7. Scalding Model
• Source objects read and write data (from
HDFS, DBs, MemCache, etc...)
• Pipes represent the flows of the data in the
job. You can think of Pipe as a distributed
list.
9. Yep, we’re counting
words:
Data is modeled
as streams of
named Tuples (of
objects)
10. Why Scala
• The scala language has a lot of built-in
features that make domain-specific
languages easy to implement.
• Map/Reduce is already within the functional
paradigm.
• Scala’s collection API covers almost all usual
use cases.
14. Word Co-occurrence
Generalized
“plus” handles
lists/sets/maps
and can be
customized
(implement
Monoid[T])
15. GroupBuilder: enabling
parallel reductions
• groupBy takes a
function that mutates a
GroupBuilder.
• GroupBuilder adds
fields which are
reductions of
(potentially different)
inputs.
• On the left, we add 7
fields.
16. scald.rb
• driver script that compiles the job and runs
it locally or transfers and runs remotely.
• we plan to add EMR support.
17.
18. Most functions in the API
have very close analogs in
scala.collection.Iterable.
19. Cascading
• is the java library that handles most of the map/
reduce planning for scalding.
• has years of production use.
• is used, tested, and optimized by many teams
(Concurrent Inc., DSLs in Scala, Clojure, Python
@Twitter. Ruby at Etsy).
• has a (very fast) local mode that runs without
Hadoop.
• flow planner designed to be portable (cascading
on Spark? Storm?)
20. mapReduceMap
• We abstract Cascading’s map-side
aggregation ability with a function called
mapReduceMap.
• If only mapReduceMaps are called, map-side
aggregation works. If a foldLeft is called
(which cannot be done map-side), scalding
falls back to pushing everything to the
reducers.
23. Optimized Joins
• mapside join is called joinWithTiny.
Implements left or inner join with a very
small pipe.
• blockJoin: deals with data skew by
replicating the data (useful for walking the
Twitter follower graph, where everyone
follows Gaga/Bieber/Obama).
• coming: combine the above to dynamically
set replication on a per key basis: only Gaga
is replicated, and just the right amount.
24. Scalding @Twitter
• Revenue quality team (ads targeting, market
insight, click-prediction, traffic-quality) uses
scalding for all our work.
• Scala engineers throughout the company
use it (i.e. storage, platform).
• More than 60 in-production scalding jobs,
more than 200 ad-hoc jobs.
• Not our only tool: Pig, PyCascading,
Cascalog, Hive are also used.
25. Example: finding
similarity
• A simple recommendation algorithm is
cosine similarity.
• Represent user-tweet interaction as a
vector, then find the users whose vectors
point in directions near the user in
question.
• We’ve developed a Matrix library on top of
scalding to make this easy.
30. Matrix in foreground,
map/reduce behind
With this
syntax, we can
focus on logic,
not how to map
linear algebra to
Hadoop
31. Example uses:
• Do random-walks on the following graph.
Matrix power iteration until convergence:
(m * m * m * m).
• Dimensionality reduction of follower graph
(Matrix product by a lower dimensional
projection matrix).
• Triangle counting: (M*M*M).trace / 3
32. What is next?
• Improve the logical flow planning (reorder
commuting filters/projections before maps,
etc...).
• Improve Matrix flow planning to narrow
the gap to hand optimized code.
33. One more thing:
• Type-safety geeks can relax: we just pushed
a type-safe API to scalding 0.6.0 analogous
to Scoobi/Scrunch/Spark, so relax.
34. That’s it.
• follow and mention: @scalding @posco
• pull reqs: http://github.com/twitter/scalding