12. BIG NUMBERS
• Petabytes of data
• 1k+ node Hadoop cluster
• Multi-billion dollar merchandising business
• Lots of users and items
13. How should I use Map Reduce?
• Raw map reduce
• Pig
• Hive
• Cascading
• Scoobi
• Scalding
14. Decision Time
• “And every one that heareth these sayings of
mine (great software engineers of the past),
and doeth them not, shall be likened unto a
foolish man, which built his house upon the
sand.”
• “And the rain descended, and the floods
came, and the winds blew, and beat upon that
house; and it fell: and great was the fall of it.”
16. Good Pig
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output';
// do joins and group by also
17. Bad Pig
DEFINE NV_terms `perl nv_terms2.pl`
ship('$scripts/nv_terms2.pl');
i5 = stream i4 through NV_terms as (leafcat:chararray,
name:chararray, name1:chararray);
i7 = foreach i5 generate leafcat,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as
name,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as
name1;
19. Cascading Rocks!
• What is it?
• Supports large workflows and reusable
components
– DAG generation
– Parallel Executions
20. Cascading code in Scala
val masterPipe = new
FilterURLEncodedStrings(masterPipe, "sqr")
masterPipe = new
FilterInappropriateQueries(masterPipe, "sqr”)
masterPipe = new GroupBy(masterPipe,
CFields("user_id", "epoch_ts", "sqr"),
sortFields)
30. Markov Chains
• Investigation of buying patterns in ~50 lines of
code
val purchases = "firsttime" :: x.take(500).toList
val pairs = purchases zip purchases.tail
val grouped = pairs.groupBy(x =>
x._1.toString+"-"+x._2.toString)
val sizes = grouped map { x => {
x._1 -> x._2.size
}} toList
31. Mining Search Queries
• 20+ billion user queries - give me the top ones
per user
De-Dupe Rank ValidateSample Data
Mention the Option and EitherFirst class functionsMention how great traits areI feel like Haskell will never break into the corporation this is a great draft All my life I’ve wanted a type safe build system. And NOW I have it
They break backward compatibilityWeak IDE support – debugging, refactoring, etcExplain the madness
Tell them about the example
The most complicated system for counting words insert meme hereExplain why we use hadoop. Data is huge. I can’t say when you want to make the jump to map reduce but I see growth in making it THE platform
Say why raw map reduce stinks. Mention what hive is and scoobi is
Explain why we didn’t go with scoobi even though it’s all scala
Scheduling and DAG creationWhere is my SOURCE?
Mentionazkaban
Can do parallel executions of tasks that don’t depend on each otherSupports static dependencies via cascades
Verbose. You still need to write a bunch of code.
Mention about scoobi and how it’s not super stableRemindthen about how it combines the best of PIG and Cascading
This is actual code to compute a user’s preferences. Explain a bit about user preferences
Mahout has some functions for this but they are hard to setup and get goingLess precise than other state of the art methods but still accurateScala Days Talk with Chris Severs
Linear ModelTalk about Concept ExtractionUse SQL Lite for ad hoc queries
Talk about the use of cascadesTalk about traps and counters
Scalding makes this 100% times easier because of cascades and flows