SlideShare a Scribd company logo
1 of 33
Scala and Hadoop @ eBay
What we will cover
• Polymorphic Function Values
• Higher Kinded/Recursive Types
• Cokleislis Star Operators
• Scala Macros
I have no clue what those things are
What we will ACTUALLY cover
• Why Scala
• Why Hadoop
• How we use Scala with Hadoop
• Lots of CODE!
Why Scala?
• JVM
• **Functional**
• Expressive
• How to convince your boss?
Someone on Hacker News said
Scala sucks
• Compile Times
• You changed List again?
• Complicated
• Leads to Madness
Madness?
trait Lazy[+T, P] {
var creationParameters: P = None.asInstanceOf[P];
lazy val lazyThing: Either[Throwable, T] = try {
Right(create(creationParameters)) }
catch { case e => Left(e) }
def get(createParams: P): Either[Throwable, T] = {
creationParameters = createParams
lazyThing
}
def create(params: P): T
}
Madness?
def getSingleInstance[T, P](params: P)(implicit
lazyCreator: Lazy[T, P]): T = {
lazyCreator.get(params) match {
case Right(successValue) => successValue
case Left(exception) => throw new
StackException(exception)
}
}
This is used by ONE client class
• Show some self-restraint
Hadoop
• void map(K1 key, V1 value,
OutputCollector<K2, V2> output, Reporter
reporter)
• void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter
reporter)
BIG NUMBERS
• Petabytes of data
• 1k+ node Hadoop cluster
• Multi-billion dollar merchandising business
• Lots of users and items 
How should I use Map Reduce?
• Raw map reduce 
• Pig
• Hive
• Cascading
• Scoobi
• Scalding 
Decision Time
• “And every one that heareth these sayings of
mine (great software engineers of the past),
and doeth them not, shall be likened unto a
foolish man, which built his house upon the
sand.”
• “And the rain descended, and the floods
came, and the winds blew, and beat upon that
house; and it fell: and great was the fall of it.”
I believe!
• Scalding combines the best of PIG and
Cascading
Good Pig
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output';
// do joins and group by also
Bad Pig
DEFINE NV_terms `perl nv_terms2.pl`
ship('$scripts/nv_terms2.pl');
i5 = stream i4 through NV_terms as (leafcat:chararray,
name:chararray, name1:chararray);
i7 = foreach i5 generate leafcat,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as
name,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as
name1;
Other Pig Issues
• Scheduling and DAG creation
Cascading Rocks!
• What is it?
• Supports large workflows and reusable
components
– DAG generation
– Parallel Executions
Cascading code in Scala
val masterPipe = new
FilterURLEncodedStrings(masterPipe, "sqr")
masterPipe = new
FilterInappropriateQueries(masterPipe, "sqr”)
masterPipe = new GroupBy(masterPipe,
CFields("user_id", "epoch_ts", "sqr"),
sortFields)
Someone should really code review this
Cascading Issues
This page intentionally left blank
Scalding Time
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")
}
}
Scalding @ eBay
• Boilerplate reduction
• Extensibility
• New hires
Practical Scalding Use
• Pimp my pimp
• Code generated boilerplate
• Cascades
• Traps
• Testing!
class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {
implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)
class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with
CommonFunctions
trait CommonFunctions {
import Dsl._
import RichPipe.assignName
def pipe: Pipe
def reallyComplexFunction(field: Fields, param: Long) = {
//mind blowing code here
}
}}
CheckoutTransactionsPipe(//default path logic)
.project(//fields I need)
.countUserInteractions(//params)
.doScoreCalculation(//params)
.doConfidenceCalculation(//params)
Seems a bit too readable for Scala
Collaborative Filtering
• Typically hard to run on large datasets
Structured Data Importance
• Do people shop by brand?
0
0.2
0.4
0.6
0.8
1
1.2
Supply
Handbags and Purses
Markov Chains
• Investigation of buying patterns in ~50 lines of
code
val purchases = "firsttime" :: x.take(500).toList
val pairs = purchases zip purchases.tail
val grouped = pairs.groupBy(x =>
x._1.toString+"-"+x._2.toString)
val sizes = grouped map { x => {
x._1 -> x._2.size
}} toList
Mining Search Queries
• 20+ billion user queries - give me the top ones
per user
De-Dupe Rank ValidateSample Data
Automation
Hadoop Proxy
Batch Database Load
Machines
Cassandra
Jenkins
MySql
Mongo
Questions?
www.ebaynyc.com

More Related Content

What's hot

Scala Refactoring for Fun and Profit
Scala Refactoring for Fun and ProfitScala Refactoring for Fun and Profit
Scala Refactoring for Fun and ProfitTomer Gabel
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millionsFlorent Vilmart
 
今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?Sho Yoshida
 
Jug Marche: Meeting June 2014. Java 8 hands on
Jug Marche: Meeting June 2014. Java 8 hands onJug Marche: Meeting June 2014. Java 8 hands on
Jug Marche: Meeting June 2014. Java 8 hands onOnofrio Panzarino
 
Java scriptcore brief introduction
Java scriptcore brief introductionJava scriptcore brief introduction
Java scriptcore brief introductionHorky Chen
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codablesFlorent Vilmart
 
Journey's End – Collection and Reduction in the Stream API
Journey's End – Collection and Reduction in the Stream APIJourney's End – Collection and Reduction in the Stream API
Journey's End – Collection and Reduction in the Stream APIMaurice Naftalin
 
Ruby is an Acceptable Lisp
Ruby is an Acceptable LispRuby is an Acceptable Lisp
Ruby is an Acceptable LispAstrails
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論scalaconfjp
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboyKenneth Geisshirt
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!Maziyar PANAHI
 
Tools for writing Haskell programs
Tools for writing Haskell programsTools for writing Haskell programs
Tools for writing Haskell programsnkpart
 
Adding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of TricksAdding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of Trickssiculars
 
Persistent Data Structures - partial::Conf
Persistent Data Structures - partial::ConfPersistent Data Structures - partial::Conf
Persistent Data Structures - partial::ConfIvan Vergiliev
 

What's hot (19)

Scala Refactoring for Fun and Profit
Scala Refactoring for Fun and ProfitScala Refactoring for Fun and Profit
Scala Refactoring for Fun and Profit
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millions
 
今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?
 
Jug Marche: Meeting June 2014. Java 8 hands on
Jug Marche: Meeting June 2014. Java 8 hands onJug Marche: Meeting June 2014. Java 8 hands on
Jug Marche: Meeting June 2014. Java 8 hands on
 
Java scriptcore brief introduction
Java scriptcore brief introductionJava scriptcore brief introduction
Java scriptcore brief introduction
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codables
 
Journey's End – Collection and Reduction in the Stream API
Journey's End – Collection and Reduction in the Stream APIJourney's End – Collection and Reduction in the Stream API
Journey's End – Collection and Reduction in the Stream API
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Ruby is an Acceptable Lisp
Ruby is an Acceptable LispRuby is an Acceptable Lisp
Ruby is an Acceptable Lisp
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論
 
Clojure & Scala
Clojure & ScalaClojure & Scala
Clojure & Scala
 
RubyMotion
RubyMotionRubyMotion
RubyMotion
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!
 
Tools for writing Haskell programs
Tools for writing Haskell programsTools for writing Haskell programs
Tools for writing Haskell programs
 
Adding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of TricksAdding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of Tricks
 
Persistent Data Structures - partial::Conf
Persistent Data Structures - partial::ConfPersistent Data Structures - partial::Conf
Persistent Data Structures - partial::Conf
 

Similar to Scala and Hadoop @ eBay

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principlesPerl Careers
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraHandaru Sakti
 
Fast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaFast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaCharles Nutter
 
The Essential Perl Hacker's Toolkit
The Essential Perl Hacker's ToolkitThe Essential Perl Hacker's Toolkit
The Essential Perl Hacker's ToolkitStephen Scaffidi
 
Fast Web Applications Development with Ruby on Rails on Oracle
Fast Web Applications Development with Ruby on Rails on OracleFast Web Applications Development with Ruby on Rails on Oracle
Fast Web Applications Development with Ruby on Rails on OracleRaimonds Simanovskis
 
Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)Patrick Crowley
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Mario Camou Riveroll
 
SproutCore and the Future of Web Apps
SproutCore and the Future of Web AppsSproutCore and the Future of Web Apps
SproutCore and the Future of Web AppsMike Subelsky
 
Ruby on Rails survival guide of an aged Java developer
Ruby on Rails survival guide of an aged Java developerRuby on Rails survival guide of an aged Java developer
Ruby on Rails survival guide of an aged Java developergicappa
 
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding PresentationLandoop Ltd
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingXebia Nederland BV
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPJeremy Kendall
 
Down the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM WonderlandDown the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM WonderlandCharles Nutter
 

Similar to Scala and Hadoop @ eBay (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principles
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Fast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaFast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible Java
 
The Essential Perl Hacker's Toolkit
The Essential Perl Hacker's ToolkitThe Essential Perl Hacker's Toolkit
The Essential Perl Hacker's Toolkit
 
Perl6 meets JVM
Perl6 meets JVMPerl6 meets JVM
Perl6 meets JVM
 
Fast Web Applications Development with Ruby on Rails on Oracle
Fast Web Applications Development with Ruby on Rails on OracleFast Web Applications Development with Ruby on Rails on Oracle
Fast Web Applications Development with Ruby on Rails on Oracle
 
Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
 
SproutCore and the Future of Web Apps
SproutCore and the Future of Web AppsSproutCore and the Future of Web Apps
SproutCore and the Future of Web Apps
 
Ruby on Rails survival guide of an aged Java developer
Ruby on Rails survival guide of an aged Java developerRuby on Rails survival guide of an aged Java developer
Ruby on Rails survival guide of an aged Java developer
 
[Start] Scala
[Start] Scala[Start] Scala
[Start] Scala
 
Scio
ScioScio
Scio
 
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding Presentation
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHP
 
Down the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM WonderlandDown the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM Wonderland
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Scala and Hadoop @ eBay

  • 2. What we will cover • Polymorphic Function Values • Higher Kinded/Recursive Types • Cokleislis Star Operators • Scala Macros
  • 3. I have no clue what those things are
  • 4. What we will ACTUALLY cover • Why Scala • Why Hadoop • How we use Scala with Hadoop • Lots of CODE!
  • 5. Why Scala? • JVM • **Functional** • Expressive • How to convince your boss?
  • 6. Someone on Hacker News said Scala sucks • Compile Times • You changed List again? • Complicated • Leads to Madness
  • 7. Madness? trait Lazy[+T, P] { var creationParameters: P = None.asInstanceOf[P]; lazy val lazyThing: Either[Throwable, T] = try { Right(create(creationParameters)) } catch { case e => Left(e) } def get(createParams: P): Either[Throwable, T] = { creationParameters = createParams lazyThing } def create(params: P): T }
  • 8. Madness? def getSingleInstance[T, P](params: P)(implicit lazyCreator: Lazy[T, P]): T = { lazyCreator.get(params) match { case Right(successValue) => successValue case Left(exception) => throw new StackException(exception) } }
  • 9. This is used by ONE client class • Show some self-restraint
  • 10.
  • 11. Hadoop • void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) • void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)
  • 12. BIG NUMBERS • Petabytes of data • 1k+ node Hadoop cluster • Multi-billion dollar merchandising business • Lots of users and items 
  • 13. How should I use Map Reduce? • Raw map reduce  • Pig • Hive • Cascading • Scoobi • Scalding 
  • 14. Decision Time • “And every one that heareth these sayings of mine (great software engineers of the past), and doeth them not, shall be likened unto a foolish man, which built his house upon the sand.” • “And the rain descended, and the floods came, and the winds blew, and beat upon that house; and it fell: and great was the fall of it.”
  • 15. I believe! • Scalding combines the best of PIG and Cascading
  • 16. Good Pig A = LOAD 'input' AS (x, y, z); B = FILTER A BY x > 5; DUMP B; C = FOREACH B GENERATE y, z; STORE C INTO 'output'; // do joins and group by also
  • 17. Bad Pig DEFINE NV_terms `perl nv_terms2.pl` ship('$scripts/nv_terms2.pl'); i5 = stream i4 through NV_terms as (leafcat:chararray, name:chararray, name1:chararray); i7 = foreach i5 generate leafcat, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as name, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as name1;
  • 18. Other Pig Issues • Scheduling and DAG creation
  • 19. Cascading Rocks! • What is it? • Supports large workflows and reusable components – DAG generation – Parallel Executions
  • 20. Cascading code in Scala val masterPipe = new FilterURLEncodedStrings(masterPipe, "sqr") masterPipe = new FilterInappropriateQueries(masterPipe, "sqr”) masterPipe = new GroupBy(masterPipe, CFields("user_id", "epoch_ts", "sqr"), sortFields)
  • 21. Someone should really code review this
  • 22. Cascading Issues This page intentionally left blank
  • 23. Scalding Time class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 24. Scalding @ eBay • Boilerplate reduction • Extensibility • New hires
  • 25. Practical Scalding Use • Pimp my pimp • Code generated boilerplate • Cascades • Traps • Testing!
  • 26. class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate { implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe) class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with CommonFunctions trait CommonFunctions { import Dsl._ import RichPipe.assignName def pipe: Pipe def reallyComplexFunction(field: Fields, param: Long) = { //mind blowing code here } }}
  • 27. CheckoutTransactionsPipe(//default path logic) .project(//fields I need) .countUserInteractions(//params) .doScoreCalculation(//params) .doConfidenceCalculation(//params) Seems a bit too readable for Scala
  • 28. Collaborative Filtering • Typically hard to run on large datasets
  • 29. Structured Data Importance • Do people shop by brand? 0 0.2 0.4 0.6 0.8 1 1.2 Supply Handbags and Purses
  • 30. Markov Chains • Investigation of buying patterns in ~50 lines of code val purchases = "firsttime" :: x.take(500).toList val pairs = purchases zip purchases.tail val grouped = pairs.groupBy(x => x._1.toString+"-"+x._2.toString) val sizes = grouped map { x => { x._1 -> x._2.size }} toList
  • 31. Mining Search Queries • 20+ billion user queries - give me the top ones per user De-Dupe Rank ValidateSample Data
  • 32. Automation Hadoop Proxy Batch Database Load Machines Cassandra Jenkins MySql Mongo

Editor's Notes

  1. Introduce myself and ebay NYC
  2. Laugh
  3. We are starting to use scala for live site recs
  4. Mention the Option and EitherFirst class functionsMention how great traits areI feel like Haskell will never break into the corporation this is a great draft All my life I’ve wanted a type safe build system. And NOW I have it
  5. They break backward compatibilityWeak IDE support – debugging, refactoring, etcExplain the madness
  6. Tell them about the example
  7. The most complicated system for counting words insert meme hereExplain why we use hadoop. Data is huge. I can’t say when you want to make the jump to map reduce but I see growth in making it THE platform
  8. Say why raw map reduce stinks. Mention what hive is and scoobi is
  9. Explain why we didn’t go with scoobi even though it’s all scala
  10. Scheduling and DAG creationWhere is my SOURCE?
  11. Mentionazkaban
  12. Can do parallel executions of tasks that don’t depend on each otherSupports static dependencies via cascades
  13. Verbose. You still need to write a bunch of code.
  14. Mention about scoobi and how it’s not super stableRemindthen about how it combines the best of PIG and Cascading
  15. This is actual code to compute a user’s preferences. Explain a bit about user preferences
  16. Mahout has some functions for this but they are hard to setup and get goingLess precise than other state of the art methods but still accurateScala Days Talk with Chris Severs
  17. Linear ModelTalk about Concept ExtractionUse SQL Lite for ad hoc queries
  18. Talk about the use of cascadesTalk about traps and counters
  19. Scalding makes this 100% times easier because of cascades and flows