SlideShare a Scribd company logo
1 of 33
MapReduce
   frameworks and methods
    Presenter:
          Adam Horvath
          (adam@teamleadnet.com)


23 May 2012, Sydney    Adam Horvath   1
Count the flowers
• Given a flower field
• Given unlimited resources
• How to count the different colours of flowers
  in minimal time?




23 May 2012, Sydney         Adam Horvath          2
Original problem
• Building product recommendations
      – For a ‘web shop’
• Looking for easy to use frameworks
      – Not too many generic ones
• Data warehouse
      – Way too complex, way too expensive
• Maximum 1 screen to write the whole processing
      – If it’s longer, you wouldn’t understand
• Functional languages - steep learning curve
      – Neither you nor others would understand

23 May 2012, Sydney          Adam Horvath          3
MapReduce based solutions
• Mapreduce data processing
      – Takes a while to get used to it
• Hadoop
      – Way too hard to start with
• Lightweight frameworks
      – A little limitation here and there
• MapReduce.NET
      – Roll your own

23 May 2012, Sydney         Adam Horvath     4
How can I learn to fly/MapReduce?
• Hadoop, petabytes of data
      – Get a Boeing 747
      – Pass the required 1500 hours of flying
      – Ready to go!
• Simple frameworks, gigabytes of data
      – Order a plane model online for $30
      – Play with a simulator for 60 minutes
      – Ready to go!

23 May 2012, Sydney        Adam Horvath          5
What is MapReduce
• Simple framework to parallel batch process
  data (even semi- or unstructured data)
• The output is usually reasonable size (human
  readable or for other systems)
• Working with key-value pairs
• Two major steps: Map and Reduce



23 May 2012, Sydney         Adam Horvath         6
What is not MapReduce
• Database
      – The input is usually plain text files
• Online tool
      – The process starts in batches, might run for hours
• A language
      – Any programming language can be used to create
        Map and Reduce steps (depending on the
        framework used)
• Compressing cartographic data

23 May 2012, Sydney          Adam Horvath                    7
Map
• Transform the input data into something that can be
  aggregated (counted, grouped…)
• The “SELECT T1,T2… FROM X…” part of the query
• Map from Key1-Value1 to
      – Key1-Value2 OR
      – Key2-Value2
• Example, log file line:
      – “20120501 /urlxxx”
      – Input key: might be null or file name
      – Input value: the line itself
• Output
      – Key: “urlxxx” Value: “20120501”


23 May 2012, Sydney              Adam Horvath           8
Reduce
• Aggregates the output of the Map step
• The simplest case is just to output the Map result
  without processing
• The “…SUM() … GROUP BY…” part of the query
• Reduces from Key1-Value1 to
      – Key1-Value1 (ideally, value1 is a list or a sum)
      – Key1-Value2 (input and output are incompatible)
• Input
      – Key: “urlxxx” Value: “20120501”
      – Key: “urlxxx” Value: “20120502”
• Output
      – Key: “urlxxx” Values: “20120501, 20120502”

23 May 2012, Sydney           Adam Horvath                 9
Taken from Google Code University

      23 May 2012, Sydney           Adam Horvath   10
“Embarrassingly parallel”
• MapReduce only makes sense when the data
  can be processed parallel
• Easy to slice the input into smaller
  chunks/problems
• Nothing is too big/complex if can be taken
  apart
• There is no overlapping/dependence between
  the chunks

23 May 2012, Sydney       Adam Horvath      11
Structured vs Unstructured data
• MapReduce frequently works on unstructured
  data or semi structured data
• Structured is great for answering requests
      – Normalised SQL tables
      – Typically quick answers
      – Reasonable size of data
• Unstructured is great for finding hidden trends
      – Logs, documents, images…
      – Typically batch processed, slow answers
      – Any size of data


23 May 2012, Sydney         Adam Horvath            12
History of MapReduce
• Google described it in 2004
• Map and reduce exists in many languages,
  manly in functional ones
• Input/output is stored on a distributed file
  system (GFS)
• Google Inc. holds PN/7,650,331 - System and
  method for efficient large-scale data
  processing

23 May 2012, Sydney           Adam Horvath       13
Questions for MapReduce
•   Create a database with inverted index
•   Recommend content based on previous visits
•   Parse logs, create visitor analytics
•   Process text documents, categorise them
•   Process images
•   Find partners for life



23 May 2012, Sydney     Adam Horvath             14
When to choose SQL
• Small structured data
      – Normalised tables
      – The database engine is capable of running the
        JOINs and GROUP BYs
• Online requirement
      – If the answer needs to be served within a second




23 May 2012, Sydney          Adam Horvath                  15
Typical framework limitations
• Same reduce output as input
      – Reduce input: [K1, [V1]], [K1, [V2]], output: [K1,[V1,V2]]
      – No need to create mergers but limits the complexity
• Memory
      – The hard disk is the new tape
• Debugging
      – No framework is flawless but usually hard to debug
• Typed/untyped, generics
      – Very hard to understand the process without types
• Network usage
• Faulty/dead machines slow the whole process down

23 May 2012, Sydney             Adam Horvath                         16
Well known frameworks
• Hadoop – Java based, data stored on HDFS
      – HDFS – open source distributed file system for
        large files (~Google File System)
      – HBase – built on top of HDFS, provides fast key-
        value lookup for tables (~Google Bigtable)
• Pig – Map Reduce expressed in Pig Latin
• Hive - SQL over Hadoop
• Sqoop - data from SQL to Hadoop

23 May 2012, Sydney        Adam Horvath                    17
MapReduce components
• Input
      – Reads data storage (files, database…)
• Slicer
      – Distributes the huge data into small chunks
• Map
      – Transfers the data for easy processing
• Partition/sort
      – Sorts the output of Map operation to transfer to Reducers
• Reduce
      – Aggregates the data
• Merger
      – Merges two or more outputs of Reduces
• After process
      – Some cleanup
• Output
      – Writes the output of the MapReduce to disk (files, database…)


23 May 2012, Sydney                    Adam Horvath                     18
MapReduce.NET
• Open source
• Written in c#
• Made for couple of machines with several Gb
  of data
• Can process TB with batches
• In comparison
      – Facebook: Hadoop, 4800 cores, 5.5 PetaBytes,
      – 12 TB per node
23 May 2012, Sydney       Adam Horvath                 19
Counting flowers
•   Make small areas on the field
•   Place Mappers on each areas
•   Have one or more reducer for some areas
•   Mapper yells color, reducer takes note
•   Another reducer/merger merges the output of
    reducers



23 May 2012, Sydney         Adam Horvath      20
Counting flowers - code
// MAP
public override void Map(string key, string value, IQueue<Color, int> result)
{
         Flower[] flowers = Flower.ParseArr(value);
         foreach (var item in flowers)
         {
                  result.Push(item.Color, 1);
         }
}

// REDUCE
public override int Reduce(Color key, int value, int result)
{
         return result + value;
}

23 May 2012, Sydney                 Adam Horvath                                21
Inverted index
• Mapping from content to location
      – “expression” -> file.txt, 44
• Lucene is a great tool utilising inverted indices
• Index types
      – DocId list - which documents contain the term
      – Positional list - where do the documents contain the terms
      – Schema-independent - unique position of the word in the
        whole data store
• Stop words
      – Can crate noise, like the word “the”
      – Omitting it can cause problems
             • “The veronicas”

23 May 2012, Sydney              Adam Horvath                    22
Demo
                      Inverted index




23 May 2012, Sydney       Adam Horvath   23
Product recommendation
• Recommend content based on user behaviour
      – Use visit log, but other factors like clicking previous
        recommendations and aging are important too
• Process
      – MAP
             Log  Session-ProdId
      – REDUCE
             Session-ProdId  Session (ProdId 1, ProdId 2...)
      – MAP
             Session (ProdId 1, ProdId 2...) 
              Transition (ProdId 1-ProdId 2), Transition(ProdId 2 -ProdId 2)
      – REDUCE
             Transition(ProdId 1-ProdId 2)  ProdId 1 (ProdId 2, ProdId 3…)


23 May 2012, Sydney                  Adam Horvath                              24
Demo
                      Product recommendation




23 May 2012, Sydney           Adam Horvath     25
How to get started
• Download MapReduce.NET, write simple
  processors
• Download Hadoop VM from Cloudera
      – Fix all the bugs in the VM…
      – Start creating distributed processing
• Create theories, prove them using your logs!



23 May 2012, Sydney          Adam Horvath        26
Case studies
• Works for me
      – Find churning customers/visitors
      – Detecting keywords and searching by keyword
      – Item recommendation
• Yahoo
      – Search, search assist
• eHarmony
      – Matchmaking
• Nugg.ad
      – Predictive Behavioral Targeting
• Facebook
      – Recommendations, etc.


23 May 2012, Sydney              Adam Horvath         27
CS101 – why is it slow?
• Use custom data structures
      – Built in one like LinkedList, Stack, Queue will blow
        up – create fixed size structures
• Serialization is always slow, deserialization is
  even worse
      – Avro IDL, Apache Thrift IDL, ProtoBuf
• Dictionary*“slow”+++
      – is a find/read/modify/find/write operation
      – Write custom dictionary with direct indexing

23 May 2012, Sydney            Adam Horvath                    28
CS101 – why is it slow?
• Do not use synchronisation
      – Few locks and waits are okay, thousands are not
• Use CPU native data types
      – String is not
• Do not use in-built string operations
      – Split, replace, … creates extra String instances
• Don’t use RegEx
• Store data in the file system
      – Anything that resembles a database (even SQLite) is at
        least a magnitude slower

23 May 2012, Sydney            Adam Horvath                  29
CS101 – why is it slow?
• Allow thread scheduler to do his job
      – Use Thread.Sleep(0) and not Thread.Sleep(1)
• Do not use sorted data structures
      – Sort the data when you are done
• Reflection is fine to call a long running method
      – But slow to call a function many times
• Choose hash function very carefully
      – Validate the choice, collision is not your friend
• Never performance test under Virtual Machines
• Win32 on 32 bit platform
      – 2Gb max process size
      – On some Linux distributions depending on Kernel this holds too



23 May 2012, Sydney               Adam Horvath                       30
URLs
• This presentation
  http://slidesha.re/mapreduce-intro
• MapReduce.NET
  http://code.google.com/p/mapreduce-net
• Apache Hadoop
  http://hadoop.apache.org
• Apache Pig
  http://pig.apache.org
• Apache Hive
  http://hive.apache.org
• Cloudera Hadoop demo VM
  https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+De
  mo+VM
• A Comparison of Approaches to Large-Scale Data Analysis
  http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/be
  nchmarks-sigmod09.pdf
• Information Retrieval: Implementing and Evaluating Search Engines
  http://www.ir.uwaterloo.ca/book

23 May 2012, Sydney          Adam Horvath                        31
Questions




    Adam Horvath
    (adam@teamleadnet.com)



23 May 2012, Sydney     Adam Horvath   32
After hour labs




    Adam Horvath
    (adam@teamleadnet.com)



23 May 2012, Sydney        Adam Horvath   33

More Related Content

Similar to MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 

Similar to MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney (20)

Hadoop
HadoopHadoop
Hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 

Recently uploaded

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 

Recently uploaded (20)

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 

MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney

  • 1. MapReduce frameworks and methods Presenter: Adam Horvath (adam@teamleadnet.com) 23 May 2012, Sydney Adam Horvath 1
  • 2. Count the flowers • Given a flower field • Given unlimited resources • How to count the different colours of flowers in minimal time? 23 May 2012, Sydney Adam Horvath 2
  • 3. Original problem • Building product recommendations – For a ‘web shop’ • Looking for easy to use frameworks – Not too many generic ones • Data warehouse – Way too complex, way too expensive • Maximum 1 screen to write the whole processing – If it’s longer, you wouldn’t understand • Functional languages - steep learning curve – Neither you nor others would understand 23 May 2012, Sydney Adam Horvath 3
  • 4. MapReduce based solutions • Mapreduce data processing – Takes a while to get used to it • Hadoop – Way too hard to start with • Lightweight frameworks – A little limitation here and there • MapReduce.NET – Roll your own 23 May 2012, Sydney Adam Horvath 4
  • 5. How can I learn to fly/MapReduce? • Hadoop, petabytes of data – Get a Boeing 747 – Pass the required 1500 hours of flying – Ready to go! • Simple frameworks, gigabytes of data – Order a plane model online for $30 – Play with a simulator for 60 minutes – Ready to go! 23 May 2012, Sydney Adam Horvath 5
  • 6. What is MapReduce • Simple framework to parallel batch process data (even semi- or unstructured data) • The output is usually reasonable size (human readable or for other systems) • Working with key-value pairs • Two major steps: Map and Reduce 23 May 2012, Sydney Adam Horvath 6
  • 7. What is not MapReduce • Database – The input is usually plain text files • Online tool – The process starts in batches, might run for hours • A language – Any programming language can be used to create Map and Reduce steps (depending on the framework used) • Compressing cartographic data 23 May 2012, Sydney Adam Horvath 7
  • 8. Map • Transform the input data into something that can be aggregated (counted, grouped…) • The “SELECT T1,T2… FROM X…” part of the query • Map from Key1-Value1 to – Key1-Value2 OR – Key2-Value2 • Example, log file line: – “20120501 /urlxxx” – Input key: might be null or file name – Input value: the line itself • Output – Key: “urlxxx” Value: “20120501” 23 May 2012, Sydney Adam Horvath 8
  • 9. Reduce • Aggregates the output of the Map step • The simplest case is just to output the Map result without processing • The “…SUM() … GROUP BY…” part of the query • Reduces from Key1-Value1 to – Key1-Value1 (ideally, value1 is a list or a sum) – Key1-Value2 (input and output are incompatible) • Input – Key: “urlxxx” Value: “20120501” – Key: “urlxxx” Value: “20120502” • Output – Key: “urlxxx” Values: “20120501, 20120502” 23 May 2012, Sydney Adam Horvath 9
  • 10. Taken from Google Code University 23 May 2012, Sydney Adam Horvath 10
  • 11. “Embarrassingly parallel” • MapReduce only makes sense when the data can be processed parallel • Easy to slice the input into smaller chunks/problems • Nothing is too big/complex if can be taken apart • There is no overlapping/dependence between the chunks 23 May 2012, Sydney Adam Horvath 11
  • 12. Structured vs Unstructured data • MapReduce frequently works on unstructured data or semi structured data • Structured is great for answering requests – Normalised SQL tables – Typically quick answers – Reasonable size of data • Unstructured is great for finding hidden trends – Logs, documents, images… – Typically batch processed, slow answers – Any size of data 23 May 2012, Sydney Adam Horvath 12
  • 13. History of MapReduce • Google described it in 2004 • Map and reduce exists in many languages, manly in functional ones • Input/output is stored on a distributed file system (GFS) • Google Inc. holds PN/7,650,331 - System and method for efficient large-scale data processing 23 May 2012, Sydney Adam Horvath 13
  • 14. Questions for MapReduce • Create a database with inverted index • Recommend content based on previous visits • Parse logs, create visitor analytics • Process text documents, categorise them • Process images • Find partners for life 23 May 2012, Sydney Adam Horvath 14
  • 15. When to choose SQL • Small structured data – Normalised tables – The database engine is capable of running the JOINs and GROUP BYs • Online requirement – If the answer needs to be served within a second 23 May 2012, Sydney Adam Horvath 15
  • 16. Typical framework limitations • Same reduce output as input – Reduce input: [K1, [V1]], [K1, [V2]], output: [K1,[V1,V2]] – No need to create mergers but limits the complexity • Memory – The hard disk is the new tape • Debugging – No framework is flawless but usually hard to debug • Typed/untyped, generics – Very hard to understand the process without types • Network usage • Faulty/dead machines slow the whole process down 23 May 2012, Sydney Adam Horvath 16
  • 17. Well known frameworks • Hadoop – Java based, data stored on HDFS – HDFS – open source distributed file system for large files (~Google File System) – HBase – built on top of HDFS, provides fast key- value lookup for tables (~Google Bigtable) • Pig – Map Reduce expressed in Pig Latin • Hive - SQL over Hadoop • Sqoop - data from SQL to Hadoop 23 May 2012, Sydney Adam Horvath 17
  • 18. MapReduce components • Input – Reads data storage (files, database…) • Slicer – Distributes the huge data into small chunks • Map – Transfers the data for easy processing • Partition/sort – Sorts the output of Map operation to transfer to Reducers • Reduce – Aggregates the data • Merger – Merges two or more outputs of Reduces • After process – Some cleanup • Output – Writes the output of the MapReduce to disk (files, database…) 23 May 2012, Sydney Adam Horvath 18
  • 19. MapReduce.NET • Open source • Written in c# • Made for couple of machines with several Gb of data • Can process TB with batches • In comparison – Facebook: Hadoop, 4800 cores, 5.5 PetaBytes, – 12 TB per node 23 May 2012, Sydney Adam Horvath 19
  • 20. Counting flowers • Make small areas on the field • Place Mappers on each areas • Have one or more reducer for some areas • Mapper yells color, reducer takes note • Another reducer/merger merges the output of reducers 23 May 2012, Sydney Adam Horvath 20
  • 21. Counting flowers - code // MAP public override void Map(string key, string value, IQueue<Color, int> result) { Flower[] flowers = Flower.ParseArr(value); foreach (var item in flowers) { result.Push(item.Color, 1); } } // REDUCE public override int Reduce(Color key, int value, int result) { return result + value; } 23 May 2012, Sydney Adam Horvath 21
  • 22. Inverted index • Mapping from content to location – “expression” -> file.txt, 44 • Lucene is a great tool utilising inverted indices • Index types – DocId list - which documents contain the term – Positional list - where do the documents contain the terms – Schema-independent - unique position of the word in the whole data store • Stop words – Can crate noise, like the word “the” – Omitting it can cause problems • “The veronicas” 23 May 2012, Sydney Adam Horvath 22
  • 23. Demo Inverted index 23 May 2012, Sydney Adam Horvath 23
  • 24. Product recommendation • Recommend content based on user behaviour – Use visit log, but other factors like clicking previous recommendations and aging are important too • Process – MAP Log  Session-ProdId – REDUCE Session-ProdId  Session (ProdId 1, ProdId 2...) – MAP Session (ProdId 1, ProdId 2...)  Transition (ProdId 1-ProdId 2), Transition(ProdId 2 -ProdId 2) – REDUCE Transition(ProdId 1-ProdId 2)  ProdId 1 (ProdId 2, ProdId 3…) 23 May 2012, Sydney Adam Horvath 24
  • 25. Demo Product recommendation 23 May 2012, Sydney Adam Horvath 25
  • 26. How to get started • Download MapReduce.NET, write simple processors • Download Hadoop VM from Cloudera – Fix all the bugs in the VM… – Start creating distributed processing • Create theories, prove them using your logs! 23 May 2012, Sydney Adam Horvath 26
  • 27. Case studies • Works for me – Find churning customers/visitors – Detecting keywords and searching by keyword – Item recommendation • Yahoo – Search, search assist • eHarmony – Matchmaking • Nugg.ad – Predictive Behavioral Targeting • Facebook – Recommendations, etc. 23 May 2012, Sydney Adam Horvath 27
  • 28. CS101 – why is it slow? • Use custom data structures – Built in one like LinkedList, Stack, Queue will blow up – create fixed size structures • Serialization is always slow, deserialization is even worse – Avro IDL, Apache Thrift IDL, ProtoBuf • Dictionary*“slow”+++ – is a find/read/modify/find/write operation – Write custom dictionary with direct indexing 23 May 2012, Sydney Adam Horvath 28
  • 29. CS101 – why is it slow? • Do not use synchronisation – Few locks and waits are okay, thousands are not • Use CPU native data types – String is not • Do not use in-built string operations – Split, replace, … creates extra String instances • Don’t use RegEx • Store data in the file system – Anything that resembles a database (even SQLite) is at least a magnitude slower 23 May 2012, Sydney Adam Horvath 29
  • 30. CS101 – why is it slow? • Allow thread scheduler to do his job – Use Thread.Sleep(0) and not Thread.Sleep(1) • Do not use sorted data structures – Sort the data when you are done • Reflection is fine to call a long running method – But slow to call a function many times • Choose hash function very carefully – Validate the choice, collision is not your friend • Never performance test under Virtual Machines • Win32 on 32 bit platform – 2Gb max process size – On some Linux distributions depending on Kernel this holds too 23 May 2012, Sydney Adam Horvath 30
  • 31. URLs • This presentation http://slidesha.re/mapreduce-intro • MapReduce.NET http://code.google.com/p/mapreduce-net • Apache Hadoop http://hadoop.apache.org • Apache Pig http://pig.apache.org • Apache Hive http://hive.apache.org • Cloudera Hadoop demo VM https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+De mo+VM • A Comparison of Approaches to Large-Scale Data Analysis http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/be nchmarks-sigmod09.pdf • Information Retrieval: Implementing and Evaluating Search Engines http://www.ir.uwaterloo.ca/book 23 May 2012, Sydney Adam Horvath 31
  • 32. Questions Adam Horvath (adam@teamleadnet.com) 23 May 2012, Sydney Adam Horvath 32
  • 33. After hour labs Adam Horvath (adam@teamleadnet.com) 23 May 2012, Sydney Adam Horvath 33