SlideShare a Scribd company logo
1 of 52
Hadoop!
2010/05/16
  naoki yanai
   id:yanaoki




                1
Hadoop

Hadoop
(Elastic MapReduce)


                      2
naoki yanai (id:yanaoki)
Web

Hadooop



                           m        m

                           iPhone

          Ruby Java


                                        3
Hadoop




         4
Hadoop




Java

Apache




                  5
Hadoop
Google 2004

MapReduce
  http://labs.google.com/papers/mapreduce.html

Google File System (GFS)
  http://labs.google.com/papers/gfs.html

2010                          Google




                                                 6
Hadoop

Web




  →




               7
8
Hadoop
         9
Hadoop
Yahoo
  Yahoo      Hadoop




     Facebook Amazon

                       10
Hadoop
RDBMS




 Join   mapreduce join




SQL     Hadoop

                         11
Hadoop


                         MapReduce
        web
                     HDFS




          RDB

  Web           Hadoop



                                     12
Hadoop


                         MapReduce
        web
                     HDFS




          RDB

  Web           Hadoop



                                     13
Hadoop


            N




Hadoop



                14
Hadoop
MapReduce HDFS

Hadoop

         MapReduce HDFS




                          15
MapReduce


      → map       → reduce      →

map   reduce    hadoop

                    key-value

               Hadoop




                                    16
HDFS

Hadoop


MapReduce




                   17
MapReduce

                          slave
                                   MR:TaskTracker
master
         MR:JobTracker
                          slave
                                  MR:TaskTracker
                  (Job)

                                   (map            reduce




                                                            18
HDFS

                       slave
                               HDFS:DataNode
master
     HDFS:NameNode
                       slave
                               HDFS:DataNode




                                               19
Hadoop
                  MapReduce HDFS
                                         slave
                                                 MR:TaskTracker
       master
                                                 HDFS:DataNode
                MR:JobTracker

                                         slave
            HDFS:NameNode
                                                 MR:TaskTracker

                                                 HDFS:DataNode
Hadoop
HDFS
MapReduce map                   reduce             JobTracker
                                                   map    reduce
                                                                   20
MapReduce


AA                A3
AB                B2
BC                C1
input            output


 map    reduce

                          21
MapReduce
Example                               Google

map(String key, String value):
 / key: document name
  /
 / value: document contents
  /
 for each word w in value:
   EmitIntermediate(w, "1");

reduce(String key, Iterator values)
  / key: a word
   /
  / values: a list of counts
   /
  int result = 0;
  for each v in values:
    result += ParseInt(v);
  Emit(AsString(result));
                                               22
MapReduce

              A:1
              A:1
        map              A:<1,1,1>
                                              A:3
                                              C:1
AA
AB
                         C:<1>       reduce
BC            A:1
              B:1
        map                                   B:2

HDFS
                         B:<1,1>
                                     reduce
              B:1
input   map
              C:1
                                                     HDFS
                    shuffle                          output
        map                          reduce
                     (sort)
                                                             23
MapReduce




  Google
            24
Hadoop
Mahout

Hadoop

Apache




  CollaborativeFiltering
  Classifier
  Clustering
  DecisionForest

                           25
Hadoop




         26
Hadoop




         27
Amazon Web Service   EC2




                           28
Amazon Web Service




WebAPI




                     29
Amazon Web Service
 EC2 ( Elastic Compute Cloud )
                                 root/admin



 S3 ( Simple Storage Service )




 EMR ( Elastic MapReduce )
    Web
    Hadoop              → MapReduce

    EC2   S3                       +α
                                              30
Elastic MapReduce
            Hadoop




                  Hadoop


input    output     S3


                           31
Elastic MapReduce




 Amazon




                    32
Elastic MapReduce
client           cloud         master
          API            Job




         input/output                        slave


                          S3
                                            slave

                                    slave

                                                     33
Elastic MapReduce




      MapReduce




                    34
Elastic MapReduce
Finding Similar Items with Amazon Elastic MapReduce,
Python, and Hadoop Streaming
http://developer.amazonwebservices.com/connect/
entry.jspa?externalID=2294
Item




                                                       35
Elastic MapReduce
                          map/reduce


               map/reduce
input       http://www.grouplens.org/
        5




                                        36
Elastic MapReduce

input S3
[       ID] [        ID] [     ]

    map/reduce

    output      S3
[         ID] [        ID] [   ]

                                   37
Elastic MapReduce
S                                                   S



    map
     map reduce     map
                     map reduce     map
                                     map reduce
      map reduce
           reduce     map reduce
                           reduce     map reduce
                                           reduce




                                                        38
Elastic MapReduce
          step1 :
input
  key:[] value:[   ID_           ID_       ]




           map                       ID
                         key:[        ID] values[          ID_   ]

         reduce                      ID
output
         ID ¥t      ID_          |          ID_     |...


                                                                     39
Elastic MapReduce
 step2 :
input
  key:[     ID] value:[       ID_          |     ID_          |...]



                                    ID
           map
                      key:[         IDx_       IDy] values[           x_   y]

                                    ID
          reduce
output
          IDx_            _    IDy


                                                                                40
Elastic MapReduce
         step3 :
input
         IDx_        _           IDy


                             IDx_(1-              ) key map
          map map
                  key: <         IDx_(1-   )> values <   IDy>


         reduce             1-
output
         IDx_        IDy_


                                                                41
Elastic MapReduce




                    42
Elastic MapReduce
              1



elastic-mapreduce 
--create 
--name "item similarity job" 
--alive 
--log-uri s3n://bucket /logs 
--num-instances 10 
--instance-type m1.small 
--availability-zone us-west-1a




                                 43
EC2




EC2
            44
Elastic MapReduce




WAITING
                         45
Elastic MapReduce
             2
                            S3                     (s3cmd
         input
            map/reduce python

s3cmd.rb put bucket   :input/input.tsv input.tsv
s3cmd.rb put bucket   :script/map.py map1.py
s3cmd.rb put bucket   :script/reduce1.py reduce1.py
...


                                                            46
Elastic MapReduce
           4
     Job



elastic-mapreduce 
--job-flow-id j-2ROU0QKL6KOV6 
--json item_similarity.json




                                 47
Elastic MapReduce




Step1       RUNNING
                      48
Elastic MapReduce
            5
      output



s3sync.rb -r --make-dirs bucket   :output .

elastic-mapreduce 
--terminate 
--job-flow-id j-2ROU0QKL6KOV6




                                              49
Hadoop
         x




             50
Hadoop

Tom White ( )
         (      )
         (      )




¥4,830




                    51
52

More Related Content

What's hot

Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축Kwang Woo NAM
 
Cascading Map-Side Joins over HBase for Scalable Join Processing
Cascading Map-Side Joins over HBase for Scalable Join ProcessingCascading Map-Side Joins over HBase for Scalable Join Processing
Cascading Map-Side Joins over HBase for Scalable Join ProcessingAlexander Schätzle
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009prasadc
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 

What's hot (20)

Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
MapReduce
MapReduceMapReduce
MapReduce
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
 
Cascading Map-Side Joins over HBase for Scalable Join Processing
Cascading Map-Side Joins over HBase for Scalable Join ProcessingCascading Map-Side Joins over HBase for Scalable Join Processing
Cascading Map-Side Joins over HBase for Scalable Join Processing
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009
 
Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
 
Latest in ml
Latest in mlLatest in ml
Latest in ml
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 

Viewers also liked

なぜソフトウェアアーキテクトが必要なのか - デブサミ2011
なぜソフトウェアアーキテクトが必要なのか - デブサミ2011なぜソフトウェアアーキテクトが必要なのか - デブサミ2011
なぜソフトウェアアーキテクトが必要なのか - デブサミ2011Yusuke Suzuki
 
Asakusa Enterprise Batch Processing Framework for Hadoop
Asakusa Enterprise Batch Processing Framework for HadoopAsakusa Enterprise Batch Processing Framework for Hadoop
Asakusa Enterprise Batch Processing Framework for HadoopTakashi Kambayashi
 
Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Takeshi Mikami
 
Scrum Gathering 2008 Stockholm - Salesforce.com
Scrum Gathering 2008 Stockholm - Salesforce.comScrum Gathering 2008 Stockholm - Salesforce.com
Scrum Gathering 2008 Stockholm - Salesforce.comSteve Greene
 
View customize pluginを使いこなす
View customize pluginを使いこなすView customize pluginを使いこなす
View customize pluginを使いこなすonozaty
 
Redmineを使ってみよう
Redmineを使ってみようRedmineを使ってみよう
Redmineを使ってみようmrgoofy33 .
 
Redmineチューニングの実際と限界(旧資料) - Redmine performance tuning(old), See Below.
Redmineチューニングの実際と限界(旧資料) - Redmine performance tuning(old), See Below.Redmineチューニングの実際と限界(旧資料) - Redmine performance tuning(old), See Below.
Redmineチューニングの実際と限界(旧資料) - Redmine performance tuning(old), See Below.Kuniharu(州晴) AKAHANE(赤羽根)
 
Salesforce.com Agile Transformation - Agile 2007 Conference
Salesforce.com Agile Transformation - Agile 2007 ConferenceSalesforce.com Agile Transformation - Agile 2007 Conference
Salesforce.com Agile Transformation - Agile 2007 ConferenceSteve Greene
 
データウェアハウス入門 (マーケティングデータ分析基盤技術勉強会)
データウェアハウス入門 (マーケティングデータ分析基盤技術勉強会)データウェアハウス入門 (マーケティングデータ分析基盤技術勉強会)
データウェアハウス入門 (マーケティングデータ分析基盤技術勉強会)Takeshi Mikami
 

Viewers also liked (9)

なぜソフトウェアアーキテクトが必要なのか - デブサミ2011
なぜソフトウェアアーキテクトが必要なのか - デブサミ2011なぜソフトウェアアーキテクトが必要なのか - デブサミ2011
なぜソフトウェアアーキテクトが必要なのか - デブサミ2011
 
Asakusa Enterprise Batch Processing Framework for Hadoop
Asakusa Enterprise Batch Processing Framework for HadoopAsakusa Enterprise Batch Processing Framework for Hadoop
Asakusa Enterprise Batch Processing Framework for Hadoop
 
Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方
 
Scrum Gathering 2008 Stockholm - Salesforce.com
Scrum Gathering 2008 Stockholm - Salesforce.comScrum Gathering 2008 Stockholm - Salesforce.com
Scrum Gathering 2008 Stockholm - Salesforce.com
 
View customize pluginを使いこなす
View customize pluginを使いこなすView customize pluginを使いこなす
View customize pluginを使いこなす
 
Redmineを使ってみよう
Redmineを使ってみようRedmineを使ってみよう
Redmineを使ってみよう
 
Redmineチューニングの実際と限界(旧資料) - Redmine performance tuning(old), See Below.
Redmineチューニングの実際と限界(旧資料) - Redmine performance tuning(old), See Below.Redmineチューニングの実際と限界(旧資料) - Redmine performance tuning(old), See Below.
Redmineチューニングの実際と限界(旧資料) - Redmine performance tuning(old), See Below.
 
Salesforce.com Agile Transformation - Agile 2007 Conference
Salesforce.com Agile Transformation - Agile 2007 ConferenceSalesforce.com Agile Transformation - Agile 2007 Conference
Salesforce.com Agile Transformation - Agile 2007 Conference
 
データウェアハウス入門 (マーケティングデータ分析基盤技術勉強会)
データウェアハウス入門 (マーケティングデータ分析基盤技術勉強会)データウェアハウス入門 (マーケティングデータ分析基盤技術勉強会)
データウェアハウス入門 (マーケティングデータ分析基盤技術勉強会)
 

Similar to Hadoop document analysis

データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)Takumi Asai
 
Hadoop導入事例 in クックパッド
Hadoop導入事例 in クックパッドHadoop導入事例 in クックパッド
Hadoop導入事例 in クックパッドTatsuya Sasaki
 
Adaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware MappersAdaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware Mappersrvernica
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAsociatia ProLinux
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeJoydeep Sen Sarma
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataEnkitec
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Hadoopを業務で使ってみた
Hadoopを業務で使ってみたHadoopを業務で使ってみた
Hadoopを業務で使ってみたTatsuya Sasaki
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章moai kids
 

Similar to Hadoop document analysis (20)

データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
Hadoop導入事例 in クックパッド
Hadoop導入事例 in クックパッドHadoop導入事例 in クックパッド
Hadoop導入事例 in クックパッド
 
Adaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware MappersAdaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware Mappers
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoopを業務で使ってみた
Hadoopを業務で使ってみたHadoopを業務で使ってみた
Hadoopを業務で使ってみた
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
 

Hadoop document analysis