SlideShare a Scribd company logo
1 of 29
Download to read offline
Hadoop/Mahout/HBase



                     2011/04/10
                   #TokyoWebmining10-2

                         yanaoki



2011   4   18
•
                • HBase
                • Mahout
                • Naive Bayes
                •
                • Web

2011   4   18
•
                    •   naoki yanai
                •
                    •
                    •                 …

                •
                    •
                    •       Hadoop

                •
                    •

2011   4   18
HBase
                •   KeyValue

                    •                                                         read/write

                        •   goal is the hosting of very large tables -- billions of rows ,
                            millions of columns ...


                    •   Hadoop

                •   CAP                   C,P

                    •   C:            ,A:             ,P:

                •            Sharding

                •   Hadoop/MapReduce
2011   4   18
HBase
                •
                    •   ―

                    •   ―

                    •



                            qualifier

2011   4   18
Mahout
           •
           •    Hadoop

                •
                •                          HBase

                •
           •
                •   Classifier / Clustering / Pattern Mining

                •   Recommenders / Collaborative Filtering

                •   Evolutionary Algorithms ...
2011   4   18
Mahout

           •
           •
                •
                •   Mahout

                •   Mahout in Action PDF

                •   hamadakoichi

                •   TokyoWebmining

2011   4   18
Naive Bayes
           •        F1,...,Fn           C




           •    C




           •


2011   4   18
Naive Bayes
                •
                    •
                        •
                    •
                        •
                    •
                        •
2011   4   18
Naive Bayes
                •
                    •
                •
                    •
                •
                    •
                •
                    •
2011   4   18
•       Web

                    •
                    •
                    •
                •
                    •

2011   4   18
2011   4   18
•    Ruby

                •   ExtractContent

           require "open-uri"
           require "extractcontent"

           html = open("http://
           news.nifty.com/....htm").read
           body, title = ExtractContent::analyse(html)

           puts body.toutf8 #=>        HTML


2011   4   18
•    Ruby

                •   scrAPI


       require 'scrapi'
       require 'open-uri'

       scr = Scraper.define do
        process "div.tweet", "tweets[]"=> :text
        result :tweets
       end

       tweets = scr.scrape(URI.parse("http://togetter.com/li/
       121476"), :parser_options => {:char_encoding => 'utf8'})

       tweets.each{ |tw| puts tw } #=>


2011   4   18
•                                             RSS                      HBase


           •
                      (URL)
                                         content                         categories

       http://togetter/1.html                                  category:src=”togetter”
                                                   ...
                                                               category:cat=”social”

       http://                                                 category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                               category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/
                                                  …

       http://ameblo.jp/....html
                                   KARA …

2011   4   18
•    HBase

                    category_id <TAB>

           •    HBase           MaprReduce   HDFS

                •
                    •
                    •
                        •   Wikipedia

                    •
2011   4   18
•    mahout

                    $ mahout trainclassifier       ...

                    $ mahout testclassifier        …

           •    mahout

                •    --input/--output         /

                •    --dataSource                   HDFS   HBase

                •    --gramSize     N-gram

                •    --classifierType

                •    --alpha

                •    --minDF/--minSupport                  /

2011   4   18
•                            HBase


           •
           =======================================================
           Summary
           -------------------------------------------------------
           Correctly Classified Instances          :       1884       82.2348%
           Incorrectly Classified Instances        :        407       17.7652%
           Total Classified Instances              :       2291
           =======================================================
           Confusion Matrix
           -------------------------------------------------------
           a       b       c       d       e       <--Classified as
           216     32      22      155     0        |  425         a     = t
           0       514     13      70      0        |  597         b     = s
           0       2       514     9       0        |  525         c     = e
           1       8       13      638     0        |  660         d     = b
           0       0       67      15      2        |  84          e     = a
           Default Category: unknown: 5


2011   4   18
•
           •                                      reducer                      HBase


            //
            BayesParameters params = new BayesParameters();
            params.set("alpha_i", "1");
            algorithm = new CBayesAlgorithm();
            datastore = new HBaseBayesDatastore("model_table_name", params);
            classifier = new ClassifierContext(algorithm, datastore);

            //
            ClassifierResult category = classifier.classifyDocument(doc.toArray(new String
            [doc.size()]), "default");

            String label = category.getLabel();


2011   4   18
•

                      (URL)
                                         content                        categories

       http://togetter/1.html                                 category:src=”togetter”
                                                   ...
                                                              category:cat=”social”

       http://                                                category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                              category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/                                 category:cat=”technology”
                                                  …

       http://ameblo.jp/....html
                                   KARA …                     category:cat=”entertainment”

2011   4   18
Web




2011   4   18
Web
                •   Google News Togetter
                                   RSS

                •
                    •                              …

                    •                                         …
                •
                        a                   935        5.2M
                        b                  5,112       7.2M
                        e                  3,746       8.1M
                        s                  4,737       12M
                        t                  3,969       9.2M
2011   4   18
4/18

                                 Web
                •
                      •
                =======================================================
                Summary
                -------------------------------------------------------
                Correctly Classified Instances          :      13388        91.6798%
                Incorrectly Classified Instances        :       1215         8.3202%
                Total Classified Instances              :      14603

                =======================================================
                Confusion Matrix
                -------------------------------------------------------
                a         b         c         d         e         <--Classified as
                2328      19        515       250       0          |  3112       a       =   t
                3         2939      54        20        0          |  3016       b       =   e
                32        3         3542      109       0          |  3686       c       =   s
                33        16        128       3877      0          |  4054       d       =   b
                1         27        2         3         702        |  735        e       =   a
                Default Category: unknown: 5


2011   4   18
Web


                •
                    •
                        •                              alpha


                              1         0.5     0.1        0.01    0.001




                            65.38%   65.83%   66.73%     66.82%   67.02%


2011   4   18
4/18

                               Web


                •
                    •
                        •   N-Gram


                                     unigram   bigram


                                     63.57%    66.09%


2011   4   18
Web


                •
                    •
                        •

                                           +




                                  56.8%   65.38%


2011   4   18
4/18

                            Web


                •
                    •
                        •



                                  67.02%   67.88%


2011   4   18
•
                    •
                •               HBase/Mahout

                    •
                    •   HBase



2011   4   18
2011   4   18

More Related Content

Viewers also liked

Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみたMahoutにパッチを送ってみた
Mahoutにパッチを送ってみたissaymk2
 
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierNaoki Yanai
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahouttakaya imai
 
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Koichi Hamada
 
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Koichi Hamada
 
協調フィルタリング with Mahout
協調フィルタリング with Mahout協調フィルタリング with Mahout
協調フィルタリング with MahoutKatsuhiro Takata
 
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Koichi Hamada
 
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" -  #TokyoWebmining 14th"Mahout Recommendation" -  #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14thKoichi Hamada
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習Preferred Networks
 
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for ShareYasushi Gunya
 
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)Shota Yasui
 
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京Koichi Hamada
 
Appium: Automation for Mobile Apps
Appium: Automation for Mobile AppsAppium: Automation for Mobile Apps
Appium: Automation for Mobile AppsSauce Labs
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-Naoki Yanai
 

Viewers also liked (15)

Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみたMahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
 
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
 
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
 
Frequency Pattern Mining
Frequency Pattern MiningFrequency Pattern Mining
Frequency Pattern Mining
 
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
 
協調フィルタリング with Mahout
協調フィルタリング with Mahout協調フィルタリング with Mahout
協調フィルタリング with Mahout
 
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
 
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" -  #TokyoWebmining 14th"Mahout Recommendation" -  #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
 
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
 
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
 
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
 
Appium: Automation for Mobile Apps
Appium: Automation for Mobile AppsAppium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
 

Similar to Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseWibiData
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
What's behind facebook
What's behind facebookWhat's behind facebook
What's behind facebookAjen 陳
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applicationsdzhou
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveKorea Sdec
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Be nice to your designers
Be nice to your designersBe nice to your designers
Be nice to your designersPai-Cheng Tao
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-augustpharkmillups
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4moai kids
 

Similar to Hadoop/Mahout/HBaseで テキスト分類器を作ったよ (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBase, no trouble
HBase, no troubleHBase, no trouble
HBase, no trouble
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
What's behind facebook
What's behind facebookWhat's behind facebook
What's behind facebook
 
HBase app HUG talk
HBase app HUG talkHBase app HUG talk
HBase app HUG talk
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Be nice to your designers
Be nice to your designersBe nice to your designers
Be nice to your designers
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-august
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 

Recently uploaded

Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 

Recently uploaded (20)

Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 

Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

  • 1. Hadoop/Mahout/HBase 2011/04/10 #TokyoWebmining10-2 yanaoki 2011 4 18
  • 2. • HBase • Mahout • Naive Bayes • • Web 2011 4 18
  • 3. • naoki yanai • • • … • • • Hadoop • • 2011 4 18
  • 4. HBase • KeyValue • read/write • goal is the hosting of very large tables -- billions of rows , millions of columns ... • Hadoop • CAP C,P • C: ,A: ,P: • Sharding • Hadoop/MapReduce 2011 4 18
  • 5. HBase • • ― • ― • qualifier 2011 4 18
  • 6. Mahout • • Hadoop • • HBase • • • Classifier / Clustering / Pattern Mining • Recommenders / Collaborative Filtering • Evolutionary Algorithms ... 2011 4 18
  • 7. Mahout • • • • Mahout • Mahout in Action PDF • hamadakoichi • TokyoWebmining 2011 4 18
  • 8. Naive Bayes • F1,...,Fn C • C • 2011 4 18
  • 9. Naive Bayes • • • • • • • 2011 4 18
  • 10. Naive Bayes • • • • • • • • 2011 4 18
  • 11. Web • • • • • 2011 4 18
  • 12. 2011 4 18
  • 13. Ruby • ExtractContent require "open-uri" require "extractcontent" html = open("http:// news.nifty.com/....htm").read body, title = ExtractContent::analyse(html) puts body.toutf8 #=> HTML 2011 4 18
  • 14. Ruby • scrAPI require 'scrapi' require 'open-uri' scr = Scraper.define do process "div.tweet", "tweets[]"=> :text result :tweets end tweets = scr.scrape(URI.parse("http://togetter.com/li/ 121476"), :parser_options => {:char_encoding => 'utf8'}) tweets.each{ |tw| puts tw } #=> 2011 4 18
  • 15. RSS HBase • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ … http://ameblo.jp/....html KARA … 2011 4 18
  • 16. HBase category_id <TAB> • HBase MaprReduce HDFS • • • • Wikipedia • 2011 4 18
  • 17. mahout $ mahout trainclassifier ... $ mahout testclassifier … • mahout • --input/--output / • --dataSource HDFS HBase • --gramSize N-gram • --classifierType • --alpha • --minDF/--minSupport / 2011 4 18
  • 18. HBase • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :       1884       82.2348% Incorrectly Classified Instances        :        407       17.7652% Total Classified Instances              :       2291 ======================================================= Confusion Matrix ------------------------------------------------------- a       b       c       d       e       <--Classified as 216     32      22      155     0        |  425         a     = t 0       514     13      70      0        |  597         b     = s 0       2       514     9       0        |  525         c     = e 1       8       13      638     0        |  660         d     = b 0       0       67      15      2        |  84          e     = a Default Category: unknown: 5 2011 4 18
  • 19. • reducer HBase // BayesParameters params = new BayesParameters(); params.set("alpha_i", "1"); algorithm = new CBayesAlgorithm(); datastore = new HBaseBayesDatastore("model_table_name", params); classifier = new ClassifierContext(algorithm, datastore); // ClassifierResult category = classifier.classifyDocument(doc.toArray(new String [doc.size()]), "default"); String label = category.getLabel(); 2011 4 18
  • 20. (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ category:cat=”technology” … http://ameblo.jp/....html KARA … category:cat=”entertainment” 2011 4 18
  • 21. Web 2011 4 18
  • 22. Web • Google News Togetter RSS • • … • … • a 935 5.2M b 5,112 7.2M e 3,746 8.1M s 4,737 12M t 3,969 9.2M 2011 4 18
  • 23. 4/18 Web • • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :      13388        91.6798% Incorrectly Classified Instances        :       1215         8.3202% Total Classified Instances              :      14603 ======================================================= Confusion Matrix ------------------------------------------------------- a         b         c         d         e         <--Classified as 2328      19        515       250       0          |  3112       a     = t 3         2939      54        20        0          |  3016       b     = e 32        3         3542      109       0          |  3686       c     = s 33        16        128       3877      0          |  4054       d     = b 1         27        2         3         702        |  735        e     = a Default Category: unknown: 5 2011 4 18
  • 24. Web • • • alpha 1 0.5 0.1 0.01 0.001 65.38% 65.83% 66.73% 66.82% 67.02% 2011 4 18
  • 25. 4/18 Web • • • N-Gram unigram bigram 63.57% 66.09% 2011 4 18
  • 26. Web • • • + 56.8% 65.38% 2011 4 18
  • 27. 4/18 Web • • • 67.02% 67.88% 2011 4 18
  • 28. • • HBase/Mahout • • HBase 2011 4 18
  • 29. 2011 4 18