More Related Content Similar to Hadoop/Mahout/HBaseで テキスト分類器を作ったよ Similar to Hadoop/Mahout/HBaseで テキスト分類器を作ったよ (20) Hadoop/Mahout/HBaseで テキスト分類器を作ったよ2. •
• HBase
• Mahout
• Naive Bayes
•
• Web
2011 4 18
3. •
• naoki yanai
•
•
• …
•
•
• Hadoop
•
•
2011 4 18
4. HBase
• KeyValue
• read/write
• goal is the hosting of very large tables -- billions of rows ,
millions of columns ...
• Hadoop
• CAP C,P
• C: ,A: ,P:
• Sharding
• Hadoop/MapReduce
2011 4 18
5. HBase
•
• ―
• ―
•
qualifier
2011 4 18
6. Mahout
•
• Hadoop
•
• HBase
•
•
• Classifier / Clustering / Pattern Mining
• Recommenders / Collaborative Filtering
• Evolutionary Algorithms ...
2011 4 18
7. Mahout
•
•
•
• Mahout
• Mahout in Action PDF
• hamadakoichi
• TokyoWebmining
2011 4 18
11. • Web
•
•
•
•
•
2011 4 18
13. • Ruby
• ExtractContent
require "open-uri"
require "extractcontent"
html = open("http://
news.nifty.com/....htm").read
body, title = ExtractContent::analyse(html)
puts body.toutf8 #=> HTML
2011 4 18
14. • Ruby
• scrAPI
require 'scrapi'
require 'open-uri'
scr = Scraper.define do
process "div.tweet", "tweets[]"=> :text
result :tweets
end
tweets = scr.scrape(URI.parse("http://togetter.com/li/
121476"), :parser_options => {:char_encoding => 'utf8'})
tweets.each{ |tw| puts tw } #=>
2011 4 18
15. • RSS HBase
•
(URL)
content categories
http://togetter/1.html category:src=”togetter”
...
category:cat=”social”
http:// category:src=”nifty”
news.nifty.com/....html AKB ...
category:cat=”entertainment”
http://groups.google.com/ 10
group/webmining-tokyo/
…
http://ameblo.jp/....html
KARA …
2011 4 18
16. • HBase
category_id <TAB>
• HBase MaprReduce HDFS
•
•
•
• Wikipedia
•
2011 4 18
17. • mahout
$ mahout trainclassifier ...
$ mahout testclassifier …
• mahout
• --input/--output /
• --dataSource HDFS HBase
• --gramSize N-gram
• --classifierType
• --alpha
• --minDF/--minSupport /
2011 4 18
18. • HBase
•
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 1884 82.2348%
Incorrectly Classified Instances : 407 17.7652%
Total Classified Instances : 2291
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e <--Classified as
216 32 22 155 0 | 425 a = t
0 514 13 70 0 | 597 b = s
0 2 514 9 0 | 525 c = e
1 8 13 638 0 | 660 d = b
0 0 67 15 2 | 84 e = a
Default Category: unknown: 5
2011 4 18
19. •
• reducer HBase
//
BayesParameters params = new BayesParameters();
params.set("alpha_i", "1");
algorithm = new CBayesAlgorithm();
datastore = new HBaseBayesDatastore("model_table_name", params);
classifier = new ClassifierContext(algorithm, datastore);
//
ClassifierResult category = classifier.classifyDocument(doc.toArray(new String
[doc.size()]), "default");
String label = category.getLabel();
2011 4 18
20. •
(URL)
content categories
http://togetter/1.html category:src=”togetter”
...
category:cat=”social”
http:// category:src=”nifty”
news.nifty.com/....html AKB ...
category:cat=”entertainment”
http://groups.google.com/ 10
group/webmining-tokyo/ category:cat=”technology”
…
http://ameblo.jp/....html
KARA … category:cat=”entertainment”
2011 4 18
22. Web
• Google News Togetter
RSS
•
• …
• …
•
a 935 5.2M
b 5,112 7.2M
e 3,746 8.1M
s 4,737 12M
t 3,969 9.2M
2011 4 18
23. 4/18
Web
•
•
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 13388 91.6798%
Incorrectly Classified Instances : 1215 8.3202%
Total Classified Instances : 14603
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e <--Classified as
2328 19 515 250 0 | 3112 a = t
3 2939 54 20 0 | 3016 b = e
32 3 3542 109 0 | 3686 c = s
33 16 128 3877 0 | 4054 d = b
1 27 2 3 702 | 735 e = a
Default Category: unknown: 5
2011 4 18
24. Web
•
•
• alpha
1 0.5 0.1 0.01 0.001
65.38% 65.83% 66.73% 66.82% 67.02%
2011 4 18
25. 4/18
Web
•
•
• N-Gram
unigram bigram
63.57% 66.09%
2011 4 18
26. Web
•
•
•
+
56.8% 65.38%
2011 4 18
27. 4/18
Web
•
•
•
67.02% 67.88%
2011 4 18
28. •
•
• HBase/Mahout
•
• HBase
2011 4 18