The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. We’ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Scaling Big Data Mining Infrastructure Twitter Experience
1. Scaling Big Data
Mining Infrastructure
Jimmy Lin Dmitriy Ryaboy
@lintool @squarecog
Hadoop Summit Europe
Thursday, March 21, 2013
2. From the Ivory Tower…
Source: Wikipedia (All Souls College, Oxford)
3. … to building sh*t that works.
Source: Wikipedia (Factory)
4. IMHO
Represents personal opinion. Not official position of Twitter.
Management not responsible for misuse. Void where prohibited. YMMV.
(If someone asks, I probably wasn’t here)
5.
6. “Yesterday”
~150 people total
~60 Hadoop nodes
~6 people use analytics stack daily
7. “Today”
~1400 people total
10s of Ks of Hadoop nodes, multiple DCs
10s of PBs total Hadoop DW capacity
~100 TB ingest daily
dozens of teams use Hadoop daily
10s of Ks of Hadoop jobs daily
11. It’s impossible to overstress this: 80%
of the work in any data project is in
cleaning the data. – DJ Patil ―Data
Jujitsu‖
Source: Wikipedia (Jujitsu)
12. Reality
Your boss says something vague
You think very hard on how to move the needle
Where’s the data?
What’s in this dataset?
What’s all the f#$!* crap in the data?
Clean the data
Run some off-the-shelf data mining algorithm
…
Productionize, act on the insight
Rinse, repeat
13. Data science is less
glamorous that you think!
How do we make data scientists’ lives a bit easier?
24. JSON to the Rescue!
Source: http://www.flickr.com/photos/snukkel/3206405352/
25. This should really be a list…
Remember the camelSnake!
{
"token": 945842,
"feature_enabled": "super_special",
"userid": 229922,
"page": "null", Is this really an integer?
"info": { "email": "my@place.com" }
}
Is this really null?
What keys? What values?
This does not scale.
26. struct MessageInfo {
1: optional string name
2: optional string email
}
struct LogMessage {
1: required i64 token
2: required string user_id
3: optional list<Feature> enabled_features
4: optional i64 page = 0
5: optional MessageInfo info
}
+ DDL provides type safety
enum Feature {
super_special, + Auto codgen
less_special
} + Efficient serialization
+ Sane schema migration
+ Separate logical from
physical
Use Thrift. or Protobufs.or Avro.
27. Schemas aren’t enough!
We need a data discovery service!
Where’s the data?
How do I read it?
Who produces it?
Who consumes it?
When was it last generated?
…
28. Where to find data?
Old way:
Hard-coded partitioning scheme, path, format
A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’
USING LzoStatusProtobufBlockPigLoader();
Custom loader
How do people know? 1.) Ask around 2.) Cargo-cult
New way:
Nice UI for browsing
A = LOAD ‘tables.statuses’
USING TwadoopLoader(); Same loader each time
B = FILTER A BY year == ‘2011’
AND month == ‘11’
AND day == ‘01’
AND hour >= ’05’
AND hour <= ‘07’;
Filters are pushed into the loader.
No need to understand partitioning
scheme.
29. Data Access Layer (DAL)
―All problems in computer science can be solved by another level
of indirection... Except for the problem of too many layers of
indirection.‖ – David Wheeler
All data accesses go through DAL
Thin layer on top of HCatalog
30. Data Access Layer (DAL)
Who wrote what data, when?
#win
Automatically construct data/job dependency graph
Automatically figure out ownership
Hooks into alerting, auditing, repos,
deploy systems, etc.
31. Plumbing
Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter.
SIGMOD 2012.
Source: Wikipedia (Plumbing)
35. upload results
data munging
Joining multiple dataset
Feature extraction
… download
down-sample test data
train predict
36. What doesn’t work…
1. Down-sampling for training on single-processor
Defeats the whole point of big data!
2. Ad hoc productionizing
Disconnected from rest of production workflow
37. Production considerations:
dependency management
scheduling
resource allocation
monitoring
error reporting
alerting
…
We need…
40. Stochastic Gradient Descent
Conceptually, classifier training is a like
user-defined aggregate function!
AVG SGD
initialize sum = 0 initialize
count = 0 weights
update add to sum
increment
count
terminate return sum / count return weights
41. previous Pig dataflow previous Pig dataflow
map
Classifier
Training
reduce
label, feature vector
Pig storage
function
model model model
feature vector feature vector
Making model UDF model UDF
Predictions prediction prediction
Just like any other parallel Pig dataflow
42. It’s just Pig!
For ―free‖: dependency management,
scheduling, resource allocation,
monitoring, error reporting, alerting, …
44. Takeaway messages
How do we make data scientists’ lives a bit easier?
Adding a bit of structure goes a long way
Getting the plumbing right makes all the difference
―In theory, there is no difference between
theory and practice. But, in practice, there
is.‖
Questions?
- Jan L.A. van de Snepscheut