Scaling Big Data Mining Infrastructure Twitter Experience

Scaling Big Data
Mining Infrastructure

Jimmy Lin Dmitriy Ryaboy
@lintool @squarecog

Hadoop Summit Europe
Thursday, March 21, 2013

From the Ivory Tower…

Source: Wikipedia (All Souls College, Oxford)

… to building sh*t that works.
Source: Wikipedia (Factory)

IMHO
Represents personal opinion. Not official position of Twitter.
Management not responsible for misuse. Void where prohibited. YMMV.

(If someone asks, I probably wasn’t here)

“Yesterday”

~150 people total
~60 Hadoop nodes
~6 people use analytics stack daily

“Today”

~1400 people total
10s of Ks of Hadoop nodes, multiple DCs
10s of PBs total Hadoop DW capacity
~100 TB ingest daily
dozens of teams use Hadoop daily
10s of Ks of Hadoop jobs daily

Why?

actionable insights
data data products

Big data mining
Cool, you get to work on new algorithms!

No, not really…

Big data mining isn’t mainly
about data mining per se!

It’s impossible to overstress this: 80%
of the work in any data project is in
cleaning the data. – DJ Patil ―Data
Jujitsu‖

Source: Wikipedia (Jujitsu)

Reality
Your boss says something vague
You think very hard on how to move the needle
Where’s the data?
What’s in this dataset?
What’s all the f#$!* crap in the data?
Clean the data
Run some off-the-shelf data mining algorithm
…
Productionize, act on the insight
Rinse, repeat

Data science is less
glamorous that you think!

How do we make data scientists’ lives a bit easier?

Gathering
Source: Wikipedia (Logging)

Moving

Source: Wikipedia (Timber rafting)

Organizing

Source: Wikipedia (Logging)

Log directly into a database!

Source: http://www.flickr.com/photos/snukkel/3206405352/

create table `my_audit_log` (
ìd` int(11) NOT NULL AUTO_INCREMENT,
`created_at` datetime,
ùser_id` int(11),
àction` varchar(256),
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Don’t do this!
— Workload mismatch
— Scaling challenges
— Overkill
— Schema changes

Main Datacenter

Scribe
Aggregators

HDFS Main Hadoop
DW

Staging Hadoop Cluster
Datacenter
Scribe Daemons Datacenter
(Production Hosts)
Scribe
Aggregators Scribe
Aggregators

HDFS
HDFS

Scribe Daemons
(Production Hosts) Scribe Daemons
(Production Hosts)

Use Scribe. or Flume.or Kafka.

Scribe solves log transport only…

System.out.println
LOG.info

^(w+s+d+s+d+:d+:d+)s+
([^@]+?)@(S+)s+(S+):s+(S+)s+(S+)
s+((?:S+?,s+)*(?:S+?))s+(S+)s+(S+)
s+[([^]]+)]s+"(w+)s+([^"]*
(?:.[^"]*)*)s+(S+)"s+(S+)s+
(S+)s+"([^"]*(?:.[^"]*)*)
"s+"([^"]*(?:.[^"]*)*)"s*
(d*-[d-]*)?s*(d+)?s*(d*.[d.]*)?
(s+[-w]+)?.*$

An actual Java regular expression used to
parse log message at Twitter circa 2010

Plain-text log messages suck
Don’t do this!

userid
CamelCase

smallCamelCase user_id

snake_case

camel_Snake

dunder__snake

Naming things is hard!

JSON to the Rescue!

Source: http://www.flickr.com/photos/snukkel/3206405352/

This should really be a list…
Remember the camelSnake!
{
"token": 945842,
"feature_enabled": "super_special",
"userid": 229922,
"page": "null", Is this really an integer?
"info": { "email": "my@place.com" }
}

Is this really null?

What keys? What values?

This does not scale.

struct MessageInfo {
1: optional string name
2: optional string email
}

struct LogMessage {
1: required i64 token
2: required string user_id
3: optional list<Feature> enabled_features
4: optional i64 page = 0
5: optional MessageInfo info
}
+ DDL provides type safety
enum Feature {
super_special, + Auto codgen
less_special
} + Efficient serialization

+ Sane schema migration
+ Separate logical from
physical
Use Thrift. or Protobufs.or Avro.

Schemas aren’t enough!

We need a data discovery service!
Where’s the data?
How do I read it?
Who produces it?
Who consumes it?
When was it last generated?
…

Where to find data?

Old way:
Hard-coded partitioning scheme, path, format
A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’
USING LzoStatusProtobufBlockPigLoader();
Custom loader

How do people know? 1.) Ask around 2.) Cargo-cult

New way:
Nice UI for browsing
A = LOAD ‘tables.statuses’
USING TwadoopLoader(); Same loader each time

B = FILTER A BY year == ‘2011’
AND month == ‘11’
AND day == ‘01’
AND hour >= ’05’
AND hour <= ‘07’;
Filters are pushed into the loader.
No need to understand partitioning
scheme.

Data Access Layer (DAL)
―All problems in computer science can be solved by another level
of indirection... Except for the problem of too many layers of
indirection.‖ – David Wheeler

All data accesses go through DAL
Thin layer on top of HCatalog

Data Access Layer (DAL)
Who wrote what data, when?

#win
Automatically construct data/job dependency graph
Automatically figure out ownership

Hooks into alerting, auditing, repos,
deploy systems, etc.

Plumbing
Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter.
SIGMOD 2012.
Source: Wikipedia (Plumbing)

Classification

Source: Wikipedia (Sorting)

label
Given
feature vector

? features

features
features
features learner classifier
features
features

Induce ? ?

Training Predicting

Stone age machine learning…

Source: Wikipedia (Stonehenge)

upload results

data munging
Joining multiple dataset
Feature extraction
… download
down-sample test data

train predict

What doesn’t work…
1. Down-sampling for training on single-processor
 Defeats the whole point of big data!
2. Ad hoc productionizing
 Disconnected from rest of production workflow

Production considerations:
dependency management
scheduling
resource allocation
monitoring
error reporting
alerting
…

We need…

Seamless scaling

Source: Wikipedia (Galaxy)

Integration with production workflows

Source: Wikipedia (Oil refinery)

Stochastic Gradient Descent
Conceptually, classifier training is a like
user-defined aggregate function!

AVG SGD

initialize sum = 0 initialize
count = 0 weights

update add to sum
increment
count
terminate return sum / count return weights

previous Pig dataflow previous Pig dataflow

map

Classifier
Training
reduce
label, feature vector
Pig storage
function

model model model

feature vector feature vector
Making model UDF model UDF
Predictions prediction prediction
Just like any other parallel Pig dataflow

It’s just Pig!

For ―free‖: dependency management,
scheduling, resource allocation,
monitoring, error reporting, alerting, …

Source: Wikipedia (Road surface)

Takeaway messages
How do we make data scientists’ lives a bit easier?

Adding a bit of structure goes a long way
Getting the plumbing right makes all the difference

―In theory, there is no difference between
theory and practice. But, in practice, there
is.‖
Questions?
- Jan L.A. van de Snepscheut

Scaling Big Data Mining Infrastructure Twitter Experience

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Scaling Big Data Mining Infrastructure Twitter Experience

Similar to Scaling Big Data Mining Infrastructure Twitter Experience (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Scaling Big Data Mining Infrastructure Twitter Experience