Digital Identity is Under Attack: FIDO Paris Seminar.pptx
jstein.cassandra.nyc.2011
1. Cassandra as the central nervous
system of your distributed systems
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
@allthingshadoop
@cassandranosql
@allthingsscala
@charmalloc
*/
http://www.medialets.com
1
4. Medialets
• Largest deployment of rich media ads for mobile devices
• Over 300,000,000 devices supported
• 3-4 TB of new data every day
• Thousands of services in production
• Hundreds of thousands ofevents received every second
• Response times are measured in microseconds
• Languages
– 35% JVM (20% Scala& 10% Java)
– 30% Ruby
– 20% C/C++
– 13% Python
– 2% Bash
4
5. The million foot view
AdServi Collecti
ng on
Kafka
mysql Hadoop
Cassandr mysql
a
Muse
mysql
7. Lets look at just one data point captured
• 09/10/2011 11:12:13
• App = Yahoo!
• Platform = iOS
• OS = 4.3.4
• Device = iPad2,1
• Resolution = 768x1024
• Events
–videoPlayPercent = 38
–Taste = great
7
8. The time series part of it
• 09/10/2011 11:12:13
Quarter Q3
Month 201109
Week 201136
Day 20110910
Hour 2011091011
Minute 201109101112
Second 20110910111213
8
9. Metrics For Different Wants
Yahoo! + iOS + 4.3.4 + iPad2,1 + 768x1024
Yahoo! + videoPlayPercent = 30 + Taste = great
Yahoo! + Taste = great
Yahoo! + videoPlayPercent = 30
iPad2,1 + videoPlayPercent = 30 + Taste = great
768x1024 + videoPlayPercent = 30 + Taste = great
iOS + 4.3.4 + iPad2,1
9
11. Storing the time series
CREATE COLUMN FAMILY ByDay Column Families hold your
WITH default_validation_class=CounterColumnType rows of data. Each row in
AND key_validation_class=UTF8Type AND comparator=UTF8Type; each column family will be
equal to the time period you
CREATE COLUMN FAMILY ByHour are dealing with. So an
WITH default_validation_class=CounterColumnType “event” occurring at
AND key_validation_class=UTF8Type AND comparator=UTF8Type;
09/10/2011 12:13:14 will
become 4 rows
CREATE COLUMN FAMILY ByMinute
WITH default_validation_class=CounterColumnType BySecond = 20110910121314
AND key_validation_class=UTF8Type AND comparator=UTF8Type; ByMinute= 201109101213
ByHour= 2011091012
CREATE COLUMN FAMILY BySecond ByDay=20110910
WITH default_validation_class=CounterColumnType
AND key_validation_class=UTF8Type AND comparator=UTF8Type;
11
17. Inserting data with Skeletor
Skeletor is the Scala wrapper of Hector for Cassandra
https://github.com/joestein/skeletor
aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") =
"app+platform+osversion+device+resolution#”
def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {
c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) +
p(device) + p(resolution))
}
//rows we are going to write too
aggregateKeys(KEYSPACE ”ByMonth") = month //201109
aggregateKeys(KEYSPACE "ByDay") = day //20110910
aggregateKeys(KEYSPACE ”ByHour") = hour //2011091012
aggregateKeys(KEYSPACE ”ByMinute") = minute //201109101213
def r(columnName: String): Unit = {
aggregateKeys.foreach{tuple:(ColumnFamily, String) => {
val (columnFamily,row) = tuple
if (row !=null &&row.size> 0)
rows add (columnFamily -> row has columnName inc) //increment the counter
}
}
}
ccAppPlatformOSVersionDeviceResolution(r)
17
18. Retrieving Data
MultigetSliceCounterQuery
• setColumnFamily(“ByDay”)
• setKeys("20110910")
• setRange(”app+event1=","app+event1=~",false,1000)
• We will get all the apps and counts for event1
• setRange(”app+event2=","app+event2=~",false,1000)
• We will get all the apps and the counts for event2
By app tastes great vs less filling
• Sample code for the aggregate metrics and retrieving them
https://github.com/joestein/apophis
• What is with the tilde?
18
20. A few more things about retrieving data
• You need to start backwards from here.
• If you want to-do things adhoc then map/reduce is better
• Sometimes more rowsarebetterallowing more nodes to-do work
– If you need to look at 100,000 metrics it is better to pull this out
of 100 rows than out of 1
– Don’t be afraid to make CF and composite keys out of Time+
Aggregate data
• 20111023+app=Yahoo!
• This could be the row that holds ALL of the app information
for that day, if you want to look at 100 apps at once with 1000
metrics for each per time period, this could be the way to go
20