SlideShare a Scribd company logo
1 of 51
Algebird 
Abstract Algebra 
for 
Analytics 
Sam BESSALAH 
@samklr 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Abstract Algebra 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
From WikiPedia 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Algebraic Structure 
“ Set of values, coupled with one or 
more finite operations,and a set of 
laws those operations must obey. “ 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Algebraic Structure 
“ Set of values, coupled with one or more 
finite operations, and a set of laws those 
operations must obey. “ 
e.g Sum, Magma, Semigroup, Groups, Monoid, 
Abelian Group, Semi Lattices, Rings, Monads, 
etc. 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Semigroup 
Semigroup Law : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Semigroup 
Semigroup Law : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
trait Semigroup[T] { 
def aggregate(x : T, y : T) : T 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Monoids 
Monoid Laws : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
identity <> x = x 
x <> identity = x 
(identity) 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Monoids 
Monoid Laws : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
identity <> x = x 
x <> identity = x 
(identiy / zero) 
trait Monoid[T] { 
def identity : T 
def aggregate (x, y) : T 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Monoids 
Monoid Laws : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
identity <> x = x 
x <> identity = x 
trait Monoid[T] extends Semigroup[T]{ 
def identity : T 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Groups 
Group Laws: 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
identity <> x = x 
x <> identity = x 
(identity) 
x <> inverse x = identity 
inverse x <> x = identity 
(invertibility) 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Groups 
Group Laws 
(x <> y) <> z = x <> (y <> z) 
identity <> x = x 
x <> identity = x 
x <> inverse x = identity 
inverse x <> x = identity 
trait Group[T] extends Monoid[T]{ 
def inverse (v : T) :T 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Many More 
- Abelian groups (Commutative Sets) 
- Rings 
- Semi Lattices 
- Ordered Semigroups 
- Fields .. 
Many of those are in Algebird …. 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Examples 
- (a min b) min c = a (b min c) with Int. 
- a max ( b max c) = (a max b) max c ** 
- a or (b or c) = (a or b) or c 
- a and (b and c) = (a and b) and c 
- int addition 
- set union 
- harmonic sum 
- Integer mean 
- Priority queue 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Why do we need those algebraic 
structures ? 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
We want to : 
- Build scalable analytics systems 
- Leverage distributed computing to perform aggregation 
on really large data sets. 
- A lot of operations in analytics are just sorting and 
counting at the end of the day 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Distributed Computing → Parallellism 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Distributed Computing → Parallellism 
Associativity → enables parallelism 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Distributed Computing → Parallellism 
Associativity enables parallelism 
Identity means we can ignore some data 
Commutativity helps us ignore order 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Typical Map Reduce ... 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Finding Top-K Elements in Scalding ... 
class TopKJob(args : Args) extends Job (args) { 
Tsv ( args(‘input’), visitScheme) 
.filter (. ..) 
.leftJoinWithTiny ( … ) 
.filter ( … ) 
.groupBy( ‘fieldOne) 
{ _.sortWithTake (visitScheme -> top } 
(biggerSale) 
.write(Tsv(...) ) 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
.sortWithTake( … ) 
Looking into .sortWithTake in Scalding, there’s one 
nice thing : 
class PiorityQueueMonoid[T] (max : Int) 
(implicit order : Ordering[T] ) 
extends Monoid[Priorityqueue[T] ] 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
class PiorityQueueMonoid[T] (max : Int) 
(implicit order : Ordering[T] ) 
extends Monoid[Priorityqueue[T] ] 
Let’s take a look : 
PQ1 : 55, 45, 21, 3 
PQ2: 100, 80, 40, 3 
top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 
Priority Queue : 
Can be empty 
Two Priority Queues can be “added” in any order 
Associative + Commutative 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
class PiorityQueueMonoid[T] (max : Int) 
(implicit order : Ordering[T] ) 
extends Monoid[Priorityqueue[T] ] 
Let’s take a look : 
PQ1 : 55, 45, 21, 3 
PQ2: 100, 80, 40, 3 
top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 
Priority Queue : 
Makes Scalding go fast, 
by doing sorting, 
filtering and extracting 
in one single “map” 
step. 
Can be empty 
Two Priority Queues can be “added” in any order 
Associative + Commutative 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Stream Mining Challenges 
- Update predictions after each observation 
- Single pass : can’t read old data or replay 
the stream 
- Full size of the stream often unknown 
- Limited time for computation per 
observation 
- O(1) memory size 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Stream Mining Challenges 
http://radar.oreilly.com/2013/10/stream-mining-essentials.html 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Tradeoff : Space and speed over 
accuracy. 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Tradeoff : Space and speed over 
accuracy. 
use sketches. 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Sketches 
Probabilistic data structures that store a summary 
(hashed mostly)of a data set that would be costly to 
store in its entirety, thus providing most of the 
time, sublinear algorithmic properties. 
E.g Bloom Filters, Counter Sketch, KMV counters, 
Count Min Sketch, HyperLogLog, Min Hashes 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Bloom filters 
Approximate data structure for set membership 
Behaves like an approximate set 
BloomFilter.contains(x) => NO | Maybe 
P(False Positive) > 0 
P(False Negative) = 0 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Internally : 
Bit Array of fixed size 
add(x) : for all element i, b[h(x,i)]=1 
contains(x) : TRUE if b[h(x,i)] = = 1 for all i. 
(Boolean AND => associative) 
Both are associative => BF can be designed as a Monoid 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Bloom filters 
import com.twitter.algebird._ 
import com.twitter.algebird.Operators._ 
// generate 2 lists 
val A = (1 to 300).toList 
// Generate a Bloomfilter 
val NUM_HASHES = 6 
val WIDTH = 6000 // bits 
val SEED = 1 
implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED) 
// approximate set with bloomfilter 
val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _) 
val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…) 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Count Min Sketch 
Gives an approximation of the number of occurrences of an 
element in a set. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Count Min Sketch 
Count min sketch 
Adding an element is a numerical addition 
Querying uses a MIN function. 
Both are associative. 
useful for detecting heavy hitters, topK, LSH 
We have in Algebird : 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
HyperLogLog 
Popular sketch for cardinality estimtion. 
Gives within a probilistic distribution of an error 
the number of distinct values in a data set. 
HLL.size = Approx[Number] 
Intuition 
Long runs of trailings 0 in a random bits 
chain are rare 
But the more bit chains you look at, the more 
likely you are to find a long one 
The longest run of trailing 0-bits seen can be 
an estimator of the number of unique bit chains 
observed. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Adding an element uses a Max and Sum function. 
Both are associative and Monoids. (Max is an 
ordered 
semigroup in Algebird really) 
Querying for an element uses an harmonic mean 
which is a Monoid. 
In Algebird : 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Many More juicy sketches ... 
- MinHashes to compute Jaccard similarity 
- QTree for quantiles estimation. Neat for anomaly 
detection. 
- SpaceSaverMonoid, Awesome to find the approximate 
most frequent and top K elements. 
- TopKMonoid 
- SGD, PriorityQueues, Histograms, etc. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
SummingBird : Lamba in a box 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Heard of Lambda Architecture ? 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
SummingBird 
Same code for both batch and real time processing. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
SummingBird 
Same code, for both batch and real time processing. 
But works only on Monoids. 
Uses Storehaus, as a mergeable store layer. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
http://github.com/twitter/algebird 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
http://github.com/twitter/algebird 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr 
These slides : 
http://bit.ly/1szncAZ 
http://slidesha.re/1zhhXKU
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Links 
-Algebra for analytics by Oscar Boykin (Creator of Algebird) 
http://speakerdeck.com/johnynek/algebra-for-analytics 
- Take a look into HLearn https://github.com/mikeizbicki/HLearn 
- Great intro into Algebird by Michael Noll 
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad- 
for-large-scala-data-analytics/ 
-Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of- 
the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure 
- Probabilistic data structures for web analytics. 
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- 
web-analytics-data-mining/ 
- http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure- 
for.html 
- http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr

More Related Content

What's hot

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterKonstantin Tsykulenko
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsJohan Andrén
 
Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Konrad Malawski
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentationGene Chang
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8 Dori Waldman
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
 
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsFresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsKonrad Malawski
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Async - react, don't wait - PingConf
Async - react, don't wait - PingConfAsync - react, don't wait - PingConf
Async - react, don't wait - PingConfJohan Andrén
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubySATOSHI TAGOMORI
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰Wayne Chen
 

What's hot (20)

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka Cluster
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
XML-Motor
XML-MotorXML-Motor
XML-Motor
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentation
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
 
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsFresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Async - react, don't wait - PingConf
Async - react, don't wait - PingConfAsync - react, don't wait - PingConf
Async - react, don't wait - PingConf
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 

Viewers also liked

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
A Study to Design and Implement a Manual for the Learning Process of Technica...
A Study to Design and Implement a Manual for the Learning Process of Technica...A Study to Design and Implement a Manual for the Learning Process of Technica...
A Study to Design and Implement a Manual for the Learning Process of Technica...UNIVERSIDAD MAGISTER (Sitio Oficial)
 
An application of abstract algebra to music theory
An application of abstract algebra to music theoryAn application of abstract algebra to music theory
An application of abstract algebra to music theorymorkir
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Samir Bessalah
 
Definition ofvectorspace
Definition ofvectorspaceDefinition ofvectorspace
Definition ofvectorspaceTanuj Parikh
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
 
Snapdragon processors
Snapdragon processorsSnapdragon processors
Snapdragon processorsDeepak Mathew
 
Kill the mutants - A better way to test your tests
Kill the mutants - A better way to test your testsKill the mutants - A better way to test your tests
Kill the mutants - A better way to test your testsRoy van Rijn
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Marina Santini
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big DataMax Lin
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 

Viewers also liked (19)

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
algebraic-geometry
algebraic-geometryalgebraic-geometry
algebraic-geometry
 
A Study to Design and Implement a Manual for the Learning Process of Technica...
A Study to Design and Implement a Manual for the Learning Process of Technica...A Study to Design and Implement a Manual for the Learning Process of Technica...
A Study to Design and Implement a Manual for the Learning Process of Technica...
 
An application of abstract algebra to music theory
An application of abstract algebra to music theoryAn application of abstract algebra to music theory
An application of abstract algebra to music theory
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015
 
Information Security Seminar #2
Information Security Seminar #2Information Security Seminar #2
Information Security Seminar #2
 
Definition ofvectorspace
Definition ofvectorspaceDefinition ofvectorspace
Definition ofvectorspace
 
Snapdragon Processor
Snapdragon ProcessorSnapdragon Processor
Snapdragon Processor
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Snapdragon processors
Snapdragon processorsSnapdragon processors
Snapdragon processors
 
Kill the mutants - A better way to test your tests
Kill the mutants - A better way to test your testsKill the mutants - A better way to test your tests
Kill the mutants - A better way to test your tests
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similar to Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Everything is Permitted: Extending Built-ins
Everything is Permitted: Extending Built-insEverything is Permitted: Extending Built-ins
Everything is Permitted: Extending Built-insAndrew Dupont
 
Concurrent programming with Celluloid (MWRC 2012)
Concurrent programming with Celluloid (MWRC 2012)Concurrent programming with Celluloid (MWRC 2012)
Concurrent programming with Celluloid (MWRC 2012)tarcieri
 
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaKonrad Malawski
 
Ruby — An introduction
Ruby — An introductionRuby — An introduction
Ruby — An introductionGonçalo Silva
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scalashinolajla
 
Blocks by Lachs Cox
Blocks by Lachs CoxBlocks by Lachs Cox
Blocks by Lachs Coxlachie
 
Ruby 2: some new things
Ruby 2: some new thingsRuby 2: some new things
Ruby 2: some new thingsDavid Black
 
A tour on ruby and friends
A tour on ruby and friendsA tour on ruby and friends
A tour on ruby and friends旻琦 潘
 
Pharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkPharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkSerge Stinckwich
 
Ruby 入門 第一次就上手
Ruby 入門 第一次就上手Ruby 入門 第一次就上手
Ruby 入門 第一次就上手Wen-Tien Chang
 
Spock: Test Well and Prosper
Spock: Test Well and ProsperSpock: Test Well and Prosper
Spock: Test Well and ProsperKen Kousen
 
Ruby is an Acceptable Lisp
Ruby is an Acceptable LispRuby is an Acceptable Lisp
Ruby is an Acceptable LispAstrails
 
Ruby Topic Maps Tutorial (2007-10-10)
Ruby Topic Maps Tutorial (2007-10-10)Ruby Topic Maps Tutorial (2007-10-10)
Ruby Topic Maps Tutorial (2007-10-10)Benjamin Bock
 
Introductionto fp with groovy
Introductionto fp with groovyIntroductionto fp with groovy
Introductionto fp with groovyIsuru Samaraweera
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Andreas Dewes
 
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...Andrew Phillips
 
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Andrew Phillips
 
Code for Startup MVP (Ruby on Rails) Session 2
Code for Startup MVP (Ruby on Rails) Session 2Code for Startup MVP (Ruby on Rails) Session 2
Code for Startup MVP (Ruby on Rails) Session 2Henry S
 

Similar to Algebird : Abstract Algebra for big data analytics. Devoxx 2014 (20)

Everything is Permitted: Extending Built-ins
Everything is Permitted: Extending Built-insEverything is Permitted: Extending Built-ins
Everything is Permitted: Extending Built-ins
 
Concurrent programming with Celluloid (MWRC 2012)
Concurrent programming with Celluloid (MWRC 2012)Concurrent programming with Celluloid (MWRC 2012)
Concurrent programming with Celluloid (MWRC 2012)
 
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and Akka
 
Ruby — An introduction
Ruby — An introductionRuby — An introduction
Ruby — An introduction
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
Blocks by Lachs Cox
Blocks by Lachs CoxBlocks by Lachs Cox
Blocks by Lachs Cox
 
Ruby 2: some new things
Ruby 2: some new thingsRuby 2: some new things
Ruby 2: some new things
 
A tour on ruby and friends
A tour on ruby and friendsA tour on ruby and friends
A tour on ruby and friends
 
Pharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkPharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source Smalltalk
 
Ruby 入門 第一次就上手
Ruby 入門 第一次就上手Ruby 入門 第一次就上手
Ruby 入門 第一次就上手
 
Spock: Test Well and Prosper
Spock: Test Well and ProsperSpock: Test Well and Prosper
Spock: Test Well and Prosper
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Ruby is an Acceptable Lisp
Ruby is an Acceptable LispRuby is an Acceptable Lisp
Ruby is an Acceptable Lisp
 
Ruby Topic Maps Tutorial (2007-10-10)
Ruby Topic Maps Tutorial (2007-10-10)Ruby Topic Maps Tutorial (2007-10-10)
Ruby Topic Maps Tutorial (2007-10-10)
 
Rails by example
Rails by exampleRails by example
Rails by example
 
Introductionto fp with groovy
Introductionto fp with groovyIntroductionto fp with groovy
Introductionto fp with groovy
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
 
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
 
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
 
Code for Startup MVP (Ruby on Rails) Session 2
Code for Startup MVP (Ruby on Rails) Session 2Code for Startup MVP (Ruby on Rails) Session 2
Code for Startup MVP (Ruby on Rails) Session 2
 

Recently uploaded

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 

Recently uploaded (20)

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 

Algebird : Abstract Algebra for big data analytics. Devoxx 2014

  • 1. Algebird Abstract Algebra for Analytics Sam BESSALAH @samklr Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr
  • 2. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 3. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 4. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 5. Abstract Algebra Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 6. From WikiPedia Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 7. Algebraic Structure “ Set of values, coupled with one or more finite operations,and a set of laws those operations must obey. “ Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 8. Algebraic Structure “ Set of values, coupled with one or more finite operations, and a set of laws those operations must obey. “ e.g Sum, Magma, Semigroup, Groups, Monoid, Abelian Group, Semi Lattices, Rings, Monads, etc. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 9. Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 10. Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) trait Semigroup[T] { def aggregate(x : T, y : T) : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 11. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 12. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identiy / zero) trait Monoid[T] { def identity : T def aggregate (x, y) : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 13. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x trait Monoid[T] extends Semigroup[T]{ def identity : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 14. Groups Group Laws: (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) x <> inverse x = identity inverse x <> x = identity (invertibility) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 15. Groups Group Laws (x <> y) <> z = x <> (y <> z) identity <> x = x x <> identity = x x <> inverse x = identity inverse x <> x = identity trait Group[T] extends Monoid[T]{ def inverse (v : T) :T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 16. Many More - Abelian groups (Commutative Sets) - Rings - Semi Lattices - Ordered Semigroups - Fields .. Many of those are in Algebird …. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 17. Examples - (a min b) min c = a (b min c) with Int. - a max ( b max c) = (a max b) max c ** - a or (b or c) = (a or b) or c - a and (b and c) = (a and b) and c - int addition - set union - harmonic sum - Integer mean - Priority queue Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 18. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 19. Why do we need those algebraic structures ? Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 20. We want to : - Build scalable analytics systems - Leverage distributed computing to perform aggregation on really large data sets. - A lot of operations in analytics are just sorting and counting at the end of the day Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 21. Distributed Computing → Parallellism Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 22. Distributed Computing → Parallellism Associativity → enables parallelism Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 23. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 24. Distributed Computing → Parallellism Associativity enables parallelism Identity means we can ignore some data Commutativity helps us ignore order Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 25. Typical Map Reduce ... Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 26. Finding Top-K Elements in Scalding ... class TopKJob(args : Args) extends Job (args) { Tsv ( args(‘input’), visitScheme) .filter (. ..) .leftJoinWithTiny ( … ) .filter ( … ) .groupBy( ‘fieldOne) { _.sortWithTake (visitScheme -> top } (biggerSale) .write(Tsv(...) ) } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 27. .sortWithTake( … ) Looking into .sortWithTake in Scalding, there’s one nice thing : class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 28. class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Can be empty Two Priority Queues can be “added” in any order Associative + Commutative Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 29. class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Makes Scalding go fast, by doing sorting, filtering and extracting in one single “map” step. Can be empty Two Priority Queues can be “added” in any order Associative + Commutative Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 30. Stream Mining Challenges - Update predictions after each observation - Single pass : can’t read old data or replay the stream - Full size of the stream often unknown - Limited time for computation per observation - O(1) memory size Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 31. Stream Mining Challenges http://radar.oreilly.com/2013/10/stream-mining-essentials.html Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 32. Tradeoff : Space and speed over accuracy. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 33. Tradeoff : Space and speed over accuracy. use sketches. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 34. Sketches Probabilistic data structures that store a summary (hashed mostly)of a data set that would be costly to store in its entirety, thus providing most of the time, sublinear algorithmic properties. E.g Bloom Filters, Counter Sketch, KMV counters, Count Min Sketch, HyperLogLog, Min Hashes #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 35. Bloom filters Approximate data structure for set membership Behaves like an approximate set BloomFilter.contains(x) => NO | Maybe P(False Positive) > 0 P(False Negative) = 0 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 36. Internally : Bit Array of fixed size add(x) : for all element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i. (Boolean AND => associative) Both are associative => BF can be designed as a Monoid #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 37. Bloom filters import com.twitter.algebird._ import com.twitter.algebird.Operators._ // generate 2 lists val A = (1 to 300).toList // Generate a Bloomfilter val NUM_HASHES = 6 val WIDTH = 6000 // bits val SEED = 1 implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED) // approximate set with bloomfilter val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _) val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…) #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 38. Count Min Sketch Gives an approximation of the number of occurrences of an element in a set. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 39. Count Min Sketch Count min sketch Adding an element is a numerical addition Querying uses a MIN function. Both are associative. useful for detecting heavy hitters, topK, LSH We have in Algebird : #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 40. HyperLogLog Popular sketch for cardinality estimtion. Gives within a probilistic distribution of an error the number of distinct values in a data set. HLL.size = Approx[Number] Intuition Long runs of trailings 0 in a random bits chain are rare But the more bit chains you look at, the more likely you are to find a long one The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 41. Adding an element uses a Max and Sum function. Both are associative and Monoids. (Max is an ordered semigroup in Algebird really) Querying for an element uses an harmonic mean which is a Monoid. In Algebird : #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 42. Many More juicy sketches ... - MinHashes to compute Jaccard similarity - QTree for quantiles estimation. Neat for anomaly detection. - SpaceSaverMonoid, Awesome to find the approximate most frequent and top K elements. - TopKMonoid - SGD, PriorityQueues, Histograms, etc. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 43. SummingBird : Lamba in a box #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 44. Heard of Lambda Architecture ? #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 45. SummingBird Same code for both batch and real time processing. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 46. SummingBird Same code, for both batch and real time processing. But works only on Monoids. Uses Storehaus, as a mergeable store layer. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 47. http://github.com/twitter/algebird #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 48. http://github.com/twitter/algebird #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 49. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr These slides : http://bit.ly/1szncAZ http://slidesha.re/1zhhXKU
  • 50. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 51. Links -Algebra for analytics by Oscar Boykin (Creator of Algebird) http://speakerdeck.com/johnynek/algebra-for-analytics - Take a look into HLearn https://github.com/mikeizbicki/HLearn - Great intro into Algebird by Michael Noll http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad- for-large-scala-data-analytics/ -Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of- the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure - Probabilistic data structures for web analytics. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ - http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure- for.html - http://infolab.stanford.edu/~ullman/mmds/ch3.pdf #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr