SlideShare a Scribd company logo
1 of 83
Spark and Friends: 
Spark Streaming, 
Machine Learning, Graph Processing, 
Lambda Architecture, Approximations 
Chris Fregly 
East Bay Java User Group 
Oct 2014 
Kinesis 
Streaming
Who am I? 
Former Netflix’er 
(netflix.github.io) 
Spark Contributor 
(github.com/apache/spark) 
Founder 
(fluxcapacitor.com) 
Author 
(effectivespark.com, 
sparkinaction.com)
Spark Streaming-Kinesis Jira
Not-so-Quick Poll 
• Spark, Spark Streaming? 
• Hadoop, Hive, Pig? 
• Parquet, Avro, RCFile, ORCFile? 
• EMR, DynamoDB, Redshift? 
• Flume, Kafka, Kinesis, Storm? 
• Lambda Architecture? 
• PageRank, Collaborative Filtering, Recommendations? 
• Probabilistic Data Structs, Bloom Filters, HyperLogLog?
Quick Poll 
Raiders or 49’ers?
“Streaming” 
Kinesis 
Streaming 
Video 
Streaming 
Piping 
Big Data 
Streaming
Agenda 
• Spark, Spark Streaming 
• Use Cases 
• API and Libraries 
• Machine Learning 
• Graph Processing 
• Execution Model 
• Fault Tolerance 
• Cluster Deployment 
• Monitoring 
• Scaling and Tuning 
• Lambda Architecture 
• Probabilistic Data Structures 
• Approximations
Timeline of Spark Evolution
Spark and Berkeley AMP Lab 
Berkeley Data Analytics Stack (BDAS)
Spark Overview 
• Based on 2007 Microsoft Dryad paper 
• Written in Scala 
• Supports Java, Python, SQL, and R 
• Data fits in memory when possible, but not 
required 
• Improved efficiency over MapReduce 
– 100x in-memory, 2-10x on-disk 
• Compatible with Hadoop 
– File formats, SerDes, and UDFs
Spark Use Cases 
• Ad hoc, exploratory, interactive analytics 
• Real-time + Batch Analytics 
– Lambda Architecture 
• Real-time Machine Learning 
• Real-time Graph Processing 
• Approximate, Time-bound Queries
Explosion of Specialized Systems
Unified High-level Spark Libraries 
• Spark SQL (Data Processing) 
• Spark Streaming (Streaming) 
• MLlib (Machine Learning) 
• GraphX (Graph Processing) 
• BlinkDB (Approximate Queries) 
• Statistics (Correlations, Sampling, etc) 
• Others 
– Shark (Hive on Spark) 
– Spork (Pig on Spark)
Benefits of Unified Libraries 
• Advancements in higher-level libraries are 
pushed down into core and vice-versa 
• Examples 
– Spark Streaming 
• GC and memory management improvements 
– Spark GraphX 
• IndexedRDD for random, hashed-based access within 
a partition versus scanning entire partition 
– Spark Core 
• Sort-based Shuffle
Resilient Distributed Dataset 
(RDD)
RDD Overview 
• Core Spark abstraction 
• Represents partitions 
across the cluster nodes 
• Enables parallel processing 
on data sets 
• Partitions can be in-memory 
or on-disk 
• Immutable, recomputable, 
fault tolerant 
• Contains transformation history (“lineage”) for 
whole data set
Types of RDDs 
• Spark Out-of-the-Box 
– HadoopRDD 
– FilteredRDD 
– MappedRDD 
– PairRDD 
– ShuffledRDD 
– UnionRDD 
– DoubleRDD 
– JdbcRDD 
– JsonRDD 
– SchemaRDD 
– VertexRDD 
– EdgeRDD 
• External 
– CassandraRDD (DataStax) 
– GeoRDD (Esri) 
– EsSpark (ElasticSearch)
RDD Partitions
RDD Lineage Examples
Demo! 
• Load user data 
• Load gender data 
• Join user data 
with gender data 
• Analyze lineage
Join Optimizations 
• When joining large dataset with small dataset (reference 
data) 
• Broadcast small dataset to each node/partition of large 
dataset (one broadcast per node)
Spark API
Spark API Overview 
• Richer, more expressive than MapReduce 
• Native support for Java, Scala, Python, 
SQL, and R (mostly) 
• Unified API across all libraries 
• Operations 
– Transformations (lazy evaluation) 
– Actions (execute transformations)
Transformations
Actions
Spark Execution Model
Job Scheduling 
• Job 
– Contains many stages 
• Contains many tasks 
• FIFO 
– Long-running jobs may starve resources for other 
jobs 
• Fair 
– Round-robin to prevent resource starvation 
• Fair Scheduler Pools 
– High-priority pool for important jobs 
– Separate user pools 
– FIFO within the pool 
– Modeled after Hadoop Fair Scheduler
Spark Execution Model Overview 
• Parallel, distributed 
• DAG-based 
• Lazy evaluation 
• Allows optimizations 
– Reduce disk I/O 
– Reduce shuffle I/O 
– Single pass through dataset 
– Parallel execution 
– Task pipelining 
• Data locality and rack awareness 
• Worker node fault tolerance using RDD 
lineage graphs per partition
Execution Optimizations
Task Pipelining
Spark Cluster Deployment
Types of Cluster Deployments 
• Spark Standalone (default) 
• YARN 
– Allows hierarchies of resources 
– Kerberos integration 
– Multiple workloads from disparate execution frameworks 
• Hive, Pig, Spark, MapReduce, Cascading, etc 
• Mesos 
– Coarse-grained 
• Single, long-running Mesos tasks runs Spark mini tasks 
– Fine-grained 
• New Mesos task for each Spark task 
• Higher overhead 
• Not good for long-running Spark jobs like Spark Streaming
Spark Standalone Deployment
Spark YARN Deployment
Master High Availability 
• Multiple Master Nodes 
• ZooKeeper maintains current Master 
• Existing applications and workers will be 
notified of new Master election 
• New applications and workers need to 
explicitly specify current Master 
• Alternatives (Not recommended) 
– Local filesystem 
– NFS Mount
Spark After Dark 
(sparkafterdark.com)
Spark After Dark ~ Tinder
Mutual Match!
Goal: Increase Matches (1/2) 
• Top Influencers 
– PageRank 
– Most desirable people 
• People You May Know 
– Shortest Path 
– Facebook-enabled 
• Recommendations 
– Alternating Least 
Squares (ALS)
Goal: Increase Matches (2/2) 
• Cluster on Interests, 
Height, Religion, etc 
– K-Means 
– Nearest Neighbor 
• Textual Analysis 
of Profile Text 
– TF/IDF 
– N-gram 
– LDA Topic Extraction
Demo!
Spark Streaming
Spark Streaming Overview 
• Low latency, high throughput, fault-tolerance 
(mostly) 
• Long-running Spark application 
• Supports Flume, Kafka, Twitter, Kinesis, 
Socket, File, etc. 
• Graceful shutdown, in-flight message draining 
• Uses Spark Core, DAG Execution Model, and 
Fault Tolerance 
• Submits micro-batch jobs to the cluster
Spark Streaming Overview
Spark Streaming Use Cases 
• ETL and enrichment 
of streaming data 
on injest 
• Operational 
dashboards 
• Lambda 
Architecture
Discretized Stream (DStream) 
• Core Spark Streaming abstraction 
• Micro-batches of RDDs 
• Operations similar to RDD – operates on underlying RDDs
Spark Streaming API
Spark Streaming API Overview 
• Rich, expressive API similar to core 
• Operations 
– Transformations (lazy) 
– Actions (execute transformations) 
• Window and State Operations 
• Requires checkpointing to snip long-running 
DStream lineage 
• Register DStream as a Spark SQL table 
for querying?! Wow.
DStream Transformations
DStream Actions
Window and State DStream Operations
DStream Window and State Example
Spark Streaming Cluster 
Deployment
Spark Streaming Cluster Deployment
Adding Receivers
Adding Workers
Spark Streaming 
+ 
Kinesis
Spark Streaming + Kinesis Architecture
Throughput and Pricing
Demo! 
Kinesis 
Streaming 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… 
Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala 
Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
Spark Streaming 
Fault Tolerance
Characteristics of Sources 
• Buffered 
– Flume, Kafka, Kinesis 
– Allows replay and back pressure 
• Batched 
– Flume, Kafka, Kinesis 
– Improves throughput at expense of duplication on 
failure 
• Checkpointed 
– Kafka, Kinesis 
– Allows replay from specific checkpoint
Message Delivery Guarantees 
• Exactly once [1] 
– No loss 
– No redeliver 
– Perfect delivery 
– Incurs higher latency for transactional semantics 
– Spark default per batch using DStream lineage 
– Degrades to less guarantees depending on source 
• At least once [1..n] 
– No loss 
– Possible redeliver 
• At most once [0,1] 
– Possible loss 
– No redeliver 
– *Best configuration if some data loss is acceptable 
• Ordered 
– Per partition: Kafka, Kinesis 
– Global across all partitions: Hard to scale
Types of Checkpoints 
Spark 
1. Spark checkpointing of StreamingContext 
DStreams and metadata 
2. Lineage of state and window DStream 
operations 
Kinesis 
3. Kinesis Client Library (KCL) checkpoints 
current position within shard 
– Checkpoint info is stored in DynamoDB per 
Kinesis application keyed by shard
Fault Tolerance 
• Points of Failure 
– Receiver 
– Driver 
– Worker/Processor 
• Possible Solutions 
– Use HDFS File Source for durability 
– Data Replication 
– Secondary/Backup Nodes 
– Checkpoints 
• Stream, Window, and State info
Streaming Receiver Failure 
• Use a backup receiver 
• Use multiple receivers pulling from multiple 
shards 
– Use checkpoint-enabled, sharded streaming 
source (ie. Kafka and Kinesis) 
• Data is replicated to 2 nodes immediately 
upon ingestion 
– Will spill to disk if doesn’t fit in memory 
• Possible loss of most-recent batch 
• Possible at-least once delivery of batch 
• Use buffered sources for replay 
– Kafka and Kinesis
Streaming Driver Failure 
• Use a backup Driver 
– Use DStream metadata checkpoint info to 
recover 
• Single point of failure – interrupts stream 
processing 
• Streaming Driver is a long-running Spark 
application 
– Schedules long-running stream receivers 
• State and Window RDD checkpoints to 
HDFS to help avoid data loss (mostly)
Stream Worker/Processor Failure 
• No problem! 
• DStream RDD partitions will be recalculated from lineage 
• Causes blip in processing during node failover
Spark Streaming 
Monitoring and Tuning
Monitoring 
• Monitor driver, receiver, worker nodes, and 
streams 
• Alert upon failure or unusually high latency 
• Spark Web UI 
– Streaming tab 
• Ganglia, CloudWatch 
• StreamingListener callback
Spark Web UI
Tuning 
• Batch interval 
– High: reduce overhead of submitting new tasks for each batch 
– Low: keeps latencies low 
– Sweet spot: DStream job time (scheduling + processing) is 
steady and less than batch interval 
• Checkpoint interval 
– High: reduce load on checkpoint overhead 
– Low: reduce amount of data loss on failure 
– Recommendation: 5-10x sliding window interval 
• Use DStream.repartition() to increase parallelism of processing 
DStream jobs across cluster 
• Use spark.streaming.unpersist=true to let the Streaming Framework 
figure out when to unpersist 
• Use CMS GC for consistent processing times
Lambda Architecture
Lambda Architecture Overview 
• Batch Layer 
– Immutable, 
Batch read, 
Append-only write 
– Source of truth 
– ie. HDFS 
• Speed Layer 
– Mutable, 
Random read/write 
– Most complex 
– Recent data only 
– ie. Cassandra 
• Serving Layer 
– Immutable, 
Random read, 
Batch write 
– ie. ElephantDB
Spark + AWS + Lambda
Spark + Lambda + GraphX + MLlib
Demo! 
• Load JSON 
• Convert to Parquet file 
• Save Parquet file to disk 
• Query Parquet file directly
Approximations
Approximation Overview 
• Required for scaling 
• Speed up analysis of large datasets 
• Reduce size of working dataset 
• Data is messy 
• Collection of data is messy 
• Exact isn’t always necessary 
• “Approximate is the new Exact”
Some Approximation Methods 
• Approximate time-bound queries 
– BlinkDB 
• Bernouilli and Poisson Sampling 
– RDD: sample(), takeSample() 
• HyperLogLog 
PairRDD: countApproxDistinctByKey() 
• Count-min Sketch 
• Spark Streaming and Twitter Algebird 
• Bloom Filters 
– Everywhere!
Approximations In Action 
Figure: Memory Savings with Approximation Techniques 
(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
Spark Statistics Library 
• Correlations 
– Dependence between 2 random variables 
– Pearson, Spearman 
• Hypothesis Testing 
– Measure of statistical significance 
– Chi-squared test 
• Stratified Sampling 
– Sample separately from different sub-populations 
– Bernoulli and Poisson sampling 
– With and without replacement 
• Random data generator 
– Uniform, standard normal, and Poisson distribution
Summary 
• Spark, Spark Streaming Overview 
• Use Cases 
• API and Libraries 
• Machine Learning 
• Graph Processing 
• Execution Model 
• Fault Tolerance 
• Cluster Deployment 
• Monitoring 
• Scaling and Tuning 
• Lambda Architecture 
• Probabilistic Data Structures 
• Approximations 
http://effectivespark.com http://sparkinaction.com 
Thanks!! 
Chris Fregly 
@cfregly

More Related Content

What's hot

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend
 

What's hot (20)

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
 

Viewers also liked

Summarization and opinion detection in product reviews
Summarization and opinion detection in product reviewsSummarization and opinion detection in product reviews
Summarization and opinion detection in product reviewspapanaboinasuman
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 

Viewers also liked (9)

Summarization and opinion detection in product reviews
Summarization and opinion detection in product reviewsSummarization and opinion detection in product reviews
Summarization and opinion detection in product reviews
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
 
On Centralizing Logs
On Centralizing LogsOn Centralizing Logs
On Centralizing Logs
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 

Similar to Spark Streaming, Machine Learning, Graph Processing and Approximations with Kinesis

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformRedge Technologies
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 

Similar to Spark Streaming, Machine Learning, Graph Processing and Approximations with Kinesis (20)

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

More from Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Spark Streaming, Machine Learning, Graph Processing and Approximations with Kinesis

  • 1. Spark and Friends: Spark Streaming, Machine Learning, Graph Processing, Lambda Architecture, Approximations Chris Fregly East Bay Java User Group Oct 2014 Kinesis Streaming
  • 2. Who am I? Former Netflix’er (netflix.github.io) Spark Contributor (github.com/apache/spark) Founder (fluxcapacitor.com) Author (effectivespark.com, sparkinaction.com)
  • 4. Not-so-Quick Poll • Spark, Spark Streaming? • Hadoop, Hive, Pig? • Parquet, Avro, RCFile, ORCFile? • EMR, DynamoDB, Redshift? • Flume, Kafka, Kinesis, Storm? • Lambda Architecture? • PageRank, Collaborative Filtering, Recommendations? • Probabilistic Data Structs, Bloom Filters, HyperLogLog?
  • 5. Quick Poll Raiders or 49’ers?
  • 6. “Streaming” Kinesis Streaming Video Streaming Piping Big Data Streaming
  • 7. Agenda • Spark, Spark Streaming • Use Cases • API and Libraries • Machine Learning • Graph Processing • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Probabilistic Data Structures • Approximations
  • 8. Timeline of Spark Evolution
  • 9. Spark and Berkeley AMP Lab Berkeley Data Analytics Stack (BDAS)
  • 10. Spark Overview • Based on 2007 Microsoft Dryad paper • Written in Scala • Supports Java, Python, SQL, and R • Data fits in memory when possible, but not required • Improved efficiency over MapReduce – 100x in-memory, 2-10x on-disk • Compatible with Hadoop – File formats, SerDes, and UDFs
  • 11. Spark Use Cases • Ad hoc, exploratory, interactive analytics • Real-time + Batch Analytics – Lambda Architecture • Real-time Machine Learning • Real-time Graph Processing • Approximate, Time-bound Queries
  • 13. Unified High-level Spark Libraries • Spark SQL (Data Processing) • Spark Streaming (Streaming) • MLlib (Machine Learning) • GraphX (Graph Processing) • BlinkDB (Approximate Queries) • Statistics (Correlations, Sampling, etc) • Others – Shark (Hive on Spark) – Spork (Pig on Spark)
  • 14. Benefits of Unified Libraries • Advancements in higher-level libraries are pushed down into core and vice-versa • Examples – Spark Streaming • GC and memory management improvements – Spark GraphX • IndexedRDD for random, hashed-based access within a partition versus scanning entire partition – Spark Core • Sort-based Shuffle
  • 16. RDD Overview • Core Spark abstraction • Represents partitions across the cluster nodes • Enables parallel processing on data sets • Partitions can be in-memory or on-disk • Immutable, recomputable, fault tolerant • Contains transformation history (“lineage”) for whole data set
  • 17. Types of RDDs • Spark Out-of-the-Box – HadoopRDD – FilteredRDD – MappedRDD – PairRDD – ShuffledRDD – UnionRDD – DoubleRDD – JdbcRDD – JsonRDD – SchemaRDD – VertexRDD – EdgeRDD • External – CassandraRDD (DataStax) – GeoRDD (Esri) – EsSpark (ElasticSearch)
  • 20. Demo! • Load user data • Load gender data • Join user data with gender data • Analyze lineage
  • 21. Join Optimizations • When joining large dataset with small dataset (reference data) • Broadcast small dataset to each node/partition of large dataset (one broadcast per node)
  • 23. Spark API Overview • Richer, more expressive than MapReduce • Native support for Java, Scala, Python, SQL, and R (mostly) • Unified API across all libraries • Operations – Transformations (lazy evaluation) – Actions (execute transformations)
  • 27. Job Scheduling • Job – Contains many stages • Contains many tasks • FIFO – Long-running jobs may starve resources for other jobs • Fair – Round-robin to prevent resource starvation • Fair Scheduler Pools – High-priority pool for important jobs – Separate user pools – FIFO within the pool – Modeled after Hadoop Fair Scheduler
  • 28. Spark Execution Model Overview • Parallel, distributed • DAG-based • Lazy evaluation • Allows optimizations – Reduce disk I/O – Reduce shuffle I/O – Single pass through dataset – Parallel execution – Task pipelining • Data locality and rack awareness • Worker node fault tolerance using RDD lineage graphs per partition
  • 32. Types of Cluster Deployments • Spark Standalone (default) • YARN – Allows hierarchies of resources – Kerberos integration – Multiple workloads from disparate execution frameworks • Hive, Pig, Spark, MapReduce, Cascading, etc • Mesos – Coarse-grained • Single, long-running Mesos tasks runs Spark mini tasks – Fine-grained • New Mesos task for each Spark task • Higher overhead • Not good for long-running Spark jobs like Spark Streaming
  • 35. Master High Availability • Multiple Master Nodes • ZooKeeper maintains current Master • Existing applications and workers will be notified of new Master election • New applications and workers need to explicitly specify current Master • Alternatives (Not recommended) – Local filesystem – NFS Mount
  • 36. Spark After Dark (sparkafterdark.com)
  • 37. Spark After Dark ~ Tinder
  • 39. Goal: Increase Matches (1/2) • Top Influencers – PageRank – Most desirable people • People You May Know – Shortest Path – Facebook-enabled • Recommendations – Alternating Least Squares (ALS)
  • 40. Goal: Increase Matches (2/2) • Cluster on Interests, Height, Religion, etc – K-Means – Nearest Neighbor • Textual Analysis of Profile Text – TF/IDF – N-gram – LDA Topic Extraction
  • 41. Demo!
  • 43. Spark Streaming Overview • Low latency, high throughput, fault-tolerance (mostly) • Long-running Spark application • Supports Flume, Kafka, Twitter, Kinesis, Socket, File, etc. • Graceful shutdown, in-flight message draining • Uses Spark Core, DAG Execution Model, and Fault Tolerance • Submits micro-batch jobs to the cluster
  • 45. Spark Streaming Use Cases • ETL and enrichment of streaming data on injest • Operational dashboards • Lambda Architecture
  • 46. Discretized Stream (DStream) • Core Spark Streaming abstraction • Micro-batches of RDDs • Operations similar to RDD – operates on underlying RDDs
  • 48. Spark Streaming API Overview • Rich, expressive API similar to core • Operations – Transformations (lazy) – Actions (execute transformations) • Window and State Operations • Requires checkpointing to snip long-running DStream lineage • Register DStream as a Spark SQL table for querying?! Wow.
  • 51. Window and State DStream Operations
  • 52. DStream Window and State Example
  • 57. Spark Streaming + Kinesis
  • 58. Spark Streaming + Kinesis Architecture
  • 60. Demo! Kinesis Streaming https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  • 62. Characteristics of Sources • Buffered – Flume, Kafka, Kinesis – Allows replay and back pressure • Batched – Flume, Kafka, Kinesis – Improves throughput at expense of duplication on failure • Checkpointed – Kafka, Kinesis – Allows replay from specific checkpoint
  • 63. Message Delivery Guarantees • Exactly once [1] – No loss – No redeliver – Perfect delivery – Incurs higher latency for transactional semantics – Spark default per batch using DStream lineage – Degrades to less guarantees depending on source • At least once [1..n] – No loss – Possible redeliver • At most once [0,1] – Possible loss – No redeliver – *Best configuration if some data loss is acceptable • Ordered – Per partition: Kafka, Kinesis – Global across all partitions: Hard to scale
  • 64. Types of Checkpoints Spark 1. Spark checkpointing of StreamingContext DStreams and metadata 2. Lineage of state and window DStream operations Kinesis 3. Kinesis Client Library (KCL) checkpoints current position within shard – Checkpoint info is stored in DynamoDB per Kinesis application keyed by shard
  • 65. Fault Tolerance • Points of Failure – Receiver – Driver – Worker/Processor • Possible Solutions – Use HDFS File Source for durability – Data Replication – Secondary/Backup Nodes – Checkpoints • Stream, Window, and State info
  • 66. Streaming Receiver Failure • Use a backup receiver • Use multiple receivers pulling from multiple shards – Use checkpoint-enabled, sharded streaming source (ie. Kafka and Kinesis) • Data is replicated to 2 nodes immediately upon ingestion – Will spill to disk if doesn’t fit in memory • Possible loss of most-recent batch • Possible at-least once delivery of batch • Use buffered sources for replay – Kafka and Kinesis
  • 67. Streaming Driver Failure • Use a backup Driver – Use DStream metadata checkpoint info to recover • Single point of failure – interrupts stream processing • Streaming Driver is a long-running Spark application – Schedules long-running stream receivers • State and Window RDD checkpoints to HDFS to help avoid data loss (mostly)
  • 68. Stream Worker/Processor Failure • No problem! • DStream RDD partitions will be recalculated from lineage • Causes blip in processing during node failover
  • 70. Monitoring • Monitor driver, receiver, worker nodes, and streams • Alert upon failure or unusually high latency • Spark Web UI – Streaming tab • Ganglia, CloudWatch • StreamingListener callback
  • 72. Tuning • Batch interval – High: reduce overhead of submitting new tasks for each batch – Low: keeps latencies low – Sweet spot: DStream job time (scheduling + processing) is steady and less than batch interval • Checkpoint interval – High: reduce load on checkpoint overhead – Low: reduce amount of data loss on failure – Recommendation: 5-10x sliding window interval • Use DStream.repartition() to increase parallelism of processing DStream jobs across cluster • Use spark.streaming.unpersist=true to let the Streaming Framework figure out when to unpersist • Use CMS GC for consistent processing times
  • 74. Lambda Architecture Overview • Batch Layer – Immutable, Batch read, Append-only write – Source of truth – ie. HDFS • Speed Layer – Mutable, Random read/write – Most complex – Recent data only – ie. Cassandra • Serving Layer – Immutable, Random read, Batch write – ie. ElephantDB
  • 75. Spark + AWS + Lambda
  • 76. Spark + Lambda + GraphX + MLlib
  • 77. Demo! • Load JSON • Convert to Parquet file • Save Parquet file to disk • Query Parquet file directly
  • 79. Approximation Overview • Required for scaling • Speed up analysis of large datasets • Reduce size of working dataset • Data is messy • Collection of data is messy • Exact isn’t always necessary • “Approximate is the new Exact”
  • 80. Some Approximation Methods • Approximate time-bound queries – BlinkDB • Bernouilli and Poisson Sampling – RDD: sample(), takeSample() • HyperLogLog PairRDD: countApproxDistinctByKey() • Count-min Sketch • Spark Streaming and Twitter Algebird • Bloom Filters – Everywhere!
  • 81. Approximations In Action Figure: Memory Savings with Approximation Techniques (http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
  • 82. Spark Statistics Library • Correlations – Dependence between 2 random variables – Pearson, Spearman • Hypothesis Testing – Measure of statistical significance – Chi-squared test • Stratified Sampling – Sample separately from different sub-populations – Bernoulli and Poisson sampling – With and without replacement • Random data generator – Uniform, standard normal, and Poisson distribution
  • 83. Summary • Spark, Spark Streaming Overview • Use Cases • API and Libraries • Machine Learning • Graph Processing • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Probabilistic Data Structures • Approximations http://effectivespark.com http://sparkinaction.com Thanks!! Chris Fregly @cfregly