Spark Streaming, Machine Learning, Graph Processing and Approximations with Kinesis

Spark and Friends:
Spark Streaming,
Machine Learning, Graph Processing,
Lambda Architecture, Approximations
Chris Fregly
East Bay Java User Group
Oct 2014
Kinesis
Streaming

Who am I?
Former Netflix’er
(netflix.github.io)
Spark Contributor
(github.com/apache/spark)
Founder
(fluxcapacitor.com)
Author
(effectivespark.com,
sparkinaction.com)

Not-so-Quick Poll
• Spark, Spark Streaming?
• Hadoop, Hive, Pig?
• Parquet, Avro, RCFile, ORCFile?
• EMR, DynamoDB, Redshift?
• Flume, Kafka, Kinesis, Storm?
• Lambda Architecture?
• PageRank, Collaborative Filtering, Recommendations?
• Probabilistic Data Structs, Bloom Filters, HyperLogLog?

Quick Poll
Raiders or 49’ers?

“Streaming”
Kinesis
Streaming
Video
Streaming
Piping
Big Data
Streaming

Agenda
• Spark, Spark Streaming
• Use Cases
• API and Libraries
• Machine Learning
• Graph Processing
• Execution Model
• Fault Tolerance
• Cluster Deployment
• Monitoring
• Scaling and Tuning
• Lambda Architecture
• Probabilistic Data Structures
• Approximations

Spark and Berkeley AMP Lab
Berkeley Data Analytics Stack (BDAS)

Spark Overview
• Based on 2007 Microsoft Dryad paper
• Written in Scala
• Supports Java, Python, SQL, and R
• Data fits in memory when possible, but not
required
• Improved efficiency over MapReduce
– 100x in-memory, 2-10x on-disk
• Compatible with Hadoop
– File formats, SerDes, and UDFs

Spark Use Cases
• Ad hoc, exploratory, interactive analytics
• Real-time + Batch Analytics
– Lambda Architecture
• Real-time Machine Learning
• Real-time Graph Processing
• Approximate, Time-bound Queries

Explosion of Specialized Systems

Unified High-level Spark Libraries
• Spark SQL (Data Processing)
• Spark Streaming (Streaming)
• MLlib (Machine Learning)
• GraphX (Graph Processing)
• BlinkDB (Approximate Queries)
• Statistics (Correlations, Sampling, etc)
• Others
– Shark (Hive on Spark)
– Spork (Pig on Spark)

Benefits of Unified Libraries
• Advancements in higher-level libraries are
pushed down into core and vice-versa
• Examples
– Spark Streaming
• GC and memory management improvements
– Spark GraphX
• IndexedRDD for random, hashed-based access within
a partition versus scanning entire partition
– Spark Core
• Sort-based Shuffle

Resilient Distributed Dataset
(RDD)

RDD Overview
• Core Spark abstraction
• Represents partitions
across the cluster nodes
• Enables parallel processing
on data sets
• Partitions can be in-memory
or on-disk
• Immutable, recomputable,
fault tolerant
• Contains transformation history (“lineage”) for
whole data set

Types of RDDs
• Spark Out-of-the-Box
– HadoopRDD
– FilteredRDD
– MappedRDD
– PairRDD
– ShuffledRDD
– UnionRDD
– DoubleRDD
– JdbcRDD
– JsonRDD
– SchemaRDD
– VertexRDD
– EdgeRDD
• External
– CassandraRDD (DataStax)
– GeoRDD (Esri)
– EsSpark (ElasticSearch)

Demo!
• Load user data
• Load gender data
• Join user data
with gender data
• Analyze lineage

Join Optimizations
• When joining large dataset with small dataset (reference
data)
• Broadcast small dataset to each node/partition of large
dataset (one broadcast per node)

Spark API Overview
• Richer, more expressive than MapReduce
• Native support for Java, Scala, Python,
SQL, and R (mostly)
• Unified API across all libraries
• Operations
– Transformations (lazy evaluation)
– Actions (execute transformations)

Job Scheduling
• Job
– Contains many stages
• Contains many tasks
• FIFO
– Long-running jobs may starve resources for other
jobs
• Fair
– Round-robin to prevent resource starvation
• Fair Scheduler Pools
– High-priority pool for important jobs
– Separate user pools
– FIFO within the pool
– Modeled after Hadoop Fair Scheduler

Spark Execution Model Overview
• Parallel, distributed
• DAG-based
• Lazy evaluation
• Allows optimizations
– Reduce disk I/O
– Reduce shuffle I/O
– Single pass through dataset
– Parallel execution
– Task pipelining
• Data locality and rack awareness
• Worker node fault tolerance using RDD
lineage graphs per partition

Types of Cluster Deployments
• Spark Standalone (default)
• YARN
– Allows hierarchies of resources
– Kerberos integration
– Multiple workloads from disparate execution frameworks
• Hive, Pig, Spark, MapReduce, Cascading, etc
• Mesos
– Coarse-grained
• Single, long-running Mesos tasks runs Spark mini tasks
– Fine-grained
• New Mesos task for each Spark task
• Higher overhead
• Not good for long-running Spark jobs like Spark Streaming

Master High Availability
• Multiple Master Nodes
• ZooKeeper maintains current Master
• Existing applications and workers will be
notified of new Master election
• New applications and workers need to
explicitly specify current Master
• Alternatives (Not recommended)
– Local filesystem
– NFS Mount

Spark After Dark
(sparkafterdark.com)

Goal: Increase Matches (1/2)
• Top Influencers
– PageRank
– Most desirable people
• People You May Know
– Shortest Path
– Facebook-enabled
• Recommendations
– Alternating Least
Squares (ALS)

Goal: Increase Matches (2/2)
• Cluster on Interests,
Height, Religion, etc
– K-Means
– Nearest Neighbor
• Textual Analysis
of Profile Text
– TF/IDF
– N-gram
– LDA Topic Extraction

Spark Streaming Overview
• Low latency, high throughput, fault-tolerance
(mostly)
• Long-running Spark application
• Supports Flume, Kafka, Twitter, Kinesis,
Socket, File, etc.
• Graceful shutdown, in-flight message draining
• Uses Spark Core, DAG Execution Model, and
Fault Tolerance
• Submits micro-batch jobs to the cluster

Spark Streaming Use Cases
• ETL and enrichment
of streaming data
on injest
• Operational
dashboards
• Lambda
Architecture

Discretized Stream (DStream)
• Core Spark Streaming abstraction
• Micro-batches of RDDs
• Operations similar to RDD – operates on underlying RDDs

Spark Streaming API Overview
• Rich, expressive API similar to core
• Operations
– Transformations (lazy)
– Actions (execute transformations)
• Window and State Operations
• Requires checkpointing to snip long-running
DStream lineage
• Register DStream as a Spark SQL table
for querying?! Wow.

Window and State DStream Operations

DStream Window and State Example

Spark Streaming Cluster
Deployment

Spark Streaming Cluster Deployment

Spark Streaming + Kinesis Architecture

Demo!
Kinesis
Streaming
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/…
Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java

Spark Streaming
Fault Tolerance

Characteristics of Sources
• Buffered
– Flume, Kafka, Kinesis
– Allows replay and back pressure
• Batched
– Flume, Kafka, Kinesis
– Improves throughput at expense of duplication on
failure
• Checkpointed
– Kafka, Kinesis
– Allows replay from specific checkpoint

Message Delivery Guarantees
• Exactly once [1]
– No loss
– No redeliver
– Perfect delivery
– Incurs higher latency for transactional semantics
– Spark default per batch using DStream lineage
– Degrades to less guarantees depending on source
• At least once [1..n]
– No loss
– Possible redeliver
• At most once [0,1]
– Possible loss
– No redeliver
– *Best configuration if some data loss is acceptable
• Ordered
– Per partition: Kafka, Kinesis
– Global across all partitions: Hard to scale

Types of Checkpoints
Spark
1. Spark checkpointing of StreamingContext
DStreams and metadata
2. Lineage of state and window DStream
operations
Kinesis
3. Kinesis Client Library (KCL) checkpoints
current position within shard
– Checkpoint info is stored in DynamoDB per
Kinesis application keyed by shard

Fault Tolerance
• Points of Failure
– Receiver
– Driver
– Worker/Processor
• Possible Solutions
– Use HDFS File Source for durability
– Data Replication
– Secondary/Backup Nodes
– Checkpoints
• Stream, Window, and State info

Streaming Receiver Failure
• Use a backup receiver
• Use multiple receivers pulling from multiple
shards
– Use checkpoint-enabled, sharded streaming
source (ie. Kafka and Kinesis)
• Data is replicated to 2 nodes immediately
upon ingestion
– Will spill to disk if doesn’t fit in memory
• Possible loss of most-recent batch
• Possible at-least once delivery of batch
• Use buffered sources for replay
– Kafka and Kinesis

Streaming Driver Failure
• Use a backup Driver
– Use DStream metadata checkpoint info to
recover
• Single point of failure – interrupts stream
processing
• Streaming Driver is a long-running Spark
application
– Schedules long-running stream receivers
• State and Window RDD checkpoints to
HDFS to help avoid data loss (mostly)

Stream Worker/Processor Failure
• No problem!
• DStream RDD partitions will be recalculated from lineage
• Causes blip in processing during node failover

Spark Streaming
Monitoring and Tuning

Monitoring
• Monitor driver, receiver, worker nodes, and
streams
• Alert upon failure or unusually high latency
• Spark Web UI
– Streaming tab
• Ganglia, CloudWatch
• StreamingListener callback

Tuning
• Batch interval
– High: reduce overhead of submitting new tasks for each batch
– Low: keeps latencies low
– Sweet spot: DStream job time (scheduling + processing) is
steady and less than batch interval
• Checkpoint interval
– High: reduce load on checkpoint overhead
– Low: reduce amount of data loss on failure
– Recommendation: 5-10x sliding window interval
• Use DStream.repartition() to increase parallelism of processing
DStream jobs across cluster
• Use spark.streaming.unpersist=true to let the Streaming Framework
figure out when to unpersist
• Use CMS GC for consistent processing times

Lambda Architecture Overview
• Batch Layer
– Immutable,
Batch read,
Append-only write
– Source of truth
– ie. HDFS
• Speed Layer
– Mutable,
Random read/write
– Most complex
– Recent data only
– ie. Cassandra
• Serving Layer
– Immutable,
Random read,
Batch write
– ie. ElephantDB

Spark + Lambda + GraphX + MLlib

Demo!
• Load JSON
• Convert to Parquet file
• Save Parquet file to disk
• Query Parquet file directly

Approximation Overview
• Required for scaling
• Speed up analysis of large datasets
• Reduce size of working dataset
• Data is messy
• Collection of data is messy
• Exact isn’t always necessary
• “Approximate is the new Exact”

Some Approximation Methods
• Approximate time-bound queries
– BlinkDB
• Bernouilli and Poisson Sampling
– RDD: sample(), takeSample()
• HyperLogLog
PairRDD: countApproxDistinctByKey()
• Count-min Sketch
• Spark Streaming and Twitter Algebird
• Bloom Filters
– Everywhere!

Approximations In Action
Figure: Memory Savings with Approximation Techniques
(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)

Spark Statistics Library
• Correlations
– Dependence between 2 random variables
– Pearson, Spearman
• Hypothesis Testing
– Measure of statistical significance
– Chi-squared test
• Stratified Sampling
– Sample separately from different sub-populations
– Bernoulli and Poisson sampling
– With and without replacement
• Random data generator
– Uniform, standard normal, and Poisson distribution

Summary
• Spark, Spark Streaming Overview
• Use Cases
• API and Libraries
• Machine Learning
• Graph Processing
• Execution Model
• Fault Tolerance
• Cluster Deployment
• Monitoring
• Scaling and Tuning
• Lambda Architecture
• Probabilistic Data Structures
• Approximations
http://effectivespark.com http://sparkinaction.com
Thanks!!
Chris Fregly
@cfregly

Spark Streaming, Machine Learning, Graph Processing and Approximations with Kinesis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Spark Streaming, Machine Learning, Graph Processing and Approximations with Kinesis

Similar to Spark Streaming, Machine Learning, Graph Processing and Approximations with Kinesis (20)

More from Chris Fregly

More from Chris Fregly (20)

Spark Streaming, Machine Learning, Graph Processing and Approximations with Kinesis