S4 is a distributed stream computing platform that allows programmers to easily implement applications for processing continuous unbounded streams of data in real-time. It uses an actor-based programming model and is designed to be fault-tolerant, scalable, and pluggable. S4 was originally developed at Yahoo! Labs to enable personalized search ads by modeling users' click behaviors in real-time from streams of user activity data. It aims to maximize revenue and user experience by controlling ad ranking, pricing, filtering, and placement based on personalized models of users' intent.
Gen AI in Business - Global Trends Report 2024.pdf
Apache S4: A Distributed Stream Computing Platform for Real-Time Applications
1. Apache S4: A Distributed Stream
Computing Platform
Presented at Stanford Infolab – Nov 4, 2011
http://incubator.apache.org/projects/s4 (migrating from http://s4.io)
S4 Committers: {fpj, kishoreg, leoneu, mmorel,
robbins}@apache.org
Presented by Leo Neumeyer (@leoneu)
1
2. About Me
Born in Buenos Aires, Argentina, studied EE.
School/Work in Canada (Signal Processing, Speech Coding).
SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab
founded speech recognition spin-off Nuance Comm Inc.
Mindstech: Startup to teach spoken English in Asia using web
audio/video (before 2-way media was widely available).
Yahoo! Labs: Search advertising (optimization, auctions).
Quantbench: mission is to create a marketplace for data
scientists, data providers, and investment funds.
2
3. S4 Project History
Started as a research project at Yahoo! Labs in August 2008
out of the need to personalize search ads in real-time.
Open sourced in September 2009.
Moved to Apache Incubator in October 2011.
3
4. Motivation
Online Parameter
Personalized Search Twitter Trends
Optimization
given multiple event streams
Predict Market Prices extract information
Spam Filtering
Automatic Trading
using data driven models
in real time
with low latency
Network Intrusion at scale
Detection Sensor Networks
It's Fun!
4
5. S4 Architecture
Node
App
App Server App
App
App PE Prototype
App
App PE Instance
App
App
Stream
App
App
Unlimited There is one Apps An app is a PE instances
number of server process encapsulate graph are clones of
nodes. Each per node. The units of work. composed of the prototype.
node has one server They can PE prototypes They are
process. loads/unloads consume and and streams associated with
apps. produce event that produce, a unique key
streams. consume, and and contain the
transmit msgs. state.
S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable,
event driven, pluggable platform that allows programmers to easily implement
applications for processing continuous unbounded streams of data.
5
6. Latency vs. Accuracy
Zero Errors Real-Time
Latency ➔ Unconstrained ➔ Constrained
Why? ➔ Reproducible results ➔ Limited control over
inbound data rate and
computing complexity
Use ➔ Debug ➔ Process unstructured data
➔ Train Models ➔ Tolerance to small errors
➔ Graceful recovery from
inbound data streams
6
7. Design
Actors programming model.
Probabilistic thinking in both algorithms and systems.
Run on commodity hardware.
All in-memory, no disk bottlenecks.
Pluggable (Protocols, applications, serialization, etc.)
Object oriented design → POJOs
Static typing, no string literals, minimize type casting.
Science friendly → constant change, ease of use.
7
8. Programming Model
Example: estimate click-
through rate in a web
application after applying a
filter to remove bot traffic.
8
10. Research Areas: Systems
Checkpointing strategies
Replication strategies
Dynamic load balancing
Adaptive load management
Query languages
10
11. Fault Tolerance
Problem Approaches S4
High Availability ➔ Warm/hot failover ➔ Warm failover
➔ Cold failover ➔ Standby nodes +
Apache Zookeeper
State Loss ➔ Lossy checkpointing ➔ Lossy checkpointing
➔ Lossless checkpoint.
(Crashes, system
updates)
Low Latency ➔ Decouple stream ➔ Asynchronous writes
processing from ➔ Uncoordinated
checkpointing checkpointing
Approach: checkpoints are count or time based, pluggable backend to
support any data store, lazy PE restore, tuning is application dependent.
Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011.
11
13. Research Areas: Algorithms
Self-adaptive models: adaptive language models using small
amounts of data.
Personalization: learn from user feedback (clicks, location,
behavior) to deliver relevant information in RT.
Trend detection: find personal Twitter trends relevant to you.
Intrusion detection: summarize high level state of the network
and detect unusual patterns.
Sensor networks: large amounts of audio/video and other
sources require processing, recognition, detection, and
tracking. Detect events across sensors.
13
14. Personalized Search Ads
Goal is to maximize:
Revenue
Click yield
User experience
By controlling:
Ranking
Pricing
Filtering
Placement
S. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th Annual
International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010.
14
15. Personalized Search Ads
Model ad click intent using recent user activity.
More likely to click → show more North ads.
Example 1
First query is digital slr camera
Next query is canon slr
More likely than average to click another ad
Example 2
Repeated query without previous clicks
Less likely to click another ad
15
16. Personalized Search Ads
Modeling user session
Typical features:
Number of searches/clicks by user past 24 hrs
User COPC: Ratio of observed clicks to predicted clicks
Identical query searched before / clicked before
Time (seconds) since last search/click
Similarity measures: current vs. previous queries
Modeling technique: stochastic gradient-descent boosted
trees (GDBT)
16
17. Personalized Search Ads
Target
P[CLICK|ad,query,user]
Approximation
P[CLICK|ad,query]* ucp[user,session]
Non-personalized User Click Propensity (UCP)
long-term model for user session
computed using Hadoop computed using S4
17
18. Personalized Search Ads
Results:
We can reduce the average number of ads (ad footprint) by
7% without decreasing click yield and revenue.
- OR -
For a given ad footprint we can increase click yield by
~2%.
18