SlideShare a Scribd company logo
1 of 19
Download to read offline
Apache S4: A Distributed Stream
Computing Platform

Presented at Stanford Infolab – Nov 4, 2011

http://incubator.apache.org/projects/s4 (migrating from http://s4.io)


  S4 Committers: {fpj, kishoreg, leoneu, mmorel,
  robbins}@apache.org
  Presented by Leo Neumeyer (@leoneu)


                                                                        1
About Me

 Born in Buenos Aires, Argentina, studied EE.
 School/Work in Canada (Signal Processing, Speech Coding).
 SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab
 founded speech recognition spin-off Nuance Comm Inc.
 Mindstech: Startup to teach spoken English in Asia using web
 audio/video (before 2-way media was widely available).
 Yahoo! Labs: Search advertising (optimization, auctions).
 Quantbench: mission is to create a marketplace for data
 scientists, data providers, and investment funds.




                                                                2
S4 Project History

 Started as a research project at Yahoo! Labs in August 2008
 out of the need to personalize search ads in real-time.
 Open sourced in September 2009.
 Moved to Apache Incubator in October 2011.




                                                               3
Motivation


                                                       Online Parameter
 Personalized Search            Twitter Trends
                                                         Optimization



                        given multiple event streams
Predict Market Prices        extract information
                                                          Spam Filtering
 Automatic Trading
                          using data driven models
                                 in real time
                              with low latency
  Network Intrusion                at scale
     Detection                                           Sensor Networks


                               It's Fun!
                                                                           4
S4 Architecture

     Node
      App
      App           Server             App
                                       App
                                        App        PE Prototype
                                                       App
                                                        App         PE Instance
                                                                        App
                                                                         App



                                                      Stream
                                                        App
                                                         App


 Unlimited       There is one     Apps             An app is a      PE instances
 number of       server process   encapsulate      graph            are clones of
 nodes. Each     per node. The    units of work.   composed of      the prototype.
 node has one    server           They can         PE prototypes    They are
 process.        loads/unloads    consume and      and streams      associated with
                 apps.            produce event    that produce,    a unique key
                                  streams.         consume, and     and contain the
                                                   transmit msgs.   state.



S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable,
event driven, pluggable platform that allows programmers to easily implement
applications for processing continuous unbounded streams of data.
                                                                                      5
Latency vs. Accuracy


            Zero Errors                Real-Time
Latency     ➔   Unconstrained          ➔   Constrained

Why?        ➔   Reproducible results   ➔   Limited control over
                                           inbound data rate and
                                           computing complexity
Use         ➔ Debug                    ➔ Process unstructured data
            ➔ Train Models             ➔ Tolerance to small errors

                                       ➔ Graceful recovery from

                                         inbound data streams




                                                                     6
Design

 Actors programming model.
 Probabilistic thinking in both algorithms and systems.
 Run on commodity hardware.
 All in-memory, no disk bottlenecks.
 Pluggable (Protocols, applications, serialization, etc.)
 Object oriented design → POJOs
 Static typing, no string literals, minimize type casting.
 Science friendly → constant change, ease of use.




                                                             7
Programming Model


                    Example: estimate click-
                    through rate in a web
                    application after applying a
                    filter to remove bot traffic.




                                                    8
Coding an App




                9
Research Areas: Systems

 Checkpointing strategies
 Replication strategies
 Dynamic load balancing
 Adaptive load management
 Query languages




                            10
Fault Tolerance

Problem                                  Approaches                 S4
High Availability                        ➔ Warm/hot failover        ➔ Warm failover
                                         ➔ Cold failover            ➔ Standby nodes +

                                                                      Apache Zookeeper
State Loss                               ➔ Lossy checkpointing      ➔   Lossy checkpointing
                                         ➔ Lossless checkpoint.
(Crashes, system
updates)
Low Latency                              ➔   Decouple stream        ➔ Asynchronous writes
                                             processing from        ➔ Uncoordinated

                                             checkpointing            checkpointing

Approach: checkpoints are count or time based, pluggable backend to
support any data store, lazy PE restore, tuning is application dependent.
Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011.

                                                                                              11
Resilience in a Distributed Word Count Task




                                              12
Research Areas: Algorithms

 Self-adaptive models: adaptive language models using small
 amounts of data.
 Personalization: learn from user feedback (clicks, location,
 behavior) to deliver relevant information in RT.
 Trend detection: find personal Twitter trends relevant to you.
 Intrusion detection: summarize high level state of the network
 and detect unusual patterns.
 Sensor networks: large amounts of audio/video and other
 sources require processing, recognition, detection, and
 tracking. Detect events across sensors.




                                                              13
Personalized Search Ads

                                                                 Goal is to maximize:
                                                                  Revenue
                                                                  Click yield
                                                                  User experience

                                                                 By controlling:
                                                                  Ranking
                                                                  Pricing
                                                                  Filtering
                                                                  Placement

S. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th Annual
International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010.

                                                                                                                                  14
Personalized Search Ads

 Model ad click intent using recent user activity.
 More likely to click → show more North ads.

 Example 1
  First query is digital slr camera
  Next query is canon slr
  More likely than average to click another ad

 Example 2
  Repeated query without previous clicks
  Less likely to click another ad

                                                     15
Personalized Search Ads

 Modeling user session

 Typical features:
   Number of searches/clicks by user past 24 hrs
   User COPC: Ratio of observed clicks to predicted clicks
   Identical query searched before / clicked before
   Time (seconds) since last search/click
   Similarity measures: current vs. previous queries

 Modeling technique: stochastic gradient-descent boosted
 trees (GDBT)

                                                             16
Personalized Search Ads


   Target
      P[CLICK|ad,query,user]

   Approximation
     P[CLICK|ad,query]* ucp[user,session]


       Non-personalized   User Click Propensity (UCP)
       long-term model          for user session
    computed using Hadoop     computed using S4


                                                        17
Personalized Search Ads

 Results:

  We can reduce the average number of ads (ad footprint) by
  7% without decreasing click yield and revenue.

                - OR -

  For a given ad footprint we can increase click yield by
  ~2%.




                                                            18
Thank you!
 Join the Apache S4 project:

  s4-user-subscribe@incubator.apache.org

  s4-dev-subscribe@incubator.apache.org



                                           19

More Related Content

Viewers also liked

Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)Max Ischenko
 
Edisi22o Ktaceh
Edisi22o KtacehEdisi22o Ktaceh
Edisi22o Ktacehepaper
 
Edisi5novaceh
Edisi5novacehEdisi5novaceh
Edisi5novacehepaper
 
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekUniversiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekThisco
 
Epaper Edisi 20 Aceh
Epaper Edisi 20 AcehEpaper Edisi 20 Aceh
Epaper Edisi 20 Acehepaper
 
Storytelling In Power Point
Storytelling In Power PointStorytelling In Power Point
Storytelling In Power Pointguest31da44c
 
Bioassets Management Services
Bioassets Management ServicesBioassets Management Services
Bioassets Management Servicesguest5df60b0
 
Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009epaper
 
25desaceh
25desaceh25desaceh
25desacehepaper
 
Edisi 4 Des Aceh
Edisi 4 Des AcehEdisi 4 Des Aceh
Edisi 4 Des Acehepaper
 
Waspada Aceh 110909
Waspada  Aceh 110909Waspada  Aceh 110909
Waspada Aceh 110909epaper
 
Shop Camp3 Viren Bhandari
Shop Camp3 Viren BhandariShop Camp3 Viren Bhandari
Shop Camp3 Viren BhandariViren Bhandari
 
Dubai. Religion
Dubai. ReligionDubai. Religion
Dubai. ReligionMeliiza
 
18 J An N As
18 J An N As18 J An N As
18 J An N Asepaper
 
Edisi 13 Aceh
Edisi 13 AcehEdisi 13 Aceh
Edisi 13 Acehepaper
 
Edisi 22 Feb Aceh
Edisi 22 Feb AcehEdisi 22 Feb Aceh
Edisi 22 Feb Acehepaper
 
OS Mapping and Industrial Location
OS Mapping and Industrial LocationOS Mapping and Industrial Location
OS Mapping and Industrial Locationdouglasgreig
 

Viewers also liked (20)

Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)
 
Edisi22o Ktaceh
Edisi22o KtacehEdisi22o Ktaceh
Edisi22o Ktaceh
 
Edisi5novaceh
Edisi5novacehEdisi5novaceh
Edisi5novaceh
 
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekUniversiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
 
Epaper Edisi 20 Aceh
Epaper Edisi 20 AcehEpaper Edisi 20 Aceh
Epaper Edisi 20 Aceh
 
Storytelling In Power Point
Storytelling In Power PointStorytelling In Power Point
Storytelling In Power Point
 
Bioassets Management Services
Bioassets Management ServicesBioassets Management Services
Bioassets Management Services
 
Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009
 
Presentation1
Presentation1Presentation1
Presentation1
 
25desaceh
25desaceh25desaceh
25desaceh
 
Edisi 4 Des Aceh
Edisi 4 Des AcehEdisi 4 Des Aceh
Edisi 4 Des Aceh
 
11 03 15 Think
11 03 15 Think11 03 15 Think
11 03 15 Think
 
Uganda
UgandaUganda
Uganda
 
Waspada Aceh 110909
Waspada  Aceh 110909Waspada  Aceh 110909
Waspada Aceh 110909
 
Shop Camp3 Viren Bhandari
Shop Camp3 Viren BhandariShop Camp3 Viren Bhandari
Shop Camp3 Viren Bhandari
 
Dubai. Religion
Dubai. ReligionDubai. Religion
Dubai. Religion
 
18 J An N As
18 J An N As18 J An N As
18 J An N As
 
Edisi 13 Aceh
Edisi 13 AcehEdisi 13 Aceh
Edisi 13 Aceh
 
Edisi 22 Feb Aceh
Edisi 22 Feb AcehEdisi 22 Feb Aceh
Edisi 22 Feb Aceh
 
OS Mapping and Industrial Location
OS Mapping and Industrial LocationOS Mapping and Industrial Location
OS Mapping and Industrial Location
 

Similar to Apache S4: A Distributed Stream Computing Platform for Real-Time Applications

Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event ProcessingSybase Türkiye
 
Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"GeneXus
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Karthik Murugesan
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTDr. Haxel Consult
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu Behera
 
The Magic of Symbiotic Security
The Magic of Symbiotic SecurityThe Magic of Symbiotic Security
The Magic of Symbiotic SecurityDenim Group
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedYury Chemerkin
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureNewvewm
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBasedarach
 
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24Dennis de Greef
 
Learning's from mobile testing
Learning's from mobile testingLearning's from mobile testing
Learning's from mobile testingVikrant Chauhan
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011darach
 
IBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile AppsIBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile AppsSanjeev Sharma
 
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSybase Türkiye
 
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11Dennis de Greef
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Narayan Bharadwaj
 
Development, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot TechnologyDevelopment, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot TechnologyAntiy Labs
 

Similar to Apache S4: A Distributed Stream Computing Platform for Real-Time Applications (20)

Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event Processing
 
Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
 
The Magic of Symbiotic Security
The Magic of Symbiotic SecurityThe Magic of Symbiotic Security
The Magic of Symbiotic Security
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learned
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud Infrastructure
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
 
WoMakersCode 2016 - Shit Happens
WoMakersCode 2016 -  Shit HappensWoMakersCode 2016 -  Shit Happens
WoMakersCode 2016 - Shit Happens
 
Learning's from mobile testing
Learning's from mobile testingLearning's from mobile testing
Learning's from mobile testing
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
 
IBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile AppsIBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile Apps
 
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming Processing
 
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Development, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot TechnologyDevelopment, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot Technology
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Apache S4: A Distributed Stream Computing Platform for Real-Time Applications

  • 1. Apache S4: A Distributed Stream Computing Platform Presented at Stanford Infolab – Nov 4, 2011 http://incubator.apache.org/projects/s4 (migrating from http://s4.io) S4 Committers: {fpj, kishoreg, leoneu, mmorel, robbins}@apache.org Presented by Leo Neumeyer (@leoneu) 1
  • 2. About Me Born in Buenos Aires, Argentina, studied EE. School/Work in Canada (Signal Processing, Speech Coding). SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab founded speech recognition spin-off Nuance Comm Inc. Mindstech: Startup to teach spoken English in Asia using web audio/video (before 2-way media was widely available). Yahoo! Labs: Search advertising (optimization, auctions). Quantbench: mission is to create a marketplace for data scientists, data providers, and investment funds. 2
  • 3. S4 Project History Started as a research project at Yahoo! Labs in August 2008 out of the need to personalize search ads in real-time. Open sourced in September 2009. Moved to Apache Incubator in October 2011. 3
  • 4. Motivation Online Parameter Personalized Search Twitter Trends Optimization given multiple event streams Predict Market Prices extract information Spam Filtering Automatic Trading using data driven models in real time with low latency Network Intrusion at scale Detection Sensor Networks It's Fun! 4
  • 5. S4 Architecture Node App App Server App App App PE Prototype App App PE Instance App App Stream App App Unlimited There is one Apps An app is a PE instances number of server process encapsulate graph are clones of nodes. Each per node. The units of work. composed of the prototype. node has one server They can PE prototypes They are process. loads/unloads consume and and streams associated with apps. produce event that produce, a unique key streams. consume, and and contain the transmit msgs. state. S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable, event driven, pluggable platform that allows programmers to easily implement applications for processing continuous unbounded streams of data. 5
  • 6. Latency vs. Accuracy Zero Errors Real-Time Latency ➔ Unconstrained ➔ Constrained Why? ➔ Reproducible results ➔ Limited control over inbound data rate and computing complexity Use ➔ Debug ➔ Process unstructured data ➔ Train Models ➔ Tolerance to small errors ➔ Graceful recovery from inbound data streams 6
  • 7. Design Actors programming model. Probabilistic thinking in both algorithms and systems. Run on commodity hardware. All in-memory, no disk bottlenecks. Pluggable (Protocols, applications, serialization, etc.) Object oriented design → POJOs Static typing, no string literals, minimize type casting. Science friendly → constant change, ease of use. 7
  • 8. Programming Model Example: estimate click- through rate in a web application after applying a filter to remove bot traffic. 8
  • 10. Research Areas: Systems Checkpointing strategies Replication strategies Dynamic load balancing Adaptive load management Query languages 10
  • 11. Fault Tolerance Problem Approaches S4 High Availability ➔ Warm/hot failover ➔ Warm failover ➔ Cold failover ➔ Standby nodes + Apache Zookeeper State Loss ➔ Lossy checkpointing ➔ Lossy checkpointing ➔ Lossless checkpoint. (Crashes, system updates) Low Latency ➔ Decouple stream ➔ Asynchronous writes processing from ➔ Uncoordinated checkpointing checkpointing Approach: checkpoints are count or time based, pluggable backend to support any data store, lazy PE restore, tuning is application dependent. Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011. 11
  • 12. Resilience in a Distributed Word Count Task 12
  • 13. Research Areas: Algorithms Self-adaptive models: adaptive language models using small amounts of data. Personalization: learn from user feedback (clicks, location, behavior) to deliver relevant information in RT. Trend detection: find personal Twitter trends relevant to you. Intrusion detection: summarize high level state of the network and detect unusual patterns. Sensor networks: large amounts of audio/video and other sources require processing, recognition, detection, and tracking. Detect events across sensors. 13
  • 14. Personalized Search Ads Goal is to maximize: Revenue Click yield User experience By controlling: Ranking Pricing Filtering Placement S. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th Annual International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010. 14
  • 15. Personalized Search Ads Model ad click intent using recent user activity. More likely to click → show more North ads. Example 1 First query is digital slr camera Next query is canon slr More likely than average to click another ad Example 2 Repeated query without previous clicks Less likely to click another ad 15
  • 16. Personalized Search Ads Modeling user session Typical features: Number of searches/clicks by user past 24 hrs User COPC: Ratio of observed clicks to predicted clicks Identical query searched before / clicked before Time (seconds) since last search/click Similarity measures: current vs. previous queries Modeling technique: stochastic gradient-descent boosted trees (GDBT) 16
  • 17. Personalized Search Ads Target P[CLICK|ad,query,user] Approximation P[CLICK|ad,query]* ucp[user,session] Non-personalized User Click Propensity (UCP) long-term model for user session computed using Hadoop computed using S4 17
  • 18. Personalized Search Ads Results: We can reduce the average number of ads (ad footprint) by 7% without decreasing click yield and revenue. - OR - For a given ad footprint we can increase click yield by ~2%. 18
  • 19. Thank you! Join the Apache S4 project: s4-user-subscribe@incubator.apache.org s4-dev-subscribe@incubator.apache.org 19