SlideShare a Scribd company logo
1 of 33
Download to read offline
A Real-Time Search Engine with Lucene and S4
Yahoo! S4 applied to Information Retrieval




 2/5/2011                                    Michaël Figuière
Speaker

      @mfiguiere
      blog.xebia.fr



      Michaël Figuière           Distributed
                                 Architectures

                         NoSQL
Search Engines
Our case study




      A Search Engine to keep track of activities
                within an enterprise
The Problem
A Search Engine


                  Search
A Search Engine


            MyCustomer   Search
A Search Engine


                      MyCustomer                               Search




   Document     Non Disclosure Agreement                                          12 days ago
                   ... MyCustomer agrees not to disclose any part of ...



   Document     2010 Sales Report                                                 1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        2 days ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline



                    Tika


       PDF
                  Text
                            Analyzer
                Extractor
                                                Search
                                                 Index
                            Analyzer
      Phone
       Call



                                       Lucene
A more complex Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                 1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        2 days ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline



            Tika       Mahout


 PDF
             Text
                       Classifier   Analyzer
           Extractor
                                                       Search
                                                        Index
                       Classifier   Analyzer
 Phone
  Call



                                              Lucene
More complex ...

• Entity Recognition
         Recognizes an entity written in any way



• Language Recognition
         To index each language separately



• Fetching linked URLs
         Enhances document context by also indexing linked URLs



• ...
A Real-Time Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
A Real-Time Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline


                                         Since Lucene 2.9



 PDF
          Text          Some
                                     Analyzer
        Extractor   Pre-Processing
                                                  Near Real-Time
                                                   Search Index
                        Some
                                     Analyzer
Phone               Pre-Processing
 Call
But...




 PDF
           Text          Some
                                         Analyzer
         Extractor   Pre-Processing
                                                    Near Real-Time
                                                     Search Index
                         Some
                                         Analyzer
Phone                Pre-Processing
 Call


                                      What if it takes
                                      one second/document
                                      on a single box ??
Let’s distribute it

                 Server 1                    Server 3

            Pre-            Search      Pre-            Search
         Processing          Index   Processing          Index



                 Server 2                    Server N

            Pre-            Search
         Processing          Index




  Processing logic and index structure distributed together
That’s a problem...

• Processing and index storage may have different scaling needs
        Depending on the search traffic, the processing overhead, ...



• Scaling up and down an index storage is long and complex
        Whereas stateless processing is simple to scale up/down



• Expensive pre-processing may make searches slower
        And indexing in real-time shouldn’t make searches slower !
Let’s move it to Hadoop




 PDF
          Text          Some
                                     Analyzer
        Extractor   Pre-Processing
                                                Near Real-Time
                                                 Search Index
                        Some
                                     Analyzer
Phone               Pre-Processing
 Call




                                     Hadoop MapReduce
But...

• Hadoop can only deal with chunk of data
         Data must be available somewhere on HDFS



• Unbounded stream of data can’t fit into Hadoop MapReduce
         Hadoop is thought and optimized for batch processing



• Manually bounding the stream won’t be efficient
         It’ll resulting in lot of regular and inefficient batches
S4
S4

• A distributed, fault-tolerant, stream processing system



• Elastic

            Based on Zookeeper



• Project started in november 2010, still experimental

            But things are moving fast !
Where does S4 come from ?

• Open Source project created by Yahoo!




• Initially built for relevant ad selection and clever positioning on webpages
         But thought to be generic enough



• Expensive pre-processing may make searches slower
         And indexing in real-time shouldn’t make searches slower !
Processing Element

                                  Your business
                                  logic goes here

                     Processing
                      Element



      Events Input                Events Output
Processing Node




                        Processing Node

           Processing     Processing      Processing
           Element 1      Element 2       Element N
S4 Cluster


                                      Cluster
                                      Management
             Processing Node 1


   Events
             Processing Node 2   Zookeeper
   Stream


             Processing Node N
Programming model


                                  PhoneCallPE

                               Accept events with :
                                 Type=PhoneCall
           Event               KeyTuple: Id=15497              Event
      Type: PhoneCall                                 Type: EnrichedPhoneCall

   KeyTuple: «Id=15497»                                KeyTuple: «Id=15497»

  Value: <serialized object>                          Value: <serialized object>


                                                A new Processing
                                                Element instance is created
                                                for each value of «Id»
An indexing pipeline with S4

               ReRoutingPE
                                                  Handles incoming events
                                                  and load-balance them
                                                  according to partitioning
 TextExtractionPE              TextExtractionPE




               ReRoutingPE




 ClassificationPE               ClassificationPE




                   MergingPE
An indexing pipeline with S4

               ReRoutingPE




 TextExtractionPE              TextExtractionPE


                                                  Handles result events
               ReRoutingPE                        and load-balance between
                                                  Processing Nodes

 ClassificationPE               ClassificationPE




                   MergingPE
An indexing pipeline with S4

               ReRoutingPE




 TextExtractionPE              TextExtractionPE




               ReRoutingPE




 ClassificationPE               ClassificationPE

                                                  Handles final result
                                                  events and push
                   MergingPE
                                                  them to the Indexer
Some drawbacks

• The system is lossy
         Events may be lost when nodes are overloaded or during failure



• A workaround is to increase the incoming queue of nodes
         But still, events may be lost during failure



• Still experimental
         But very promising
More: Real-Time Inverted Search


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting


                                     20 new results...


   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Summary

• S4 is a nice processing system for real-time search
         Events may be lost when nodes are overloaded or during failure



• Not only for indexing-time, also for query-time !
         As S4 ensures low latency, query-time processing is possible



• A promising roadmap....
         Better failure handling, client API in major languages,
         initial processing with Hadoop, ...
Questions / Answers




                       ?
                      blog.xebia.fr
                      @mfiguiere

More Related Content

Similar to FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick StartFishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick StartKim Negaard
 
Implementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessImplementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessDataWorks Summit
 
Nosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use CasesNosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use CasesMongoDB
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis
 
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidHostedbyConfluent
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Soroosh Khodami
 
IPexcel - Company Overview
IPexcel - Company OverviewIPexcel - Company Overview
IPexcel - Company OverviewIPexcel
 
How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?Denodo
 
Microsoft StreamInsight
Microsoft StreamInsight Microsoft StreamInsight
Microsoft StreamInsight Mark Ginnebaugh
 
ABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Digital Transformation Mindset - More Than Just Technology
Digital Transformation Mindset - More Than Just TechnologyDigital Transformation Mindset - More Than Just Technology
Digital Transformation Mindset - More Than Just Technologyconfluent
 
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022InfluxData
 
Albel Pres Continuous Intelligence Overview
Albel Pres   Continuous Intelligence OverviewAlbel Pres   Continuous Intelligence Overview
Albel Pres Continuous Intelligence OverviewAli BELCAID
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Denodo
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 
Global automation domination: how do you roll out one workflow solution acros...
Global automation domination: how do you roll out one workflow solution acros...Global automation domination: how do you roll out one workflow solution acros...
Global automation domination: how do you roll out one workflow solution acros...sharedserviceslink.com
 
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...Looker
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformSergei Dolukhanov
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformSergei Dolukhanov
 

Similar to FOSDEM (feb 2011) - A real-time search engine with Lucene and S4 (20)

Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick StartFishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
 
Implementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessImplementing Big Data at the Speed of Business
Implementing Big Data at the Speed of Business
 
Nosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use CasesNosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use Cases
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
 
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
 
IPexcel - Company Overview
IPexcel - Company OverviewIPexcel - Company Overview
IPexcel - Company Overview
 
How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?
 
Microsoft StreamInsight
Microsoft StreamInsight Microsoft StreamInsight
Microsoft StreamInsight
 
ABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY USA TAWPI presentation
ABBYY USA TAWPI presentation
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Digital Transformation Mindset - More Than Just Technology
Digital Transformation Mindset - More Than Just TechnologyDigital Transformation Mindset - More Than Just Technology
Digital Transformation Mindset - More Than Just Technology
 
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
 
Albel Pres Continuous Intelligence Overview
Albel Pres   Continuous Intelligence OverviewAlbel Pres   Continuous Intelligence Overview
Albel Pres Continuous Intelligence Overview
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Global automation domination: how do you roll out one workflow solution acros...
Global automation domination: how do you roll out one workflow solution acros...Global automation domination: how do you roll out one workflow solution acros...
Global automation domination: how do you roll out one workflow solution acros...
 
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
 

More from Michaël Figuière

EclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache CassandraEclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache CassandraMichaël Figuière
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersMichaël Figuière
 
YaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersYaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersMichaël Figuière
 
Geneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersGeneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersMichaël Figuière
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Michaël Figuière
 
NYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthNYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthMichaël Figuière
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversMichaël Figuière
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
GTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLGTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLMichaël Figuière
 
Duchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutDuchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutMichaël Figuière
 
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...Michaël Figuière
 
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraBreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraMichaël Figuière
 
Mix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Mix-IT (Apr 2011) - Intelligence Collective avec Apache MahoutMix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Mix-IT (Apr 2011) - Intelligence Collective avec Apache MahoutMichaël Figuière
 
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesBreizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesMichaël Figuière
 
Xebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (feb 2011) - Large Scale Web DevelopmentXebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (feb 2011) - Large Scale Web DevelopmentMichaël Figuière
 
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Michaël Figuière
 
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesLorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesMichaël Figuière
 
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesTours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesMichaël Figuière
 

More from Michaël Figuière (20)

EclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache CassandraEclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache Cassandra
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for Developers
 
YaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersYaJug - Cassandra for Java Developers
YaJug - Cassandra for Java Developers
 
Geneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersGeneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java Developers
 
ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
 
NYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthNYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in Depth
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra Drivers
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
 
GTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLGTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQL
 
Duchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutDuchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache Mahout
 
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
 
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraBreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
 
Mix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Mix-IT (Apr 2011) - Intelligence Collective avec Apache MahoutMix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Mix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
 
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesBreizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
 
Xebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (feb 2011) - Large Scale Web DevelopmentXebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (feb 2011) - Large Scale Web Development
 
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
 
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesLorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
 
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesTours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
 

Recently uploaded

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Recently uploaded (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

  • 1. A Real-Time Search Engine with Lucene and S4 Yahoo! S4 applied to Information Retrieval 2/5/2011 Michaël Figuière
  • 2. Speaker @mfiguiere blog.xebia.fr Michaël Figuière Distributed Architectures NoSQL Search Engines
  • 3. Our case study A Search Engine to keep track of activities within an enterprise
  • 6. A Search Engine MyCustomer Search
  • 7. A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 8. Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
  • 9. A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 10. Indexing Pipeline Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
  • 11. More complex ... • Entity Recognition Recognizes an entity written in any way • Language Recognition To index each language separately • Fetching linked URLs Enhances document context by also indexing linked URLs • ...
  • 12. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 13. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 14. Indexing Pipeline Since Lucene 2.9 PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call
  • 15. But... PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call What if it takes one second/document on a single box ??
  • 16. Let’s distribute it Server 1 Server 3 Pre- Search Pre- Search Processing Index Processing Index Server 2 Server N Pre- Search Processing Index Processing logic and index structure distributed together
  • 17. That’s a problem... • Processing and index storage may have different scaling needs Depending on the search traffic, the processing overhead, ... • Scaling up and down an index storage is long and complex Whereas stateless processing is simple to scale up/down • Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  • 18. Let’s move it to Hadoop PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call Hadoop MapReduce
  • 19. But... • Hadoop can only deal with chunk of data Data must be available somewhere on HDFS • Unbounded stream of data can’t fit into Hadoop MapReduce Hadoop is thought and optimized for batch processing • Manually bounding the stream won’t be efficient It’ll resulting in lot of regular and inefficient batches
  • 20. S4
  • 21. S4 • A distributed, fault-tolerant, stream processing system • Elastic Based on Zookeeper • Project started in november 2010, still experimental But things are moving fast !
  • 22. Where does S4 come from ? • Open Source project created by Yahoo! • Initially built for relevant ad selection and clever positioning on webpages But thought to be generic enough • Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  • 23. Processing Element Your business logic goes here Processing Element Events Input Events Output
  • 24. Processing Node Processing Node Processing Processing Processing Element 1 Element 2 Element N
  • 25. S4 Cluster Cluster Management Processing Node 1 Events Processing Node 2 Zookeeper Stream Processing Node N
  • 26. Programming model PhoneCallPE Accept events with : Type=PhoneCall Event KeyTuple: Id=15497 Event Type: PhoneCall Type: EnrichedPhoneCall KeyTuple: «Id=15497» KeyTuple: «Id=15497» Value: <serialized object> Value: <serialized object> A new Processing Element instance is created for each value of «Id»
  • 27. An indexing pipeline with S4 ReRoutingPE Handles incoming events and load-balance them according to partitioning TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE MergingPE
  • 28. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE Handles result events ReRoutingPE and load-balance between Processing Nodes ClassificationPE ClassificationPE MergingPE
  • 29. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE Handles final result events and push MergingPE them to the Indexer
  • 30. Some drawbacks • The system is lossy Events may be lost when nodes are overloaded or during failure • A workaround is to increase the incoming queue of nodes But still, events may be lost during failure • Still experimental But very promising
  • 31. More: Real-Time Inverted Search MyCustomer Search Sales Juridic Accounting 20 new results... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 32. Summary • S4 is a nice processing system for real-time search Events may be lost when nodes are overloaded or during failure • Not only for indexing-time, also for query-time ! As S4 ensures low latency, query-time processing is possible • A promising roadmap.... Better failure handling, client API in major languages, initial processing with Hadoop, ...
  • 33. Questions / Answers ? blog.xebia.fr @mfiguiere