FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

A Real-Time Search Engine with Lucene and S4
Yahoo! S4 applied to Information Retrieval

2/5/2011 Michaël Figuière

Speaker

@mﬁguiere
blog.xebia.fr

Michaël Figuière Distributed
Architectures

NoSQL
Search Engines

Our case study

A Search Engine to keep track of activities
within an enterprise

A Search Engine

Search

A Search Engine

MyCustomer Search

A Search Engine

MyCustomer Search

Document Non Disclosure Agreement 12 days ago
... MyCustomer agrees not to disclose any part of ...

Document 2010 Sales Report 1 month ago
... MyCustomer: 12 M€ with 3 deals ...

Phone Call 2 days ago
Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min
Description: Invoice not received for order #2354E

Indexing Pipeline

Tika

PDF
Text
Analyzer
Extractor
Search
Index
Analyzer
Phone
Call

Lucene

A more complex Search Engine

MyCustomer Search

Sales Juridic Accounting


Phone Call 2 days ago

Indexing Pipeline

Tika Mahout

PDF
Text
Classiﬁer Analyzer
Extractor
Search
Index
Classiﬁer Analyzer
Phone
Call

Lucene

More complex ...

• Entity Recognition
Recognizes an entity written in any way

• Language Recognition
To index each language separately

• Fetching linked URLs
Enhances document context by also indexing linked URLs

• ...

A Real-Time Search Engine

MyCustomer Search



Phone Call 3 seconds ago

Indexing Pipeline

Since Lucene 2.9

PDF
Text Some
Analyzer
Extractor Pre-Processing
Near Real-Time
Search Index
Some
Analyzer
Phone Pre-Processing
Call

But...

PDF
Text Some
Analyzer
Near Real-Time
Search Index
Some
Analyzer
Call

What if it takes
one second/document
on a single box ??

Let’s distribute it

Server 1 Server 3

Pre- Search Pre- Search
Processing Index Processing Index

Server 2 Server N

Pre- Search
Processing Index

Processing logic and index structure distributed together

That’s a problem...

• Processing and index storage may have different scaling needs
Depending on the search traffic, the processing overhead, ...

• Scaling up and down an index storage is long and complex
Whereas stateless processing is simple to scale up/down

• Expensive pre-processing may make searches slower
And indexing in real-time shouldn’t make searches slower !

Let’s move it to Hadoop

PDF
Text Some
Analyzer
Near Real-Time
Search Index
Some
Analyzer
Call

Hadoop MapReduce

But...

• Hadoop can only deal with chunk of data
Data must be available somewhere on HDFS

• Unbounded stream of data can’t ﬁt into Hadoop MapReduce
Hadoop is thought and optimized for batch processing

• Manually bounding the stream won’t be efﬁcient
It’ll resulting in lot of regular and inefficient batches

S4

• A distributed, fault-tolerant, stream processing system

• Elastic

Based on Zookeeper

• Project started in november 2010, still experimental

But things are moving fast !

Where does S4 come from ?

• Open Source project created by Yahoo!

• Initially built for relevant ad selection and clever positioning on webpages
But thought to be generic enough

• Expensive pre-processing may make searches slower
And indexing in real-time shouldn’t make searches slower !

Processing Element

Your business
logic goes here

Processing
Element

Events Input Events Output

Processing Node

Processing Node

Processing Processing Processing
Element 1 Element 2 Element N

S4 Cluster

Cluster
Management
Processing Node 1

Events
Processing Node 2 Zookeeper
Stream

Processing Node N

Programming model

PhoneCallPE

Accept events with :
Type=PhoneCall
Event KeyTuple: Id=15497 Event
Type: PhoneCall Type: EnrichedPhoneCall

KeyTuple: «Id=15497» KeyTuple: «Id=15497»

Value: <serialized object> Value: <serialized object>

A new Processing
Element instance is created
for each value of «Id»

An indexing pipeline with S4

ReRoutingPE
Handles incoming events
and load-balance them
according to partitioning
TextExtractionPE TextExtractionPE

ReRoutingPE

ClassiﬁcationPE ClassiﬁcationPE

MergingPE


ReRoutingPE


Handles result events
ReRoutingPE and load-balance between
Processing Nodes


MergingPE


ReRoutingPE


ReRoutingPE


Handles final result
events and push
MergingPE
them to the Indexer

Some drawbacks

• The system is lossy
Events may be lost when nodes are overloaded or during failure

• A workaround is to increase the incoming queue of nodes
But still, events may be lost during failure

• Still experimental
But very promising

More: Real-Time Inverted Search

MyCustomer Search


20 new results...


Phone Call 3 seconds ago

Summary

• S4 is a nice processing system for real-time search
Events may be lost when nodes are overloaded or during failure

• Not only for indexing-time, also for query-time !
As S4 ensures low latency, query-time processing is possible

• A promising roadmap....
Better failure handling, client API in major languages,
initial processing with Hadoop, ...

Questions / Answers

?
blog.xebia.fr
@mﬁguiere

FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

Recommended

Recommended

More Related Content

Similar to FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

Similar to FOSDEM (feb 2011) - A real-time search engine with Lucene and S4 (20)

More from Michaël Figuière

More from Michaël Figuière (20)

Recently uploaded

Recently uploaded (20)

FOSDEM (feb 2011) - A real-time search engine with Lucene and S4