7. A Search Engine
MyCustomer Search
Document Non Disclosure Agreement 12 days ago
... MyCustomer agrees not to disclose any part of ...
Document 2010 Sales Report 1 month ago
... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days ago
Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min
Description: Invoice not received for order #2354E
8. Indexing Pipeline
Tika
PDF
Text
Analyzer
Extractor
Search
Index
Analyzer
Phone
Call
Lucene
9. A more complex Search Engine
MyCustomer Search
Sales Juridic Accounting
Document 2010 Sales Report 1 month ago
... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days ago
Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min
Description: Invoice not received for order #2354E
10. Indexing Pipeline
Tika Mahout
PDF
Text
Classifier Analyzer
Extractor
Search
Index
Classifier Analyzer
Phone
Call
Lucene
11. More complex ...
• Entity Recognition
Recognizes an entity written in any way
• Language Recognition
To index each language separately
• Fetching linked URLs
Enhances document context by also indexing linked URLs
• ...
12. A Real-Time Search Engine
MyCustomer Search
Sales Juridic Accounting
Document 2010 Sales Report 1 month ago
... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds ago
Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min
Description: Invoice not received for order #2354E
13. A Real-Time Search Engine
MyCustomer Search
Sales Juridic Accounting
Document 2010 Sales Report 1 month ago
... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds ago
Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min
Description: Invoice not received for order #2354E
14. Indexing Pipeline
Since Lucene 2.9
PDF
Text Some
Analyzer
Extractor Pre-Processing
Near Real-Time
Search Index
Some
Analyzer
Phone Pre-Processing
Call
15. But...
PDF
Text Some
Analyzer
Extractor Pre-Processing
Near Real-Time
Search Index
Some
Analyzer
Phone Pre-Processing
Call
What if it takes
one second/document
on a single box ??
16. Let’s distribute it
Server 1 Server 3
Pre- Search Pre- Search
Processing Index Processing Index
Server 2 Server N
Pre- Search
Processing Index
Processing logic and index structure distributed together
17. That’s a problem...
• Processing and index storage may have different scaling needs
Depending on the search traffic, the processing overhead, ...
• Scaling up and down an index storage is long and complex
Whereas stateless processing is simple to scale up/down
• Expensive pre-processing may make searches slower
And indexing in real-time shouldn’t make searches slower !
18. Let’s move it to Hadoop
PDF
Text Some
Analyzer
Extractor Pre-Processing
Near Real-Time
Search Index
Some
Analyzer
Phone Pre-Processing
Call
Hadoop MapReduce
19. But...
• Hadoop can only deal with chunk of data
Data must be available somewhere on HDFS
• Unbounded stream of data can’t fit into Hadoop MapReduce
Hadoop is thought and optimized for batch processing
• Manually bounding the stream won’t be efficient
It’ll resulting in lot of regular and inefficient batches
21. S4
• A distributed, fault-tolerant, stream processing system
• Elastic
Based on Zookeeper
• Project started in november 2010, still experimental
But things are moving fast !
22. Where does S4 come from ?
• Open Source project created by Yahoo!
• Initially built for relevant ad selection and clever positioning on webpages
But thought to be generic enough
• Expensive pre-processing may make searches slower
And indexing in real-time shouldn’t make searches slower !
23. Processing Element
Your business
logic goes here
Processing
Element
Events Input Events Output
24. Processing Node
Processing Node
Processing Processing Processing
Element 1 Element 2 Element N
26. Programming model
PhoneCallPE
Accept events with :
Type=PhoneCall
Event KeyTuple: Id=15497 Event
Type: PhoneCall Type: EnrichedPhoneCall
KeyTuple: «Id=15497» KeyTuple: «Id=15497»
Value: <serialized object> Value: <serialized object>
A new Processing
Element instance is created
for each value of «Id»
27. An indexing pipeline with S4
ReRoutingPE
Handles incoming events
and load-balance them
according to partitioning
TextExtractionPE TextExtractionPE
ReRoutingPE
ClassificationPE ClassificationPE
MergingPE
28. An indexing pipeline with S4
ReRoutingPE
TextExtractionPE TextExtractionPE
Handles result events
ReRoutingPE and load-balance between
Processing Nodes
ClassificationPE ClassificationPE
MergingPE
29. An indexing pipeline with S4
ReRoutingPE
TextExtractionPE TextExtractionPE
ReRoutingPE
ClassificationPE ClassificationPE
Handles final result
events and push
MergingPE
them to the Indexer
30. Some drawbacks
• The system is lossy
Events may be lost when nodes are overloaded or during failure
• A workaround is to increase the incoming queue of nodes
But still, events may be lost during failure
• Still experimental
But very promising
31. More: Real-Time Inverted Search
MyCustomer Search
Sales Juridic Accounting
20 new results...
Document 2010 Sales Report 1 month ago
... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds ago
Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min
Description: Invoice not received for order #2354E
32. Summary
• S4 is a nice processing system for real-time search
Events may be lost when nodes are overloaded or during failure
• Not only for indexing-time, also for query-time !
As S4 ensures low latency, query-time processing is possible
• A promising roadmap....
Better failure handling, client API in major languages,
initial processing with Hadoop, ...