The document discusses search implementation and ElasticSearch. It begins with an overview of how search works by indexing documents into an inverted index of tokens and associated document postings. It then provides a Ruby implementation of a basic search index and demonstrates indexing documents and searching the index. The document concludes by describing features of ElasticSearch like its use of HTTP and JSON, schema-free indexing, distributed search capabilities, and Ruby integration.
17. WHY SEARCH SUCKS?
How do you implement search?
class MyModel
include Whatever::Search
end
MyModel.search "something"
18. WHY SEARCH SUCKS?
How do you implement search?
class MyModel
include Whatever::Search
MAGIC
end
MyModel.search "whatever"
19. WHY SEARCH SUCKS?
How do you implement search?
Query Results Result
def search
@results = MyModel.search params[:q]
respond_with @results
end
20. WHY SEARCH SUCKS?
How do you implement search?
Query Results Result
MAGIC
def search
@results = MyModel.search params[:q]
respond_with @results
end
21. WHY SEARCH SUCKS?
How do you implement search?
Query Results Result
MAGIC +
def search
@results = MyModel.search params[:q]
respond_with @results
end
25. HOW DOES SEARCH WORK?
A collection of documents
file_1.txt
The ruby is a pink to blood-‐red colored gemstone ...
file_2.txt
Ruby is a dynamic, reflective, general-‐purpose object-‐oriented
programming language ...
file_3.txt
"Ruby" is a song by English rock band Kaiser Chiefs ...
26. HOW DOES SEARCH WORK?
How do you search documents?
File.read('file1.txt').include?('ruby')
27. HOW DOES SEARCH WORK?
The inverted index
TOKENS POSTINGS
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
28. HOW DOES SEARCH WORK?
The inverted index
MySearchLib.search "ruby"
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
29. HOW DOES SEARCH WORK?
The inverted index
MySearchLib.search "song"
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
30. module SimpleSearch
def index document, content
tokens = analyze content
store document, tokens
puts "Indexed document #{document} with tokens:", tokens.inspect, "n"
end
def analyze content
# >>> Split content by words into "tokens"
content.split(/W/).
# >>> Downcase every word
map { |word| word.downcase }.
# >>> Reject stop words, digits and whitespace
reject { |word| STOPWORDS.include?(word) || word =~ /^d+/ || word == '' }
end
def store document_id, tokens
tokens.each do |token|
# >>> Save the "posting"
( (INDEX[token] ||= []) << document_id ).uniq!
end
end
def search token
puts "Results for token '#{token}':"
# >>> Print documents stored in index for this token
INDEX[token].each { |document| " * #{document}" }
end
INDEX = {}
STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t
extend self
end
A naïve Ruby implementation
31. HOW DOES SEARCH WORK?
Indexing documents
SimpleSearch.index "file1", "Ruby is a language. Java is also a language.
SimpleSearch.index "file2", "Ruby is a song."
SimpleSearch.index "file3", "Ruby is a stone."
SimpleSearch.index "file4", "Java is a language."
Indexed document file1 with tokens:
["ruby", "language", "java", "also", "language"]
Indexed document file2 with tokens:
["ruby", "song"] Words downcased,
stopwords removed.
Indexed document file3 with tokens:
["ruby", "stone"]
Indexed document file4 with tokens:
["java", "language"]
32. HOW DOES SEARCH WORK?
The index
puts "What's in our index?"
p SimpleSearch::INDEX
{
"ruby" => ["file1", "file2", "file3"],
"language" => ["file1", "file4"],
"java" => ["file1", "file4"],
"also" => ["file1"],
"stone" => ["file3"],
"song" => ["file2"]
}
33. HOW DOES SEARCH WORK?
Search the index
SimpleSearch.search "ruby"
Results for token 'ruby':
* file1
* file2
* file3
34. HOW DOES SEARCH WORK?
The inverted index
TOKENS POSTINGS
ruby 3 file_1.txt file_2.txt file_3.txt
pink 1 file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
35. It is very practical to know how search works.
For instance, now you know that
the analysis step is very important.
Most of the time, it's more important than the search step.
ElasticSearch
36. module SimpleSearch
def index document, content
tokens = analyze content
store document, tokens
puts "Indexed document #{document} with tokens:", tokens.inspect, "n"
end
def analyze content
# >>> Split content by words into "tokens"
content.split(/W/).
# >>> Downcase every word
map { |word| word.downcase }.
# >>> Reject stop words, digits and whitespace
reject { |word| STOPWORDS.include?(word) || word =~ /^d+/ || word == '' }
end
def store document_id, tokens
tokens.each do |token|
# >>> Save the "posting"
( (INDEX[token] ||= []) << document_id ).uniq!
end
end
def search token
puts "Results for token '#{token}':"
# >>> Print documents stored in index for this token
INDEX[token].each { |document| " * #{document}" }
end
INDEX = {}
STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t
extend self
end
A naïve Ruby implementation
37. HOW DOES SEARCH WORK?
The Search Engine Textbook
Search Engines
Information Retrieval in Practice
Bruce Croft, Donald Metzler and Trevor Strohma
Addison Wesley, 2009
http://search-engines-book.com
38. SEARCH IMPLEMENTATIONS
The Baseline Information Retrieval Implementation
Lucene in Action
Michael McCandless, Erik Hatcher and Otis Gospodnetic
July, 2010
http://manning.com/hatcher3
52. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
The “Sliding Window” problem
curl -‐X DELETE http://localhost:9200 / logs_2010_01
logs_2010_02
logs
logs_2010_03
logs_2010_04
“We can really store only three months worth of data.”
53. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Index Templates
curl -‐X PUT localhost:9200/_template/bookmarks_template -‐d '
{
"template" : "users_*", Apply this configuration
for every matching
"settings" : { index being created
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 3
}
},
"mappings": {
"url": {
"properties": {
"url": {
"type": "string", "analyzer": "simple", "boost": 10
},
"title": {
"type": "string", "analyzer": "snowball", "boost": 5
}
// ...
}
}
}
}
'
http://www.elasticsearch.org/guide/reference/api/admin-indices-templates.html
55. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Index A is split into 3 shards, and duplicated in 2 replicas.
A1 A1' A1'' Replicas
A2 A2' A2''
A3 A3' A3''
curl -‐XPUT 'http://localhost:9200/A/' -‐d '{
"settings" : {
"index" : {
Shards "number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}'
56. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Im
pr
ce
ove
an
rm
in
de
rfo
xi
pe
ng
h
pe
a rc
rfo
se
rm
e
ov
an
pr
ce
Im
SH
AR
AS
DS
IC
PL
RE
57. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
$ curl -‐X GET "http://localhost:9200/_search?q=<YOUR QUERY>"
apple
Terms
apple iphone
Phrases "apple iphone"
Proximity "apple safari"~5
Fuzzy apple~0.8
app*
Wildcards
*pp*
Boosting apple^10 safari
[2011/05/01 TO 2011/05/31]
Range
[java TO json]
apple AND NOT iphone
+apple -‐iphone
Boolean
(apple OR iphone) AND NOT review
title:iphone^15 OR body:iphone
Fields published_on:[2011/05/01 TO "2011/05/27 10:00:00"]
http://lucene.apache.org/java/3_1_0/queryparsersyntax.html
67. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
K R O A T I E N
K R O
}
R O A
O A T
Trigrams
A T I
T I E
I E N
68. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
Tire.index 'articles' do
delete
create
store :title => 'One', :tags => ['ruby'], :published_on => '2011-‐01-‐01'
store :title => 'Two', :tags => ['ruby', 'python'], :published_on => '2011-‐01-‐02'
store :title => 'Three', :tags => ['java'], :published_on => '2011-‐01-‐02'
store :title => 'Four', :tags => ['ruby', 'php'], :published_on => '2011-‐01-‐03'
refresh
end
s = Tire.search 'articles' do
query { string 'title:T*' }
filter :terms, :tags => ['ruby']
sort { title 'desc' }
http://github.com/karmi/tire
facet 'global-‐tags' { terms :tags, :global => true }
facet 'current-‐tags' { terms :tags }
end
69. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
class Article < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
end
$ rake environment tire:import CLASS='Article'
Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { published_on 'desc' }
end
http://github.com/karmi/tire
70. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
class Article
include Whatever::ORM
include Tire::Model::Search
include Tire::Model::Callbacks
end
$ rake environment tire:import CLASS='Article'
Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { published_on 'desc' }
end
http://github.com/karmi/tire
71.
72. Try ElasticSearch and Tire with a one-line command.
$ rails new tired -‐m "https://gist.github.com/raw/951343/tired.rb"
A “batteries included” installation.
Downloads and launches ElasticSearch.
Sets up a Rails applicationand and launches it.
When you're tired of it, just delete the folder.