1. Semantic Search Tutorial
Introduction
Peter Mika| Yahoo! Research, Spain
pmika@yahoo-inc.com
Thanh Tran | Institute AIFB, KIT, Germany
Tran@aifb.uni-karlsruhe.de
2. About the speakers
Peter Mika
Senior Research Scientist
Head of Semantic Search group at Yahoo! Research in
Barcelona
Semantic Search, Web Object Retrieval, Natural
Language Processing
Tran Duc Thanh
Assistent Professor at AIFB, Karlsruhe Institute of
Technology
Head of Semantic Search group
Semantic Search, Semantic / Linked Data Management
3. Agenda
Introduction (10 min)
Semantic Web data (50 min)
The RDF data model
Publishing RDF
Crawling and indexing RDF data
Query processing (40 min)
Ranking (30 min)
Result presentation (15 min)
Semantic Search evaluation (15 min)
Questions (5 min)
4. Why Semantic Search? I.
“We are at the beginning of search.“ (Marissa Mayer)
Solved large classes of queries, e.g. navigational
Heavy investment in computational power
Remaining queries are hard, not solvable by brute force,
and require a deep understanding of the world and
human cognition
Background knowledge and metadata can help to
address poorly solved queries
5. Poorly solved information needs
Many of these queries
Ambiguous searches would not be asked by
paris hilton users, who learned over
Long tail queries time what search
technology can and can
george bush (and I mean the beer brewer notArizona)
in do.
Multimedia search
paris hilton sexy
Imprecise or overly precise searches
jim hendler
pictures of strong adventures people
Precise searches for descriptions
countries in africa
32 year old computer scientist living in barcelona
reliable digital camera under 300 dollars
7. Why Semantic Search? II.
The Semantic Web is now a reality
Large amounts of data published in RDF
Heterogeneous data of varying quality
Users who are not skilled in writing complex queries (e.g.
SPARQL) and may not be experts in the domain
Searching data instead or in addition to searching
documents
Direct answers
Novel search tasks
8. Example: direct answers in search
Points of Faceted
interest in Information
search for Information box with
Vienna, from the Shopping content from and
Austria Knowledgeresults links to Yahoo!
Graph Travel
Since Aug,
2010, „regular‟
search results
are „Powered
by Bing‟
9. Novel search tasks
Aggregation of search results
e.g. price comparison across websites
Analysis and prediction
e.g. world temperature by 2020
Semantic profiling
recommendations based on particular interests
Semantic log analysis
understanding user behavior in terms of objects
Support for complex tasks
e.g. booking a vacation using a combination of services
10. Contextual (pervasive, ambient) search
Yahoo! Connected
TV:
Widget engine
embedded into the
TV
Yahoo! IntoNow:
recognize audio and
show related content
12. Document retrieval and data retrieval
Information Retrieval (IR) support the retrieval of
documents (document retrieval)
Representation based on lightweight syntax-centric models
Work well for topical search
Not so well for more complex information needs
Web scale
Database (DB) and Knowledge-based Systems (KB)
deliver more precise answers (data retrieval)
More expressive models
Allow for complex queries
Retrieve concrete answers that precisely match queries
Not just matching and filtering, but also joins
Limitations in scalability
13. Combination of document and data
retrieval
Documents with metadata
Metadata may be embedded inside the document
I’m looking for documents that mention countries in
Africa.
Data retrieval
Structured data, but searchable text fields
I’m looking for directors, who have directed movies
where the synopsis mentions dinosaurs.
14. Semantic Search
Target (combination of) document and data retrieval
Semantic search is a retrieval paradigm that
Exploits the structure/semantics of the data or explicit
background knowledge to understand user intent and the
meaning of content
Incorporates the intent of the query and the meaning of
content into the search process (semantic models)
Wide range of semantic search systems
Employ different semantic models, possibly at different
steps of the search process and in order to support
different tasks
15. Semantic models
Semantics is concerned with the meaning of the
resources made available for search
Various representations of meaning
Linguistic models: models of relationships among
words
Taxonomies, thesauri, dictionaries of entity names
Inference along linguistic relations, e.g. broader/narrower
terms
Conceptual models: models of relationships among
objects
Ontologies capture entities in the world and their
relationships
Inference along domain-specific relations
We will focus on conceptual models in this tutorial
In particular, the RDF/OWL conceptual model for
16. Semantic Search – a process view
Knowledge Representation
Semantic Models
• Keywords Resources
• Forms
Query • NL
Construction • Formal language
•IR-style matching & ranking
•DB-style precise matching
Query •KB-style matching &
Processing inferences
•Query visualization
•Document and data Documents
Result presentation
Presentation •Summarization
•Implicit feedback
•Explicit feedback
Query •Incentives
Refinement
Document Representation
17. Semantic Search systems
For data / document retrieval, semantic search
systems might combine a range of techniques, ranging
from statistics-based IR methods for ranking,
database methods for efficient indexing and query
processing, up to complex reasoning techniques for
making inferences!
18. Example: Information Workbench
Addressing the lifecycle of
interacting with the Web of
Data
Integration of data sources
Content generation by the end
user
Search and Exploration
Visualization User- Wikipedia
Publishing generated
DBpedia, Yago
Integrated management of
Earthquake
heterogeneous data (Data.gov)
sources
Structured and unstructured
Structure
Published and user-generated Dynamic
d
Static and dynamic
Open domain
19. Data Sources in the Application
Entire English Wikipedia
Data from Linked Open Data
DBpedia
YAGO
…
Data from Data.gov (US Government)
E.g. live data about earthquakes
Many more
20. Semantic Search
Hybrid Search: Structured queries combined with
keywords across structured and unstructured data
sources
Query interpretation: Translation of keywords into
hybrid queries
Keyword search/query interpretation combined
with faceted search: iterative refinement process
based on keywords and operations on facets
21. Search, Refinement and Navigation
Keywords
Query
Translations
Term
Completions
Vorlesung Knowledge Discovery - Institut
Facets
AIFB
2
1
24. Data on the Web
Data on the Web is not directly accessible
Most web pages are generated from databases, but
formatted for human consumption
APIs offer limited views over data
Two solutions
Extraction using Information Extraction (IE) techniques
Out of scope for this tutorial
Relying on publishers to expose structured data using
standard Semantic Web formats
25. Information extraction
Natural Language Processing
Named entity recognition and disambiguation, sentiment
analysis etc.
Extraction of information about entities
Suchanek et al. YAGO: A Core of Semantic Knowledge
Unifying WordNet and Wikipedia, WWW, 2007.
Wu and Weld. Autonomously Semantifying Wikipedia, CIKM
2007.
Extraction from HTML tables
Cafarella et al. WebTables: Exploring the Power of Tables on
the Web. VLDB 2008
Wrapper induction
Kushmerick et al. Wrapper Induction for Information
ExtractionText extraction. IJCAI 2007
Filling web forms automatically (form-filling)
Madhavan et al. Google's Deep-Web Crawl. VLDB 2008
26. Semantic Web
Sharing data across the Web
Standard data model
RDF
A number of syntaxes (file formats)
RDF/XML, RDFa
Powerful, logic-based languages for schemas
OWL, RIF
Query languages and protocols
HTTP, SPARQL
27. Resource Description Framework (RDF)
Each resource (thing, entity) is identified by a URI
Globally unique identifiers
URLs a subset of URIs
Often abbreviated using namespaces
e.g. example:roi = http://example.org/roi
RDF represents knowledge as a set of triples
Each triple is a single fact about the entity (an attribute or a
relationship)
A set of triples forms an RDF graph
RDF document
type foaf:Person
example:roi name
“Roi Blanco”
28. Linking resources
Roi‟s homepage Friend-of-a-Friend ontology
type
example:roi foaf:Person
name
“Roi Blanco” knows
sameAs
Yahoo!‟s website
type
worksWith
#roi2 #peter
email
“pmika@yahoo-inc.com”
29. Ontologies
Ontologies are the schemas for RDF graphs
Define the intended meaning of certain classes and
relationships in a domain
e.g. the FOAF ontology defines the foaf:Person class and
relationships such as foaf:knows
Ontologies are published in standard languages such
as OWL
Language defines the constructs for modeling and their
meaning
e.g. subClassOf, sameAs
Tools for editing ontologies such as Protégé, TopBraid
Composer
Reusing and extending well-known ontologies helps to
interpret unknown data
e.g. if X is subClassOf foaf:Person, and A is of type
X, then we know that A is a person
30. Example: schema.org
Agreement on a shared set of schemas for common
types of web content
Bing, Google, and Yahoo! as initial supporters
Similar in intent to sitemaps.org (2006)
Use a single format to communicate the same information to all
three search engines
Support for microdata
schema.org covers areas of interest to all search
engines
Business listings (local), creative works
(video), recipes, reviews
User defined extensions
Each search engine continues to develop its products
32. Publishing RDF
Interlinked RDF documents (Linked Data)
Each document describes a single resource with URIs
pointing to related resources
Common RDF file formats are RDF/XML and Turtle
Mostly implemented as a wrapper around a database or
Web service
Embedding RDF inside HTML
RDFa, microdata
SPARQL endpoints
Triple stores are databases for managing RDF data
SPARQL is a standard protocol and query language for
accessing triple stores using HTTP
33. Linked Data
Grass-roots community effort to (re)publish open
datasets in RDF
Centered around the Dbpedia project
Directory of datasets at linkeddata.org, thedatahub.org
34. Metadata in HTML anno 1995
What does this term
<HTML> mean?
<HEAD profile="http://dublincore.org/documents/dcq-
html/">
<META name="DC.author" content="Peter Mika">
<LINK rel="DC.rights copyright"
href="http://www.example.org/rights.html" />
<LINK rel="meta" type="application/rdf+xml" title="FOAF"
href= "http://www.cs.vu.nl/~pmika/foaf.rdf">
</HEAD>
…
</HTML>
How is this data
related?
35. Microformats (anno 2003)
Mark-up for data in HTML pages
Reuse existing HTML elements (class, rel)
Microformats exist for a limited set of objects
Persons, organizations, reviews, events etc., see microformats.org
No formal schemas
Limited reuse, extensibility of schemas
Unclear which combinations are allowed
No identifiers for entities
No interlinking between entities
<div class="vcard">
<a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>
<div class="tel">+1-919-555-7878</div>
<div class="title">Area Administrator, Assistant</div>
</div>
36. RDFa and RDFa Lite (2008-2012)
W3C standards for embedding RDF data in HTML
documents
A set of new HTML attributes to be used in head or body
A specification of how to extract the data from these
attributes
RDFa is just a syntax, you have to choose a
vocabulary separately
RDFa family
RDFa 1.0
Recommendation since October, 2008
RDFa 1.1 is a small update on RDFa to make it easier to
use
RDFa Primer (Working Draft, May 8, 2012)
RDFa 1.1 Lite is a subset of RDFa 1.1 with the most
37. Example: Facebook‟s Open Graph
Protocol
The „Like‟ button provides publishers with a way to
promote their content on Facebook and build
communities
Shows up in profiles and news feed
Site owners can later reach users who have liked an
object
Facebook Graph API allows 3rd party developers to
access the data
Open Graph Protocol is an RDFa-based format that
allows to describe the object that the user „Likes‟
38. Example: Facebook‟s Open Graph
Protocol
RDF vocabulary to be used in conjunction with RDFa
Simplify the work of developers by restricting the freedom in RDFa
Activities, Businesses, Groups, Organizations, People, Places, Product
s and Entertainment
Only HTML <head> accepted
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url"
content="http://www.imdb.com/title/tt0117500/" />
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" /> …
</head> ...
39. Microdata
HTML5
Currently under standardization at the W3C
Originally part of the HTML5 spec, but now a separate
document
Comparable to RDFa 1.1 Lite
Key extensibility features (such as multiple types)
missing
HTML5 also has a number of “semantic” elements
<div itemscope itemid=“http://www.yahoo.com/resource/person”>
<p>My name is <span <video>, <article>…
such as <time>, itemprop="name">Neil</span>.</p>
<p>My band is called <span itemprop="band">Four Parts
Water</span>.
I was born on <time itemprop="birthday" datetime="2009-05-10">
May 10th 2009</time>.
<img itemprop="image" src=”me.png" alt=”me”></p>
</div>
40. Current state of metadata on the Web
31% of webpages, 5% of domains contain some
metadata
Analysis of the Bing Crawl (US crawl, January, 2012)
RDFa is most common format
By URL: 25% RDFa, 7% microdata, 9% microformat
By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
Adoption is stronger among large publishers
Especially for RDFa and microdata
See also
P. Mika, T. Potter. Metadata Statistics for a Large Web
Corpus, LDOW 2012
H.Mühleisen, C.Bizer.Web Data Commons - Extracting
Structured Data from Two Large Web Corpora, LDOW
2012
41. Exponential growth in RDFa data
Another five-fold increase
between October 2010 and
January, 2012
Five-fold increase between
March, 2009 and
October, 2010
Percentage of URLs with embedded metadata in various formats
43. Crawling the Semantic Web
Linked Data
Similar to HTML crawling, but the the crawler needs to
parse RDF/XML (and others) to extract URIs to be
crawled
Semantic Sitemap/VOID descriptions
RDFa
Same as HTML crawling, but data is extracted after
crawling
Mika et al. Investigating the Semantic Gap through Query
Log Analysis, ISWC 2010.
SPARQL endpoints
Endpoints are not linked, need to be discovered by other
means
Semantic Sitemap/VOID descriptions
44. Data fusion
Ontology matching
Widely studied in Semantic Web research, see e.g. list of
publications at ontologymatching.org
Unfortunately, not much of it is applicable in a Web context due to the
quality of ontologies
Entity resolution
Logic-based approaches in the Semantic Web
Studied as record linkage in the database literature
Machine learning based approaches, focusing on attributes
Graph-based approaches, see e.g. the work of Lisa Getoor
are applicable to RDF data
Improvements over only attribute based matching
Blending
Merging objects that represent the same real world entity and
reconciling information from multiple sources
45. Data quality assessment and curation
Heterogeneity, quality of data is an even larger issue
Quality ranges from well-curated data sets (e.g. Freebase) to
microformats
In the worst of cases, the data becomes a graph of words
Short amounts of text: prone to mistakes in data entry or
extraction
Example: mistake in a phone number or state code
Quality assessment and data curation
Quality varies from data created by experts to user-generated
content
Automated data validation
Against known-good data or using triangulation
Validation against the ontology or using probabilistic models
Data validation by trained professionals or crowdsourcing
Sampling data for evaluation
46. Indexing
Search requires matching and ranking
Matching selects a subset of the elements to be scored
The goal of indexing is to speed up matching
Retrieval needs to be performed in milliseconds
Without an index, retrieval would require streaming
through the collection
The type of index depends on the query model to
support
DB-style indexing
IR-style indexing
47. IR-style indexing
Index data as text
Create virtual documents from data
One virtual document per subgraph, resource or triple
typically: resource
Key differences to Text Retrieval
RDF data is structured
Minimally, queries on property values are required
48. Horizontal index structure
Two fields (indices): one for terms, one for properties
For each term, store the property on the same position in
the property index
Positions are required even without phrase queries
Query engine needs to support the alignment operator
✓ Dictionary is number of unique terms + number of
properties
Occurrences is number of tokens * 2
49. Vertical index structure
One field (index) per property
Positions are not required
But useful for phrase queries
Query engine needs to support fields
Dictionary is number of unique terms
Occurrences is number of tokens
✗ Number of fields is a problem for merging, query
performance
50. Distributed indexing
MapReduce is ideal for building inverted indices
Map creates (term, {doc1}) pairs
Reduce collects all docs for the same term:
(term, {doc1, doc2…}
Sub-indices are merged separately
Term-partitioned indices
Peter Mika. Distributed Indexing for Semantic
Search, SemSearch 2010.
52. Structure
Taxonomy of search approaches
Query processing / matching techniques for Semantic
Search
Types of semantic data
Formalisms for querying semantic data
Approaches
General task: hybrid graph pattern matching
Matching keyword query against text
Matching structured query against structured data
Matching keyword query against structured data
Matching structured query against text (a hybrid case)
Main tasks, challenges and opportunities
53. Taxonomy of search approaches (1)
The search problem
A collection of resources, called data
Information needs expressed as queries
Search is the task of efficiently computing results from
data that are relevant to queries
Document data retrieval vs. structured data retrieval
Differences in query and data representation and
matching
Efficiently retrieve structured data that exactly match
formal information needs expressed as structured
queries
Effectively rank textual results that match ambiguous NL /
keyword queries to a certain degree (notions of
relevance)
Semantic search: ranked retrieval of document and
54. Taxonomy of search approaches (2)
Query engines (of
databases)
Exact
Complete
Query
Sound
• Approximate
Matching
• Not complete
• Not sound
• Ranked
Data
• Best effort
• Top-k
Search engines (stand-alone/database
extensons)
Query processing mainly focuses on efficiency of matching
whereas ranking deals with degree of matching (relevance)!
55. Query processing for Semantic Search (1)
Resources represented by semantic data ranging from
Structured data with well defined schemas
Semi-structured data with incomplete or no schemas
Data that largely comprise text
Hybrid / embedded data
Information needs of varying complexity, captured using
different formalisms and querying paradigms
Natural language texts and keywords
Form-based inputs
Formal structured queries
(Search is end-user oriented paradigm, requires
“natural”, intuitive querying interfaces)
Semantic search: efficiently computing results (query
processing) from data that are relevant to queries
(ranking)
56. Query processing for Semantic Search (2)
NL Form- / facet- Structured Queries
Keywords
Questions based Inputs (SPARQL)
Ambiquities
Query
Matching
Data
RDF data Semi- OWL ontologies with
Structured
embedded in Structured rich, formal
RDF data
text (RDFa) RDF data semantics
Ambiquities: confidence degree, truth/trust
57. Query processing for Semantic Search (3)
Textual Data
Structured
query on
Keyword query
textual data
on textual
, e.g. querying
data, e.g. Web Search target
Semantic
extension for
search systems group of
different search
users, information
systems?
Unstructured needs, and types Structured
of data. Structured
Query Query
Query processing for on
Keyword query query
on structured search is hybrid
semantic structured data
data, e.g. e.g. standard
combination of techniques!
search querying
extensions for interface for
databases databases /
RDF stores
Structured Data
58. Types of data models (1)
Textual
Bag-of-words
Represent documents, text in structured data,…, real-
world objects (captured as structured data)
Lacks “structure”
in text, e.g. linguistic structure, hyperlinks, (positional
information)
Structure in structured data representation
term (statistics)
In combination with
combination
Cloud Computing
Cloud
technologies, promising
Computing
solutions for the
Technologies
management of `big
solutions
data' have emerged.
management
Existing industry
`big data'
solutions are able to
industry
support complex
solutions
queries and analytics
support
tasks with terabytes of
complex
data. For
……
example, using a
Greenplum.
59. Types of data models (2)
Textual
Structured
Resource Description Framework (RDF)
Represent real-world objects, services, applications, ….
documents
Resource attribute values and relationships between
resources
Schema
Picture
creator
Person
Bob
60. Types of data models (3)
Textual
Structured
Hybrid
RDF data embedded in text (RDFa)
61. Types of data models – RDFa (1)
…
<div about="/alice/posts/trouble_with_bob">
<h2 property="dc:title">The trouble with Bob</h2>
<h3 property="dc:creator">Alice</h3>
Bob is a good friend of mine. We went to the same university, and
also shared an apartment in Berlin in 2008. The trouble with Bob is
that he takes much better photos than I do:
<div about="http://example.com/bob/photos/sunset.jpg">
<img src="http://example.com/bob/photos/sunset.jpg" />
<span property="dc:title">Beautiful Sunset</span>
by <span property="dc:creator">Bob</span>.
</div>
</div>
…
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
62. Types of semantic data – RDFa (2)
Bob is a good friend of mine. We content
went to the same university, and
also shared an apartment in Berlin
in 2008. The trouble with Bob is
that he takes much better photos
than I do:
content
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
63. Types of semantic data - conclusion
Semantic data in general can be conceived
as a graph with text and structured data
items as nodes, and edges represent
different types of relationships including
explicit semantic relationships and
vaguely specified ones such as hyperlinks!
64. Formalisms for querying semantic data
(1)
Example information need
“Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.”
Unstructured queries
Fully-structured queries
Hybrid queries: unstructured + structured
65. Formalisms for querying semantic data
(2)
Example information need
“Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.”
Unstructured
NL
Keywords
shared apartment Berlin Alice
66. Formalisms for querying semantic data
(3)
Example information need
“Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.”
Unstructured
Fully-structured
SPARQL:
BGP, filter, optional, union, select, construct, ask, describe
PREFIX ns: <http://example.org/ns#>
SELECT ?x
WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
67. Formalisms for querying semantic data
(4)
Fully-structured
Unstructured
Hybrid: content and structure constraints
“shared apartment Berlin Alice”
?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”
68. Formalisms for querying semantic data
(5)
Fully-structured
Unstructured
Hybrid: content and structure constraints
“shared apartment Berlin Alice”
?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”
69. Formalisms for querying semantic data - conclusion
Semantic search queries can be conceived
as graph patterns with nodes referring to
text and structured data items, and edges
referring to relationships between these
items!
70. Processing hybrid graph patterns (1)
Example information need
“Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone working at KIT.”
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
?y ns:name “Alice”. ?x ns:knows ? y
trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend
Beautiful
of mine. We went to Sunset
the same Germany Semantic
Alice Search
university, and also
shared an
apartment in Berlin
in 2008. The trouble Germany 2009
Bob
with Bob is that he Thanh
takes much better
photos than I do: KIT
72. Matching keyword query against text
• Retrieve documents
• Inverted list (inverted index)
keyword {<doc1, pos, score, ...>,
<doc2, pos, score, ...>, ...}
• AND-semantics: top-k join
shared Berlin Alice shared Berlin Alice
D1 D1 D1
shared = berlin = alice
shared
73. Matching structured query against structured
data
• Retrieve data for triple patterns
• Index on tables
• Multiple “redundant” indexes to cover different access
patterns
• Join (conjunction of triples)
• Blocking, e.g. linear merge join (required sorted input)
• Non-blocking, e.g. symmetric hash-join
• Materialized join indexes
Per1 ns:works ?v ?v ns:name “KIT”
?x ns:knows ?y. ?x ns:knows ?z.
SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT”
=
= =
Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT
74. Matching keyword query against structured data
• Retrieve keyword elements
• Using inverted index
keyword {<el1, score, ...>, <el2, score, ...>,…}
• Exploration / “Join”
• Data indexes for triple lookup
• Materialized index (paths up to graphs)
• Top-k Steiner tree search, top-k subgraph exploration
Alice Bob KIT Alice Bob KIT
↔ ↔
=
=
Alice ns:knows Bob Inst1 ns:name KIT
Bob ns:works Inst1
75. Matching structured query against text
• Based on offline IE (offline see Peter‟s slides)
• Based on online IE, i.e., “retrieve “ is as follows
• Derive keywords to retrieve relevant documents
• On-the-fly information extraction, i.e., phrase pattern matching “X
name Y”
• Retrieve extracted data for structured part
• Retrieve documents for derived text patterns, e.g.
sequence, windows, reg. exp. ?x ns:knows ?y. ?x ns:knows ?
?z ns: works ?v. ?v ns:name “K
knows
name KIT
76. Matching structured query against text
• Index
• Inverted index for document retrieval and pattern matching
• Join index inverted index for storing materialized joins between
keywords
• Neighborhood indexes for phrase patterns
?x ns:knows ?y. ?x ns:knows ?
?z ns: works ?v. ?v ns:name “K
KIT name
knows
name KIT
77. Query processing – main tasks
Retrieval
Documents , data
Query elements, triples, paths, graphs
Inverted index,…, but also other
(B+ tree)
Matching
Index
documents, triples, materialized
paths
Join
Data Different join
implementations, efficiency
depends on availability of indexes
Non-blocking join good for early
result reporting and for
“unpredictable” Linked Data / data
78. Query processing – more tasks
More complex queries:
disjunction, aggregation, grouping, an
Query
alytics…
Join order optimization
Approximate
Approximate the search space
Matching
Approximate the results
(matching, join)
Parallelization
Top-k
Data Use only some entries in the input
streams to produce k results
Multiple sources
Federation, routing
On-the-fly mapping, similarity join
Hybrid
79. Query processing on the Web -
research challenges and opportunities
Large amount of
semantic data
• Optimization, parallelizati
Data on
inconsistent, redunda • Approximation
nt, and low quality
• Hybrid querying and data
Large amount of data management
embedded in text • Federation, routing
Large amount of • Online schema
sources mappings
Large amount of links • Similarity join
between sources
81. Structure
Problem definition
Types of ambiguities
Ranking paradigms
Model construction
Content-based
Structure-based
82. Ranking – problem definition
Query
• Ambiguities arise when
representation is incomplete /
imprecise
Matching
• Ambiguities at the level of
• elements (content ambiguity)
• structure between elements
Data
(structure ambiguity)
Due to ambiguities in the representation of the
information needs and the underlying resources, the
results cannot be guaranteed to exactly match the query.
Ranking is the problem of determining the degree of
matching using some notions of relevance.
83. Content ambiguity
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
?y ns:name “Alice”. ?x ns:knows ? y
trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend
Beautiful
of mine. We went to Sunset
the same Germany Semantic
Alice Search
university, and also
shared an
apartment in Berlin
in 2008. The trouble Germany 2009
Bob
with Bob is that he Thanh
takes much better
photos than I do: KIT
What is meant by “Berlin” in the query? What is meant by “KIT” in the query?
What is meant by “Berlin” in the data? What is meant by “KIT” in the data?
A city with the name Berlin? a person? A research group? a university? a location?
84. Structure ambiguity
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
?y ns:name “Alice”. ?x ns:knows ? y
trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend
Beautiful
of mine. We went to Sunset
the same Germany Semantic
Alice Search
university, and also
shared an
apartment in Berlin
in 2008. The trouble Germany 2009
Bob
with Bob is that he Thanh
takes much better
photos than I do: KIT
What is the connection between What is meant by “works”?
“Berlin” and “Alice”? Works at? employed?
Friend? Co-worker?
85. Ambiguity
Recall: query processing is matching at the level of
syntax and semantics
Ambiguities arise when data or query allow for multiple
interpretations, i.e. multiple matches
Syntactic, e.g. works vs. works at
Semantic, e.g. works vs. employ
“Aboutness”, i.e., contain some elements which
represent the correct interpretation
Ambiguities arise when matching elements of different
granularities
Does i contains the interpretation for j, given some part(s) of i
(syntactically/semantically) match j
E.g. Berlin vs. “…we went to the same university, and also, we
shared an apartment in Berlin in 2008…”
Strictly speaking, ranking is performed after syntactic /
semantic matching is done!
86. Features: What to use to deal with ambiguities?
What is meant by “Berlin”? What is the
connection between “Berlin” and “Alice”?
Content features
Frequencies of terms: d more likely to be “about” a
query term k when d more often, mentions k
(probabilistic IR)
Co-occurrences: terms K that often co-occur form a
contextual interpretation, i.e., topics (cluster
hypothesis)
Structure features
Consider relevance at level of fields
Linked-based popularity
87. Ranking paradigms
Explicit relevance model
Foundation: probability ranking principle
Ranking results by the posterior probability (odds) of
being observed in the relevant class:
P(w|R) varies in different approaches, e.g., binary
independence model, 2-poisson model, relevance
model
P(D|R)
P( D | R) P(w | R) (1 P(w | N ))
P(D/N) w D w D
k
P( w | R) P( w | q1 ,..., qk ) P ( m) P ( w | m) P(qk | m)
m M i 1
88. Ranking paradigms
No explicit notion of relevance: similarity between the
query and the document model
Vector space model (cosine similarity)
Language models (KL divergence)
Sim(q, d ) Cos(( w1,d ,..., wt , d ), ( w1,q ,..., wk , q ))
P(t | q )
Sim(q, d ) KL( q || d ) P(t | q ) log(
t V P(t | d )
89. Model construction
How to obtain
Relevance models?
Weights for query / document terms?
Language models for document / queries?
90. Content-based model construction
Document statistics, e.g. • An object is more likely
Term frequency about “Berlin”?
Document length • When it contains a
Collection statistics, e.g. relatively high number
of mentions of the term
Inverse document
“Berlin”
frequency
• When the number of
Background language
mentions of this term in
models
the overall collection is
tf relatively low
wt , d idf
|d |
tf
P(t | d ) (1 ) P(t | C )
|d |
91. Structure-based model construction
Consider structure of objects during content-based
modeling, i.e., to obtain structured content-based
model
Content-based model for structured
objects, documents and for general tuples
P(t | d ) f P(t | f )
f Fd
• An object is more likely about “Berlin”?
• When one of its (important) fields contains a
relatively high number of mentions of the term “Berlin”
92. Structure-based model construction
PageRank
Link analysis algorithm
Measuring relative importance of nodes
Link counts as a vote of support
The PageRank of a node recursively depends on the
number and PageRank of all nodes that link to it
(incoming links)
ObjectRank
Types and semantics of links vary in structured data
setting
Authority transfer schema graph specifies connection
• An object about “Berlin” is more important than one another?
strengths
• When a relatively large number of objects are linked to it
Recursively compute authority transfer data graph
93. Taxonomy of ranking approaches
Explicitly vs. non-explicitly relevance-based
Content-based ranking
Structure-based ranking
Content- and-structure-based ranking
95. Search interface
Input and output functionality
helping the user to formulate complex queries
presenting the results in an intelligent manner
Semantic Search brings improvements in
Query formulation
Snippet generation
Suggesting related entities
Adaptive and interactive presentation
Presentation adapts to the kind of query and results presented
Object results can be actionable, e.g. buy this product
Aggregated search
Grouping similar items, summarizing results in various ways
Filtering (facets), possibly across different dimensions
Task completion
Help the user to fulfill the task by placing the query in a task context
96. Query formulation
“Snap-to-grid”: suggest the most likely interpretation of the
query
Given the ontology or a summary of the data
While the user is typing or after issuing the query
Example: Freebase suggest, TrueKnowledge
97. Enhanced results/Rich Snippets
Use mark-up from the webpage to generate search
snippets
Originally invented at Yahoo! (SearchMonkey)
Google, Yahoo!, Bing, Yandex now consume
schema.org markup
Validators available from Google and Bing
98. Other result presentation tasks
Select the most relevant resources within an RDF
document
Penin et al. Snippet Generation for Semantic Web
Search Engines, ASWC 2010
For each resource, rank the properties to be displayed
Natural Language Generation (NLG)
Verbalize, explain results
104. Semantic Search challenge (2010/2011)
Two tasks
Entity Search
Queries where the user is looking for a single real world object
Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW
2010.
List search (new in 2011)
Queries where the user is looking for a class of objects
Billion Triples Challenge 2009 dataset
Evaluated using Amazon‟s Mechanical Turk
Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST
2010
Blanco et al. Repeatable and Reliable Search System
Evaluation using Crowd-Sourcing, SIGIR2011
106. Catching the bad guys
Payment can be rejected for workers who try to game
the system
An explanation is commonly expected, though cheaters rarely
complain
We opted to mix control questions into the real results
Gold-win cases that are known to be perfect
Gold-loose cases that are known to be bad
Metrics
Worker and std. dev onReal
Avg.
Known gold-win and gold-loose results
Known Good Total Time to
Time to complete
bad N complete
N Mean N Mean N Mean (sec)
badguy 20 2.556 200 2.738 20 2.684 240 29.6
goodguy 13 1 130 2.038 13 3 156 95
whoknows 1 1 21 1.571 2 3 24 83.5
107. Other evaluations
TREC Entity Track
Related Entity Finding
Entities related to a given entity through a particular relationship
Retrieval over documents (ClueWeb 09 collection)
Example: (Homepages of) airlines that fly Boeing 747
Entity List Completion
Given some elements of a list of entities, complete the list
Question Answering over Linked Data
Retrieval over specific datasets (Dbpedia and
MusicBrainz)
Full natural language questions of different forms
Correct results defined by an equivalent SPARQL query
Example: Give me all actors starring in Batman Begins.
109. Resources
Books
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
Information Retrieval. ACM Press. 2011
Survey papers
Thanh Tran, Peter Mika. Survey of Semantic Search
Approaches. Under submission, 2012.
Conferences and workshops
ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
Semantic Search workshop series
Exploiting Semantic Annotations in Information Retrieval
(ESAIR)
Entity-oriented Search (EOS) workshop
Editor's Notes
Search is a form of content aggregation
- In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. In this regard, technologies for Information Retrieval (IR) can be distinguished from solutions in the field of Database (DB) and Knowledge-based Expert Systems (KB). Whereas IR applications support the retrieval of documents (document retrieval), DB and KB systems deliver more precise answers (data retrieval). The technical differences between these two paradigms can be broken down into three main dimensions, i.e. the representation of the user need (query model), the underlying resources (data model) and the matching technique. The representation of user need and resource content in current IR systems is still almost exclusively achieved by the lightweight syntax-centric models such as the predominant keyword paradigm (i.e. keyword queries matched against bag-of-words document representation). While these systems have shown to work well for topical search, i.e. retrieve documents based on a topic, they usually fail to address more complex information needs. Using more expressive models for the representation of the user need, and leveraging the structure and semantics inherent in the data, DB and KB systems allow for complex queries, and for the retrieval of concrete answers that precisely match them.
Semantic search can be seen as a retrieval paradigm Centered on the use of semanticsIncorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
Another trend resulting from this convergence of textual, structured and semantic data is the need for hybrid semantic search systems. Whilestandard IR focuses on the retrieval of documents, the DB and KB systems are built for data retrieval. As opposed to these types of systems, hybrid search systems manage the different types of data in a holistic way, and support the retrieval of answers that are integrated units of information, possibly assembled from different types of data. In such a system, there is not only a convergence at the
We implemented the search paradigms and integrated them as separate search modules into a demonstrator system of the Information Workbench7 that has been developed as a showcase for interaction with the Web of data. In particular, keyword search is implemented according to the design and technologies employed by standard Semantic Web search engines. Like Sindice and FalconS, we use an invertedindex to store and retrieve RDF resources based on terms. Also using the inverted index, faceted search is implemented based on the techniques discussed in [25]. Result completion is based on recent work discussed for the TASTIER system [8]. For computing join graphs, we use the top-k procedure elaborated in [9]. This technique is also used for computing top-k interpretations, i.e. to support query completion. We choose to display the top-6 queries and the top-25 results respectively.
- resource-base navigation
Facebook invited, but continues to pursue OGP
>> >> Intro Session: motivation, overview etc.: 10 min >> >>1)Representation of the Search Space (M) 45>> >>2)Offline Preprocessing: Crawling and Indexing (P) 60>> >>3)Query Processing (T) 4)Matching (T) 90>> >>5)Ranking 60 >> >>6)Result Presentation (45)>> >>7)Evaluation (45)>> >>Demo Session (30)>> >> Wrap-up Session (10)
Close to the topic of keyword-search in databases, except knowledge-bases have a schema-oblivious designDifferent papers assume vastly different query needs even on the same type of data
>> >> Intro Session: motivation, overview etc.: 10 min >> >>1)Representation of the Search Space (M) 45>> >>2)Offline Preprocessing: Crawling and Indexing (P) 60>> >>3)Query Processing (T) 4)Matching (T) 90>> >>5)Ranking 60 >> >>6)Result Presentation (45)>> >>7)Evaluation (45)>> >>Demo Session (30)>> >> Wrap-up Session (10)
Approximate many results need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
Miss structural information in textsHyperlinksLinguistic structurePositional information
- Real world objects
SELECT Returns all, or a subset of, the variables bound in a query pattern match. CONSTRUCT Returns an RDF graph constructed by substituting variables in a set of triple templates. ASK Returns a boolean indicating whether a query pattern matches or not. DESCRIBE Returns an RDF graph that describes the resources found. Graph patterns are defined recursively. A graph pattern may have zero or more optional graph patterns, and any part of a query pattern may have an optional part. In this example, there are two optional graph patterns.Section 6 introduces the ability to make portions of a query optional; Section 7 introduces the ability to express the disjunction of alternative graph patterns; and Section 8 introduces the ability to constrain portions of a query to particular source graphs. Section 8 also presents SPARQL's mechanism for defining the source graphs for a query.Basic graph patterns allow applications to make queries where the entire query pattern must match for there to be a solution. For every solution of a query containing only group graph patterns with at least one basic graph pattern, every variable is bound to an RDF Term in a solution. However, regular, complete structures cannot be assumed in all RDF graphs. It is useful to be able to have queries that allow information to be added to the solution where the information is available, but do not reject the solution because some part of the query pattern does not match. Optional matching provides this facility: if the optional part does not match, it creates no bindings but does not eliminate the solution.The UNION pattern combines graph patterns; each alternative possibility can contain more than one triple pattern:SPARQL provides a means of combining graph patterns so that one of several alternative graph patterns may match. If more than one of the alternatives matches, all the possible pattern solutions are found.
Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
- Less than 5 percent of IR papers deal with query processing and the aspect of efficiency
Partitioning has impact on performance!)Blocking: iterator-based approachesNon-blocking: good for streaming, good we cannot wait for some parts of the results to be completely worked-offLink data: cannot wait for sources, (some are slower then other) thus better to push data into query processing as the they come instead of pulling data and wait (busy waiting)Top-k:
-phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web.- patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window.For extraction: relation patternsFor text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
-phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web.- patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window.For extraction: relation patternsFor text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
Given some materialized indexes no joins at all Given sorted inputs sorted merge join
Given some materialized indexes no joins at all Given sorted inputs sorted merge joinjoin
Every task is a challenge of itself, some more some less well elaboratedThere are separate challenges for every problems
>> >> Intro Session: motivation, overview etc.: 10 min >> >>1)Representation of the Search Space (M) 45>> >>2)Offline Preprocessing: Crawling and Indexing (P) 60>> >>3)Query Processing (T) 4)Matching (T) 90>> >>5)Ranking 60 >> >>6)Result Presentation (45)>> >>7)Evaluation (45)>> >>Demo Session (30)>> >> Wrap-up Session (10)
Approximate many results need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
Syntactic works vs.works at Semantic works vs. employ
text which contains a large mentions of “Berlin” is likely to be about “Berlin”i is more likely to be the correct interpretation of K when terms in K co-occur in a large number of context (bag of words) associated with i“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people