Semantic Search tutorial at SemTech 2012

Semantic Search Tutorial
Introduction

Peter Mika| Yahoo! Research, Spain
pmika@yahoo-inc.com

Thanh Tran | Institute AIFB, KIT, Germany
Tran@aifb.uni-karlsruhe.de

About the speakers
 Peter Mika
 Senior Research Scientist
 Head of Semantic Search group at Yahoo! Research in
Barcelona
 Semantic Search, Web Object Retrieval, Natural
Language Processing
 Tran Duc Thanh
 Assistent Professor at AIFB, Karlsruhe Institute of
Technology
 Head of Semantic Search group
 Semantic Search, Semantic / Linked Data Management

Agenda

 Introduction (10 min)
 Semantic Web data (50 min)
 The RDF data model
 Publishing RDF
 Crawling and indexing RDF data
 Query processing (40 min)
 Ranking (30 min)
 Result presentation (15 min)
 Semantic Search evaluation (15 min)
 Questions (5 min)

Why Semantic Search? I.
 “We are at the beginning of search.“ (Marissa Mayer)
 Solved large classes of queries, e.g. navigational
 Heavy investment in computational power
 Remaining queries are hard, not solvable by brute force,
and require a deep understanding of the world and
human cognition
 Background knowledge and metadata can help to
address poorly solved queries

Poorly solved information needs
Many of these queries
 Ambiguous searches would not be asked by
 paris hilton users, who learned over
 Long tail queries time what search
technology can and can
 george bush (and I mean the beer brewer notArizona)
in do.
 Multimedia search
 paris hilton sexy
 Imprecise or overly precise searches
 jim hendler
 pictures of strong adventures people
 Precise searches for descriptions
 countries in africa
 32 year old computer scientist living in barcelona
 reliable digital camera under 300 dollars

Example: multiple interpretations

Why Semantic Search? II.
 The Semantic Web is now a reality
 Large amounts of data published in RDF
 Heterogeneous data of varying quality
 Users who are not skilled in writing complex queries (e.g.
SPARQL) and may not be experts in the domain
 Searching data instead or in addition to searching
documents
 Direct answers
 Novel search tasks

Example: direct answers in search

Points of Faceted
interest in Information
search for Information box with
Vienna, from the Shopping content from and
Austria Knowledgeresults links to Yahoo!
Graph Travel
Since Aug,
2010, „regular‟
search results
are „Powered
by Bing‟

Novel search tasks
 Aggregation of search results
 e.g. price comparison across websites
 Analysis and prediction
 e.g. world temperature by 2020
 Semantic profiling
 recommendations based on particular interests
 Semantic log analysis
 understanding user behavior in terms of objects
 Support for complex tasks
 e.g. booking a vacation using a combination of services

Contextual (pervasive, ambient) search
Yahoo! Connected
TV:
Widget engine
embedded into the
TV

Yahoo! IntoNow:
recognize audio and
show related content

Interactive search and task completion

Document retrieval and data retrieval
 Information Retrieval (IR) support the retrieval of
documents (document retrieval)
 Representation based on lightweight syntax-centric models
 Work well for topical search
 Not so well for more complex information needs
 Web scale
 Database (DB) and Knowledge-based Systems (KB)
deliver more precise answers (data retrieval)
 More expressive models
 Allow for complex queries
 Retrieve concrete answers that precisely match queries
 Not just matching and filtering, but also joins
 Limitations in scalability

Combination of document and data
retrieval
 Documents with metadata
 Metadata may be embedded inside the document
 I’m looking for documents that mention countries in
Africa.
 Data retrieval
 Structured data, but searchable text fields
 I’m looking for directors, who have directed movies
where the synopsis mentions dinosaurs.

Semantic Search
 Target (combination of) document and data retrieval
 Semantic search is a retrieval paradigm that
 Exploits the structure/semantics of the data or explicit
background knowledge to understand user intent and the
meaning of content
 Incorporates the intent of the query and the meaning of
content into the search process (semantic models)
 Wide range of semantic search systems
 Employ different semantic models, possibly at different
steps of the search process and in order to support
different tasks

Semantic models
 Semantics is concerned with the meaning of the
resources made available for search
 Various representations of meaning
 Linguistic models: models of relationships among
words
 Taxonomies, thesauri, dictionaries of entity names
 Inference along linguistic relations, e.g. broader/narrower
terms
 Conceptual models: models of relationships among
objects
 Ontologies capture entities in the world and their
relationships
 Inference along domain-specific relations
 We will focus on conceptual models in this tutorial
 In particular, the RDF/OWL conceptual model for

Semantic Search – a process view

Knowledge Representation

Semantic Models
• Keywords Resources
• Forms
Query • NL
Construction • Formal language

•IR-style matching & ranking
•DB-style precise matching
Query •KB-style matching &
Processing inferences

•Query visualization
•Document and data Documents
Result presentation
Presentation •Summarization

•Implicit feedback
•Explicit feedback
Query •Incentives
Refinement

Document Representation

Semantic Search systems
For data / document retrieval, semantic search
systems might combine a range of techniques, ranging
from statistics-based IR methods for ranking,
database methods for efficient indexing and query
processing, up to complex reasoning techniques for
making inferences!

Example: Information Workbench
 Addressing the lifecycle of
interacting with the Web of
Data
 Integration of data sources
 Content generation by the end
user
 Search and Exploration
 Visualization User- Wikipedia
 Publishing generated
DBpedia, Yago
 Integrated management of
Earthquake
heterogeneous data (Data.gov)
sources
 Structured and unstructured
Structure
 Published and user-generated Dynamic
d
 Static and dynamic
 Open domain

Data Sources in the Application
 Entire English Wikipedia

 Data from Linked Open Data
 DBpedia
 YAGO
…

 Data from Data.gov (US Government)
 E.g. live data about earthquakes

 Many more

Semantic Search
 Hybrid Search: Structured queries combined with
keywords across structured and unstructured data
sources

 Query interpretation: Translation of keywords into
hybrid queries

 Keyword search/query interpretation combined
with faceted search: iterative refinement process
based on keywords and operations on facets

Search, Refinement and Navigation
Keywords

Query
Translations

Term
Completions

Vorlesung Knowledge Discovery - Institut
Facets
AIFB
2
1

Result Inspection, Analysis and Browsing

Data on the Web
 Data on the Web is not directly accessible
 Most web pages are generated from databases, but
formatted for human consumption
 APIs offer limited views over data
 Two solutions
 Extraction using Information Extraction (IE) techniques
 Out of scope for this tutorial
 Relying on publishers to expose structured data using
standard Semantic Web formats

Information extraction
 Natural Language Processing
 Named entity recognition and disambiguation, sentiment
analysis etc.
 Extraction of information about entities
 Suchanek et al. YAGO: A Core of Semantic Knowledge
Unifying WordNet and Wikipedia, WWW, 2007.
 Wu and Weld. Autonomously Semantifying Wikipedia, CIKM
2007.
 Extraction from HTML tables
 Cafarella et al. WebTables: Exploring the Power of Tables on
the Web. VLDB 2008
 Wrapper induction
 Kushmerick et al. Wrapper Induction for Information
ExtractionText extraction. IJCAI 2007
 Filling web forms automatically (form-filling)
 Madhavan et al. Google's Deep-Web Crawl. VLDB 2008

Semantic Web
 Sharing data across the Web
 Standard data model
 RDF
 A number of syntaxes (file formats)
 RDF/XML, RDFa
 Powerful, logic-based languages for schemas
 OWL, RIF
 Query languages and protocols
 HTTP, SPARQL

Resource Description Framework (RDF)
 Each resource (thing, entity) is identified by a URI
 Globally unique identifiers
 URLs a subset of URIs
 Often abbreviated using namespaces
 e.g. example:roi = http://example.org/roi
 RDF represents knowledge as a set of triples
 Each triple is a single fact about the entity (an attribute or a
relationship)
 A set of triples forms an RDF graph
RDF document
type foaf:Person

example:roi name

“Roi Blanco”

Linking resources
Roi‟s homepage Friend-of-a-Friend ontology

type
example:roi foaf:Person
name

“Roi Blanco” knows
sameAs

Yahoo!‟s website
type

worksWith
#roi2 #peter

email

“pmika@yahoo-inc.com”

Ontologies
 Ontologies are the schemas for RDF graphs
 Define the intended meaning of certain classes and
relationships in a domain
 e.g. the FOAF ontology defines the foaf:Person class and
relationships such as foaf:knows
 Ontologies are published in standard languages such
as OWL
 Language defines the constructs for modeling and their
meaning
 e.g. subClassOf, sameAs
 Tools for editing ontologies such as Protégé, TopBraid
Composer
 Reusing and extending well-known ontologies helps to
interpret unknown data
 e.g. if X is subClassOf foaf:Person, and A is of type
X, then we know that A is a person

Example: schema.org
 Agreement on a shared set of schemas for common
types of web content
 Bing, Google, and Yahoo! as initial supporters
 Similar in intent to sitemaps.org (2006)
 Use a single format to communicate the same information to all
three search engines
 Support for microdata
 schema.org covers areas of interest to all search
engines
 Business listings (local), creative works
(video), recipes, reviews
 User defined extensions
 Each search engine continues to develop its products

Documentation and OWL ontology

Publishing RDF

 Interlinked RDF documents (Linked Data)
 Each document describes a single resource with URIs
pointing to related resources
 Common RDF file formats are RDF/XML and Turtle
 Mostly implemented as a wrapper around a database or
Web service
 Embedding RDF inside HTML
 RDFa, microdata
 SPARQL endpoints
 Triple stores are databases for managing RDF data
 SPARQL is a standard protocol and query language for
accessing triple stores using HTTP

Linked Data
 Grass-roots community effort to (re)publish open
datasets in RDF
 Centered around the Dbpedia project
 Directory of datasets at linkeddata.org, thedatahub.org

Metadata in HTML anno 1995
What does this term
<HTML> mean?
<HEAD profile="http://dublincore.org/documents/dcq-
html/">
<META name="DC.author" content="Peter Mika">
<LINK rel="DC.rights copyright"
href="http://www.example.org/rights.html" />
<LINK rel="meta" type="application/rdf+xml" title="FOAF"
href= "http://www.cs.vu.nl/~pmika/foaf.rdf">
</HEAD>
…
</HTML>

How is this data
related?

Microformats (anno 2003)
 Mark-up for data in HTML pages
 Reuse existing HTML elements (class, rel)
 Microformats exist for a limited set of objects
 Persons, organizations, reviews, events etc., see microformats.org
 No formal schemas
 Limited reuse, extensibility of schemas
 Unclear which combinations are allowed
 No identifiers for entities
 No interlinking between entities

<div class="vcard">
<a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>
<div class="tel">+1-919-555-7878</div>
<div class="title">Area Administrator, Assistant</div>
</div>

RDFa and RDFa Lite (2008-2012)
 W3C standards for embedding RDF data in HTML
documents
 A set of new HTML attributes to be used in head or body
 A specification of how to extract the data from these
attributes
 RDFa is just a syntax, you have to choose a
vocabulary separately
 RDFa family
 RDFa 1.0
 Recommendation since October, 2008
 RDFa 1.1 is a small update on RDFa to make it easier to
use
 RDFa Primer (Working Draft, May 8, 2012)
 RDFa 1.1 Lite is a subset of RDFa 1.1 with the most

Example: Facebook‟s Open Graph
Protocol
 The „Like‟ button provides publishers with a way to
promote their content on Facebook and build
communities
 Shows up in profiles and news feed
 Site owners can later reach users who have liked an
object
 Facebook Graph API allows 3rd party developers to
access the data
 Open Graph Protocol is an RDFa-based format that
allows to describe the object that the user „Likes‟

Example: Facebook‟s Open Graph
Protocol
 RDF vocabulary to be used in conjunction with RDFa
 Simplify the work of developers by restricting the freedom in RDFa
 Activities, Businesses, Groups, Organizations, People, Places, Product
s and Entertainment
 Only HTML <head> accepted

<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url"
content="http://www.imdb.com/title/tt0117500/" />
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" /> …
</head> ...

Microdata
 HTML5
 Currently under standardization at the W3C
 Originally part of the HTML5 spec, but now a separate
document
 Comparable to RDFa 1.1 Lite
 Key extensibility features (such as multiple types)
missing
 HTML5 also has a number of “semantic” elements
<div itemscope itemid=“http://www.yahoo.com/resource/person”>
<p>My name is <span <video>, <article>…
such as <time>, itemprop="name">Neil</span>.</p>
<p>My band is called <span itemprop="band">Four Parts
Water</span>.
I was born on <time itemprop="birthday" datetime="2009-05-10">
May 10th 2009</time>.
<img itemprop="image" src=”me.png" alt=”me”></p>
</div>

Current state of metadata on the Web
 31% of webpages, 5% of domains contain some
metadata
 Analysis of the Bing Crawl (US crawl, January, 2012)
 RDFa is most common format
 By URL: 25% RDFa, 7% microdata, 9% microformat
 By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
 Adoption is stronger among large publishers
 Especially for RDFa and microdata

 See also
 P. Mika, T. Potter. Metadata Statistics for a Large Web
Corpus, LDOW 2012
 H.Mühleisen, C.Bizer.Web Data Commons - Extracting
Structured Data from Two Large Web Corpora, LDOW
2012

Exponential growth in RDFa data

Another five-fold increase
between October 2010 and
January, 2012

Five-fold increase between
March, 2009 and
October, 2010

Percentage of URLs with embedded metadata in various formats

Crawling the Semantic Web
 Linked Data
 Similar to HTML crawling, but the the crawler needs to
parse RDF/XML (and others) to extract URIs to be
crawled
 Semantic Sitemap/VOID descriptions
 RDFa
 Same as HTML crawling, but data is extracted after
crawling
 Mika et al. Investigating the Semantic Gap through Query
Log Analysis, ISWC 2010.
 SPARQL endpoints
 Endpoints are not linked, need to be discovered by other
means
 Semantic Sitemap/VOID descriptions

Data fusion
 Ontology matching
 Widely studied in Semantic Web research, see e.g. list of
publications at ontologymatching.org
 Unfortunately, not much of it is applicable in a Web context due to the
quality of ontologies
 Entity resolution
 Logic-based approaches in the Semantic Web
 Studied as record linkage in the database literature
 Machine learning based approaches, focusing on attributes
 Graph-based approaches, see e.g. the work of Lisa Getoor
are applicable to RDF data
 Improvements over only attribute based matching
 Blending
 Merging objects that represent the same real world entity and
reconciling information from multiple sources

Data quality assessment and curation
 Heterogeneity, quality of data is an even larger issue
 Quality ranges from well-curated data sets (e.g. Freebase) to
microformats
 In the worst of cases, the data becomes a graph of words
 Short amounts of text: prone to mistakes in data entry or
extraction
 Example: mistake in a phone number or state code

 Quality assessment and data curation
 Quality varies from data created by experts to user-generated
content
 Automated data validation
 Against known-good data or using triangulation
 Validation against the ontology or using probabilistic models
 Data validation by trained professionals or crowdsourcing
 Sampling data for evaluation

Indexing
 Search requires matching and ranking
 Matching selects a subset of the elements to be scored
 The goal of indexing is to speed up matching
 Retrieval needs to be performed in milliseconds
 Without an index, retrieval would require streaming
through the collection
 The type of index depends on the query model to
support
 DB-style indexing
 IR-style indexing

IR-style indexing
 Index data as text
 Create virtual documents from data
 One virtual document per subgraph, resource or triple
 typically: resource

 Key differences to Text Retrieval
 RDF data is structured
 Minimally, queries on property values are required

Horizontal index structure

 Two fields (indices): one for terms, one for properties
 For each term, store the property on the same position in
the property index
 Positions are required even without phrase queries
 Query engine needs to support the alignment operator
 ✓ Dictionary is number of unique terms + number of
properties
 Occurrences is number of tokens * 2

Vertical index structure

 One field (index) per property
 Positions are not required
 But useful for phrase queries
 Query engine needs to support fields
 Dictionary is number of unique terms
 Occurrences is number of tokens
 ✗ Number of fields is a problem for merging, query
performance

Distributed indexing
 MapReduce is ideal for building inverted indices
 Map creates (term, {doc1}) pairs
 Reduce collects all docs for the same term:
(term, {doc1, doc2…}
 Sub-indices are merged separately
 Term-partitioned indices

 Peter Mika. Distributed Indexing for Semantic
Search, SemSearch 2010.

Structure
 Taxonomy of search approaches
 Query processing / matching techniques for Semantic
Search
 Types of semantic data
 Formalisms for querying semantic data
 Approaches
 General task: hybrid graph pattern matching
 Matching keyword query against text
 Matching structured query against structured data
 Matching keyword query against structured data
 Matching structured query against text (a hybrid case)
 Main tasks, challenges and opportunities

Taxonomy of search approaches (1)
 The search problem
 A collection of resources, called data
 Information needs expressed as queries
 Search is the task of efficiently computing results from
data that are relevant to queries
 Document data retrieval vs. structured data retrieval
 Differences in query and data representation and
matching
 Efficiently retrieve structured data that exactly match
formal information needs expressed as structured
queries
 Effectively rank textual results that match ambiguous NL /
keyword queries to a certain degree (notions of
relevance)
 Semantic search: ranked retrieval of document and

Taxonomy of search approaches (2)
Query engines (of
databases)
 Exact
 Complete
Query
 Sound

• Approximate
Matching

• Not complete
• Not sound

• Ranked
Data
• Best effort
• Top-k
Search engines (stand-alone/database
extensons)
Query processing mainly focuses on efficiency of matching
whereas ranking deals with degree of matching (relevance)!

Query processing for Semantic Search (1)
 Resources represented by semantic data ranging from
 Structured data with well defined schemas
 Semi-structured data with incomplete or no schemas
 Data that largely comprise text
 Hybrid / embedded data
 Information needs of varying complexity, captured using
different formalisms and querying paradigms
 Natural language texts and keywords
 Form-based inputs
 Formal structured queries
(Search is end-user oriented paradigm, requires
“natural”, intuitive querying interfaces)
 Semantic search: efficiently computing results (query
processing) from data that are relevant to queries
(ranking)

NL Form- / facet- Structured Queries
Keywords
Questions based Inputs (SPARQL)
Ambiquities

Query

Matching

Data

RDF data Semi- OWL ontologies with
Structured
embedded in Structured rich, formal
RDF data
text (RDFa) RDF data semantics
Ambiquities: confidence degree, truth/trust

Textual Data

Structured
query on
Keyword query
textual data
on textual
, e.g. querying
data, e.g. Web Search target
Semantic
extension for
search systems group of
different search
users, information
systems?
Unstructured needs, and types Structured
of data. Structured
Query Query
Query processing for on
Keyword query query
on structured search is hybrid
semantic structured data
data, e.g. e.g. standard
combination of techniques!
search querying
extensions for interface for
databases databases /
RDF stores
Structured Data

Types of data models (1)
 Textual
 Bag-of-words
 Represent documents, text in structured data,…, real-
world objects (captured as structured data)
 Lacks “structure”
 in text, e.g. linguistic structure, hyperlinks, (positional
information)
 Structure in structured data representation

term (statistics)
In combination with
combination
Cloud Computing
Cloud
technologies, promising
Computing
solutions for the
Technologies
management of `big
solutions
data' have emerged.
management
Existing industry
`big data'
solutions are able to
industry
support complex
solutions
queries and analytics
support
tasks with terabytes of
complex
data. For
……
example, using a
Greenplum.

 Textual
 Structured
 Resource Description Framework (RDF)
 Represent real-world objects, services, applications, ….
documents
 Resource attribute values and relationships between
resources
 Schema
Picture
creator
Person

Bob

 Textual
 Structured
 Hybrid
 RDF data embedded in text (RDFa)

Types of data models – RDFa (1)
…
<div about="/alice/posts/trouble_with_bob">
<h2 property="dc:title">The trouble with Bob</h2>
<h3 property="dc:creator">Alice</h3>

Bob is a good friend of mine. We went to the same university, and
also shared an apartment in Berlin in 2008. The trouble with Bob is
that he takes much better photos than I do:

<div about="http://example.com/bob/photos/sunset.jpg">
<img src="http://example.com/bob/photos/sunset.jpg" />
<span property="dc:title">Beautiful Sunset</span>
by <span property="dc:creator">Bob</span>.
</div>
</div>
…
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/

Types of semantic data – RDFa (2)

Bob is a good friend of mine. We content
went to the same university, and
also shared an apartment in Berlin
in 2008. The trouble with Bob is
that he takes much better photos
than I do:
content

adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/

Types of semantic data - conclusion

Semantic data in general can be conceived
as a graph with text and structured data
items as nodes, and edges represent
different types of relationships including
explicit semantic relationships and
vaguely specified ones such as hyperlinks!

Formalisms for querying semantic data
(1)

Example information need
“Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.”

 Unstructured queries
 Fully-structured queries
 Hybrid queries: unstructured + structured

(2)

 Unstructured
 NL
 Keywords
shared apartment Berlin Alice

(3)


 Unstructured
 Fully-structured
 SPARQL:
BGP, filter, optional, union, select, construct, ask, describe
 PREFIX ns: <http://example.org/ns#>
SELECT ?x
WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }

(4)
 Unstructured
 Hybrid: content and structure constraints

“shared apartment Berlin Alice”

?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”

(5)
 Unstructured
 Hybrid: content and structure constraints

“shared apartment Berlin Alice”

?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”

Formalisms for querying semantic data - conclusion

Semantic search queries can be conceived
as graph patterns with nodes referring to
text and structured data items, and edges
referring to relationships between these
items!

Processing hybrid graph patterns (1)
“Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone working at KIT.”

apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

?y ns:name “Alice”. ?x ns:knows ? y

trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend
Beautiful
of mine. We went to Sunset
the same Germany Semantic
Alice Search
university, and also
shared an
apartment in Berlin
in 2008. The trouble Germany 2009
Bob
with Bob is that he Thanh
takes much better
photos than I do: KIT

Processing hybrid graph patterns (2)
 Matching hybrid graph patterns against data

Matching keyword query against text
• Retrieve documents
• Inverted list (inverted index)
keyword  {<doc1, pos, score, ...>,
<doc2, pos, score, ...>, ...}
• AND-semantics: top-k join

shared Berlin Alice shared Berlin Alice

D1 D1 D1

shared = berlin = alice

shared

Matching structured query against structured
data
• Retrieve data for triple patterns
• Index on tables
• Multiple “redundant” indexes to cover different access
patterns
• Join (conjunction of triples)
• Blocking, e.g. linear merge join (required sorted input)
• Non-blocking, e.g. symmetric hash-join
• Materialized join indexes
Per1 ns:works ?v ?v ns:name “KIT”
?x ns:knows ?y. ?x ns:knows ?z.
SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT”

=
= =

Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT

Matching keyword query against structured data
• Retrieve keyword elements
• Using inverted index
keyword  {<el1, score, ...>, <el2, score, ...>,…}
• Exploration / “Join”
• Data indexes for triple lookup
• Materialized index (paths up to graphs)
• Top-k Steiner tree search, top-k subgraph exploration

Alice Bob KIT Alice Bob KIT

↔ ↔

=
=
Alice ns:knows Bob Inst1 ns:name KIT
Bob ns:works Inst1

Matching structured query against text
• Based on offline IE (offline see Peter‟s slides)
• Based on online IE, i.e., “retrieve “ is as follows
• Derive keywords to retrieve relevant documents
• On-the-fly information extraction, i.e., phrase pattern matching “X
name Y”
• Retrieve extracted data for structured part
• Retrieve documents for derived text patterns, e.g.
sequence, windows, reg. exp. ?x ns:knows ?y. ?x ns:knows ?
?z ns: works ?v. ?v ns:name “K

knows

name KIT

Matching structured query against text
• Index
• Inverted index for document retrieval and pattern matching
• Join index  inverted index for storing materialized joins between
keywords
• Neighborhood indexes for phrase patterns

?x ns:knows ?y. ?x ns:knows ?
?z ns: works ?v. ?v ns:name “K
KIT name
knows

name KIT

Query processing – main tasks
 Retrieval
 Documents , data
Query elements, triples, paths, graphs
 Inverted index,…, but also other
(B+ tree)
Matching

 Index
documents, triples, materialized
paths
 Join
Data  Different join
implementations, efficiency
depends on availability of indexes
 Non-blocking join good for early
result reporting and for
“unpredictable” Linked Data / data

Query processing – more tasks
 More complex queries:
disjunction, aggregation, grouping, an
Query
alytics…
 Join order optimization
 Approximate
 Approximate the search space
Matching

 Approximate the results
(matching, join)
 Parallelization
 Top-k
Data  Use only some entries in the input
streams to produce k results
 Multiple sources
 Federation, routing
 On-the-fly mapping, similarity join
 Hybrid

Query processing on the Web -
research challenges and opportunities

 Large amount of
semantic data
• Optimization, parallelizati
 Data on
inconsistent, redunda • Approximation
nt, and low quality
• Hybrid querying and data
 Large amount of data management
embedded in text • Federation, routing
 Large amount of • Online schema
sources mappings
 Large amount of links • Similarity join
between sources

Structure
 Problem definition
 Types of ambiguities
 Ranking paradigms
 Model construction
 Content-based
 Structure-based

Ranking – problem definition
Query
• Ambiguities arise when
representation is incomplete /
imprecise
Matching

• Ambiguities at the level of
• elements (content ambiguity)
• structure between elements
Data
(structure ambiguity)

Due to ambiguities in the representation of the
information needs and the underlying resources, the
results cannot be guaranteed to exactly match the query.
Ranking is the problem of determining the degree of
matching using some notions of relevance.

Content ambiguity


Peter
sunset.jpg
Beautiful
Alice Search
shared an
apartment in Berlin
Bob
takes much better

What is meant by “Berlin” in the query? What is meant by “KIT” in the query?
What is meant by “Berlin” in the data? What is meant by “KIT” in the data?
A city with the name Berlin? a person? A research group? a university? a location?

Structure ambiguity


Peter
sunset.jpg
Beautiful
Alice Search
shared an
apartment in Berlin
Bob
takes much better

What is the connection between What is meant by “works”?
“Berlin” and “Alice”? Works at? employed?
Friend? Co-worker?

Ambiguity
 Recall: query processing is matching at the level of
syntax and semantics
 Ambiguities arise when data or query allow for multiple
interpretations, i.e. multiple matches
 Syntactic, e.g. works vs. works at
 Semantic, e.g. works vs. employ
 “Aboutness”, i.e., contain some elements which
represent the correct interpretation
 Ambiguities arise when matching elements of different
granularities
 Does i contains the interpretation for j, given some part(s) of i
(syntactically/semantically) match j
 E.g. Berlin vs. “…we went to the same university, and also, we
shared an apartment in Berlin in 2008…”
 Strictly speaking, ranking is performed after syntactic /
semantic matching is done!

Features: What to use to deal with ambiguities?

What is meant by “Berlin”? What is the
connection between “Berlin” and “Alice”?
 Content features
 Frequencies of terms: d more likely to be “about” a
query term k when d more often, mentions k
(probabilistic IR)
 Co-occurrences: terms K that often co-occur form a
contextual interpretation, i.e., topics (cluster
hypothesis)
 Structure features
 Consider relevance at level of fields
 Linked-based popularity

Ranking paradigms
 No explicit notion of relevance: similarity between the
query and the document model
 Vector space model (cosine similarity)
 Language models (KL divergence)

Sim(q, d ) Cos(( w1,d ,..., wt , d ), ( w1,q ,..., wk , q ))

P(t | q )
Sim(q, d ) KL( q || d ) P(t | q ) log(
t V P(t | d )

Model construction
 How to obtain
 Relevance models?
 Weights for query / document terms?
 Language models for document / queries?

Content-based model construction
 Document statistics, e.g. • An object is more likely
 Term frequency about “Berlin”?
 Document length • When it contains a
 Collection statistics, e.g. relatively high number
of mentions of the term
 Inverse document
“Berlin”
frequency
• When the number of
 Background language
mentions of this term in
models
the overall collection is
tf relatively low
wt , d idf
|d |
tf
P(t | d ) (1 ) P(t | C )
|d |

Structure-based model construction
 Consider structure of objects during content-based
modeling, i.e., to obtain structured content-based
model
 Content-based model for structured
objects, documents and for general tuples

P(t | d ) f P(t | f )
f Fd

• An object is more likely about “Berlin”?
• When one of its (important) fields contains a
relatively high number of mentions of the term “Berlin”

Structure-based model construction
 PageRank
 Link analysis algorithm
 Measuring relative importance of nodes
 Link counts as a vote of support
 The PageRank of a node recursively depends on the
number and PageRank of all nodes that link to it
(incoming links)
 ObjectRank
 Types and semantics of links vary in structured data
setting
 Authority transfer schema graph specifies connection
• An object about “Berlin” is more important than one another?
strengths
• When a relatively large number of objects are linked to it
 Recursively compute authority transfer data graph

Taxonomy of ranking approaches
 Explicitly vs. non-explicitly relevance-based
 Content-based ranking
 Structure-based ranking
 Content- and-structure-based ranking

Search interface
 Input and output functionality
 helping the user to formulate complex queries
 presenting the results in an intelligent manner
 Semantic Search brings improvements in
 Query formulation
 Snippet generation
 Suggesting related entities
 Adaptive and interactive presentation
 Presentation adapts to the kind of query and results presented
 Object results can be actionable, e.g. buy this product
 Aggregated search
 Grouping similar items, summarizing results in various ways
 Filtering (facets), possibly across different dimensions
 Task completion
 Help the user to fulfill the task by placing the query in a task context

Query formulation
 “Snap-to-grid”: suggest the most likely interpretation of the
query
 Given the ontology or a summary of the data
 While the user is typing or after issuing the query
 Example: Freebase suggest, TrueKnowledge

Enhanced results/Rich Snippets
 Use mark-up from the webpage to generate search
snippets
 Originally invented at Yahoo! (SearchMonkey)
 Google, Yahoo!, Bing, Yandex now consume
schema.org markup
 Validators available from Google and Bing

Other result presentation tasks
 Select the most relevant resources within an RDF
document
 Penin et al. Snippet Generation for Semantic Web
Search Engines, ASWC 2010
 For each resource, rank the properties to be displayed
 Natural Language Generation (NLG)
 Verbalize, explain results

Related entities

Related actors
and movies

Adaptive presentation:
housing search

Semantic Search challenge (2010/2011)
 Two tasks
 Entity Search
 Queries where the user is looking for a single real world object
 Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW
2010.
 List search (new in 2011)
 Queries where the user is looking for a class of objects

 Billion Triples Challenge 2009 dataset
 Evaluated using Amazon‟s Mechanical Turk
 Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST
2010
 Blanco et al. Repeatable and Reliable Search System
Evaluation using Crowd-Sourcing, SIGIR2011

Catching the bad guys
 Payment can be rejected for workers who try to game
the system
 An explanation is commonly expected, though cheaters rarely
complain
 We opted to mix control questions into the real results
 Gold-win cases that are known to be perfect
 Gold-loose cases that are known to be bad
 Metrics
Worker and std. dev onReal
 Avg.
Known gold-win and gold-loose results
Known Good Total Time to
 Time to complete
bad N complete
N Mean N Mean N Mean (sec)

badguy 20 2.556 200 2.738 20 2.684 240 29.6
goodguy 13 1 130 2.038 13 3 156 95
whoknows 1 1 21 1.571 2 3 24 83.5

Other evaluations

 TREC Entity Track
 Related Entity Finding
 Entities related to a given entity through a particular relationship
 Retrieval over documents (ClueWeb 09 collection)
 Example: (Homepages of) airlines that fly Boeing 747
 Entity List Completion
 Given some elements of a list of entities, complete the list
 Question Answering over Linked Data
 Retrieval over specific datasets (Dbpedia and
MusicBrainz)
 Full natural language questions of different forms
 Correct results defined by an equivalent SPARQL query
 Example: Give me all actors starring in Batman Begins.

Resources
 Books
 Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
Information Retrieval. ACM Press. 2011
 Survey papers
 Thanh Tran, Peter Mika. Survey of Semantic Search
Approaches. Under submission, 2012.
 Conferences and workshops
 ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
 Semantic Search workshop series
 Exploiting Semantic Annotations in Information Retrieval
(ESAIR)
 Entity-oriented Search (EOS) workshop

Semantic Search tutorial at SemTech 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Semantic Search tutorial at SemTech 2012

Similar to Semantic Search tutorial at SemTech 2012 (20)

More from Peter Mika

More from Peter Mika (8)

Recently uploaded

Recently uploaded (20)

Semantic Search tutorial at SemTech 2012

Editor's Notes