SlideShare a Scribd company logo
1 of 109
Semantic Search Tutorial
     Introduction


   Peter Mika| Yahoo! Research, Spain
         pmika@yahoo-inc.com

 Thanh Tran | Institute AIFB, KIT, Germany
       Tran@aifb.uni-karlsruhe.de
About the speakers
 Peter Mika
   Senior Research Scientist
   Head of Semantic Search group at Yahoo! Research in
    Barcelona
   Semantic Search, Web Object Retrieval, Natural
    Language Processing
 Tran Duc Thanh
   Assistent Professor at AIFB, Karlsruhe Institute of
    Technology
   Head of Semantic Search group
   Semantic Search, Semantic / Linked Data Management
Agenda

 Introduction (10 min)
 Semantic Web data (50 min)
   The RDF data model
   Publishing RDF
   Crawling and indexing RDF data
 Query processing (40 min)
 Ranking (30 min)
 Result presentation (15 min)
 Semantic Search evaluation (15 min)
 Questions (5 min)
Why Semantic Search? I.
 “We are at the beginning of search.“ (Marissa Mayer)
   Solved large classes of queries, e.g. navigational
   Heavy investment in computational power
   Remaining queries are hard, not solvable by brute force,
   and require a deep understanding of the world and
   human cognition
 Background knowledge and metadata can help to
 address poorly solved queries
Poorly solved information needs
                                               Many of these queries
 Ambiguous searches                           would not be asked by
   paris hilton                               users, who learned over
 Long tail queries                            time what search
                                               technology can and can
   george bush (and I mean the beer brewer    notArizona)
                                               in do.
 Multimedia search
   paris hilton sexy
 Imprecise or overly precise searches
   jim hendler
   pictures of strong adventures people
 Precise searches for descriptions
   countries in africa
   32 year old computer scientist living in barcelona
   reliable digital camera under 300 dollars
Example: multiple interpretations
Why Semantic Search? II.
 The Semantic Web is now a reality
   Large amounts of data published in RDF
   Heterogeneous data of varying quality
   Users who are not skilled in writing complex queries (e.g.
   SPARQL) and may not be experts in the domain
 Searching data instead or in addition to searching
 documents
   Direct answers
   Novel search tasks
Example: direct answers in search




  Points of                 Faceted
  interest in      Information
                            search for   Information box with
  Vienna,          from the Shopping     content from and
  Austria          Knowledgeresults      links to Yahoo!
                   Graph                 Travel
                               Since Aug,
                               2010, „regular‟
                               search results
                               are „Powered
                               by Bing‟
Novel search tasks
 Aggregation of search results
   e.g. price comparison across websites
 Analysis and prediction
   e.g. world temperature by 2020
 Semantic profiling
   recommendations based on particular interests
 Semantic log analysis
   understanding user behavior in terms of objects
 Support for complex tasks
   e.g. booking a vacation using a combination of services
Contextual (pervasive, ambient) search
                                 Yahoo! Connected
                                 TV:
                                 Widget engine
                                 embedded into the
                                 TV




          Yahoo! IntoNow:
          recognize audio and
          show related content
Interactive search and task completion
Document retrieval and data retrieval
 Information Retrieval (IR) support the retrieval of
  documents (document retrieval)
     Representation based on lightweight syntax-centric models
     Work well for topical search
     Not so well for more complex information needs
     Web scale
 Database (DB) and Knowledge-based Systems (KB)
  deliver more precise answers (data retrieval)
     More expressive models
     Allow for complex queries
     Retrieve concrete answers that precisely match queries
     Not just matching and filtering, but also joins
     Limitations in scalability
Combination of document and data
    retrieval
 Documents with metadata
   Metadata may be embedded inside the document
   I’m looking for documents that mention countries in
   Africa.
 Data retrieval
   Structured data, but searchable text fields
   I’m looking for directors, who have directed movies
   where the synopsis mentions dinosaurs.
Semantic Search
 Target (combination of) document and data retrieval
 Semantic search is a retrieval paradigm that
   Exploits the structure/semantics of the data or explicit
    background knowledge to understand user intent and the
    meaning of content
   Incorporates the intent of the query and the meaning of
    content into the search process (semantic models)
 Wide range of semantic search systems
   Employ different semantic models, possibly at different
   steps of the search process and in order to support
   different tasks
Semantic models
 Semantics is concerned with the meaning of the
 resources made available for search
   Various representations of meaning
 Linguistic models: models of relationships among
 words
   Taxonomies, thesauri, dictionaries of entity names
   Inference along linguistic relations, e.g. broader/narrower
   terms
 Conceptual models: models of relationships among
 objects
   Ontologies capture entities in the world and their
    relationships
   Inference along domain-specific relations
 We will focus on conceptual models in this tutorial
   In particular, the RDF/OWL conceptual model for
Semantic Search – a process view

                                                   Knowledge Representation


                                            Semantic Models
             • Keywords                                                 Resources
             • Forms
  Query      • NL
Construction • Formal language


             •IR-style matching & ranking
             •DB-style precise matching
  Query      •KB-style matching &
Processing    inferences


             •Query visualization
             •Document and data                                           Documents
   Result     presentation
Presentation •Summarization


             •Implicit feedback
             •Explicit feedback
  Query      •Incentives
Refinement




                                                    Document Representation
Semantic Search systems
For data / document retrieval, semantic search
systems might combine a range of techniques, ranging
from statistics-based IR methods for ranking,
database methods for efficient indexing and query
processing, up to complex reasoning techniques for
making inferences!
Example: Information Workbench
 Addressing the lifecycle of
  interacting with the Web of
  Data
   Integration of data sources
   Content generation by the end
    user
   Search and Exploration
   Visualization                             User-     Wikipedia
   Publishing                              generated
                                                        DBpedia, Yago
 Integrated management of
                                                        Earthquake
  heterogeneous data                                    (Data.gov)
  sources
     Structured and unstructured
                                Structure
     Published and user-generated                      Dynamic
                                    d
     Static and dynamic
     Open domain
Data Sources in the Application
 Entire English Wikipedia


 Data from Linked Open Data
   DBpedia
   YAGO
  …


 Data from Data.gov (US Government)
   E.g. live data about earthquakes


 Many more
Semantic Search
 Hybrid Search: Structured queries combined with
  keywords across structured and unstructured data
  sources

 Query interpretation: Translation of keywords into
  hybrid queries

 Keyword search/query interpretation combined
  with faceted search: iterative refinement process
  based on keywords and operations on facets
Search, Refinement and Navigation
                                           Keywords




                                                      Query
                                                      Translations

                                                        Term
                                                        Completions




Vorlesung Knowledge Discovery - Institut
              Facets
                                  AIFB
2
1
Result Inspection, Analysis and Browsing
Semantic Web data
Data on the Web
 Data on the Web is not directly accessible
   Most web pages are generated from databases, but
    formatted for human consumption
   APIs offer limited views over data
 Two solutions
   Extraction using Information Extraction (IE) techniques
     Out of scope for this tutorial
   Relying on publishers to expose structured data using
   standard Semantic Web formats
Information extraction
 Natural Language Processing
   Named entity recognition and disambiguation, sentiment
    analysis etc.
 Extraction of information about entities
   Suchanek et al. YAGO: A Core of Semantic Knowledge
    Unifying WordNet and Wikipedia, WWW, 2007.
   Wu and Weld. Autonomously Semantifying Wikipedia, CIKM
    2007.
 Extraction from HTML tables
   Cafarella et al. WebTables: Exploring the Power of Tables on
    the Web. VLDB 2008
 Wrapper induction
   Kushmerick et al. Wrapper Induction for Information
    ExtractionText extraction. IJCAI 2007
 Filling web forms automatically (form-filling)
   Madhavan et al. Google's Deep-Web Crawl. VLDB 2008
Semantic Web
 Sharing data across the Web
   Standard data model
     RDF
   A number of syntaxes (file formats)
     RDF/XML, RDFa
   Powerful, logic-based languages for schemas
     OWL, RIF
   Query languages and protocols
     HTTP, SPARQL
Resource Description Framework (RDF)
 Each resource (thing, entity) is identified by a URI
   Globally unique identifiers
   URLs a subset of URIs
   Often abbreviated using namespaces
     e.g. example:roi = http://example.org/roi
 RDF represents knowledge as a set of triples
   Each triple is a single fact about the entity (an attribute or a
    relationship)
 A set of triples forms an RDF graph
    RDF document
                              type       foaf:Person

                example:roi      name

                                        “Roi Blanco”
Linking resources
Roi‟s homepage                                Friend-of-a-Friend ontology

                                 type
   example:roi                                      foaf:Person
                     name


                            “Roi Blanco”                    knows
          sameAs



Yahoo!‟s website
                                             type

                   worksWith
  #roi2                        #peter

                                           email

                                              “pmika@yahoo-inc.com”
Ontologies
 Ontologies are the schemas for RDF graphs
   Define the intended meaning of certain classes and
   relationships in a domain
     e.g. the FOAF ontology defines the foaf:Person class and
     relationships such as foaf:knows
 Ontologies are published in standard languages such
 as OWL
   Language defines the constructs for modeling and their
   meaning
     e.g. subClassOf, sameAs
   Tools for editing ontologies such as Protégé, TopBraid
   Composer
 Reusing and extending well-known ontologies helps to
 interpret unknown data
   e.g. if X is subClassOf foaf:Person, and A is of type
   X, then we know that A is a person
Example: schema.org
 Agreement on a shared set of schemas for common
 types of web content
   Bing, Google, and Yahoo! as initial supporters
   Similar in intent to sitemaps.org (2006)
     Use a single format to communicate the same information to all
      three search engines
 Support for microdata
 schema.org covers areas of interest to all search
 engines
   Business listings (local), creative works
    (video), recipes, reviews
   User defined extensions
 Each search engine continues to develop its products
Documentation and OWL ontology
Publishing RDF

 Interlinked RDF documents (Linked Data)
   Each document describes a single resource with URIs
    pointing to related resources
   Common RDF file formats are RDF/XML and Turtle
   Mostly implemented as a wrapper around a database or
    Web service
 Embedding RDF inside HTML
   RDFa, microdata
 SPARQL endpoints
   Triple stores are databases for managing RDF data
   SPARQL is a standard protocol and query language for
   accessing triple stores using HTTP
Linked Data
 Grass-roots community effort to (re)publish open
 datasets in RDF
   Centered around the Dbpedia project
   Directory of datasets at linkeddata.org, thedatahub.org
Metadata in HTML anno 1995
                      What does this term
<HTML>                mean?
<HEAD profile="http://dublincore.org/documents/dcq-
  html/">
<META name="DC.author" content="Peter Mika">
<LINK rel="DC.rights copyright"
      href="http://www.example.org/rights.html" />
<LINK rel="meta" type="application/rdf+xml" title="FOAF"
     href= "http://www.cs.vu.nl/~pmika/foaf.rdf">
</HEAD>
…
</HTML>

                        How is this data
                        related?
Microformats (anno 2003)
 Mark-up for data in HTML pages
   Reuse existing HTML elements (class, rel)
 Microformats exist for a limited set of objects
   Persons, organizations, reviews, events etc., see microformats.org
 No formal schemas
   Limited reuse, extensibility of schemas
   Unclear which combinations are allowed
 No identifiers for entities
   No interlinking between entities

<div class="vcard">
 <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>
 <div class="tel">+1-919-555-7878</div>
 <div class="title">Area Administrator, Assistant</div>
</div>
RDFa and RDFa Lite (2008-2012)
 W3C standards for embedding RDF data in HTML
 documents
   A set of new HTML attributes to be used in head or body
   A specification of how to extract the data from these
   attributes
 RDFa is just a syntax, you have to choose a
  vocabulary separately
 RDFa family
   RDFa 1.0
     Recommendation since October, 2008
   RDFa 1.1 is a small update on RDFa to make it easier to
   use
     RDFa Primer (Working Draft, May 8, 2012)
   RDFa 1.1 Lite is a subset of RDFa 1.1 with the most
Example: Facebook‟s Open Graph
    Protocol
 The „Like‟ button provides publishers with a way to
 promote their content on Facebook and build
 communities
   Shows up in profiles and news feed
   Site owners can later reach users who have liked an
    object
   Facebook Graph API allows 3rd party developers to
    access the data
 Open Graph Protocol is an RDFa-based format that
 allows to describe the object that the user „Likes‟
Example: Facebook‟s Open Graph
     Protocol
 RDF vocabulary to be used in conjunction with RDFa
   Simplify the work of developers by restricting the freedom in RDFa
 Activities, Businesses, Groups, Organizations, People, Places, Product
  s and Entertainment
 Only HTML <head> accepted


<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
   <title>The Rock (1996)</title>
   <meta property="og:title" content="The Rock" />
   <meta property="og:type" content="movie" />
   <meta property="og:url"
   content="http://www.imdb.com/title/tt0117500/" />
   <meta property="og:image" content="http://ia.media-
   imdb.com/images/rock.jpg" /> …
</head> ...
Microdata
 HTML5
   Currently under standardization at the W3C
   Originally part of the HTML5 spec, but now a separate
    document
 Comparable to RDFa 1.1 Lite
   Key extensibility features (such as multiple types)
    missing
 HTML5 also has a number of “semantic” elements
<div itemscope itemid=“http://www.yahoo.com/resource/person”>
<p>My name is <span <video>, <article>…
  such as <time>, itemprop="name">Neil</span>.</p>
<p>My band is called <span itemprop="band">Four Parts
   Water</span>.
    I was born on <time itemprop="birthday" datetime="2009-05-10">
   May 10th 2009</time>.
    <img itemprop="image" src=”me.png" alt=”me”></p>
</div>
Current state of metadata on the Web
 31% of webpages, 5% of domains contain some
 metadata
   Analysis of the Bing Crawl (US crawl, January, 2012)
   RDFa is most common format
     By URL: 25% RDFa, 7% microdata, 9% microformat
     By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
   Adoption is stronger among large publishers
     Especially for RDFa and microdata

 See also
   P. Mika, T. Potter. Metadata Statistics for a Large Web
    Corpus, LDOW 2012
   H.Mühleisen, C.Bizer.Web Data Commons - Extracting
    Structured Data from Two Large Web Corpora, LDOW
    2012
Exponential growth in RDFa data

                Another five-fold increase
                between October 2010 and
                January, 2012



            Five-fold increase between
            March, 2009 and
            October, 2010




Percentage of URLs with embedded metadata in various formats
Crawling and indexing
Crawling the Semantic Web
 Linked Data
   Similar to HTML crawling, but the the crawler needs to
    parse RDF/XML (and others) to extract URIs to be
    crawled
   Semantic Sitemap/VOID descriptions
 RDFa
   Same as HTML crawling, but data is extracted after
    crawling
   Mika et al. Investigating the Semantic Gap through Query
    Log Analysis, ISWC 2010.
 SPARQL endpoints
   Endpoints are not linked, need to be discovered by other
    means
   Semantic Sitemap/VOID descriptions
Data fusion
 Ontology matching
   Widely studied in Semantic Web research, see e.g. list of
    publications at ontologymatching.org
     Unfortunately, not much of it is applicable in a Web context due to the
      quality of ontologies
 Entity resolution
   Logic-based approaches in the Semantic Web
   Studied as record linkage in the database literature
     Machine learning based approaches, focusing on attributes
   Graph-based approaches, see e.g. the work of Lisa Getoor
    are applicable to RDF data
     Improvements over only attribute based matching
 Blending
   Merging objects that represent the same real world entity and
    reconciling information from multiple sources
Data quality assessment and curation
 Heterogeneity, quality of data is an even larger issue
   Quality ranges from well-curated data sets (e.g. Freebase) to
    microformats
     In the worst of cases, the data becomes a graph of words
   Short amounts of text: prone to mistakes in data entry or
    extraction
     Example: mistake in a phone number or state code

 Quality assessment and data curation
   Quality varies from data created by experts to user-generated
    content
   Automated data validation
     Against known-good data or using triangulation
     Validation against the ontology or using probabilistic models
   Data validation by trained professionals or crowdsourcing
     Sampling data for evaluation
Indexing
 Search requires matching and ranking
   Matching selects a subset of the elements to be scored
 The goal of indexing is to speed up matching
   Retrieval needs to be performed in milliseconds
   Without an index, retrieval would require streaming
   through the collection
 The type of index depends on the query model to
 support
   DB-style indexing
   IR-style indexing
IR-style indexing
 Index data as text
   Create virtual documents from data
   One virtual document per subgraph, resource or triple
     typically: resource

 Key differences to Text Retrieval
   RDF data is structured
   Minimally, queries on property values are required
Horizontal index structure

 Two fields (indices): one for terms, one for properties
 For each term, store the property on the same position in
  the property index
   Positions are required even without phrase queries
 Query engine needs to support the alignment operator
 ✓ Dictionary is number of unique terms + number of
  properties
 Occurrences is number of tokens * 2
Vertical index structure

 One field (index) per property
 Positions are not required
   But useful for phrase queries
 Query engine needs to support fields
 Dictionary is number of unique terms
 Occurrences is number of tokens
 ✗ Number of fields is a problem for merging, query
  performance
Distributed indexing
 MapReduce is ideal for building inverted indices
   Map creates (term, {doc1}) pairs
   Reduce collects all docs for the same term:
    (term, {doc1, doc2…}
   Sub-indices are merged separately
     Term-partitioned indices

 Peter Mika. Distributed Indexing for Semantic
 Search, SemSearch 2010.
Query Processing / Matching
Structure
 Taxonomy of search approaches
 Query processing / matching techniques for Semantic
  Search
 Types of semantic data
 Formalisms for querying semantic data
 Approaches
   General task: hybrid graph pattern matching
   Matching keyword query against text
   Matching structured query against structured data
   Matching keyword query against structured data
   Matching structured query against text (a hybrid case)
 Main tasks, challenges and opportunities
Taxonomy of search approaches (1)
 The search problem
   A collection of resources, called data
   Information needs expressed as queries
   Search is the task of efficiently computing results from
   data that are relevant to queries
 Document data retrieval vs. structured data retrieval
   Differences in query and data representation and
    matching
   Efficiently retrieve structured data that exactly match
    formal information needs expressed as structured
    queries
   Effectively rank textual results that match ambiguous NL /
    keyword queries to a certain degree (notions of
    relevance)
 Semantic search: ranked retrieval of document and
Taxonomy of search approaches (2)
                          Query engines (of
                          databases)
                               Exact
                                  Complete
   Query
                                  Sound

                              • Approximate
     Matching




                                • Not complete
                                • Not sound

                              • Ranked
    Data
                                • Best effort
                                • Top-k
                         Search engines (stand-alone/database
                         extensons)
Query processing mainly focuses on efficiency of matching
whereas ranking deals with degree of matching (relevance)!
Query processing for Semantic Search (1)
 Resources represented by semantic data ranging from
     Structured data with well defined schemas
     Semi-structured data with incomplete or no schemas
     Data that largely comprise text
     Hybrid / embedded data
 Information needs of varying complexity, captured using
  different formalisms and querying paradigms
   Natural language texts and keywords
   Form-based inputs
   Formal structured queries
  (Search is end-user oriented paradigm, requires
  “natural”, intuitive querying interfaces)
 Semantic search: efficiently computing results (query
  processing) from data that are relevant to queries
  (ranking)
Query processing for Semantic Search (2)
                        NL           Form- / facet-   Structured Queries
      Keywords
                     Questions       based Inputs         (SPARQL)
Ambiquities


                                   Query




                                     Matching

                                    Data



   RDF data             Semi-                         OWL ontologies with
                                        Structured
 embedded in          Structured                         rich, formal
                                        RDF data
  text (RDFa)         RDF data                            semantics
Ambiquities: confidence degree, truth/trust
Query processing for Semantic Search (3)
                            Textual Data


                                      Structured
                                       query on
               Keyword query
                                     textual data
                 on textual
                                   , e.g. querying
               data, e.g. Web Search target
                   Semantic
                                    extension for
               search systems group of
                       different        search
                       users, information
                                      systems?
Unstructured      needs, and types Structured
                                       of data.        Structured
Query                                                  Query
                     Query processing for on
               Keyword query           query
                on structured search is hybrid
                 semantic           structured data
                  data, e.g.         e.g. standard
                combination of techniques!
                  search                 querying
               extensions for          interface for
                databases              databases /
                                       RDF stores
                            Structured Data
Types of data models (1)
 Textual
   Bag-of-words
   Represent documents, text in structured data,…, real-
    world objects (captured as structured data)
   Lacks “structure”
     in text, e.g. linguistic structure, hyperlinks, (positional
      information)
     Structure in structured data representation


    term (statistics)
     In combination with
     combination
     Cloud Computing
     Cloud
     technologies, promising
     Computing
     solutions for the
     Technologies
     management of `big
     solutions
     data' have emerged.
     management
     Existing industry
     `big data'
     solutions are able to
     industry
     support complex
     solutions
     queries and analytics
     support
     tasks with terabytes of
     complex
     data. For
     ……
     example, using a
     Greenplum.
Types of data models (2)
 Textual
 Structured
   Resource Description Framework (RDF)
   Represent real-world objects, services, applications, ….
    documents
   Resource attribute values and relationships between
    resources
   Schema
                        Picture
                     creator
            Person




      Bob
Types of data models (3)
 Textual
 Structured
 Hybrid
   RDF data embedded in text (RDFa)
Types of data models – RDFa (1)
…
<div about="/alice/posts/trouble_with_bob">
    <h2 property="dc:title">The trouble with Bob</h2>
    <h3 property="dc:creator">Alice</h3>

        Bob is a good friend of mine. We went to the same university, and
        also shared an apartment in Berlin in 2008. The trouble with Bob is
        that he takes much better photos than I do:

    <div about="http://example.com/bob/photos/sunset.jpg">
     <img src="http://example.com/bob/photos/sunset.jpg" />
     <span property="dc:title">Beautiful Sunset</span>
     by <span property="dc:creator">Bob</span>.
    </div>
</div>
…
                                adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
Types of semantic data – RDFa (2)

Bob is a good friend of mine. We         content
went to the same university, and
also shared an apartment in Berlin
in 2008. The trouble with Bob is
that he takes much better photos
than I do:
                                 content




                                       adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
Types of semantic data - conclusion



Semantic data in general can be conceived
 as a graph with text and structured data
   items as nodes, and edges represent
  different types of relationships including
    explicit semantic relationships and
vaguely specified ones such as hyperlinks!
Formalisms for querying semantic data
   (1)

Example information need
“Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.”

 Unstructured queries
 Fully-structured queries
 Hybrid queries: unstructured + structured
Formalisms for querying semantic data
   (2)

Example information need
“Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.”
 Unstructured
   NL
   Keywords
                  shared   apartment   Berlin   Alice
Formalisms for querying semantic data
   (3)

Example information need
“Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.”

 Unstructured
 Fully-structured
   SPARQL:
   BGP, filter, optional, union, select, construct, ask, describe
     PREFIX ns: <http://example.org/ns#>
     SELECT ?x
     WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
        ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
Formalisms for querying semantic data
    (4)
 Fully-structured
 Unstructured
 Hybrid: content and structure constraints




                                                   “shared apartment Berlin Alice”

            ?x ns:knows ? y. ?y ns:name “Alice”.
            ?x ns:knows ?z. ?z ns: works ?v.
            ?v ns:name “KIT”
Formalisms for querying semantic data
    (5)
 Fully-structured
 Unstructured
 Hybrid: content and structure constraints




                                                   “shared apartment Berlin Alice”

            ?x ns:knows ? y. ?y ns:name “Alice”.
            ?x ns:knows ?z. ?z ns: works ?v.
            ?v ns:name “KIT”
Formalisms for querying semantic data - conclusion




Semantic search queries can be conceived
 as graph patterns with nodes referring to
text and structured data items, and edges
  referring to relationships between these
                     items!
Processing hybrid graph patterns (1)
 Example information need
 “Information about a friend of Alice, who shared an apartment with
 her in Berlin and knows someone working at KIT.”


apartment shared Berlin Alice              ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                            ?y ns:name “Alice”. ?x ns:knows ? y

                    trouble with bob                            FluidOps                    34
                                                                            Peter
                                             sunset.jpg
                    Bob is a good friend
                                                            Beautiful
                    of mine. We went to                     Sunset
                    the same                                            Germany     Semantic
                                             Alice                                  Search
                    university, and also
                    shared an
                    apartment in Berlin
                    in 2008. The trouble                                          Germany    2009
                                                      Bob
                    with Bob is that he                            Thanh
                    takes much better
                    photos than I do:                                      KIT
Processing hybrid graph patterns (2)
 Matching hybrid graph patterns against data
Matching keyword query against text
 • Retrieve documents
    • Inverted list (inverted index)
            keyword  {<doc1, pos, score, ...>,
                       <doc2, pos, score, ...>, ...}
 • AND-semantics: top-k join

shared          Berlin         Alice                   shared   Berlin   Alice


 D1             D1            D1


  shared    =    berlin   =    alice




  shared
Matching structured query against structured
          data
   • Retrieve data for triple patterns
         • Index on tables
         • Multiple “redundant” indexes to cover different access
             patterns
   • Join (conjunction of triples)
         • Blocking, e.g. linear merge join (required sorted input)
         • Non-blocking, e.g. symmetric hash-join
         • Materialized join indexes
Per1 ns:works ?v ?v ns:name “KIT”
                                               ?x ns:knows ?y. ?x ns:knows ?z.
      SP-index PO-index                                          ?z ns: works ?v. ?v ns:name “KIT”




                                                 =
                                       =                 =

                           Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT
Matching keyword query against structured data
    • Retrieve keyword elements
       • Using inverted index
          keyword  {<el1, score, ...>, <el2, score, ...>,…}
    • Exploration / “Join”
       • Data indexes for triple lookup
       • Materialized index (paths up to graphs)
       • Top-k Steiner tree search, top-k subgraph exploration


         Alice      Bob        KIT                  Alice   Bob   KIT

               ↔           ↔


                              =
                   =
Alice ns:knows Bob              Inst1 ns:name KIT
                Bob ns:works Inst1
Matching structured query against text
• Based on offline IE (offline see Peter‟s slides)
• Based on online IE, i.e., “retrieve “ is as follows
   • Derive keywords to retrieve relevant documents
   • On-the-fly information extraction, i.e., phrase pattern matching “X
     name Y”
   • Retrieve extracted data for structured part
   • Retrieve documents for derived text patterns, e.g.
     sequence, windows, reg. exp.                     ?x ns:knows ?y. ?x ns:knows ?
                                                        ?z ns: works ?v. ?v ns:name “K

                                                     knows


                                                                            name KIT
Matching structured query against text
• Index
   • Inverted index for document retrieval and pattern matching
   • Join index  inverted index for storing materialized joins between
      keywords
   • Neighborhood indexes for phrase patterns


                                                   ?x ns:knows ?y. ?x ns:knows ?
                                                   ?z ns: works ?v. ?v ns:name “K
 KIT   name
                                                knows


                                                                       name KIT
Query processing – main tasks
              Retrieval
                Documents , data
 Query           elements, triples, paths, graphs
                Inverted index,…, but also other
                 (B+ tree)
  Matching




                Index
                 documents, triples, materialized
                 paths
              Join
 Data           Different join
                 implementations, efficiency
                 depends on availability of indexes
                Non-blocking join good for early
                 result reporting and for
                 “unpredictable” Linked Data / data
Query processing – more tasks
              More complex queries:
               disjunction, aggregation, grouping, an
 Query
               alytics…
              Join order optimization
              Approximate
                Approximate the search space
  Matching




                Approximate the results
                 (matching, join)
              Parallelization
              Top-k
 Data           Use only some entries in the input
                 streams to produce k results
              Multiple sources
                Federation, routing
                On-the-fly mapping, similarity join
              Hybrid
Query processing on the Web -
     research challenges and opportunities


 Large amount of
    semantic data
                                 • Optimization, parallelizati
   Data                           on
    inconsistent, redunda        • Approximation
    nt, and low quality
                                 • Hybrid querying and data
   Large amount of data           management
    embedded in text             • Federation, routing
   Large amount of              • Online schema
    sources                        mappings
   Large amount of links        • Similarity join
    between sources
Ranking
Structure
 Problem definition
 Types of ambiguities
 Ranking paradigms
 Model construction
   Content-based
   Structure-based
Ranking – problem definition
 Query
                 • Ambiguities arise when
                   representation is incomplete /
                   imprecise
   Matching




                 • Ambiguities at the level of
                    • elements (content ambiguity)
                    • structure between elements
  Data
                      (structure ambiguity)

Due to ambiguities in the representation of the
information needs and the underlying resources, the
results cannot be guaranteed to exactly match the query.
Ranking is the problem of determining the degree of
matching using some notions of relevance.
Content ambiguity
apartment shared Berlin Alice              ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                            ?y ns:name “Alice”. ?x ns:knows ? y

                    trouble with bob                            FluidOps                    34
                                                                            Peter
                                             sunset.jpg
                    Bob is a good friend
                                                            Beautiful
                    of mine. We went to                     Sunset
                    the same                                            Germany     Semantic
                                             Alice                                  Search
                    university, and also
                    shared an
                    apartment in Berlin
                    in 2008. The trouble                                          Germany    2009
                                                      Bob
                    with Bob is that he                            Thanh
                    takes much better
                    photos than I do:                                      KIT

 What is meant by “Berlin” in the query?       What is meant by “KIT” in the query?
 What is meant by “Berlin” in the data?        What is meant by “KIT” in the data?
 A city with the name Berlin? a person?        A research group? a university? a location?
Structure ambiguity
apartment shared Berlin Alice              ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                            ?y ns:name “Alice”. ?x ns:knows ? y

                    trouble with bob                            FluidOps                    34
                                                                            Peter
                                             sunset.jpg
                    Bob is a good friend
                                                            Beautiful
                    of mine. We went to                     Sunset
                    the same                                            Germany     Semantic
                                             Alice                                  Search
                    university, and also
                    shared an
                    apartment in Berlin
                    in 2008. The trouble                                          Germany    2009
                                                      Bob
                    with Bob is that he                            Thanh
                    takes much better
                    photos than I do:                                      KIT


 What is the connection between                What is meant by “works”?
 “Berlin” and “Alice”?                         Works at? employed?
 Friend? Co-worker?
Ambiguity
 Recall: query processing is matching at the level of
  syntax and semantics
 Ambiguities arise when data or query allow for multiple
  interpretations, i.e. multiple matches
   Syntactic, e.g. works vs. works at
   Semantic, e.g. works vs. employ
 “Aboutness”, i.e., contain some elements which
  represent the correct interpretation
   Ambiguities arise when matching elements of different
    granularities
   Does i contains the interpretation for j, given some part(s) of i
    (syntactically/semantically) match j
   E.g. Berlin vs. “…we went to the same university, and also, we
    shared an apartment in Berlin in 2008…”
 Strictly speaking, ranking is performed after syntactic /
  semantic matching is done!
Features: What to use to deal with ambiguities?

What is meant by “Berlin”? What is the
connection between “Berlin” and “Alice”?
 Content features
   Frequencies of terms: d more likely to be “about” a
    query term k when d more often, mentions k
    (probabilistic IR)
   Co-occurrences: terms K that often co-occur form a
    contextual interpretation, i.e., topics (cluster
    hypothesis)
 Structure features
   Consider relevance at level of fields
   Linked-based popularity
Ranking paradigms
 Explicit relevance model
   Foundation: probability ranking principle
   Ranking results by the posterior probability (odds) of
    being observed in the relevant class:
   P(w|R) varies in different approaches, e.g., binary
    independence model, 2-poisson model, relevance
    model


       P(D|R)
                           P( D | R)          P(w | R)         (1 P(w | N ))
       P(D/N)                           w D              w D

                                                                  k
    P( w | R)   P( w | q1 ,..., qk )         P ( m) P ( w | m)         P(qk | m)
                                       m M                       i 1
Ranking paradigms
  No explicit notion of relevance: similarity between the
  query and the document model
    Vector space model (cosine similarity)
    Language models (KL divergence)




  Sim(q, d )    Cos(( w1,d ,..., wt , d ), ( w1,q ,..., wk , q ))

                                                                  P(t |   q   )
Sim(q, d )     KL(    q   ||   d   )         P(t |   q   ) log(
                                       t V                        P(t |   d   )
Model construction
 How to obtain
   Relevance models?
   Weights for query / document terms?
   Language models for document / queries?
Content-based model construction
 Document statistics, e.g.         • An object is more likely
   Term frequency                  about “Berlin”?
   Document length                    • When it contains a
 Collection statistics, e.g.          relatively high number
                                       of mentions of the term
   Inverse document
                                       “Berlin”
    frequency
                                       • When the number of
   Background language
                                       mentions of this term in
    models
                                       the overall collection is
                 tf                    relatively low
    wt , d           idf
                |d |
                  tf
P(t |   d   )           (1   ) P(t | C )
                 |d |
Structure-based model construction
 Consider structure of objects during content-based
 modeling, i.e., to obtain structured content-based
 model
   Content-based model for structured
   objects, documents and for general tuples



      P(t |   d   )          f   P(t |   f   )
                      f Fd


• An object is more likely about “Berlin”?
   • When one of its (important) fields contains a
   relatively high number of mentions of the term “Berlin”
Structure-based model construction
  PageRank
    Link analysis algorithm
    Measuring relative importance of nodes
    Link counts as a vote of support
    The PageRank of a node recursively depends on the
     number and PageRank of all nodes that link to it
     (incoming links)
  ObjectRank
    Types and semantics of links vary in structured data
     setting
    Authority transfer schema graph specifies connection
• An object about “Berlin” is more important than one another?
     strengths
   • When a relatively large number of objects are linked to it
    Recursively compute authority transfer data graph
Taxonomy of ranking approaches
 Explicitly vs. non-explicitly relevance-based
 Content-based ranking
 Structure-based ranking
 Content- and-structure-based ranking
Result Presentation
Search interface
 Input and output functionality
   helping the user to formulate complex queries
   presenting the results in an intelligent manner
 Semantic Search brings improvements in
     Query formulation
     Snippet generation
     Suggesting related entities
     Adaptive and interactive presentation
       Presentation adapts to the kind of query and results presented
       Object results can be actionable, e.g. buy this product
   Aggregated search
     Grouping similar items, summarizing results in various ways
     Filtering (facets), possibly across different dimensions
   Task completion
     Help the user to fulfill the task by placing the query in a task context
Query formulation
 “Snap-to-grid”: suggest the most likely interpretation of the
  query
   Given the ontology or a summary of the data
   While the user is typing or after issuing the query
   Example: Freebase suggest, TrueKnowledge
Enhanced results/Rich Snippets
 Use mark-up from the webpage to generate search
 snippets
  Originally invented at Yahoo! (SearchMonkey)
 Google, Yahoo!, Bing, Yandex now consume
 schema.org markup
  Validators available from Google and Bing
Other result presentation tasks
 Select the most relevant resources within an RDF
 document
   Penin et al. Snippet Generation for Semantic Web
   Search Engines, ASWC 2010
 For each resource, rank the properties to be displayed
 Natural Language Generation (NLG)
   Verbalize, explain results
Aggregated search: facets
Aggregated search: Sig.ma
Related entities




               Related actors
               and movies
Adaptive presentation:
housing search
Evaluation
Semantic Search challenge (2010/2011)
 Two tasks
   Entity Search
     Queries where the user is looking for a single real world object
     Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW
      2010.
   List search (new in 2011)
     Queries where the user is looking for a class of objects

 Billion Triples Challenge 2009 dataset
 Evaluated using Amazon‟s Mechanical Turk
   Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST
    2010
   Blanco et al. Repeatable and Reliable Search System
    Evaluation using Crowd-Sourcing, SIGIR2011
Evaluation form
Catching the bad guys
 Payment can be rejected for workers who try to game
 the system
   An explanation is commonly expected, though cheaters rarely
    complain
 We opted to mix control questions into the real results
   Gold-win cases that are known to be perfect
   Gold-loose cases that are known to be bad
 Metrics
 Worker and std. dev onReal
   Avg.
            Known       gold-win and gold-loose results
                                      Known Good Total      Time to
   Time to complete
            bad                                      N      complete
           N   Mean    N     Mean     N    Mean             (sec)


 badguy    20 2.556    200   2.738    20   2.684    240     29.6
 goodguy   13 1        130   2.038    13   3        156     95
 whoknows 1    1       21    1.571    2    3        24      83.5
Other evaluations

 TREC Entity Track
   Related Entity Finding
     Entities related to a given entity through a particular relationship
     Retrieval over documents (ClueWeb 09 collection)
     Example: (Homepages of) airlines that fly Boeing 747
   Entity List Completion
     Given some elements of a list of entities, complete the list
 Question Answering over Linked Data
   Retrieval over specific datasets (Dbpedia and
    MusicBrainz)
   Full natural language questions of different forms
   Correct results defined by an equivalent SPARQL query
   Example: Give me all actors starring in Batman Begins.
Resources
Resources
 Books
  Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
   Information Retrieval. ACM Press. 2011
 Survey papers
  Thanh Tran, Peter Mika. Survey of Semantic Search
   Approaches. Under submission, 2012.
 Conferences and workshops
  ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
  Semantic Search workshop series
  Exploiting Semantic Annotations in Information Retrieval
   (ESAIR)
  Entity-oriented Search (EOS) workshop

More Related Content

What's hot

What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsPeter Mika
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in PracticePeter Mika
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through EntitiesPeter Mika
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the WebPeter Mika
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Bradley Allen
 
Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Bradley Allen
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the WebRoi Blanco
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
 
Smoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersSmoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersBill Slawski
 
Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Bradley Allen
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Trey Grainger
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Peter Mika
 

What's hot (20)

What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)
 
Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
Smoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersSmoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papers
 
Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015
 

Viewers also liked

Semantic Search Over The Web
Semantic Search Over The WebSemantic Search Over The Web
Semantic Search Over The Webalierkan
 
Increasing product and service visibility through front-end semantic web
Increasing product and service visibility through front-end semantic webIncreasing product and service visibility through front-end semantic web
Increasing product and service visibility through front-end semantic webJay Myers
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Khirulnizam Abd Rahman
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine ReadingIsabelle Augenstein
 
USFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersUSFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersIsabelle Augenstein
 
Lodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured TextLodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured TextIsabelle Augenstein
 
Seed Selection for Distantly Supervised Web-Based Relation Extraction
Seed Selection for Distantly Supervised Web-Based Relation ExtractionSeed Selection for Distantly Supervised Web-Based Relation Extraction
Seed Selection for Distantly Supervised Web-Based Relation ExtractionIsabelle Augenstein
 
Distant Supervision with Imitation Learning
Distant Supervision with Imitation LearningDistant Supervision with Imitation Learning
Distant Supervision with Imitation LearningIsabelle Augenstein
 
Relation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant SupervisionRelation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant SupervisionIsabelle Augenstein
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionNGDATA
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 
Information Extraction with Linked Data
Information Extraction with Linked DataInformation Extraction with Linked Data
Information Extraction with Linked DataIsabelle Augenstein
 
Natural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebNatural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebIsabelle Augenstein
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEDiana Maynard
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 

Viewers also liked (20)

Semantic Search Over The Web
Semantic Search Over The WebSemantic Search Over The Web
Semantic Search Over The Web
 
Increasing product and service visibility through front-end semantic web
Increasing product and service visibility through front-end semantic webIncreasing product and service visibility through front-end semantic web
Increasing product and service visibility through front-end semantic web
 
Concept based semantic search
Concept based semantic searchConcept based semantic search
Concept based semantic search
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
 
USFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersUSFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
USFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
 
Lodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured TextLodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured Text
 
Seed Selection for Distantly Supervised Web-Based Relation Extraction
Seed Selection for Distantly Supervised Web-Based Relation ExtractionSeed Selection for Distantly Supervised Web-Based Relation Extraction
Seed Selection for Distantly Supervised Web-Based Relation Extraction
 
Distant Supervision with Imitation Learning
Distant Supervision with Imitation LearningDistant Supervision with Imitation Learning
Distant Supervision with Imitation Learning
 
Relation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant SupervisionRelation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant Supervision
 
Mapping Keywords to
Mapping Keywords to Mapping Keywords to
Mapping Keywords to
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC edition
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 
Human Neural Machine
Human Neural MachineHuman Neural Machine
Human Neural Machine
 
Information Extraction with Linked Data
Information Extraction with Linked DataInformation Extraction with Linked Data
Information Extraction with Linked Data
 
Natural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebNatural Language Processing for the Semantic Web
Natural Language Processing for the Semantic Web
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 

Similar to Semantic Search tutorial at SemTech 2012

Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011sssw2011
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentationurvics
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Stuart Weibel
 
Slawek Korea
Slawek KoreaSlawek Korea
Slawek KoreaSlawek
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Finalguestcaef1d
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
Index nominum to ontology
Index nominum to ontologyIndex nominum to ontology
Index nominum to ontologyGautier Poupeau
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignMarianne Sweeny
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsAndre Freitas
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0John Breslin
 
Semantic Search
Semantic SearchSemantic Search
Semantic Searchsssw2012
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorialThengo Kim
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesThanh Tran
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Laurent Alquier
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 

Similar to Semantic Search tutorial at SemTech 2012 (20)

Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?
 
Slawek Korea
Slawek KoreaSlawek Korea
Slawek Korea
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Final
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
Index nominum to ontology
Index nominum to ontologyIndex nominum to ontology
Index nominum to ontology
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic Applications
 
Web Mining
Web MiningWeb Mining
Web Mining
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorial
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 

More from Peter Mika

Hackathon s pb
Hackathon s pbHackathon s pb
Hackathon s pbPeter Mika
 
Investigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisInvestigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisPeter Mika
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchablePeter Mika
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic WebPeter Mika
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011Peter Mika
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009Peter Mika
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin YahooPeter Mika
 

More from Peter Mika (8)

Hackathon s pb
Hackathon s pbHackathon s pb
Hackathon s pb
 
Investigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisInvestigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log Analysis
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 

Recently uploaded

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Recently uploaded (20)

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Semantic Search tutorial at SemTech 2012

  • 1. Semantic Search Tutorial Introduction Peter Mika| Yahoo! Research, Spain pmika@yahoo-inc.com Thanh Tran | Institute AIFB, KIT, Germany Tran@aifb.uni-karlsruhe.de
  • 2. About the speakers  Peter Mika  Senior Research Scientist  Head of Semantic Search group at Yahoo! Research in Barcelona  Semantic Search, Web Object Retrieval, Natural Language Processing  Tran Duc Thanh  Assistent Professor at AIFB, Karlsruhe Institute of Technology  Head of Semantic Search group  Semantic Search, Semantic / Linked Data Management
  • 3. Agenda  Introduction (10 min)  Semantic Web data (50 min)  The RDF data model  Publishing RDF  Crawling and indexing RDF data  Query processing (40 min)  Ranking (30 min)  Result presentation (15 min)  Semantic Search evaluation (15 min)  Questions (5 min)
  • 4. Why Semantic Search? I.  “We are at the beginning of search.“ (Marissa Mayer)  Solved large classes of queries, e.g. navigational  Heavy investment in computational power  Remaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognition  Background knowledge and metadata can help to address poorly solved queries
  • 5. Poorly solved information needs Many of these queries  Ambiguous searches would not be asked by  paris hilton users, who learned over  Long tail queries time what search technology can and can  george bush (and I mean the beer brewer notArizona) in do.  Multimedia search  paris hilton sexy  Imprecise or overly precise searches  jim hendler  pictures of strong adventures people  Precise searches for descriptions  countries in africa  32 year old computer scientist living in barcelona  reliable digital camera under 300 dollars
  • 7. Why Semantic Search? II.  The Semantic Web is now a reality  Large amounts of data published in RDF  Heterogeneous data of varying quality  Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain  Searching data instead or in addition to searching documents  Direct answers  Novel search tasks
  • 8. Example: direct answers in search Points of Faceted interest in Information search for Information box with Vienna, from the Shopping content from and Austria Knowledgeresults links to Yahoo! Graph Travel Since Aug, 2010, „regular‟ search results are „Powered by Bing‟
  • 9. Novel search tasks  Aggregation of search results  e.g. price comparison across websites  Analysis and prediction  e.g. world temperature by 2020  Semantic profiling  recommendations based on particular interests  Semantic log analysis  understanding user behavior in terms of objects  Support for complex tasks  e.g. booking a vacation using a combination of services
  • 10. Contextual (pervasive, ambient) search Yahoo! Connected TV: Widget engine embedded into the TV Yahoo! IntoNow: recognize audio and show related content
  • 11. Interactive search and task completion
  • 12. Document retrieval and data retrieval  Information Retrieval (IR) support the retrieval of documents (document retrieval)  Representation based on lightweight syntax-centric models  Work well for topical search  Not so well for more complex information needs  Web scale  Database (DB) and Knowledge-based Systems (KB) deliver more precise answers (data retrieval)  More expressive models  Allow for complex queries  Retrieve concrete answers that precisely match queries  Not just matching and filtering, but also joins  Limitations in scalability
  • 13. Combination of document and data retrieval  Documents with metadata  Metadata may be embedded inside the document  I’m looking for documents that mention countries in Africa.  Data retrieval  Structured data, but searchable text fields  I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
  • 14. Semantic Search  Target (combination of) document and data retrieval  Semantic search is a retrieval paradigm that  Exploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of content  Incorporates the intent of the query and the meaning of content into the search process (semantic models)  Wide range of semantic search systems  Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
  • 15. Semantic models  Semantics is concerned with the meaning of the resources made available for search  Various representations of meaning  Linguistic models: models of relationships among words  Taxonomies, thesauri, dictionaries of entity names  Inference along linguistic relations, e.g. broader/narrower terms  Conceptual models: models of relationships among objects  Ontologies capture entities in the world and their relationships  Inference along domain-specific relations  We will focus on conceptual models in this tutorial  In particular, the RDF/OWL conceptual model for
  • 16. Semantic Search – a process view Knowledge Representation Semantic Models • Keywords Resources • Forms Query • NL Construction • Formal language •IR-style matching & ranking •DB-style precise matching Query •KB-style matching & Processing inferences •Query visualization •Document and data Documents Result presentation Presentation •Summarization •Implicit feedback •Explicit feedback Query •Incentives Refinement Document Representation
  • 17. Semantic Search systems For data / document retrieval, semantic search systems might combine a range of techniques, ranging from statistics-based IR methods for ranking, database methods for efficient indexing and query processing, up to complex reasoning techniques for making inferences!
  • 18. Example: Information Workbench  Addressing the lifecycle of interacting with the Web of Data  Integration of data sources  Content generation by the end user  Search and Exploration  Visualization User- Wikipedia  Publishing generated DBpedia, Yago  Integrated management of Earthquake heterogeneous data (Data.gov) sources  Structured and unstructured Structure  Published and user-generated Dynamic d  Static and dynamic  Open domain
  • 19. Data Sources in the Application  Entire English Wikipedia  Data from Linked Open Data  DBpedia  YAGO …  Data from Data.gov (US Government)  E.g. live data about earthquakes  Many more
  • 20. Semantic Search  Hybrid Search: Structured queries combined with keywords across structured and unstructured data sources  Query interpretation: Translation of keywords into hybrid queries  Keyword search/query interpretation combined with faceted search: iterative refinement process based on keywords and operations on facets
  • 21. Search, Refinement and Navigation Keywords Query Translations Term Completions Vorlesung Knowledge Discovery - Institut Facets AIFB 2 1
  • 24. Data on the Web  Data on the Web is not directly accessible  Most web pages are generated from databases, but formatted for human consumption  APIs offer limited views over data  Two solutions  Extraction using Information Extraction (IE) techniques  Out of scope for this tutorial  Relying on publishers to expose structured data using standard Semantic Web formats
  • 25. Information extraction  Natural Language Processing  Named entity recognition and disambiguation, sentiment analysis etc.  Extraction of information about entities  Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007.  Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.  Extraction from HTML tables  Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008  Wrapper induction  Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007  Filling web forms automatically (form-filling)  Madhavan et al. Google's Deep-Web Crawl. VLDB 2008
  • 26. Semantic Web  Sharing data across the Web  Standard data model  RDF  A number of syntaxes (file formats)  RDF/XML, RDFa  Powerful, logic-based languages for schemas  OWL, RIF  Query languages and protocols  HTTP, SPARQL
  • 27. Resource Description Framework (RDF)  Each resource (thing, entity) is identified by a URI  Globally unique identifiers  URLs a subset of URIs  Often abbreviated using namespaces  e.g. example:roi = http://example.org/roi  RDF represents knowledge as a set of triples  Each triple is a single fact about the entity (an attribute or a relationship)  A set of triples forms an RDF graph RDF document type foaf:Person example:roi name “Roi Blanco”
  • 28. Linking resources Roi‟s homepage Friend-of-a-Friend ontology type example:roi foaf:Person name “Roi Blanco” knows sameAs Yahoo!‟s website type worksWith #roi2 #peter email “pmika@yahoo-inc.com”
  • 29. Ontologies  Ontologies are the schemas for RDF graphs  Define the intended meaning of certain classes and relationships in a domain  e.g. the FOAF ontology defines the foaf:Person class and relationships such as foaf:knows  Ontologies are published in standard languages such as OWL  Language defines the constructs for modeling and their meaning  e.g. subClassOf, sameAs  Tools for editing ontologies such as Protégé, TopBraid Composer  Reusing and extending well-known ontologies helps to interpret unknown data  e.g. if X is subClassOf foaf:Person, and A is of type X, then we know that A is a person
  • 30. Example: schema.org  Agreement on a shared set of schemas for common types of web content  Bing, Google, and Yahoo! as initial supporters  Similar in intent to sitemaps.org (2006)  Use a single format to communicate the same information to all three search engines  Support for microdata  schema.org covers areas of interest to all search engines  Business listings (local), creative works (video), recipes, reviews  User defined extensions  Each search engine continues to develop its products
  • 32. Publishing RDF  Interlinked RDF documents (Linked Data)  Each document describes a single resource with URIs pointing to related resources  Common RDF file formats are RDF/XML and Turtle  Mostly implemented as a wrapper around a database or Web service  Embedding RDF inside HTML  RDFa, microdata  SPARQL endpoints  Triple stores are databases for managing RDF data  SPARQL is a standard protocol and query language for accessing triple stores using HTTP
  • 33. Linked Data  Grass-roots community effort to (re)publish open datasets in RDF  Centered around the Dbpedia project  Directory of datasets at linkeddata.org, thedatahub.org
  • 34. Metadata in HTML anno 1995 What does this term <HTML> mean? <HEAD profile="http://dublincore.org/documents/dcq- html/"> <META name="DC.author" content="Peter Mika"> <LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF" href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> … </HTML> How is this data related?
  • 35. Microformats (anno 2003)  Mark-up for data in HTML pages  Reuse existing HTML elements (class, rel)  Microformats exist for a limited set of objects  Persons, organizations, reviews, events etc., see microformats.org  No formal schemas  Limited reuse, extensibility of schemas  Unclear which combinations are allowed  No identifiers for entities  No interlinking between entities <div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div>
  • 36. RDFa and RDFa Lite (2008-2012)  W3C standards for embedding RDF data in HTML documents  A set of new HTML attributes to be used in head or body  A specification of how to extract the data from these attributes  RDFa is just a syntax, you have to choose a vocabulary separately  RDFa family  RDFa 1.0  Recommendation since October, 2008  RDFa 1.1 is a small update on RDFa to make it easier to use  RDFa Primer (Working Draft, May 8, 2012)  RDFa 1.1 Lite is a subset of RDFa 1.1 with the most
  • 37. Example: Facebook‟s Open Graph Protocol  The „Like‟ button provides publishers with a way to promote their content on Facebook and build communities  Shows up in profiles and news feed  Site owners can later reach users who have liked an object  Facebook Graph API allows 3rd party developers to access the data  Open Graph Protocol is an RDFa-based format that allows to describe the object that the user „Likes‟
  • 38. Example: Facebook‟s Open Graph Protocol  RDF vocabulary to be used in conjunction with RDFa  Simplify the work of developers by restricting the freedom in RDFa  Activities, Businesses, Groups, Organizations, People, Places, Product s and Entertainment  Only HTML <head> accepted <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media- imdb.com/images/rock.jpg" /> … </head> ...
  • 39. Microdata  HTML5  Currently under standardization at the W3C  Originally part of the HTML5 spec, but now a separate document  Comparable to RDFa 1.1 Lite  Key extensibility features (such as multiple types) missing  HTML5 also has a number of “semantic” elements <div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span <video>, <article>… such as <time>, itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10"> May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”></p> </div>
  • 40. Current state of metadata on the Web  31% of webpages, 5% of domains contain some metadata  Analysis of the Bing Crawl (US crawl, January, 2012)  RDFa is most common format  By URL: 25% RDFa, 7% microdata, 9% microformat  By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat  Adoption is stronger among large publishers  Especially for RDFa and microdata  See also  P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012  H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012
  • 41. Exponential growth in RDFa data Another five-fold increase between October 2010 and January, 2012 Five-fold increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats
  • 43. Crawling the Semantic Web  Linked Data  Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled  Semantic Sitemap/VOID descriptions  RDFa  Same as HTML crawling, but data is extracted after crawling  Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.  SPARQL endpoints  Endpoints are not linked, need to be discovered by other means  Semantic Sitemap/VOID descriptions
  • 44. Data fusion  Ontology matching  Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org  Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies  Entity resolution  Logic-based approaches in the Semantic Web  Studied as record linkage in the database literature  Machine learning based approaches, focusing on attributes  Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data  Improvements over only attribute based matching  Blending  Merging objects that represent the same real world entity and reconciling information from multiple sources
  • 45. Data quality assessment and curation  Heterogeneity, quality of data is an even larger issue  Quality ranges from well-curated data sets (e.g. Freebase) to microformats  In the worst of cases, the data becomes a graph of words  Short amounts of text: prone to mistakes in data entry or extraction  Example: mistake in a phone number or state code  Quality assessment and data curation  Quality varies from data created by experts to user-generated content  Automated data validation  Against known-good data or using triangulation  Validation against the ontology or using probabilistic models  Data validation by trained professionals or crowdsourcing  Sampling data for evaluation
  • 46. Indexing  Search requires matching and ranking  Matching selects a subset of the elements to be scored  The goal of indexing is to speed up matching  Retrieval needs to be performed in milliseconds  Without an index, retrieval would require streaming through the collection  The type of index depends on the query model to support  DB-style indexing  IR-style indexing
  • 47. IR-style indexing  Index data as text  Create virtual documents from data  One virtual document per subgraph, resource or triple  typically: resource  Key differences to Text Retrieval  RDF data is structured  Minimally, queries on property values are required
  • 48. Horizontal index structure  Two fields (indices): one for terms, one for properties  For each term, store the property on the same position in the property index  Positions are required even without phrase queries  Query engine needs to support the alignment operator  ✓ Dictionary is number of unique terms + number of properties  Occurrences is number of tokens * 2
  • 49. Vertical index structure  One field (index) per property  Positions are not required  But useful for phrase queries  Query engine needs to support fields  Dictionary is number of unique terms  Occurrences is number of tokens  ✗ Number of fields is a problem for merging, query performance
  • 50. Distributed indexing  MapReduce is ideal for building inverted indices  Map creates (term, {doc1}) pairs  Reduce collects all docs for the same term: (term, {doc1, doc2…}  Sub-indices are merged separately  Term-partitioned indices  Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
  • 51. Query Processing / Matching
  • 52. Structure  Taxonomy of search approaches  Query processing / matching techniques for Semantic Search  Types of semantic data  Formalisms for querying semantic data  Approaches  General task: hybrid graph pattern matching  Matching keyword query against text  Matching structured query against structured data  Matching keyword query against structured data  Matching structured query against text (a hybrid case)  Main tasks, challenges and opportunities
  • 53. Taxonomy of search approaches (1)  The search problem  A collection of resources, called data  Information needs expressed as queries  Search is the task of efficiently computing results from data that are relevant to queries  Document data retrieval vs. structured data retrieval  Differences in query and data representation and matching  Efficiently retrieve structured data that exactly match formal information needs expressed as structured queries  Effectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)  Semantic search: ranked retrieval of document and
  • 54. Taxonomy of search approaches (2) Query engines (of databases)  Exact  Complete Query  Sound • Approximate Matching • Not complete • Not sound • Ranked Data • Best effort • Top-k Search engines (stand-alone/database extensons) Query processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!
  • 55. Query processing for Semantic Search (1)  Resources represented by semantic data ranging from  Structured data with well defined schemas  Semi-structured data with incomplete or no schemas  Data that largely comprise text  Hybrid / embedded data  Information needs of varying complexity, captured using different formalisms and querying paradigms  Natural language texts and keywords  Form-based inputs  Formal structured queries (Search is end-user oriented paradigm, requires “natural”, intuitive querying interfaces)  Semantic search: efficiently computing results (query processing) from data that are relevant to queries (ranking)
  • 56. Query processing for Semantic Search (2) NL Form- / facet- Structured Queries Keywords Questions based Inputs (SPARQL) Ambiquities Query Matching Data RDF data Semi- OWL ontologies with Structured embedded in Structured rich, formal RDF data text (RDFa) RDF data semantics Ambiquities: confidence degree, truth/trust
  • 57. Query processing for Semantic Search (3) Textual Data Structured query on Keyword query textual data on textual , e.g. querying data, e.g. Web Search target Semantic extension for search systems group of different search users, information systems? Unstructured needs, and types Structured of data. Structured Query Query Query processing for on Keyword query query on structured search is hybrid semantic structured data data, e.g. e.g. standard combination of techniques! search querying extensions for interface for databases databases / RDF stores Structured Data
  • 58. Types of data models (1)  Textual  Bag-of-words  Represent documents, text in structured data,…, real- world objects (captured as structured data)  Lacks “structure”  in text, e.g. linguistic structure, hyperlinks, (positional information)  Structure in structured data representation term (statistics) In combination with combination Cloud Computing Cloud technologies, promising Computing solutions for the Technologies management of `big solutions data' have emerged. management Existing industry `big data' solutions are able to industry support complex solutions queries and analytics support tasks with terabytes of complex data. For …… example, using a Greenplum.
  • 59. Types of data models (2)  Textual  Structured  Resource Description Framework (RDF)  Represent real-world objects, services, applications, …. documents  Resource attribute values and relationships between resources  Schema Picture creator Person Bob
  • 60. Types of data models (3)  Textual  Structured  Hybrid  RDF data embedded in text (RDFa)
  • 61. Types of data models – RDFa (1) … <div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div> </div> … adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
  • 62. Types of semantic data – RDFa (2) Bob is a good friend of mine. We content went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: content adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
  • 63. Types of semantic data - conclusion Semantic data in general can be conceived as a graph with text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
  • 64. Formalisms for querying semantic data (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”  Unstructured queries  Fully-structured queries  Hybrid queries: unstructured + structured
  • 65. Formalisms for querying semantic data (2) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”  Unstructured  NL  Keywords shared apartment Berlin Alice
  • 66. Formalisms for querying semantic data (3) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”  Unstructured  Fully-structured  SPARQL: BGP, filter, optional, union, select, construct, ask, describe  PREFIX ns: <http://example.org/ns#> SELECT ?x WHERE { ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
  • 67. Formalisms for querying semantic data (4)  Fully-structured  Unstructured  Hybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  • 68. Formalisms for querying semantic data (5)  Fully-structured  Unstructured  Hybrid: content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  • 69. Formalisms for querying semantic data - conclusion Semantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
  • 70. Processing hybrid graph patterns (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same Germany Semantic Alice Search university, and also shared an apartment in Berlin in 2008. The trouble Germany 2009 Bob with Bob is that he Thanh takes much better photos than I do: KIT
  • 71. Processing hybrid graph patterns (2)  Matching hybrid graph patterns against data
  • 72. Matching keyword query against text • Retrieve documents • Inverted list (inverted index) keyword  {<doc1, pos, score, ...>, <doc2, pos, score, ...>, ...} • AND-semantics: top-k join shared Berlin Alice shared Berlin Alice D1 D1 D1 shared = berlin = alice shared
  • 73. Matching structured query against structured data • Retrieve data for triple patterns • Index on tables • Multiple “redundant” indexes to cover different access patterns • Join (conjunction of triples) • Blocking, e.g. linear merge join (required sorted input) • Non-blocking, e.g. symmetric hash-join • Materialized join indexes Per1 ns:works ?v ?v ns:name “KIT” ?x ns:knows ?y. ?x ns:knows ?z. SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT” = = = Per1 ns:works Ins1 Ins1 ns:name KIT Per1 ns:works Ins1 Ins1 ns:name KIT
  • 74. Matching keyword query against structured data • Retrieve keyword elements • Using inverted index keyword  {<el1, score, ...>, <el2, score, ...>,…} • Exploration / “Join” • Data indexes for triple lookup • Materialized index (paths up to graphs) • Top-k Steiner tree search, top-k subgraph exploration Alice Bob KIT Alice Bob KIT ↔ ↔ = = Alice ns:knows Bob Inst1 ns:name KIT Bob ns:works Inst1
  • 75. Matching structured query against text • Based on offline IE (offline see Peter‟s slides) • Based on online IE, i.e., “retrieve “ is as follows • Derive keywords to retrieve relevant documents • On-the-fly information extraction, i.e., phrase pattern matching “X name Y” • Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp. ?x ns:knows ?y. ?x ns:knows ? ?z ns: works ?v. ?v ns:name “K knows name KIT
  • 76. Matching structured query against text • Index • Inverted index for document retrieval and pattern matching • Join index  inverted index for storing materialized joins between keywords • Neighborhood indexes for phrase patterns ?x ns:knows ?y. ?x ns:knows ? ?z ns: works ?v. ?v ns:name “K KIT name knows name KIT
  • 77. Query processing – main tasks  Retrieval  Documents , data Query elements, triples, paths, graphs  Inverted index,…, but also other (B+ tree) Matching  Index documents, triples, materialized paths  Join Data  Different join implementations, efficiency depends on availability of indexes  Non-blocking join good for early result reporting and for “unpredictable” Linked Data / data
  • 78. Query processing – more tasks  More complex queries: disjunction, aggregation, grouping, an Query alytics…  Join order optimization  Approximate  Approximate the search space Matching  Approximate the results (matching, join)  Parallelization  Top-k Data  Use only some entries in the input streams to produce k results  Multiple sources  Federation, routing  On-the-fly mapping, similarity join  Hybrid
  • 79. Query processing on the Web - research challenges and opportunities  Large amount of semantic data • Optimization, parallelizati  Data on inconsistent, redunda • Approximation nt, and low quality • Hybrid querying and data  Large amount of data management embedded in text • Federation, routing  Large amount of • Online schema sources mappings  Large amount of links • Similarity join between sources
  • 81. Structure  Problem definition  Types of ambiguities  Ranking paradigms  Model construction  Content-based  Structure-based
  • 82. Ranking – problem definition Query • Ambiguities arise when representation is incomplete / imprecise Matching • Ambiguities at the level of • elements (content ambiguity) • structure between elements Data (structure ambiguity) Due to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some notions of relevance.
  • 83. Content ambiguity apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same Germany Semantic Alice Search university, and also shared an apartment in Berlin in 2008. The trouble Germany 2009 Bob with Bob is that he Thanh takes much better photos than I do: KIT What is meant by “Berlin” in the query? What is meant by “KIT” in the query? What is meant by “Berlin” in the data? What is meant by “KIT” in the data? A city with the name Berlin? a person? A research group? a university? a location?
  • 84. Structure ambiguity apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same Germany Semantic Alice Search university, and also shared an apartment in Berlin in 2008. The trouble Germany 2009 Bob with Bob is that he Thanh takes much better photos than I do: KIT What is the connection between What is meant by “works”? “Berlin” and “Alice”? Works at? employed? Friend? Co-worker?
  • 85. Ambiguity  Recall: query processing is matching at the level of syntax and semantics  Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matches  Syntactic, e.g. works vs. works at  Semantic, e.g. works vs. employ  “Aboutness”, i.e., contain some elements which represent the correct interpretation  Ambiguities arise when matching elements of different granularities  Does i contains the interpretation for j, given some part(s) of i (syntactically/semantically) match j  E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…”  Strictly speaking, ranking is performed after syntactic / semantic matching is done!
  • 86. Features: What to use to deal with ambiguities? What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”?  Content features  Frequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR)  Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis)  Structure features  Consider relevance at level of fields  Linked-based popularity
  • 87. Ranking paradigms  Explicit relevance model  Foundation: probability ranking principle  Ranking results by the posterior probability (odds) of being observed in the relevant class:  P(w|R) varies in different approaches, e.g., binary independence model, 2-poisson model, relevance model P(D|R) P( D | R) P(w | R) (1 P(w | N )) P(D/N) w D w D k P( w | R) P( w | q1 ,..., qk ) P ( m) P ( w | m) P(qk | m) m M i 1
  • 88. Ranking paradigms  No explicit notion of relevance: similarity between the query and the document model  Vector space model (cosine similarity)  Language models (KL divergence) Sim(q, d ) Cos(( w1,d ,..., wt , d ), ( w1,q ,..., wk , q )) P(t | q ) Sim(q, d ) KL( q || d ) P(t | q ) log( t V P(t | d )
  • 89. Model construction  How to obtain  Relevance models?  Weights for query / document terms?  Language models for document / queries?
  • 90. Content-based model construction  Document statistics, e.g. • An object is more likely  Term frequency about “Berlin”?  Document length • When it contains a  Collection statistics, e.g. relatively high number of mentions of the term  Inverse document “Berlin” frequency • When the number of  Background language mentions of this term in models the overall collection is tf relatively low wt , d idf |d | tf P(t | d ) (1 ) P(t | C ) |d |
  • 91. Structure-based model construction  Consider structure of objects during content-based modeling, i.e., to obtain structured content-based model  Content-based model for structured objects, documents and for general tuples P(t | d ) f P(t | f ) f Fd • An object is more likely about “Berlin”? • When one of its (important) fields contains a relatively high number of mentions of the term “Berlin”
  • 92. Structure-based model construction  PageRank  Link analysis algorithm  Measuring relative importance of nodes  Link counts as a vote of support  The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links)  ObjectRank  Types and semantics of links vary in structured data setting  Authority transfer schema graph specifies connection • An object about “Berlin” is more important than one another? strengths • When a relatively large number of objects are linked to it  Recursively compute authority transfer data graph
  • 93. Taxonomy of ranking approaches  Explicitly vs. non-explicitly relevance-based  Content-based ranking  Structure-based ranking  Content- and-structure-based ranking
  • 95. Search interface  Input and output functionality  helping the user to formulate complex queries  presenting the results in an intelligent manner  Semantic Search brings improvements in  Query formulation  Snippet generation  Suggesting related entities  Adaptive and interactive presentation  Presentation adapts to the kind of query and results presented  Object results can be actionable, e.g. buy this product  Aggregated search  Grouping similar items, summarizing results in various ways  Filtering (facets), possibly across different dimensions  Task completion  Help the user to fulfill the task by placing the query in a task context
  • 96. Query formulation  “Snap-to-grid”: suggest the most likely interpretation of the query  Given the ontology or a summary of the data  While the user is typing or after issuing the query  Example: Freebase suggest, TrueKnowledge
  • 97. Enhanced results/Rich Snippets  Use mark-up from the webpage to generate search snippets  Originally invented at Yahoo! (SearchMonkey)  Google, Yahoo!, Bing, Yandex now consume schema.org markup  Validators available from Google and Bing
  • 98. Other result presentation tasks  Select the most relevant resources within an RDF document  Penin et al. Snippet Generation for Semantic Web Search Engines, ASWC 2010  For each resource, rank the properties to be displayed  Natural Language Generation (NLG)  Verbalize, explain results
  • 101. Related entities Related actors and movies
  • 104. Semantic Search challenge (2010/2011)  Two tasks  Entity Search  Queries where the user is looking for a single real world object  Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.  List search (new in 2011)  Queries where the user is looking for a class of objects  Billion Triples Challenge 2009 dataset  Evaluated using Amazon‟s Mechanical Turk  Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010  Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
  • 106. Catching the bad guys  Payment can be rejected for workers who try to game the system  An explanation is commonly expected, though cheaters rarely complain  We opted to mix control questions into the real results  Gold-win cases that are known to be perfect  Gold-loose cases that are known to be bad  Metrics Worker and std. dev onReal  Avg. Known gold-win and gold-loose results Known Good Total Time to  Time to complete bad N complete N Mean N Mean N Mean (sec) badguy 20 2.556 200 2.738 20 2.684 240 29.6 goodguy 13 1 130 2.038 13 3 156 95 whoknows 1 1 21 1.571 2 3 24 83.5
  • 107. Other evaluations  TREC Entity Track  Related Entity Finding  Entities related to a given entity through a particular relationship  Retrieval over documents (ClueWeb 09 collection)  Example: (Homepages of) airlines that fly Boeing 747  Entity List Completion  Given some elements of a list of entities, complete the list  Question Answering over Linked Data  Retrieval over specific datasets (Dbpedia and MusicBrainz)  Full natural language questions of different forms  Correct results defined by an equivalent SPARQL query  Example: Give me all actors starring in Batman Begins.
  • 109. Resources  Books  Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press. 2011  Survey papers  Thanh Tran, Peter Mika. Survey of Semantic Search Approaches. Under submission, 2012.  Conferences and workshops  ISWC, ESWC, WWW, SIGIR, CIKM, SemTech  Semantic Search workshop series  Exploiting Semantic Annotations in Information Retrieval (ESAIR)  Entity-oriented Search (EOS) workshop

Editor's Notes

  1. Search is a form of content aggregation
  2. - In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. In this regard, technologies for Information Retrieval (IR) can be distinguished from solutions in the field of Database (DB) and Knowledge-based Expert Systems (KB). Whereas IR applications support the retrieval of documents (document retrieval), DB and KB systems deliver more precise answers (data retrieval). The technical differences between these two paradigms can be broken down into three main dimensions, i.e. the representation of the user need (query model), the underlying resources (data model) and the matching technique. The representation of user need and resource content in current IR systems is still almost exclusively achieved by the lightweight syntax-centric models such as the predominant keyword paradigm (i.e. keyword queries matched against bag-of-words document representation). While these systems have shown to work well for topical search, i.e. retrieve documents based on a topic, they usually fail to address more complex information needs. Using more expressive models for the representation of the user need, and leveraging the structure and semantics inherent in the data, DB and KB systems allow for complex queries, and for the retrieval of concrete answers that precisely match them.
  3. Semantic search can be seen as a retrieval paradigm Centered on the use of semanticsIncorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
  4. Another trend resulting from this convergence of textual, structured and semantic data is the need for hybrid semantic search systems. Whilestandard IR focuses on the retrieval of documents, the DB and KB systems are built for data retrieval. As opposed to these types of systems, hybrid search systems manage the different types of data in a holistic way, and support the retrieval of answers that are integrated units of information, possibly assembled from different types of data. In such a system, there is not only a convergence at the
  5. We implemented the search paradigms and integrated them as separate search modules into a demonstrator system of the Information Workbench7 that has been developed as a showcase for interaction with the Web of data. In particular, keyword search is implemented according to the design and technologies employed by standard Semantic Web search engines. Like Sindice and FalconS, we use an invertedindex to store and retrieve RDF resources based on terms. Also using the inverted index, faceted search is implemented based on the techniques discussed in [25]. Result completion is based on recent work discussed for the TASTIER system [8]. For computing join graphs, we use the top-k procedure elaborated in [9]. This technique is also used for computing top-k interpretations, i.e. to support query completion. We choose to display the top-6 queries and the top-25 results respectively.
  6. - resource-base navigation
  7. Facebook invited, but continues to pursue OGP
  8. &gt;&gt; &gt;&gt; Intro Session: motivation, overview etc.: 10 min &gt;&gt; &gt;&gt;1)Representation of the Search Space (M) 45&gt;&gt; &gt;&gt;2)Offline Preprocessing: Crawling and Indexing (P) 60&gt;&gt; &gt;&gt;3)Query Processing (T) 4)Matching (T) 90&gt;&gt; &gt;&gt;5)Ranking 60 &gt;&gt; &gt;&gt;6)Result Presentation (45)&gt;&gt; &gt;&gt;7)Evaluation (45)&gt;&gt; &gt;&gt;Demo Session (30)&gt;&gt; &gt;&gt; Wrap-up Session (10)
  9. Close to the topic of keyword-search in databases, except knowledge-bases have a schema-oblivious designDifferent papers assume vastly different query needs even on the same type of data
  10. &gt;&gt; &gt;&gt; Intro Session: motivation, overview etc.: 10 min &gt;&gt; &gt;&gt;1)Representation of the Search Space (M) 45&gt;&gt; &gt;&gt;2)Offline Preprocessing: Crawling and Indexing (P) 60&gt;&gt; &gt;&gt;3)Query Processing (T) 4)Matching (T) 90&gt;&gt; &gt;&gt;5)Ranking 60 &gt;&gt; &gt;&gt;6)Result Presentation (45)&gt;&gt; &gt;&gt;7)Evaluation (45)&gt;&gt; &gt;&gt;Demo Session (30)&gt;&gt; &gt;&gt; Wrap-up Session (10)
  11. Approximate many results  need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  12. Miss structural information in textsHyperlinksLinguistic structurePositional information
  13. - Real world objects
  14. SELECT Returns all, or a subset of, the variables bound in a query pattern match. CONSTRUCT Returns an RDF graph constructed by substituting variables in a set of triple templates. ASK Returns a boolean indicating whether a query pattern matches or not. DESCRIBE Returns an RDF graph that describes the resources found. Graph patterns are defined recursively. A graph pattern may have zero or more optional graph patterns, and any part of a query pattern may have an optional part. In this example, there are two optional graph patterns.Section 6 introduces the ability to make portions of a query optional; Section 7 introduces the ability to express the disjunction of alternative graph patterns; and Section 8 introduces the ability to constrain portions of a query to particular source graphs. Section 8 also presents SPARQL&apos;s mechanism for defining the source graphs for a query.Basic graph patterns allow applications to make queries where the entire query pattern must match for there to be a solution. For every solution of a query containing only group graph patterns with at least one basic graph pattern, every variable is bound to an RDF Term in a solution. However, regular, complete structures cannot be assumed in all RDF graphs. It is useful to be able to have queries that allow information to be added to the solution where the information is available, but do not reject the solution because some part of the query pattern does not match. Optional matching provides this facility: if the optional part does not match, it creates no bindings but does not eliminate the solution.The UNION pattern combines graph patterns; each alternative possibility can contain more than one triple pattern:SPARQL provides a means of combining graph patterns so that one of several alternative graph patterns may match. If more than one of the alternatives matches, all the possible pattern solutions are found.
  15. Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
  16. - Less than 5 percent of IR papers deal with query processing and the aspect of efficiency
  17. Partitioning has impact on performance!)Blocking: iterator-based approachesNon-blocking: good for streaming, good we cannot wait for some parts of the results to be completely worked-offLink data: cannot wait for sources, (some are slower then other) thus better to push data into query processing as the they come instead of pulling data and wait (busy waiting)Top-k:
  18. -phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web.- patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window.For extraction: relation patternsFor text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
  19. -phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web.- patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window.For extraction: relation patternsFor text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
  20. Given some materialized indexes  no joins at all Given sorted inputs  sorted merge join
  21. Given some materialized indexes  no joins at all Given sorted inputs  sorted merge joinjoin
  22. Every task is a challenge of itself, some more some less well elaboratedThere are separate challenges for every problems
  23. &gt;&gt; &gt;&gt; Intro Session: motivation, overview etc.: 10 min &gt;&gt; &gt;&gt;1)Representation of the Search Space (M) 45&gt;&gt; &gt;&gt;2)Offline Preprocessing: Crawling and Indexing (P) 60&gt;&gt; &gt;&gt;3)Query Processing (T) 4)Matching (T) 90&gt;&gt; &gt;&gt;5)Ranking 60 &gt;&gt; &gt;&gt;6)Result Presentation (45)&gt;&gt; &gt;&gt;7)Evaluation (45)&gt;&gt; &gt;&gt;Demo Session (30)&gt;&gt; &gt;&gt; Wrap-up Session (10)
  24. Approximate many results  need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  25. Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
  26. Syntactic works vs.works at Semantic works vs. employ
  27. text which contains a large mentions of “Berlin” is likely to be about “Berlin”i is more likely to be the correct interpretation of K when terms in K co-occur in a large number of context (bag of words) associated with i“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people