Searching, Navigating, and Querying the Semantic Web with SWSE

grassquantityΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

83 εμφανίσεις

Searching, Navigating, and
Querying the Semantic Web with
SWSE

http://challenge.semanticweb.org/

Andreas, Aidan, Renaud,
Juergen, Sean, Stefan

Deadline Friday, 13. July


8 pages LNCS springer style


1 slide will be 1 paragraph in final paper


please send slides back by Tuesday


first paper draft Wednesday


Prototype by July, 13

1. Introduction


Sean/Andreas

What is the Problem?


Current search engines, both for the Web and the
Intranet, are based on keyword searches over
documents


More advanced systems have document clustering
capabilities using topic taxonomies, or use shallow
metadata (meta tags for topic description)



With the traditional approach to search, you only get
matching documents to keyword searches, but not
answers to precise questions (e.g. telephone number of
Person X, projects a person is working on)



Hard to further process traditional search results
(information in documents) with a program


Not possible to combine two sources and derive an
answer‏that‏one‏source‏alone‏couldn’t‏answer‏
-

mashups

Why is it interesting and important


Loads of data available (on the Web,
Intranet)



Leverage and provide new insights into
information assets


Connecting the dots across data sources
can reveal previously unseen relations
(data mashups)



Ultimately: transform the web of
documents into a web of data

Why is it hard?

1.
Multitude of formats and data models:
HTML (text documents), RSS, (relational)
data, RDF

2.
difficult to integrate and consolidate data
about entities


no common identifiers
across sources

3.
scale: web is huge

4.
how can users navigate massive data
sets with unknown schema


Why‏hasn’t‏it‏been‏solved‏before?


Sean?

What are the key components of
the approach?

1.
graph
-
structured data format (RDF) to merge
data from multiple sources in multiple format,
entity
-
centric world view (talking about books,
persons, stocks, locations rather than just
documents)


2.
exact matching based on IFPs and fuzzy
matching to consolidate entities

3.
Distributed system, KISS architecture, highly
optimised primitives, scale by adding hardware

4.
Entity search and navigation interface oblivious
to schema


2. Example Session


User Interaction Model


data model: entities with attributes, and
relations between entities


UI primitives


match keywords in attributes, display entities


filter by entity type


follow relations (incoming links and outgoing
links) for an entity


focus change


Example queries


Get phone number of rudi studer (answers
instead of links)



Explore and navigate rudi studer and
surrounding entities (combination of
different sources)



maybe show ontology graph here for rudi
studer result set (authored
110
papers,
38
people know him, he knows
27
people,‏he’s‏
maker of a file, editor of something, workinfo
homepage‏is‏this…)‏
-

Andreas

3. Architecture


Two components


Data preparation and integration




Semantic search and query engine

Architecture Overview

HTML
RSS
RDF
Data
Entity Search and Navigation Interface
Index Manager
Query Processing
Ranking
RDF Store
Data Conversion
Entity Recognition
Entity Matching and Consolidation
Crawling and Data Gathering
Semantic Search
and Query Engine
Data Preparation and
Integration
3.1. Data Preparation and
Integration


Crawling and data gathering


Juergen/Andreas


What?


Why?


How?



crawl during june/july
2007


started with RDF rdfs:seeAlso


added search
-
engine scraped RSS (with common
english words)



added dblp URLs for HTML pages and crawled them to
depth
2
/
3
?


Data Conversion


We convert structured data exported from the
“Deep‏Web”‏under‏liberal‏ licenses


Such data includes DBLP, CiteSeer, IMdB,
Wikipedia etc.


Much data present in these datasets. Can
combine information (e.g. DBLP, CiteSeer) so
the sum is greater than the parts


Create a target ontology/schema in RDFS/OWL;
write wrappers e.g. XSLT to convert data to RDF
according to the schema

Crawling and data gathering

(MultiCrawler)



Use of the MultiCrawler architecture for gathering data. A
crawler framework developed especially for getting
structured data from a various kind of different sources(RDF,
XML, RSS; HTML,...)



Support for crawling rdf documents, by following rdf:seeAlso
links


Ability of getting more structured data by crawling different
data formats, converting them automatically into RDF by
using XSLT's, and extracting new urls


Over two month (june, july
07
) XYZ documents( resulted in
ABC Quads/Triples) were crawled from different rdf
repositories, rss feeds and selected html documents

Entity consolidation (IFPs)



Need to integrate information on the same resources
across sources


Can do so automatically if URIs are correctly used; often
they’re‏not


Can look at other unique keys (inverse functional
properties) to match instances of the same resource;
keys such as ISBN codes, IM chat usernames, etc.


IFPs are identified as such in ontologies


If two instances of book have the same ISBN code they
are the same book


This way we attempt to have one instance (result) per
entity; the total knowledge contribution towards an entity
from all sources is summarised in one result

Entity Matching and Linking


How to interlink various web documents with existing RDF entities ?


RDF entities: Geonames, Foaf:Agent, Doap:Project



Web documents: HTML, RSS, RDF, ...


We want to annotate web documents (unstructured information) with existing RDF resources
(semi
-
structured information) with rdfs:seeAlso links on a large scale.


To be able to have a better description of the entities identified in a document:


A web document speaks about a company named « DERI

». What is the company ?
How can I find its description and contact information ?


To be able to find web documents based on entity conjunctions:


I want to read documents that refers to events about the «

SWSE

» project of «

Andreas
Harth

» from «

DERI

» in «

Ireland

».


How ?


Web documents are indexed with a normal IR engine, then named entities are matched
against the inverted index (very efficient, hundreds of matches per second). The issue is how
to avoid noisy matches: we need a disambiguation process.


Disambiguation: Weighting scheme from the IR engine; basic reasoning using contextual
information about the entity; statistically with co
-
occurrence of rdfs:seeAlso links.


Evaluation ?


Randomly selection of RDF entities. Manual verification of the links between the entity and
the web documents.

3.2
. Semantic Web Search and
Query Engine


Index Manager


Index Manager maintains local YARS index


Index contains complete quad index and keyword index


Keyword index (lucene) provides identifiers of entities
that match a keyword query


Complete quad index contains six quad indices in
different order (SPOC, POCS, etc.) which are required to
offer lookups on any quad pattern


Each of these individual indices comprises a blocked,
compressed, sorted file of quads with an in
-
memory
sparse index


Sparse index in memory stores the first quad of each
block‏and‏it’s‏position‏on
-
disk allowing direct access to
the first block pertaining to a given quad pattern




Index Manager


Specifically the Index Manager


Creates the local indices given a file


Re
-
orders and merge
-
sorts the quad indices


Serialises the sparse index for each


Creates the Keyword index


Offers query processing over the local index


Can access the index manager via RMI client and
pose lookups and SPARQL queries to the index

Query Processing


provide SPARQL interface enriched with
keyword lookups to aid explorations of
unkown data


distributed query processing based on
flooding and hash partitioning depending
on query and data distribution


iterator notion with batch transfer of data
to improve performance


ui uses query processor, and also
SPARQL api available

Ranking


Ranking used called ReConRank


Need a way of ordering results according to importance


Also need trust metrics since data can be provided by
anyone, anywhere about anything


Combine data
-
source graph (physical layer) and the data
graph (logical layer) into one large graph and apply links
analysis to rank both entities (importance) and sources
(trust)



Also include TF
-
IDF scores from Keyword Index in a
weightings scheme to improve relevance of top scored
entities


Currently operates at run
-
time on the results returned
from the index

Entity Search and Navigation
Interface


Sean

4
. Prototype


Andreas/Renaud/Aidan


Online with >
1
m sources (need to get a lot of
HTML


currently have roughly
100
k FOAF and
100
k rss sources)



>
1
m entities


target
500
million triples


index generation, object consolidation: xx hours,
xx triples/second


distributed on
4
machines for query processing
and
1
machine for user interface


SPARQL endpoint

Evaluation


response times for user interface <
1
second


see http://jayant
7
k.blogspot.com/
2006
/
06
/benchmarking
-
results
-
of
-
mysql
-
lucene.html for an acceptable
benchmark method


need to test concurrent access


select
10000
keyword searches


do HTTP lookups with
1
,
2
,
3
concurrent threads


measure overall time


calculate average response time (overall time/
10000
)



Evaluation of: fuzzy matching (Renaud) how?


Evaluation of ranking (Aidan) how?



Build that thing
-

TBD


get stats for current index (aidan)


monday


get class diagram for current index (andreas)
-

monday


object consolidate index (aidan)
-

monday


crawl HTML pages (andreas)


Monday



entity recognition for RSS and HTML (renaud)
-

tuesday


disambiguation? (renaud)
-

wednesday


rebuild index with additional information (aidan)


thursday


performance tests (aidan/renaud)
-

friday



add ranking to UI (andreas)
-

tuesday


cut off for lucene or check how lucene can return streaming results
(aidan)
-

tuesday


hack local join of keyword and spoc (andreas/aidan)


tuesday

5
. Conclusion


Andreas/Sean


We have shown how to apply semantic web technology
to a large
-
scale Web data integration scenario using a
systems approach


Future work: complement the centralised warehousing
approach with an on
-
demand approach to include live
data sources in query processing


the basics, namely
distributed query processing are already there


We see commercial potential in a wide range of
application areas for our system: general Web search,
vertical search engines, and enterprise search

Conclusion II


add sw challenge criteria to conclusion