Data on the (Semantic) Web

longtermagonizingInternet και Εφαρμογές Web

13 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

58 εμφανίσεις

Data on the (Semantic) Web

Agenda (75 min)


Data on the Web


Extracting data


Publishing data


Linked Data


Metadata in HTML


SPARQL endpoints


Crawling and extraction


Indexing RDF data


Database
-
style indexing


IR
-
style indexing



IR view of the Web


Web accessible resources


Documents (typically HTML)


Multimedia


Search engines index NL text


Most of the structure in HTML is discarded


Multimedia is indexed by surrounding text


Additional information on web graph, usage


See Manning,
Raghavan
,
Müntze
.
Introduction to
Information Retrieval.

Cambridge Press, 2008.



Data on the Web


Most web pages on the Web are generated from
structured data


Data is stored in relational databases (typically)


Queried through web forms


Presented as tables or simply as unstructured text


The structure and semantics (meaning) of the data is
not directly accessible to search engines


Two solutions


Extraction using Information Extraction (IE) techniques
(implicit metadata)


Relying on publishers to expose structured data using
standard Semantic Web formats (explicit metadata)


Information Extraction methods


Named Entity Recognition (NER) and disambiguation


OpenCalais
,
Zemanta



Extraction of triples


TextRunner
,
NELL


Suchanek

et al.
YAGO: A Core of Semantic Knowledge Unifying
WordNet

and Wikipedia, WWW, 2007.


Wu and Weld. Autonomously
Semantifying

Wikipedia, CIKM 2007.


Filling web forms automatically (form
-
filling)


Madhavan

et al.
Google's Deep
-
Web Crawl. VLDB 2008


Extraction from HTML tables


Cafarella

et al.
WebTables
: Exploring the Power of Tables on the Web. VLDB
2008


Wrapper induction


Kushmerick

et al. Wrapper Induction for Information
ExtractionText

extraction. IJCAI
2007


Information Extraction


A tale of many trade
-
offs


Less or no training data, lower quality


More complex the model to learn, more training data
needed


Deeper the analysis, slower the processing


The more narrowly trained, the more likely to break


Populating a Knowledge Base is easier than ad
-
hoc
extraction


However, a complete and correct semantic
representation of the content may not be need
for all tasks

Publishing data on the Web


Pre
-
Semantic Web technologies have been
inadequate


Existing formats are not appropriate for serendipitous
reuse


HTML: structure is lost due to a mix of presentation and
content


XML: captures structure, but not semantics


Lack of protocols to talk to databases over the Web


Motivation has been lacking


Publishers are interested to the extent that they
benefit from sharing data, e.g. because it drives traffic
back to their site


What the Semantic Web provides


Data format: RDF


D
esigned for object
-
relationship data


Identification of objects by
URIs


Multiple serializations: RDF/XML, Turtle, N3, N
-
Triples,
Trix

etc.


Schema language: OWL


Description Logic based


Extensible using rule languages such as RIF


Query language and protocol: SPARQL


The principles of Linked Data


Methods for publishing RDF data


Multiple ways of publishing RDF data


SPARQL endpoints


Linked Data


Metadata in HTML documents


Data feeds


GRDDL


Automated tools



Each require different treatment in crawling and
extraction


SPARQL endpoints


SPARQL is a standard query language and protocol for
accessing RDF stores via HTTP


Also possible to expose a traditional
RDBMs

via a wrapper


Advantages:


Most flexible and best performing access from a consumer
perspective


Disadvantages:


Higher maintenance


Discovery is problematic


Tools:


Triple stores (Oracle, Virtuoso, Sesame, Jena, OWLIM etc.)


RDB
-
to
-
RDF
mappers

such as D2RQ and
Triplify


SPARQL query builders

Linked Data


A web of interlinked RDF documents


Each document describes the characteristics of a
single object, and links to related objects


Most important: links to the same object in different data
sets (
sameAs
)


Guidelines for proper configuration of web
servers to serve such documents


Rapidly growing community


Focus on public datasets (government, scientific)


see
linkeddata.org










The even larger picture: entire
datasets connected

Linked Data


Advantages:


No change to the publishing of the HTML documents


Data can be published by third party (e.g.
Dbpedia
)


Disadvantages:


Web servers need to be configured to properly handle
URIs

that identify concepts instead of documents


Search engines need to be extended to crawl linked data


Data is not always linked to documents


Tools


Linked Data browsers (
Tabulator
,
Marbles

etc.)


RDB
-
to
-
RDF
mappers

(
D2RQ
,
Triplify
)



Metadata in HTML


Microformats
,
RDFa
,
Microdata


Advantages:


Data and document are always in sync


Browser plug
-
in friendly


Search engine friendly


Copy
-
paste friendly


Tools:


XML editors (e.g. Oxygen)


RDFa Distiller


RDFa bookmarklet
,
Ubiquity RDFa plugin


Optimus microformat parser


Examples: many, including
SlideShare
, YouTube, LinkedIn,
Digg
,
Myspace
,
Facebook



Microformats (μf)


Agreements on the way to encode certain kinds of data in
HTML


Reuse of semantic
-
bearing HTML elements


Based on existing standards


Minimality
: designed to solve particular problems


Microformats

exist for a limited set of objects


hCard
,
hResume
,
hProduct
,
hRecipe


Varying degrees of support and stability


hCard

and
rel
-
tag are widely supported


Community centered around
microformats.org


Specifications and discussions are hosted there

Example: the hCard microformat

<cite class="
vcard
">

<a class="
fn url
" rel="friend colleague met"

href="http://meyerweb.com/">
Eric Meyer
</a>

</cite>
wrote a post

(
<cite>

<a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax
-
relief/">

Tax Relief
</a></cite>
) about an unintentionally humorous letter

he received from the

<span class="
vcard
">

<a class="
fn org url
" href="http://irs.gov/">

Internal Revenue Service
</a> </span>.

<div class="
vcard
">


<a class="
email fn
" href="mailto:jfriday@host.com">
Joe Friday
</a>


<div class="
tel
">
+1
-
919
-
555
-
7878
</div>


<div class="
title
">
Area Administrator, Assistant
</div>

</div>

Microformats: limitations


No shared syntax


Each microformat has a separate syntax tailored to the
vocabulary


No formal schemas


Limited reuse, extensibility of schemas


Unclear which combinations are allowed


No datatypes


No
namespaces, unique
identifiers
(URIs)


no interlinking


mapping between instances is required


RDFa


W3C standard for embedding RDF data in HTML
documents


A set of new HTML attributes


Despite the extension of HTML,
RDFa

does not require XHTML


A specification of how to extract the data from these
attributes


RDFa

can be used to embed data in HTML headers or
to annotate parts of the body of HTML documents


RDFa

is just a syntax, you have to choose a vocabulary
separately

Differences in usage



Microformats

are the first choice for most publishers
because they are simple


If you find none that perfectly fits your needs then you
need
RDFa


Microformats

have a fixed schema: you can not add your
own attributes


Example: a social networking site with user profiles


VCard is a good candidate, but for example it doesn’t have
a way to express the user’s social connections


You either live without this, or go with
RDFa

Example: Facebook’s Open Graph
Protocol


Open Graph Protocol


RDF vocabulary to be used in conjunction with RDFa


Simplify the work of developers by restricting the freedom in RDFa


Activities, Businesses, Groups, Organizations, People, Places, Products and
Entertainment


Only HTML <head> accepted


http://opengraphprotocol.org/


Facebook as consumer


Facebook indexes OGP data whenever someone ‘likes’ a page with OGP data


Social recommendation (‘like’ button) provides publishers with a way to
promote their content on Facebook


Shows up in profiles and news feed, the user is subscribing to a channel of future
feeds from the web page they liked


Facebook Graph API allows 3
rd

party developers to access the data


http://developers.facebook.com/docs/api



Example:
Facebook’s

Open Graph
Protocol

<html
xmlns:og
="http://
opengraphprotocol.org
/schema/">

<head>


<title>The Rock (1996)</title>


<meta property="
og:title
" content="The Rock" />


<meta property="
og:type
" content="movie" />


<meta property="
og:url
" content="http://www.imdb.com/title/tt0117500/" />


<meta property="
og:image
" content="
http://ia.media
-
imdb.com/images/rock.jpg
" /> …

</head> ...

</html>

Microdata


HTML5 is currently under standardization at the
W3C


Introduces
Microdata


Similar to
microformats


Some predefined vocabularies with central registration


Some of the flexibility of
RDFa


Introduce new terms using reverse domain names or
full
URIs


Semantic HTML elements such as <time>,
<video>, <article>…



Microdata example

<div
itemscope

itemid
=“http://
www.yahoo.com
/resource/person”>


<
p
>My name is <span
itemprop
="
name
">
Neil
</span>.</
p
>


<
p
>My band is called


<span
itemprop
="
band
">
Four Parts Water
</span>.


I was born on


<time
itemprop
="
birthday
"
datetime
="
2009
-
05
-
10
"
>


May
10th
2009


<
/time>.


<
img

itemprop
="
image
"
src
=”
me.png
" alt=”me”>


</
p
>

</div


The state of metadata in HTML


5
-
10% of
webpages

contain some explicit
metadata


Depending on how you count…


Too many competing approaches


Too many formats:
microformats

vs

RDFa

vs

Microdata


Too many schemas: publishers may need to use
multiple different vocabularies or
microformats

to
satisfy everyone