Scientific RDF Databases

wrendeceitInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

112 εμφανίσεις

Scientific RDF Databases

Michael Mertens

K.U.Leuven


Outline


Introduction to RDF


RDF Databases


Advantages for scientific R&D


In practice


Criticism


2


Outline


Introduction to RDF


RDF Databases


Advantages for scientific R&D


In practice


Criticism


3

RDF: Resource Description Framework



Originally: metadata data model



Now: General method for conceptual
description for web resources (Semantic Web)


Introduction

4


Traditional Web in 2009:





Introduction


Sharing documents


URL as retrieval mechanism


HTML standard format


Hypertext links

Image taken from “
The Emerging Web of Linked Data
”, Chris Bizer

5



>

Semantic Web


Data on the web


HTML describes documents and links between them



Semantic web:


Publish data in RDF, OWL, XML, ..


Describe arbitrary things: people, books, events, ..


Link between these concepts


Machine
-
readable, web
-
accessible databases




Introduction

6



>

Semantic Web


Tim
-
Berners Lee: LINKED DATA


Connected structured data


3 simple principles:


URLs for conceptual things


Returns useful data about that thing


Relationships link to other URLs





Introduction

7



>
Semantic
Web >
Linked Data


Introduction

8


Before: Scientific
d
ata usually not shared


Pharmaceutical Drug Discovery


A lot of spread out data


Drug Bank, ClinicalTrial.gov, Health Care and Life Science


Genomics data, Protein data, ..



A question nobody examined before:


“What Proteins are involved in signal transduction AND are
related to pyramidal neurons?”

Example taken from “
Tim Berners
-
Lee on the next Web




>
Semantic
Web > Linked Data >
Example


Introduction

9


The web:



223,000 hits, 0 results

Example taken from “
Tim Berners
-
Lee on the next Web




>
Semantic
Web > Linked Data >
Example


Introduction

10


Linked Data:




32 hits, 32 results

Example taken from “
Tim Berners
-
Lee on the next Web


DRD1, 1812 adenylate cyclase activation

ADRB2, 154 adenylate cyclase activation

ADRB2, 154
arrestin

mediated desensitization of G
-
protein
coupled


DRD1IP
, 50632 dopamine receptor signaling pathway

DRD1, 1812 dopamine receptor,
adenylate

cyclase

activating pathway

DRD2, 1813 dopamine receptor,
adenylate

cyclase

inhibiting pathway

GRM7, 2917 G
-
protein coupled receptor protein signaling pathway

GNG3, 2785 G
-
protein coupled receptor protein signaling pathway

GNG12, 55970 G
-
protein coupled receptor protein signaling pathway

DRD2, 1813 G
-
protein coupled receptor protein signaling pathway

ADRB2, 154 G
-
protein coupled receptor protein signaling pathway

CALM3, 808 G
-
protein coupled receptor protein signaling pathway

HTR2A, 3356 G
-
protein coupled receptor protein signaling pathway

DRD1, 1812 G
-
protein signaling, coupled to cyclic nucleotide
second…
SSTR5
, 6755 G
-
protein signaling, coupled to cyclic nucleotide
second…

MTNR1A, 4543 G
-
protein signaling, coupled to cyclic
nucleotide




HTR6
, 3362 G
-
protein signaling, coupled to cyclic nucleotide
second



GRIK2, 2898 glutamate signaling pathway

GRIN1, 2902 glutamate signaling pathway

GRIN2A, 2903 glutamate signaling pathway

GRIN2B, 2904 glutamate signaling pathway

ADAM10, 102 integrin
-
mediated signaling pathway

GRM7, 2917 negative regulation of
adenylate

cyclase

activity

LRP1, 4035 negative regulation of
Wnt

receptor signaling pathway

ADAM10, 102 Notch receptor processing

ASCL1, 429 Notch signaling pathway

HTR2A, 3356 serotonin receptor signaling pathway

ADRB2, 154 transmembrane receptor protein tyrosine kinase

PTPRG
, 5793 transmembrane receptor protein tyrosine
kinase


EPHA4
, 2043 transmembrane receptor protein tyrosine kinase

NRTN
, 4902 transmembrane receptor protein tyrosine kinase

CTNND1
, 1500
Wnt

receptor signaling pathway



>
Semantic
Web > Linked Data >
Example


Introduction

11

Example taken from “
Tim Berners
-
Lee on the next Web


PREFIX g
o
: <http://purl.org/obo/owl/GO#>

PREFIX
rdfs
: <http://www.w3.org/2000/01/rdf
-
schema#>

PREFIX
owl
: <http://www.w3.org/2002/07/owl#>

PREFIX
mesh
:
http
://purl.org/commons/record/mesh
/

SELECT
?genename
?processname

WHERE

{

graph http
://
purl.org/commons/hcls/pubmesh


{ ?
paper ?p
mesh:D017966

.


?
article sc:identified_by_pmid ?paper
.



?
gene sc:describes_gene_or_gene_product_mentioned_by ?article
.}


graph
<http://purl.org/commons/hcls/goa>


{
?protein rdfs:subClassOf ?res.


?
res owl:onProperty ro:has_function.


?
res owl:someValuesFrom ?res2.


?
res2 owl:onProperty ro:realized_as.


?
res2 owl:someValuesFrom ?process.


graph
<http://purl.org/commons/hcls/20070416/classrelations>


{{?
process <http://purl.org/obo/owl/obo#part_of>
go:GO_0007166
}


union


{ ?
process rdfs:subClassOf
go:GO_0007166
}}


?protein
rdfs:subClassOf ?parent.


?
parent owl:equivalentClass ?res3.


?
res3 owl:hasValue ?gene
.}


graph
<http://purl.org/commons/hcls/gene>


{
?gene rdfs:label ?genename }


graph
<http://purl.org/commons/hcls/20070416>


{
?process rdfs:label ?processname
}}



>
Semantic
Web > Linked Data >
Example

Related to Pyramidal Neurons



Part of Signal Transduction





Used 4 sources


Introduction

12



>
Semantic
Web >
Linked Data


Introduction

13



>
Semantic
Web >
Linked Data


What do we need?


Identifiers: URIs


Linking mechanism: HTTP


Vocabulary: Web Ontology Language (OWL)


Serialization: RDF/XML






Introduction

14



>
Semantic
Web >
Linked Data


Identifiers: URIs


Use of HTTP URL


Link to “Resources”


Possibly many
documents per
resource


Shift
to non
-
information resources
:










Introduction

15



>
Semantic
Web >
Linked Data

http://dbpedia.org/resource/London


HTML: http://dbpedia.org/page/London

RDF: http://dbpedia.org/data/London.rdf

N3: http://dbpedia.org/data/London.ntriples


Linking mechanism: HTTP


Accessible through generic data browsers


Allowing to be crawled by search engines


Connecting different sources




In contrast, Web APIs use different interfaces










Introduction

16



>
Semantic
Web >
Linked Data


Vocabulary: Web Ontology Language (OWL)


Knowledge representation language


Designed to be interpreted by computers


Describes data, based on
individuals

(classes) and
property assertions
(relationships)






Introduction

17



>
Semantic
Web >
Linked Data

<
owl:Class

rdf:ID="
Money
">


<
rdfs:subClassOf

rdf:resource="
http://www.w3.org/2002/07/owl#Thing
"/>

</
owl:Class
>

<
owl:DatatypeProperty

rdf:ID="
currency
">


<
rdfs:domain

rdf:resource="
#Money
"/>


<
rdfs:range

rdf:resource="
http://www.w3.org/2001/XMLSchema#string
"/>

</
owl:DatatypeProperty
>


Vocabulary: Web Ontology Language (OWL)


Knowledge representation language


Designed to be interpreted by computers


Describes data, based on
individuals

(classes) and
property assertions
(relationships)


URIs about the same thing: ‘owl:sameAs’






Introduction

18



>
Semantic
Web >
Linked Data


Based on triples


Subject, predicate, object





Resources identified by URI


URIs allow to look up RDF information


RDF information links to other URIs


RDF
: Resource Description Framework

19

< http://dbpedia.org/resource/London,

http://dbpedia.org/ontology/country,

http://dbpedia.org/resource/United_Kingdom >

20


RDF
: Resource Description Framework

21


RDF
: Resource Description Framework

22


RDF
: Resource Description Framework

This looks a lot like XML..



Why don’t we just use XML??

RDF
: <Page, author, Name>

XML
:



<document href=“Page”>


<author>Name</author>


</document>

<document>



<details>


<uri>Page</uri>


<author>Name</author>



</details>

</document>

<author>


<uri>Page</uri>



<name>Name</name>

</author>

...


RDF
vs XML

23


RDF/XML
: proposed by W3C






N3

or
Turtle
: human
-
readability

<rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22
-
rdf
-
syntax
-
ns#"


xmlns:dc="http://purl.org/dc/elements/1.1/">


<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">


<dc:title>Tony Benn</dc:title>


<dc:publisher>Wikipedia</dc:publisher>


</rdf:Description>

</rdf:RDF>

@prefix dc: <http://purl.org/dc/elements/1.1/>.

<http://en.wikipedia.org/wiki/Tony_Benn>


dc:title

"Tony Benn";



dc:publisher

"Wikipedia".


RDF:
Serialization

24


Outline


Introduction to RDF


RDF Databases


Advantages for scientific R&D


In practice


Criticism


25


Also called “Triple Store”


Data in the form of triples:



Subject


predicate


object


Dominant query language: SPARQL





RDF Databases

26

PREFIX abc: <nul://sparql/exampleOntology#> .

SELECT
?capital ?country

WHERE
{


?
x abc:cityname ?capital ;



abc:isCapitalOf
?y.


?
y abc:countryname
?country ;



abc:isInContinent abc:Africa
.


}


Built
on W3C’s “Linked Data”


Subset of “Graph databases”


Nodes (entities), edges (relationships),
properties




Directed, labeled graph structure





(Predicate URI as label)



RDF Databases

27


Graph View

28

Image taken from w3.org


Only standarised NoSQL database


In contrast to normal RDBMS:


Very flexible data model


Do not require fixed table schema


Information as most basic building blocks


Enabling improvement on data
-
intensive
operations



Examples: Ebay, Facebook, digg, ..



RDF Databases

29


Scalable: Distributed design


Self
-
Documenting Data


Vocabulary identified in OWL or RDFS definitions


Allows multiple schemata


Open


Discover new data sources at run
-
time


Often weak consistency guarantees


Solved with additional middleware




RDF Databases

30

Limitations of Relational Databases:



Not directly visible to web
-
agents


Primary
-
foreign key relationships


Meaning is implicit, unspecified semantics


No relationships across seperate databases


Parent
-
child relationship are not natural


“Self
-
joins” for each level in hierarchy

31


RDF Databases


Outline


Introduction to RDF


RDF Databases


Advantages for scientific R&D


Criticism


In practice

32


Advantages for Scientific R&D


Studies continue to show that research in all
fields is
increasingly collaborative



Example: genomic research


Complex data distributed over many datasets


Entrez Gene (EG), Gene Ontology (GO), Swiss_Prot,
GenBank, ..


33


Problem = Lack of well defined standards


Integration Nightmare:


data scattered, different formats, lacking information


s
ynonyms, ambiguity


Changing models:


maintenance not feasible


Understanding and reasoning


need for connecting ontologies


Challenge: Syntatic and Semantic heterogeneity




34


Advantages for Scientific R&D


Localization of resources


Identify relevant webresources


Data formats


Resources are represented in HTML, TXT, images, ..


Synonyms


Researchers can name their own data differently

35


Integration of Databases



>
Challenges


Ambiguity


E.g. “insulin” can represent a drug, protein, gene, ..


Relations


One
-
to
-
one / One
-
to
-
many between identifiers


Granularity


Can cause missing data, ..

36


Integration of Databases



>
Challenges


Data Warehouse Approach


Translate data in one
l
ocal
d
atabase


Eliminate unavailability & slow response


Allow data processing and optimalization


Maintenance problem


evolution of content and structure



Examples: BioWarehouse, Biozon, DataFoundry

37


Integration of Databases


>

Approaches


Federated Database Approach


Translate queries for individual sources


Easier to maintain (e.g. Adding new source)


Poor performance



Examples: BioKleisli, DiscoveryLink, QIS

38


Integration of Databases


>

Approaches


Semantic Web Approach


No need to map data models


Rely on standarized ontologies



Less work, better performance


But only if sources comply

39


Integration of Databases


>

Approaches


Outline


Introduction to RDF


RDF Databases


Advantages for scientific R&D


In practice


Criticism


40


In Practice


Scientists need:


Access to data



Ability to utilize data



Handle uncertainty

41


In Practice


Linked Open Data:


“We all need the same databases, for different
decisions or applications”


Complements data
in
internal/licensed sources


Stimulates cross
scientific
sharing



42


Biological data: Human Genome Project


Increase in web
-
accessible databases


GenBank, Gene Ontology, UniProt, PhenoDB, ..


Integration is key problem



Increase in RDF availability

43


Examples


YeastHub


Registration of web
-
accessible database


Metadata according to Dublin Core standards using
RSS1.0 to describe an ontology


Data Conversion


XML or RDB to RDF conversion


(eg Unique ID = RDF ID , rest of columns are properties)


Data Integration


Ad hoc RDF queries


Form
-
based queries (supervised)

44


Examples


Outline


Introduction to RDF


RDF Databases


Advantages for scientific R&D


In practice


Criticism


45


Feasability


Human behavior and personal preferences


‘Database hugging’


Organizations tend to keep data for themselves


Censorship and Privacy



46


Criticism


Published data reusable in research?


Requires:


Provenance information


Quality


Attribution


Consistency


...


Out
-
of context data fails to respect scientific
research methodology

47


Criticism


Bringing
Web 2.0 to
bioinformatics

2008, Zhang
Zhang, Kei
-
Hoi Cheung and Jeffrey P.
Townsend



Semantic web approach to database integration in life
sciences

2006, Kei
-
Hoi Cheung, Andrew K. Smith, Kevin Y.L. Yip, Christopher J.O.
Baker and Mark B. Gerstein



Integrating
large biomedical
knowledge resources
with
RDF

2007, Satya
S.
Sahoo, Olivier Bodenreider, Kelly Zeng, Amit Sheth



RDF/RDFS
-
based Relational Database Integration

2006, Huajun Chen ,


Zhaohui Wu ,


Heng Wang ,


Yuxin Mao


48


References


Has anyone ever worked with linked (RDF)
data before? What are your experiences?



Will the semantic web grow to become the
Giant Global Graph
?



Why haven’t RDF databases taken off like
Relational Databases?



49


Discussion