Scientific RDF Databases
Michael Mertens
K.U.Leuven
Outline
•
Introduction to RDF
•
RDF Databases
•
Advantages for scientific R&D
•
In practice
•
Criticism
2
Outline
•
Introduction to RDF
•
RDF Databases
•
Advantages for scientific R&D
•
In practice
•
Criticism
3
RDF: Resource Description Framework
•
Originally: metadata data model
•
Now: General method for conceptual
description for web resources (Semantic Web)
Introduction
4
•
Traditional Web in 2009:
Introduction
•
Sharing documents
•
URL as retrieval mechanism
•
HTML standard format
•
Hypertext links
Image taken from “
The Emerging Web of Linked Data
”, Chris Bizer
5
>
Semantic Web
•
Data on the web
–
HTML describes documents and links between them
–
Semantic web:
•
Publish data in RDF, OWL, XML, ..
•
Describe arbitrary things: people, books, events, ..
•
Link between these concepts
•
Machine
-
readable, web
-
accessible databases
Introduction
6
>
Semantic Web
•
Tim
-
Berners Lee: LINKED DATA
•
Connected structured data
•
3 simple principles:
–
URLs for conceptual things
–
Returns useful data about that thing
–
Relationships link to other URLs
Introduction
7
>
Semantic
Web >
Linked Data
Introduction
8
•
Before: Scientific
d
ata usually not shared
•
Pharmaceutical Drug Discovery
–
A lot of spread out data
•
Drug Bank, ClinicalTrial.gov, Health Care and Life Science
–
Genomics data, Protein data, ..
•
A question nobody examined before:
“What Proteins are involved in signal transduction AND are
related to pyramidal neurons?”
Example taken from “
Tim Berners
-
Lee on the next Web
”
>
Semantic
Web > Linked Data >
Example
Introduction
9
•
The web:
223,000 hits, 0 results
Example taken from “
Tim Berners
-
Lee on the next Web
”
>
Semantic
Web > Linked Data >
Example
Introduction
10
•
Linked Data:
32 hits, 32 results
Example taken from “
Tim Berners
-
Lee on the next Web
”
DRD1, 1812 adenylate cyclase activation
ADRB2, 154 adenylate cyclase activation
ADRB2, 154
arrestin
mediated desensitization of G
-
protein
coupled
…
DRD1IP
, 50632 dopamine receptor signaling pathway
DRD1, 1812 dopamine receptor,
adenylate
cyclase
activating pathway
DRD2, 1813 dopamine receptor,
adenylate
cyclase
inhibiting pathway
GRM7, 2917 G
-
protein coupled receptor protein signaling pathway
GNG3, 2785 G
-
protein coupled receptor protein signaling pathway
GNG12, 55970 G
-
protein coupled receptor protein signaling pathway
DRD2, 1813 G
-
protein coupled receptor protein signaling pathway
ADRB2, 154 G
-
protein coupled receptor protein signaling pathway
CALM3, 808 G
-
protein coupled receptor protein signaling pathway
HTR2A, 3356 G
-
protein coupled receptor protein signaling pathway
DRD1, 1812 G
-
protein signaling, coupled to cyclic nucleotide
second…
SSTR5
, 6755 G
-
protein signaling, coupled to cyclic nucleotide
second…
MTNR1A, 4543 G
-
protein signaling, coupled to cyclic
nucleotide
…
HTR6
, 3362 G
-
protein signaling, coupled to cyclic nucleotide
second
…
GRIK2, 2898 glutamate signaling pathway
GRIN1, 2902 glutamate signaling pathway
GRIN2A, 2903 glutamate signaling pathway
GRIN2B, 2904 glutamate signaling pathway
ADAM10, 102 integrin
-
mediated signaling pathway
GRM7, 2917 negative regulation of
adenylate
cyclase
activity
LRP1, 4035 negative regulation of
Wnt
receptor signaling pathway
ADAM10, 102 Notch receptor processing
ASCL1, 429 Notch signaling pathway
HTR2A, 3356 serotonin receptor signaling pathway
ADRB2, 154 transmembrane receptor protein tyrosine kinase
…
PTPRG
, 5793 transmembrane receptor protein tyrosine
kinase
…
EPHA4
, 2043 transmembrane receptor protein tyrosine kinase
…
NRTN
, 4902 transmembrane receptor protein tyrosine kinase
…
CTNND1
, 1500
Wnt
receptor signaling pathway
>
Semantic
Web > Linked Data >
Example
Introduction
11
Example taken from “
Tim Berners
-
Lee on the next Web
”
PREFIX g
o
: <http://purl.org/obo/owl/GO#>
PREFIX
rdfs
: <http://www.w3.org/2000/01/rdf
-
schema#>
PREFIX
owl
: <http://www.w3.org/2002/07/owl#>
PREFIX
mesh
:
http
://purl.org/commons/record/mesh
/
SELECT
?genename
?processname
WHERE
{
graph http
://
purl.org/commons/hcls/pubmesh
{ ?
paper ?p
mesh:D017966
.
?
article sc:identified_by_pmid ?paper
.
?
gene sc:describes_gene_or_gene_product_mentioned_by ?article
.}
graph
<http://purl.org/commons/hcls/goa>
{
?protein rdfs:subClassOf ?res.
?
res owl:onProperty ro:has_function.
?
res owl:someValuesFrom ?res2.
?
res2 owl:onProperty ro:realized_as.
?
res2 owl:someValuesFrom ?process.
graph
<http://purl.org/commons/hcls/20070416/classrelations>
{{?
process <http://purl.org/obo/owl/obo#part_of>
go:GO_0007166
}
union
{ ?
process rdfs:subClassOf
go:GO_0007166
}}
?protein
rdfs:subClassOf ?parent.
?
parent owl:equivalentClass ?res3.
?
res3 owl:hasValue ?gene
.}
graph
<http://purl.org/commons/hcls/gene>
{
?gene rdfs:label ?genename }
graph
<http://purl.org/commons/hcls/20070416>
{
?process rdfs:label ?processname
}}
>
Semantic
Web > Linked Data >
Example
Related to Pyramidal Neurons
Part of Signal Transduction
Used 4 sources
Introduction
12
>
Semantic
Web >
Linked Data
Introduction
13
>
Semantic
Web >
Linked Data
•
What do we need?
–
Identifiers: URIs
–
Linking mechanism: HTTP
–
Vocabulary: Web Ontology Language (OWL)
–
Serialization: RDF/XML
Introduction
14
>
Semantic
Web >
Linked Data
•
Identifiers: URIs
–
Use of HTTP URL
–
Link to “Resources”
–
Possibly many
documents per
resource
–
Shift
to non
-
information resources
:
Introduction
15
>
Semantic
Web >
Linked Data
http://dbpedia.org/resource/London
HTML: http://dbpedia.org/page/London
RDF: http://dbpedia.org/data/London.rdf
N3: http://dbpedia.org/data/London.ntriples
•
Linking mechanism: HTTP
–
Accessible through generic data browsers
–
Allowing to be crawled by search engines
–
Connecting different sources
–
In contrast, Web APIs use different interfaces
Introduction
16
>
Semantic
Web >
Linked Data
•
Vocabulary: Web Ontology Language (OWL)
–
Knowledge representation language
–
Designed to be interpreted by computers
–
Describes data, based on
individuals
(classes) and
property assertions
(relationships)
Introduction
17
>
Semantic
Web >
Linked Data
<
owl:Class
rdf:ID="
Money
">
<
rdfs:subClassOf
rdf:resource="
http://www.w3.org/2002/07/owl#Thing
"/>
</
owl:Class
>
<
owl:DatatypeProperty
rdf:ID="
currency
">
<
rdfs:domain
rdf:resource="
#Money
"/>
<
rdfs:range
rdf:resource="
http://www.w3.org/2001/XMLSchema#string
"/>
</
owl:DatatypeProperty
>
•
Vocabulary: Web Ontology Language (OWL)
–
Knowledge representation language
–
Designed to be interpreted by computers
–
Describes data, based on
individuals
(classes) and
property assertions
(relationships)
–
URIs about the same thing: ‘owl:sameAs’
Introduction
18
>
Semantic
Web >
Linked Data
•
Based on triples
–
Subject, predicate, object
•
Resources identified by URI
•
URIs allow to look up RDF information
•
RDF information links to other URIs
RDF
: Resource Description Framework
19
< http://dbpedia.org/resource/London,
http://dbpedia.org/ontology/country,
http://dbpedia.org/resource/United_Kingdom >
20
RDF
: Resource Description Framework
21
RDF
: Resource Description Framework
22
RDF
: Resource Description Framework
This looks a lot like XML..
Why don’t we just use XML??
RDF
: <Page, author, Name>
XML
:
<document href=“Page”>
<author>Name</author>
</document>
<document>
<details>
<uri>Page</uri>
<author>Name</author>
</details>
</document>
<author>
<uri>Page</uri>
<name>Name</name>
</author>
...
RDF
vs XML
23
•
RDF/XML
: proposed by W3C
•
N3
or
Turtle
: human
-
readability
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22
-
rdf
-
syntax
-
ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
<dc:title>Tony Benn</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
</rdf:Description>
</rdf:RDF>
@prefix dc: <http://purl.org/dc/elements/1.1/>.
<http://en.wikipedia.org/wiki/Tony_Benn>
dc:title
"Tony Benn";
dc:publisher
"Wikipedia".
RDF:
Serialization
24
Outline
•
Introduction to RDF
•
RDF Databases
•
Advantages for scientific R&D
•
In practice
•
Criticism
25
•
Also called “Triple Store”
•
Data in the form of triples:
Subject
–
predicate
–
object
•
Dominant query language: SPARQL
RDF Databases
26
PREFIX abc: <nul://sparql/exampleOntology#> .
SELECT
?capital ?country
WHERE
{
?
x abc:cityname ?capital ;
abc:isCapitalOf
?y.
?
y abc:countryname
?country ;
abc:isInContinent abc:Africa
.
}
•
Built
on W3C’s “Linked Data”
•
Subset of “Graph databases”
•
Nodes (entities), edges (relationships),
properties
Directed, labeled graph structure
(Predicate URI as label)
RDF Databases
27
Graph View
28
Image taken from w3.org
•
Only standarised NoSQL database
•
In contrast to normal RDBMS:
–
Very flexible data model
•
Do not require fixed table schema
–
Information as most basic building blocks
•
Enabling improvement on data
-
intensive
operations
•
Examples: Ebay, Facebook, digg, ..
RDF Databases
29
•
Scalable: Distributed design
•
Self
-
Documenting Data
–
Vocabulary identified in OWL or RDFS definitions
–
Allows multiple schemata
•
Open
–
Discover new data sources at run
-
time
•
Often weak consistency guarantees
–
Solved with additional middleware
RDF Databases
30
Limitations of Relational Databases:
•
Not directly visible to web
-
agents
•
Primary
-
foreign key relationships
–
Meaning is implicit, unspecified semantics
•
No relationships across seperate databases
•
Parent
-
child relationship are not natural
–
“Self
-
joins” for each level in hierarchy
31
RDF Databases
Outline
•
Introduction to RDF
•
RDF Databases
•
Advantages for scientific R&D
•
Criticism
•
In practice
32
Advantages for Scientific R&D
•
Studies continue to show that research in all
fields is
increasingly collaborative
•
Example: genomic research
–
Complex data distributed over many datasets
•
Entrez Gene (EG), Gene Ontology (GO), Swiss_Prot,
GenBank, ..
33
•
Problem = Lack of well defined standards
–
Integration Nightmare:
•
data scattered, different formats, lacking information
•
s
ynonyms, ambiguity
–
Changing models:
•
maintenance not feasible
–
Understanding and reasoning
•
need for connecting ontologies
•
Challenge: Syntatic and Semantic heterogeneity
34
Advantages for Scientific R&D
•
Localization of resources
–
Identify relevant webresources
•
Data formats
–
Resources are represented in HTML, TXT, images, ..
•
Synonyms
–
Researchers can name their own data differently
35
Integration of Databases
>
Challenges
•
Ambiguity
–
E.g. “insulin” can represent a drug, protein, gene, ..
•
Relations
–
One
-
to
-
one / One
-
to
-
many between identifiers
•
Granularity
–
Can cause missing data, ..
36
Integration of Databases
>
Challenges
•
Data Warehouse Approach
–
Translate data in one
l
ocal
d
atabase
–
Eliminate unavailability & slow response
–
Allow data processing and optimalization
–
Maintenance problem
•
evolution of content and structure
–
Examples: BioWarehouse, Biozon, DataFoundry
37
Integration of Databases
>
Approaches
•
Federated Database Approach
–
Translate queries for individual sources
–
Easier to maintain (e.g. Adding new source)
–
Poor performance
–
Examples: BioKleisli, DiscoveryLink, QIS
38
Integration of Databases
>
Approaches
•
Semantic Web Approach
–
No need to map data models
–
Rely on standarized ontologies
–
Less work, better performance
–
But only if sources comply
39
Integration of Databases
>
Approaches
Outline
•
Introduction to RDF
•
RDF Databases
•
Advantages for scientific R&D
•
In practice
•
Criticism
40
In Practice
•
Scientists need:
–
Access to data
–
Ability to utilize data
–
Handle uncertainty
41
In Practice
•
Linked Open Data:
–
“We all need the same databases, for different
decisions or applications”
–
Complements data
in
internal/licensed sources
–
Stimulates cross
scientific
sharing
42
•
Biological data: Human Genome Project
–
Increase in web
-
accessible databases
•
GenBank, Gene Ontology, UniProt, PhenoDB, ..
–
Integration is key problem
–
Increase in RDF availability
43
Examples
•
YeastHub
–
Registration of web
-
accessible database
•
Metadata according to Dublin Core standards using
RSS1.0 to describe an ontology
–
Data Conversion
•
XML or RDB to RDF conversion
–
(eg Unique ID = RDF ID , rest of columns are properties)
–
Data Integration
•
Ad hoc RDF queries
•
Form
-
based queries (supervised)
44
Examples
Outline
•
Introduction to RDF
•
RDF Databases
•
Advantages for scientific R&D
•
In practice
•
Criticism
45
•
Feasability
–
Human behavior and personal preferences
•
‘Database hugging’
–
Organizations tend to keep data for themselves
•
Censorship and Privacy
46
Criticism
•
Published data reusable in research?
–
Requires:
•
Provenance information
•
Quality
•
Attribution
•
Consistency
•
...
–
Out
-
of context data fails to respect scientific
research methodology
47
Criticism
•
Bringing
Web 2.0 to
bioinformatics
2008, Zhang
Zhang, Kei
-
Hoi Cheung and Jeffrey P.
Townsend
•
Semantic web approach to database integration in life
sciences
2006, Kei
-
Hoi Cheung, Andrew K. Smith, Kevin Y.L. Yip, Christopher J.O.
Baker and Mark B. Gerstein
•
Integrating
large biomedical
knowledge resources
with
RDF
2007, Satya
S.
Sahoo, Olivier Bodenreider, Kelly Zeng, Amit Sheth
•
RDF/RDFS
-
based Relational Database Integration
2006, Huajun Chen ,
Zhaohui Wu ,
Heng Wang ,
Yuxin Mao
48
References
•
Has anyone ever worked with linked (RDF)
data before? What are your experiences?
•
Will the semantic web grow to become the
Giant Global Graph
?
•
Why haven’t RDF databases taken off like
Relational Databases?
49
Discussion
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment