Semantic Web Technologies

snufflevoicelessInternet and Web Development

Oct 22, 2013 (3 years and 7 months ago)

103 views

Semantic Web Technologies
in Biosciences

Kei Cheung, Ph.D.

Yale Center for Medical Informatics

Outline


Introduction


Past and current Web (Syntactic Web)


Future Web (Semantic Web)


Semantic Web technologies with
examples in the biosciences


Data Growth


The Human Genome Project created a
paradigm shift in biology (experimental
-
>
computational) due to the flood of DNA
sequence data produced.


Since HGP, other types of high throughput bio
-
technologies have emerged and produced vast
quantities of data of diverse types (transcript
profiling, protein profiling, genotyping, next
generation sequencing, etc).


An increasing number of bio
-
data providers have
made their data available through the Web.


Problems and Issues


Each database represents a data silo accessed
by local applications written in specific
languages


The web pages display data but they do not
expose the structure of data in a machine
readable format


Different user/query interfaces


No uniform/global data schema


Lack of standard ID’s, terminology, vocabulary,
data formats, etc




Available Tools/Approaches


Web search engines (e.g., google, yahoo)


One
-
stop shopping (e.g., NCBI)


Gateway or directory listing (e.g.,
Neuroscience Database Gateway)


Use screen scraping methods to extract
data from web content (e.g., Perl scripts)

Kei (Hoi) Cheung

(15 years ago)

Kei (Hoi) Cheung

(more recent)

Kei (Hui) Cheung

Not me!

I’m NOT a company!

Find the most recent image

of the person “Kei Hoi Cheung”

Semantic Web = Brilliant Web!

Knowledge
-
driven bioscience data
integration on the Semantic Web

KEGG

Neuron

DB

PDB

DrugBank

GenBank

CCDB

Gene

Cards

Gene

Expression

Onmibus

Cell

Drug

Protein

Gene

Disease

Sequence

Image

Receptor

targets

treats

is
-
a

has
-
image

has
-
sequence

encodes

underlies

underlies

Knowledge
-
based applications

Knowledge layer

Data layer

has
-
part

Pathway

underlies

is_involved_in

Semantic Web Stack

Problems with XML


DTD has limited expressiveness of the XML language



XML is designed as a language for message encoding



XML is only self
-
descriptive about the following structural
relationships:


containment, adjacency, co
-
occurrence, attribute and opaque
reference.


All these relationships are useful for serialization, but are not
optimal for modeling objects of a problem domain



For example, the relationship between the <spot> and
<coord_*> of AGML tags is no different from that
between <spot> and <dia_*>.


A computer algorithm must treat them differently to develop
meaningful applications. To calculate the distance between two
<spot>s, an algorithm shall use the value of <coord_*>, but to
calculate the area of each <spot>, it shall retrieve the value of
<dia_*> instead

Proliferation of Bio
-
XML
Formats

Sequence

BSML

AGAVE

Microarray Gene Expression

GEML

MAML

Pathway

BIND

SBML

PSI
-
MI

MAGE
-
ML

RDF (e.g., BioPax)

Semantically rich ontologies

Reasoning (machine intelligence)

From XML to RDF

Semantic Web


The
Semantic Web

provides a common
machine
-
readable framework that allows
data

to
be shared and reused across application,
enterprise, and community boundaries


The Semantic Web is a web of data


The Semantic Web is about two things


It is about common formats for identification,
integration and combination of data drawn from
diverse sources


It is also about languages for recording how the data
relates to real world objects

RDF


The foundational semantic web technology is
the resource
-
description framework (RDF)


RDF is a system to describe resources


RDF has a very simple and yet elegant data
model (directed acyclic graph)



everything is a resource that connects with other
resources via properties


A resource is anything that is identifiable by a
uniform resource identifier (URI)

Characteristics of RDF


The DAG structure offered by RDF makes it extensible and evolvable.
Adding nodes and edges to a DAG doesn’t change the structure of any
existing subgraph.



RDF has an open
-
world assumption in that allows anyone to make
statements about any resource



RDF is monotonic in that new statements neither change nor negate the
validity of previous assertions, making it particularly suitable in an academic
environment, in which consensus and disagreement about the same
resources have a useful coexistence that needs to be formally recorded.



All RDF terms share a global naming scheme in URI, making distributed
data and ontologies possible



The combined effect of global naming, universal data structure and open
-
world assumption is that resources exist independently but can be readily
linked with little precoordination.

Linked Data


Linked Data is about using the Web to connect related
data that wasn't previously linked



Wikipedia defines Linked Data as "a term used to
describe a recommended best practice for exposing,
sharing, and connecting pieces of
data
,
information
, and
knowledge
on the Semantic Web using
URIs

and
RDF
."



In addition to providers and consumers of linked data,
there are link creators who create semantic links
between different RDF datasets (e.g., links can be
created between protein kinases and drugs)

Linked Data Cloud (linkeddata.org)

RDFS and OWL


RDF Schema (RDFS)


it supports
classes and class hierarchy


Web Ontology Language (OWL): OWL
Lite, OWL DL, OWL Full


While RDFS and OWL are layered on top
of RDF, they offer support for inference
and axiom, making Semantic Web capable
of supporting knowledge
-
based querying
and inferencing

Uniform Resource Identifiers
(
URIs)


A URI is a string of characters used to identify or name a
resource on the Internet.


URLs (Uniform Resource Locators) are a particular type of
URI, used for resources that can be accessed on the WWW
(e.g., web pages)


In RDF, URIs typically look like “normal” URLs, often with
fragment identifiers to point at specific parts of a document:


http://www.semantic
-
systems
-
biology.org/SSB#CCO_B0000000

(id for “core cell cycle protein” in Cell Cycle Ontology)

RDF Triple/Graph


The basic information unit in RDF is an RDF statement in the form of


(subject, property, object)



Each RDF statement can be modeled as a graph comprising two nodes
connected by a directed arc






A triple example






A set of such triples can jointly form a directed labeled graph (DLG) that
can in theory model a significant part of domain knowledge.



An RDF graph can be represented in different formats (XML, Turtle, N3…)

Cell Cycle Ontology (CCO)

(
Antezana et al, 2009, Genome Biology)

http://genomebiology.com/2009/10/5/R58

Named Graph


RDF graphs are nameable by URIs


This enables RDF statements to be
created to describe graphs


This helps establish provenance and trust


Representation formats: TriX and TriG

:G2 { :G1 swp:assertedBy _:w1 .


_:w1 swp:authority :Erick .


_:w1 dc:date "2009
-
05
-
29"^^xsd:date .


_:w1 dc:license "Creative Commons Attribution License“^^xsd:string .



:Erick rdf:type ex:Person .


:Erick ex:email <mailto:erant@psb.ugent.be> }

SPARQL


It is a standard query language for RDF


It can be used to express queries across diverse
data sources, whether the data is stored natively
as RDF or viewed as RDF via middleware.


It contains capabilities for querying required and
optional graph patterns along with their
conjunctions and disjunctions.


The results of SPARQL queries can be results
sets or RDF graphs.

RDF Graph Match (SPARQL)

BASE <
http://www.semantic
-
systems
-
biology.org/

webcite
>

PREFIX rdfs:<
http://www.w3.org/2000/01/rdf
-
schema#

webcite
>

PREFIX ssb:<
http://www.semantic
-
systems
-
biology.org/SSB#

webcite
>

SELECT ?protein_label

WHERE {


GRAPH <cco_S_pombe> {


?protein ssb:is_a ssb:CCO_B0000000.


?protein rdfs:label ?protein_label


}

}

core cell cycle protein

SPARQL (Cont’d)


The following SPARQL query on the A. thaliana graph allows users to infer a
putative location for proteins with no documented cellular locations. The
assumption behind such a query is that two proteins that participate in the same
interaction are likely to share the same cellular location, the 'nucleus'
(CCO_C0000252):


BASE <
http://www.semantic
-
systems
-
biology.org/

webcite
>

PREFIX rdfs:<
http://www.w3.org/2000/01/rdf
-
schema#

webcite
>

PREFIX ssb:<
http://www.semantic
-
systems
-
biology.org/SSB#

webcite
>

SELECT


?prot_in_the_nucleus


?prot_to_study


?interaction_label

WHERE {


GRAPH <cco_A_thaliana> {


?interaction a ssb:interaction.


?interaction rdfs:label ?interaction_label.


?prot_A ssb:participates_in ?interaction.


?prot_B ssb:participates_in ?interaction.


?prot_A rdfs:label ?prot_in_the_nucleus.


?prot_B rdfs:label ?prot_to_study.


?prot_A ssb:located_in ssb:CCO_C0000252.


OPTIONAL {


?prot_B ssb:located_in ?location_B.


}


FILTER (!bound(?location_B))


}

}

OWL DL Representation

:Nucleus


a owl:Class ;


rdfs:subClassOf


[ a owl:Restriction ;


owl:onProperty :part_of ;


owl:someValuesFrom :Cell


]

Necessary but not sufficient condition:

part of a nucleus is also part of a cell,

but part of a cell is not necessarily part of a nucleus

OWL Reasoning


Which proteins participate in “mitosis”

:Protein


a owl:Class ;


rdfs:subClassOf


[ a owl:Restriction ;


owl:onProperty :participates_in ;


owl:someValuesFrom :Mitosis


]

Visualization Application

Semantic Web Rules


Semantic Web Rule Language (SWRL)


it combines the sublanguages of the
OWL
Web Ontology Language

with those of the
Rule Markup Language



It can help increase the expressivity of
OWL ontologies by augmenting such
ontologies with rules


Rules are easier to understand than
description logic.

SWRL Examples


Protein
(?p1)
Λ

cellularLocation

(?p1, Nucleus)


NuclearProtein
(?p1)



participatesInteraction(?protein1, ?interaction1)
Λ

participatesInteraction(?protein2, ?interaction1)
Λ

participatesInteraction(?protein2, ?interaction2)
Λ

participatesInteraction(?protein3, ?interaction2)


proteinInteraction (?protein1, ?protein3)

Enabling Technologies/Tools


Triplestores (e.g., Virtuoso, Oracle,
AllegroGraph, …)


SPARQL Endpoint


Ontology editors (e.g., Protégé, SWOOP,
OBO
-
Edit, …)


OWL reasoners (e.g., Pellet, RacerPro,
FaCT++, …)


Semantic Web Related Communities


National Center for Biomedical Ontology


OBO Foundry


BioPAX


Semantic Web Activity of the World Wide
Web Consortium


Semantic Web for Health Care and Life
Sciences Interest Group


BioRDF, COI, LODD, Sci. Discourse, Terminology,
Translational Medicine Ontology


Roads to Semantic Web


Provide data in RDF format (data providers)


UniProt, Gene Ontology, NCI Metathesauras


Convert non
-
RDF data to RDF data (third party
efforts)


YeastHub


D2RQ, TRIPLIFY


Mix RDF data with non
-
RDF data


RDFa (e.g., Fuzz Firefox extension)


GRDDL

Merge between Web 2.0 and
Semantic Web


People (FOAF)


Yahoo!Pipes (Semantic Web Pipes
developed at DERI)


Dapper (Semantify Dapper)


MediaWiki (Semantic MediaWiki)


Google Map (Semantic Google Map)


The End