Towards an intelligent framework to quickly find data from distributed heterogeneous biomedical resources.

buninnateΛογισμικό & κατασκευή λογ/κού

18 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

61 εμφανίσεις

Towards an intelligent framework to quickly find data from distributed
heterogeneous biomedical resources.

Despoina Antonakaki,

Dasha Zhernakova, Erik Roos,

K Joeri van der Velde,
Mark Kiestra
,Tomasz Adamusiak,

Niran Abeygunawardena, Helen Parkinson,


Rolf Sijmons, Morris A. Swertz


Biologists challenges:
A web of data


Find data


Many different resources


local, structured


array express, free text


pubmed



Type in many search boxes


Google, NCBI/Entrez, EBI/EB
-
eye, KEGG/DBGET



Merge and pool data


Big excel file (trying to make headers fit)



Size of data


Working for weeks (map and match)


Major problem : “Using Microsoft Word as sequence annotation tool”



Informatics challenges:
Too many silos…


Differences in terminology


Need to reach “hidden”, structured data : DB encapsulated, legacy


Different conceptualization of information


Differences in formats and structure


Too many formats, specifying & describing biomedical entities:



no standard representation model


Automatic matching and merging


Difficult to merge into single query


Working for weeks (map & match)


Query across silos

DB1

DB2

DB3

Format 1

Format 2

Format3



Local? National? EU? Global?

LifeLines

GenerationR

TweelingReg

PSI

Celiac

Disease

query

Wanted:

‘meta’ search infrastructure to

Find me cases

Find me cohorts/partners

Connecting different ‘ biobanks’?

Outline


Three challenges for biologists’ and the corresponding for the
Informatics’:

1.
Merge and pool data
-

Differences in formats and structure

2.
Find data
-

Differences in terminology

3.
Size of data
-

Automatic matching and merging

4.
Across data sets


All above + distribution


Approaches

1.
Integrate data into one ‘pheno’ model (MOLGENIS)

2.
Use ontologies (OntoCAT)

3.
Indexing (Lucene)

4.
Query expansion (Lucene + OntoCAT)


Discussion

1.
Federated data queries (molgenis & rdf)



Data warehouse, put it all in one place?
Loading …

Pheno
-
OM


Pheno
-
OM data model

Flexible:
any feature,

value, and target combination

Observed

value

*

Observation

target

time

Observable

feature

*

Panel/cohort/Biob
anks

Individual

*

*

Protocol

Protocol

application

*

time

Observed
Relation

Inferred Value

*

*

time

*

Height

179cm

Ind1

http://wwwdev.ebi.ac.uk/microarray
-
srv/pheno/doc/objectmodel.html

An example of excel data


Or bbmri
-
nl


Use ontologies

To overcome different terminologies, two approaches:

1.
Use ontologies to annotate the source


Of course depends on other parties

2.
Use ontologies for query expansion (synonyms,

part of,
subclasses)

Deformed ears?


Abnormale shaped
ears


Pheno
-
DB

Ontologies
with
mappings

Ontologies
with
mappings

Ontologies
with
mappings

Index

HPO:

Abnormally shaped ears

Auricular malformation

Deformed auricles

Deformed ears

Malformed auricles

Malformed ears

Malformed external ears


MP:

Abnormally shaped ears

Auricular malformation

Deformed auricles

Deformed ears

Malformed auricles

Malformed ears

Malformed external ears

Outline


Three challenges for biologists’ and the corresponding for the
Informatics’:

1.
Merge and pool data
-

Differences in formats and structure

2.
Find data
-

Differences in terminology

3.
Size of data
-

Automatic matching and merging

4.
Across data sets


All above + distribution


Approaches

1.
Integrate data into one ‘pheno’ model (MOLGENIS)

2.
Use ontologies (OntoCAT)

3.
Indexing (Lucene)

4.
Query expansion (Lucene + OntoCAT)


Discussion

1.
Federated data queries (molgenis & rdf)


Complexity in Ontologies

..sometimes they change unpredictably ..

..or sometimes they become suddenly unavailable ..

To search across different ontologies requires expert knowledge

Some facts…


NCBO Bioportal :


204 ontologies , 29 REST signatures …


BUT :
Rest signature change/break without
notice ,


OLS: 79 OBO ontologies, 16 web service
signatures
-

stable, open, local


BUT:
not as rich , rudimentary documentation


Individual user’s ontologies created


Integration is hard …





Ontology Browser

EFO Bioportal Import

OntoAPI

OWL API

OntoCAT hides the complexity

ontocat.org

BioPortal

EBI OLS

OWL &
OBO

searchOntology()

getChildren()

getParents()

getSynonyms()

getDefinitions()

...


Generic Ontology Service
interface



Implemented in Java 6,


Open Source (LGPL v3),


Simple and easy
-
to
-
use API for BioPortal , OLS web services, OWL API
(BioportalOntologyService, OlsOntologyService and FileOntologyService ).



BBMRI ontology

OWL API

HPO

NCBO Bioportal

OLS (EMBL
-
EBI)

OBO files


Use case diagram of OntoCAT


Use case of a simplified user interaction with existing ontology resources through
OntoCAT .


Web applications can connect using REST or SOAP services


R connect with Ontocat bioconductor




Common workflow to integrate
ontology resources


Ontocat
example
:Find “membrane”
term in multiple ontologies


More examples available

1.
Updating Ontology properties:


EFO involves construction of mappings to multiple domain specific ontologies (Disease, Cell Type)


Multithreading the Ontocat requests allows to process & import extra information


from over 20,000 external ontology terms in less that 10 minutes

2.
Annotate user experimental values with ontology terms


Array Express Archive & Gene Expression Atlas >1 million unique experiment annotated from EBI’s version EFO


Not existing ones have to be checked against publicly available ontologies


Previously manual process now with Zooma (local EFO, OWL, local DBs)



OntoCAT & Zooma use cases

Array express archive

Gene Expression Atlas

> 1 million unique experiment
annotations

Annotate
(ontology
terms)

EBI (pre release version of the
application ontology EFO)

Not available in EFO ?

???

???

???



OntoCAT & Zooma use cases

3.

Local ontology management


eXtensive Genotype And Phenotype data platform (XGAP
-

Molgenis)
: search
widget

Interactive annotation of data with ontology terms



Allows search publically available ontologies & download terms for unambiguous
annotation of QTL or GWAS data.








4.
Data analysis & annotation


New Bioconductor ready to read & query OWL/OBO into R .


Build in offline support for EFO & Bioportal ontology queries



OntoCAT provides synonym & definition lookup across two major implemented ontology
services


Supports interoperability using RDF


Class combining multiple ontology resources including different repositories behind single
entry point (CompositeOntologyService)


Cache


Ranking


Prioritization


Fallback mechanism if ontology resource unavailable






OntoCAT characteristics & tools


Demo on Google App Engine
framework



http://ontocat
-
web.appspot.com


Ontocat browser retrieving OLS

http://gbic.target.rug.nl:8080/ontocatbrowser/molgenis.do?
__target=main&select=
OntocatBrowser


OntoCAT’s applications


OntoCAT ontology mapping application
:


http://zooma.sourceforge.net


OntoCAT Bioconductor/R package
:


http://bioconductor.org/help/bioc
-
v
iews/2.7/bioc/html/ontoCAT.html

Outline


Three challenges for biologists’ and the corresponding for the
Informatics’:

1.
Merge and pool data
-

Differences in formats and structure

2.
Find data
-

Differences in terminology

3.
Size of data
-

Automatic matching and merging

4.
Across data sets


All above + distribution


Approaches

1.
Integrate data into one ‘pheno’ model (MOLGENIS)

2.
Use ontologies (OntoCAT)

3.
Indexing (Lucene)

4.
Query expansion (Lucene + OntoCAT)


Discussion

1.
Federated data queries (molgenis & rdf)



Indexing: general features


Data structure overcomes barriers in large DB


created by using DB tables as basis for search


Efficient access of ordered records & rapid random lookup


Less disk space for storage (key fields)



Open source java library (known in internet search engines)


Full text indexing & searching capability


Format independent (documents & fields)


Query Expansion:


Add additional terms related (synonyms & children) appended by OR
operator, assigned lower weight


Changes document ranking


order of retrieved docs


Even if query expansion doesn’t improve search, query more precise






DB


Indexing: the approach


Overcome the barriers of searching in large data size


Optimize the in memory representation, e.g. as a tree


Steps:

1.
Create a new index and add documents (fields from DB, ontology terms from Ontocat)

2.
Analyzer: extract tokens out of text to be indexed and eliminates the rest

3.
Parser: Select Fields (term/value)

»
Tokenized? Indexed? Case sensitive?

4.
Collect results




def
:

"Paired,

cup
-
shaped

cartilage

that

are

dorsal

to

the

septomaxillae

and

anterior

to

the

oblique

cartilage
.

The

anterior,

convex

face

of

each

alary

cartilage

is

synchondrotically

fused

to

the

superior

prenasal

cartilage

and

the

ventral

edge

is

fused

to

the

superior

margin

of

the

crista

intermedia
.
"

[AAO
:
LAP]

related_synonym
:

"alinasal

cartilage"

[]

related_synonym
:

"cartilago

alaris"

[]related_synonym
:

"cartilago

alaris

nasi"

[]related_synonym
:

"cartilago

cupullaris"

[]

[Term]

id
:

AAO
:
0000289
name
:

Meckel's_cartilage

def
:

"Paired,

rod
-
shaped

elements

that

extend

the

length

of

the

mandible

and

lie

between

the

dentaries

and

the

angulosplenials
.
"

[AAO
:
LAP]

relationship
:

part_of

AAO
:
0000274

!

lower_jaw_skeleton

[Term]

id
:

CHEBI
:
24431

name
:

molecular

structure

def
:

"A

description

of

the

molecular

entity

or

part

thereof

based

on

its

composition

and/or

the

connectivity

between

its

constituent

atoms
.
"

[]












Oblique
cartilage.

Tokenized??

cartilago
cupullaris

Tokenized??

Septomaxillae

angulosplenias


index

1. Analyze Query

2. Parse Index

3. Collect Results




Enters search
term



Output results


Indexing DB: implementation


Outline


Three challenges for biologists’ and the corresponding for the
Informatics’:

1.
Merge and pool data
-

Differences in formats and structure

2.
Find data
-

Differences in terminology

3.
Size of data
-

Automatic matching and merging

4.
Across data sets


All above + distribution


Approaches

1.
Integrate data into one ‘pheno’ model (MOLGENIS)

2.
Use ontologies (OntoCAT)

3.
Indexing (Lucene)

4.
Query expansion (Lucene + OntoCAT)


Discussion

1.
Federated data queries (molgenis & rdf)


32

Pheno Warehouse

Deformed
ears?

HPO
:

Abnormally shaped ears

Auricular malformation

Deformed auricles


MP
:

Malformed auricles

Malformed ears

Malformed external ears

etc

query

expansion


Query expansion

Local
ontologies

(OLW or
OBO)

CWA

BioPortal

OLS

OntoCAT


Ontology common API tasks

http://www.ontocat.org

and
http://precedings.nature.com/documents/4666



Deformed ears


Abnormally shaped ears



Query expansion details & ontology selection

Ontologies used


The expanded query & the results

query: lung disease

searching WITHOUT query expansion:



Indexing: implementation (ontocat)

Lucene scoring uses a combination of the
Vector Space Model (VSM) of Information
Retrieval

and the
Boolean model

to
determine how relevant a given Document
is to a User's query.


query: lung disease

searching WITH query expansion:

Outline


Three challenges for biologists’ and the corresponding for the
Informatics’:

1.
Merge and pool data
-

Differences in formats and structure

2.
Find data
-

Differences in terminology

3.
Size of data
-

Automatic matching and merging

4.
Across data sets


All above + distribution


Approaches

1.
Integrate data into one ‘pheno’ model (MOLGENIS)

2.
Use ontologies (OntoCAT)

3.
Indexing (Lucene)

4.
Query expansion (Lucene + OntoCAT)


Discussion

1.
Federated data queries (molgenis & rdf)


Twin
Registry

Generation R

LifeLines

BBMRI
-
SE

Deformed
ears?

query

Distributed querying in BBMRI

OntoCAT


Ontology common API tasks

http://www.ontocat.org

and
http://precedings.nature.com/documents/4666



RDF + OWL?


Federated data queries (molgenis & rdf)


How to make Molgenis data distributed via RDF/SPARQL ?

Deformed ears?


Abnormale
shaped

ears



HPO:

Abnormally shaped ears

Auricular malformation

Deformed auricles

Deformed ears

Malformed auricles

Malformed ears

Malformed external ears


MP:

Abnormally shaped ears

Auricular malformation

Deformed auricles

Deformed ears

Malformed auricles

Malformed ears

Malformed external ears

DB

Ontologies
with
mappings

Ontologies
with
mappings

Ontologies
with
mappings

DB

DB

?

RDF

SPARQL

Discussion & next steps : distributed
querying?


How to map a database to RDF such that it helps querying?


Diversity : all data molgenis’ pheno model .
(+ quick
-

working
offline ,
-

have to update all the time)


Map to all distributed sources “on the fly”. (RDF & SPARQL )


Agree on distributed query mechanisms
(+ always up to date


-

slow, breaks if sources go offline)




Investigate other project like Open Data


Can molgenis be part of open data?


NL

NL


Thank you for your attention.

Questions?





Ontocat
http://www.ontocat.org/

,


http://precedings.nature.com/documents/4666/version/1


http://www.biomedcentral.com/imedia/1627447285460829_article.pdf


Guide/ examples
http://www.ontocat.org/wiki/OntocatGuide


Available from :


http://gbic.target.rug.nl:8080/ontocatbrowser/molgenis.do?
__targe
t=main&select=
OntocatBrowser


Ontocat Demo on Google App Engine framework :
http://ontocat
-
web.appspot.com


Molgenis Lucene Index & query expansion app :


http://www.molgenis.org/svn/molgenis_projects/molgenis4phenot
ype/handwritten/java/plugins/LuceneIndex/


Pheno
-
OM datamodel :
http://wwwdev.ebi.ac.uk/microarray
-
srv/pheno/doc/objectmodel.html


XGAP:
http://www.xgap.org