Data Integration in Bioinformatics and Life Sciences

tennisdoctorBiotechnology

Sep 29, 2013 (3 years and 11 months ago)

304 views

Data Integration in Bioinformatics
and Life Sciences
Erhard Rahm, Toralf Kirsten, Michael Hartung
http://dbs.uni-leipzig.de
http://www.izbi.de
EDBT – Summer School, September 2007
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 2
What is the Problem?
„What protocols were used for tumors in
similar locations, for patients in the
same age group, with the same genetic
background?“
Source: L. Haas, ICDE2006 keynote
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 3
DILS workshop series
ƒ International workshop series
Data Integration in the Life Sciences (DILS)
ƒ DILS2004: Leipzig
(Interdisciplinary Center for Bioinformatics)
ƒ DILS2005: San Diego, USA
(UCSD Supercomputing Center)
ƒ DILS2006: Cambridge/Hinxton, UK
(EBI)
ƒ DILS2007: Philadelphia (UPenn)
ƒ DILS2008: Have you ever been in Paris? ☺
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 4
Agenda
ƒ Kinds of data to be integrated
ƒ General data integration alternatives
ƒ Warehouse approaches
ƒ Virtual and mapping-based data integration
ƒ Matching large life science ontologies
ƒ Data quality aspects
ƒ Conclusions and further challenges
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 5
Agenda
ƒ Kinds of data to be integrated
ƒ Experimental data
ƒ Clinical data
ƒ Public web data
ƒ Ontologies
ƒ General data integration alternatives
ƒ Warehouse approaches
ƒ Virtual and mapping-based data integration
ƒ Matching large life science ontologies
ƒ Data quality aspects
ƒ Conclusions and further challenges
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 6
Scientific data management process
ƒ Sharing/reuse of data products
ƒ community-oriented research
Source: Gertz/Ludaescher: SDM Tutorial, EDBT2006
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 7
Data integration in life sciences
ƒ Many heterogeneous data sources
ƒ Experimental data produced by chip-based techniques
ƒ Genome-wide measurement of
gene activity under
different conditions (e.g., normal vs. different disease states)
ƒ Experimental annotations (metadata about experiments)
ƒ Clinical data
ƒ Lots of inter-connected web data sources and ontologies
ƒ Sequence data, annotation data, vocabularies, …
ƒ Publications (knowledge in text documents)
ƒ Private vs. public data
ƒ Different kinds of analysis
ƒ Gene expression analysis
ƒ Transcription analysis
ƒ Functional profiling
ƒ Pathway analysis and reconstruction
ƒ Text mining , …
Affymetrix gene
expression microarray
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 8
Expression experiment and analysis
sample
(5) Image analysis
(4) Array scan
(1) Cell selection
(2) RNA/DNA
preparation
(3) Hybridization
array
array spot intensities
array image
labeling
mRNA
x
y
x
y
(6) Data pre-processing
spot intensities for
experiment series
gene expression
matrix
(7) Expression analysis/data mining
(8) Interpretation using annotations
Gene groups (co-regulated, ...)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 9
Experimental data
ƒ High volume of experimental data
ƒ Various existing chip types for gene expression and mutation analysis
ƒ Fast growing amount of numeric data values
ƒ Need to pre-process chip data (no standard routines)
ƒ Different data aggregation levels (e.g. Affy probe vs. probeset expression values)
ƒ Various statistical approaches, e.g. tests and resampling procedures, …
ƒ Visualizations, e.g. Heatmap, M/A plot, …
ƒ Need for comprehensive, standardized experimental annotations
ƒ Experimental set up and procedure (hybridization process, utilized devices, …
ƒ Manual specification by the experimenter
ƒ Often user-dependent utilization of abbrev. and names / synonyms
ƒ Recommendation: Minimal Information about a Microarray Experiment*
* Brazma et al.: Minimum information about a mircoarray experiment (MIAME) – toward standards for microarray data.
Nature Genetics, 29(4): 365-371, 2001
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 10
Clinical data: Requirements
ƒ Patient-oriented data
ƒ Personal data
ƒ Different types of findings, e.g. general clinical findings (blood pressure, etc.),
pathological findings (tissue samples), genetic findings
ƒ Applied therapies (timing and dosages of drugs, …)
ƒ Clinical studies to evaluate and improve treatment protocols, e.g. against cancer
ƒ Data acquisition during complex workflows running in different hospitals
ƒ Special software systems for study management (eResearch Network, Oracle Clinical,
...)
ƒ New research direction: collect and evaluate patient-specific genetic data (e.g., gene
expression data) within clinical studies to investigate molecular-biological causes of
diseases and impact of drugs
ƒ Need to integrate experimental and clinical data within distributed study
management workflows
ƒ High privacy requirements: protect identity of individual patients
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 11
Clinical trials: Inter-organizational workflows
Data Acquisition and Analysis
Selection of patients
meeting pre-defined
inclusion criteria
Personal
(patient) data
Data
Chip-based
genetic data
Genome-wide Chip-based genetic Analysis
• Mutation profiling (Matrix-CGH)
• Expression profiling (Microarray)
Periodic Doctor or Hospital Visits
• Operations
• Checkups
General clinical
findings
Genome Location specific genetic Analysis
• Mutation profiling (Banding analysis, FISH)
Genetic
findings
Tissue
Extraction
Pathological Analysis
• Microscopy
• Antibody Tests
Pathological
findings
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 12
Publicly accessible data in web sources
ƒ Genome sources: Ensembl, NCBI Entrez, UCSC Genome, ...
ƒ Objects: Genes, transcripts, proteins etc. of different species
ƒ Object specific sources
ƒ Proteins: UniProt (SwissProt, Trembl), Protein Data Bank, ...
ƒ Protein interactions: BIND, MINT, DIP, ...
ƒ Genes: HUGO (standardized gene symbols for human genome), MGD, ...
ƒ Pathways: KEGG (metabolic & regulatory pathways), GenMAPP, ...
ƒ...
ƒ Publication sources: Medline/ Pubmed (>16 Mio entries)
ƒ Ontologies
ƒ Utilized to describe properties of biological objects
ƒ Controlled vocabulary of concepts to reduce terminology variations
ƒ Popular examples: Gene Ontology, Open Biomedical Ontologies (OBO)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 13
Sample web data with cross-references
ƒ Annotation data vs. mapping data
Enzyme
GeneOntology
OMIM
UniGene
KEGG
}
References
to other data
sources
source-specific ID (accession)
annotations:
names, symbols,
synonyms, etc.
}
ƒ Problem: semantics of mappings (missing mapping type)
ƒ Gene ÅÆgene: orthologous vs. paralogous genes
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 14
Highly connected data sources
ƒ Heterogeneity
ƒ Files and databases
ƒ Format and schema differences
ƒ Semantics
ƒ Many, highly connected
data sources and ontologies
ƒ
Frequent changes
ƒ Data, schema, APIs
ƒ
Incomplete data sources
ƒ Overlapping data sources
Æneed to fuse corresponding
objects from different sources
ƒ common (global) database schema ???
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 15
Ontologies
ƒ Increasing use of ontologies in bioinformatics and medicine to
organize domains, annotate data and support data integration
ƒ Develop a shared understanding of concepts in a domain
ƒ Define the terms used
ƒ Attach these terms to real data (annotation)
ƒ Provide ability to query data from different sources using a common vocabulary
ƒ Some popoluar life science ontologies
ƒ Gene Ontology (http://www.geneontology.org)
ƒ Species-independent, comprehensive sub-ontologies about Molecular Functions,
Biological Processes and Cellular Components
ƒ UMLS – Unified Medical Language System
(http://www.nlm.nih.gov/research/umls/umlsmain.html)
ƒ Metathesaurus comprising medical subjects and terms of Medical Subject
Headings, International Classification of Diseases (ICD), …
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 16
OBO – Open Biomedical Ontologies
http://obo.sourceforge.net/main.html
• An umbrella project for grouping different ontologies in biological/medical field
Currently covered aspects:
• Anatomies
• Cell Types
• Sequence Attributes
• Temporal Attributes
• Phenotypes
• Diseases
• ….
Requirements for ontologies in OBO:
- Open, can be used by all without any constraints
- Common shared syntax
- No overlap with other ontologies in OBO
- Share a unique identifier space
- Include text definitions of their terms
Why OBO?
- GO only covers three specific domains
- Other aspects could also be annotated: anatomy, …
- No standardization of ontologies: format, syntax, …
- What ontologies do exist in the biomedical domain?
- Creation takes a lot of work ÆReuse existing ontol.
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 17
Agenda
ƒ Kinds of data to be integrated
ƒ General data integration alternatives
ƒ Physical vs. virtual integration
ƒ P2P-like / Peer Data Management Systems (PDMS)
ƒ Scientific workflows
ƒ Warehouse approaches
ƒ Virtual and mapping-based data integration
ƒ Matching large life science ontologies
ƒ Data quality aspects
ƒ Conclusions and further challenges
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 18
Instance integration: Physical vs. virtual
Source 1
Source m
Source n
Wrapper 1
Wrapper m
Wrapper n
Mediator
Client 1
Client k
Meta
data
Virtual Integration
(query mediators)
Operational
Systems
Import (ETL)
Data Warehouse
Data Marts
Analysis Tools
Meta
data
Physical Integration
(Data Warehousing)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 19
Peer Data Integration: Typical Scenario
Gene
Ontology
Protein annotations for gene X?
Local data
Check GO annotation for
genes of interest?
SwissProt
Ensembl
NetAffx
ƒ Bidirectional mappings between data sources instead of global schema
ƒ Queries refer to single source and are propagated to relevant peers
ƒ Adding new sources becomes simpler
ƒ Support for local data sources (e.g. private gene list)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 20
Data integration: Physical vs. virtual
Virtual
-
+
+
o
-
o
At query runtime
A priori
Query mediators
o-
(HW) ressource
requirements
+o
Source autonomy
+o
Data freshness
o+
Achievable data quality
-+
Analysis of large data
volumes
o-
Scalability to many sources
At query runtimeA priori
Instance data integration
No schema
integration
A priori
Schema integration
Peer Data Mgmt
Physical
(Warehouse)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 21
Classification of data integration approaches
Type of instancedataintegration
physical
integration
virtual
integration
hybrid
integration
Type of schema integration
application-specific
global schema/ontology
generic
representations
Homogenized/ global view
No global view
Mapping-based/ P2P
• Annonda
• DiscoveryLink
• Tambis
• Observer
• Ensembl, UCSC
Genome Browser ...
•ArrayExpress, GX, GEO,
SMD, GeWare, ...
• EnsMart/BioMart
• Columba
• IMG, TrialDB
• Kleisli
• hybrid integration
approach in GeWare
• LinkDB
• DAS
• GenMapper
• BioMoby/Taverna
• Kepler
• caBIG/caGrid
Service (App.)
integration/ workflows
• BioFuice
• SRS
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 22
Application-specific vs. generic representation
Function3
ProteinFunctionRel2
Protein1
Name
Entity_ID
......4
Organism13
2
1
Attribute_ID
Name1
Accession1
Name
Entity_ID
...22
ENSP0000030651212
Homo Sapiens31
1
1
Tupel_ID
Cytokine B6 precursor2
ENSP000002263171
Value
Attribute_ID
Entity
Attribute
AttributeValue
Generic representation using EAV
Instance
data
Interleukin-8 precursor
Cytokine B6 precursor
Name
...
ENSP00000306512
ENSP00000226317
Accession
Homo Sapiens
Homo Sapiens
...
Organism
Application-specific
global schema
Protein
Metadata
Generic representation
Flexible and extensible, but hard to query
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 23
Scientific Workflows
ƒ Integrate data sources at the application (analysis) level
ƒ Complementary to data-focussed integration approaches
ƒ Reuse of existing applications, services, and (sub-) workflows
ƒ Issues: semantically rich service registration, service composition (matching),
manipulation of result data, monitoring and debugging workflow execution, …
ƒ Example: Promoter Identification Workflow*
* Source: Kepler Project http://www.kepler-project.org/Wiki.jsp?page=WorkflowExamples
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 24
Agenda
ƒ Kinds of data to be integrated
ƒ General data integration alternatives
ƒ Warehouse approaches
ƒ The GeWare platform for microarray data management
ƒ Architecture; preprocessing and analysis workflows
ƒ Integrating data from clinical studies
ƒ Generic annotation management
ƒ Hybrid integration for expression + annotation analysis
ƒ Virtual and mapping-based data integration
ƒ Matching large life science ontologies
ƒ Data quality aspects
ƒ Conclusions and further challenges
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 25
The GeWare system*
ƒ Many platforms for microarray data management: ArrayExpress (EBI),
Gene Expression Omnibus (NCBI), Stanford Microarray Database, ...
ƒ GeWare – Genetic Data Warehouse (U Leipzig)
ƒ Under development since 2003
ƒ Central data management and analysis platform
ƒ Data of chip-based experiments (i.e. expression microarrays & Matrix-CGH arrays)
ƒ Uniform and autonomous specification of experiment annotations
ƒ Import of clinical data
ƒ Integration of gene annotations from public sources
ƒ Various methods for pre-processing, analysis and visualization
ƒ Coupling with existing tools for powerful and flexible analysis,e.g. R packages,
BioConductor
*Rahm, E; Kirsten, T; Lange, J: The GeWare data warehouse platform for the analysis molecular-biological and clinical data.
Journal of Integrative Bioinformatics, 4(1):47, 2007
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 26
GeWare Applications
ƒ Two collaborative cancer research studies
ƒ Molecular Mechanism in Malignant Lymphoma (MMML)
http://www.lymphome.de/Projekte/MMML
ƒ German Glioma Network: http://www.gliomnetzwerk.de/
ƒ Data from several national clinical, pathological and molecular-genetics centers
ƒ Experimental and clinical data for hundreds of patients
ƒ Local research groups at the Univ. Leipzig, e.g.
ƒ Expression analysis of different types of human thyroid nodules
ƒ Expression analysis of physiological properties of mice
ƒ Analysis of factors influencing the specific binding of sequences on microarrays
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 27
System architecture
Data Sources
Data Warehouse
Web Interface
Staging
Area
Data Im-/Export
Database API
Stored Procedure
Pre-pro-
cessing
Results
Gene Annotations
Experimental & Clinical
Annotation Data
Expression/Mutation Data
CEL Files & Expression/
CGH Matrices (CSV)
Manual
User Input
Public Data Sources
Local
Copies
SRS
Mapping
DB
Daily Import from Study
Management System
• Data pre-processing
• Data analysis (canned
queries, statistics, visuali-
zation)
• Administration
Data Mart
Expression /
CGH Matrix
Core
Data Warehouse
Multidimensional
Data Model including
• Gene Expression Data
• Clone Copy Numbers
• Experimental & clinical
Annotations
• Public Data
• GO
• Ensembl
• NetAffx
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 28
GeWare – System workflows
Analysis
Import of
raw data
Preprocessing
(Normalization /
aggregation
Experiment creation / selection
Manual
experiment
annotation
Import of pre-
processed data
Import Workflow
Statistics
Visualization
Browse / search in annotations
Gene/Clone
groups
Treatment
groups
External analysis
(Functional profiling, clustering)
Expression /
CGH matrices
Internal / integrated analysis
Management of analysis objects
Export
Reporting
Analysis Workflow
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 29
Multidimensional Data Management
ƒ Fact tables: expression values for different chip types and many chips
ƒ Scalability and extensibility
ƒ Dimensions (chips/patients, genes, analysis methods)
ƒ Multidimensional analysis
ƒ Easy selection, aggregation and comparison of values
ƒ Basis to support more advanced analysis methods
ƒ Focused selection and creation of matrices
Analysis methods
Experiments (chips)
Genes
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 30
GeWare – Data Warehouse Model
Annotation-related
Dimensions
Facts:
Expression Data,
Analysis Results
Processing-related
Dimensions
Chip
Treatment Group
*
1
Experiment
*
1
Gene
*
*
Gene Group
Gene Intensity
Expression Matrix
Analysis Method
Transformation
Method
Sample, Array,
Treatment, …
GO function,
Location, Pathway, ...
MAS5, RMA,
Li-Wong, …
Data Warehouse
Data Mart
Clustering, Classification,
Westfall/Young, ...
*
11
*
*
*
1
Clone
*
*
Clone Group
Clone Intensity
CGH Matrix
Chromosomal
Location, …
*
*
11
*
*
1
11
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 31
Clinical data: integration architecture*
Chip-based genetic Data
Gene expression data
Matrix-CGH data
Lab annotation data
Chip Id
Public Gene/Clone Annotations
GO Ensembl NetAffx

Management of Chip-related Data
(GeWare)
•Data analysis & reports
•Data export
Data
Warehouse
Management of Clinical Studies
(eResearch Network)
Study
Repository
•Administration
•Simple reports
•Data export
Validation
by data checks
common
Patient ID
Clinical
Centers
Pathological
Centers
Clinical findings
Pathological
findings
Patient-related Findings
Mapping table
Patient IDs ÅÆChip IDs
periodic
transfer
*Kirsten, T; Lange, J; Rahm, E : An integrated platform for analyzing molecular-biological data within clinical studies.
Information Integration in Healthcare Application, LNCS 4254, 2006
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 32
Analysis example
ƒ Visualizations of expression values using clinical data
Heatmap of a selected
gene expression matrix
Chip 1
Chip 2
Chip 3
Chip 4
Chip 5
Chip 6
Chip 7
Chip 8
Chip 9
Chip 10
Chip 11
Chip 12
Chip 13
Chip 14
Chip 15
Chip 16
Chip 17
Chip 18
Chip 19
Chip 20
Chip 21
Chip 22
Chip 23
Chip 24
Chip 25
Chip/Patient dendrogram
Gene dendrogram
Chips/Patients
Genes
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 33
Annotation management
ƒ Generic approach to specify structure and vocabulary for experimental,
clinical and genetic annotations
ƒ Consistent metadata instead of freetext or undocumented abbreviations
and naming
ƒ Manual specification of experimental annotations
ƒ describing the experimental set-up and procedure: sample modifications,
hybridization process, utilized devices, …
ƒ Automatic import of clinical annotations and genetic annotations
ƒ Annotation templates:
ƒ collections of hierarchically structured annotation categories
ƒ permissible annotation values can be restricted to controlled vocabularies
ƒ MIAME compliant templates
ƒ Controlled vocabularies: locally developed or external
(e.g. NCBI Taxonomy)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 34
Experiment annotation: implementation (1)
ƒ Template example
ƒ Easy specification and adaptation
ƒ Association of available vocabularies
Description
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 35
Experiment annotation: implementation (2)
ƒ Template example
ƒ Automatically generated web GUI
ƒ Hierarchically ordered categories
Index page
Generated page to capture
annotation values
Utilization of terms of
associated vocabularies
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 36
Experiment annotation: application
ƒ Search in experiment annotation: Create treatment groups (later
reuse in analysis)
Search for relevant chips
by specifying queries
Save result as group
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 37
Hybrid integration of data sources*
Annotation AnalysisExpression Analysis
Identification of relevant genes
using annotation data
Identification of relevant genes
using experimental data
Expression (signal) value
P-Value

Molecular function
Gene location
Protein (product)
Disease

DWH
+
Analysis Tools
gene /
clone
groups
SRS
Gene annotation
Mapping-DB
Query-Mediator
*Kirsten, T; Rahm, E: Hybrid integration of molecular-biological annotation data.
Proc. 2
nd
Intl. Workshop DILS, July 2005
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 38
Agenda
ƒ Kinds of data to be integrated
ƒ General data integration alternatives
ƒ Warehouse approaches
ƒ Virtual and mapping-based data integration
ƒ Web-link integration: DBGet/LinkDB
ƒ GenMapper
ƒ Distributed Annotation System (DAS)
ƒ Sequence Retrieval System (SRS)
ƒ BioFuice
ƒ Matching large life science ontologies
ƒ Data quality aspects
ƒ Conclusions and further challenges
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 39
Integration based on available web-links
ƒ Web-Link = URL of a source + ID of the object of interest
ƒ Simple integration approach
ƒ Little integration effort
ƒ Scaleable
ƒ Navigational analysis: only one object at a time
ƒ DBGET + LinkDB:
ƒ Collection of web-links
between many sources
ƒ Management of source specific
sets of object ID and their connec-
ting mappings
ƒ No explicit mapping types
www. genome.jp/dbget/
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 40
GenMapper*
ƒ Generic data model, GAM, to uniformly represent annotation data
ƒ Flexible w.r.t. heterogeneity, evolution and integration
ƒ Exploits existing mappings between objects/sources
ƒ Valuable knowledge, available in almost every source, scalable
ƒ High-level operations to support data integration and data access
ƒ Tailored annotation views for specific analysis needs
NetAffx
GAM
Data Model
GAM-Based
Annotation Management
Data Sources
LocusLink
Annotation Views
Application
Integration
•Map
•Compose
•GenerateView
•…
Map(Unigene, GO)
Data Integration
Data Acess
Unigene
Map(Affx, Unigene)
•Parse
•Import
GO
Source Id
Name
Type
Content
SOURCE
Source Id
Name
Type
Content
SOURCE
Obj Rel Id
Src Rel Id
Object1 Id
Object2 Id
Evidence
OBJECT_ REL
Obj Rel Id
Src Rel Id
Object1 Id
Object2 Id
Evidence
OBJECT_ REL
n1
n
1
11
n n
n
n
1
1
Object Id
Source Id
Accession
Text
Number
OBJECT
Object Id
Source Id
Accession
Text
Number
OBJECT
Src Rel Id
Source1 Id
Source2 Id
Type
SOURCE_ REL
Src Rel Id
Source1 Id
Source2 Id
Type
SOURCE_ REL
*Do, H.H.; Rahm, E.: Flexible integration of molecular-biological annotation data: The GenMapper approach.
Proc. 9th EDBT Conf., 2004
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 41
GenMapper: Usage scenario
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 42
Distributed Annotation System (DAS)
ƒ Integration of distributed data sources with central genome server
ƒ Genome server: Primary source containing reference genome sequence
ƒ Annotation server: Wrapped source of a research group/ organization
ƒ Annotations are mapped to a reference genome sequence
ƒ Only sequence coordinates for each object are necessary (i.e., chr, start, stop,
strand)
ƒ Simple and scaleable approach
ƒ Recalculation of all annotations when the reference sequence has changed
Annotation Viewer
Genome Server
Annotation Server 1
Genome DB
Annotation Server 2 Annotation Server n
...
www.biodas.org
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 43
DAS: Query processing
ƒ Query formulation
ƒ Select organism and chromosome from
reference genome
ƒ Position-based (range) queries for associated objects
ƒ Query processing
ƒ Send range query to genome
DB and relevant annotation
servers
ƒ Merge retrieved results
ƒ Query result can be viewed
on the genome at different
detail levels with
associated annotations, i.e.,
objects of different types
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 44
Sequence Retrieval System (SRS)
ƒ Originally developed for accessing sequence data at EMBL
ƒ Commercial version by BioWisdom(before: Lion Bioscience)
ƒ Data integration primarily for file data sources, but extended for
database access and analysis tools
ƒ Mapping-based integration, no global schema
ƒ Local installation of sources necessary
ƒ Indexing (queryable attributes)
of file-based sources by a
proprietary script language
ƒ Definition of hub-tables
(and queryable attributes)
in relational sources
ƒ Large wrapper library
available for public sources
Source: Lion BioScience
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 45
SRS: Query formulation and processing
ƒ Query formulation
ƒ Source selection
ƒ Filter specification for
queryable attributes
ƒ Query types
ƒ Keyword search
ƒ Range search for numeric
and date attributes
ƒ Regular expressions
ƒ Automatic translation to
SQL queries for relational
sources
ƒ Merge of result sets
ƒ Intersection
ƒ Union
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 46
SRS: Query formulation and processing cont.
ƒ Explorative analysis
ƒ Traverse selected objects
to objects of another data
source
ƒ Automatically generated
paths between sources
ƒ Shortest paths (Dijkstra)
ƒ No consideration of path/
mapping semantics
ƒ No join, only source graph
traversal
ƒ Result
ƒ Set of associated objects
ƒ No explicit mapping
data (object
correspondences) retrieved
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 47
BioFuice*: Design goals
ƒ Utilization of instance-level cross-references (often manually curated,
high quality data): instance-level mappings between sources
ƒ Navigational access to many sources
ƒ Support for queries and ad-hoc analysis workflows
ƒ Often no full transparency necessary: users want to know from which
sources data comes (data lineage/ provenance)
ƒ Support for integrating local (non-public) data
ƒ Support for object matching and fusion (data quality)
ƒ Creation of new instance mappings
->
Mapping-based data integration
*Kirsten, T; Rahm, E: BioFuice: Mapping-based data integration in bioinformatics. Proc. 3
rd
DILS, 2006
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 48
BioFuice (2)
ƒ BioFuice: Bio
informatics information f
usion u
tilizing
i
nstance c
orrespondences and pe
er mappings
ƒ Basis: iFuice approach*
ƒ Generic way to information fusion
ƒ High-level operators
ƒ P2P-like infrastructure
ƒ Mappings between autonomous data sources (peers), e.g. sets of
instance correspondences
ƒ Simple addition of new sources where they fit best
ƒ Mapping mediator
ƒ Mapping management and operator execution
ƒ Downloadable sources are materialized for better performance (hybrid
integration)
ƒ Utilization of application specific semantic domain model
* Rahm, E., et al.: iFuice - Information Fusion utilizing Instance Correspondences and Peer Mappings.
Proc. 8
th
WebDB, Baltimore, June 2005
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 49
BioFuice: Data sources
ƒ Physical data source (PDS)
ƒ Public, private and local
data (gene list, …), ontologies
ƒ Splitted into logical data sources
Ensembl
Accession
: ENSG00000121380
Descr.: Apoptosis facilitator Bcl-2-like …
Sequence region start position: 12115145
Sequence region stop position: 12255214
Biotype: protein coding
Confidence: KNOWN
Gene@Ensembl
ƒ Object instances
ƒ Set of relevant attributes
ƒ One id attribute
Gene
Sequence
Region
Exon
ƒ Logical data source (LDS)
ƒ Refers to one object type
and a physical data source,
e.g. Gene@Ensembl
ƒ Contains object instances
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 50
BioFuice Mappings
ƒ Directed relationships between LDS
ƒ Mappings have a semantic mapping type
ƒ E.g. OrthologousGenes
ƒ Different kinds of mappings
ƒ Same mappings vs. Association mappings
ƒ Same: equality relationship
ƒ ID mappings vs. computed mappings (e.g. query mappings)
ƒ Materialized mappings (mapping tables) vs. dynamic generation (on the fly)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 51
BioFuice: metadata models
ƒ Used by mediator for mapping/operator execution
ƒ Domain model indicates available object types and relationships
Source mapping model
LDS
PDS
mapping
(same: )
Legend
Ensembl
SwissProt
MySequences
NetAffx
E
s
t
D
n
a
B
l
a
s
t
.
h
s
a
Ensembl.
SRegionExons
Ensembl.
ExonGene
Ensembl.
GeneProteins
Ensembl.
sameNetAffxGenes
Domain model
Extraction
O
r
t
h
o
l
o
g
o
u
s
G
e
n
e
s
Sequence
Region
Gene
Protein
R
e
g
i
o
n
T
o
u
c
h
e
d
E
x
o
n
s
c
o
d
e
d
P
r
o
t
e
i
n
s
Sequence
S
e
q
u
e
n
c
e
C
o
o
r
d
i
n
a
t
e
s
Exon
G
e
n
e
O
f
E
x
o
n
Sequence
Sequence
Region
Exon
Gene
Gene
Protein
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 52
BioFuice Operators
ƒ Query capabilities + scripting support
ƒ Set oriented operators
ƒ Input: Set of objects/mappings
+ parameters / query conditions
ƒ Output: Set of resulting objects

Combination of operators within scripts for
workflow-like execution
ƒ Selected operators:
ƒ Single source: queryInstances, searchInstances, …
ƒ Navigation:traverse, map, compose, …
ƒ Navigation + aggregation: aggregate, aggregateTraverse, …
ƒ Generic: diff, union, intersect, …
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 53
BioFuice architecture
B i o F u i c e
Generic Mapping
Execution Services
Relational
Database
XML
Database
XML
File
XML
Stream
Appli-
cation
i F u i c e C o r e
Web-
Service
Fusion Control Unit
and Repository
Mediator Interface
Mapping Handler
Repository
Cache
responserequest
mapping callmapping call mapping result
Duplicate Detection
i F u i c e c o r e A P I
Mapping Layer
Mappings retrieving data of a single LDS but also interconnecting different LDS
User
Interface
Script
Editor
Model-based
Queries
Query Manager
Query
Transformation
Query
specification
Query
result
Pre-defined
Queries
B i o F u i c e
Q u e r y
R i F u i c e
Keyword
Search
C o m m a n d l i n e
I n t e r f a c e
Function library for
• Setting and retrieval of
iFuice objects
• Execution of iFuice
Scripts
• Metadata settings and
retrieval
CSV Export
B i o F u i c e b a s e
FASTA Export
iFuice Connector
iFuice-Script Metadata
Script result/ Data transfer
XML Export
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 54
BioFuice: Script example
ƒ Scenario
ƒ Given: Set of sequences in local source MySequences
ƒ Wanted: Three classes: unaligned s., non-coding s., protein coding sequences
$alignedSeqMR := map( MySequences, { SeqDnaBlast } );
$unalignedSeqOI:= diff ( MySequences, domain ( $alignedSeqMR ));
$codingSeqMR:= compose( $alignedSeqMR, { Ensembl.SRegionExons } );
$protCodingSeqOI:= domain ( $codingSeqMR );
$nonCodingSeqOI := diff ( domain ( $alignedSeqMR ) , $protCodingSeqOI );
Ensembl
MySequences
Ensembl.
SRegionExons
S
e
q
D
n
a
B
l
a
s
t
Sequence Region
Sequence
Exon
LDS
PDS
mapping
(same: )
Legend
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 55
BioFuice Query Processing
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 56
iFuice application: citation analysis*
ƒ Citation analysis important for evaluating scientific impact of
publications venues, researchers, universities etc.
ƒ What are the most cited papers of journal X or conference Y?
ƒ What is the H-index of author Z ?
ƒ Frequent changes: new publications & new citations
ƒ Idea: Combine publication lists, e.g. from DBLP or Pubmed, with
citation counts, e.g from Google Scholar, Citeseer or Scopus
ƒ Warehousing approach, virtual (on the fly) or hybrid integration
ƒ Fast approximate results by Online Citation Service (OCS)**
ƒ http:// labs.dbs.uni-leipzig.de/ocs
* Rahm, E, Thor, A.: Citation analysis of database publications. ACM Sigmod Record, 2005
** Thor, A., Aumueller, D., Rahm, E.: Data integration support for Mashups. Proc. IIWeb 2007
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 57
Sample OCS result
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 58
caBIG™/caGRID*
ƒ cancer Biomedical Informatics Grid™(caBIG™)
ƒ Virtual network connecting individuals and organizations to enable the sharing of
data and tools, creating a World Wide Web of cancer research
ƒ Overall goal: Speed the delivery of innovative approaches for the prevention and
treatment of cancer
ƒ Objectives
ƒ Common, widely distributed infrastructure that permits the cancer research
community to focus on innovation
ƒ Service-based integration of applications and data
ƒ Shared, harmonized set of terminology, data elements, and data models that
facilitate information exchange to overcome syntactic and semantic interoperability
ƒ Collection of interoperable applications developed to common standards
ƒ Raw published cancer research data is available for mining and integration
*Joel H. Saltz, et al.: caGrid: design and implementation of the core architecture of the cancer biomedical informatics
grid.Bioinformatics, Vol. 22, No. 15, 2006, pp. 1910-1916
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 59
Service-based data integration in caGrid
Source: T. Kurc et al.: Panel Discussion, caBIG Annual Meeting 2007
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 60
caBIG/caGRID: Data description infrastructure
G
M
E
Syntactic interoperability
Semantic interoperability
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 61
caBIG/caGRID: Basis Vocabulary -NCI Thesaurus
ƒ About NCI Thesaurus
ƒ Reference terminology for NCI
ƒ About 54000 concepts in 20 hierarchies
ƒ Broad coverage of cancer domain

Findings and Disorders

Anatomy

Drugs, Chemicals

Administrative Concepts

Conceptual Entities/Data Types
ƒ Advantages
ƒ Uniform conceptualization in a domain
ƒ Standardization, interoperability, classification
ƒ Enable reuse of data and information
ƒ Usage in caBIG/caGrid
ƒ Annotation of medical data (images, …)
ƒ Service Discovery in grids
ƒ Building of Common Data Elements (CDE)
for exchange of medical data
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 62
caBIG/caGRID: Building common data elements
NCI Thesaurus
Enterprise Vocabulary Services
=
Person
Reported Age
Age Value
Data
Element
caDSR
metadata
repository
Value
Domain
+
Age Value
Numeric
High Value: 150
Low Value: 1
Person Reported Age Value
Source: caDSR & ISO 11179 Training - Jennifer Brush, Dianne Reeves
Data
Element
Concept
Person
Reported
Age
Object
Class
Property
Local
database
33
Describes
instance data
stored in
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 63
Agenda
ƒ Kinds of data to be integrated
ƒ General data integration alternatives
ƒ Warehouse approaches
ƒ Virtual and mapping-based data integration
ƒ Matching large life science ontologies
ƒ Motivation
ƒ Match approaches and frameworks (Coma++, Prompt, Sambo)
ƒ Instance-based match approach (DILS07), evaluation results
ƒ Data quality aspects
ƒ Conclusions and further challenges
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 64
Motivation
ƒ Increasing number of connected sources and ontologies
ƒ Ontology matching (alignment)
ƒ Goal: Find semantically related concepts
ƒ Output: Set of correspondences (ontology mapping)
ƒ Ideally: + semantic mapping type (equivalence, is-a, part-of, …)
ƒ Use:
ƒ Improved analysis
ƒ Validation (curation) and recommendation of instance associations
ƒ Ontology merge or curation, e.g. to reduce overlap between ontologies
Gene
Entrez
Protein
SwissProt
Molecular Function
GO
Biological Process
GO
Genetic Disorders
OMIM
Protein
Ensembl
?
?
instance associations
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 65
Automatic Match Techniques*
ƒ Combined Approaches: Hybrid vs. Composite
ƒ Many frameworks/ prototypes: COMA++, Prompt, FOAM, Clio, …but
mostly not used in bioinformatics
Schema-based
Instance-based
• Parents
• Children
• Leaves
Linguistic
Constraint-
based
• Types
• Keys
• Value pattern
and ranges
Constraint-
based
Linguistic
• IR (word
frequencies,
key terms)
Constraint-
based
• Names
• Descriptions
StructureElement Element
Reuse-oriented
StructureElement
• Dictionaries
• Thesauri
• Previous match
results
*Rahm, E., P.A. Bernstein: A Survey of Approaches to Automatic Schema Matching.VLDB Journal 10(4), 2001
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 66
Frameworks: PROMPT*
ƒ Framework for ontology alignment and merging
ƒ Plug-in tool for Protege 2000
ƒ Linguistic matching
ƒ Iterative user feedback and match result manipulation
ƒ Automatic detection of ontology conflicts
ƒ Interactive conflict resolution and automaticconflict resolution based on user-
preferred ontology
ƒ Merge operation: Create a new ontology or extend one selected
ontology
ƒ Automatic creations of parent- and sub-concept relationships
ƒ Suggestions of similar concepts based on ontology matches
*Noy, N.; Musen, M.: PROMPT – Algorithm and tool for automated ontology merging and alignment.
Proc. Conf. on Artificial Intelligence and Innovative Applications of Artificial Intelligence, 2000.
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 67
System Architecture*
Repository
Graphical
User
Interface
Execution Engine
Model Pool
External
Schemas,
Ontologies
Mapping Pool
Exported
Mappings
Component
Identification
Matcher
Execution
Similarity
Combination
Model
Manipulation
Source Id
Name
Structure
Content
SOURCE
Source Id
Name
Structure
Content
SOURCE
Object Rel Id
Source Rel Id
Object1 Id
Object2 Id
Evidence
OBJECT_ REL
Object Rel Id
Source Rel Id
Object1 Id
Object2 Id
Evidence
OBJECT_ REL
n
1
n
1
11
n n
n
n
1 1
Object Id
Source Id
Accession
Text
Number
OBJECT
Object Id
Source Id
Accession
Text
Number
OBJECT
Source Rel Id
Source1 Id
Source2 Id
Type
SOURCE_ REL
Source Rel Id
Source1 Id
Source2 Id
Type
SOURCE_ REL
Match Customizer
Matcher
Configs
Match
Strategies
Mapping
Manipulation
Matcher Strategy
*Do, H.H., E. Rahm: COMA - A System for Flexible
Co
mbination of Schema Ma
tching Approaches. VLDB 2002
Aumüller D., H.-H. Do, S. Massmann, E. Rahm: Schema
and Ontology Matching with COMA++.Sigmod 2005
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 68
Frameworks: SAMBO*
ƒ System for aligning and merging biomedical ontologies
ƒ Framework to find similar concepts in overlapping ontologies for
alignment and merge tasks
ƒ Import of OWL ontologies
ƒ Support of various match strategies by applying/ combining different matchers
and use of auxiliary information
ƒ Linguistic, structure-based,
constraint-based, instance-based matcher
ƒ Iterative user feedback for match results
ƒ Result manipulation by description logic
reasoner checking for ontology con-
sistency, cycles, unsatisfiable concepts
*Lambrix, P; Tan, H.: SAMBO – A system for aligning and merging biomedical ontologies.
Journal of Web Semantics, 4(3):196-206 , 2006.
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 69
Metadata-based match approaches
ƒ Metadata: Concept names, descriptions, ontology structure, ...
ƒ Match mainly based on syntax and structure
ƒ Limited use of domain knowledge
ƒ Highly similar names with opposite semantics,
e.g., ion vs. anion, organic vs. inorganic
Sim
2-Gram
ion transporter – anion transport 0.77
ion transporter activity – ion transport 0.66
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 70
Instance-based match approach*
ƒ Approach
ƒ Use domain-specific knowledge expressed in existing instance associations to
create ontology mappings
ƒ Key idea: "Two concepts are related
if they share a significant number of associated objects"
ƒ Flexible and extensible approach
ƒ Instance associations of pre-selected sources
ƒ Different metrics to determine the instance-based similarity
ƒ Combination of different ontology mappings
* Kirsten, T, Thor, A; Rahm, E.: Instance-based matching of large life science ontologies.
Proc. 4th Intl. Workshop DILS, July 2007
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 71
Instance-based matching
Molecular Function (MF)
GO:0005215
Transporter activity
GO:0008504
Anion transporter activity
GO:0008514
Organic anion transporter activity
...
...
GO:0015075
Ion transporter activity
...
...
GO:0015103
Inorganic anion transporter activity
Biological Process (BP)
GO:0050875
Cellular process
GO:0051234
Establishment of localization
GO:0006810
Transport
...
...
...
...
GO:0006811
Ion transport
...
GO:0006820
Anion transport
...
GO:0015711
Organic anion transport
GO:0015698
Inorganic anion transport
ID:ENSP00000355930
Name:Solute carrier family 22 member 1 isoform a
MF:GO.0015075, ...
BP:GO:0006811, ...
Species:Homo Sapiens
ID:ENSP00000325240
Name:LIM and SHB domain protein 1
MF:GO.0015075, ...
BP:GO:0006811, ...
Species:Homo Sapiens
Correspondence
creation using
shared associated
instances
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 72
Selected similarity metrics
ƒ Baseline similarity Sim
Base



=
>
=
0 if , 0
0if , 1
),(
21
21
21
cc
cc
Base
N
N
ccSim
0 ≤ Sim
Dice
≤ Sim
Min
≤ Sim
Base
≤ 1
Example:
Sim
Base
= 1
I
c
1
∈O
1
c
2
∈O
2
4
∩=2
3
21
21
2
),(
21
cc
cc
Dice
NN
N
ccSim
+

=
Sim
Dice
= 2*2/(4+3) = 0.57
ƒ Dice similarity Sim
Dice
),min(
),(
21
21
21
cc
cc
Min
NN
N
ccSim =
Sim
Min
= 2/3 = 0.67
ƒ Minimum similarity Sim
Min
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 73
Evaluation metrics
ƒ Computation of precision & recall needs a perfect mapping
ƒ Laborious for large ontologies
ƒ Might not be well-defined
1
||
||
1
21
1
≥=


MatchO
OO
O
C
Corr
MatchRatio
1
||||
||2
21
21

+

=
−−

MatchOMatchO
OO
CC
Corr
tchRatioCombinedMa
ƒ Metric Match Ratio to approximate "precision"
ƒ Idea: Measure average number of match counter-parts per matched concept
]1...0[
||
||
1
1
1
∈=

O
MatchO
O
C
C
ageMatchCover
]1...0[
||||
||||
21
1
2

+
+
=
−−
−−
InstOInstO
MatchOMatchO
CC
CC
overageInstMatchC
ƒ Metric Match Coverage to approximate "recall"
ƒ Idea: Measure fraction of matched concepts
Combined
ƒ Goal: high Match Coverage with low Match Ratio
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 74
Evaluation metrics cont.
ƒ Example:
I
O
2
O
1
InstMatchCoverage
O1
= 800/1000 = 0.80
InstMatchCoverage
O2
= 900/1200 = 0.75
MatchRatio
O1
= 1000/800 = 1.25
MatchRatio
O2
= 1000/900 = 1.11
|Corr
O1-O2
|
|C
O1-Inst
|
|C
O2-Inst
|
|C
O1-Match
|
|C
O2-Match
|
800 900
1000
1000
1200
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 75
Match scenario
ƒ Ontologies
ƒ Subontologies of GeneOntology: Mol. function, biol. processes and cell.
components
ƒ Genetic disorders of OMIM
ƒ Instances: Ensembl proteins of different species, i.e.,
homo sapiens, mus musculus, rattus norvegicus
Ensembl Proteins of different species
Molecular
Function
Biological
Process
Cellular
Component
Genetic
Disorder
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 76
Ontology overlap between species
Number of associated
Molecular Functions
Mus MusculusHomo Sapiens
Rattus Norvegicus
242 86
96
1,954
253
3181
2,530 2,324
2,162
Number of associated
Biological Processes
Mus MusculusHomo Sapiens
Rattus Norvegicus
288 110
133
2,452
201
4777
3,018 2,810
2,709
Total # functions: 7,514
Total # processes: 12,555
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 77
Exhaustive match study
ƒ Instance-based matching
ƒ Direct protein associations of human, mouse, rat
ƒ Study of match combinations: Union, intersection
ƒ Utilization of indirect associations
ƒ (Simple) Metadata-based matching
ƒ Utilization of concept names
ƒ Trigram string similarity; different thresholds
ƒ Comparison of instance- and metadata-based match results
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 78
Match results: Direct instance associations
ƒ Sim
Base
:
High Coverage (99%), moderate to high Match Ratios
ƒ Sim
Dice
:
Very restrictive (Coverage < 20%) but low Match Ratios
ƒ Sim
Min
:
High Coverage (60%-80%) with high number of covered
concepts but significantly lower Match Ratios than Sim
Base
0,0
0,2
0,4
0,6
0,8
1,0
SimBase
SimMin
SimDice
SimKappa
SimBase
SimMin
SimDice
SimKappa
SimBase
SimMin
SimDice
SimKappa
MF - BP MF - CC BP - CC
Human
Mouse
Rat
2.61.72.71.92.02.0Kappa
1.3
1.0
1.3
1.0
1.2
1.3
Dice
8.62.47.82.24.04.4Min
46.3
9.8
28.6
7.6
17.0
20.4
Base
CC
BP
CC
MF
BP
MF
BP - CC
MF - CC
MF - BP
(Match Ratios for Homo Sapiens)
Combined Instance Coverage Match Ratios per ontology
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 79
Match results: Metadata-based matching
ƒ Growing Coverage and Match Ratios for lower thresholds
ƒ No correspondences with a similarity ≥ 0.9
ƒ Moderate to low Match Ratios
ƒ Inclusion of false positives for low thresholds, e.g. 0.5
Match Coverage per ontology
Match Ratios per ontology
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
0,40
0,45
0,50
MF BP MF CC BP CC
MF - BP MF - CC BP - CC
Match Coverage per ontology
0,5
0,6
0,7
0,8
1.21.11.21.11.11.10.8
1.4
1.4
1.5
1.1
1.4
1.4
0.7
2.01.74.62.72.92.40.6
3.4
2.5
6.3
2.5
6.9
4.4
0.5
CC
BP
CC
MF
BP
MF
BP - CC
MF - CC
MF - BP
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 80
Match results: Match combinations
ƒ Combinations between instance- (Sim
Min
) and metadata-based match
approach
ƒ Union: Increased coverage, higher influence of Sim
Min
for increased thresholds of
the metadata-based matcher
ƒ Intersection: Low Match Coverage (<1%) and Match Ratios
ƒ Low overlap between instance- and metadata-based mappings
1.31.01.01.01.01.0

7.6
2.4
6.7
2.2
3.7
4.1

CC
BP
CC
MF
BP
MF
BP - CC
MF - CC
MF - BP
Match Ratios per ontology
(threshold 0.7)
0,00
0,20
0,40
0,60
0,80
1,00
MF BP MF CC BP CC
MF - BP MF - CC BP - CC
Match Coverage per Ontology
0,5
0,6
0,7
0,8
(Sim
Min
= 1.0, Homo Sapiens)
Match Coverage per ontology
for combined mappings
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 81
Agenda
ƒ Kinds of data to be integrated
ƒ General data integration alternatives
ƒ Warehouse approaches
ƒ Virtual and mapping-based data integration
ƒ Data quality aspects
ƒ Overview and examples of quality problems
ƒ Object Matching
ƒ Data cleaning frameworks
ƒ Conclusions and further challenges
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 82
Overview*
Data quality problems
Single-source problems Multi-source problems
Schema level Instance level Schema level Instance level
(Lack of integrity
constraints, poor
schema design)
(data entry errors) (Heterogeneous
schema models
and design)
(Overlapping,
contradicting and
inconsistent data)
• Uniqueness
• Referential integrity
•...
• Mispellings
• Redundancy, duplicates
• Contradictory values
•...
• Naming conflicts
• Structural conflicts
•...
• Inconsistent
aggregating
• Inconsistent
timing
•...
*Rahm, E; Do, H.-H.: Data cleaning: Problems and current approaches.
IEEE Techn. Bulletin on Data Engineering, 23(4), 2000
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 83
Single-source problems
[ENSEMBL:
ENSP00007463 ]
Mycobaterium
tuberculosis
14 kDa antigen, also:
16kDa antigen, HSP16.3
14KD_MYCTUP0A5B7
mgdreqll...Rattus
norvegicus
14-3- protein eta1433F_RATP11576
MGDREQLL...Rat14-3- protein eta1433F_RATP68511
Sequence
Comment
Species
Protein-Name
Entry-Name
Accession
Uniqueness
ƒ Example: Protein data
ƒ Causes
ƒ Schemaless storage, e.g., file-based data storage
ƒ Lack of input/ acceptance integrity constraints
ƒ...
Multiple values
Synonyms
Case insensitivity
Missing values
Encoding of further
annotations and links
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 84
Multi-source problems (selection)
ƒ Multiple experiments on same problem with different results
ƒ Different normalization and analysis methods
ƒ Human interpretation!
ƒ Observations of mobile things, e.g., animals in bordering areas
ƒ Human observations
ƒ Varying annotations (difficult to be objective):

white-brown vs. brown-white, full vs. complete
ƒ Example: Describe and count animal populations
...
Pattern
Colour
Nr
...completewhite3
...completebeige2
...spottedwhite-brown1
Area 1
Area 2
...spottedwhite-brown2
...fullsnow-white1
...
Pattern
Colour
Nr
Integration
with object fusion
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 85
Simple solution strategies
ƒ Uniqueness
ƒ Utilization of global identifiers
ƒ Use identifier mappings to a second source (of the same type and detail level)
ƒ Multiple values/ encodings
ƒ Extract atomic values by specific parsers, regular expressions
ƒ Normalization of dependent attributes
ƒ Synonyms: Use available controlled vocabularies/ ontologies as much as possible,
e.g., NCBI Taxonomy for species
ƒ Case insensitiveness: e.g. transformall values to upper/lower case
[ENSEMBL:
ENSP00007463 ]
Mycobaterium
tuberculosis
14 kDa antigen, also:
16kDa antigen, HSP16.3
14KD_MYCTUP0A5B7
mgdreqll...Rattus
norvegicus
14-3- protein eta1433F_RATP11576
MGDREQLL...Rat14-3- protein eta1433F_RATP68511
Sequence
Comment
Species
Protein-Name
Entry-Name
Accession
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 86
Object matching approaches
Object matching approaches
Context-basedValue-based
unsupervised supervised
• Aggregation function with threshold
• User-specified Rules:
• Hernandez et al. (SIGMOD 1995)
• Clustering
• Monge, Elkan (DMKD 1997)
• Mc Callumet al. (SIGKDD 2000)
• Cohen, Richman (SIGKDD 2002)
Single attribute Multiple attributes
...
...
• Hierarchies:
• Ananthakrishna et.al. (VDLB 2002)
• Graphs:
• Bhattacharya, Getoor (DMKD 2004)
• Dong et al. (SIGMOD 2005)
• Ontologies
...
• decision trees
• Verykios et al. (Information Sciences 2000)
• Tejada et al. (Information Systems 2001)
• support vector machine
• Bilenko, Mooney (SIGKDD 2003)
• Minton et al. (2005)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 87
Similarity-based grouping*
ƒ Goal: Detect and group duplicate (very similar) data entries
ƒ Sequential procedure
ƒ Specification of grouping rules: Which similarity functions (also combinations)
for which attributes
ƒ Pairwise grouping: Computing the similarity and comparing data entries based
on selected/ specified grouping rules
ƒ Grouping of pairs of data entries into
cliques based on
ƒ Total number of groups
ƒ Number of data entries in a group
ƒ Disjoint/ overlapping groups
ƒ Analysis and evaluation of
generated groupings
*Jakoniene, V; Rundqvist, D.; Lambrix, P.: A method for
similarity-based groupig of biological data. Proc. DILS, 2006
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 88
Similarity-based grouping: Test cases
ƒ Test: Group selected
proteins into classes using
ƒ Annotations, e.g., attributes
like product, definition
ƒ Protein sequences
ƒ Associations to GO ontology
ƒ Results
ƒ Best grouping by using GO
associations
ƒ Annotation-based: Too many
groups
ƒ Sequence alignments: Too
specific for grouping
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 89
BIO-AJAX*
ƒ Framework for biological data cleaning
ƒ Operators
ƒ MAP: translates the data fromone schema to another schema.
ƒ VIEW: extracts portions of data for cleaning purposes.
ƒ MATCH: detects duplicate or similar records
ƒ MERGE: combines duplicate records or similar records into one record
*Herbert, K.G.; Gehani, N.H.; Piel, W.H.; Wang, J.T.-L.; Wu, C.H.: BIO-AJAX: An Extensible Framework
for Biological Data Cleaning. SIGMOD Record 33(2), 2004
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 90
Further data cleaning frameworks
ƒ Research prototypes
ƒ AJAX (Galhardas et al., VLDB 2001)
ƒ IntelliClean (Lee et al., SIGKDD 2000)
ƒ Potter‘s Wheel (Raman et al., VLDB 2001)
ƒ Febrl (Christen, Churches, PAKDD 2004)
ƒ TAILOR (Elfeky et al., Data Eng. 2002)
ƒ MOMA (Thor, Rahm, CIDR 2007)
ƒ Commercial solutions
ƒ DataCleanser (EDD), Merge/Purge Library (Sagent/QMSoftware),
MasterMerge (Pitnew Bowes) ...
ƒ MS SQL Server 2005: Data Cleaning Operators (Fuzzy Join/ Lookup)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 91
Agenda
ƒ Motivation
ƒ General data integration alternatives
ƒ Warehousing of large biological data collections
ƒ Virtual integration of molecular-biological data
ƒ Data quality aspects
ƒ Matching large life science ontologies
ƒ Conclusions and further challenges
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 92
Overall conclusions
ƒ Diverse data characteristics
ƒ Large amounts of experimental data produced by different chip technologies
ƒ Integration / management of clinical data
ƒ Huge amount of inter-connected web sources
ƒ High amount of text data
ƒ Comprehensive standardization efforts needed: object ids / formats,
preprocessing routines of chip data, shared vocabularies / ontologies
ƒ Need to support explorative workflows across different sources
ƒ Different data integration architectures needed
ƒ Data Warehousing
ƒ Virtual and mapping-based integration approaches
ƒ Combinations
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 93
Overall conclusions cont.
ƒ Warehousing for integration of large collections of biological data
ƒ Ideal for analysis / data mining on huge data sets, e.g. experimental chip data
ƒ Comprehensive data preprocessing
ƒ Support for consistent annotations needed
ƒ Integration of external data for enhanced analysis
ƒ Mapping-based data integration (e.g., BioFuice)
ƒ Utilization of instance-level mappings to traverse between sources and fuse
objects
ƒ Set-oriented navigation + structured queries + keyword search
ƒ Programmability / workflow orientation
ƒ Ontology matching
ƒ Metadata vs. instance-based matching, combined approach
ƒ Key problem: validation of mappings by domain experts
ƒ More research needed
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 94
Future challenges
ƒ Clinical data management: many organizational issues, data privacy
ƒ Bridging different workstyles and research goals: computers scientists
vs. biologists vs. clinicians
ƒ Make data integration easier and faster, e.g. by a mashup-like
paradigm
ƒ Enable biologist/users to extract, clean, integrate and analyze data themselves
ƒ Make it easier to develop and use data-driven workflows
ƒ Annotation and ontology management
ƒ Creation, evolution, matching, merging of ontologies
ƒ Utilization of generic and domain-specific approaches
ƒ Data quality: object matching and fusion, provenance, …
ƒ Data integration in new application fields, e.g. systems biology
ƒ e.g., management of metabolic ~, regulatory pathways, protein-protein-
interaction networks
ƒ Combination of data of wet-lab experiments with cell-based simulation (in silico
experiments)
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 95
Literature: Surveys, Overviews
ƒ T. Hernandez, S. Kambhampati: Integration of biological sources: current systems
and challenges ahead. SIGMOD record, 33(3):51-60, 2004.
ƒ Z. Lacroix: Biological data integration: wrapping data and tools.IEEE Trans.
Information Technology in Biomedicine. 6(2), 2002
ƒ Z. Lacroix, T. Critchlow (eds.): Bioinformatics – Managing scientific data. Morgan
Kaufmann Publishers, 2003.
ƒ B. Louie, P. Mork, F. Martin-Sanchez, A. Halevy, P. Tarczy-Hornoch: Data
integration and genomic medicine. Journal of Biomedical Informatics, 40:5-16, 2007.
ƒ L. Stein: Integrating biological databases. Nature Review Genetics, 4(5):337-345,
2003.
ƒ H.-H. Do, T. Kirsten, E. Rahm: Comparative evaluation of microarray-based gene
expression databases. Proc. 10th BTW Conf., 2003.
ƒ M.Y. Galperin: The molecular biology database collection: 2006 update. Nucleic
Acids Research, 34 (Database Issue):D3-D5, 2006.
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 96
Literature: Warehousing of biological data
ƒ A. Brazma et al.: Minimum information about a mircoarray experiment (MIAME) –
toward standards for microarray data. Nature Genetics, 29(4): 365-371, 2001
ƒ A. Kasprzyk, D. Keefe, D. Smedley et al.: EnsMart: A generic systemfor fast and
flexible access to biological data. Genome Research, 14(1):160-169, 2004
ƒ T. Kirsten, J. Lange, and E. Rahm: An integrated platformfor analyzing molecular-
biological data within clinical studies. Proc. Intl. EDBT Workshop on Information
Integration in Healthcare Applications, 2006.
ƒ V.M. Markowitz et al.: ?The Integrated Microbial Genomes (IMG) System: A Case
Study in Biological Data Management .Proc. VLDB 2005
ƒ R. Nagarajan, M. Ahmed, A. Phatak: Database challenges in the integration of
biomedical data sets. Proc. 30th VLDB Conf., 2004.
ƒ E. Rahm, T. Kirsten, J. Lange: The GeWare data warehouse platformfor the analysis
of molecular-biological and clinical data. Journal of Integrative Bioinformatics, 4(1):47,
2007.
ƒ K. Rother, H. Müller, S. Trissl et al.: Columba: Multidimensional data integration of
protein annotations. Proc. 1st DILS Workshop, 2004.
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 97
Literature: Virtual & mapping-based integration
ƒ H.-H. Do, E. Rahm: Flexible integration of molecular-biological annotation data: The GenMapper
approach. Proc. EDBT Conf., 2004.
ƒ T. Etzold, A. Ulyanov, P. Argos: SRS: Integration retrival systemfor molecularbiological data
banks. Methods in Enzymology, 266:114-128, 1996.
ƒ L. Haas et al.: Discoverylink: A system for integrating life sciences data. IBM Systems Journal
2001
ƒ D. Hull et al.; Taverna: a tool for building and running workflows of services. Nucleic Acid
Research 2006
ƒ T. Kirsten, E. Rahm: BioFuice:Mapping-based data integration in bioinformatics. Proc. 3rd Intl.
Workshop on Data Integration in the Life Sciences, 2006.
ƒ B. Ludaescher at al.: Scientific Workflow Management and the Kepler System. Concurrency and
Computation: Practice & Experience, 2005
ƒ A. Prlic, E. Birney, T. Cox et al.: The distributed annotation system for integration of biological
data. Proc. 3rd Workshop on Data Integration in the Life Sciences, 2006.
ƒ S. Prompramote, Y.P. Chen: Annonda: Tool for integrating molecular-biological annotation data.
Proc. 21st ICDE Conf., 2005.
ƒ E. Rahm, A.Thor, D. Aumüller et al.: iFuice – Information fusion utilizing instance-based peer
mappings. Proc. 8th WebDB Workshop, 2005.
ƒ R. Stevens et al.: Tambis - Transparent Access to Multiple Bioinformatics Information Sources.
Bionformatics 2000
ƒ J. Saltz, S. Oster, et al.: caGRID: Design and implementation of the core architecture of the
cancer biomedical informatics grid. Bioinformatics, 22(15):1910-1916, 2006.
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 98
Literature: Ontologies and ontology matching
ƒ S. Schulze-Kremer: Ontologies for molecular biology. Proc. 3rd Pacific Symposium
on Biocomputing, 1998.
ƒ O. Bodenreider, M. Aubry, A. Bugrun: Non-lexical approaches to identifying
associative relations in the Gene Ontology. Proc. Pacific Symposium on
Biocomputing, 2005.
ƒ O. Bodenreider, A.Bugrun: Linking the Gene Ontology to other biological ontologies.
Proc. ISMB Meeting on Bio-Ontologies, 2005.
ƒ J. Euzenat, P. Shvaiko: Ontology matching. Springer Verlag, 2007.
ƒ T. Kirsten, A. Thor, E. Rahm: Matching large life science ontologies. Proc. 4th Intl.
Workshop on Data Integration in the Life Sciences. 2007.
ƒ P. Mork, P. Bernstein: Adapting a generic match algorithmto align ontologies of
human anatomy. Proc 20th ICDE Conf., 2004.
ƒ S. Myhre, H. Tveit, T. Mollestad, A. Laengreid: Additional Gene Ontology structure
for improved biological reasoning. Bioinformatics, 22(16):2020-2037, 2006.
ƒ P. Lambrix, H.Tan: Sambo – A systemfor aligning and merging biomedical
ontologies. Journal of Web Semantics, 4(3):196-206 , 2006.
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 99
Literature: Data quality aspects
ƒ A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios: Duplicate Record Detection: A Survey.
IEEE Transactions on Knowledge and Data Engineering 19(1), 2007.
ƒ K.G. Herbert et al: BIO-AJAX: An Extensible Framework for Biological Data Cleaning.
SIGMOD Record 33(2), 2004
ƒ K.G. Herbert, J. Wang: Biological data cleaning: A case study. International Journal of
Information Quality, 1(1):60-82, 2007.
ƒ V. Jakoniene, D. Rundqvist, and P. Lambrix: A method for similarity-based grouping of
biological data. Proc 3rd Intl. Workshop on Data Integration in the Life Sciences, 2006.
ƒ J. Koh, M. Lee, A. Khan et al.: Duplicate detection in biological data using association rule
mining. Proc Workshop on Data and Text Mining in Bioinformatics, 2004.
ƒ A. Monge C. Elkan: An efficient domain-indepent algorithm for detecting approximatively
duplicate database records. Proc. SIGMOD Workshop on Research Issues on Data Mining
and Knowledge Discovery, 1997.
ƒ H. Müller and J.-C. Freytag: Problems, Methods and Challenges in Comprehensive Data
Cleansing. Technical Report HUB-IB-164, Humboldt University Berlin, 2003.
ƒ F. Naumann, J.-C. Freytag, and U. Leser: Completeness of integrated information sources.
Journal of Information Systems, 29(7):583-615, 2004.
ƒ E. Rahm, H.-H. Do: Data cleaning: Problems and current approaches. IEEE Bulletin of the
Technical Committee on Data Engineering, 23(4):3-13, 2000.
E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 100
Online Bibliographies
http://dc-pubs.dbs.uni-leipzig.de/
http://se-pubs.dbs.uni-leipzig.de/