Integration of Biological Sources: Current Systems and Challenges Ahead ( )

hurriedtinkleAI and Robotics

Nov 15, 2013 (4 years and 4 months ago)


Integration of Biological Sources:
Current Systems and Challenges Ahead

Sigmod Record, Vol. 33. No. 3, September 2004


Thomas Hernandez & Sybbarao Kambhampati

Dept. of Computer Science and Engineering

Arizona State University


Traditionally, the integration of biological data was
done manually by biologists. However, the availability
of more data in different formats and the wide
distribution over the internet makes the manual
integration practically infeasible. There is a need for
computer integration.

This need is also justified by the characteristics of the
biological sources:

Characteristics of Biological

Variety of data. Typical data stored cover several
biological and genomic research fields (e.g. gene
expression and sequences, disease characteristics,
molecular structures, microarray data, etc). Not only
can the quantity of data available in a source be quite
large, but also the size of each record can itself be
extremely large (e. g. DNA sequences, 3D protein
structures, etc).

Heterogeneous representations. Several sources
containing the similar data can have very different
representations. The representational heterogeneity
includes structural (i. e. schema), naming, semantic (i.e.
the same semantic concept with different terms and the
opposite), content (different data for the same semantic
object) differences.

Characteristics of Biological

Autonomous operations. They are free to modify their
design and/or schema , remove or modify data without
any prior public notification. Nearly all sources are
based and therefore dependent on network traffic
and overall availability. The data is dynamic.

Different interfaces and querying capabilities.

Integration Approaches in
Existing Systems

They can be classified first in terms of data models. This
refers to the design assumptions made by the integration
system as to the syntactic nature of the data being
exported by the sources.

1. Text data model. They view sources as exporting
mainly text, and their integration involves supporting
keyword/text search across the sources.

2. Structures data model. When sources are viewed as
exporting more structured data, there are two broad
types of integration approaches: warehoused or accessed
on demand from the sources.

3. Linked records model. They view sources as exporting
linked sets of browsable records and the integration
involves supporting effective navigation across sources.

Integration Approaches in
Existing Systems

The majority of systems use the (semi
) structured or
linked record models. More details about those systems
are going to be discussed.

They include three types of approach:

1. Warehouse integration. It materializes the data from
multiple sources into a local warehouse and executes all
queries on the data contained in the warehouse instead of
the actual sources. It emphasizes the data translation
instead of query translation in mediator

Pros: less dependency on network, improved efficiency of
query optimization, enabling users to filter, validate,
modify, and annotate the data obtained from the sources.

Cons: outdated data and the need for frequent updates.

Integration Approaches in
Existing Systems

2. Mediator
based integration. It concentrates on query
translation. A mediator is responsible for reformulating a
query at runtime on a single mediated schema into a
query on the local schema of the underlying data sources.
Mapping between the source description and the mediator
is very crucial for such a translation. There are two main
approaches for establishing mapping between each source
schema and the global schema: global
view (GAV) and
view (LAV). In GAV, the mediator relations are
written directly in terms of the source relations. In LAV,
every source relation is defined over the relations and the
schema of the mediator. LAV is preferred for large scale
integration and GAV is appropriate when the set of
sources being integrated is known and stable.

Integration Approaches in
Existing Systems

based integration. It emerges from the fact
that an increasing number of sources on the web require
of users that they manually browse through several web
pages and data sources in order to obtain the desired
information. The specific paths essentially constitute
workflows in which the output of a source is redirected to
the input of the next source until the requested
information is reached.

Integration Approaches in
Existing Systems

There are also other classifications besides the data model

1. Aim of integrations

portal or query oriented;

2. Source model

complimentary (horizontal) or vertical
(overlapping exists and requires aggregation);

3. User model

low expertise, high expertise in query
languages, and interactive query formulations;

4. Level of transparency: users choosing sources or hard
wiring choices of sources.

Integration Approaches in
Existing Systems

Sequence Retrieval System

SRS first parses flat files that contain structured text
with field names. It then creates and stores an index for
each field and used these local indexes at query
time to
retrieve relevant entries. Although extensive indexed
entries are kept locally to be used by the query
processor at query time, SRS is not a warehouse system
as the actual data is neither modified nor stored locally.
The other main feature of SRS is that it keeps track of
the cross
references between sources. It uses its own
parsing component to identify links that exists between
entries in different sources during parsing and indexing.
These links are then used to suggest more results to a
user after a query has been processed.



BioKleisli is a mediator
based integration system. The
mediator on top of the underlying sources relies mainly
on a high level query language (CPL, more expressive
than SQL) to query across several sources. Queries are
decomposed into sub
queries and source
wrappers map sub
queries to specific heterogeneous
sources, which are accessed through predefined atomic
query functions.

BioKleisli doesn’t use any global molecular biology
schema or ontology.

It is aimed at performing a horizontal integration. A
query attribute is usually bound to an attribute in a
single predetermined source and there is essentially no
content overlap.


TAMBIS is a mediator
based and ontology
integration system.




In a global




CPL query
execution plan

Use BioKleisli existing

function library to
access sources


The TAMBIS domain ontology mainly serves the
purpose of easing the user’s task of formulating the
query instead of schema mapping between sources.


DiscoveryLink is also a mediator
based integration
system. Applications typically connect to
DiscoveryLink and submit a query in SQL on the global
schema, not necessarily aware of the underlying sources.
Underneath, a federated database query processor
communicates with source
specific wrappers to
determine the optimal plan for a given query.

The wrappers have two roles. They translate the source
data models and provide source
specific information
about query capabilities that will help the optimizer
determine which parts of a query can be submitted to
each source.

Other Existing Systems

BASCIIS is an end
use product which was developed
following a mediator
based approach combined with
extensive use of a knowledge base (KB). The KB
contains a domain ontology which serves as a global
schema and maps the data base schema to the domain

BioNavigator is a commercially available navigation
integration system. Users can define their preferred
execution path for a query and reuse it later.

GUS is a warehouse
based integration system.


As mentioned earlier, warehouse
based approaches have
two clear advantages. First, it simplifies query
optimization and processing by storing the data locally
according to a single global schema. Second, it enables
users to add their own annotations to some stored data
and specify some filtering conditions to clean the data
as it is stored locally.

However, it is still unclear how this user
feature can be achieved efficiently and more specifically
how the data could effectively be validated or modified
without human interventions and extensive domain
expertise. Furthermore, data warehousing faces the big
problem of handling updates in the sources and even a
bigger challenge as the data can be modified and
annotated locally, and therefore different from the data
in the sources.


Although GAV and LAV are introduced earlier for
based approach, there are no mediator
integration systems implementing them so far.
oriented approaches are still relatively new.

Much like TAMBIS and BioKleisli, most of the current
systems only address the horizontal integration and
don’t consider the potential overlapping aspect of
sources. DiscoveryLink makes an attempt to solve the
problem of selecting between several potential sources
by using the information provided by wrappers to
estimate querying costs. But the overlap and coverage
point of view of optimization and source selection is not


Thomas Hernandez & Subbarao Kambhampati.
Integration of Biological

Sources: Current Systems and Challe
ges A
Sigmod Record
, Vol. 33, No.

3, September 2004.