Integration of Biological Sources: Current Systems and Challenges Ahead ( )

hurriedtinkleAI and Robotics

Nov 15, 2013 (3 years and 11 months ago)

77 views




Integration of Biological Sources:
Current Systems and Challenges Ahead


(
Sigmod Record, Vol. 33. No. 3, September 2004

)



Thomas Hernandez & Sybbarao Kambhampati

Dept. of Computer Science and Engineering

Arizona State University


Introduction


Traditionally, the integration of biological data was
done manually by biologists. However, the availability
of more data in different formats and the wide
distribution over the internet makes the manual
integration practically infeasible. There is a need for
computer integration.


This need is also justified by the characteristics of the
biological sources:















Characteristics of Biological
Sources


Variety of data. Typical data stored cover several
biological and genomic research fields (e.g. gene
expression and sequences, disease characteristics,
molecular structures, microarray data, etc). Not only
can the quantity of data available in a source be quite
large, but also the size of each record can itself be
extremely large (e. g. DNA sequences, 3D protein
structures, etc).


Heterogeneous representations. Several sources
containing the similar data can have very different
representations. The representational heterogeneity
includes structural (i. e. schema), naming, semantic (i.e.
the same semantic concept with different terms and the
opposite), content (different data for the same semantic
object) differences.













Characteristics of Biological
Sources



Autonomous operations. They are free to modify their
design and/or schema , remove or modify data without
any prior public notification. Nearly all sources are
web
-
based and therefore dependent on network traffic
and overall availability. The data is dynamic.



Different interfaces and querying capabilities.













Integration Approaches in
Existing Systems



They can be classified first in terms of data models. This
refers to the design assumptions made by the integration
system as to the syntactic nature of the data being
exported by the sources.


1. Text data model. They view sources as exporting
mainly text, and their integration involves supporting
keyword/text search across the sources.


2. Structures data model. When sources are viewed as
exporting more structured data, there are two broad
types of integration approaches: warehoused or accessed
on demand from the sources.


3. Linked records model. They view sources as exporting
linked sets of browsable records and the integration
involves supporting effective navigation across sources.













Integration Approaches in
Existing Systems



The majority of systems use the (semi
-
) structured or
linked record models. More details about those systems
are going to be discussed.


They include three types of approach:


1. Warehouse integration. It materializes the data from
multiple sources into a local warehouse and executes all
queries on the data contained in the warehouse instead of
the actual sources. It emphasizes the data translation
instead of query translation in mediator
-
based
integration.


Pros: less dependency on network, improved efficiency of
query optimization, enabling users to filter, validate,
modify, and annotate the data obtained from the sources.


Cons: outdated data and the need for frequent updates.


















Integration Approaches in
Existing Systems



2. Mediator
-
based integration. It concentrates on query
translation. A mediator is responsible for reformulating a
query at runtime on a single mediated schema into a
query on the local schema of the underlying data sources.
Mapping between the source description and the mediator
is very crucial for such a translation. There are two main
approaches for establishing mapping between each source
schema and the global schema: global
-
as
-
view (GAV) and
local
-
as
-
view (LAV). In GAV, the mediator relations are
written directly in terms of the source relations. In LAV,
every source relation is defined over the relations and the
schema of the mediator. LAV is preferred for large scale
integration and GAV is appropriate when the set of
sources being integrated is known and stable.





















Integration Approaches in
Existing Systems



3.
Navigation
-
based integration. It emerges from the fact
that an increasing number of sources on the web require
of users that they manually browse through several web
pages and data sources in order to obtain the desired
information. The specific paths essentially constitute
workflows in which the output of a source is redirected to
the input of the next source until the requested
information is reached.




















Integration Approaches in
Existing Systems



There are also other classifications besides the data model
classification:



1. Aim of integrations


portal or query oriented;



2. Source model


complimentary (horizontal) or vertical
(overlapping exists and requires aggregation);


3. User model


low expertise, high expertise in query
languages, and interactive query formulations;


4. Level of transparency: users choosing sources or hard
-
wiring choices of sources.



















Integration Approaches in
Existing Systems




















Sequence Retrieval System
(SRS)



SRS first parses flat files that contain structured text
with field names. It then creates and stores an index for
each field and used these local indexes at query
-
time to
retrieve relevant entries. Although extensive indexed
entries are kept locally to be used by the query
processor at query time, SRS is not a warehouse system
as the actual data is neither modified nor stored locally.
The other main feature of SRS is that it keeps track of
the cross
-
references between sources. It uses its own
parsing component to identify links that exists between
entries in different sources during parsing and indexing.
These links are then used to suggest more results to a
user after a query has been processed.




http://srs.embl
-
heidelberg.de:8000/srs5/













BioKleisli



BioKleisli is a mediator
-
based integration system. The
mediator on top of the underlying sources relies mainly
on a high level query language (CPL, more expressive
than SQL) to query across several sources. Queries are
decomposed into sub
-
queries and source
-
specific
wrappers map sub
-
queries to specific heterogeneous
sources, which are accessed through predefined atomic
query functions.


BioKleisli doesn’t use any global molecular biology
schema or ontology.


It is aimed at performing a horizontal integration. A
query attribute is usually bound to an attribute in a
single predetermined source and there is essentially no
content overlap.










TAMBIS



TAMBIS is a mediator
-
based and ontology
-
driven
integration system.



GUI

(Concepts

Defined

In a global

Schema)

Source
-
indepen
dent
GRAIL
query

Query
internal
form

Source
dependent
CPL query
execution plan

Use BioKleisli existing

function library to
access sources

TAMBIS



The TAMBIS domain ontology mainly serves the
purpose of easing the user’s task of formulating the
query instead of schema mapping between sources.



DiscoveryLink



DiscoveryLink is also a mediator
-
based integration
system. Applications typically connect to
DiscoveryLink and submit a query in SQL on the global
schema, not necessarily aware of the underlying sources.
Underneath, a federated database query processor
communicates with source
-
specific wrappers to
determine the optimal plan for a given query.



The wrappers have two roles. They translate the source
data models and provide source
-
specific information
about query capabilities that will help the optimizer
determine which parts of a query can be submitted to
each source.


Other Existing Systems



BASCIIS is an end
-
use product which was developed
following a mediator
-
based approach combined with
extensive use of a knowledge base (KB). The KB
contains a domain ontology which serves as a global
schema and maps the data base schema to the domain
ontology.



BioNavigator is a commercially available navigation
integration system. Users can define their preferred
execution path for a query and reuse it later.



GUS is a warehouse
-
based integration system.

Discussion



As mentioned earlier, warehouse
-
based approaches have
two clear advantages. First, it simplifies query
optimization and processing by storing the data locally
according to a single global schema. Second, it enables
users to add their own annotations to some stored data
and specify some filtering conditions to clean the data
as it is stored locally.


However, it is still unclear how this user
-
friendly
feature can be achieved efficiently and more specifically
how the data could effectively be validated or modified
without human interventions and extensive domain
expertise. Furthermore, data warehousing faces the big
problem of handling updates in the sources and even a
bigger challenge as the data can be modified and
annotated locally, and therefore different from the data
in the sources.


Discussion



Although GAV and LAV are introduced earlier for
mediator
-
based approach, there are no mediator
-
based
integration systems implementing them so far.
Wrapper
-
oriented approaches are still relatively new.



Much like TAMBIS and BioKleisli, most of the current
systems only address the horizontal integration and
don’t consider the potential overlapping aspect of
sources. DiscoveryLink makes an attempt to solve the
problem of selecting between several potential sources
by using the information provided by wrappers to
estimate querying costs. But the overlap and coverage
point of view of optimization and source selection is not
considered.

Reference

Thomas Hernandez & Subbarao Kambhampati.
Integration of Biological

Sources: Current Systems and Challe
n
ges A
h
ead
.
Sigmod Record
, Vol. 33, No.

3, September 2004.