Data Integration in the Life Sciences

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

73 εμφανίσεις

Bioinformatics Databases:

Fundamentals of Database
Technology & Data
Organization

Kristen Chambers

Director of Bioinformatics

Dartmouth Medical School


Bio
Informatics @ Dartmouth Medical School



Bio
Informatics @ Dartmouth Medical School

How can data be organized?


Paper (i.e. in notebooks)


Flat files


Collection of data records


Minimal structure, no metadata


Application program must contain relationship
information


Database


Hierarchical


Network


Relational


Bio
Informatics @ Dartmouth Medical School



Bio
Informatics @ Dartmouth Medical School

How can data be organized?


Paper (i.e. in notebooks)


Flat files


Collection of data records


Minimal structure, no metadata


Application program must contain relationship
information


Database


Hierarchical


Network


Relational




Bio
Informatics @ Dartmouth Medical School

What is a relational database?

A database composed of relations and conforming

to a set of principles governing how such relations

are supposed to behave (“Codd’s 12 Rules”).

There are many database systems that use tables

but don’t conform to all of the principles.

These are often called “semirelational” systems.




from
Understanding SQL
, Martin Gruber



Bio
Informatics @ Dartmouth Medical School

Practically speaking...


A database is a body of information stored in two
dimensions (rows and columns)


Rows are records


Columns are attributes of those record entities


The groups of rows and columns, or tables, are
largely independent of each other


The power of the database lies in the relationships
that you construct among the tables


A database is self
-
describing: it contains metadata,
which is a description of its own structure





A set of programs which define, administer and
process databases and their associated applications


A
scalable

DBMS can run on multiple platforms
(varying sizes)


A DBMS that supports
interoperability

uses
industry
-
standard language and standard ways of
exchanging data


What is a Database Management
System (DBMS)?

Examples: Oracle, Sybase, 4D, MS Access …

Bio
Informatics @ Dartmouth Medical School

Features of a Relational Database


Rows (records) are in no particular order


Columns (fields) are ordered, numbered and
named; names should indicate content of
the field


Primary key uniquely identifies each row
-

ensures that no row is empty, and that every
row is different from every other row


Two
-
step commit process

Bio
Informatics @ Dartmouth Medical School

Features of a Relational Database


A
view

is a subset of the database that an
application (or user) can process


The database
schema

is the structure of the
entire database


A
constraint

is a condition you apply to an
attribute of a table

Bio
Informatics @ Dartmouth Medical School

Bio
Informatics @ Dartmouth Medical School

Relationships between tables


One
-
to
-
One, Many
-
to
-
One, Many
-
to
-
Many


A “join” is an operation that combines data
from multiple tables into a singe result table





E
-
R (entity
-
relationship) diagram is the basic
graphic to describe the structure of a database

SELECT Sequence.sname, KnownGenes.gname,


KnownGenes.length


FROM Sequence, KnownGenes


WHERE KnownGenes.length = Sequence.length

Bio
Informatics @ Dartmouth Medical School

E
-
R Diagram

The tool for communicating with

relational databases: SQL


Standard Query Language (SQL)


A query is a question you ask the database,
and SQL retrieves the appropriate answer
set


Interactive SQL (command line) vs. RAD
tool


Standardization issue: ANSI (American
National Standards Institute)

Bio
Informatics @ Dartmouth Medical School

Data Types


Types of data indicate functions that are
possible between related fields


Each field is assigned one data type
(imposes structure on data)


Examples: text (CHAR, VARCHAR),
number (INT, DEC); date, time, money
binary


Standardization issue: ANSI (American
National Standards Institute)

Bio
Informatics @ Dartmouth Medical School


Designing a database is not trivial


The value is not in the data, but in the
structure


Design to facilitate the retrieval and
interpretation of the data

Bio
Informatics @ Dartmouth Medical School

A word about database design:


Reusable ‘core’ modules, with
customizable components


Standard business logic framework
controls transactions (middle layer)


Metadata
-
based back
-
end data storage
(facilitates data sharing)

Bio
Informatics @ Dartmouth Medical School

Example: BioInformatics Core
Technology

Bio
Informatics @ Dartmouth Medical School

BioInformatics Core Technology

Life science has become a field
which generates an enormous
amount of
un
-
integrated

data.

Bio
Informatics @ Dartmouth Medical School

How can methods for data
organization help to solve this
problem?

Bio
Informatics @ Dartmouth Medical School

What is Data Integration?


Creating a system which allows the
extraction of a piece or set of information
(query result) across multiple domains
(possibly disparate data sources
-

flat files,
databases, spreadsheets, URLs...)

Bio
Informatics @ Dartmouth Medical School

Sample integration problem:

Cancer Biomarker Discovery


Clinical center collects blood samples from
1000 individuals with colon cancer


Expression analysis reveals that protein ‘x’
is over
-
expressed in these samples, relative
to controls


Could this be a colon cancer biomarker?

Bio
Informatics @ Dartmouth Medical School

Understanding transcription
factors for protein ‘x’ production

Show me all genes in the public literature that are putatively
related to protein ‘x’, have more than 4
-
fold expression
differential between affected and normal tissue and are
homologous to known transcription factors.

Q
1
: Find homologs

Q
2
: Find genes with

4
-
fold differential

Q
3
: Show me genes

in public literature

SEQUENCE

EXPRESSION

LITERATURE

(Q1



Q2



Q3)

Bio
Informatics @ Dartmouth Medical School

Key components to integration


Accessing without modifying original data sources


Handling redundant, conflicting, missing,
changing (versions) data


Normalizing analytical data from different data
sources


Conforming terminology to industry standards


Accessing the integrated data as a single
repository


Including metadata in repository

Bio
Informatics @ Dartmouth Medical School

Approaches to Integration

where are the key issues addressed?


Federated database (poses constraints on original
data sources; fragility in reliance on source
systems)


Data warehousing (ETL layer, original data
sources untouched, required understanding of
domain, sophisticated update/archive processes)


Integrating data source profiles


Indexed Flat Files


Others….

Bio
Informatics @ Dartmouth Medical School

Data Warehousing

Bio
Informatics @ Dartmouth Medical School


Describes data types, relationships,
histories, etc.


Back
-
end (supports developers), front
-
end
(supports users and application)


Metadata

one key to success

Data value: 55














Bio
Informatics @ Dartmouth Medical School

Data value: 55

Metadata values:


Data element name: vehicle speed







Describes data types, relationships,
histories, etc.


Back
-
end (supports developers), front
-
end
(supports users and application)


Metadata

one key to success

Bio
Informatics @ Dartmouth Medical School

Data value: 55

Metadata values:


Data element name: vehicle speed


Unit: miles per hour





Describes data types, relationships,
histories, etc.


Back
-
end (supports developers), front
-
end
(supports users and application)


Metadata

one key to success

Bio
Informatics @ Dartmouth Medical School

Data value: 55

Metadata values:


Data element name: vehicle speed


Unit: miles per hour


Description: the average velocity of a


vehicle


Describes data types, relationships,
histories, etc.


Back
-
end (supports developers), front
-
end
(supports users and application)


Metadata

one key to success

Bio
Informatics @ Dartmouth Medical School

Standards

the final frontier


Naming conventions


Standard coordinate systems


Unify interpretations of single object types


Unify software solutions to the same
problem (also data formats)


Standards for metadata (incompatible or
missing metadata)

Bio
Informatics @ Dartmouth Medical School

Developing Standards

for Life Sciences Research


Discovery science does not lend well to
constraints (especially system constraints)


Decentralized data management
infrastructure, competition


Wildly varying skill levels for data and
information management

Several groups (Bio
-
Ontologies, HGNC, OMG, etc.) and
national research initiatives (EDRN, caBIG, etc.) are taking
the lead in the effort to create ‘workable’ standards.

New approach to integration:

Cancer Biomarker Discovery


Network of distributed data ‘silos’ (does not
perturb data sources)


Centralized query and ‘business logic’ servers,
accessed through web interface


CORBA framework ‘manages’ XML profile
definitions across the web


A profile is a set of resource definitions
implemented in XML for data sources residing in
one or more distributed systems


Bio
Informatics @ Dartmouth Medical School