slides1

creatorprocessΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

132 εμφανίσεις

Bioinformatics Databases:

Fundamental Concepts of
Database Technology & Data
Organization

Kristen Anton

Director of BioInformatics

Geisel School of Medicine at
Dartmouth

January 22, 2013


Bio
Informatics @ Geisel School of Medicine at Dartmouth



How can data be organized?


Paper (e.g. in notebooks)


Flat files


Collection of data records


Minimal structure, no metadata


Application program must contain relationship
information


Database


Hierarchical


Network


Relational


Bio
Informatics @ Geisel School of Medicine at Dartmouth

Bio
Informatics @ Geisel School of Medicine at Dartmouth



How can data be organized?


Paper (i.e. in notebooks)


Flat files


Collection of data records


Minimal structure, no metadata


Application program must contain relationship
information


Database


Hierarchical


Network


Relational


Bio
Informatics @ Geisel School of Medicine at Dartmouth



What is a relational database?

A database composed of relations and conforming

to a set of principles governing how such relations

are supposed to behave (

Codd

s 12 Rules

).

There are many database systems that use tables

but don’t conform to all of the principles.

These are often called

semirelational


systems.




from
Understanding SQL
, Martin Gruber

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Edgar Codd



Practically speaking...


A database is a body of information stored in two
dimensions (rows and columns)


Rows are records


Columns are attributes of those record entities
(usually!)


The groups of rows and columns, or tables, are
largely independent of each other


The power of the database lies in the relationships
that you construct among the tables


A database is self
-
describing: it contains metadata,
which is a description of its own structure


Bio
Informatics @ Geisel School of Medicine at Dartmouth




A set of programs which define, administer and
process databases and their associated applications


A
scalable

DBMS can run on multiple platforms
(varying sizes)


A DBMS that supports
interoperability

uses
industry
-
standard language and standard ways of
exchanging data
-
> open source


What is a Database Management
System (DBMS)?

Examples: Oracle, Sybase, MySQL,

MS Access, PostgreSQL, …

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Features of a Relational Database


Rows (records) are in no particular order


Columns (fields) are ordered, numbered and
named; names should indicate content of
the field


Primary key uniquely identifies each row
-

ensures that no row is empty, and that every
row is different from every other row


Two
-
step commit process

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Features of a Relational Database


A
view

is a subset of the database that an
application (or user) can process


The database
schema

is the structure of the
entire database


A
constraint

is a condition you apply to an
attribute of a table

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Relationships between tables


One
-
to
-
One, Many
-
to
-
One, Many
-
to
-
Many


A

join


is an operation that combines data
from multiple tables into a singe result table





E
-
R (entity
-
relationship) diagram is the basic
graphic to describe the structure of a database

SELECT Sequence.sname, KnownGenes.gname,


KnownGenes.length


FROM Sequence, KnownGenes


WHERE KnownGenes.length = Sequence.length

Bio
Informatics @ Geisel School of Medicine at Dartmouth

E
-
R Diagram

Bio
Informatics @ Geisel School of Medicine at Dartmouth

The
universal language

for communicating
with (and within) relational databases:
SQL


Standard Query Language (SQL)


A query is a question you ask the database,
and SQL code operationalizes the question
and retrieves the appropriate answer set


Interactive SQL (command line) vs. RAD
tool/GUI


Standardization issue: ANSI (American
National Standards Institute)

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Data Types


Types of data indicate functions that are
possible between related fields


Each field is assigned one data type
(imposes structure on data)


Examples: text (CHAR, VARCHAR),
number (INT, DEC); date, time, currency …


Standardization issue: ANSI (American
National Standards Institute)

Bio
Informatics @ Geisel School of Medicine at Dartmouth

No data types? You get clinical/ pathology!

Beyond Data Types:
Ontology


An
ontology

is a representation of concepts within a domain
and their relationships (the “big picture”)


Represents all the data within a domain


Ties sub
-
models into an overarching model


Information architecture/ information model


critical to
definition & organization of data


Common Data Elements (CDE)


capture the critical
attributes and a shared vocabulary of common terminology
for representing an object


Ontology modeling tools (e.g. Protégé)

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Supports conversion of unstructured intuitive knowledge into
a formal specification that defines classes

and their attributes and relationships

Common Data Element

Bio
Informatics @ Geisel School of Medicine at Dartmouth

“Cell count”

Software
automatically
know how to:

1) Validate 2) Store 3) Display


Designing a database is not trivial


The value is not only in the data, but
also in the structure and the metadata


Design to facilitate the retrieval and
interpretation of the data

A word about database design:

Bio
Informatics @ Geisel School of Medicine at Dartmouth


How can you define your data elements and
describe them effectively?


Relationships ease extraction and/or reporting
of data from the system


Redundancy


Concept of attributes in rows instead of
columns


Minimize free text


Design database for data
extraction: think it through

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Design database for data
extraction: think it through

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Design database for data
extraction: think it through

Bio
Informatics @ Geisel School of Medicine at Dartmouth


Reusable

core


modules, with
customizable components


Standard business logic framework
controls transactions (middle layer)


Metadata
-
based back
-
end data storage
(facilitates data retrieval, sharing,
integration)

Enter the scene: good technology

Example: BioInformatics Core System

Bio
Informatics @ Geisel School of Medicine at Dartmouth

BioInformatics Core Technology

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Data Security: High Priority

HIPAA, FIPS
140
-
2,

PGP encryption,
IRB
requirements …

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Solid technology,

Secure procedures.

Data Security: Is privacy obsolete?

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Life science has become a field
which generates an enormous
amount of
un
-
integrated

data.

How can methods for data
organization help to solve this
problem?

Bio
Informatics @ Geisel School of Medicine at Dartmouth

What is Data Integration?


Creating a system which allows the
extraction of a piece or set of information
(query result) across multiple domains
(possibly disparate data sources
-

flat files,
databases, spreadsheets, URLs...)

or


Pooling data to create power for detection
of small signals

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Sample integration problem:

Cancer Biomarker Discovery


Clinical center collects blood samples from
1000 individuals with colon cancer


Expression analysis reveals that transcription
factor protein

x


is over
-
expressed in these
samples, relative to controls


Could protein

x


be a colon cancer
biomarker?

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Understanding transcription
factors for protein

x


production

Show me all genes in the public literature that are putatively
related to protein

x

, have more than 4
-
fold expression
differential between affected and normal tissue, and are
homologous to known transcription factors.

Q
1
: Find homologs

Q
2
: Find genes with

4
-
fold differential

Q
3
: Show me genes

in public literature

SEQUENCE

EXPRESSION

LITERATURE

(Q1



Q2



Q3)

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Key components to integration


Accessing without modifying original data sources


Handling redundant, conflicting, missing,
changing (versions) data


Normalizing analytical data from different data
sources


Conforming terminology to industry (or agreed
-
upon) standard
-
> CDEs


Accessing integrated data as a single repository


Including metadata in repository

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Approaches to Integration

where are the key issues addressed?


Federated database (poses constraints on original
data sources; fragility in reliance on source systems)


Data warehousing (ETL layer, original data sources
untouched, required understanding of domain,
sophisticated update/archive processes, high cost)


Integrating data source profiles


Indexed Flat Files


Others….

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Data Warehousing

Bio
Informatics @ Geisel School of Medicine at Dartmouth


Describes data types, relationships,
histories, etc.


System (supports developers),
interface(supports users and application)


Metadata

one key to success

Data value: 55














Bio
Informatics @ Geisel School of Medicine at Dartmouth

Data value: 55

Metadata values:


Data element name: vehicle speed







Describes data types, relationships,
histories, etc.


System (supports developers),
interface(supports users and application)


Metadata

one key to success

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Data value: 55

Metadata values:


Data element name: vehicle speed


Unit: miles per hour





Describes data types, relationships,
histories, etc.


System (supports developers),
interface(supports users and application)


Metadata

one key to success

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Data value: 55

Metadata values:


Data element name: vehicle speed


Unit: miles per hour


Description: the average velocity of a


vehicle


Describes data types, relationships,
histories, etc.


System (supports developers),
interface(supports users and application)


Metadata

one key to success

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Standards

creating order in the chaos


Naming conventions


Standard coordinate systems


Unify interpretations of single object types
(
ontology
)


Unify software solutions to the same
problem (also data formats)


Standards for metadata (incompatible or
missing metadata)

Bio
Informatics @ Geisel School of Medicine at Dartmouth

Developing Standards

for Life Sciences Research


Discovery science does not lend well to
constraints (especially system constraints)


Decentralized data management
infrastructure, competition


Wildly varying skill levels for data and
information management

Several groups (Bio
-
Ontologies, HGNC, NCBI, etc.) and
national research initiatives (EDRN, TCGI, etc.) are taking
the lead in the effort to create

workable


standards.

Bio
Informatics @ Geisel School of Medicine at Dartmouth

New approach to integration:

Cancer Biomarker Discovery


Create a network of distributed data

silos


(does
not perturb data sources)


Centralize query and

business logic


servers,
access information through web interface


Manage data extraction and integration through the
web via middleware/ business logic centrally …


And via profile locally (a set of resource definitions
implemented in XML for data sources residing in
one or more distributed systems)


Bio
Informatics @ Geisel School of Medicine at Dartmouth