Bioinformatics Needs for the post-genomic era - Grid Computing


2 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

90 εμφανίσεις

Bioinformatics Needs for the
genomic era

Dr. Erik Bongcam

The Linnaeus Centre for

From Egg to Adult in 3x109

A single cell, the fertilized egg, eventually
differentiates into the ~300 different types of cells
that make up an adult body.

With a few exceptions all of these cells contain the
complete human genome, but express only a
subset of the genes.

Gene expression patterns are determined largely
by cell type, and vice versa.

The “body” has:

The genome

A comprehensive list of genes

Gene expression data

Protein localization in the cells

Information about Protein/protein and
protein/DNA interactions.

Ways to store, display and query masses of
data so activity can focus on relevant bits.

Primary Flows of Information
and Substance in the Cell

Why a Grid?

Growth of Molecular Biological problems is
getting out of sync with Moore's Law

Growing interest in Bioinformatics from
other disciplines

New experimental approaches (genomics,
proteomics, etc.) require new and more
demanding solutions

Comparative Genomics

Comparative genomics: comparison of
whole genomes (e.g. human and mouse)
and new techniques for phylogenetic


Rnomics: tertiary structure prediction and
novel RNA gene location in whole genomes

We are conducting genome wide scans for
RNA regulatory elements and RNA genes
using state of the art comparative genomics

The analysis involves comparison of the
human and mouse genomes using tools such
as stochastic context
free grammars

Molecular Interactions

Large scale
in silico

maps of the molecular
interactions over entire proteomes and
genomes. These maps provide quantitative
functional models that bridge the biological
with the chemical.

We are developing models of gene participation in
biological processes. Such models are developed
from microarray
based gene expressions and
background knowledge, e.g. as provided by the so
called Gene Ontology. The GRID Test Bed will
be an excellent computational environment for
finding molecular classifiers associated with e.g.
major diseases such as, for instance, cancer,
artherosclerosis and other diseases that kill many
people in Europe.

What is needed?

Standard, stable interfaces to conceptual
problem solvers / data / objects

A distributed way to store and analyse

Security for user data

Avoiding duplication of implementation
and computation

Protein structure prediction

an example

There are over 1.3 million sequences in the non
redundant protein database managed by the NCBI
and over 19 thousand structures in the protein data
bank (PDB)

Using this data we have built a library of common
protein substructures linking structure and
sequence on a local level

Our library consists of over 4000 unique
substructure associated with from seven to two
thousand examples of sequence fragments

In order to extract properties that recognize
proteins containing particular substructures, we
iteratively test different (combinations of)
properties on proteins containing and proteins not
containing the substructure of interest.

calculating properties for all groups takes one
week on ten Athlon XP 1700+ (1.46 GHz, 1GB
RAM) processors

In a more realistic search space, without the
drastic search space reductions, we estimate to
need approximately 700 processor days with 2GB
RAM. And depending on the available resources,
we would like to run several such trails in order to
test different parameter settings. Thus our upper
estimates may be multiplied by a factor 5