Bioinformatics Needs for the post-genomic era - Grid Computing

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

101 views

Bioinformatics Needs for the
post
-
genomic era

Dr. Erik Bongcam
-
Rudloff

The Linnaeus Centre for
Bioinformatics

From Egg to Adult in 3x109
Bases


A single cell, the fertilized egg, eventually
differentiates into the ~300 different types of cells
that make up an adult body.


With a few exceptions all of these cells contain the
complete human genome, but express only a
subset of the genes.


Gene expression patterns are determined largely
by cell type, and vice versa.




The “body” has:


The genome


A comprehensive list of genes


Gene expression data


Protein localization in the cells


Information about Protein/protein and
protein/DNA interactions.


Ways to store, display and query masses of
data so activity can focus on relevant bits.

Primary Flows of Information
and Substance in the Cell

Why a Grid?


Growth of Molecular Biological problems is
getting out of sync with Moore's Law


Growing interest in Bioinformatics from
other disciplines


New experimental approaches (genomics,
proteomics, etc.) require new and more
demanding solutions

Comparative Genomics


Comparative genomics: comparison of
whole genomes (e.g. human and mouse)
and new techniques for phylogenetic
footprinting.

Rnomics


Rnomics: tertiary structure prediction and
novel RNA gene location in whole genomes


We are conducting genome wide scans for
RNA regulatory elements and RNA genes
using state of the art comparative genomics
tools.



The analysis involves comparison of the
human and mouse genomes using tools such
as stochastic context
-
free grammars

Molecular Interactions


Large scale
in silico

maps of the molecular
interactions over entire proteomes and
genomes. These maps provide quantitative
functional models that bridge the biological
with the chemical.


We are developing models of gene participation in
biological processes. Such models are developed
from microarray
-
based gene expressions and
background knowledge, e.g. as provided by the so
-
called Gene Ontology. The GRID Test Bed will
be an excellent computational environment for
finding molecular classifiers associated with e.g.
major diseases such as, for instance, cancer,
artherosclerosis and other diseases that kill many
people in Europe.

What is needed?


Standard, stable interfaces to conceptual
problem solvers / data / objects


A distributed way to store and analyse
information


Security for user data


Avoiding duplication of implementation
and computation

Protein structure prediction

an example


There are over 1.3 million sequences in the non
-
redundant protein database managed by the NCBI
and over 19 thousand structures in the protein data
bank (PDB)


Using this data we have built a library of common
protein substructures linking structure and
sequence on a local level


Our library consists of over 4000 unique
substructure associated with from seven to two
thousand examples of sequence fragments


In order to extract properties that recognize
proteins containing particular substructures, we
iteratively test different (combinations of)
properties on proteins containing and proteins not
containing the substructure of interest.


calculating properties for all groups takes one
week on ten Athlon XP 1700+ (1.46 GHz, 1GB
RAM) processors


In a more realistic search space, without the
drastic search space reductions, we estimate to
need approximately 700 processor days with 2GB
RAM. And depending on the available resources,
we would like to run several such trails in order to
test different parameter settings. Thus our upper
estimates may be multiplied by a factor 5
-
10.