Data-intensive Computing: Case Study Area 1: Bioinformatics

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 12 μέρες)

80 εμφανίσεις

Data
-
intensive Computing: Case
Study Area 1: Bioinformatics

B. Ramamurthy

10/3/2013

1

Human Genetics


Genomics


Human Genome project


Proteomics


Diseasome


Tree of life project


Phylogenetics



10/3/2013

2

Human cell


Base pair of DNA: CG, AT


C


cytosine, G


guanine, A


adenine , T
-

thymine


Each human cell contains approximately 3 billion base pairs.


The DNA of a single cell contains so much information that if it were represented
in printed words, simply listing the first letter of each base would require over 1.5
million pages of text!


If laid end
-
to
-
end, the DNA strand measures about 2


3 meters.


DNA is a single large molecule at the nucleus of cell


It is coiled a double helix


Each strand of the DNA molecule is made of A, C, G and T: example:
AAAGTTCTTAATTA that will be matched on the other strand by the matching base:
TTTCAAGAATTAAT


These string of alphabets contain all the codes needed for the human functions


Ref text: Bioinformatics: Databases, tools and algorithms, by. O. Bosu and S.K.
Thukral

10/3/2013

3

More details


Sequence of base pairs are grouped to make sense: genes


When a gene inside needs to be activated, the DNA
molecule at the cell nucleus uncoils and unfurls to the right
extent to expose that gene


From the exposed ends of the DNA a RNA is formed.


mRNA or messenger RNA is formed that carries with it the
“print” of the open DNA section


RNA and DNA differ in one respect: RNA does not contain T
or thymine but it has uracil (U). RNA is short
-
lived


Once mRNA is formed open sections of the DNA close off.

10/3/2013

4

Protein formation


mRNA travels to the cytoplasm where it meets the ribosome (rRNA)


Ribosome reads the code in the mRNA (codon) and form the amino
acids.


Twenty amino acids are prevalent in human cells. Ex: codon GCU
GCC GCA correspond to alanine


In effect ribosome is a process control computer that takes in as
input codons and produces amino acids as output.


Amino acids polymerize and form polypeptide chains called
proteins


Proteins fold and form the basic structures such as skin and hair.


Even though brain controls major human functions at the cell level
it is the DNA that has the command and control.


DNA is fixed code for a given human. (WORM characteristics)



10/3/2013

5

Life’s processes


DNA is “program” that controls functions,
operations and structure of a cell and in turn
that of our life processes.


Life processes are in fact dependent of the
program in a DNA and the hundreds of
millions of ribosomes.


Life in this context appears as an immense
distributed system.

10/3/2013

6

Bioinformatics


Can we study, understand and analyze the complexity of the
immensely complex system? It structure and programs?


University of Arizona’s tree of life project (
ToL
):
http://tolweb.org


Human Genome project (NIH and DOE): collecting approximately
30,000 genes in human DNA and determining the sequences three
billion bases that make up the human DNA.


Out of the 30000 genes we do not know the functions of more than
50% of them.


99.9% of the nucleotide sequence is same for all of us


0.1% is attributed to individual differences such as race, color of
skin, disposition to diseases


High throughput sequencing is generating ultra scale biological
data: how to analyze this data?


That is a data
-
intensive problem
.

10/3/2013

7

Existing solutions?


Traditional databases: store, retrieve, analyze
and/or predict huge biological data


Software tools for implementing algorithms,
and developing applications for in
-
silico

experiments


Visualization tools, user interfaces, web
accessibility for search through data


Machine learning and data mining
methodologies.

10/3/2013

8

Databases


Taxonomy DB


Genomics


Sequence db


Structure db


Proteomic database (PDB)


Micro
-
array db


Expression db


Enzyme db


Disease db


Molecular biology db

10/3/2013

9

Tools


Data analysis tools


MySQL


Perl


Prediction tools


Clustering


Modeling tools


Surface prediction, predicting area of interest,
protein
-
protein interaction


Alignment tools



10/3/2013

10

How can we help?


How can we leverage our knowledge of large
scale data management to address
bioinformatics problems? DC methods.


Large number of tools and data: how we
standardize the efforts so that they are
complementary or repetitive? Cloud
computing.

10/3/2013

11

10/3/2013

12

Text Mining vs Genetic Sequence
Mining (Dot plot)



C

O

R

R

E

L

A

T

I

O

N

S

R

























E

























L

























A

























T

























I

























O

























N

























S

























H

























I

























P



























A

C

T

C

T

A

G

G

A

G

T

C

G

























A

























T

























A

























A

























T

























T

























C

























G

























A

























T

























C