Principles of Bioinformatics

dasypygalstockingsBiotechnology

Oct 2, 2013 (4 years and 9 days ago)

100 views

YM
-
Biochem

Principles of Bioinformatics

YM
-
Biochem

Information Biology
vs

Bioinformatics


Information biology (
Infobiology
): solving



biological problems using bioinformatic
tools


Bioinformatics: tool development

YM
-
Biochem

Biological system has its
own logic

Glucose oxidation in test tube

C
6
H
12
O
6

+ 6O
2

6CO
2

+ 6H
2
O


Glucose oxidation in a cell

C
6
H
12
O
6

+
6H
2
O

+ 6O
2

6CO
2

+
12H
2
O

YM
-
Biochem

The first hypothesis

Most evolution change at the molecular
level is driven by
random drift

rather than
natural selection

-

Kimura’s neural hypothesis

YM
-
Biochem

Approx. date of common ancestor can be
estimated from the fossil record







Millions of years ago

Preambrian Time


Archean Era





4600
-
2500


Proterozoic Era




2500
-
570

Phanerozoic Time


Paleozoic Era



Cambrian Period



570
-
505



Ordovician Period



505
-
438



Silurian Period



438
-
408



Devonian Period



408
-
360



Carboniferous Period



360
-
286



Permian Period



286
-
245


Mesozoic Era



Triassic Period



245
-
208



Jurassic Period



208
-
144



Cretaceous Period



144
-
66.4


Cenozoic Era



Tertiary Period




Paleocene Epoch


66.4
-
57.8




Eocene Epoch


57.8
-
38.6




Oligocene Epoch


38.6
-
23.7




Miocene Epoch


23.7
-
5.3




Pliocene Epoch


5.3
-
1.6



Quarternary Period




Pleistocene Epoch


1.6
-
0.01




Holocene Epoch


0.01
-
0

YM
-
Biochem

Rates of evolution in eight proteins

Protein



Rate (# of changes




in 10
9

years)


Firbinopeptides


8.3

Pancreatic RNase


2.1

Lysozyme



2.0

a
-
Globin



1.2

Myoglobin



0.89

Insulin




0.44

Cytochrome C



0.3

Histone H4



0.01


* Ridley (1993) Evolution. Blackwell scientific publicaitons


Table 7.1 (p.142)

YM
-
Biochem

Amount of variation in natural
populations

Species


# of loci


polymorphic loci


Phlox cuspidata


16



11 %

Drosophila robusta


40



39 %

Bufo americanus


14



26 %

Homo sapiens



71



28 %


* Modified from

Ridley (1993) Evolution. Blackwell scientific publicaitons


Table 7.2 (p.145)

YM
-
Biochem

Diversity at the Sequence Level

Alkaptonuria

Continuous traits

Mutation

Polymorphism

Gene

Chromosomes

in population

YM
-
Biochem

The advantage of having diversity?


Don’t put all the eggs in


the same basket


Problem with inbreeding


The combinatory strategy

YM
-
Biochem

PAM

Accepted point mutation

YM
-
Biochem

Sequence similarity and
evolutionary distance

YM
-
Biochem

PAM250
matrix

http://www.cmbi.kun.nl/
bioinf/tools/pam.shtml

Steps to construct PAM1

matrix:

1. Collect statistics

2. Convert to frequency

3. Calculate likelihood ratio

4. Take log value of the


likelihood ratio

= (PAM1)
250

YM
-
Biochem

The second hypothesis

Re
-
use the components

YM
-
Biochem

Comparison of photosynthesis and
pentose phosphate cycle




3



1












5 + 5





3




+



3 + 7


6







4 + 6



Trans
-
ketolase



6



Trans
-
aldolase




1






5 +

5





3






3 +

7


6







4 +

6

YM
-
Biochem

From biochemistry to applied
microbiology

Microorganism that can



use petroleum as carbon source



use mineral as energy source



. . .

YM
-
Biochem

Static image of life

YM
-
Biochem

The next challenge:

from components to circuitry

Explanation: Below is the full data for every gene in yeast. Data between
timepoints have been normalized with respect to each other. There are
some bacterial geneon each of the 4 chips. We recommend that this
document be downloaded and opened in Excel. The 17 data after each
gene are the normalized fluorescence between 0 and 160 minutes after cell
cycle reinitiation from START. Have fun.


Gene Name zero ten twenty thirty fourty fifty sixty seventy eighty ninety
hundred one
-
ten one
-
twenty one
-
thirty one
-
fourty one
-
fifty one
-
sixty

18srRnaa 22 38 41 43 23 29 25 20 17 98 46 27 23 38 27 28 287

18srRnab 5 9
-
13
-
9
-
14
-
13
-
11
-
18
-
1
-
18 9
-
8
-
15
-
6
-
19
-
35 150

18srRnac 3
-
2 13 5 6 5
-
3
-
1
-
6 37 8
-
3
-
3 7 7 0 182

. . .

(data from
http://genomics.stanford.edu/yeast/full_data.html

)


Picture taken from
http://genomics.stanford.edu/yeast/additional_figures_link.html

Picture taken from
http://www.ibms.sinica.edu.tw/~peck/chinese/marray_4.htm

YM
-
Biochem

Cluster Analysis

YM
-
Biochem

Products of genome projects

DNA

RNA

transcription

translation

Protein

Genomic seq

EST

SAGE

gene chips

Protein expression,
modification,
interaction,
---

etc.

Annotation

YM
-
Biochem

The third hypothesis

Independent folding motifs (IFMs) are
functional units

YM
-
Biochem

The evolution of functional motifs?

Gene

duplication

Variation

(mutation)

Gene

duplication

Recombination

+

YM
-
Biochem

Modifying the pocket size of active
site makes a new enzyme

YM
-
Biochem

Method to introduce a keto group
to an aliphatic chain

OAA

citrate

isocitrate

a
-
ketoglutarate

succinyl CoA

succinate

malate

fumarate

-
2H

-
CO
2

-
2H

-
CO
2

CoA

-
2H

-
2H

CoA + GTP

acetyl CoA


Release CO
2

Formation
of carrier

H
2
O

TCA

cycle

Dehydrogenation


hydration


dehydrogenation

YM
-
Biochem

Oxidation of
fatty acid


a

RCH
2
CH
2

CH
2
C
-
S
-
CoA



O




a

RCH
2
CH=CHC
-
S
-
CoA



OH O




a

RCH
2
CH CH
2
C
-
S
-
CoA



OH O



RCH
2
C CH
2
C
-
S
-
CoA



O O

-
2H

+H
2
O

-
2H

RCH
2
CH
2
CH
2
CH
2
CH
2
CH
2
C
-
S
-
CoA



O


RCH
2
CH
2
CH
2
CH
2
C
-
S
-
CoA



O


RCH
2
CH
2
C
-
S
-
CoA



O


Acetyl CoA

Acetyl CoA

YM
-
Biochem

Concept of protein family

YM
-
Biochem

The Fourth hypothesis

The combinatory strategy

YM
-
Biochem

The number of genes of an organism


Human*: 30,000 ~ 35,000


Thale cress: 26,000 (plant)


Worm: 18,000


Drosophila: 13,000


Yeast: 6,000


Tuberculosis microbe: 4,000

Data from http://www.sanger.ac.uk/HGP/publication2001/facts.shtml

Human data is from http://www.nature.com/genomics/papers/human.html

YM
-
Biochem

Combination at DNA level: Generation
of Antibody Diversity

* Picture taken from http://www.cvm.tamu.edu/vtmi409502/

YM
-
Biochem

Combination at RNA level: Putative
alternative splicing site (PASS) db

Ubiquinone dehydrogenase

YM
-
Biochem

Chip target design need to distinguish
different splicing forms

Splice form 2

Splice form 1

Sum of splice forms 1 & 2

YM
-
Biochem

The Fifth hypothesis

Sequence implies structure implies function.

-

Murray
-
Rust, 1994

YM
-
Biochem

Comparative
anatomy

Comparative

structural biology

YM
-
Biochem

Sequence alignment


Use PAM/Blosum matrices to


score the matches


Use gap insertion and extension


penalties to optimize the alignment


=> measure “similarity”

YM
-
Biochem

Similarity is not equal to homology

YM
-
Biochem

The Nature of Genome Analysis

Size

Quantity

Repetitious

YM
-
Biochem

The Need for Automatic Analysis

-
rw
-
r
--
r
--

1 c00liu00 root 29614 Feb 6 08:53 HSU15422

-
rw
-
r
--
r
--

1 c00liu00 root 15828 Feb 6 15:11 HSU15422.blx.ace

-
rw
-
r
--
r
--

1 c00liu00 root
1645755

Feb 6 09:37 HSU15422.
est.bl

-
rw
-
r
--
r
--

1 c00liu00 root 792494 Feb 6 09:37 HSU15422.est.bln.ace

-
rw
-
r
--
r
--

1 c00liu00 root 385 Feb 6 09:18 HSU15422.genie.ace

-
rw
-
r
--
r
--

1 c00liu00 root 233 Feb 6 09:19 HSU15422.genscan.ace

-
rw
-
r
--
r
--

1 c00liu00 root 61 Feb 6 08:53 HSU15422.gf.ace

-
rw
-
r
--
r
--

1 c00liu00 root
3387868

Feb 6 15:10 HSU15422.
gp.bl

-
rw
-
r
--
r
--

1 c00liu00 root 7320 Feb 6 15:11 HSU15422.gp.blq

-
rw
-
r
--
r
--

1 c00liu00 root 446 Feb 6 08:53 HSU15422.grail.ace

-
rw
-
r
--
r
--

1 c00liu00 root 29518 Feb 6 09:19 HSU15422.masked

-
rw
-
r
--
r
--

1 c00liu00 root 8479 Feb 6 08:53 HSU15422.orf.ace

-
rw
-
r
--
r
--

1 c00liu00 root 2656 Feb 6 15:21 HSU15422.prom.ace

-
rw
-
r
--
r
--

1 c00liu00 root 49689 Feb 6 09:19 HSU15422.repeats.ace

-
rw
-
r
--
r
--

1 c00liu00 root 116902 Feb 6 09:19 HSU15422.rpt.bln

-
rw
-
r
--
r
--

1 c00liu00 root 7720 Feb 6 15:23 HSU15422.splice.ace

-
rw
-
r
--
r
--

1 c00liu00 root 83920 Feb 6 15:23 HSU15422.startstop.ace


YM
-
Biochem

The Need for Value
-
Added Databases

Information Quality

Information Contents

YM
-
Biochem

Lots of Functionally Unknown
Sequences in GenBank

YM
-
Biochem

Expressed Sequence Tag (EST)


Partial cDNA sequences of genes


expressed in different tissues

mRNA

cDNA

5` partial sequencing

3` partial sequencing

EST

Tissues

YM
-
Biochem

Gold Mining in A Field Full of
Fool’s Gold

BLASTN 2.0.8 [Jan
-
05
-
1999]

Query
= gi|3958354|gb|AI298618|AI298618 qm96b01.x1 NCI_CGAP_Lu5

Homo sapiens cDNA clone IMAGE:1896553
3' similar to TR:Q13538 Q13538

ORF2: FUNCTION UNKNOWN.

;, mRNA sequence [Homo sapiens] (419 letters)

Database
:
Non
-
redundant

GenBank+EMBL+DDBJ+PDB sequences


402,852 sequences; 969,381,864 total letters


Score E

Sequences producing significant alignments: (bits) Value

gb|U49973|HSU49973
Human Tigger1 transposable element, complete... 168 5e
-
40

emb|AL021408|HS523C21 Homo sapiens DNA sequence from PAC 523C21... 155 8e
-
36

gb|AC002287|HUAC002287 Homo sapiens Chromosome 16 BAC clone CIT... 155 8e
-
36

gb|AC005622|AC005622 Homo sapiens chromosome 19, cosmid R30953,... 153 3e
-
35

gb|AC004736|AC004736 Human Chromosome 11p14.3 PAC clone pDJ1082... 151 1e
-
34

gb|AC002553|AC002553 Homo sapiens chromosome 17, clone hCIT529I... 151 1e
-
34

gb|AC003667|AC003667 Homo sapiens Xp22 PAC RPCI1
-
17L20 (Rosewel... 147 2e
-
33

gb|AC004800|AC004800 Homo sapiens chromosome 7 clone UWGC:g3586... 145 7e
-
33

gb|AC005826|AC005826 Homo sapiens clone UWGC:rg041a03 from 7p14... 145 7e
-
33

gb|AF047825|AF047825 Homo sapiens PAC 50H2 in the CUTL1 locus, ... 139 5e
-
31



. . . . .

YM
-
Biochem

Why is there an incorrect
assignment?


gb|U49973|HSU49973
Human Tigger1 transposable element
, complete consensus


sequence


Length = 2418 Score = 168 bits (85), Expect = 5e
-
40


Identities = 131/145 (90%)
, Gaps = 1/145 (0%) Strand = Plus / Plus



Query: 61 gtggaaacaggaagagaactagaactggaagtaaagcattgaagatgtgactgaattgct 120


||||||| || ||||||||||||| | ||||| ||| | ||||||||||||||||||||

Sbjct: 1709 gtggaaatagcaagagaactagaattagaagtggagcct
-
gaagatgtgactgaattgct 1767



Query: 121 gcaatctcatgatcaaatttgaatggatgaggagttgctttttagggatgagcaaagaaa 180


||||||||||||| ||| ||||| |||||||||||||||| ||| |||||||||||||||

Sbjct: 1768 gcaatctcatgataaaacttgaacggatgaggagttgcttcttatggatgagcaaagaaa 1827


Query: 181 gtggtttctcgagatggaatctact 205


||||||||| |||||||||||||||

Sbjct: 1828 gtggtttcttgagatggaatctact 1852

YM
-
Biochem

Tigger1 Was Discovered on 1996

LOCUS HSU49973 2418 bp DNA PRI
28
-
JUN
-
1997

DEFINITION Human Tigger1 transposable element, complete consensus sequence.

ACCESSION U49973

NID g2226003

KEYWORDS .

SOURCE human.


ORGANISM Homo sapiens


Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;


Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae;


Homo.

REFERENCE 1 (bases 1 to 2418)


AUTHORS Smit,A.F. and Riggs,A.D.


TITLE Tiggers and DNA transposon fossils in the human genome


JOURNAL Proc. Natl. Acad. Sci. U.S.A. 93 (4), 1443
-
1448 (
1996
)


MEDLINE 96202298

YM
-
Biochem

Assignment Was Made in 1997

LOCUS AI298618 419 bp mRNA EST
29
-
JAN
-
1999

DEFINITION qm96b01.x1 NCI_CGAP_Lu5 Homo sapiens cDNA clone IMAGE:1896553 3'


similar to TR:Q13538 Q13538 ORF2: FUNCTION UNKNOWN. ;, mRNA


sequence.

ACCESSION AI298618

NID g3958354

KEYWORDS EST.

SOURCE human. ORGANISM Homo sapiens


Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;


Primates; Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 419)


AUTHORS NCI
-
CGAP http://www.ncbi.nlm.nih.gov/ncicgap.


TITLE National Cancer Institute, Cancer Genome Anatomy Project (CGAP),


Tumor Gene Index


JOURNAL Unpublished (
1997
)

YM
-
Biochem

Transitive Catastrophy

Hs.47393 Homo sapiens

ESTs,
Moderately similar to ORF2:

function unknown

[H.sapiens]


MAPPING INFORMATION


Chromosome: 3




Gene Map 98: AA218858 , Chr.3, D3S3591
-
D3S1283


EXPRESSION INFORMATION


cDNA sources: CNS, Heart, Kidney, Lung, Ovary, Parathyroid, Placenta, Testis,
Thymus, Tonsil, Uterus, Whole embryo


EST SEQUENCES (37)


T98806 cDNA clone IMAGE:122293 3' read 1.6 kb


AA287238 cDNA clone IMAGE:713684 Ovary 5' read 1.5 kb


AA709280 cDNA clone 1343469 Testis 3' read 1.3 kb


AA873649 cDNA clone IMAGE:1358275 Tonsil 3' read 1.0 kb


AA237014 cDNA clone IMAGE:683842 Tonsil 3' read 1.0 kb


AA953287 cDNA clone IMAGE:1573207 Kidney 3' read 1.0 kb


AA923144 cDNA clone IMAGE:1557049 Lung 3' read 0.9 kb


AI298618 cDNA clone IMAGE:1896553 Lung 3' read 0.9 kb


. . . . .

YM
-
Biochem

The sixth hypothesis

Mechanism can be determined by
interactions among molecules

YM
-
Biochem

Concept of Interaction Map

Protein
-
nucleic acid interaction

Protein
-
protein interaction

YM
-
Biochem

Protein
-

nucleic acid interaction

YM
-
Biochem

Protein
-
protein interaction

Physical and physiologic



outcomes of dimerization


Possible examples




Proximity and orientation

Single transmembrane cell surface receptors

Differential regulation by

Myc/Mad/Max; Bcl
-
2 family


heterodimerization



Temporal and spatial


Id; Emc


thresholds

Enhanced specificity


Many DNA
-
binding proteins, DCoH

Enlarged surface area


Growth factor
-
receptor



interactions

Regulated monomer
-
to
-
dimer

STAT proteins; Smad proteins


transitions

Imposition of a kinetic

E
-
cadherin; Synaptotagmin


barrier


* This table was taken from Klemm
et al (
1988) Annu. Rev. Immunol. 16:569
-
592

YM
-
Biochem

TNFR

YM
-
Biochem

Relation of death domain containing
proteins

IL
-
1R associated kinase M

TNFR
*

TRAMP
*

TRAIL
*

FAS soluble protein

RAIDD

Nuclear matrix protein p84

RIP protein kinase

TRADD

FADD

TNFR
-
6
*

Ankyrin
-
1

Ankyrin
-
B

Ankyrin
-
G

Myeloid differentiation primary response gene 88
*

Ectodysplasin
-
A receptor
*

*

NGFR
*

DAP
-
kinase

NFKB1

NF
-
KB subunit

*

*

Unc5_chr4
*

*

* : Proteins contain transmembrane domain predicted by TMHMM 1.0 server

1
-
30 : The ID of proteins loci identified as same as previous figure

YM
-
Biochem

Clustering method can be used in
many different places

Numerical taxonomy

Phylogenetic analysis

Microarray data analysis

(pathway discovery)

YM
-
Biochem

Summary: hypotheses


Evolution at molecular level is driven by random


drift (neutral mutation)


Re
-
use components


Independent folding motifs are functional units


The combinatory strategy


Sequence implies structure implies function


Mechanism can be determined by interactions


among molecules