Introduction to Bioinformatics - Ohio University

sparrowcowardBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

90 views

What does mathematics
contribute to bioinformatics?

Winfried Just

Department of Mathematics

Ohio University

A new microscope and a new
physics

In

2004 PLoS Biology
published a paper by

Joel E. Cohen



Mathematics Is Biology's Nex
t
Microscope
,
Only Better;

Biology Is Mathematics' Next Physics, Only Better
.


Really
?

How does this new
mi
c
ros
c
op
e

differ from the traditional ones?

How to use it
?

Why did mathematicians become seriously interested in

biology?

And how is all this related to bioinformatics?

More empirical observations



NSF
and
NIH
recently started to invest heavily in
biomathematics.


In

2002
the

Mathematical Biosciences Institute

(MBI,
located at OSU) was founded; this is the first and so
far only NSF

institute dedicated exclusively to
applications of mathematics in one other area.


Several other new research institutes in
biomathematics are supported from public or private
sources.


A number of new journals specializing in
biomathematics got started.


The job market for biomathematicians is currently
rather favorable, both in academia and industry,
especially in the pharmaceutical industry.

What is behind this
trend?


And why do we observe this trend
now,

instead of

30

years ago or 30 years from now?

There are two main

reasons:


1.
Contemporary biology generate a huge mountains of
data. Drawing biologically meaningful inferences
from these data requires analysis in the framework
of good mathematical models. Hence mathematics
has become a
necessary

tool for biology.

2.
Currently available computer power allows us to
investigate sufficiently detailed mathematical
models to draw biologically realistic inferences.
Thus mathematics has become a
useful

tool for
biology.



Biomathematics
vs.

bioinformatics


Everything that has been said so far about

“biomathematics” could also be said about

“bioinformatics.”


What is the difference between the two areas?


Biomathematics:

Applications of mathematics to biology.


Bioinformatics:
The design, implementation, and use of

computer algorithms to draw inferences from massive sets of

biomolecular data. It is an interdisciplinary field that draws on

knowledge from biology, biochemistry, statistics, mathematics,

and computer science.



Example of a huge data set
:
Genbank

The first viral
genom
e

was published in the 1980’s, the first

bacterial

genom
e,

H
.
influenzae
,
1.83 ∙ 10
6

bp
,

in 1995,

The first genome of a multicellular organism,
C
.

elegans
,

10
8

bp
,

w

1998. The sketch of our own

genom
e
,

H. sapiens,
π

∙ 10
9

bp
,
was announced in June

2000.


As of February

2008
,

Genbank
contained

85

759

586

764

bp

of information
.


How to draw concrete inferences from such a huge

mountains of information?


Where are the genes
?

Let us look, for example, at our own

genom
e
.
The information

about it is written in
Genbank
as a sequence

π

∙ 10
9

liter

that

would fill a million of tightly typed pages, the equivalent of

several thousand novels:


...actggtacctgtatatggacgctccatatttaatgcgcgatgcaggatctaaa...


Less than

1
.
5%
of this sequence codes proteins
.
How to find

these genes?


No human can read the whole sequence. A computer can read

it easily, in a few seconds. So, maybe the computer will tell us

where the genes are, where they start, and where they end.


But what is the computer supposed to compute
?
??

Honest

Craig
’s Casino

This is a casino in

Nevada
where one plays 64
-
number

roulette. In each round, a player bets chips on three

among those

64
numbers
.
If one of these three chosen

numbers comes up, honest
Craig w
ill pay a suitable

premium. If not, the player loses the chips.



QUESTION: How long does it take, on average, for a

winning number to come up?

Honest

Craig
’s Casino

This is a casino in

Nevada
where one plays 64
-
number

roulette. In each round, a player bets chips on three

among those

64
numbers
.
If one of these three chosen

numbers comes up, honest
Craig w
ill pay a suitable

premium. If not, the player loses the chips.



QUESTION: How long does it take, on average, for a

winning number to come up?


ANSWER:
64/3
=
21
.
33
rounds
.




Probability of long waiting times


Let us assume that

Craig
is as honest as he claims.

Then the probability
P(k)


that our player will keep losing

throughout the first
k

rounds is

(61/64)
k
.
In particular,

starting from
k = 50
we obtain the following probabilities:


P(50) = 0
.
0907 P(51) = 0
.
0864 P(52) = 0
.
0824 P(53) = 0
.
0785 P(54) = 0
.
0748

P(55) = 0
.
0713 P(56) = 0
.
0680 P(57) = 0
.
0648 P(58) = 0
.
0618 P(59) = 0
.
0589

P(60) = 0
.
0561 P(61) = 0
.
0535 P(62) = 0
.
0510 P(63) = 0
.
0486 P(64) = 0
.
0463

P(65) = 0
.
0441 P(66) = 0
.
0421 P(67) = 0
.
0401 P(68) = 0
.
0382 P(69) = 0
.
0364

P(100) = 0
.
0082 P(200) = 0
.
000064 P(300) = 0
.
00000055


Some statistical terminology


The assumption that
Craig
is as honest as he claims will

be our
null hypothesis.
The suspicion that he

is cheating

after all is our

alternative hypothesis.
The number of

losses that precede the first winning round will be our

test statistics.
The

p
-
value
is the probability that the

test statistics takes the observed or a more extreme

value under the assumption of the null hypothesis.

If

the p
-
value falls below our agreed upon

significance

level,
we are justified in

rejecting the null hypothesis.
In

science, the most commonly used significance level is


0.05.
Falsely accusing honest

Craiga
about cheating

would be a
Type I error;

trusting him when he is in fact

cheating would be a
Type II error.


Craiga Venter
’s Lab


In

1995 Craig Venter
’s team sequenced the genome of the

bacterium
H. influenzae.
If we want to detect the positions of

its
1740
genes that code proteins in its sequence of
1 830 140

base pairs, we can reason as follows: In bacteria almost all the

genome codes proteins. Let us start from position n and read

triplets:
(n, n
+1, n+2), (n+3, n+4, n+5), …

Craiga Venter
’s Lab


In

1995 Craig Venter
’s team sequenced the genome of the

bacterium
H. influenzae.
If we want to detect the positions of

its
1740
genes that code proteins in its sequence of
1 830 140

base pairs, we can reason as follows: In bacteria almost all the

genome codes proteins. Let us start from position n and read

triplets:
(n, n
+1, n+2), (n+3, n+4, n+5), … If we read in the

correct reading frame, we will read a sequence of codons that

ends with a
STOP

codon, that is
, TAA, TGA, TAG
.


Craiga Venter
’s Lab


In

1995 Craig Venter
’s team sequenced the genome of the

bacterium
H. influenzae.
If we want to detect the positions of

its
1740
genes that code proteins in its sequence of
1 830 140

base pairs, we can reason as follows: In bacteria almost all the

genome codes proteins. Let us start from position n and read

triplets:
(n, n
+1, n+2), (n+3, n+4, n+5), … If we read in the

correct reading frame, we will read a sequence of codons that

ends with a
STOP

codon, that is
, TAA, TGA, TAG
.

Such a

STOP
codon will appear on average once in about
300 t
riplets
.

If we read in one of the other five reading frames, we will read

garbage, that is, a more or less random sequence of triplets

and one of the triplets
TAA, TGA, TAG
will be encountered on

average once every
64
/
3

=
21.33

positions.



Rings a bell
?

This is the same problem
!

With minor modifications
:
Now our null hypothesis will be that

we read in the wrong reading frame, the alternative hypothesis

will be that we read a coding sequence in the correct reading

frame. If we don’t encounter a STOP codon while reading 63

successive triplets, we can reject the null hypothesis at

significance level 0.05 and conclude that we found a sequence

that codes a protein whose end is easy to find.


So we can design an easy gene
-
finding algorithm based on

finding these so
-
called ORF’s (open reading frames).



Some caveats



The beginning of the gene is somewhat more difficult
to determine, since
ATG
is both the START codon and
the codon for methionine, and the promoter is also
part of the gene.


The “garbage” in the other five reading frames is not
completely random.


This approach will miss all genes that code proteins
shorter than 63 amino acids (type ? error) and will
sometimes discover spurious genes (type ? error).


This approach is unsuitable for discovering RNA
-
coding genes.


However, the above problems can be solved, and there

exist good gene
-
finding algorithms based on this idea.

Craiga Venter
’s lab in

2000

But now let us look at the genome of
H.
s
apiens:



Protein
-
coding regions constitute only a small fraction
of our genome.


All by itself, this would lead to a lot more Type I errors

than in prokaryotes.

Craiga Venter
’s lab in

2000

But now let us look at the genome of
H.
s
apiens:



Protein
-
coding regions constitute only a small fraction of our
genome.


The coding sequences, exons, are interspersed with introns.


A given codon may be split by an intron.


Consecutive exons don’t have to sit in the same reading
frame.



Intron
s look similar to random sequences.



So we are faced with a much more difficult problem.

Nowadays there exist pretty good algorithms for finding genes

in eukaryotes. But:


No algorithm for finding genes in prokaryotes will work here.

Mathematics and mathematicians


1.
Mat
hematics is a great language for elucidating the common
structure in apparently unrelated problems.


2.
Mathematicians have a tendency to talk about complicated theories
in their jargon instead of giving simple and concrete answers.

3.
“Mathematical microscopes” often don’t come with a simple user’s
manual. In order to successfully use them, one needs to understand
to some extent how they work. The choice of the most appropriate
“mathematical microscope” for a given biological problem often
requires active cooperation between mathematicians and biologists.

4.
The key to success in this type of cooperation is finding a common
language and mutual understanding of and respect for the two
different intellectual approaches.

5.
Mathematical models form the basis for formulating hypotheses,
often in the form of probabilities.

6.
The final interpretation of these hypotheses and their experimental
verification belongs to the biologists. Thus “mathematical
microscopes” will not make the more traditional ones redundant.


In points 3
-
6, feel free to substitute “bioinformatics” for “mathematics.”

Biomathematics
vs.

bioinformatics



Biomathematics:

Applications of mathematics to

biology.


Bioinformatics:
The
design,

implementation, and
use

of

computer algorithms to
draw inferences

from massive

sets of biomolecular data. It is an interdisciplinary field

that draws on knowledge from biology, biochemistry,

statistics, mathematics, and computer science.


The design of all bioinformatics tools is based on

mathematical models. In order to choose the most

appropropriate among the available tools and draw

proper inferences, one needs to understand these models.