What does mathematics
contribute to bioinformatics?
Winfried Just
Department of Mathematics
Ohio University
A new microscope and a new
physics
In
2004 PLoS Biology
published a paper by
Joel E. Cohen
Mathematics Is Biology's Nex
t
Microscope
,
Only Better;
Biology Is Mathematics' Next Physics, Only Better
.
Really
?
How does this new
mi
c
ros
c
op
e
differ from the traditional ones?
How to use it
?
Why did mathematicians become seriously interested in
biology?
And how is all this related to bioinformatics?
More empirical observations
•
NSF
and
NIH
recently started to invest heavily in
biomathematics.
•
In
2002
the
Mathematical Biosciences Institute
(MBI,
located at OSU) was founded; this is the first and so
far only NSF
institute dedicated exclusively to
applications of mathematics in one other area.
•
Several other new research institutes in
biomathematics are supported from public or private
sources.
•
A number of new journals specializing in
biomathematics got started.
•
The job market for biomathematicians is currently
rather favorable, both in academia and industry,
especially in the pharmaceutical industry.
What is behind this
trend?
And why do we observe this trend
now,
instead of
30
years ago or 30 years from now?
There are two main
reasons:
1.
Contemporary biology generate a huge mountains of
data. Drawing biologically meaningful inferences
from these data requires analysis in the framework
of good mathematical models. Hence mathematics
has become a
necessary
tool for biology.
2.
Currently available computer power allows us to
investigate sufficiently detailed mathematical
models to draw biologically realistic inferences.
Thus mathematics has become a
useful
tool for
biology.
Biomathematics
vs.
bioinformatics
Everything that has been said so far about
“biomathematics” could also be said about
“bioinformatics.”
What is the difference between the two areas?
Biomathematics:
Applications of mathematics to biology.
Bioinformatics:
The design, implementation, and use of
computer algorithms to draw inferences from massive sets of
biomolecular data. It is an interdisciplinary field that draws on
knowledge from biology, biochemistry, statistics, mathematics,
and computer science.
Example of a huge data set
:
Genbank
The first viral
genom
e
was published in the 1980’s, the first
bacterial
genom
e,
H
.
influenzae
,
1.83 ∙ 10
6
bp
,
in 1995,
The first genome of a multicellular organism,
C
.
elegans
,
10
8
bp
,
w
1998. The sketch of our own
genom
e
,
H. sapiens,
π
∙ 10
9
bp
,
was announced in June
2000.
As of February
2008
,
Genbank
contained
85
759
586
764
bp
of information
.
How to draw concrete inferences from such a huge
mountains of information?
Where are the genes
?
Let us look, for example, at our own
genom
e
.
The information
about it is written in
Genbank
as a sequence
π
∙ 10
9
liter
that
would fill a million of tightly typed pages, the equivalent of
several thousand novels:
...actggtacctgtatatggacgctccatatttaatgcgcgatgcaggatctaaa...
Less than
1
.
5%
of this sequence codes proteins
.
How to find
these genes?
No human can read the whole sequence. A computer can read
it easily, in a few seconds. So, maybe the computer will tell us
where the genes are, where they start, and where they end.
But what is the computer supposed to compute
?
??
Honest
Craig
’s Casino
This is a casino in
Nevada
where one plays 64

number
roulette. In each round, a player bets chips on three
among those
64
numbers
.
If one of these three chosen
numbers comes up, honest
Craig w
ill pay a suitable
premium. If not, the player loses the chips.
QUESTION: How long does it take, on average, for a
winning number to come up?
Honest
Craig
’s Casino
This is a casino in
Nevada
where one plays 64

number
roulette. In each round, a player bets chips on three
among those
64
numbers
.
If one of these three chosen
numbers comes up, honest
Craig w
ill pay a suitable
premium. If not, the player loses the chips.
QUESTION: How long does it take, on average, for a
winning number to come up?
ANSWER:
64/3
=
21
.
33
rounds
.
Probability of long waiting times
Let us assume that
Craig
is as honest as he claims.
Then the probability
P(k)
that our player will keep losing
throughout the first
k
rounds is
(61/64)
k
.
In particular,
starting from
k = 50
we obtain the following probabilities:
P(50) = 0
.
0907 P(51) = 0
.
0864 P(52) = 0
.
0824 P(53) = 0
.
0785 P(54) = 0
.
0748
P(55) = 0
.
0713 P(56) = 0
.
0680 P(57) = 0
.
0648 P(58) = 0
.
0618 P(59) = 0
.
0589
P(60) = 0
.
0561 P(61) = 0
.
0535 P(62) = 0
.
0510 P(63) = 0
.
0486 P(64) = 0
.
0463
P(65) = 0
.
0441 P(66) = 0
.
0421 P(67) = 0
.
0401 P(68) = 0
.
0382 P(69) = 0
.
0364
P(100) = 0
.
0082 P(200) = 0
.
000064 P(300) = 0
.
00000055
Some statistical terminology
The assumption that
Craig
is as honest as he claims will
be our
null hypothesis.
The suspicion that he
is cheating
after all is our
alternative hypothesis.
The number of
losses that precede the first winning round will be our
test statistics.
The
p

value
is the probability that the
test statistics takes the observed or a more extreme
value under the assumption of the null hypothesis.
If
the p

value falls below our agreed upon
significance
level,
we are justified in
rejecting the null hypothesis.
In
science, the most commonly used significance level is
0.05.
Falsely accusing honest
Craiga
about cheating
would be a
Type I error;
trusting him when he is in fact
cheating would be a
Type II error.
Craiga Venter
’s Lab
In
1995 Craig Venter
’s team sequenced the genome of the
bacterium
H. influenzae.
If we want to detect the positions of
its
1740
genes that code proteins in its sequence of
1 830 140
base pairs, we can reason as follows: In bacteria almost all the
genome codes proteins. Let us start from position n and read
triplets:
(n, n
+1, n+2), (n+3, n+4, n+5), …
Craiga Venter
’s Lab
In
1995 Craig Venter
’s team sequenced the genome of the
bacterium
H. influenzae.
If we want to detect the positions of
its
1740
genes that code proteins in its sequence of
1 830 140
base pairs, we can reason as follows: In bacteria almost all the
genome codes proteins. Let us start from position n and read
triplets:
(n, n
+1, n+2), (n+3, n+4, n+5), … If we read in the
correct reading frame, we will read a sequence of codons that
ends with a
STOP
codon, that is
, TAA, TGA, TAG
.
Craiga Venter
’s Lab
In
1995 Craig Venter
’s team sequenced the genome of the
bacterium
H. influenzae.
If we want to detect the positions of
its
1740
genes that code proteins in its sequence of
1 830 140
base pairs, we can reason as follows: In bacteria almost all the
genome codes proteins. Let us start from position n and read
triplets:
(n, n
+1, n+2), (n+3, n+4, n+5), … If we read in the
correct reading frame, we will read a sequence of codons that
ends with a
STOP
codon, that is
, TAA, TGA, TAG
.
Such a
STOP
codon will appear on average once in about
300 t
riplets
.
If we read in one of the other five reading frames, we will read
garbage, that is, a more or less random sequence of triplets
and one of the triplets
TAA, TGA, TAG
will be encountered on
average once every
64
/
3
=
21.33
positions.
Rings a bell
?
This is the same problem
!
With minor modifications
:
Now our null hypothesis will be that
we read in the wrong reading frame, the alternative hypothesis
will be that we read a coding sequence in the correct reading
frame. If we don’t encounter a STOP codon while reading 63
successive triplets, we can reject the null hypothesis at
significance level 0.05 and conclude that we found a sequence
that codes a protein whose end is easy to find.
So we can design an easy gene

finding algorithm based on
finding these so

called ORF’s (open reading frames).
Some caveats
•
The beginning of the gene is somewhat more difficult
to determine, since
ATG
is both the START codon and
the codon for methionine, and the promoter is also
part of the gene.
•
The “garbage” in the other five reading frames is not
completely random.
•
This approach will miss all genes that code proteins
shorter than 63 amino acids (type ? error) and will
sometimes discover spurious genes (type ? error).
•
This approach is unsuitable for discovering RNA

coding genes.
However, the above problems can be solved, and there
exist good gene

finding algorithms based on this idea.
Craiga Venter
’s lab in
2000
But now let us look at the genome of
H.
s
apiens:
•
Protein

coding regions constitute only a small fraction
of our genome.
All by itself, this would lead to a lot more Type I errors
than in prokaryotes.
Craiga Venter
’s lab in
2000
But now let us look at the genome of
H.
s
apiens:
•
Protein

coding regions constitute only a small fraction of our
genome.
•
The coding sequences, exons, are interspersed with introns.
•
A given codon may be split by an intron.
•
Consecutive exons don’t have to sit in the same reading
frame.
•
Intron
s look similar to random sequences.
So we are faced with a much more difficult problem.
Nowadays there exist pretty good algorithms for finding genes
in eukaryotes. But:
No algorithm for finding genes in prokaryotes will work here.
Mathematics and mathematicians
1.
Mat
hematics is a great language for elucidating the common
structure in apparently unrelated problems.
2.
Mathematicians have a tendency to talk about complicated theories
in their jargon instead of giving simple and concrete answers.
3.
“Mathematical microscopes” often don’t come with a simple user’s
manual. In order to successfully use them, one needs to understand
to some extent how they work. The choice of the most appropriate
“mathematical microscope” for a given biological problem often
requires active cooperation between mathematicians and biologists.
4.
The key to success in this type of cooperation is finding a common
language and mutual understanding of and respect for the two
different intellectual approaches.
5.
Mathematical models form the basis for formulating hypotheses,
often in the form of probabilities.
6.
The final interpretation of these hypotheses and their experimental
verification belongs to the biologists. Thus “mathematical
microscopes” will not make the more traditional ones redundant.
In points 3

6, feel free to substitute “bioinformatics” for “mathematics.”
Biomathematics
vs.
bioinformatics
Biomathematics:
Applications of mathematics to
biology.
Bioinformatics:
The
design,
implementation, and
use
of
computer algorithms to
draw inferences
from massive
sets of biomolecular data. It is an interdisciplinary field
that draws on knowledge from biology, biochemistry,
statistics, mathematics, and computer science.
The design of all bioinformatics tools is based on
mathematical models. In order to choose the most
appropropriate among the available tools and draw
proper inferences, one needs to understand these models.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο