© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Chapter 6
The Computational Foundations of
Genomics
Applying algorithms to analyze
genomics data
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Contents
What are computational biology and
bioinformatics?
Understanding computers and algorithms
Sequence alignment
Gene prediction
Algorithms for analysis of phylogeny
Analysis of microarray data
Computer simulation
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Computational Biology and
Bioinformatics
Computational biology
Development of computational methods to solve
problems in biology
Bioinformatics
Application of computational biology to analysis and
management of real
molecular biology
data
Why do molecular biologists need computer science?
Discrete nature of sequence data is ideal for analysis
using digital computers
Size and complexity of genomics data make the data
impossible to analyze without computers
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithm
an
algorithm
is a procedure (a finite
set
of
well

defined instructions) for accomplishing
some task
A recipe to perform a task
Algorithms often have steps that repeat
(
iterate
) or require decisions (such as
logic
or
comparison
). Algorithms can be composed to
create more complex algorithms.
Concept originated in 1936
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A historical perspective
The 1960s: the birth of
bioinformatics
High

level computer
languages
Protein sequence data
Academic access to
computers
Margaret Oakley Dayhoff
First protein database
First program for automatic
sequence assembly
IBM 7090 computer
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Solving problems in computer science
Necessary parameters for assessing the difficulty of a
computer science problem
Algorithmic complexity
Is the problem theoretically solvable?
If so, what is the most efficient solution?
Current state of computer technology
Memory
CPU speed
Cost
sequencing entire genomes via the shotgun approach was
not possible until the mid

1990s because the
computational power needed was unavailable until that
time.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithmic problems
Example: searching for a number in an unordered list
If the list has
N
numbers, the average amount of time
the search will take will be proportional to
N
A more clever approach
Place the numbers in order
Do a binary search
Step 1:
Pick a number in the middle of the list
Step 2:
Restrict the search to the half that contains your
number
Return to
Step 1
until you find your number
Time for this approach is proportional to log
2
N
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The digital computer
Represents everything in
a code of zeros and ones
Computer architecture
CPU
(Central Processing Unit)
Memory
Input / Output
Advantages of digital
computer
Deterministic
Minimization of noise
Output
CPU
Memory
Input
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The limitations of digital computers
The limitations of digital computers are
conceptual, not just technological
Digital computers are deterministic
Incapable of truly random behavior
Digital computers deal with strictly discrete
values
Can only approximate continuous behavior
Many interesting biological phenomena occur
in the continuous realm of space and time
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Sequence databases
What is a database?
An indexed set of records
Records retrieved using a query language
Database technology is well established
Examples of sequence databases
GenBank (NCBI)
Encompasses all publicly available protein and
nucleotide sequences
Protein Data Bank
Contains 3

D structures of proteins
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The client

server model
A single computer to GCG to Internet..
The clients and servers
are software processes
Clients request data
from servers
Servers and clients can
reside on the same or
different machines
Clients can act as
servers
to other
processes and vice versa
Web Browser
BLAST Search
Engine
Database
Web Server
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Sequence alignment
Sequence alignments
search for matches
between sequences
Two broad classes of
sequence alignments
Global (wide)
maximize overall score
Local (narrow) high
score in limited area
Alignment can be
performed between two
or more sequences
QKESGPSSSYC
VQQESGLVRTTC
Global alignment
Local alignment
ESG
ESG
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The biological importance of sequence
alignment
Sequence alignments assess the degree of similarity
between sequences
Similar sequences suggest similar function
Proteins with similar sequences are likely to play
similar biochemical roles
Regulatory DNA sequences that are similar will likely
have similar roles in gene regulation
Sequence similarity suggests evolutionary history
Fewer differences mean more recent divergence
Orthologs versus paralogs
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The algorithmic problem of aligning
sequences
Comparison of similar
sequences of similar
length is straightforward
How does one deal with
insertions and gaps that
may hide true similarity?
How does one interpret
minimal similarity?
Are sequences actually
related?
Is alignment by chance?
QQESGPVRSTC
QKGSYQEKGYC
QQESGPVRSTC
RQQEPVRSTC
QQESGPVRSTC
QKESGPSRSYC
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Methods of sequence alignment
Graphical methods: visual
Dynamic

programming methods:
mathematically best but needs time
Heuristic methods:
approximate but close
to
real answer
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dot matrix analysis
A graphical method
Shows all possible
alignments
Caveats
Some guesswork in
picking parameters
Window size
Stringency
Not as rigorous or
quantitative as other
methods
R
Q
Q
E
P
V
R
S
T
C
Q
Q
E
S
G
P
V
R
S
T
C
QQESGPVRSTC
RQQEPVRSTC
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dot matrix analysis: a real example
Window size: 23
Stringency: 15
Window size: 1
Stringency: 1
Noise to signal ratio
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Devising a scoring system
Scoring matrices allow biologists to
quantify
the
quality
of sequence alignments
Use different scoring matrices for different purposes
Score for similar structural domains in proteins
Score for evolutionary relationship
Some popular scoring matrices
PAM for evolutionary studies (Percent Accepted
Mutation)
BLOSUM for finding common motifs (BLOcks amino
acid SUbstitution Matrix)
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
An example of scoring
A
R
N
D
C
Q
E
A
4

1

2

2
0

1

1
R

1
5
0

2

3
1
0
N

2
0
6
1

3
0
0
D

2

2
1
6

3
0
2
C
0

3

3

3
9

3

4
Q

1
1
0
0

3
5
2
E

1
0
0
2

4
2
5
BLOSUM62
A sequence comparison
Total score: 18
A
A
4
D
Q
0
D
E
2
R
R
5
Q
Q
5
C
E

4
E
C

4
R
Q
1
A
A
4
D
Q
0
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dynamic programming (DP)
Possibility of gaps (or insertions) makes number of
possible sequence alignments astronomical
Dynamic programming makes sequence alignment
possible by abandoning low scoring alignments
among subsequences as the algorithm progresses
Mathematically proven to provide optimal alignments
DP algorithms for sequence alignment
Needleman

Wunsch

Gotoh algorithm for global
alignments
Smith

Waterman algorithm for local alignments
DP alignment algorithms
still too slow
for searching
an entire sequence database
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Heuristic methods with
k

tuples
Example: BLAST/FASTA
Using query sequence,
derive a list of words
(tupules) of length
w
(e.g.,
3)
Keep high

scoring
matching words
High

scoring words are
compared with database
sequences
Sequences with many
matches to high

scoring
words are used as anchors
(not just words but the
order) for final alignments
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Statistical significance
Chance alignments have no biological significance
Statistical significance implies low probability of
generating a chance alignment
Probability of long alignments increases with longer
sequences
The extreme

value distribution (E value)
Used to calculate the probability of chance alignment
Generated by calculating the scores resulting from
repeatedly scrambling one of the sequences being
compared
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A practical example of sequence alignment
MASH

1, a transcription factor
http://blast.ncbi.nlm.nih.gov/Blast.cgi
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
BLAST results
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Detailed BLAST results
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A pairwise alignment with MASH

1
HASH

2, a human homolog of MASH

1
“
+
” indicates conservative amino acid substitution
“
–
” indicates gap/insertion
XXXX… shows areas of low complexity (common occurrence)
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Multiple

sequence alignments
Uses of multiple

sequence alignments
Automated reconstruction of sequence fragments
Phylogenetic analysis
Identification of sequence families
The problem of multiple

sequence alignment
O
(
N
M
) where
N
is the average sequence length and
M
is
the number of sequences being aligned (optimal
methods)
Dynamic programming will work only for small
M
Heuristic methods are required
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Some methods for global
multiple

sequence alignment
Progressive methods
Align most closely related sequences, and then less
related ones
Use phylogenetic trees to quantify similarities
Downside:
poor
results with
distantly related
sequences
Iterative methods
Start with progressive alignment
Realign sequences after leaving one sequence out
Add left

out sequence
Repeat until acceptable alignment is achieved
Probabilistic methods
Hidden Markov models ( we will talk later)
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Phylogenetic analysis
Phylogenetic trees
Describe evolutionary
relationships between
sequences
Three common methods
Maximum parsimony
Distance
Maximum likelihood
human immunodeficiency viruses from around the world
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Comparison of methods for
phylogenetic analysis
Maximum parsimony (machine input)(closely related
seqs)
Finds optimal tree (or trees) requiring minimum
number of substitutions to explain sequence variation
Maximum likelihood (
user input
) (distantly related)
Finds most probable tree
Similar to maximum parsimony
Distance (mix of close and distantly related)
Compare pairs of sequences for number of differences
between them
Use many methods to get consensus tree
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithmic complexity and
phylogenetic analysis
Four steps
Sequence alignment
Substitution model (scoring matrix)
Tree building
Tree evaluation
Tree building and evaluation are
computationally expensive
Heuristic methods required in most cases
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Gene prediction
A problem of pattern recognition
Algorithms look for features of genes:
E.g., Splice sites, ORFs, starting methionine
Identification of regulatory regions is
very
difficult
Statistical understanding of genes is ongoing
Problems of this type require machine learning
algorithms: learn what is the pattern based on
small dataset
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Central Dogma in Molecular Biology
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Artificial neural networks
Machine learning algorithms
that mimic the brain
Connections between
“neurons” vary in strength
Connection weights (
w
ij
)
(strength) change while
network is exposed to
training set
Fully trained network
recognizes pattern in novel
input
GRAIL
A
feed forward
neural network
input
output
hidden
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Hidden Markov models
Can be used for machine
learning
Units constitute
transition states
Transitions not
dependent on history
Many uses in genomics
Gene prediction
Multiple sequence
alignment
Finding periodic
patterns
start
End
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs
The example of a dishonest gambler is often used to illustrate this point.
The gambler may carry a loaded die that he or she occasionally substitutes
for a fair die, but not so often that the other players would notice. The fair
die has a one

in

six chance of showing any particular number. When using
the loaded die, a player will have a 50% chance of rolling a one and a 10%
chance of rolling any other number. It is in these types of situations that
stochastic models called hidden Markov models (HMMs) are useful,
because they take into account
unknown (or hidden) states
. For example,
exactly when the cheating gambler is using a fair or loaded die is hidden
from the other players, but insight may still be gained by
looking at the
outcome of the cheater’s rolls
. If he or she rolls three ones in a row, it is
more likely (a 12.5% chance) that the loaded die is being used than the fair
one, which would have only a 0.5% chance of generating three ones in a
row. Hidden Markov models describe the probability of transitions
between hidden states, as well as the probabilities associated with each
state. In the example of the cheating gambler, an HMM would describe the
probabilities of rolling particular numbers given the loaded or fair die and
the probability that the dishonest gambler would switch from one die to
another.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs continued
Hidden Markov models can be used to answer three types of questions. The first
type is the
likelihood
question: Given a particular HMM, what is the
probability
of
obtaining a particular outcome (e.g., rolling three ones)? The second type is the
decoding
question: Given a particular HMM, what is the most likely sequence of
transitions
between states for a particular outcome? In the case of the cheating
gambler, this sequence would be the order in which he or she transitioned from one
die to another. The third type is the
learning
question: Given a particular outcome
and set of assumptions about possible transition states, what are
the best model
parameters
(e.g., probabilities between transition states)? This third question allows
HMMs to be used for machine learning. The figure in the slide shows a simple
example of a hidden Markov model being used to account for the DNA sequence at
the bottom. Every HMM has a start and end state, denoted by the S and E,
respectively, in the slide. Hidden states lie between the start and end states. In the
figure, the squares are states, and the lines between them indicate the probability of
one state transitioning to another. The loops on the upper and lower states show the
probabilities associated with the state remaining the same. States transition back
and forth until the HMM reaches the end state. In this HMM, the top square
represents a state that has equal probabilities of generating A, G, C, or T. The
bottom state has probabilities of 0.1, 0.1, 0.1, and 0.7 of generating A, G, C, and T,
respectively.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Hidden Markov models
Can be used for machine
learning
Units constitute
transition states
Transitions not
dependent on history
Many uses in genomics
Gene prediction
Multiple sequence
alignment
Finding periodic
patterns
start
End
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs for gene prediction
HMMs are trained on
sequences that are members
of known gene class
HMM gives probability that
a particular sequence
belongs to the gene class
Length of the bar indicates
probability
Bigger the bar higher
probability
Genscan: gene predicting
program
2000 human introns
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithms for secondary

structure
determination
Chou

Fasman / GOR method
Based on
experimentally determined
frequency
of amino acids in secondary structures
Machine learning algorithms
Neural networks: three

dimensional structures
have already been determined Structures
Nearest

neighbor methods: closest matches
Trained on previously deduced structures to
detect amino acid patterns in secondary
structures
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Analysis of microarray data
Microarrays can measure the expression of
thousands of genes simultaneously
Vast amounts of data require computers
Types of analysis
Gene

by

gene
Method: Statistical techniques
Categorizing groups of genes
Method: Clustering algorithms
Deducing patterns of gene regulation
Method: Under development
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Unsupervised techniques
Make no assumptions
about how the data
should behave
Cluster genes based on
similar patterns of gene
expression
Examples
Hierarchical clustering
Principal components
analysis (PCA)
Hierarchical
clustering
PCA
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Metrics for gene expression
Need a method to
measure how similar
genes are based on
expression
Examples
Euclidean distance
Pearson correlation
coefficient
Euclidean
distance
Pearson
correlation
coefficient
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Supervised techniques
Divide groups of genes
based on sample
properties
Can predict sample
condition based on gene
expression pattern
Examples
Support vector
machine
Nearest neighbor
Nearest
neighbor
Support
vector
machine
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The usefulness of simulation
Why simulate when you can experiment?
Models involving many parameters may be
difficult to conceptualize without simulations
A simulation may suggest ways of testing a
hypothesis
Some experiments cannot be done in vivo, or in
vitro, and must therefore be done in silico
If a simulation is good, it can be used in place
of more expensive or time

consuming
experiments. Nuclear experiments by the US.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Numerical methods
Numerical methods are needed because of the
discrete nature of computers
Differential equations are turned into
difference equations that deal with discrete
rather than continuous quantities
Smaller steps lead to greater simulation
accuracy
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Examples of computer simulations in biology
Gene regulatory
networks
Simulations of cells
Networks of neurons
Population genetics
A model of gene regulation
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Prospects for a fully simulated cell
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Limitations of computer
simulation
Algorithmic
Computers only can process discrete values
Simulating continuous behavior accurately often
requires an unfeasible number of calculations
Experimental
Simulation only as good as data it is based on
Critical data often missing from simulation
Conceptual
Overly complex simulations do not contribute to
understanding of a biological system
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Summary
Vast amounts of data require bioinformatics
These are limited by the following:
Algorithmic complexity of bioinformatics problems
Computer hardware performance
Heuristic methods used to get around these limitations
Bioinformatics methods used in the following areas:
Sequence alignment
Phylogenetic

tree construction
Gene prediction
Secondary

structure determination
Analysis of microarray data
Simulation of biological systems
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Take home test
1) Define the term "bioinformatics". What are some
major applications of bioinformatics in the field of
genomics (give about 5)?
2) What features of digital computers are suitable and
unsuitable for bioinformatics analysis? Discuss the
pros and cons of using computers for simulating
biological systems.
3) What is the difference between a local and a global
alignment?
What is the biological significance of
sequence alignments? Why is it necessary to align
multiple DNA/protein sequences?
4) Describe any two methods of sequence alignments:
visual, dynamic programming or heuristic.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο