Human Origins

powerfultennesseeBiotechnology

Oct 2, 2013 (4 years and 3 months ago)

135 views

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Chapter 6


The Computational Foundations of
Genomics

Applying algorithms to analyze
genomics data

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Contents


What are computational biology and
bioinformatics?


Understanding computers and algorithms


Sequence alignment


Gene prediction


Algorithms for analysis of phylogeny


Analysis of microarray data


Computer simulation

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Computational Biology and
Bioinformatics


Computational biology


Development of computational methods to solve
problems in biology


Bioinformatics


Application of computational biology to analysis and
management of real
molecular biology

data


Why do molecular biologists need computer science?


Discrete nature of sequence data is ideal for analysis
using digital computers


Size and complexity of genomics data make the data
impossible to analyze without computers

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Algorithm


an
algorithm

is a procedure (a finite
set

of
well
-
defined instructions) for accomplishing
some task


A recipe to perform a task


Algorithms often have steps that repeat
(
iterate
) or require decisions (such as
logic

or
comparison
). Algorithms can be composed to
create more complex algorithms.


Concept originated in 1936

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

A historical perspective


The 1960s: the birth of
bioinformatics


High
-
level computer
languages


Protein sequence data


Academic access to
computers


Margaret Oakley Dayhoff


First protein database


First program for automatic
sequence assembly

IBM 7090 computer

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Solving problems in computer science


Necessary parameters for assessing the difficulty of a
computer science problem


Algorithmic complexity


Is the problem theoretically solvable?


If so, what is the most efficient solution?


Current state of computer technology


Memory


CPU speed


Cost


sequencing entire genomes via the shotgun approach was
not possible until the mid
-
1990s because the
computational power needed was unavailable until that
time.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Algorithmic problems


Example: searching for a number in an unordered list


If the list has
N

numbers, the average amount of time
the search will take will be proportional to
N


A more clever approach


Place the numbers in order


Do a binary search


Step 1:

Pick a number in the middle of the list


Step 2:

Restrict the search to the half that contains your
number


Return to
Step 1
until you find your number


Time for this approach is proportional to log
2
N

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

The digital computer


Represents everything in
a code of zeros and ones


Computer architecture


CPU
(Central Processing Unit)


Memory


Input / Output


Advantages of digital
computer


Deterministic


Minimization of noise

Output

CPU

Memory

Input

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

The limitations of digital computers


The limitations of digital computers are
conceptual, not just technological


Digital computers are deterministic


Incapable of truly random behavior


Digital computers deal with strictly discrete
values


Can only approximate continuous behavior


Many interesting biological phenomena occur
in the continuous realm of space and time

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Sequence databases


What is a database?


An indexed set of records


Records retrieved using a query language


Database technology is well established


Examples of sequence databases


GenBank (NCBI)


Encompasses all publicly available protein and
nucleotide sequences


Protein Data Bank


Contains 3
-
D structures of proteins

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

The client
-
server model

A single computer to GCG to Internet..


The clients and servers
are software processes


Clients request data
from servers


Servers and clients can
reside on the same or
different machines


Clients can act as
servers

to other
processes and vice versa

Web Browser

BLAST Search

Engine

Database

Web Server

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Sequence alignment


Sequence alignments
search for matches
between sequences


Two broad classes of
sequence alignments


Global (wide)
maximize overall score


Local (narrow) high
score in limited area


Alignment can be
performed between two
or more sequences

QKESGPSSSYC

VQQESGLVRTTC

Global alignment

Local alignment

ESG

ESG

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

The biological importance of sequence
alignment


Sequence alignments assess the degree of similarity
between sequences


Similar sequences suggest similar function


Proteins with similar sequences are likely to play
similar biochemical roles


Regulatory DNA sequences that are similar will likely
have similar roles in gene regulation


Sequence similarity suggests evolutionary history


Fewer differences mean more recent divergence


Orthologs versus paralogs

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

The algorithmic problem of aligning
sequences


Comparison of similar
sequences of similar
length is straightforward


How does one deal with
insertions and gaps that
may hide true similarity?


How does one interpret
minimal similarity?


Are sequences actually
related?


Is alignment by chance?

QQESGPVRSTC

QKGSYQEKGYC

QQESGPVRSTC

RQQEPVRSTC

QQESGPVRSTC

QKESGPSRSYC

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Methods of sequence alignment


Graphical methods: visual


Dynamic
-
programming methods:
mathematically best but needs time


Heuristic methods:
approximate but close

to
real answer

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Dot matrix analysis


A graphical method


Shows all possible
alignments


Caveats


Some guesswork in
picking parameters


Window size


Stringency


Not as rigorous or
quantitative as other
methods

R

Q

Q

E

P

V

R

S

T

C

Q

Q

E

S

G

P

V

R

S

T

C

QQESGPVRSTC

RQQEPVRSTC

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Dot matrix analysis: a real example

Window size: 23

Stringency: 15

Window size: 1

Stringency: 1

Noise to signal ratio

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Devising a scoring system


Scoring matrices allow biologists to
quantify

the
quality

of sequence alignments


Use different scoring matrices for different purposes


Score for similar structural domains in proteins


Score for evolutionary relationship


Some popular scoring matrices


PAM for evolutionary studies (Percent Accepted
Mutation)


BLOSUM for finding common motifs (BLOcks amino
acid SUbstitution Matrix)

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

An example of scoring

A

R

N

D

C

Q

E

A

4


-
1

-
2

-
2

0

-
1

-
1

R

-
1

5

0

-
2

-
3

1

0

N

-
2

0

6

1

-
3

0

0

D

-
2

-
2

1

6

-
3

0

2

C

0

-
3

-
3

-
3

9

-
3

-
4

Q

-
1

1

0

0

-
3

5

2

E

-
1

0

0

2

-
4

2

5

BLOSUM62

A sequence comparison

Total score: 18

A

A

4

D

Q

0

D

E

2

R

R

5

Q

Q

5

C

E

-
4

E

C

-
4

R

Q

1

A

A

4

D

Q

0

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Dynamic programming (DP)


Possibility of gaps (or insertions) makes number of
possible sequence alignments astronomical


Dynamic programming makes sequence alignment
possible by abandoning low scoring alignments
among subsequences as the algorithm progresses


Mathematically proven to provide optimal alignments


DP algorithms for sequence alignment


Needleman
-
Wunsch
-
Gotoh algorithm for global
alignments


Smith
-
Waterman algorithm for local alignments


DP alignment algorithms
still too slow

for searching
an entire sequence database

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Heuristic methods with
k
-
tuples


Example: BLAST/FASTA


Using query sequence,
derive a list of words
(tupules) of length
w
(e.g.,

3)


Keep high
-
scoring
matching words


High
-
scoring words are
compared with database
sequences


Sequences with many
matches to high
-

scoring
words are used as anchors
(not just words but the
order) for final alignments

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Statistical significance


Chance alignments have no biological significance


Statistical significance implies low probability of
generating a chance alignment


Probability of long alignments increases with longer
sequences


The extreme
-
value distribution (E value)


Used to calculate the probability of chance alignment


Generated by calculating the scores resulting from
repeatedly scrambling one of the sequences being
compared

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

A practical example of sequence alignment


MASH
-
1, a transcription factor

http://blast.ncbi.nlm.nih.gov/Blast.cgi


© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

BLAST results

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Detailed BLAST results

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

A pairwise alignment with MASH
-
1


HASH
-
2, a human homolog of MASH
-
1



+
” indicates conservative amino acid substitution




” indicates gap/insertion


XXXX… shows areas of low complexity (common occurrence)

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Multiple
-
sequence alignments


Uses of multiple
-
sequence alignments


Automated reconstruction of sequence fragments


Phylogenetic analysis


Identification of sequence families


The problem of multiple
-
sequence alignment


O
(
N
M
) where
N

is the average sequence length and
M
is
the number of sequences being aligned (optimal
methods)


Dynamic programming will work only for small
M


Heuristic methods are required

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Some methods for global

multiple
-
sequence alignment


Progressive methods


Align most closely related sequences, and then less
related ones


Use phylogenetic trees to quantify similarities


Downside:
poor
results with
distantly related

sequences


Iterative methods


Start with progressive alignment


Realign sequences after leaving one sequence out


Add left
-
out sequence


Repeat until acceptable alignment is achieved


Probabilistic methods


Hidden Markov models ( we will talk later)

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Phylogenetic analysis


Phylogenetic trees


Describe evolutionary
relationships between
sequences


Three common methods


Maximum parsimony


Distance


Maximum likelihood

human immunodeficiency viruses from around the world

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Comparison of methods for
phylogenetic analysis


Maximum parsimony (machine input)(closely related
seqs)


Finds optimal tree (or trees) requiring minimum
number of substitutions to explain sequence variation


Maximum likelihood (
user input
) (distantly related)


Finds most probable tree


Similar to maximum parsimony


Distance (mix of close and distantly related)


Compare pairs of sequences for number of differences
between them


Use many methods to get consensus tree

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Algorithmic complexity and
phylogenetic analysis


Four steps


Sequence alignment


Substitution model (scoring matrix)


Tree building


Tree evaluation


Tree building and evaluation are
computationally expensive


Heuristic methods required in most cases

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Gene prediction


A problem of pattern recognition


Algorithms look for features of genes:


E.g., Splice sites, ORFs, starting methionine


Identification of regulatory regions is
very

difficult


Statistical understanding of genes is ongoing


Problems of this type require machine learning
algorithms: learn what is the pattern based on
small dataset

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Central Dogma in Molecular Biology

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Artificial neural networks


Machine learning algorithms
that mimic the brain


Connections between
“neurons” vary in strength


Connection weights (
w
ij
)
(strength) change while
network is exposed to
training set


Fully trained network
recognizes pattern in novel
input


GRAIL

A
feed forward

neural network

input

output

hidden

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Hidden Markov models


Can be used for machine
learning


Units constitute
transition states


Transitions not
dependent on history


Many uses in genomics


Gene prediction


Multiple sequence
alignment


Finding periodic
patterns

start

End

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

HMMs


The example of a dishonest gambler is often used to illustrate this point.
The gambler may carry a loaded die that he or she occasionally substitutes
for a fair die, but not so often that the other players would notice. The fair
die has a one
-
in
-
six chance of showing any particular number. When using
the loaded die, a player will have a 50% chance of rolling a one and a 10%
chance of rolling any other number. It is in these types of situations that
stochastic models called hidden Markov models (HMMs) are useful,
because they take into account
unknown (or hidden) states
. For example,
exactly when the cheating gambler is using a fair or loaded die is hidden
from the other players, but insight may still be gained by
looking at the
outcome of the cheater’s rolls
. If he or she rolls three ones in a row, it is
more likely (a 12.5% chance) that the loaded die is being used than the fair
one, which would have only a 0.5% chance of generating three ones in a
row. Hidden Markov models describe the probability of transitions
between hidden states, as well as the probabilities associated with each
state. In the example of the cheating gambler, an HMM would describe the
probabilities of rolling particular numbers given the loaded or fair die and
the probability that the dishonest gambler would switch from one die to
another.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

HMMs continued


Hidden Markov models can be used to answer three types of questions. The first
type is the
likelihood

question: Given a particular HMM, what is the
probability

of
obtaining a particular outcome (e.g., rolling three ones)? The second type is the
decoding
question: Given a particular HMM, what is the most likely sequence of
transitions

between states for a particular outcome? In the case of the cheating
gambler, this sequence would be the order in which he or she transitioned from one
die to another. The third type is the
learning

question: Given a particular outcome
and set of assumptions about possible transition states, what are
the best model
parameters

(e.g., probabilities between transition states)? This third question allows
HMMs to be used for machine learning. The figure in the slide shows a simple
example of a hidden Markov model being used to account for the DNA sequence at
the bottom. Every HMM has a start and end state, denoted by the S and E,
respectively, in the slide. Hidden states lie between the start and end states. In the
figure, the squares are states, and the lines between them indicate the probability of
one state transitioning to another. The loops on the upper and lower states show the
probabilities associated with the state remaining the same. States transition back
and forth until the HMM reaches the end state. In this HMM, the top square
represents a state that has equal probabilities of generating A, G, C, or T. The
bottom state has probabilities of 0.1, 0.1, 0.1, and 0.7 of generating A, G, C, and T,
respectively.


© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Hidden Markov models


Can be used for machine
learning


Units constitute
transition states


Transitions not
dependent on history


Many uses in genomics


Gene prediction


Multiple sequence
alignment


Finding periodic
patterns

start

End

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

HMMs for gene prediction


HMMs are trained on
sequences that are members
of known gene class


HMM gives probability that
a particular sequence
belongs to the gene class


Length of the bar indicates
probability


Bigger the bar higher
probability


Genscan: gene predicting
program

2000 human introns

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Algorithms for secondary
-
structure
determination


Chou
-
Fasman / GOR method


Based on
experimentally determined

frequency
of amino acids in secondary structures


Machine learning algorithms


Neural networks: three
-
dimensional structures
have already been determined Structures


Nearest
-
neighbor methods: closest matches


Trained on previously deduced structures to
detect amino acid patterns in secondary
structures

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Analysis of microarray data


Microarrays can measure the expression of
thousands of genes simultaneously


Vast amounts of data require computers


Types of analysis


Gene
-
by
-
gene


Method: Statistical techniques


Categorizing groups of genes


Method: Clustering algorithms


Deducing patterns of gene regulation


Method: Under development

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Unsupervised techniques


Make no assumptions
about how the data
should behave


Cluster genes based on
similar patterns of gene
expression


Examples


Hierarchical clustering


Principal components
analysis (PCA)


Hierarchical

clustering

PCA

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Metrics for gene expression


Need a method to
measure how similar
genes are based on
expression


Examples


Euclidean distance


Pearson correlation
coefficient

Euclidean

distance

Pearson

correlation

coefficient

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Supervised techniques


Divide groups of genes
based on sample
properties


Can predict sample
condition based on gene
expression pattern


Examples


Support vector
machine


Nearest neighbor

Nearest

neighbor

Support

vector

machine

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

The usefulness of simulation


Why simulate when you can experiment?


Models involving many parameters may be
difficult to conceptualize without simulations


A simulation may suggest ways of testing a
hypothesis


Some experiments cannot be done in vivo, or in
vitro, and must therefore be done in silico


If a simulation is good, it can be used in place
of more expensive or time
-
consuming
experiments. Nuclear experiments by the US.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Numerical methods


Numerical methods are needed because of the
discrete nature of computers


Differential equations are turned into
difference equations that deal with discrete
rather than continuous quantities


Smaller steps lead to greater simulation
accuracy

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Examples of computer simulations in biology


Gene regulatory
networks


Simulations of cells


Networks of neurons


Population genetics

A model of gene regulation

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Prospects for a fully simulated cell

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Limitations of computer
simulation


Algorithmic


Computers only can process discrete values


Simulating continuous behavior accurately often
requires an unfeasible number of calculations


Experimental


Simulation only as good as data it is based on


Critical data often missing from simulation


Conceptual


Overly complex simulations do not contribute to
understanding of a biological system

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Summary


Vast amounts of data require bioinformatics


These are limited by the following:


Algorithmic complexity of bioinformatics problems


Computer hardware performance


Heuristic methods used to get around these limitations


Bioinformatics methods used in the following areas:


Sequence alignment


Phylogenetic
-
tree construction


Gene prediction


Secondary
-
structure determination


Analysis of microarray data


Simulation of biological systems

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Take home test


1) Define the term "bioinformatics". What are some
major applications of bioinformatics in the field of
genomics (give about 5)?


2) What features of digital computers are suitable and
unsuitable for bioinformatics analysis? Discuss the
pros and cons of using computers for simulating
biological systems.



3) What is the difference between a local and a global
alignment?


What is the biological significance of
sequence alignments? Why is it necessary to align
multiple DNA/protein sequences?


4) Describe any two methods of sequence alignments:
visual, dynamic programming or heuristic.