PowerPoint

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

78 views

Introduction to bioinformatics

lecture 8

Deriving amino acid exchange matrices (II)
and Multiple sequence alignment (I)

Summary Dayhoff’s PAM
-
matrices





Derived from
global

alignments of
closely

related sequences.





Matrices for greater evolutionary distances are extrapolated



from those for lesser ones.





The number with the matrix (PAM40, PAM100) refers to the

evolutionary distance; greater numbers are greater distances.





Several later groups have attempted to extend Dayhoff's

methodology or re
-
apply her analysis using later databases

with more examples.





Extensions of Dayhoff’s methodology:



>
Jones, Thornton and coworkers used the same methodology as




Dayhoff but with modern databases (CABIOS 8:275).



>
Gonnett and coworkers (Science 256:1443) used a slightly different




(but theoretically equivalent) methodology.



> Henikoff & Henikoff (Proteins 17:49) compared these two newer




versions of the PAM matrices with Dayhoff's originals.

The BLOSUM matrices

(BLOcks SUbstitution Matrix)





The BLOSUM series of matrices were created by Steve



Henikoff and colleagues (PNAS 89:10915).





Derived from local, un
-
gapped alignments of distantly

related sequences.





All matrices are directly calculated; no extrapolations

are used.





Again: the observed frequency of each pair is compared

to the expected frequency (which is essentially the

product of the frequencies of each residue in the

dataset).



Then: Log
-
odds matrix.

The Blocks Database





The Blocks Database contains multiple alignments of

conserved regions in protein families.





Blocks are multiply aligned un
-
gapped segments corresponding

to the most highly conserved regions of proteins.





The blocks for the BLOCKS database are made automatically

by looking for the most highly conserved regions in groups of

proteins represented in the PROSITE database. These blocks

are then calibrated against the SWISS
-
PROT database to

obtain a measure of the random distribution of matches. It is

these calibrated blocks that make up the BLOCKS database.





The database can be searched by e
-
mail and World Wide Web

(WWW) servers (http://blocks.fhcrc.org/help) to classify protein

and nucleotide sequences.

The Blocks Database


Gapless
alignment
blocks

The BLOSUM series




BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80,

85, 90.





The number after the matrix (BLOSUM
62
) refers to the

minimum percent identity of the blocks (in the BLOCKS

database) used to construct the matrix



(all blocks have >=62% sequence identity);





No extrapolations are made in going to higher

evolutionary distances





High number
-

closely related sequences



Low number
-

distant sequences





BLOSUM62 is the most popular: best for general

alignment.

The log
-
odds matrix for BLOSUM62


Based on an explicit
evolutionary model



Derived from small,
closely related proteins
with ~15% divergence



Higher PAM numbers to
detect more remote
sequence similarities



Errors in PAM 1 are
scaled 250X in PAM 250


Based on empirical
frequencies



Uses much larger, more
diverse set of protein
sequences (30
-
90% ID)




Lower BLOSUM numbers
to detect more remote
sequence similarities



Errors in BLOSUM arise
from errors in alignment

PAM
versus

BLOSUM


Comparing exchange matrices





To compare amino acid exchange matrices, the

"Entropy" value can be used. This is a relative entropy

value (
H
) which describes the amount of information

available per aligned residue pair.

Specialized matrices





Claverie

(J.Mol.Biol 234:1140) developed a set of

substitution matrices designed explicitly for finding

possible frameshifts in protein sequences.


These matrices are designed solely for use in protein
-
protein
comparisons; they should not be used with programs
which

blindly translate DNA (e.g. BLASTX, TBLASTN).

Specialized matrices





Rather than starting from alignments generated by

sequence comparison,
Rissler
et al
(1988)

and later

Overington
et al

(1992)

only considered proteins for

which an experimentally determined three dimensional

structure was available.





They then aligned similar proteins on the basis of their

structure rather than sequence and used the resulting

sequence alignments as their database from which to



gather substitution statistics. In principle, the Rissler or

Overington matrices should give more reliable results

than either PAM or BLOSUM. However, the

comparatively small number of available protein

structures (particularly in the Rissler
et al

study)

limited the reliability of their statistics.





Overington
et al

(1992) developed further matrices

that consider the local environment of the amino acids.

A note on reliability





All these matrices are designed using standard

evolutionary models.





It is important to understand that evolution is not the

same for all proteins, not even for the same regions of

proteins.





No single matrix performs best on all sequences. Some

are better for sequences with few gaps, and others are

better for sequences with fewer identical amino acids.





Therefore, when aligning sequences, applying a general

model to all cases is not ideal. Rather, re
-
adjustment

can be used to make the general model better fit the

given data.

Pair
-
wise alignment quality

versus

sequence identity

(Vogt et al., JMB 249, 816
-
831,1995)

Summary




If ORF
exists,
then align at protein level
.




Amino acid substitution matrices reflect the log
-
odds ratio

between the evolutionary and random model and can

therefore



help in determining homology via the alignment score.




The evolutionary and random models depend on the



generalized data used to derive them. This not an ideal

solution.




Apart from the PAM and BLOSUM series, a great



number of further matrices have been developed.




Matrices have been made based on DNA, protein



structure, information content, etc.




For local alignment, BLOSUM62 is often superior; for



distant (global) alignments, BLOSUM50, GONNET, or



(still) PAM250 work well.




Remember

that gap penalties are always a problem;



unlike the matrices themselves, there is no formal way



to calculate their values
--

you can follow



recommended settings, but these are based on trial



and error and not on a formal framework.

Biological definitions for

related sequences


Homologues

are similar sequences in two different
organisms that have been derived from a common ancestor
sequence. Homologues can be described as either
orthologues or paralogues.



Orthologues

are similar sequences in two different
organisms that have arisen due to a speciation event.
Orthologs typically retain identical or similar functionality
throughout evolution.



Paralogues

are similar sequences within a single organism
that have arisen due to a gene duplication event.



Xenologues

are similar sequences that do not share the
same evolutionary origin, but rather have arisen out of
horizontal transfer events through symbiosis, viruses, etc.

Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

So this means …

Multiple sequence alignment


Sequences can be conserved across species and perform
similar or identical functions.


> hold information about which regions have high mutation


rates over evolutionary time and which are evolutionarily


conserved;

> identification of regions or domains that are critical to


functionality.



Sequences can be mutated or rearranged to perform an
altered function.


> which changes in the sequences have caused a change in


the functionality.

Multiple sequence alignment:

the idea is to take three or more

sequences and align them so that the greatest number of similar

characters are aligned in the same column of the alignment.

What to ask yourself


How do we get a multiple alignment?

(three or more sequences)



What is our aim?




Do we go for max accuracy, least


computational time or the best compromise?



What do we want to achieve each time


Sequence
-
sequence alignment

sequence

sequence

Multiple alignment methods


Multi
-
dimensional dynamic programming

> extension of pairwise sequence alignment.



Progressive alignment

>
incorporates phylogenetic information to guide the


alignment process



Iterative alignment

>
correct for problems with progressive alignment by


repeatedly realigning subgroups of sequence



Simultaneous multiple alignment

Multi
-
dimensional dynamic programming

The combinatorial explosion



2 sequences of length n


n
2

comparisons



Comparison number increases exponentially


i.e. n
N

where n is the length of the sequences, and N is the
number of sequences




Impractical for even a small number of short sequences

Multi
-
dimensional dynamic
programming

(Murata et al., 1985)

Sequence 1

Sequence 2

The MSA approach


MSA (
Lipman et al., 1989, PNAS 86, 4412)



MSA restricts the amount of memory by computing bounds that
approximate the centre of a multi
-
dimensional hypercube.



Calculate all pair
-
wise alignment scores.


Use the scores to to predict a tree.


Calculate pair weights based on the tree (lower bound).


Produce a heuristic alignment based on the tree.


Calculate the maximum weight for each sequence pair (upper
bound).


Determine the spatial positions

that must be calculated to obtain

the optimal alignment.


Perform the optimal alignment.


Report the weight found compared

to the maximum weight previously

found (measure of divergence)
.


Extremely slow and memory intensive.


Max 8
-
9 sequences of ~250 residues.




The DCA approach


DCA (
Stoye et al., 1997, Appl. Math. Lett. 10(2), 67
-
73)



Each sequence is cut in two behind

a suitable
cut position

somewhere

close to its midpoint.



This way, the problem of aligning

one family of (long) sequences is

divided into the two problems of

aligning two families of (shorter)

sequences.



This procedure is re
-
iterated until

the sequences are sufficiently short.



Optimal alignment by MSA.



Finally, the resulting short

alignments are concatenated.




So in effect …

Sequence 1

Sequence 2

Multiple alignment methods


Multi
-
dimensional dynamic programming

> extension of pairwise sequence alignment.



Progressive alignment

> incorporates phylogenetic information to guide the


alignment process



Iterative alignment

>
correct for problems with progressive alignment by


repeatedly realigning subgroups of sequence



The progressive alignment method


Underlying idea
: usually we are interested in aligning
families of sequences that are evolutionary related.



Principle
: construct an approximate phylogenetic tree
for the sequences to be aligned and than to build up the
alignment by progressively adding sequences in the
order specified by the tree.



But before going into details, some notices of multiple
alignment profiles …

How to represent a block of sequences?


Historically:
consensus sequence



single
sequence that best represents the amino acids
observed at each alignment position.



Modern methods:
Alignment profile



representation that retains the information
about frequencies of amino acids observed at
each alignment position.


Multiple alignment profiles
(Gribskov et al. 1987)


Gribskov

created a probe: group
of typical sequences of
functionally related proteins that have been aligned by
similarity in sequence or three
-
dimensional structure (in
his case:
globins

&
immunoglobulins
).



Then he constructed a profile, which consists of a
sequence position
-
specific scoring matrix M(p,a)
composed of 21 columns and N rows (N = length of
probe).



The first 20 columns of each row specify the score for
finding, at that position in the target, each of the 20
amino acid residues. An additional column contains a
penalty for insertions or deletions at that position (gap
-
opening and gap
-
extension).



Multiple alignment profiles

A

C

D







W

Y

-

i

fA..

fC..

fD..







fW..

fY..

Gapo, gapx

Gapo, gapx

Position dependent gap penalties

Core region

Core region

Gapped region

Gapo, gapx

fA..

fC..

fD..







fW..

fY..

fA..

fC..

fD..







fW..

fY..

Profile building


Example: each aa is represented as a frequency penalties as weights.


A

C

D







W

Y

Gap

penalties

i

0.3

0.1

0







0.3

0.3

0.5

1.0

Position dependent gap penalties

0.5

0

0







0

0.5

0

0.5

0.2







0.1

0.2

1.0

Profile
-
sequence alignment

ACD……VWY

sequence

Sequence to profile alignment

A

A

V

V

L

0.4 A

0.2 L

0.4 V


Score of amino acid
L

in sequence that is aligned against
this profile position:


Score =
0.4

* s(
L
,
A
) +
0.2

* s(
L
,
L
) +
0.4

* s(
L
,
V
)

Profile
-
profile alignment

A

C

D

.

.

Y

ACD……VWY

profile

profile

Profile to profile alignment

0.4 A

0.2 L

0.4 V


Match score of these two alignment columns using the a.a frequencies at the
corresponding profile positions:

Score =
0.4
*
0.75
*s(
A,
G
) +
0.2
*
0.75
*s(
L
,
G
) +
0.4
*0.75*
s(
V
,
G)
+

+

0.4
*
0.25
*s(
A,
S
) +
0.2
*
0.25
*s(
L
,
S
) +
0.4
*0.25*
s(
V
,
S)


s(x,y) is value in amino acid exchange matrix (
e.g.

PAM250, Blosum62) for
amino acid pair (x,y)

0.75 G

0.25 S


So, for scoring profiles …


Think of sequence
-
sequence alignment.


Same principles but more information for each position.


Reminder:


The

sequence pair alignment score
S
comes from the
sum of the positional scores
M(aa
i
,aa
j
)

(i.e. the
substitution matrix values at each alignment position
minus penalties if applicable)


Profile alignment scores are exactly the same, but the
positional scores are more complex