Or,
What is a correspondence set anyway?!
Topic 12
Chapter 16, Du and Bourne “Structural Bioinformatics”
Alignment vs. superposition
•
Structural alignment
attempts to establish
homology between two or more polymer structures
based on their shape and 3D structure.
•
Structural alignment requires no a priori knowledge
of equivalent positions.
•
Structural alignment is a valuable tool for the
comparison of proteins with low sequence similarity,
where evolutionary relationships between proteins
cannot be easily detected by standard sequence
alignment techniques.
•
Conversely, simple
structural superposition
uses
knowledge of at least some equivalent residues to
guide a rigid body superposition.
•
The most basic possible comparison between protein
structures makes no attempt to align the input
structures.
•
Requires a
precalculated
alignment as input to
determine which of the residues in the sequence are
intended to be considered in the RMSD calculation.
Structure alignment
Second step
First step
+
Structure alignments are based on
structure
similarity, from which
sequence
alignments can be trivially
extracted.
Due to computational complexity, most
structural alignments are pairwise, but
multiple alignment methods do exist.
Dynamic programming and sequence alignment
To really understand structure alignment, you need to understand sequence alignment...
Dynamic programming (DP) is an algorithm originally developed by Richard Bellman in the
early 1950s for “
multistage decision processes.
”
DP methods solve optimization problems,
very useful in bioinformatics applications, for example sequence alignment. Even though there
are a large number of possible solutions, but only one (or a few) best solution(s).
Foundation
: Any partial sub

path ending at a point along the true optimal path must itself be an
optimal path leading up to that point. So the optimal path can be found by incremental
extensions of optimal sub

paths, leading to a recursive algorithm that is (typically) guaranteed
to produce the best answer.
There are two major types of optimal DP sequence alignments:
Global
(Needleman

Wunsch)
and
local
(Smith

Waterman) alignments.
Based on the assumption of
independence
, where the score of a residue (mis)match is
unaffected by other pairs, thus joint probability! For example…
A
S
C
T
V
L
A
T
C
A
V
I
Based on the magic of logarithms
Substitution (scoring) matrix
Substitution matrices are composed of log

ratios that compare
observed pairs to background expectation.
S
(
ij
)
>
0 indicate
‘preferred’ matches. For example, the BLOSUM

62 matrix…
Dynamic Programming (DP)
Match: +5
Mismatch:

2
Insertion/deletion:

6
Sean Eddy, 2004, Nature Biotechnology
Back to structure alignment
Independence is not a valid assumption in
structure because…
Similarly, in RNA…
That is, the probability of mutating the
above lysine to
X
,
p
(
KX
), is NOT
independent of the aspartate.
This is, of course, the reality in sequence
alignment too, but we ignore this fact
because we are treating the protein as a 1D
sequence that doesn’t reveal those details.
Rigid body treatment ≠ independence of positions
Structure alignment treats proteins as rigid bodies, leading to an even more serious violation of
independence.
That is, adjusting the position of the purple residue, for example, to maximize overlap with its
target will also alter the position of the green residue because they rigidly related.
Rotation of
purple by 90
o
also rotates
the green
Formalizing the structure alignment problem
Given two sets of points
A
=
(
a
1
, a
2
, …, a
n
)
and
B
=
(
b
1
,b
2
,…
b
m
)
in Cartesian
space, find the
optimal
subsets
A
(
P
)
and
B
(
Q
)
with

A
(
P
)

=

B
(
Q
)
, and find the
optimal
rigid body transformation
G
between the two subsets
A
(
P
)
and
B
(
Q
)
that
minimizes a given distance metric
D
over all possible rigid body
transformation
G
,
i.e.
The
two subsets
A
(
P
)
and
B
(
Q
)
define
a “
correspondence
”,
and
p
=

A
(
P
)
=

B
(
Q
)
is called the correspondence length.
Naturally
, the
correspondence length is maximal when
A
(
P
)
and
B
(
Q
)
are
similar.
Therefore there
are
essentially
two
problems in structure alignment:
(
i
.)
Find
the correspondence set (which is
NP

hard
), and
(
ii.
)
Find the alignment transform (which is
O(n
)).
In the structure alignment literature, you will frequently encounter coordinate root mean
squared
deviation, which is just like RMSD except
B
describes a coordinate
transformation of
b.
Where
B
describes a coordinate
transformation of
b
.
Just to clarify…
DALI:
Uses 2D distance matrices between C
A atoms
to represent each structure
.
Conceptually, the alignment problem is then straightforward, you must simply
maximally overlay the matrices (as described in
an earlier cartoon).
Holm and Sander. Protein structure comparison by alignment of distance matrices. J Mol
Biol
1993, 233:123

128.
CE (
Combinatorial extension
):
Uses characteristics of local geometry to seed
structural alignments and then joins these regions of local similarity (called aligned
fragment pairs, AFPs) into an “optimal” path for the full alignment. Bottom

up
approach.
Shindyalov
and Bourne, Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Prot Eng,
1998, 11:739

747.
SSAP (
Sequential Structure Alignment Program
)
:
Uses a “double

dynamic
programming” algorithm:
high level and low level matrices. Used in CATH
classification.
Taylor WR,
Orengo
CA. 1989b. Protein structure alignment. J Mol
Biol
208:l

22
VAST (
Vector A
lignment
S
earch
T
ool
), TM

align and many
more……
Common structure alignment methods
Dali: The Persistence of Time
Overview of the Dali Algorithm
Starting with a contact map…
Dali
attempts to maximize the overlap of the
contact maps; however, doing so globally is
NP

hard
, so the methods focus on local
comparisons.
Image from
Amy Keating at MIT
Image from
Mark
Maciejewski
at
UConn
The DALI (
D
istance
m
a
trix
ali
gnment) algorithm
is based on the matrix comparison
methods
that we have
already
introduced
.
Images and
content modified
from Mark
Maciejewski
at
UConn
Similarity score:
Structure A
Structure B
i
A
j
A
j
B
i
B
i
and
j
are equivalent residues in A and B
L
is the number of such pairs or the size of the substructure
is the similarity measure based on the CA distance
and
Overview of the Dali Algorithm
The Dali Algorithm (step by step)
1.
Compute distance matrices for both protein A and B
2.
Extract a full set of overlapped
hexapeptide
(
6
x
6
) sub

matrices (also called
contact
patterns
) from each matrix
3.
Each
6
x
6
distance matrix from protein A is compared with the
6
x
6
distance matrix
in protein B. (
Really?
)
6
x
6 CA
distance matrices
For example: 6.2
–
12.7 =

6.5
Consider protein A with
100
residues, meaning we have 100

5
=
95
hexapeptides
.
(95^2)/2 =
4,512
contact pattern matrices
Consider protein B with
150
residues, meaning 150

5 =
145
hexapeptides
.
(145^2)/2 =
10,512
contact pattern matrices
Even for these two relatively small proteins, there would be
4,512
x
10,512
=
47,430,144
comparisons between A and B.
Step 1:
For each
hexapeptide
, a distance matrix compares it to every other
hexapeptide
within its structure.
Step 2:
Every distance matrix created in step 1 for each protein are compared to each
other.
“
Houston, … we have a problem!
”
The Dali Algorithm (step by step)
4.
Each contact pattern in protein A is paired with its most similar pattern in protein
B, a process that generates a pair list
5.
The list is sorted based on the strength of pair similarity of contact patterns
A note about the similarity measure
:
We want to maximize the number of equivalent
residues while minimize structural variations
–
it is a tradeoff. That is, if the criteria are
so tough that minor structure deviations are not allowed, then the number of matching
contact patterns is likely to be very small.
Image from
Amy Keating at MIT
The Dali Algorithm (step by step)
Note that unmatched residues
do not contribute to the overall
similarity score
S.
Q: How do you calculate
(
i,j
)?
Method 1:
Rigid residue

pair similarity score:

1.5
Å
is the zero level of similarity.

The only thing that matters is absolute difference, meaning that the same difference
at large distances is penalized the same as short distances.
Method 2:
Elastic similarity score (default):

Larger differences are tolerate for longer

range contact pairs.
The Dali Algorithm (step by step)
The Dali Algorithm (step by step)
6. Merging contact patterns to form chains and reduce complexity
The search space is reduced because
only the central contact pattern is
retained (actually, the one that gives the
smallest average intra

pattern distance).
The Dali Algorithm (step by step)
7.)
After removing the overlapping patterns, we are still left with way too many
contact patterns to exhaustively compare all possible pairs.
Start comparing pairs at random:

Keep list of positive scores (discard negative scores)

Keep comparing till your list has 80,000 positive scores
Sort the list and keep the best 40,000 contact pattern matches.
8.)
End game:
Need to find optimal alignment of the 40,000 contact patterns such that
the alignment occurs over as wide a range of the structural pair as possible.
Using
Markov Chain Monte Carlo
(MCMC), start with a random contact pattern
from the list of 40,000, and then “walk” to another overlapping pattern (must
extend the contact pattern by 4 residues) using the standard
Metropolis
criterion.
Metropolis Monte Carlo Optimization
In Dali…
T
he
net result is that scores that
improve are always kept,
whereas scores that get worse are
excepted with some probability.
The Dali Algorithm (the reality)
Statistical significance of Dali alignments
Dali uses
Z

score
to show the significance of the alignment
A common and practical
approach to the problem of assessing alignment significance is
to determine if the alignment score is better than one could expect by chance.
Dali
compares each alignment score against an All

to

All protein structure
comparison (normalized by length), which defines the z

score
.

Dali Z

scores
> 2
are thought to
be meaningful
.
Combinatorial Extension (a cursory look)
Similar to Dali in that it
also breaks the structure down into a series of small
fragments, from which it attempts to reassemble into a complete alignment.
For a pair of proteins
A
and
B
, an alignment fragment pair (AFP) is defined
as a continuous segment of
A
aligned against a continuous segment of
B
of
the same size (without gaps).
If
n
1
and
n
2
are the lengths of
A
and
B
, and AFP length is set to
m
, then there
is a total of possible (
n
1
m
)
(
n
2
m
) AFPs.
Only AFPs that meet a given criteria for local similarity are included in the
matrix as means of restricting the search space.
An alignment path is calculated as the optimal path through the similarity
matrix by linearly progressing through the sequences and extending the
alignment with the next possible high

scoring AFP pair.
Combinatorial Extension (a cursory look)
Goal
: Find a “good” local alignment for structures of proteins
A
and
B.
1.
Select some initial AFP.
2.
Build an alignment path by incrementally adding AFPs in a way that
satisfies the conditions
(i.e., stitch AFPs together).
3.
Repeat step (2) until the length of each protein is traversed, or until
no “good” AFPs remain.
4.
Optimize the alignment via dynamic programming.
5.
Measure
statistical significance.
Questions:
How do we choose the starting AFP?
What are the criteria for adding AFPs to our alignment path?
What does the distance function look like.
When to stop? Or at what point do we know that there no “good” AFPs left?
Combinatorial Extension (a cursory look)
Combinatorial Extension (a cursory look)
To assess how good the alignment produced by CE is, we can compare it to
the alignment of a random pair of structures, and compute the
Z

score
based
on the RMSD distance and number of gaps in the final alignment.
Since CE does not penalize gaps, we can perform additional optimization
after the CE is completed in order to remove excess gaps using dynamic
programming.
The CE method is highly configurable, which is at once its strength and
weakness. Adjusting multiple parameters, such as
AFP length
m
,
cutoff
distances
D
0
and
D
1
, and
definitions for AFP distances
, can result varying
alignments and execution speeds.
In general, CE does not outperform previously existing structural alignment
methods, such as Dali and VAST: it does better for some pairs of structures,
and worse for others.
VAST (a cursory look)
VAST = Vector Alignment Search Tool
1.)
Parse
protein structures into SSEs (helices and strands).
2.)
Fit
vectors to SSEs.
3.)
To
compare a pair of proteins attempt to superpose as many vectors as
possible, subject to
constraints.
4.)
Evaluate
the vector alignment for statistical significance (compute an
E

value
).
5.)
If
the vector alignment is significant then proceed to a more detailed
residue

to

residue alignment (
“
refined alignment
”
).
Modified from
Tom
Madej
at
GWU
VAST (a cursory look)
Modified from
Tom
Madej
at
GWU
VAST in pictures…
+
Double Dynamic Programming (a cursory look)
Use
two levels of dynamic programming, a high level scoring matrix and a
low level matrix for each high level matrix element.
For
each
F
ij
in the
high level
scoring matrix, it shows how likely it is that
the
pair is on an optimal alignment.
For
each
F
ij
, the likelihood is found by a (low level) optimal alignment
with the constraint that
F
ij
is part of the
alignment.
The
scores along the low level alignments are accumulated in the high
level scoring
matrix.
DDP cont.
Begin by constructing
a series of inter

residue
distance vectors between each residue and its
nearest non

contiguous neighbors on each
protein.
A
series of matrices are then constructed
containing the vector differences between
neighbors for each pair of residues for which
vectors were constructed.
Dynamic
programming applied to each
resulting matrix determines a series of optimal
local alignments which are then summed into a
"summary" matrix to which dynamic
programming is applied again to determine the
overall
structural similarity.
Multiple Structure Alignment
Most multiple structure alignments are based on a pile

up combination of
pairwise results; however, few algorithms do an All

to

All optimization.
One example of a multiple alignment is
Combinatorial
Extension Monte
Carlo
(CE

MC
), which
is based on a
progressive
CE multiple alignment
strategy, followed by an iterative Metropolis MC refinement.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο