Improved pairwise alignments of proteins in the Twilight Zone using ...

wickedshortpumpΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 3 μήνες)

55 εμφανίσεις

Vol.22 no.4 2006,pages 413–422
doi:10.1093/bioinformatics/bti828
BIOINFORMATICS ORIGINAL PAPER
Sequence analysis
Improved pairwise alignments of proteins in the Twilight Zone
using local structure predictions
Yao-ming Huang and Christopher Bystroff
￿
Center for Bioinformatics,Department of Biology,Rensselaer Polytechnic Institute,Troy,NY 12180,USA
Received on March 9,2005;revised on November 2,2005;accepted on December 8,2005
Advance Access publication December 13,2005
Associate Editor:Christos Ouzounis
ABSTRACT
Motivation:In recent years,advances have been made in the ability
of computational methods to discriminate between homologous and
non-homologous proteins in the ‘twilight zone’ of sequence similarity,
where the percent sequence identity is a poor indicator of homology.
To make these predictions more valuable to the protein modeler,
they must be accompanied by accurate alignments.Pairwise
sequence alignments are inferences of orthologous relationships
betweensequencepositions.Evolutionarydistanceistraditionallymod-
eled using global amino acid substitution matrices.But real differences
in the likelihood of substitutions may exist for different structural con-
textswithinproteins,sincestructural context contributestotheselective
pressure.
Results:HMMSUM(HMMSTR-based substitution matrices) is a new
model for structural context-basedamino acidsubstitutionprobabilities
consisting of a set of 281 matrices,each for a different sequence–
structure context.HMMSUM does not require the structure of the
protein to be known.Instead,predictions of local structure are made
usingHMMSTR,ahiddenMarkov model for local structure.Alignments
using the HMMSUMmatrices compare favorably to alignments carried
out using the BLOSUM matrices or structure-based substitution
matrices SDM and HSDM when validated against remote homolog
alignments from BAliBASE.HMMSUM has been implemented using
local Dynamic Programming and with the Bayesian Adaptive
alignment method.
Availability:Matrices,source codes and programs are available at
http://www.bioinfo.rpi.edu/~bystrc/downloads.html.
Contact:bystrc@rpi.edu,huangy2@rpi.edu
1 INTRODUCTION
Amino acid substitution matrices are evolutionary models that seek
to explain the cost of mutating any one of the twenty amino acids
to any other,relative to the cost of not mutating.The most widely
used substitution matrices,such as PAM(Dayhoff et al.,1978) and
BLOSUM (Henikoff and Henikoff,1992),are derived from high
confidence multiple sequence alignments (MSAs).Counting stat-
istics are used to estimate the frequency of each possible mutation,
and a ratio of the observed mutation probability to the probability
one would expect by chance is calculated.The logarithm of this
likelihood ratio is the number we use when scoring pairwise
alignments such as those generated by Dynamic Programming
algorithms [LALIGN (Huang and Miller,1991),etc.] and by
database search algorithms [SSEARCH,FASTA (Pearson,1995,
1996,1998,2000) and BLAST (Altschul et al.,1990,1997)].
The current models ignore the possible effect of the structural
context on the frequencies of mutation,instead summing the fre-
quencies over all positions in the training set alignments equally.
But the structural context may influence the cost of making a
mutation.For example,glycine is known to be strongly conserved
when it occurs in a tight turn,where the presence of any side chain
would sterically hinder the formation of the turn.In other structural
contexts,glycine is more likely to mutate.As another example,
consider the selective pressures felt by a leucine in the core of a
protein versus a leucine on the surface.Ahydrophobic side chain on
the surface may sometimes be conserved for functional reasons,but
a hydrophobic side chain in the core of a protein is probably critical
for folding,and this is arguably a stronger selective pressure.
In general,we do not know the structural context of each of the
positions in a sequence at the point when we want to use the
evolutionary model—that is,when we want to align the sequence
or search a database.A structure-based evolutionary model such as
the one we propose in this paper is useful for alignments and data-
base searches only if we can first predict the structural context of
each residue.The prediction of residue context falls far short of
predicting the 3D structure,but adds information that can be used
to improve the quality of alignments,as we show in this paper.
Here we use local structure predictions from HMMSTR [hidden
Markov model for protein structure (Bystroff et al.,2000)] to
improve the quality of the most difficult pairwise alignments,
those with <25% sequence identity—in the so-called ‘Twilight
Zone’ of homology detection (Doolittle,1981).The new method
compliments remote homology detection methods,such as
RAPTOR (Xu et al.,2003),SVM-HMMSTR (Hou et al.,2004),
3D-SHOTGUN (Sasson and Fischer,2003) and others,that can
accurately identify remote structural homology but do not contrib-
ute to improved alignments.We show that distant homologs can
be aligned more accurately when the substitution matrix considers
the local structure as predicted by HMMSTR.
In general,profile–profile alignment methods have a greater
chance of success in aligning Twilight Zone sequences (Yona
and Levitt,2002;Jaroszewski and Godzik,2002),but obtaining
the profile requires first performing pairwise alignments.Therefore,
improvements in pairwise alignments will eventually improve
MSAs and profile–profile alignments.
The new model,called HMMSUM (HMMSTR-based sub-
stitution matrices),is a set of matrices,one for each of the
281 local structure contexts defined by HMMSTR.The matrices
￿
To whom correspondence should be addressed.
 The Author 2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournals.org
413
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
were summed from a training set of MSAs in the same way as the
previous matrices PAM and BLOSUM were summed,but the
training set in this case contained HMMSTR descriptors (Markov
state probabilities) for each position,based on known structures.
When HMMSUM is used in an alignment,the target and template
HMMSTR descriptors are first predicted,then an alignment matrix
(Smith and Waterman,1981) is calculated using the HMMSUM
model.In the alignment matrix,each ‘match score’ is a linear
combination of the substitution scores from the 281 matrices.
The weights are the target and template HMMSTR descriptors
[g,Equation (1)].
The alignments were carried out using standard Smith–Waterman
local Dynamic Programming (Smith and Waterman,1981;Smith
et al.,1981),or using Bayesian Adaptive Alignment (BAA) (Zhu
et al.,1998).The results were validated against BAliBASE
(Thompson et al.,1999;Bahr et al.,2001),a curated database
of structure-based MSAs.Alignment accuracy was assessed as
the number of correctly aligned positions (correct matches in
Table 1).Pairwise alignments produced by the HMMSUM model
had significantly more correct matches than alignments using the
BLOSUM35,40,50 or 55 matrices.These results were also a
significant improvement over two previously described structure-
based alignment matrices,SDM and HSDM (Prlic
´
et al.,2000),
which are in turn significantly better than the best BLOSUM
matrix.However,SDMand HSDMare single substitution matrices,
whereas our model consists of many matrices.
In broad terms,an increase in the number of variable parameters
improved the model.An apparent increase in accuracy can result
from simple over-fitting of the data,but this is not the case here.
None of the test alignments were part of the training data,which did
not include twilight-zone alignments or structure-based alignments.
Also,jackknife statistical tests were performed to showthat the gap
parameter optimization was valid.
Analysis of the numbers (log-likelihood ratios,LLRs) in the
HMMSUM matrices,when compared with SDM,hints at the
reasons for the improved accuracy.Different local structures
have different amino acid preferences.These differences are sig-
nificant and cannot be captured in a single substitution matrix.
2 METHODS
2.1 Construction of MSAs and structure descriptors
The observed amino acid substitution frequencies were summed froma non-
redundant training set of MSAs.Amodified PDBselect-25 list (Hobohmand
Sander,1994) was used as a source of parent sequences.Sequences in the
original PDBselect-25 list were excluded if the parent structure was not well
determined,if the protein was membrane-bound,or if it contained a large
number of disulfide bonds.Disordered loop regions were also omitted.Each
of the remaining 657 parent sequences has at most 25% sequence identity
with any other sequence in the list.MSAs were produced by searching the
‘nr’ protein database using two iterations of PSI-BLAST (Altschul et al.,
1997),with an E-value cut-off of 0.001.In practice,this cut-off produces
alignments that very rarely have <25% identity,and the training set con-
tained none of the ‘Twilight Zone’ alignments that were later used for
validation in the testing phase.
HMMSTR predictions
HMMSTR (Bystroff et al.,2000) was used to
assign/predict structure descriptors for each position in each MSA during
the training phase,and also to predict the structure descriptors in target
sequences during the testing phase.HMMSTR takes as input the sequence
profile and,optionally,the protein structure expressed as backbone (w,c)
angles.We say the descriptors were assigned if the true structure was used
as input to HMMSTR in addition to the sequence profile.Otherwise,we say
the descriptors were predicted if only the sequence profile was the input
to HMMSTR.A sequence profile is a probability distribution over the
20 amino acids at each MSA position.It is calculated by summing the
frequency of each amino acid in each MSA column,after accounting for
sequence redundancy.PSI-BLAST produces a profile as part of its function,
and the profile was used as input to HMMSTR.
HMMSTR is a hidden Markov model containing 281 Markov states (q).
Each state ‘emits’ a single descriptor from a state-specific probability dis-
tribution.HMMSTR states have probability distributions for amino acid
type,secondary structure,backbone angle region and structural context.
Table 1.Comparison of HMMSUM models with other matrices
Model Gap penalty Gap number
b
Gap length
b
Alignment length
b
Correct matches Accuracy Coverage
Opening
a
Extension
a
Counts P-value
c
% P-value
c
% P-value
c
SDM 7.1 ± 0.35 0.4 ± 0.04
1310
5973
25528 11393
— 44.6 — 40.8 —
HSDM
19.5 ± 0.93
1.0 ± 0.09
1284
6134
24762 11114
0.03 44.9 0.884 39.8 0.006
BLOSUM35
11.0 ± 1.41
0.9 ± 0.39
1560
6221
25775 10410
<0.001 40.4 <0.001 37.3 <0.001
BLOSUM40
13.1 ± 0.64
1.3 ± 0.30
1377
5089
24879 10581
0.002 42.5 0.022 37.9 0.001
BLOSUM50
8.4 ± 1.06
2.1 ± 0.45
2089
5019
23674 10019
<0.001 42.3 0.046 35.9 <0.001
BLOSUM55
11.3 ± 0.71
0.9 ± 0.20
1842
7850
22857 9719
<0.001 42.5 <0.001 34.8 <0.001
HMMSUM-M
15.0 ± 0.00
0.9 ± 0.00
1457
6278
26813 11204
0.644 41.8 0.158 40.1 0.928
HMMSUM-L
11.6 ± 0.52
1.4 ± 0.27
1571
5071
25339 11089
0.526 43.8 0.627 39.7 0.443
HMMSUM-D
21.0 ± 0.53
0.4 ± 0.18
920
5145
28477 13165
<0.001 46.2 0.025 47.1 <0.001
HMMSUM-D
3
13.0 ± 0.53
0.3 ± 0.08
1173
6028
27892 12938
<0.001 46.4 0.108 46.3 <0.001
HMMSUM-D
S
14.1 ± 0.38
0.8 ± 0.08
1510
6211
27892 13269
<0.001 47.6 0.007 47.5 <0.001
HMMSUM-D
NS
13.4 ± 1.51
0.3 ± 0.12
1283
5959
28253 13930
<0.001 49.3 <0.001 49.9 <0.001
HMMSUM-D
3+NS
13.3 ± 0.46
0.2 ± 0.13
1175
6371
27759 12981
<0.001 46.8 0.037 46.5 <0.001
Optimal gap penalties are defined as the combination of opening and extension penalty that will produce the highest number of correct matches under zero matrix bias.P-value of
two group comparisons are estimated Wilcoxon signed-rank test.Accuracy and coverage were calculated as mentioned in Section 2.
a
Gap parameters were determined by taking the average of results from eight cross-validations.Standard deviations of cross-validations are indicated as ±SD.
b
Gap numbers/length and alignment length were calculated over 147 alignments under optimal gap parameters of individual methods.
c
The P-value indicates the significance of the difference in each measurement between SDMand a given method.P-value of >0.05 indicates a non-significant difference.
Y.-M.Huang and C.Bystroff
414
States are connected to each other via transitions to model the likelihood of
different emission distributions being adjacent to each other in the protein
chain.Strongly linked states in HMMSTRrepresent the I-sites local structure
motifs (Bystroff and Baker,1998),such as helix capping motifs and beta
turns,from which the HMMSTR model was derived.Some studies have
shown that I-sites motifs can fold independently and/or initiate folding (Yi
et al.,1998;Bystroff and Garde,2003;Northey et al.,2002).The forward/
backward algorithm (Rabiner,1989) was used to predict the conditional
probabilities [g,Equation (1)] of each state q given each position t in the
sequence.
g
qt
¼ Pðq j tÞ:ð1Þ
The details of this calculation were described previously (Bystroff et al.,
2000).
The g matrix may be viewed as a probabilistic representation of all local
structure motifs (I-sites) given the sequence and,optionally,the structure.
Loosely speaking,
g ¼ PðI-sites motif j sequence‚½structureÞ:
In practice,when the optional structure data were included the gamma
distribution was sharply peaked at most positions,having probability ¼1 for
one state and 0 for all others.However,some local structure types are
missing or redundantly modeled by HMMSTR.For these positions the
gamma distribution was less sharp.In the testing phase of this study,
when the structure data were not included in the forward/backward calcu-
lation,the motif predictions were more often ambiguous,and the gamma
distribution was sharply peaked at fewer positions.Because of the known
ambiguity of motif predictions,an early decision was made to treat all local
structure assignments/predictions probabilistically [i.e.Equation (1)] rather
than as a string of discrete descriptors.
2.2 Summing structure-dependent substitution
frequencies from training set MSAs
The matrices of observed and background frequencies for amino acid sub-
stitutions were summed in a manner similar to the one described earlier
for BLOSUM (Henikoff and Henikoff,1992),but in our case probabilistic
weights were used in order to separate the training set positions into
HMMSTR states.
Sequence weights,calculated using the mismatch method (Vingron
and Argos,1989),were used to eliminate the redundancy within an given
MSA.For example,in a MSA with L aligned sequences,an L · L mismatch
matrix is constructed where each matrix element is the number of mis-
matches between two sequences.The mismatch distance of a sequence
a is calculated as the sum over all the elements in row a,divided by sum
over the whole matrix.Sequences with lower mismatch distances (w) were
more redundant and should therefore contribute less to the sum of sub-
stitutions.The contribution of sequences a and b to any observed substitution
frequency was the product of the sequence weights (w
a
· w
b
).
As in previous studies,no assumption was made about the direction of the
substitution;therefore,the substitution matrices are symmetrical.
Likelihood ratios are the ratio of the observed substitution probability
and the expected substitution probability.The expected substitution
probability was calculated as the product of two background amino acid
probabilities.The background frequencies were summed from the same
training set MSAs,with each sequence a contributed w
a
to this sum.
The observed counts of state-dependent substitutions F was calculated
fromthe training set of MSAs.Each column (t) in each MSA(m) was given a
probability of being state q (g
qt
) using HMMSTR.F(i,j j q),the state-specific
observed counts of i $ j substitutions,was the sum of the g-weighted
sequence weights (w) over all sequence pairs a,b where the amino acid
at t in a was i and the amino acid at t in b was j,as follows:
Fði‚j j qÞ ¼
X
m2MSAs
in training set
X
a2m
X
b2m
b < a
X
t2a
t
¼i
&b
t
¼j
w
a
w
b
g
m
qt
:ð2Þ
Two strategies for expected frequencies
For the expected frequencies
of substitution,we considered two models,called ‘Dayhoff’ (D) and
‘Lipman’ (L) in the spirit of two classic bioinformatics experiments in
estimating expectation values for sequence alignments.Margaret Dayhoff’s
model (Dayhoff et al.,1978,1983) for random alignments was to align
scrambled sequences,whereas David Lipman aligned randomly selected
non-homologous but unscrambled sequences (Lipman et al.,1984).The
scores for random natural sequences were generally higher than for
scrambled sequences,showing the non-random nature of protein sequence
space.
Our D-model assumes that the expected frequency of substitution is not
dependent on the structure but only on the overall background frequencies
of the amino acids.Our L-model uses the structural context to estimate
the expected frequency of substitution,assigning a different background
frequency distribution to each state.The strategy applied in the L-model
should give us a better estimate of the significance of a substitution given
the structure,but with a finite-sized database we run the risk of sparse or
missing data for some of the 210 possible substitutions and 281 states.And
the poorer results for the more finely-divided L-model suggest that sparse
data were a problem.The D-model suffers less from the sparse data.The
following formulas were used to sum the background counts [Equation (3)]
or state-dependent background counts [Equation (4)] of each amino acid
from the training set.
FðiÞ ¼
X
m2MSAs
in training set
X
a2m
X
t2a
t
¼i
w
a
ðD-modelÞ ð3Þ
Fði j qÞ ¼
X
m2MSAs
in training set
X
a2m
X
t2a
t
¼i
w
a
g
m
qt
ðL-modelÞ:ð4Þ
A small pseudo-count («) was added to the expected and observed sub-
stitution frequencies to overcome the problem of sparse and missing data.
The choice of pseudo-count does not greatly a effect the resulting alignments
but prevents certain run-time errors from occurring.
The frequencies F were normalized to probabilities,P.The observed con-
ditional probability of a structure-specific substitution P(i,j j q) is as follows:
Pði‚jjqÞ ¼
Fði‚j j qÞ
P
h220 a:a:
P
k220 a:a:
Fðh‚k j qÞ
:ð5Þ
The normalized,structure-specific background amino acid probabilities
are
Pði j qÞ ¼
Fði j qÞ
P
h220a:a:
Fðh j qÞ
:ð6Þ
Structure-independent substitution probabilities P(i,j) were also calcu-
lated,in order to directly compare the added value of structure-specific
models.
Pði‚jÞ ¼
P
281
q¼1
Fði‚j j qÞ
P
h220a:a:
P
k220a:a:
P
281
q¼1
Fðh‚k j qÞ
:ð7Þ
Structure-independent background amino acid probabilities P(i) are
PðiÞ ¼
FðiÞ
P
h220a:a:
FðhÞ
:ð8Þ
2.3 Generating the alignment matrix
Pairwise alignment programs generally start by calculating the alignment
matrix (A),where each element A
tu
is the score for the two amino acids a
t
and
b
u
,where a and b are the two sequences being aligned and t and u are two
sequence positions.In traditional Dynamic Programming algorithm,A
tu
is
assigned the appropriate LLR from the global substitution matrix (Smith
and Waterman,1981).
HMMSUM:structure-based substitution matrices
415
To calculate the alignment matrix using HMMSUM,we first calculate
the gamma distribution for each sequence using the HMMSTR model,as
described in the previous section.Then,we sum the substitution probabil-
ities,P(a
t
,b
u
j q),over all values of q,weighted by the joint probability that
positions t and u are in state q.The joint probability is simply the product
of the g-values.P(a
t
,b
u
j Q) is defined as the weighted sum over all states q
of the substitution probability.
Pða
t
‚b
u
j QÞ ¼
P
281
q¼1
Pða
t
‚b
u
j qÞg
a
qt
g
b
qu
P
281
q¼1
g
a
qt
g
b
qu
:ð9Þ
We also need to find the weighted sum of the expected substitution
frequencies if we are using the L-model.We define P(a
t
j Q) to be the
weighted sum of the conditional background frequencies over all states q,
Pða
t
j QÞ ¼
P
281
q¼1
Pða
t
j qÞg
a
qt
P
281
q¼1
g
a
qt
‚ ð10Þ
where a
t
refers to the amino acid at position t in sequence a.
Nowwe can define the LLRs in the A matrix in the terms presented above
in Equations (5)–(10).Three models are shown.The M-model is inde-
pendent of structure,a single substitution matrix like BLOSUM and
PAM.The D-model depends on the predicted structure and assumes a
structure-independent background substitution probability,which is the
product of the background amino acid probabilities.The L-model depends
on predicted structure and assumes a structure-dependent background sub-
stitution probability.
A
tu
¼ log
2
Pða
t
‚b
u
Þ
Pða
t
ÞPðb
u
Þ
 
ðM-modelÞ:ð11Þ
A
tu
¼ log
2
Pða
t
‚b
u
j QÞ
Pða
t
ÞPðb
u
Þ
 
ðD-modelÞ:ð12Þ
A
tu
¼ log
2
Pða
t
‚b
u
j QÞ
Pða
t
j QÞPðb
u
j QÞ
 
ðL-modelÞ:ð13Þ
Alignments were carried out using Smith–Waterman local Dynamic
Programming and gap penalties were optimized as described below.An
overview of construction and testing of the HMMSUM model is shown
in Figure 1.
2.4 Measuring accuracy in alignment
To assess the performance of the HMMSUMmodel,we used a documented
benchmark database,called BAliBASE (Bahr et al.,2001) as a reference
set.BAliBASE consists of 167 reference MSAs,with >2100 sequences.All
alignments are based on 3Dstructural superpositions (except reference set 7)
Fig.1.Overviewof constructingHMMSUMmodels.Constructionof the HMMSUMmodel canbe separatedintotwosteps:(A) trainingphase,whichcounts the
observedsubstitutionand backgroundfrequencies weighted bymismatchdistances and g-values;(B) testingphase,which uses linear combination togenerate an
alignment matrix in LLR format and aligns two sequences by Dynamic Programming or BAA.
Y.-M.Huang and C.Bystroff
416
which we believe to be as close as possible to the ‘true’ orthologous rela-
tionships.Currently,BAliBASE is categorized into eight different reference
sets for different purposes.Within reference set 1 there are several subcate-
gories:short sequences (<100 residues);medium sequences (200–300 resi-
dues) and long sequences (>500 residues),and each subcategory is further
divided into different levels of percent identity (<25,20–40 and >35%).
For this study,we extracted all 99 of the MSAs in the ‘twilight-zone’ range
(from 7 to 25% identify) from reference set 1 of BAliBASE 2.0.A total
of 147 available reference pairwise alignments are obtained from those
sequences,containing a total of 27930 ‘true’ matches.
The performance of the alignment methods was evaluated by assessing the
accuracy (the fraction of predicted matches that were ‘true’ matches) and
coverage (the fraction of ‘true’ matches that were predicted matches) over
the 27930 matches in the 147 alignments.The statistical significance
(P-value) of the differences in ‘true’ matches and coverage was evaluated
using the Wilcoxon signed-rank test (Wilcoxon,1945) over the individual
alignments.Alignments were ranked by the number of correct matches or
by accuracy,and the P-value was calculated for the likelihood that the
differences between the methods were the result of chance.
Alignment algorithms
The sequence alignments were performed in two
ways.Local Dynamic Programming alignment is an implementation of the
Smith–Waterman algorithm,which returns the optimal alignment given the
scoring scheme.In the Smith–Waterman algorithm,we utilized the affine
gap penalty method (Altschul and Erickson,1986).Optimized gap opening
penalties and gap extension penalties were determined for each scoring
method by applying a grid search and evaluating the results by counting
the total correct matches.
Another algorithm,BAA computes the Bayesian posterior probability
of all possible a
t
,b
u
matches.In BAA,likelihood ratios,rather than
LLRs are used.For traditional substitution matrices like BLOSUM,we
used the exponent of the reported LLRs.BAA is carried out by first
performing a forward sum,similar to the forward step in the forward/
backward algorithm,over a maximum of K-aligned blocks.We used
K ¼ min (20,n/10),where n is the length of the shorter of the two proteins
being aligned.Instead of a trace-back through the optimal alignment,a
‘sample-back’ approach is taken in BAA.Probabilistic sampling of the
alignment space produces a probability distribution over all alignments,
expressed as a matrix of probabilities S,where S
tu
is the probability of seeing
positions t,u matched in an alignment.
Different scoring schemes were compared by calculating the probability
of the true alignment,i.e.the BAliBASE alignment,given the sample-back
distribution,S.
P
true
¼
X
true t‚ u
S
tu
:ð14Þ
Using BAA allows us to evaluate alignment quality on all possible sub-
optimal alignments,and also avoids the use of parameters such as gap
opening penalties and extension penalties,which are known to highly affect
the result of alignments.
3 RESULTS AND DISCUSSION
3.1 Qualitative analysis of the substitution matrices
Substitution matrices such as BLOSUM,PAM and SDM are
expressed as symmetrical 20 · 20 matrices of LLRs.HMMSUM
on the other hand,consists of 281 matrices and is not so easily
viewed and understood.We may inspect the new model by looking
at its marginal properties,and also by asking whether different
structural contexts had significantly different substitution probab-
ilities,even when the amino acid composition was similar.
3.1.1 Marginal characteristics of HMMSUM By using marginal
probabilities of substitutions [Equation (11)],a single substitution
matrix (denoted as HMMSUM-M) can be generated in the 20 · 20
format.A comparison of HMMSUM-Mand SDMscoring matrices
is shown in Figure 2.All of the structure information was ignored in
the HMMSUM-Mmodel.Overall,we see a higher score for changes
relative to identities in the HMMSUM-M model when compared
with SDM,as indicated by the relatively higher numbers off the
diagonal (after correcting for differences in the mean and standard
deviation).This increases the importance of mutations in the align-
ment score,relative to conserved residues.HMMSUM-M singles
out Gly,Asp,Pro and Asn for the highest substitution penalties,
while Thr,Gln and Met are the most mutable.SDM,on the other
hand,gives Gly,Pro,Ile and Trp the highest penalty for substitution,
and finds Gln,Thr,and Arg to be the most mutable.The normalized
mean mismatch penalties for the amino acids correlate poorly
(R ¼ 0.60) between HMMSUM-M and SDM.The correlation
over all elements of HMMSUM-M with SDM (R ¼ 0.72) and
with BLOSUM50 (R ¼ 0.72) is weaker than these two correlate
with each other (R ¼ 0.84).Overall,the diversity between models
suggests that selection pressure has multiple solutions that cannot
all be modeled by a single matrix.
3.1.2 Analysis of two structural contexts that favor glycine Two
conserved glycine states were chosen to illustrate the effect of
structural context on the substitution probabilities.For comparison
purposes only,observed and background frequencies of each state q
were transformed to LLRs according to Equation (15) to generate
substitution matrices
log
2
Pði‚j j qÞ þ«
Pði j qÞPðj j qÞ þ«
 
‚ ð15Þ
where « ¼ 0.25.Note,however,that the matrices were not used in
this form in making alignments.Instead the alignment matrices are
calculated directly from the substitution probabilities,as described
in Equations (9–13),not from LLRs.
The substitution matrix for HMMSTR state 30,a conserved
glycine in a b-turn motif (Figure 3,lower left) was found to be
Fig.2..Comparison of HMMSUM-Mand SDMsubstitution matrices.The
figure shows the comparison of marginal substitution matrix in HMMSUM
model,HMMSUM-M (upper half),with SDM (lower half).Both are
expressed as LLRs in half-bit unit (scaled by a factor of 2),and rounded
to the nearest integer.The two matrices correlate weakly (R ¼ 0.72).
HMMSUM:structure-based substitution matrices
417
dramatically different from the substitution matrix for state 102,
a conserved glycine in a b-strand conformation (Figure 3,
upper right).The LLRs were calculated using structure-
dependent expected substitution values,as in HMMSUM-L.Both
states represent common structural contexts.
The b-strand state has exactly the expected probability of G–G
substitutions (i.e.conservations),given the glycine-rich com-
position of that state (LLR ¼ 0).But the b-turn state,which is
also glycine-rich,has a much higher than expected rate of G–G
substitution (LLR ¼ 5).This reflects the energetic preference for
Gly in a b-turn.
Residues other than glycine are less likely to be found in states
30 and 102,but if they are found in these contexts,they are scored
very differently.State 102 (b-strand) favors a mismatch of almost
any sort,almost as much as an identity match.This means that
residues other than Gly mutate more often than expected in this
context.This is not surprising since the expected substitution prob-
ability would be very small for all non-Gly amino acids,and it
means that those rare occurrences of non-glycine residues in this
context occurred together in the training set.For state 30 (b-turn)
the situation is different.Glycine mutates more often than expected
to certain residues,C,D,P,Wand Y.In particular the preferred Gly
to Pro mutation makes sense for a b-turn.But other substitutions
in this turn state are either not observed (large negative LLRs) or
they occur more often than expected,for the same reasons as
mentioned before.
By modeling the structural contribution to evolutionary selective
pressure,HMMSUM should add biological meaning to the align-
ment of protein sequences.If so,an improvement should be seen
in the alignment accuracy.
3.2 Evaluation of alignment accuracy
To test whether the HMMSUM improves the alignments of
twilight-zone sequences,we compared the matrix performance of
the HMMSUMmodels with BLOSUMmatrices or structure-based
substitution matrices SDM/HSDM,which has been reported as the
best choice for aligning proteins with low percent identity (Prlic
´
et al.,2000).Three methods were used in our study to construct the
A
tu
matrix [Equations (11–13)] and alignments were performed
using local Dynamic Programming.
Gap opening/extension penalties are optimized for each matrix
before the evaluation of accuracy.An 8-fold jackknife cross-
validation was used for all methods tested when determining the
optimal penalties in order to avoid over-fitting.Evaluations are
reported here using the alignment parameters that gave us the
maximum correct matches (Table 1).The average gap opening/
extension penalties after cross-validations for SDM (7.1/0.4) and
HSDM (19.5/1.0) are comparable to previous reports (6.6/0.6 and
19.0/0.8,respectively).
3.2.1 HMMSUM-M results Using optimal gap parameters
(15.0/0.9),the structure-independent model HMMSUM-M finds
6% more of the correct matches than BLOSUM40 (P ¼ 0.053),
the best performing BLOSUMmatrix in our evaluation.This result
cannot be attributed to the use of structure,since neither structure
assignment nor prediction was used,but it may be attributed to
using a different sequence weighting scheme in summing substitu-
tion frequencies.Our weighting scheme increased the contribution
of distant homologs to the substitution scores,and the results are
seen in the increased LLRs for changes relative to conservations.
However,the increase in the number of correct matches was
accompanied by an increase in the number of incorrect matches,
and the accuracy in alignments remained unimproved overall.The
Wilcoxon test indicated that the change of accuracy in individual
alignments may have resulted by chance when compared with
BLOSUM40 (P ¼ 0.887).
3.2.2 HMMSUM-L results As mentioned in section 2,two sets
of structure-dependent matrices,HMMSUM-D and HMMSUM-L,
were made based on two different assumptions about the expected
substitutions.HMMSUM-L,with optimized gap parameters
(11.6/1.4),found 5% more correct matches than BLOSUM40
(P ¼ 0.065).This was fewer than the structure-independent
HMMSUM-M model.We believe that the L-model,although
theoretically more sound,suffered from small counts problems
and therefore did not manifest its theoretical potential.
However,the small increase in correct matches was not
accompanied by an increase in incorrect matches completely,
and the overall accuracy in alignments improved slightly.This
improvement was found to be possibly the result of chance when
compared with BLOSUM40 (P ¼ 0.371).
3.2.3 HMMSUM-D results HMMSUM-D,a structure-
dependent model with structure-independent background frequen-
cies,performed much better than expected,finding 24% more cor-
rect matches than the BLOSUM40 matrix.Both the accuracy and
the coverage of correct matches were significantly improved when
optimal gap parameters were used (21.0/0.4).Of the 147 alignments
61% were more accurate using this model,an unlikely chance
occurrence (P < 0.001).
When compared with SDM and HSDM,the state-of-the-art
matrices derived from structures,HMMSUM-D still showed a
16% and an 18% improvement in correct matches,respectively
(P < 0.001).About 54% out of the 147 alignments were more
accurate when using HMMSUM-D in both comparisons.The
Fig.3.Substitution matrices for glycine-rich states in HMMSTR.Each sub-
stitution matrix was generated using Equation (15).Substitution scores are
presented as LLRs in half-bit units.Here the figure shows the substitution
matrixof twostates inHMMSTR.Thelower half of the matrixis a b-turnstate
(state 30 in HMMSTR) and the upper half of the matrix is a b-strand state
(state 102 in HMMSTR).
Y.-M.Huang and C.Bystroff
418
alignment accuracy and coverage of HMMSUM-D were signific-
antly better than SDM (P ¼ 0.025 and P < 0.001,respectively).
3.2.4 Results using structure predictions and structure
assignments To investigate the effects of structural information
on difficult alignment problems,several more HMMSUM models
were constructed.A structure-prediction dependent model,
HMMSUM-D
NS
,was summed from the training set using the
same method as that described from HMMSUM-D,except that
the calculation of g was performed without using the structural
information.The g-values in HMMSUM-D
NS
can be expressed
loosely as follows:
g ¼ PðI-sites motif j sequenceÞ:
Figure 4.Comparisonof alignments of 1MUCAand 4ENLproteins.Alignments are producedusingthe best matrixinBLOSUMand HMMSUM-Dserials from
tested methods [i.e.BLOSUM40 (
￿
) and HMMSUM-D
NS
(‡)] and SDMmatrix (†).True alignments fromBAliBASE (#) are highlighted in gray color.Actual
secondary structures of 4ENL are indicated in cylinders as a-helices and arrows as b-strands.Two sequences have 17.1% identity.
Figure 5.Ribbon structures of remote homologs 1MUCA and 4ENL.Homologous secondary structure elements can clearly be identified despite only 17.1%
sequence identity.
HMMSUM:structure-based substitution matrices
419
The g weights are therefore structure predictions rather than
assignments.Surprisingly,the accuracy significantly improved
over the results when structure information was used in
HMMSUM-D (P ¼0.002).When measured using correct matches,
alignment accuracy or coverage,this prediction-based model con-
sistently performed best among methods tested.Figure 4 illustrates
this result using the two structurally similar proteins shown in
Figure 5.
To further understand the effect of structural information on
the alignment accuracy,we used the true structural information
for both the training phase and the testing phase.Although a
pairwise sequence alignment would not be the method of choice
for aligning two known structures,this test was thought to simulate
a ‘best case scenario’ where the structure predictions were perfect
for both sequences.As expected,the alignment accuracy for
this model,indicated as HMMSUM-D
S
,was better than that of
HMMSUM-D,where structure was ignored in the testing phase,
but surprisingly it was not as good as HMMSUM-D
NS
,where the
structure was ignored in both phases (Table 1).
3.2.5 Results using three structural states To address the
importance of precision in structure predictions and assignments,
bare minimum structure information was adapted into the
HMMSUM model.A simple three-state (helix,strand and loop)
HMMSUM-D model (denoted HMMSUM-D
3
) was made and ana-
lyzed in an analogous fashion.The performance of HMMSUM-D
3

















Figure 6.Alignments of 1R69 and 1NEQ using Dynamic Programming and BAA.(a) Alignments of two sequences were obtained by local Dynamic
Programming using BLOSUM40,SDM and different HMMSUM-D models with optimized parameters in Table 1.True alignments from BAliBASE are
highlighted in gray.Two sequences have percent identity of 12.7%.(b) A histogram of 1R69-1NEQ alignment was generated by BBA (Bayesian Adaptive
Alignment).The color of each position in histogramrepresents the probability of aligning two positions between two sequences.(red:high probability;blue,low
probability;white,zero probability).Black boxes represent the true alignment of two sequences from BAliBASE.
Y.-M.Huang and C.Bystroff
420
was comparable withother HMMSUM-Dmodels incorrect matches,
alignment accuracy or coverage (Table 1).The same improvement
was also found in another model,HMMSUM-D
3+NS
,which was
built in the same way as HMMSUM-D
NS
,using structure pre-
dictions rather than structure assignments,but using only three
structure states.However,both three-state models are better than
the best reported single-matrix model,SDM.
3.3 Evaluation of suboptimal alignments
By examining the details of alignments from HMMSUM models,
we found that some segments were aligned to the wrong regions
by shifting a few positions (Fig.6a).This suggested that the true
alignment was suboptimal but ‘nearby’ in parameter space,perhaps
accessible by a slight change in the substitution matrices.
Alignments that are off by a little are just as bad as alignments
that are off by a lot,according to our measure of accuracy for the
Dynamic Programming algorithm (Table 1) since this method can
deliver only the one optimal alignment,and arbitrarily chooses
one alignment if there are other equal-scoring possibilities.There
are several modifications to Dynamic Programming that consider
suboptimal alignments (Waterman and Eggert,1987;Huang et al.,
1990;Chao et al.,1994),however,in this study we chose to use
BAA,which models all alignments.
The BAA algorithm was also used to perform each of the 147
pairwise alignments using each substitution model,giving a pos-
terior probability distribution over all alignments for each case.To
evaluate the performance of HMMSUM-Dwhen compared to SDM
using BAA,the probabilities of reference alignments were com-
pared using P
true
[Equation (14)].Around 88%out of 147 reference
alignments have higher P
true
when using HMMSUM-D than with
SDM.Figure 6b shows the results of BAA for the sequences in
Figure 6a.
4 CONCLUSIONS
Amino acid substitution matrices that are local structure-specific
significantly improve the accuracy of pairwise alignments when the
sequences are distant homologs,even if only a simple three-state
structural model is used.The improvement can be seen in the
accuracy and coverage of correct matches when compared with
curated structure-based alignments.
Substitution matrices based on structure predictions were found
to be significantly more accurate (P < 0.001) than substitution
matrices based on structure assignments.This result runs counter-
intuitive since it suggests that the sequences alone contain more
information than the structure,even though we know that structural
homology extends beyond the reach of sequence-based alignment
methods.One possible explanation is that structure predictions con-
tain information about weak or multiple secondary structure pro-
pensities that is lost when a unique structure is assigned to each
position.Weak propensities may be conserved in some parts of the
structure in order to enforce the order of events in the folding
pathway.However,experimental studies have shown no conserva-
tion of propensity across proteins of the same family (Mun
˜
oz et al.,
1995).
An alternative explanation is that substitution data were sparse
when unique structure assignments were used but less sparse when
predictions were used,leading to less random error in substitution
scores when indexed to predicted structures,and more randomerror
in substitution scores when indexed to assigned structures.In some
cases the structure predictions were wrong,but perhaps the added
data more than compensated for the inclusion of some false data.
It was also considered that homologous proteins might tend to
have correlated wrong predictions,and that would improve align-
ments.The sparse data issue also partially explains why the more
finely divided HMMSUM-L model did not perform as well as
expected.Meanwhile,the three-state model HMMSUM-D
3
may
have suffered less fromsparse data,since using structure predictions
(HMMSUM-D
3+NS
) did not further improve alignments.
An improvement was also found in the scores of true suboptimal
alignments using the Bayesian Adaptive method.An additional
34% of 147 individual alignments were improved when using the
BAA algorithm in comparing HMMSUM-D and SDM.Accuracy
in finding true suboptimal alignments is important since optimal
alignments for twilight zone sequences are only 40–50% accurate.
ACKNOWLEDGEMENTS
We thank Bobbie-Jo Webb-Robertson for contributing to the
Bayes Aligner coding and for helpful conversations.This work
was supported by NSF award DBI-0343206 to C.B.
Conflict of Interest:none declared.
REFERENCES
Altschul,S.F.and Erickson,B.W.(1986) Optimal sequence alignment using affine gap
costs.Bull.Math.Biol.,48,603–616.
Altschul,S.F.et al.(1990) Basic local alignment search tool.J.Mol.Evol.,215,
403–410.
Altschul,S.F.et al.(1997) Gapped BLAST and PSI-BLAST:a new generation of
protein database search programs.Nucleic Acids Res.,25,3389–3402.
Bahr,A.et al.(2001) BAliBASE (Benchmark Alignment dataBASE):enhancements
for repeats,transmembrane sequences and circular permutations.Nucleic Acids
Res.,29,323–326.
Bystroff,C.and Baker,D.(1998) Prediction of local structure in proteins using a library
of sequence-structure motifs.J.Mol.Biol.,281,565–577.
Bystroff,C.and Garde,S.(2003) Helix propensities of short peptides:molecular
dynamics versus bioinformatics.Proteins,50,552–562.
Bystroff,C.et al.(2000) HMMSTR:a hidden Markov model for local sequence-
structure correlations in proteins.J.Mol.Biol.,301,173–190.
Chao,K.M.et al.(1994) Recent developments in linear-space alignment methods:
a survey.J.Comput.Biol.,1,271–291.
Dayhoff,M.O.et al.(1983) Establishing homologies in protein sequences.Methods
Enzymol.,91,524–545.
Dayhoff,M.O.,Schwartz,R.M.and Orcutt,B.C.(1978) Amodel of evolutionary change
in proteins.In Dayhoff,M.(ed.),Atlas of Protein Sequence and structure.National
Biomedical Research Foundation,Silver Spring,Maryland,USA.Vol.5 (Suppl.3),
pp.345–352.
Doolittle,R.F.(1981) Similar amino acid sequences:chance or common ancestry?
Science,,214,149–159.
Henikoff,S.and Henikoff,J.G.(1992) Amino acid substitution matrices from protein
blocks.Proc.Natl Acad.Sci.USA,89,10915–10919.
Hobohm,U.and Sander,C.(1994) Enlarged representative set of protein structures.
Protein Sci.,3,522–524.
Hou,Y.et al.(2004) Remote homolog detection using local sequence-structure
correlations.Proteins,57,518–530.
Huang,X.Q.et al.(1990) A space-efficient algorithm for local similarities.Comput.
Appl.Biosci.,6,373–381.
Huang,X.Q.and Miller,W.(1991) A time-efficient,linear-space local similarity
algorithm.Adv.Appl.Math.,12,337–357.
Jaroszewski,L.et al.(2002) In search for more accurate alignments in the twilight zone.
Protein Sci.,11,1702–1713.
Lipman,D.J.et al.(1984) On the statistical significance of nucleic acid similarities.
Nucleic Acids Res.,12,215–226.
HMMSUM:structure-based substitution matrices
421
Munoz,V.et al.(1995) The distribution of alpha-helix propensity along the polypeptide
chain is not conserved in proteins fromthe same family.Protein Sci.,4,1577–1586.
Northey,J.G.et al.(2002) Protein folding kinetics beyond the phi value:using multiple
amino acid substitutions to investigate the structure of the SH3 domain folding
transition state.J.Mol.Biol.,320,389–402.
Pearson,W.R.(1995) Comparison of methods for searching protein sequence databases.
Protein Sci.,4,1145–1160.
Pearson,W.R.(1996) Effective protein sequence comparison.Methods Enzymol.,266,
227–258.
Pearson,W.R.(1998) Empirical statistical estimates for sequence similarity searches.
J.Mol.Biol.,276,71–84.
Pearson,W.R.(2000) Flexible sequence similarity searching with the FASTA3 program
package.Methods Mol.Biol.,132,185–219.
Prlic
´
,A.et al.(2000) Structure-derived substitution matrices for alignment of distantly
related sequences.Protein Eng.,13,545–550.
Rabiner,L.R.(1989) A tutorial on hidden Markov models and selected applications in
speech recognition.Proc.IEEE,77,257–286.
Sasson,I.and Fischer,D.(2003) Modeling three-dimensional protein structures
for CASP5 using the 3D-SHOTGUN meta-predictors.Proteins,53 (Suppl.6),
389–394.
Smith,T.F.and Waterman,M.S.(1981) Identification of common molecular sub-
sequences.J.Mol.Biol.,147,195–197.
Smith,T.F.et al.(1981) Comparative biosequence metrics.J.Mol.Evol.,18,
38–46.
Thompson,J.D.et al.(1999) BAliBASE:a benchmark alignment database for the
evaluation of multiple alignment programs.Bioinformatics,15,87–88.
Vingron,M.and Argos,P.(1989) A fast and sensitive multiple sequence alignment
algorithm.Comput.Appl.Biosci.,5,115–121.
Waterman,M.S.and Eggert,M.(1987) A new algorithm for best subsequence align-
ments with application to tRNA-rRNA comparisons.J.Mol.Biol.,197,723–728.
Wilcoxon,F.(1945) ‘Individual Comparisons by Ranking Methods’.Biometrics,1,
80–83.
Xu,J.et al.(2003) RAPTOR:optimal protein threading by linear programming.
J.Bioinform.Comput.Biol.,1,95–117.
Yi,Q.et al.(1998) Prediction and structural characterization of an independently
folding substructure in the src SH3 domain.J.Mol.Biol.,283,293–300.
Yona,G.and Levitt,M.(2002) Within the twilight zone:a sensitive profile-profile
comparison tool based on information theory.J.Mol.Biol.,315,1257–1275.
Zhu,J.et al.(1998) Bayesian adaptive sequence alignment algorithms.Bioinformatics,
14,25–39.
Y.-M.Huang and C.Bystroff
422