Vol.22 no.4 2006,pages 413–422
doi:10.1093/bioinformatics/bti828
BIOINFORMATICS ORIGINAL PAPER
Sequence analysis
Improved pairwise alignments of proteins in the Twilight Zone
using local structure predictions
Yaoming Huang and Christopher Bystroff
Center for Bioinformatics,Department of Biology,Rensselaer Polytechnic Institute,Troy,NY 12180,USA
Received on March 9,2005;revised on November 2,2005;accepted on December 8,2005
Advance Access publication December 13,2005
Associate Editor:Christos Ouzounis
ABSTRACT
Motivation:In recent years,advances have been made in the ability
of computational methods to discriminate between homologous and
nonhomologous proteins in the ‘twilight zone’ of sequence similarity,
where the percent sequence identity is a poor indicator of homology.
To make these predictions more valuable to the protein modeler,
they must be accompanied by accurate alignments.Pairwise
sequence alignments are inferences of orthologous relationships
betweensequencepositions.Evolutionarydistanceistraditionallymod
eled using global amino acid substitution matrices.But real differences
in the likelihood of substitutions may exist for different structural con
textswithinproteins,sincestructural context contributestotheselective
pressure.
Results:HMMSUM(HMMSTRbased substitution matrices) is a new
model for structural contextbasedamino acidsubstitutionprobabilities
consisting of a set of 281 matrices,each for a different sequence–
structure context.HMMSUM does not require the structure of the
protein to be known.Instead,predictions of local structure are made
usingHMMSTR,ahiddenMarkov model for local structure.Alignments
using the HMMSUMmatrices compare favorably to alignments carried
out using the BLOSUM matrices or structurebased substitution
matrices SDM and HSDM when validated against remote homolog
alignments from BAliBASE.HMMSUM has been implemented using
local Dynamic Programming and with the Bayesian Adaptive
alignment method.
Availability:Matrices,source codes and programs are available at
http://www.bioinfo.rpi.edu/~bystrc/downloads.html.
Contact:bystrc@rpi.edu,huangy2@rpi.edu
1 INTRODUCTION
Amino acid substitution matrices are evolutionary models that seek
to explain the cost of mutating any one of the twenty amino acids
to any other,relative to the cost of not mutating.The most widely
used substitution matrices,such as PAM(Dayhoff et al.,1978) and
BLOSUM (Henikoff and Henikoff,1992),are derived from high
conﬁdence multiple sequence alignments (MSAs).Counting stat
istics are used to estimate the frequency of each possible mutation,
and a ratio of the observed mutation probability to the probability
one would expect by chance is calculated.The logarithm of this
likelihood ratio is the number we use when scoring pairwise
alignments such as those generated by Dynamic Programming
algorithms [LALIGN (Huang and Miller,1991),etc.] and by
database search algorithms [SSEARCH,FASTA (Pearson,1995,
1996,1998,2000) and BLAST (Altschul et al.,1990,1997)].
The current models ignore the possible effect of the structural
context on the frequencies of mutation,instead summing the fre
quencies over all positions in the training set alignments equally.
But the structural context may inﬂuence the cost of making a
mutation.For example,glycine is known to be strongly conserved
when it occurs in a tight turn,where the presence of any side chain
would sterically hinder the formation of the turn.In other structural
contexts,glycine is more likely to mutate.As another example,
consider the selective pressures felt by a leucine in the core of a
protein versus a leucine on the surface.Ahydrophobic side chain on
the surface may sometimes be conserved for functional reasons,but
a hydrophobic side chain in the core of a protein is probably critical
for folding,and this is arguably a stronger selective pressure.
In general,we do not know the structural context of each of the
positions in a sequence at the point when we want to use the
evolutionary model—that is,when we want to align the sequence
or search a database.A structurebased evolutionary model such as
the one we propose in this paper is useful for alignments and data
base searches only if we can ﬁrst predict the structural context of
each residue.The prediction of residue context falls far short of
predicting the 3D structure,but adds information that can be used
to improve the quality of alignments,as we show in this paper.
Here we use local structure predictions from HMMSTR [hidden
Markov model for protein structure (Bystroff et al.,2000)] to
improve the quality of the most difﬁcult pairwise alignments,
those with <25% sequence identity—in the socalled ‘Twilight
Zone’ of homology detection (Doolittle,1981).The new method
compliments remote homology detection methods,such as
RAPTOR (Xu et al.,2003),SVMHMMSTR (Hou et al.,2004),
3DSHOTGUN (Sasson and Fischer,2003) and others,that can
accurately identify remote structural homology but do not contrib
ute to improved alignments.We show that distant homologs can
be aligned more accurately when the substitution matrix considers
the local structure as predicted by HMMSTR.
In general,proﬁle–proﬁle alignment methods have a greater
chance of success in aligning Twilight Zone sequences (Yona
and Levitt,2002;Jaroszewski and Godzik,2002),but obtaining
the proﬁle requires ﬁrst performing pairwise alignments.Therefore,
improvements in pairwise alignments will eventually improve
MSAs and proﬁle–proﬁle alignments.
The new model,called HMMSUM (HMMSTRbased sub
stitution matrices),is a set of matrices,one for each of the
281 local structure contexts deﬁned by HMMSTR.The matrices
To whom correspondence should be addressed.
The Author 2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournals.org
413
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
were summed from a training set of MSAs in the same way as the
previous matrices PAM and BLOSUM were summed,but the
training set in this case contained HMMSTR descriptors (Markov
state probabilities) for each position,based on known structures.
When HMMSUM is used in an alignment,the target and template
HMMSTR descriptors are ﬁrst predicted,then an alignment matrix
(Smith and Waterman,1981) is calculated using the HMMSUM
model.In the alignment matrix,each ‘match score’ is a linear
combination of the substitution scores from the 281 matrices.
The weights are the target and template HMMSTR descriptors
[g,Equation (1)].
The alignments were carried out using standard Smith–Waterman
local Dynamic Programming (Smith and Waterman,1981;Smith
et al.,1981),or using Bayesian Adaptive Alignment (BAA) (Zhu
et al.,1998).The results were validated against BAliBASE
(Thompson et al.,1999;Bahr et al.,2001),a curated database
of structurebased MSAs.Alignment accuracy was assessed as
the number of correctly aligned positions (correct matches in
Table 1).Pairwise alignments produced by the HMMSUM model
had signiﬁcantly more correct matches than alignments using the
BLOSUM35,40,50 or 55 matrices.These results were also a
signiﬁcant improvement over two previously described structure
based alignment matrices,SDM and HSDM (Prlic
´
et al.,2000),
which are in turn signiﬁcantly better than the best BLOSUM
matrix.However,SDMand HSDMare single substitution matrices,
whereas our model consists of many matrices.
In broad terms,an increase in the number of variable parameters
improved the model.An apparent increase in accuracy can result
from simple overﬁtting of the data,but this is not the case here.
None of the test alignments were part of the training data,which did
not include twilightzone alignments or structurebased alignments.
Also,jackknife statistical tests were performed to showthat the gap
parameter optimization was valid.
Analysis of the numbers (loglikelihood ratios,LLRs) in the
HMMSUM matrices,when compared with SDM,hints at the
reasons for the improved accuracy.Different local structures
have different amino acid preferences.These differences are sig
niﬁcant and cannot be captured in a single substitution matrix.
2 METHODS
2.1 Construction of MSAs and structure descriptors
The observed amino acid substitution frequencies were summed froma non
redundant training set of MSAs.Amodiﬁed PDBselect25 list (Hobohmand
Sander,1994) was used as a source of parent sequences.Sequences in the
original PDBselect25 list were excluded if the parent structure was not well
determined,if the protein was membranebound,or if it contained a large
number of disulﬁde bonds.Disordered loop regions were also omitted.Each
of the remaining 657 parent sequences has at most 25% sequence identity
with any other sequence in the list.MSAs were produced by searching the
‘nr’ protein database using two iterations of PSIBLAST (Altschul et al.,
1997),with an Evalue cutoff of 0.001.In practice,this cutoff produces
alignments that very rarely have <25% identity,and the training set con
tained none of the ‘Twilight Zone’ alignments that were later used for
validation in the testing phase.
HMMSTR predictions
HMMSTR (Bystroff et al.,2000) was used to
assign/predict structure descriptors for each position in each MSA during
the training phase,and also to predict the structure descriptors in target
sequences during the testing phase.HMMSTR takes as input the sequence
proﬁle and,optionally,the protein structure expressed as backbone (w,c)
angles.We say the descriptors were assigned if the true structure was used
as input to HMMSTR in addition to the sequence proﬁle.Otherwise,we say
the descriptors were predicted if only the sequence proﬁle was the input
to HMMSTR.A sequence proﬁle is a probability distribution over the
20 amino acids at each MSA position.It is calculated by summing the
frequency of each amino acid in each MSA column,after accounting for
sequence redundancy.PSIBLAST produces a proﬁle as part of its function,
and the proﬁle was used as input to HMMSTR.
HMMSTR is a hidden Markov model containing 281 Markov states (q).
Each state ‘emits’ a single descriptor from a statespeciﬁc probability dis
tribution.HMMSTR states have probability distributions for amino acid
type,secondary structure,backbone angle region and structural context.
Table 1.Comparison of HMMSUM models with other matrices
Model Gap penalty Gap number
b
Gap length
b
Alignment length
b
Correct matches Accuracy Coverage
Opening
a
Extension
a
Counts Pvalue
c
% Pvalue
c
% Pvalue
c
SDM 7.1 ± 0.35 0.4 ± 0.04
1310
5973
25528 11393
— 44.6 — 40.8 —
HSDM
19.5 ± 0.93
1.0 ± 0.09
1284
6134
24762 11114
0.03 44.9 0.884 39.8 0.006
BLOSUM35
11.0 ± 1.41
0.9 ± 0.39
1560
6221
25775 10410
<0.001 40.4 <0.001 37.3 <0.001
BLOSUM40
13.1 ± 0.64
1.3 ± 0.30
1377
5089
24879 10581
0.002 42.5 0.022 37.9 0.001
BLOSUM50
8.4 ± 1.06
2.1 ± 0.45
2089
5019
23674 10019
<0.001 42.3 0.046 35.9 <0.001
BLOSUM55
11.3 ± 0.71
0.9 ± 0.20
1842
7850
22857 9719
<0.001 42.5 <0.001 34.8 <0.001
HMMSUMM
15.0 ± 0.00
0.9 ± 0.00
1457
6278
26813 11204
0.644 41.8 0.158 40.1 0.928
HMMSUML
11.6 ± 0.52
1.4 ± 0.27
1571
5071
25339 11089
0.526 43.8 0.627 39.7 0.443
HMMSUMD
21.0 ± 0.53
0.4 ± 0.18
920
5145
28477 13165
<0.001 46.2 0.025 47.1 <0.001
HMMSUMD
3
13.0 ± 0.53
0.3 ± 0.08
1173
6028
27892 12938
<0.001 46.4 0.108 46.3 <0.001
HMMSUMD
S
14.1 ± 0.38
0.8 ± 0.08
1510
6211
27892 13269
<0.001 47.6 0.007 47.5 <0.001
HMMSUMD
NS
13.4 ± 1.51
0.3 ± 0.12
1283
5959
28253 13930
<0.001 49.3 <0.001 49.9 <0.001
HMMSUMD
3+NS
13.3 ± 0.46
0.2 ± 0.13
1175
6371
27759 12981
<0.001 46.8 0.037 46.5 <0.001
Optimal gap penalties are defined as the combination of opening and extension penalty that will produce the highest number of correct matches under zero matrix bias.Pvalue of
two group comparisons are estimated Wilcoxon signedrank test.Accuracy and coverage were calculated as mentioned in Section 2.
a
Gap parameters were determined by taking the average of results from eight crossvalidations.Standard deviations of crossvalidations are indicated as ±SD.
b
Gap numbers/length and alignment length were calculated over 147 alignments under optimal gap parameters of individual methods.
c
The Pvalue indicates the significance of the difference in each measurement between SDMand a given method.Pvalue of >0.05 indicates a nonsignificant difference.
Y.M.Huang and C.Bystroff
414
States are connected to each other via transitions to model the likelihood of
different emission distributions being adjacent to each other in the protein
chain.Strongly linked states in HMMSTRrepresent the Isites local structure
motifs (Bystroff and Baker,1998),such as helix capping motifs and beta
turns,from which the HMMSTR model was derived.Some studies have
shown that Isites motifs can fold independently and/or initiate folding (Yi
et al.,1998;Bystroff and Garde,2003;Northey et al.,2002).The forward/
backward algorithm (Rabiner,1989) was used to predict the conditional
probabilities [g,Equation (1)] of each state q given each position t in the
sequence.
g
qt
¼ Pðq j tÞ:ð1Þ
The details of this calculation were described previously (Bystroff et al.,
2000).
The g matrix may be viewed as a probabilistic representation of all local
structure motifs (Isites) given the sequence and,optionally,the structure.
Loosely speaking,
g ¼ PðIsites motif j sequence‚½structureÞ:
In practice,when the optional structure data were included the gamma
distribution was sharply peaked at most positions,having probability ¼1 for
one state and 0 for all others.However,some local structure types are
missing or redundantly modeled by HMMSTR.For these positions the
gamma distribution was less sharp.In the testing phase of this study,
when the structure data were not included in the forward/backward calcu
lation,the motif predictions were more often ambiguous,and the gamma
distribution was sharply peaked at fewer positions.Because of the known
ambiguity of motif predictions,an early decision was made to treat all local
structure assignments/predictions probabilistically [i.e.Equation (1)] rather
than as a string of discrete descriptors.
2.2 Summing structuredependent substitution
frequencies from training set MSAs
The matrices of observed and background frequencies for amino acid sub
stitutions were summed in a manner similar to the one described earlier
for BLOSUM (Henikoff and Henikoff,1992),but in our case probabilistic
weights were used in order to separate the training set positions into
HMMSTR states.
Sequence weights,calculated using the mismatch method (Vingron
and Argos,1989),were used to eliminate the redundancy within an given
MSA.For example,in a MSA with L aligned sequences,an L · L mismatch
matrix is constructed where each matrix element is the number of mis
matches between two sequences.The mismatch distance of a sequence
a is calculated as the sum over all the elements in row a,divided by sum
over the whole matrix.Sequences with lower mismatch distances (w) were
more redundant and should therefore contribute less to the sum of sub
stitutions.The contribution of sequences a and b to any observed substitution
frequency was the product of the sequence weights (w
a
· w
b
).
As in previous studies,no assumption was made about the direction of the
substitution;therefore,the substitution matrices are symmetrical.
Likelihood ratios are the ratio of the observed substitution probability
and the expected substitution probability.The expected substitution
probability was calculated as the product of two background amino acid
probabilities.The background frequencies were summed from the same
training set MSAs,with each sequence a contributed w
a
to this sum.
The observed counts of statedependent substitutions F was calculated
fromthe training set of MSAs.Each column (t) in each MSA(m) was given a
probability of being state q (g
qt
) using HMMSTR.F(i,j j q),the statespeciﬁc
observed counts of i $ j substitutions,was the sum of the gweighted
sequence weights (w) over all sequence pairs a,b where the amino acid
at t in a was i and the amino acid at t in b was j,as follows:
Fði‚j j qÞ ¼
X
m2MSAs
in training set
X
a2m
X
b2m
b < a
X
t2a
t
¼i
&b
t
¼j
w
a
w
b
g
m
qt
:ð2Þ
Two strategies for expected frequencies
For the expected frequencies
of substitution,we considered two models,called ‘Dayhoff’ (D) and
‘Lipman’ (L) in the spirit of two classic bioinformatics experiments in
estimating expectation values for sequence alignments.Margaret Dayhoff’s
model (Dayhoff et al.,1978,1983) for random alignments was to align
scrambled sequences,whereas David Lipman aligned randomly selected
nonhomologous but unscrambled sequences (Lipman et al.,1984).The
scores for random natural sequences were generally higher than for
scrambled sequences,showing the nonrandom nature of protein sequence
space.
Our Dmodel assumes that the expected frequency of substitution is not
dependent on the structure but only on the overall background frequencies
of the amino acids.Our Lmodel uses the structural context to estimate
the expected frequency of substitution,assigning a different background
frequency distribution to each state.The strategy applied in the Lmodel
should give us a better estimate of the signiﬁcance of a substitution given
the structure,but with a ﬁnitesized database we run the risk of sparse or
missing data for some of the 210 possible substitutions and 281 states.And
the poorer results for the more ﬁnelydivided Lmodel suggest that sparse
data were a problem.The Dmodel suffers less from the sparse data.The
following formulas were used to sum the background counts [Equation (3)]
or statedependent background counts [Equation (4)] of each amino acid
from the training set.
FðiÞ ¼
X
m2MSAs
in training set
X
a2m
X
t2a
t
¼i
w
a
ðDmodelÞ ð3Þ
Fði j qÞ ¼
X
m2MSAs
in training set
X
a2m
X
t2a
t
¼i
w
a
g
m
qt
ðLmodelÞ:ð4Þ
A small pseudocount («) was added to the expected and observed sub
stitution frequencies to overcome the problem of sparse and missing data.
The choice of pseudocount does not greatly a effect the resulting alignments
but prevents certain runtime errors from occurring.
The frequencies F were normalized to probabilities,P.The observed con
ditional probability of a structurespeciﬁc substitution P(i,j j q) is as follows:
Pði‚jjqÞ ¼
Fði‚j j qÞ
P
h220 a:a:
P
k220 a:a:
Fðh‚k j qÞ
:ð5Þ
The normalized,structurespeciﬁc background amino acid probabilities
are
Pði j qÞ ¼
Fði j qÞ
P
h220a:a:
Fðh j qÞ
:ð6Þ
Structureindependent substitution probabilities P(i,j) were also calcu
lated,in order to directly compare the added value of structurespeciﬁc
models.
Pði‚jÞ ¼
P
281
q¼1
Fði‚j j qÞ
P
h220a:a:
P
k220a:a:
P
281
q¼1
Fðh‚k j qÞ
:ð7Þ
Structureindependent background amino acid probabilities P(i) are
PðiÞ ¼
FðiÞ
P
h220a:a:
FðhÞ
:ð8Þ
2.3 Generating the alignment matrix
Pairwise alignment programs generally start by calculating the alignment
matrix (A),where each element A
tu
is the score for the two amino acids a
t
and
b
u
,where a and b are the two sequences being aligned and t and u are two
sequence positions.In traditional Dynamic Programming algorithm,A
tu
is
assigned the appropriate LLR from the global substitution matrix (Smith
and Waterman,1981).
HMMSUM:structurebased substitution matrices
415
To calculate the alignment matrix using HMMSUM,we ﬁrst calculate
the gamma distribution for each sequence using the HMMSTR model,as
described in the previous section.Then,we sum the substitution probabil
ities,P(a
t
,b
u
j q),over all values of q,weighted by the joint probability that
positions t and u are in state q.The joint probability is simply the product
of the gvalues.P(a
t
,b
u
j Q) is deﬁned as the weighted sum over all states q
of the substitution probability.
Pða
t
‚b
u
j QÞ ¼
P
281
q¼1
Pða
t
‚b
u
j qÞg
a
qt
g
b
qu
P
281
q¼1
g
a
qt
g
b
qu
:ð9Þ
We also need to ﬁnd the weighted sum of the expected substitution
frequencies if we are using the Lmodel.We deﬁne P(a
t
j Q) to be the
weighted sum of the conditional background frequencies over all states q,
Pða
t
j QÞ ¼
P
281
q¼1
Pða
t
j qÞg
a
qt
P
281
q¼1
g
a
qt
‚ ð10Þ
where a
t
refers to the amino acid at position t in sequence a.
Nowwe can deﬁne the LLRs in the A matrix in the terms presented above
in Equations (5)–(10).Three models are shown.The Mmodel is inde
pendent of structure,a single substitution matrix like BLOSUM and
PAM.The Dmodel depends on the predicted structure and assumes a
structureindependent background substitution probability,which is the
product of the background amino acid probabilities.The Lmodel depends
on predicted structure and assumes a structuredependent background sub
stitution probability.
A
tu
¼ log
2
Pða
t
‚b
u
Þ
Pða
t
ÞPðb
u
Þ
ðMmodelÞ:ð11Þ
A
tu
¼ log
2
Pða
t
‚b
u
j QÞ
Pða
t
ÞPðb
u
Þ
ðDmodelÞ:ð12Þ
A
tu
¼ log
2
Pða
t
‚b
u
j QÞ
Pða
t
j QÞPðb
u
j QÞ
ðLmodelÞ:ð13Þ
Alignments were carried out using Smith–Waterman local Dynamic
Programming and gap penalties were optimized as described below.An
overview of construction and testing of the HMMSUM model is shown
in Figure 1.
2.4 Measuring accuracy in alignment
To assess the performance of the HMMSUMmodel,we used a documented
benchmark database,called BAliBASE (Bahr et al.,2001) as a reference
set.BAliBASE consists of 167 reference MSAs,with >2100 sequences.All
alignments are based on 3Dstructural superpositions (except reference set 7)
Fig.1.Overviewof constructingHMMSUMmodels.Constructionof the HMMSUMmodel canbe separatedintotwosteps:(A) trainingphase,whichcounts the
observedsubstitutionand backgroundfrequencies weighted bymismatchdistances and gvalues;(B) testingphase,which uses linear combination togenerate an
alignment matrix in LLR format and aligns two sequences by Dynamic Programming or BAA.
Y.M.Huang and C.Bystroff
416
which we believe to be as close as possible to the ‘true’ orthologous rela
tionships.Currently,BAliBASE is categorized into eight different reference
sets for different purposes.Within reference set 1 there are several subcate
gories:short sequences (<100 residues);medium sequences (200–300 resi
dues) and long sequences (>500 residues),and each subcategory is further
divided into different levels of percent identity (<25,20–40 and >35%).
For this study,we extracted all 99 of the MSAs in the ‘twilightzone’ range
(from 7 to 25% identify) from reference set 1 of BAliBASE 2.0.A total
of 147 available reference pairwise alignments are obtained from those
sequences,containing a total of 27930 ‘true’ matches.
The performance of the alignment methods was evaluated by assessing the
accuracy (the fraction of predicted matches that were ‘true’ matches) and
coverage (the fraction of ‘true’ matches that were predicted matches) over
the 27930 matches in the 147 alignments.The statistical signiﬁcance
(Pvalue) of the differences in ‘true’ matches and coverage was evaluated
using the Wilcoxon signedrank test (Wilcoxon,1945) over the individual
alignments.Alignments were ranked by the number of correct matches or
by accuracy,and the Pvalue was calculated for the likelihood that the
differences between the methods were the result of chance.
Alignment algorithms
The sequence alignments were performed in two
ways.Local Dynamic Programming alignment is an implementation of the
Smith–Waterman algorithm,which returns the optimal alignment given the
scoring scheme.In the Smith–Waterman algorithm,we utilized the afﬁne
gap penalty method (Altschul and Erickson,1986).Optimized gap opening
penalties and gap extension penalties were determined for each scoring
method by applying a grid search and evaluating the results by counting
the total correct matches.
Another algorithm,BAA computes the Bayesian posterior probability
of all possible a
t
,b
u
matches.In BAA,likelihood ratios,rather than
LLRs are used.For traditional substitution matrices like BLOSUM,we
used the exponent of the reported LLRs.BAA is carried out by ﬁrst
performing a forward sum,similar to the forward step in the forward/
backward algorithm,over a maximum of Kaligned blocks.We used
K ¼ min (20,n/10),where n is the length of the shorter of the two proteins
being aligned.Instead of a traceback through the optimal alignment,a
‘sampleback’ approach is taken in BAA.Probabilistic sampling of the
alignment space produces a probability distribution over all alignments,
expressed as a matrix of probabilities S,where S
tu
is the probability of seeing
positions t,u matched in an alignment.
Different scoring schemes were compared by calculating the probability
of the true alignment,i.e.the BAliBASE alignment,given the sampleback
distribution,S.
P
true
¼
X
true t‚ u
S
tu
:ð14Þ
Using BAA allows us to evaluate alignment quality on all possible sub
optimal alignments,and also avoids the use of parameters such as gap
opening penalties and extension penalties,which are known to highly affect
the result of alignments.
3 RESULTS AND DISCUSSION
3.1 Qualitative analysis of the substitution matrices
Substitution matrices such as BLOSUM,PAM and SDM are
expressed as symmetrical 20 · 20 matrices of LLRs.HMMSUM
on the other hand,consists of 281 matrices and is not so easily
viewed and understood.We may inspect the new model by looking
at its marginal properties,and also by asking whether different
structural contexts had signiﬁcantly different substitution probab
ilities,even when the amino acid composition was similar.
3.1.1 Marginal characteristics of HMMSUM By using marginal
probabilities of substitutions [Equation (11)],a single substitution
matrix (denoted as HMMSUMM) can be generated in the 20 · 20
format.A comparison of HMMSUMMand SDMscoring matrices
is shown in Figure 2.All of the structure information was ignored in
the HMMSUMMmodel.Overall,we see a higher score for changes
relative to identities in the HMMSUMM model when compared
with SDM,as indicated by the relatively higher numbers off the
diagonal (after correcting for differences in the mean and standard
deviation).This increases the importance of mutations in the align
ment score,relative to conserved residues.HMMSUMM singles
out Gly,Asp,Pro and Asn for the highest substitution penalties,
while Thr,Gln and Met are the most mutable.SDM,on the other
hand,gives Gly,Pro,Ile and Trp the highest penalty for substitution,
and ﬁnds Gln,Thr,and Arg to be the most mutable.The normalized
mean mismatch penalties for the amino acids correlate poorly
(R ¼ 0.60) between HMMSUMM and SDM.The correlation
over all elements of HMMSUMM with SDM (R ¼ 0.72) and
with BLOSUM50 (R ¼ 0.72) is weaker than these two correlate
with each other (R ¼ 0.84).Overall,the diversity between models
suggests that selection pressure has multiple solutions that cannot
all be modeled by a single matrix.
3.1.2 Analysis of two structural contexts that favor glycine Two
conserved glycine states were chosen to illustrate the effect of
structural context on the substitution probabilities.For comparison
purposes only,observed and background frequencies of each state q
were transformed to LLRs according to Equation (15) to generate
substitution matrices
log
2
Pði‚j j qÞ þ«
Pði j qÞPðj j qÞ þ«
‚ ð15Þ
where « ¼ 0.25.Note,however,that the matrices were not used in
this form in making alignments.Instead the alignment matrices are
calculated directly from the substitution probabilities,as described
in Equations (9–13),not from LLRs.
The substitution matrix for HMMSTR state 30,a conserved
glycine in a bturn motif (Figure 3,lower left) was found to be
Fig.2..Comparison of HMMSUMMand SDMsubstitution matrices.The
figure shows the comparison of marginal substitution matrix in HMMSUM
model,HMMSUMM (upper half),with SDM (lower half).Both are
expressed as LLRs in halfbit unit (scaled by a factor of 2),and rounded
to the nearest integer.The two matrices correlate weakly (R ¼ 0.72).
HMMSUM:structurebased substitution matrices
417
dramatically different from the substitution matrix for state 102,
a conserved glycine in a bstrand conformation (Figure 3,
upper right).The LLRs were calculated using structure
dependent expected substitution values,as in HMMSUML.Both
states represent common structural contexts.
The bstrand state has exactly the expected probability of G–G
substitutions (i.e.conservations),given the glycinerich com
position of that state (LLR ¼ 0).But the bturn state,which is
also glycinerich,has a much higher than expected rate of G–G
substitution (LLR ¼ 5).This reﬂects the energetic preference for
Gly in a bturn.
Residues other than glycine are less likely to be found in states
30 and 102,but if they are found in these contexts,they are scored
very differently.State 102 (bstrand) favors a mismatch of almost
any sort,almost as much as an identity match.This means that
residues other than Gly mutate more often than expected in this
context.This is not surprising since the expected substitution prob
ability would be very small for all nonGly amino acids,and it
means that those rare occurrences of nonglycine residues in this
context occurred together in the training set.For state 30 (bturn)
the situation is different.Glycine mutates more often than expected
to certain residues,C,D,P,Wand Y.In particular the preferred Gly
to Pro mutation makes sense for a bturn.But other substitutions
in this turn state are either not observed (large negative LLRs) or
they occur more often than expected,for the same reasons as
mentioned before.
By modeling the structural contribution to evolutionary selective
pressure,HMMSUM should add biological meaning to the align
ment of protein sequences.If so,an improvement should be seen
in the alignment accuracy.
3.2 Evaluation of alignment accuracy
To test whether the HMMSUM improves the alignments of
twilightzone sequences,we compared the matrix performance of
the HMMSUMmodels with BLOSUMmatrices or structurebased
substitution matrices SDM/HSDM,which has been reported as the
best choice for aligning proteins with low percent identity (Prlic
´
et al.,2000).Three methods were used in our study to construct the
A
tu
matrix [Equations (11–13)] and alignments were performed
using local Dynamic Programming.
Gap opening/extension penalties are optimized for each matrix
before the evaluation of accuracy.An 8fold jackknife cross
validation was used for all methods tested when determining the
optimal penalties in order to avoid overﬁtting.Evaluations are
reported here using the alignment parameters that gave us the
maximum correct matches (Table 1).The average gap opening/
extension penalties after crossvalidations for SDM (7.1/0.4) and
HSDM (19.5/1.0) are comparable to previous reports (6.6/0.6 and
19.0/0.8,respectively).
3.2.1 HMMSUMM results Using optimal gap parameters
(15.0/0.9),the structureindependent model HMMSUMM ﬁnds
6% more of the correct matches than BLOSUM40 (P ¼ 0.053),
the best performing BLOSUMmatrix in our evaluation.This result
cannot be attributed to the use of structure,since neither structure
assignment nor prediction was used,but it may be attributed to
using a different sequence weighting scheme in summing substitu
tion frequencies.Our weighting scheme increased the contribution
of distant homologs to the substitution scores,and the results are
seen in the increased LLRs for changes relative to conservations.
However,the increase in the number of correct matches was
accompanied by an increase in the number of incorrect matches,
and the accuracy in alignments remained unimproved overall.The
Wilcoxon test indicated that the change of accuracy in individual
alignments may have resulted by chance when compared with
BLOSUM40 (P ¼ 0.887).
3.2.2 HMMSUML results As mentioned in section 2,two sets
of structuredependent matrices,HMMSUMD and HMMSUML,
were made based on two different assumptions about the expected
substitutions.HMMSUML,with optimized gap parameters
(11.6/1.4),found 5% more correct matches than BLOSUM40
(P ¼ 0.065).This was fewer than the structureindependent
HMMSUMM model.We believe that the Lmodel,although
theoretically more sound,suffered from small counts problems
and therefore did not manifest its theoretical potential.
However,the small increase in correct matches was not
accompanied by an increase in incorrect matches completely,
and the overall accuracy in alignments improved slightly.This
improvement was found to be possibly the result of chance when
compared with BLOSUM40 (P ¼ 0.371).
3.2.3 HMMSUMD results HMMSUMD,a structure
dependent model with structureindependent background frequen
cies,performed much better than expected,ﬁnding 24% more cor
rect matches than the BLOSUM40 matrix.Both the accuracy and
the coverage of correct matches were signiﬁcantly improved when
optimal gap parameters were used (21.0/0.4).Of the 147 alignments
61% were more accurate using this model,an unlikely chance
occurrence (P < 0.001).
When compared with SDM and HSDM,the stateoftheart
matrices derived from structures,HMMSUMD still showed a
16% and an 18% improvement in correct matches,respectively
(P < 0.001).About 54% out of the 147 alignments were more
accurate when using HMMSUMD in both comparisons.The
Fig.3.Substitution matrices for glycinerich states in HMMSTR.Each sub
stitution matrix was generated using Equation (15).Substitution scores are
presented as LLRs in halfbit units.Here the figure shows the substitution
matrixof twostates inHMMSTR.Thelower half of the matrixis a bturnstate
(state 30 in HMMSTR) and the upper half of the matrix is a bstrand state
(state 102 in HMMSTR).
Y.M.Huang and C.Bystroff
418
alignment accuracy and coverage of HMMSUMD were signiﬁc
antly better than SDM (P ¼ 0.025 and P < 0.001,respectively).
3.2.4 Results using structure predictions and structure
assignments To investigate the effects of structural information
on difﬁcult alignment problems,several more HMMSUM models
were constructed.A structureprediction dependent model,
HMMSUMD
NS
,was summed from the training set using the
same method as that described from HMMSUMD,except that
the calculation of g was performed without using the structural
information.The gvalues in HMMSUMD
NS
can be expressed
loosely as follows:
g ¼ PðIsites motif j sequenceÞ:
Figure 4.Comparisonof alignments of 1MUCAand 4ENLproteins.Alignments are producedusingthe best matrixinBLOSUMand HMMSUMDserials from
tested methods [i.e.BLOSUM40 (
) and HMMSUMD
NS
(‡)] and SDMmatrix (†).True alignments fromBAliBASE (#) are highlighted in gray color.Actual
secondary structures of 4ENL are indicated in cylinders as ahelices and arrows as bstrands.Two sequences have 17.1% identity.
Figure 5.Ribbon structures of remote homologs 1MUCA and 4ENL.Homologous secondary structure elements can clearly be identified despite only 17.1%
sequence identity.
HMMSUM:structurebased substitution matrices
419
The g weights are therefore structure predictions rather than
assignments.Surprisingly,the accuracy signiﬁcantly improved
over the results when structure information was used in
HMMSUMD (P ¼0.002).When measured using correct matches,
alignment accuracy or coverage,this predictionbased model con
sistently performed best among methods tested.Figure 4 illustrates
this result using the two structurally similar proteins shown in
Figure 5.
To further understand the effect of structural information on
the alignment accuracy,we used the true structural information
for both the training phase and the testing phase.Although a
pairwise sequence alignment would not be the method of choice
for aligning two known structures,this test was thought to simulate
a ‘best case scenario’ where the structure predictions were perfect
for both sequences.As expected,the alignment accuracy for
this model,indicated as HMMSUMD
S
,was better than that of
HMMSUMD,where structure was ignored in the testing phase,
but surprisingly it was not as good as HMMSUMD
NS
,where the
structure was ignored in both phases (Table 1).
3.2.5 Results using three structural states To address the
importance of precision in structure predictions and assignments,
bare minimum structure information was adapted into the
HMMSUM model.A simple threestate (helix,strand and loop)
HMMSUMD model (denoted HMMSUMD
3
) was made and ana
lyzed in an analogous fashion.The performance of HMMSUMD
3
Figure 6.Alignments of 1R69 and 1NEQ using Dynamic Programming and BAA.(a) Alignments of two sequences were obtained by local Dynamic
Programming using BLOSUM40,SDM and different HMMSUMD models with optimized parameters in Table 1.True alignments from BAliBASE are
highlighted in gray.Two sequences have percent identity of 12.7%.(b) A histogram of 1R691NEQ alignment was generated by BBA (Bayesian Adaptive
Alignment).The color of each position in histogramrepresents the probability of aligning two positions between two sequences.(red:high probability;blue,low
probability;white,zero probability).Black boxes represent the true alignment of two sequences from BAliBASE.
Y.M.Huang and C.Bystroff
420
was comparable withother HMMSUMDmodels incorrect matches,
alignment accuracy or coverage (Table 1).The same improvement
was also found in another model,HMMSUMD
3+NS
,which was
built in the same way as HMMSUMD
NS
,using structure pre
dictions rather than structure assignments,but using only three
structure states.However,both threestate models are better than
the best reported singlematrix model,SDM.
3.3 Evaluation of suboptimal alignments
By examining the details of alignments from HMMSUM models,
we found that some segments were aligned to the wrong regions
by shifting a few positions (Fig.6a).This suggested that the true
alignment was suboptimal but ‘nearby’ in parameter space,perhaps
accessible by a slight change in the substitution matrices.
Alignments that are off by a little are just as bad as alignments
that are off by a lot,according to our measure of accuracy for the
Dynamic Programming algorithm (Table 1) since this method can
deliver only the one optimal alignment,and arbitrarily chooses
one alignment if there are other equalscoring possibilities.There
are several modiﬁcations to Dynamic Programming that consider
suboptimal alignments (Waterman and Eggert,1987;Huang et al.,
1990;Chao et al.,1994),however,in this study we chose to use
BAA,which models all alignments.
The BAA algorithm was also used to perform each of the 147
pairwise alignments using each substitution model,giving a pos
terior probability distribution over all alignments for each case.To
evaluate the performance of HMMSUMDwhen compared to SDM
using BAA,the probabilities of reference alignments were com
pared using P
true
[Equation (14)].Around 88%out of 147 reference
alignments have higher P
true
when using HMMSUMD than with
SDM.Figure 6b shows the results of BAA for the sequences in
Figure 6a.
4 CONCLUSIONS
Amino acid substitution matrices that are local structurespeciﬁc
signiﬁcantly improve the accuracy of pairwise alignments when the
sequences are distant homologs,even if only a simple threestate
structural model is used.The improvement can be seen in the
accuracy and coverage of correct matches when compared with
curated structurebased alignments.
Substitution matrices based on structure predictions were found
to be signiﬁcantly more accurate (P < 0.001) than substitution
matrices based on structure assignments.This result runs counter
intuitive since it suggests that the sequences alone contain more
information than the structure,even though we know that structural
homology extends beyond the reach of sequencebased alignment
methods.One possible explanation is that structure predictions con
tain information about weak or multiple secondary structure pro
pensities that is lost when a unique structure is assigned to each
position.Weak propensities may be conserved in some parts of the
structure in order to enforce the order of events in the folding
pathway.However,experimental studies have shown no conserva
tion of propensity across proteins of the same family (Mun
˜
oz et al.,
1995).
An alternative explanation is that substitution data were sparse
when unique structure assignments were used but less sparse when
predictions were used,leading to less random error in substitution
scores when indexed to predicted structures,and more randomerror
in substitution scores when indexed to assigned structures.In some
cases the structure predictions were wrong,but perhaps the added
data more than compensated for the inclusion of some false data.
It was also considered that homologous proteins might tend to
have correlated wrong predictions,and that would improve align
ments.The sparse data issue also partially explains why the more
ﬁnely divided HMMSUML model did not perform as well as
expected.Meanwhile,the threestate model HMMSUMD
3
may
have suffered less fromsparse data,since using structure predictions
(HMMSUMD
3+NS
) did not further improve alignments.
An improvement was also found in the scores of true suboptimal
alignments using the Bayesian Adaptive method.An additional
34% of 147 individual alignments were improved when using the
BAA algorithm in comparing HMMSUMD and SDM.Accuracy
in ﬁnding true suboptimal alignments is important since optimal
alignments for twilight zone sequences are only 40–50% accurate.
ACKNOWLEDGEMENTS
We thank BobbieJo WebbRobertson for contributing to the
Bayes Aligner coding and for helpful conversations.This work
was supported by NSF award DBI0343206 to C.B.
Conflict of Interest:none declared.
REFERENCES
Altschul,S.F.and Erickson,B.W.(1986) Optimal sequence alignment using afﬁne gap
costs.Bull.Math.Biol.,48,603–616.
Altschul,S.F.et al.(1990) Basic local alignment search tool.J.Mol.Evol.,215,
403–410.
Altschul,S.F.et al.(1997) Gapped BLAST and PSIBLAST:a new generation of
protein database search programs.Nucleic Acids Res.,25,3389–3402.
Bahr,A.et al.(2001) BAliBASE (Benchmark Alignment dataBASE):enhancements
for repeats,transmembrane sequences and circular permutations.Nucleic Acids
Res.,29,323–326.
Bystroff,C.and Baker,D.(1998) Prediction of local structure in proteins using a library
of sequencestructure motifs.J.Mol.Biol.,281,565–577.
Bystroff,C.and Garde,S.(2003) Helix propensities of short peptides:molecular
dynamics versus bioinformatics.Proteins,50,552–562.
Bystroff,C.et al.(2000) HMMSTR:a hidden Markov model for local sequence
structure correlations in proteins.J.Mol.Biol.,301,173–190.
Chao,K.M.et al.(1994) Recent developments in linearspace alignment methods:
a survey.J.Comput.Biol.,1,271–291.
Dayhoff,M.O.et al.(1983) Establishing homologies in protein sequences.Methods
Enzymol.,91,524–545.
Dayhoff,M.O.,Schwartz,R.M.and Orcutt,B.C.(1978) Amodel of evolutionary change
in proteins.In Dayhoff,M.(ed.),Atlas of Protein Sequence and structure.National
Biomedical Research Foundation,Silver Spring,Maryland,USA.Vol.5 (Suppl.3),
pp.345–352.
Doolittle,R.F.(1981) Similar amino acid sequences:chance or common ancestry?
Science,,214,149–159.
Henikoff,S.and Henikoff,J.G.(1992) Amino acid substitution matrices from protein
blocks.Proc.Natl Acad.Sci.USA,89,10915–10919.
Hobohm,U.and Sander,C.(1994) Enlarged representative set of protein structures.
Protein Sci.,3,522–524.
Hou,Y.et al.(2004) Remote homolog detection using local sequencestructure
correlations.Proteins,57,518–530.
Huang,X.Q.et al.(1990) A spaceefﬁcient algorithm for local similarities.Comput.
Appl.Biosci.,6,373–381.
Huang,X.Q.and Miller,W.(1991) A timeefﬁcient,linearspace local similarity
algorithm.Adv.Appl.Math.,12,337–357.
Jaroszewski,L.et al.(2002) In search for more accurate alignments in the twilight zone.
Protein Sci.,11,1702–1713.
Lipman,D.J.et al.(1984) On the statistical signiﬁcance of nucleic acid similarities.
Nucleic Acids Res.,12,215–226.
HMMSUM:structurebased substitution matrices
421
Munoz,V.et al.(1995) The distribution of alphahelix propensity along the polypeptide
chain is not conserved in proteins fromthe same family.Protein Sci.,4,1577–1586.
Northey,J.G.et al.(2002) Protein folding kinetics beyond the phi value:using multiple
amino acid substitutions to investigate the structure of the SH3 domain folding
transition state.J.Mol.Biol.,320,389–402.
Pearson,W.R.(1995) Comparison of methods for searching protein sequence databases.
Protein Sci.,4,1145–1160.
Pearson,W.R.(1996) Effective protein sequence comparison.Methods Enzymol.,266,
227–258.
Pearson,W.R.(1998) Empirical statistical estimates for sequence similarity searches.
J.Mol.Biol.,276,71–84.
Pearson,W.R.(2000) Flexible sequence similarity searching with the FASTA3 program
package.Methods Mol.Biol.,132,185–219.
Prlic
´
,A.et al.(2000) Structurederived substitution matrices for alignment of distantly
related sequences.Protein Eng.,13,545–550.
Rabiner,L.R.(1989) A tutorial on hidden Markov models and selected applications in
speech recognition.Proc.IEEE,77,257–286.
Sasson,I.and Fischer,D.(2003) Modeling threedimensional protein structures
for CASP5 using the 3DSHOTGUN metapredictors.Proteins,53 (Suppl.6),
389–394.
Smith,T.F.and Waterman,M.S.(1981) Identiﬁcation of common molecular sub
sequences.J.Mol.Biol.,147,195–197.
Smith,T.F.et al.(1981) Comparative biosequence metrics.J.Mol.Evol.,18,
38–46.
Thompson,J.D.et al.(1999) BAliBASE:a benchmark alignment database for the
evaluation of multiple alignment programs.Bioinformatics,15,87–88.
Vingron,M.and Argos,P.(1989) A fast and sensitive multiple sequence alignment
algorithm.Comput.Appl.Biosci.,5,115–121.
Waterman,M.S.and Eggert,M.(1987) A new algorithm for best subsequence align
ments with application to tRNArRNA comparisons.J.Mol.Biol.,197,723–728.
Wilcoxon,F.(1945) ‘Individual Comparisons by Ranking Methods’.Biometrics,1,
80–83.
Xu,J.et al.(2003) RAPTOR:optimal protein threading by linear programming.
J.Bioinform.Comput.Biol.,1,95–117.
Yi,Q.et al.(1998) Prediction and structural characterization of an independently
folding substructure in the src SH3 domain.J.Mol.Biol.,283,293–300.
Yona,G.and Levitt,M.(2002) Within the twilight zone:a sensitive proﬁleproﬁle
comparison tool based on information theory.J.Mol.Biol.,315,1257–1275.
Zhu,J.et al.(1998) Bayesian adaptive sequence alignment algorithms.Bioinformatics,
14,25–39.
Y.M.Huang and C.Bystroff
422
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment