Vol.22 no.4 2006,pages 413–422

doi:10.1093/bioinformatics/bti828

BIOINFORMATICS ORIGINAL PAPER

Sequence analysis

Improved pairwise alignments of proteins in the Twilight Zone

using local structure predictions

Yao-ming Huang and Christopher Bystroff

Center for Bioinformatics,Department of Biology,Rensselaer Polytechnic Institute,Troy,NY 12180,USA

Received on March 9,2005;revised on November 2,2005;accepted on December 8,2005

Advance Access publication December 13,2005

Associate Editor:Christos Ouzounis

ABSTRACT

Motivation:In recent years,advances have been made in the ability

of computational methods to discriminate between homologous and

non-homologous proteins in the ‘twilight zone’ of sequence similarity,

where the percent sequence identity is a poor indicator of homology.

To make these predictions more valuable to the protein modeler,

they must be accompanied by accurate alignments.Pairwise

sequence alignments are inferences of orthologous relationships

betweensequencepositions.Evolutionarydistanceistraditionallymod-

eled using global amino acid substitution matrices.But real differences

in the likelihood of substitutions may exist for different structural con-

textswithinproteins,sincestructural context contributestotheselective

pressure.

Results:HMMSUM(HMMSTR-based substitution matrices) is a new

model for structural context-basedamino acidsubstitutionprobabilities

consisting of a set of 281 matrices,each for a different sequence–

structure context.HMMSUM does not require the structure of the

protein to be known.Instead,predictions of local structure are made

usingHMMSTR,ahiddenMarkov model for local structure.Alignments

using the HMMSUMmatrices compare favorably to alignments carried

out using the BLOSUM matrices or structure-based substitution

matrices SDM and HSDM when validated against remote homolog

alignments from BAliBASE.HMMSUM has been implemented using

local Dynamic Programming and with the Bayesian Adaptive

alignment method.

Availability:Matrices,source codes and programs are available at

http://www.bioinfo.rpi.edu/~bystrc/downloads.html.

Contact:bystrc@rpi.edu,huangy2@rpi.edu

1 INTRODUCTION

Amino acid substitution matrices are evolutionary models that seek

to explain the cost of mutating any one of the twenty amino acids

to any other,relative to the cost of not mutating.The most widely

used substitution matrices,such as PAM(Dayhoff et al.,1978) and

BLOSUM (Henikoff and Henikoff,1992),are derived from high

conﬁdence multiple sequence alignments (MSAs).Counting stat-

istics are used to estimate the frequency of each possible mutation,

and a ratio of the observed mutation probability to the probability

one would expect by chance is calculated.The logarithm of this

likelihood ratio is the number we use when scoring pairwise

alignments such as those generated by Dynamic Programming

algorithms [LALIGN (Huang and Miller,1991),etc.] and by

database search algorithms [SSEARCH,FASTA (Pearson,1995,

1996,1998,2000) and BLAST (Altschul et al.,1990,1997)].

The current models ignore the possible effect of the structural

context on the frequencies of mutation,instead summing the fre-

quencies over all positions in the training set alignments equally.

But the structural context may inﬂuence the cost of making a

mutation.For example,glycine is known to be strongly conserved

when it occurs in a tight turn,where the presence of any side chain

would sterically hinder the formation of the turn.In other structural

contexts,glycine is more likely to mutate.As another example,

consider the selective pressures felt by a leucine in the core of a

protein versus a leucine on the surface.Ahydrophobic side chain on

the surface may sometimes be conserved for functional reasons,but

a hydrophobic side chain in the core of a protein is probably critical

for folding,and this is arguably a stronger selective pressure.

In general,we do not know the structural context of each of the

positions in a sequence at the point when we want to use the

evolutionary model—that is,when we want to align the sequence

or search a database.A structure-based evolutionary model such as

the one we propose in this paper is useful for alignments and data-

base searches only if we can ﬁrst predict the structural context of

each residue.The prediction of residue context falls far short of

predicting the 3D structure,but adds information that can be used

to improve the quality of alignments,as we show in this paper.

Here we use local structure predictions from HMMSTR [hidden

Markov model for protein structure (Bystroff et al.,2000)] to

improve the quality of the most difﬁcult pairwise alignments,

those with <25% sequence identity—in the so-called ‘Twilight

Zone’ of homology detection (Doolittle,1981).The new method

compliments remote homology detection methods,such as

RAPTOR (Xu et al.,2003),SVM-HMMSTR (Hou et al.,2004),

3D-SHOTGUN (Sasson and Fischer,2003) and others,that can

accurately identify remote structural homology but do not contrib-

ute to improved alignments.We show that distant homologs can

be aligned more accurately when the substitution matrix considers

the local structure as predicted by HMMSTR.

In general,proﬁle–proﬁle alignment methods have a greater

chance of success in aligning Twilight Zone sequences (Yona

and Levitt,2002;Jaroszewski and Godzik,2002),but obtaining

the proﬁle requires ﬁrst performing pairwise alignments.Therefore,

improvements in pairwise alignments will eventually improve

MSAs and proﬁle–proﬁle alignments.

The new model,called HMMSUM (HMMSTR-based sub-

stitution matrices),is a set of matrices,one for each of the

281 local structure contexts deﬁned by HMMSTR.The matrices

To whom correspondence should be addressed.

The Author 2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournals.org

413

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

were summed from a training set of MSAs in the same way as the

previous matrices PAM and BLOSUM were summed,but the

training set in this case contained HMMSTR descriptors (Markov

state probabilities) for each position,based on known structures.

When HMMSUM is used in an alignment,the target and template

HMMSTR descriptors are ﬁrst predicted,then an alignment matrix

(Smith and Waterman,1981) is calculated using the HMMSUM

model.In the alignment matrix,each ‘match score’ is a linear

combination of the substitution scores from the 281 matrices.

The weights are the target and template HMMSTR descriptors

[g,Equation (1)].

The alignments were carried out using standard Smith–Waterman

local Dynamic Programming (Smith and Waterman,1981;Smith

et al.,1981),or using Bayesian Adaptive Alignment (BAA) (Zhu

et al.,1998).The results were validated against BAliBASE

(Thompson et al.,1999;Bahr et al.,2001),a curated database

of structure-based MSAs.Alignment accuracy was assessed as

the number of correctly aligned positions (correct matches in

Table 1).Pairwise alignments produced by the HMMSUM model

had signiﬁcantly more correct matches than alignments using the

BLOSUM35,40,50 or 55 matrices.These results were also a

signiﬁcant improvement over two previously described structure-

based alignment matrices,SDM and HSDM (Prlic

´

et al.,2000),

which are in turn signiﬁcantly better than the best BLOSUM

matrix.However,SDMand HSDMare single substitution matrices,

whereas our model consists of many matrices.

In broad terms,an increase in the number of variable parameters

improved the model.An apparent increase in accuracy can result

from simple over-ﬁtting of the data,but this is not the case here.

None of the test alignments were part of the training data,which did

not include twilight-zone alignments or structure-based alignments.

Also,jackknife statistical tests were performed to showthat the gap

parameter optimization was valid.

Analysis of the numbers (log-likelihood ratios,LLRs) in the

HMMSUM matrices,when compared with SDM,hints at the

reasons for the improved accuracy.Different local structures

have different amino acid preferences.These differences are sig-

niﬁcant and cannot be captured in a single substitution matrix.

2 METHODS

2.1 Construction of MSAs and structure descriptors

The observed amino acid substitution frequencies were summed froma non-

redundant training set of MSAs.Amodiﬁed PDBselect-25 list (Hobohmand

Sander,1994) was used as a source of parent sequences.Sequences in the

original PDBselect-25 list were excluded if the parent structure was not well

determined,if the protein was membrane-bound,or if it contained a large

number of disulﬁde bonds.Disordered loop regions were also omitted.Each

of the remaining 657 parent sequences has at most 25% sequence identity

with any other sequence in the list.MSAs were produced by searching the

‘nr’ protein database using two iterations of PSI-BLAST (Altschul et al.,

1997),with an E-value cut-off of 0.001.In practice,this cut-off produces

alignments that very rarely have <25% identity,and the training set con-

tained none of the ‘Twilight Zone’ alignments that were later used for

validation in the testing phase.

HMMSTR predictions

HMMSTR (Bystroff et al.,2000) was used to

assign/predict structure descriptors for each position in each MSA during

the training phase,and also to predict the structure descriptors in target

sequences during the testing phase.HMMSTR takes as input the sequence

proﬁle and,optionally,the protein structure expressed as backbone (w,c)

angles.We say the descriptors were assigned if the true structure was used

as input to HMMSTR in addition to the sequence proﬁle.Otherwise,we say

the descriptors were predicted if only the sequence proﬁle was the input

to HMMSTR.A sequence proﬁle is a probability distribution over the

20 amino acids at each MSA position.It is calculated by summing the

frequency of each amino acid in each MSA column,after accounting for

sequence redundancy.PSI-BLAST produces a proﬁle as part of its function,

and the proﬁle was used as input to HMMSTR.

HMMSTR is a hidden Markov model containing 281 Markov states (q).

Each state ‘emits’ a single descriptor from a state-speciﬁc probability dis-

tribution.HMMSTR states have probability distributions for amino acid

type,secondary structure,backbone angle region and structural context.

Table 1.Comparison of HMMSUM models with other matrices

Model Gap penalty Gap number

b

Gap length

b

Alignment length

b

Correct matches Accuracy Coverage

Opening

a

Extension

a

Counts P-value

c

% P-value

c

% P-value

c

SDM 7.1 ± 0.35 0.4 ± 0.04

1310

5973

25528 11393

— 44.6 — 40.8 —

HSDM

19.5 ± 0.93

1.0 ± 0.09

1284

6134

24762 11114

0.03 44.9 0.884 39.8 0.006

BLOSUM35

11.0 ± 1.41

0.9 ± 0.39

1560

6221

25775 10410

<0.001 40.4 <0.001 37.3 <0.001

BLOSUM40

13.1 ± 0.64

1.3 ± 0.30

1377

5089

24879 10581

0.002 42.5 0.022 37.9 0.001

BLOSUM50

8.4 ± 1.06

2.1 ± 0.45

2089

5019

23674 10019

<0.001 42.3 0.046 35.9 <0.001

BLOSUM55

11.3 ± 0.71

0.9 ± 0.20

1842

7850

22857 9719

<0.001 42.5 <0.001 34.8 <0.001

HMMSUM-M

15.0 ± 0.00

0.9 ± 0.00

1457

6278

26813 11204

0.644 41.8 0.158 40.1 0.928

HMMSUM-L

11.6 ± 0.52

1.4 ± 0.27

1571

5071

25339 11089

0.526 43.8 0.627 39.7 0.443

HMMSUM-D

21.0 ± 0.53

0.4 ± 0.18

920

5145

28477 13165

<0.001 46.2 0.025 47.1 <0.001

HMMSUM-D

3

13.0 ± 0.53

0.3 ± 0.08

1173

6028

27892 12938

<0.001 46.4 0.108 46.3 <0.001

HMMSUM-D

S

14.1 ± 0.38

0.8 ± 0.08

1510

6211

27892 13269

<0.001 47.6 0.007 47.5 <0.001

HMMSUM-D

NS

13.4 ± 1.51

0.3 ± 0.12

1283

5959

28253 13930

<0.001 49.3 <0.001 49.9 <0.001

HMMSUM-D

3+NS

13.3 ± 0.46

0.2 ± 0.13

1175

6371

27759 12981

<0.001 46.8 0.037 46.5 <0.001

Optimal gap penalties are defined as the combination of opening and extension penalty that will produce the highest number of correct matches under zero matrix bias.P-value of

two group comparisons are estimated Wilcoxon signed-rank test.Accuracy and coverage were calculated as mentioned in Section 2.

a

Gap parameters were determined by taking the average of results from eight cross-validations.Standard deviations of cross-validations are indicated as ±SD.

b

Gap numbers/length and alignment length were calculated over 147 alignments under optimal gap parameters of individual methods.

c

The P-value indicates the significance of the difference in each measurement between SDMand a given method.P-value of >0.05 indicates a non-significant difference.

Y.-M.Huang and C.Bystroff

414

States are connected to each other via transitions to model the likelihood of

different emission distributions being adjacent to each other in the protein

chain.Strongly linked states in HMMSTRrepresent the I-sites local structure

motifs (Bystroff and Baker,1998),such as helix capping motifs and beta

turns,from which the HMMSTR model was derived.Some studies have

shown that I-sites motifs can fold independently and/or initiate folding (Yi

et al.,1998;Bystroff and Garde,2003;Northey et al.,2002).The forward/

backward algorithm (Rabiner,1989) was used to predict the conditional

probabilities [g,Equation (1)] of each state q given each position t in the

sequence.

g

qt

¼ Pðq j tÞ:ð1Þ

The details of this calculation were described previously (Bystroff et al.,

2000).

The g matrix may be viewed as a probabilistic representation of all local

structure motifs (I-sites) given the sequence and,optionally,the structure.

Loosely speaking,

g ¼ PðI-sites motif j sequence‚½structureÞ:

In practice,when the optional structure data were included the gamma

distribution was sharply peaked at most positions,having probability ¼1 for

one state and 0 for all others.However,some local structure types are

missing or redundantly modeled by HMMSTR.For these positions the

gamma distribution was less sharp.In the testing phase of this study,

when the structure data were not included in the forward/backward calcu-

lation,the motif predictions were more often ambiguous,and the gamma

distribution was sharply peaked at fewer positions.Because of the known

ambiguity of motif predictions,an early decision was made to treat all local

structure assignments/predictions probabilistically [i.e.Equation (1)] rather

than as a string of discrete descriptors.

2.2 Summing structure-dependent substitution

frequencies from training set MSAs

The matrices of observed and background frequencies for amino acid sub-

stitutions were summed in a manner similar to the one described earlier

for BLOSUM (Henikoff and Henikoff,1992),but in our case probabilistic

weights were used in order to separate the training set positions into

HMMSTR states.

Sequence weights,calculated using the mismatch method (Vingron

and Argos,1989),were used to eliminate the redundancy within an given

MSA.For example,in a MSA with L aligned sequences,an L · L mismatch

matrix is constructed where each matrix element is the number of mis-

matches between two sequences.The mismatch distance of a sequence

a is calculated as the sum over all the elements in row a,divided by sum

over the whole matrix.Sequences with lower mismatch distances (w) were

more redundant and should therefore contribute less to the sum of sub-

stitutions.The contribution of sequences a and b to any observed substitution

frequency was the product of the sequence weights (w

a

· w

b

).

As in previous studies,no assumption was made about the direction of the

substitution;therefore,the substitution matrices are symmetrical.

Likelihood ratios are the ratio of the observed substitution probability

and the expected substitution probability.The expected substitution

probability was calculated as the product of two background amino acid

probabilities.The background frequencies were summed from the same

training set MSAs,with each sequence a contributed w

a

to this sum.

The observed counts of state-dependent substitutions F was calculated

fromthe training set of MSAs.Each column (t) in each MSA(m) was given a

probability of being state q (g

qt

) using HMMSTR.F(i,j j q),the state-speciﬁc

observed counts of i $ j substitutions,was the sum of the g-weighted

sequence weights (w) over all sequence pairs a,b where the amino acid

at t in a was i and the amino acid at t in b was j,as follows:

Fði‚j j qÞ ¼

X

m2MSAs

in training set

X

a2m

X

b2m

b < a

X

t2a

t

¼i

&b

t

¼j

w

a

w

b

g

m

qt

:ð2Þ

Two strategies for expected frequencies

For the expected frequencies

of substitution,we considered two models,called ‘Dayhoff’ (D) and

‘Lipman’ (L) in the spirit of two classic bioinformatics experiments in

estimating expectation values for sequence alignments.Margaret Dayhoff’s

model (Dayhoff et al.,1978,1983) for random alignments was to align

scrambled sequences,whereas David Lipman aligned randomly selected

non-homologous but unscrambled sequences (Lipman et al.,1984).The

scores for random natural sequences were generally higher than for

scrambled sequences,showing the non-random nature of protein sequence

space.

Our D-model assumes that the expected frequency of substitution is not

dependent on the structure but only on the overall background frequencies

of the amino acids.Our L-model uses the structural context to estimate

the expected frequency of substitution,assigning a different background

frequency distribution to each state.The strategy applied in the L-model

should give us a better estimate of the signiﬁcance of a substitution given

the structure,but with a ﬁnite-sized database we run the risk of sparse or

missing data for some of the 210 possible substitutions and 281 states.And

the poorer results for the more ﬁnely-divided L-model suggest that sparse

data were a problem.The D-model suffers less from the sparse data.The

following formulas were used to sum the background counts [Equation (3)]

or state-dependent background counts [Equation (4)] of each amino acid

from the training set.

FðiÞ ¼

X

m2MSAs

in training set

X

a2m

X

t2a

t

¼i

w

a

ðD-modelÞ ð3Þ

Fði j qÞ ¼

X

m2MSAs

in training set

X

a2m

X

t2a

t

¼i

w

a

g

m

qt

ðL-modelÞ:ð4Þ

A small pseudo-count («) was added to the expected and observed sub-

stitution frequencies to overcome the problem of sparse and missing data.

The choice of pseudo-count does not greatly a effect the resulting alignments

but prevents certain run-time errors from occurring.

The frequencies F were normalized to probabilities,P.The observed con-

ditional probability of a structure-speciﬁc substitution P(i,j j q) is as follows:

Pði‚jjqÞ ¼

Fði‚j j qÞ

P

h220 a:a:

P

k220 a:a:

Fðh‚k j qÞ

:ð5Þ

The normalized,structure-speciﬁc background amino acid probabilities

are

Pði j qÞ ¼

Fði j qÞ

P

h220a:a:

Fðh j qÞ

:ð6Þ

Structure-independent substitution probabilities P(i,j) were also calcu-

lated,in order to directly compare the added value of structure-speciﬁc

models.

Pði‚jÞ ¼

P

281

q¼1

Fði‚j j qÞ

P

h220a:a:

P

k220a:a:

P

281

q¼1

Fðh‚k j qÞ

:ð7Þ

Structure-independent background amino acid probabilities P(i) are

PðiÞ ¼

FðiÞ

P

h220a:a:

FðhÞ

:ð8Þ

2.3 Generating the alignment matrix

Pairwise alignment programs generally start by calculating the alignment

matrix (A),where each element A

tu

is the score for the two amino acids a

t

and

b

u

,where a and b are the two sequences being aligned and t and u are two

sequence positions.In traditional Dynamic Programming algorithm,A

tu

is

assigned the appropriate LLR from the global substitution matrix (Smith

and Waterman,1981).

HMMSUM:structure-based substitution matrices

415

To calculate the alignment matrix using HMMSUM,we ﬁrst calculate

the gamma distribution for each sequence using the HMMSTR model,as

described in the previous section.Then,we sum the substitution probabil-

ities,P(a

t

,b

u

j q),over all values of q,weighted by the joint probability that

positions t and u are in state q.The joint probability is simply the product

of the g-values.P(a

t

,b

u

j Q) is deﬁned as the weighted sum over all states q

of the substitution probability.

Pða

t

‚b

u

j QÞ ¼

P

281

q¼1

Pða

t

‚b

u

j qÞg

a

qt

g

b

qu

P

281

q¼1

g

a

qt

g

b

qu

:ð9Þ

We also need to ﬁnd the weighted sum of the expected substitution

frequencies if we are using the L-model.We deﬁne P(a

t

j Q) to be the

weighted sum of the conditional background frequencies over all states q,

Pða

t

j QÞ ¼

P

281

q¼1

Pða

t

j qÞg

a

qt

P

281

q¼1

g

a

qt

‚ ð10Þ

where a

t

refers to the amino acid at position t in sequence a.

Nowwe can deﬁne the LLRs in the A matrix in the terms presented above

in Equations (5)–(10).Three models are shown.The M-model is inde-

pendent of structure,a single substitution matrix like BLOSUM and

PAM.The D-model depends on the predicted structure and assumes a

structure-independent background substitution probability,which is the

product of the background amino acid probabilities.The L-model depends

on predicted structure and assumes a structure-dependent background sub-

stitution probability.

A

tu

¼ log

2

Pða

t

‚b

u

Þ

Pða

t

ÞPðb

u

Þ

ðM-modelÞ:ð11Þ

A

tu

¼ log

2

Pða

t

‚b

u

j QÞ

Pða

t

ÞPðb

u

Þ

ðD-modelÞ:ð12Þ

A

tu

¼ log

2

Pða

t

‚b

u

j QÞ

Pða

t

j QÞPðb

u

j QÞ

ðL-modelÞ:ð13Þ

Alignments were carried out using Smith–Waterman local Dynamic

Programming and gap penalties were optimized as described below.An

overview of construction and testing of the HMMSUM model is shown

in Figure 1.

2.4 Measuring accuracy in alignment

To assess the performance of the HMMSUMmodel,we used a documented

benchmark database,called BAliBASE (Bahr et al.,2001) as a reference

set.BAliBASE consists of 167 reference MSAs,with >2100 sequences.All

alignments are based on 3Dstructural superpositions (except reference set 7)

Fig.1.Overviewof constructingHMMSUMmodels.Constructionof the HMMSUMmodel canbe separatedintotwosteps:(A) trainingphase,whichcounts the

observedsubstitutionand backgroundfrequencies weighted bymismatchdistances and g-values;(B) testingphase,which uses linear combination togenerate an

alignment matrix in LLR format and aligns two sequences by Dynamic Programming or BAA.

Y.-M.Huang and C.Bystroff

416

which we believe to be as close as possible to the ‘true’ orthologous rela-

tionships.Currently,BAliBASE is categorized into eight different reference

sets for different purposes.Within reference set 1 there are several subcate-

gories:short sequences (<100 residues);medium sequences (200–300 resi-

dues) and long sequences (>500 residues),and each subcategory is further

divided into different levels of percent identity (<25,20–40 and >35%).

For this study,we extracted all 99 of the MSAs in the ‘twilight-zone’ range

(from 7 to 25% identify) from reference set 1 of BAliBASE 2.0.A total

of 147 available reference pairwise alignments are obtained from those

sequences,containing a total of 27930 ‘true’ matches.

The performance of the alignment methods was evaluated by assessing the

accuracy (the fraction of predicted matches that were ‘true’ matches) and

coverage (the fraction of ‘true’ matches that were predicted matches) over

the 27930 matches in the 147 alignments.The statistical signiﬁcance

(P-value) of the differences in ‘true’ matches and coverage was evaluated

using the Wilcoxon signed-rank test (Wilcoxon,1945) over the individual

alignments.Alignments were ranked by the number of correct matches or

by accuracy,and the P-value was calculated for the likelihood that the

differences between the methods were the result of chance.

Alignment algorithms

The sequence alignments were performed in two

ways.Local Dynamic Programming alignment is an implementation of the

Smith–Waterman algorithm,which returns the optimal alignment given the

scoring scheme.In the Smith–Waterman algorithm,we utilized the afﬁne

gap penalty method (Altschul and Erickson,1986).Optimized gap opening

penalties and gap extension penalties were determined for each scoring

method by applying a grid search and evaluating the results by counting

the total correct matches.

Another algorithm,BAA computes the Bayesian posterior probability

of all possible a

t

,b

u

matches.In BAA,likelihood ratios,rather than

LLRs are used.For traditional substitution matrices like BLOSUM,we

used the exponent of the reported LLRs.BAA is carried out by ﬁrst

performing a forward sum,similar to the forward step in the forward/

backward algorithm,over a maximum of K-aligned blocks.We used

K ¼ min (20,n/10),where n is the length of the shorter of the two proteins

being aligned.Instead of a trace-back through the optimal alignment,a

‘sample-back’ approach is taken in BAA.Probabilistic sampling of the

alignment space produces a probability distribution over all alignments,

expressed as a matrix of probabilities S,where S

tu

is the probability of seeing

positions t,u matched in an alignment.

Different scoring schemes were compared by calculating the probability

of the true alignment,i.e.the BAliBASE alignment,given the sample-back

distribution,S.

P

true

¼

X

true t‚ u

S

tu

:ð14Þ

Using BAA allows us to evaluate alignment quality on all possible sub-

optimal alignments,and also avoids the use of parameters such as gap

opening penalties and extension penalties,which are known to highly affect

the result of alignments.

3 RESULTS AND DISCUSSION

3.1 Qualitative analysis of the substitution matrices

Substitution matrices such as BLOSUM,PAM and SDM are

expressed as symmetrical 20 · 20 matrices of LLRs.HMMSUM

on the other hand,consists of 281 matrices and is not so easily

viewed and understood.We may inspect the new model by looking

at its marginal properties,and also by asking whether different

structural contexts had signiﬁcantly different substitution probab-

ilities,even when the amino acid composition was similar.

3.1.1 Marginal characteristics of HMMSUM By using marginal

probabilities of substitutions [Equation (11)],a single substitution

matrix (denoted as HMMSUM-M) can be generated in the 20 · 20

format.A comparison of HMMSUM-Mand SDMscoring matrices

is shown in Figure 2.All of the structure information was ignored in

the HMMSUM-Mmodel.Overall,we see a higher score for changes

relative to identities in the HMMSUM-M model when compared

with SDM,as indicated by the relatively higher numbers off the

diagonal (after correcting for differences in the mean and standard

deviation).This increases the importance of mutations in the align-

ment score,relative to conserved residues.HMMSUM-M singles

out Gly,Asp,Pro and Asn for the highest substitution penalties,

while Thr,Gln and Met are the most mutable.SDM,on the other

hand,gives Gly,Pro,Ile and Trp the highest penalty for substitution,

and ﬁnds Gln,Thr,and Arg to be the most mutable.The normalized

mean mismatch penalties for the amino acids correlate poorly

(R ¼ 0.60) between HMMSUM-M and SDM.The correlation

over all elements of HMMSUM-M with SDM (R ¼ 0.72) and

with BLOSUM50 (R ¼ 0.72) is weaker than these two correlate

with each other (R ¼ 0.84).Overall,the diversity between models

suggests that selection pressure has multiple solutions that cannot

all be modeled by a single matrix.

3.1.2 Analysis of two structural contexts that favor glycine Two

conserved glycine states were chosen to illustrate the effect of

structural context on the substitution probabilities.For comparison

purposes only,observed and background frequencies of each state q

were transformed to LLRs according to Equation (15) to generate

substitution matrices

log

2

Pði‚j j qÞ þ«

Pði j qÞPðj j qÞ þ«

‚ ð15Þ

where « ¼ 0.25.Note,however,that the matrices were not used in

this form in making alignments.Instead the alignment matrices are

calculated directly from the substitution probabilities,as described

in Equations (9–13),not from LLRs.

The substitution matrix for HMMSTR state 30,a conserved

glycine in a b-turn motif (Figure 3,lower left) was found to be

Fig.2..Comparison of HMMSUM-Mand SDMsubstitution matrices.The

figure shows the comparison of marginal substitution matrix in HMMSUM

model,HMMSUM-M (upper half),with SDM (lower half).Both are

expressed as LLRs in half-bit unit (scaled by a factor of 2),and rounded

to the nearest integer.The two matrices correlate weakly (R ¼ 0.72).

HMMSUM:structure-based substitution matrices

417

dramatically different from the substitution matrix for state 102,

a conserved glycine in a b-strand conformation (Figure 3,

upper right).The LLRs were calculated using structure-

dependent expected substitution values,as in HMMSUM-L.Both

states represent common structural contexts.

The b-strand state has exactly the expected probability of G–G

substitutions (i.e.conservations),given the glycine-rich com-

position of that state (LLR ¼ 0).But the b-turn state,which is

also glycine-rich,has a much higher than expected rate of G–G

substitution (LLR ¼ 5).This reﬂects the energetic preference for

Gly in a b-turn.

Residues other than glycine are less likely to be found in states

30 and 102,but if they are found in these contexts,they are scored

very differently.State 102 (b-strand) favors a mismatch of almost

any sort,almost as much as an identity match.This means that

residues other than Gly mutate more often than expected in this

context.This is not surprising since the expected substitution prob-

ability would be very small for all non-Gly amino acids,and it

means that those rare occurrences of non-glycine residues in this

context occurred together in the training set.For state 30 (b-turn)

the situation is different.Glycine mutates more often than expected

to certain residues,C,D,P,Wand Y.In particular the preferred Gly

to Pro mutation makes sense for a b-turn.But other substitutions

in this turn state are either not observed (large negative LLRs) or

they occur more often than expected,for the same reasons as

mentioned before.

By modeling the structural contribution to evolutionary selective

pressure,HMMSUM should add biological meaning to the align-

ment of protein sequences.If so,an improvement should be seen

in the alignment accuracy.

3.2 Evaluation of alignment accuracy

To test whether the HMMSUM improves the alignments of

twilight-zone sequences,we compared the matrix performance of

the HMMSUMmodels with BLOSUMmatrices or structure-based

substitution matrices SDM/HSDM,which has been reported as the

best choice for aligning proteins with low percent identity (Prlic

´

et al.,2000).Three methods were used in our study to construct the

A

tu

matrix [Equations (11–13)] and alignments were performed

using local Dynamic Programming.

Gap opening/extension penalties are optimized for each matrix

before the evaluation of accuracy.An 8-fold jackknife cross-

validation was used for all methods tested when determining the

optimal penalties in order to avoid over-ﬁtting.Evaluations are

reported here using the alignment parameters that gave us the

maximum correct matches (Table 1).The average gap opening/

extension penalties after cross-validations for SDM (7.1/0.4) and

HSDM (19.5/1.0) are comparable to previous reports (6.6/0.6 and

19.0/0.8,respectively).

3.2.1 HMMSUM-M results Using optimal gap parameters

(15.0/0.9),the structure-independent model HMMSUM-M ﬁnds

6% more of the correct matches than BLOSUM40 (P ¼ 0.053),

the best performing BLOSUMmatrix in our evaluation.This result

cannot be attributed to the use of structure,since neither structure

assignment nor prediction was used,but it may be attributed to

using a different sequence weighting scheme in summing substitu-

tion frequencies.Our weighting scheme increased the contribution

of distant homologs to the substitution scores,and the results are

seen in the increased LLRs for changes relative to conservations.

However,the increase in the number of correct matches was

accompanied by an increase in the number of incorrect matches,

and the accuracy in alignments remained unimproved overall.The

Wilcoxon test indicated that the change of accuracy in individual

alignments may have resulted by chance when compared with

BLOSUM40 (P ¼ 0.887).

3.2.2 HMMSUM-L results As mentioned in section 2,two sets

of structure-dependent matrices,HMMSUM-D and HMMSUM-L,

were made based on two different assumptions about the expected

substitutions.HMMSUM-L,with optimized gap parameters

(11.6/1.4),found 5% more correct matches than BLOSUM40

(P ¼ 0.065).This was fewer than the structure-independent

HMMSUM-M model.We believe that the L-model,although

theoretically more sound,suffered from small counts problems

and therefore did not manifest its theoretical potential.

However,the small increase in correct matches was not

accompanied by an increase in incorrect matches completely,

and the overall accuracy in alignments improved slightly.This

improvement was found to be possibly the result of chance when

compared with BLOSUM40 (P ¼ 0.371).

3.2.3 HMMSUM-D results HMMSUM-D,a structure-

dependent model with structure-independent background frequen-

cies,performed much better than expected,ﬁnding 24% more cor-

rect matches than the BLOSUM40 matrix.Both the accuracy and

the coverage of correct matches were signiﬁcantly improved when

optimal gap parameters were used (21.0/0.4).Of the 147 alignments

61% were more accurate using this model,an unlikely chance

occurrence (P < 0.001).

When compared with SDM and HSDM,the state-of-the-art

matrices derived from structures,HMMSUM-D still showed a

16% and an 18% improvement in correct matches,respectively

(P < 0.001).About 54% out of the 147 alignments were more

accurate when using HMMSUM-D in both comparisons.The

Fig.3.Substitution matrices for glycine-rich states in HMMSTR.Each sub-

stitution matrix was generated using Equation (15).Substitution scores are

presented as LLRs in half-bit units.Here the figure shows the substitution

matrixof twostates inHMMSTR.Thelower half of the matrixis a b-turnstate

(state 30 in HMMSTR) and the upper half of the matrix is a b-strand state

(state 102 in HMMSTR).

Y.-M.Huang and C.Bystroff

418

alignment accuracy and coverage of HMMSUM-D were signiﬁc-

antly better than SDM (P ¼ 0.025 and P < 0.001,respectively).

3.2.4 Results using structure predictions and structure

assignments To investigate the effects of structural information

on difﬁcult alignment problems,several more HMMSUM models

were constructed.A structure-prediction dependent model,

HMMSUM-D

NS

,was summed from the training set using the

same method as that described from HMMSUM-D,except that

the calculation of g was performed without using the structural

information.The g-values in HMMSUM-D

NS

can be expressed

loosely as follows:

g ¼ PðI-sites motif j sequenceÞ:

Figure 4.Comparisonof alignments of 1MUCAand 4ENLproteins.Alignments are producedusingthe best matrixinBLOSUMand HMMSUM-Dserials from

tested methods [i.e.BLOSUM40 (

) and HMMSUM-D

NS

(‡)] and SDMmatrix (†).True alignments fromBAliBASE (#) are highlighted in gray color.Actual

secondary structures of 4ENL are indicated in cylinders as a-helices and arrows as b-strands.Two sequences have 17.1% identity.

Figure 5.Ribbon structures of remote homologs 1MUCA and 4ENL.Homologous secondary structure elements can clearly be identified despite only 17.1%

sequence identity.

HMMSUM:structure-based substitution matrices

419

The g weights are therefore structure predictions rather than

assignments.Surprisingly,the accuracy signiﬁcantly improved

over the results when structure information was used in

HMMSUM-D (P ¼0.002).When measured using correct matches,

alignment accuracy or coverage,this prediction-based model con-

sistently performed best among methods tested.Figure 4 illustrates

this result using the two structurally similar proteins shown in

Figure 5.

To further understand the effect of structural information on

the alignment accuracy,we used the true structural information

for both the training phase and the testing phase.Although a

pairwise sequence alignment would not be the method of choice

for aligning two known structures,this test was thought to simulate

a ‘best case scenario’ where the structure predictions were perfect

for both sequences.As expected,the alignment accuracy for

this model,indicated as HMMSUM-D

S

,was better than that of

HMMSUM-D,where structure was ignored in the testing phase,

but surprisingly it was not as good as HMMSUM-D

NS

,where the

structure was ignored in both phases (Table 1).

3.2.5 Results using three structural states To address the

importance of precision in structure predictions and assignments,

bare minimum structure information was adapted into the

HMMSUM model.A simple three-state (helix,strand and loop)

HMMSUM-D model (denoted HMMSUM-D

3

) was made and ana-

lyzed in an analogous fashion.The performance of HMMSUM-D

3

Figure 6.Alignments of 1R69 and 1NEQ using Dynamic Programming and BAA.(a) Alignments of two sequences were obtained by local Dynamic

Programming using BLOSUM40,SDM and different HMMSUM-D models with optimized parameters in Table 1.True alignments from BAliBASE are

highlighted in gray.Two sequences have percent identity of 12.7%.(b) A histogram of 1R69-1NEQ alignment was generated by BBA (Bayesian Adaptive

Alignment).The color of each position in histogramrepresents the probability of aligning two positions between two sequences.(red:high probability;blue,low

probability;white,zero probability).Black boxes represent the true alignment of two sequences from BAliBASE.

Y.-M.Huang and C.Bystroff

420

was comparable withother HMMSUM-Dmodels incorrect matches,

alignment accuracy or coverage (Table 1).The same improvement

was also found in another model,HMMSUM-D

3+NS

,which was

built in the same way as HMMSUM-D

NS

,using structure pre-

dictions rather than structure assignments,but using only three

structure states.However,both three-state models are better than

the best reported single-matrix model,SDM.

3.3 Evaluation of suboptimal alignments

By examining the details of alignments from HMMSUM models,

we found that some segments were aligned to the wrong regions

by shifting a few positions (Fig.6a).This suggested that the true

alignment was suboptimal but ‘nearby’ in parameter space,perhaps

accessible by a slight change in the substitution matrices.

Alignments that are off by a little are just as bad as alignments

that are off by a lot,according to our measure of accuracy for the

Dynamic Programming algorithm (Table 1) since this method can

deliver only the one optimal alignment,and arbitrarily chooses

one alignment if there are other equal-scoring possibilities.There

are several modiﬁcations to Dynamic Programming that consider

suboptimal alignments (Waterman and Eggert,1987;Huang et al.,

1990;Chao et al.,1994),however,in this study we chose to use

BAA,which models all alignments.

The BAA algorithm was also used to perform each of the 147

pairwise alignments using each substitution model,giving a pos-

terior probability distribution over all alignments for each case.To

evaluate the performance of HMMSUM-Dwhen compared to SDM

using BAA,the probabilities of reference alignments were com-

pared using P

true

[Equation (14)].Around 88%out of 147 reference

alignments have higher P

true

when using HMMSUM-D than with

SDM.Figure 6b shows the results of BAA for the sequences in

Figure 6a.

4 CONCLUSIONS

Amino acid substitution matrices that are local structure-speciﬁc

signiﬁcantly improve the accuracy of pairwise alignments when the

sequences are distant homologs,even if only a simple three-state

structural model is used.The improvement can be seen in the

accuracy and coverage of correct matches when compared with

curated structure-based alignments.

Substitution matrices based on structure predictions were found

to be signiﬁcantly more accurate (P < 0.001) than substitution

matrices based on structure assignments.This result runs counter-

intuitive since it suggests that the sequences alone contain more

information than the structure,even though we know that structural

homology extends beyond the reach of sequence-based alignment

methods.One possible explanation is that structure predictions con-

tain information about weak or multiple secondary structure pro-

pensities that is lost when a unique structure is assigned to each

position.Weak propensities may be conserved in some parts of the

structure in order to enforce the order of events in the folding

pathway.However,experimental studies have shown no conserva-

tion of propensity across proteins of the same family (Mun

˜

oz et al.,

1995).

An alternative explanation is that substitution data were sparse

when unique structure assignments were used but less sparse when

predictions were used,leading to less random error in substitution

scores when indexed to predicted structures,and more randomerror

in substitution scores when indexed to assigned structures.In some

cases the structure predictions were wrong,but perhaps the added

data more than compensated for the inclusion of some false data.

It was also considered that homologous proteins might tend to

have correlated wrong predictions,and that would improve align-

ments.The sparse data issue also partially explains why the more

ﬁnely divided HMMSUM-L model did not perform as well as

expected.Meanwhile,the three-state model HMMSUM-D

3

may

have suffered less fromsparse data,since using structure predictions

(HMMSUM-D

3+NS

) did not further improve alignments.

An improvement was also found in the scores of true suboptimal

alignments using the Bayesian Adaptive method.An additional

34% of 147 individual alignments were improved when using the

BAA algorithm in comparing HMMSUM-D and SDM.Accuracy

in ﬁnding true suboptimal alignments is important since optimal

alignments for twilight zone sequences are only 40–50% accurate.

ACKNOWLEDGEMENTS

We thank Bobbie-Jo Webb-Robertson for contributing to the

Bayes Aligner coding and for helpful conversations.This work

was supported by NSF award DBI-0343206 to C.B.

Conflict of Interest:none declared.

REFERENCES

Altschul,S.F.and Erickson,B.W.(1986) Optimal sequence alignment using afﬁne gap

costs.Bull.Math.Biol.,48,603–616.

Altschul,S.F.et al.(1990) Basic local alignment search tool.J.Mol.Evol.,215,

403–410.

Altschul,S.F.et al.(1997) Gapped BLAST and PSI-BLAST:a new generation of

protein database search programs.Nucleic Acids Res.,25,3389–3402.

Bahr,A.et al.(2001) BAliBASE (Benchmark Alignment dataBASE):enhancements

for repeats,transmembrane sequences and circular permutations.Nucleic Acids

Res.,29,323–326.

Bystroff,C.and Baker,D.(1998) Prediction of local structure in proteins using a library

of sequence-structure motifs.J.Mol.Biol.,281,565–577.

Bystroff,C.and Garde,S.(2003) Helix propensities of short peptides:molecular

dynamics versus bioinformatics.Proteins,50,552–562.

Bystroff,C.et al.(2000) HMMSTR:a hidden Markov model for local sequence-

structure correlations in proteins.J.Mol.Biol.,301,173–190.

Chao,K.M.et al.(1994) Recent developments in linear-space alignment methods:

a survey.J.Comput.Biol.,1,271–291.

Dayhoff,M.O.et al.(1983) Establishing homologies in protein sequences.Methods

Enzymol.,91,524–545.

Dayhoff,M.O.,Schwartz,R.M.and Orcutt,B.C.(1978) Amodel of evolutionary change

in proteins.In Dayhoff,M.(ed.),Atlas of Protein Sequence and structure.National

Biomedical Research Foundation,Silver Spring,Maryland,USA.Vol.5 (Suppl.3),

pp.345–352.

Doolittle,R.F.(1981) Similar amino acid sequences:chance or common ancestry?

Science,,214,149–159.

Henikoff,S.and Henikoff,J.G.(1992) Amino acid substitution matrices from protein

blocks.Proc.Natl Acad.Sci.USA,89,10915–10919.

Hobohm,U.and Sander,C.(1994) Enlarged representative set of protein structures.

Protein Sci.,3,522–524.

Hou,Y.et al.(2004) Remote homolog detection using local sequence-structure

correlations.Proteins,57,518–530.

Huang,X.Q.et al.(1990) A space-efﬁcient algorithm for local similarities.Comput.

Appl.Biosci.,6,373–381.

Huang,X.Q.and Miller,W.(1991) A time-efﬁcient,linear-space local similarity

algorithm.Adv.Appl.Math.,12,337–357.

Jaroszewski,L.et al.(2002) In search for more accurate alignments in the twilight zone.

Protein Sci.,11,1702–1713.

Lipman,D.J.et al.(1984) On the statistical signiﬁcance of nucleic acid similarities.

Nucleic Acids Res.,12,215–226.

HMMSUM:structure-based substitution matrices

421

Munoz,V.et al.(1995) The distribution of alpha-helix propensity along the polypeptide

chain is not conserved in proteins fromthe same family.Protein Sci.,4,1577–1586.

Northey,J.G.et al.(2002) Protein folding kinetics beyond the phi value:using multiple

amino acid substitutions to investigate the structure of the SH3 domain folding

transition state.J.Mol.Biol.,320,389–402.

Pearson,W.R.(1995) Comparison of methods for searching protein sequence databases.

Protein Sci.,4,1145–1160.

Pearson,W.R.(1996) Effective protein sequence comparison.Methods Enzymol.,266,

227–258.

Pearson,W.R.(1998) Empirical statistical estimates for sequence similarity searches.

J.Mol.Biol.,276,71–84.

Pearson,W.R.(2000) Flexible sequence similarity searching with the FASTA3 program

package.Methods Mol.Biol.,132,185–219.

Prlic

´

,A.et al.(2000) Structure-derived substitution matrices for alignment of distantly

related sequences.Protein Eng.,13,545–550.

Rabiner,L.R.(1989) A tutorial on hidden Markov models and selected applications in

speech recognition.Proc.IEEE,77,257–286.

Sasson,I.and Fischer,D.(2003) Modeling three-dimensional protein structures

for CASP5 using the 3D-SHOTGUN meta-predictors.Proteins,53 (Suppl.6),

389–394.

Smith,T.F.and Waterman,M.S.(1981) Identiﬁcation of common molecular sub-

sequences.J.Mol.Biol.,147,195–197.

Smith,T.F.et al.(1981) Comparative biosequence metrics.J.Mol.Evol.,18,

38–46.

Thompson,J.D.et al.(1999) BAliBASE:a benchmark alignment database for the

evaluation of multiple alignment programs.Bioinformatics,15,87–88.

Vingron,M.and Argos,P.(1989) A fast and sensitive multiple sequence alignment

algorithm.Comput.Appl.Biosci.,5,115–121.

Waterman,M.S.and Eggert,M.(1987) A new algorithm for best subsequence align-

ments with application to tRNA-rRNA comparisons.J.Mol.Biol.,197,723–728.

Wilcoxon,F.(1945) ‘Individual Comparisons by Ranking Methods’.Biometrics,1,

80–83.

Xu,J.et al.(2003) RAPTOR:optimal protein threading by linear programming.

J.Bioinform.Comput.Biol.,1,95–117.

Yi,Q.et al.(1998) Prediction and structural characterization of an independently

folding substructure in the src SH3 domain.J.Mol.Biol.,283,293–300.

Yona,G.and Levitt,M.(2002) Within the twilight zone:a sensitive proﬁle-proﬁle

comparison tool based on information theory.J.Mol.Biol.,315,1257–1275.

Zhu,J.et al.(1998) Bayesian adaptive sequence alignment algorithms.Bioinformatics,

14,25–39.

Y.-M.Huang and C.Bystroff

422

## Comments 0

Log in to post a comment