SCORE: predicting the core of protein models

wickedshortpumpΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 12 μέρες)

80 εμφανίσεις

BIOINFORMATICS
Vol.17 no.6 2001
Pages 541–550
SCORE:predicting the core of protein models
Charlotte M.Deane
1,∗
,Quentin Kaas
2
and Tom L.Blundell
1
1
Department of Biochemistry,University of Cambridge,Tennis Court Road,
Cambridge CB2 1GA,UK and
2
ENSCM,8 Rue de L’ ´ecole Normale,34000
Montpellier,France
Received on November 1,2000;revised on February 19,2001;accepted on February 23,2001
ABSTRACT
Motivation:The prediction of the regions of homology
models that can be restrained by or copied from the
basis structures is a vital step in correct model generation,
because these regions are the models most accurate part.
However,there is no ideal method for the identication of
their limits.In most algorithms their length depends on the
number of family members and denitions of secondary
structure.
Results:The algorithm SCORE steps away from the
conventional denitions of the core to identify from large
numbers of basis structures those regions that can be
considered structurally related to a target sequence.The
use of φ,ψ constraints to accurately pinpoint the regions
that are conserved across a family and environmentally
constrained substitution tables to extend these regions
allows SCORE to rapidly (generally in under 1 s,an order
of magnitude faster than methods such as MODELLER)
identify and build the core of homology models from the
alignments of the target sequence to the basis structures.
The SCORE algorithmwas used to build 114 model cores.
In only two cases was the core size less than 50% of the
structure and all the cores built had an RMSD of 3.7

A or
less to the target structure.
Availability:The algorithm is available upon request.
Contact:charlotte@cryst.bioc.cam.ac.uk
INTRODUCTION
Most functional restraints on evolutionary divergence
operate at the level of tertiary structure,and therefore
three-dimensional (3D) structures are more conserved
in evolution than sequence (Bajaj and Blundell,1984).
Furthermore,most disruptive changes occur at discrete
positions,and at loop regions on the surface of proteins,
rather than within the solvent inaccessible hydropho-
bic core.These core regions are therefore structurally
conserved between homologous proteins.The property
was Þrst quantiÞed by Chothia and Lesk (1986) who
compared the structural similarity of the cores for several

To whomcorrespondence should be addressed.
pairs of homologous proteins and showed a dependence
upon sequence identity.
Therefore,to model proteins the target sequence is
aligned to structures of similar sequence (the basis
structures) the region that can be considered the core of
the model is extracted from this alignment.From this
alignment of the target sequence to the basis structures,
the backbone fragments can be divided into two types:
Structurally Conserved Regions (SCRs) and Structurally
Variable Regions (SVRs) (Greer,1980).SCRs are those
regions that are structurally conserved in all members
of a protein family;they are not necessarily limited to
secondary structure,just as structural variability is not
limited to loops.
The deÞnition of SCR necessarily varies depending on
the number of basis structures and sequence identityÑ
families with many members tend to have short SCRs.
This means that several of the commonly used compara-
tive modelling algorithms which use SCRs to delineate the
core,(e.g.Bates and Sternberg,1999;Peitsch,1996;Sut-
cliffe et al.,1987a,b;Yang and Honig,1999) are likely to
underestimate the number of residues that can be carried
through to the target structure.
Hilbert et al.(1993) examined in detail the behaviour
of the SCRs in pairs of homologous proteins.The SCRs
were deÞned as those regions which after structural
superposition,had Cα carbons within 3.8
û
A.This distance
was selected,as it is the mean distance of adjacent Cα
atoms in a trans-polypeptide chain,i.e.the shift of a Cα
carbon by 3.8
û
A completely reassigns its spatial position.
They counted the number of residues contained within
these SCRs and found that the fraction of residues in this
common core dropped with decreasing sequence identity.
Pairs whose identity within the core residues was greater
than 50%had 90%or more of their residues in SCRs.But
even if the sequence identity of the core was below 20%
the common cores still included 65%or more of the amino
acids of the protein structures.
Collar extension,extending the region copied from
the basis structure beyond the designated SCR,has been
shown to greatly improve protein structure prediction
c
Oxford University Press 2001
541
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
C.M.Deane et al.
(Srinivasan and Blundell,1993),because the most accu-
rate parts of protein structure models are those that are
copied from or restrained by the basis structures (Jones
and Kleywegt,1999;Martin et al.,1997).However,the
procedure has never been automated,since this requires
quantifying the relationship between the basis structure
and target for the putative collar extension,by either
local structure or sequence similarity.The latter remains
problematical:even though it is commonly assumed in
modelling that local sequence similarity implies local
conformational similarity (e.g.Bystroff and Baker,1998;
Topham et al.,1993),there is still no satisfactory way to
quantify this;even peptides as long as nine residues can
adopt more than one distinct conformation (Cohen et al.,
1993;Kabsch and Sander,1984;Mezei,1998;Sander and
Schneider,1991).
Here we describe a new method for generating the core,
which establishes reliably and rapidly how far beyond the
conventional SCRs it is possible to extend the fragments
that are copied from the basis structures (i.e.the Basis
Fragments (BFs)).This allows as much of the known
structural information as possible to be included in the
model,even if it lies outside a conventional SCR.The
method is thoroughly benchmarked and compared to
other modelling programs to give rules for its general
applicability,its advantages and disadvantages,limits and
pitfalls,dependence on the degree of sequence homology
or identity,number of family members and class of
protein.
METHOD
Overview of method
The method is designed to retain the maximumamount of
information fromthe structural alignment.To achieve this
it utilises the structural similarity of the basis structures
to each other (in both Cartesian and φ,ψ space) and the
similarity of each basis structure to the target sequence
through environmentally constrained substitution tables
(Deane and Blundell,2001).These criteria are discussed
below.A BF rather than an SCR is the basic unit of
the cores built by SCORE.A BF is any part of any
single basis structure that can conÞdently be said to
be similar to the target structure,regardless of whether
or not it lies within an SCR.The latter,by contrast,
spans all aligned sequences,and is limited to the residues
structurally similar in all sequences.Thus BFs can extend
the information extracted fromthe alignment compared to
the SCRs.
DeÞnition of structural conservation
For structures that have been superposed and aligned along
structural features (using COMPARER;Sali and Blundell,
1990,for instance),aligned residues are considered to
occupy ÔconservedÕpositions if their CαÐCα separations
(after superposition) and D
φ,ψ
(difference in backbone
torsion angles,equation (1)) both fall below deÞned cut-
off values.An SCR is deÞned as a continuous stretch of
three or more residues conserved in all aligned structures.
(Note that the conventional SCR deÞnition does not
include the D
φ,ψ
criterion.) For modelling purposes,all
residues in the SCR must additionally have sequence
similarity to the aligned position in the target sequence.
D
φ,ψ
= [(1/2)(dφ
2
+dψ
2
)]
1/2
(1)
where
dφ =|φ
i k
−φ
j k
| if |φ
i k
−φ
j k
| < 180 and 360
−|φ
i k
−φ
j k
| if |φ
i k
−φ
j k
| > 180
and
dψ =|ψ
i k
−ψ
j k
| if |ψ
i k
−ψ
j k
| < 180 and 360
−|ψ
i k
−ψ
j k
| if |ψ
i k
−ψ
j k
| > 180.
In SCORE the BFs are selected using both the structural
similarity across the basis structures and environmentally
constrained substitution tables.
DeÞnition of sequence similarity:environmentally
constrained substitution tables
The raw environmentally constrained substitution
tables were constructed by accumulating substi-
tutions observed in all the homologous pair-wise
alignments from a high-resolution database.This
database was extracted from HOMSTRAD (http:
//www-cryst.bioc.cam.ac.uk/∼homstrad) (date 18/02/00)
(Mizuguchi et al.,1998) and contained 320 homologous
families with 859 proteins all solved to a resolution better
than 2.5
û
A.In this case the environment of the replaced
and substituted residues was taken into account and six
φ,ψ areas deÞned six environmentally constrained tables.
Environmentally constrained amino acid substitution ta-
bles are derived (Deane and Blundell,2001).These tables
are used to score BFs.The value of the environmentally
constrained substitution table score for a BF is given
by S
c
(BF) (equation (2)).It is the sum of the value for
each environmentally conserved substitution between
the residues at position i in the alignment of the basis
structure (B) and the target sequence (T),where i runs
from residue m (start of the BF) to residue n (end of the
BF).
S
c
(BF) =
n

i =m
S
c
(B(i ),T(i )).(2)
Basis fragment generation
Several SCR regions are designated as described above
and each is taken in turn.The SCR from one of the basis
542
SCORE:predicting the core of protein models
structures is chosen as a representative of the region.This
is the SCR from the structure with the highest S
c
(SCR)
(equation (2)).This initial SCR is designated as a BF.
The SCRs from all the basis structures (not just that
chosen as the representative BF) are then extended in all
possible combinations at both the Nand C termini until an
insertion or deletion with respect to the target sequence
occurs.All BFs (extensions of the SCRs) that have a
S
c
(BF) greater than or equal to the S
c
(SCR) of the original
SCR in that basis structure are collected.No duplications
are allowed if two (or more) BFs fromtwo (or more) basis
structures are ÔidenticalÕ(structurally conserved) under the
criteria described above.The BF with the higher S
c
(BF) is
then selected.
If there are only one or two basis structures an additional
step is added to BF generation.In this case SCORE
also reduces the SCR size by removal of residues at
the N and C termini of the SCR fragment down to any
fragment containing only three residues.The same rules
for acceptance still apply.These shorter fragments must
have an S
c
(BF) higher than or equal to that of the SCR
fromtheir basis structure.
Core generation
Thus a long list of overlapping BFs is produced.To create
the core all possible combinations of non-overlapping
BFs are generated and the core with the highest S
c
(core),
which is the sum of S
c
(BF) for all the BFs in the core,
is selected.If two cores have an identical score the core
containing a greater number of residues will be selected.
The SCOREprogramuses two other parameters MinScore
and MinSize,which relate to the minimum S
c
(core) and
minimumlength of the core to be built respectively.
Relative spatial positioning of the selected basis
fragments
There are two choices for BF positioning in the algorithm.
The Þrst is to Þt the BFs to a single basis structure,selected
as that with the highest overall S
c
(basis structure) denoted
S
c
(t ).The Þtting is performed on all residues of that basis
structure (S
c
(t )) that are aligned to the core.The second is
to Þt the BFs to a weighted averaged Cα network of the
SCRs (equation (3)).In the Þt the Cα positions of the BFs
are superposed on these averaged Cα positions or on the
Cα positions of S
c
(t ) (Kearsley,1989a,b).NS
c
(a) is the
normalised score for each basis structure a.t indicates the
basis fragment with the highest S
c
of the q basis structures.
NS
c
(a) = e
S
c
(a)/S
c
(t )

q

j =0
e
S
c
( j )/S
c
(t )
.(3)
The averaged Cα framework is developed by taking each
residue Cα position i within the SCRs and multiplying the
Cα co-ordinates (l representing x,y and z) from each ba-
sis structure j by NS
c
( j ) to calculate AνC
i
l
(equation (4))
the Cα co-ordinate in the averaged framework for position
i in the SCR and co-ordinate l (x,y or z).
AνC
i
l
=
q

j =0
NS
c
( j ) ×Cα
i
l
( j ).(4)
The SCORE results discussed in this paper relate unless
otherwise speciÞed to Þtting of the BFs to this averaged
Cα framework.
Selection of alignments for testing
A subset of alignments was selected from the HOM-
STRADdatabase (July 2000).Structures with a resolution
better than 2.5
û
A solved by x-ray crystallography were
selected from the database.Low-resolution structures
and/or structures solved by NMR were omitted following
the results of Guex and Peitsch (1997),Harrison et al.
(1997) and Hilbert et al.(1993).This led to a list of
185 family alignments.Ten of these were selected for
parameterisation and 14 (Table 1) for initial testing.
These smaller sets were selected to represent diversity in
number of family members,major secondary structural
components (α,β,αβ,etc.),sequence similarity to the
target and number of residues in the alignment.These
alignments were used to generate models using SCORE,
COMPOSER (Sutcliffe et al.,1987a) and MODELLER
(Sali and Blundell,1993).
RESULTS
Setting programparameters
Several different Cα separation and D
φ,ψ
cut-offs were
tested using the initial 10 alignments.These properties
are designed to identify segments of structure that can
be considered ÔidenticalÕbetween two or more of the
basis structures.Setting the cut-offs too low will result in
very small SCRs with the possibility that entire regions
of SCR will be overlooked.Very small initial SCRs can
also lead to the generation of a very large number of
possible BFs which would increase the task involved in
generating all possible cores,slowing down the algorithm.
Conversely if the Cα separation and D
φ,ψ
cut-offs are
too high,positions in the basis structures are considered
ÔidenticalÕ,where they are actually showing signi Þcant
variation.Thus,a balance was struck so that the program
could give good predictions in a reasonable time frame.
SCORE operates with a CαÐCα separation cut-off of
3.5
û
A and a D
φ,ψ
cut-off of 150.The D
φ,ψ
cut-off is high
given that the distribution of differences between ÔrandomÕ
torsion angles would be expected to peak around 90

.
Thus,it was tested to see how often this cut-off caused
a difference in the SCR deÞnition.In the 14 test cases
543
C.M.Deane et al.
Table 1.The 14 families used for evaluating SCORE
Percentage identity S
c
Family
 
Members Class Target (number Maximum Minimum Average Maximum Minimum Average
of residues)
Adh
1cod
A,Multi
1d1t
A (374) 70.8 52.5 61.7 1987 1423 1705
2ohx
A domain
Hip
1hpi 2hip
A
Small 1isu
A (62) 25.8 17.7 22.0 108 87 95
1cku
A
Cytprime 1a7v
A
2ccy
A All α 1jaf
A (128) 32.8 26.6 30.3 241 187 213
1bbh
A
Ets 1pue
E
2stw
A All α 1bc8
C (93) 52.7 37.5 47.3 314 175 265
1ßi
A
Ltb
1lt5
D
All β 1tii
D (98) 14.3 12.2 13.3 33 31 32
3chb
D
Ngf 1bet
1bnd
A
Small
1b98
M(116) 56.0 52.3 54.6 537 423 488
1bnd
B
disulphide
Cpa
1nsa 2ctc
αβ 1orb (323) 30.3 29.7 30.1 448 421 437
1pca
Lamb
2mpr
A Membrane
1a0t
P (413) 22.8 22.3 22.6 219 203 211
1af6
A bound α
Rub
4rxn 1rdg
Small 6rxn (52) 64.4 44.4 60.0 226 204 216
1rb9
Ferritin
2fha 1aew
All α 1bg7 (173) 68.0 58.2 63.9 746 616 700
1rci
Fer2 1fxa
A 4fxc
1fxi
A 1a70
1pfd 1frr
A
1awd 1doy Small 1b9r
A (105) 17.7 12.4 15.46 248 −4 65
2cjo 1frd
1ayf
A
1pdx
A
Gpr 1f3g 1gpr All β 2gpr (154) 42.2 38.0 40.1 298 297 298
Fkbp 1fkb 1pbk α +β 1yat (113) 57.0 41.6 49.3 393 291 342
Ricin 1mrj 1apa
1abr
A α +β 1mrg (246) 63.8 28.0 38.4
1qci
A
The family name and class are those designated in the HOMSTRAD database.The family members and target are listed by their PDB codes the chain is
identiÞed after an underscore if necessary.The minimum,maximumand average percentage identity and S
c
of the family members to the target are given in
the last six columns.
 
Full HOMSTRAD family names:
Adh =alcohol dehydrogenase,Hip =high potential ironÐsulfur protein,Cytprime =cytochrome c

,Ets =ETS domain,Ltb =heat-labile
enterotoxin/cholera toxin,B subunit,Ngf =nerve growth factor,Cpa =zinc carboxypeptidases,Lamb =maltoporin (LamB protein),Rub =rubredoxin,
FerritinÐferritin,Fer2 =ferredoxin (2Fe-2S),Gpr =glucose permease,Fkbp =FKBP-type peptidyl-prolyl cis-trans isomerase,Ricin =Ribosome
inactivating protein.
544
SCORE:predicting the core of protein models
-500
0
500
1000
1500
2000
2500
-1000 0 1000 2000 3000
Maximum (basis)
MinScore
0
50
100
150
200
250
300
350
400
450
0 200 400 600
Alignment length
MinSize
SS
c
(a)
(b)
Fig.1.The correlation of MinSize with alignment length (a) and MinScore with Maximum S
c
(basis) (b) for the 14 initial test families
(Table 1) (circles) and for the other 100 HOMSTRAD families (squares).
Þve are effected by this cut-off in each case one or two
of the SCRs are shortened by one or two residues.The
effect of the cut-off is not large but it is preventing residues
with signiÞcantly different φ,ψ values being labelled as
structurally conserved.
Next values of MinSize and MinScore were identi Þed
(on the set given in Table 1) that allowed rapid operation of
the program without artiÞcially lengthening the core.The
optimal values were calculated using iterative testing.In
this testing,the threshold of MinScore was Þrst calculated
(the highest value where a core was predicted) and then
the threshold value of MinSize was selected such that it
did not overextend the predicted core but allowed rapid
prediction.The MinScore and MinSize parameters were
then tested to see if a correlation could be found with
any of the available information about the alignments
(S
c
(basis),alignment length).Figure 1 shows the only
signiÞcant correlations:MinScore with the maximum
S
c
(basis) value of a sequence in the alignment to the
target and MinSize to alignment length.Unfortunately
neither correlation is strong enough to be sure that the
algorithm can automatically select an optimal value for
the parameters.SCORE therefore,automatically sets the
MinSize and MinScore parameters within the bands found
in Figure 1 and,if no answer is found,or if more than 5000
BF are created by the algorithm (rendering the algorithm
very slow),it then suggests the band within which the
MinSize and MinScore parameters should be set and in
which direction the user should move either one of the
parameters in order to achieve a result within a reasonable
time frame.SCORE is a rapid algorithm;usually it takes
less than 1 s to give a prediction.
Comparison to copying froma single basis
structure
In all 14 example cases there is more than one basis
structure (Table 1).The performance of SCORE was
therefore compared to copying the identical core from
any one of the basis structures (Table 2).The core
region predicted by SCORE is compared to the structural
alignment and if any single structure is fully aligned along
the whole length of the SCOREcore then the RMSDof the
core copied from that basis structure to the real structure
was calculated.In general,SCORE performs better than
selection from a single basis structure (9 of 14 cases) by
the selection of locally more similar conformations and
often a core comparable to that generated by SCORE is
not available from some or all of the basis structures in
the alignment (Table 2),for example in the case of the
cytprime
 
family.
Comparison to COMPOSERÑ conventional SCRs
The accuracy of SCORE was then compared to the SCRs
constructed by the version of COMPOSER (Sutcliffe
et al.,1987a,b) found in SYBYL 6.6 (Tripos UK Ltd).
COMPOSER generates SCRs in the ÔconventionalÕ
manner which is still commonly used by many modelling
groups using only CαÐCα separation and structurally
aligned positions (e.g.Bates et al.,1997;Bates and
Sternberg,1999;Burke et al.,1999;Dunbrack,1999;
Guex et al.,1999;Guex and Peitsch,1997;Harrison et
al.,1995,1997;Peitsch,1996).COMPOSER was run on
the 14 test examples using their respective structure based
alignments extracted fromHOMSTRAD.The comparison
between SCORE and COMPOSER centres on two issues,
the size of core that is predicted by the two programs
and the RMSD of these cores to the real structure.This
is because a program may be predicting a signi Þcantly
larger core,but in doing so may extend the region that
is copied beyond that which is truly ÔidenticalÕbetween
the basis structure and the target or give a signi Þcantly
lower RMSD but predict a far smaller percentage of the
structure.Table 3 shows that in general the cores are of a
similar size,but they are not selecting identical residues.
 
The family name given here is that found in the HOMSTRADdatabase and
in Table 1.
545
C.M.Deane et al.
Table 2.Comparing SCORE to copying fromthe basis structures
Family Members PDBCODE
chain (RMSD of core
û
A) SCORE RMSD (
û
A) Target (%in core)

Adh 1cod
A (NR) 2ohx
A (0.62) 0.62 1d1t
A (99)

Hip 1hpi (1.95) 2hip
A (2.09) 1cku
A (1.71) 1.65 1isu
A (76)

Cytprime 1a7v
A (NR) 2ccy
A (NR) 1bbh
A (NR) 1.44 1jaf
A (82)

Ets 1pue
E (NR) 2stw
A (2.20) 1ßi
A (2.15) 2.16 1bc8
C (74)
Ltb 1lt5
D (1.68) 3chb
D (1.75) 1.71 1tii
D (57)

Ngf 1bet (NR) 1bnd
A (1.45) 1bnd
B (1.35) 1.29 1b98
M(82)
Cpa 1nsa (1.02) 2ctc (1.17) 1pca (1.11) 1.22 1orb (79)

Lamb 2mpr
A (2.00) 1af6
A (1.99) 1.99 1a0t
P (66)
Rub 4rxn (0.66) 1rdg (0.73) 1rb9 (0.58) 0.70 6rxn (83)

Ferritin 2fha (0.40) 1aew (NR) 1rci (NR) 0.40 1bg7 (87)

Fer2 1fxa
A (NR) 4fxc (NR) 1fxi
A (NR) 1a70 (NR) 2.78 1b9r
A (62)
1pfd (NR) 1frr
A (NR) 1awd (NR) 1doy (NR) 2cjo
(NR) 1frd (NR) 1ayf
A (NR) 1pdx
A (3.03)
Gpr 1f3g (NR) 1gpr (1.57) 1.77 2gpr (94)

Fkbp 1fkb (0.81) 1pbk (0.79) 0.80 1yat (91)
Ricin 1mrj (0.68) 1apa (NR) 1abr
A (NR) 1qci
A (NR) 0.74 1mrg (86)
The asterisk indicates the cases where the SCORE core is better than or identical to the best core built fromonly one of the basis structures.NR [No
Result]Ñ indicates that it was not possible to build a core equivalent to the SCORE core fromthat basis structure.The RMSD is calculated on Cα atoms only.
Table 3.Comparison of SCORE to COMPOSER and MODELLER
SCORE COMPOSER MODELLER
Family %residues RMSD (
û
A) RMSD(f) (
û
A) %residues RMSD (
û
A) RMSD (
û
A) over
predicted predicted SCORE core
Adh#99 $ 0.62 0.62 96 0.62 0.62
Hip 76 $ 1.65 1.73 87 3.82 1.84
Cytprime#82 $ 1.44 1.24 81 2.68 1.56
Ets#74 2.16 2.23 44 1.15 1.84
Ltb 57 1.71 1.75 96 3.82 1.52
Ngf 82 $ 1.29 1.41 86 1.42 1.36
Cpa 79 $ 1.22 1.22 91 4.85 1.23
Lamb 66 $ 1.99 2.00 92 14.48 2.49
Rub#83 $ 0.70 0.69 Ð Ð 0.85
Ferritin 87 $ 0.40 0.40 95 0.36 0.48
Fer2#62 $ 2.78 3.03 21 7.26 3.03
Gpr#94 1.77 1.63 90 1.64 1.49
Fkbp 91 0.80 0.81 94 0.84 0.70
Ricin#86 $ 0.74 0.68 79 1.19 0.83
The two RMSD columns for SCORE relate to the two different Þtting procedures,RMSD is when the core is Þtted to the average Cα network and RMSD(f)
when the core is Þtted to a single basis structure.The hash (#) indicates the cores that are longer when built by SCORE than COMPOSER.The dollar ($)
indicates where the SCORE RMSD values are lower than those achieved by MODELLER.The RMSD is calculated on Cα atoms only.
This explains the lower RMSDs in general for the SCORE
cores (Table 3).
Only two of the cores built by SCORE have a slightly
inferior RMSD than those constructed by COMPOSER,
and both the SCORE cores are signi Þcantly longer,69
to 41 residues in the ets family and 145 to 135 residues
in the gpr family.The worst core built by SCORE has
an RMSD of 2.7
û
A.Five of the COMPOSER cores have
an RMSD greater than 3.8
û
A.In the case of the lamb
family the COMPOSER core has an RMSD of 14.5
û
A
which is far greater than would be expected when copying
elements from such similar basis structures.The problem
is caused by COMPOSERcopying long loop regions from
the basis structures which have diverged signi Þcantly from
the target,whereas SCORE is able to identify the limits of
the ÔidenticalÕregions far more precisely.In the case of
the ricin family the SCORE core has both a lower RMSD
and a larger number of residues.Here the opposite effect
is observed:COMPOSER creates shorter SCRs Þnishing
where all the structures are ÔidenticalÕto one another but
546
SCORE:predicting the core of protein models
a
b
0
1
2
3
4
5
6
-500 0 500 1000 1500 2000 2500
ximum (basis)
rmsd
0
10
20
30
40
50
60
70
80
90
100
-500 0 500 1000 1500 2000 2500
Maximum (basis)
residues in core (%)
Ma S
c
S
c
Fig.2.The relationship between maximum S
c
(basis) with RMSD of core (a) and size of core (b).
SCOREperforms Ôcollar extensionÕalong a basis structure
that is similar to the target.
Comparison to MODELLER
MODELLER (Sali and Blundell,1993) is one of the most
commonly used modelling packages available.A single
model structure for the target,using the basis structures
and the alignment Þle fromHOMSTRADwas constructed
using MODELLER.To demonstrate that the methodology
being developed here can compete with current programs,
the core regions from the MODELLER predictions were
compared to those from SCORE in terms of their Cα Þt
to the real structure.Once again the SCORE core was cut
from a structure,this time from the MODELLER model.
This comparison is not entirely fair,as MODELLER is
designed to build the entire structure and the core region
for comparison is that selected by SCORE.However,
if SCORE is a reliable modelling program it should in
general perform at least as well as MODELLER for
building the core region of the protein.Of the 14 examples
SCORE builds a lower RMSD core in 10 cases (Table 3).
However,all the values here are close (within 0.3
û
A)
indicating a similarity of predictive ability.
The effect of basis structure choice
S
c
is not directly correlated with percentage identity,so
the algorithm was further tested to see whether percent-
age identity or S
c
(basis) is a better guide to overall struc-
tural similarity in the context of the SCORE algorithm.
Percentage identity is known not to be a powerful indica-
tor of local structural similarity.However,of the six cases
where the basis structure with the highest percentage iden-
tity to the target was different from that with the highest
S
c
(basis) to the target (Table 3),three built lower RMSD
cores with the highest S
c
(basis) and three with the highest
percentage identity basis structure (the core sizes being in
547
C.M.Deane et al.

0
5
10
15
20
25
30
35
3 8 13 18 23 28 33 38 43 48 53 58 64 72 83 95 117 141 169 196 287
Length of BF
Number
of BF
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8 9 10 >10
Gap length
Numb
er
0
10
20
30
40
50
60
70
80
0 1 2 3 4 5 6 7 8 9 10 >10
len
g
th of N or C terminal
g
ap
Number
a
b
c
Fig.3.The length variation of BFs (a) gaps (b) and N or C termini missing fragments (c) of SCORE cores.
general similar).There is,therefore,no clear evidence as
to which measure is a better guide to global similarity in
these highly identical homologous families.
Evaluation on a large dataset
COMPOSER is not a fully automated system and MOD-
ELLER has a run time three orders of magnitude longer
than SCORE so evaluation on a larger dataset was carried
out for SCORE alone.It was run on 100 other families ex-
tracted fromthe HOMSTRAD database,leading to a total
of 114 families.In only two cases was the core size less
than 50% of the structure and all the cores built had an
RMSD of 3.7
û
A or less to the target structure.The Min-
Size and MinScore parameters set using the original 14
examples were compared to the values calculated for this
larger dataset and all but a tiny proportion fall within the
pre-ordained bands fromthe smaller test set (Figure 1).
The relationship between the average,maximum and
minimum S
c
(basis) or percentage identity of basis struc-
tures to target sequence with the size of core and RMSD
of core was examined.The clear correlations that were
observed by Hilbert et al.(1993) on pairs of homologous
548
SCORE:predicting the core of protein models
structures were not as obvious.There is a general increase
in core size with greater maximum S
c
or percentage iden-
tity up to a point where the graphs level out (greater than
50% identity or 550 S
c
respectively).RMSD of the core
appears to correlate better to S
c
(basis) values (maximum,
minimum or average) than percentage identity.The shape
of the graph is as expected in that increasing S
c
(basis)
decreases the RMSD of the core (Figure 2).
The larger dataset was also used to assess whether Þtting
to a single basis structure or to an averaged framework
should be used.Overall the performance was found to
be slightly better if the core was Þtted to an averaged
framework,but there are many cases where Þtting to
a single basis structure improved results.Unfortunately
there is no clear indication as to what percentage identities
or S
c
(basis) values correlate to a more reliable result when
Þtting to a single basis structure rather than an averaged
framework.
Counting the basis fragments and the gaps in the
cores
The target structures were divided into three types of
fragments:
(1) those BFs predicted by SCORE;
(2) the N and C termini fragments not predicted by
SCORE;
(3) all other fragments that are not predicted by
SCORE.
The number of fragments of each length of the three
types is shown in Figure 3.This was done so that the
average length of a BF or gap could be observed,i.e.was
SCORE predicting only short fragments,all separated by
one residue gaps,or was it predicting large continuous
pieces of structure separated by small gaps.As can be seen
fromFigure 3 the median length of BF is nine residues and
the average length is between 28 and 29 residues.
The gaps between the BFs are of interest,for if SCORE
is to form the basis of a comparative modelling program
these gaps will have to be predicted using SVR modelling
software such as (Deane and Blundell,2000).These
methods work best for short gaps of eight residues or
less.Thus it is desirable that the majority of gaps fall into
this category.As Figure 3 shows the majority of gaps are
in fact only one residue in length.Over 85% are eight
residues or less.In the case of the N and C termini gaps
nearly 70%are less than three residues.
CONCLUSION
A new approach to prediction of the core of protein
models is suggested.That steps away from the concept of
SCRs to the more ßuid deÞnition of BFs.The algorithm
is fully automatic,rapid in operation and compares
well with other comparative modelling programs such as
COMPOSER and MODELLER.In the case of protein
families the operation of the program appears to ful Þl the
criteria of both prediction of a reasonable percentage of
the structure coupled with low RMSD values.
REFERENCES
Bajaj,M.and Blundell,T.(1984) Evolution and the tertiary structure
of proteins.Ann.Rev.Biophys.Bioeng.,13,453Ð492.
Bates,P.A.and Sternberg,M.J.(1999) Model building by compari-
son at CASP3:using expert knowledge and computer automa-
tion.Proteins,37,47Ð54.
Bates,P.A.,Jackson,R.M.and Sternberg,M.J.(1997) Model build-
ing by comparison:a combination of expert knowledge and com-
puter automation.Proteins,(Suppl.1),59Ð67.
Burke,D.F.,Deane,C.M.,Nagarajaram,H.A.,Campillo,N.,Martin-
Martinez,M.,Mendes,J.,Molina,F.,Perry,J.,Reddy,B.V.,
Soares,C.M.,Steward,R.E.,Williams,M.,Carrondo,M.A.,Blun-
dell,T.L.and Mizuguchi,K.(1999) An iterative structure-assisted
approach to sequence alignment and comparative modeling.
Proteins,(Suppl.3),55Ð60.
Bystroff,C.and Baker,D.(1998) Prediction of local structure in
proteins using a library of sequence-structure motifs.J.Mol.
Biol.,281,565Ð577.
Chothia,C.and Lesk,A.M.(1986) The relation between the diver-
gence of sequence and structure in proteins.EMBO,5,823Ð826.
Claessens,M.,Van Cutsem,E.,Lasters,I.and Wodak,S.(1989)
Modelling the polypeptide backbone with Ôspare partsÕfrom
known protein structures.Protein Eng.,2,335Ð345.
Cohen,B.I.,Presnell,S.R.and Cohen,F.E.(1993) Origins of struc-
tural diversity within sequentially identical hexapeptides.Protein
Sci.,2,2134Ð2145.
Deane,C.M.and Blundell,T.L.(2000) A novel exhaustive search al-
gorithmfor predicting the conformation of polypeptide segments
in proteins.Proteins,40,135Ð144.
Deane,C.M.and Blundell,T.L.(2001) CODA:A combined algo-
rithm for predicting the structurally variable regions of protein
models.Protein Sci.,10,599Ð612.
Dunbrack,Jr,R.L.(1999) Comparative modeling of CASP3 targets
using PSI-BLAST and SCWRL.Proteins,(Suppl.3),81Ð87.
Greer,J.(1980) Model for haptoglobin heavy chain based upon
structural homology.Proc.Natl Acad.Sci.USA,77,3393Ð3397.
Guex,N.and Peitsch,M.C.(1997) SWISS-MODEL and the Swiss-
PdbViewer:an environment for comparative protein modeling.
Electrophoresis,18,2714Ð2723.
Guex,N.,Diemand,A.and Peitsch,M.C.(1999) Protein modelling
for all.Trends Biochem.Sci.,24,364Ð367.
Harrison,R.W.,Chatterjee,D.and Weber,I.T.(1995) Analysis of
six protein structures predicted by comparative modeling tech-
niques.Proteins,23,463Ð471.
Harrison,R.W.,Reed,C.C.and Weber,I.T.(1997) Analysis of com-
parative modeling predictions for CASP2 targets 1,3,9,and 17.
Proteins,(Suppl.1),68Ð73.
Hilbert,M.,Bohm,G.and Jaenicke,R.(1993) Structural relation-
ships of homologous proteins as a fundamental principle in ho-
mology modeling.Proteins,17,138Ð151.
Jones,T.A.and Thirup,S.(1986) Using known substructures in
protein model building and crystallography.EMBO,5,819Ð822.
549
C.M.Deane et al.
Jones,T.A.and Kleywegt,G.J.(1999) CASP3 comparative modeling
evaluation.Protein Struct.Funct.Genet.,(S3),30Ð46.
Kabsch,W.and Sander,C.(1984) On the use of sequence homolo-
gies to predict protein structure:identical pentapeptides can have
completely different conformations.Proc.Natl Acad.Sci.USA,
81,1075Ð1078.
Kearsley,S.K.(1989a) On the orthogonal transformation used for
structural comparisons.Acta Cryst.,A 45,208Ð210.
Kearsley,S.K.(1989b) Structural comparisons using restrained
inhomogeneous transformations.Acta Cryst.,A 45,628Ð635.
Levitt,M.(1992) Accurate modeling of protein conformation by
automatic segment matching.J.Mol.Biol.,226,507Ð533.
Martin,A.C.R.,MacArthur,M.W.and Thornton,J.M.(1997) Assess-
ment of comparative modeling in CASP2.Protein Struct.Funct.
Genet.,(Suppl.1),14Ð28.
Mezei,M.(1998) Chameleon sequences in the PDB.Protein Eng.,
11,411Ð414.
Mizuguchi,K.,Deane,C.M.,Blundell,T.L.and Overington,J.P.
(1998) HOMSTRAD:a database of protein structure alignments
for homologous families.Protein Sci.,7,2469Ð2471.
Peitsch,M.C.(1996) ProMod and Swiss-Model:internet-based tools
for automated comparative protein modelling.Biochem.Soc.
Trans.,24,274Ð279.
Sali,A.and Blundell,T.L.(1990) DeÞnition of general topological
equivalence in protein structures.Aprocedure involving compar-
ison of properties and relationships through simulated annealing
and dynamic programming.J.Mol.Biol.,212,403Ð428.
Sali,A.and Blundell,T.L.(1993) Comparative protein modelling by
satisfaction of spatial restraints.J.Mol.Biol.,234,779Ð815.
Sander,C.and Schneider,R.(1991) Database of homology-derived
protein structures and the structural meaning of sequence align-
ment.Proteins,9,56Ð68.
Srinivasan,N.and Blundell,T.L.(1993) An evaluation of the perfor-
mance of an automated procedure for comparative modelling of
protein tertiary structure.Protein Eng.,6,501Ð512.
Sutcliffe,M.J.,Haneef,I.,Carney,D.and Blundell,T.L.(1987a)
Knowledge based modelling of homologous proteins,Part I:
Three-dimensional frameworks derived from the simultaneous
superposition of multiple structures.Protein Eng.,1,377Ð384.
Sutcliffe,M.J.,Hayes,F.R.F.and Blundell,T.L.(1987b) Knowledge
based modelling of homologous proteins,Part II:rules for the
conformations of substituted sidechains.Protein Eng.,1,385Ð
392.
Topham,C.M.,McLeod,A.,Eisenmenger,F.,Overington,J.P.,John-
son,M.S.and Blundell,T.L.(1993) Fragment ranking in mod-
elling of protein structure.Conformationally constrained envi-
ronmental amino acid substitution tables.J.Mol.Biol.,229,194Ð
220.
Unger,R.,Harel,D.,Wherland,S.and Sussman,J.L.(1989) A 3D
building blocks approach to analyzing and predicting structure
of proteins.Proteins,5,355Ð373.
Yang,A.S.and Honig,B.(1999) Sequence to structure alignment in
comparative modeling using PrISM.Proteins,37,66Ð72.
550