3D Results Summary

moredwarfΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

1.974 εμφανίσεις








CASP5 Methods Abstracts




A
-
2

A
-
3

123D_server (P0476)
-

68 predictions: 68 3D


123D: an Old Program for Fold Recognition


N.Alexandrov

Ceres, Inc. Malibu, CA, USA

nicka@ceres
-
inc.com


I used the 123D+ web site at http://123d.ncifcrf.go
v/ for making predictions.
The predictions were completely automatic, without any manual intervention
with only exceptions made for multi
-
domain proteins. For such proteins the
strongest local hit was cut out from the query sequence and the rest of the
seq
uence was submitted again. The program 123D+ uses PSI
-
blast generated
profiles for both query sequence and the fold library, secondary structure
compatibility, and contact capacity potentials for finding optimal sequence

structure alignment. Fold library
was constructed from 40% non
-
redundant
Astral set of SCOP
-
1.59 domains.




Accelrys (P0210)
-

24 predictions: 24 3D


Comparative Modeling Using GeneAtlas
TM


Dana Haley
-
Vicente
, Velin Spassov
, Tina Yeh
, Ken Buten
hof
,
Christoph Schneider
, Azat Badretdinov

and Lisa Yan


Accelrys Inc., 9685 Scranton Road, San Diego, CA 92121, USA

dhv@accelrys.com


GeneAtlas™ (1) is a high
-
throughput pipeline for automated
protein structure
prediction and function annotation. For template structure identification it uses
PSI
-
BLAST searches and our fold recognition program, SeqFold. To maximize
homology recognition, both direct and reverse PSI
-
BLAST searches are
performed an
d the hits are combined. Automated model building is carried out
with Modeler, and models are evaluated using Profiles
-
3D Verify scores.


For CASP5 targets, we first use GeneAtlas to help to identify and select
potential PDB templates, and then the alignme
nts are adjusted manually with
the aid of various alignment tools (e.g. Align123) in the Homology module in
InsightII. Align123 is based on ClustalW and augmented with a secondary
structure match term added to the alignment score. If multiple templates are

used to build a model, structure
-
structure alignments are explored using
InsightII’s structure alignment tools, as well as Modeler’s MALIGN3D, and the
protein structure alignment program CE. Subsequently the sequence
-
structure
alignment is carried out wit
h Modeler’s Align2D. Multiple models are built
with Modeler, including the new loop refinement routine based on the
optimization of statistical pair potentials. Models are checked for proper
stereochemistry, and evaluated by comparing the restraint violati
ons reported
by Modeler; and by the Profiles
-
3D Verify scores, which measure the
compatibility of each residue in the model with its environment.


In addition, some targets were selected to test two new methods that we have
developed, ChiRotor and Looper,
for side
-
chain and loop prediction. ChiRotor
is a fast algorithm that predicts the conformation of all or part of amino
-
acid
side chains with an average RMSD of about 1Å for the core residues. The loop
-
modeling program, Looper, produces a number of energy
minimized loop
backbone conformations ranked according to force
-
field energy terms. Both
algorithms are a combination of a discrete search in dihedral angle space and
CHARMm energy minimization.


1.

Kitson et al. (2002) Functional annotation of proteomic sequ
ences based
on consensus of sequence and structural analysis.
Briefings in
Bioinformatics

3
(1), 1
-
13.


A
-
4

Advanced
-
ONIZUKA (P0214)
-

92 predictions: 92 3D


Fold Selection and Patchwork Energy Minimization


Kentaro Onizuka

Advanced Technology R
esearch Laboratories,

Matsushita Electric Industrial Co. Ltd.

onizuka@mrit.mei.co.jp


.The new method developed to meet CASP5 consists of three units.

1) Fold recognition unit

This unit selects ten to hundred conformations that have relatively good
compati
bility to the target protein sequence among approximately two thousand
non
-
redundant protein structures collected from PDB release 100. The selected
conformations are aligned to the target protein sequence. The compatibility of a
conformation against the t
arget sequence is evaluated as the sum of multi
-
dimensional mean
-
force potentials between all possible pairs of residues in that
conformation, now that having the target sequence aligned.


2) Patch work energy minimization unit

This unit builds a protein
conformation by concatenating the structure segments
cut out of those conformations selected by the fold recognition unit. The
conformations selected are aligned to the target protein sequence. Here the
concatenation of conformations is done as follows; 1
) select two (i
-
th and j
-
th)
conformations each aligned to the target protein sequence, 2) choose a residue
M in the sequence as the crossover point 3) the new conformation is generated
by concatenating the segment from N
-
term (of the target sequence) to M

of j
-
th
conformation and the segment from M to C
-
term (of the target sequence) of i
-
th
conformation. The minimization algorithm is analogous partially to genetic
algorithm and also dynamic programming. The minimization procedure first set
the several segm
ent core residues, which should never be the crossover points.
The core residues are those having locally minimal energy, where the energy of
each residue is calculated as the average energy (sum of potentials involving
that residue) over all the selected
conformations. The first concatenation step
takes crossover points M between N
-
term and the first segment core residue.
For i
-
th conformation, the best combination of M and j with the conformation
having minimal energy is selected. The k
-
th step takes M

between k
-
1
-
th and
k
-
th core residue. The last step takes M between the last core and the C
-
term
residue. Finally, the best conformation having minimal energy is selected from
the remaining new conformations as the result of the energy minimization.


3) G
ap caulking unit

The protein conformation built by patchwork energy minimization unit contains
some gaps inserted or deleted during the alignment process. This unit tries to
caulk those gaps by searching the conformations (selected by the fold
recognition
unit) for the combination of two gapless conformation segments at
that region which may substitute the conformation segments containing gaps.


The multi
-
dimensional mean force potentials E
ab
k

are pairwise between two
residues with respect to the residue ty
pes a and b, sequence separation k, and
the six
-
dimensional relative configuration whose components are 1) the
distance between two residues, 2) the direction of residue b from a, and 3) the
orientation of b against a (three Euler's angles). The fold reco
gnition unit,
however, first employs singleton potentials with respect only to one residue
type among the pair in order to generate the energy profile of conformations
among non
-
redundant conformation data
-
set. Then the target sequence is
aligned to each p
rofile using dynamic programming algorithm. The
compatibility of each conformation to the target sequence is evaluated by
calculating the total energy, which is the sum of pairwise potentials according
to that alignment. The energy minimization unit employ
s pairwise potentials
plus attractive force potentials because the energy minimization using only the
net mean force potentials
1

generates an extended conformation rather than
compact one. The attractive potentials adopted here are such that are
proportion
al to the square of the distance between residues.


The performance of the minimization algorithm proposed is intense, although
the algorithm logically does not assure to generate the optimal solution. The
most difficult problem remaining is the potentials

for minimization.


1.

Sippl M.J. (1990) Calculation of Conformational Ensembles from Poten
-
tials of Mean Force: An Approach to the Knowledge
-
based Prediction of
Local Structure in Globular Proteins.
J. Mol. Biol.,
213
, 859
-
883.

2.

Onizuka K., Noguchi T., Akiyam
a Y. Matsuda H. (2002) Using Data
Compression for Multidimensional Distribution Analysis.
Intelligent
Systems May/June 2002
, 48
-
54.

A
-
5

ALAX (P0234)
-

39 predictions: 39 3D


A New Sequence Alignment Method ALAX and Its
Application to Homology Modeling


Atsushi

Hijikata
1
, Tosiyuki Noguti

2

and Mitiko Go
1

1

Division of Biological Science, Graduate School of Science, Nagoya
University
,
2
Saga Medical School

alax@bio.nagoya
-
u.ac.jp


One of the important issues in homology modeli
ng is to obtain accurate
sequence alignment. Particularly it is true in the case of low sequence identity
(less than 30 %) between the target and template proteins. In low sequence
identity, one of the difficulties lies in locating the insertions/deletio
ns (in/del) at
proper positions. To accommodate the in/del at correct locations, we developed
a new sequence alignment method for protein pairs with weak identity in their
amino acid sequences. A new gap penalty function was introduced that is
based on th
e solvent accessibility of the corresponding amino acid residues of
the template structure. In the new sequence alignment method, the gap penalty
function and the Position Specific Scoring Matrix (PSSM) of PSI
-
BLAST [1]
were combined. This alignment metho
d we developed is named ALAX
(ALignment based on ACCessibility). We used ALAX for template
-
target
sequence alignment and homology modeling software FAMS in
CASP5/CAFASP3.


In CASP5/CAFASP3, we obtained the target models through the following
three steps.


1) Template structure selection

To identify a template structure, we used five iterations of PSI
-
BLAST against
the non
-
redundant protein sequence database (nr) of the NCBI. All the
sequences having an e
-
value lower than 0.1 were included in the PSSM
const
ruction. Then, the PSSM was used to search against the PDB sequence
database. One PDB sequence with the lowest e
-
value was selected as a template
structure.

2) Target


template sequence alignment

To align the template and the target sequence, we used ALAX

with solvent
accessibility of residues of the template structure and the PSSM constructed in
the step 1).


3) Model building

The model building was carried out finally by using FAMS [2] program
according to the alignment that was obtained by ALAX.

All th
e processes of
homology modeling, 1) to 3) are fully automatic.


1.

Altschul S.F. et al.
(1997) Gapped BLAST and PSI
-
BLAST: a new
generation of protein database search programs.
Nucleic Acids Res.

25
(17),

3389
-
3402.

2.

Ogata K. and Umeyama H. (2000) An automati
c homology modeling
method consisting of database searches and simulated annealing. J. Mol.
Graph. Model.
18
(3)
, 258
-
272, 305
-
306.



A
-
6

Aligners (P0064)
-

31 predictions: 31 3D


Fold Recognition Using Only Boilerplate Methods of Database
Search and Multiple

Sequence Alignment


Arcady Mushegian

Stowers Institute for Medical Research

arm@stowers
-
institute.org


I believe that most if not all approaches for predicting protein structure from
sequence form a continuum of methods, at the core of whi
ch lies probabilistic
modeling of evolutionarily related sequence families. (Ab initio methods may
be an exception, but they used to be practical mostly for short peptides). Thus,
there is no “threading” really distinct from “fold recognition” really dist
inct
from “homology modeling”


the difference is mainly in the atomic detail of
the resulting model.


In order to falsify, and thereby scientifically test, the above statement, one has
to demonstrate that various complementary physico
-
chemical approaches
are 1.
not reducible to probabilistic modeling of protein sequence families and 2.
result in a statistically significant improvement over the methods that use
alignment information alone.


In order to provide a benchmark against which the level of improvem
ent can be
scored, I applied the “no
-
new
-
methods” approach for structure prediction of
CASP5 targets. At the first step, I removed the targets that had a statistically
significant match (arbitrary cutoff E=<10
-
4
), at the first iteration of the PSI
-
BLAST p
rogram [1] to a sequence with the known (pdb) structure. These are
straight homology modeling targets, where the real issue is not fold recognition
but the RMSD of the model. I know nothing about methods of reducing RMSD.
I also left out several very shor
t peptides. The result is 37 targets where fold
recognition, i.e., identification of and alignment to an appropriate template, is a
legitimate yet non
-
trivial task.


The main database search program was PSI
-
BLAST (cutoff for inclusion into a
profile was
set at 0.05 and composition
-
based statistics was used when helpful).
The program was run to convergence, the homologs were collected and used in
the new rounds of similarity searches. This is an important step because none
of the existing similarity sear
ch methods is assured to recover all family
members in one, even iterative, search [2]. If template was not discovered, the
RPS
-
BLAST program [3] was used and proved helpful in two cases. Several
HMM
-
based applications were also employed but did not give

any gain in the
template identification.


Sequences of multiple family members, including target, template, and several
homologs with different degree of similarity with both, were aligned using
MACAW [4] and T
-
COFFEE [5], then converted to the AL format

(I thank
Ognen Duzlevski for giving me a converter program). The only manual check
was to assure that the alignment makes structural sense, i.e. that the major
elements of secondary structure are aligned, and their connectivity is possible
given the dist
ances between the aligned elements in each structure. Loops were
not modeled if they could not be aligned on the basis of sequence similarity.


I submitted 28 models for 28 targets. The assessors are invited to see whether
the results are, on average, com
parable with the ones achieved by more
sophisticated approaches.


1.

Altschul S.F. et al.
(1997) Gapped BLAST and PSI
-
BLAST: a new
generation of protein database search programs.
Nucleic Acids Res.

25
(17),

3389
-
3402

2.

Aravind L., Koonin E.V. (1999) Gleaning no
n
-
trivial structural, functional
and evolutionary information about proteins by iterative database searches
J Mol Biol.
287
(5):1023
-
1040
.

3.

Schaffer A.A. et al. (1999). IMPALA: matching a protein sequence against
a collection of PSI
-
BLAST
-
constructed positi
on
-
specific score matrices.
Bioinformatics.
15
:1000
-
1011

4.

Schuler G.D. et al. (1991). A workbench for multiple alignment
construction and analysis.
Proteins
9
: 180
-
190

5.

Notredame C. et al. (2000). T
-
Coffee: A novel method for fast and accurate
multiple seque
nce alignment.
J Mol Biol.
302
: 205
-
217
.




A
-
7

arby
-
scai (P0183)

-

68 predictions: 68 3D


The Arby Automated Structure Prediction Server


Ingolf Sommer
1
, Niklas von Öhsen
2

1



Max
-
Planck
-
Institute for Informatics,


2


FraunhoferI
nstitute forScientific Computing and Algorithms

sommer@mpi
-
sb.mpg.de


Our fully automated protein structure prediction server Arby combines the
results of several fold recognition methods to find suitable templates in a
database of structural representativ
es of protein domains.


The method starts by constructing a set of subsequences from the query
sequence, each subsequence representing a hypothesis for a possible protein
domain. This is done by scanning against the InterPro database and using hits
as dom
ain hypotheses
[1]
. Additional hypotheses are constructed using a
secondary structure prediction from PSIPRED
[2]
. Segments of predicted loops
are used as potential domain boundaries. Finally, the set of su
bsequences is
reduced to a reasonable size by removing subsequences that are highly similar
or short.


For each subsequence a multiple alignment is constructed by searching the NR
database using PSI
-
BLAST
[3]
. A frequency profile is calculated from this
multiple alignment using a slightly modified version of the Henikoff
-
Henikoff
sequence
-
weighting algorithm
[4]
.


Each of the potential domains is then subjected to four different fold
recognition methods. Each method searches for an optimal structure in our
template database.

The template database is a representative subset of the
SCOP domains with pairwise sequence identity lower than 40%
[5, 6]
. For each
of these template domains
, a frequency profile was constructed as described
above for the targets. The first fold recognition method is PSI
-
BLAST, which is
used to search through our set of template domains (augmented by the NR
sequence database). The second one is the 123D thread
ing program. It uses
frequency profiles on the target side and 3D structural information on the
template side
[7, 8]
. The third one is the JProp profile
-
profile alignment method
re
cently developed in our group
[9, 10]
. It compares frequency profiles on the
target side with profiles on the template side using the log average scoring
approach. The fourth method is
again the JProp profile
-
profile alignment
program, but in this version it makes use of additional secondary structure
information on the target and template side (publication in preparation).


The quality of each of these search results is assessed using c
onfidence
measures. For PSI
-
BLAST, these are readily available
[11
]
, for the other
methods, these were developed in a recent study
[12]
.


The target sequence is then annotated with all the prod
uced quadruplets
(subsequence, fold recognition method, search result, confidence value).
Finally, we select a set of non
-
overlapping annotations along the sequence, by
performing combinatorial optimization of a heuristic score based on the
confidence valu
es. For each of these selected annotations, a separate protein
domain is predicted. The structure of this domain prediction is computed by
aligning the subsequence against the template structure using JProp.


The underlying machinery is a Java based data f
low engine, designed for
stability. Since it is general and independent of the specific pipeline (as the one
described above), it can be used as infrastructure for other projects as well: we
developed a component framework in which all algorithms and progr
ams are
encapsulated in small Java classes. Each of these components specifies an
algorithm to be executed along with its input parameters, the output that it
produces, and possible error conditions. The accompanying engine provides a
number of features fo
r the components: First of all, the input/output
dependencies of components are resolved. If all inputs for a specific algorithm
have been determined, the algorithm itself is being scheduled for execution.
The components are executed in parallel on any num
ber of CPUs, in our case
10 CPUs of a SunFire 4800 server. A frequent problem in fully automated
systems is reliable error handling. We solve this problem by catching potential
error conditions and adaptively pruning the data
-
flow tree. Additionally,
persi
stence of the computed results is accomplished by using a relational
database, thus offering convenient and fast access to previously computed
results for identical input parameters.


A
-
8

The power of the structure prediction server is based on the use of mode
rn
profile
-
profile algorithms for fold recognition, the quality assessment using
confidence measures, and the stable and powerful Java data flow engine. In
future work, we will use the latter technology as a basis for our bioinformatics
computing environme
nt.


Acknowledgements.

In addition to the authors, the ARBY CAFSP 3 Team
includes Mario Albrecht, Thomas Lengauer, Theo Mevissen, and Ralf Zimmer.
We thank Daniel Hanisch for providing contributions to the Java
implementation. Part of this research has bee
n supported by BMBF grant
no. 01
SF 9984/3 (Helmholtz

Network for Bioinformatics).


1.

Apweiler R. et al.
(2001) The InterPro database, an integrated documenta
-
tion resource for protein families, domains and functional sites
.

Nucleic
Ac
ids Res
.
29

(1)
, 37
-
40.

2.

Jones D.T. (1999) Protein secondary structure prediction based on
position
-
specific scoring matrices
.

J Mol Biol
.
292

(2)
, 195
-
202.

3.

Altschul S.F. et al.
(1997) Gapped BLAST and PSI
-
BLAST: a new genera
-
tion of protein database
search programs
.

Nucleic Acids Res
.
25

(17)
, 3389
-
402.

4.

Henikoff S. and Henikoff J.G. (1994) Position
-
based sequence weights
.

J
Mol Biol
.
243

(4)
, 574
-
8.

5.

Chandonia J.M., et al. (2002) ASTRAL compendium enhancements
.

Nucleic Acids Res
.
30

(1)
, 260
-
3.

6
.

Brenner SE, Koehl P, and Levitt M. (2000) The ASTRAL compendium for
protein structure and sequence analysis
.

Nucleic Acids Res
.
28

(1)
, 254
-
6.

7.

Zien A., Zimmer R., and Lengauer T. (2000) A simple iterative approach
to parameter optimization
.

J Comput B
iol
.
7

(3
-
4)
, 483
-
501.

8.

Alexandrov N.N., Nussinov R., and Zimmer R. (1996) Fast protein fold
recognition via sequence to structure alignment and contact capacity
potentials
.

Pac Symp Biocomput
, 53
-
72.

9.

Von Öhsen N, Sommer I, and Zimmer R (2003) Profile
-
Profile Alignment:
A Powerful Tool For Protein Structure Prediction. in
Pac Symp Biocomput
.

10.

Von Öhsen N. and Zimmer R. (2001) Improving profile
-
profile alignment
via log average scoring
.

Lecture Notes in Computer Science
.
2149
, 11
-
26.

11.

Karlin S. an
d Altschul S.F. (1990) Methods for assessing the statistical
significance of molecular sequence features by using general scoring
schemes
.

Proc Natl Acad Sci U S A
.
87

(6)
, 2264
-
8.

12.

Sommer I., et al.
(2002) Confidence measures for protein fold recogniti
on
.

Bioinformatics
.
18

(6)
, 802
-
12.



AS2TS (P0081)
-

26 predictions: 26 3D


AS2TS


A New Protein Structure Prediction Server


J. Zemla


Independence High School, Brentwood, CA, US

joanna_zemla@yahoo.com


We have attempted to predict structur
es of twenty
-
six CASP5 targets using a
preliminary version of a fully automated method AS2TS (Amino acid Sequence
to Tertiary Structure) [1].


The AS2TS server built 3D protein models using a top sequence
-
structure
alignment provided by PSI
-
BLAST [2] for a

given target. Coordinates for loop
regions were assigned from a library of folds generated by LGA program
(Local
-
Global Alignment) [3]. Side chains were added using SCWRL program
[4]. Human intervention was limited to enter an amino acid sequence to the
A
S2TS server and control whether the process of model building went through.


Our main goal during this round of CASP was to test the ability and
effectiveness of combining two independently working processes: sequence
alignment method with loop building pr
ocedure. An analysis of evaluation
results will help in further development of the AS2TS system.


1.

Zemla A. http://protein.llnl.gov/as2ts

2.

Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W.
& Lipman D.J. (1997). Gapped BLAST and PSI
-
BL
AST: a new generation
of protein database search programs.
Nucleic Acids Res

25
(17), 3389
-
3402.

3.

Zemla A. http://PredictionCenter.llnl.gov/local/lga/lga.html

4.

Bower M., Cohen F.E. and Dunbrack R.L. Jr. (1997) Sidechain prediction
from a backbone
-
dependent ro
tamer library: A new tool for homology
modeling.
J. Mol. Biol.

267
, 1268
-
1282

A
-
9

ATOME (P0464)
-

318 predictions: 318 3D


Evaluation of an Automatic Pipeline, ATOME for Protein
Structure Modelling


G. Labesse
, V. Catherinot
, J
.
-
L. Pons
, L. Martin

and D. Douguet

1

-

Centre de Biochimie Structurale (CNRS), Montpellier, France

labesse@cbs.cnrs.fr


The fold compatibility between the targets and PDB entries was analyzed using
our recently dev
elopped meta
-
server [1]. Query sequences are sent
automatically to six distinct fold recognition or protein structure prediction
servers: 3D
-
PSSM[2], PDB
-
BLAST

(http://bioinformatics.burnham
-
inst.org/pdb_blast/), FUGUE[3], GenTHREADER[4], SAM
-
T99[5] and J
-
PRED2[6]

with default parameters but for PDB
-
BLAST (10 iterations). No
particular treatment were made for multi
-
domain targets as proper domain
delimitation was not yet automatized. This likely lead to partially incorrect
alignment or to incorrect fold rec
ognition for a few targets.


As most “threaders”
use the

“frozen approximation”, each structural alignment
was

further evaluated using T.I.T.O [7]. PSI
-
BLAST [8] on SWISSPROT [9]
sequence database run on the NPSA server [10] was used to search homologous
s
equences using the target sequence as a query. The homologs and the target
sequence were used to produce a multiple alignment using CLUSTALW. This
alignment was used to assess the structural alignments.


A consensus ranking was deduced for each template ta
king into account its
score and its ranking (both computed by the original server), the T.I.T.O score
and the level of sequence identity.


For all targets, three models were built directly using MODELLER 6.0 [11] for
the top
-
ranking structural alignments.
Additional restraints to be used in
MODELLER 6.0, were deduced from template secondary structure assignment
made using P
-
SEA [12]. Models were evaluated using PROSA [13] and
Verify3D [14]. Side chain
modelling
in the common core (as defined by target
-
templ
ate alignment) was also performed using SCWRL 2.8 [15] and similarly
evaluated but not further refined.


For each target, all the three
-
dimensional models were ranked according to the
scores computed by PROSA [13] and Verify3D [14]. The top
-
five models we
re
deposited for each targets.


1.

Douguet D. et al.
(2001) Easier threading through web
-
based comparisons
and cross
-
validations.
BioInformatics

17
, 752
-
753.

2.

Kelley L.A. et al. (2000) Enhanced Genome Annotation using Structural
Profiles in the Program 3D
-
PSSM
.
J. Mol. Biol.

299
, 501
-
522

3.

Shi J. et al.
(2001) FUGUE: sequence
-
structure homology recognition
using environment
-
specific substitution tables and structure
-
dependent gap
penalties.
J. Mol. Biol.

310
, 243
-
257.

4.

McGuffin L.J. et al.
(2000) The PSIPRED pro
tein structure prediction
server.
Bioinformatics

16
, 404
-
405

5.

Karplus K. et al.
(1998) Hidden Markov models for detecting remote
protein homologies.
Bioinformatics

14
, 846
-
856.

6.

Cuff J.A. et al
.
(1998) Jpred: A Consensus Secondary Structure Prediction
Server
.
Bioinformatics
14
, 892
-
893

7.

Labesse G. et al.
(1998) A Tool for Incremental Threading Optimization
(T.I.T.O.) to help alignment and modelling of remote homologs.
Bioinformatics

14
, 206
-
211.

8.

Altschul S.F. et al.
(1997) Gapped BLAST and PSI
-
BLAST: a new
ge
neration of protein database search programs.
Nucleic Acids Res.

25
(17),

3389
-
3402

9.

Bairoch A. et al..
(2000) The SWISS
-
PROT protein sequence database and
its supplement TrEMBL in 2000
Nucleic Acids Res.

28
, 45
-
48

10.

Combet, C.
et al.
(2000) NPS@: network pro
tein sequence analysis.
Trends Biochem. Sci.

25
, 147
-
150

11.

Sali A
. et al.

(1993) Comparative protein modelling by satisfaction of
spatial restraints.
J. Mol.
Biol.

234
, 779
-
815.

12.

Labesse G. et al.
(1997) P
-
SEA: a new efficient assignment of secondary
struct
ure from Ca trace of proteins.
CABIOS

13
, 291
-
295.

13.

Sippl M.J. (1993) Recognition of errors in three
-
dimensional structures of
proteins.
Proteins

17
, 355
-
362.

14.

Eisenberg D. et al.
(1997) VERIFY3D: assessment of protein models with
three
-
dimensional profiles.

Methods Enzymol

277
, 396
-
404

15.

Dunbrack R.L. et al.
(
1993
) Backbone
-
dependent rotamer library for
proteins. Application to side
-
chain prediction.

J Mol Biol
.
230
, 543
-
574.

A
-
10

Avbelj
-
Franc (P0341)
-

25 predictions: 25 3D


Torsion Space Monte Carlo Simulations o
f Folding Using
Electrostatic Screening of Backbone and

Charged Side
-
Chain Interactions


F. Avbelj

National Institute of Chemistry

francl@sg3.ki.si


Three
-
dimensional structures of small proteins are predicted
ab initio

using
torsion space Mo
nte Carlo simulations from sequence alone. Protein structures
in the data
-
bank are not used in this method. The method is based on the
electrostatic screening of main
-
chain and charged side
-
chain interactions.


The screening of main
-
chain electrostatic int
eractions by water solvation is
used to model the backbone conformational propensities (the electrostatic
screening model: ESM) [1
-
7]. The strongest support for the ESM has been
provided by the recent experimental studies, which demonstrated that an
enthal
pic factor is involved in determining the preferences for α
-
helices and β
-
strands [8
-
10].


The energy function in the Monte Carlo procedure contains: main
-
chain and
charged side
-
chain electrostatic interactions, electrostatic solvation free
energies of mai
n
-
chain and charged side
-
chain groups, and hydrophobic effect.
The screening of charged side
-
chain electrostatics by water solvation is used to
model the interactions of charged side
-
chain groups. The interactions of polar
non
-
charged side
-
chain groups are

ignored. The hydrophobic effect is modeled
by the long
-
range interactions. The main
-
chain and charged side
-
chain
electrostatic interactions are calculated using Coulomb's law with a dielectric
constant of 1. The electrostatic solvation free energies of po
lar main
-
chain and
charged side
-
chain groups (ESF) are calculated using the finite difference
Poisson
-
Boltzmann model (DelPhi) with PARSE parameter set [11]. The
electrostatic potential of the molecule is first calculated using a very large box
and large g
rid size. This potential then provides boundary conditions for more
accurate calculations of electrostatic potential around each residue (focusing).


Torsion space Monte Carlo simulations of small proteins are performed using
hierarchic condensation. In th
e first phase of simulation only the local
electrostatic energies and backbone solvation free energies of residues are
activated. After equilibration the protein molecules display native
-
like local
conformational propensities and dimensions characteristic
for the highly
denatured proteins. The calculated NMR J3
HNH
α

coupling constants agree well
with those obtained from the COIL residues in experimental protein structures.
In this phase the β
-
strands are formed. In the second phase of simulation the
main
-
chain hydrogen bonds are included in the energy function. In
this phase
α
-
helices and hairpins are formed. In the third phase of simulation the long
-
range hydrophobic and electrostatic interactions between charged residues are
gradually included in the energy function. The electrostatic interactions
between charged

residues are screened by the electrostatic solvation free
energies of charged side
-
chains. In this phase α
-
helices and β
-
strands gradually
condense into compact structures.


In order to improve sampling of the conformational space, a large number of
indep
endent Monte Carlo simulations (~100) are performed. All heavy atoms
including polar hydrogen’s are included in simulations. Geometry of amino
acids is generated using the Discover residue library. Only torsion angles are
allowed to vary during simulation
s. The ω peptide bond torsion angles are fixed
to 180˚. Hard sphere repulsion is enforced by discarding conformations with
steric clashes. Pairs of atoms related by torsion angles are not checked for steric
clashes. Conformational space is sampled by varyi
ng torsion angles of proteins
using different types of moves. The Metropolis criterion is used to decide
whether to accept or reject the move. Temperature is 300 K.


1.

Avbelj F. and Moult J. (1995) Role of electrostatic screening in
determining protein

main
-
chain conformational preferences.
Biochemistry
,
34
, 755
-
764.

2.

Avbelj F. and Fele L. (1998) Role of main
-
chain electrostatics,
hydrophobic effect, and side
-
chain conformational entropy in determining
the secondary structure of proteins.
J. Mol. Biol
.,
279
, 665
-
684.

3.

Avbelj F. (2000) Amino acid conformational preferences and solvation of
polar backbone atoms in peptides and proteins.
J. Mol. Biol
.,
300
, 1337
-
61.

4.

Avbelj F. and Moult J. (1995) The conformation of folding initiation sites
in proteins determi
ned by computer simulation
. Proteins: Struc., Funct.,
Genet.
,
23
, 129
-
141.

A
-
11

5.

Avbelj F. and Fele L. (1998) Prediction of the three dimensional structure
of proteins using the electrostatic screening model and hierarchic
condensation.
Proteins: Struc., Funct.,

Genet
.,
31
, 74
-
96.

6.

Avbelj F. (1992) Use of a potential of mean force to analyze free energy
contributions in protein Folding.
Biochemistry,

31,

6290
-
6297.

7.

Avbelj F. and Baldwin R. L. (2002) Role of backbone solvation in
determining thermodynamic β
-
propens
ities of the amino acids
. Proc. Natl.
Acad. Sci. U.S.A
.,
99
, 1309
-
1313.

8.

Luo P. and Baldwin R. L. (1999) Interactions between water and polar
groups of the helix backbone: An important determinant of helix
propensities.
Proc. Natl. Acad. Sci. U.S.A
.,
96
, 49
30
-
4935.

9.

Lorch M. et al. (2000) Effects of mutants on the thermodynamics of a
protein folding reactions: Implications for the mechanism of formation of
the intermediate and transition states.
Biochemistry
,
39
, 3480
-
3485.

10.

Thomas S. T. et al. (2001) Hydrati
on of the peptide backbone largely
defines the thermodynamic propensity scale of residues at the C' position
of the C
-
capping box of α
-
helices
. Proc. Natl. Acad. Sci. U.S.A
.,
98
,
10670
-
10675.

11.

Sitkoff D. et al. (1994) Accurate calculations of hydration fre
e energies
using macroscopic solvent models,
J. Phys. Chem
.,
98
, 1978
-
198.




BAKER (P0002)
-

377 predictions: 377 3D


Comparative Modeling Using Rosetta


D. Chivian
1+
, C.A. Rohl
1+
, C.E.M. Strauss
2
, P. Murphy
1

and
D.

Baker
1

1

-

University of Washington,
2
-

Los Alamos National Laboratory,

+
-

authors contributed equally

dabaker@u.washington.edu


Comparative modeling using Rosetta [1] is comprised of up to five steps: A)
detection of the bes
t parent for each putative domain, B) sequence alignment to
that parent, C) modeling of structurally variable regions, D) optimization to
increase the physical reasonableness of the final model, and E) re
-
assembling
the complete chain when domains were par
sed and processed individually.


(A) Homolog Detection

Queries were initially screened for simple Blast or PSI
-
Blast parents. Large
regions of query sequence without parent coverage were then submitted to the
Bioinfo meta
-
server and candidate parents from
Pcons2 and Pcons3 [2] were
selected. Occasionally, parents with functions similar to that reported for the
query were also considered. Human intervention was then employed to select
the appropriate parent. Domains for which no significant matches were fo
und
were modeled using the Rosetta de novo prediction protocol [3, and see above
description].


(B) Sequence Alignment

We employed a "kitchen sink" approach, called "K*SYNC", which produces
large sets of candidate alignments by varying the way in which inf
ormation is
derived and used by a modified Smith
-
Waterman alignment algorithm. The
information used includes the similarity between PSI
-
Blast derived residue
substitution profiles for the query and parent, supplementing the parent residue
substitution pro
file with counts from its FSSP, matching predicted regular
secondary structure (PSIPRED, PHD, SAM, and/or JUFO) with three
-
state
collapsed DSSP assigned secondary structure, and position specific
obligateness and contiguousness as defined by the occupancy
and degree of
gapping for the query and parent in the PSI
-
Blast multiple sequence alignment
and from the parent's FSSP multiple structural alignment.


The ensemble of sequence alignments was converted to an ensemble of three
-
dimensional template structures
, and short to medium unaligned regions (< 17
residues) were modeled in the context of these templates using an abbreviated
insertion modeling procedure (see C below). Alignments containing insertions
that failed to produce conformations in agreement with

the geometry of the
template stems were discarded from the ensemble. Remaining alignments were
ranked by evaluation of the structural models by several energy criteria.
Human intervention was employed to either select one of the high
-
ranking
alignments
or to produce a new alignment by recombining the preferred
features of multiple high
-
ranking alignments.


(C) Insertion modeling

Unaligned regions corresponding to gaps in the sequence alignment as well as
A
-
12

regions estimated likely to show significant struc
tural divergence from the
parent structure were modeled by the Rosetta fragment assembly strategy in the
context of the fixed template. For regions < 17 residues, ~300 initial
conformations were selected from a database of known structures using
similarit
y of sequence, secondary structure, and stem geometry. Initial
conformations for longer regions were built up using three and nine residue
fragments. The conformations of all variable regions were then optimized using
fragment replacement and random angle

perturbations. A gap closure term in
the potential in combination with conjugate gradient

minimization was used to
ensure continuity of the peptide backbone. Optimization of variable regions
was accomplished by use of the standard Rosetta potential with

centroid
representation of side chains, followed by optimization with explicit side
chains. All variable regions were optimized simultaneously, starting from a
random selection of initial conformations. Generally,


~1000 independent
optimizations were c
arried out. Variable regions were ranked independently by
energy and low energy conformations for each variable region combined into a
final model, manually ensuring that interacting variable regions were
compatible. For the purposes of evaluating alignme
nts (see B above), variable
regions were modeled sequentially rather than simultaneously, stricter
geometry requirements were enforced in selecting initial conformations, and
the optimization step was severely truncated.


(D) Idealization and Optimization
of Template Regions

To make the models more physically reasonable, most structural models were
modified to possess ideal backbone bond lengths and angles. Additionally,
residue clashes were alleviated using a combination of small backbone
perturbations an
d a rotamer
-
repacking algorithm. For most targets, models pre
-

and post
-
optimization were submitted. For targets for which either the
idealization and/or optimization resulted in significant backbone perturbation
(> ~1.5
-

2A), this step was eliminated.


(
E) Domain Assembly

Domain scope models were combined into a contiguous chain by fragment
insertion in the putative linker region, and evaluated by a coarse energy
function. Finally, side
-
chains were repacked [4] in either the single or the
multi
-
domain co
ntext.


1.

Simons K.T. et al. (1997) Assembly of protein tertiary structures from
fragments with similar local sequences using simulated annealing and
Bayesian scoring functions.
J Mol. Biol.

268
(1),

209
-
25.

2.

Lundstrom J. et al. (2001) Pcons: a neural
-
networ
k
-
based consensus
predictor that improves fold recognition
. Protein Sci.

10
(11)
, 2354
-
62.

3.

Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio protein
structure prediction
. Proteins

4
(S5)
, 119
-
26.

4.

Kuhlman B. and Baker D. (2000) Native Pro
tein sequences are close to
optimal for their structures.

Proc. Natl. Acad. Sci. USA

97
(19)
, 10383
-
8.




BAKER (P0002)
-

377 predictions: 377 3D


De Novo Structure Predictions Using Rosetta


P. Bradley
1+
, J. Meiler
1+
, K.M.S. M
isura
1+
, W.R. Schief
1+
,
J.

Schonbrun
1+
, W.J. Wedemeyer
1+
, O. Schueler
-
Furman
1
,
M

Kuhn
1
, P. Murphy
1
, C.E.M. Strauss
2
, and D. Baker
1

1

-

University of Washington,
2
-

Los Alamos National Laboratory,

+
-

authors contributed equally

dabaker@u.washington.edu


De novo structure predictions for CASP5 were made using Rosetta. The basic
method has been described previously [1]. One
of the fundamental
assumptions underlying Rosetta is that the conformations adopted by short (3
-
9
residue) segments of the target polypeptide chain are similar to those adopted
by related sequences in fragments of experimentally determined protein
structur
es. Fragment libraries for each three and nine residue segment of the
target polypeptide chain were extracted from the protein data bank using a
profile
-
profile comparison method as described previously [2]. The
conformational space defined by these fragm
ents is then searched using a
Monte Carlo procedure with an energy function favoring compact structures,
buried hydrophobic residues, and paired beta strands. 10,000
-

400,000
independent simulations were carried out for the target sequence and
homologous
sequences (when available). Longer sequences were often parsed;
A
-
13

smaller segments were folded and served as nuclei for folding the remainder of
the chain.


For sequences longer that 110 amino acid residues, the resulting models were
subjected to a filter wh
ich provided an even distribution of topologies generated
during the Monte Carlo search procedure, and reduced the number of models
with local contacts. The filtered models were then clustered as described in [3].
In some cases, representative decoys from

each cluster were refined to improve
the hydrogen bonding of their beta sheets. For proteins with fewer than 110
residues, decoys were scored with the energy function as described above, and
the low free energy models were subjected to a Monte Carlo Mini
mization
procedure to relieve backbone atomic clashes. Following this, sidechains were
built onto the models using Dunbrack’s backbone dependent rotamer library
and the method described in [4] and a similar Monte Carlo Minimization
procedure was then used
to minimize an all
-
atom energy function dominated by
Lennard
-
Jones interactions, an orientation dependent hydrogen bonding term,
and an implicit solvation model. Side chain conformations were periodically
optimized using a full combinatorial optimization p
rocedure. Models with the
lowest free energy were selected.


Recent advances in the Rosetta method have been in the areas of decoy
discrimination and improvement of the energy function for small proteins; and
formation of beta sheets, generation of comple
x topologies and non
-
local
contacts, and development of a protocol to identify decoys which have
successfully incorporated these features for larger proteins. Attempts were
made to improve secondary structure packing in all decoys. We have also
attempted

to compensate for incorrect secondary structure predictions in any
given region of the polypeptide chain, and to increase the conformational space
searched in regions where secondary structure could not be assigned with
confidence. A new method, JUFO (ma
nuscript in preparation), has been
included in efforts to improve the accuracy of secondary structure prediction
and aid generation of a more robust fragment library for a given sequence.


1.

Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio pr
otein
structure prediction
. Proteins

45
(S5)
, 119
-
26.

2.

Simons K.T. et al. (1997) Assembly of protein tertiary structures from
fragments with similar local sequences using simulated annealing and
Bayesian scoring functions.
J Mol. Biol.

268
(1),

209
-
25.

3.

Sho
rtle D., Simons K.T. and Baker D. (1998) Clustering of low energy
conformations near the native structures of small proteins.
Proc. Natl.
Acad. Sci. USA
95

(19),

11158
-
62.

4.

Kuhlman B. and Baker D. (2000) Native Protein sequences are close to
optimal for the
ir structures.

Proc. Natl. Acad. Sci. USA

97
(19)
, 10383
-
8.




BAKER
-
ROBETTA (P0029)
-

199 predictions: 199 3D


Automated Method for Full Chain Structure Prediction

Using Rosetta


D. Chivian
, D.E. Kim
, C.A. Rohl
,
L. Malmstrom
, J. Meiler
,
T.

Robertson
, and D. Baker

University of Washington

dabaker@u.washington.edu


We have automated our basic comparative modeling and de novo protocols in
an effort to determine

the ability of the Rosetta [1] method to produce full chain
models without human intervention. The server, called “Robetta” provides de
novo, comparative, or mixed models in which the appropriate method is
selected for each putative domain.

Additionally
, the server provides a
secondary structure prediction from the JUFO
-
3D [2] method.


Regrettably, the server had several shortcomings during the CAFASP
-
3
experiment. Much of the code was implemented just prior to the experiment
and not properly tested. I
t was not entirely free of logical errors, and some of
the models are probably quite poor for this reason. Additionally, the automated
methods that were implemented for CAFASP
-
3 employed reduced protocols
either in an effort to meet the time demands requi
red of a server method, or
because they could not be completed in time for the experiment. In the interest
of brevity, this abstract will only discuss differences from the full de novo and
comparative modeling protocols (for full protocols, please see the

“baker
group” methods abstracts in this volume and [3]).


A
-
14

(A) Homolog Detection and Domain Parsing

We developed a method, called “Ginzu”, to determine domains in the full chain
of the query and assign them for de novo or comparative modeling. It consiste
d
of sequential processing of the sequence with Blast, PSI
-
Blast, and Pcons2 [4]
in order to identify regions of the query with parent PDB coverage. A Blast e
-
value of at least .001 or a Pcons2 confidence value of at least 1.5 was
considered sufficient to

justify comparative modeling. A single parent for each
region of coverage was then selected based on confidence and length of
coverage. Next, putative domain boundaries were determined for both
comparative modeling and de novo regions of the chain. PSI
-
Blast detected
homologous sequences were clustered by region of coverage and assigned to
the query as non
-
overlapping domain regions in order of cluster size. Cut
points between domains were assigned at positions of reduced occupancy in the
PSI
-
Blast MSA

and strongly predicted loop by PSIPRED [5]. Domain lengths
for de novo regions were forced to be shorter (not more than ~200 residues)
than they probably often were in the native structure in recognition of the
current limitation of Rosetta’s de novo pro
tocol to produce good quality models
for large domains when generating a small decoy ensemble.


(B) Modeling

Domains were modeled either by the de novo or comparative modeling Rosetta
protocol. Reductions to the full protocols included generating a smaller

decoy
ensemble for de novo domains, producing only one default weighted K*SYNC
alignment to the most confident parent for comparative modeling domains, and
not rebuilding short and medium loops for unaligned regions in comparative
models with our more rig
orous protocol. Lastly, the final full chain model was
produced trivially by spacing the coordinates of each domain model by 100
Angstroms.


(C) Secondary Structure Prediction

JUFO
-
3D is a version of the JUFO neural
-
net secondary structure predictor that
uses Rosetta de novo decoys or comparative models in addition to PSI
-
Blast
multiple sequence information and an amino acid property profile to produce
three
-
state predictions.


1.

Simons K.T. et al. (1997) Assembly of protein tertiary structures from
fragment
s with similar local sequences using simulated annealing and
Bayesian scoring functions.
J Mol. Biol.

268
(1),

209
-
25.

2.

Meiler J. and Baker D. (manuscript in preparation)

3.

Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio protein
structure pre
diction
. Protein

45
(S5)
, 119
-
26.

4.

Lundstrom J. et al. (2001) Pcons: a neural
-
network
-
based consensus
predictor that improves fold recognition
. Protein Sci.

10
(11)
, 2354
-
62.

5.

Jones D.T. (1999) Protein secondary structure prediction based on
position
-
specif
ic scoring matrices.

J. Mol. Biol.

292
(2)
, 195
-
202.




Baldi (P0021)

-

61 predictions: 61 3D

Baldi
-
CONpro (P0022)

-

62 predictions: 62 RR

Baldi
-
SSpro (P0023)

-

63 predictions: 63 SS

CMap23Dpro (P0253)

-

1 prediction: 1 3D

CMapPro (P0255)
-

0 predictions

S
Spro2 (P0254)
-

65 predictions: 65 SS


Automated ab initio Prediction of Protein Structure Through
Contact Maps by Recurrent Neural Networks


Gianluca Pollastri

and Pierre Baldi

University of California, Irvine

gpollast@ics.uci.
edu


The strategy we implemented to predict protein structure splits the problem in
three stages, as described in [1]. The first stage corresponds to modules that
predict structural features including secondary structure and relative solvent
accessibility.

The second stage corresponds to modules that predict the contact
map of the protein at the amino acid level, using the primary sequence and the
structural features. The final stage is the reconstruction of 3D coordinates of C


atoms from predicted contact

map and secondary structure. All the steps are
entirely automated and performed without any human intervention..


The methods we use for secondary structure and relative solvent accessibility
prediction have been described in [2,3]. These methods try to o
vercome the
limitations of simple feed
-
forward networks and consist of BRNNs
A
-
15

(Bidirectional Recurrent Neural Networks) with the capability of capturing at
least partial long
-
ranged information without overfitting. The recurrent neural
networks are given as

input PSI
-
BLAST profiles derived as described in [2].
Both in the case of secondary structure and relative solvent accessibility an
ensemble of networks is used for the final prediction. Secondary structure
predictions were submitted to CASP in two versio
ns (SSpro2 and Baldi
-
SSpro),
trained on different data sets. Versions of the methods are also freely available
as web servers (SSpro, ACCpro) at the address:

http://promoter.ics.uci.edu/BRNN
-
PRED/


In the second step, we go from the primary sequence and th
e structural features
to the map of contacts between amino acids. Training a large neural network to
directly predict 3D coordinates from primary sequence information is in fact
likely to fail because the problem is highly degenerate. Translations and
rot
ations leave the structure invariant but greatly change the 3D coordinates. In
contrast, contact maps provide a topological representation of the structure that
is invariant under rotation and translation. Furthermore, contact maps typically
contain enoug
h information to reconstruct the full structure even in presence of
noise [4]. A previous attempt to predict protein contact maps is described in [5].
Our current approach to the problem rests on a generalization of the graphical
model underlying BRNNs to
process one
-
dimensional objects. The
generalization of this architecture to two
-
dimensional objects, such as contact
maps, is described in [6]. In its basic version the model consists of nodes
regularly arranged in 6 planes: one input plane, one output pla
ne, and 4 hidden
planes. This graphical model is implemented with five feed
-
forward neural
networks, four representing transitions in the hidden planes given the input, one
representing the input
-
output transformation. The main advantages of this
model are

that it chooses automatically an optimal context to base its decision
on, and that it can capture at least partial long
-
ranged information without
overfitting. We submitted contact map predictions to CASP at 8 and 12
Angstrom (respectively as CMapPro and
Baldi
-
CONpro). The 12 Angstrom
predictor is trained on a larger data set and proves to be more reliable in our
tests, especially on proteins of length greater than 100 amino acids.


In contrast with the first two stages that heavily rely on machine learni
ng
methods, the last reconstruction step is addressed using distance geometry and
optimization techniques without learning. Our approach partly follows [4] but
with a number of significant modifications due to the fact that, in our case,
predicted

maps dif
fer from exact maps, as well as from random perturbations
of exact maps by uniform additive noise. In particular, in order to deal with the
specific properties of
predicted

contact maps we use: (1) semi
-
random moves
of variable length (combining of a r
andom vector and an attraction vector
directed towards putative contacts); (2) a bond length term in the energy
function to deal with unphysical bond lengths introduced by the moves; and (3)
a two
-
phase search with a first rough phase comprising large step
s where only
the predicted contact map contributes to the energy, and a second refinement
phase comprising smaller steps that take into account the effects of chirality,
bond length and amino acid hard
-
core repulsion forces. The submitted 3D
structures are

predicted using two different versions of the reconstruction
algorithm: the first version (CMap23Dpro) uses a direct in
-
house
implementation of [4], the other (Baldi) uses the modified version described
above.


1.

Baldi P. and Pollastri G. (2002) Machine Le
arning Structural and
Functional Proteomics,
IEEE Intelligent Systems (Intelligent Systems in
Biology II)
, March/April.

2.

Pollastri G., Przybylski D., Rost B., Baldi P. (2002) Improving the
Prediction of Protein Secondary Structure in Three and Eight Classes

Using Recurrent Neural Networks and Profiles,
Proteins,
47
, 228
-
235.

3.

Pollastri G., Baldi P., Fariselli P., Casadio R. (2002) Prediction of
Coordination Number and Relative Solvent Accessibility in Proteins,
Proteins,
47
, 142
-
153.

4.

Vendruscolo M., Domany E.

(2000) Protein folding using contact maps.
Vitam Horm.
58
, 171
-
212.

5.

Fariselli P, Olmea O, Valencia A, Casadio R. (2001) Prediction of contact
maps with neural networks and correlated mutations,
Protein Eng.
Nov;
14
(11), 835
-
43.

6.

Pollastri G, Baldi P. (2002
) Prediction of contact maps by GIOHMMs and
recurrent neural networks using lateral propagation from all four cardinal
corners,
Bioinformatics. Jul;
18
Suppl

1
, S62
-
S70.



A
-
16

Bass
-
Michael (P0384)

-

51 predictions: 51 3D


A Threading Approach to Structure Pred
iction


M. Bass

and R. Luethy


Computational Biology, Amgen Inc.

mbass@amgen.com


The threading approach used here was employed to test the accuracy of a
threading method when applied to a variety of test sequences. In target
seque
nces that are similar to a known structure, if threading produced a
different alignment, the thread alignment was used to test if the method can
improve the residue shift error in alignments.


The threading method uses residue
-
based statistical potentials
. The potentials
were calculated as log
-
odds of the interaction. Three potentials were used.
Each potential was given equal weight. The surface area potential was
evaluated by dividing the surface exposure into ten equal bins. The pairwise
interaction

potential was calculated by measuring the closest atom
-
atom contact
between pairs of amino acids such that the pair of amino acids was at least five
amino acids apart. Only interactions between 2.5Å and 12.5Å were counted.
The backbone dihedral potentia
l was calculated for each amino acid. The
dihedral angles were divided into 20 equal bins and separated according to
amino acid. The standard statistical potentials were calculated against a subset
of the Protein Data Bank (July 2002 release) such that n
o two proteins share
more than 35% sequence identity. This set was reduced by removing any
structures that fail a self
-
thread test. That is, a sequence must be able to find its
structure with the threading algorithm. This produced a unique subset of the

Protein Data Bank containing 2399 structures. A similar subset of the Protein
Data Bank was used for the query database. This subset contained proteins that
share no more than 45% sequence identity (2875 structures). The algorithm
produces gapped alig
nments without end penalties using an adaptation of the
Needleman
-
Wunsch algorithm [1].. The gap creation penalty is 3.5 and the gap
extension penalty is 0.7. The Z
-
score was calculated for all of the alignments
and alignments producing a Z
-
score in exce
ss of 5.0 were considered. WU
-
Blast [2] was also run for each target against the structural database to provide
a comparison alignment.

The sequence alignment was converted into a three
-
dimensional structure by
the following method. The alignments were c
onverted to a C
-
alpha trace based
on the coordinates of the template structure. Residues around any gaps in the
alignment were allowed to vary according to the method of Luethy [3]. After
the structure optimization, all
-
atom coordinates were constructed
in the
following way: first all coordinates from the PDB fragments were copied, then
missing backbone atoms were inserted by looking up the closest 5 residue
backbone fragment in PDB, finally missing side
-
chain atom were copied from
the closest 5 residue f
ragment from PDB with the same residue in the middle.
The structure was then minimized using TINKER [4] using a steepest descent
method with fixed C
-
alpha atoms.


1.

Needleman S.B. and Wunsch C.D. (1970) A General Method Applicable
to the Search for Similari
ties in the Amino Acid Sequence of Two
Proteins.
J. Mol. Biol.
,
48
, 443
-
453.

2.

Gish W. (1996
-
2002) http://blast.wustl.edu

3.

Luethy R. (2002) Unified Prediction Approach for Comparative Modeling
and
ab initio

Predictions.
CASP5 Abstract
.

4.

Ponder J.W. and Ric
hards F.M. (1987) An Efficient Newton
-
like Method
for Molecular Mechanics Energy Minimization of Large Molecules.
J.
Comput. Chem.,

8
, 1016
-
1024 (http://dasher.wustl.edu/tinker/)



Bates
-
Paul (P0096)
-

72 predictions: 72 3D


Comparative Modelling By
In Si
lico

Recombination of
Templates, Alignments and Models


Bruno Contreras
-
Moreira
, Paul W. Fitzjohn
, Marc Offman
,

Graham R. Smith

and Paul A. Bates

Biomolecular Modelling Laboratory

Cancer Re
search UK
-

London Research Institute

paul.bates@cancer.org.uk


After the CASP4 assessment it was concluded that template selection and
sequence alignment remain the main problems awaiting solution in the field of
A
-
17

comparative modelling [1]. Models were rar
ely found to be closer to the
experimental structures than the optimal template and often manual
intervention only marginally mproved their quality. Similar problems were
found in the fold recognition category [2,4], suggesting that the same approach
may b
e applied in the search for possible solutions in both fields. During
CASP5 our group has tested a novel procedure to tackle these problems. This
new method was used to generate models for all 67 targets, with roughly half of
them classified as fold recog
nition targets by the CAFASP3 meta
-
server
(www.cs.bgu.ac.il/~dfischer/CAFASP3).

This procedure is named
in silico

protein recombination, as it is a
computational implementation of genetic recombination, a well known
mechanism for generating population var
iability, but at the protein level. For
each CASP5 target a population of models was generated from a variety of
templates and sequence alignments. Care was taken to assure that models had
similar length and were complete, adding missing loops when necessa
ry and
smoothing their phi/psi geometry to permit later energy calculations and
minimizations. The algorithm can be outlined as:



initial population of models





(1)
grow population:
r

recombination + (
1
-
r
) mutation






(2)
select best proportion according to fitness







converged?
stop : otherwise back to (1)


This is a standard genetic algorithm with two genetic operators (recombination
and m
utation) and a fitness function acting as an artificial selection agent. We
will now briefly describe each step in the protocol.


Initial population of models.

Initially, our server Domain Fishing [3]
(www.bmm.icnet.uk/servers/

3djigsaw/dom_fish) was used

to define protein domains within each target
sequence and to find suitable modelling templates. Resulting alignments were
inspected and corrected if suspected to be incorrect. If reasonable alternative
alignments could be found they too were added to th
e pool. When possible,
only alignments with bit
-
scores (average pssm
-
logodds+secondary structure
agreement/residue) around 2 were selected. In several cases annotations from
the templates or their corresponding PFAM families were used to check the
correctn
ess of the alignment in active/binding sites. Usually several models
were built using the same template changing parts in the alignment. Models
from these alignments were built using our server 3D
-
JIGSAW [4]
(www.bmm.icnet.uk/servers/3djigsaw). Additional
models were obtained from
the CAFASP3 server after inspection of the alignments to gain extra variability
in sequence alignments, templates used and exposed loops. These models were
taken from different sources, including

FAMS (physchem.pharm.kitasatou.ac
.jp/FAMS),


Pmodeller (www.sbc.su.se/~arne/pcons) and


EsyPred3D (www.fundp.ac.be/urbm/bioinfo/esypred).

Models were inspected and missing parts, typically loops, added using in
-
house
software before going to the next step. In essence, this software ex
plores phi/psi
space to allow a peptide (the missing loop) to connect a gap in a protein fold.


1. Growing the population by recombination and mutation.

The initial
population was grown by randomly selecting pairs of protein models and
applying one of th
e two possible operators. In the case of recombination, the
models were superimposed based on their sequence alignment and a crossover
point drawn. Crossover was not permitted inside secondary structure elements.
The resulting recombinant model inherits th
e N
-
terminus from one parent and
the C
-
terminus from the other. In mutation events (occurring with frequency 1
-
r, where r is the recombination probability) a new protein model was obtained
by simply averaging its parents' coordinates after superimposition.

In many
cases this process obtained distorted side
-
chain conformations.


2. Selecting the best proportion. Fitness function.
The whole idea of the
algorithm is that it should be possible to obtain optimized mosaic models by
shuffling them in a rational
way. The key point in this approach is thus the
choice of an appropriate fitness function. After some benchmarking
experiments (unpublished results) we chose a function that calculates a free
energy estimate based on two terms: protein contact pair
-
potenti
als and side
-
chain solvation energies estimated from their solvent accessible area. This
function seems to yield a consistent measure of protein structural quality.

When each population reaches the upper limit (between 2 and 4 times its initial
size), thi
s energy function is used to rank its members. Only the worst 25% of
A
-
18

the population is discarded at this point, to assure that quality models are not
lost prematurely.


3. Convergence criterion and final refinements.

When the population has
converged to s
imilar energies, there is no room for further generation of
variability and the evolution process stops. At this point the final population is
inspected. In most cases this consists of several representations of the same
protein conformation with average b
ackbone deviations in the order of 0.1Å.

One of these representatives is then taken as the final model, which is carefully
inspected to detect unfavorable peptide conformations and a final energy
minimization using the CHARMM22 force field is performed.
This procedure
is able to fix distorted side
-
chains. At this point we have a CASP5 unrefined
model.

In addition, for targets T0134, T0165, T0177 and T0185 we tested a further
refinement step consisting of running an all
-
atom, molecular dynamics
simulatio
n inside a water box, with neutral total charge for around 0.5ns. For
these simulations we used the GROMACS package (www.gromacs.org) and
the OPLSAA force field. Snapshots taken from the trajectory were clustered
according to average backbone deviations an
d one conformation from the most
populated cluster was selected. After a few rounds of CHARMM22 energy
minimization, it was submitted as a refined model.

Insufficient computer resources prevented us from refining all targets.


1.

Tramontano A., Leplae R. and
Morea V. (2001) Analysis and Assessment
of Comparative Modeling Predictions in CASP4..
Proteins suppl

5
,

22
-
38

2.

Sippl M.J., Lackner P., Domingues F.S., Prlic A., Malik R., Andreeva A.
and Wiederstein M.(2001) Assessment of the CASP4 Fold Recognition
Categor
y
. Protein suppl
5
, 55
-
67.

3.

Contreras
-
Moreira B. and Bates P.A. (2002) Domain Fishing: a first step in
protein comparative modelling.
Bioinformatics
18,

1141
-
1142.

4.

Bates P.A., Kelley L.A., MacCallum R.M. and Sternberg M.J.E. (2001)
Enhancement of Protei
n Modelling by Human Intervention in Applying the
Automatic Programs 3D
-
JIGSAW and 3D
-
PSSM.
Proteins suppl

5,
39
-
46.
(www.bmm.icnet.uk/servers/3djigsaw)



Benner
-
steve (P0524)
-

35 predictions: 18 3D, 17 SS


Evolution
-
based Structure Prediction


D.W. De Ke
e
, T.J. McCormack
, and S.A. Benner

Foundation for Applied Molecular Evolution

P.O. Box 13174, Gainesville FL 32604

benner@chem.ufl.edu


Predictions for fourteen CASP5
ab initio

targets were submitted in a
col
laborative effort to explore the potential for predicting secondary structure in
the transparent secondary structure prediction method [1]. The targets were
selected based on the availability of homologous protein sequences in adequate
numbers and evolutio
nary distributions in the MasterCatalog, a commercial
naturally organized database developed in collaboration with EraGen
Biosciences (Madison, WI).


Multiple alignments were generated using the automated DARWIN
-
server
[2].Secondary structures were predict
ed based on automated heuristics to assign
surface, interior, active site and parsing residues by analysis of patterns of
conservation and variation among homologous protein sequences in light of
evolutionary models that interpret amino acid substitutions
as the consequence
of neutral variation subjected to functional constraints [3].


For the targets with a homolog whose structure has been solved, multiple
alignment trials were performed. The alignments were executed with different
gap
-
opening and gap
-
ext
ension penalties. The alignments were then evaluated
by visualizing them in relation to the solved structure, with the assumption that
the greatest sequence variation exists outside the boundaries of conserved
secondary structure motifs, i.e.,

-
helices an
d

-
strands. Also, additional
homologous sequences were added to the alignments in order to obtain a family
profile, which allowed us to optimize the alignments, since key residues are
more likely to be universally conserved.


1.

Benner S.A. and Gerloff S.D.
(1990) Patterns of divergence in homologous
proteins as indicators of secondary and tertiary structure: a prediction of
the structure of the catalytic domain of protein kinases.
Adv. Enzyme
A
-
19

Regul
.
31
, 121
-
181.

2.

Gonnet G.H. et al. (1992) Exhaustive matching
of the entire protein
sequence database.
Science

256
, 1443
-
1445.

3.

Benner S.A. et al. (1994) Bona
-

fide prediction of aspects of protein
conformation
. J. Mol. Biol.

235
, 926
-
958.



Benner
-
steve (P0524)
-

35 predictions: 18 3D, 17 SS



Evolution
-
based Structu
re Prediction Tools


Steven A. Benner
, Danny De Kee
, Thomas McCormack

Unversity of Florida, Foundation for Applied Molecular Evolution

email: benner@chem.ufl.edu


In 1992, the first convincing tools were introduced

for predicting protein
conformation from sequence data. These started with a set of aligned
homologous sequences for proteins diverging under functional constraints (1).
These were applied against the two ab initio targets presented in the CASP 1
predicti
on context, phospho
-
beta
-
galactosidase and synaptotagmin, and
generated correct tertiary structure models for both. The judges noted that these
represented the first two successful ab initio predictions in the CASP program
(2).In CASP 2, these tools genera
ted another prediction, this time for the heat
shock protein 90 (3). Here, the prediction was sufficiently accurate that it
correctly assigned HSP90 as a distant homolog of gyrase, generated a
functional hypothesis for HSP90, and identified as incorrect ce
rtain
interpretations of experimental data concerning the function of HSP90.


Outside of the CASP project, the tools have been used to analyze the structures
of protein kinase, the pleckstrin homology domain, and ribonucleotide
reductase, among others, wh
ere their outcome has gone beyond that of
modelling the fold, but in each case answer questions of interest to biologists
and biomedical researchers working with these systems (4). A version of the
method has been applied to every protein sequence family i
n GenBank, and
these predictions are incorporated into the MasterCatalog, an interpretive
proteomics database marketed by EraGen Biosciences (Madison WI) (5). The
MasterCatalog helps identify diagnostics and therapeutics targets, assess the
value of animal

models for human disease, and correlate genomic data with
function, starting with pathway interactions and extending to the cell, organism,
ecosystem, and planetary biosphere (6).


At the time that they were introduced, it was clear that evolution
-
based
s
tructure prediction methods suffer from specific weaknesses inherent in their
formulation. These weaknesses are expected regardless of the details
surrounding its implementation. Thus, the PHD tool, which implements the
same basic idea but in the form of a

neural network, is expected to suffer from
the same weaknesses, and this has been suggested anecdotally. The purpose of
our participation in CASP5 is to generate a reference database of record of
predictions done using the 1992 method, which is described
in detail, both in
(1), and in the patent literature (7).


1.

Benner S. A., Gerloff D. L. (1991) Patterns of divergence in homologous
proteins as indicators of secondary and tertiary structure. The catalytic
domain of protein kinases.
Adv. Enzyme Regulat
.
31
,

121
-
181

2.

DeFay T., Cohen F. E. (1995) Evaluation of current techniques for ab initio
protein structure prediction.
Proteins

23
, 431
-
445

3.

Gerloff D.L., Cohen F.E., Korostensky C., Turcotte M., Gonnet G.H.,
Benner S.A. (1997) A predicted consensus structure f
or the N
-
terminal
fragment of the heat shock protein HSP90 family.
Proteins: Struct. Funct.
Genet.

27
, 450
-
458)

4.

Benner S.A., Cannarozzi G., Chelvanayagam G., Turcotte M. (1997)
Bona
fide

predictions of protein secondary structure using transparent analyse
s
of multiple sequence alignments.
Chem. Rev.

97
, 2725
-
2843

5.

Benner S.A., Chamberlin S.G., Liberles D.A., Govindarajan S., Knecht L.
(2000) Functional inferences from reconstructed evolutionary biology
involving rectified databases. An evolutionarily
-
ground
ed approach to
functional genomics.
Research Microbiol.

151
, 97
-
106



A
-
20

Bilab (P0080)
-

200 predictions: 200 3D


Tertiary Structure Prediction of Proteins Using Probability
Maps of Mainchain Torsion Angles for New Fold Targets and
Comparative Modeling Metho
d for Other Targets


S
.
Nakamura
1
,
T
.
Nishimura
2
,

T. Ishida
1
, T. Miki
1
,

J. Sasaki
1
, K. Hibi
1
, T. Ishizuka
1
,3

and
K
.
Shimizu
1

1

-

Department of Biote
chnology, the University of Tokyo
,

2
-

Graduate School of Humanities and Sociology, the University of Tokyo,

3

-

Faculty of Industrial Science and Technology, Tokyo University of Science

bilab
@
bi.a.u
-
tokyo.ac.jp


We have submitted tertiary structure predic
tion models for most of CASP5
target proteins except for T0136 and T0145. First we searched structural
templates for the target sequence by using PSI
-
BLAST and 3D
-
PSSM server
against Protein Data Bank. When we could not find any templates for the
target, w
e used ab initio protein structure modeling tool named "ABLE"
developed in our laboratory to produce prediction models. Otherwise we used
MODELLER to build up prediction models based on the alignments of the
templates and the target.


Modeling with ABLE wa
s based on energy minimization of statistical potential
by simulated annealing. First, we built up probability maps for mainchain
torsion angles (phi
-
psi) at each position of the target sequence. Sequence
similarity scores between nine
-
residue sub
-
sequence

of the target at each
position and all the fragments in the same length from tertiary structure
database were calculated. As this database, NCBI non
-
redundant PDB (nrpdb)
was used. Proteins with irregular residues, chain breaks, missing sidechains,
and me
mbrane proteins were eliminated from the list with cutoff p
-
value of
1.0e
-
7. 1164 chains were used in total. Sequence similarity score function was
similar to that by Fischer et al [1] including sequence identity and the matching
of secondary structure. BL
OSUM62 matrix was used for the calculation of the
sequence identity. Secondary structure prediction of the target was performed
by using PSIPRED server. Weight factors to emphasize matching at the center
of a nine
-
residue window were used. Probability maps

of mainchain phi
-
psi
torsion angles were obtained from phi
-
psi values of amino acids at the center of
all nine
-
residue fragments with similarity scores larger than a threshold. For
this procedure, the effects of the fragments with higher similarity scores

were
enhanced. Smoothing with Gaussian was applied to these maps.


After building probability maps for each amino acid, a number of tertiary
structure models of the target were produced to minimize potential energy by
simulated annealing using these maps.

Potential energy function we used was
modification of that by Simons et al [2]. Degree of buriedness of each amino
acid, contacts between residues, and average length between hydrophobic
residues were used to evaluate matching between the sequence and the

structure, and hydrogen bonds between mainchains, packing of secondary
structure segments, exclusive volume to avoid overlap of residues, and radius
of gyration were used to evaluate the plausibility of the model as a protein
tertiary structure. When we c
ould not obtain compact structures for a target,
restrictions of distances between several residue pairs were added to the
potential energy function. Weight factor for each energy term was adjusted for
each target to obtain compact model structures and was

changed as the progress
of simulated annealing. For each simulated annealing step, we changed
mainchain phi
-
psi torsion angles at random position to random values
according to probability maps. About 200 to 5000 structures were produced for
each target by

simulated annealing (about 30000 to 200000 steps per each run),
followed by clustering of these structures. Up to five structures which were the
nearest from the centers of large clusters were selected, and sidechain modeling
was performed for these struc
tures by using SCWRL. The order of the
submission was determined by manual inspection of these structures.