Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints

jiggerluncheonAI and Robotics

Oct 19, 2013 (3 years and 10 months ago)

102 views

Using Sequence Motifs for Enhanced Neural Network
Prediction of Protein Distance Constraints
Jan Gorodkin
1
￿
2
,Ole Lund
1
￿
,Claus A.Andersen
1
,and Søren Brunak
1
1
Center for Biological Sequence Analysis,Department of Biotechnology,
The Technical University of Denmark,Building 208,DK-2800 Lyngby,Denmark
gorodkin@cbs.dtu.dk,olund@strubix.dk,ca2@cbs.dtu.dk,brunak@cbs.dtu.dk
Phone:+45 45 25 24 77,Fax:+45 45 93 15 85
2
Department of Genetics and Ecology,The Institute of Biological Sciences
University of Aarhus,Building 540,Ny Munkegade,DK-8000 Aarhus C,Denmark
Abstract
Correlations between sequence separation (in residues) and
distance (in Angstrom) of any pair of amino acids in polypep-
tide chains are investigated.For each sequenceseparation we
dene a distance threshold.For pairs of amino acids where
the distance between C

atoms is smaller than the threshold,
a characteristic sequence (logo) motif,is found.The mo-
tifs change as the sequence separation increases:for small
separations they consist of one peak located in between the
two residues,then additional peaks at these residues appear,
and nally the center peak smears out for very large separa-
tions.We also nd correlations between the residues in the
center of the motif.This and other statistical analyses are
used to design neural networks with enhanced performance
compared to earlier work.Importantly,the statistical anal-
ysis explains why neural networks perform better than sim-
ple statistical data-driven approaches such as pair probability
density functions.The statistical results also explain charac-
teristics of the network performance for increasing sequence
separation.The improvement of the new network design is
signicant in the sequence separation range 1030 residues.
Finally,we nd that the performance curve for increasing se-
quence separation is directly correlated to the corresponding
information content.AWWWserver,distanceP,is available
at http://www.cbs.dtu.dk/services/distanceP/.
Keywords:Distance prediction;sequence motifs;distance
constraints;neural network;protein structure.
Introduction
Much work have over the years been put into approaches
which either analyze or predict features of the three-
dimensional structure using distributions of distances,cor-
related mutations,and more lately neural networks,or com-
binations of these e.g.(Tanaka &Scheraga 1976;Miyazawa
& Jernigan 1985;Bohr et al.1990;Sippl 1990;Maiorov
& Crippen 1992;Göbel et al.1994;Mirny & Shaknovich
1996;Thomas,Casari,& Sander 1996;Lund et al.1997;
Olmea & Valencia 1997;Skolnick,Kolinski,&Ortiz 1997;
￿
Present address:Structural Bioinformatics Advanced Tech-
nologies A/S,Agern Alle 3,DK-2970 Hørsholm,Denmark
Copyright
c
￿
1999,American Association for Articial Inte lli-
gence (www.aaai.org).All rights reserved.
Fariselli & Casadio 1999).The ability to adopt structure
fromsequences depends on constructing an appropriate cost
function for the native structure.In search of such a function
we here concentrate on nding a method to predict distance
constraints that correlate well with the observed distances
in proteins.As the neural network approach is the only ap-
proach so far which includes sequence context for the con-
sidered pair of amino acids,these are expected not only to
performbetter,but also to capture more features relating dis-
tance constraints and sequence composition.
The analysis include investigation of the distances be-
tween amino acid as well as sequence motifs and corre-
lations for separated residues.We construct a prediction
scheme which signicantly improve on an earlier approach
(Lund et al.1997).
For each particular sequence separation,the corres pond-
ing distance threshold is computed as the average of all
physical distances in a large data set between any two amino
acids separated by that amount of residues (Lund et al.
1997).Here,we include an analysis of the distance dis-
tributions relative to these thresholds and use them to ex-
plain qualitative behavior of the neural network prediction
scheme,thus extending earlier studies (Reese et al.1996).
For the prediction scheme used here it is essential to relate
the distributions to their means.Analysis of the network
weight composition reveal intriguing properties of the dis-
tance constraints:the sequence motifs can be decomposed
into sub-motifs associated with each of the hidden units in
the neural network.
Further,as the sequence separation increases there is a
clear correspondence in the change of the mean value,dis-
tance distributions,and the sequence motifs describing the
distance constraints of the separated amino acids,respec-
tively.The predicted distance constraints may be used as
inputs to threading,ab initio,or loop modeling algorithms.
Materials and Method
Data extraction
The data set was extracted from the Brookhaven Protein
Data Bank (Bernstein et al.1977),release 82 containing
5762 proteins.In brief entries were excluded if:(1) the sec-
From: ISMB-99 Proceedings. Copyright © 1999, AAAI (www.aaai.org). All rights reserved.

ondary structure of the proteins could not be assigned by the
program DSSP (Kabsch & Sander 1983),as the DSSP as-
signment is used to quantify the secondary structure identity
in the pairwise alignments,(2) the proteins had any physi-
cal chain breaks (dened as neighboring amino acids in the
sequence having C

-distances exceeding 4
￿
0Å) or entries
where the DSSP program detected chain breaks or incom-
plete backbones,(3) they had a resolution value greater than
2
￿
5Å,since proteins with worse resolution,are less reliable
as templates for homology modeling of the C

trace (unpub-
lished results).
Individual chains of entries were discarded if (1) they had
a length of less than 30 amino acids,(2) they had less than
50%secondary structure assignment as dened by the pro-
gram DSSP,(3) they had more than 85% non-amino acids
(nucleotides) in the sequence,and (4) they had more than
10%of non-standard amino acids (B,X,Z) in the sequence.
A representative set with low pairwise sequence similar-
ity was selected by running algorithm#1 of Hobohm et al.
(1992) implemented in the program RedHom (Lund et al.
1997).
In brief the sequences were sorted according to resolution
(all NMR structures were assigned resolution 100).The se-
quences with the same resolution were sorted so that higher
priority was given to longer proteins.
The sequences were aligned utilizing the local alignment
program,ssearch (Myers &Miller 1988;Pearson 1990) us-
ing the pam120 amino acid substitution matrix (Dayhoff &
Orcutt 1978),with gap penalties
￿
12,
￿
4.As a cutoff for
sequence similarity we applied the threshold T
￿
290
￿

L,
T is the percentage of identity in the alignment and L the
length of the alignment.Starting from the top of the list
each sequence was used as a probe to exclude all sequence
similar proteins further down the list.
By visual inspection seven proteins were removed from
the list,since their structure is either'sustained'by DNAor
predominantly buried in the membrane.The resulting 744
protein chains are composed of the residues where the C

-
atomposition is specied in the PDB entry.
Ten cross-validation sets were selected such that they all
contain approximately the same number of residues,and all
have the same length distribution of the chains.All the data
are made publicly available through the world wide web
page http://www.cbs.dtu.dk/services/distanceP/.
Information Content/Relative entropy measure
Here we use the relative entropy to measure the informa-
tion content (Kullback & Leibler 1951) of aligned regions
between sequence separated residues.The information con-
tent is obtained by summing for the respective position in
the alignment,I
￿

L
i
￿
1
I
i
,where I
i
is the information con-
tent of position i in the alignment.The information content
at each position will sometimes be displayed as a sequence
logo (Schneider &Stephens 1990).The position-dependent
information content is given by
I
i
￿

k
q
ik
log
2
q
ik
p
k
￿
(1)
where k refers to the symbols of the alphabet considered
(here amino acids).The observed fraction of symbol k at
position i is q
ik
,and p
k
is the background probabilityof nd-
ing symbol k by chance in the sequences.p
k
will sometimes
be replaced by a position-dependent background probabil-
ity,that is the probability of observing letter k at some posi-
tion in the alignment in another data set one wishes to com-
pare to.Symbols in logos turned 180 degrees indicate that
q
ik
￿
p
k
.
Neural networks
As in the previous work,we apply two-layer feed-forward
neural networks,trained by standard back-propagation,see
e.g.(Brunak,Engelbrecht,& Knudsen 1991;Bishop 1996;
Baldi & Brunak 1998),to predict whether two residues are
below or above a given distance threshold in space.The
sequence input is sparsely encoded.In (Lund et al.1997)
the inputs were processed as two windows centered around
each of the separated amino acids.However,here we extend
that scheme by allowing the windows to growtowards each
other,and even merge to a single large windowcovering the
complete sequence between the separated amino acids.Even
though such a scheme increases the computational require-
ments,it allowus to search for optimal covering between the
separated amino acids.
As there can be a large difference in the number of posi-
tive (contact) and negative (no contact) sequence windows,
for a given separation,we apply the balanced learning ap-
proach (Rost & Sander 1993).Training is done by a 10 set
cross-validation approach (Bishop 1996),and the result is
reported as the average performance over the partitions.The
performance on each partition is evaluated by the Mathews
correlation coefcient (Mathews 1975)
C
￿
P
t
N
t
￿
P
f
N
f

￿
N
t
￿
N
f
￿￿
N
t
￿
P
f
￿￿
P
t
￿
N
f
￿￿
P
t
￿
P
f
￿
￿
(2)
where P
t
is the number of true positives (contact,predicted
contact),N
t
the number of true negatives (no contact,no
contact predicted),P
f
the number of false positives (no con-
tact,contact predicted),and N
f
is the number of false nega-
tives (contact,no contact predicted).
The analysis of the patterns stored in the weights of the
networks is done through the saliency of the weights,that
is the cost of removing a single weight while keeping the
remaining ones.Due to the sparse encoding each weight
connected to a hidden unit corresponds exactly to a particu-
lar amino acid at a given position in the sequence windows
used as inputs.We can then obtain a ranking of symbols on
each position in the input eld.To compute the saliencies
we use the approximationfor two-layer one-output networks
(Gorodkin et al.1997),who showed that the saliencies for
the weights between input and hidden layer can be written
as
s
k
ji
￿
s
ji
￿
w
2
ji
W
2
j
K
￿
(3)
where w
ji
is the weight between input i and hidden unit j,
and W
j
the weight between hidden unit j and the output.K
is a constant.The kth symbol is implicitly given due to the
0
500
1000
1500
2000
-6
-4
-2
0
2
Number
Physical distance (Angstrom)
Sequence separation 2
0
50
100
150
200
250
-20
-15
-10
-5
0
5
10
15
20
25
Number
Physical distance (Angstrom)
Sequence separation 11
0
20
40
60
80
100
120
-20
-10
0
10
20
30
40
Number
Physical distance (Angstrom)
Sequence separation 16
0
200
400
600
800
1000
-6
-4
-2
0
2
4
Number
Physical distance (Angstrom)
Sequence separation 3
0
50
100
150
200
-20
-10
0
10
20
Number
Physical distance (Angstrom)
Sequence separation 12
0
10
20
30
40
50
60
70
80
90
-20
-10
0
10
20
30
40
50
Number
Physical distance (Angstrom)
Sequence separation 20
0
100
200
300
400
500
600
700
800
-8
-6
-4
-2
0
2
4
6
Number
Physical distance (Angstrom)
Sequence separation 4
0
50
100
150
200
-20
-10
0
10
20
30
Number
Physical distance (Angstrom)
Sequence separation 13
0
10
20
30
40
50
60
70
-40
-20
0
20
40
60
80
Number
Physical distance (Angstrom)
Sequence separation 60
Figure 1:Length distributions of residue segments for corresponding sequence separations,relative the respective mean values.Sequence
separations 2,3,4,11,12,13,16,20,and 60 are shown.These showthe representative shapes of the distance distributions.The vertical line
through zero indicates the displacement with respect to the mean distance.
sparse encoding.If the signs of w
ji
and W
j
are opposite,
we display the corresponding symbol upside down in the
weight-saliency logo.
Results
We conduct statistical analysis of the data and the distance
constraints between amino acids,and subsequent use the re-
sults to design and explain the behavior of a neural network
prediction scheme with enhanced performance.
Statistical analysis
For each sequence separation (in residues),we derive the
mean of all the distances between pairs of C

atoms.We use
these means as distance constraint thresholds,and in a pre-
diction scheme we wish to predict whether the distance be-
tween any pair of amino acids is above or below the thresh-
old corresponding to a particular separation in residues.To
analyze which pairs are above and below the threshold,it
is relevant to compare:(1) the distribution of distances be-
tween amino acid pairs below and above the threshold,and
(2) the sequence composition of segments where the pairs of
amino acids are belowand above the threshold.
First we investigate the length distributionof the distances
as function of the sequence separation.A complete inves-
tigation of physical distances for increasing sequence sep-
aration is given by (Reese et al.1996).In particular it
was found that the  -helices caused a distinct peak up to
sequence separation 20,whereas  -strands are seen up to
separations 5 only.However,when we perform a simple
translation of these distributions relative to their respective
means,the same plots provide essentially newqualitative in-
formation,which is anticipated to be strongly correlated to
the performance of a predictor (above/below the threshold).
In particular we focus on the distributions shown in Figure 1,
but we also use the remaining distributions to make some
important observations.
When inspecting the distance distributionsrelative to their
mean,two main observations are made.First,the distance
distribution for sequence separation 3,is the one distance
distribution where the data is most bimodal.Thus sequence
separation 3 provides the most distinct partition of the data
points.Hence,in agreement with the results in (Lund et
al.1997) we anticipate that the best prediction of distance
constraints can be obtained for sequence separation 3.Fur-
thermore,we observe that the  -helix peak shifts relative to
the mean when going from sequence separation 11 to 13.
The length of the helices becomes longer than the mean dis-
tances.This shift interestingly involve the peak to be placed
at the mean value itself for sequence separation 12.Due to
this phenomenon,we anticipate,that,for an optimized pre-
dictor,it can be slightlyharder to predict distance constraints
for separation 12 than for separations 11 and 13.
The peak present at the mean value for sequence separa-
tion 12 does indeed reect the length of helices as demon-
strated clearly in Figure 2.Rather than using the simple rule
that each residue in a helix increases the physical distance
with 1.5 Angstrom(Branden & Tooze 1999),we computed
the actual physical lengths for each size helix to obtain a
more accurate picture.The physical length of the  -helices
was calculated by nding the helical axis and measuring the
translation per C

-atom along this axis.The helical axis
was determined by the mass center of four consecutive C

-
atoms.Helices of length four are therefore not included,
since only one center of mass was present.We see that he-
lices at 12 residues coincide with sequence separation 12.
Again we use this as an indication that at sequence sepa-
ration 12 it may be slightly harder to predict distance con-
straints than at separation 13.
0 5 10 15 20 25 30
Separation/Length (Residues)
0
5
10
15
20
25
30
35
Physical distance (Angstrom)
Separation and helix length
Average sequence separation
Average helix length
Figure 2:Mean distances for increasing sequence separation and
computed average physical helix lengths for increasing number of
residues.
As the sequence separation increases the distance distri-
bution approaches a universal shape,presumably indepen-
dent of structural elements,which are governed by more lo-
cal distance constraints.The traits from the local elements
do as mentioned by (Reese et al.1996) (who merge large
separations) vanish for separations 2025.(Here we con-
sidered separations up to 100 residues.) In Figure 1 we see
that the transition from bimodal distributions to unimodal
distributions centered around their means,indicates that pre-
diction of distance constraints must become harder with in-
creasing sequence separation.In particular when the univer-
sal shape has been reached (sequence separation larger than
30),we should not expect a change in the prediction ability
as the sequence separation increases.The universal distri-
bution has its mode value approximately at
￿
3
￿
5 Angstrom,
indicating that the most probable physical distance for large
sequence separations corresponds to the distance between
two C

atoms.
Notice that the universality only appears when the distri-
bution is displaced by its mean distance.This is interesting
since the mean of the physical distances grows as the se-
quence separation increases.Froma predictability perspec-
tive,it therefore makes good sense to use the mean value as
a threshold to decide the distance constraint for an arbitrary
sequence separation.
A useful prediction scheme must rely on the information
available in the sequences.To investigate if there exists a
detectable signal between the sequence separated residues,
for each sequence separation,we constructed sequence lo-
gos (Schneider &Stephens 1990) as follows:The sequence
segments for which the physical distance between the sepa-
rated amino acids was above the threshold were used to gen-
erate a position-dependent background distributionof amino
acids.The segments with corresponding physical distance
below the threshold were all aligned and displayed in se-
quence logos using the computed background distribution
of amino acids.The pure information content curves are
shown for increasing sequence in Figure 3.The correspond-
ing sequence logos are displayed in Figure 4.We used a
margin of 4 residues to the left and right of the logos,e.g.,
for sequence separation 2,the physical distance is measured
between position 5 and 7 in the logo.
The change in the sequence patterns is consistent with the
change in the distribution of physical distances.Up to sepa-
rations 67,the distribution of distances (Figure 1) contains
two peaks,with the  -strand peak vanishing completely for
separation 78 residues.For the same separations,the se-
quence motif changes from containing one to three peaks.
For larger sequence separations,the motif consists of three
characteristic peaks,the two located at positions exactly cor-
responding to the separated amino acids.The third peak ap-
pear in the center.This peak tells us that for physical dis-
tances within the threshold,it indeed matters whether the
residues in the center can bend or turn the chain.We see
that exactly such amino acids are located in the center peak:
glycine,apartic acid,glutamic acid,and lysine,all these are
medium size residues (except glycine),being hydrophilic,
thereby having afnity for the solvent,but not being on the
outer most surface of the protein (see e.g.,(Creighton 1993)
1 2 3 4 5 6 7 8 9 10 11
Position
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Bits
Separation: 2
2 4 6 8 10 12 14 16 18 20 22
Position
0.00
0.02
0.04
0.06
Bits
Separation: 13
2 4 6 8 10 12 14 16
Position
0.00
0.02
0.04
0.06
Bits
Separation: 7
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Position
0.00
0.01
0.02
0.03
Bits
Separation: 20
2 4 6 8 10 12 14 16 18
Position
0.00
0.02
0.04
0.06
Bits
Separation: 9
5 10 15 20 25 30 35
Position
0.00
0.01
0.02
Bits
Separation: 30
2 4 6 8 10 12 14 16 18 20
Position
0.00
0.02
0.04
0.06
Bits
Separation: 12
5 10 15 20 25 30 35 40 45 50 55 60 65
Position
0.00
0.01
0.02
Bits
Separation: 60
Figure 3:Sequence information proles for increasing sequence separation.
for amino acid properties).As the sequence separation in-
creases from about 20 to 30 the center peak smears out,in
agreement with the transitionof the distance distributionthat
shifts to the universal distribution in this range of sequence
separation.The sequence motif likewise becomes univer-
sal.Only the peaks located at the separated residues are
left.
The composition of single peak motifs resemble to a large
degree the composition of the center for motifs having three
peaks.However,the slightly increased amount of leucine
moves from the center of the one peak motifs to the sep-
arated amino acid positions in the three peak motifs.The
reverse happens for glycine.Interestingly,as the sequence
separation increases valine (and isoleucine) become present
at the outermost peaks.In agreement with the helix length
shift from below to above the threshold at separations 12
Separation:2 Separation:13
Separation:7 Separation:20
Separation:9 Separation:30
Separation:12 Separation:60
0.0
0.02
0.04
0.06
0.08
0.1
0.12
bits
|
1
C
W
H
M
Y
Q
P
F
N
R
K
I
T
S
D
E
V
G
L
A
2
W
C
H
M
Y
Q
F
P
N
R
I
K
T
S
D
E
V
G
L
A
3
C
W
H
M
Y
F
Q
P
N
R
I
T
K
S
D
V
E
G
L
A
4
C
W
H
M
Y
F
Q
P
N
R
I
T
S
K
V
G
D
E
L
A
5
W
C
M
H
Y
F
Q
R
P
I
N
T
V
K
G
S
E
D
L
A
6
C
W
H
M
Y
F
P
Q
I
T
V
R
N
S
K
D
E
G
L
A
7
P
W
C
H
M
Y
F
Q
N
I
T
R
S
V
D
K
E
G
L
A
8
WC
H
M
Y
F
P
Q
N
R
I
S
T
D
V
E
K
G
L
A
9
W
C
H
M
Y
P
F
Q
N
T
S
R
D
I
E
V
K
G
A
L
10
C
W
H
M
Y
F
Q
P
N
R
T
S
D
I
E
V
K
G
L
A
11
C
W
M
H
Y
Q
P
F
N
R
T
S
I
D
E
K
V
G
L
A
|
0.0
0.02
0.04
0.06
bits
|
1
W
C
M
H
Q
Y
F
R
NP
K
I
E
T
S
D
V
A
G
L
2
W
C
M
H
Q
Y
F
P
N
R
K
I
E
S
D
T
V
G
A
L
3
W
C
M
H
Q
Y
F
P
N
R
K
S
D
E
T
I
V
G
A
L
4
W
C
M
H
Q
Y
P
N
F
R
K
D
S
E
T
I
G
V
A
L
5
W
C
M
H
Q
Y
P
N
F
R
K
D
E
S
T
I
G
V
A
L
6
W
C
M
H
Q
Y
P
F
N
R
K
E
D
I
T
S
V
G
A
L
7
W
C
M
H
Q
Y
F
P
R
N
I
K
ET
S
D
V
G
L
A
8
W
C
M
H
Q
Y
F
P
R
N
I
KT
E
S
V
D
G
L
A
9
W
C
M
H
Y
F
Q
R
I
P
N
V
T
K
S
E
D
L
A
G
10
W
C
M
H
Y
F
Q
I
R
P
N
V
T
K
E
S
D
L
A
G
11
W
C
M
H
Y
F
Q
I
R
P
V
N
T
K
E
S
D
L
A
G
12
W
C
M
H
F
Y
Q
I
R
P
V
N
T
S
K
E
D
L
A
G
13
W
C
M
H
F
Y
Q
I
R
N
P
V
T
S
K
E
D
L
A
G
14
W
C
M
H
Y
F
Q
I
R
N
P
T
V
S
E
K
D
L
A
G
15
W
C
M
H
Y
F
Q
R
N
P
I
S
T
K
E
D
V
L
A
G
16
C
W
M
H
Q
F
Y
R
N
P
I
S
T
K
D
E
V
L
A
G
17
W
C
M
H
Q
Y
P
F
N
R
S
K
D
E
T
I
V
G
A
L
18
W
C
H
M
Q
Y
P
N
F
R
D
K
S
E
T
I
G
V
A
L
19
W
C
H
M
Q
Y
P
F
N
R
K
D
S
T
E
I
G
V
A
L
20
W
C
M
H
Q
Y
P
F
N
R
K
T
D
S
I
E
G
V
A
L
21
W
C
M
H
Q
Y
P
F
N
R
K
T
D
S
I
E
V
G
L
A
22
W
C
M
H
Q
Y
F
P
NR
I
T
D
S
KE
V
G
L
A
|
0.0
0.02
0.04
0.06
bits
|
1
W
C
MH
Y
Q
F
P
N
R
K
I
S
T
D
E
V
G
L
A
2
W
C
MH
Y
Q
F
P
N
R
K
I
T
S
D
E
V
G
L
A
3
W
C
H
M
Y
Q
F
P
N
R
K
I
S
T
D
E
V
G
L
A
4
W
C
H
M
Y
Q
P
F
N
R
K
T
S
I
D
E
V
G
L
A
5
W
C
H
M
Y
Q
P
F
N
R
K
T
I
S
D
E
G
V
L
A
6
W
C
H
M
Y
F
Q
P
N
R
I
T
V
K
S
D
G
E
L
A
7
W
C
H
M
Y
F
Q
P
I
R
N
T
V
S
K
D
E
G
L
A
8
W
C
M
H
Y
F
P
Q
I
R
V
N
T
S
K
D
E
L
G
A
9
W
C
H
M
Y
F
P
Q
I
R
N
V
T
S
K
D
E
L
G
A
10
C
W
M
H
Y
F
P
Q
I
N
R
T
V
S
D
K
E
L
G
A
11
W
C
H
M
Y
P
F
Q
N
I
R
T
S
D
V
K
E
G
L
A
12
W
C
H
M
P
Y
F
Q
N
R
D
S
T
I
E
K
V
G
A
L
13
W
C
H
M
Y
Q
P
F
N
R
D
T
S
I
E
K
G
V
A
L
14
W
C
M
H
Y
Q
P
F
N
R
S
T
D
I
K
E
G
V
A
L
15
W
C
M
H
Y
Q
P
F
N
R
S
T
D
I
K
E
V
G
A
L
16
W
C
M
H
Y
Q
F
P
N
R
T
S
I
D
K
E
V
G
L
A
|
0.0
0.02
0.04
bits
|
1
W
C
M
H
Q
Y
F
R
NP
I
E
K
T
S
D
V
L
A
G
2
W
C
M
H
Q
Y
F
R
N
P
K
E
I
D
T
S
V
A
L
G
3
W
C
M
H
Q
Y
F
P
N
R
K
E
I
T
S
D
V
A
G
L
4
W
C
M
H
Q
Y
F
P
N
R
K
E
D
I
S
T
V
G
A
L
5
W
C
M
H
Q
Y
P
N
F
R
K
E
D
S
T
I
V
G
A
L
6
W
C
M
H
Q
Y
F
P
N
R
K
E
I
D
T
S
V
G
A
L
7
W
C
M
H
Q
Y
P
F
N
R
K
E
I
D
T
S
V
G
A
L
8
W
C
M
H
Q
Y
F
P
N
R
K
E
I
D
T
S
V
G
A
L
9
W
C
M
H
Q
Y
F
P
R
N
K
I
E
T
S
D
V
G
A
L
10
W
C
MH
Q
Y
F
P
N
R
I
T
E
K
S
D
V
L
A
G
11
W
C
M
H
Y
Q
F
P
R
N
I
K
E
T
S
D
V
L
A
G
12
W
C
M
H
Y
F
Q
R
P
N
I
K
T
E
S
D
V
L
A
G
13
W
C
M
H
Y
F
Q
R
P
N
I
T
K
E
S
V
D
L
A
G
14
W
C
M
H
Y
F
Q
R
P
I
N
T
K
E
V
S
D
L
A
G
15
W
C
M
H
Y
F
Q
R
I
P
N
T
K
V
E
S
D
L
A
G
16
W
C
M
H
Y
F
Q
R
I
P
N
K
T
S
V
E
D
L
A
G
17
W
C
M
H
Y
F
Q
R
I
P
N
K
T
S
E
V
D
L
A
G
18
W
C
M
H
Y
F
Q
R
P
I
N
K
T
S
E
D
V
L
A
G
19
W
C
M
H
Y
F
Q
R
P
N
I
K
T
S
E
D
V
L
A
G
20
W
C
M
H
Q
Y
F
P
R
N
I
K
S
T
E
D
V
L
A
G
21
W
C
MH
Q
Y
F
P
R
N
K
I
E
S
T
D
V
A
L
G
22
W
C
M
H
Q
Y
F
P
N
R
K
I
T
S
E
D
V
G
A
L
23
W
C
M
H
Q
Y
F
P
N
R
K
D
T
S
I
E
V
G
A
L
24
W
C
M
H
Q
Y
P
F
N
R
K
D
S
E
T
I
V
G
A
L
25
W
C
H
M
Q
Y
P
F
N
R
K
D
E
S
T
I
G
V
A
L
26
W
C
H
M
Q
Y
F
P
R
N
K
D
E
T
S
I
V
G
A
L
27
W
C
M
H
Q
Y
F
P
R
N
K
I
S
T
E
D
V
G
L
A
28
W
C
M
H
Q
Y
F
P
R
N
I
K
T
E
S
D
V
G
L
A
29
W
C
MH
Q
Y
F
P
R
N
I
T
E
K
D
S
V
L
G
A
|
0.0
0.02
0.04
0.06
bits
|
1
W
C
MH
Y
Q
F
PN
R
K
I
S
T
ED
V
G
L
A
2
W
C
MH
Y
Q
F
P
N
R
K
I
S
T
D
E
V
G
L
A
3
W
C
M
H
Y
Q
P
F
N
R
K
I
S
T
D
E
V
G
L
A
4
W
C
H
M
Q
Y
P
F
N
R
K
S
T
D
I
E
G
V
A
L
5
W
C
H
M
Y
Q
P
F
N
R
K
T
S
I
D
E
G
V
A
L
6
W
C
H
M
Y
F
Q
P
N
R
I
T
K
S
D
V
E
G
L
A
7
W
C
M
H
Y
F
Q
P
R
I
N
T
V
K
S
D
E
G
L
A
8
W
C
M
H
Y
F
Q
P
I
R
N
T
V
S
K
D
E
L
G
A
9
W
C
M
H
Y
F
Q
P
I
R
N
V
T
S
K
D
E
L
G
A
10
W
C
H
M
Y
F
Q
P
I
R
N
V
T
S
K
D
E
L
G
A
11
W
C
M
H
Y
F
Q
P
I
N
R
T
V
S
D
K
E
L
G
A
12
W
C
M
H
Y
F
P
Q
N
I
R
T
S
D
V
K
E
G
L
A
13
W
C
M
H
Y
P
F
Q
N
R
I
S
T
D
K
E
V
G
L
A
14
W
C
H
M
P
Y
Q
F
N
R
D
S
T
I
K
E
G
V
A
L
15
W
C
M
H
Y
Q
P
F
N
R
D
S
T
I
E
K
G
V
A
L
16
W
C
M
H
Y
Q
P
F
N
R
T
S
D
I
K
E
V
G
L
A
17
W
C
M
H
Q
Y
P
F
N
R
T
S
D
I
K
E
V
G
A
L
18
W
C
M
H
Q
Y
F
P
N
R
T
S
I
D
K
E
V
G
A
L
|
0.0
0.04
bits
|
1
C
W
M
H
Q
Y
F
P
R
N
K
E
I
T
S
D
V
A
L
G
2
W
C
M
H
Q
Y
F
R
N
P
K
E
I
T
S
D
V
A
L
G
3
W
C
M
H
Q
Y
F
PN
R
K
E
I
D
S
T
V
A
L
G
4
W
C
M
H
Q
Y
F
P
N
R
K
E
D
I
S
T
V
G
A
L
5
W
C
M
H
Q
Y
F
P
N
R
K
E
D
S
I
T
V
G
A
L
6
W
C
M
H
Q
Y
F
P
N
R
K
E
I
D
T
S
V
G
A
L
7
W
C
M
H
Q
Y
F
P
N
R
K
E
I
D
T
S
V
G
L
A
8
C
W
M
H
Q
Y
F
P
N
R
K
E
I
D
T
S
V
A
G
L
9
C
W
M
H
Q
F
Y
P
N
R
K
I
E
T
S
D
V
L
A
G
10
C
W
M
H
Q
Y
F
P
NR
K
I
D
S
T
E
V
L
A
G
11
C
W
M
H
Q
Y
F
PN
R
I
KE
T
S
D
V
A
G
L
12
W
M
C
H
Q
YF
P
NR
I
EDK
S
T
V
A
G
L
13
C
W
MH
Q
YF
P
N
R
I
EDK
S
T
V
A
G
L
14
C
W
MH
Q
YF
RNP
EIK
S
T
DV
A
G
L
15
C
W
M
H
Q
Y
RFP
NI
ED
K
S
T
V
A
G
L
16
C
W
M
H
Y
F
Q
RNP
I
E
S
T
K
D
V
A
L
G
17
W
C
M
H
Y
F
Q
PR
N
I
K
T
S
E
D
V
L
A
G
18
W
C
M
H
Q
Y
F
P
R
N
I
K
E
T
S
D
V
L
A
G
19
W
C
M
H
Q
Y
F
R
P
N
I
K
D
S
TE
V
L
A
G
20
W
C
M
H
Y
Q
F
P
R
N
I
K
T
E
S
D
V
L
A
G
21
W
C
M
H
Y
Q
F
P
R
N
I
K
S
T
E
D
V
L
A
G
22
W
C
M
H
Q
Y
F
P
R
N
I
K
E
T
S
D
V
L
A
G
23
C
W
M
H
Q
Y
F
P
R
N
I
K
T
S
E
D
V
L
A
G
24
W
C
M
H
Q
Y
F
P
R
N
I
K
T
SE
D
V
L
A
G
25
C
W
MH
Q
Y
F
R
N
P
I
K
T
S
E
D
V
AL
G
26
C
W
M
H
Q
FY
PR
N
I
K
D
S
T
E
V
L
A
G
27
C
W
MH
Q
FY
PR
N
I
K
T
E
S
D
V
L
A
G
28
C
W
MH
Q
FY
R
N
P
I
K
T
S
E
D
V
A
L
G
29
C
W
MH
Q
YF
R
N
P
I
K
T
S
E
D
V
L
A
G
30
C
W
MH
Q
FY
PR
N
K
I
T
S
E
D
V
L
A
G
31
C
W
M
H
Q
Y
F
P
N
R
K
I
E
T
S
D
V
L
A
G
32
C
W
M
H
Q
Y
F
P
R
N
K
I
T
S
E
D
V
L
A
G
33
W
C
M
H
Q
Y
F
P
R
N
K
I
D
S
T
E
V
A
G
L
34
W
C
M
H
Q
Y
F
P
R
N
K
E
I
T
S
D
V
G
A
L
35
W
C
M
H
Q
Y
F
P
N
R
K
E
D
S
T
I
V
G
A
L
36
W
C
M
H
Q
Y
F
P
N
R
K
E
S
I
D
T
V
G
A
L
37
W
C
MH
Q
Y
F
R
N
P
K
I
E
T
S
D
V
G
L
A
38
C
W
M
H
Q
Y
F
R
N
P
K
E
I
T
S
D
V
G
A
L
39
C
W
M
H
Q
Y
F
R
N
P
K
I
T
S
ED
V
A
G
L
|
0.0
0.02
0.04
0.06
bits
|
1
W
C
MH
Y
F
Q
PN
R
K
I
D
S
TE
V
A
G
L
2
W
C
M
H
Q
Y
F
PN
R
K
I
D
S
T
E
V
G
A
L
3
W
C
M
H
Y
Q
P
F
N
R
K
D
S
I
T
E
V
G
A
L
4
W
C
M
H
Q
Y
P
F
N
R
K
D
S
T
E
I
G
V
A
L
5
W
C
H
M
Q
Y
P
N
F
R
K
D
S
E
T
I
G
V
A
L
6
W
C
M
H
Q
Y
F
P
N
R
K
I
D
T
S
E
V
G
A
L
7
W
C
M
H
Y
Q
F
P
N
R
I
K
T
S
E
D
V
G
L
A
8
W
C
M
H
Y
Q
F
P
R
N
I
T
K
V
S
D
E
G
L
A
9
W
C
M
H
Y
F
Q
P
R
I
N
V
T
K
S
E
D
L
A
G
10
W
C
M
H
Y
F
Q
I
R
P
N
V
T
K
S
E
D
L
A
G
11
W
C
M
H
Y
F
Q
I
R
P
V
N
T
S
K
E
D
L
A
G
12
W
C
M
H
Y
F
Q
I
R
P
N
V
T
S
K
E
D
L
A
G
13
W
C
M
H
Y
F
Q
I
P
R
N
T
V
S
K
E
D
L
A
G
14
W
C
M
H
Y
F
Q
P
R
N
I
S
T
D
V
E
K
L
A
G
15
C
W
M
H
Y
F
Q
P
N
R
I
S
T
D
K
E
V
G
L
A
16
W
C
M
H
Q
Y
P
F
N
R
S
D
T
K
I
E
V
G
A
L
17
W
C
H
M
Q
P
Y
N
F
R
D
S
K
E
T
I
G
V
A
L
18
W
C
M
H
Q
Y
P
F
N
R
T
S
D
K
E
I
G
V
A
L
19
W
C
M
H
Q
Y
P
F
N
R
T
S
K
D
I
E
G
V
L
A
20
W
C
M
H
Q
Y
P
F
N
R
T
K
S
I
E
D
V
G
A
L
21
W
C
MH
Q
Y
F
P
NR
T
I
K
E
S
D
V
G
L
A
|
0.0
0.02
0.04
bits
|
1
C
W
MH
Q
Y
F
R
N
P
K
E
I
T
S
D
V
L
A
G
2
C
W
M
H
Q
Y
F
PR
N
K
E
I
T
S
D
V
L
A
G
3
C
W
M
H
Q
Y
F
PR
N
K
E
I
D
T
S
V
L
A
G
4
C
W
M
H
Q
Y
P
F
R
N
K
E
I
D
T
S
V
G
L
A
5
C
W
M
H
Q
Y
P
F
R
N
K
E
D
I
T
S
V
G
L
A
6
C
W
M
H
Q
Y
F
PR
N
K
E
I
D
T
S
V
G
L
A
7
C
W
M
H
Q
Y
F
PR
N
K
I
E
D
T
S
V
G
L
A
8
C
W
M
H
Q
Y
F
PR
N
K
E
I
D
T
S
V
G
L
A
9
C
W
M
H
Q
Y
F
P
R
N
I
E
K
T
S
D
V
G
L
A
10
C
W
M
H
Q
Y
F
P
NR
I
K
E
T
S
D
V
G
L
A
11
C
W
M
H
Q
Y
F
PR
N
I
K
E
T
S
D
V
L
G
A
12
C
W
M
H
Q
Y
F
PR
N
I
KE
T
S
D
V
L
A
G
13
C
WM
H
Q
Y
RF
P
N
IK
D
S
T
E
V
LA
G
14
C
MW
H
Q
Y
F
R
N
P
IE
K
T
S
D
V
A
G
L
15
W
CH
M
F
YN
P
Q
R
EI
T
K
VD
S
G
L
A
16
W
C
H
M
Q
Y
P
N
F
R
EI
K
D
S
T
V
A
G
L
17
W
C
HM
Q
Y
F
PI
RN
ED
K
S
T
V
A
G
L
18
HW
M
Q
Y
F
N
P
IK
R
D
S
E
T
V
G
L
A
19
Y
M
Q
H
IKF
R
N
PE
S
T
VD
A
G
L
20
M
H
FYN
P
Q
R
E
IK
VD
S
T
G
L
A
21
WCH
M
Y
P
F
Q
R
I
KN
E
T
VD
S
G
L
A
22
C
W
M
H
Q
Y
R
F
P
N
I
ED
K
S
T
V
A
G
L
23
C
M
W
H
Q
YF
R
N
P
K
I
D
S
T
E
V
G
L
A
24
C
WM
H
Q
Y
R
F
P
N
I
T
E
K
S
D
V
G
L
A
25
C
W
M
H
Q
Y
P
F
R
N
IE
K
T
S
D
V
A
G
L
26
CW
M
H
Q
Y
F
R
N
P
IE
K
T
S
D
V
L
A
G
27
WCH
M
Q
Y
N
PFIR
E
DK
S
T
V
A
G
L
28
YM
Q
H
N
PFIR
K
E
D
S
T
V
A
G
L
29
I
YFR
N
P
Q
K
E
D
S
T
V
A
G
L
30
C
M
W
H
Q
YF
R
NP
I
E
DK
S
T
V
L
A
G
31
C
WMH
Q
YF
R
N
P
E
ITK
VD
S
A
G
L
32
C
M
W
H
QY
FR
N
P
E
IT
K
S
D
V
L
A
G
33
C
W
MH
Q
YF
R
N
P
K
I
D
S
TE
V
AL
G
34
C
W
M
H
Q
Y
R
F
P
N
I
EK
T
S
D
V
A
G
L
35
C
W
M
H
Q
Y
F
R
NP
IEK
T
S
D
V
L
A
G
36
C
W
MH
QFY
RN
P
I
E
S
T
K
D
V
A
G
L
37
C
W
MH
Q
YF
RNP
I
E
S
TK
D
V
L
A
G
38
C
W
MH
Q
Y
F
R
NP
I
TEK
S
D
V
A
L
G
39
C
W
M
H
Q
Y
F
R
NP
I
ES
T
K
D
V
L
A
G
40
CW
MH
Q
YF
R
N
P
EITK
S
D
V
AL
G
41
C
HWM
Q
R
YF
N
P
EIT
K
S
D
V
L
A
G
42
C
W
MH
Q
Y
FR
N
P
K
I
D
S
TE
V
L
A
G
43
CW
MH
Q
Y
F
R
N
P
EI
K
S
T
DV
A
G
L
44
CW
MH
Q
Y
F
R
N
P
EIK
S
T
DV
A
G
L
45
C
W
MH
Q
Y
PFR
N
TIK
E
S
D
V
A
G
L
46
C
W
MH
Q
F
Y
R
N
P
EITK
V
D
S
L
A
G
47
CW
MH
Q
Y
F
R
N
P
KI
E
T
V
D
S
L
A
G
48
C
W
MH
Q
Y
F
PR
N
KI
D
S
TE
V
AG
L
49
C
W
MH
Q
Y
F
R
N
P
KI
E
T
S
D
V
L
A
G
50
C
W
MH
Q
Y
F
R
N
P
I
T
EK
S
D
V
G
L
A
51
C
W
MH
Q
F
Y
PR
N
K
I
ET
S
D
V
L
G
A
52
C
W
M
H
Q
F
Y
PR
N
K
I
D
S
TE
V
G
L
A
53
C
W
M
H
Q
Y
F
R
N
P
I
TEK
S
D
V
G
L
A
54
C
W
M
H
Q
Y
F
R
N
P
I
T
E
K
S
D
V
G
L
A
55
C
W
M
H
Q
Y
F
PR
N
I
T
E
K
S
D
V
G
L
A
56
C
W
MH
Q
Y
F
R
N
P
I
K
T
S
E
D
V
L
A
G
57
C
W
MH
Q
YF
PR
N
K
I
ET
S
D
V
G
L
A
58
C
W
MH
Q
Y
F
PR
N
I
T
E
K
S
D
V
L
G
A
59
C
W
M
H
Q
Y
F
PR
N
I
TEK
S
D
V
L
G
A
60
C
W
M
H
Q
Y
F
R
N
P
K
I
T
S
E
D
V
L
A
G
61
C
W
MH
Q
Y
F
P
R
N
I
EK
S
T
D
V
G
L
A
62
C
W
M
H
Q
Y
F
P
R
N
K
E
I
S
T
D
V
G
L
A
63
C
W
MH
Q
Y
F
P
R
N
K
E
I
T
S
D
V
G
L
A
64
C
W
M
H
Q
Y
P
F
R
N
K
E
S
I
D
T
V
G
L
A
65
C
W
M
H
Q
Y
P
R
F
N
K
E
D
S
T
I
V
G
A
L
66
C
W
M
H
Q
Y
F
P
R
N
K
E
T
S
I
D
V
G
L
A
67
C
W
M
H
Q
Y
F
PR
N
K
I
T
S
E
D
V
G
L
A
68
C
W
MH
Q
Y
R
F
P
N
K
I
E
T
S
D
V
G
L
A
69
C
W
MH
Q
Y
F
PR
N
K
I
D
S
T
E
V
G
A
L
|
Figure 4:Sequence logos for increasing sequence separation.A margin of 4 residues on both ends of the logo is used.Sym-
bols displayed upside down indicate that their amount is lower than the background amount.This gure is available in full color at
http://www.cbs.dtu.dk/services/distanceP/.
and 13,the center peak shifts from slight over to slight un-
derrepresentation of alanine:helices are no longer dominant
for distances belowthe threshold.
The smearing out of the center peak for large sequence
separations does not indicate lack of bending/turn proper-
ties,but reects that such amino acids can be placed in a
large region in between the two separated amino acids.What
we see for small separations,is that the signal in addition
must be located in a very specic position relative to the
separated amino acids.
Neural networks:predictions and behavior
As found above,optimized prediction of distance constraints
for varying sequence separation,should be performed by a
non-linear predictor.Here we use neural networks.Clearly,
far most of the information is found in the sequence logos,
so we expect only a few hidden neurons (units) to be nec-
essary in the design of the neural network architecture.As
earlier,we only use two-layer networks with one output unit
(above/belowthreshold).We x the number of hidden units
to 5,but the size of the input eld may vary.
We start out by quantitatively investigating the rela-
tion between the sequence motifs in the logos above,and
the amount of sequence context needed in the prediction
scheme.First,we choose the amount of sequence context,
by starting with local windows around the separated amino
acids and the extending the sequence region,r,to be in-
cluded.We only include additional residues in the direction
towards the center.The number of residues used in the direc-
tion away fromthe center is xed to 4,except in the cases of
r
￿
0 and r
￿
2,where that number is zero and two respec-
tively.At some point the two sequence regions may reach
each other.Whether they overlap,or just connect does not
affect the performance.For each r
￿
0
￿
2
￿
4
￿
6
￿
8
￿
10
￿
12
￿
14,
we trained networks for sequence separations 2 to 99,result-
ing in the training of 8000 networks,since we used 10 cross-
validation sets.The performances in fraction correct and
correlation coefcient are shown in Figure 5.
Figure 5(b) shows the performance in terms of correla-
tion coefcient.We see that each time the sequence sepa-
ration becomes so large that the two context regions do not
connect the performance drops when compared to contexts
that merge into a single window.This behavior is generic,it
happens consistently for increasing size of sequence context
region,though the phenomenon is asymptotically decreas-
ing.An effect can be found up to a region size of about 30,
see Figure 6.Several features appear on the performance
curve,and they can all be explained by the statistical obser-
vations made earlier.First,the curve with no context ( r
￿
0)
corresponds to the curve one obtains using probability pair
density functions (Lund et al.1997).The bad performance
of such a predictor is to be expected due to the lack of motif
that was not included in the model.The green curve ( r
￿
4)
is the curve that corresponds to the neural network predic-
tor in (Lund et al.1997).As observed therein a signicant
improvement is obtained when including just a little bit of
sequence context.As the size of the context region is in-
creased,so is the performance up to about sequence separa-
tion 30.Then the performance remains the same indepen-
dent of the sequence separation.We even note the drop in
performance for sequence separation 12,as predicted from
the statistical analysis above.It is also characteristic that,
as the distribution of physical distances approaches the uni-
versal shape,and as the center peak of the sequence logos
vanishes,the performance drops and becomes constant with
approximately 0.15 in correlation coefcient.The change
in prediction performance takes place at sequence separa-
tions for which changes in the distribution of physical dis-
tance and sequence motifs change.Due to the not vanish-
ing signal in the logos for large separations,it is possible
to predict distance constraints signicantly better than ran-
dom,and better than with a no-context approach.The con-
clusion is not surprising:the best performing network is that
which uses as much context as possible.However,for large
sequence separations more than 3035 residues,the same
performance is obtained by just using a small amount of
sequence context around the separated amino acids,as in
(Lund et al.1997).We found that the performance between
the worst and best cross-validation set is about 0.1 for sepa-
rations up to 12 residues,then about 0.07 for separations up
to 30 residues,and then 0.10.2 for separations larger than
30.Hence,at sequence separation 31 the performance start
to uctuate indicating that the sequence motif becomes less
dened throughout the sequences in the respective cross-
validation sets.Hence we can use the networks as an in-
dicator for when a sequence motif is well dened.
As an independent test,we predicted on nine CASP3
(http://predictioncenter.llnl.gov/casp3/) targets (and submit-
ted the predictions).None of these targets (T0053 T0067
T0071 T0077 T0081 T0082 T0083 T0084 T0085) have se-
quence similarityto any sequence with known structure.The
previous method (Lund et al.1997) had on the average
64.5% correct predictions and an average correlation coef-
cient of 0.224.The method presented here has on average
70.3%correct predictions and an average correlation coef-
cient of 0.249.The performance of the original method cor-
responds to the performance previously published (Lund et
al.1997).However,the average measure is a less convenient
way to report the performance,as the good performance on
small separations are biased due to the many large separation
networks that have essentially the same performance.There-
fore we have also constructed performance curves similar to
the one in Figure 5 (not shown).The main features of the
prediction curve was present.In Figure 8 we show a pre-
diction example of the distance constraints for T0067.The
server (http://www.cbs.dtu.dk/services/distanceP/) provides
predictions up a sequence separation of 100.Notice that
predictions up to a sequence separation of 30 clearly capture
the main part of the distance constraints.
We also investigated the qualitative relation between the
network performance and information content in the se-
quence logos.Interestingly,and in agreement with the ob-
servations made earlier,we found that the two curves have
the same qualitative behavior as the sequence separation in-
crease (Figure 7).Both curves peak at separation 3,both
curves drop at separation 12,and both curves reaches the
plateau for sequence separation 30.We note that the relative
entropy curve increases slightly as the separation increases.
0 20 40 60 80 100
Sequence Separation (residues)
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Fraction correct
(a)
r=0
r=2
r=4
r=6
r=8
r=10
r=12
r=14
Mean performances
0 20 40 60 80 100
Sequence separation (residues)
0.00
0.10
0.20
0.30
0.40
0.50
Correlation coefficient
(b)
r=0
r=2
r=4
r=6
r=8
r=10
r=12
r=14
Mean performances
Figure 5:The performance as function of increasing sequence context when predicting distance constraints.The sequence context is the
number of additional residues r
￿
0
￿
2
￿
4
￿
6
￿
8
￿
10
￿
12
￿
14 from the amino acid and toward the other amino acid.The performance is showed in
percentage (a),and in correlation coefcient (b).This gure is available in full color at http://www.cbs.dtu.dk/services/distanceP/.
This is due to a well known artifact for decreasing sampling
size,and entropy measures.The correlation between the
curves also indicates that the most information used by the
predictor stemfromthe motifs in the sequence logos.
0 20 40 60 80 100
Sequence Separation (residues)
0.00
0.10
0.20
0.30
0.40
0.50
Correlation coefficient
Plain
Profile
Single window curves
Figure 6:The performance curve using a single windows only.
We also show the performance when including homologue in the
respective tests.This complete a prole.
0 20 40 60 80 100
Sequence separation (residues)
Arbitrary scale
Relative entropy and NN performance
NN-performance
Relative entropy
Figure 7:Information content and network performance for in-
creasing sequence separation.
Finally,we conclude by an analysis of the neural network
weight composition with the aim of revealing signicant
motifs used in the learning process.In general,we found the
same motifs appearing for the each of the 10 cross-validation
sets for a given sequence separation.As described in the
Methods section,we compute saliency logos of the network
parameters,that is we display the amino acids for which
the corresponding weights,if removed,cause the largest in-
crease in network error.We made saliency logos for each of
the 5 hidden neurons.For the short sequence separations the
network strongly punish having a proline in the center of the
CASP3 target T0067
] - ; 0.2]
]0.2 ; 0.3]
]0.3 ; 0.4]
]0.4 ; 0.5]
]0.5 ; 0.6]
]0.6 ; 0.7]
]0.7 ; 0.8]
]0.8 ; 0.9]
]0.9 ; 1.0]
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
Figure 8:Prediction of distance constraints for the CASP3 tar-
get T0067.The upper triangle is the distance constraints of pub-
lished coordinates,where dark points indicate distances above
the thresholds (mean values) and light points distances below the
threshold.The lower triangle shows the actual neural network
predictions.The scale indicates the predicted probability for se-
quence separations having distances below the thresholds.Thus
light points represent prediction for distances below the thresh-
olds and dark points prediction for distances above the thresh-
old.The numbers along the edges of the plot show the posi-
tion in the sequence.This gure is available in full color at
http://www.cbs.dtu.dk/services/distanceP/.
input eld,that is a proline in between the separated amino
acids.This makes sense as proline is not located within sec-
ondary structure elements,and the helix lengths for small
separations are smaller than the mean distances.(The net-
work has learnt that proline is not present in helices.) In
contrast,the presence of aspartic acid can have a positive or
negative effect depending on the position in the input eld.
For larger sequence separations,some of the neurons dis-
play a three peak motif resembling what was found in the se-
quence logos.The center of such logos can be very glycine
rich,but also showa strong punishment for valine or trypto-
phan.Sometimes two neurons can share the motif of what is
present for one neuron in another cross-validation set.Other
details can be found as well.Here we just outlined the main
features:the networks not only look for favorable amino
acids,they also detects those that are not favorable at all.In
Figure 9 we show an example of the motifs for 2 of the 5
hidden neurons for a network with sequence separation 13.
0
1
2
Saliency
|
1
Q
R
L
D
E
S
H
Y
T
V
A
I
N
F
M
G
C
P
W
K
X
2
Q
V
R
S
W
E
Y
D
C
N
A
I
T
K
F
M
G
P
H
X
3
R
T
A
V
I
S
P
F
X
D
K
Y
N
L
W
M
E
Q
H
G
C
4 R
T
Y
S
L
K
A
E
I
D
N
F
Q
M
V
X
W
P
H
C
G
5
T
W
X
R
Y
S
K
Q
E
I
D
C
F
V
N
M
L
A
P
H
G
6 W
V
F
X
Y
D
S
M
T
N
K
R
I
L
Q
A
C
E
H
P
G
7
T
V
D
S
X
Y
I
F
W
M
R
N
L
E
K
A
Q
C
H
G
P
8
X
Y
V
F
W
N
C
I
H
E
A
T
L
S
D
M
R
Q
K
G
P
9
I
X
Y
F
V
W
G
L
M
N
D
H
E
C
S
R
K
A
T
Q
P
10
N
F
Y
W
I
X
V
E
G
S
D
H
K
M
L
R
C
Q
T
A
P
11
V
Y
W
X
F
I
H
G
R
K
E
S
D
T
N
M
Q
L
A
C
P
12
H
V
X
R
W
F
G
K
I
Q
Y
T
E
S
N
M
A
L
D
C
P
13
E
W
F
V
X
H
Y
G
N
R
D
M
I
K
A
L
S
Q
T
C
P
14 C
H
K
X
E
Q
T
Y
R
V
A
F
G
W
N
M
D
S
I
L
P
15
M
F
W
H
Q
Y
V
E
T
C
K
X
N
D
G
A
S
R
L
P
I
16
M
H
N
F
Q
G
E
V
S
X
K
W
A
C
R
Y
D
L
P
T
I
17
W
T
H
M
V
Y
E
K
F
Q
N
R
A
L
X
S
G
P
D
I
C
18 T
H
D
M
E
G
C
V
F
L
K
R
P
W
N
A
Q
X
Y
S
I
|
0
1
Saliency
|
1
Q
S
A
Y
M
D
I
L
R
V
H
T
W
X
G
N
E
C
K
P
2
T
I
R
H
S
V
Q
M
E
F
A
W
P
K
Y
C
G
D
N
X
3
C
T
M
A
L
E
Q
Y
P
W
I
X
V
R
G
F
D
N
K
4
Y
R
E
A
I
D
H
W
V
F
X
S
T
Q
C
M
P
K
L
N
G
5
H
X
Q
M
T
Y
K
E
W
A
S
V
R
L
N
D
P
I
C
F
G
6
F
W
MA
S
X
Q
H
R
Y
K
V
D
I
C
L
E
T
N
P
G
7
K
Q
N
Y
D
X
V
I
F
M
A
S
T
R
L
W
E
P
H
G
C
8
Y
W
T
R
S
A
N
L
V
X
I
K
M
Q
E
D
C
H
G
P
F
9
R
E
P
T
K
S
D
X
A
G
L
N
Y
Q
H
W
C
F
I
V
10
M
Q
P
R
A
X
H
E
C
G
T
I
K
N
Y
S
D
L
F
V
W
11 P
T
X
M
A
H
R
Q
G
S
C
E
Y
N
W
V
D
K
F
L
I
12
S
T
A
H
X
P
C
Q
E
R
L
N
M
K
Y
F
W
V
G
I
D
13
S
P
R
Y
T
Q
X
V
C
K
H
E
N
I
A
M
F
D
W
G
L
14 V
E
K
D
A
C
X
H
Q
N
M
S
R
L
F
T
P
I
G
W
Y
15
T
D
M
S
K
P
H
X
I
V
Q
C
W
G
F
R
A
E
Y
N
L
16 Y
K
Q
D
F
H
W
E
P
A
X
S
I
L
C
T
M
R
G
N
V
17
Y
L
T
X
N
K
E
R
C
F
M
A
I
W
Q
D
S
P
G
V
18
M
A
H
R
X
Y
T
K
Q
L
W
C
E
D
S
F
V
I
N
P
G
19
M
Y
T
Q
K
A
V
X
C
H
W
D
N
S
G
F
E
I
R
P
20
S
M
A
T
Q
H
D
Y
G
K
C
X
F
E
W
V
L
P
N
R
I
21
S
H
L
Q
V
A
T
F
C
I
M
E
Y
R
D
W
K
P
G
X
N
22
G
S
Y
D
R
A
H
M
K
F
P
Q
V
L
E
I
N
W
C
X
|
0
1
2
3
4
Saliency
|
1
R
L
S
G
N
H
T
A
P
K
D
C
Y
Q
W
M
V
F
E
I
X
2 W
P
H
D
Y
R
A
K
G
T
L
F
S
N
I
X
E
M
Q
V
C
3
N
T
A
D
H
R
Q
W
F
K
E
G
S
Y
I
L
P
C
X
M
V
4 N
A
T
H
M
R
W
F
K
G
Y
Q
E
L
P
D
S
X
V
I
C
5
N
H
T
X
R
M
Q
A
F
W
E
S
K
Y
L
G
V
D
I
C
P
6 A
M
H
P
R
X
F
L
S
G
N
Y
K
E
Q
V
C
T
I
D
W
7 TP
M
Q
Y
S
V
F
L
H
I
X
D
W
G
A
R
C
K
E
N
8 P
T
X
KC
R
A
L
N
V
M
F
H
I
Y
E
Q
D
W
S
G
9
C
X
Y
T
Q
R
H
F
K
A
P
N
E
W
M
L
I
S
D
V
G
10
T
H
X
A
C
R
E
Y
M
P
K
Q
S
F
L
D
V
W
N
I
G
11
A
X
T
C
R
H
K
E
Q
Y
P
L
S
M
F
D
N
V
W
I
G
12
T
X
A
H
Q
E
R
P
C
K
M
S
F
N
W
L
V
D
Y
I
G
13 C
X
T
A
S
R
E
H
W
K
Q
F
D
P
M
L
N
V
Y
I
G
14
H
X
S
A
R
F
E
T
W
Q
C
N
K
D
P
V
Y
L
M
I
G
15
S
C
Q
X
Y
D
R
E
T
A
W
H
N
V
L
F
K
P
M
I
G
16 K
R
X
H
Y
T
N
L
E
Q
C
V
S
I
F
W
D
A
P
M
G
17
L
E
H
X
D
P
Q
A
T
Y
M
I
R
F
N
W
S
K
C
G
18
H
X
A
Q
T
L
K
E
S
M
Y
R
F
G
D
P
C
W
N
V
I
19 M
E
H
X
A
F
P
S
T
L
W
Q
K
D
Y
N
R
C
G
V
I
20 A
S
R
Y
H
X
M
L
T
N
K
D
Q
E
W
F
G
V
C
P
I
21 L
K
R
M
E
H
V
Y
A
N
P
S
T
W
D
Q
F
G
X
I
C
22
V
P
K
S
T
Y
A
H
E
W
R
G
L
N
M
D
I
F
Q
C
X
|
Figure 9:Three saliency-weight logos fromthe trained neural net-
work at sequence separations 9 and 13.For each pos ition of the
input eld the saliency of the respective amino acids is displayed.
Letters turned upside down contribute negatively to the weighted
sumof the network input.The top logo (sequence separation 9) il-
lustrates strong punishment for proline and glycine upstreams.On
the middle logo (sequence separation 13),Nis turned upside down
on the peaks near the edges.On the bottom logo the I's close to
the edges contribute positively,and the N in the center peak also
contribute positively.The I's in the center peak contributes nega-
tively (they are upside down).This gure is available in full color
at http://www.cbs.dtu.dk/services/distanceP/.
Final remarks
We studied predictionof distance constraints in proteins.We
found that sequence motifs were related to the distributionof
distances.Using the motifs we showed howto construct an
optimal neural network predictor which improve an earlier
approach.The behavior of the predictor could be completely
explained froma statistical analysis of the data,in particular
a drop in performance for sequence separation 12 residues
was predicted fromthe statistical analysis.We also correctly
predicted separation 3 as having optimal performance.We
found that the information content of the logos has the same
qualitative behavior as the network performance for increas-
ing sequence separation.Finally,the weight composition of
the network was analyzed and we found that the sequence
motif appearing was in agreement with the sequence logos.
The perspectives are many:the network performance may
be improved further by using three windows for sequence
separations between 20 and 30 residues.The performance of
the network can be used as inputs to a newcollection of net-
works which can clean up certain types of false predictions.
Combiningthe method with a secondary structure prediction
should give a signicant performance improvement on the
short sequence separations.The relation between informa-
tion content and network performance might be explained
quantitatively through extensive algebraic considerations.
Acknowledgments
This work was supported by The Danish National Research
Foundation.OL was supported by The Danish 1991 Phar-
macy Foundation.
References
Baldi,P.,and Brunak,S.1998.Bioinformatics  The
machine learning approach.Cambridge Mass.:MIT Press.
Bernstein,F.C.;Koetzle,T.G.;Williams,G.J.B.;
Meyer,E.F.;Brice,M.D.;Rogers,J.R.;Kennard,
O.;Shimanouchi,T.;and Tasumi,M.1977.The
protein data bank:A computer based archival le for
macromolecular structures.J.Mol.Biol.122:535542.
(http://www.pdb.bnl.gov/).
Bishop,C.M.1996.Neural Networks for Pattern Recog-
nition.Oxford:Oxford University Press.
Bohr,H.;Bohr,J.;Brunak,S.;Cotterill,R.M.C.;Fred-
holm,H.;Lautrup,B.;and Petersen,S.B.1990.A novel
approach to prediction of the 3-dimensional structures of
protein backbones by neural networks.FEBS Lett.261:43
46.
Branden,C.,and Tooze,J.1999.Introduction to Protein
Structure.NewYork:Garland Publishing Inc.,2nd edition.
Brunak,S.;Engelbrecht,J.;and Knudsen,S.1991.Pre-
diction of human mRNAdonor and acceptor sites fromthe
DNAsequence.J.Mol.Biol.220:4965.
Creighton,T.E.1993.Proteins.NewYork:W.H.Freeman
and Company,2nd edition.
Dayhoff,M.O.ans Schwartz,R.M.,and Orcutt,B.C.
1978.Amodel of evolutionary change in proteins.Atlas of
Protein Sequence and Structure 5,Suppl.3:345352.
Fariselli,P.,and Casadio,R.1999.Aneural network based
predictor of residue contacts in proteins.Prot.Eng.12:15
21.
Göbel,U.;Sander,C.;Schneider,R.;and Valencia,A.
1994.Correlated mutations and residue contacts in pro-
teins.Proteins 18:309317.
Gorodkin,J.;Hansen,L.K.;Lautrup,B.;and Solla,S.A.
1997.Universal distribution of saliencies for pruning in
layered neural networks.Int.J.Neural Systems 8:489498.
(http://www.wspc.com.sg/journals/ijns/85_6/gorod.pdf).
Hobohm,U.;Scharf,M.;Schneider,R.;and Sander,C.
1992.Selection of a representative set of structures from
the Brookhaven Protein Data Bank.Prot.Sci.1:409417.
Kabsch,W.,and Sander,C.1983.Dictionary of protein
secondary structure:Pattern recognition and hydrogen-
bonded and geometrical features.Biopolymers 22:2577
2637.
Kullback,S.,and Leibler,R.A.1951.On information and
sufciency.Ann.Math.Stat.22:7986.
Lund,O.;Frimand,K.;Gorodkin,J.;Bohr,H.;Bohr,
J.;Hansen,J.;and Brunak,S.1997.Protein dis-
tance constraints predicted by neural networks and prob-
ability density functions.Prot.Eng.10:12411248.
(http://www.cbs.dtu.dk/services/CPHmodels).
Maiorov,V.N.,and Crippen,G.M.1992.Contact potential
that recognizes the correct folding of globular proteins.J.
Mol.Biol.227:876888.
Mathews,B.W.1975.Comparison of the predicted
and observed secondary structure of T4 phage lysozyme.
Biochem.Biophys.Acta 405:442451.
Mirny,L.A.,and Shaknovich,E.I.1996.How to de-
rive a protein folding potential?A new approach to an old
problem.J.Mol.Biol.264:11641179.
Miyazawa,S.,and Jernigan,R.L.1985.Estimation of
effective interresidue contact energies from protein crys-
tal structures:cuasi-chemical approximation.Macro-
molecules 18:534552.
Myers,E.W.,and Miller,W.1988.Optimal alignments in
linear space.CABIOS 4:117.
Olmea,O.,and Valencia,A.1997.Improving contact
predictions by the combination of correlated mutations and
other sources of sequence information.Folding & Design
2:S25S32.
Pearson,W.R.1990.Rapid and sensitive sequence com-
parison with FASTP and FASTA.Meth.Enzymol.183:63
98.
Reese,M.G.;Lund,O.;Bohr,J.;Bohr,H.;Hansen,J.E.;
and Brunak,S.1996.Distance distributions in proteins:A
six parameter representation.Prot.Eng.9:733740.
Rost,B.,and Sander,C.1993.Prediction of protein sec-
ondary structure at better than 70% accuracy.J.Mol.Biol.
232:584599.
Schneider,T.D.,and Stephens,R.M.1990.Sequence
logos:a new way to display concensus sequences.Nucl.
Acids Res.18:60976100.
Sippl,M.J.1990.Calculation of conformational ensem-
bles from potentials of mean force.An approach to the
knowledge-based prediction of local structures in globular
proteins.J.Mol.Biol.213:85983.
Skolnick,J.;Kolinski,A.;and Ortiz,A.R.1997.MON-
SSTER:A method for folding globular proteins with a
small number-comment of distance restraints.J.Mol.Biol.
265:217241.
Tanaka,S.,and Scheraga,H.A.1976.Medium- and long
range interaction parameters between amino acids for pre-
dicting three-dimensional structures of proteins.Macro-
molecules 9:945950.
Thomas,D.J.;Casari,C.;and Sander,C.1996.The predic-
tion of protein contacts frommultiplesequence alignments.
Prot.Eng.9:941948.