Prediction of Zinc-binding Sites Using Support Vector Machine

yellowgreatAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

148 views


利用支持向量機器預測蛋白質之鋅鍵結位置

Prediction of Zinc-binding Sites
Using Support Vector Machine

研 究 生: 謝燕婷 (Yen-Ting Hsieh)
指導教授: 鍾翊方 (I-Fang Chung)

國立陽明大學醫學院暨生科院
生物醫學資訊研究所
碩士論文

National Yang-Ming University
School of Medicine and School of Life Sciences
Institute of Biomedical Informatics
Master Thesis

中華民國九十七年七月
July 2008



中文摘要
金屬鍵結位置參與了金屬蛋白的生物功能。在蛋白質資料庫中,與鋅鍵結的蛋白
質是在金屬蛋白裡最多的。最近的研究也顯示鋅鍵結位置從蛋白質結構和序列上
都可以被預測。為了得到被人工註解過的鋅鍵結位置,本篇研究在Swiss-Prot 蛋
白質資料庫中仔細地選取為機器學習的基準資料。支持向量機器被用來從序列衍
生出的特徵和從結構上的特徵預測鋅鍵結位置。在蛋白質序列,預測器針對胱胺
酸 (Cysteine) 在80%的recall,precision 可達到78%;針對組胺酸 (Histidine),
在 63%的 recall,precision 可達到 75%;同時針對胱胺酸和組胺酸,在 69%的
recall,precision 可達到 78%。此預測器可以應用在結構尚未決定而只有序列的
蛋白質,亦可應用在結構基因體學中有結構資訊的蛋白質。

關鍵字:鋅鍵結位置、預測、支持向量機器、機器學習















2
Abstract
Metal-binding sites involve in the biological function of the metalloproteins.
Zinc-binding proteins are the most abundant metalloprotein in the protein database
and their binding sites were showed predictable from both structure and sequence by
recent studies. This research collected the benchmarking dataset carefully from
Swiss-Prot to get manually annotated zinc-binding sites for machine learning. Support
Vector Machine was used to predict zinc-binding sites with sequence derived features
and structural features. For sequences, the predictor achieved 74% precision at 71%
recall for Cysteine; 66% precision at 53% recall for Histidine; 71% precision at 61%
recall for Cysteine and Histidine together. For structures, the predictor achieved 78%
precision at 80% recall for Cysteine; 75% precision at 63% recall for Histidine; 78%
precision at 69% recall for Cysteine and Histidine together. This prediction tool can
be applied to those proteins whose structures have not been determined so that only
sequences are available, as well as those proteins which have structure information in
structural genomics.

Keyword: zinc-binding sites, prediction, support vector machine, machine learning


Content

Chapter 1: Introduction..............................................................................................5
1.1 Background....................................................................................................5
1.2 Related work...................................................................................................5
1.3 Motivation and specific aim..........................................................................7
Chapter 2: Materials and methods.............................................................................8
2.1 Benchmark dataset........................................................................................8
2.2 Database for BLAST....................................................................................10
2.3 Feature description......................................................................................10
2.3.1 Sequence derived features................................................................12
2.3.2 Structural features............................................................................16
2.3.3 Features of paired residues..............................................................18
2.4 Machine learning.........................................................................................22
2.4.1 SVM vector encoding........................................................................22
2.4.2 Performance measurement..............................................................27
2.4.3 Parameter selection...........................................................................27
2.5 The flowchart of a prediction process........................................................28
Chapter 3: Results and discussions..........................................................................30
3.1 Performances of predictions of zinc-binding sites....................................30
3.2 Observation on ROC curves and RP curves.............................................36
3.3 Prediction with each feature or a subset of features.................................37
3.4 Comparison with previous predictors........................................................39
Chapter 4: Conclusion...............................................................................................41
Links:..........................................................................................................................42
Reference:...................................................................................................................42
Appendix. The 127 benchmark proteins..................................................................44










4
Figure list

Figure 1. The proportion of C, H, D, and E in the 127 benchmark proteins.................9
Figure 2 Weblogo of zinc-binding windows for (a) Cysteine and (b) Histidine.........16
Figure 3. An example of structural neighborhood.......................................................17
Figure 4. The distributions of the number of zinc-binding pairs versus pair distances
corresponding to different kinds of the first residue in the paired residues..19
Figure 5. An example of a single residue vector and its zinc-binding label................22
Figure 6. The distribution of conserved residues or conserved zinc-binding sites......25
Figure 7. An example of a paired residues vector and its zinc-binding label..............26
Figure 8. An example of a gated vector and its zinc-binding label.............................26
Figure 9. The flowchart of a prediction of zinc-binding sites......................................29
Figure 10. ROC curves of prediction from sequences using dataset (a) C only, (b) H
only, (c) C and H together.............................................................................32
Figure 11. RP curves of prediction from sequences using dataset (a) C only, (b) H only,
(c) C and H together......................................................................................33
Figure 12. ROC curves of prediction from structures using dataset (a) C only, (b) H
only, (c) C and H together.............................................................................34
Figure 13. RP curves of prediction from structures using dataset (a) C only, (b) H only,
(c) C and H together......................................................................................35
Figure 14. Performance of each feature or subset of features using (a) C dataset, (b) H
dataset, and (c) CH dataset............................................................................39


Table list

Table 1. The encoding elements of each feature..........................................................11
Table 2. Amino acid composition in UniProtKB statistics..........................................15
Table 3. The encoding matrices of the first residue in zinc-binding pair....................20
Table 4. The best SVM parameter used in the predictions after parameter selection..30






5
Chapter 1: Introduction
1.1 Background
Metalloproteins need to bind one or more metal ions for their biological function or
structural stability. Zinc-binding proteins are the most abundant metalloprotien (Sodhi
et al., 2004) in the Protein Data Bank (PDB) (Berman et al., 2000) and have many
kinds of roles including enzymes, storage proteins, transcription factors, and
replication proteins (Coleman, 1992). Zinc-binding sites involve in many biological
functions, mainly in the catalytic reactions or structural stabilization in proteins. With
rapidly growing databases in protein and structural genomics, a well performed tool
of predicting zinc-binding sites might help the annotation and the verification of the
functional sites on protein.

In terms of the prediction issue in bioinformatics, supervised machine learning
methods were applied very often. Petrova and Wu (2006) used support vector
machine (SVM) to predict catalytic residues with selected protein sequence and
structural properties. Iakoucheva et al. (2004) construct linear predictors based on
logistic regression for prediction of protein phosphorylation sites. Among many
supervised machine learning methods, some studies showed better performance using
SVM (Chen and Zhou, 2005; Petrova and Wu, 2006).

1.2 Related work
Recently, predictions of metal-binding sites have been studied with structural
information (Sodhi et al., 2004). Sodhi et al. (2004) used neural network to predict
zinc-binding residue out of whole protein in low-resolution structural models. The
residue which was within a distance threshold to zinc ion was defined as zinc-binding


6
residue. The features that were used to distinguish zinc-binding residues from the rest
residues in the same protein included position specific scoring matrix (PSSM) by
PSI-BLAST, solvent accessibility, and some features representing the structural
environment. For predicting zinc-binding sites from protein sequences (Passerini et al.,
2007; Shu et al., 2008), only Cysteine, Histidine, Aspartic acid, and Glutamic acid
(CHDE) (About the whole name and the abbreviation of amino acids, please refer to
Table 2) were selected for prediction since they dominate zinc-binding residues.
Passerini et al. (2007) considered pairs of sequentially close residues, called
semi-pattern, as well as local single residue information. SVM was used to build a
local predictor and a gated predictor which was formed by combining the prediction
results from the local residue and that from the semi-pattern predictor. The feature
vectors of the local predictor included the PSSM by PSI-BLAST (Altschul et al.,
1997), representing a window centered at C or H. A residue pair was selected by a
semi-pattern [CHDE]x(0-7)[CHDE], which meant another C, H, D, or E was in the
eight-residue sequential neighborhood of a C, H, D, or E. The feature vectors of pairs
were generated in a similar way as in the local predictor. A gating network combined
the predictions from local predictor and semi-pattern predictor to form the gated
predictor. Shu et al. (2008) provided a different strategy to select residue pairs. A
residue pair was formed by CHDEs in the range of 150 residues with high residue
conservation score from PSSM. This strategy selected fewer residues to form pairs
but identified more zinc-binding CHDEs than Passerini et al. (2007).
Homology-based predictor was also built and combined with the gated predictor for
homologous test proteins appearing in the training dataset. In these two studies of the
prediction of zinc-binding sites from sequences, C and H datasets were used in the
major researches because the small datasets of D and E.



7
1.3 Motivation and specific aim
In supervised machine learning, the benchmark dataset is very important. The datasets
used by the researches about the prediction of metal-binding sites so far came from
PDB. The definition of the metal-binding residues depended on the shortest distance
between the metal ion and the residue atoms. The distance threshold is usually 3Å
(Passerini et al., 2007; Sodhi et al., 2004). However, this kind of dataset might have
many false positive residues in identifying the metal-binding sites from whole
proteins. In predicting zinc-binding sites from sequence (Passerini et al., 2007; Shu et
al., 2008), although the dataset only focused on CHDEs so that some false positives
were reduced, the worry still appeared. To get rid of the false positive data, getting
benchmark data from manually annotated database might be a good method.

In this research, the benchmark dataset was carefully selected from Swiss-Prot
database. Zinc-binding sites were predicted from C and H by SVM predictor with
local single residues and gated information. Some bioinformatics tools were used to
generate sequence derived features for prediction from protein sequences. Structural
predictor was also built with the addition of structural features in order to response to
the query proteins with structural information.









8
Chapter 2: Materials and methods
2.1 Benchmark dataset
The benchmarking dataset was collected very carefully for machine learning from
Swiss-Prot, which is a manually annotated protein database in UniProt. At least one
PDB structure and zinc-binding description were required. The zinc-binding sites
whose descriptions contain “By similarity”, “Probable”, or “Potential” were
eliminated to ensuring the protein annotation is from real experiments. Then, each
Swiss-Prot protein was matched manually to one chain on a PDB structure according
to the zinc-binding information on the PDBsum website (Link 1). For the Swiss-Prot
protein which has many PDB structures, only one PDB structure was chosen by
checking the reference papers on its annotation, structure resolution, and the binding
sites information on PDBsum. In order to reduce the redundant and homologous data,
only one protein was picked from every PIRSF family. PIRSF is a protein
classification from superfamily to subfamily levels in UniProt. In PIRSF, protein
family members are homologous (sharing common ancestry) and homeomorphic
(sharing full-length sequence similarity with common domain architecture) (Wu et al.,
2004). After leaving one protein for each PIRSF family, the proteins whose has more
than 25% sequence identity were deleted. Finally, 127 proteins formed the benchmark
dataset (Appendix).

In the 127 benchmark proteins, the zinc-binding residues were the residues which
were annotated to bind zinc in the Swiss-Prot; the rest residues were viewed as the
non-zinc-binding residues. Among 539 zinc-binding residues, only 10 residues exist
whose amino acid type were not C, H, D, or E.



9
The PDB structural files of 127 benchmark proteins were used to generate the SVM
input data. However, the sequential residue numbers of the zinc-binding sites from the
Swiss-Prot annotations base on their own sequences which may be a little different
from the sequences in PDB structural file. Directly labeling the zinc-binding sites
from Swiss-Prot annotations on the PDB sequences would make many false positives
and false negatives. Therefore, in order to solve this problem, the sequential residue
numbers from Swiss-Prot annotation were mapped to the residue numbers of PDB file
format according to the mapping table provided by Martin (2005). Only three
zinc-binding sites lost in this mapping process.

Several residues lost in the feature generation before undergoing the machine learning
process; 503 zinc-binding sites left in the 127 benchmark proteins. The amounts
zinc-binding CHDEs and the rest CHDEs (non-zinc-binding) before the prediction
from sequences were showed in Figure 1. The zinc-binding to non-zinc-binding ratios
of C, H, D, and E are 1:2.13, 1:4.36, 1:40.57, and 1:67 respectively. Since the C and
H occupy 82% of all zinc-binding sites, they were used in the most analyses. The
information from D and E were used in some feature generations.
Proportion of C, H, D, E in the benchmarking data
216
196
412
54 37
503
460
855
1315
2191
2479
5985
0
1000
2000
3000
4000
5000
6000
7000
C H CH D E CHDE
Amino acid
number
Zinc-binding
Non-zinc-binding

Figure 1. The proportion of C, H, D, and E in the 127 benchmark proteins. CH is C
and H; CHDE is C, H, D, and E.


10
2.2 Database for BLAST
PSI-BLAST was often used in the features generation process to search proteins
sequences in an assigned database, which were similar to the query sequence, or to
form corresponding PSSM. Nr database from NCBI is the most popular database used
for PSI-BLAST. However, nr database is so big that it takes a long time in
PSI-BLAST. Therefore, a non-redundant database was formed by the tool in BLAST
based on PIRSF, which is a protein classification in UniProt. In PIRSF (version
2008.02), which had 1,425,452 proteins, 992,287 proteins were in the families which
have seed sequences as representative sequences. In order to further reduce the size of
the blast against database but to keep the characters of each family, two criteria were
defined to create the PIRSF-based database: (1) all of the seed sequences in a family
will be collected for the database if this family has more than 3 seed sequences; (2) all
of the family members in a family will be collected for the database if this family has
less than 4 seed sequences. After above processing, 565,563 sequences left in the
PIRSF-based database were used in PSI-BLAST to generate the features of
conservation score. Nr database (version 2008.01) from NCBI was used in
PSI-BLAST to produce the features of secondary structure and disulfide, since they
were from the tools that were developed in previous researches.

2.3 Feature description
In this research, the features included sequence derived features, structural features,
and some features for zinc-binding pairs. The list of the features is showed in Table 1.
Detail description of each feature is in 2.3.1, 2.3.2, or 2.3.3.





11
Table 1. The encoding elements of each feature. Attribute number is the number of
values for representing the feature. Attribute type is according to the input data format
of Weka.

Feature Attribute Number Attribute Type
amino acid 1 Nominal
secondary structure 3 Numeric
solvent accessibility 1 Numeric
disulfide bond probability 1 Numeric
Scorecons 1 Numeric
PSSM 22 Numeric
Position weight matrix 21 Numeric
SR_oxygen 1 Numeric
SR_nitrogen 1 Numeric
SR_sulfur 1 Numeric
SR_charged 1 Numeric
SR_polar 1 Numeric
SR_hydrophobic 1 Numeric
Structural Alphabet 1 Nominal
distance 6 Numeric
probability in range 1 Numeric
range probability 1 Numeric


12
2.3.1 Sequence derived features
Seven features were generated from protein sequence. They are amino acid,
secondary structure, solvent accessibility, disulfide bond, conservation score, PSSM,
and position weight matrix. Secondary structure, solvent accessibility, and disulfide
bond probability were grouped into sequence derived features since they were
predicted results from protein sequence by present tools.

- Amino acid: The feature is the one letter name of the 20 kinds of amino acids (Table
2).

- Secondary structure: 2D structure is the basic and popular structure information in
different predictions for different bioinformatics issues. It could be predicted from
sequence by PSIPRED (version 2.6) (Jones, 1999). The three probabilities of helix,
sheet, and coil were used to form this feature.

- Solvent accessibility: Some zinc-binding sites involve in catalytic reaction; some are
for structural stabilization. Theoretically, the catalytic zinc-binding sites are on the
protein’s exposed area; the structural zinc-binding sites are on the buried area. Thus,
the solvent accessibility might be a good feature for zinc-binding prediction.
Weighted ensemble solvent accessibility predictor (Chen and Zhou, 2005) was used to
predict the solvent accessibility of each residue. The feature was formed with the
buried or exposed confidence value which is a weighted ensemble of the probabilities
given from Bayesian statistics, multiple linear regression, decision tree, neural
network, and SVM. The weighted ensemble ranges from 0 to 6, where 0 means buried
residue, and 6 means exposed residue.



13
- Disulfide bond probability: The disulfide bond may exist between two Cysteine
amino acids close to each other which may be a part of the zinc-binding sites in a
protein. The feature is the disulfide bond probability calculated by DiANNA (Ferre
and Clote, 2005), the prediction tool of disulfide bond connectivity. The highest score
was used for one Cysteine if it had more than one disulfide bond prediction score.

To see whether a residue is conserved or not is important because the significant
locations on the proteins are usually conserved in biological evolution. The residue
conservation score was calculated by two methods and represented in two features.
One is Scorecons, the other is PSSM.

- Scorecons: The first step to generate conservation score was to get the multiple
sequence alignment by ClustalW2 (Larkin et al., 2007) of the query protein with the
similar proteins blasted out against PIRSF-based database with one iteration and an E
value of six. If there are more than 9 seed sequences in the protein set which was
blasted out by the query sequence, only the seed sequences went through multiple
sequence alignment with the query sequence. Otherwise, the seed sequences (< 10)
and at most 200 non-seed sequences were used. Then, Scorecons (Valdar, 2002)
calculated the conservation score for each residue from the multiple sequence
alignment profile. The value ranges from 0 to 1, where 1 means the most conserved.

- Position Specific Scoring Matrix (PSSM): PSSM of each protein was generated by
PSI-BLAST (version 2.2.17) against PIRSF based database with three iterations and
an E value of four. The values of PSSM were normalized by the formula in (Shu et al.,
2008).


14


k was the kth residue in the sequence. aa(k) was the amino acid type. M
k,aa(k)
was the
PSSM value at the kth position where amino acid was aa(k). MIN_M
aa(k)
was the
minimal value of aa(k) overall the sequence. MAX_M
aa(k)
was the maximal value of
aa(k) overall the sequence. In addition to the PSSM, information per position and
relative weight of gapless real matches to pseudocounts in the PSSM profile were also
included in this feature.

- Position weight matrix (PWM): 21 values were calculated via PWM for one
zinc-binding window with 21 residues. A zinc-binding window here was defined as a
sequence segment which centered on the zinc-binding residue and extended 10
residues backward and forward (Figure 2). The zinc-binding windows in the training
data file were gathered to form PWM using following formula,


aa(j) was one of the 20 amino acids. number(aa(j), i) was the number of aa(j) at
position i. totalNumber(i) was the number of all amino acid at position i. F
aa(j)
was the
amino acid composition of aa(j) in the UniProtKB statistics as Table 2.







)/)
1)(
05.0)),((
((log)),((
)(2 jaa
F
irtotalNumbe
ijaanumber
ijaaPWM
+
+
=
)()(
)()(,
__
_
kaakaa
kaakaak
k
MMINMMAX
MMINM
PSSMscore


=


15
Table 2. Amino acid composition in UniProtKB statistics

Amino acid Amino acid
Name One letter
Composition
Name One letter
Composition
Alanine A 0.0857 Leucine L 0.0985
Arginine R 0.0555 Lysine K 0.0521
Asparagine N 0.0419 Methionine M 0.0241
Aspartic acid D 0.0526 Phenylalanine F 0.0405
Cysteine C 0.0135 Proline P 0.0483
Glutamine Q 0.039 Serine S 0.0681
Glutamic acid E 0.0605 Threonine T 0.0559
Glycine G 0.0706 Tryptophan W 0.0133
Histidine H 0.0222 Tyrosine Y 0.0302
Isoleucine I 0.0592 Valine V 0.0665














16
(a)

(b)

Figure 2 Weblogo (Crooks et al., 2004) of zinc-binding windows for (a) Cysteine and
(b) Histidine.

2.3.2 Structural features
Seven features were grouped into structural features. They are SR_oxygen,
SR_nitrogen, SR_sulfur, SR_charged, SR_polar, SR_hydrophobic, and structural
alphabet.

A zinc ion interacts with more than one amino acid in the structure space at the same
time. The binding atom is usually an electron receiving atom, like oxygen, nitrogen,
and sulfur. The distance between the zinc ion and its zinc-binding atom is usually less
than 2.8Å. Therefore, the distance from one zinc binding atom to another one is less
than 5.6Å. In this research, the structural neighborhood was defined as the area of
reaching any atom of a specific amino acid within a shortest distance of 5.6Å. Figure
3 shows in green the structural neighborhood of a Cysteine in a zinc-binding area. Six
of the seven structural features were generated in the structural neighborhood:


17
- SR_oxygen: the number of oxygen.
- SR_nitrogen: the number of nitrogen.
- SR_sulfur: the number of sulfur.
- SR_charged: the number of charged residues.
- SR_polar: the number of polar residues.
- SR_hydrophobic: the number of hydrophobic residues.
The specific amino acid itself was included in calculation of the six features. Charged
residues are H, R, K, E, and D. Polar residues are Q, T, S, N, C, Y, and W.
Hydrophobic residues are G, F, L, M, A, I, P, and V. (Bartlett et al., 2002)

Figure 3. An example of structural neighborhood. The structural neighborhood of the
amino acid (Cysteine) is showed in green. Six structural features represented the
structural environment the amino acid faced. The zinc-binding site is at 1XC3 from
PDBsum.

- Structural Alphabet: The local 3D structure was represented by 16 1D alphabets
according to the 16 kinds of backbone shape of five-residue segments. The encoding
program is from the web tool, Protein Block Expert (Tyagi et al., 2006).



18
2.3.3 Features of paired residues
Previous studies indicated the importance of adding the information of zinc-binding
pairs to machine learning. Hence, the pair information of the dataset in this research
was also checked. Please see 2.4.1 for more information about the selection of paired
residues. In Figure 4, the y axis expresses the number of zinc-binding pairs, and the x
axis indicates the pair distance, the distance between the first residue (C or H) and the
second residue (C, H, D, or E). The case of C or H in the first residue is shown in
Figure 4a, 4b, respectively. In these both cases, it can also be observed that there is a
high probability for a residue to pair with the other residue which is the same amino
acid, especially for Cysteine. Note that, here we only considered the situation of the
pair distance is less than 50 residues because the number of zinc-binding pairs is quite
few beyond this distance criterion (as shown in Figure 4c, the consideration of both C
and H in the first residue).









19



Figure 4. The distributions of the number of zinc-binding pairs versus pair distances
corresponding to different kinds of the first residue in the paired residues. (a) The first
residue is Cysteine. (b) The first residue is Histidine. (c) All zinc-binding pairs.


H
0
2
4
6
8
10
12
14
16
18
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Distance
Number
C
H
D
E
(b)
Distribtion of zinc-binding pair distance
0
10
20
30
40
50
60
70
80
90
1
6
11
1
6
21
2
6
31
3
6
41
46
5
1
56
6
1
66
7
1
76
81
86
91
9
6
101
106
1
1
1
116
1
2
1
126
1
3
1
136
141
1
4
6
Distance
Number
(c)
C
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Distance
Number
C
H
D
E
(a)


20
The following three features described the characters of a zinc-binding pair: distance,
probability in range, and range probability.

- Distance: In addition to the pair distance, five binary numbers were encoded from
the pair distance to represent which distance range the pair belonged to. Different
encoding matrix was used according to the first residue (Table 3). For example, this
feature for a C and H pair with a distance 15 was 15, 0, 0, 1, 0, 0; this feature for an H
and C pair with a distance 15 was 15, 0, 1, 0, 0, 0.

Table 3. The encoding matrices of the first residue in zinc-binding pair. The distance
ranges were decided according to bell-shape sub-distributions of the number of
zinc-binding pairs in Figure 4a, 4b.

Cysteine Histidine
Matrix distance
1,0,0,0,0 1~8
0,1,0,0,0 9~14
0,0,1,0,0 15~25
0,0,0,1,0 26~40
0,0,0,0,1 40~
Matrix distance
1,0,0,0,0 1~8
0,1,0,0,0 9~16
0,0,1,0,0 17~32
0,0,0,1,0 33~47
0,0,0,0,1 48~









21
- Probability in range: This feature is the occurrence probability of a pair in its
distance range. The probability is calculated using following formula,


aa
1
(i) is the first residue; aa
2
(j) is the second residue. aa
1
(i) is C or H. aa
2
(j) is one of
C, H, D, E.
))(,),((
21
jaadisiaapairNumber
is the pair number of aa
1
(i) and aa
2
(j)
with distance dis.
))(),((
1
disrangeiaapairNumber
is the pair number of aa
1
(i) in the
distance range which dis belongs to. Take a C and H pair with a distance 15 for
example, there are 4 pairs which are C and H pair with a distance 15 and there are 75
pairs in the range 15~25 of C, and then the probability in range is 4/75.

- Range probability: This feature is the occurrence probability of a pair within a
specific range. The probability was calculated using following formula,


))(),((
1
disrangeiaapairNumber
is the pair number of aa
1
(i) in the distance range
which dis belongs to.
))((
1
iaartotalNumbe
is the total pair number of aa
1
(i). Take a
C and H pair with a distance 15 for example, there are 75 pairs in the range 15~25 of
C and total pair number of C is 210, and then the range probability is 75/210.







))(),((
))(,),((
rangein y Probabilit
1
21
disrangeiaapairNumber
jaadisiaapairNumber
=
))((
))(),((
yprobabilit Range
1
1
iaartotalNumbe
disrangeiaapairNumber
=


22
2.4 Machine learning
The two-class prediction of zinc-binding sites was executed by using the public
available program LIBSVM (version 2.8) (Chang and Lin, 2001) with radial basis
function kernel. The inserted interface WLSVM (EL-Manzalawy and Honavar, 2005)
in Weka (version 3.5.7) (Witten and Frank, 2005) was further used to handle variant
input data formats for LIBSVM.

2.4.1 SVM vector encoding
The SVM vector is the input unit of SVM classifier. Each vector contains all features
that are required for the prediction. In this research, each feature was represented by
one or several attributes which were nominal or numeric, listed in Table 1. The vector
was labeled as “+1” if it binds zinc; otherwise, it was labeled as “-1”.

Single residue vector
Each residue of a protein had 50 attributes in its single residue vector for prediction
from sequence; 57 attributes for prediction from structure. Zinc-binding residues were
labeled as “+1” class while non zinc-binding residues were labeled as “-1” class.



Figure 5. An example of a single residue vector and its zinc-binding label. The
attribute number and attribute type were listed in Table 1.

C, 0.821, 0.144, 0.057, 2.56, 0……, -1
Secondary structure
Amino acid
Solvent accessibility
Disulfide Bond
Zinc-binding label
The other attributes


23
Paired residues vector
The conservation-based method in Shu et al. (2008) was used as the principle of
selecting paired residues. Note that there were two residue conservation scores here;
one was from Scorecons, the other was from PSSM. The residue whose conservation
score from PSSM > 0.9 (Figure 6a, 6b) and that from Scorecons > 0.3 (Figure 6c, 6d)
were selected to pair in a distance range of 100 residues. In Figure 6, the y axis
expresses the fraction was the conserved residues out of all residues (Figure 6a, 6c) or
out of all zinc-binding residues (Figure 6b, 6d) for different conservation threshold on
the x axis. This paired residues selecting threshold kept about 80% zinc-binding sites.
After the addition of the three paired residues features, there were 108 attributes in
prediction from sequence, where 50 attributes for the first residue, 50 attributes for the
second residue, eight attributes for pair features; 122 attributes in prediction from
structure, here 57 attributes for the first residue, 57 attributes for the second residue,
and eight attributes for pair features. The pair was labeled as “+1” class if both the
residues bind zinc, even that they may bind different zinc ion.





24





Distribution of conserved residues out of all residues
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
.
0
7
5
0
.
15
0.225
0.3
0.375
0
.
4
5
0
.
5
2
5
0.6
0
.
6
7
5
0
.
75
0.825
0.9
0.975
Conservation score from PSSM
Fraction
C
H
D
E
(a)
(b)
Distribution of conserved zinc-binding sites out of all zinc-binding sites
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.0
75
0.1
5
0.225
0.3
0.3
75
0.45
0.525
0.6
0.675
0.75
0.8
2
5
0
.
9
0
.
975
Conservation score from PSSM
Fraction
C
H
D
E


25




Figure 6. The distribution of conserved residues or conserved zinc-binding sites. (a)(c)
The fraction was the conserved residues out of all residues for different conservation
threshold by (a) PSSM and (c) Scorecons. (b)(d) The fraction was the conserved
zinc-binding sites out of all zinc-binding sites for different conservation threshold by
(b) PSSM and (d) Scorecons.


Distribution of conserved residues out of all residues
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.
07
5
0.
1
5
0.2
2
5
0
.3
0.
37
5
0.
4
5
0.525
0.6
0
.67
5
0
.7
5
0.825
0.9
0.
97
5
Conservation socre from Scorecons
Fraction
C
H
D
E
(c)
(d)
Distribution of conserved zinc-binding sites out of all zinc-binding sites
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
.0
7
5
0.1
5
0.225
0
.
3
0
.
37
5
0
.45
0
.5
2
5
0.6
0.675
0
.
7
5
0.825
0
.9
0
.9
7
5
Conservation score from Scorecons
Fraction
C
H
D
E


26


Figure 7. An example of a paired residues vector and its zinc-binding label. The
attribute number and attribute type were listed in Table 1.

Gated vector
In order to give final result to each residue in test set, a gated score combined the
resultant zinc-binding probabilities from both single residue predictor and paired
residues predictor by the formula suggested in She et al. (2008) as follows,
)(*))(1()( aaPaaPaaPgatedScore
pss

+
=

Where P
s
(aa) is the zinc-binding probability of the residue aa from single residue
predictor; P
p
(aa) is the maximum zinc-binding probability of the residue aa from
paired residues predictor. If the residue aa doesn’t have paired residues information,
this P
p
(aa) was assigned 0. The gated vector of a specific residue contained its gated
score. The gated vector was labeled as “+1” class if it binds zinc; otherwise, “-1”
class.



Figure 8. An example of a gated vector and its zinc-binding label.
Amino acid
C, 0.10857183782813064, -1
Gated score
Zinc-binding label
Distance
Probability in range
Single residue vector
The first residue, the second residue, 25, 0, 0, 1, 0, 0, 0.5048, 0.0189, +1
Single residue vector
Range probability
Zinc-binding label


27
2.4.2 Performance measurement
Some terms of performance measurement were defined in this research: TP is true
positives, the number of zinc-binding residues that are correctly predicted; TN is true
negatives, the number of non-zinc-binding residues that are correctly predicted; FP is
false positives, the number of non-zinc-binding residues that are predicted as
zinc-binding residues; FN is false negatives, the number of zinc-binding residues that
are predicted as non-zinc-binding residues; recall, true positive rate (TPR), is defined
as TP/(TP+FN), the proportion of correctly predicting zinc-binding residues;
precision is defined as TP/(TP+FP), the proportion of zinc-binding residues out of the
predicted zinc-binding residues; false positive rate (FPR) is defined as FP/(TN+FP),
the proportion of incorrectly predicting non-zinc-binding residues.

Receiver operating characteristic (ROC) curves and recall-precision (RP) curves
represented the performance of the predictor on 10-fold cross validation. The area
under ROC (AUC) and the area under RP curve (AURPC) were calculated by a public
Java program (Jesse and Mark, 2006) for performance measurement where 1 was
perfect, and 0 was the worst. In predictions with each feature or subset of features,
performances were measured by Matthew’s correlation coefficient (MCC) value as
follows,



2.4.3 Parameter selection
Different parameters influence the performance of the SVM classifier. To simplify the
process of parameter selection, different pairs of parameter C and parameter Gamma
for SVM with radial basis function kernel were applied to get corresponding 10-fold






++++
×−×
=
predictionbest
predictionrandom
predictionworst
FNTNFPTNFNTPFPTP
FNFPTNTP
MCC
   
   
   
,1
,0
,1
))()()((
)(


28
cross validation results. In a run of cross validation, each fold used the same pair of C
and Gamma. The value of C was 2
i
, where the integer i varied from -5 to 10. The
value of Gamma was 2
j
, where the integer j varied from -10 to 0. In total, 176 pairs of
C and Gamma were tested to get a parameter pair with best AURPC for a single
residue predictor, a paired residues predictor, or a gated predictor. It is important to
point out that the probability estimate, which is a parameter of LIBSVM, was set up
in order to make the suitable output for drawing ROC and RP curves.

2.5 The flowchart of a prediction process
The flowchart of a prediction process is showed in Figure 9. Overall, the single
residue prediction and the paired residues prediction provided their results to form
gated vectors. After the first 10-fold cross validation, the result of the single residue
predictor was kept to see the performance, and the 10 gated vector datasets underwent
the second 10-fold cross validation. Note that in our experiment the fold separation
for data depended on the 10 gated vector datasets. The performance of the gated
predictor was got after the second 10-fold cross validation. It is significant that the
fold separation depended on the proteins, not on the zinc-binding residues. The reason
is for the appropriate usage of the feature PWM. PWM is a feature formed by the
zinc-binding windows directly. It gives very high score to the sequence segment
which is similar to the zinc-binding window it involved. Hence it is not appropriate to
generate PWM from all benchmark proteins because it will result in a
higher-estimated performance.



29
10-fold cross validation
10-fold cross validation
Training proteins
Test proteins
Single residue prediction
Zinc-binding
probability for
test vectors
Paired residues prediction
Zinc-binding
probability for
test vectors
Single residue
training vectors
(only from
training proteins)
Single residue
test vectors
(only from
test proteins)
Gated prediction
Start
End
Benchmark
Dataset
(127 proteins)
Generate features
for each residue.
(except PWM)
Get C, H datasets
Forming of single residue vectors
(including PWM)
Forming of paired residues vectors
(including PWM)
Paired residues
training vectors
(only from
training proteins)
Paired residues
test vectors
(only from test
proteins)
Gated score calculation
and vector forming
Final result
Gated vectors
Gated vectors
Gated result
Result of single
residue prediction
Result of single
residue prediction


Figure 9. The flowchart of a prediction of zinc-binding sites














30
Chapter 3: Results and discussions
In both prediction from sequences and prediction from structures, the result of gated
predictor showed a significant improvement in AURPC of single residue predictor
after adding paired residues information on single residue information, consisting
with previous studies. Predictions with each feature or subset of features were further
tested to see which feature was useful for discriminating zinc-binding CH from all CH
in the proteins. The dataset from previous studies was also tested for the comparison
of zinc-binding predictors.

3.1 Performances of predictions of zinc-binding sites
Results of parameter selection
The predictions with different SVM parameters were tested to identify the best
parameters. The result is showed in Table 4. The detail information of parameter
selection please refers to 2.4.3.

Table 4. The best SVM parameter used in the predictions after parameter selection.
The SVM parameter C and Gamma are showed in (C, Gamma) form in log
2
value.

Prediction Predictor C H CH
Single residue (-1, -1) (-2, -3) (-2, -2)
Paired residues (2, -2) (2, -2) (1, -2)
sequence
gated (-3, -3) (1, -5) (-5, -1)
Single residue (3, -7) (1, -1) (1, -2)
Paired residues (2, -7) (9, -3) (2, -3)
structure
gated (-5, -1) (-3, -2) (-5, -1)



31
Prediction from sequences
The ROC and RP curves of prediction from the sequences were showed in Figure 10
and Figure 11. The AUC of single residue predictor achieved 0.892 for C, 0.856 for H,
and 0.883 for CH, respectively. The AURPC of single residue predictor achieved
0.781 for C, 0.604 for H, and 0.706 for CH, respectively. The AUC of gated predictor
achieved 0.906 for C, 0.855 for H, and 0.882 for CH, respectively. The AURPC of
gated predictor achieved 0.825 for C, 0.648 for H, and 0.738 for CH, respectively.

Prediction from structures
The ROC and RP curves of prediction from the structures were showed in Figure 12
and Figure 13. The AUC of single residue predictor achieved 0.925 for C, 0.872 for H,
and 0.904 for CH, respectively. The AURPC of single residue predictor achieved
0.825 for C, 0.692 for H, and 0.772 for CH, respectively. The AUC of gated predictor
achieved 0.928 for C, 0.890 for H, and 0.907 for CH, respectively. The AURPC of
gated predictor achieved 0.860 for C, 0.745 for H, and 0.787 for CH, respectively.










32




Figure 10. ROC curves of prediction from sequences using dataset (a) C only, (b) H
only, (c) C and H together.

C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
True positive rate
single predicto
r
gated predictor
(a)
H
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
True positive rate
single predictor
gated predictor
(b)
CH
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
True positive rate
single predictor
gated predictor
(c)


33




Figure 11. RP curves of prediction from sequences using dataset (a) C only, (b) H
only, (c) C and H together.


CH
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
single predictor
gated predictor
(c)
H
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precistion
single predictor
gated predictor
(b)
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
single predictor
gated predictor
(a)


34




Figure 12. ROC curves of prediction from structures using dataset (a) C only, (b) H
only, (c) C and H together.

CH
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
True positive rate
single predictor
gated predictor
(c)
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
True positive rate
single predictor
gated predictor
(a)
H
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
True positive rate
single predictor
gated predictor
(b)


35




Figure 13. RP curves of prediction from structures using dataset (a) C only, (b) H only,
(c) C and H together.


CH
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
single predictor
gated predictor
(c)
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
single predictor
gated predictor
(a)
H
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
single predictor
gated predictor
(b)


36
3.2 Observation on ROC curves and RP curves
After the observation on the AUC and the AURPC values of single predictor and
gated predictor (Figure 10~13), the paired residues information promoted the AURPC
4% and 3.4% higher than those of single residue predictor on average for the
prediction from sequences and the prediction from structures, respectively. The
improvement on low recall area is especially obvious. However, the AUC got only
0.4% and 1.8% higher. The reason is that the paired residues information can improve
the true positive rate but slightly raise the false positive rate at the same time. Higher
true positive rate with almost that same overall accuracy resulted in good
improvement only in the AURPC.

In performance comparison of prediction from sequences and from structures, the
structural features promoted the AURPC 6.6% and 6.0% higher than those using the
sequence derived features only for single residue predictors and gated predictors,
respectively. The AUC got only 1.3% and 2.7% higher. The reason is the structural
features in this research promoted the accuracy on zinc-binding data more than that on
non-zinc-binding data. Since AURPC is more sensitive to positive accuracy than
AUC, the improvement in AURPC is more.

The zinc-binding predictors got better performance in C than in CH, and better in CH
than in H. On average, C’s AUC was 4.5% higher than H’s; C’s AURPC was 15.1%
higher than H’s. There are two probable reasons for this situation. First, compared to
H, C gets lower amino acid composition (Table 2) which results in less unbalanced
ratio of zinc-binding residue and non-zinc-binding residue. The ratio is 1:2.13 for C
and 1:4.36 for H. Theoretically, the more balanced the dataset is, the better
performance the predictor will get. Second, the proportion of zinc-binding pair of C


37
with distance < 11 is 78.81% (Figure 4a), while the proportion of H is only 28.27%
(Figure 4b). This appearance made zinc-binding pattern more obvious in the PWM of
C (Figure 2). The less unbalanced ratio of the dataset and the more discriminative
feature for C resulted in better performance.

3.3 Prediction with each feature or a subset of features
In order to identify useful features, single residue predictions with each feature or a
subset of features are showed in Figure 14. Structural features included the features
described in 2.3.2. Sequence features included the features used in single residue
prediction from the protein sequence. All features included structural features and
sequence features, that is, the features used in single residue prediction from the
protein structure. The default SVM parameters were used. The performances were
measured by MCC value. The feature whose MCC value was less than 0.2 was not
shown in this figure.

Overall, the important features are Scorecons, SR_sulfur, PSSM, and PWM (Figure
14c). Note that Scorecons and SR_sulfur have only one attribute in the input data of
SVM, while PSSM has 22 attributes and PWM has 21 attributes. The difference of the
MCC values between single feature and a subset or all features can be viewed as the
improvement in performance when more than one feature was used. Compared to the
result of H (Figure 14b), C got four more useful features (Figure 14a), which are
SR_hydrophobic, Structural_Alphabet, SR_sulfur, and Scorecons. This appearance
might indicate the different biological characters between C zinc-binding sites and H
zinc-binding sites.



38
(a) Performance of each feature or subset of features, C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SR_hydrophobic
Structural_Alphabet
SR_sulfur
Structural features
Scorecons
PSSM
Sequence features
PWM
All Features
Feature
MCC



(b) Performance of each feature or subset of features, H
0
0.1
0.2
0.3
0.4
0.5
0.6
PWM
sequence
features
PSSM
All Features
Feature
MCC




39
(c) Performance of each feature or subset of features, CH
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SR_sulfur
Structural
features
Scorecons
PSSM
PWM
Sequence
features
All Features
Feature
MCC

Figure 14. Performance of each feature or subset of features using (a) C dataset, (b) H
dataset, and (c) CH dataset. The color brown means the feature is structural, blue
means from sequence. The performance using all features was painted in red.

3.4 Comparison with previous predictors
Prediction from sequences
The non-redundant dataset provided by Passerini et al. (2007) was used for
comparison with the zinc-binding sites predictor developed by Shu et al. (2008). The
dataset in Passerini et al. (2007) contained 2674 PDB chains with 305 zinc-binding
chains and 2369 non-metalloproteins. The same dataset in Shu et al. (2008) became
2428 chains. The same dataset in this research became 2375 chains because of the
elimination of obsolete proteins on PDB and the deletion in failed feature generations.
Performances of the predictors were measured by AURPC on five-fold cross
validation in previous studies. In Shu et al. (2008), the single residue predictor
achieved an AURPC of 0.684 for C; 0.287 for H. The gated predictor achieved 0.767
for C; 0.342 for H. (These performance were from Shu’s reply to the request of detail
performance on C and H respectively.) In this research, the cross validation with the


40
same five-fold separation was tested. The single residue predictor got an AURPC of
0.630 for C; 0.213 for H. The gated predictor got 0.659 for C; 0.251 for H. The
performances of this research were lower 6.4% for single residue predictor and 10%
for gated predictor. The reasons might be: (1) the database used for PSI-BLAST was
formed from PIRSF rather than nr database. This affected the values in the PSSM. (2)
the usage of the features was different. In Shu et al. (2008), some features which were
represented originally by one value each were encoded to several attributes to increase
the elements in the SVM input vector. This vector encoding might increase the
discriminative ability of SVM to different classes. (3) The source of the zinc-binding
probabilities from the single residue predictor and paired residues predictor was also
different. In Shu et al. (2008), the zinc-binding probability was calculated from the
margin of the SVM (not LIBSVM) output. Two parameters in the calculation formula
were further learned by SVM to get appropriate values. The zinc-binding probability
in this research was directly from LIBSVM. The difference of the probability
calculation might affect the gated score which was the combination from the
zinc-binding probability of the single residue predictor and the paired residues
predictor. These three reasons might result in the worse performances in this research
than those of the previous study.

Prediction from structures
Sodhi et al. (2004) developed a zinc-binding sites predictor for protein structures.
However, the concept of the prediction of zinc-binding sites is different between
Sodhi et al. (2004) and this research. The predictor built by Sodhi et al. (2004)
discriminated the zinc-binding residues from all kinds of amino acid, while the
predictor built in this research discriminated the zinc-binding CH from all CH in the
proteins. Two predictors with different concept can not be compared.


41
Chapter 4: Conclusion
For prediction of zinc-binding sites, a benchmark dataset were carefully selected from
protein annotation database. Different biological features for machine learning were
generated from sequences, structures, and even present prediction tools. The
information of zinc-binding pairs was added for prediction improvement. Two
predictors of zinc-binding sites were developed for both protein sequences and protein
structures. For sequences, the predictor achieved 74% precision at 71% recall for C;
66% precision at 53% recall for H; 71% precision at 61% recall for CH. For structures,
the predictor achieved 78% precision at 80% recall for C; 75% precision at 63% recall
for H; 78% precision at 69% recall for CH. The predictors were expected to be a
useful method to see which amino acid binds zinc on the proteins with or without
structure information.















42
Links:
1. PDBsum: http://www.ebi.ac.uk/pdbsum/

2. PSIPRED: http://bioinf.cs.ucl.ac.uk/psipred/

3. WESA: http://pipe.scs.fsu.edu/wesa.html

4. DiANNA: http://clavius.bc.edu/~clotelab/DiANNA/

5.Scorecons:
http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/valdar/scorecons_server.pl

6. Protein Block Expert: http://bioinformatics.univ-reunion.fr/PBE/PBT.htm

7. Weka software: http://www.cs.waikato.ac.nz/ml/weka/

8. LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/


Reference:

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and
Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res 25, 3389-3402.
Bartlett, G.J., Porter, C.T., Borkakoti, N., and Thornton, J.M. (2002). Analysis of
catalytic residues in enzyme active sites. J Mol Biol 324, 105-121.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H.,
Shindyalov, I.N., and Bourne, P.E. (2000). The Protein Data Bank. Nucl Acids Res 28,
235-242.
Chang, C.-C., and Lin, C.-J. (2001). LIBSVM: a library for support vector machines.
Chen, H., and Zhou, H.X. (2005). Prediction of solvent accessibility and sites of
deleterious mutations from protein sequence. Nucleic Acids Res 33, 3193-3199.
Coleman, J.E. (1992). Zinc proteins: enzymes, storage proteins, transcription factors,
and replication proteins. Annu Rev Biochem 61, 897-946.
Crooks, G.E., Hon, G., Chandonia, J.M., and Brenner, S.E. (2004). WebLogo: a
sequence logo generator. Genome Res 14, 1188-1190.
EL-Manzalawy, Y., and Honavar, V. (2005). WLSVM:Integrating LibSVM into
Weka Environment.
Ferre, F., and Clote, P. (2005). DiANNA: a web server for disulfide connectivity
prediction. Nucleic Acids Res 33, W230-232.
Iakoucheva, L.M., Radivojac, P., Brown, C.J., O'Connor, T.R., Sikes, J.G., Obradovic,


43
Z., and Dunker, A.K. (2004). The importance of intrinsic disorder for protein
phosphorylation. Nucleic Acids Res 32, 1037-1049.
Jesse, D., and Mark, G. (2006). The relationship between Precision-Recall and ROC
curves. In Proceedings of the 23rd international conference on Machine learning
(Pittsburgh, Pennsylvania, ACM).
Jones, D.T. (1999). Protein secondary structure prediction based on position-specific
scoring matrices. J Mol Biol 292, 195-202.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A.,
McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., et al. (2007).
Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947-2948.
Martin, A.C. (2005). Mapping PDB chains to UniProtKB entries. Bioinformatics 21,
4297-4301.
Passerini, A., Andreini, C., Menchetti, S., Rosato, A., and Frasconi, P. (2007).
Predicting zinc binding at the proteome level. BMC Bioinformatics 8, 39.
Petrova, N.V., and Wu, C.H. (2006). Prediction of catalytic residues using Support
Vector Machine with selected protein sequence and structural properties. BMC
Bioinformatics 7, 312.
Shu, N., Zhou, T., and Hovmoller, S. (2008). Prediction of zinc-binding sites in
proteins from sequence. Bioinformatics 24, 775-782.
Sodhi, J.S., Bryson, K., McGuffin, L.J., Ward, J.J., Wernisch, L., and Jones, D.T.
(2004). Predicting metal-binding site residues in low-resolution structural models. J
Mol Biol 342, 307-320.
Tyagi, M., Sharma, P., Swamy, C.S., Cadet, F., Srinivasan, N., de Brevern, A.G., and
Offmann, B. (2006). Protein Block Expert (PBE): a web-based protein structure
analysis server using a structural alphabet. Nucleic Acids Res 34, W119-123.
Valdar, W.S. (2002). Scoring residue conservation. Proteins 48, 227-241.
Witten, I.H., and Frank, E. (2005). Data Mining: Practical machine learning tools and
techniques.
Wu, C.H., Nikolskaya, A., Huang, H., Yeh, L.S., Natale, D.A., Vinayaka, C.R., Hu,
Z.Z., Mazumder, R., Kumar, S., Kourtesis, P., et al. (2004). PIRSF: family
classification system at the Protein Information Resource. Nucleic Acids Res 32,
D112-114.








44
Appendix. The 127 benchmark proteins
The proteins are showed in Uniprot AC, PDB id, PDB chain, PIRSF number,
respectively

O05510 1XC3 A PIRSF000531
O09171 1UMY D PIRSF037505
O15344 2FFW A PIRSF001733
O26147 1EF4 A PIRSF005653
O57963 1IRX A PIRSF006544
O67050 1WWR A PIRSF004767
O70201 1M4M A PIRSF037812
P00328 1EE2 A PIRSF000091
P00733 1LBU A PIRSF001129
P00756 1SGF G PIRSF001135
P03390 1AOL A PIRSF003847
P05020 1J79 A PIRSF001237
P05458 1Q2L A PIRSF001211
P06621 1CG2 B PIRSF037238
P07268 1SRP A PIRSF001205
P07272 1PYI A PIRSF003317
P07505 1SRD A PIRSF000348
P08148 1LML A PIRSF001204
P09954 1PCA B PIRSF001128
P14141 1FLJ A PIRSF001392
P14384 1UWY A PIRSF009270
P16006 1VQ2 A PIRSF006019
P19321 2FPQ A PIRSF001932
P20165 1WNI A PIRSF001206
P21692 1FBL A PIRSF001191
P24289 1AK0 A PIRSF016092
P25491 1NLT A PIRSF002585
P26311 1FNO A PIRSF037215
P28990 1CHC A PIRSF003537
P34948 1PMI A PIRSF001480
P45494 1LFW A PIRSF005531
P45548 1BF6 A PIRSF016839
P46076 1EB6 A PIRSF016065
P50456 1Z6R D PIRSF006263
P08518 1TWF B PIRSF000765
P09237 1MMQ A PIRSF001192
P09598 1AH7 A PIRSF005483
P09960 1HS6 A PIRSF001113
P0A0L2 1SXT B PIRSF001919
P0A6C1 1QTW A PIRSF000989
P0A6F3 1GLC G PIRSF000538
P0A7F3 4AT1 B PIRSF000415
P0A8M3 1EVK A PIRSF001520
P0AB87 1FUA A PIRSF001463
P0ABF6 1CTT A PIRSF006334
P0ACS5 1Q08 A PIRSF004656
P0C0T5 1U10 A PIRSF018455
P10507 1HR6 B PIRSF001210
P13340 1EN7 A PIRSF004259
P13689 1LR5 A PIRSF002666
P15303 1M2O A PIRSF003206
P16083 1QR2 A PIRSF000221
P16370 1TWF C PIRSF000749
P16455 1EH6 A PIRSF000407
P21888 1LI5 A PIRSF001536
P22966 1UZE A PIRSF001126
P24931 1JYB A PIRSF006449
P28720 1ENU A PIRSF005973
P29473 1D0C A PIRSF000333
P29736 1E4M M PIRSF001079
P31151 3PSR A PIRSF002353
P31658 1ONS A PIRSF037798
P32169 1GT7 A PIRSF037800
P32170 1D8W A PIRSF037819
P37617 1MWZ A PIRSF001307
P40422 1TWF L PIRSF006829
P40482 1PCX A PIRSF017225
P41972 1QU2 A PIRSF001524


45
P55907 1XER A PIRSF000068
P61092 1K2F A PIRSF028730
P61998 1PFT A PIRSF003050
P78563 1ZY7 A PIRSF001248
P81177 1BQB A PIRSF001201
P84139 1S3G A PIRSF000734
Q29451 1O7D A PIRSF015759
Q46856 1OJ7 B PIRSF000113
Q8BW10 2CON A PIRSF037125
Q8GY31 1T3K A PIRSF016971
Q8INK6 1OHT A PIRSF037945
Q8IXJ6 1J8F A PIRSF037938
Q8ZPZ9 2AP1 A PIRSF006140
Q9LJL3 2FGE A PIRSF016207
Q9S7A9 1UL4 A PIRSF037575
Q9XBQ8 2A5H B PIRSF004911
P0AB74 1GVF A PIRSF001359
P32320 1MQ0 A PIRSF001250
P26902 1HI9 A PIRSF015853
O26708 1NJ2 A PIRSF006101
O28597 1ICI A PIRSF006013
O76074 1UDT A PIRSF000967
P00428 1OCO F PIRSF000276
P00634 1B8J A PIRSF000891
P00727 1LAM A PIRSF001116
P00806 1LBA A PIRSF001230
P02340 1HU8 A PIRSF002089
P03695 1GPC A PIRSF004310
P03958 1A4M A PIRSF001249
P04050 1TWF A PIRSF000756
P04190 1BC2 A PIRSF001241
P06134 1U8B A PIRSF000409
P07071 1HK8 A PIRSF000358
P07584 1AST A PIRSF001197
P08473 1DMT A PIRSF001194

P45148 2A8C A PIRSF001393
P47224 2FU5 B PIRSF018365
P48452 1TCO A PIRSF000911
P49356 1TN6 B PIRSF003209
P50591 1DG6 A PIRSF038013
P52700 1SML A PIRSF005457
P56406 1C7K A PIRSF016573
P56965 2CI7 A PIRSF000417
P59641 2FYG A PIRSF000833
P62877 1LDJ B PIRSF016237
P69783 1F3Z A PIRSF000697
P80256 1DFE A PIRSF002236
P80366 4KBP A PIRSF000899
P96116 1TOA A PIRSF029508
P96142 1GAX A PIRSF001527
Q02890 1X3Z A PIRSF007836
Q04609 2C6C A PIRSF002548
Q06187 1B55 B PIRSF000602
Q08602 1DCE A PIRSF037441
Q46AN5 2CJA A PIRSF019376
Q47112 7CEI B PIRSF001006
Q51669 1X6M A PIRSF033318
Q5SHQ1 1N32 N PIRSF002137
Q9SE42 1H1Z A PIRSF001461