Protein Fold Recognition
on Data Mining
Department of Computer Science
University of Missouri
Protein fold recognition is a well
known problem in the
field of bioinformatics. There are many different
approaches and algorithms for predicting whether a
given protein sequence belongs to a protein family,
more specifically to a fold. He
discuss a classification approach using support vector
machines for predicting whether
belong to same fold/superfamily/family
Radial basis function was chosen as
the kernel. Through 10
alidation approach of
training and testing the best value of gamma and c
parameter were found to be 0.017 and 0.5
respectively. Results show that although the accuracy
above 80%, the precision is
extremely low, below 5%. The area under the curve for
the final model was found to be 0.874.
The project is
primarily based on previously published work
and is being continued
as a research
Supplementary data, this
report, and the source codes are available at
Protein fold recognition methods are the methods
developed to know the structural relationship between
proteins. These methods are usually either prediction
based methods, structural methods or sequence
methods or combination of t
Instead of developing one specialized
alignment method for fold recognition, one could also
use information retrieval methods
that leverages features
extracted using existing alignment tools or structure
(Cheng & Baldi, 2006)
This project is not about extracting features values from
the protein pairs. Instead, it is about using the a
extracted features, (1) to derive a model that will
represent the examples, and (2) to evaluate the
The input data used for the mining process was
obtained from a previously done research
The feature values in data were computed using
different methods and tools. The sequences and models
(Lindahl & Elofsson, 2000)
used to generate these
features are derivation of SCOP database.
The dataset has
ut of which 7,438
are +1 labeled examples and the rest
labeled examples. There are
features that have
different information gain values.
Each example has two
rows: title row
that begins with a hash which has
protein pair name along with
the structure classification
and feature row
that has feature values used to
describe the pair of protein in the title row.
A portion of input dataset. Title rows begin with a
hash and others are feature rows.
The examples were
evenly split into 10 subsets for 10
validation. However, this was not done
randomly. At first, examples having the same query
protein were grouped together. This resulted in a cluster
of size 976 because the
re are 976 unique protein
the available data. Ten subsets of data were created
validation, using these 976 clusters of
examples, without breaking the clusters. This was done
so that the training dataset does not contain any of the
ry proteins that will be used in testing.
To deal with the problem of biased examples
, a test was conducted
to see if balancing helps
improve the performance. Two random training and
testing sets were prepared. Using the previously known
of gamma, learning and classification were
performed. The area under the curve were computed
for the two models and were found to be 0.82 and
hown in the ROC curve below.
Keeping the two test sets as is, the training sets were
balanced by filteri
ng away most of the negative
examples so that there were equal number of negative
and positive examples. Again, using the same value of
gamma, learning and classification were performed, with
exactly same test sets. The area under the curve was
found to be
0.94 and 0.93. Surprisingly, it is observed
that balancing of examples actually improves the
performance of the training and classification results.
Comparison of four models: filtered 01 and 02 are
models generated from balanced training data and unfiltered
01 and 02 are models generated from the training data as it.
The same test data were used for the models generated.
nstead of randomly picking up
kernels and trying different parameters, a support vector
(Hsu, Chang, & Lin, 2010)
RBF kernel K(x, y) =
considered for generating the model. Then, 10
validation was performed to determine the value
of gamma and c.
In each iteration
used for training and
the remaining set was used as
the test set.
To find the value of gamma, at
first gamma values
between 0 and 1 were considered with the step size of
0.1. Better performance was observed at the values
close to 0.1, so more precise values of gamma were
used to repeat the process. Figure 3 shows the
performance of the classifier for
values of gamma
between 0 and 0.19 with step
size of 0.01.
Average values of sensitivity versus specificity for
different values of gamma.
More precise values of gamma against average
sensitivity and specificity.
Upon finding that the classifier is performing best near
the gamma values of 0.01, another round of training
and testing was conducted with gamma
0.0025 to 0.0475, to
find a more precise value of
gamma. Figure 4 shows the
It is evident
that there is
no distinguishably best value of gamma.
However, we can observe that the sensitivity and
specificity are highest at gamma equal to 0.0175
To determine the best value of c, training and
classification was performed with
a range of values of c
from 0 to 1, as shown in figure 5, keeping gamma
constant as 0.0175. We observe that, at c
higher than 0.05, there is not much impact of the
es in any of the measurements. Any value of c
greater than 0.1 and less than 1 seems appropriate.
Average values of sensitivity, accuracy and precision
against a range of values of c parameter of the RBF kernel
2.4 Tools Used
The support vector machine implementation tool used
. The tool is basically two programs:
svm_learn for training and svm_classify for classification.
Using the tool is quite straightforward.
$svm_classify example1/test.dat example1/model example1/prediction
Perl was used as the scripting language for all data
R was used to plot the ROC curves and for calculat
area under the curves. For other graphs, MS Excel
The final model was buil
using gamma equal to
0.0175 and c value equal to 0.5. Figure 6 shows that
the area under the curve for the model is 0.874.
ROC curve for the final model with area under
curve equal to 0.874 shown with different cutoff values
represented by different colors.
are the future works planned:
Perform more precise balancing of data by
the data instead of
randomly filtering them out. Instead of applying
filtering on the whole training set, it should be
applied to individual cluster of examples. This
could make the examples more discriminative.
search approach to find the bes
values of gamma and c for the RBF kernel.
Perform classification at more specific levels:
family level, superfamily level and fold level.
Apply neural network algorithms for classification.
Apply random forest algorithm for classification.
feature selection methods to
Generate new features for each pair of proteins
to improve prediction accuracy.
Cheng, J., & Baldi, P. (2006). A machine
information retireval approach to protein
W., Chang, C.
C., & Lin, C.
(2010). A Practical Guide to Support Vector
National Taiwan University
Lindahl, E., & Elofsson, A. (2000). Identification
of Related Proteins on Family, Superfamily and