Protein_Fold_Recognition_Reportx - University of Missouri

skirlorangeΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

84 εμφανίσεις



1

of
4


Protein Fold Recognition

A

Coursework Project

on Data Mining


Badri Adhikari

Department of Computer Science

University of Missouri
-
Columbia

5/10/2012


Abstract

Protein fold recognition is a well
-
known problem in the
field of bioinformatics. There are many different
approaches and algorithms for predicting whether a
given protein sequence belongs to a protein family,
superfamily,
or

more specifically to a fold. He
re we
discuss a classification approach using support vector
machines for predicting whether
two

protein
s (a

pair
)

belong to same fold/superfamily/family

(any)
, using
existing dataset
.

Radial basis function was chosen as
the kernel. Through 10
-
fold cross
-
v
alidation approach of
training and testing the best value of gamma and c
parameter were found to be 0.017 and 0.5
respectively. Results show that although the accuracy
and precision
are

above 80%, the precision is
extremely low, below 5%. The area under the curve for
the final model was found to be 0.874.

The project is
primarily based on previously published work

(Cheng &
Baldi, 2006)

and is being continued
as a research
project.

Supplementary Information:

Supplementary data, this
report, and the source codes are available at

http://web.missouri.edu/~bap54/protein_fold_recognition_
2
012/

1 Introduction

Protein fold recognition methods are the methods
developed to know the structural relationship between

proteins. These methods are usually either prediction
-
based methods, structural methods or sequence
-
based
methods or combination of t
hese methods

(Lindahl &
Elofsson, 2000)
.

Instead of developing one specialized
alignment method for fold recognition, one could also
use information retrieval methods
that leverages features
extracted using existing alignment tools or structure
prediction tools
(Cheng & Baldi, 2006)
.

This project is not about extracting features values from
the protein pairs. Instead, it is about using the a
lready
extracted features, (1) to derive a model that will
represent the examples, and (2) to evaluate the
model derived.

2 Methods

2.1

Data Description

The input data used for the mining process was
obtained from a previously done research
(Cheng &
Baldi, 2006)

available at
http://casp.rnet.missouri.edu/download/linda_lob.bin2
.
The feature values in data were computed using
different methods and tools. The sequences and models
(Lindahl & Elofsson, 2000)

used to generate these
features are derivation of SCOP database.

The dataset has
951
,
600

examples o
ut of which 7,438
are +1 labeled examples and the rest
944
,
162

are
-
1
labeled examples. There are
84

features that have
different information gain values.

Each example has two
rows: title row
-

that begins with a hash which has
protein pair name along with

the structure classification
and feature row
-

that has feature values used to
describe the pair of protein in the title row.


Figure
1

A portion of input dataset. Title rows begin with a
hash and others are feature rows.

2.2

Data Preprocessing

The examples were
evenly split into 10 subsets for 10
-
fold cross
-
validation. However, this was not done


2

of
4


randomly. At first, examples having the same query
protein were grouped together. This resulted in a cluster
of size 976 because the
re are 976 unique protein
s

in
the available data. Ten subsets of data were created
for 10
-
fold cross
-
validation, using these 976 clusters of
examples, without breaking the clusters. This was done
so that the training dataset does not contain any of the
que
ry proteins that will be used in testing.

To deal with the problem of biased examples

that we
had
, a test was conducted
to see if balancing helps
improve the performance. Two random training and
testing sets were prepared. Using the previously known
values

of gamma, learning and classification were
performed. The area under the curve were computed
for the two models and were found to be 0.82 and
0.87 as
s
hown in the ROC curve below.

Keeping the two test sets as is, the training sets were
balanced by filteri
ng away most of the negative
examples so that there were equal number of negative
and positive examples. Again, using the same value of
gamma, learning and classification were performed, with
exactly same test sets. The area under the curve was
found to be

0.94 and 0.93. Surprisingly, it is observed
that balancing of examples actually improves the
performance of the training and classification results.



Figure
2

Comparison of four models: filtered 01 and 02 are
models generated from balanced training data and unfiltered
01 and 02 are models generated from the training data as it.
The same test data were used for the models generated.

2.
3

Training

and

Classificati
on

While using

SVM
light
,

i
nstead of randomly picking up
kernels and trying different parameters, a support vector
classification guide

(Hsu, Chang, & Lin, 2010)

was
followed.

The
RBF kernel K(x, y) =

was
considered for generating the model. Then, 10
-
fold
cross
-
validation was performed to determine the value
of gamma and c.
In each iteration
,

9 subsets
wer
e
used for training and
the remaining set was used as
the test set.

To find the value of gamma, at

first gamma values
between 0 and 1 were considered with the step size of
0.1. Better performance was observed at the values
close to 0.1, so more precise values of gamma were
used to repeat the process. Figure 3 shows the
performance of the classifier for

values of gamma
between 0 and 0.19 with step
-
size of 0.01.


Figure
3

Average values of sensitivity versus specificity for
different values of gamma.

0
20
40
60
80
100
120
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
Sensitivity
Accuracy
Precision


3

of
4



Figure
4

More precise values of gamma against average
sensitivity and specificity.

Upon finding that the classifier is performing best near
the gamma values of 0.01, another round of training
and testing was conducted with gamma

starting from
0.0025 to 0.0475, to
find a more precise value of
gamma. Figure 4 shows the

performance.
It is evident
that there is

no distinguishably best value of gamma.
However, we can observe that the sensitivity and
specificity are highest at gamma equal to 0.0175
.

To determine the best value of c, training and
classification was performed with
a range of values of c
from 0 to 1, as shown in figure 5, keeping gamma
constant as 0.0175. We observe that, at c
-
values
higher than 0.05, there is not much impact of the
valu
es in any of the measurements. Any value of c
greater than 0.1 and less than 1 seems appropriate.


Figure
5

Average values of sensitivity, accuracy and precision
against a range of values of c parameter of the RBF kernel
function.



2.4 Tools Used

The support vector machine implementation tool used
wa
s SVM
light
. The tool is basically two programs:
svm_learn for training and svm_classify for classification.
Using the tool is quite straightforward.

$svm_learn

example1/train.dat example1/model

$svm_classify example1/test.dat example1/model example1/prediction


Perl was used as the scripting language for all data
transformations, pre
-
processing and
calculations
.

R was used to plot the ROC curves and for calculat
ing
area under the curves. For other graphs, MS Excel
was used.

3 Results

The final model was buil
t

using gamma equal to
0.0175 and c value equal to 0.5. Figure 6 shows that
the area under the curve for the model is 0.874.

84
84.2
84.4
84.6
84.8
85
85.2
0.0025
0.0050
0.0075
0.0100
0.0125
0.0150
0.0175
0.0200
0.0225
0.0250
0.0275
0.0300
0.0325
0.0350
0.0375
0.0400
0.0425
0.0450
0.0475
gamma

Sensivity
Specificity
0
20
40
60
80
100
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
C parameter

Sensitivity
Accuracy
Precision


4

of
4



Figure
6

ROC curve for the final model with area under
curve equal to 0.874 shown with different cutoff values
represented by different colors.

4

Future Works

Following

are the future works planned:

1.

Perform more precise balancing of data by
selectively removing
the data instead of
randomly filtering them out. Instead of applying
filtering on the whole training set, it should be
applied to individual cluster of examples. This
could make the examples more discriminative.

2.

Perform grid
-
search approach to find the bes
t
values of gamma and c for the RBF kernel.

3.

Perform classification at more specific levels:
family level, superfamily level and fold level.

4.

Apply neural network algorithms for classification.

5.

Apply random forest algorithm for classification.

6.

Use different
feature selection methods to
improve accuracy.

7.

Generate new features for each pair of proteins
to improve prediction accuracy.


References

Cheng, J., & Baldi, P. (2006). A machine
learning

information retireval approach to protein
fold recogn
ition.
Bioinformatics
.

Hsu, C.
-
W., Chang, C.
-
C., & Lin, C.
-
J.
(2010). A Practical Guide to Support Vector
Classi.
National Taiwan University
.

Lindahl, E., & Elofsson, A. (2000). Identification
of Related Proteins on Family, Superfamily and
Fold Level.
JMB
.