PROTEIN PROFILING AND CLASSIFICATION : TECHNICAL COMPARISON

chardfriendlyΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

65 εμφανίσεις




1



PROTEIN PROFILING AND CLASSIFICATION :
TECHNICAL COMPARISON


*V. Krishnan, **C.R. Balamurugan, ***S. Purushothaman

*vkrisvasan@yahoo.com,**crbalan@lycos.com, ***kspurumani@caramail.com

PG Students, Dept. of Computer Science and Engineering

College of Engineering Guindy, Anna University, Chennai
-

25



ABSTRACT :
Bioinformatics has to
deal with exponentially growing steps of
highly interrelated and rapidly evolving
types of data that are heavily used by
hu
mans and computers. A bioinformatics
language should therefore offer power and
scalability at run time and should be based
on a flexible and expressive model. The
relatedness of proteins is of extreme
importance to modern biologists as proteins
with simi
lar structure often have similar
function. The prediction of the secondary,
tertiary, and quaternary structures of
proteins remains a daunting problem which
has been attacked by numerous methods.
One method is to classify the protein into a
family based on

sequence, shape, or other
features and assume that it has a similar
function to the other members of that family.
T
he general problem of machine learning is
to search usually very large space of
potential hypotheses to determine the one
that will best fit

the data and any prior
knowledge. The data may be labeled or
unlabelled. If labels are given then the
problem is one of
supervised learning
, in
that the true answer is known for a given set
of data. If labels are not given then the
problem is one of
unsup
ervised learning

and the aim is characterize the structure of
the data, e.g. by identifying groups of
examples in the data that are collectively
similar to each other and distinct from the
other data.
There are many computational
tools to achieve this cla
ssification, among
them are the supervised learning methods of
K
-
nearest neighbors, neural nets, decision
trees and support vector machines. We shall
compare the performance of these four
algorithms in the task of classifying protein
domain profiles into t
he appropriate family.
Our preliminary results indicate that the
empirically optimal versions of each
algorithm are returning similar results.


Keywords : SCOP, XLMINER, PFAM,
PRODOM, InterPro, Prosite,
PRINTS, AI technique



1. INT
RODUCTION


Bioinformatics is the art and science
of electronically representing and integrating
biomedical information in a way that makes
it accessible and usable across the various
fields of biological research. World Wide
research in molecular biology
is producing
large amounts of data that stand in need of
computerized analysis. The data is not only
sequence data of DNA and of proteins but
also the three
-
dimensional structures of
some proteins, simultaneous levels of
activity of large numbers of genes

and data
on the binding affinities of many antibodies
of many substances. Protein structure and
hence function is of extreme scientific,
medical, and economic importance. With a
few exceptions, proteins are the physical and
chemical machines at the heart

of every
major biological phenomenon. However,
their complex three dimensional structures



2

and the predictions of these structures
remains a challenging problem. Artificial
intelligence techniques, especially
supervised learning ones, are well suited for
t
his problem as long range dependencies and
extremely subtle nuances are involved in a
protein forming its three dimensional
structure. Some AI techniques such as
support vector machines and neural nets
may be able to discern these patterns.


This project a
ims to analyze the
performance of four different supervised
learning algorithms. The algorithms were
trained and tested using a set of protein
sequences which were profiled using their
amino acid composition. The protein
sequences were obtained from SCOP
d
atabase. Three of the four algorithms
decision trees, neural networks, and k
-
nearest neighbors, were implemented using
the XLMINER software package, while the
SVMTorch package was used for the fourth
algorithm, Support Vector Machines.
Evaluation was done
by cross validation and
by computing a ROC curve for the predicted
classes given by the classifiers.


2. METHODS



The problem is to test how four
different machine learning techniques
classify protein domains based on a profile.
The profile is a vector
of attributes, where
the attributes may be, but are certainly not
limited to, amino acid composition, mass,
volume, charge, hydrophobicity, etc. Cross
validation of the learning was implemented
in a 5
-
fold fashion whereby 80% of the data
was used for train
ing and 20% for training
five separate times on different partitions. A
brief discussion of the four algorithms
follows.


2.1. Decision Trees


Generally preferred over other
nonparametric techniques because of the
readability of their learned hypotheses a
nd
the efficiency of training and evaluation.
Altering parameters such as those for
pruning and cutoff may help to build smaller,
quicker trees that are just as robust as a full
decision tree.

2.2. Neural Networks

Inspired by densely interconnected,
paral
lel structure of the mammalian brain,
neural nets are made up the fundamental
perceptron unit, which was first invented by
Rosenblatt. There are a multitude of
parameters that may be changed in neural
nets. The number of input, hidden, and
output nodes is
variable. The number of
hidden layers or the activation function can
be changed. The step value associated with
gradient descent can also be customized.
Also, there are different learning algorithms
such as feed
-
forward, back propagation, and
counter propa
gation. So depending on the


nature of your data set suitable parameters
may be selected to get a good performance.

2.3. K
-
Nearest Neighbors

A very intuitive method, the k
-
nearest neighbors simply memorizes the
training set and then when given a test data

point, calculates the distance from the test
data point to every point in the training set
in the n
-
dimensional space. It then averages,
sometimes a weighted average, the
classifications of the closest k neighbor(s).
There are different ways to classify w
hen
there are more than one neighbor.


2.4. Support Vector Machines (SVM)


Support vector machines (SVMs)
were first suggested by Vapnik in the 1960s
for classification and have recently become
an area of intense research owing to
developments in the tec
hniques and theory



3

coupled with extensions to regression and
density estimation. The choice of a kernel
function is the most common way to fit an
SVM to a given problem and this remains a
hot research topic. SVMs separate the inputs
into positive and negat
ive examples by
calculating the hyper surface in the space of
possible inputs that will divide the two
regions and also have the largest distance
from the hyper surface to the nearest of the
positive and negative examples. Intuitively,
this makes the clas
sification correct for
testing data that is near, but not identical to
the training data.


3. RECEIVER OPERATING
CHARACTERISTIC CURVE
(ROC)




An excellent method for evaluating a
classifier is to us the ROC curve or score.
The ROC curve of a
classifier shows its
performance as a trade off between
selectivity

and
sensitivity.

Typically it is a
plot between false

positive

rate
versus
true
positive

rate
.

The area under the ROC is a convenient way
of comparing classifiers. A random classifier
has
an area of 0.5, while and ideal one has
an area of 1.

3.1. An ROC curve demonstrates several
things:

1.

It shows the tradeoff between
sensitivity and specificity (any
increase in sensitivity will be
accompanied by a decrease in
specificity).

2.

The closer the
curve follows the left
-
hand border and then the top border
of the ROC space, the more accurate
the test.

3.

The closer the curve comes to the
45
-
degree diagonal of the ROC
space, the less accurate the test.

4.

The slope of the tangent line at a
cutpoint gives
the likelihood ratio
(LR) for that value of the test.



The graph above shows three ROC
curves representing excellent, good, and
worthless tests plotted on the same graph.
The accuracy of the test depends on how
well the test separates the group being tes
ted
into those with and without the disease in
question. Accuracy is measured by the area
under the ROC curve. An area of 1
represents a perfect test; an area of .5
represents a worthless test. A rough guide
for classifying the accuracy of a diagnostic
tes
t is the traditional academic point system:



.90
-
1 = excellent (A)



.80
-
.90 = good (B)



.70
-
.80 = fair (C)



.60
-
.70 = poor (D)



.50
-
.60 = fail (F)






4

ROC curves can also be constructed
from clinical prediction rules. The graph
above come from a stud
y of how clinical
findings predict strep throat (Wigton RS,
Connor JL, Centor RM. Transportability of a
decision rule for the diagnosis of
streptococcal pharyngitis. Arch Intern Med.
1986;146:81
-
83.) In that study, the presence
of tonsillar exudate, fever,

adenopathy and
the absence of cough all predicted strep. The
curves were constructed by computing the
sensitivity and specificity of increasing
numbers of clinical findings (from 0 to 4) in
predicting strep. The study compared
patients in Virginia and Neb
raska and found
that the rule performed more accurately in
Virginia (area under the curve = .78)
compared to Nebraska (area under the curve
= .73). These differences turn out not to be
statistically different, however. At this point,
you may be wondering w
hat this area
number really means and how it is computed.
The area measures
discrimination
, that is,
the ability of the test to correctly classify
those with and without the disease. Consider
the situation in which patients are already
correctly classified

into two groups. You
randomly pick on from the disease group
and one from the no
-
disease group and do
the test on both.


The patient with the more abnormal
test result should be the one from the disease
group. The area under the curve is the
p
ercentage of randomly drawn pairs for
which this is true (that is, the test correctly
classifies the two patients in the random
pair). Computing the area is more
difficult to explain. Two methods are
commonly used: a non
-
parametric method
based

on constructing trapezoids under the
curve as an approximation of area and a
parametric method using a maximum
likelihood estimator to fit a smooth curve to
the data points. Both methods are available
as computer programs and give an estimate
of area and
standard error that can be used to
compare different tests or the same test in
different patient populations.

4. DATA
COLLECTION AND


DATA PREPROCESSING


Domain families and individual
domain sequences were obtained from
SCOP http://scop.berkeley.edu
/ and the
corresponding ASTRAL compendium of the
genetic domain sequences found at
http://astral.stanford.edu/scopseq
-
1.61.html .
It should be noted that SCOP is a database
of protein families in which the families are
constructed by human experts. There i
s
some computer assistance in the initial
phases of classification, such as that of class,
fold etc., but humans ultimately place each
domain into a single family. The sequences
obtained were those where the E
-
value was
greater than or equal to 10
-
25
. Thi
s yielded
6024 protein domain sequences. However,
many of these sequences belong to domain
families that only have a few members. This
can severely complicate machine learning,
so an arbitrary cutoff value of 24 members
was set. This gave a total of 31 fam
ilies with
which is the data set for the learning
algorithms. Figure 1 lists the protein domain
families that constituted the selected 1250
protein domains. (Note: For clarity, some of
the family names have been shortened and
also the fold and superfamily
designations
have been omitted.)



Each individual domain was then
profiled according to amino acid %
composition which yielded a 20 dimensional
vector for each domain. Then, using 5
-
fold
cross validation, the supervised learning
algorithms were

trained on 80% of the data
and tested on the remaining 20%. Apart
from comparing the performance of
algorithms against each other, each of the
algorithms were executed with different set
of conditions/parameters to determine which
parameters were contribu
ting towards
algorithms performance.




5



SCOP
designation

Class

Family

a.3.1.1

All alpha

monodomain cytochrome

a.35.1.5

All alpha

Bacterial repressors

a.39.1.5

All alpha

Calmodulin
-
like

a.4.1.1

All alpha

Homeodomain

a.43.1.1

All alpha

Phage repressors

a.53.1.1

All alpha

p53 tetramerization domain

b.1.1.1

All beta

V set domains (antibody variable domain
-
like)

b.1.1.2

All beta

C1 set domains (antibody constant domain
-
like)

b.1.1.4

All beta

I set domains

b.1.1.5

All beta

E set domains

b.1.2.1

All be
ta

Fibronectin type III

b.34.2.1

All beta

SH3
-
domain

b.34.3.1

All beta

Myosin S1 fragment, N
-
terminal domain

b.71.1.1

All beta

alpha
-
Amylases, C
-
terminal beta
-
sheet domain

c.2.1.2

Alpha and beta (a/b)

Tyrosine
-
dependent oxidoreductases

c.3.1.5

Alpha a
nd beta (a/b)

FAD/NAD
-
linked reductases, N
-
terminal and central
domains

d.9.1.1

Alpha and beta (a+b)

Interleukin 8
-
like chemokines

f.2.1.2

Membrane proteins

Photosynthetic reaction centre, L
-
, M
-

and H
-
chains

f.2.1.3

Membrane proteins

Cytochrome c oxida
se
-
like

f.2.1.8

Membrane proteins

Cytochrome bc1 transmembrane subunits

g.1.1.1

Small proteins

Insulin
-
like

g.24.1.1

Small proteins

TNF receptor
-
like

g.3.1.1

Small proteins

Hevein
-
like agglutinin (lectin) domain

g.3.11.1

Small proteins

EGF
-
type module

g.3.7.2

Small proteins

Short
-
chain scorpion toxins

g.37.1.1

Small proteins

Classic zinc finger, C2H2


Figure1. SCOP Protein domain families used for profiling and classification.




The following is a brief description
of the parameters sel
ected and analysis of
how the algorithms performed :


5. RESULTS


5.1. K
-
Nearest Neighbors:



Algorithm was run for K values of 1
5 & 10, performance was good for both K=1
and d K=5 but with K value of 10 it
decreased. High K values increases

computation time, but may yield good
values if the dataset is complex, hence the




nature of data set should be considered
while selecting the K value.



Figure 2. Number of families vs. ROC


plot for K
-
nearest neighbors



with K values 1 5 & 10.





6

5.2. Decision Trees:




The first set of experiments was to
build the complete tree with the given set of
training set and to see its performance,
second and third sets of experiments
basically depicts the perform
ance of
algorithm with pruning and early cutoff
options. The performance was better when
tree was allowed to grow completely; this is
due to the fact that more information that is
learned will be encoded in the tree. Early
cutoff i.e training was stopped w
hen the
number of instances reached a minimum
number at any node, seemed to be a better
option than pruning if time and or space
complexity is an issue. However, in this case
it wasn’t.




Figure 3. Number of families vs. ROC


Plot
for Decision Trees with


standard, pruning, and cutoff.



5.3. Neural Networks:




5 different sets of experiments
were performed with neural nets. The choice
of parameters was basically dependent on
the nature of our dat
a set. We found that the
number of hidden layers and number of
nodes in them had a great effect on the
performance on the neural nets. With


number nodes of 25 to 20 the performance
was almost the same but with number of



nodes being 19 the performance

drastically
reduced. More than one hidden layer didn’t
improve performance at all and in fact, even
decreased it. Multiple hidden layers are
required when there are higher order
relationships in the dataset. The selection of
a activation function is also
equally
important, we got the second best
performance measure when a Symmetric
function was selected. One of the important
features which drew our attention was
Perceptron Convergence , the performance
improved as the the learning rule converged
to correct

weights that produced correct
output.






Figure 4. Number of families vs. ROC


plot for Neural Networks




5.4. Support Vector Machines:





Kernel selection is probably one of
the most importan
t step in order to get a
good performance. Algorithm was run with
three different Kernal function. SVM with
Radial Basis Kernal with a value 0.6 was
found to give the best performance.
Polynomial with degree two also performed
well.






7




Figure 5.
Number of families vs.


ROC plot for SVM



5.5.

Comparison of Algorithms




Figure 6. Comparison ROC plot of


SVM (yellow), NN (pink),


KNN (blue) and Decision


Trees (red).


6. BIOLOGICAL NOTE




The protein families which gave the
worst results were 7, 8, 9, and 10. These
were domains in the Immunoglobulin
superfamily and immunoglobulins are
known for their high sequence variability
and hence variable amino acid compos
ition.
The immunoglobulin amino acid sequences
may at times to appear almost random and
this is because they are. The combinatorial
rearrangement is what causes the immune
system to defend an almost limitless number



of chemical structures. Neural netwo
rks
performed surprisingly well with one hidden
layer with 20 nodes and perceptron
convergence. Overall, the algorithms
performed surprisingly similar and this may
have been caused by the dataset. More
domain families and/or more attributes
should g
ive a greater robustness to the
dataset.



7. CONCLUSIONS &
FUTURE
WORK



In k nearest neighbor technique K
values increases computation time, but may
yield good values if the dataset is complex,
k value of 1,5,10 are considered in the ROC
plot of
which the 10 showed an decrease
sensitivity since the data set is not complex.
So nature of data set should be considered
while selecting the K value.


In Decision Tree the performance
was better when tree was allowed to grow
completely; this is
due to the fact that more
information that is learned will be encoded
in the tree. Early cutoff i.e training was
stopped when the number of instances
reached a minimum number at any node,
seemed to be a better option than pruning if
time and or space compl
exity is an issue. In
neural nets the number of hidden layers and
number of nodes in them had a great effect
on the performance. One of the important
features, which drew our attention, was
Perceptron Convergence, the performance
improved as the learning r
ule converged to
correct weights that produced correct output.
In SVM Kernel selection is probably one of
the most important step in order to get a
good performance. Algorithm was run with
three different Kernel functions. SVM with
Radial Basis Kernel w
ith a value 0.6 was
found to give the best performance. Too
few protein domain families were subjected
to classification hence added features to the
attribute vector such as length, mass, area,



8

volume, charge, secondary structure
propensity, number of mot
ifs,
hydrophobicity, etc. would provide
interesting solutions.
There are many
protein family classification databases on
the internet. PFAM, PRODOM, InterPro,
and Prosite are a few. Other sites, such as
PRINTS, attempt not to classify proteins
into presu
med evolutionary families, but
instead try to fingerprint proteins based on
common subsequences.
MetaFam

is a site
which uses set and graph theory to group
proteins based on their classifications in
other databa
ses.


(It should be noted that further analysis done
on domains having 10 or more members
may yield interesting results.)

References

1.

Cathy H.Wu, “Artificial Neural
Networks for molecular sequences
analysis”

, 1997.

2.

Roman L. Tatusov , Eugene V.
Koonin,
David J.Lipman , “A
Genomic Perspective on Protein
Families”, 1997.

3.

Michael P.S. Brown, William Noble
Grudy, David Lin, Nello Cristianini,
Charles Walsh Sunget, Terrence S.
Furey, Manuel Ares , Jr., and David
Haussler, “Knowledge
-

based
analysis of microa
rray gene
expression data by using Support
Vector Machines”, 2000.

4.

Li Liao, William Stafford Noble,
"Combining pairwise sequence
similarity and support vector
machines for remote protein
homology detection", The
Proceedings of The Sixth
International Con
ference on
Research in Computational
Molecular Biology (RECOMB 2002),
April 2002, pp225
-
232.

5.

Alexey G Murzin, Steven E. Brenner,
Tim Hubbard and Cyrus Chothia ,
“SCOP: A Structural Classification
of Proteins Database for the
Investigation of Sequences an
d
Structures”,
J. Mol Biol
. (1995) 247,
536
-
540

6.

Loredana Lo Conte, Steven E.
Brenner, Tim J. P. Hubbard, Cyrus
Chothia, and Alexey G. Murzin
(2002). “SCOP database in 2002:
refinements accommodate structural
genomics.”
Nucl. Acid Res.
30(1),
264
-
267

7.

To
m Fawcett “ROC Graphs: Notes
and Practical Considerations for Data
Mining Researchers”, (2003) HPL
-
2003
-
4.