VICMpred: SVM-based method for prediction of functional proteins of gram-negative bacteria using amino acid patterns and composition.

cobblerbeggarΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

75 εμφανίσεις


VICMpred: SVM
-
based method for prediction of
functional proteins of gram
-
negative bacteria using
amino acid patterns and composition.









Sudipto Saha and G.P.S.Raghava*


Institute of Microbial Technology

Sector
-
39A, Chandigarh, India













*Ad
dress for Correspondence


Dr. G. P. S. Raghava, Scientist


Bioinformatics Centre


Institute of Microbial Technology


Sector 39A, Chandigarh, INDIA


Email: raghava@imtech.res.in


Web:
http://www.imtech.res.i
n/raghava/


Phone: +91
-
172
-
2690557


Fax: +91
-
172
-
2690632












Abstract


In this study, an attempt has been made to predict major functions of gram
-
negative
bacterial proteins from its amino acid sequence. The dataset used for training and testing
c
onsists of 670 non
-
redundant gram
-
negative bacterial proteins (255 cellular process, 60
information molecule, 285 metabolism proteins and 70 virulence factor). First we
developed a SVM based method using amino acid and dipeptide composition and
achieved an

overall accuracy 52.39% and 47.01% respectively. In this study we introduce
a new concept for classification of proteins based on tetra
-
peptide where we identify the
unique tetra
-
peptides significantly found in a class of proteins. These tetra
-
peptides we
re
used as input feature for predicting function of a protein and achieved an overall accuracy
68.66%. We also developed a hybrid method where tetra
-
peptide information was used
with amino acids composition and achieved overall accuracy 70.75%. The five
-
f
old cross
validation was used to evaluate the performance method. The webserver VICMpred has
been developed for predicting function of gram
-
negative bacterial proteins
(http://
www.imtech.res.in/rag
hava/vicmpred/
).


Introduction

Though there is exponential growth in sequence database of proteins in last decade, but
function of small fraction of proteins has been experimentally characterized. The
experimental assessment of the function of every prot
ein of each newly sequenced
genome is beyond foreseeable resources, our knowledge of most of the new proteins will
be from predictions. Functional prediction is a major challenge in the field of
bioinformatics (1). In the past number of methods have been d
eveloped to predict the
function of proteins (2,3,4), but the results obtained by analyzing a significant number of
true sequence similarities, point to the complexity of function prediction. Most of the
methods are indirect methods where attempt have been

made to predict subcellular
localization of proteins rather than function. The subcellular localization methods are
based on observation that protein belongs to same compartment of protein has similar
amino acid composition (5,6) and has similar functions
. In this study, an attempt has been
made to develop direct method for predicting major functions (virulence factors,
information molecule, cellular process and metabolism molecule) of gram
-
negative
bacterial proteins including virulence factors that cause

pathogenicity to the host system.
Most of the proteins in an organism involves in cellular process, metabolism and in
information storage, remaining can be classified in virulence factors, which allow the
germs to establish themselves in the host. Virul
ence factors include adhesions (7), toxins
(8) and hemolytic molecules (9). Identification of virulence factors is crucial for drug
development. So, we made an attempt to classify the bacterial proteins into four broad
functional classes. The three broad f
unctional classes were taken from COGs functional
annotation (10). They are i) cellular process, which includes cell division, cell envelope
biogenesis, cell motility and signal transduction molecule; ii) information storage and
processing, in which transc
ription, translation and DNA replication and repair molecule
is included; iii) metabolic includes energy production and carbohydrate, amino acid,
nucleotide, lipid transport and metabolism.


The similarity search tools like BLAST or FASTA (11) and PSI
-
BLAS
T (12) are
commonly used for annotation of genomes. Besides similarity search tools, machine
learning tools also used for classification of proteins where amino acid, pseudo, dipeptide
and property compositions are used as features of protein. The predict
ion of function of
protein is much more complex than other classifications because sequence similarity is
very poor in proteins having same function, thus most of methods based on similarity
search fail to predict function of a protein (13,14).

In this s
tudy we made a systematic attempt to develop better method for predicting
function of proteins. First, we tried traditional strategies for classification of proteins that
includes i) similarity search using PSI
-
BLAST; ii) SVM
-
based method using amino acid
composition and iii) SVM
-
based method using dipeptide composition which also
consider local order of amino acids. It was observed that performance of traditional
approaches was very poor in functional classification of proteins. In order to improve the
per
formance we used tetra
-
peptides as features of protein similar to deterministic pattern
of Class A as defined by Brazma et al., (15). The approach relies on identifying short
signaling patterns and group of patterns of each four broad functional class pre
sent in
higher number (16). The performance of method based on tetra
-
peptide was much higher
than traditional methods based on residue composition. The performance was further
improved when new and traditional approaches were combined. In this study we cl
assify
gram
-
negative bacterial proteins obtained from PSORTdb v.20
(
http://www.psort.org/dataset
) (17), which is used in the development of SubLoc (18).
Based on our study, we have made a webserver, VICMpred
(
http://www.imtech.res.in/raghava/vicmpred/
) for predicting function of proteins from its
amino acid sequence.


Materials Methods

Data sets

We obtained 1572 proteins from Hua and Sun (2001) work. W
e examine the function of
these proteins using SWISS
-
PROT (19) version 33.0. and kept 1048 proteins for further
processing, whose functions was known. We used PROSET software to create a dataset
of non
-
redundant proteins where no two proteins have more t
han 90% sequence identity.
Final dataset consist of 670 non
-
redudant gram
-
negative bacterial proteins (255 cellular
process, 60 information molecule, 285
-
metabolism protein and 70 virulence factors
protein).


Evaluation of the predictive performances

The p
erformance modules constructed in this study were evaluated using a 5
-
fold cross
-
validation technique. In the 5
-
fold cross
-
validation, the relevant dataset was randomly
divided into five sets. The training and testing was carried out five times, each time
using
one distinct set for testing and the remaining four sets for training. For evaluating the
performance of various modules, accuracy and Matthew’s correlation coefficient (MCC)
were calculated using the following equations:


Accu
racy (x) =


MCC (x)=

where x can be any functional class (cellular, information, metabolism and virulence
protein), exp(x) is the number of sequences observed in f
unction x, p(x) is the number of
correctly predicted sequences of function x, n(x) is the number of correctly predicted
sequences not of function x, u(x) is the number of under
-
predicted sequences and o(x) is
the number of over
-
predicted sequences.

Support vector machine

The SVM was implemented using freely downloadable software package SVM_light
written by Joachims (20). The software enables the user to define a number of parameters
as well as to select from a
choice of inbuilt kernal functions, including a radial basis
function (RBF) and a polynomial kernal. Preliminary tests show that the

radial basis
function (RBF) kernel gives results better than other kernels. Therefore, in this work we
use the RBF kernel f
or all the experiments. The prediction of functional class is a multi
-
class classification problem. We developed a series of binary classifiers to handle the
multi
-
classification problem. We constructed N SVMs for N
-
class classification using 1

vs

r (one a
gainst rest) strategy. Here, the class number was equal to four for bacterial
protein sequences, The i
th

SVM was trained with all samples in the i
th

class with positive
labels and all other samples with negative labels. In this way, four SVMs were
construc
ted for functional class of bacterial protein to cellular, information, metabolic
and virulence.

Protein features


Amino acid composition
. Amino acid composition is the fraction of each amino acid in a
protein. The fraction of all 20 natural amino acids
was calculated using the following
equations:

Fraction of amino acid i =

where i can be any amino acid.


Dipeptide composition
. Dipeptide composition was used to encapsulate the global
information about each protein sequence, which

gives a fixed pattern length of 400 (20


20). This representation encompassed the information about amino acid composition
along local order of amino acid. The fraction of each dipeptide was calculated using
following equation:

fraction of dipep (i) =


where dipep(i) is one out of 400 dipeptides.


Ab
-
initio patterns

: We have calculated the frequency of all possible tetra peptides
(20

20

20

20=160,000) in each class of proteins. Then we identify the tetra peptide
s,
which are found more than a threshold for a class of proteins, called significant tetra
peptides for that class. In our case we consider a tetra peptide significant if it is found

6
times in case of cellular proteins;

3 times in case of information mo
lecules;

6 times in
case of metabolic proteins and

4 times in case of virulence proteins. In next step we
compute number of significant tetra peptides of each class are present in a protein. Thus,
four features represented a protein, where each feature

represents significant number of
tetra peptides of a class of proteins. Finally we used SVM for classification of proteins
based on these four features. In our study significant tetra peptides were only calculated
from proteins in training set in order t
o avoid any biasness in prediction. An out line of
this method is shown in the Fig 1.


PSI
-
BLAST

A module of PSI
-
BLAST was designed in which query sequences in test dataset were
searched against proteins in training dataset using PSI
-
BLAST.

Three iterations of PSI
-
BLAST were carried out at a cut
-
off E
-
value of 0.001. PSI
-
BLAST was used instead of
normal standard BLAST because PSI
-
BLAST has the capability to detect remote
homologies. The module could predict any of the four functions (cellul
ar, information,
metabolic and virulence) depending upon the similarity of the query protein to the protein
in the dataset.


Prediction results

The performance of all the modules developed in this study is shown in Table 1. The
performance of all modules
was evaluated through 5
-
fold cross
-
validation. The
composition
-
based module (kernal=RBF,

=80 ,C=2 j=4) was able to predict with
52.39% accuracy. In the case of the dipeptides composition
-
based module the
performance of the RBF kernel (

=100and C=50, j=1 )

was 5% lower than the amino
acid composition. The results of the PSI
-
BLAST module were evaluated through 5
-
fold
cross
-
validation. The module predicted cellular, information, metabolism and virulence
protein sequence with 23.13, 8.33, 28.77, 25.71 % accur
acy, respectively. During 5
-
fold
cross
-
validation, only 172 hits were obtained out of total 670 proteins. Therefore, the
performance of this module is poorer in comparison to amino acid and dipeptide
compositions based SVM modules.


It was interesting to

note that performance for dipeptide based module was lower than
simple amino acid composition based module, despite dipeptide composition provides
composition as well as order of local amino acids. It is because in case of dipeptide total
number of featur
es are 400 (20

20) which is too high, number of features do not occurs
in small number of proteins. Thus SVM is unable to learn properly on too many features.
In order to avoid this problem we introduce a new concept for prediction where we
consider peptid
es which are occurs in each class of proteins in significant amount. Here
we used frequency of significant tetra peptides found in a class of proteins. This module
called ab inito pattern based module were able to predict functions of protein with
accura
cy of 68.66% (kernal=RBF,

= 0.001and C=50 and j=5), which is higher than
amino acid composition based and dipeptide
-
based module.


To further improve the prediction accuracy, hybrid modules on the basis of various
features of proteins were constructed. Th
e first hybrid (hybrid 1) was developed on the
basis of the pattern information and amino acid composition. The prediction accuracy of
the hybrid1 module was 70.30%, which is better than any individual features
-
based
module. Another module (hybrid 2) was d
eveloped on the basis pattern information and
of amino acid composition; its performance was similar to hybrid1 module. Finally, a
hybrid module based on pattern information, amino acid and dipeptide composition was
developed. This hybrid used an input vec
tor of 424 dimensions, comprising 4 for pattern
information, 20 for amino acid composition, 400 for dipeptide composition. As shown in
Table 1 the performance of this module is better than any individual feature
-
based or
other hybrid modules (hybrid1 and h
ybrid 2). Finally, a hybrid module with the RBF
kernel (

=0.001 and C=100000 j=1), which used pattern, amino acid and dipeptide
composition information, was able to achieve 70.75% overall accuracy.


VICMpred SERVER

Based on our study, we have developed a
web server that allows users to predict the
function of a protein (e.g., virulence factors, information molecule, cellular process and
metabolism molecule) from its amino acid sequences. VICMperd is freely available at
http://www.imtech.res.in/raghava/vicmpred/
. The common gateway interface (CGI) script
for VICMpred is written using PERL version 5.03. This server is installed on a Sun
Server (420E) under a UNIX (Solaris 7) environment. Users can en
ter the primary amino
acid sequence for prediction using file uploading or cut
-
and
-
paste options.

Discussion

The functional annotation of proteins is one of the major challenges in the era of post
genomics. The most widely used methods for predicting the
function of a new protein
involve sequence alignment or similarity search or profile search, like FASTA, BLAST,
PSI
-
BLAST (10,11). These methods fail in absence of significant similarity between
query protein and annotated proteins. One of the reasons of
failure of similarity
-
based
method is variation in size of proteins either belongs to same or different classes.


The problems with profiles is that they are complicated models with many free
parameters. One is faced with a number of difficult problems li
ke the best ways to set the
position
-
specific residues scores, to score gaps and insertions and to combine structural
and multiple sequence information.

An alternative way for predicting function of protein is to predict is location in cell,
which is based

on assumption that proteins reside in same location also have same
functions. Most of these subcellular localization methods are based on composition
(amino or dipeptide) of protein.

In this study, an attempt have been made to develop direct method for pr
edicting function
of proteins. First we tried traditional approaches which are commonly used in prediction
of subcellular localization. It was observed that performance of PSI
-
BLAST was poor to
composition based methods (Table 1). This demonstrate that sim
ilarity search based
methods are not very effective in function prediction. It was also observed that dipeptide
composition based method perform poor than amino acid composition. This was
unexpected as dipeptide provides more information (composition with
local order) than
simple amino acid. In past we observed that dipeptide perform better than amino acid
composition in subcellular localization of proteins. We examined our data and observed
that number of dipeptide was either rare or completely absent due
to small number of
proteins used for classification. This demonstrates that higher order composition is not
successful on small set of data. In order to overcome this problem we tried a new
approach. In this approach we used tetra peptides that provides mo
re local order than
dipeptide and tripeptide. Instead of using composition of all tetra peptide, we identify the
tetra peptides found in significant number in each class of protein. We used only
significant tetra peptide found in proteins for classificatio
n. We calculate the number of
tetra peptides of each class present in a query sequence. This information is used to
classify the proteins using SVM. We obtain very high accuracy using this approach. One
may compare this approach with pattern searching appr
oach (like PROSITE) where one
need to detect known pattern in a sequence. Here patterns are tetra peptides instead of
PROSITE patterns. There is limited number of PROSITE, so number of proteins does not
have any PROSITE pattern. Where as in our case we are

using all tetra peptides found in
significant amount in each class of proteins so number of patterns in our case is too high
(1248, 381, 1443 and 1168 for cellular, information molecule, metabolic and virulence
proteins respectively). Thus there is chanc
e that each query protein will have large
number of tetra peptides of each class. Though specificity of our tetra peptides is lower
than PROSITE patterns but number is 100 times more.


We also developed hybrid modules, which combine our composition
-
based

modules, and
pattern based approach, in order to improve the performance of method further. The
performance of hybrid methods was better than individual. In summary we developed a
effective method for prediction function of bacterial proteins. This method

will be very
useful in development of drug and vaccine as it allows predicting virulence proteins.
Though we tried our best to improve the accuracy of prediction still it is not very high.
Another limitation of this method is that it predict single funct
ion of protein, where as in
realistic situation it is observed that protein may have multiple functions.


Acknowledgments


We acknowledge the financial support from the Council of Scientific and Industrial
Research (CSIR) and Department of Biotechnology (
DBT), Govt. of India.


References



1. Devos D, Valencia A. Practical limits of function prediction. Proteins. 2000 Oct
1;41(1):98
-
107.



2. Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic prediction of protein
function. Cell Mol Life Sci. 200
3 Dec;60(12):2637
-
50. Review.



3. Panchenko AR, Kondrashov F, Bryant S. Prediction of functional sites by analysis of
sequence and structure conservation. Protein Sci. 2004 Apr;13(4):884
-
92. Epub 2004
Mar 09.




4. Cai YD, Doig AJ. Prediction of Saccharom
yces cerevisiae protein functional class
from functional domain composition. Bioinformatics. 2004 May 22;20(8):1292
-
300.
Epub 2004 Feb 19.




5. Bhasin M, Raghava GP. ESLpred: SVM
-
based method for subcellular localization of
eukaryotic proteins using dipep
tide composition and PSI
-
BLAST. Nucleic Acids Res.
2004 Jul 1;32(Web Server issue):W414
-
9.


6. Garg A, Bhasin M, Raghava GP. SVM
-
based method for subcellular localization of
human proteins using amino acid compositions, their order and similarity search.

J

Biol Chem. 2005 Jan 12; [Epub ahead of print]


7. Irie Y, Mattoo S, Yuk MH. The Bvg virulence control system regulates biofilm
formation in
Bordetella bronchiseptica
. J Bacteriol. 2004 Sep;186(17):5692
-
8.


8. Geric B, Rupnik M, Gerding DN, Grabnar M, John
son S.

Distribution of Clostridium
difficile variant toxinotypes and strains with binary toxin genes among clinical isolates in
an American hospital. J Med Microbiol. 2004 Sep; 53(Pt 9):887
-
94.


9. Ethelberg S, Olsen KE, Scheutz F, Jensen C, Schiellerup P
, Enberg J, Petersen AM,
Olesen B, Gerner
-
Smidt P, Molbak K. Virulence factors for hemolytic uremic syndrome,
Denmark. Emerg Infect Dis. 2004 May;10(5):842
-
7.



10. Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for
genome
-
scale an
alysis of protein functions and evolution. Nucleic Acids Res. 2000 Jan
1;28(1):33
-
6.


11. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search
tool. J Mol Biol. 1990 Oct 5;215(3):403
-
10.




12. Altschul SF, Madden TL, Schaffer A
A, Zhang J, Zhang Z, Miller W, Lipman DJ.
Gapped BLAST and PSI
-
BLAST: a new generation of protein database search programs.

Nucleic Acids Res. 1997 Sep 1;25(17):3389
-
402.



13. Li L, Shakhnovich EI, Mirny LA. Amino acids determining enzyme
-
substrate
specif
icity in prokaryotic and eukaryotic protein kinases. Proc Natl Acad Sci U S A. 2003
Apr 15;100(8):4463
-
8. Epub 2003 Apr 04.


14. Hannenhalli SS, Russell RB. Analysis and prediction of functional sub
-
types from
protein sequence alignments.J Mol Biol. 2000
Oct 13;303(1):61
-
76.


15. Brazma A, Jonassen I, Eidhammer I, Gilbert D. Approaches to the automatic
discovery of patterns in biosequences. J Comput Biol. 1998 Summer;5(2):279
-
305.


16. Perez,A.J., Rodriguez,A., Trelles,O. and Thode,G (2002) A computational

strayegy
for protein function assigment which addresses the multidomain problem. Comp. Funct.
Genomics,
3
, 423
-
440.



17. Jennifer L. Gardy, Cory Spencer, Ke Wang, Martin Ester, Gabor E. Tusnady, Istvan
Simon, Sujun Hua, Katalin deFays, Christophe Lamber
t, Kenta Nakai and Fiona S.L.
Brinkman (2003). PSORT
-
B: improving protein subcellular localization prediction for
Gram
-
negative bacteria,
Nucleic Acids Research

31(13):3613
-
17
.


18. Hua S, Sun Z. Support vector machine approach for protein subcellular loc
alization
prediction. Bioinformatics. 2001 Aug;17(8):721
-
8.


19. Bairoch A, Apweiler R. The SWISS
-
PROT protein sequence database and its
supplement TrEMBL in 2000. Nucleic Acids Res. 2000 Jan 1;28(1):45
-
8.



20. Joachims,T. (1999) Making large
-
scale SVM le
arning particle. In Scholkopf,B.,
Burges,C, and Smola,A.(eds), Advances in kernal Methods Support Vector Learning.
MIT Pres, Cambridge, MA and London, pp 42
-
56.












































Legend of Figure



Figure 1.
An out line of

Ab
-
initi
o patterns prediction method.

















































































Fig 1









Gram

湥条瑩癥⁂ac瑥t楡氠i牯瑥楮i

⠶㜰(

噩牵汥湣e cac瑯牳

⠷〩

fn景f浡瑩潮潬散畬u

⠶〩

Ce汬畬慲⁐牯re獳

⠲㔵(

䵥瑡扯汩t洠m潬散畬u

⠲㠵(

ㄱ㘸⁳1g湩晩na湴n

瑥瑲t浥ms

⡦牥煵e湣y‾=㐩

㌸ㄠ獩g湩晩na湴n
瑥瑲t浥ms


⡦牥煵e湣y‾=㌩

ㄲ㐸⁳1g湩晩na湴n

瑥瑲t浥ms

⡦牥煵e湣y‾=㘩

ㄴ㐳⁳1g湩晩na湴n

瑥瑲t浥ms

⡦牥煵e湣y‾=㘩

乵浢N爠潦†瑥瑲a浥m猠
⠠癩(畬敮ue 晡c瑯牳r⁩渠
煵qry⁰牯瑥楮i

乵浢N爠潦†瑥瑲a浥m猠
⡩湦潲浡瑩潮⁣污獳l⁩渠
煵qry⁰牯瑥楮

乵浢N爠潦⁴整 a浥m猠s
⡣e汬畬慲⁰牯ce獳
⤠楮)
煵qry⁰牯瑥楮i

乵浢N爠潦⁴整 a浥m猠s
⡍(瑡扯汩t洠⤠楮m
煵qry⁰牯瑥楮i

f湰畴⁦n爠卖r

Predicted functional class

Search for significant tetramers for each class in query sequence


Table 1.The performance of various modules including SVM modules based on various features of protein
sequence and PSI
-
BLAST.

_____
___________________________________________________________________

Approach Cellular Information Metabolism Virulence Ove
rall


ACC MCC ACC

MCC ACC MCC ACC MCC ACC


Composition
-
based (A)
47.06 0.12 36.67 0.41 66.67 0.31 27.14 0.32 52.39


Dipeptide
-
based (B)
45.10 0.11 15.00

0.21 60.35 0.23 27.14 0.20 47.01


Pattern
-
based (C)
70.20 0.46 48.33 0.57 72.98 0.51 62.86 0.61 68.66


PSI
-
BLAST
23.13

8.33 28.77 25.71


Hybrid (C+A)

69.41 0.48 50.0 0.59 77.19 0.54 62.86 0.65 70.30


Hybrid (C+B
) 69.02
0.54 48.33 0.52 74.04 0.53 58.57 0.54 68.21


Hybrid (C+B+A)

69.80 0.51 53.33 0.58 77.54 0.56 61.43 0.59 70.75



ACC: Accuracy; MCC:

Matthew’s correlation coefficient.