Prediction of protein subcellular locations using fuzzy k-NN method


29 Σεπ 2013 (πριν από 3 χρόνια και 1 μήνα)

66 εμφανίσεις

Vol.20 no.1 2004,pages 21–28
Prediction of protein subcellular locations
using fuzzy k-NN method
Ying Huang

and Yanda Li
State Key Laboratory of Intelligent Technology and Systems,Department
of Automation,Institute of Bioinformatics,Tsinghua University,Beijing 100084,
People’s Republic of China
Received on December 10,2002;revised on April 23,2003;accepted on July 14,2003
Motivation:Protein localization data are a valuable
information resource helpful in elucidating protein functions.
It is highly desirable to predict a proteins subcellular locations
automatically from its sequence.
Results:In this paper,fuzzy k-nearest neighbors (k -NN)
algorithm has been introduced to predict proteins subcellular
locations from their dipeptide composition.The prediction is
performed with a new data set derived from version 41.0
SWISS-PROT databank,the overall predictive accuracy about
80%has been achieved in a jackknife test.The result demon-
strates the applicability of this relative simple method and
possible improvement of prediction accuracy for the protein
subcellular locations.We also applied this method to annotate
six entirely sequenced proteomes,namely Saccharomyces
cerevisiae,Caenorhabditis elegans,Drosophila melano-
gaster,Oryza sativa,Arabidopsis thaliana and a subset of all
human proteins.
Availability:Supplementary information and subcellu-
lar location annotations for eukaryotes are available at
With the progress in genome sequencing projects,an
enormous amount of raw sequence data accumulates
databanks.This raises the challenge of understanding the
functions of manygenes fromlarge-scale sequencingprojects.
Protein localization data are a valuable information resource
helpful in elucidating protein functions (Chou and Elrod,
1999a,b;Chou,2000b).Experimental determination of sub-
cellular location is mainly accomplished by three approaches:
cell fractionation,electron microscopy and ßuorescence
microscopy (Murphy et al.,2000).By immunolocalization
of epitope-tagged gene products,Kumar et al.(2002) have
determined the localization of 2744 yeast proteins.However,
currently it is still time-consuming and costly to acquire the
knowledge solelybasedonexperimental measures.It is highly

To whomcorrespondence should be addressed.
desirable to predict a proteinÕs subcellular locations automat-
ically fromits sequence.Since the pioneering efforts of Nakai
and Kanehisa (1991,1992),there have been several attempts
insystematicallypredictingsubcellular locations fromprotein
Most of the existing prediction methods fall into two
categories:one is based on prediction of individual sort-
ing signals;the other is based on amino acid composition
(Nakai,2000).Nakai and Kanehisa (1991,1992) were the
Þrst who proposed to predict the subcellular location of
proteins based on their N-terminal sorting signals.This
approach was integrated eventually into PSORT prediction
system (Nakai and Horton,1999).Von Heijne (1992) and
Nielsen et al.(1997,1999) worked extensively on identify-
ing individual sorting signals using neural networks.Then,
they combined these individual predictions into an integrated
systemÑTargetP (Emanuelsson et al.,2000) for subcellular
location prediction.A review of prediction of protein signal
sequences can be found in Chou (2002a).However,in sys-
tematic annotation of open reading frames found in a genome,
the assignments of 5

-regions are often unreliable.Therefore,
the prediction based on sorting signals is problematic when
signals are missing or only partially included (Reinhardt and
Prediction based on amino acid composition was suggested
by Nakashima and Nishikawa (1994).They proposed an
algorithm to discriminate between intracellular and extracel-
lular proteins by amino acid composition.Subsequently,there
are many ways to use amino acid composition for subcel-
lular location.Cedano et al.(1997) proposed an algorithm
calledProtLockusingtheMahalanobis distance(Chou,1995).
Reinhardt and Hubbard (1998) used neural networks.Chou
and Elrod (1998,1999a,b) proposed a covariant discrimina-
tion algorithm(Zhou and Assa-Munt,2001).Zhou and Doctor
(2003) also used it for subcellular location prediction of apop-
tosis proteins.Other methods were based on Markov chain
models (Yuan,1999) andsupport vector machine (SVM) (Hua
and Sun,2001).
Predictions based only on amino acid composition may
lose some sequence-order information,but incorporating
Bioinformatics 20(1) © Oxford University Press 2004;all rights reserved.
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
Y.Huang andY.Li
this information may improve prediction performance.Chou
(2000a) was the Þrst who proposed an augmented covariant
discrimination algorithmto incorporate quasi-sequence-order
effect,and a remarkable improvement in prediction qual-
ity was achieved.Subsequently,Chou (2001) has further
introduced a novel concept,the so-called pseudo-amino acid
composition to reßect the protein sequence-order effect in
term of a set of discrete numbers.Recently,Cai et al.(2002)
usedSVMincorporatingquasi-sequence-order effect.Anovel
concept,the so-called functional domain composition was
also introduced by Chou and Cai (2002) for representation
of protein sequence.
We introduced fuzzy k-NN method in this paper to predict
proteinÕs subcellular locations based on dipeptide compos-
ition.Dipeptide composition can be considered as another
representative form of proteins incorporating neighborhood
information.High prediction accuracy has been obtained in
a jackknife test.The current approach cannot only play an
important complementary role to previous powerful methods
(Chou,2000a,2001;Cai et al.2002),but also be helpful
for this new branch of proteomics (Chou,2002b).Finally,
we applied our method to annotate six entirely sequenced
eukaryotes proteomes.
Sequence data
The data were selected from all eukaryotic proteins with
annotated subcellular location in SWISS-PROT release 41.0
(Boeckmann et al.,2003).All proteins withambiguous words,
SIMILARITYÕ,and proteins with multiple annotations of loc-
ations were excluded.The transmembrane proteins were also
excluded for they could be predicted quite reliably by some
known methods (Rost et al.,1996;Hirokawa et al.,1998;
Lio and Vannucci,2000;Krogh et al.,2001).The remaining
12 865 proteins compose our raw data set (Data_SWISS).To
reduce bias and investigate the relation between prediction
accuracy and sequence identity in the data set,we also estab-
lished two subsets of Data_SWISS on the basis of database
search technique BLAST (Altschul et al.,1990,1997).They
are Data_80 and Data_50 with pairwise sequence identity (the
number of identical residues in an alignment of two proteins
divided by alignment length,which can be obtained directly
by using BLAST) less than 80 and 50%,respectively.The
numbers of proteins and their distributions in 11 categories are
listed in Table 1.All the SWISS-PROTcodes in three data sets
are available at
Here,we concentrated our attention on homology-reduced
Data_80,and presented result derived from Data_50 in
Supplementary material.
We also obtained all proteins belonging to six entirely
sequenced eukaryotic proteomes (Homo sapiens,Drosophila
melanogaster,Caenorhabditis elegans,Saccharomyces
Table 1.Eukaryotic sequences within each subcellular location group of the
data sets used in this study
Cellular location Data_SWISS
Cytoplasm 2465 1251 622
Nuclear 3419 2152 1188
Mitochondria 1106 692 424
Extracellular 4228 2135 915
Golgi apparatus 34 31 26
Chloroplast 1145 645 225
Endoplasmic reticulum 137 82 45
Cytoskeleton 24 10 7
Vacuole 54 41 29
Peroxisome 122 81 47
Lysosome 131 83 44
Total proteins 12 865 7203 3572
Number of proteins with known localization found in version 41.0 SWISS-PROT.
Derived fromData_SWISS with pairwise sequence identity <80%.
Derived fromData_SWISS with pairwise sequence identity <50%.
cerevisiae,Arabidopsis thaliana and Oryza sativa) from
SWISS-PROT+TREMBL databank for entire-proteome
Instead of using amino acid composition,we use proteinÕs
dipeptide composition (van Heel,1991) to represent protein
sequences with Þx-length feature vector.Dipeptide compos-
ition representation can be considered as a sort of n-gram
method,which was Þrst proposed by Wu et al.(1992)
for sequence encoding.This method extracts and counts
the occurrences of n consecutive residues (n-gram) from a
sequence string in a sliding window fashion.So the count of
all 2-gram patterns is a 400 dimension vector,which can be
used to represent the protein sequence.Dipeptide composi-
tion (2-grammethod) has been used to predict protein family
(Wang et al.,1998).Using dipeptide composition method for
sequence coding,we can incorporate some sequence-order
information,while the dimension of the feature vector is still
not very high.
The k-nearest neighbors (k-NN) algorithm is a simple
non-parametric classiÞcation algorithm (Duda et al.,2000).
Despite its simplicity,it can give competitive performance
comparedtomanyother methods.It is widelyusedinmachine
learning and has numerous variations.Given a test sample of
unknown label,it Þnds the k-NNin the training set and assigns
a label to the test sample according to the labels of those
neighbors.In biological and medical data classiÞcation prob-
lems,combining fuzzy set theory with k-NN algorithm can
often improve classiÞcation performance (Keller et al.,1985;
Bezdek et al.,1993;Leszczynski et al.,1999).Zhang et al.
(1995) has also used fuzzy clustering to predict protein struc-
tural class.Therefore,we used the fuzzy k-NN algorithm to
Prediction of protein subcellular locations
predict subcellular locations.This methodassigns fuzzymem-
berships of samples to different categories rather than a partic-
ular class as in Ôk-NNÕ.Here class memberships are assigned
to the test sample,according to the following relationship:
(x) =
)(x −x

(x −x

i = 1,...,c
where mis a fuzzy strength parameter,which determines how
heavily the distance is weighted when calculating each neigh-
borÕs contribution to the membership value.The variablek
is the number of nearest neighbors,u
(x) is the membership
of the test sample x,to class i.x − x
 is the distance
betweenthe test sample x andits nearest trainingsamples x
Various distance measures can be used,such as Euclidean,
absolute and Mahalanobis distance measures.In the present
study,we used the Euclidean distance measure.u
) is
the membership value of the j-th neighbor to the i-th class,
it can be assigned in several way.The ÔcrispestÕ way is to
assign 1 if x
belongs to i-th class otherwise assign 0.A
more ÔfuzzyÕ alternative is to assign the training samplesÕ
memberships based on the k-NN rule.In our analysis,we
deÞne the membership via ÔcrispestÕ way.After calculating
the memberships for the test sample,it is assigned to the class
with highest membership value.
Measurement accuracy
We use jackknife test for cross-validation.In comparison with
subsampling test or independent data set test,the jackknife
test is thought to be more rigorous and reliable (Mardia et al.,
1979).ChouandZhang(1995) alsoprovideda comprehensive
discussionabout this problem.Duringthe process of jackknife
test,each protein is singled out in turn as a test sample,the
remaining proteins are used as training set to calculate test
sampleÕs membership and predict the class.The prediction
quality was evaluated by the overall prediction accuracy and
prediction accuracy for each location.
overall accuracy =
accuracy(s) =
where N is the total number of sequences,k is the class num-
ber,obs(s) is the number of sequences observed in location
s and p(s) is the number of correctly predicted sequences in
location s.
The other measure of prediction accuracy is MatthewÕs cor-
relation coefÞcients (MCC) (Matthews,1975) between the
observed and predicted locations over a data set,as given by:
p(s)n(s) −u(s)o(s)

(p(s) +u(s))(p(s) +o(s))(n(s) +u(s))(n(s) +o(s))
Here,p(s) is the number of properly predicted proteins in
location s,n(s) is the number of correctly predicted proteins
not in location s,u(s) is the number of under-predicted and
o(s) is the number of over-predicted sequences.
Prediction accuracy of fuzzy k-NN Method
Tests have been done with various values of the fuzzy strength
parameter m and the number of nearest neighbor k.Using
leave-one-out cross-validation,we found the best result was
achieved when m = 1.05.We then calculated the overall pre-
diction accuracy with fuzzy strength parameter m = 1.05.For
Data_80,the dependence of the total prediction accuracy on
the number of nearest neighbors,k,is shown in Figure 1.It
can be seen that the prediction accuracy does not change signi-
Þcantly when k is greater than or equal to 15 while m = 1.05.
Therefore,we selected k = 15 and m = 1.05 for the sub-
sequent analysis.The similar result was obtained on Data_50,
which can be found in Supplementary Figure 4.
Performance related to thresholds of similarity
Our method relies on sequence information,so predictive
accuracy is closely related to pairwise sequence identity in
the data set.In order to investigate the inßuence of pair-
wise sequence identity on the prediction performance,we
performed our method to two different sequence identity data
sets,Data_80 and Data_50.
The jackknife testing resulting for Data_80 and Data_50 are
listed in Table 2.The prediction applied to different data sets
resulted in different overall predictive accuracy.For Data_80,
our method achieved overall accuracy 80.1%.There are
3572 sequences in Data_50,which is about 50%of Data_80.
For this data set,the predictive accuracy is 58.1%.A drop in
the accuracyof cytoplasmic proteins (from70.2%to35.4%) is
a main reason for the decrease.Mitochondrial and chloroplast
proteins also have lowpredictive accuracy.However,the pre-
dictive accuracyof extracellular proteins changedfrom93.7to
81.6%.It is not very bad.The prediction accuracy of nuclear
proteins also remains 71.5%in Data_50.Therefore,the inßu-
ence of pairwise similarity on predictive accuracy varies with
different compartments.This may indicate that sequence con-
servations are different in these groups.Such result is worthy
of a deeper investigation.
Confusion matrix analysis
To evaluate our approach in detail,a confusion matrix is con-
structed according to the result of jackknife test and shown
in Table 3.(Confusion matrix on Data_50 can be found in
Supplementary Table 6.) We can see from Tables 2 and 3
that predictive accuracy varies substantially with subcellular
locations.Nuclear and extracellular proteins can be inferred
more reliably than other classes.On the other hand,perform-
ance for cytoplasmic and mitochondrial proteins is not very
Y.Huang andY.Li
Fig.1.The dependence of the overall prediction accuracy on the number of nearest neighbors,k,used in the fuzzy k-NNclassiÞcation (fuzzy
strength parameter m = 1.05).These results were obtained on Data_80 using the Euclidean distance measure.
Table 2.The predictive accuracy for subcellular locations of different data
sets (corresponding to different thresholds of pairwise sequence identity)
Cellular location Data_80 Data_50
Accuracy (%) MCC Accuracy (%) MCC
Cytoplasm 70.2 0.67 35.4 0.31
Nuclear 81.9 0.78 71.5 0.58
Mitochondria 59.0 0.62 36.6 0.30
Extracellular 93.7 0.79 81.6 0.54
Golgi apparatus 16.1 0.32 15.4 0.27
Chloroplast 84.7 0.80 32.4 0.36
Endoplasmic reticulum 57.3 0.71 11.1 0.22
Cytoskeleton 40.0 0.57 28.6 0.44
Vacuole 34.1 0.55 6.9 0.16
Peroxisome 56.8 0.68 14.9 0.27
Lysosome 67.5 0.74 20.5 0.31
Overall accuracy 80.1 Ñ 58.1 Ñ
good.It can be seen from Table 3 that cytoplasmic proteins
are oftenconfusedwithnuclear andextracellular proteins,and
proteins from mitochondria are most often assigned incor-
rectly to extracellular space.Accuracy for the minor classes
that contained too few proteins (Golgi,endoplasmic retic-
ulum,cytoskeleton,vacuole,peroxisome and lysosome) are
not very good as well.About 30% of vacuole proteins are
classiÞed as extracellular;perhaps they are involved in the
secretory pathway.
Table 3.Confusion matrix for prediction results of Data_80
Actual Predicted group SUM
group Cytop Nuc Mit Ext Gol Chl Endo Cytos Vac Pero Lyso
Cytop 878 136 51 146 1 29 2 0 0 5 3 1251
Nuc 107 1762 44 195 2 40 1 1 0 0 0 2152
Mit 71 37 408 119 0 54 0 0 0 3 0 692
Ext 43 53 19 2001 0 11 0 0 2 0 6 2135
Gol 5 7 3 9 5 0 2 0 0 0 0 31
Chl 26 11 30 30 0 546 0 0 0 2 0 645
Endo 19 6 2 5 0 2 47 0 0 0 1 82
Cytos 1 4 0 1 0 0 0 4 0 0 0 10
Vac 3 6 2 12 0 1 1 0 14 0 2 41
Pero 11 4 6 8 0 6 0 0 0 46 0 81
Lyso 5 5 2 14 0 1 0 0 0 0 56 83
Sum 1169 2031 567 2540 8 690 53 5 16 56 68 7203
Matrix delineates distribution of actual compared with predicted class membership.
Abbreviations for localizations:Cytop:Cytoplasm;Nuc:Nuclear;Mit:Mitochondria;
Ext:Extracellular;Gol:Golgi apparatus;Chl:Chloroplast;Endo:Endoplasmic retic-
ulum;Cytos:Cytoskeleton;Vac:Vacuole;Pero:Peroxisome and Lyso:Lysosome.
Correct classiÞcations are in bold letters.
Reliability index calculation
When neural network is used for subcellular location predic-
tion,the difference between the highest and the next highest
network output scores is used as a reliability index (RI) for a
prediction (Reinhardt and Hubbard,1998;Emanuelsson et al.,
Prediction of protein subcellular locations
Fig.2.Average predictive accuracy related to RI.We also give fractions of sequences with various RI values.For example,about 5%of all
sequences have RI = 9,and of these sequences about 80%are correctly classiÞed.The Þgure is based on Data_80.
2000).As fuzzy k-NN method assigns class memberships to
an input pattern x rather than a particular class,the mem-
bership values of an input pattern would provide a level of
conÞdence to the resultant classiÞcation.We can deÞne a RI
in the same way.The assignment of RI is based on the dif-
ference between the highest and the next highest membership
value for a prediction.RI is deÞned as
RI =
INTEGER(diff) ∗ 10 +1 0 ≤ diff < 0.9
10 diff ≥ 0.9
The RI assignment can give some information about the
certainty of the classiÞcation decision.Figure 2 shows the
expected prediction accuracy and the fractions of sequences
withgivenRI value (similar Þgure for Data_50canbe foundin
Figure 5 of Supplementary material).We can Þnd about 60%
of all sequences has a RI index 10 with expected prediction
accuracy >95%.Average prediction accuracy was also calcu-
lated with RI above a given threshold,as shown in Figure 3
(similar Þgure for Data_50 can be found in Figure 6 of Sup-
plementary material).For example,about 80% of sequences
have RI ≥ 5,andof these sequences about 90%were correctly
predicted by fuzzy k-NN method.
Entire proteome annotation
Using our method and sequences from SWISS-PROT+
TREMBL databank,we obtained subcellular location
annotations for six proteomes.Because we excluded mem-
brane proteins in our prediction,so we Þrst discriminated
sequences without annotated subcellular location using
TMHMM(Krogh et al.,2001,
TMHMM).Sequences with TMHMM prediction result
ÔPredHel=0Õ were considered to be soluble proteins and
predicted with our fuzzy k-NN classiÞer.Predicted dis-
tributions for six major subcellular locations are listed in
Table 4,and the annotation for individual protein can be
found at
we included chloroplast sequences in prediction,some pro-
teins of YEAST (HOMO,CAEEL and DROME) were also
predicted as chloroplast proteins.This minor mistake could
be revised by excluding plant proteins for prediction in further
This prediction result could give us a rough estimate of
protein distribution in cell.It can be found that fractions of
cytoplasmic and mitochondrial proteins in total proteomes do
not have signiÞcant change over organisms.However,frac-
tion of nuclear proteins in YEAST and DROME proteomes
are larger than other proteomes.Is such phenomenon just
prediction bias,or does it reßect some difference in these
proteomeÕs organization?In a recent study of YEASTproteo-
mes (Kumar et al.,2002),2452 soluble cytoplasmic proteins
have been estimated.The result is different from their pre-
vious study (Drawid and Gerstein,2000),and also different
from our prediction.The difference may be caused by using
different training set and protein features.It also indicates
that such genome-wide analysis would be more reliable by
integrating different experimental and prediction methods.
Y.Huang andY.Li
Fig.3.Average prediction accuracy was also calculated cumulatively with RI above a given value.For example,about 75%of all sequences
have RI ≥ 6,and of these sequences about 92%are correctly predicted.The result is based on Data_80.
Table 4.Distribution of predicted subcellular localization for six proteomes
Organism Total
ORYSA 9420 1662 (17.6%
) 2088 (22.2%) 1044 (11.1%) 513 (5.5%) 3410 (36.2%) 622 (6.6%)
ARATH 36 528 8282 (22.7%) 7431 (20.3%) 5122 (14.0%) 2086 (5.7%) 10781 (29.5%) 2499 (6.8%)
YEAST 6905 1489 (21.6%) 1997 (28.9%) 876 (12.7%) 470 (6.8%) 1795 (26.0%) 221 (3.2%)
CAEEL 20 887 6572 (31.5%) 4316 (20.7%) 2405 (11.5%) 898 (4.3%) 5833 (27.9%) 744 (3.6%)
DROME 19 978 4294 (21.5%) 5914 (29.6%) 2558 (12.8%) 1018 (5.1%) 5380 (26.9%) 587 (2.9%)
HOMO 44 402 9146 (20.6%) 10 081 (22.7%) 4639 (10.5%) 1561 (3.5%) 17 978 (40.5%) 615 (1.4%)
Abbreviations for organism:ORYSA:Oryza sativa;ARATH:Arabidopsis thaliana;YEAST:Saccharomyces cerevisiae;CAEEL:Caenorhabditis elegans;DROME:Drosophila
melanogaster;HOMO:Homo sapiens.
Only annotation of six major subcellular location are listed.
Number of protein sequence in SWISS-PROT +TREMBL databank.
Fraction in total proteins.
Comparison with other methods
We also applied our method to the data set used by other
groups (Reinhardt and Hubbard,1998;Yuan,1999;Hua and
Sun,2001),so that we can make direct comparison with
other methods.There are 2427 eukaryotic proteins in their
data set,684 cytoplasmic,325 extracellular,1097 nuclear and
321 mitochondrial proteins.Reinhardt and Hubbard (1998)
Þrst used neural network approach to achieve 66% accuracy
for this data set.Yuan (1999) used Markov chain models to
achieve 73%accuracy,while Hua and Sun (2001) used SVM
approach to achieve 79.4% accuracy.We achieved 85.2%
accuracy in a jackknife test.The details of the comparison
can be found in Table 5.ReinhardtÐHubbard data set may be
old and include only four subcellular locations.However,the
result can demonstrate the applicability of this relative simple
method and possible improvement of prediction accuracy for
the protein subcellular locations.
In this paper,fuzzy k-NNmethod based on proteinÕs dipeptide
composition was proposed for prediction of subcellular loca-
tions.An advantage of the new method is its incorporating
sequence-order effects into prediction.This method was
performed to a new data set derived from version 41.0
SWISS-PROT databank,and high predictive accuracy has
beenachievedina jackknife test.This indicates that extracting
Prediction of protein subcellular locations
Table 5.Performance comparison with other methods by a jackknife test
Location Markov model SVM Fuzzy k-NN
Accuracy MCC Accuracy MCC Accuracy MCC
(%) (%) (%)
Cytoplasmic 78.1 0.60 76.9 0.64 86.7 0.76
Extracellular 62.2 0.63 80.0 0.78 83.7 0.87
Mitochondrial 69.2 0.53 56.7 0.58 60.4 0.63
Nuclear 74.1 0.68 87.4 0.75 92.0 0.83
Total accuracy 73.0 Ñ 79.4 Ñ 85.2 Ñ
In this test,fuzzy strength number m = 1.05,number of nearest neighbors k = 20.
more useful information within the primary sequences can be
helpful in subcellular location prediction.This method just
needs raw sequence data,so we can apply it to infer subcel-
lular locations of protein that has only sequence information.
As a demonstration,we have used it to annotate six euka-
ryotic proteomes.Integrating with other powerful algorithms
(Chou,2000a,2001;Cai et al.,2002) this method is anticip-
ated to contribute to systematic analysis of great amounts of
genome data.
The authors would like to thank Dr A.Reinhardt for provid-
ing his data set.Thanks to Jun Cai for helpful discussions and
Dr Liang Ji for valuable comments on the manuscript.We also
thank the anonymous reviewers for their helpful comments.
This work was funded by the National Natural Science Grant
in China (Nos 60171038 and 60234020) and the National
Basic Research Priorities Programof the Ministry of Science
and Technology (No.2001CCA0).Y.H.also thanks Tsinghua
University Ph.D.Grant for the support.
Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.and Lipman,D.J.
(1990) Basic local alignment search tool.J.Mol.Biol.,215,
Miller,W.and Lipman,D.J.(1997) Gapped BLAST and PSI-
BLAST:a new generation of protein database search programs.
Nucleic Acids Res.,25,3389Ð3402.
Bezdek,J.C.,Hall,L.O.and Clarke,L.P.(1993) Reviewof MRimage
segmentation techniques using pattern recognition.Med.Phys.,
OÕDonovan,C.,Phan,I.,Pilbout,S.and Schneider,M.(2003)
The SWISS-PROT protein knowledgebase and its supplement
TrEMBL in 2003.Nucleic Acids Res.,31,365Ð370.
Cedano,J.,Aloy,P.,Perez-Pons,J.A.and Querol,E.(1997) Relation
between amino acid composition and cellular location of proteins.
Cai,Y.D.,Liu,X.J.,Xu,X.B.and Chou,K.C.(2002) Support vec-
tor machines for prediction of protein subcellular location by
incorporating quasi-sequence-order effect.J.Cell.Biochem.,84,
Chou,K.C.(1995) A novel approach to predicting protein structural
classes in a (20-1)-D amino acid composition space.Proteins
Chou,K.C.(2000a) Prediction of protein subcellular locations by
incorporating quasi-sequence-order effect.Biochem.Biophys.
Chou,K.C.(2000b) Review:prediction of protein structural classes
and subcellular locations.Curr.Protein Peptide Sci.,1,171Ð208.
Chou,K.C.(2001) Prediction of protein cellular attributes using
pseudo-amino acid composition.Proteins Struct.Funct.Genet.,
Chou,K.C.(2002a) Prediction of protein signal sequences.Curr.
Protein Peptide Sci.,3,615Ð622.
Chou,K.C.(2002b) A new branch of proteomics:prediction of pro-
tein cellular attributes.In Weinrer,P.W.and Lu,Q.(eds),Gene
Cloning &Expression Technologies,Chapter 4.Eaton Publishing,
Chou,K.C.and Cai,Y.D.(2002) Using functional domain com-
position and support vector machines for prediction of protein
subcellular location.J.Biol.Chem.,277,45765Ð45769.
Chou,K.C.and Elrod,D.W.(1998) Using discriminant function
for prediction of subcellular location of prokaryotic proteins.
Chou,K.C.and Elrod,D.W.(1999a) Protein subcellular location
prediction.Protein Eng.,12,107Ð118.
Chou,K.C.and Elrod,D.W.(1999b) Prediction of membrane protein
types and subcellular locations.Proteins Struct.Funct.Genet.,34,
Chou,K.C.and Zhang,C.T.(1995) Review:prediction of pro-
tein structural classes.Crit.Rev.Biochem.Mol.Biol.,30,
Drawid,A.and Gerstein,M.(2000) A Bayesian system integrating
expression data with sequence patterns for localizing proteins:
comprehensive application to the yeast genome.J.Mol.Biol.,
Duda,R.O.,Hart,P.E.and Stork,D.G.(2000) Pattern ClassiÞcation,
2nd edn.Wiley,New York.
Emanuelsson,O.,Nielsen,H.,Brunak,S.and von Heijne,G.(2000)
Predicting subcellular localization of proteins based on their
N-terminal amino acid sequence.J.Mol.Biol.,300,1005Ð1016.
van Heel,M.(1991) A new family of powerful multivariate
statistical sequence analysis techniques.J.Mol.Biol.,220,
vonHeijne,G.(1992) Membrane proteinstructure prediction.Hydro-
phobicity analysis and the positive-inside rule.J.Mol.Biol.,225,
Hirokawa,T.,Boon-Chieng,S.and Shigeki,M.(1998) SOSUI:classi-
Þcation and secondary structure prediction systemfor membrane
Hua,S.and Sun,Z.(2001) Support vector machine approach for
protein subcellular localization prediction.Bioinformatics,17,
Keller,J.M.,Gray,M.R.and Givens,J.A.(1985) A fuzzy k-nearest
neighbour algorithm.IEEE Trans.Syst.Man Cybern.,15,
Y.Huang andY.Li
Krogh,A.,Larsson,B.,von Heijne,G.and Sonnhammer,E.L.(2001)
Predictingtransmembrane proteintopologywitha hiddenMarkov
model:application to complete genomes.J.Mol.Biol.,305,
Piccirillo,S.,Umansky,L.,Drawid,A.,Jansen,R.,Liu, al.
(2002) Subcellular localization of the yeast proteome.Genes Dev.,
Lio,P.and Vannucci,M.(2000) Wavelet change-point prediction of
transmembrane proteins.Bioinformatics,16,376Ð382.
Loose,S.and Mvilongo,E.(1999) Application of a fuzzy pattern
classiÞer to decision making in portal veriÞcation of radiotherapy.
Mardia,K.V.,Kent,J.T.and Bibby,J.M.(1979) Multivariate Analysis.
Academic Press,London,pp.322 and 381.
Matthews,B.W.(1975) Comparison of predicted and observed sec-
ondary structure of T4 phage lysozyme.Biochim.Biophys.Acta,
Murphy,R.F.,Boland,M.V.and Velliste,M.(2000) Towards a sys-
tematics for protein subcelluar location:quantitative description
of protein localization patterns and automated analysis of ßuores-
cence microscope images.Proc.Int.Conf.Intell.Syst.Mol.Biol.,
Nakai,K.(2000) Protein sorting signals and prediction of subcellular
localization.Adv.Protein Chem.,54,277Ð344.
Nakai,K.and Horton,P.(1999) PSORT:a programfor detecting sort-
ingsignals inproteins andpredictingtheir subcellular localization.
Trends Biochem.Sci.,24,34Ð36.
Nakai,K.and Kanehisa,M.(1991) Expert systemfor predicting pro-
tein localization sites in Gram-negative bacteria.Proteins Struct.
Nakai,K.and Kanehisa,M.(1992) A knowledge base for predict-
ing protein localization sites in eukaryotic cells.Genomics,14,
Nakashima,H.and Nishikawa,K.(1994) Discrimination of intra-
cellular and extracellular proteins using amino acid com-
position and residue-pair frequencies.J.Mol.Biol.,238,
Nielsen,H.,Engelbrecht,J.,Brunak,S.and von Heijne,G.(1997)
IdentiÞcation of prokaryotic and eukaryotic signal pep-
tides and prediction of their cleavage sites.Protein Eng.,
Nielsen,H.,Brunak,S.and von Heijne,G.(1999) Machine learning
approaches for the prediction of signal peptides and other protein
sorting signals.Protein Eng.,12,3Ð9.
Reinhardt,A.and Hubbard,T.(1998) Using neural networks for pre-
diction of the subcellular location of proteins.Nucleic Acids Res.,
Rost,B.,Fariselli,P.and Casadio,R.(1996) Topology prediction for
helical transmembrane proteins at 86%accuracy.Protein Sci.,5,
Yuan,Z.(1999) Prediction of protein subcellular loca-
tions using Markov chain models.FEBS Lett.,451,
Wang,H.C.,Dopazo,J.,de la Fraga,L.G.,Zhu,Y.P.and Carazo,J.M.
(1998) Self-organizing tree-growing network for the classiÞcation
of protein sequences.Protein Sci.,7,2613Ð2622.
Wu,C.,Whitson,G.,McLarty,J.,Ermongkonchai,A.and Chang,T.C.
(1992) Protein classiÞcation artiÞcial neural system.Protein Sci.,
Zhang,C.T.,Chou,K.C.and Maggiora,G.M.(1995) Predicting pro-
tein structural classes from amino acid composition:application
of fuzzy clustering.Protein Eng.,8,425Ð435.
Zhou,G.P.and Assa-Munt,N.(2001) Some insights into protein
structural class prediction.Proteins Struct.Funct.Genet.,44,
Zhou,G.P.and Doctor,K.(2003) Subcellular location prediction
of apoptosis proteins.Proteins Struct.Funct.Genet.,50,