bioinformatics original paper

earthsomberΒιοτεχνολογία

29 Σεπ 2013 (πριν από 4 χρόνια και 3 μήνες)

755 εμφανίσεις

Bioinformatics

Volume 21, Number 1, January 1 2005

EDITORIALS:
Alfonso Valencia and Alex Bateman
INCREASING THE IMPACT OF BIOINFORMATICS
Bioinformatics 2005 21: 1; doi:10.1093/bioinformatics/bti185
ORIGINAL PAPERS:
GENOME ANALYSIS:
Albert D. G. de Roos
Origins of introns based on the definition of exon modules and their
conserved interfaces
Bioinformatics Advance Access published on August 12, 2004
Bioinformatics 2005 21: 2-9; doi:10.1093/bioinformatics/bth475
SEQUENCE ANALYSIS:
Kuo-Chen Chou
Using amphiphilic pseudo amino acid composition to predict enzyme
subfamily classes
Bioinformatics Advance Access published on August 12, 2004
Bioinformatics 2005 21: 10-19; doi:10.1093/bioinformatics/bth466
Chin Lung Lu and Yen Pin Huang
A memory-efficient algorithm for multiple sequence alignment with
constraints
Bioinformatics Advance Access published on September 16, 2004
Bioinformatics 2005 21: 20-30; doi:10.1093/bioinformatics/bth468
Pavel Sumazin, Gengxin Chen, Naoya Hata, Andrew D. Smith, Theresa Zhang, and
Michael Q. Zhang
DWE: Discriminating Word Enumerator
Bioinformatics Advance Access published on August 27, 2004
Bioinformatics 2005 21: 31-38; doi:10.1093/bioinformatics/bth471




Åsa K. Björklund, Daniel Soeria-Atmadja, Anna Zorzet, Ulf Hammerling, and Mats G.
Gustafsson
Supervised identification of allergen-representative peptides for in silico
detection of potentially allergenic proteins
Bioinformatics Advance Access published on August 19, 2004
Bioinformatics 2005 21: 39-50; doi:10.1093/bioinformatics/bth477
STRUCTURAL BIOINFORMATICS:
Luonan Chen, Tianshou Zhou, and Yun Tang
Protein structure alignment by deterministic annealing
Bioinformatics Advance Access published on August 12, 2004
Bioinformatics 2005 21: 51-62; doi:10.1093/bioinformatics/bth467
GENE EXPRESSION:
Wenjiang J. Fu, Edward R. Dougherty, Bani Mallick, and Raymond J. Carroll
How many samples are needed to build a classifier: a general sequential
approach
Bioinformatics Advance Access published on August 5, 2004
Bioinformatics 2005 21: 63-70; doi:10.1093/bioinformatics/bth461
Min Zou and Suzanne D. Conzen
A new dynamic Bayesian network (DBN) approach for identifying gene
regulatory networks from time course microarray data
Bioinformatics Advance Access published on August 12, 2004
Bioinformatics 2005 21: 71-79; doi:10.1093/bioinformatics/bth463
A. Reverter, S. M. McWilliam, W. Barris, and B. P. Dalrymple
A rapid method for computationally inferring transcriptome coverage and
microarray sensitivity
Bioinformatics Advance Access published on August 12, 2004
Bioinformatics 2005 21: 80-89; doi:10.1093/bioinformatics/bth472
GENETICS AND POPULATION ANALYSIS:
Kui Zhang, Fengzhu Sun, and Hongyu Zhao
HAPLORE: a program for haplotype reconstruction in general pedigrees
without recombination
Bioinformatics Advance Access published on July 1, 2004
Bioinformatics 2005 21: 90-103; doi:10.1093/bioinformatics/bth388



DATA AND TEXT MINING:
Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry
Gene clustering by Latent Semantic Indexing of MEDLINE abstracts
Bioinformatics Advance Access published on August 12, 2004
Bioinformatics 2005 21: 104-115; doi:10.1093/bioinformatics/bth464
DATABASES AND ONTOLOGIES:
Yongqun He, Richard R. Vines, Alice R. Wattam, Georgiy V. Abramochkin, Allan W.
Dickerman, J. Dana Eckart, and Bruno W. S. Sobral
PIML: the Pathogen Information Markup Language
Bioinformatics Advance Access published on August 5, 2004
Bioinformatics 2005 21: 116-121; doi:10.1093/bioinformatics/bth462
APPLICATIONS NOTE:
SEQUENCE ANALYSIS:
Robert Belshaw and Aris Katzourakis
BlastAlign: a program that uses blast to align problematic nucleotide
sequences
Bioinformatics Advance Access published on August 13, 2004
Bioinformatics 2005 21: 122-123; doi:10.1093/bioinformatics/bth459
GENE EXPRESSION:
Scott J. Tebbutt, Igor V. Opushnyev, Ben W. Tripp, Ayaz M. Kassamali, Wendy L.
Alexander, and Marilyn I. Andersen
SNP Chart: an integrated platform for visualization and interpretation of
microarray genotyping data
Bioinformatics Advance Access published on August 12, 2004
Bioinformatics 2005 21: 124-127; doi:10.1093/bioinformatics/bth470
GENETICS AND POPULATION ANALYSIS:
Marie-Françoise Jourjon, Sylvain Jasson, Jacques Marcel, Baba Ngom, and Brigitte
Mangin
MCQTL: multi-allelic QTL mapping in multi-cross design
Bioinformatics Advance Access published on August 19, 2004
Bioinformatics 2005 21: 128-130; doi:10.1093/bioinformatics/bth481




Kui Zhang, Zhaohui Qin, Ting Chen, Jun S. Liu, Michael S. Waterman, and Fengzhu Sun
HapBlock: haplotype block partitioning and tag SNP selection software using
a set of dynamic programming algorithms
Bioinformatics Advance Access published on August 27, 2004
Bioinformatics 2005 21: 131-134; doi:10.1093/bioinformatics/bth482
SYSTEMS BIOLOGY:
Vincent J. Carey, Jeff Gentry, Elizabeth Whalen, and Robert Gentleman
Network structures and algorithms in Bioconductor
Bioinformatics Advance Access published on August 5, 2004
Bioinformatics 2005 21: 135-136; doi:10.1093/bioinformatics/bth458
DATABASES AND ONTOLOGIES:
Slobodan Vucetic, Zoran Obradovic, Vladimir Vacic, Predrag Radivojac, Kang Peng,
Lilia M. Iakoucheva, Marc S. Cortese, J. David Lawson, Celeste J. Brown, Jason G.
Sikes, Crystal D. Newton, and A. Keith Dunker
DisProt: a database of protein disorder
Bioinformatics Advance Access published on August 13, 2004
Bioinformatics 2005 21: 137-140; doi:10.1093/bioinformatics/bth476

BIOINFORMATICS Editorial
Vol.21 no.1 2005,page 1
doi:10.1093/bioinformatics/bti185
INCREASING THE IMPACT OF
BIOINFORMATICS
The year 2004 has been very successful for Bioinformatics.
The journals latest impact factor from the Institute for
Scientic Information has increased from4.615 to 6.701.This
is quite anexceptional increase reectingthe increasingstand-
ard of work in the journal as well as the increasing stature
of the eld.Early in 2004,we implemented a new system
for Advance online access which allows scientists to access
research in Bioinformatics as rapidly as possible.
The number of manuscripts submitted to the journal contin-
ues to grow.In 2003 we received 1300 submissions while in
2004 we received over 1800 submissions.That we have coped
with this large increase is testament to the dedication and
hard work of our team of Associate Editors,referees and the
Editorial Ofce.To cope with future growth we are adapting
our editorial structure and processes.
From2005 the journal will appear 24 times per year rather
than the 18 issues per year previously.This will allow us
to publish more of the high quality research and applications
that are being submitted to us each day.However,the increase
in submissions has outstripped the increase in journal pages.
This means that our acceptance rate is decreasing to ∼20%.
We are in the process of appointing new Associate Editors
to replace those who have stepped down and to increase our
coverage of new areas.We would like to thank Gert Vriend,
Debbie Marks and Fritz Roth for their invaluable contribution
to the journal.We are also in the process of expanding the
membership of our Editorial Board.
In 2005,we will be introducing a newscheme of categories
for papers.During the submission process authors will be
asked to choose which category their paper belongs to.This
will improve the assignment of manuscripts to editors as well
as helping us to formulate a clear denition of the scope of the
journal within each category,thus improving organization,in
terms of layout and editorial process.
Open Access is a topic that is very important to many of our
authors andreaders.Alongthis line,we as Editors,andOxford
University Press as a not-for-prot academic publisher,are
very much in favour of the principle of making scientic pub-
lications freely available.For a well-established journal,as
Bioinformatics is now,it is alsoimportant topreserve the repu-
tation and nancial security of the journal to which authors,
referees,editors and readers have contributed over the many
years.During 2005 we will learn a great deal fromthe exper-
ience of our sister journal,Nucleic Acids Research,which is
introducing a full Open Access model.We are also seeking the
opinions of our authors and readers through a survey explor-
ing publication models.We will be sure to base our decision
on whether to move forward with an Open Access initiative
on the response from our readers,authors and their institu-
tions.So please let us knowyour views.If we are encouraged
by the journals community to experiment with Open Access,
and with a carefully studied newbusiness model in place,we
see Open Access as a real opportunity for the journal in the
near future.
This signicant collection of changes will make 2005 an
important year for the journal.Our new cover design reects
this spirit of change which will make our journal,and the
bioinformatics eld,more open to science.
Alfonso Valencia and Alex Bateman
Executive Editors
Bioinformatics vol.21 issue 1 © Oxford University Press 2005;all rights reserved.
1
BIOINFORMATICS ORIGINAL PAPER
Vol.21 no.1 2005,pages 2–9
doi:10.1093/bioinformatics/bth475
Origins of introns based on the deÞnition of exon
modules and their conserved interfaces
Albert D.G.de Roos
The Beagle Armada,Postbus 964,4600 AZ Bergen op Zoom,The Netherlands
Received on March 4,2004;revised on July 22,2004;accepted on August 5,2004
Advance Access publication August 12,2004
ABSTRACT
Summary:Central to the unraveling of the early evolution of
the genome is the origin and role of introns.The evolution
of the genome can be characterized by a continuous expan-
sion of functional modules that occurs without the interruption
of existing processes.The design-by-contract methodology of
software development offers a modular approach to design
that seeks to increase ßexibility by focusing on the design of
constant interfaces between functional modules.Here,it is
shown that design-by-contract can offer a framework for gen-
ome evolution.The deÞnition of an ancient exon module with
identical splice sites leads to a relatively simple sequence of
events that explains theroleof introns,intronphasedifferences
and the evolution of multi-exon proteins in an RNA world.An
interaction of the experimentally deÞned six-nucleotide spli-
cing consensus sequence together with a limited number of
primitive ribozymes can account for a rapid creation of protein
diversity.
Contact:albert.de.roos@thebeaglearmada.nl
INTRODUCTION
One of the most intriguing questions in unravelling genomic
evolution is whether the intron/exon structure of eukaryotic
genes reßects their ancient assembly by exon shufßing or
whether the introns have been inserted into preformed genes.
Several theories have been put forward to explain the role
of introns and exons in evolution (reviewed in Mattick,
1994;Logsdon,1998;Fedorova and Fedorov,2003;Rzhetsky
and Ayala,1999).There are now two main competing the-
ories that try to explain the role of introns,both based
on the involvement of DNA-based introns and exons.The
introns earlyÕ or exon theory of genes states that the introns
are ancient and have been subsequently lost in prokaryotes
(Gilbert,1987;Gilbert et al.,1996,1997).In this theory,
the Þrst exons coded for ancient protein modules fromwhich
multi-modular proteins were assembled by means of exon
shufßing and recombination.Introns facilitated this process
by providing the actual sites of recombination.On the other
hand,the Ôintrons late theoryÕ maintains that the spliceo-
somal introns were inserted into the eukaryote genes later
in evolution (Palmer and Logsdon,1991;Cavalier-Smith,
1991;Cho and Doolittle,1997;Logsdon,1998) after the
evolution of multi-modular proteins.In introns-late,the
appearance of introns could also have aided in the creation
of diversity by facilitating recombination.No conclusive
evidence has been found to prove or disprove intron-early
or intron-late,although these theories are based on com-
pletely different genome architectures and mechanisms of
evolution.
The genome has evolved from a simple RNA-based self-
replicating system,the RNA world (Gilbert,1986;Joyce,
2002) to a complex system of multi-exon genes coding for
multi-modular proteins.During this evolutionary process,
numerous new functions were added or modiÞed without
disrupting the functioning of older systems.The evolution
from strands of RNA to multi-exon genes with sophist-
icated expression systems implies that the genome was
able to increase in size and complexity many orders of
magnitude without losing ßexibility.Any genome architec-
ture meant to form the basis of genome evolution should
therefore be ßexible and robust in order to meet the
requirements for virtually unlimited expansion of size and
function.
Modern software designs seek to increase ßexibility by
using a modular approach which allows for the addition,
replacement and changing operations within individual mod-
ules.Complex software architectures are based on a meth-
odology in which a software system is viewed as a set of
communicating modules whose interaction is based on pre-
cisely deÞned interfaces.The interfaces can be viewed as
speciÞcations of the mutual obligations or contracts.The
effect of constant interfaces across modules is a reduction
of the interdependences across modules or components and
a reduction in the risk that changes within one module
will create unanticipated changes in other modules.This
methodology is also known as design-by-contract (Meyer,
1997).Since the characteristics of the design-by-contract
methodology are similar to those required in genome evolu-
tion,it is hypothesized here that genome architecture reßects
the paradigms of design-by-contract:deÞnition of func-
tional modules that interact with each other by well-deÞned
interfaces.
2
Bioinformatics vol.21 issue 1 © Oxford University Press 2004;all rights reserved.
Genome architecture
exon
intron
A
B
C
ex
on
intron
ancient exon
intron intron
exon
ÔstartÕ
CAG
GUG
CAG
GUG
mRNA
RNA
A
B
C
CAG
GUG
CAG
GUG
ÔendÕ
Fig.1.Froman intron-centric to an exon-centric view on the struc-
ture of eukaryotic genes.(A) Generalized eukaryotic gene structure.
Eukaryotic genes consist at the RNA level of coding sequences
(exons) interspersed with non-coding sequences (introns).Before
translation,the introns are spliced out to form a continuous coding
sequence,the messenger RNA (mRNA).( B) The intron in detail.
Introns are spliced out based on conserved sequences at both ends
of the intron.At the left-hand side,the three most common ribo-
nucleotides sequence is GUG,at the right-hand side this sequence is
CAG.The boldface characters indicate residues necessary for spli-
cing.(C) The ancient exon viewed as a unit.In an exon-centric view
on the gene structure,the conserved parts of the intron could have
functioned as signals demarcating the end and beginning of the exon.
MODULARITY AND INTERFACES IN THE
GENOME
The basic unit of genetic information,the gene,can be
regarded as a self-contained module with a well-deÞned inter-
face.A gene contains all the necessary information from
which the encoded protein can be generated,whereas the
highly conserved genetic code functions as the interface
between gene and protein.Eukaryotic genes consist them-
selves of parts of codingsequences,exons,interruptedbynon-
coding sequence,the introns (Fig.1A).The introns have to
be spliced out in order to forma continuous coding sequence,
mRNA,that can be recognized by the translation machinery.
In principle,an intron contains all the necessary information
to be spliced out,which enables it to function independently
fromthe exon sequence.The intron can therefore be regarded
as a self-contained module with a well-deÞned (conserved)
interface,the splice recognition site (Fig.1B),which is loc-
ated exclusively in the intron.This conÞguration enables the
excision of introns independent fromexon sequence.
Exons are,in contrast to introns,dependent upon informa-
tion that lies outside of the exons,since the splice recog-
nition sites of the intron determine the span of the exon.
A dependence on intron sequences would severely hamper
independent movement and exchange of coding sequences
between genes.However,extensive recombination of exons
by exon shufßing is believed to play an important role in the
creationof genetic diversity(Patthy,2003;Sudhof et al.,1985;
Kolkman and Stemmer,2001) and many of the proteins with
functionally divergent domains were established before the
division of prokaryotes from eukaryotes (Ohno,1987).In
order to be inserted into random nucleotide sequences,the
exon module should preferably behave like a self-contained
module.The exon greatly acquires independence when the
conserved intron sequences that ßank both ends of the exon is
included as part of the exon (Fig.1C),enabling it to function
as an independent coding module,or ancient exon module.
MOLECULAR VIEWON THE ANCIENT EXON
The ends of the proposed ancient exon module were studied in
more detail at a molecular level using an intronÐexon database
(Clark,2003,http://www.maths.uq.edu.au/∼fc/datasets/)
generated from GenBank release 127 (Benson et al.,2002).
The last nucleotides on either side of the exon module are
represented by the intronÐexon boundary and possible rem-
nants of a consensus sequence were determined by looking
at nucleotide triplets from the intron and the exon part of the
intronÐexon boundary.The tri-nucleotide sequences with the
highest frequencies of several species are shown in Tables 1
and 2.Looking at the overall similarity between the sequences
on both ends of the exon module and the conservation of
these sequences between species,a bias towards the sequence
CAG|GTG can be discerned both in the sequence preceding
the exon and in the one following the exon.No signiÞcant dif-
ferences were observed between sequences fromintronÐexon
boundaries with different intron phases (data not shown).
Based on the data in Tables 1 and 2,it is proposed that
the conserved sequences of both ends of this ancient exon
module functioned as the ancient exon recognition site with
an original sequence CAGGUG (Fig.2A).This consensus
sequence could have served as a cleavage recognition site
enabling the splicing out of the coding sequence,creating
the substrate for the translation machinery (Fig.2B).Acleav-
age in the middle of the sequence CAGGUG would result in
a spliced out coding RNAsequence that is always surrounded
by the remaining parts of the recognition sequence,GUG at
the start and CAG at the end of the exon.Concatenation of
these ancient exon modules after cleavage of the recognition
sites joins the remaining parts of the recognition sequence
(Fig.2C),forming multi-exon mRNA.
Support for the existence of the ancient splice site can be
provided by the fact that the codon GUG still acts as a trans-
lation start site in bacteria (Gold,1988) and can still function
as one in other organisms (Mehdi et al.,1990;Peabody,
1989).Moreover,the most common start codon AUG dif-
fers only one nucleotide from GUG and a single mutation of
the Þrst nucleotide of the hypothetical ancient end sequence
CAGis needed to convert it into the amber stop codon UAG.
Other support for the role of the ancient splice site comes from
3
A.D.G.de Roos
Table 1.Frequency distribution of the nucleotide sequences of the intronÐ
exon boundary at the ends of the left-hand side or 5

end of the ancient exon
module
intron 3

ctrl exon 5

ctrl
Human
cag 16 304 68.7 2.3 gtg 1767 7.4 1.7
tag 5272 22.2 1.0 gag 856 3.6 2.2
aag 1108 4.7 1.6 gaa 841 3.5 2.4
gag 129 0.5 2.3 ggt 818 3.4 1.7
agg 46 0.2 2.6 gtt 814 3.4 1.2
Drosophila
cag 25 565 64.1 1.0 gtg 1465 3.7 1.4
tag 10 063 25.2 1.1 atg 1456 3.6 1.5
aag 2409 6.0 1.5 atc 1370 3.4 1.6
gag 233 0.6 0.8 att 1335 3.3 1.7
taa 76 0.2 3.5 aaa 1197 3.0 2.6
Arabidopsis
cag 56 151 60.8 0.7 gtt 7033 7.9 1.9
tag 26 371 28.5 1.2 gtg 5145 5.8 1.0
aag 6703 7.3 1.4 gaa 4295 4.8 2.9
gag 1228 1.3 0.8 gta 4071 4.6 0.7
tga 140 0.2 2.6 gat 3505 3.9 2.2
Caenorhabditis
cag 76 340 81.5 0.8 aaa 5315 5.7 2.7
tag 12 994 13.9 1.2 att 4901 5.2 2.0
aag 3347 3.6 1.6 aat 3549 3.8 2.0
gag 578 0.6 0.7 gaa 3509 3.7 2.9
ttt 29 0.0 7.7 atg 3404 3.6 1.6
#% %#% %
The table consists of the sequence of the last three RNA nucleotides of the intron (3

intron) followed by the Þrst three nucleotides of the exon (5

exon).Controls (ctrl) show
nucleotide sequences in exons and introns 12 nt up-or downstreamfromthe splice site.
Methods:Datasets with intron and exon data were downloaded fromthe Internet page of
Francis Clark (Clark,2003).These data are based on gene annotation in DNAsequences
derived from GenBank (Benson et al.,2002).The relevant intron and exon data were
extracted from these Þles and converted into a tab-limited text Þle that was imported
into tables created in a MySQL database (www.mysql.com).Exon tables consisted of
a unique gene identiÞer,the exon number and exon sequence,and the intron tables of
the gene identiÞer,intron number and intron sequence.A third table consisted of the
gene identiÞers with their corresponding GenBank accession number enabled joining
with other GenBank databases.SQL queries are available upon request.Only the Þrst
Þve most abundant tri-nucleotide sequences are shown.The 5

ends of the Þrst exon of
a gene and the 3

part of the last exon of a gene were not included in the data.
Boldface sequences indicate the putative consensus recognition sequences of the exon
module (Fig.2).
the intron-less genes of prokaryotes.It has been shown that
the coding sequences around the positions of introns inser-
tion in their eukaryotic counterparts also show a consensus
sequence CAG

GT,originally dubbed the proto-splice site
(Dibb and Newman,1989).If introns were lost during evol-
ution in an RNA world with a mechanism closely related to
splicing (cf.Fig.2C),the proposed ancient splice site would
also be retained.
EXON PHASE AND FRAME-SHIFT
The joining of two exons modules as shown in Figure 2C
implies that part of the consensus splicing sites become part
Table 2.Frequency distribution of the nucleotide sequences of the intronÐ
exon boundary at the ends of the right-hand side side or 3

end of the putative
ancient exon module
exon 3

ctrl intron 5

ctrl
Human
cag 5566 23.5 2.8 gta 11 817 49.7 0.8
aag 3990 16.9 1.6 gtg 10 073 42.4 2.2
gag 2245 9.5 2.2 gtc 720 3.0 1.3
ctg 1003 4.2 3.0 gtt 553 2.3 1.3
atg 842 3.6 1.5 gca 113 0.5 1.5
Drosophila
aag 4866 12.2 1.7 gta 21 804 54.5 1.3
cag 4591 11.5 1.9 gtg 14 279 35.7 0.9
gag 2553 6.4 1.6 gtt 2164 5.4 1.7
atg 1578 3.9 1.5 gtc 749 1.9 0.7
caa 1468 3.7 3.0 gca 127 0.3 1.2
Arabidopsis
aag 16 985 19.1 2.3 gta 58 785 63.6 1.2
cag 15 845 17.8 1.3 gtt 16 100 17.4 0.7
gag 9348 10.5 1.8 gtg 11 588 12.5 0.8
atg 4114 4.6 1.6 gtc 4698 5.1 2.0
ctg 3482 3.9 1.1 gca 439 0.5 1.3
Caenorhabditis
aag 11 184 12.0 1.8 gta 51 869 55.3 1.1
cag 9175 9.9 1.3 gtg 23 717 25.3 0.7
gag 6709 7.2 1.3 gtt 16 020 17.1 1.8
aaa 4923 5.3 3.8 gtc 1526 1.6 0.5
atg 4149 4.5 1.8 gca 179 0.2 0.8
#% %#% %
The table consists of the 3

end of exon and the 5

end of the intron.
Boldface sequences indicate the putative consensus recognition sequences of the exon
module (Fig.2).
of the coding sequence (Fig.3A) and every module would be
connected by a Þxed series of 6 nt,formed by the sequence
CAGGUG.The two codons in this sequence (CAGand GUG)
would always be translated into the amino acids glutamine (Q)
and valine (V).In our design-by-contract model,the recogni-
tion sequence represents the interface for the splicing of the
exons and therefore,any mutation in this sequence would be
deleterious since it wouldresult inthe inactivationof the splice
site (Fig.3B) andresultingloss of functionof the encodedpro-
tein.On the other hand,mutation of the amino acid sequence
would be advantageous for the evolutionary process since it
would relieve the obligatory translation of the ancient splice
site into the amino acids Q and V.A phase shift enables the
reading-through of the recognition sequence in another way
(Fig.3C),leading to a different amino acid sequence between
exons with identical recognition consensus sequences.
The actual distribution of amino acids at splice junctions
was investigated using an exonÐintron database containing
phase information (Sakharkar et al.,2000) derived from
GenBank 122 (Benson et al.,2002).Figure 4 shows that the
last amino acid of an exon has a phase-dependent preference
for speciÞc amino acids.In each phase,the last amino
4
Genome architecture
Fig.2.Anexon-centric viewongene structure.( A) Eukaryotic exon
showing the generalized sequences at the intronÐexon and exonÐ
intron boundaries.The most common exon sequence on the 5

of
the exon is GTG,on the 3

side it is CAG.Both sides of the exon
now have an identical six ribonucleotide long sequence.( B) The
exon unit can be viewed as a coding sequence,surrounded by two
identical recognition sequences (CAGGUG),where actual splicing
could occur in the middle of the recognition site,the exonÐintron
boundary.(C) Concatenation of exons based on a single recognition
site would join exons at their splice sites while retaining parts of the
recognition sequences in the resulting mRNAat the splice junctions.
This sequence is identical to the original recognition site and could
be the equivalent of the proposed proto-splice sites in prokaryotic
intron-less genes.Note the recurrent sequence CAGGUG.
acid follows closely the ones that can be predicted from a
phase shift based on a constant splice recognition sequence
(Fig.3C).Note that at the nucleotide level,the intronÐexon
boundary does not exhibit phase-dependent differences (data
not shown).Table 3 shows that the amino acid positions that
would be have been affected by an ancient phase shift still
show a bias towards their predicted phase.This effect is even
stronger when the effect of a phase shift is viewed in both
exons simultaneously,up to the point that almost 95%of the
amino acid sequences Q|V around a splice site is in phase 0.
Since splicing out of introns is necessary for correct transla-
tion,intronless mRNAcan be considered as a well-conserved
interface to the translation machinery.The generation of
intronless mRNAby a concatenation of different coding RNA
modules in random ÔintronÕ RNA sequence (Fig.2),would
not change this interface and could take place without affect-
ing translation.Also,the separate development of functional
protein modules,followed by an assembly of these modules
would be inherently less complex and more ßexible (Gilbert,
Fig.3.mRNA intron phase shifts can change the amino acid
sequence without changing the ancient splice recognition site.
(A) The sequence at the splice junctions codes for the amino acids
glutamine and valine that would result from concatenation of the
exons.(B) Mutations in the splice site recognition sequence will dis-
rupt splicing.(C) Phase shifts can change the splice junction amino
acid sequence without disruption of the recognition site.In phase 0,
there is only one amino acid sequence possible,glutamine followed
by valine.In phase 1 and 2,translation will read through the splice
junction in a different way,making various combinations of amino
acids possible.Note the conservation of the sequence at the splice
junctions (in red).
1987;Patthy,2003).Phase shift could be viewed as an
outcome of a genomic evolution model based on the design-
by-contract methodology,since phase shifts could provide a
means for creating more protein diversity without affecting
the established splicing interface.The development of a spli-
cing machinery that would conÞne the splicing recognition
sequence exclusively to the intron (as is presently the case,
cf.Fig.1B) wouldultimatelyenable the complete independent
evolution of the coding ends of the exon.
The degree of conservation at the boundaries of exons ßank-
ing introns has been shown earlier and has been interpreted
as a derived result of evolution for efÞcient splicing (Long
et al.,1997),the preferred insertion site for introns (Dibb and
Newman,1989) or as functional splice sites that existed in
the coding sequence of genes prior to the insertion of introns
(Sadusky et al.,2004).Intron phase has been shown to be
correlated to the codon position (Long et al.,1995;Tomita
et al.,1996) and hypothesized to be related to exon shufßing
between exons in the same phase (Long et al.,1995).
5
A.D.G.de Roos
0
5
10
15
20
25
30
35
A C D E F G H I K L M N P Q R S T V W Y
occurence (%)
phase 0
phase 0-ctrl
0
5
10
15
20
25
30
35
A C D E F G H I K L M N P Q R S T V W Y
occurence (%)
phase 1
phase 1-ctrl
0
5
10
15
20
25
30
35
A C D E F G H I K L M N P Q R S T V W Y
one-letter amino acid code
occurence (%)
phase 2
phase 2-ctrl
Fig.4.Amino acid frequencies around splice sites in different phases.The last amino acid coded by each exon was determined when the
exon was followed by a phase 0 (A),phase 1 (B) and phase 2 intron (C).In phase 0,the last amino acid just before the splice site is shown,
in phase 1 and 2 the amino acid was taken that bridged the actual splice site on a nucleotide basis.As a control,amino acids three residues
downstream the splice site were taken (cf.Fig.3C).The control values were similar in each phase and were therefore averaged.In general,
the amino acid distribution follows the nucleotide triplet data presented in Tables 1 and 2,although some differences can be seen due to
the fact that amino acids can have multiple codons and that in this Þgure all species present in GenBank are pooled.Methods:The ExInt
database (Sakharkar et al.,2000) containing a set of tables with exon and intron data including exon amino acid sequence,intron phase data
and Genbank accession number was kindly provided by Dr Meena Sakharkar.SQL queries were performed,which determined the last amino
acid that was coded by the respective exon in each phase.
6
Genome architecture
Table 3.Relationship between amino acids bordering the splice site and
intron phase.Intron phase distribution with the indicated amino acid pre-
ceding or following a splice site,and a combination of the amino acids.
The amino acid chosen reßects the proposed mechanismof intron phase dif-
ferences in Figure 3.Controls are phase distributions in exons without the
indicated sequences in that row.Methods:see Figure 4.
Phase pre Q post V pre Q,post V Control
0 46 544 91.5 28 121 63.1 5688 94.8 188 735 43.2
1 1717 3.4 8895 20.0 132 2.2 137 145 31.4
2 2634 5.2 7537 16.9 180 3.0 110 541 25.3
pre-1 TPAS pre G TPAS/G Control
0 57 490 43.9 5080 8.5 1411 7.1 196 553 55.4
1 45 451 34.7 44 377 73.8 15 757 79.2 73 554 20.7
2 28 116 21.5 10 648 17.7 2716 13.7 84 484 23.8
pre R post CW R +CW Control
0 11 243 23.3 6071 37.4 283 16.0 240 681 52.0
1 3909 8.1 5437 33.5 158 8.9 138 437 29.9
2 32 999 68.5 4732 29.1 1328 75.1 84 129 18.2
#%#%#%#%
Boldface indicates predicted phase of used amino acid sequences based on the model in
Figure 3c.
FUNDAMENTAL STEPS IN EVOLUTION
BASED ON A SINGLE TEMPLATE
It is proposed here that the sequence CAGGUG acted as the
ancient cleavage recognition site for a ribozyme.Ribozymes
caninteract withits targets b,a complementaryRNAsequence
primarily based on WatsonÐCrick base pairing (Guerrier-
Takada et al.,1989;Cech,1987).Based on the sequence
of the ancient splice site,an antiparallel arrangement of this
sequence could interact with itself (Fig.5A),making a single
recognition sequence act as both the target site and the target
recognition sequence.At a molecular level,this interaction
could be stabilized by four WatsonÐCrick base pairs while
leaving two G-pairs unpaired.
The splicing out of the RNA sequences between the exon
modules,equivalent to intron splicing,is an important step
in genome evolution.Figure 5B shows how the antiparallel
arrangement of two adjacent exon modules could facilitate
splicing.In addition to an intra-strand cleavage between G
residues,a religation of the GÕs to the opposing strand would
concatenate the two exons,a process that could be facilitated
by the close physical proximity of the GÕs involved.
Another important step in the evolution of proteins is the
exchange of coding sequences between different genes res-
ulting in the recombination of genes.A mechanism identical
to intron splicing as shown in Figure 5B but followed by an
in trans religation would lead to the exchange of RNAstrands
(Fig.5C) between RNA molecules.In this way,ancient
ribozymes could have played an active role in the generation
of the diversity of proteins.
Thus,based on a six-nucleotide proto-splice site and rel-
atively simple ribozymes that could cleave and religate this
sequence,three important events in the exon-centric evolution
Fig.5.Fundamental events in the evolution of multifunction,multi-
exonic proteins based on a single recognition sequence.( A) Cleav-
age.The sequence CAGGUG functions as the signal sequence
for cleavage while the actual recognition takes place via the same
sequence of a ribozyme using canonical (GC and UA) base pairing,
possibly combined with non-canonical GG pairing.( B) Cleavage
and in cis religation.On the basis of an anti-parallel arrangement of
recognition sites,splicing could be accomplished by a cleavage of
the two recognition sites followed by a religation between opposing
strands.(C) Cleavage and in trans religation.A religation between
different RNA strands leads to a recombination of exons modules.
of multi-domain proteins can be explained:(i) the splicing
out of the exon modules yielding short exonic mRNA,(ii) the
splicing out of RNA sequences between exons thereby con-
catenating exon modules to multi-exon mRNA and (iii) the
active recombination of exons.The classes of ribozymes that
could catalyse the cleavage and ligation reactions proposed in
Figure 5 have been shown to occur naturally (Symons,1992;
Guerrier-Takada et al.,1983).Ribozyme RNAse T1 cleaves
a double-stranded complementary RNAsequence at unpaired
G residues),and apart from several naturally occurring RNA
ligases (Yoshida,2001;Hager et al.,1996),it has also been
shown that complex ligases can evolve fromgroup I ribozyme
domains (Jaeger et al.,1999) and from small random RNA
sequences (Ekland et al.,1995).
7
A.D.G.de Roos
The proto-splice site can act as a starting point for the evol-
ution of multifunction proteins when the consensus sequence
of the proposed proto-splice site arises randomly in strands of
RNA.Two splice sites in close proximity could then lead to
the Þrst functional single-exon genes.The transformation of
the coding parts of the proto-splice site sequences into start
(GUG to AUG) and stop codons (CAG to UAG) and vice
versa back to a functional proto-splice site could facilitate a
stepwise concatenation of exons (cf.Fig.2).
The introns that arose early in evolution as a consequence
of a concatenation of exons (Fig.2) could be lost further in
evolution,but their presence at conserved positions would
still reßect their ancient origins.The evolution of transposons
from introns,both able to function as relatively independent
functional units,may account for many of the observations
attributed to the introns-late theory (Cho and Doolittle,1997;
Logsdon,1998).
EVOLUTION ON A DESIGN-BY-CONTRACT
THEORY
The application of the design-by-contract methodology by
viewing the exon as a module that interacts with its environ-
ment by its interface,led to a series of logical steps explaining
the intronÐexon structure of genes and intron phase differ-
ences.It suggests that evolution behaved according to a design
pattern that separates functional modules from each other by
well-deÞned interfaces.The dependence of vital functions on
interfaces prevents changes in the interfaces and forces evolu-
tion in an architecture that reßects design-by-contract ÔrulesÕ.
It also proposes that the major events leading to a diversiÞca-
tion of proteins were situated in an RNA world.The next
fundamental step in genome evolution,the transition fromthe
RNA world to the RNA/DNA world can also be explained in
line with design-by-contract.In order to keep all the inter-
faces that were created in the RNA world intact,the entire
RNA genome could have been copied verbatim into DNA.
REFERENCES
Benson,D.A.,Karsch-Mizrachi,I.,Lipman,D.J.,Ostell,J.,Rapp,B.A.
and Wheeler,D.L.(2002) GenBank.Nucleic Acids Res.,30,
17Ð20.
Cavalier-Smith,T.(1991) Intronphylogeny:anewhypothesis.Trends
Genet.,7,145Ð148.
Cech,T.R.(1987) The chemistry of self-splicing RNA and RNA
enzymes.Science,236,1532Ð1539.
Cho,G.and Doolittle,R.F.(1997) Intron distribution in ancient para-
logs supports randominsertion and not randomloss.J.Mol.Evol.,
44,573Ð584.
Clark,F.(2003) Gene data sets derived fromGenBank.
Dibb,N.J.and Newman,A.J.(1989) Evidence that introns arose at
proto-splice sites.EMBO J.,8,2015Ð2021.
Ekland,E.H.,Szostak,J.W.and Bartel,D.P.(1995) Structurally com-
plex and highly active RNA ligases derived from random RNA
sequences.Science,269,364Ð370.
Fedorova,L.and Fedorov,A.(2003) Introns in gene evolution.
Genetica,118,123Ð131.
Gilbert,W.(1986) The RNA World.Nature,319,618.
Gilbert,W.(1987) Theexontheoryof genes.ColdSpringHarb.Symp.
Quant.Biol.,52,901Ð905.
Gilbert,W.,Marchionni,M.andMcKnight,G.(1986) Onthe antiquity
of introns.Cell,46,151Ð153.
Gilbert,W.,de Souza,S.J.and Long,M.(1997) Origin of genes.
Proc.Natl Acad.Sci.,USA,94,7698Ð7703.
Gold,L.(1988) Posttranscriptional regulatory mechanisms in
Escherichia coli.Annu.Rev.Biochem.,57,199Ð233.
Guerrier-Takada,C.,Gardiner,K.,Marsh,T.,Pace,N.and Altman,S.
(1983) The RNAmoiety of ribonuclease P is the catalytic subunit
of the enzyme.Cell,35,849Ð857.
Guerrier-Takada,C.,Lumelsky,N.and Altman,S.(1989) SpeciÞc
interactions in RNA enzymeÐsubstrate complexes.Science,246,
1578Ð1584.
Hager,A.J.,Pollard,J.D.and Szostak,J.W.(1996) Ribozymes:aim-
ing at RNA replication and protein synthesis.Chem.Biol.,3,
717Ð725.
Jaeger,L.,Wright,M.C.and Joyce,G.F.(1999) A complex ligase
ribozyme evolved in vitro froma group I ribozyme domain.Proc.
Natl Acad.Sci.,USA,96,4712Ð4717.
Joyce,G.F.(2002) The antiquity of RNA-based evolution.Nature,
418,214Ð221.
Kolkman,J.A.and Stemmer,W.P.(2001) Directed evolution of pro-
teins by exon shufßing.Nat.Biotechnol.,19,423Ð428.
Logsdon,J.M.(1998) The recent origins of spliceosomal introns
revisited.Curr.Opin.Genet.Dev.,8,637Ð648.
Long,M.,Rosenberg,C.and Gilbert,W.(1995) Intron phase correl-
ations and the evolution of the intron/exon structure of genes.
Proc.Natl Acad.Sci.,USA,92,12495Ð12499.
Long,M.,de Souza,S.J.and Gilbert,W.(1997) The yeast splice site
revisited:new exon consensus from genomic analysis.Cell,12,
739Ð740.
Mattick,J.S.(1994) Introns:evolution and function.Curr.Opin.
Genet.Dev.,4,823Ð831.
Mehdi,H.,Ono,E.and Gupta,K.C.(1990) Initiation of translation at
CUG,GUG,and ACG codons in mammalian cells.Gene.,91,
173Ð178.
Meyer,B.(1997) Object-Oriented Software Construction,2nd ed.
Prentice-Hall,NY.
Ohno,S.(1987) Early genes that were oligomeric repeats generated a
number of divergent domains on their own.Proc.Natl Acad.Sci.,
USA,84,6486Ð6490.
Palmer,J.D.and Logsdon,J.M.(1991) The recent origins of introns.
Curr.Opin.Genet.Dev.,1,470Ð477.
Patthy,L.(2003) Modular assembly of genes and the evolution of
new functions.Genetica,118,217Ð231.
Peabody,D.S.(1989) Translation initiation at non-AUG triplets in
mammalian cells.J.Biol.Chem.,264,5031Ð5035.
Rzhetsky,A.and Ayala,F.J.(1999) The enigma of intron origins.
Cell.Mol.Life Sci.,55,3Ð6.
Sakharkar,M.,Long,M.,Tan,T.W.,de Souza,S.J.(2000) ExInt:an
Exon/Intron database.Nucleic Acids Res.,28,191Ð192.
Sadusky,T.,Newman,A.J.and Dibb,N.J.(2004) Exon junction
sequences as cryptic splice sites:implications for intron origin.
Curr.Biol.,14,505Ð509.
8
Genome architecture
Sudhof,T.C.,Goldstein,J.L.,Brown,M.S.and Russell,D.W.(1985)
The LDL receptor gene:a mosaic of exons shared with different
proteins.Science,228,815Ð822.
Symons,R.H.(1992) Small catalytic RNAs.Annu.Rev.Biochem.,
61,641Ð671.
Tomita,M.,Shimizu,N.and Brutlag,D.L.(1996) Introns and read-
ing frames:correlation between splicing sites and their codon
positions.Mol.Biol.Evol.,13,1219Ð1223.
Yoshida,H.(2001) The ribonuclease T1 family.Methods Enzymol.,
341,28Ð41.
9
BIOINFORMATICS ORIGINAL PAPER
Vol.21 no.1 2005,pages 10–19
doi:10.1093/bioinformatics/bth466
Using amphiphilic pseudo amino acid
composition to predict enzyme
subfamily classes
Kuo-Chen Chou
Gordon Life Science Institute,San Diego,CA 92130,USA
Received on June 30,2004;revised on July 20,2004;accepted on August 2,2004
Advance Access publication August 12,2004
ABSTRACT
Motivation:With protein sequences entering into databanks
at an explosive pace,the early determination of the family or
subfamily class for a newly found enzyme molecule becomes
important because this is directly related to the detailed inform-
ation about which speciÞc target it acts on,as well as to
its catalytic process and biological function.Unfortunately,
it is both time-consuming and costly to do so by experi-
ments alone.In a previous study,the covariant-discriminant
algorithm was introduced to identify the 16 subfamily classes
of oxidoreductases.Although the results were quite encour-
aging,the entire prediction process was based on the amino
acid composition alone without including any sequence-order
information.Therefore,it is worthy of further investigation.
Results:To incorporate the sequence-order effects into the
predictor,the Ôamphiphilic pseudo amino acid compositionÕ
is introduced to represent the statistical sample of a protein.
The novel representation contains 20 +2λ discrete numbers:
the Þrst 20 numbers are the components of the conventional
amino acid composition;the next 2λ numbers are a set of
correlation factors that reßect different hydrophobicity and
hydrophilicity distributionpatterns alongaproteinchain.Based
on such a concept and formulation scheme,a new predictor
is developed.It is shown by the self-consistency test,jack-
knife test and independent dataset tests that the success rates
obtained by the new predictor are all signiÞcantly higher than
those by the previous predictors.The signiÞcant enhancement
in success rates also implies that the distribution of hydro-
phobicity and hydrophilicity of the amino acid residues along
a protein chain plays a very important role to its structure and
function.
Contact:kchou@san.rr.com
1 INTRODUCTION
According to their EC (Enzyme Commission) numbers,
enzymes are mainly classiÞed into six families (Webb,1992):
(1) oxidoreductases,catalyzing oxidoreduction reactions;(2)
transferases,transferring a group from one compound to
another;(3) hydrolases,catalyzing the hydrolysis of various
bonds;(4) lyases,cleaving C−C,C−O,C−N and other
bonds by means other than hydrolysis or oxidation;(5) iso-
merases,catalyzing geometrical or structural changes within
one molecule;and (6) ligases,catalyzing the joining together
of two molecules coupled with the hydrolysis of a pyro-
phosphate bond in ATP or a similar triphosphate.Each of
these families has its own subfamilies,and sub-subfamilies.
For a newly found protein sequence,we are often chal-
lenged by the following two questions:is the new protein
an enzyme or non-enzyme?If it is,to which enzyme family
class should it be attributed?Both questions are very basic
and essential because they are intimately related to the func-
tion of the protein as well as its speciÞcity and molecular
mechanism.Although the answers can be found through vari-
ous biochemical experiments,it is both time-consuming and
costly to completely rely on experiments.Particularly,the
number of newly found protein sequences is now increas-
ing rapidly.For instance,the number of total sequence
entries in SWISS-PROT (Bairoch and Apweiler,2000) was
only 3939 in 1986;recently,it was expanded to 153 325
(increasing by more than 38 times in less than two decades!)
according to Release 43.6 (June 21 2004) of SWISS-PROT
(http://www.expasy.org/sprot/relotes/relstat.html).With such
a sequence explosion,it has become vitally important to
develop an automated and fast method to help deal with the
above two fundamental problems.Actually,efforts have been
made in this regard,and the results in identifying the attribute
among the six main enzyme family classes as well as between
enzymes andnon-enzymes arequitepromising(ChouandCai,
2004).Sinceeachof themainenzymefamilies has its ownsub-
families,the next question is:for an enzyme with a given main
family class,can we predict which subfamily it belongs to?
This is indispensable if we wish to understand the molecular
mechanismof the enzyme at a deeper level.Ina previous study
(Chou and Elrod,2003),the covariant-discriminant predictor
was adopted to identify the 16 subfamilies of oxidoreductases.
However,in that study the entire approach was based on
the protein amino acid composition alone.According to
the classical deÞnition,the amino acid composition of a
proteinconsists of 20components representingthe occurrence
frequencies of the 20 native amino acids in it.Obviously,if a
10
Bioinformatics vol.21 issue 1 © Oxford University Press 2004;all rights reserved.
Enzyme family class prediction
protein sample is represented by its amino acid composition
alone,all the details about its sequence order and sequence
lengthare totallylost.Therefore,althoughthe results obtained
in that study (Chou and Elrod,2003) are quite encouraging,
the methodology is very preliminary and certainly worthy of
further improvement.To include all the details of its sequence
order and length,the sample of a protein must be represented
by its entire sequence.Unfortunately,it is unfeasible to estab-
lish a predictor with such a requirement,as exempliÞed below.
As mentioned above,the total number of sequence entries
that contain 56 402 618 amino acids is 153 325 according to
Release 43.6 of SWISS-PROT.And hence the average protein
length is ∼368.The number of different combinations for a
protein of 368 residues will be 20
368
= 10
368 log 20
> 10
478
!
For such an astronomical number,it is impracticable,to con-
struct a reasonable training dataset that can be used for a
meaningful statistical prediction based on the current protein
data.Besides,protein sequence lengths vary widely,which
poses an additional difÞculty for including the sequence-order
information,in both the dataset construction and algorithm
formulation.Faced with such a dilemma,can we Þnd a com-
promise to partially incorporate the sequence-order effects?
This problemis addressed in the next section.
2 THE AMPHIPHILIC PSEUDO AMINO ACID
COMPOSITION
The sample of a protein can be represented by two different
forms:one is the discrete formand the other is the sequential
form.In the discrete form,a protein is represented by a set of
discrete numbers or a multiple dimensionvector.For example,
the amino acid composition is a typical discrete formthat has
been widely used in predicting protein structural class (Bahar
et al.,1997;Cai et al.,2000;Chandonia and Karplus,1995;
Chou and Zhang,1993;Chou and Maggiora,1998;Chou and
Zhang,1994;Chou,1989;Deleage and Roux,1987;Klein,
1986;Klein and Delisi,1986;Kneller et al.,1990;Metfessel
et al.,1993;Nakashima et al.,1986;Zhou,1998;Zhou and
Assa-Munt,2001) and subcellular localization (Cedano et al.,
1997;Chou,2000;Chou and Elrod,1999;Hua and Sun,
2001;Nakai,2000;Nakai and Kanehisa,1991;Nakashima
and Nishikawa,1994;Zhou and Doctor,2003).The advantage
of the discrete form is that it is easy to be treated in statist-
ical prediction,but the disadvantage is,it is hard to directly
incorporate the sequence-order information (the amino acid
composition actually contains no sequence-order information
at all,as mentioned in the last section).In the sequential form,
a protein is represented by a series of amino acids according
to the order of their positions in the protein chain.Therefore,
the sequential form can naturally reßect all the information
about the sequence order and length of a protein.However,
when used in statistical treatment,it leads to the difÞculty in
dealing with an almost inÞnitive number of possible patterns,
as illustrated above.
To solve such a dilemma,the crux is:can we develop a
different discrete form to represent a protein that will allow
accommodation of partial,if not all,sequence-order inform-
ation?Since a protein sequence is usually represented by a
series of amino acid codes,what kind of numerical values
should be assigned to these codes in order to optimally con-
vert the sequence-order information into a series of numbers
for the discrete form representation?Here,we introduce the
amphiphilic pseudo amino acid composition to tackle these
problems.
Suppose a protein P with a sequence of L amino acid
residues:
R
1
R
2
R
3
R
4
R
5
R
6
R
7
· · · R
L
,(1)
where R
1
represents the residue at chain position 1,R
2
the
residue at position 2,and so forth.Because the hydrophobi-
city and hydrophilicity of the constituent amino acids in a
protein play a very important role in its folding,its interaction
with the environment and other molecules,as well as its cata-
lytic mechanism,these two indices may be used to effectively
reßect the sequence-order effects.For example,many helices
in proteins are amphiphilic,that is,formed by the hydro-
phobic and hydrophilic amino acids according to a special
order along the helix chain,as illustrated by the ÔwenxiangÕ
diagram(Chou et al.,1997).Actually,different types of pro-
teins have different amphiphilic features,corresponding to
different hydrophobic and hydrophilic order patterns.In view
of this,the sequence-order information can be indirectly and
partially,but quite effectively,reßected through the following
equations (Fig.1):



































τ
1
=
1
L−1

L−1
i=1
H
1
i,i+1
τ
2
=
1
L−1

L−1
i=1
H
2
i,i+1
τ
3
=
1
L−2

L−2
i=1
H
1
i,i+2
τ
4
=
1
L−2

L−2
i=1
H
2
i,i+2
...............
τ
2λ−1
=
1
L−λ

L−λ
i=1
H
1
i,i+λ
τ

=
1
L−λ

L−λ
i=1
H
2
i,i+λ
,λ < L,(2)
where H
1
i,j
and H
2
i,j
are the hydrophobicity and hydrophilicity
correlation functions given by
H
1
i,j
= h
1
(R
i
) · h
1
(R
j
),
H
2
i,j
= h
2
(R
i
) · h
2
(R
j
),
(3)
where h
1
(R
i
) and h
2
(R
i
) are,respectively,the hydrophobicity
and hydrophilicity values for the ith (i = 1,2,...,L) amino
acid in Equation (1),and the dot (·) means the multiplica-
tion sign.In Equation (2),τ
1
and τ
2
are called the Þrst-tier
correlation factors that reßect the sequence-order correlations
11
K.-C.Chou
R
3
R
4
R
5
R
6
R
7
R
L
H
1,2
H
2,3
H
3,4
H
4,5
H
5,6
H
6,7
R
1
R
2
(a1)
R
3
R
4
R
5
R
6
R
7
R
L
H
1,3
H
2,4
H
3,5
R
1
R
2
H
4,6
H
5,7
(b1)
R
3
R
4
R
5
R
6
R
7
R
L
H
1,4
R
1
R
2
H
2,5
H
3,6
H
4,7
R
3
R
4
R
5
R
6
R
7
R
L
H
1,2
H
2,3
H
3,4
H
4,5
H
5,6
H
6,7
R
1
R
2
R
3
R
4
R
5
R
6
R
7
R
L
H
1,3
H
2,4
H
3,5
R
1
R
2
H
4,6
H
5,7
R
3
R
4
R
5
R
6
R
7
R
L
H
1,4
R
1
R
2
H
2,5
H
3,6
H
4,7
(c1)
R
3
R
4
R
5
R
6
R
7
R
L
H
1,2
H
2,3
H
3,4
H
4,5
H
5,6
H
6,7
R
1
R
2
R
3
R
4
R
5
R
6
R
7
R
L
H
1,2
H
2,3
H
3,4
H
4,5
H
5,6
H
6,7
R
1
R
2
(a2)
R
3
R
4
R
5
R
6
R
7
R
L
H
1,3
H
2,4
H
3,5
R
1
R
2
H
4,6
H
5,7
(b2)
R
3
R
4
R
5
R
6
R
7
R
L
H
1,3
H
2,4
H
3,5
R
1
R
2
H
4,6
H
5,7
R
3
R
4
R
5
R
6
R
7
R
L
H
1,4
R
1
R
2
H
2,5
H
3,6
H
4,7
(c2)
R
3
R
4
R
5
R
6
R
7
R
L
H
1,4
R
1
R
2
H
2,5
H
3,6
H
4,7
1
1
1
1 1
1
2
2
2
22 2
1 1 1 1
1
22
22
2
1 1 1 1
2
2
2 2
Fig.1.A schematic diagram to show (a1/a2) the Þrst-rank,(b1/b2) the second-rank and (c1/c2) the third-rank sequence-order-coupling
mode along a protein sequence through a hydrophobicity/hydrophilicity correlation function,where H
1
i,j
and H
2
i,j
are given by Equation (3).
Panel (a1/a2) reßects the coupling mode between all the most contiguous residues,panel (b1/b2) that between all the second-most contiguous
residues and panel (c1/c2) that between all the third-most contiguous residues.
between all the most contiguous residues along a protein
chainthroughhydrophobicityandhydrophilicity,respectively
(Fig.1,a1 and a2);τ
3
and τ
4
are the corresponding second-tier
correlation factors that reßect the sequence-order correlation
between all the second-most contiguous residues (Fig.1,b1
and b2);and so forth.Note that before substituting the values
of hydrophobicity and hydrophilicity into Equation (3),they
were all subjected to a standard conversion as described by
the following equation:
h
1
(R
i
) =
h
1
0
(R
i
) −

20
k=1
h
1
0
(R
k
)/20


20
u=1

h
1
0
(R
u
) −

20
k=1
h
1
0
(R
k
)/20

2

20
,
h
2
(R
i
) =
h
2
0
(R
i
) −

20
k=1
h
2
0
(R
k
)/20


20
u=1

h
2
0
(R
u
) −

20
k=1
h
2
0
(R
k
)/20

2

20
,
(4)
where we use the R
i
(i = 1,2,...,20) to represent the 20
native amino acids according to the alphabetical order of their
single-letter codes:A,C,D,E,F,G,H,I,K,L,M,N,P,
Q,R,S,T,V,W and Y.The symbols h
1
0
and h
2
0
repres-
ent the original hydrophobicity and hydrophilicity values of
the amino acid in the brackets right after the symbols,and
their values are taken from Tanford (1962) and Hopp and
Woods (1981),respectively.The converted hydrophobicity
and hydrophilicity values obtained using Equation (4) will
have a zero mean value over the 20 native amino acids,and
will remain unchanged if going through the same conversion
procedure again.As we cansee from(1)Ð(4) as well as Fig.1,a
considerable amount of sequence-order information has been
incorporated into the 2λcorrelation factors through the hydro-
phobicandhydrophilicvalues of theaminoacidresidues along
a protein chain.By merging the 2λ amphiphilic correlation
factors into the classical amino acid composition,we obtain
an augmented discrete form to represent a protein sample as
12
Enzyme family class prediction
follows:
P =

















p
1
.
.
.
p
20
p
20+1
.
.
.
p
20+λ
p
20+λ+1
.
.
.
p
20+2λ

















,(5)
where
p
u
=







f
u

20
i=1
f
i
+w


j=1
τ
j
,1 ≤ u ≤ 20,

u

20
i=1
f
i
+w


j=1
τ
j
,20 +1 ≤ u ≤ 20 +2λ,
(6)
where f
i
(i = 1,2,...,20) are the normalized occurrence
frequencies of the 20 native amino acids in the protein P,τ
j
the j-tier sequence-correlation factor computed according to
Equation (2),and w the weight factor.In the current study,
we chose w = 0.5 to make the results of Equation (6) within
the range easier to be handled (w can be of course assigned
with other values,but this would not make a signiÞcant dif-
ference to the Þnal results).Therefore,the Þrst 20 numbers in
Equation(5) represent theclassicaminoacidcomposition,and
the next 2λ discrete numbers reßect the amphiphilic sequence
correlation along a protein chain.Such a protein representa-
tion is called Ôamphiphilic pseudo amino acid compositionÕ,
or abbreviated as Am-Pse-AA composition:it has the same
formas the amino acid composition,but contains much more
information that is related to the sequence order of a protein
and the distribution of the hydrophobic and hydrophilic amino
acids along its chain.It should be pointed out that,according
to the deÞnition of the classical amino acid composition,all its
components must be ≥0;it is not always true,however,for the
pseudo amino acid composition (Chou,2001):the compon-
ents corresponding to the sequence correlation factors may
also be <0,as further discussed later.
3 AUGMENTED COVARIANT-DISCRIMINANT
PREDICTOR
Since the Am-Pse-AAcomposition Equation (5) has the same
mathematical frame as the amino acid composition except
that it contains more components,all the existing predictors
developed based on the classical amino acid composition can
be straightforwardly extended to cover the Am-Pse-AAcom-
position.For the readerÕs convenience,a brief description of
how to augment the covariant-discriminant predictor for the
Am-Pse-AA composition is given below.The details about
the algorithmand its development can be found in a series of
earlier papers (Chou and Zhang,1995;Chou,2001;Chou and
Elrod,1999;Chou and Zhang,1994;Liu and Chou,1998;
Zhou,1998;Zhou and Doctor,2003).According to the Am-
Pse-AA composition [Equation (5)],the k-th enzyme in the
class mcanbe representedbya (20+2λ) D(dimension) vector
as follows:
P
m
k
=


















p
m
k,1
.
.
.
p
m
k,20
p
m
k,20+1
.
.
.
p
m
k,20+λ
p
m
k,20+λ+1
.
.
.
p
m
k,20+2λ


















,k = 1,2,...,n
m
;m = 1,2,...,M,
(7)
where p
m
k,1
,p
m
k,2
,...,p
m
k,20
are the amino acid compositions
for the k-th enzyme of class m,p
m
k,20+λ
,p
m
k,20+λ+1
,...,
p
m
k,20+2λ
the sequence correlation factors of the same enzyme
that can be easily calculated by Equations (2)Ð(6) according to
its amino acid sequence,and n
m
the total number of enzymes
in class m.The standard vector for class mis deÞned by Chou
and Zhang (1995) as follows:
P
m
=

















p
m
1
.
.
.
p
m
20
p
m
20+1
.
.
.
p
m
20+λ
p
m
20+λ+1
.
.
.
p
m
20+2λ

















,m = 1,2,...,M,(8)
where
p
m
i
=
1
n
m
n
m

k=1
p
m
k,i
,i = 1,2,...,20 +2λ.(9)
Suppose P is a query enzyme whose subfamily is to be
identiÞed.It is alsorepresentedbyapoint or vector inthe (20+
2λ)Dspace as shown in Equation (5).The difference between
the query enzyme P and the norm of class m is measured by
the following covariant discriminant function:
(P,
P
m
) = D
2
M
(P,
P
m
) +ln |S
m
|,m = 1,2,...,M,
(10)
where
D
2
M
(P,
P
m
) = (P −
P
m
)
T
S
−1
m
(P −
P
m
) (11)
is the squared Mahalanobis distance (Chou and Zhang,1995;
Mahalanobis,1936;Pillai,1985),T is the transposition oper-
ator,while |S
m
| and S
−1
m
are the determinant and inverse
13
K.-C.Chou
matrix respectively,of S
m
.The latter is the covariance matrix
for class mand deÞned by
S
m
=





s
m
1,1
s
m
1,2
· · · s
m
1,20+2λ
s
m
2,1
s
m
2,2
· · · s
m
2,20+2λ
.
.
.
.
.
.
.
.
.
.
.
.
s
m
20+2λ,1
s
m
20+2λ,2
· · · s
m
20+2λ,20+2λ





,(12)
where the matrix elements are given by
s
m
i,j
=
1
n
m
−1
n
m

k=1

p
m
k,i

p
m
i
 
p
m
k,j

p
m
j

,
i,j = 1,2,...,20 +2λ.(13)
According to the principle of similarity,the smaller the dif-
ference between the query enzyme P and the norm of class
m,the higher the probability that enzyme P belongs to class
m.Accordingly,the identiÞcation rule can be formulated
as follows:
(E,
E
µ
) = Min{(E,
E
1
),(E,
E
2
),...,(E,
E
M
)},
(14)
where µcanbe 1,2,3,...,or M,andthe operator Min means
taking the minimal one among those in the brackets.The value
of the superscript µ derived from Equation (14) indicates to
which class the query enzyme P belongs.If there is a tie case,
µ is not uniquely determined,but that did not happen for the
datasets studied here.
Before using the above equations for practical calculations,
we would like to draw attention to the following two points.
First,owing to the normalization condition [Equation (6)]
imposed on the Am-Pse-AAcomposition,of the 20 +2λcom-
ponents in Equation (8),only 20 + 2λ − 1 are independent
(Chou and Zhang,1995),and hence the covariance matrix
S
m
as deÞned by Equation (12) must be a singular one (Chou
and Zhang,1994).This implies that the Mahalanobis dis-
tance deÞned by Equation (11) and the covariant discriminant
function by Equation (12) would be divergent and meaning-
less.To overcome such a difÞculty,the dimension-reducing
procedure (Chou and Zhang,1995) was adopted in prac-
tical calculations;i.e.instead of the (20 + 2λ)D space,an
enzyme is deÞned in a (20 +2λ −1)D space by leaving out
one of its 20 + 2λ amino acid components.The remaining
20 + 2λ − 1 components would be completely independ-
ent and hence the corresponding covariance matrix S
m
would
no longer be singular.In such a (20 + 2λ − 1)D space,the
Mahalanobis distance [Equation (11)] and the covariant dis-
criminant function[Equation(12)] canbewell deÞnedwithout
the divergence difÞculty.However,which one of the 20 +2λ
components can be left out?Any one.Will it lead to a dif-
ferent predicted result by leaving out a different component?
No.According to the invariance theorem given in Appendix
A of Chou and Zhang (1995),the value of the Mahalanobis
distance as well as the value of the determinant of S
m
will
remainexactlythe same regardless of whichone of the 20 +2λ
components is left out.Therefore,the value of the covariant
discriminant function [Equation (12)] can be uniquely deÞned
through such a dimension-reducing procedure.
Second,as mentioned in the last section,the components in
theAm-Pse-AAcompositionmaybe <0.Will thedeterminant
of S
m
be always >0 so as to make the term of ln |S
m
| in
Equation (10) always meaningful?The answer is yes if S
m
is
non-singular.The mathematical proof regarding this is given
in Appendix A.If S
m
is singular,we can always use the above
dimension-reducingproceduretoredeÞne S
m
andmakeit non-
singular.Therefore,the determinant of S
m
as deÞned in the
(20 +2λ −1)D space is actually always >0.
4 RESULTS AND DISCUSSION
To demonstrate the improvement of prediction quality by
introducing the Am-Pse-AA composition,tests were con-
ducted on the same training dataset as used by the previous
investigators (Chou and Elrod,2003).The dataset contains
2640 oxidoreductases,of which 314 are of subfamily 1;216
of subfamily 2;194 of subfamily 3;130 of subfamily 4;112
of subfamily 5;305 of subfamily 6;64 of subfamily 7;59 of
subfamily 8;254 of subfamily 9;94 of subfamily 10;154 of
subfamily 11;94 of subfamily 12;257 of subfamily 13;155
of subfamily 14;84 of subfamily 15;and 154 of subfamily
16.As shown in Figure 2,each of these 16 subfamilies is act-
ing on a different target.The accession numbers of the 2640
oxidoreductases can be found in Table 1 of the earlier paper
(Chou and Elrod,2003).
Furthermore,as a showcase for practical application,
an independent dataset was constructed that contains 2124
oxidoreductases;of which 626 are of subfamily 1;216 of sub-
family2;25of subfamily3;17of subfamily4;14of subfamily
5;608 of subfamily 6;7 of subfamily 7;6 of subfamily 8;253
of subfamily 9;12 of subfamily 10;20 of subfamily 11;12 of
subfamily 12;257 of subfamily 13;20 of subfamily 14;11 of
subfamily 15;and 20 of subfamily 16.The accession numbers
of the 2124 oxidoreductases are given in Online Supplement-
ary Materials A.None of the 2124 entries in the independent
dataset occurs in the aforementioned training dataset of the
2640 entries.
As we see from Equations (2)Ð(7) as well as Figure 1,the
greater the number λ,the more the sequence-order effect that
is incorporated.Accordingly,with an increase in λ,the rate of
correct prediction by the self-consistency test will be gener-
ally enhanced.Note that the number of λ does have an upper
limit;i.e.it must be smaller than the number of amino acid
residues of the shortest protein chain in the dataset studied
[Fig.1 and Equation (2)].Besides,owing to the information
loss during the jackkniÞng process,the success rate by the
jackknife test does not always monotonically increase with
λ.Since jackknife tests are deemed to be one of the most
14
Enzyme family class prediction
Fig.2.A schematic diagramto show the 16 classes of oxidoreductases classiÞed according to different groups acted by the enzyme.
rigorous and objective methods for cross-validation in statist-
ics (Mardia et al.,1979;Chou and Zhang,1995),the optimal
value for λ should be the one that yields the highest overall
success rate byjackkniÞngthe trainingdataset.For the current
study,it was found that the optimal value for λ is 9.
The results obtained by the self-consistency test,jackknife
test and independent dataset tests are given in Tables 1,2
and 3,respectively.Meanwhile,for facilitating comparison,
also listed are the results fromthe simple geometry predictor
(Nakashima et al.,1986) and the covariant predictor (Chou
and Elrod,2003).Both were performed based on the amino
acid composition alone.Fromthese tables,we can see the fol-
lowing.(1) The overall success rates obtained by the current
approach,i.e.a combination of the Am-Pse-AA compos-
ition and the augmented covariant-discriminant algorithm,
are remarkably higher than those by the other approaches.
(2) The success rates by the jackknife test are decreased
compared with those by the self-consistency test.Such a
decrement is more remarkable for small subset,such as sub-
family classes 7 and 8.This is because the cluster-tolerant
capacity (Chou,1999) for small subsets is usually low.And
hence the information loss resulting from jackkniÞng will
have a greater impact on the small subsets than on the large
ones.Nevertheless,the overall jackknife rate by the current
approach is still >70%.It is expected that the success rate for
identifying the enzyme subfamilies can be further enhanced
with the improvement of the small training subsets by adding
into them more new proteins that have been found to belong
to the categories deÞned by these subsets.(3) The overall
success rate obtained by the current approach in the inde-
pendent dataset test is 76.55%,which is lower than that of the
self-consistency test but higher than that of the jackknife test,
implying that,of the three test methods,the jackknife test is
the most rigorous and objective in reßecting the real power of
a predictor.
5 CONCLUSION
The classes of newly found enzyme sequences are usu-
ally determined either by biochemical analysis of eukaryotic
and prokaryotic genomes or by microarray chips.These
experimental methods are both time-consuming and costly.
With the explosion of protein entries in databanks,we are
challenged to develop an automated method to quickly and
accurately determine the enzymatic attribute for a newly
found protein sequence:is it an enzyme or a non-enzyme?
15
K.-C.Chou
Table 1.The success rates in identifying the 16 subfamilies of oxidoreductases with different methods by the self-consistency test
Subfamily
class (Fig.2)
Number of
samples
a
Least Euclidean predictor
(Nakashima et al.,1986) (%)
Covariant-discriminant
predictor (Chou and Elrod,2003) (%)
This paper
b
(%)
1 314 26.75 58.92 89.49
2 216 50.93 64.81 87.96
3 194 24.23 55.67 85.57
4 130 16.92 69.23 93.08
5 112 12.50 65.18 83.04
6 305 71.80 72.79 85.57
7 64 29.69 85.94 96.88
8 59 23.79 96.61 100
9 254 70.47 89.37 93.70
10 94 42.55 67.02 95.74
11 154 51.95 87.66 96.75
12 94 20.21 80.85 97.87
13 257 70.43 79.38 96.50
14 155 74.19 97.42 100
15 84 79.76 96.43 100
16 154 44.81 81.17 93.51
Overall 2640 1279/2640 = 48.45% 1992/2640 = 75.45% 2433/2640 = 92.16%
a
Data taken fromTable 1 of Chou and Elrod (2003).
b
Performed using the augmented covariant-discriminant predictor and the Am-Pse-AA composition with λ = 9 and w = 0.5 [Equations (5) and (6)].
Table 2.The success rates in identifying the 16 subfamilies of oxidoreductases with different methods by the jackknife test
Subfamily class
(Fig.2)
Number of
samples
a
Least Euclidean predictor
(Nakashima et al.,1986) (%)
Covariant-discriminant predictor
(Chou and Elrod,2003) (%)
This paper
b
(%)
1 314 25.48 47.77 72.61
2 216 49.54 54.17 66.20
3 194 22.68 42.78 65.46
4 130 13.85 52.31 62.31
5 112 8.93 44.64 47.32
6 305 71.48 72.13 77.70
7 64 23.44 46.88 45.31
8 59 16.95 52.54 23.73
9 254 70.08 84.65 82.28
10 94 39.36 54.26 63.83
11 154 51.95 78.57 81.17
12 94 17.02 52.13 51.06
13 257 69.65 74.71 78.99
14 155 71.61 93.55 92.90
15 84 77.38 70.24 59.52
16 154 44.81 64.29 73.38
Overall 2640 1237/2640 = 46.86% 1680/2640 = 63.64% 1864/2640 = 70.61%
a
Data taken fromTable 1 of Chou and Elrod (2003).
b
Performed using the augmented covariant-discriminant predictor and the Am-Pse-AA composition with λ = 9 and w = 0.5 [Equations (5) and (6)].
If it is,to which enzyme family and subfamily class does
it belongs?The answers to these questions are import-
ant because they may help deduce the mechanism and
speciÞcity of the query protein,providing clues to the
relevant biological function.Although it is an extremely
complicated problem and might involve the knowledge of
three-dimensional structure as well as many other physical
chemistry factors,some quite encouraging results have been
obtained by a bioinformatical method established on the basis
of amino acid composition alone (Chou and Elrod,2003).
Since the amino acid composition of a protein does not con-
tain any of its sequence-order information,a logical step to
further improve the method is to incorporate the sequence-
order information into the predictor.To realize this,the most
16
Enzyme family class prediction
Table 3.The success rates in identifying the 16 subfamilies of oxidoreductases
a
by various methods on an independent dataset given in Online Supplementary
Materials A
Subfamily class
(Fig.2)
Number of
samples
b
Least Euclidean predictor
(Nakashima et al.,1986) (%)
Covariant- discriminant predictor
(Chou and Elrod,2003) (%)
This paper
c
(%)
1 626 26.68 49.68 73.00
2 216 47.22 57.41 70.37
3 25 36.00 48.00 56.00
4 17 17.65 52.94 70.59
5 14 7.14 50.00 50.00
6 608 71.38 72.37 77.30
7 7 28.57 57.14 42.86
8 6 33.33 50.00 50.00
9 253 73.91 86.17 84.58
10 12 41.67 58.33 83.33
11 20 50.00 75.00 90.00
12 12 25.00 75.00 66.67
13 257 68.87 77.82 84.05
14 20 70.00 90.00 95.00
15 11 72.73 81.82 63.64
16 20 50.00 60.00 80.00
Overall 2124 1134/2124 = 53.39% 1398/2124 = 65.82% 1626/2124 = 76.55%
c
Conducted by the rule parameters derived fromthe training dataset given in Table 1 of Chou and Elrod (2003).
b
Data taken fromOnline Supplementary Materials A.
c
Performed using the augmented covariant-discriminant predictor and the Am-Pse-AA composition with λ = 9 and w = 0.5 [Equations (5) and (6)].
straightforward way is to represent the sample of a protein
by its entire sequence,the so-called sequential form.How-
ever,it leads us to face the difÞculty of an inÞnite number
of sample patterns.Accordingly,to formulate a feasible pre-
dictor,the sample of a protein must be represented by a set
of discrete numbers,the so-called discrete form.One feas-
ible compromise to effectively take care of both the aspects
is to represent the sample of a protein by the Ôamphiphilic
pseudo amino acid compositionÕ,which contains 20+2λ dis-
crete numbers:the Þrst 20 numbers are the components of
the conventional amino acid composition;the next 2λ num-
bers are a set of sequence correlation factors with different
ranks of coupling through the hydrophobicity and hydro-
philicity of the constituent amino acids along the sequence
of a protein.For different training datasets,λ has different
optimal values.For the current training dataset,the optimal
value for λ is 9,meaning that the sequence-order informa-
tion is converted into the discrete formthrough the Þrst-order
correlation factor,second-order correlation factor and up to
ninth-order correlation factor in terms of both hydrophobi-
city and hydrophilicity of the constituent amino acids along
a protein chain.Based on such a representation scheme,the
covariant discriminant algorithm is augmented to take into
account partial,if not all,sequence-order effects.The pre-
dictor thus developed is remarkably superior to those based on
the amino acid composition alone,as reßected by the success
rates in identifying the 16 subfamily classes of oxidore-
ductases through the self-consistency test,jackknife test and
independent dataset test.
Meanwhile,the results of the present study also imply
that the arrangement of hydrophobicity and hydrophilicity
of the amino acid residues along a protein chain plays a
very important role in its folding,as well as its interac-
tion with other molecules and catalytic mechanisms,and that
different types of proteins will have different amphiphilic fea-
tures,corresponding to different hydrophobic and hydrophilic
sequence-order patterns.
ACKNOWLEDGEMENT
The author wishes to thank the four anonymous review-
ers whose comments were very helpful in strengthening the
presentation of this study.
SUPPLEMENTARY DATA
Supplementary data for this paper are available on
Bioinformatics online.
REFERENCES
Bahar,I.,Atilgan,A.R.,Jernigan,R.L.and Erman,B.(1997) Under-
standing the recognition of protein structural classes by amino
acid composition.Proteins,29,172Ð185.
Bairoch,A.and Apweiler,R.(2000) The SWISS-PROT protein
sequence data bank and its supplement TrEMBL.Nucleic Acids
Res.,25,31Ð36.
Cai,Y.D.,Li,Y.X.and Chou,K.C.(2000) Using neural networks for
prediction of domain structural classes.Biochim.Biophys.Acta,
1476,1Ð2.
17
K.-C.Chou
Cedano,J.,Aloy,P.,PÕerez-Pons,J.A.and Querol,E.(1997) Relation
between amino acid composition and cellular location of proteins.