Theoretical analysis of alternative splice forms using computational ...

wickedshortpumpΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 2 μήνες)

76 εμφανίσεις

BIOINFORMATICS
Vol.18 Suppl.2 2002
Pages S65–S73
Theoretical analysis of alternative splice forms
using computational methods
St ´ephanie Bou´e
1
,Martin Vingron
1
,Evgenia Kriventseva
2
and
Ina Koch
1,3
1
Max-Planck-Institute for Molecular Genetics,Department Computational Molecular
Biology,Ihnestr.73,Berlin,14195,Germany,
2
EMBL Outstation - Hinxton,European
Bioinformatics Institute,European Bioinformatics Institute,Hinxton,Cambridge,
CB10 1SD,UK and
3
University of Applied Sciences Berlin - TFH,Computational
Biology,Seestr.64,Berlin,13347,Germany
ABSTRACT
Nowadays understanding alternative splicing is one of the
greatest challenges in biology,because it is a genetic
process much more important than thought at the time
of its discovery.In this paper,we explain the approach
of using the different available databases and software
tools to start a large scale investigation of alternative splice
forms.To collect information about alternative splicing we
investigated known data in the databases using different
computational methods.The investigations proceeded
fromthe genomic sequence data to structural protein data.
Then,we interpreted those data to Þnd the relationship
between alternative splice forms and protein function
and structure.We found some interesting features of
alternative splicing which are presented here.We discuss
the results of one chosen example.They concern the
coverage quality of the protein sequence of a known
structure,an EST analysis,the validation of splice variants,
the determination of the alternative splice type,and Þnally
the link between alternative splicing and disease.
Contact:ina.koch@molgen.mpg.de
INTRODUCTION
The story of alternative splicing began when Sambrook
(1977) discovered the mosaic structure of genes in the
adenovirus.After that,Gilbert (1978) called the non-
coding and coding parts of genes introns and exons.
Moreover,he introduced the idea that a mechanism
enables exons to join in different schemas,giving rise to
distinct transcripts.This mechanismwas called alternative
splicing,which will be abbreviated here by AS.
It was assumed for a long time that AS was an excep-
tional event and,in most cases,a unique combination for
the assembly of exons occurs.In the past years,numerous
surveys,especially bioinformatics studies,see Table 1,
revolutionized the paradigm one gene gives rise to one
protein.The estimated frequency of AS in human genes
increased dramatically from 5% (Sharp,1994) to 59%
(Consortium,2001).AS is now known as the rule and
not the exception.Moreover,AS was found to occur in
many eukaryotic species with similar frequency (Brett
et al.,2002).These facts suggest the strong biological
signicance of AS.
AS represents one of the major events that lead to
protein diversity.It is in fact so powerful that it can
generate more transcripts from a single gene than the
number of genes in the entire human genome (<32 000
(Consortium,2001)).For example,the Drosophila gene
encoding the Down syndrome cell adhesion molecule
dscam is a source of huge diversity.This axon guidance
receptor plays a key role during development,especially
for migration and connection of neurons.The dscam pre-
mRNA can give rise to more than 38,000 various mRNA
isoforms.Each of them has the ability to form special
interactions directing the growing axon to its proper
location (Graveley,2001).As a matter of fact,AS plays
a fundamental role in protein specicity.A still open
question is the regulation of AS.Physiological processes
need a strong and ne regulation at both time point and
tissue type level.If the expression of a transcript occurs at
a false time point or in the wrong tissue this can lead to
major diseases,such as Alzheimers or cancer.
The majority of the developed bioinformatics studies
were interested in the discovery of AS variants.However,
in order to have a full understanding of the consequences
of AS,it is important to go further into the proteome level.
It is necessary to link the mRNAvariants with their protein
form that has either a new,or an identical,or even non
function.Thus,the following two questions arise:How is
it possible to assess the function of each splice variant?
How does AS affect the structure of the variant?
Investigating AS data,Lopez (1998) distinguished two
cases
1.The AS region covers a whole protein domain
c
Oxford University Press 2002
S65
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
S.Bou
´
e et al.
Table 1.Overview of the recent bioinformatics studies interested in AS
Authors Study Particular case Organism Number of Estimated Accessiblity
Year AS frequency of
identied AS in human
Mironov et al.ESTs are aligned to genome based on TIGR
1999 creation of superstructures Human Gene Index H.sapiens 133 35%
of ESTs clusters
Brett et al.ESTs are aligned to mRNAs http://www.bioinf.mdc-berlin.de/splice/
2000 in order to identify H.sapiens 3011 38%
insertions or deletion
Croft et al.Creation of a database The set of protein http://isis.bituq.edu.au/
2000 of introns fromGenbank analysed doesnt H.sapiens 582 22%
alignment of this with ESTs contain any
previous known AS
Human genome Cons.Alignment of 632 chromosome 22,H.sapiens 145 59%
2001 different transcripts 245 genes
Kan et al.ESTs are aligned to H.sapiens http://stl.wustl.edu/∼zkan/TAP
2001 genome to recreate M.musculus 374 55%
transcripts
Modrek and Lee ESTs are aligned to 6201 AS http://bioinformatics.ucla.edu/
2002 genome:detection of AS H.sapiens relationships 42%
rather than prediction HASDB
2.The AS region covers only a part of a protein
domain.
In the rst case the loss or addition of a domain
will instinctively be associated with the loss or gain
of the function linked to this domain.In the second
case,however,there exists neither a study nor a model
about the consequences for the structure and thus for the
function.The aim of our study was to gain some clues
that will enable us to answer this second question.We
focused on AS forms where only a part of an INTERPRO
domain (Apweiler and the Interpro Consortium,2000) is
concerned with AS.
Two papers which resolved the structures of different
isoforms of a protein produced by AS (Oakley et al.,
2001;Peneff et al.,2001) encouraged us to start a detailed
and exhaustive analysis to make suggestions for resolving
further interesting examples.
We analyzed in silico known proteins for which a
maximum of information could be accumulated.In this
paper we describe the computational methods we applied
to conduct such analyses in a standard and efcient
manner.An overview of the methods is given in Figure 1.
As results,we discuss one interesting example in detail
the beta-glucuronidase,see Figure 2.
The ultimate goal of this study is to link structure and
function in the particular case of AS variants.
METHODS
Determination of the protein set
SWISS-PROT (Bairoch and Apweiler,2000) is known
as the most reliable public database and provides a
great source of information and cross references.We
searched for human proteins,which exhibit at least one AS
variant and one resolved structure using a simple SRS6
search (Zdobnov et al.,2002) in SWISS-PROT with the
keywords VARSPLIC,HUMAN,and PDB.Moreover,an
automated search retrieved all proteins from this set for
which the AS region covers less than an entire INTERPRO
domain.We extended this set with the new annotations
in SWISS-PROT and checked them for AS forms and
availability of structural information.
Data collection
Starting with the accession numbers of the proteins we
collected the information contained in the respective
SWISS-PROT les.More precisely,we were interested
in (i) the protein sequence,(ii) the protein sequences
of the variants,(iii) the domain structure,(iv) the cross
references to the EMBL database (Stoesser et al.,2002),
the PDB structure (Berman et al.,2000),the OMIM
database (Hamosh et al.,2002),and (v) the literature
references.
S66
Theoretical analysis of alternative splice forms using computational methods
Proteins whose part of a
domain is affected by
alternative splicing
(SWISS
_
PROT, Interpro)
OMI
M
Link AS/disea se
PD
B
Coordinates
Multiple alignment
ClustalW
Diali
g
n
EMB
L
mRNAs
CDNAs
Gene
Gen eNest
Gen eNest cluster
(mRNAs, cDNAs
)
S
p
liceNest
Alignmen t genome,
cDNAs, mRNAs
sim4
sim4
Phd
Jpred

Wise2

Sequence
Sequence of
each variant
Protein sequence
Secondary structure
prediction
EST coverage
quality
Tissue specificity
Alignment with
human genome
Probable splice
variants
Genome
sequence
Alignment
protein
_
mRNA
Fig.1.Schematic outline of the methods used.The rectangles represent results and the ellipses data.The red rectangle is the starting point
of the approach.The green ellipses are protein sequences,the orange ones stand for databases,the grey ones are nucleic acid sequences and
protein structure coordinates.The names of the computational tools are written above the arrows.
Structure coverage quality
To be able to use the available structural data in the
particular case of our splice variants,the published
structure had to cover at least the AS region and ideally
most of the protein length.Indeed,only complete coverage
can give an accurate representation of the 3D-structure
which takes into account all kinds of interactions between
amino acids.This condition was checked with a multiple
alignment of the sequence of the SWISS-PROT isoform,
the sequences of all variants,and the sequences used
during the structure resolution.We did the multiple
alignments using ClustalW (Thompson et al.,1994) or
Dialign (Morgenstern et al.,1998).The alignments were
viewed and edited using Seqlab from the GCG package
(Womble,2000).
EST analysis
This part of the study aims at providing information
concerning tissue and time specicity of the AS variants.
We need this data not only to increase our knowledge
about each variant,but also to experimentally be able to
access each variant within the human organism.
ESTs,or expressed sequence tags,are derived from
fully processed mRNAs,i.e.mRNAs that are already
spliced,polyadenylated,and possess a 5

cap.They are
sequenced in a high-throughput manner.ESTs provide a
huge quantity of information concerning gene expression
in many tissues and cell types at different developmental
or tumoral states.
Most of the bioinformatics studies interested in AS use
the data issued from EST libraries to detect AS variants.
But,the main problem is the poor sequencing quality of
the ESTs due to their high rate of production.Therefore,
we cautiously checked the alignment of the ESTs to the
genomic sequence.In the present study we used the ESTs
clustered by GeneNest (Haas et al.,2002).In GeneNest,
each cluster which should represent a gene is composed
of many contigs,which are ideally the counterparts of
the mRNA isoforms.A consensus sequence,which is
characteristic for each contig,is created from the set
of ESTs and mRNAs that form this contig.GeneNest
provides information about the library the EST is issued
from,i.e.the tissue type,the tumoral or developmental
status.We performed a BLAST search (Altschul et al.,
1990) with our protein sequence against the GeneNest
database of consensus sequences.
Validation of AS variants and determination of AS
type
Because the exact processing of AS is not yet fully
understood,the discovery of novel splice variants for a
given gene using computational methods is a difcult task.
Therefore,most of the methods developed to performthis
analysis use EST data.
S67
S.Bou
´
e et al.
structure: 1bhg
beta
_
glucoronidase (P08236)
1
1
651 long isoform
600 short isoform
Fig.2.From top to bottom,these are schematic representations of the exon structure,the long isoform,the short isoform,and the sequence
of the beta-glucuronidase structure (1bhg).The domain affected by the AS event is colored yellow.
AS prediction is not an ab initio prediction,but rather a
careful observation of expression patterns.An alignment
with the genomic sequence affords the discovery of novel
splicing patterns.First,we looked into the SpliceNest
database (Coward et al.,2002) of probable AS variants.
The probable AS sites are highlighted based on an
alignment of GeneNest contigs with the human genome
draft sequence using sim4 (Florea et al.,1998).However,
not all information available in the EMBL database
is contained in UNIGENE (Schuler,1997) on which
GeneNest is based.In order to be able to argue with
a maximum of data at hand,we aligned all sequences
provided by the EMBL cross references in SWISS-PROT
(ESTs,cDNAs,mRNAs,and gene sequences) with the
human genome using sim4.
Additionally,to be able to jump from genomic coordi-
nates to mRNA or protein ones,we aligned in each case
one mRNA with the main isoform of the protein using
Wise2 (Birney and Durbin,1997).
In summary,we did the following alignments:
• protein,variants,sequences of structures
• mRNAs,cDNAs,genome
• mRNAs,protein
The comparison of the alignments of mRNAs and ESTs
to the genome sequence enabled us to nd out which type
of AS occurs.
There are various types of AS:
1.alternative initiation:the N-terminus part of the
protein differs because of the use of an alternative
5

exon
2.alternative termination:the C-terminus part of the
protein differs due to a frameshift
3.exon extension (5

or 3

):an exon is extended,either
in 5

or in 3

,because of the use of another splice site
that does normally not give rise to a frameshift
4.exon truncation (5

or 3

):similar case to exon
extension
5.cryptic exon/exon skipping:recognition of new
splice sites (5

and 3

),or none recognition of usual
ones
Link to disease
Splicing defects are estimated to account for about 15%
of disease causing mutation in humans.The vast major-
ity of known genetic lesions that affect splicing are point
mutations within the 3

and the 5

splice site (Krawczak
et al.,1990),(Nakai and Sakamoto,1994).In addition,
splicing regulation errors can also cause diseases.Thus,
mRNA splicing is involved in a great number of patholo-
gies of many different types,e.g.in neurological diseases
like Alzheimers disease ( A4 amyloid protein),or in can-
cers (p73,mdm2),in myopathy (dystrophin)...
Therefore,it is of great interest to point out whether
there is a link between a particular variant and a pathology.
In the case of variants causing cancer,the knowledge
about the pro- or anti-apoptotic function of different vari-
ants could allow for the design of innovative approaches
to chemotherapy.Those could consist of modifying AS
pathways (Mercante and Kole,2002) thus inhibiting the
overexpression of anti-apoptotic forms.
One way to link disease and AS is to go through the
literature.Alternatively,we would need to analyse for each
pathology a set of tissues,and compare them with normal
ones for each mRNA.This topic will not be considered
here.
Secondary structure prediction
We wanted to get an impression of possible changes
of the protein conformation caused by AS.Therefore
S68
Theoretical analysis of alternative splice forms using computational methods
we performed a secondary structure prediction of the
SWISS-PROT sequences and the various isoforms using
PHD (Rost,1996) and Jpred (Cuff et al.,1998).We used
two prediction programs especially for cases where the
known structure does not cover the complete AS region.
For verication we predicted also the secondary structure
for the PDB sequences and compared the results with
the known PDB secondary structures.We added the pre-
dicted secondary structures to the previously mentioned
alignment of the protein,variants,and PDB sequences.
Secondary structure prediction is a fast method to get
primary assumptions concerning possible changes in the
protein fold.One must be careful in using secondary
structure prediction methods because these methods are
based on great approximations and could fail in the
prediction of beta-sheet structures.
RESULTS AND DISCUSSION
Set of proteins
We found 113 human proteins present in SWISS-PROT
that possess AS variants and at least one known structure.
According to SWISS-PROTannotation,60 proteins (53%)
exhibit only a part of a domain that is affected by AS.
They constitute the following search set.27 (24%) are
affected by AS in a whole domain.14 (12%) proteins are
present in both sets,i.e.they both have variants either with
only a part or with a whole domain affected by AS.In
some cases of multiple domains,we even found partly and
completely affected domains within a single variant.In 40
(35%) proteins the AS region covers no described domain.
They are not in any of the two sets.
The 60 proteins are sorted according to their length,cel-
lular localisation,number of alternatively spliced variants,
function,and AS type.The length distribution of our set is
very similar to the statistics on the length of the proteins
present in the SWISS-PROT database.
The set studied here shows a good repartition of the
proteins among the different cellular compartments,see
Figure 3.The location of the proteins breaks down as
follows:35% nucleus,29% cytoplasm,19% extracellular
matrix,13%membrane,and 4%mitochondrium.
It is worth noting that certain proteins can have many
cellular localisations,often both nuclear and cytoplasmic.
In two cases we found in the literature that different
variants have exclusively different cellular localisations.
For example,the retinoic acid receptor beta has two
isoforms located in the nucleus (beta-1 and beta-2),and
one isoformlocated in the cytoplasm(beta-4).
On average,each protein from our set has 3.6 alterna-
tively spliced variants according to SWISS-PROT annota-
tions.From 9263 human proteins of SWISS-PROT 1167
had annotated splice variants,i.e.less than 13%.The num-
ber of genes affected by AS is evaluated to be greater than
Fig.3.Cellular localisation distribution of the set of 60 proteins
affected only in a part of a domain.
Fig.4.Function type distribution of the set of 60 proteins affected
only in a part of a domain.
59%(Consortium,2001).Therefore,there is no doubt that
the number of AS annotations will increase drastically,
and that also this average number of isoforms is in fact
a minimum.
We analysed the repartition of the functions of the
proteins,see Figure 4.It was noticeable that there are
many proteins involved in signal transduction or gene
expression regulation (46%).Such events have indeed
to be subtly regulated.AS is a great and economic
opportunity for the cell to produce slightly different
proteins able to interact with various proteins,or play
diverse roles without having many genes coding for them.
AS plays also a fundamental role in the diversity needed
in the neuronal system.The other side of this diversity is
that AS can also lead to many neurological disorders,as
reviewed in (Dredge et al.,2001).In our case,only two
S69
S.Bou
´
e et al.
proteins are exclusively involved in the neuronal system.
But other proteins such as the numerous transcription
regulators mentioned above or signal transduction actors
may play important roles in neurological functions.
Another vital function for the organism that needs
much variety is the immune system.The organism has to
adapt its response to the aggressor.Proteins involved in
immunity represent 11%of our set.
Last but not least,cell death regulation is a critical
function in the cell.The main actors in the apoptosis are
the members of the p53 family (p53,p63,p73),and other
proteins that can act as tumour suppressors or help tumour
progression.They compose 15% of our set.Interestingly,
those proteins are often linked to cancers.
The remaining 25% of our set could not be integrated
into one of the previous function types.
The results obtained for one example issued from
our set,the beta-glucuronidase (SWISS-PROT ac-
cession number:P08236),are discussed beneath.
Beta-glucuronidase,see Figure 2,is a 651 amino acid
long cytoplasmic protein.The bacterial homologue of
beta-glucuronidase is well known to biologists due to its
wide use in molecular biology as a reporter gene.The
physiological function of beta-glucuronidase in human
cells is the degradation of dermatan and keratan sulfates.
Structure analysis
To address the relationship between protein structure
and function,it was important for us to have a good
structure coverage of the AS region.Among the 60
proteins composing our set,only 34 (i.e.57%) possess
structural information of good quality.In our case,this
means experimentally proven data which cover the part of
the protein of interest for us.
There is almost always at least a part of a secondary
structure (i.e.helix,or beta sheet) that is removed,
inserted,or substituted.This could indicate that AS leads
to necessary structural changes which eventually give rise
to a refolding of the domain or even of the entire protein,
causing functional changes.
Interestingly,rst results of our secondary structure pre-
diction show that in most cases the secondary structure of
isoforms do not change signicantly.In order to visual-
ize those changes and possible refolding,the AS variants
would be interesting targets for structure determination.
The X-ray structure of the beta-glucuronidase (1bhg,
resolution 2.6

A,(Jain et al.,1996)),see Figure 5,
encompasses the complete protein length (22-633) and
thus the entire AS region.The removed domain in the
short isoform is colored blue in Figure 5.It is made
of beta sheets in the main isoform.The green and
orange colored atoms in the spacell model represent the
ligands,N-acetyl-D-glucosamine and alpha-D-mannose,
respectively.A part of the secondary structure elements
Fig.5.Structure of the long isoform of beta-glucuronidase (1bhg)
at 2.6

Aresolution.The structure is colored according to the type
of secondary structure (yellowfor beta sheets,magenta for helices).
The two ligands,N-acetyl-D-glucosamine and alpha-D-mannose are
colored green and orange,respectively,as spacell model.The part
missing in the short isoform is colored red.The picture was done
using SYBYL software (Tripos Inc,2000).
is removed due to the AS event.The consequences of
this removal of beta sheets on the folding of the protein
is difcult to estimate.According to PHD- and Jpred-
prediction there is no change to the secondary structure.
So,the protein fold is preserved in the short isoform,even
though four small beta-strands are missing.These strands
are located far away from the ligand binding sites.We
assume that the function of the protein will not be lost.
An option to estimate the consequences of such an AS
event would be to predict the folding that the protein can
adopt.But,either one proceeds by homology and risks
missing an important refolding or predicts the structure
ab initio and loses the amount of information that could
be gained from the known structure.Such predictions
would not be sufcient enough.So,it is necessary to have
resolved structural examples.
To our knowledge,there are until now only two ex-
amples where the structure of two isoforms of the same
protein issued from AS are published.The rst one (Pen-
eff et al.,2001) deals with the human pyrophosphorylase.
The structures of two AS variants,AGX1 and AGX2 have
been solved.AGX2,made of 522 aminoacids,has an
insert of 17 residues within a loop compared with AGX1.
S70
Theoretical analysis of alternative splice forms using computational methods
Con s en s u s
s eq u e n c
e
mRNA
ββ
_
glu cu ron ida s e
EST
s
Tiss ue
information
Fig.6.GeneNest contig Hs183868.The consensus sequence of the contig is represented in red.All tissue and library data are on the left.
Each colored box stands for a specic tissue type.Under the consensus sequence,we nd the mRNA sequence of the long isoform of
beta-glucuronidase (green),and the ESTs (grey and blue).This page is accessible via http://genenest.molgen.mpg.de/.
This change doesnt lead to 3D conformation changes.
However,the quaternary structures of AGX1 and AGX2
are different:AGX1 has a dimeric arrangement,whereas
AGX2 is exclusively monomeric.This leads to substrate
specicity alteration.With GalNac1P as substrate AGX1
is 2-3 times more active.AGX2 is 8 times more active
with GlcNac1P.The inserted loop causes oligomeric
changes responsible for the specicity shift.
In the second paper (Oakley et al.,2001),the structures
of two isoforms of the mosquito Anopheles dirus glu-
tathione S-transferase have been solved.They differ by
a few aminoacids which form an helix or enlarge loops.
These aminoacids dont affect the topology of the active
site.The substrate inhibition type is altered.
In both cases,the availability of the structures of
two isoforms of the same protein has led to a better
understanding of the function of the protein and its
specicity.
S71
S.Bou
´
e et al.
EST analysis
This EST analysis was performed in order to gain infor-
mation about the tissue where the variants are expressed.
Whenever possible,we also wanted to check whether the
AS variants are linked with a particular time or space
specicity within the human organism.
It is difcult to interpret tissue specicity,because EST
library coverage is not yet exhaustive.We can be sure
that a particular gene is expressed in a particular tissue,
if there is no doubt that an EST which represents this
gene is issued from this particular tissue.This presence
information is what we wanted to gain.It enables us
to search the right tissue for the mRNA of the various
isoforms.
On the other side,the fact that no EST corresponding to
a gene is found in a library doesnt mean that this gene is
not expressed in this particular tissue.This part could just
be not sequenced,or not publicly released yet.
As explained in the methods,the information relevant
for this EST analysis was collected in GeneNest.The
output of one GeneNest cluster representative of the beta-
glucuronidase is shown in Figure 6.The EST coverage is
really good in this case.We have many ESTs that cover
in total many times the entire length of the mRNA.In
GeneNest,the colors of the boxes to the right of each
EST-ID are codes for the library type.Looking at those
colored boxes,we see that beta-glucuronidase is expressed
in various tissues.
Validation of AS variants and determination of AS
type
It is possible to validate one splice variant by aligning
the mRNA sequences or the GeneNest contigs with the
genome.
The mRNA sequences that we use are mature mRNAs
that should not contain any intronic sequence anymore.
In addition,GeneNest contigs are based only on ESTs
and mRNAs,which are also processed sequences.The
alignment highlights,consequently,the exonic parts of the
gene.However,errors can occur due to different kinds of
experimental factors (Modrek and Lee,2002).From the
alignment,we are able to detect AS variants,because there
are mRNAs (or contigs) that show an obvious deletion or
insertion,in comparison to others.
In case of beta-glucuronidase,the short isoformexhibits
an exon skipping when compared to the long isoform.The
exon 6 is skipped in the short isoform.
The AS type was analysed in the whole set,which gives
the following distribution,see Figure 7.There are 40%
of exon skipping or use of cryptic exon,23% of exon
truncation events,6% of exon extension cases,18% of
alternative initiation,and 13%of exon termination cases.
Exon skipping or the use of a cryptic exon,which are
actually the same events but observed fromdifferent points
Fig.7.AS type distribution of the set of 60 proteins affected only in
a part of a domain.
of view,are the most frequent types of AS in our set.
There are only a few cases where the open reading frame
is shifted,which corresponds to the results from Modrek
et al.(2001).
CONCLUSION
In summary,AS is an event of greatest importance.On one
hand,it gives rise to the diversity that an organism needs
to function properly.On the other hand,AS is involved in
a great number of diseases.
In order to achieve a full understanding of the conse-
quences of AS in protein function,it is necessary to go
into the proteome level and to have data concerning the
structure of the different variants.The 3D-structure of a
protein actually determines its interactions with other pro-
teins.Ultimately,those protein networks are responsible
for all possible functions in an organism.
This paper presents a strategy that can be used to
infer the function of known AS forms using available
databases and computational tools.By means of one
example,we demonstrated how to nd,to validate,and
to analyse AS data.We discussed experimental and
theoretical limitations of AS investigations.
AS events concerning only a part of a protein domain
occur very often in a broad range of cell compartments.
Although in most cases,the affected part of the protein
is so large that a refolding could be assumed,the results
of secondary structure prediction lead to the opposite
statement.Further investigations should focus on this
contradiction between both assumptions.
S72
Theoretical analysis of alternative splice forms using computational methods
ACKNOWLEDGEMENTS
We would like to thank Dr J
¨
org Schultz for helpful
discussions,Johanna Holbrook for critical reading of
the manuscript,and the University of Potsdam for the
possibility using SYBYL.
REFERENCES
Altschul,S.,Gish,W.,Miller,W.,Myers,E.and Lipman,D.(1990)
Basic local alignment search tool.J.Mol.Biol.,215,403410.
Apweiler,R.and The Interpro Consortium (2000) Interproan
integrated documentation resource for protein families,domains
and functional sites.Bioinformatics,16,11451150.
Bairoch,A.and Apweiler,R.(2000) The SWISS-PROT protein
sequence database and its supplement TrEMBL in 2000.Nucleic
Acids Res.,28,4548.
Berman,H.,Westbrook,J.,Feng,Z.,Gilliland,G.,Bhat,T.,Weis-
sig,H.,Shindyalov,I.and Bourne,P.(2000) The protein data bank.
Nucleic Acids Res.,28,235242.
Birney,E.and Durbin,R.(1997) Dynamite:a exible code gener-
ating language for dynamic programming methods used in se-
quence comparison.ISMB 97 Proceedings.pp.5864.
Brett,D.,Hanke,J.,Lehmann,G.,Haase,S.,Delbruck,S.,Krueger,S.,
Reich,J.and Bork,P.(2000) EST comparison indicates 38% of
human mrnas contain possible alternative splice forms.FEBS
Lett.,474,8386.
Brett,D.,Pospisil,H.,Valcartel,J.,Reich,J.and Bork,P.(2002)
Alternative splicing and genome complexity.Nature Genet.,30,
2930.
Consortium,I.(2001) Initial sequencing and analysis of the human
genome.Nature,409,860924.
Coward,E.,Haas,S.and Vingron,M.(2002) SpliceNest:visualizing
gene structure and alternative splicing based on EST clusters.
Trends Genet.,18,5355.
Croft,L.,Schandorff,S.,Clark,F.,Burrage,K.,Arctander,P.and
Mattick,J.(2000) ISIS,the intron information system,reveals
the high frequency of alternative splicing in the human genome.
Nature Genet.,24,340341.
Cuff,J.,Clamp,M.,Siddiqui,A.,Finlay,M.and Barton,G.(1998)
Jpred:a consensus secondary structure prediction server.Bioin-
formatics,14,892893.
Dredge,B.,Polydorides,A.and Darnell,R.(2001) The splice of life:
alternative splicing and neurological disease.Nature Reviews in
Neuroscience,2,4350.
Florea,L.,Hartzell,G.,Zhang,Z.,Rubin,G.and Miller,W.(1998) A
computer programfor aligning a cDNAsequence with a genomic
DNA sequence.Genome Res.,8,967974.
Gilbert,W.(1978) Why genes in pieces?Nature,271,501.
Graveley,B.(2001) Alternative splicing:increasing diversity in the
proteomic world.Trends Genet.,17,100107.
Haas,S.,Beissbarth,S.,Rivals,E.,Krause,A.and Vingron,M.(2002)
GeneNest:automated generation and visualization of gene
indices.Trends Genet.,16,521523.
Hamosh,A.,Scott,A.,Amberger,J.,Bocchini,C.,Valle,D.and
McKusick,V.(2002) Online Mendelian Inheritance in Man
(OMIM),a knowledgebase of human genes and genetic disor-
ders.Nucleic Acids Res.,30,5255.
Jain,S.,Drendel,W.,Chen,Z.,Sly,W.and Grubb,J.(1996) Structure
of the human beta-glucuronidase reveals candidate lysosomal
targeting and active-site motifs.Nat.Struct.Biol.,3,375381.
Kan,Z.,Rouchka,E.and Gish,W.R.amd States,D.(2001) Gene
prediction and alternative splicing analysis using genomically
aligned ESTs.Genome Res.,11,889900.
Krawczak,M.,Reiss,J.and Cooper,D.(1990) The mutational spec-
trum of single base-pair substitutions in mRNA splice junctions
of human genes:causes and consequences.Human Genet.,90,
4154.
Lopez,A.(1998) Alternative splicing of pre-mRNA:developmental
consequences and mechanisms of regulation.Ann.Rev.Genet.,
32,279305.
Mercante,D.and Kole,R.(2002) Modication of alternative splicing
pathways as a potential approach to chemotherapy.Pharmacol-
ogy and Therapeutics,85,237243.
Mironov,A.,Fickett,J.and Gelfand,M.(1999) Frequent alternative
splicing of human genes.Genome Res.,9,12881293.
Modrek,B.and Lee,C.(2002) A genomic view of alternative
splicing.Nature Genet.,30,1319.
Modrek,B.,Resch,A.,Grasso,C.and Lee,C.(2001) Genome-wide
detection of alternative splicing in expressed sequences of human
genes.Nucleic Acids Res.,29,28502859.
Morgenstern,B.,Frech,K.,Dress,A.and Werner,T.(1998) DI-
ALIGN:nding local similarities by multiple sequence align-
ment.Bioinformatics,14,290294.
Nakai,K.and Sakamoto,H.(1994) Construction of a novel database
containing aberrant splicing mutations in mammalian genes.
Gene,141,171177.
Oakley,A.,Harnnoi,T.,Udomsinprasert,R.,Jirajaroenrat,K.,Keter-
man,A.and Wilce,M.(2001) The crystal structures of glu-
tathione S-transferases isozymes 1-3 and 1-4 from anopheles
dirus species b.Protein Sci.,10,21762185.
Peneff,C.,Ferrari,P.,Charrier,V.,Taburet,Y.,Monnier,C.,Zam-
boni,V.,Winter,J.,Harnois,M.,Fassy,F.and Bourne,Y.(2001)
Crystal structures of two human pyrophosphorylase isoforms in
complex with UDPGlc(Gal)NAc:role of the alternatively spliced
insert in the enzyme oligomeric assembly and active site archi-
tecture.EMBO J.,20,61916202.
Rost,B.(1996) PHD:predicting one-dimensional protein structure
by prole based neural networks.Meth.Enzymol.,266,525539.
Sambrook,J.(1977) Adenovirus amazes at Cold Spring Harbor.
Nature,268,101104.
Schuler,G.(1997) Pieces of the puzzle:expressed sequenced tags
and the catalog of human genes.J.Mol.Med.,75,694698.
Sharp,P.(1994) Split genes and RNA splicing.Cell,77,805815.
Stoesser,G.,Baker,W.,van den Broek,A.,Camon,E.,Garcia-
Pastor,M.,Kanz,C.,Kulikova,T.,Leinonen,R.,Lin,Q.,Lom-
bard,V.,Lopez,R.,Redashi,N.,Stoehr,P.,Tuli,M.,Tzouvara,K.
and Vaughan,R.(2002) The EMBL nucleotide sequence
database.Nucleic Acids Res.,30,2126.
Thompson,J.,Higgins,D.and Gibson,T.(1994) CLUSTALW:im-
proving the sensitivity of progressive multiple sequence align-
ment through sequence weighting,position-specic gap penalties
and weightmatrix choice.Nucleic Acids Res.,22,46734680.
Tripos Inc (2000) SYBYL 6.7.1699 Hanley road,St Louis,
Missouri,USA.
Womble,D.(2000) GCG:the Wisconsin package of sequence
analysis programs.Meth.Mol.Biol.,132,322.
Zdobnov,E.,Lopez,R.,Apweiler,R.and Etzold,T.(2002) The EBI
SRS server recent developments.Bioinformatics,18,368373.
S73