Application of Bayesian Decomposition for analysing microarray data

wickedshortpumpΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

78 εμφανίσεις

BIOINFORMATICS
Vol.18 no.4 2002
Pages 566–575
Application of Bayesian Decomposition for
analysing microarray data
T.D.Moloshok
1
,R.R.Klevecz
2
,J.D.Grant
3
,F.J.Manion
1
,
W.F.Speier IV
3
and M.F.Ochs
1,∗
1
Bioinformatics Working Group,Fox Chase Cancer Center,Philadelphia,PA 19111,
USA,
2
Department of Biology,Beckman Research Institute of the City of Hope,
Duarte,CA 91010,USA and
3
Bioinformatics Facility,Fox Chase Cancer Center,
Philadelphia,PA 19111,USA
Received on October 1,2001;revised on October 31,2001;accepted on November 7,2001
ABSTRACT
Motivation:Microarray and gene chip technology provide
high throughput tools for measuring gene expression
levels in a variety of circumstances,including cellular
response to drug treatment,cellular growth and develop-
ment,tumorigenesis,among many other processes.In
order to interpret the large data sets generated in exper-
iments,data analysis techniques that consider biological
knowledge during analysis will be extremely useful.We
present here results showing the application of such a tool
to expression data from yeast cell cycle experiments.
Results:Originally developed for spectroscopic analysis,
Bayesian Decomposition (BD) includes two features which
make it useful for microarray data analysis:the ability to
assign genes to multiple coexpression groups and the
ability to encode biological knowledge into the system.
Here we demonstrate the ability of the algorithmto provide
insight into the yeast cell cycle,including identification of
five temporal patterns tied to cell cycle phases as well as
the identification of a pattern tied to an ∼40 min cell cycle
oscillator.The genes are simultaneously assigned to the
patterns,including partial assignment to multiple patterns
when this is required to explain the expression profile.
Availability:The application is available free to academic
users under a material transfer agreement.Go to http:
//bioinformatics.fccc.edu/for more details.
Contact:td
moloshok@fccc.edu;RKlevecz@coh.org;
jd
grant@fccc.edu;fj
manion@fccc.edu;
speier1@prodigy.net;m
ochs@fccc.edu
INTRODUCTION
Recent advances in microarray and gene chip technology
(Cho et al.,1998;Spellman et al.,1998;Iyer et al.,
1999;Ross et al.,2000) have led to an explosion in
the amount of data available for understanding cellular

To whomcorrespondence should be addressed.
function and pathways,with the potential for revealing
the underlying cellular behavior responsible for diseases
such as cancer (Young,2000).Since the amount of
data is significantly larger than in standard biological
studies,new tools are needed for analysis.While progress
has been relatively good in the use of data analysis to
identify disease states (Golub et al.,1999;Alizadeh et al.,
2000),the overall complexity of the data and underlying
biological systems will require the application of more
advanced methods from computer science to recover the
maximal information for knowledge discovery (Lockhart
and Winzeler,2000;Young,2000).
Many attempts have been made to apply standard
data mining algorithms to discover patterns within gene
expression array data.These include the use of standard
statistical methods (Claverie,1999;Alter et al.,2000),
self-organizing maps (Tamayo et al.,1999),support vector
machines (Brown et al.,2000),clustering (Eisen et al.,
1998;Lukashin and Fuchs,2001),and other methods,
reviewed in (Brazma and Vilo,2000).Progress has also
recently been made on new statistical methods which
maintain non-Euclidean relationships during reduction
of the dimensionality of the data space (Roweis and
Saul,2000;Tenenbaum et al.,2000),which may help
in defining relationships in complex data.In general,
these methods are not designed with knowledge of the
underlying biological system behavior,although there
have been some recent advances in clustering which do
take into account some experimental knowledge (Heyer
et al.,1999).However,all of the current methods still
lack an ability to recover fundamental behavior because
they force each gene within the expression experiment
into a single coexpression group,which overlooks the
biological fact that many individual genes are coexpressed
in multiple groups in response to different stimuli (Roberts
et al.,2000).This fundamental limitation impairs their
usefulness as it leads to the loss of all information related
566
c
Oxford University Press 2002
BD of expression data
to behavior arising frommultiple inputs.Such information
is critical for understanding cellular behavior (Bittner et
al.,2000).For example,if a set of cell cycle regulated
genes included a set of genes which were upregulated
in G1,a set upregulated in G2,and a set upregulated
in G1 and G2,most current algorithms would require
three clusters to model the data.However since there
are only two cell cycle phases represented,a method
which can assign genes multiply to both G1 and G2 will
more accurately reflect the biological basis for the gene
expression.
One key cellular behavior that becomes difficult to iden-
tify if responses to multiple stimuli cannot be disentangled
is the identification of the set of interacting signaling path-
ways that have been activated.This is particularly impor-
tant when the point of a study is the therapeutic response to
a drug,either for prevention or treatment.Many identified
causes of cancer development are errors in signaling pro-
teins or sets of signaling proteins with the consequent dis-
ruption of the normal operation of the signaling pathways
(e.g.p53,abl,c-kit).In order to use gene expression mea-
surements to deduce drug response or to interpret studies
aimed at identifying key differences between normal and
tumorigenic tissues,a technique of identifying the sets of
pathways undergoing changes must be used.
Here we present a new method for analysing gene
expression data which attempts to identify changes in sets
of interacting pathways from gene expression data by
encoding into the system behavior that mirrors the phys-
iological behavior more closely than standard analysis
methods.The method is an application of the Bayesian
Decomposition (BD) algorithm,originally developed for
the identification of spectral signatures in cases where
there was mixing of components in all data points (Ochs et
al.,1999).This algorithmis a matrix factorization method
which,unlike Principal Component Analysis (PCA) and
other widely used methods,identifies physically meaning-
ful,nonorthogonal basis vectors describing the data.For
gene expression time series data,such basis vectors are
the time curves associated with a physical process (such
as progression through the cell cycle or the activation of a
pathway in response to a drug treatment).The algorithm
simultaneously identifies the basis vectors and their
distributions within the data.The distributions allow the
algorithmto ascribe the behavior of each gene to multiple
patterns,so that genes that are transcribed in response to
multiple stimuli can be identified as belonging to multiple
coexpression groups,e.g.activation of multiple pathways
at different times in the cell cycle.
ALGORITHM AND IMPLEMENTATION
BD is a matrix factorization algorithm which uses prior
knowledge in the form of mathematical models of the
underlying physical process to identify the basis vectors
of the data in their minimal,physically significant form.
In our initial work (Ochs et al.,1999),the physical model
included basis vectors restricted to sets of positive spectral
components with Gaussian or Lorentzian lineshapes.In
subsequent work (Ochs et al.,2001),the physical model
was the known functional formof magnetization recovery
following inversion in magnetic resonance relaxographic
imaging (Labadie et al.,1994).For gene expression
analysis,the basis vectors are the coexpression response
due to individual pathways or linked pathways that
respond in a correlated way to stimuli.
The fundamental decomposition performed by BD is
the recovery of a distribution matrix (A) and a pattern
matrix (P) which combine to forma mock data matrix (M)
that reproduces the data matrix (D) within the noise limits,
as shown in Figure 1.For gene expression analysis,each
row of D represents the expression of a single gene
with the columns representing different measurements
(different times,different tissues,different individuals,
etc.).The matrix Mwould match the matrix Dexactly if A
and P were perfect models of the systemand if there were
no noise,i.e.
D = AP.(1)
The distribution matrix A contains rows that describe the
amount (amplitude) of each pattern within the correspond-
ing row (gene) in D,with each column being associated
with a single pattern.The rows of P are the patterns
that show the average behavior of the coexpressed genes
across the experiment (e.g.over time,across tissues,
across individuals,etc.).For example,a row of P in a cell
cycle experiment could contain the time behavior of gene
expression representing upregulation related to a single
cell cycle phase,so that the rows of A would indicate
when a single gene was upregulated,perhaps in multiple
phases (i.e.significant values in multiple columns).This
allows BD to identify multiple coexpression groups for
each gene,which permits the correct identification of a
gene as being transcribed in response to multiple different
stimuli.This is the key feature of the mathematical model,
as it permits the identification of overlapping coexpression
groups,which could confuse simpler algorithms that force
each gene into a single group.
The mean model for the data is determined by a Markov
chain Monte Carlo process using Bayesian estimates of
probability (Besag et al.,1995).From Bayes’ equation
and (1),the relative probability at each point in the Markov
chain is given by
p(A,P|D) = p(D| A,P)p(A,P) (2)
where the posterior probability distribution,p(A,P|D),
is the probability of the model (A and P matrices at
that point) given the data,p(D| A,P) is the likelihood,
and p(A,P) is the prior,which gives an indication of
567
T.D.Moloshok et al.
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
gene 1
gene N
M ~ D
A
P
condition 1
condition M
pattern 1
pattern P
pattern 1
pattern P
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
=
x
gene 1
gene N
condition 1
condition M
Fig.1.The matrices used in BD.The algorithmcreates simultaneously the A and P matrices which define the model,these are combined by
standard matrix multiplication to create the Mmatrix.The algorithm compares the Mmatrix to the D matrix,since Mis the data which the
model would generate.
the probability of the model independent of the data
(discussed below).The system creates a first model
randomly out of the vacuum in accordance with the prior
and then attempts changes to the model,determining
the suitability of each change by comparison of the
resulting M matrix (the mock data resulting from the
model) to the D matrix (the data) through computation of
the change in the likelihood.For a change in P of δP,the
change in the likelihood is given by
L(δP) =
1

2
Tr[(AδP)
T
(AP − D)
+(AP − D)
T
AδP +(AδP)
T
(AδP)] (3)
where Tr indicates the trace,
T
indicates the transpose,
and equal uncertainty σ is assumed here at each point for
clarity in writing the equation.There is a corresponding
equation for changes in A.The algorithm uses simulated
annealing (Kirkpatrick et al.,1983;Geman and Geman,
1984) to minimize the possibility of entrapment in areas
of false maxima in likelihood.The simulated annealing
process modifies (2) to
p(A,P|D) = p(D| A,P)
S
p(A,P) (4)
where S is progressively raised from 0 to 1,effectively
increasing the power of the data to determine the model
during annealing.Sampling occurs after annealing with
S = 1,so that the posterior probability distribution
(i.e.(2)) is used.
The algorithm encodes an atomic prior that seeks
the minimal structure required to fit the data,therefore
minimizing the number of points requiring computation
(Sibisi and Skilling,1997).The prior is enforced within
the atomic domain (see Figure 2),where changes to
the model are generated.The atomic domain consists of
point masses distributed along an infinitely divisible line
(in reality two 32 bit ints,one for A and one for P).These
point masses are then mapped into the model domain
(A and P),which allows incorporation of additional prior
knowledge by the use of convolution functions ( f s in
Figure 2).The minimizing of structure is important as
the algorithmhas computational complexity of O(n log n)
in the number of point masses.In the original work with
BD (Ochs et al.,1999),the prior knowledge encoded was
that a point mass in the P atomic domain defined a full
spectral line along a row of P in the model domain (i.e.in
one pattern).For this work we encode only the positivity
of gene expression (i.e.the ratio of the relative amounts
of mRNA between experiment and control is a positive
value) at this stage,allowing all possible curves.Finally
the data and mock data generated from the model are
compared by calculation of the change in likelihood given
in (3),which permits full utilization of knowledge of the
uncertainties associated with each data point.The change
in this likelihood is used to determine the probability of
accepting the randomly generated change to the model.
For this work there is a direct mapping from the atomic
domain into the model domain,so that the convolution
functions ( f s in Figure 2) take the atom flux and place it
fully into a single matrix element.The prior here includes
only the requirement of positivity and additivity,which
still effectively reduces the search space by 2
N
where
N is the number of dimensions,since only the positive
hyperquadrant is allowed for the model.However,the
mathematical structure allows for multiple coexpression
of genes,since each gene may be assigned behavior
arising fromdifferent patterns (e.g.expression in response
to drug treatment and in response to stress).The simple
mapping here makes this process similar to nonnegative
matrix factorization (Lee and Seung,1999).However,BD
is more flexible in that it allows complex correlations to
be mapped and is limited to positivity only in the atomic
domain.
BD requires an estimate of the number of dimensions
required to fit the data as an input parameter (i.e.the
number of columns in A and rows in P).This can be
provided by estimation of the number of dimensions
needed to fit the data through statistical analysis (such
568
BD of expression data
A
P
pattern 1
pattern P
pattern 1
pattern P
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * * * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
* * * * * * * *
gene 1
gene N
condition 1
condition M
ƒ
ƒ
Fig.2.The relation of the atomic domains to the model domains.
The atomic domains are computational representations of infinitely
divisible one dimensional spaces containing point masses created ex
vacuo by the algorithm.These point masses (atoms) are mapped
to the model domain (A and P matrices) by the application of
convolution functions ( f s).For the present work,these convolution
functions are merely a direct mapping of the amplitude of the atom
into a value at the appropriate element of a matrix,so that the atoms
denoted by thick lines map to a single matrix element denoted by a
large dot.
as PCA).In addition,BD is usually run using a number
of different estimates.For many analyses specific features
appear indicating that the number of patterns is excessive.
In spectroscopic studies and time domain modelling,these
are often the emergence of patterns which appear to be
unrelated to spectral features or to natural time behaviors.
In data sets which have no likely correlated structures
between points,the estimation of the dimensionality is
more problematic.
A primary feature of (1) is that it is mathematically
degenerate,permitting multiple solutions.BD searches
through the possible solutions for those which are most
probable and samples the likely solutions.In general,
if the number of elements in D is significantly larger
than the number of elements in A and P combined or
if there is a good mathematical model of the data,the
sampler will visit multiple representations of the same
solution,yielding a mean and standard deviation for each
element of the matrix.In cases where there are multiple,
significantly differing solutions,the sampler can move
between these.This will generally yield cases where there
are significant uncertainties associated with the points,
however BDsaves snapshots of individual solutions which
can be examined to verify that multiple solutions exist.In
addition,by repeating the analysis using different random
seeds in the Markov process,different Markov chains are
generated and the results can be compared.This reduces
the probability that the results represent one solution out
of many which might fit the data equally well.
METHODS
To test the ability of BD to recover biologically signifi-
cant information from microarray data,we analysed the
cdc28-mutant yeast cell cycle data (Cho et al.,1998) as
summarized in the Stanford cell cycle data set (Spellman
et al.,1998),containing 763 genes with cell cycle peri-
odicity.Since the data presented in the Stanford data set
are log
2
measurements of experimental to control expres-
sion,the data were transformed to a linear scale,giving
ratios that are always positive.BD requires a calculation
of the likelihood,so that uncertainty (or noise) estimates
for each point in the data were needed.Since it appears
that the noise level is often tied more closely to the gene
expression level,rather than being an absolute level across
all genes,we used a noise estimate of 25%of signal.In ad-
dition,the data set had data missing at many time points.
For these points we set the expression ratio to 1.0,but set
the noise estimate to 100,so that BDwould treat the mod-
els as unconstrained by these data points.
We performed 11 separate runs of BD with different
random seeds,which is equivalent to different random
starting points for the Markov chain.Standard statistics
were calculated on the results,yielding an estimate of
the uncertainty for each point in the pattern and for the
distribution of the patterns for each gene in the data set.
The results of these runs were then compared to the
results reported elsewhere.In addition,specific sets within
the data were analysed,including genes encoding known
kinases which often are activated at multiple points in
the cell cycle.These data were interpreted in light of
information within the Saccharomyces Genome Database
(SGD;Cherry et al.,1997),the Curagen Pathcalling
Database (CPD;Uetz et al.,2000),and the Proteome Yeast
Database (YPD;Costanzo et al.,2000,2001).
The patterns of time expression behavior fromBD were
further analysed by wavelet decomposition (Daubechies,
1992) at all permitted scales as described previously
(Klevecz,2000).In the six gene expression patterns,
the periodicities that are apparent in the profiles were
confirmed by the wavelet scaling.The decompositions
were done using the Daubechies daublet −6 transform.
RESULTS
Figure 3 shows the average gene expression patterns
together with error bars derived fromthe 11 repeated runs
of BD.The identification of the time curves with the cell
cycle phases is clear,as is the existence of a single solution
across the multiple runs.Specifically the solid orange
curve corresponds with high expression in the M/G1
569
T.D.Moloshok et al.
0
0.05
0.1
0.15
0.2
0.25
0.3
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
Cell Cycle Phase
Relative Expression
M/G1
S/G2
M
G1A
G1B
osc
Fig.3.The patterns determined by BDfor the yeast cell cycle data.There are five patterns tied to the cell cycle and one additional oscillatory
pattern.In order of peaks,the cell cycle patterns are tied to M/G1 (solid orange),G1 (G1A,solid black),S/G2 (dashed orange),M(dotted
orange),and G1 (G1B,dashed black).The oscillator pattern (osc) is shown in red.The two G1 patterns differ primarily in the presence of
expression during the first cell cycle upon release fromcell cycle arrest.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
0.5
1
1.5
2
2.5
3
3.5
4
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
(a) (b)
(c)
0
0.5
1
1.5
2
2.5
3
3.5
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
0.5
1
1.5
2
2.5
3
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
(d) (e)
(f)
Cell Cycle Phase
Cell Cycle Phase
Cell Cycle Phase
Cell Cycle Phase
Cell Cycle Phase
Cell Cycle Phase
Relative Expression
Relative Expression
Relative Expression
Relative Expression
Relative Expression
Relative Expression
Fig.5.Modelling of expression by BD in complex cases.In each case,the mock data created from the model by BD is compared to the
original data,with the mock data in orange and the original data in red.The genes shown are (a) ISR1;(b) RAD53;(c) HSL7;(d) CDC5;
(e) DBF20;(f) GIN4.
transition,the solid black curve with high expression in the
G1 phase (including the first pass through the cell cycle,
G1Agroup),the dashed orange curve with high expression
in S and G2 phases,the dotted orange curve with high
expression in the M phase,and the dashed black curve
with high expression in the G1 phase (but not during the
570
BD of expression data
Table 1.All genes with more than 70%of their behavior explained by the oscillator pattern which are also identified within the Proteome YPD database
Gene name YPD Proteome role Proteome function
YOL0T6C CMK2 Mating response [E] Protein kinase [P];transferase [P]
YEL060C PRB1 Protein degradation [E] Protein folding [E];protein modification [E];
protein degradation [E]
YEL040W UTR2 Cell wall maintenance [E,P] Serine rich protein [P];hydrolase [P]
YOL060W ARG1 Amino-acid metabolism[E];other metabolism[E] Ligase [E]
YOR190C YDR190C DNA repair [P];Pol II transcription [P];RNA Hydrolase [P];helicase [P] ATPase [P];
processing/modification [E] DNA-bindng protein [P]
YOR116C YOR115C Vesicular transport [E] Unknown
YOR024C-A PMP1 Protein synthesis [E] tRNA synthetase [E];ligase [E];
RNA-binding protein [E]
YOL196W GCN1 Amino-acid metabolism[E];protein synthesis [P];other [E] Activator [E]
YMR202W ERG2 Lipid,fatty-acid and sterol metabolism[E];vesicular transport [E] Isomerase [E]
YML183C PHO84 Phosphate metabolism[E];small molecule transport [E,P] Major facilitator superfamily [P];transporter [E];
active transporter,secondary [P]
YCL043C PDI1 Protein folding [E];protein modification [E];protein degradation [E] Oxidoreductase [E];isomerase [E];chaperones [E]
YOR323C PRO2 Amino-acid metabolism[E] Oxidoreductase [E]
YOR283W YOR283W Carbohydrate metabolism[P] Hydrolase [P]
YMR246W FAA4 Lipid,fatty-acid and sterol metabolism[E] Ligase [E]
YGR124W ASN2 Amino-acid metabolism[E] Ligase [P]
YML058W YML058W Nucleotide metabolism[E];DNA repair [E];Regulatory subunit [P]
Chromatin/chromosome structure [E]
YIL078W SEC28 Vesicular transport [P] Vesicle coat protein [P]
YJR004C SAG1 Mating response [E];cell adhesion [E] Adhesin/agglutinin [E]
YLR347C KAP95 Nuclear–cytoplasmic transport [E,P] Nuclear import/export protein [E,P]
YHR208W BAT1 Amino-acid metabolism[P] Transferase [E]
YOA077W SED1 Cell structure [E] Unknown
YLR467W YLR467W Chromatin/chromosome structure [P] Hydrolase [P];helicase [E];ATPase [P];
DNA-binding protein [E]
YLR342W GLS1 Carbohydrate metabolism[E];cell wall maintenance [E] Transferase [E]
YCR002C CDC10 Cell polarity [E,P];cytokinesis [E] Hydrolase [P];GTP-binding protein/GTPase [P];
structural protein [E]
YPL232W SSO1 Vesicular transport [E];membrane fusion [E] Docking protein [E]
YOR332W VMA4 Small molecule transport [E,P] Hydrolase [P];ATPase [P];Transporter [P];
active transporter,primary [P]
For each gene its gene name fromthe SGD is given,together with its YPD identification,the role fromthe YPD,and the function fromthe YPD.In the role
and function columns a [P] indicates a predicted function and an [E] indicates an experimentally verified function.
first pass through the cell cycle after arrest,G1B group).
In addition,there is a sixth curve shown in red which
has a roughly constant level of expression but with a
small oscillator-like behavior superimposed (osc).Wavelet
decomposition of the oscillator data finds a periodicity
of ∼40 min.The ∼40 min cycle is consistent with
published reports on metabolic oscillations in S.cerevisiae
(Satroutdinov et al.,1992).It is also in agreement with a
genome-wide wavelet analysis of these data sets in which
both ∼40- and ∼80-min periodicities were uncovered
in the expression profiles of the majority of the genes
(Klevecz,2000;Klevecz and Dowse,2000).A selection
of the genes which are most strongly associated with
this pattern are shown in Table 1.Most appear to be
metabolic genes which would be expected to be involved
with routine cellular function.
Figure 4 demonstrates the ability of the algorithm to
identify the time behavior of expression (i.e.patterns).In
each case,the time behavior determined by BD is shown
with a thick solid line.Superimposed on this time curve
are the original data (thin lines) and average of the original
data (thick dashed line) for the genes which have at least
70% (80% for the G1A and the oscillator patterns) of
the time behavior explained by the single pattern shown.
These genes would typically be identified as a ‘cluster’ in
a clustering algorithm,as they essentially belong to only a
571
T.D.Moloshok et al.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
0.5
1
1.5
2
2.5
3
3.5
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
1
2
3
4
5
6
7
8
9
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
0
0.5
1
1.5
2
2.5
3
M/G1
G1
G1
S
S
G2
M
M
M/G1
M/G1
G1
G1
S
G2
M
M
M/G1
(a) (b) (c)
(d) (e) (f)
Cell Cycle Phase
Cell Cycle Phase Cell Cycle Phase
Cell Cycle Phase
Cell Cycle Phase
Cell Cycle Phase
Relative Expression
Relative Expression
Relative Expression
Relative Expression
Relative Expression
Relative Expression
Fig.4.Genes strongly tied to a single pattern.The data for genes which have a minimum of 70% (G1B,G2/S,M,M/G1)–80% (G1A,
oscillator) of their expression behavior explained by a single pattern are plotted together with their mean (thick dashed line) and the pattern
(thick solid line) appearing in Figure 3:(a) G1B,genes tied to G1 phase with minimal expression in the first G1 phase after release fromcell
cycle arrest;(b) G1A,genes tied to G1 phase with expression in the first G1 phase;(c) genes tied to G2 and S phases;(d) genes tied to M
phase;(e) genes tied to Mand G1 phases;and (f) genes tied to the cell cycle oscillator.The genes are summarized in Tables 1 and 2.
single pattern.The identification of the specific genes for
the cell cycle clusters,a–e in Figure 4,are given in Table 2.
Figure 5 demonstrates the capabilities of the BD
algorithm to handle cases reflecting complex regula-
tion in response to multiple input signals.Since BD
simultaneously identifies patterns (e.g.time behaviors of
expression) and specifies how these must be combined
to recover a good model of expression for each gene,
genes which are multiply coregulated can be properly
assigned to multiple patterns (e.g.genes upregulated in G1
and G2 are assigned partially to G1 and partially to G2).
This means that rather than creating additional clusters
which contain genes which are multiply coregulated,a
physiologically meaningful number of patterns can be
combined to explain the expression behavior of such
genes.In Figure 5 six examples of the expression be-
havior of kinases are given.The transcriptional response
of many kinases is known to be multiply regulated (see,
for example,kinases in the SGD;Cherry et al.,1997),
so that they are upregulated in multiple parts of the cell
cycle.For each gene,the time series data is shown in
red and the model data reconstructed by BD is shown
in orange.BD has successfully matched the behavior of
each gene expression time series using only six patterns
by combining different amounts of each of the patterns.
For those genes which are turned on multiple times in
the cell cycle,BD is able to correctly assign these genes
to multiple cell cycle patterns.For example,in Figure 5a,
the expression behavior of ISR1 is shown.ISR1 is clearly
upregulated in S,G2,and G1,although its function is not
known.BD assigns this gene 56% to the G2/S pattern
(Figure 4c) and 30% to the G1 phases (Figure 4a and b).
On the other hand,RAD53 (Figure 5b),a G1 checkpoint
arrest protein,is clearly strongly tied to the G1 phase of
the cell cycle,and BD assigns RAD53 95% to the G1
phases.For each of the expression patterns shown,the
match from BD is excellent.As data quality improves,
this ability to construct a model that has physiologically
meaningful bases should increase in power.
DISCUSSION
BDhas shown its usefulness in the fields of image analysis
(Ochs et al.,2001) and spectroscopic analysis (Ochs et al.,
1999).In the analysis presented here,the ability of BD
to disentangle gene expression patterns that arise when
genes are coregulated at different points within the cell
cycle is demonstrated.For genes that are strongly tied
to a single aspect of the cell cycle,the results agree in
many ways with the recent work on the use of simulated
annealing to identify the optimal number of clusters
(Lukashin and Fuchs,2001).In that work,five clusters
from the 20 identified genes were linked strongly to the
572
BD of expression data
Table 2.All genes with more than 70%of their behavior explained by a single pattern.Identifications are standard names fromthe SGD
G1A G1B S/G2 M M/G1
YNL289W YGR152C YLR236C YNR067C YPL021W YPR157W YMR001C YPL187W
YER070W YLR235C YIL140W YOR263C YMR003W YPR156C YGR108W YHR126C
YNL102W YML060W YOR114W YHR137W YBL063W YGL116W YOL069W YOL132W
YDL003W YIL026W YNL309W YPL265W YNR009W YLR190W YPR119W YBR067C
YGR221C YBR070C YGR188C YGL089C YOR073W YGL209W YGR092W YDL037C
YPL153C YFL008W YPL241C YKL178C YDR451C YBR038W YDL048C
YBR089W YNL273W YJL181W YPL061W YEL061C YOR314W YML034W
YCL022C YHR154W YCL024W YBR054W YJR010W YOL014W YHR151C
YOL007C YNL312W YER095W YDR380W YOR315W YJR092W
YPL267W YOL090W YLR234W YNL173C YHR315W YJR058C
YML027W YHR153C YPR018W YDL179W YBR138C YGR143W
YLR103C YML102W YMR199W YCL040W YOR313C YJL051W
YKL113C YOL017W YDL018C YNL160W YOR025W YGL021W
YDR097C YCL061C YLR183C YKR039W YIL106W YGR035C
YLR313C YAR007C YNL233W YGR044C
YKR013W YCL060C YER111C
YJL074C YLL022C YDL096C
YBL035C YJL187C YLR326W
YOR074C YPL256C YGR109C
YAR008W YLR383W YCR065W
YBR088C YMR076C YDR507C
YMR078C YDL164C YNL300W
YFR027
cell cycle,similar to the five patterns (a–e) of Figure 4.
Because BD can mix these patterns together to explain
the expression behavior for each gene,it does not require
additional patterns in order to model the data.Instead BD
provides physiologically meaningful expression patterns
which together can describe all expression behavior.In
this way,BD eliminates the need for additional clusters
not required by the underlying biology,but instead needed
only because of the inability of a clustering algorithm to
partition the expression behavior of a single gene into
multiple groups.
One interesting issue with the five cell cycle clusters
identified both here and with the simulated annealing
clustering algorithm (Lukashin and Fuchs,2001) is the
lack of a peak in one pattern during the first cell cycle
phase.The first peak in the G1A cluster occurs at 20 min
and the second at 100 min,while the peak in the G1B
cluster occurs at 90–100 min with no corresponding peak
at 10–20 min.Reviewing the genes in the G1B cluster
(Table 2),it is seen that many genes encode proteins
involved in amino acid metabolism and carbohydrate
metabolism.The genes in the G1A cluster on the other
hand are primarily involved with DNA synthesis and
repair,mitosis,cell cycle control,and Pol II transcription.
Although speculative,one reasonable interpretation of the
absence of a peak at 10–20 min is that the necessary
building blocks (amino acids and carbohydrates) had been
processed prior to arrest,therefore eliminating the need
for the metabolic proteins and for the transcription of the
corresponding genes upon exit from arrest.This could be
a result of negative regulation of metabolic processes by
metabolic products (Rao and Arkin,2001).
The oscillator pattern identified by BD is of particular
interest (Figure 4f).This pattern appears to be tied to
routine metabolic processes within the cell as shown by
Table 1.It may be that this pattern has not drawn attention
previously because in clustering algorithms it appears as
one (or several) of many clusters,few of which can be
reasonably interpreted due to the problem of multiple
coregulation.BDhas the advantage of being able to assign
multiple coregulation appropriately to multiple patterns,
allowing the data to be explained with fewer,interpretable
patterns.This allows biologically relevant clusters to be
fully examined.
The ability to add additional prior knowledge through a
convolution function ( f s in Figure 2) in BD is a powerful
tool which has not yet been exploited in expression
analysis.We are presently exploring the addition of
573
T.D.Moloshok et al.
new distribution functions in two ways.First,we are
adding models of time behavior of mRNA species during
transcription in order to model expression time series data
more accurately.This is done by mapping a single atom
in the P atomic domain into a full curve (sampled at the
times in the experimental data set) in the model domain
(P matrix).This involves modelling the possible delays
between signal initiation and transcriptional initiation,the
rise time of mRNA following initiation of transcription,
and the half lives of mRNA species.These models
may be too complex in higher organisms with complex
transcriptional control,however simplifications may be
adequate to provide insight.For simpler organisms such
as yeast and bacteria,such modelling is likely to be
adequate.Second,we are introducing correlations in
the distributions by using known,reliable coregulation
(i.e.known common promoter elements) to determine
which genes must have their expression levels correlated.
Since BD is free to use multiple patterns to explain the
expression profile of a single gene and to add additional
genes to a coregulated group,this should not introduce
excessive bias.Still this method must be used carefully,
as it creates a strong prior in favor of the linking of
certain genes.With these and future improvements,BD
should continue to provide a good tool for gene expression
analysis in the future.
ACKNOWLEDGEMENTS
This work was supported by the National Institutes of
Health,National Cancer Institute (Comprehensive Cancer
Center Core grant CA06927 to R.Young) and the Pew
Foundation.This work was also supported in part by a
grant from the Beckman Foundation.We would like to
thank the referees for helpful suggestions.
REFERENCES
Alizadeh,A.A.,Eisen,M.B.,Davis,R.E.,Ma,C.,Lossos,I.S.,Rosen-
wald,A.,Boldrick,J.C.,Sabet,H.,Tran,T.,Yu,X.,Powell,J.I.,
Yang,L.,Marti,G.E.,Moore,T.,Hudson,J.Jr,Lu,L.,Lewis,D.B.,
Tibshirani,R.,Sherlock,G.,Chan,W.C.,Greiner,T.C.,Weisen-
burger,D.D.,Armitage,J.O.,Warnke,R.,Staudt,L.M.et al.(2000)
Distinct types of diffuse large B-cell lymphoma identified by
gene expression profiling.Nature,403,503–511.
Alter,O.,Brown,P.O.and Botstein,D.(2000) Singular value decom-
position for genome-wide expression data processing and mod-
eling.Proc.Natl Acad.Sci.USA,97,10 101–10 106.
Besag,J.,Green,P.,Higdon,D.and Mengersen,K.(1995) Bayesian
computation and stochastic systems.Stat.Sci.,10,3–66.
Bittner,M.,Meltzer,P.,Chen,Y.,Jiang,Y.,Seftor,E.,Hendrix,M.,
Radmacher,M.,Simon,R.,Yakhini,Z.,Ben-Dor,A.,Sampas,N.,
Dougherty,E.,Wang,E.,Marincola,F.,Gooden,C.,Lueders,J.,
Glatfelter,A.,Pollock,P.,Carpten,J.,Gillanders,E.,Leja,D.,Diet-
rich,K.,Beaudry,C.,Berens,M.,Alberts,D.and Sondak,V.(2000)
Molecular classification of cutaneous malignant melanoma by
gene expression profiling.Nature,406,536–540.
Brazma,A.and Vilo,J.(2000) Gene expression data analysis.FEBS
Lett.,480,17–24.
Brown,M.P.,Grundy,W.N.,Lin,D.,Cristianini,N.,Sugnet,C.W.,
Furey,T.S.,Ares,M.Jr and Haussler,D.(2000) Knowledge-based
analysis of microarray gene expression data by using support
vector machines.Proc.Natl Acad.Sci.USA,97,262–267.
Cherry,J.M.,Ball,C.,Weng,S.,Juvik,G.,Schmidt,R.,Adler,C.,
Dunn,B.,Dwight,S.,Riles,L.,Mortimer,R.K.and Botstein,D.
(1997) Genetic and physical maps of Saccharomyces cerevisiae.
Nature,387,67–73.
Cho,R.J.,Campbell,M.J.,Winzeler,E.A.,Steinmetz,L.,Conway,A.,
Wodicka,L.,Wolfsberg,T.G.,Gabrielian,A.E.,Landsman,D.,
Lockhart,D.J.and Davis,R.W.(1998) A genome-wide transcrip-
tional analysis of the mitotic cell cycle.Mol.Cell,2,65–73.
Claverie,J.M.(1999) Computational methods for the identification
of differential and coordinated gene expression.Hum.Mol.
Genet.,8,1821–1832.
Costanzo,M.C.,Hogan,J.D.,Cusick,M.E.,Davis,B.P.,
Fancher,A.M.,Hodges,P.E.,Kondu,P.,Lengieza,C.,Lew-
Smith,J.E.,Lingner,C.,Roberg-Perez,K.J.,Tillberg,M.,
Brooks,J.E.and Garrels,J.I.(2000) The Yeast Proteome
Database (YPD) and Caenorhabditis elegans proteome database
(WormPD):comprehensive resources for the organization and
comparison of model organism protein information.Nucleic
Acids Res.,28,73–76.
Costanzo,M.C.,Crawford,M.E.,Hirschman,J.E.,Kranz,J.E.,
Olsen,P.,Robertson,L.S.,Skrzypek,M.S.,Braun,B.R.,Hop-
kins,K.L.,Kondu,P.,Lengieza,C.,Lew-Smith,J.E.,Tillberg,M.
and Garrels,J.I.(2001) YPD,PombePD and WormPD:model
organism volumes of the BioKnowledge library,an integrated
resource for protein information.Nucleic Acids Res.,29,75–79.
Daubechies,I.(1992) Ten Lectures on Wavelets.Society for Indus-
trial and Applied Mathematics,Philadelphia,PA.
Eisen,M.B.,Spellman,P.T.,Brown,P.O.and Botstein,D.(1998)
Cluster analysis and display of genome-wide expression patterns.
Proc.Natl Acad.Sci.USA,95,14 863–14 868.
Geman,S.and Geman,D.(1984) Stochastic relaxation,Gibbs dis-
tributions,and the Bayesian restoration of images.IEEE Trans.
Pattern Anal.Mach.Intell.,PAMI 6,721–741.
Golub,T.R.,Slonim,D.K.,Tamayo,P.,Huard,C.,Gaasen-
beek,M.,Mesirov,J.P.,Coller,H.,Loh,M.L.,Downing,J.R.,
Caligiuri,M.A.,Bloomfield,C.D.and Lander,E.S.(1999) Molec-
ular classification of cancer:class discovery and class prediction
by gene expression monitoring.Science,286,531–537.
Heyer,L.J.,Kruglyak,S.and Yooseph,S.(1999) Exploring expres-
sion data:identification and analysis of coexpressed genes.
Genome Res.,9,1106–1115.
Iyer,V.R.,Eisen,M.B.,Ross,D.T.,Schuler,G.,Moore,T.,Lee,J.C.
F.,Trent,J.M.,Staudt,L.M.,Hudson,J.Jr,Boguski,M.S.,
Lashkari,D.,Shalon,D.,Botstein,D.and Brown,P.O.(1999) The
transcriptional program in the response of human fibroblasts to
serum.Science,283,83–87.
Kirkpatrick,S.,Gelatt,C.D.and Vecchi,M.P.(1983) Optimization by
simulated annealing.Science,220,671–680.
Klevecz,R.R.(2000) Dynamic architecture of the yeast cell cycle
uncovered by wavelet decomposition of expression microarray
data.Funct.Integr.Genom.,1,186–192.
574
BD of expression data
Klevecz,R.R.and Dowse,H.B.(2000) Tuning in the transcriptome:
basins of attraction in the yeast cell cycle.Cell Prolif.,33,209–
218.
Labadie,C.,Lee,J.-H.,Vetek,G.and Springer,C.S.Jr (1994) Relax-
ographic imaging.J.Magn.Reson.B,105,99–112.
Lee,D.D.and Seung,H.S.(1999) Learning the parts of objects by
non-negative matrix factorization.Nature,401,788–791.
Lockhart,D.J.and Winzeler,E.A.(2000) Genomics,gene expression
and DNA arrays.Nature,405,827–836.
Lukashin,A.V.and Fuchs,R.(2001) Analysis of temporal gene
expression profiles:clustering by simulated annealing and deter-
mining the optimal number of clusters.Bioinformatics,17,405–
414.
Ochs,M.F.,Stoyanova,R.S.,Arias-Mendoza,F.and Brown,T.R.
(1999) Anewmethod for spectral decomposition using a bilinear
Bayesian approach.J.Magn.Reson.,137,161–176.
Ochs,M.F.,Stoyanova,R.S.,Brown,T.R.,Rooney,W.D.and
Springer,C.S.Jr (2001) A Bayesian Markov chain Monte Carlo
solution of the bilinear problem.In Rychert,J.T.,Erickson,G.J.
and Smith,C.R.(eds),Bayesian Inference and Maximum En-
tropy Methods in Science and Engineering:19th International
Workshop.American Institute of Physics,Melville,pp.274–284.
Rao,C.V.and Arkin,A.P.(2001) Control motifs for intracellular
regulatory networks.Annu.Rev.Genet.,3,391–419.
Roberts,C.J.,Nelson,B.,Marton,M.J.,Stoughton,R.,Meyer,M.R.,
Bennett,H.A.,He,Y.D.,Dai,H.,Walker,W.L.,Hughes,T.R.,
Tyers,M.,Boone,C.and Friend,S.H.(2000) Signaling and cir-
cuitry of multiple MAPK pathways revealed by a matrix of
global gene expression profiles.Science,287,873–880.
Ross,D.T.,Scherf,U.,Eisen,M.B.,Perou,C.M.,Rees,C.,Spell-
man,P.,Iyer,V.,Jeffrey,S.S.,Van de Rijn,M.,Waltham,M.,Perga-
menschikov,A.,Lee,J.C.,Lashkari,D.,Shalon,D.,Myers,T.G.,
Weinstein,J.N.,Botstein,D.and Brown,P.O.(2000) Systematic
variation in gene expression patterns in human cancer cell lines.
Nature Genet.,24,227–235.
Roweis,S.T.and Saul,L.K.(2000) Nonlinear dimensionality reduc-
tion by locally linear embedding.Science,290,2323–2326.
Satroutdinov,A.D.,Kuriyama,H.and Kobayashi,H.(1992) Oscilla-
tory metabolismof Saccharomyces cerevisiae in continuous cul-
ture.FEMS Microbiol.Lett.,77,261–267.
Sibisi,S.and Skilling,J.(1997) Prior distributions on measure space.
J.R.Stat.Soc.B,59,217–235.
Spellman,P.T.,Sherlock,G.,Zhang,M.Q.,Iyer,V.R.,Anders,K.,
Eisen,M.B.,Brown,P.O.,Botstein,D.and Futcher,B.(1998)
Comprehensive identification of cell cycle-regulated genes of
the yeast Saccharomyces cerevisiae by microarray hybridization.
Mol.Biol.Cell,9,3273–3297.
Tamayo,P.,Slonim,D.,Mesirov,J.,Zhu,Q.,Kitareewan,S.,Dmitro-
vsky,E.,Lander,E.S.and Golub,T.R.(1999) Interpreting patterns
of gene expression with self-organizing maps:methods and ap-
plication to hematopoietic differentiation.Proc.Natl Acad.Sci.
USA,96,2907–2912.
Tenenbaum,J.B.,Silva,V.D.and Langford,J.C.(2000) A global
geometric framework for nonlinear dimensionality reduction.
Science,290,2319–2323.
Uetz,P.,Giot,L.,Cagney,G.,Mansfield,T.A.,Judson,R.S.,
Knight,J.R.,Lockshon,D.,Narayan,V.,Srinivasan,M.,
Pochart,P.,Qureshi-Emili,A.,Li,Y.,Godwin,B.,Conover,D.,
Kalbfleisch,T.,Vijayadamodar,G.,Yang,M.,Johnston,M.,
Fields,S.and Rothberg,J.M.(2000) A comprehensive analysis
of protein–protein interactions in Saccharomyces cerevisiae.
Nature,403,623–627.
Young,R.A.(2000) Biomedical discovery with DNA arrays.Cell,
102,9–15.
575