Comparative analysis of genomics data collections: Multi-species

ticketdonkeyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

109 εμφανίσεις


Comparative analysis of genomics data collections:
Multi
-
species
Integrative Biclustering



Peter Waltman
1
, 3
*
, Thadeous
Kacmarczyk
2
*
, Ashley R Bate
2
, Patrick Eichenberger
2
+
, Richard
Bonneau
1
, 2, 3
+


1


Computer Science Department
, Courant Institute for
Mathematical Sciences
, New York University

2


Center for Genomics and Systems Biology,

Department

of Biology
, New York University

3


Computational Biology Program, New York University


*
These

authors contributed equally to this work:

Richard Bonneau
<
bonneau@nyu.edu
>,

Patrick Eichenberger <
pe19@nyu.edu
>



Running Title: multi
-
species biclustering


Keywords: motility, sporulation, clustering,

data
-
integration



Abstract

Extensive data from high
-
throughput experimental technologies
ha
s

enabled
new methods to
learn and
model
the

complex regulatory networks

that organisms use to respond to their ever
-
changing
environment
.

A key challenge in the a
nalysis of genomics data is the identification of modules of genes
with similar or identical regulatory controls
,
a

non
-
trivial problem due to the complexity of regulatory
networks.
Recent works that compare functional genomics data sets for closely related species suggest
that many co
-
regulated modules are conserved (in whole or in part) across several species. This
suggests that comparative
analysis

of functional genomics data
-
sets
could prove powerful in accurately
identifying biologically relevant conserved co
-
regulated groups.
Our approach to this problem integrates
collections of
data
-
types across multiple species to learn regulatory modules conserved across multiple
species. We
describe an extension of our cMonkey algorithm (a biclustering algorithm that integrates
multiple systems biology data
-
types to estimate condition depend
e
nt modules) that allows for the
simultaneous biclustering of data compendia spanning multiple species.

For each of the conserved co
-
regulated modules detected we also identify the experimental conditions (within the single species
data
-
sets supporting the module) the experimental condition over which the module is relevant.

W
e
identify conserved modules o
f orthologous genes, yielding evolutionary insights into the formation and
surprising high degree of conservation of regulatory modules. The method also provides a framework
that allows insights from well
-
studied organisms to be used to aid the analysis of

related but less well
studied organisms. We present results from the multi
-
species biclustering of a
G
ram positive group
containing
Bacillus subtilis
,
Bacillus anthracis
, and
Listeria monocytogenes
.



Introduction: Background & significance

In recent yea
rs, there has been a
tremendous increase

in
the
availability of high quality, high
-
throughput systems biology data.
F
or example, the Stanford Microarray Database

(Demeter,
Beauheim et al. 2007)

in 2000,
contained

9022
microarray and other transcriptome experiments
(of
which only 313 were public) now contains
nearly 73000

ex
periments
over 17000
of which

are publicly
available.

This ever
-
increasing supply of data presents systems biologists with
a wealth

of

new
opportunities

as wel
l as some daunting

new challenges. Specifically, more data
allows

systems
biologists to study
i
ncreasingly
more organisms in
a wide

number of experimental
perturbations

and
conditions
, and

to begin to
understand
how different organisms
coordinate genome
-
wide responses

to
their environment and extra
-
cellular

cues
.

Thus, by learning networks for sever
al organisms we can
begin to make general statements about the regulatory/control strategies used by cells and also begin
to build truly predictive dynamical models.
C
ommonly cited
advantages

of the explosion of genomics
data and data availability
revolve around
data completeness within a single organism. In this work we

address another advantage of the growing quantity and accessibility of genomics data; that we can
begin to assemble multiple
-
species data
-
sets and
,

by
comparing data for closely rel
ated organism
,

add
a critical evolutionary component to systems biology data analysis
.


As f
unctionally related genes are often
co
-
expressed, identifying correlated sets, or clusters, of genes is
frequently useful in identifying functionally related sets o
f genes, while also providing insight into the
roles of poorly understood genes.

These co
-
regulated or co
-
functi
onal groups, however, are often

expressed only under subsets of experimental conditions. Thus any attempt to detect conserved groups
of co
-
regul
ated orthologous proteins across mul
t
iple species data collections needs to simultaneously
address the following non
-
trivial challenges: 1) groups may be active or coherent in just subsets of the
conditions for each species,
2) in most case
s

there is not a

good correspondence between the
conditions in different species data
-
sets, 3
)
gene
gro
ups may not be conserved

in their regulation or
function
, 4) conserved co
-
regulated groups in many cases have extensive species speci
fic elaborations

that complicate the
ir detection
, 5) in many cases the orthology between genes in the species for which
we have data is not a one
-
to
-
one mapping and 6)

integration of additional data
-
types (networks, binding
sites, etc.) will have to be robust to essential differences in avai
lable data or annotat
i
on completeness
across species. We
describe
a
n

integrative mul
t
i
-
species biclustering algorithm that allows us to begin
addressing all of these challenges in a unified and scalable analysis.

W
e extend our previous
framework for discov
ering condition
-
depend
e
nt co
-
regulated (and often co
-
functional) modules to

enable the analysis of

multiple
species

datasets and demonstrate its use to

discover putative
conserve
d

co
-
regulated biclusters spanning
the

Gram

positive
organisms

Bacillus
subtilis, Bacillus
anthracis
, and
Listeria monocytogenes
. We demonstrate
that d
etection of co
-
regulated

groups is
greatly aided by comparative bicluster
i
ng, and that integration across multiple

species

improves the
performances of our biclustering algorith
m if followed by a species
-
specific elaboration of detected
conserved core co
-
regulated modules (where genes that have no detected orthologs across the
species set are included based on single species data sets
)
.


C
lustering algorithms

have
emerged as a
popular and effective tool in analyzing microarray data

in
identifying these correlated sets
.
However
,
many of the

clustering algorithms

routinely used in biology,
such as hierarchical agglomerative clustering (HAC)

(McQuitty 1966)
, k
-
means clustering

(MacQueen
1967)

and singular value decomposi
tion
(SVD)

(Golub and Kah
an 1965)

based methods

only allow

genes to participate in mutually exclusive clusters
.
As such
,
these algorithms
implicitly
assum
e
that
the
genes in a cluster
correlate

across all experimental
conditions

present
in
the data set being analyzed.
With

sm
aller data sets that represent

a narrow range of conditions, these
are

frequently sufficient to
capture the most significant underlying biological processes

(Alter, Brown et al. 2000; Alter, Brown et al.

2003)
.

A
s

genes may participa
t
e in multiple processes depende
nt upon an organism’s state

or cell’s
position on a developmental pathway
,
however,
these may not be effective
wh
en analyzing
data sets
that include
data

from
many

different
perturbation
experiments

(for example the datasets for many
prokaryotic species contain several hundred experiments and several multi
-
cellular organisms have
data collections with several thousand expression experiments)
.

Recently, biclustering, the
simultaneous
clustering of bo
th genes and conditions
,

was introduced
to address the biologi
cally
relevant condition depende
nce of co
-
regul
a
tion and co
-
expression patterns
(Cheng and Church 2000;
Tanay, Sharan et al. 2002; Ben
-
Dor, Ch
or et al. 2003; Bergmann, Ihmels et al. 2003; Kluger, Basri et al.
2003; Tanay, Sharan et al. 2004; Supper, Strauch et al. 2007; DiMaggio, McAllister et al. 2008; Gan,
Liew et al. 2008; Lu, Huggins et al. 2009)
. Rather than clustering
a data set into subsets of genes

over
all conditions
,
the goal of
biclustering
is to identify subsets of
genes
that correlate over a subset of
conditions
, called biclusters
.
In their
early
work introducing the concept

to the biology community
,
Cheng and Church employed a greedy algorithm
to find a set of genes and conditions that
minimize
d

the mean squared difference

from the overall signal of the bicluster.
O
ther

biclustering

approach
es
that

followed include SAMBA

(Tanay, Sharan et al. 2002; Tanay, Sharan et al. 2004)
,
which
employs a
weighted
probabilistic graph theoretic approach
;

and ISA (Iterative Signature Algorithm)
(Bergmann,
Ihmels et al. 2003)

which
also
uses
a greedy approach to find maximally coherent biclusters.

Many of
these early works
use
d

objective functions
base
d

sol
ely

upon expression data
.

Recently,
a number of
works have shown that the integration of data
-
types such as a variety of networks and binding site data
greatly improve the performance of estimation of co
-
funct
i
onal (
SAMBA
now
incorporates known
protein
-
protein

and
protein
-
DNA ass
ociations

into a Bayesian framework
)

(Tanay, Sharan et al. 2004)

and co
-
regulated groups
(Reiss, Baliga et al. 2006)
.


Recently, w
e developed a

new biclustering algorithm,
cMonkey

(Reiss, Baliga et al. 2006)
,
that

integrates evidence from multiple data types, including expression, sequence, and association
network data. This in
tegrative biclustering technique groups genes and conditions into
biclusters on the basis of 1) expression data across subsets of experimental conditions, 2)
shared putative regulatory motifs and 3)
protein
-
protein and protein
-
DNA

associations such as
meta
bolic, signaling
(Kanehisa 2002)
, protein
-
protein, and comparative genomics net
works

(Mellor, Yanai et al. 2002; Bowers, Pellegrini et al. 2004; Price, Huang et al. 2005)
. This

integration of multip
le data types allows
cMonkey

to generate biclusters with
multiple
supporting
evidences, while also
allowing

it
to avoid
overfitting based on any one data type.

W
e have
used
this biclustering method

with success in previous works for “pre
-
clustering”
genes prior to learning global regulatory influences. Briefly
,

biclusters are
built

via a Markov

chain process, in which each bicluster is iteratively optimized, and its state is updated based
upon condi
tional probability distributions computed using the
bi
cluster's previous state. This
enables
cMonkey

to
determine

the probability

that
a given
gene or condition belongs in the
bicluster,
dependent upon
the current state of the bicluster
.
The components of

this conditional
probability
(one for each of the different data
types)
are modeled independently
as
p
-
values
based upon individual data likelihoods, which are combined into a regression model to
de
termin
e the full conditional probability.

While up till
this point, the

three major distinct data
types used
have been
gene expression, upstream sequ
ences, and association networks, t
his
separation (and the MCMC optimization of the resulting score) allows for
the
straight
-
forward
integration of
additional

data
-
types

and will be the basis for our search for biclusters of
orthologous gene groups in multiple
-
species data collections
.

Here w
e
extend
cMonkey

to allow for evidence from multiple organisms to mutually inform the
biclusters that
are generated from putati
ve orthologs spanning the given species
.

Two main
motivations for this


and other


multi
-
species clustering methods are: 1) expression patterns
that are conserved across different species have a much greater chance of being biol
ogically
relevant and thu
s this

approach

can
improve accuracy and robustness by detecting such
conserved expression patterns
;

and 2) coupling clustering analysis across multiple species
gives us a window into the evolution and function of regulatory networks and co
-
regulated
modul
es.

Several groups have previously devised clustering methods that can take advantage of multiple
species data
-
sets. These p
rior methods

for
generally fall into two classes
;
for
excellent

review
s

of multiple
-
species cl
ustering
methods and their applicatio
n
see

(Tirosh, Bilu et al. 2007; Lu,
Huggins et al. 2009)
.


Matched condition multi
-
species clustering:

I
n
this class

of multi
-
species clustering
studies,

an attempt is made to match conditions
between
organism
s

in
order to identify similarities and differences for a given process
.

In many of these
studies,
the

comparisons were made between
closely related

species

such as primates

(Khaitovich, Hellmann et al. 2005; Gilad, Oshlack et al. 2006)

and fungi
(Tirosh, Weinberger et
al. 2006)
, though
, i
n one example

the genes involved in
aging
between two
distantly related
species,
Caenorhabditis elegans

and
Drosophila melanogaster
,
were detected
by matching
experiments from
matching
data points

based
on the ages of these
organisms

(McCarroll,
Murphy et al. 2004)
.
This approach, requiring matched conditions, limits studies of this type to
only those conditions for which a direct analog exists for both species, which are likely to be

only a small subset of the full
datasets
available

for any two species irrespective of their
evolutionary distance
. For less studied organisms, these

coupled experimental conditions

may
not exist at all.
Therefore this class of method is not
well
-
suited to large accumulated sets of
pub
lic experiments that were not designed

and collected expressly for the purpose of

coupled
multi
-
species analysis.

Comparative analysis on the basis of co
-
expression:


The

second
,

general class of comparative genomics analysis methods
employ a strategy wh
ere
the
datasets for each organism
are reduced
to a unit
-
less measure of co
-
expression (
for
example

Pearson’s correlation)
which are
then
used to compare co
-
expression patterns in
multiple
organisms
(Ih
mels, Bergmann et al. 2005; Tanay, Regev et al. 2005; Dutilh, Huynen et
al. 2006; Tirosh and Barkai 2007)
. These conserved co
-
expression studies have shed light on
the evolution of several species, the role of duplication in regulatory
network evolution, and the
occurrence of
convergent evolution of co
-
expression.


O
ne critical limitation to many

of the previous efforts described is that
distance metrics are

calculated

across all conditions within a data set for a given organism. For exampl
e
,

an early
method, the

Differential Clustering Algorithm, generate
s

pair
-
wise correlations matrices for each
organism,

which

are

combined into a single matrix by taking the upper
-
tr
iangles
in both species

and then clustered using a modified hierarchical clustering approach.

To
identify orthologs
, the

authors
used Inparanoid

(Remm, Storm et al. 2001)

and
constrain
ed

the search space
by
including

only
the reciprocal best BLAST matches as orthologs.
The search was

constrained by
appl
ying
the DCA

approach to only those genes
that

shar
e

the same GO annotation

or word (6
-

and 7
-
mers) in their promoter regions
.

Tirosh et al
(Tirosh and Barkai 2007)

developed a
method to calculate a measure called the
Iterative Comparison of Co
-
expression

(
I
CC
)
that

computes

a measure of c
onserved co
-
expression for each orthologous pair of genes in a given
pair of genomes
. W
ith this approach

the authors were able to shed light on the role of
duplication in the evolution of new regulatory patterns. In this work we do not directly compare
to

this method as the

ICC method is not
explicitly, as originally formulated,
a clustering
algorithm; rather it is a means for calculating a gene or ortholog
-
wise metric that gauges their
similarity of co
-
expression

for each possible pair
.

These
important early
studies suggest that uncovering the genetic and functional similarities
between different species, even those that are closely related have the potential to be fruitful

in
spite of the many additional considerations that arise when performi
ng biclustering and
clustering across multiple species
.
Multi
-
species cMonkey allows us to integrate data across

multiple species in a scalable fashion while still allowing for true biclustering, gene membership
in multiple clusters, and the integration o
f relevant data types such as up
-

and down
-
stream
binding sites, prior biological information in the form of networks (prior known regulatory effects,
binding sites, metabolic pathways, and evolutionary associations) and expression data.

M
ethods


We descr
ibe
here
the main steps in this algorithm, which is implemented in the R programming
language and freely available

(following publication, https://ms2.bio.nyu.edu/cMonkey2
-
trac/)
.
This
discussion focuses on the novel modifications to the algorithm that all
ow for
identifying

biclusters in
a
multi
-
species
context;

for a more detailed description of the multiple data
-
type cMonkey scoring
function (by which cMonkey scores biclusters and potential changes to biclusters during the
optimization) see

(Reiss, Baliga et al. 2006)
.



Determining a set of putative orthologs spanning relevant genomes:

The first step in our analysis is the identification of putative orthologs
shared between

the
organisms being
consider
ed
.

As
phylogenetic reconstruction
and
the
identific
ation of ortholog
sets between species
are not
primary
focuses of this
investigation
, we rely on publicly available
tools and
resources

to define our starting set of putative orthologs between two or more
species
.


D
ependent

upon the organism
s used
, there
may be
d
atabases that can provide these

ortholog sets,
such as the
well
-
annotated list of orthologs from the Mouse Genomics Informatics
database
(Bult, Eppig et al. 2008)

used by
Herschkowitz et al

during their comparison of human
and mouse cancer
(Herschkowitz, Simin et al. 2007)
.


In cases where a
pre
-
existing, curated

list
of orthologs is
un
available, we use the Inp
aranoid algorithm
(Remm, Storm et al. 2001)

as two
recent benchmar
ks

(Hulsen, Huynen et al. 2006; Chen
, Mackey et al. 2007)

determined it to be
among
the most accurate when identifying pairwise or
thology. Briefly described, Inp
aranoid
determines the best reciprocal BLAST hits between two
given
geno
mes considered, and when
possible, further expands these to
group putative orthologs into paralogous groups or supersets
of putative orthology
. Thus, it allows for the identification of

families
of orthologous

and
paralogous

genes that are
shared by 2
genomes
, rather just
single pair matches

(for example,
as the
cotZ

gene in
B. subtilis
has two possible orthologs in
B. anthracis
,
cotZ1

and
cotZ2
, both
the
cotZ
-
cotZ1

and
cotZ
-
cotZ2

pairs will be considered by our algorithm)
.


This feature of the
Inparano
id algorithm
allows for more

permissive sets of putative orthology
from which we can
sample
(
as our goal is to have

a low false negat
ive rate
which we can later correct

in a data
driven
manner
, if necessary
).




For the purposes of describing the core eleme
nts of the algorithm
, and in our initial
implementation

and testing

of the algorithms,

we will focus on

the case where

two
organisms
are biclustered

using the coupled algorithm
, although the algorithm
can be trivially extended to
simultaneously bicluster more than two genomes in a single
coupled
analysis
.

For any two
genomes,
G
U

and

G
V
,
we
use

OC
U

and

OC
V

to refer
to
the
portions of these genomes with one or
more orthologs in the other genome, which we

term ‘
orthologous cores


of these genome
s
.
T
hese are
t
he subset
s

of genes that have orthologs
common to

both genomes
such that
OC
U



G
U

and

OC
V



G
V
.

Furthermore, we will use
OC
UV

to refer to the list of
all possible pairings of
orthologs between the
species
, which we will refer to as

orthologous pairs


or

orthologous
gene
-
pairs

, rather than genes. In the case of

gene

families,
where genes have putative
orthology relationships with several members in the other genome,

we allow
the algorit
hm to
sepa
rately consider a

single
-
gene pair for each of the possible pair
-
wise relationships
.
Thus, if
we have a family
,
f
,

that has 4 members in genome

U
,


3
1
U
U
f
U
g
,
g
OC


,

and 3 in genome

V
,


3
1
V
V
f
V
g
,
g
OC


,
this will result in 12

possible pairs for this family, i.e.


3
4
2
4
2
1
1
1
V
U
V
U
V
U
V
U
f
UV
g
g
,
g
g
,
,
g
g
,
g
g
OC


.
Building multiple
-
species biclusters then consists of
seeding a bicluster (selecting a
starting
subset of
orthologous

pairs to define as the starting
bicluster
; this can be done via selection
of a random subset of orthologous pairs
) and then
adding and dropping ortholog pairs to and from the bicluster until convergence.


Thus, when
operating in the multi
-
species data space our algorithm operates only on these
orthologous
pairs and only
. We

dea
l with single genes in
a completely distinct later stage of the procedure
where we search for species specific

elaboration
of a set of biclusters detected in the ortholog
-
pair space
.


Optimize the multi
-
species or shared
-
space biclusters:

This step is the core of the method and the key advantage of the
cMonkey

approach in the
multi
-
species regime
,
describe
d

schematically
in figure
1
.

We start by

generating a
multi
-
species
, shared
-
space
seed
bicluster from the set of pairs in
OC
UV
, u
sing a
strategy similar to
the one employed by the original cMonkey algorithm

(described in greater detail in the
supplementary info), we begin the shared
-
space optimization.

For any pair in
OC
UV
, t
he
derivat
ion of the score
for

adding that pair to (or dropping

it

from) a given bicluster in the multi
-
species case is a straight
-
forward extension of the single
-
species version of the algorithm.
L
et
ting

X
U

and
X
V

represent the expression datasets for the two genomes

considered, a

single
-
species
bicluster is defined
a
s a

set of genes and a set of conditions in
either

X
U

or

X
V
.

Given a
multispecies bicluster with genes and conditions in

X
U
and
X
V

we

calculate
two scores for each

gene in the genome (g and g given the bicluster
k

for all genes in
OC
uv
) that are then
combined
into a single multi
-
species score that is optimized by the iterative adding and dropping o genes
and conditions
.
In the multi
-
species, shared
-
space
,

we combine the singe
-
species scores
for
the genes corresponding to those of a given pair
(calcula
ted sepa
rately for
each organism
) to
compute

the multi
-
species score
π
ik
:







V
ik
U
ik
V
ik
U
ik
ik
ik
g
g
exp
g
,
g
|
y
p





1
0
1




w
here

U
ik
g

and
V
ik
g

are the species
-
specific likelihoods for the members of pair
i

for bicluster
k
, and
β
0

and
β
1

are the parameters of a logistic regression
, fitted to the combined single
-
species scores for the
pairs in
OC
UV
.

For each organism
j





V
,
U
j

,

g

is
defined, similar to that of the purely single
-
species
cMonkey
,

as:












N
n
j
ni k
n
j
i k
j
i k
j
i k
q
~
l og
q
s
~
l og
s
r
~
l og
r
g
0
0
0

w
here

j
ik
r
~
,
j
ik
s
~
, and
j
nik
q
~

are the individual likelihoods for the expression, sequence and networks, as
defined by our earlier work.


T
hus, t
he probability that the gene pair in
OC
UV

is added to the growing bicluster is
a well balanced
function
of the
evidence derived from
the integrated data
-
set for
each species
,

formulated as the two
multi
-
data scores
,
1
ik
g

and
2
ik
g
, that represent the
individual
species support values

for
each gene in an
orthologous pair

(
1
ik
g

and
2
ik
g

represent the multi
-
data
-
type integration for each organism separately
and
π
ik

effects the multi
-
species integration).

With this formulation of

the multiple species score we
keep information from each species separated until the de
c
ision surface is estimated via
π
ik
. Thus
w
e
get the advantages of coupling the genes explicitly across the
two species
without any lossy
transformations
of the data, s
uch as
converting
the expression data in
to an inner
-
product space;
like

Pearson correlation. Once this coupled version of the
cMonkey

score is obtained
,

the algorithm
progresses
nearly
exactly as in the single species version of
cMonkey,

adding and removing pairs from
the bicluster

during
each iteration
,

and stopping when conve
rgence criteria are met

(Reiss, Baliga et al.
2006; Bonneau, Facciotti et al. 2007)
.

Note, for the sake of clarity, when optimizing in the share
-
space,
the
individual scoring components
,
j
ik
r
~
,
j
ik
s
~
, and
j
nik
q
~
, a
re only
fit to the genes in
OC
UV

rather than the full
genomes of each organism. Last, there exists one key difference

between

the shared
-

and single
-
spaced optimization
s, in that

while the method allows pairs to participate in multiple biclusters,
we limit
a
ny

given bicluster

to inclu
d
ing

only
a single

pair from any
one particular

ortholog family.




Note, this framework can easily be extended to more than 2 organisms, where the likelihood of the
multiple “n
-
pairs”
for the organisms in
N

would be defined as:


















N
n
n
ik
N
ik
ik
N
ik
ik
ik
g
exp
g
,
,
g
|
y
,
,
y
p
1
0
1
1
1
1







Identify species
-
specific elaborations of conserved core biclusters:


In th
is
step, we identify species
-
specific modifications to the shared modules that are discovered during
the shared
-
space search. To do this, we decouple the orthologs pairs from the
shared
-
space modules
to generate two biclusters, one for each organism,
which

represent the conserved cores of putative
,

conserved
,

c
o
-
regulated modules. These effe
ctively serve as 'super
-
seeds'
for this step, which
are
each separately

optimized in a mann
er similar to the original single
-
species cMonkey method,
but

now
considering the full genomes of each respective organism. Unlike the original method, though
,
in this
step,
we anchor the
se

search
es

by preventing genes from the original shared
-
space ortho
logous cores
from being dropped. In so doing, we maintain the original shared
-
space
module
, while allowing
the
addition
of
any
species
-
specific

or orthologous core genes that improve the overall score of the
bicluster. Unlike the
‘one pair for each
family’ rule

of the previous step, during this optimization, we
permit the addition of paralogous genes to biclusters containing a member of an orthologous family,
thus allowing for the potential identification of dosage selection of paralogous genes.

Fin
ally
, unlike
either
the shared
-
space or single
-
species

optimization
s

where the mixing parameters,
r
0
,
p
0

and
q
0
,
follow a structured annealing schedule during the optimization,
in this optimization
step, we hold these
constant, using the final values from
the shared optimizatio
n for these.


Once the multi
-
species analysis has been completed, as a final step, any species
-
specific modules
that
are unique to each organism
can be identified by running single
-
species cMonkey on the un
-
biclustered
genes of each o
rganism’s genomes.

W
e
direct the reader to the supplementary material for
a
more
detailed
description and

discussion
of this step

as it is not a main focus of our method
.

T
hese last two
steps (the species
-
specific steps) provide
our
method
the strength and flexibility
to identify both
conserved and species
-
specific modules
,
giving a correct limiting behavior across a wide range of
regardless of the evolutionary distance
s

between the organisms analyzed
.


Visualization and exploration of biclusters:

The results from this analysis (biclusters spanning multiple species data sets and the input data sets)
can be visualized using the Gaggle, Cytoscape, and the MeV (
h
ttp://www.tm4.org/mev.html

) as
described in
(Reiss, Baliga et al. 2006; Shannon, Reiss et al. 2006)
. Briefly, biclusters are represented

as meta
-
nodes (nodes representing all genes within the bicluster) embedded in a network of network
edges between biclusters. The Gaggle is used to pass messages to tools designed to expl
ore the
contents of biclusters (expression data, networks within biclusters, annotations, genome context). Our
main modification for this work was to modify the
operation

of the Gaggle so that messages about the
biclustering are broadcast to networks repre
sented for both species (for example if the user selects
cotZ in the
B. subtilis

network then the genes cotZ1 and cotZ2 are selected automatically in the
B.
anthracis network
by a Gaggle enabled ortholog translator tool). All tools and programs are set up

as
platform independent Java web
-
starts at ( http: ); an online tutorial for exploring the multi
-
species
biclustering using the Gaggle can be found at ( http: ).


Gauging conservation between biclusters

W
e
compare our multi
-
species method to the
single
-
species version of the algorithm and thus need to
address or quantitate how well the single species version of the algorithm would do at detecting
conserved co
-
regulation if the algorithm was run independently on each species data
-
set and then
align
ed and compared after the uncoupled single species biclusterings or clusterings. To make this
comparison
,

we
introduce a new conservation metric
, based on the F
-
statistic
(Stein, Eissen et al.
2003)

to gauge the degree of conservation between a bicluster from one organism to that from another.

Thus, for 2 organisms,
U

and
V
, we measure the degree of conservation between biclusters,
U
k
b
and

V
l
b
,
for each, as



V
l
U
k
V
l
V
l
U
k
U
k
b
,
b
b
b
,
b
b
V
l
U
k
OC
OC
OC
OC
b
,
b
Cons


2



w
here
U
k
b
OC
is the set of genes from bicluster
U
k
b

that belong to the orthologous core for genome
U
,
OC
U
, (where
U
k
b
b
OC
U
k

and
U
U
b
G
OC
OC
U
k


). Likewise
V
l
b
OC
is defined similarly for genome
V

and
OC
V
.
V
l
U
k
V
l
U
k
b
b
b
,
b
OC
OC
OC



is the set of genes in bicluster
U
k
b

with direct matching orthologs in
V
l
b
, and
vice versa. This measure
is
similar to an F
-
statistic that gauges the ability of the biclustering for one
organism to recall the biclustering of another

(after orthology based alignment/matching of the
biclusters)
.
T
o make this metric more general
,
in the case of multi
-
member orthologous families,
we
consider

all possible pairs,
includ
ing
those with
putative para
logs
, such as those

returned by
Inparanoid.



Multi
-
species k
-
means

W
e also
re
-
implemented a simple multi
-
species k
-
means method similar to that which Herschkowitz et
al used in their analysis
(Herschkowitz, Simin et al. 2007)
.
In this
simple
method,
only the
reciprocal
best Blast matches are selected as orthologous pairs
. T
hese
o
ne
-
to
-
one pairwise
relationships
are first
used to form a concatenated expression matrix, so that a row in this matrix corresponds to the
concatenation of the expression data for 2 orthologous genes. This c
oncatenated expression matrix i
s
ne
xt submitted t
o k
-
means
, using the Euclidean distance metric and

with k=
150 (as this was the same
size used for the
test of the
multi
-
species
cMonkey
method)

to generate what we will call shared k
-
means clusters. Next,
as a modification to Herschkowitz’s shared k
-
means

algorithm, we have added

a
subsequent step

that is

similar to the elaboration step of the multi
-
species cMonkey algorithm,
where
the shared k
-
means clusters
a
re used to anchor

new k
-
means clustering
s

of each organism’s full
genome.
In this step
, the

components of the

shared
k
-
means centr
o
i
ds
a
re first split
by organism to
those which correspond to the organism
-
specific conditions in the concatenated expression data set
.


For each organism, then, the
organism
-
specific
shared k
-
means
(sub
-
)
centr
o
i
ds ar
e

used to perform a
Voronoi partitioning of

that organism’s
non
-
orthologous core

expression data
.

Thus
, in this step, the
orthologous genes that belonged to the ori
ginal shared k
-
means clusters a
re required
, by this
partitioning,

to remain in their origin
al cluster. The results of this last step we will henceforth call
elaborated k
-
means clusters. In addition to this k
-
means method, we also
considered comparing

our
method to a more sophisticated multi
-
species method, however, as previously mentioned, nei
ther the
DCA
(Ihmels, Bergmann et al. 2005)
, nor ICC
(Tirosh and Barkai 2007)

methods are general clustering
algorithms, nor do they lend themselves well to comparison with
our multi
-
species cMonkey

method.


Results

In this section, we
review
the analysis
generated by our multi
-
species method when applied to the three
pairings that are possible for three closely
-
related prokaryotes. We describe the data sets collected for
each organism, as well as the results of these analyses. These results include both
significant
biological m
odules retrieved, as well benchmarking

comparisons to both our original single
-
species
cMonkey algorithm and a
simple multiple
-
species

k
-
means method that has been adapted to multi
-
species analysis
,
introduced in the methods section
. During these benchmarking comparisons, we use
several common
ly used

global metrics, as well the
conservation

metric
, also

introduced

in the methods
section
.


Data

sets used for initial test of method
:

To evaluate this method, we performed the optimizati
on on the three possible pairings between three
closely related species of
Gram
positive

bacteria
:

the
main
model organism

for Gram positive bacteria,


Bacillus subtilis
, as well as the pathogens
Bacillus anthracis

and

Listeria monocytogenes
. As one of
the earliest and best studied
bacterial
model organisms,
B. subtilis
,
was selected due to the wealth of
publicly available genomic data

for it
. Additionally, it was selected

due to the
similarity of its
life
cycle to
that of

B. anthracis

as b
oth
alternate between vegetative cell and dormant spore states
(Piggot and
Coote 1976; Stragier and Losick 1996; Errington 2003; Waltman, Kacmarczyk et al. 2009)
.

The third
member of the triplet,
L. monocytogenes
, was selected
to be a closely
-
related outgroup of the previous
two, sharing

with them a similar morphology, but
lacking

the

ability to form spores
. Evolutionarily,

the
B
acillus
and
Listeria
gen
era

are
estimated to have separated
1.
2
5 BYA

(Battistuzzi, Feijao et al. 2004)


For
B. subtilis
, we compiled an expression data matrix
that

consisted of 314 conditions from 15 studies

that

examin
e

the regulons of over 40 known transcription
al regulators

and sigma factors
(Kobayashi,
Ogura et al. 2001; Ogura, Yamaguchi et al. 2001; Yoshida, Kobayashi et al. 2001; Ogura, Yamaguchi
et al. 2002; Asai, Yamaguchi et al. 2003; Doan, Servant et al. 2003; Eichenberger, Jensen et al. 2003;
Molle
, Nakaura et al. 2003; Tojo, Matsunaga et al. 2003; Watanabe, Hamano et al. 2003; Yoshida,
Yamaguchi et al. 2003; Bunai, Ariga et al. 2004; Eichenberger, Fujita et al. 2004; Serizawa, Yamamoto
et al. 2004; Yoshida, Ohki et al. 2004; Hayashi, Ohsawa et al.
2005; Hayashi, Kensuke et al. 2006;
Wang, Setlow et al. 2006)
. For the two pathogens,
the
L. monocytogenes

express
ion matrix

contained
56

conditions that were compiled from
8 studies covering early stationary phase, salt, alkali, and cold
shocks

(Marr, Joseph et al. 2006; Hu, Oliver et al. 2007; Hu, Raengpradub et al. 2007; Severino,
Dussurget et al. 2007; Bowman, Bittencourt et al. 2008; Giotis, Muthaiyan et al. 2008; Raengpradub,
Wiedmann et al. 2008)
;

while the
B. anthracis

matrix
contained 51 conditions
from a
single study
by
Bergman et al
(Bergman, Anderson et al. 2006)

covering the full life
-
cycle of the
B. anthracis

Sterne
strain
. All data was collected from the GEO omnibus

database

(Edgar, Domrachev et al. 2002; Barrett,
Troup et al. 2007)
.


In addition to these expression data sets,
we also
included upstream

sequence

data

(200 bases upstream of the start codon)
, collected from
RSA
-
T
ools
(van Helden 2003)

as well as
network associations from KEGG

(Kanehisa and Goto 2000; Kanehisa 2002; Kanehisa, Goto et al.
2006; Kanehisa, Araki et al. 2008)
,
P
rolinks
(Bowers, Pellegrini et al. 2004)

and P
redictome

(Mellor,
Yanai et al. 2002)
.



W
e used Inparanoid to identify

putative sets of orthologs between these three species
(running
Inparanoid three times on each pair of species)
.

U
sing Inp
aranoid with the default settings (ava
ilable in
supplementary material), we identified 2225 orthologous groups between B. subtilis and B. anthracis,
1439 between B. subtilis and L. monocytogenes, and 1494 between B. anthracis and L.
m
onocytogenes
. Not to belabor the point, but we emphasize th
ese are the total number of groups,
while the total number of possible pairs and genes is larger, as we also include non
-
best
-
matching

orthologs in our analysis.


Table
s
S1

and
S2

in the supplementary material
pr
ovide

full listing
s

of the

number of genes, conditions
and edges (by network association) in our

databas
e for each organism, as well as

the

total number of
genes,
ortholog
s
and
ortholog familie
s
for each organism pairing.



Conserved modules
detect
ed
by the mult
i
-
species analysis

For each organism pair, we generated 150 biclusters in the shared space that were then elaborated.
To illustrate the strength of our method’s ability to discriminate and identify conserved transcriptional
modules, we focus on
two

processes


sporulation a
nd flagellar assembly
, which confers the ability to
swim in the direction of specific chemical attractants (chemotaxis),


as these were either known or
believed to be present in
two

of the species, but missing in the third.
In the case of sporulation, b
ot
h
B.
anthracis

and
B. subtilis

sporulate, while
L. monocytogenes

cannot

(Eichenberger, Jensen et al. 2003;
Liu, Bergman et al. 2004)
.


Similarly
,

both

B. subtilis

and
L. monocytogenes

possess
functioning
flagella
and are motile,
while
B. anthracis

is believed

to
be non
-
motile as it
lack
s

this

structure

(although
the closely related species
B. cereus

is both flagellated and motile)

(Rasko, Ravel et al. 2004)
.

In both
cases, we believed our multi
-
species analysis would only identify modules associated with these
processes when the appropriate organisms were paired together
. As expected, sporulation modules
were only identified between
B. subtilis

and
B.

anthracis
, but a key difference in their sporulation
programs emerged as well. In contrast, strong flagellar modules were identified for all three pairings,
including those with
B. anthracis
, with subsequent experimental validation indicating
that the st
rain
used in the
B. a
nthracis

microarray studies,
B. anthracis

Sterne strain
, is motile
.


Biclusters involved in sporulation

shared between

B. subtilis
and

B. anthracis

Endospore formation (hereafter: s
porulation
)

is
a
process of cell differentiation that
some

Gram
-
positive bacteria undergo as a response to hostile environmental conditions such as resource
depletion. During the
initial stages of the
process, a cell

will

cease to grow and
undergo
an
asymmetric
cell division resulting in a larger cell
termed
the mother cell,
and a smaller cell called the
forespore
.
T
he forespore will eventually become a metabolically dormant cell type called an endospore
(hereafter:
spore)
that is resistant to numerous physical extrema
.
Following the asymmetric divisi
on,

the mother
cell engulfs the forespore

and

a thick layer of
modified peptidoglycan is synthesized in the inter
-
membrane space that

surrounds
the forespore
, forming the spore cortex,

upon which a thick layer of
protein is added to create the spore coat
.

In the final step, the mother cell

lyses
, freeing the completed
endospore

(Piggot and Coote 1976; Stragier and Losick 1996; Errington 2003)
. As mentioned above,

both
Bacillus

organisms in our study


like al
l members of the
Bacillus

genus


have
the ability to
sporulate, while
L. monocytogenes

lack
s

this ability

completely.
For t
hese reasons, we considered this
process as a
useful benchmark of the method,
as
the
B. subtilis
-

B. anthracis
pairing
sh
ould
identify
modules involved in sporulation
, while
any pairing
with
L. monocytogenes

would not.


As expected, neither of the pairings involving
L. monocytogenes

generated biclusters involved with
sporulation
,
an unsurprising
, but important verification of the

method

(
as
L. monocytogenes

lacks mos
t
of the genes involved in sporu
lation
)
. However, the multi
-
species method
identifi
ed
three sporulation

biclusters from the
B. subtilis
-

B. anthracis

pairing
,

where the
B. subtilis

portion of the
se
shared
bicluster
s

(the
B. subtilis
genes from the ortholog pairs of each)
contained significant numbers of genes
regulated by σ
E
, the first regulator in the
B. subtilis
mother cell
line of gene transcription
(Eichenberger,
Fujita et al. 2004)
.
E
ach of the
se

three biclusters

(biclusters 84, 35 and 17)

contained
a

distinct set of
genes
,

with distinct biological function. For example, bicluster 84 contained gen
es involved with
metabolism and the
early stage
of sporulation; while bicluster 35 primarily contain
ed

genes
involved in

cortex
synthesis, glutamine transport
and
activation

of
the sporulation
σ

factors,

σ
G

and σ
K
,
and
bicluster 17 contained
mostly
spore coat genes. Additionally, there were four unexpected members of
the shared orthologous core of bicluster 84 for both species,
ctaCDEF
, which encode the four subunits
of cytochrome
C

oxidase. In
B. subtilis
, it is know
n

that these genes are subject to catabolite
repression by glucose
, which prevents their expression during exponential growth

(Liu and Taber
1998)
. Furthermore, while these genes are known to
be
regulated by Spo0A

(the master regulator of
the initiation of sporulation)
, they have not yet been shown to be regulated by σ
F

or σ
E

in previous
microa
rray experiments
(Eichenberger, Fujita et al. 2004; Steil, Serrano et al. 2
005; Wang, Setlow et al.
2006)
.
However, upon examination of the intercistronic region between
ctaB

and
ctaC

in
B. subtilis

it
appears

that there is a possible σ
E

binding site which may allow for σ
E

regulation of

the

ctaCDEF

operon

(Figure 2). Protracted expression of these genes is consistent with the conclusions of previous
studies indicating that Krebs cycle
function (and by extension electron transport chain) is required
during sporulation
(Ireton, Jin et al. 1995; Jin, Levin et al. 1997; Matsuno, Blais et al. 1999)
.


Additionally, our method reveals that

these biclusters have a temporal aspect that suggests a
significant
kinetic
difference between the sporulation programs of the two
Bacillus

species.
T
he
expression data we used for
B. anthracis

(Bergman, Anderson et al. 2006)
is

a transcriptional profile of
the entire 8 hour life
-
cycle, from germination through sporulation
that
i
dentified 5 waves of gene
expression, each characterized by a distinct set of genes, whose activation followed a strict temporal
signature that corresponds to different phases of the
B. anthracis

life
-
cycle
. These waves included

germination and outgrowth (
wave 1); rapid growth (wave 2); toxic environment response (wave 3); and

two
separate sets involved in sporulation (waves 4 and 5, where wave 5 includes genes also up
-
regulated during germination). As would be suggested by their functional composition, th
e
B. anthracis

biclusters corresponded to wave 3 for bicluster 84 (toxic environment response), while bicluster 35
(cortex formation) was composed of genes involved in wave 4, and bicluster 17

(spore coat)
contained
roughly an equal number of genes in wave
s 4 and 5
.


However, t
he tight coupling of the multi
-
species method
also
allowed us to
compare the expression
profiles of these three biclusters for each
organism

to investigate whether

evidence of

a similar
transcription cascade exists for
B. subtilis
.

Surprisingly, while there are

three

distinct expression profiles

for the
B. anthracis

biclusters, no
obvious

expression
difference exists for the
B. subtilis

biclusters

(Figure
##
)

as most genes in the
B. subtilis

biclusters are expressed under the control of the early
mother cell
σ

factor
σ
E

(with a few genes under the control of the late mother cell
σ

factor
σ
K

and
secondary effects caused by the activation of the mother cell
-
specific sporulation transcription fa
ctors
SpoIIID, GerR and GerE
(Eichenberger, Fujita et al. 2004)
)
.
Although

the
B. subtilis

data
set
we used
did not contain
a
similar
ly

explicit
time series

of the
B. subtilis

lifecycle, this
observation
may indicate
that there may
not
be
as many
wave
s
of gene expression needed for spore assembly

in
B. subtilis

as in
B. anthracis
.

This is further supported by the fact that
σ
E

in

B. anthracis

is not activated until wave 4,
indicating a potential transcriptional re
-
wiring between the two species for the genes in bicluster 84
(from wave 3) suggesting th
at in
B. anthracis
they
may be

under the control of a different transcription
factor

that is active
prior to
σ
E

activation
.



This difference in the expression profiles for the two organisms for these 3 biclusters was further
reflected during the elaboration step when the 3
B. anthracis

biclusters remained completely distinct
with respect to their gene membership, while the
B. su
btilis

biclusters gravitated to each other with
many of the genes from each distinct set added to each other.



Flagellar Assembly & Chemotaxis Biclusters

The bacterial flagellum assembly pathway is a well
-
known pathway
(Figure XXXA)

that has been
studied
over a wide range of prokaryotes

(Penn and Luke 1992; Cascales, Lloubes et al. 2001; Aldridge
and Hughes 2002; Charon and Goldstein 2002; Bardy, Ng et al. 2003; Berg 2003; Blocker, Komoriya et
al. 2003; Macnab 2003; Soutourina and Bertin 2003; Macnab
2004; McCarter 2006; Murphy,
Leadbetter et al. 2006; Pallen and Matzke 2006; Liu and Ochman 2007; Liu and Ochman 2007)
. This
pathway contains approximately 24 genes conserved across numerous species, though not all these
species are motile
(Liu and Ochman 2007)

T
he flagellar assembly pathway was considered as another
potentially useful benchmark of the multi
-
species method, because while
B. subtilis

and
L.

monocytogenes

are motile,
B. anthracis

is believed to be non
-
motile and non
-
flagellated

(Rasko, Ravel
et al. 2004)
.

Therefore, it was thought
prior to carrying out our analysis
that multi
-
species integrative biclustering of
B. subtilis
-
L. monocytogenes

would identify modules enriched with flagellar assembly genes, whereas
pairings between
B. subtilis
or
L. monocytogenes

with
B. anthracis

would be unable to recover such
modules.
However, in contrast to our prior hypothesis, we discovered that the multi
-
s
pecies
biclustering of the pairings that included
B. anthracis

also retrieved flagellar modules

(Figure XXXB)
;
and that their retrieval was well supported by the
B. anthracis

portion of the multi
-
data score
.

This
result was unexpected because one would presume
that due to selective pressures, the loss of flagellar
biosynthesis and motility would be rapidly followed by gene loss and certainly loss of active expression
and regulation
.

Based on genome annotat
ions for
B. anthracis

(Kanehisa 2009; Kanehisa 2009)
,i
t was found that
B.
anthracis

str. Sterne


the strain used for the expression data analyzed by our method


lacks five
genes which are present in
B. subtilis
; these include
fliK
,
fliO
,
fliJ
,
f
liT

and
flgM
.
F
our of these genes
are either essential or likely to be essential for flagellar assembly in
B. subtilis

(Dan Kearns, Indiana
University, personal communication, Table ##)
. Nonetheless, further investigation revealed that these
genes are al
so missing in
L. monocytogenes
and
B. cereus


which
are
known to be both flagellated and
motile.


Table
1
:

Predicted
B. subtilis f
lagellar Assembly Genes
that are
Missing in B. anthracis
, and their associated function
.
The
genes i
n the table are present in the B. subtilis flagellar assembly pathway as indicated by KEGG, but missing in B. anthracis.

Gene

Function

flgM

T
he anti
-
α
D

anti
-
sigma factor

fliJ

P
art of the Type II Secretion Chaperone
-
Usher Complex

fliK

T
he hook length
regulator

fliO

P
art of the Type III Secretion Apparatus

flit

C
haperone



Next, the presence or absence of flagellar assembly and chemotaxis genes was determined in
L.
monocytogenes

along with all the strains and species of
Bacilli

represented in KEGG (T
able ###) in
order to better understand the necessity of these genes for motility.
To verify that these organisms truly
lacked these genes present in
B. subtilis
, BLAST searches were performed

for each organism
, using
B.
subtilis

str. 168 protein sequences

as reference, to determine if similar coding sequences existed. In
the case of
B. amyloliqufaciens
and
B. pumilus

the BLAST search resulted in a very strong
match

for
the ‘missing’
fliK

gene
and

was
subsequently

annotated as such. In contrast, none of the other
organisms (
B. cereus
,
L. monocytogenes
,
B. anthracis
str. Sterne or
B. anthracis

str. Ames)
contained


strong match
es

to the
B. subtilis

proteins
in question
, indicating that these genes are not present in

the
genomes of these
organism
s
.

Table
###
: Comparison

of the Genes M
issing from the KEGG Flagellar Assembly

and Chemotaxis Pathways

(
Bacilli
and
L.
monocytogenes

Examined)
. As shown in the table, the genetic composition of the B. anthracis strains Stern,

A2012 and CDC
684 are almost identical to the motile species B. cereus, B. thuringiensis, L. monocytogenes, etc. In contrast, the anthraci
s
strains Ames, Ames 0581 and A0248 are lacking multiple genes present in the other motile organisms.



B.subtilis

B. amyloliquefaciens

B. clausii

B. anth
r
acis

(Sterne, A2012, CDC 684)

B. anthracis

(Ames, Ames
0581, A0248)


B. halodurans

B. pumilus


B. cereus
(all)



B. licheniformis
(all)



B. thuringiensis






B. weihenstephanensis


Genes




L
.

Monocytogenes


flgL

X

X

X

X



flgM

X

X

X




fliF

X

X

X

X



fliJ

X

X

X




fliK

X

*





fliM

X

X

X

X



fliO

X

X

X




fliT

X

X





cheC

X

X

X



cheD

X

X

X



cheV

X

X

X

X’


cheW

X

X

X

X’’


X gene is present in the KEGG flagellar assembly pathway

X’ gene is not
present in B. anthracis Sterne or A2012

X’’ gene is present in B. anthracis Sterne and A2012 but not the other organisms in the column

*
gene is not present in KEGG but is recognized by NCBI


In situations where a gene was present in the KEGG pathways for
B. cereus

but missing in
B. anthracis

BLAST searches were performed using the protein sequence for
B. cereus

as opposed to
B. subtilis
;
this was done because
B. cereus

is the closest motile re
lative to
B. anthracis

(Rasko, Altherr et al.
2005; Callahan, Castanha et al. 2008)
. None of the BLASTed proteins showed any strong matches in
B. anthracis
Ames; however, in Sterne there were two strong hits for t
he protein CheV; each which
covered a different half of the protein sequence. Upon investigation of the genetic sequence, it was
determined that both these coding sequences derived from the same gene, which underwent a
frameshift mutation via a one base
-
p
air deletion resulting in an in
-
frame stop codon shortly following the
deletion.

It is possible that the frameshift mutation of
cheV

in
B. anthracis

Sterne
is the reason for the loss of
motility in the organism.
T
his

suggests that loss of motility

is a

recent evolutionary development and
may explain why the other flagellar assembly and chemotaxis genes are being expressed. It
also
implies that Sterne
may
easily regain motility by a reversion of the frameshift mutation, e.g. by insertion

o
f a nucleotide

near the deletion which resulted in the frameshift.




However, because this frameshift causes the appearance of a stop codon,
it is expected that
a
truncated CheV protein
will
be made. This truncated protein would include the entire CheW
-
like
domain; a
domain that has been experimentally shown in
B.

subtilis

to be enough to restore motility in a
non
-
motile
cheVcheW

double mutant

(Rosario, Fredrick et al.

1994)
.

We decided to determine experimentally if
B. anthracis

Sterne was motile by performing swimming
motility assays on 0.3% agar plates (Figure XXXC). We used B. cereus ATCC 14759 as a positive
control for swimming and
DS1677 (
B. subtilis

3610 ∆
hag
)

as a negative control. Although DS1677 is
unable to swim, it can move by gliding, which explains the formation of the extended branched patterns
observed on the corresponding plate. The assay shows that
B. anthracis

Sterne is able to swim,
although in a l
imited capacity when compared to
B. cereus
, since the zone of motility zone is
significantly smaller and does not cover the entire plate.

Validation of multi
-
species bicluster performance

To validate our multi
-
species biclustering method, we

compare
d

it
to
both the multi
-
species k
-
means
method described above, as well as
our earlier single
-
species method,
as it was previously shown to be
competitive with, and in many cases outperform
ed
, other state of the art biclustering methods

(Reiss,
Baliga et al. 2006)
.
In our
evaluation, our goal was not only to
compare
the methods’

ability to
detect

statistically significant modules,
but more importantly,
their

ability to identify conserved modules.
Thus,
our goal was both to
quantify
the multi
-
species method’s

ability to ret
rieve conserved modules
, as well
as determine
if there was any

penalty
resulting from

coupling the orthologous genes so tightly during
the shared optimization step.

T
o achieve both of these goals, we
both: 1)
compared the
modules
identified by each of the

three algorithms (single
-
species cMonkey, and multi
-
species cMon
k
ey and k
-
means),
using several
commonly used
metr
ics that
gauge

the coherence and biological relevance of
the biclusters
, and 2) quantified the degree that elaborated, shared and single
species (bi)clusterings
aligned by gene orthology as a proxy for the amount of conserved co
-
regulation detected using

the

conservation
metr
ic des
cribed
in the methods section
.
We show that using a coupled analysis of
multiple species we can produce cluste
rs as good (by metrics of class 1 above)
and
much more
conserved (by the conservation metric, 2, above)
. T
hus
, the clusters obtained by this procedure are

of
greater biological significance.

Compar
ing the amount of conserved co
-
regulation detected by eac
h method

U
sing the conservation metric, described above, we evaluated all three of the methods’ ability to
identify conserved modules. For the two multi
-
species methods, we calculated the metric using the
directly paired biclusters for a given pairing bet
ween organisms, i.e. the shared bicluster for one
organism, with its bicluster counterpart in the other.
For the shared (or conserved core) biclusters all

genes in one organism have an orthologous pair in the other by construction and thus the conservatio
n
score is 1.0, for elaborated multi
-
species biclusters this score is less than one due to species specific
gene additions.
As no such direct mapping exists between the
independently run
single
-
species
results, for
the three
possible pairing
s of the orga
nism
s we employed a greedy strategy where
f
or each
bicluster in
one

organism’s set of biclusters, we select the bicluster f
rom the other
that maximizes this
metric

(so for each bicluster we find the most orthologous/conserved counterpart)
. These individua
l
bicluster scores were then averaged to produce a single conservation score for

that method, with that
pairing
(Table
6)
.
T
o make the single
-
species
analysis directly

comparable with those from
both

multi
-
species

methods
, we
only
consider
the 150
biclusters for each organism with the most genes from each
organism’s respective orthologous core.

We show, in table
6
, that as expected both the

elaboration
of
the multiple species cMonkey and the elaboration of the multi
-
species k
-
means

remain quite con
served
by
this measure of conservation. For example, with the
B. subtilis
-
B. anthracis

pairing, the average
conservation score for the biclusters from the
multi
-
species cMonkey
elaboration optimization is 0.8
52
,
while for the single
-
species
it
is only 0.
1
24
.


Table
2
: Comparison of conservation metrics for each pairing for the single
-
species and elaboration optimizations.
Note, that by definition the shared optimization has a score of 1 and has been omitted for this reason.

Conservation Metric






B. subtilis
-

B. anthracis

B. subtilis
-

L.
m
onocytogenes

B. anthracis
-

L.
m
onocytogenes


Single

0.124

0.147

0.126


Elaborated

0.852

0.884

0.906


Elaborated
K
-
means

0.956

0.963

0.943



G
iven the relatively close evolutionary proximity
of

each of these organisms,
we expected

that any
conserved modules between
any of the
two species
should be relatively distinguishable, and readily
identifiable by the single
-
species optimization
.
There are several possible explanations for why this is
not the case (for the unexpectedly low values for ortholog
-
aligned single species runs) discussed
below.
With respect to this metric, it is also clear that the modified elaborated k
-
means method
outp
erforms the multi
-
species elaboration method.
T
his
is due to the fact

that k
-
means generates a
hard partitioning that completely partitions the data space
composed only of

reciprocal best Blast
matches,
leaving few genes with orthologs

to be added during
the elaboration step of the k
-
means
method.

It should also be noted that
by completely partitioning the data space, t
he
multi
-
species
k
-
means algorithm
operates on the incorrect assumption

that there is perfect functional conservation for
all orthologous
g
enes shared between two species.

In contrast, the biclustering cMonkey performs is a
soft clustering that allows membership in multiple biclusters, while not requiring perfect coverage.
N
on
-

conserved orthologs are not forced to join any modules by multi
-
species cMonkey

and, thus
, this lower
conservation is arguably biologically relevant.

Quantitation of the expression and functional coherence of biclusters generated by each
method:


We

compared the relative performance
s

of the
two

multi
-
species methods

to the single species method

using
three

commonly used metrics that gauge

the degree of support that
is

provided
to

each bicluster
by the three

data types that cMonkey integrates

(expression, sequence and association networks).
For
comparison of the
single species cMonkey algorithm to other biclustering algorithms, and comparison
between single species biclustering and clustering algorithms see

(Prelic, Bleuler et al. 2006; Reiss,
Baliga et al. 2006)
. Our main performance metrics are: 1) expression residuals, a measure of the
coherence of expression across the two species datasets for conditions within the bicluster, 2) network
p
-
values, a measure of the

significance of the sub
-
networks within biclusters compared to the full
network, and 3) motif E
-
values, a measure of the quality/significance of the upstream binding site
motifs detected for each bicluster,

each of which will
be described in greater det
ail below as we
discuss the relative performance of the results from both cMonkey versions.

For the sake of brevity,
we
focus on the comparison of

the
elaborated multi
-
species
cMonkey
results

to those from the single
-
species

f
or

the
B. subtilis
-

B. anthr
acis

pairing
;

the full set of comparisons is available in the
supplementary material.

We focus on

the elaboration results
as they are the most
directly

comparable
to the single species result in terms of number of genes and data used in the analysis. Also

the
biclusters resulting from this elaboration step

are the main focus of this work as they contrast the both
the conserved core of co
-
regulated modules as well as biologically relevant
species
-
specific
modifications.

In brief, we find that multi
-
species

biclustering detects a much higher degree of
conserved (ortholog alignable) biclusters and that these multi
-
species biclusters are more functionally
coherent (they effect better partitioning of KEGG and GO annotations), equally coherent with respect to
ex
pression within each species data
-
set, and have more significant or equally significant sub
-
networks
(depending on the organism examined). This greater conservation, without compromising the four
performance

metrics, suggests that the multi
-
species (conse
rved) biclusters have a higher biological
significance.


To quantitate the tightness of co
-
expression within biclusters we use a measure of the error between
individual gene data points and the column and row means that define the bicluster, called a
residual
,
that’s described fully in the supplementary information and in Reiss et al
(Cheng and Church 2000;
Reiss, Baliga et al. 2006)
. Figure 4

shows the distribution of residuals for each of the methods tested
for multiple organism
-
pairings.



Table
3
: P
-
values from two
-
sided Wilcoxen’s nonparametric rank test comparisons of the distribution of residuals from
the
multi
-
sp
ecies elabor
ated and single
-
species optimizations. Here, we use MS as an abbreviation for multi
-
species,
and SS for single
-
species.

P
-
values from two
-
sided
Wilcoxen’s rank
test:





B. subtilis
-

B. anthracis


B. subtilis

B. anthracis

MS cMonkey (elab)


S
S cMonkey

0.273

0.036

MS
K

means

(elab)


S
S cMonkey

2.80E
-
11

0

MS K

-
means
(elab)
-

MS cMonkey (elab)

4.38E
-
15

0



Where necessary, due to the non
-
normality of the

distributions

of the residuals of these different
methods
,
we used pairwise two
-
sided Wilcoxen’s non
-
parametric rank tests
t
o asses
s

th
eir statistical
similarity (
Table
7
)
.

The
cMonkey
multi
-
species and single
-
species results generate biclusters of
similar quality as they
display
residual
distributions that are statistically similar
for most organism
pairings.

For example, for
B. subtilis

the distribution from the
elaboration optimization is statistically
the
same as

the single
-
species

(
mean

residual

of 0.49±0.09 for the elaborated, as com
pared to 0.49
±
0.13

for the

single
-
species optimization, statistically the same with 95% confidence)
,

while
f
or
B. anthracis

they are statistically different
(
mean residual of

0.31
±
0.12

for the

single
-
species optimization, compared
to

0.32±0.09 for the elab
orated,
which is
statistically greater with 95% confidence
)
.

When comparing
the two elaboration results from multi
-
species methods
, for
B. subtilis
,

the results of the
elaborated
k
-
means step

was

significantly better than
elaborated
cMonkey optimization
(
mean residual of

0.42
±
0.06

for multi
-
species k
-
means,
statistically less with 95% confidence

than

0.49±0.09 for multi
-
species
cMonkey
);
while for
B. anthracis,
it was
significantly worse

(
mean residual of

0.48
±
0.11

for multi
-
species k
-
means,
statistically

greater with 95% confidence

than

0.32±0.09 for multi
-
species cMonkey
)
.
As the
B. subtilis

expression data set contained nearly six times the conditions than the
B. anthracis

data set (314 compared to 51), this
illustrates the dominance of a single specie
s (
B. subtilis
) in

the k
-
means multi
-
species results
,

a key limitation of this method for multi
-
species analysis.


Similar patterns emerge for the other two
pairings when comparing the multi
-
species and single
-
species cMonkey

optimizations with each other;

as does when comparing the two multi
-
species
elaboration results
B. Subtilis and L. monocytogenes

pairing.

We direct the readers to the
supplementary material for a further discussion.


We also compared the significance o
f the biclusters with respect to networks of protein associations

(edges representing co
-
participation in a metabolic pathway, phylogenetic profile, co
-
membership in an
operon; see table S1 and S2).
Similar

to the residuals, the distributions of associati
on p
-
values
for the
cMonkey optimization
also
display
an organism
-
dependent

behavior. Briefly explained, the association
p
-
values for a bicluster are modeled using a hypergeometric distribution

(see supplementary material
for a full description of this ca
lculation)
.

T
he association p
-
values for all network types were

converted
into their negative log
10

values (to facilitate comparison),

pooled together

for each method

(Figure
5),

and

compared using
two
-
sided Wilcoxen’s non
-
parametric rank test
s (Table
8)
.
T
he overall pattern with
respect to the distributions from the different
cMonkey
optimizations is

somewhat
organism
-
dependent
.
For example, for
B. anthracis
, both of the multi
-
species optimizations were significantly better than
those from the single
-
s
pecies optimization (
mean p
-
values of the
multi
-
species

optimizations were
16.
2

11.0

and

1
2.8

11.0

for
elaborated

multi
-
species cMonkey
, as compared to 1
2.8
±
11.0

for the

single
-
species optimization, significantly greater with 95% confidence).

In contrast, for
B. subtilis

the
distributions were, with 95% confidence, statistically the same, although the single
-
species retrieved a
greater overall mean

(
mean p
-
values of the

elaborated

multi
-
species

results was 16.0±10.60
, as
compared to 1
7.2
±
10.
8

for the

single
-
species optimization
)
.

A discussion of all organism
-
pairings
tested is present in the supplementary material. In nearly all cases, the network
-
p
-
value distributions
for the elaborated multi
-
species k
-
means were significantly worse than ei
ther of the cMonkey
optimizations (mean p
-
values of 13.9±10.60 for
B. subtilis

and 8.88±9.46 for the single
-
species
optimization), showing that cMonkey is effectively optimizing the network component of the scoring
function.


Table
4
:
P
-
values from two
-
sided Wilcoxen’s nonparametric rank test comparisons of the distribution of association p
-
values from the
multi
-
species

elaborated and single
-
species optimizations.

Here, we use MS as an abbreviation for
multi
-
species, and SS for sing
le
-
species.

P
-
values from two
-
sided
Wilcoxen’s rank
test:





B. subtilis
-

B. anthracis


B. subtilis

B. anthracis

MS cMonkey (elab)


S
S cMonkey

0.083

3.58E
-
03

MS
K

means

(elab)


S
S cMonkey

1.28E
-
06

1.58E
-
03

MS K

-
means
(elab)
-

MS cMonkey (elab)

1.80E
-
03

4.19E
-
08



Lastly, we compared our ability to detect significant upstream binding motifs for these methods (roughly
measured by both the E
-
values of the motifs resulting from the MEME algorithm (representing the
ability of the motifs and the
background distribution to regenerate the observed upstream sequence)
.

The results of this analysi
s

are presented in full in the supplementary information. Note, when

comparing the motif E
-
values of the different methods, only the first motifs identified
for each bicluster
were selected (as these are generally the most reliable). With the exception of the bi/clusters
generated for
L. monocytogenes
, the single
-
species cMonkey biclusters on average contained the best
motifs, while, unsurprisingly, the k
-
me
ans results had the worst, although these differences were not
always significant. For both of the multi
-
species cMonkey and k
-
means methods, there is a consistent
increase in E
-
value quality between the shared and elaborated steps, indicating that there
is significant
species specific change at the level of binding sites, even when the gene membership in a conserved
co
-
regulated module has significant orthologous overlap.


Furthermore, analysis of the results from a
specialized single
-
species cMonkey ver
sion, run just on the orthologous core genes, with these results
later elaborated in a manner similar to multi
-
species cMonkey elaboration also displayed a similar
pattern (results not shown). One possible explanation is simply algorithmic, namely, that M
eme is able
to infer more significant motifs from the larger pool of sequences accessible to the elaborated
bi/clusters. Another, more biologically motivated reason may that as the orthologous modules
discovered during the shared step have evolved to incl
ude species
-
specific genes, they have come to
be under the control of transcription factors whose binding motifs are either entirely different or have
experienced considerable evolutionary drift. Our meth
o
dology for modeling and detecting binding sites
as
part of the multi
-
species procedure is clearly an area for future work.



Estimating functional coherence via Enrichment of function annotations
:

We compared the percentages of biclusters that were significantly enriched for both GO terms and
presence in KEGG pathways (maximum p
-
value of 0.01) (Figure ##). Again, for brevity, we only
discuss the results from the
B. subtilis
and
B. anthracis

pairin
g here, with the results for the other
pairings in the supplementary material. Focusing first on the percentages of biclusters that were
significantly enriched with GO terms, we found that for both of the multi
-
species methods, there was a
consistent incr
ease between the shared and elaboration optimizations. For example, for multi
-
species
cMonkey, the percentage of
B. subtilis

biclusters with GO term enrichments increases from 50.0% to
58.0% between the shared and elaboration optimizations, while for
B. a
nthracis

the increase is even
more dramatic with an increase from 50.7% to 70.7%. Similarly, for multi
-
species k
-
means, the
increase is from 50.0% to 63.5% for
B. subtilis

and 43.2% to 77.7% for
B. anthracis
.


The percentage of biclusters with enriched KE
GG pathways is much higher in both the multi
-
species
methods than the single species cMonkey method. For example, for
B. subtilis

the shared steps of the
multi
-
species cMonkey and k
-
means the percentages of enriched biclusters were 12.7% and 14.2%,
respec
tively, while the percentage of the single
-
species cMonkey results was 11.5%. Likewise, for
B.

anthracis
, the percentages for multi
-
species cMonkey and k
-
means were 12.7% and 14.9% versus 13%
for the single
-
species cMonkey. We also observed patterns simi
lar to what we saw with the GO terms,
where there was also a consistent increase in the percentage of biclusters with KEGG pathway
enrichments between the shared and elaboration runs for multi
-
species cMonkey. For example, with
the multi
-
species cMonkey m
ethod, the percentages increase from 12.7% to 15.3% for
B. subtilis

and
12.7% to 21.3% for
B. anthracis
. Unexpectedly, while we saw this behavior for k
-
means with the
B.
subtilis
clusters, the percentage of
B. anthracis

clusters dropped. Finally,, when c
omparing the two
multi
-
species methods the shared k
-
means had a greater percentage than the shared cMonkey results,
but this was the reverse for the elaborated results.


One potential explanation for why both the multi
-
species methods, including our simple

multi
-
species k
-
means implementation, found greater percentages of biclusters with KEGG pathway enrichments than
the single
-
species cMonkey method is that, by their very nature, both of the multi
-
species methods are
heavily determined by the conserved ort
hologous core between two species. Due to their conserved
nature, these orthologous cores are likely to contain genes in critical and highly conserved processes
such as metabolism. For this reason, it is also likely the genes in the cores are also more l
ikely to be
better known and annotated, with the result being that the biclusters from the multi
-
species
optimizations are more likely to contain well
-
annotated genes than the single
-
species. Similar logic
may also explain the difference seen between the
percentage of KEGG enrichments between the
single
-
species and multi
-
species optimizations.




Conclusion

In this investigation, we have introduced a new algorithm, the multi
-
species cMonkey. Our results
indicate that this computational method performs sign
ificantly better at identifying conserved modules of
co
-
expressed functionally related genes than the previously described methods, including the single
-
species cMonkey algorithm , We show that this new technique produces condition dependent modules
of hig
her biological significance, as they are conserved (supported by datasets for multiple closely
related species) Also, unlike the multi
-
species k
-
means method, multi
-
species cMonkey does not
allow one organism to dominate the analysis, while also integra
ting other supporting data types during
the analysis, enables it to identify more biologically relevant modules.


We
have
demonstrate
d

that our method reveals important new biology, with examples of expected
patterns that were not found, cases where differ
ences between species are revealed, and cases where

unexpected conservation reveals unexpected biology (
such as in the
B. anthracis

flagellar biclusters
).
Clearly, by being so heavily determined by the orthologous cores of two species, this type of analys
is is
more likely to identify well
-
known, critical processes. In fact, it would be more worrying if it did not.
However, as exemplified by the
B. anthracis

flagellar modules that were retrieved by multi
-
species
cMonkey, it is arguable that even in these
cases it is possible to identify modules previously believed
not
to exist. Moreover, for many of the biclusters containing GO annotation or KEGG pathway
enrichments, the genes associated with these terms or pathways are only subsets of the genes they
cont
ain. Thus, the addition of genes outside of these core processes during the elaboration step allows
for the potential annotation of genes of unknown function. It also allows for the association of functions
that were previously considered to be separate,

such as the association of the
ctaCDEF

cytochrome C
oxidase regulon with the
B. subtilis

sporulation module found by multi
-
species cMonkey
.

Last, many of
the biclusters identified by multi
-
species cMonkey that lack any GO annotation or KEGG pathway
enrichments are still highly supported by the data such as possessing either conserved expression
profiles or statistically significant motifs.


F
or all the pairings of organisms, there was a consistent increase from the shared to elaboration steps
of the multi
-
species cMonkey methods in the percentage of both enrichments. There was a similar,
though, not as consistent increase for the k
-
means me
thod as well. Most likely, this behavior results
from shared (bi)clusters that contain enrichments which do not show up statistically until genes from
outside of the orthologous core are added during the elaboration step. This might also explain why the
elaboration optimization consistently had a greater percentage of biclusters with enriched GO terms
than the single
-
species, but the shared optimization did not. As such, we argue that this illustrates the
necessity of the elaboration step in any type of
multi
-
species analysis similar to this.


References

Aldridge, P. and K. T. Hughes (2002). "Regulation of flagellar assembly."
Curr Opin Microbiol

5
(2): 160
-
5.

Alter, O., P. O. Brown, et al. (2000). "Singular value decomposition for genome
-
wide expression data
processing and modeling."
Proc Natl Acad Sci U S A

97
(18): 10101
-
6.

Alter, O., P. O. Brown, et al. (2003). "Generalized singular value decomposition for co
mparative
analysis of genome
-
scale expression data sets of two different organisms."
Proc Natl Acad Sci
U S A

100
(6): 3351
-
6.

Asai, K., H. Yamaguchi, et al. (2003). "DNA microarray analysis of Bacillus subtilis sigma factors of
extracytoplasmic function fa
mily."
FEMS Microbiol Lett

220
(1): 155
-
60.

Bardy, S. L., S. Y. Ng, et al. (2003). "Prokaryotic motility structures."
Microbiology

149
(Pt 2): 295
-
304.

Barrett, T., D. B. Troup, et al. (2007). "NCBI GEO: mining tens of millions of expression profiles
--
databa
se and tools update."
Nucleic Acids Res

35
(Database issue): D760
-
5.

Battistuzzi, F. U., A. Feijao, et al. (2004). "A genomic timescale of prokaryote evolution: insights into the
origin of methanogenesis, phototrophy, and the colonization of land."
BMC Evol

Biol

4
: 44.

Ben
-
Dor, A., B. Chor, et al. (2003). "Discovering local structure in gene expression data: the order
-

preserving submatrix problem."
J Comput Biol

10
(3
-
4): 373
-
84.

Berg, H. C. (2003). "The rotary motor of bacterial flagella."
Annu Rev Biochem

7
2
: 19
-
54.

Bergman, N. H., E. C. Anderson, et al. (2006). "Transcriptional profiling of the Bacillus anthracis life
cycle in vitro and an implied model for regulation of spore formation."
J Bacteriol

188
(17): 6092
-
100.

Bergmann, S., J. Ihmels, et al. (2003)
. "Iterative signature algorithm for the analysis of large
-
scale gene
expression data."
Phys Rev E Stat Nonlin Soft Matter Phys

67
(3 Pt 1): 031902.

Blocker, A., K. Komoriya, et al. (2003). "Type III secretion systems and bacterial flagella: insights into
t
heir function from structural similarities."
Proc Natl Acad Sci U S A

100
(6): 3027
-
30.

Bonneau, R., M. T. Facciotti, et al. (2007). "A predictive model for transcriptional control of physiology
in a free living cell."
Cell

131
(7): 1354
-
65.

Bowers, P. M., M. Pellegrini, et al. (2004). "Prolinks: a database of protein functional linkages derived
from coevolution."
Genome Biol

5
(5): R35.

Bowman, J. P., C. R. Bittencourt, et al. (2008). "Differential gene expression of Listeria monocytogenes
dur
ing high hydrostatic pressure processing."
Microbiology

154
(Pt 2): 462
-
75.

Bult, C. J., J. T. Eppig, et al. (2008). "The Mouse Genome Database (MGD): mouse biology and model
systems."
Nucleic Acids Res

36
(Database issue): D724
-
8.

Bunai, K., M. Ariga, et al
. (2004). "Profiling and comprehensive expression analysis of ABC transporter
solute
-
binding proteins of Bacillus subtilis membrane based on a proteomic approach."
Electrophoresis

25
(1): 141
-
55.

Callahan, C., E. R. Castanha, et al. (2008). "The Bacillus ce
reus containing sub
-
branch most closely
related to Bacillus anthracis, have single amino acid substitutions in small acid
-
soluble proteins,
while remaining sub
-
branches are more variable."
Mol Cell Probes

22
(3): 207
-
11.

Cascales, E., R. Lloubes, et al. (20
01). "The TolQ
-
TolR proteins energize TolA and share homologies
with the flagellar motor proteins MotA
-
MotB."
Mol Microbiol

42
(3): 795
-
807.

Charon, N. W. and S. F. Goldstein (2002). "Genetics of motility and chemotaxis of a fascinating group
of bacteria: t
he spirochetes."
Annu Rev Genet

36
: 47
-
73.

Chen, F., A. J. Mackey, et al. (2007). "Assessing performance of orthology detection strategies applied
to eukaryotic genomes."
PLoS ONE

2
(4): e383.

Cheng, Y. and G. M. Church (2000). "Biclustering of expression d
ata."
Proc Int Conf Intell Syst Mol Biol

8
: 93
-
103.

Demeter, J., C. Beauheim, et al. (2007). "The Stanford Microarray Database: implementation of new
analysis tools and open source release of software."
Nucleic Acids Res

35
(Database issue):
D766
-
70.

DiMagg
io, P. A., Jr., S. R. McAllister, et al. (2008). "Biclustering via optimal re
-
ordering of data matrices
in systems biology: rigorous methods and comparative studies."
BMC Bioinformatics

9
: 458.

Doan, T., P. Servant, et al. (2003). "The Bacillus subtilis yw
kA gene encodes a malic enzyme and its
transcription is activated by the YufL/YufM two
-
component system in response to malate."
Microbiology

149
(Pt 9): 2331
-
43.

Dutilh, B. E., M. A. Huynen, et al. (2006). "A global definition of expression context is conse
rved
between orthologs, but does not correlate with sequence conservation."
BMC Genomics

7
: 10.

Edgar, R., M. Domrachev, et al. (2002). "Gene Expression Omnibus: NCBI gene expression and
hybridization array data repository."
Nucleic Acids Res

30
(1):
207
-
10.

Eichenberger, P., M. Fujita, et al. (2004). "The program of gene transcription for a single differentiating
cell type during sporulation in Bacillus subtilis."
PLoS Biol

2
(10): e328.

Eichenberger, P., S. T. Jensen, et al. (2003). "The sigmaE regulo
n and the identification of additional
sporulation genes in Bacillus subtilis."
J Mol Biol

327
(5): 945
-
72.

Errington, J. (2003). "Regulation of endospore formation in Bacillus subtilis."
Nat Rev Microbiol

1
(2):
117
-
26.

Gan, X., A. W. Liew, et al. (2008). "
Discovering biclusters in gene expression data based on high
-
dimensional linear geometries."
BMC Bioinformatics

9
: 209.

Gilad, Y., A. Oshlack, et al. (2006). "Expression profiling in primates reveals a rapid evolution of human

transcription factors."
Natur
e

440
(7081): 242
-
5.

Giotis, E. S., A. Muthaiyan, et al. (2008). "Genomic and proteomic analysis of the Alkali
-
Tolerance
Response (AlTR) in Listeria monocytogenes 10403S."
BMC Microbiol

8
: 102.

Golub, G. and W. Kahan (1965). "Calculating the Singular Values

and Pseudo
-
Inverse of a Matrix."
Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis

2
(2): 205
-
224.

Hayashi, K., T. Kensuke, et al. (2006). "Bacillus subtilis RghR (YvaN) represses rapG and rapH, which
encode inhibi
tors of expression of the srfA operon."
Mol Microbiol

59
(6): 1714
-
29.

Hayashi, K., T. Ohsawa, et al. (2005). "The H2O2 stress
-
responsive regulator PerR positively regulates
srfA expression in Bacillus subtilis."
J Bacteriol

187
(19): 6659
-
67.

Herschkowitz, J. I., K. Simin, et al. (2007). "Identification of conserved gene expression features
between murine mammary carcinoma models and human breast tumors."
Genome Biol

8
(5):
R76.

Hu, Y., H. F. Oliver, et al. (2007). "Transcriptomic and phenotypic

analyses suggest a network between
the transcriptional regulators HrcA and sigmaB in Listeria monocytogenes."
Appl Environ
Microbiol

73
(24): 7981
-
91.

Hu, Y., S. Raengpradub, et al. (2007). "Phenotypic and transcriptomic analyses demonstrate
interactions b
etween the transcriptional regulators CtsR and Sigma B in Listeria
monocytogenes."
Appl Environ Microbiol

73
(24): 7967
-
80.

Hulsen, T., M. A. Huynen, et al. (2006). "Benchmarking ortholog identification methods using functional
genomics data."
Genome Biol

7
(4): R31.

Ihmels, J., S. Bergmann, et al. (2005). "Comparative gene expression analysis by differential clustering
approach: application to the Candida albicans transcription program."
PLoS Genet

1
(3): e39.

Ireton, K., S. Jin, et al. (1995). "Krebs cycle f
unction is required for activation of the Spo0A transcription
factor in Bacillus subtilis."
Proc Natl Acad Sci U S A

92
(7): 2845
-
9.

Jin, S., P. A. Levin, et al. (1997). "Deletion of the Bacillus subtilis isocitrate dehydrogenase gene
causes a block at stag
e I of sporulation."
J Bacteriol

179
(15): 4725
-
32.

Kanehisa, M. (2002). "The KEGG database."
Novartis Found Symp

247
: 91
-
101; discussion 101
-
3,
119
-
28, 244
-
52.

Kanehisa, M. (2009). "Bacterial chemotaxis
-

Bacillus anthracis Sterne." Retrieved September 2
4,
2009, from
http://www.genome.jp/dbget
-
bin/show_pathway?bat02030
.

Kanehisa, M. (2009). "Flagellar assembly
-

Bacillus anthracis Sterne." Retrieved September 24, 2009,
from
http://www.genome.jp/kegg
-
bin/show_pathway?bat02040
.

Kanehisa, M., M. Araki, et al. (2008). "KEGG for linking genomes to life and the environment."
Nucleic
Acids Res

36
(Database issue): D480
-
4.

Kaneh
isa, M. and S. Goto (2000). "KEGG: kyoto encyclopedia of genes and genomes."
Nucleic Acids
Res

28
(1): 27
-
30.

Kanehisa, M., S. Goto, et al. (2006). "From genomics to chemical genomics: new developments in
KEGG."
Nucleic Acids Res

34
(Database issue): D354
-
7.

Khaitovich, P., I. Hellmann, et al. (2005). "Parallel patterns of evolution in the genomes and
transcriptomes of humans and chimpanzees."
Science

309
(5742): 1850
-
4.

Kluger, Y., R. Basri, et al. (2003). "Spectral biclustering of microarray data: coclusteri
ng genes and
conditions."
Genome Res

13
(4): 703
-
16.

Kobayashi, K., M. Ogura, et al. (2001). "Comprehensive DNA microarray analysis of Bacillus subtilis
two
-
component regulatory systems."
J Bacteriol

183
(24): 7365
-
70.

Liu, H., N. H. Bergman, et al. (2004).
"Formation and composition of the Bacillus anthracis endospore."
J Bacteriol

186
(1): 164
-
78.

Liu, R. and H. Ochman (2007). "Origins of flagellar gene operons and secondary flagellar systems."
J
Bacteriol

189
(19): 7098
-
104.

Liu, R. and H. Ochman (2007). "Stepwise formation of the bacterial flagellar system."
Proc Natl Acad
Sci U S A

104
(17): 7116
-
21.

Liu, X. and H. W. Taber (1998). "Catabolite regulation of the Bacillus subtilis ctaBCDEF gene cluster."
J

Bacteriol

180
(23):
6154
-
63.

Lu, Y., P. Huggins, et al. (2009). "Cross species analysis of microarray expression data."
Bioinformatics

25
(12): 1476
-
83.

Macnab, R. M. (2003). "How bacteria assemble flagella."
Annu Rev Microbiol

57
: 77
-
100.

Macnab, R. M. (2004). "Type III flage
llar protein export and flagellar assembly."
Biochim Biophys Acta

1694
(1
-
3): 207
-
17.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations.

Marr, A. K., B. Joseph, et al. (2006). "Overexpression of PrfA leads to gro
wth inhibition of Listeria
monocytogenes in glucose
-
containing culture media by interfering with glucose uptake."
J
Bacteriol

188
(11): 3887
-
901.

Matsuno, K., T. Blais, et al. (1999). "Metabolic imbalance and sporulation in an isocitrate
dehydrogenase mutan
t of Bacillus subtilis."
J Bacteriol

181
(11): 3382
-
91.

McCarroll, S. A., C. T. Murphy, et al. (2004). "Comparing genomic expression patterns across species
identifies shared transcriptional profile in aging."
Nat Genet

36
(2): 197
-
204.

McCarter, L. L. (2006
). "Regulation of flagella."
Curr Opin Microbiol

9
(2): 180
-
6.

McQuitty, L. L. (1966). "Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data."
Educational and Psychological Measurement

26
(4): 825
-
831.

Mellor, J. C., I. Yanai, et al. (200
2). "Predictome: a database of putative functional links between
proteins."
Nucleic Acids Res

30
(1): 306
-
9.

Molle, V., Y. Nakaura, et al. (2003). "Additional targets of the Bacillus subtilis global regulator CodY
identified by chromatin immunoprecipitation

and genome
-
wide transcript analysis."
J Bacteriol

185
(6): 1911
-
22.

Murphy, G. E., J. R. Leadbetter, et al. (2006). "In situ structure of the complete Treponema primitia
flagellar motor."
Nature

442
(7106): 1062
-
4.

Ogura, M., H. Yamaguchi, et al. (2002). "W
hole
-
genome analysis of genes regulated by the Bacillus
subtilis competence transcription factor ComK."
J Bacteriol

184
(9): 2344
-
51.

Ogura, M., H. Yamaguchi, et al. (2001). "DNA microarray analysis of Bacillus subtilis DegU, ComA and
PhoP regulons: an appr
oach to comprehensive analysis of B.subtilis two
-
component regulatory
systems."
Nucleic Acids Res

29
(18): 3804
-
13.

Pallen, M. J. and N. J. Matzke (2006). "From The Origin of Species to the origin of bacterial flagella."
Nat Rev Microbiol

4
(10): 784
-
90.

Pen
n, C. W. and C. J. Luke (1992). "Bacterial flagellar diversity and significance in pathogenesis."
FEMS Microbiol Lett

79
(1
-
3): 331
-
6.

Piggot, P. J. and J. G. Coote (1976). "Genetic aspects of bacterial endospore formation."
Bacteriol Rev

40
(4): 908
-
62.

Pre
lic, A., S. Bleuler, et al. (2006). "A systematic comparison and evaluation of biclustering methods for
gene expression data."
Bioinformatics

22
(9): 1122
-
9.

Price, M. N., K. H. Huang, et al. (2005). "A novel method for accurate operon predictions in all
se
quenced prokaryotes."
Nucleic Acids Res

33
(3): 880
-
92.

Raengpradub, S., M. Wiedmann, et al. (2008). "Comparative analysis of the sigma B
-
dependent stress
responses in Listeria monocytogenes and Listeria innocua strains exposed to selected stress
conditions
."
Appl Environ Microbiol

74
(1): 158
-
71.

Rasko, D. A., M. R. Altherr, et al. (2005). "Genomics of the Bacillus cereus group of organisms."
FEMS
Microbiol Rev

29
(2): 303
-
29.

Rasko, D. A., J. Ravel, et al. (2004). "The genome sequence of Bacillus cereus ATCC

10987 reveals
metabolic adaptations and a large plasmid related to Bacillus anthracis pXO1."
Nucleic Acids
Res

32
(3): 977
-
88.

Reiss, D. J., N. S. Baliga, et al. (2006). "Integrated biclustering of heterogeneous genome
-
wide
datasets for the inference of gl
obal regulatory networks."
BMC Bioinformatics

7
: 280.

Remm, M., C. E. Storm, et al. (2001). "Automatic clustering of orthologs and in
-
paralogs from pairwise
species comparisons."
J Mol Biol

314
(5): 1041
-
52.

Rosario, M. M., K. L. Fredrick, et al. (1994). "Chemotaxis in Bacillus subtilis requires either of two

functionally redundant CheW homologs."
J Bacteriol

176
(9): 2736
-
9.

Serizawa, M., H. Yamamoto, et al. (2004). "Systematic analysis of SigD
-
regulated genes

in Bacillus
subtilis by DNA microarray and Northern blotting analyses."
Gene

329
: 125
-
36.

Severino, P., O. Dussurget, et al. (2007). "Comparative transcriptome analysis of Listeria
monocytogenes strains of the two major lineages reveals differences in virulence, cell wall, and
stress response."
Appl Environ Microbiol

73
(19): 6078
-
88.

Shannon, P
. T., D. J. Reiss, et al. (2006). "The Gaggle: an open
-
source software system for integrating
bioinformatics software and data sources."
BMC Bioinformatics

7
: 176.

Soutourina, O. A. and P. N. Bertin (2003). "Regulation cascade of flagellar expression in Gr
am
-
negative bacteria."
FEMS Microbiol Rev

27
(4): 505
-
23.

Steil, L., M. Serrano, et al. (2005). "Genome
-
wide analysis of temporally regulated and compartment
-
specific gene expression in sporulating cells of Bacillus subtilis."
Microbiology

151
(Pt 2): 399
-
42
0.

Stein, B., S. M. Eissen, et al. (2003). On Cluster Validity and the Information Need of Users.
Proceedings of the 3rd IASTED International Conference on Artificial Intelligence and
Applications (AIA 03), Benalmádena, Spain
. M. H. Hanza, ACTA Press
:
216
-
221.

Stragier, P. and R. Losick (1996). "Molecular genetics of sporulation in Bacillus subtilis."
Annu Rev
Genet

30
: 297
-
41.

Supper, J., M. Strauch, et al. (2007). "EDISA: extracting biclusters from multiple time
-
series of gene
expression profiles."
BMC Bi
oinformatics

8
: 334.

Tanay, A., A. Regev, et al. (2005). "Conservation and evolvability in regulatory networks: the evolution
of ribosomal regulation in yeast."
Proc Natl Acad Sci U S A

102
(20): 7203
-
8.

Tanay, A., R. Sharan, et al. (2004). "Revealing modul
arity and organization in the yeast molecular
network by integrated analysis of highly heterogeneous genomewide data."
Proc Natl Acad Sci
U S A

101
(9): 2981
-
6.

Tanay, A., R. Sharan, et al. (2002). "Discovering statistically significant biclusters in gene e
xpression
data."
Bioinformatics

18 Suppl 1
: S136
-
44.

Tirosh, I. and N. Barkai (2007). "Comparative analysis indicates regulatory neofunctionalization of yeast
duplicates."
Genome Biol

8
(4): R50.

Tirosh, I., Y. Bilu, et al. (2007). "Comparative biology: bey
ond sequence analysis."
Curr Opin Biotechnol

18
(4): 371
-
7.

Tirosh, I., A. Weinberger, et al. (2006). "A genetic signature of interspecies variations in gene
expression."
Nat Genet

38
(7): 830
-
4.

Tojo, S., M. Matsunaga, et al. (2003). "Organization and expre
ssion of the Bacillus subtilis sigY
operon."
J Biochem

134
(6): 935
-
46.

van Helden, J. (2003). "Regulatory sequence analysis tools."
Nucleic Acids Res

31
(13): 3593
-
6.

Waltman, P., T. Kacmarczyk, et al. (2009). Prokaryotic Systems Biology.
Plant systems biol
ogy, Annual
plant reviews
. G. Coruzzi and R. A. Gutierrez. Ames, Iowa, Blackwell Pub.
:
67
-
136.

Wang, S. T., B. Setlow, et al. (2006). "The forespore line of gene expression in Bacillus subtilis."
J Mol
Biol

358
(1): 16
-
37.

Watanabe, S., M. Hamano, et al. (2
003). "Mannitol
-
1
-
phosphate dehydrogenase (MtlD) is required for
mannitol and glucitol assimilation in Bacillus subtilis: possible cooperation of mtl and gut
operons."
J Bacteriol

185
(16): 4816
-
24.

Yoshida, K., K. Kobayashi, et al. (2001). "Combined transc
riptome and proteome analysis as a
powerful approach to study genes under glucose repression in Bacillus subtilis."
Nucleic Acids
Res

29
(3): 683
-
92.

Yoshida, K., Y. H. Ohki, et al. (2004). "Bacillus subtilis LmrA is a repressor of the lmrAB and yxaGH
opero
ns: identification of its binding site and functional analysis of lmrB and yxaGH."
J Bacteriol

186
(17): 5640
-
8.

Yoshida, K., H. Yamaguchi, et al. (2003). "Identification of additional TnrA
-
regulated genes of Bacillus
subtilis associated with a TnrA box."
Mol Microbiol

49
(1): 157
-
65.