Probabilistic Graphical Modeling of Cis-Regulatory Modules in Drosophila

kettlecatelbowcornerAI and Robotics

Nov 7, 2013 (4 years and 3 days ago)

374 views

Probabilistic
Graphical
Modeling of Cis
-
Regulatory
M
od
ul
es
in

Drosophila



Vipin Narang
1

Wing
-
Kin Sung
1
, Ankush Mittal
2



1
Department of Computer Science, National University of Singapore, Singapore


117543
.


2
Department of Ele
ctronics and Computer Engineering, Indian Institute of Technology, Roorkee
-
247667, Uttaranchal, India
.


Classification:


Physical Sciences


Statistics,

Biological Sciences


Genetics


Corresponding Author:


Vipin Narang,

c/o Graduate Office,

Department of

Computer Science,

3 Science Drive 2,

National University of Singapore,

Singapore


117543.

Email:
vipinnar@comp.nus.edu.sg


Manuscript information:


Number of text pages (including references and figure leg
ends):

32


Number of figures:

7


Number of tables:

1


Number of words in abstract (max 250):

162


Number of characters in the paper (including spaces, max 47000):

45,310


Data note:

Supplementary text included containing 10 supplementary figures

ABSTRACT



Cis
-
regulatory modules (CRMs) are enhancers or repressors that control the expression of
genes in a particular tissue at a particular deve
lopment stage. It is hypothesized

that CRMs with
similar motif module will control the same gene express
ion pattern. Such motif module is called
the regulatory code. So far few regulatory codes are known which have been determined based
on wet lab experiments. We present the first computational approach to learn regulatory codes
de
-
novo from a repository o
f CRMs. We use a probabilistic graphical model to derive the
regulatory codes and to predict novel CRMs. Using a training data of 356 non
-
redundant CRMs,
810 novel CRMs have been recovered from the Drosophila melanogaster genome regulating
gene expression

in different tissues at various stages of development.
We
also
derived
s
pecific
regulatory codes
for CRMs controlling

gene expression in the drosophila embryonic mesoderm,
the ventral nerve cord, the eye
-
antennal disc and
the larval wing imaginal disc.
Furthermore, the
taget genes of
CRMs with t
hose regulatory codes are
validat
ed to express
in the
corre
sponding

tissue at the
corresponding

development stage.

31 novel genes are implicated in the development
of th
ese tissues.


I
NDEX TERMS

Gene regulation, Cis
-
regulatory modules, Transcription factor binding motifs, Drosophila
melanogaster
, Bayesian networks
.


INTRODUCTION


The recent
sequencing and annotation of the genomes of several multicellular species
has
rev
ealed
that
there is
extensive similarity in their protein

coding sequences
.
Thus,

the
complexity of higher species
is a
ttributed

to a

more elaborate system of gene regulation
[1]
.
Transcriptional control appears
to be
the primary mechanism
of
gene
regulation
,
wh
ere the

essential feature
is
binding of
trans
-
acting
proteins
called
transcription

factors (TF) to cis
-
regulatory
DNA
flanking the

coding sequence
s
.
C
is
-
regulatory modules
(CRMs)
such as
enhancers or repressors
are
a special class of cis
-
regulatory sequences that
control

the
spatio
-
temporal specific
expression of genes
.
They are
short

(0.5
-
1 kb) sequences
with

a high density
of
transcription factor
binding sites

(TFBS) and
can control
the expression of
gene
s

located
from
several kilobases upstream
or
downstream

or
from
within the introns
.
Recent studies have
dem
onstrated their
widespread role in regulating networks of gene expression
,

especially

during
early
development

[2, 3]
.



CRMs
have been
most widely studied in Drosophila melanogaster (fruit fly), which is a
model organism for the study of development.
T
he number of
CRMs in the Drosophila genome
is
estimated to be
3
-
4 times
greater than
the number of genes themselves
[1]
.
Several
genes
expressed during Drosophila development are
known to be
regulated by
CRMs
.
CRMs regulate
gene expression in almost all
phases
of cell
differentiation
in Drosophila
from pluripotency
to
tissue
formation.
In the Drosophila
embryo, the transcription factors are non
-
uniformly
distributed in different
region
s
of the embryo.
Depending on the TF concentrations

at a given
region
, specific CRMs
are activated to express or repress their target genes. This results in
differential expression of the zygotic genes in different
region
s
, ultimately leading to tissue
differentiation. Thus CRMs are important components of developmental gene net
works.



CRMs are difficult to characterize
e
xperimental
ly

due to their short length and variable
location

in the genome
up to several kilobases upstream or downstream of the
target
gene.
I
n
-
vivo experimental validation of CRM activity using reporter gene

assays is time and resource
consuming. So far
less than
700 CRMs
have been

experimentally characterized
in
approximately 200 Drosophila genes
[4]
.
Therefore
i
n
-
silico

discovery of CRMs
is an attractive
option
.
Several recent studies have attempted c
omputational discovery of novel CRMs
[5
-
10]
.
The computational approach in essence cha
racterizes

the
CRMs as short (~1 kb) genomic
segments containing high density of binding sites for a set of co
-
acting TFs
[5
-
15]
. The binding
sites are
recognized with the help of
positional weight matrices (PWMs) for the TFs

[16]
. More
sophisticated approaches consider the contribution of both strong and weak binding sites within
a
CRM
[7, 8]
.
Probability models such as h
idden Markov models
(HMM)
have also been
employed
to recognize significant clusters of TFBS
s

within genomic sequences
, sometimes
additi
onally incorporating order and distance constraints among the TFBS
s

[13
-
15]
.

Genome
comparison with
related Drosophila species
has also
been used

to aid CRM discovery
, though
with caution due to low conservation of
the
CRM sequences in

general
[9]
.



The
main
challenge for
the
computational approach
is that the
TFBS compositions of
different CRMs regulating gene expression in different developmental stages and tissues

are
exceedingly

varied
.
Each specific expression
profile is governed by the interaction of a specific
set of
TFs. The TF set
forms a regulatory code governing the expression

pattern
.
For
example
the
TFs
bcd
,
cad
,
hb
,
kni
,
Kr
,
gt
,
dl
,
tll
,
etc
are known to
re
gulate

gene expression in the
blastoderm embr
yo

[6, 8]
.
Whereas the set of TFs
dl
,
twi
,
su(H)
, etc govern
s

gene expression in
the embrynoic
neuroectoderm

[10]
.
Presently
the regulatory codes are extracted based on
tedious wet
-
lab experiments and

biological
knowledge
.
This
has limited

their applicability to
few
specific CRM types. Another limitation
is that good
quality PWMs
,

which are
required
to
predict
the
TFBS clusters
,

are available only for a few TFs
.
Currently the
PWMs
have been
computed from
a small number of
experimental TFBS sequences determined by DNAse
footprinting
.


M
ost
of these
PWMs
lack
sufficient
sensitivity

and specifi
c
ity

[17]
.



An alternative computational approach is sequence based modeling of CRM
s

such as
using oligonucleotide fr
equencies
[18]
.

T
his approach has shown
s
ome degree of
success
in
modeling
blastoderm CRMs
.

H
owever,

it is
not applicable

to
model
ing

other

types of CRMs
[19]
. Firstly, we do not have enough
typed CRM
s

to train a sequence based model
.
Secondly
,

a
model based on
oligonucleotide motifs
give
s

large number of false positi
ves

(as shown in this
study).



We
introduce a
computational
CRM modeling and prediction approach called
Modulexplorer

to perform

de
-
novo learning of regulatory codes
for Drosophila CRMs
from
CRM
s

of unknown types
. The Modulexplorer
pipeline is shown in
Figure
1
. The input
to
Modulexplorer is a database of known CRM
s

and a set of non
-
CRM background sequences.
Modulexplorer
first
characterize
s

the
TF
BS
s

within the CRM
s

de
-
novo
[5, 20
-
22]
.
It represents
the
TFBS
s

with
dyad motifs in degenerate IUPAC alphabet to achieve high specificity
[23
-
26]
.
Th
en
using a
probabilistic Bayesian network model
, it
learns
the
TFBS
interactions

which are
over
-
represented in CRMs
while
under
-
represented in non
-
CRMs
.
The
se interactions describe

the
regulatory codes.

The trained model is then used to discove
r
novel CRMs.

This paper
reports the
novel
regulatory codes and CRMs discovered
.


RESULTS


Training of
Modulexplorer CRM Model



The Modulexplorer Bayesian network model for
a
CRM is shown in
Figure
2
.
I
n the
Bay
esian network model, the TFBS
s

are the
causal elements
or parent nodes
while

the CRM is
their common effect or child node.
The
basic idea is to
consider
a CRM
as
a cluster of TFBS
s

for cooperating TFs with certain distance and order constraints.
The TFBS

interactions are
encoded
in
the probabilities at the edges of the Bayesian network.
The
interaction probabilities

are learnt
d
e
-
novo by the Bayesian network from training CRM

and non
-
CRM

sequences with
un
supervised learning.
Distance and order constrain
ts are considered between pairs of closely
interacting motifs.
After training, the
Bayesian network model
functions
as a classifier to
discriminate between CRM and non
-
CRM sequences.

During
inference, the Bayesian network
assigns high probability of CRM
to a sequence which contains a combination of closely
interacting TFBS
s
.


To construct the Modulexplorer model,
we first developed
a method to
robustly
identify

TFBS
s

within CRMs.
Drosophila
CRMs
show
homotypic clustering

of TFBS
s
, i.e.
they contain
multi
ple binding sites for the same
transcription fact
or

[5, 20
-
22]
.
Therefore TFBS
s

often
appear
as r
edundant subsequences
with
in
the
CRM.
We studied the

correlation
between redundant sites
and TFBS
s

in
155
Drosophila CRMs

for
which TF
BS
s

have been
previously
characterized by
DNAse footprinting
experiments
[4, 17, 27]
.
T
he
se

CRMs

contained
3
-
6

binding sites per TF
in
general
(
Supplementar
y Figure 1
(a))
.
The
fluffy tail test statistic
[28]

of the CRMs
also
indicated
a high level of

sequence redundancy

in the CRMs (
Supplementary Figure 1
(b))
.

We
developed a
novel

algorithm to recover
the
redundant sites
in
a
CRM
with high accuracy

(see
Supplement
ary Figure

2
)
.
T
he
predicted redundant sites
overlapped
the
experimentally
annotated
TF
BS
s

in
the
CRMs

with
an average sensitivity of 81% and false positive rate of 22%
(
Supplementary Figure 1
(c
)
). The
p
-
value
for this correlation is
9×10
-
32

compared with random
sites of
the
same length
.


Thus
the
method robustly
identified

the
TFBS
s

in the

CRMs
de novo.


To learn

the
Modulexplorer
Bayesian network
model
,

we obtained
experimentally
validated CRM
s

from
the
REDfly database
[4]

(version 1.0, February 2006).
A n
on
-
redundant
and non
-
overlapping set of
356
training
and
58
test CRM
s

was
derived
from the database
as
described in the D
ata section
. The CRM
s

were of several different types as described in
Supplementary Figure 3
.
In addition,
we prepared
three
different
background
sets
from exon,
intron and intergenic sequences

respectively
. The background sequences were

randomly
select
ed from the Drosophila melanogaster genome
. The number and lengths of the sequences in
each background set was kept
the
same as
the training CRM
s
.


Ten
-
fold cross
-
validation training of the Bayesian network was performed with the
above
training data.
The

discrimination
achieved
between CRM and
background
sequences in cross
-
validation
training
is shown in

Figure
4
.


The result is compared with
the
current best performing
algorithm
HexDiff
[18]

and

with
a Markov model
.

Other
CRM prediction algorithms
(
such as

[7, 13, 29]
)
could
not
be included
for
comparison
since
they
require prior biological knowledge
of the CRM model
(
such as
the PWMs of
the
TFs)
and are specific to a subset of CRMs
of the
same type
.

As shown in
Figure
4
(a)
,
all three models showed
high discrimination between
CRM
and
coding (
exon
)

sequences
. However
for
non
-
coding (
intron and intergenic
)

sequences

(
Figure
4
(b))
,

the
Markov model

showed no discri
mination

(area under ROC=0.37
)
, HexDiff
showed
marginal discrimination
(area under ROC=
0.58
),
while
Modulexplorer had

the highest
discrimination (area under ROC=
0.75
)
.


Modulexplorer’s prediction performance was
then
evaluated on
a
test dataset of 58 CRM
s

and
1000
random
background sequences

different from the training set
.
The test set is unbiased
as it contains
CRMs expressed in a variety of tissues and stages
and is distinct from the training
sequences (
Figure
4
(c)
)
. The ROC
,
shown in
Figure
4
(d),
resembles the
performance
on the
training set

(area under ROC=
0.72
)
.


Pairwise TF
-
TF Interactions Learnt De
-
novo by the Modulexplorer



W
e investigated the c
onditional
probabil
ities
associated with the edges
of
the
Modulexplorer
Bayesian network
model

to obtain
insight into the
pairwise
TFBS
interactions

represented in the model
.
We took
from
the
model the
representative motifs
for 61

known TFs
which

best match
ed

their
known bi
nding sites

(
listed in Supplement
ary Figure 8)
.
The strength
of interaction between any two
motifs
M
1
,
M
2

was
measured as the
ratio of the marginal
probabilities





1 2 1 2
Pr,Pr,
CRM M M non CRM M M

. The
pairwise
interaction matrix

is
shown in
Figure
5
.

Based on the interaction matrix, the TFs
were
hierarchically clustered using
UPGMA algorithm
[30]
.




According to
their
known biological function
s
, the
61
TFs
may be
grouped into
five
broad categories.
The TF T
wist (
twi
)
and its cofactors
dl
,
sna
,
byn
,
slbo
,
prd
,
bin
,
su(H)
,

su(Hw)

in mesoderm and nervous sy
stem development

[10, 31
-
33]

are
placed in the first group. The TFs
sd
,
pan
,
nub
,
ap
,
grh

and
ara

which are involved in the development of imaginal dis
cs such as
the wing disc were placed in the second group. The third and fourth groups were formed by the
TFs of the antennapedia complex (
Antp
,
abd
-
A
,
abd
-
B
,
ubx
,
dfd
) and the blastoderm (
bcd
,
hb
,
cad
,
kni
,
Kr
,
tll
,
gt
)

respectively
.

The fifth group cons
ists

of the TFs
ey

and
toy

involved in eye
development.



In the TF
-
TF interaction matrix, t
he first four
TF
groups showed high mutual interaction
values in general.
The overlap is expected as these TFs are known to function cooperatively

[34,
35]
. However
,

closer
analysis of the interaction matrix by hierarchical clustering indicated that
the TFs
of
these four groups
form
three distinct clusters
in the TF
-
TF interaction matrix. The
first cluster
included

all TFs from the first group an
d
the
TFs

sd
,
nub
,
grh

and
pan

from the
second gro
up. The second cluster contained

the
remaining TFs from the second group and
all
TFs of the
antennapedia complex
.
In addition
,

the TFs
ems
,
vvl
,
en
,
exd
,
cf2
-
II
,
gl
,
Dref
,
zen

and
eve

also came together
with this cluster according to the hierarchical clustering.
There are some
supports that these TFs may be related to the known factors in the second cluster.
Exd
,

en
,
ems
,
zen

and
eve

are known regulators in the development of appendages including the le
gs and wings
[34]
.
Vvl

also has known function in wing development
[36]
, while little information is available
on the T
Fs
gl

and
cf2
-
II
.
The primary function of

t
he
TF
Dref
,
is DNA replication. T
hough there
is no
support of its role with antennapedia complex,
recent stu
dies have shown its role i
s

divers
ify

[37]

and hence it may
be related to

the antennapeida complex
.




The third cluster contains all the blastoderm TFs.
Moreover t
he ant
ennapedia pair
-
rule
TF
ftz

was
also
found
in this cluster
. This association is not surprising as
ftz

cooperates
with
blastoderm TFs in known CRMs
[38]
.



The
TFs
ey

and
toy

of the
fifth group appeared

as a distinct cluster by the UPGMA
clustering.

The TF
deaf1

was also found
in
this
cluster
. Little is known in the existing literature
about
deaf1
. Surprisingly we found from a survey of recent literature that
deaf1

seems to have a
role in eye development
[39]
.



We found
another separate
cluster formed by a miscellaneous set of TFs
med
,
mad
,

brk
,
adf1
,
espl
,
tin
,
hkb
,
vnd

and
ftz
-
f1
. Some of these TFs have known interactions, e.g.
mad

and
brk

are co
-
regulators of
zen

[40]
, while
mad

and
med

cooperate in the regulation of
bam

gene
[41]
.
However the clustering of these
TFs is a subject for further study.



In summary,
TF
s

with
high interaction probability
in Modulexplorer
were found to have
close

interact
ion

with each other in the same biological process and developmental st
ages.


G
enome
W
ide
S
can
for Novel CRMs



T
he Mo
dulexplorer model
was used
to search for novel CRMs within
BDGP Release 5
assembly of the
Drosophila genome.
In a sliding window like approach, the complete
120 Mb
genomic

sequence
was divided
into 24,000 windows of length 1000 bp each
with the adjacent
w
indows

overlapping
by

500 bp
. The
Modulexplorer
model assigns

to each window
a
probability
that it
may
contain

a CRM.
A

small set of
high confidence
windows that
were
assigned high probability value by the model
were shortlisted for analysis
as shown in
Figure
6
(a).
We chose a probability

threshold

so that

the model has
a
small
false positive rate of
0.1%

in cross
-
validation
. At this threshold the
expected sensitivity
is
about 20%. Thus
240
false
positive windo
ws are expected in the predicted set.

A

total of
1298 windows

were found
above
the threshold
, which is
more than 5
-
fold the number of expected false positives. The P
-
value for
this recovery
(Bonferroni corrected)
is 4.0×10
-
16
.
Out of 1298 windows,
472 w
indows
were
overlap
ping

the training CRMs and 13 windows overlap
ped

the 58 test sequences

(
Figure
6
(b))
.
The remaining 813 windows are novel predictions.

These novel predictions are listed in
Supplementary Figure

4.



As an initial validation,
the novel predictions
were compared
with computational CRM
predictions
reported by

other authors

[5, 6, 8, 9]

and with ne
w CRMs added to the REDfly
database in version 2
.

Mild overlap with these predictions was found

as shown in
Figure
6
(c).
Out of 28 predicted CRMs reported by Berman et al.
[6]
,
6
were also reported by Modulexplorer.
In a subsequent
in
-
vivo validation by Berman et al.
[9]
,

9 of the 28 predicted CRMs
were
validated
as active enhancers

while the remaining 19’s were not
.

Five
of the six common
predictions
between
Modulexplorer
and Berman et al.
corresponding to the genes
gt
,
odd
,
sqz

and

CG9650

(two overlapping Modulexplorer windows) were among the validated modules,
while one prediction corresponding to the gene
antp

was inactive.
Similarly 7
CRMs were
common between Modulexplorer and the predictions of Markstein et al.
[5]

corresponding to the
genes
run
,
zen
,
brk
,
sog
,
CG12444
,
osm
-
6

and
ady43a
. Of these, the
zen
,
brk
,
sog

and
ady43a

enhancers have been validated as active in
-
vivo.

Nine

other CRMs reported by Modulexplorer
corresponding to the genes
fkh
,
sim
,
wg
,
mir
-
309
,
grh
,
phyl
,
and
cluster_at_55C

are confirmed
by the updated REDfly database (version 2).



The Modulexplorer

predictions were over
-
represented in upstream regulatory regions of
genes (
Figure
6
(d)) indicating a strong bias towards transcriptional control. Of the 813 predicted
CRM windows
, 391

(48.1%) fell in the upstream

intergenic and promoter region, which is
significantly higher (
p
-
value

= 1.1×10
-
182
) compared to randomly distributed size
-
matched
segments (mean 26%, stdev 4.4% over 100 trials).
K
nown CRMs
show a similar

bias, with
49.6%
CRMs
overlap
ping

upstream inter
genic and promoter regions (
p
-
value = 1.1×10
-
226
). The
known and predicted CRMs also show significant under
-
representation in the exon regions as
compared to random segments.



In many cases, multiple predicted CRMs were found clustered around
a

gene
. Th
e trend
is similar to that for known CRMs, where out of
619 known Drosophila CRMs
, 398 occur as a
cluster of 4 or more CRMs around 51 genes. Monte carlo simulations were performed to
assess
the statistical significance of the clustering of CRMs in interge
nic gaps
. The number of clusters
of different sizes in 50kb windows formed by randomly distributing 813 segments of 1000 bp
length across the Drosophila genome is shown in
Supplementary Figure 5
(averaged over 100
simulations).
In comparison, the
corresp
onding distribution
s

for the predicted
and known CRMs

show
significant clustering around the
ir

target genes.



Putative target genes were assigned to the predicted CRM windows based on proximity.
Though a CRM can regulate distant genes, it is a
n uncommon

occurrence, for instance
,

81% of
the known CRMs target
their
most proximal gene. In this study,
CRMs lying within the intron of
a gene
were
ass
igned the same gene as their tar
get
, whereas CRMs lying in the intergenic region
were
assigned both
the
closest
upstream and downstream
genes as
their
possible targets.

Gene
ontology classification of the target genes
obtained using the online tool GOToolBox

[42]

is
shown in
Figure
6
(e)

with the
GO
terms sorted according to their
signif
icance (
P
-
value
)
.


The
terms show

highly
significant enrichment in the GO categories related to development and gene
regulation (morphogen activity)

[42]
.

This distribution is consistent with the GO categories of
the target genes of the training CRMs.



The G+C content

of the predicted CRM windows is shown in
Supplementary Figure 10
.
The known and the predicted CRMs have similar GC content, which is higher than those in
intron and intergenic sequences but lower than that in exons. The same trend has been
previously re
ported in
[19]
.




Feature
B
ased
C
lustering

of CRMs



T
o characterize the CRMs pred
icted by Modulexplorer

into functional categories
, the
813
predicted CRMs
and 356 training CRMs were
together
clustered based on their motif
content.
The clustering
was performed
by a
n iterative frequent itemset mining clustering procedure as
described in

the Methods section.
It was observed
that the

CRMs of
every CRM cluster

consistently regulate target genes expressed in the same tissue and development stage.

This
supports the hypothesis that

CRMs with similar motifs regulate
target
gene
s

within the sam
e
tissue and developmental stage
.


T
he
major
CRM clusters

are described below
.
For each

CRM cluster
discovered
,
first
we
validate

from the REDfly database

if
the

known CRMs in the cluster

are functional
within the
same tissue and developmental stage.
Thi
s check
also
deduce
s

the type of the CRM cluster.
Then
for the novel CRMs, we validate
if
the
ir

target genes
are expressed

in the same tissue and
developmental stage
using
in
-
situ

gene expression
profiles

from BDGP or Flybase annotation
.

Finally,

the
comm
on motifs for
the

cluster were used to derive a
concise
regulatory code
(see
Methods section)
.
We show that the

regulatory code

can specify
CRMs conferring
the
gene
expressi
on pattern associated with the cluster.



Table 1 summarize
s

the clusters.
The fi
rst
three
iterations of
the
clustering

procedure

produced a mixed set of CRMs rich in AT motifs.
T
he
se

CRMs
represented
two
major
categories with target
gene expression in the
blastoderm

embryo
and
in
the wing imaginal disc of
3
rd

instar larva. Subsequen
t iterations
produced
smaller clusters with
predominant
target gene
expression in the embryonic
mesoderm,
ventral nerve cord
and
eye
-
antennal
tissues
. The

clusters
with
their
dominant expression patterns and
regulatory codes
are
shown
in
Supplementary Fig
ure
6
.


Early
mesoderm development



The
fourth
cluster
discovered

consisted of 11

known
CRMs for 9 genes
and 34 novel
CRMs

for 2
7

genes
.
All

11

known
CRMs were
annotated

to express
their target genes
in
the
developing
mesoderm
during stages 8
-
12
. 9
CRMs

out of
these
11
were
expressed in the
visceral mesoderm while the remaining 2
were
expressed in the somatic mesoderm
.



For the novel CRMs, a
mong the
ir

2
7

target

gene
s
,

in
-
situ

gene expression profiles were
available for
19

genes including
CG
2493
,
sob
,
tr
af1
,
ush
,
eya
,
pvf2
,
wg
,
fus
,
rib
,
egfr
,
cpr49ac
,
sens
, SP
1173
,
emc
,
pxb
,
fer2lch
,
rst
,
dm

and
gnf1
,

all of which showed expression in the
mesoderm

during stages 8
-
12
. Of the
se, the

genes
traf1
,
ush
,
eya
,
pvf2
,
wg
,
rib
,
egfr
,
emc
,
fer2lch
,
and
rst

have kn
own involvement in mesoderm development
[43]

while the g
enes
CG
2493
, sob,
fus,
cpr49ac
,

sens
,
SP1173
, pxb,
dm

and
gnf1

are novel.

Of
the remaining 8
target
genes
,
four genes
sna
,
knrl
,
htl

and
fer1hch

were confirmed by Flybase annotations

for their
involvement in mesoderm development,
while the other four are
unknown to function in the
mesoderm
.



T
he

regulatory code

for the
CRMs
in this cluster
contained
12
motif
s

(see
Supplementary
Figure 6
)
.


The
o
ccurrence
of 14 or more sites for the
se

motifs within a 1 kb fragment
at a PWM
match threshold of 5.0×10
-
4

(see
Methods section)
was suff
icient to classify
a
sequence

as
a
mesoderm enhancer. The code
was found to be
specific, reporting
zero false positive

against
other REDfly CRMs and two false positives against 1000 random sequences.
The
dpp

813 bp
enhancer
was
t
he
lone
available
CRM
in this cluster
that has experimental TFBS annotation
. We
therefore
compare
d

the
known
annotation with the binding sites predicted
in this
CRM
by the
regulatory code.

A
n almost complete
and accurate
annotation of TFBS
was produced a
s shown
in
Figure
7
.



The motifs in the regulatory code
matched
closely with
the known motifs for the TFs
dl
,
twi
,
sna
,

tin
,
bin
,
abd
-
B
,
exd
,
eve
,
ftz

and others which have known role in mesoderm
development. The

TFs
d
l
and
twi

together
express
sna

in the blastoderm embryo
to
establish the
identity of mesoderm precursor cells

[44]
.
In the gastrulating embryo (stage

7), invagination
causes the mesoderm precursors to migrate and spread as a layer below the epidermis. During
stages 8
-
12,
the T
Fs which are
downstream
targets of
twi
such as
bap
,
zfh
-
1
,
mef2
,
eve
,
htl
,
tin
,
etc.

promote the differentiation

of mesoderm cells.
Thus d
ifferent muscle progenitor types
(somatic, visceral, fat body, gonadal, cardiac and tracheal)
are formed

in different

parasegments
along the anterior
-
posterior
(AP)
and dorso
-
ventral
(DV)
axes
[45]
. The gradients along
AP and
DV

axes are formed under the influence of regulatory inputs from pair
-
rule and segment polarity

genes such as
dpp
,
wg
,
hh
,
en
,
slp
,
exd
,
eve
, etc.
All of the above
regulatory inputs appear to
have independent influences which are integrated together via
CRMs

[46]
.

Thus the identities of
the motifs in the regulatory cod
e suggest that CRMs in this cluster regulate
the differen
tiation of
muscle progenitors.


Ventral nerve cord



The fifth cluster
discovered consisted of

15

known CRMs for 14 genes and 44 novel
CRMs for 44 genes
.

1
1

out

of
15

known CRMs
have known involvemen
t in ventral nerve cord
(VNC) development during stages 11
-
16.



Among the
4
4

target
s

genes of the 44 novel CRMs
,
2
3

genes
had
in
-
situ

confirmation of
expression in the VNC
during
stages 11
-
16
,

while
another 7
genes
were
annotated in Flybase as
functional
in the
VNC
.

Thus
30
out of 44
genes

were validated in VNC development. Out of
these, 1
2

genes
pdm2
,
tsh
,
traf1
,
wg
,
phyl
,
klu
,
mirr
,
D
,
Ap
-
2
,
B
-
h1
,
run
,
and
rst

have
known
function in
VNC
development
while
18

genes
ceng1a
,
slp2, ush, rx, pk, sens,
comm2,

CG6897,
fz2, CG11347, klar, ets65a, pxb, ptx1, hth, corto,
Ca
-
alpha1T
, and
CG9650

are novel.

For the
rest 14 genes, 10 have no information available while 4 show no expression in the VNC.



The regulatory code for the VNC cluster
consists of
15 motifs
.


The regulatory code
could separate the

known neuronal enhancers in the VNC cluster from other known
REDfly
CRMs and 1000 random se
quences with 100% specificity.



The
motifs
in the regulatory code
closely
match
ed

the known
consensus
for the TFs
dl
,
twi
,
gr
h
,
trl
,
ftz
,
pros

and
the
bithorax complex
TFs
which
have known involvement

in VNC
regulation
.

The TF
twi

expresses
rhomboid

gene in the ventral

lateral region of the blastoderm
embryo to specify the neuroectoderm anlagen. About a quarter of the neuroect
odermal cells
eventually differentiate as neuroblasts while the rest form the ectoderm. During gastrulation
(stage 7), the neuroectodermal cells migrate to the ventral region. The fate of neuroectodermal
cells as neuroblasts or epidermal cells is decided

during stages 8
-
12 by lateral inhibition
.
N
euroblast formation
is
promoted by the proneural genes
(
ac
,
sc
,
lsc
,

ase
)

and inhibited by
neurogenic genes

(
Notch
,
Delta
,
su(H)

etc.
).

T
he neuroblasts in different parasegments along the
AP and DV axes develop

subtypes by expressing different sets of genes

under the control of
various TFs
.
The regulatory inputs of these TFs are combined by CRMs
.
The TFs
pros
,
grh
,
ftz
,
bithorax complex TFs are known to function together in neuroblast differentiation, such as
the
formation of ganglion mother cells
[47, 48]
.
The
different
neurblasts proliferate in the interior of
the embryo during stages 13
-
16 to form the VNC.
Thus f
rom the identities of motifs in the
current regulatory code, it appears that the CRMs in this cluster regulate neuroblast
differen
tiation.


Eye
-
antennal disc



The sixth
cluster
consisted of
18

known
CRMs for 17 genes
and 21 novel CRMs
for 21
genes
.
12

of
the
18
known CRMs
confer expression
in the
eye
-
antennal disc

during stages 12
-
16
.



Among the
novel CRMs,
the
9
target genes
ceng
1a
,
lola
,
D
,
fz
,
spn
,
cas
,
fer2lch
,
opa
,
skpd

were confirmed as expressed in the eye
-
antennal disc.
For 9 other target genes, no
expression or functional annotation information was available. Three genes showed no
expression in the embryonic eye
-
antennal

disc.
Out of the 9 validated genes,
lola
,
D
,
fz
,
spn

and
cas

have known involvement in eye
-
antennal development while the genes
ceng1a
,
fer2lch
,
opa

and
skpd

represent novel targets.



The eye
-
antennal regulatory code
had
10
motifs
. The regulatory code
gave 100%
specificity over other known CRMs and
1000
random sequences.

Motifs in the regulatory code
were recognized as closely resembling the known binding sites for the TFs
Antp
/
zen
,
Exd
,
tll

and
ey
/
toy
.
The TFs
ey

and
toy

express the
sine oculis

(
so
)
gene in the
embryonic blastoderm

to
specify the
optic primordium
.

The
eye
-
antennal
imaginal disc is formed during

stage 12

by
the
invagination

of
optic
primordium cells

to produce a
monolayer
epithelium.

Commitment of the
imaginal disc cells towards eye
or antenna fates occurs in a series of steps from stage 12 embryo
until the second instar larva. Several eye or antennal determinant genes such as
eyg
,
ey
,
toy
,
dac
,
optix
,
salm
,
E
xd
,
d
ll
, etc. are expressed from embryonic stage 12 onwards in the imagi
nal disc.
The TFs
dpp
,
zen
,
tll
,
otd
,
wg

etc. are active in this process.
The current regulatory code
therefore
appears to regulate genes in this early stage of eye
-
antennal specification
.



S
urprising
ly the
deaf1

binding motif
also
appears
in the regula
tory code
.
Deaf1

was
o
bserved in the previous section

to interact with
ey

and
toy

in the TF
-
TF interaction matrix
.
Its
role in eye development is therefore a subject for further
study
.


Blastoderm

embryo



The CRMs containing AT
-
rich motifs obtained in t
he first three iterations of
the
clustering
procedure
consisted of two major
CRM types
controlling target gene expression in the
blastoderm embryo and the
wing imaginal

disc.
The blastoderm CRMs were separated manually
on the basis of their enrichment in
binding sites for
the
known blastoderm TFs
hb
,
bcd
,
cad
,
Kr
,
kni
,
dl

and
tll
. The binding sites were annotated using the PWMs for these TFs reported in
previous studies
[6, 7]
. A total of
33 known CRMs
for 29 gen
e
s and
133
novel
CRMs for 79
genes
were recovered. The
novel
blastoderm CRM windows showed a 2
-
fold enrichment of TF
-
binding as compared to their flanking
-
5 kb to +5 kb reg
ions (
Supplementary Figure 7
).



The target genes of the
novel
blastoderm CRMs were studied for zygotic expression in
stage 4
-
6 developing embryos.
Out of
79 genes
, zygotic expression could be confirmed for
at
least
50 genes
(
3
7
from
in
-
situ images
in
BDG
P in
-
situ
[49]

and Fly
-
FISH
[50]

databases,
and
13
from
microarray expression data
[51, 52]

as described in Supplement
ary Figure

8
). C
onsidering
15% of all 14
,000 Drosophila genes to be zygotically expressed in the blastoderm embryo
, which
is a generous estimate
[50]
,

this association
is statistically significant with a P
-
value of
1
.8
×
10
-
1
8

(
hypergeometric probability with Bonferroni correction
o
f
factor 14,000
).






Wing imaginal
disc



Since the wing imaginal disc specific CRMs
could not be separated
from the AT rich
cluster

by the clustering procedure
, we
attempted to
manually
create this cluster.
Since there is
no known regulatory code for w
ing imaginal disc specification, w
e
derived a regulatory code
from known CRMs of the genes
ct
,
dpp
,
kn
,
kni
,
salm
,
ser
,
vg
,
pfe

and
chn
. All these CRMs
confer gene expression in the wing disc
in 3
rd

instar larva
. We extracted

the common motifs
from
the
se

CRMs
and
derived a regulatory code using the same procedure as described in the
Method
s

section. Using this

regulatory code, 33
novel
CRMs for 31 genes were
separated from
the AT rich cluster
. 15 target genes
for the novel CRMs
including
pdm2
,
drm
,
cg25
c
,
act57b
,
dve
,
inv
,
rho
,
emc
,
c15
,
hh
,
CG12063
,
grn
,
CG8483
,
B
-
h1

and
bi

were
validated
by their
enrichment in
the wing imaginal disc in the 3
rd

instar larva using
microarray analysis
[53]
. For
the rest 16 genes, no means of validation
was

available
.

All the above validated genes have
known function in wing development.



The regulatory code included 11 distinct motifs with 7 motifs resembling TFs
which are
known to

regulate wing imaginal disc development in the larval stage
.
The wing imaginal disc

is

formed from the embryonic ectoderm by an invagination at the
compartment where DV
stripe of
wg

intersects
with
AP
stripe of
dpp
.

The primordium of the wing disc is es
tablished in late
stages 13
-
16 of the developing embryo when TFs such as
hth
,
exd
,
vg
,
sna
,
esg

become
transcriptionally active in the wing imaginal cells. Growth and pre
-
patterning of the imaginal
disc takes place in the larva with a number of genes expr
essed presaging the development of
adult structures. The TFs
en
,
hh
,
dpp
,
wg
,
Antp
,
Ubx
,
Exd
,
hth
,
Dll

etc. have been implicated in
pre
-
patterning of the imaginal disc into compartments. CRM mediated spatio
-
temporal specific
expression of genes in the wi
ng imaginal disc has been reported by the action of the TFs
ap
,
pan
,
S
u(H)
,
nub
,
mad
,
sd
,
ara
,
e(spl)
,
Ubx

etc.
The motifs in the current regulatory code resemble the
known motifs for the TFs
ubx
,
ap
,
ara
,
sd
,
mad
,
pan
,
su(H)
,
nub
, etc.







DISCUSSION



The
Modulexplorer
Bayesian network
model describes
a CRM
a
s
a cluster

of TFBS
s

for
TFs that co
-
regulate gene expression in a particular tissue and stage of development
. The TFBS
combination defines
a regulatory code
.

The regulatory codes were learnt de
-
novo in this study

from a repository of CRM
s

of unknown types
. The
trained
model
was
used to
perform a genome
scan to
discover novel
putative CRMs
containing
these
regulatory code
s
.

C
lustering of the
CRMs
based on common motifs
was performed to obtain
CR
Ms with
specific regulatory codes
.
Each cluster of CRMs was found to regulate
spatio
-
temporal specific gene expression in
a
particular tissue
.



In previous
studies
, low sequence similarity has been reported among CRMs. This is true
as the 414 CRMs comp
rising the training and test data in this study had less than 40%
sequence

similarity. However we observed similarity of CRMs
in terms
of the TFBS or motifs that they
contain. Based on their regulatory motif content, the CRMs could be clustered into func
tional
groups. Therefore a restricted notion of similarity among CRMs emerges in this study.



Furthermore in this study the common motifs among “similar” CRMs were used to
computationally derive
regulatory
codes
for gene expression in specific tissues
an
d
developmental stages
.
Here the regulatory codes were learnt in the form of probabilistic
interaction among the TFBS
s

or motifs. We studied such interaction at the most basic level in
the pairwise
TF
-
TF interaction matrix.

The matrix is a simplificatio
n of the actual codes learnt
by the model which consider mutual interactions among several TFs at the same time. The
interactions observed in the matrix could be corroborated with known biology.
The
Modulexplorer model thus gives some interesting insight
s into the nature and composition of
CRMs.



Modulexplorer also gives clues for the improvement of sequence
-
based modeling of
regulatory sequences such as using oligonucleotide motifs. We saw in this study (e.g.
Supplementary Figure 2) that oligonucleotid
e motifs produce large number of matches in non
-
TFBS segments of the CRMs, which reduces the
eff
ectiveness

of modeling based on these
motifs. On the other hand
Modulexplorer
model
showed improvement in the discrimination
between

CRMs and backgr
ound (especially
intergenic and intron sequences
)

by accurately
removing the
non
-
TFBS
segments from the
CRM
s
and using
only
TFBS segments

to train the
model
. The TFBSs were robustly characterized by exploiting the property of homotypic
clusterin
g of Drosophila CRMs and by using dyad motifs to represent the
TFBSs. Such
enhancements may be possible in other applications of sequence
-
based modeling.



Based on Modulexplorer, this study
contributes
regulatory codes
for CRM
s

associated
with the develo
pment of specific tissues
including mesoderm,
wing imaginal
disc, ventral
nerve
cord
and
eye
-
antenna
l disc
.

Furthermore,
it
also
assist
s

in the functional annotation of genes.
For instance
, the
novel
predicted
CRMs
could help us to classify 31 new ge
nes in the above
developmental
functions
. We could also identify roles of some TFs in these regulatory
mechanisms. For instance
we suggest the
novel
role
of
deaf1

in eye
-
antennal disc development.



The regulatory codes currently discovered are few in nu
mber as the application of
Modulexplorer model is restricted to
CRMs
that have been previously characterized
.

Another
reason is that t
he current
method of CRM clustering using frequent itemset mining is
not robust
enough
. It

can only
discover clusters
wh
ere a sufficient number of CRMs share a large number
of common motifs
.

Also
the model
currently relies on homotypic clustering to
accurately
discover the TFBS. This may not be
possible
in other species or
in
CRMs

where homotypic
clustering is absent
.
In

such scenarios, one of the possibilities could be
to
characterize

TFBS
by
motif discovery in
a set of
known
CRMs h
aving

the same motif module

[54]
.



DATA



Experimentally validated CRM
s

were derived from version 1 of the
REDfly database
(September 2006 release). The database contained a total of
619
CRM
s
, among which some
were redundant

or
overlap
ping.
Pairwise sequence

similarity was computed using ClustalW
[55]

and sequences with more than 40% similarity were treated as redundant.
After removing the
redundant seq
uen
ces,
414
non
-
overlapping
sequences

were

obtai
ned
.

Out of these,
58
sequences
were
too long
(>3.5 kb)
to be useful in the Modulexplorer pipeline. These
were
taken as
test
dataset. The
remaining
356
CRM sequences were selected as the training dataset.



The training CRM
s

represent

a diverse mix regu
lating gene expression in a variety of
tissues and stages.
Out of 356 training CRM
s
,
302
control gene expression

in the embryo, 193 in
the larva and 41 in the adult fly.
Of
302 CRMs active in the embryo, 87 are expressed in the
blastoderm and 215 in the
post
-
blastoderm stages.

T
he 215 post
-
blastoderm CRMs
are
expressed in one or more of the
tissues
integumentary system, imaginal precursor, nervous
system, digestive system, muscle system, circulatory system, tracheal system, reproductive
system, excretory

system, edipose system and endocrine system as shown in Supplementary
Figure 4.



Three different

background
sequence sets were created from
coding, intron and intergenic
sequences

selected
randomly from the
whole Drosophila
genome. Each of the three bac
kground
sets
consisted of
356 sequences
size
-
matched with the 356 training CRM
s
.



The BDGP
release
5 genome assembly
(2007)
was used for the whole
genome prediction
and for all other analyses.



The datasets and the whole genome prediction results are ava
ilable for free access at our
website
http://www.comp.nus.edu.sg/~bioinfo/Drosophila

.




METHOD


TFBS Characterization and Motif Extraction



T
ranscription

factors can bind to
many
differen
t DNA sequences with different binding
affinities.
Various motif representations have been developed t
o model
the
non
-
specificity of
TF
binding

[16]
.
N
o single
representation

is yet known to be optimal for all TFs
, thus the choice of
representation depends on the nature of TFBS. T
he experimentally annotated TFBS in
Drosophila CRMs
are
5 to 140 bp long footprints (
mean length 16 bp)

containing multiple
short
conserved segments

of length 6
-
10 bp with variable gaps or spacers. The conserved segments
can
appear on either strand
in the
footprint.
The f
ootprints
usually
lie close to each other

in
the
CRM
and
may share
conserved segments
. Therefore to model the TFBS in this study,

we used
dyad motif

representation
[23
-
26]

similar to the one proposed by Sinha and Tompa
[56]

to model
spaced motifs in yeast.



A dyad motif is defined as a pair of monads separated by a
distance

of at most
D

bp.
Each
monad
is
a consensus sequence in degene
rate IUPAC alphabet {A,C,G,T,R,Y,S,W,N}
[56]
.
A dyad motif is associated with an
order
.
I
f
A

and
B

are
two
monads
in a dyad motif
(
A
,
B
)
, the
dyad is said to appear in the left (L) (or right (R)) order if
A

appears to the left (or
right) of
B
.

The
monads may appear on either strand
.
This dyad motif representation is suitable
to represent cons
erved segments in the TFBS footprints in a CRM d
ue to its ability to model
their
degeneracy

as well as mutual

order, distance and strand independence
.



We
characterize
d

the TFBS in
dividually in each
CRM

using a
novel
procedure
with
the
following steps. T
he steps are illustrated by an example in Supplementary Figure 2.


Step 1: Extraction of over
-
represented
patterns
:
For all possible
nucleotide patterns
of length
6
over the alphabet {A,C,G,T}
, their number of instances in the CRM
and in a set of backgro
und
sequences
is counted

with
a maximum of 1 mismatch
.
Let the

number of instances of a pattern
in the CRM and in the background be
1
n

and
2
n

respectively. Then
its
over
-
representation in
the
CRM as compa
red to background is measured using the
Z
-
score:







1 2
1 2
ˆ ˆ
1 1
ˆ ˆ
1
p p
Z
p p
N N


 
 
 
 
,

where
1
N

is the CRM length,
2
N

is the total length of background sequences,
1
1
1
ˆ
n
p
N

,
2
2
2
ˆ
n
p
N

, and
1 2
1 2
ˆ
n n
p
N N



.

Monads whose Z
-
score is above a fixed cutoff are selected.


Step 2:
Extraction
and clustering
of over
-
represented
pairs
:
From the selected
patterns
, pairs
of
patterns
occurring at short distance from each other

(0 to 15 bp gap) are searched and
the most
over
-
represented

pairs are selected. Then the selected pairs are clustered according to their
sequence similarity.
Within each cluster,
t
he
top scoring pair

is
called
the cluster center

a
nd any
other
pair in the cluster will have at most
two mismatches

with the
cluster center.


Step 3: Identification of redundant sites:

W
e
look at the sites of occurrences of
all the
pattern
pairs in the CRM.
For each cluster, we identify
sites in the C
RM where more than 90% of the
pattern
pairs in
the
cluster
simultaneously occur
.
Such sites are the conserved sites or redundant
sites in the CRM since a number of similar patterns are found corresponding to these sites.


Step 4: Representing conserved s
ites with motifs:

For each cluster, all the pattern pairs in the
cluster are aligned together to derive a consensus motif. The consensus motif closely represents
all the pattern pairs in the cluster and matches the redundant sites in the CRM. The consen
sus
motif is written in degenerate IUPAC alphabet {A,C,G,T,R,Y,S,W,N}.

The consensus motifs
are the dyad motifs used in the modeling of Modulexplorer.


Bayesian network model



The Modulexplorer Bayesian network model describes a CRM as a combination of
d
yad
motif
s

with mutual order and distance constraints.
The parent nodes are
the dyad
motif
s

i
P
,


1,2,,
i K

,
with the states 0
(absent)
or 1

(present)
,
while the
ir

common child node
is the
CRM
node
Y

with
states Tr
ue (CRM)
or
False (non
-
CRM)
.
Each
dyad motif
node
i
P

has two
parent
monad
motif nodes


2 1 2
,
i i
M M

, where the

nodes
,1,2,,2
j
M j K


take the states
0

or
1

depending upon whether or the motif
j
M

is present or absent in
the sequence
. The node
i
P

has a
noisy
-
AND relationship with its
parent
monad
motif nodes
.

Additionally
the
nodes
O
i

and
d
i

impose distance and order constraints upon the
monad motifs in a d
yad
. The order constraint
i
O

defines a bias in the relative positions (left or right) of
the monad
motifs
2 1
i
M


and
2
i
M
, whereas
the distance node
i
d

models th
e distance between the adjacent occurrences of the pair of motifs
2 1
i
M


and
2
i
M
.
T
he distance
is
discretized into two levels


low and high


with distance

??
.
The
conditional probability table
(CPT)
for a d
istance node is shown in

Supplementary Figure
10
.


The noisy
-
AND relationship
is implemented as described in [?]. A

dyad motif
node
i
P

and its
parent monad
motif nodes
2 1 2
,
i i
M M


are related
through dummy variable
s
2 1 2
,
i i
M M

 
,
which also take the states 0 and 1. The relationship between
i
P

and the dummy variables
2 1 2
,
i i
M M

 

is
a
deterministic AND. However, the dummy variables depend stochastically upon
th
e actual variables as


0
Pr 1 0
j j j
M M


  

and


1
Pr 1 1
j j j
M M


  
. Thus
i
P

depends
stochastically upon the motif nodes as


2 1 2 2 1 2
Pr 1,
a b
i i i i i
P M a M b
 
 
   
, where


,0,1
a b

.


The Bayesian network encode
s
the joint probability
:









1
2
1 2 1 1 1
2
1 1 1 1
Pr,,,,,,,,,,,
Pr Pr,Pr,Pr
j
K K K K
K K K K
i i i i i j
i i i j
Y M M P P O O D D
P Y O Y P D Y P M P

 
 
   
 
 

 
 
 
   



In the training of the
Bayesian network model
,

first we learn the CPTs of order and
distance nodes. For each
dyad
motif
the occurrences of
its respective
monad

motif parents
in
both training CRM and
non
-
CRM sequences are identified.
T
hese occurrences

are used to
compute
the
order probabilities
CRM
left
p
,
CRM
right
p
,
nonCRM
left
p

and
nonCRM
right
p
, and the distance probabilities
CRM
low
p
,
CRM
high
p
,
nonCRM
low
p

and
nonCRM
high
p

individually for
the
dyad
. The CPTs of order and distance
nodes are henceforth kept fixed.


The
reafter the

noisy AND parameters
0 1
,
j j
 

and the parameters Pr(Y|P1,P2,…,PN)

are
estimated.
Removing
the order and distance nodes
, t
he model is shown as an undirected graph in
Supplementary Figure 10
. Each training sequence,
| 1,2,,
n
S n N

, is represented as an ordered
pa
ir


,
n n
m y
, where


1 2 2
K
n n n n
m m m m


is a binary vector representing the presence or
absence of each of the
monad
motifs in the sequence, and
n
y

is a label 1 or 0 depending upon
whether the sequence is
a CRM or non
-
CRM. The nodes
Q

in the undirected graph serve to
factorize the probability potentials as described in [Vomlel (2002)]. The expectation
maximization (EM) algorithm is used to learn the model parameters from the data


,
n n
m y
. The
EM equations are written as [Vomlel (2002), Vomlel (2003)]

M step:






1,
j j
a
j
j
n M M a
n M a


 


,






1,
Pr 1|
i
i
n P Y b
P Y b
n Y b
 
  

, where


,0,1
a b

,
1,,2
j K

,
1,2,,
i K

.

E
-
step:






1
Pr 1| if
1,
0 otherwise
N
i n n
i
n
P m y b
n P Y b


 

  




, where


2 1 2
2 1 2
Pr 1|
i i
m m
i n i i
P m
 


 






1
Pr 1| if
1,
0 otherwise
j
N
j n n
j j
n
M m m a
n M M a



 


  




,















1
Pr | Pr |,,Pr |,Pr |
i i i j i x
i i x
K
j n
j n i n Q P i i QM i j j j n QM i x x x x
P Q M
i
M m P Y y Q P Q M M M m Q M M M m
  
 


 
 
 
 
    
   
 
 
 
 
 
 
 
 
 
  

,

where
x

is the label of the motif which is the pair of
j
M
, and the potentials


are shown below:


The
Bayesian network CRM model can infer whether or not a given sequence is a CRM
based on its motif content.
The sequence
is scanned to ascertain which of the 2
K

motifs
,1,2,,2
j
M j K



occur in the sequence, as well as the order and distance bet
ween the adjacent
motif
s
. This
evidence is provided to the
Bayesian
network and a standard
inference
algorithm is
used

to
assign a
value between 0 and 1 at the “CRM” node, which is the estimated probability of
the given sequence being a CRM.

To predict C
RMs in a long uncharacterized sequence, a
sl
iding window approach is used.


Feature Based Clustering of CRMs



The aim of feature based clustering is to find clusters of CRMs having a common set of
motifs.
The dyad motifs are called “items”

and the
set of

all dyad motifs

that match a given
CRM

is called

the “itemset” for that CRM
.
The
closet

a
lgorithm
[?]

is used to
determine a
closed maximal subset of items that are common to at least
T

itemsets.
T
his translates to finding
the maximal set of motifs tha
t are common to at least
T

CRMs
. The number
T

is called the
minimum support.
Thus
the most conserved cluster of CRMs with size ≥
T

is obtain
e
d.



In each run, the
closet

algorithm

output
s

the most conserved set of items (motifs) in at
least
T

itemsets (se
quences). CRMs obtained in the first run are removed from the list and
closet

is run on the reduced set of sequences wit
h a different
T

value. Similarly to get the
next
most
conserved cluster, the CRMs in
previous
cluster are also removed from the list a
nd the algorithm
is run again.
The

iterative procedure finds several conserved clusters of CRMs.



The number of CRMs and the number of common motifs in the cluster decide the
fitness

of the cluster. Fitness is calculated as
the
inverse of the probabilit
y of obtaining the cluster by
chance. Let
S be
the total number of itemsets (or
CRM
s) and
M be
the total number of distinct
items (motifs). Then the chance probability that at least
T

itemsets have the
N

common items is
given by:








Pr at least itemsets contain all items 1
S
t S t
t T
S
T I P I P I
T


 
 
   
 
   
 

,

where


P I

is the probability that all the
N

items


1 2
,,,
N
I i i i


are selected, regardless of
whether or not the other items are selected
. If
p
k

is the frequency of
i
k
,
then


1
N
k
k
P I p



.

Fitness is ta
ken as negative log of this probability.

The fitness is a convex function of T. We
select the cluster with highest fitness across all T.


Derivation of Regulatory Code



By
f
eature
based clustering of CRMs
, we got

clusters of CRMs having a
common set of
motifs.
T
his
common set of motifs

is used

to derive the regulatory code.
The regulatory code is
obtained as a
minimal
subset of the common motifs that can
eff
ec
t
ive
ly
discriminate
between
the
CRMs in the cluster
and the
background sequences
or
other CRMs.
We f
irst
analyze all the
common motifs
with
STAMP tool
[57]
. The STAMP tool
computes
the similarities among the
motifs and hierarchically
clusters them
.
The hierarchical cluster is shown as a
phylogenetic tree
.
With a certain similarity cutoff, we separate the motifs into distinct clusters. From each c
luster,
we select one representative motif that is most over
-
represented in the CRMs in the cluster as
compared to background sequences or other CRMs. The selected motifs comprise the minimal
regulatory code.




REFERENCES


1.

Arnone,
M.I. and E.H. Davidson,
The hardwiring of development: organization and function of
genomic regulatory systems.

Development (Cambridge, England), 1997.
124
(10): p. 1851
-
64.

2.

Davidson, E.H.,
The Regulatory Genome: Gene Regulatory Networks In Development A
nd
Evolution
. First ed. 2006: Academic Press.

3.

GuhaThakurta, D.,
Computational identification of transcriptional regulatory elements in DNA
sequence.

Nucleic acids research, 2006.
34
(12): p. 3585
-
98.

4.

Gallo, S.M., L. Li, Z. Hu, and M.S. Halfon,
REDfly:

a Regulatory Element Database for Drosophila.

Bioinformatics (Oxford, England), 2006.
22
(3): p. 381
-
3.

5.

Markstein, M., P. Markstein, V. Markstein, and M.S. Levine,
Genome
-
wide analysis of clustered
Dorsal binding sites identifies putative target genes i
n the Drosophila embryo.

Proceedings of
the National Academy of Sciences of the United States of America, 2002.
99
(2): p. 763
-
8.

6.

Berman, B.P., Y. Nibu, B.D. Pfeiffer, P. Tomancak, S.E. Celniker, M. Levine
, et al.
,
Exploiting
transcription factor binding

site clustering to identify cis
-
regulatory modules involved in pattern
formation in the Drosophila genome.

Proceedings of the National Academy of Sciences of the
United States of America, 2002.
99
(2): p. 757
-
62.

7.

Rajewsky, N., M. Vergassola, U. Gaul, an
d E.D. Siggia,
Computational detection of genomic cis
-
regulatory modules applied to body patterning in the early Drosophila embryo.

BMC
bioinformatics, 2002.
3
: p. 30.

8.

Schroeder, M.D., M. Pearce, J. Fak, H. Fan, U. Unnerstall, E. Emberly
, et al.
,
Transc
riptional
control in the segmentation gene network of Drosophila.

PLoS biology, 2004.
2
(9): p. E271.

9.

Berman, B.P., B.D. Pfeiffer, T.R. Laverty, S.L. Salzberg, G.M. Rubin, M.B. Eisen
, et al.
,
Computational identification of developmental enhancers: conse
rvation and function of
transcription factor binding
-
site clusters in Drosophila melanogaster and Drosophila
pseudoobscura.

Genome Biol, 2004.
5
(9): p. R61.

10.

Markstein, M., R. Zinzen, P. Markstein, K.P. Yee, A. Erives, A. Stathopoulos
, et al.
,
A regulat
ory
code for neurogenic gene expression in the Drosophila embryo.

Development, 2004.
131
(10): p.
2387
-
94.

11.

Frech, K., J. Danescu
-
Mayer, and T. Werner,
A novel method to develop highly specific models for
regulatory units detects a new LTR in GenBank whi
ch contains a functional promoter.

J Mol Biol,
1997.
270
(5): p. 674
-
87.

12.

Wasserman, W.W. and J.W. Fickett,
Identification of regulatory regions which confer muscle
-
specific gene expression.

Journal of molecular biology, 1998.
278
(1): p. 167
-
81.

13.

Bail
ey, T.L. and W.S. Noble,
Searching for statistically significant regulatory modules.

Bioinformatics (Oxford, England), 2003.
19 Suppl 2
: p. II16
-
II25.

14.

Frith, M.C., U. Hansen, and Z. Weng,
Detection of cis
-
element clusters in higher eukaryotic DNA.

Bioi
nformatics (Oxford, England), 2001.
17
(10): p. 878
-
89.

15.

Sinha, S., E. van Nimwegen, and E.D. Siggia,
A probabilistic method to detect regulatory modules.

Bioinformatics (Oxford, England), 2003.
19 Suppl 1
: p. i292
-
301.

16.

Stormo, G.D.,
DNA binding site
s: representation and discovery.

Bioinformatics, 2000.
16
(1): p.
16
-
23.

17.

Narang, V., W.K. Sung, and A. Mittal,
Computational annotation of transcription factor binding
sites in D. Melanogaster developmental genes.

Genome informatics, 2006.
17
(2): p. 14
-
24.

18.

Chan, B.Y. and D. Kibler,
Using hexamers to predict cis
-
regulatory motifs in Drosophila.

BMC
Bioinformatics, 2005.
6
: p. 262.

19.

Li, L., Q. Zhu, X. He, S. Sinha, and M.S. Halfon,
Large
-
scale analysis of transcriptional cis
-
regulatory modules revea
ls both common features and distinct subclasses.

Genome Biol, 2007.
8
(6): p. R101.

20.

Lifanov, A.P., V.J. Makeev, A.G. Nazina, and D.A. Papatsenko,
Homotypic regulatory clusters in
Drosophila.

Genome research, 2003.
13
(4): p. 579
-
88.

21.

Ochoa
-
Espinosa, A
., G. Yucel, L. Kaplan, A. Pare, N. Pura, A. Oberstein
, et al.
,
The role of binding
site cluster strength in Bicoid
-
dependent patterning in Drosophila.

Proc Natl Acad Sci U S A,
2005.
102
(14): p. 4960
-
5.

22.

Davidson, E.H., J.P. Rast, P. Oliveri, A. Ransic
k, C. Calestani, C.H. Yuh
, et al.
,
A genomic regulatory
network for development.

Science, 2002.
295
(5560): p. 1669
-
78.

23.

van Helden, J., A.F. Rios, and J. Collado
-
Vides,
Discovering regulatory elements in non
-
coding
sequences by analysis of spaced dyads.

Nucleic acids research, 2000.
28
(8): p. 1808
-
18.

24.

Eskin, E. and P.A. Pevzner,
Finding composite regulatory patterns in DNA sequences.

Bioinformatics (Oxford, England), 2002.
18 Suppl 1
: p. S354
-
63.

25.

Favorov, A.V., M.S. Gelfand, A.V. Gerasimova, D.A.

Ravcheev, A.A. Mironov, and V.J. Makeev,
A
Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved
estimation of the signal length.

Bioinformatics (Oxford, England), 2005.
21
(10): p. 2240
-
5.

26.

Rombauts, S., K. Florq
uin, M. Lescot, K. Marchal, P. Rouze, and Y. van de Peer,
Computational
approaches to identify promoters and cis
-
regulatory elements in plant genomes.

Plant
physiology, 2003.
132
(3): p. 1162
-
76.

27.

Bergman, C.M., J.W. Carlson, and S.E. Celniker,
Drosophil
a DNase I footprint database: a
systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila
melanogaster.

Bioinformatics (Oxford, England), 2005.
21
(8): p. 1747
-
9.

28.

Abnizova, I., R. te Boekhorst, K. Walter, and W.R. Gi
lks,
Some statistical properties of regulatory
DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the
fluffy
-
tail test.

BMC Bioinformatics, 2005.
6
: p. 109.

29.

Sharan, R., A. Ben
-
Hur, G.G. Loots, and I. Ovcharenko,
CRE
ME: Cis
-
Regulatory Module Explorer
for the human genome.

Nucleic Acids Res, 2004.
32
(Web Server issue): p. W253
-
6.

30.

Sokal, R.R. and C.D. Michener,
A statistical method for evaluating systematic relationships.

Univ.
Kansas Sci. Bull., 1958.
38
: p. 1409
-
1
438.

31.

Furlong, E.E., E.C. Andersen, B. Null, K.P. White, and M.P. Scott,
Patterns of gene expression
during Drosophila mesoderm development.

Science, 2001.
293
(5535): p. 1629
-
33.

32.

Borghese, L., G. Fletcher, J. Mathieu, A. Atzberger, W.C. Eades, R.L.
Cagan
, et al.
,
Systematic
analysis of the transcriptional switch inducing migration of border cells.

Dev Cell, 2006.
10
(4): p.
497
-
508.

33.

Kusch, T. and R. Reuter,
Functions for Drosophila brachyenteron and forkhead in mesoderm
specification and cell sign
alling.

Development, 1999.
126
(18): p. 3991
-
4003.

34.

Mann, R.S. and G. Morata,
The developmental and molecular biology of genes that subdivide the
body of Drosophila.

Annu Rev Cell Dev Biol, 2000.
16
: p. 243
-
71.

35.

Morata, G.,
How Drosophila appendages d
evelop.

Nature reviews, 2001.
2
(2): p. 89
-
97.

36.

de Celis, J.F., M. Llimargas, and J. Casanova,
Ventral veinless, the gene encoding the Cf1a
transcription factor, links positional information and cell differentiation during embryonic and
imaginal developm
ent in Drosophila melanogaster.

Development, 1995.
121
(10): p. 3405
-
16.

37.

Hirose, F., N. Ohshima, M. Shiraki, Y.H. Inoue, O. Taguchi, Y. Nishi
, et al.
,
Ectopic expression of
DREF induces DNA synthesis, apoptosis, and unusual morphogenesis in the Drosophi
la eye
imaginal disc: possible interaction with Polycomb and trithorax group proteins.

Mol Cell Biol,
2001.
21
(21): p. 7231
-
42.

38.

Zhang, C.C., J. Muller, M. Hoch, H. Jackle, and M. Bienz,
Target sequences for hunchback in a
control region conferring Ultr
abithorax expression boundaries.

Development (Cambridge,
England), 1991.
113
(4): p. 1171
-
9.

39.

Veraksa, A., J. Kennison, and W. McGinnis,
DEAF
-
1 function is essential for the early embryonic
development of Drosophila.

Genesis, 2002.
33
(2): p. 67
-
76.

40.

R
ushlow, C., P.F. Colosimo, M.C. Lin, M. Xu, and N. Kirov,
Transcriptional regulation of the
Drosophila gene zen by competing Smad and Brinker inputs.

Genes Dev, 2001.
15
(3): p. 340
-
51.

41.

Song, X., M.D. Wong, E. Kawase, R. Xi, B.C. Ding, J.J. McCarthy
, et

al.
,
Bmp signals from niche
cells directly repress transcription of a differentiation
-
promoting gene, bag of marbles, in
germline stem cells in the Drosophila ovary.

Development, 2004.
131
(6): p. 1353
-
64.

42.

Martin, D., C. Brun, E. Remy, P. Mouren, D. Th
ieffry, and B. Jacq,
GOToolBox: functional analysis
of gene datasets based on Gene Ontology.

Genome Biol, 2004.
5
(12): p. R101.

43.

The Interactive Fly,
http://www.sdbonline.org/fly/aimorph/
mesoderm.htm
.

44.

Ganguly, A., J. Jiang, and Y.T. Ip,
Drosophila WntD is a target and an inhibitor of the
Dorsal/Twist/Snail network in the gastrulating embryo.

Development, 2005.
132
(15): p. 3419
-
29.

45.

Borkowski, O.M., N.H. Brown, and M. Bate,
Anterior
-
posterior subdivision and the diversification
of the mesoderm in Drosophila.

Development, 1995.
121
(12): p. 4183
-
93.

46.

Furlong, E.E.,
Integrating transcriptional and signalling networks during muscle development.

Curr Opin Genet Dev, 2004.
14
(4): p. 343
-
50.

47.

Prokop, A., S. Bray, E. Harrison, and G.M. Technau,
Homeotic regulation of segment
-
specific
differences in neuroblast numbers and proliferation in the Drosophila central nervous system.

Mech Dev, 1998.
74
(1
-
2): p. 99
-
110.

48.

Skeath, J.B. and S. T
hor,
Genetic control of Drosophila nerve cord development.

Curr Opin
Neurobiol, 2003.
13
(1): p. 8
-
15.

49.

Tomancak, P., B.P. Berman, A. Beaton, R. Weiszmann, E. Kwan, V. Hartenstein
, et al.
,
Global
analysis of patterns of gene expression during Drosophila
embryogenesis.

Genome Biol, 2007.
8
(7): p. R145.

50.

Lecuyer, E., H. Yoshida, N. Parthasarathy, C. Alm, T. Babak, T. Cerovina
, et al.
,
Global analysis of
mRNA localization reveals a prominent role in organizing cellular architecture and function.

Cell,
200
7.
131
(1): p. 174
-
87.

51.

Arbeitman, M.N., E.E. Furlong, F. Imam, E. Johnson, B.H. Null, B.S. Baker
, et al.
,
Gene expression
during the life cycle of Drosophila melanogaster.

Science, 2002.
297
(5590): p. 2270
-
5.

52.

Pilot, F., J.M. Philippe, C. Lemmers, J.
P. Chauvin, and T. Lecuit,
Developmental control of nuclear
morphogenesis and anchoring by charleston, identified in a functional genomic screen of
Drosophila cellularisation.

Development, 2006.
133
(4): p. 711
-
23.

53.

Butler, M.J., T.L. Jacobsen, D.M. Cain
, M.G. Jarman, M. Hubank, J.R. Whittle
, et al.
,
Discovery of
genes with highly restricted expression patterns in the Drosophila wing disc using DNA
oligonucleotide microarrays.

Development, 2003.
130
(4): p. 659
-
70.

54.

Gupta, M. and J.S. Liu,
De novo cis
-
r
egulatory module elicitation for eukaryotic genomes.

Proceedings of the National Academy of Sciences of the United States of America, 2005.
102
(20): p. 7079
-
84.

55.

Pavesi, G., G. Mauri, and G. Pesole,
An algorithm for finding signals of unknown length in
DNA
sequences.

Bioinformatics, 2001.
17 Suppl 1
: p. S207
-
14.

56.

Sinha, S. and M. Tompa,
A statistical method for finding transcription factor binding sites.

Proc
Int Conf Intell Syst Mol Biol, 2000.
8
: p. 344
-
54.

57.

Mahony, S. and P.V. Benos,
STAMP: a we
b tool for exploring DNA
-
binding motif similarities.

Nucleic Acids Res, 2007.
35
(Web Server issue): p. W253
-
8.





The Modulexplorer Pipeline
S
1
S
2
S
3
S
4
S
N
Repository of uncharacterized
CRM
sequences
Background Sequences
S
1
S
2
S
3
S
4
S
N
Characterized CRM
sequences
TFBS
TFBS
TFBS
TFBS
Non
-
TFBS
Whole genome
scan to predict
CRMs
De
-
novo
motif
extraction
D
1
CRM
C
11
C
21
C
22
C
32
C
31
O
1
O
2
d
2
CRM Model
d
1
C
12
O
3
d
3
D
3
D
2


Background
Sequences
S
1
Find monad motifs that are
over
-
represented in the
CRM sequence as
compared to background
CRM sequence
M
1
M
2
M
4
M
6
M
7
M
5
M
3
M
10
M
8
M
9
M
K
Monad motifs
Find pairs of selected monad
motifs that are over
-
represented
in the CRM sequence as
compared to background
Motif Pairs
M
1
M
2
M
4
M
1
M
9
M
1
M
3
M
2
M
3
M
7
M
4
M
9
M
6
M
8
M
1
M
6
M
4
M
8
M
10
M
7
M
1
M
2
M
4
M
2
M
9
M
1
M
3
M
2
M
3
M
7
M
4
M
9
M
6
M
8
M
1
M
6
M
4
M
8
M
10
M
7
Clustered Pairs
D
11
D
1
2
D
11
D
1
2
D
11
D
1
2
D
11
D
1
2
Consensus Dyads
Cluster the similar motif
pairs together
Each cluster yields one
representative
consensus dyad motif
S
1
TFBS
TFBS
TFBS
TFBS
Non
-
TFBS
Map back the consensus dyads
to the sequence to obtain the
TFBS dissection
Extraction of TFBS from a CRM sequence using de
-
novo motif discovery


Figure
1
.

The Modulexplorer pipeline to learn a CRM model from a repository of uncharacterized CRM
s

and us
e the model to predict novel
CRMs.



D
1
CRM
M
11
M
21
M
22
M
32
M
31
O
1
O
2
d
2
d
1
M
12
O
3
d
3
D
3
D
2


1 2 3
Pr CRM,,
D D D

Figure
2
.

T
he
Modulexplorer
Bayesian network model. The model describes a CRM as a cluster of multiple interacting TFBS
s

with distance
and order constraints. T
he
nodes
i
D

are the dyad motifs representing the
TFBSs
. They have states 0 or 1 according to whether the
motif is absent or present in the CRM
. The CRM is their

common effect

or hypothesis,
represented

as
the
child node.
Each dyad
motif
i
D

has two monad components


1 2
,
i i
M M

with a spacer

of 0 to 15 bp. The
se monads are represented by individual nodes
1 2
,
i i
M M

having states 0 or 1,
i.e.

present or absent,
and are related to the
dyad
node
i
D

by a
noisy
-
AND relationship
.


The spacer
length (or d
istance
)
,

discretized as

low or high
,

is modeled by the node
d
i
. Furthermore each
i
D

is associated with an
order

either left
or right according to whether
1
i
M

appear
s to the left or to the right of
2
i
M

in the CRM.
The conditional probability tables associated
with the
edges of
order and distance nodes are
shown in
Supplementary Figure 10
.


0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sensitivity
False Positive Rate
Overlap of TFBS and Redundant Sites in 19 Fully Annotated CRMs
1.28 DRE
Dfd_EAE
-
F2
Dfd_EAE
-
F9
dpp_dl_mel
dpp_dpp261
dpp_dpp813
eve_EME
-
B
eve_stripe_3+7
eve_stripe2
gsb_fragIV
h_h7_element
h_stripe_6
kni_KD
Kr_CD1
salm_blastoderm
Ser_minimal_wing
Ubx_pbxSB
Ubx_PRE
zen_0.7

(a)

8
9
11
12
13
14
15
4
1
2
3
5
6
7
16
18
19
10
17
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sensitivity
False Positive Rate
Blastoderm
Dorsal
Homeotic
Pair
-
rule
1 : 1.28_DRE
2 : Dfd_EAE
-
F2
3 : Dfd_EAE
-
F9
4 : dpp_dl_mel
5 : dpp_dpp261
6 : dpp_dpp813
7 : eve_EME
-
B
8 : eve_stripe_3+7
9 : eve_stripe2
10 : gsb_fragIV
11 : h_h7_element
12 : h_stripe_6
13 : kni_KD
14 : Kr_CD1
15 : salm_blastoderm
16 : Ser_minimal_wing
17 : Ubx_pbxSB
18 : Ubx_PRE
19 : zen_0.7

(b
)



(c)


Figure
3
.

The TFBSs in
Drosophila CRMs
appear as repeated or redundant sites
. Modulexplorer locates these re
dundant sites as potential
TFBS
s
. The receiver
-
operating characteristic of predicting TFBS
s

using
redundant sites in
19 fully annotated CRM
s

is shown in (a).
Here s
ensitivity
(
y
-
axis)
refers to the % of nucleotides in TFBS
s

that are overlapped by some re
dundant site,
while
false positive rate
(
x
-
axis)
refers to the % of nucleotides in a redundant site that do not match any TFBS
. The maximum
effectiveness
of TFBS
characterization in each of the 19 CRMs is shown in
(b)
, which is

the point in the

ROC curve where Matthew’s correlation coefficient
is
maxim
ized
.
At this maximum
effectiveness
, t
he visual overlap between the
TFBS sites (
blue
boxes)
and the
redundant sites (red
boxes) in each CRM

is shown in (c)
.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sensitivity
1
-
Specificity
ROC of discrimination between CRM and exon sequences
in leave
-
one
-
out cross
-
validation
Markov Model
HexDiff
Modulexplorer

(a)

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sensitivity
1
-
Specificity
ROC of discrimination between CRM and non
-
coding sequences
in ten fold cross
-
validation
Markov Order 2
Markov Order 3
Markov Order 4
Markov Order 5
Markov Order 6
(5,0) HexDiff
(6,0) HexDiff
(6,1) HexDiff
(7,0) HexDiff
(7,1) HexDiff
(8,0) HexDiff
Modulexplorer

(b)

Embryo
56 CRMs
Adult
13 CRMs
Larva
48 CRMs
58 Testing CRMs
blastoderm,
4
post
-
blastoderm,
52
0
5
10
15
20
25
30
35
integumentary system
imaginal precursor
nervous system
digestive system
muscle system
tracheal system
excretory system
reproductive system
circulatory system
adipose system
endocrine system
Categorization of post
-
blastoderm
CRMs in 58 testing sequences


(c)

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sensitivity
1
-
Specificity
Prediction accuracy of Modulexplorer on testing data
of 58 CRM and 100 background sequences

(d)


Figure
4
.

Performance of the Modulexplorer in discriminating between CRM and background sequences. Modulexplorer’s performance is
compared with two other methods: a Markov model (
orders 2 to 6
) and the
HexDiff algorithm
[18]
. The original Hexdiff algorithm
uses (6,0) motifs, but it was extended in this comparison to try several different
(
l
,
d
) motifs
.
Discrimination
achieved
between
training
CRMs and exon sequences
in 10
-
fold cross
-
validation
is shown in
(a).
The ROC shows that a
ll
three
methods
could
easily
discriminate CRM
s

from exons.

D
iscrimination between CRMs and non
-
coding sequences (intron+intergenic) is shown
in (b)
. Here
Markov model shows no discrimination, HexDiff has marginal discriminati
on, while
Modulexplorer
achieves maximum
discrimination. Modulexplorer was further evaluated on a separate testing set of 58 CRMs
. The number of CRMs of different types
in the test set according to their stage and tissue of expression is shown in
(c)
.
The
performance of Modulexplorer o
n
this test

set
,
shown in
(d
)
,
is similar to the training
performance
.

AEF1
TRL
DL
SNA
TWI
SLBO
BYN
PRD
BIN
SUH
SUHW
SD
NUB
GRH
PAN
ARA
EMS
VVL
EN
CF2-II
EXD
GL
ZEN
AP
EVE
DFD
DREF
ANTP
ABD-A
ABD-B
UBX
DSX-M
DSX-F
GT
TLL
HB
CAD
FTZ
KR
BCD
KNI
OVO
SRP
Z
HIS2B
PHO
TTK
DEAF1
EY
TOY
MED
BRK
MAD
ADF1
ESPL
TIN
HKB
VND
FTZ-F1
AEF1
0.00
0.00
0.00
0.00
0.00
1.10
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.55
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.19
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.04
0.00
0.00
0.00
0.00
TRL
0.00
0.00
0.00
1.50
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
DL
0.00
0.00
0.00
3.29
2.57
2.15
0.00
0.00
0.00
0.00
0.00
4.07
0.00
0.00
2.99
1.20
0.00
0.00
1.12
1.10
1.09
1.09
0.00
1.06
0.00
0.00
0.00
0.00
0.00
1.22
0.00
0.00
0.00
0.00
1.66
1.11
1.56
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
SNA
0.00
1.50
3.29
0.00
2.47
2.89
3.20
0.00
2.11
0.00
0.00
0.00
0.00
0.00
3.15
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.17
0.00
0.00
1.46
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.13
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
TWI
0.00
0.00
2.57
2.47
0.00
5.15
4.00
3.24
3.36
3.05
3.91
3.01
2.57
2.31
3.33
1.29
1.26
1.30
0.00
0.00
1.77
1.05
1.37
0.00
1.20
1.12
0.00
1.23
0.00
1.67
1.66
0.00
0.00
1.06
1.61
2.35
1.07
1.54
1.54
1.29
0.00
0.00
0.00
0.00
0.00
0.00
1.22
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.37
0.00
0.00
0.00
0.00
SLBO
1.10
0.00
2.15
2.89
5.15
0.00
4.72
4.42
3.46
4.44
5.76
4.82
4.30
3.10
4.89
3.97
3.14
2.96
2.48
2.47
3.09
2.39
4.22
3.12
2.67
3.44
1.97
2.98
2.90
2.64
2.16
2.93
2.93
2.68
3.03
2.06
2.01
2.82
1.55
1.88
0.00
2.18
1.39
0.00
0.00
1.78
0.00
0.00
1.13
0.00
0.00
0.00
0.00
0.00
0.00
1.40
0.00
1.55
0.00
BYN
0.00
0.00
0.00
3.20
4.00
4.72
0.00
4.46
2.77
3.21
4.62
4.57
2.61
0.00
4.65
2.39
2.05
2.55
1.73
1.32
0.00
1.41
1.87
1.70
2.01
2.41
0.00
2.11
1.85
2.24
2.43
1.77
1.77
1.96
1.84
1.78
1.89
1.75
2.46
0.00
0.00
0.00
0.00
1.34
0.00
1.17
1.03
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.37
0.00
1.18
1.07
PRD
0.00
0.00
0.00
0.00
3.24
4.42
4.46
0.00
3.70
2.79
4.51
3.20
2.62
1.79
3.92
4.10
3.40
2.91
3.87
1.29
2.91
1.90
2.98
3.07
3.09
3.01
1.83
3.23
2.39
2.86
1.42
2.10
2.10
3.16
1.80
1.55
1.92
1.12
2.13
2.02
0.00
1.59
1.37
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
BIN
0.00
0.00
0.00
2.11
3.36
3.46
2.77
3.70
0.00
3.62
5.56
3.20
4.16
3.16
5.05
4.27
3.09
3.48
2.83
3.73
3.63
3.82
3.90
3.93
4.36
2.92
2.48
3.05
3.40
3.45
2.69
3.44
3.44
3.51
3.02
3.19
3.72
2.54
2.79
1.44
0.00
0.00
1.41
0.00
0.00
1.40
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
SUH
0.00
0.00
0.00
0.00
3.05
4.44
3.21
2.79
3.62
0.00
4.35
5.41
3.80
2.84
5.75
2.91
2.96
2.62
1.63
2.46
3.19
3.32
3.17
2.69
2.88
2.56
1.69
2.84
1.61
2.27
1.28
3.44
3.44
2.72
1.53
2.34
2.80
1.26
1.72
1.76
0.00
0.00
1.95
1.66
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
SUHW
0.00
0.00
0.00
0.00
3.91
5.76
4.62
4.51
5.56
4.35
0.00
7.04
4.79
4.85
6.57
4.21
3.79
3.40
3.54
3.83
3.70
4.05
4.55
4.49
4.01
3.97
2.17
4.46
2.89
3.23
3.91
3.15
3.15
2.99
2.80
3.44
2.78
3.58
2.13
1.51
0.00
0.00
1.16
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
SD
0.00
0.00
4.07
0.00
3.01
4.82
4.57
3.20
3.20
5.41
7.04
0.00
4.04
4.57
5.49
3.94
3.40
3.78
3.14
3.27
4.30
4.18
4.25
3.65
3.44
3.45
1.72
3.57
3.14
3.29
2.32
3.60
3.60
3.44
2.79
3.67
3.54
1.60
1.86
1.70
1.54
1.25
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.05
0.00
NUB
0.00
0.00
0.00
0.00
2.57
4.30
2.61
2.62
4.16
3.80
4.79
4.04
0.00
5.13
3.96
3.61
2.08
2.13
3.16
3.85
3.09
2.30
3.28
3.59
2.45
2.84
1.75
3.12
2.67
3.55
2.81
2.96
2.96
2.97
3.22
2.27
3.09
2.04
1.58
0.00
0.00
1.42
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
GRH
0.00
0.00
0.00
0.00
2.31
3.10
0.00
1.79
3.16
2.84
4.85
4.57
5.13
0.00
5.31
3.02
2.33
2.53
2.52
3.60
2.90
2.17
2.67
3.16
2.67
2.51
1.43
2.94
2.05
3.06
2.22
2.79
2.79
2.13
2.90
2.32
2.50
2.03
1.53
1.50
0.00
1.54
0.00
2.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.03
0.00
0.00
1.18
0.00
PAN
0.00
0.00
2.99
3.15
3.33
4.89
4.65
3.92
5.05
5.75
6.57
5.49
3.96
5.31
0.00
4.66
2.90
4.22
2.72
3.07
2.81
3.35
3.33
3.72
3.85
3.25
2.96
3.92
2.95
3.01
2.71
3.42
3.42
3.53
4.04
3.42
2.97
2.17
2.43
2.14
0.00
1.52
1.79
1.60
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.27
0.00
ARA
0.00
0.00
1.20
0.00
1.29
3.97
2.39
4.10
4.27
2.91
4.21
3.94
3.61
3.02
4.66
0.00
5.42
5.36
1.40
4.52
4.80
5.75
6.02
2.88
5.20
1.57
3.86
6.09
1.63
4.47
2.41
4.67
4.67
5.71
3.90
3.61
4.35
2.54
2.41
2.99
0.00
1.83
1.64
1.76
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.02
0.00
EMS
0.00
0.00
0.00
0.00
1.26
3.14
2.05
3.40
3.09
2.96
3.79
3.40
2.08
2.33
2.90
5.42
0.00
4.49
4.61
4.10
3.89
4.01
5.10
3.94
4.55
5.45
4.13
5.71
4.90
2.77
4.25
4.84
4.84
4.59
2.33
2.97
3.98
3.62
1.71
1.42
0.00
1.63
2.13
1.01
1.27
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
VVL
0.00
0.00
0.00
0.00
1.30
2.96
2.55
2.91
3.48
2.62
3.40
3.78
2.13
2.53
4.22
5.36
4.49
0.00
4.52
3.88
3.70
4.48
4.58
5.08
4.09
4.31
2.52
4.77
4.16
3.53
4.28
4.19
4.19
3.78
3.00
3.45
3.59
3.04
2.27
1.81
1.02
1.06
0.00
1.34
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
EN
0.00
0.00
1.12
0.00
0.00
2.48
1.73
3.87
2.83
1.63
3.54
3.14
3.16
2.52
2.72
1.40
4.61
4.52
0.00
4.09
4.56
4.06
5.78
1.68
3.56
1.43
3.71
5.72
1.73
3.89
1.80
3.90
3.90
5.09
3.33
2.78
3.10
2.44
1.74
2.35
0.00
1.77
1.45
1.73
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
CF2-II
0.00
0.00
1.10
0.00
0.00
2.47
1.32
1.29
3.73
2.46
3.83
3.27
3.85
3.60
3.07
4.52
4.10
3.88
4.09
0.00
4.09
4.07
4.52
4.98
3.78
4.08
3.59
4.50
3.02
3.17
2.93
3.70
3.70
4.03
3.43
3.66
3.30
2.66
1.63
1.76
0.00
0.00
2.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
EXD
1.55
0.00
1.09
0.00
1.77
3.09
0.00
2.91
3.63
3.19
3.70
4.30
3.09
2.90
2.81
4.80
3.89
3.70
4.56
4.09
0.00
5.25
5.28
4.96
4.53
4.56
3.38
4.08
4.49
3.77
3.62
3.77
3.77
5.16
3.29
4.74
3.57
3.33
2.23
2.22
0.00
1.45
2.49
1.15
1.16
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.14
0.00
GL
0.00
0.00
1.09
0.00
1.05
2.39
1.41
1.90
3.82
3.32
4.05
4.18
2.30
2.17
3.35
5.75
4.01
4.48
4.06
4.07
5.25
0.00
4.84
5.23
4.70
4.70
3.42
4.69
4.31
3.30
3.44
3.33
3.33
5.12
2.52
3.08
3.52
3.24
1.88
2.15
0.00
1.07
1.05
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.02
0.00
ZEN
0.00
0.00
0.00
0.00
1.37
4.22
1.87
2.98
3.90
3.17
4.55
4.25
3.28
2.67
3.33
6.02
5.10
4.58
5.78
4.52
5.28
4.84
0.00
6.15
5.26
5.58
4.03
5.01
4.98
4.94
4.54
4.06
4.06
5.58
3.32
4.17
3.89
4.53
2.22
2.89
0.00
1.28
1.78
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
AP
0.00
0.00
1.06
0.00
0.00
3.12
1.70
3.07
3.93
2.69
4.49
3.65
3.59
3.16
3.72
2.88
3.94
5.08
1.68
4.98
4.96
5.23
6.15
0.00
5.90
2.19
4.72
2.58
3.87
3.11
5.09
4.22
4.22
3.95
3.23
3.75
4.15
4.13
1.93
2.40
0.00
1.71
1.68
1.24
1.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
EVE
0.00
0.00
0.00
0.00
1.20
2.67
2.01
3.09
4.36
2.88
4.01
3.44
2.45
2.67
3.85
5.20
4.55
4.09
3.56
3.78
4.53
4.70
5.26
5.90
0.00
2.87
3.43
3.58
4.45
4.31
3.45
4.15
4.15
4.40
3.31
3.55
4.06
2.31
2.29
2.65
0.00
1.55
1.41
1.63
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
DFD
0.00
0.00
0.00
0.00
1.12
3.44
2.41
3.01
2.92
2.56
3.97
3.45
2.84
2.51
3.25
1.57
5.45
4.31
1.43
4.08
4.56
4.70
5.58
2.19
2.87
0.00
4.44
3.66
2.57
3.84
2.63
4.55
4.55
4.62
3.36
3.02
2.94
2.56
1.74
1.84
0.00
1.81
1.55
1.40
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
DREF
0.00
0.00
0.00
0.00
0.00
1.97
0.00
1.83
2.48
1.69
2.17
1.72
1.75
1.43
2.96
3.86
4.13
2.52
3.71
3.59
3.38
3.42
4.03
4.72
3.43
4.44
0.00
4.40
2.90
3.16
2.99
3.21
3.21
3.74
2.42
2.38
3.35
2.89
0.00
1.17
0.00
1.30
1.34
0.00
1.24
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
ANTP
0.00
0.00
0.00
0.00
1.23
2.98
2.11
3.23
3.05
2.84
4.46
3.57
3.12
2.94
3.92
6.09
5.71
4.77
5.72
4.50
4.08
4.69
5.01
2.58
3.58
3.66
4.40
0.00
3.38
4.29
2.44
4.01
4.01
4.77
3.48
3.60
3.89
2.20
0.00
1.86
0.00
1.55
1.94
1.21
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
ABD-A
0.00
0.00
0.00
0.00
0.00
2.90
1.85
2.39
3.40
1.61
2.89
3.14
2.67
2.05
2.95
1.63
4.90
4.16
1.73
3.02
4.49
4.31
4.98
3.87
4.45
2.57
2.90
3.38
0.00
3.61
1.71
3.44
3.44
4.99
3.00
2.30
3.16
1.89
2.23
2.74
0.00
1.99
1.10
1.07
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
ABD-B
0.00
0.00
1.22
0.00
1.67
2.64
2.24
2.86
3.45
2.27
3.23
3.29
3.55
3.06
3.01
4.47
2.77
3.53
3.89
3.17
3.77
3.30
4.94
3.11
4.31
3.84
3.16
4.29
3.61
0.00
3.37
3.70
3.70
2.92
4.31
3.95
4.27
2.54
2.86
1.93
1.05
1.16
1.96
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
UBX
0.00
0.00
0.00
0.00
1.66
2.16
2.43
1.42
2.69
1.28
3.91
2.32
2.81
2.22
2.71
2.41
4.25
4.28
1.80
2.93
3.62
3.44
4.54
5.09
3.45
2.63
2.99
2.44
1.71
3.37
0.00
3.15
3.15
3.98
2.90
2.26
2.41
1.02
1.67
1.41
0.00
1.80
1.21
1.37
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
DSX-M
0.00
0.00
0.00
0.00
0.00
2.93
1.77
2.10
3.44
3.44
3.15
3.60
2.96
2.79
3.42
4.67
4.84
4.19
3.90
3.70
3.77
3.33
4.06
4.22
4.15
4.55
3.21
4.01
3.44
3.70
3.15
0.00
0.00
4.46
2.75
3.03
4.12
3.09
1.79
2.33
1.36
1.21
2.22
0.00
0.00
1.40
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.17
0.00
0.00
0.00
DSX-F
0.00
0.00
0.00
0.00
0.00
2.93
1.77
2.10
3.44
3.44
3.15
3.60
2.96
2.79
3.42
4.67
4.84
4.19
3.90
3.70
3.77
3.33
4.06
4.22
4.15
4.55
3.21
4.01
3.44
3.70
3.15
0.00
0.00
4.46
2.75
3.03
4.12
3.09
1.79
2.33
1.36
1.21
2.22
0.00
0.00
1.40
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.17
0.00
0.00
0.00
GT
0.00
0.00
0.00
1.17
1.06
2.68
1.96
3.16
3.51
2.72
2.99
3.44
2.97
2.13
3.53
5.71
4.59
3.78
5.09
4.03
5.16
5.12
5.58
3.95
4.40
4.62
3.74
4.77
4.99
2.92
3.98
4.46
4.46
0.00
6.71
7.33
7.50
6.22
4.87
3.89
2.17
1.79
2.18
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
TLL
0.00
0.00
1.66
0.00
1.61
3.03
1.84
1.80
3.02
1.53
2.80
2.79
3.22
2.90
4.04
3.90
2.33
3.00
3.33
3.43
3.29
2.52
3.32
3.23
3.31
3.36
2.42
3.48
3.00
4.31
2.90
2.75
2.75
6.71
0.00
6.58
7.31
4.91
4.72
2.91
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.14
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
HB
0.00
0.00
1.11
0.00
2.35
2.06
1.78
1.55
3.19
2.34
3.44
3.67
2.27
2.32
3.42
3.61
2.97
3.45
2.78
3.66
4.74
3.08
4.17
3.75
3.55
3.02
2.38
3.60
2.30
3.95
2.26
3.03
3.03
7.33
6.58
0.00
5.78
3.89
3.19
2.80
2.30
0.00
1.54
0.00
1.11
0.00
0.00
1.19
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
CAD
1.19
0.00
1.56
1.46
1.07
2.01
1.89
1.92
3.72
2.80
2.78
3.54
3.09
2.50
2.97
4.35
3.98
3.59
3.10
3.30
3.57
3.52
3.89
4.15
4.06
2.94
3.35
3.89
3.16
4.27
2.41
4.12
4.12
7.50
7.31
5.78
0.00
4.22
2.89
3.03
0.00
1.18
1.75
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
FTZ
0.00
0.00
0.00
0.00
1.54
2.82
1.75
1.12
2.54
1.26
3.58
1.60
2.04
2.03
2.17
2.54
3.62
3.04
2.44
2.66
3.33
3.24
4.53
4.13
2.31
2.56
2.89
2.20
1.89
2.54
1.02
3.09
3.09
6.22
4.91
3.89
4.22
0.00
2.54
0.00
0.00
2.11
1.79
1.07
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
KR
0.00
0.00
0.00
0.00
1.54
1.55
2.46
2.13
2.79
1.72
2.13
1.86
1.58
1.53
2.43
2.41
1.71
2.27
1.74
1.63
2.23
1.88
2.22
1.93
2.29
1.74
0.00
0.00
2.23
2.86
1.67
1.79
1.79
4.87
4.72
3.19
2.89
2.54
0.00
6.24
2.10
1.23
1.26
1.53
1.47
1.75
1.26
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
1.08
BCD
0.00
0.00
0.00
0.00
1.29
1.88
0.00
2.02
1.44
1.76
1.51
1.70
0.00
1.50
2.14
2.99
1.42
1.81
2.35
1.76
2.22
2.15
2.89
2.40
2.65
1.84
1.17
1.86
2.74
1.93
1.41
2.33
2.33
3.89
2.91
2.80
3.03
0.00
6.24
0.00
2.23
0.00
1.41
1.49
0.00
0.00
0.00
0.00
1.12
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.26
0.00
KNI
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.54
0.00
0.00
0.00
0.00
0.00
1.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.05
0.00
1.36
1.36
2.17
0.00
2.30
0.00
0.00
2.10
2.23
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
OVO
0.00
0.00
0.00
0.00
0.00
2.18
0.00
1.59
0.00
0.00
0.00
1.25
1.42
1.54
1.52
1.83
1.63
1.06
1.77
0.00
1.45
1.07
1.28
1.71
1.55
1.81
1.30
1.55
1.99
1.16
1.80
1.21
1.21
1.79
0.00
0.00
1.18
2.11
1.23
0.00
0.00
0.00
1.73
1.10
0.00
0.00
0.00
0.00
0.00
1.01
0.00
1.38
2.76
0.00
2.28
1.45
0.00
1.66
0.00
SRP
0.00
0.00
0.00
0.00
0.00
1.39
0.00
1.37
1.41
1.95
1.16
0.00
0.00
0.00
1.79
1.64
2.13
0.00
1.45
2.00
2.49
1.05
1.78
1.68
1.41
1.55
1.34
1.94
1.10
1.96
1.21
2.22
2.22
2.18
0.00
1.54
1.75
1.79
1.26
1.41
0.00
1.73
0.00
1.66
2.35
1.39
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
Z
0.00
0.00
0.00
0.00
0.00
0.00
1.34
0.00
0.00
1.66
0.00
0.00
0.00
2.00
1.60
1.76
1.01
1.34
1.73
0.00
1.15
0.00
0.00
1.24
1.63
1.40
0.00
1.21
1.07
0.00
1.37
0.00
0.00
0.00
0.00
0.00
0.00
1.07
1.53
1.49
0.00
1.10
1.66
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.17
1.13
0.00
1.84
0.00
HIS2B
0.00
0.00
0.00
1.13
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.27
0.00
0.00
0.00
1.16
0.00
0.00
1.01
0.00
0.00
1.24
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.11
0.00
0.00
1.47
0.00
0.00
0.00
2.35
0.00
0.00
0.00
1.28
0.00
0.00
0.00
0.00
0.00
1.56
1.32
0.00
0.00
0.00
0.00
0.00
PHO
0.00
0.00
0.00
0.00
0.00
1.78
1.17
0.00
1.40
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.40
1.40
0.00
0.00
0.00
0.00
0.00
1.75
0.00
0.00
0.00
1.39
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.38
0.00
0.00
0.00
0.00
TTK
0.00
0.00
0.00
0.00
1.22
0.00
1.03
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.26
0.00
0.00
0.00
0.00
0.00
1.28
0.00
0.00
0.00
0.00
1.40
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.23
DEAF1
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.14
1.19
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.65
1.81
1.49
1.46
0.00
0.00
0.00
0.00
0.00
0.00
0.00
EY
0.00
0.00
0.00
0.00
0.00
1.13
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.12
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.65
0.00
1.81
0.00
0.00
0.00
0.00
0.00
0.00
1.10
0.00
0.00
TOY
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.01
0.00
0.00
0.00
0.00
1.40
1.81
1.81
0.00
2.09
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
MED
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.49
0.00
2.09
0.00
1.86
1.34
0.00
0.00
1.71
0.00
0.00
0.00
BRK
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.38
0.00
0.00
0.00
0.00
0.00
1.46
0.00
0.00
1.86
0.00
2.11
2.34
2.99
0.00
1.16
0.00
0.00
MAD
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
2.76
0.00
0.00
1.56
0.00
0.00
0.00
0.00
0.00
1.34
2.11
0.00
6.54
4.64
1.28
2.92
1.44
1.24
ADF1
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.32
0.00
0.00
0.00
0.00
0.00
0.00
2.34
6.54
0.00
2.77
0.00
2.88
0.00
1.03
ESPL
1.04
0.00
0.00
0.00
1.37
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.03
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
2.28
0.00
1.17
0.00
1.38
0.00
0.00
0.00
0.00
0.00
2.99
4.64
2.77
0.00
1.25
1.14
2.03
1.24
TIN
0.00
0.00
0.00
0.00
0.00
1.40
1.37
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.17
1.17
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
1.45
0.00
1.13
0.00
0.00
0.00
0.00
0.00
0.00
1.71
0.00
1.28
0.00
1.25
0.00
1.28
1.57
1.02
HKB
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.10
0.00
0.00
1.16
2.92
2.88
1.14
1.28
0.00
2.02
0.00
VND
0.00
0.00
0.00
0.00
0.00
1.55
1.18
0.00
0.00
0.00
0.00
1.05
0.00
1.18
1.27
1.02
0.00
0.00
0.00
0.00
1.14
1.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.26
0.00
1.66
0.00
1.84
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.44
0.00
2.03
1.57
2.02
0.00
0.00
FTZ-F1
0.00
0.00
0.00
0.00
0.00
0.00
1.07
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.08
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.23
0.00
0.00
0.00
0.00
0.00
1.24
1.03
1.24
1.02
0.00
0.00
0.00
1
2
3
4
5
6

Figure
5
.

Pairwise
interactions
between 61 different TFs
learnt de
-
novo by the Modulexplorer probabili
ty model.
Based on th
e interaction
matrix, the TFs were hierarchically clustered.

Six f
unctionally related
groups of
TFs
were formed
: (1) cofactors of twist in
mesoderm
and nervous system development, (2) TFs involved in imaginal disc development, (3) the antennapedia complex
, (4) TFs expressed in
the blastoderm, (5) TFs
for
eye development and (6)
a miscellaneous set of TFs. Five distinct clusters
are
seen

in the interaction
matrix
. Three
of the
clusters contain mixed set of TFs from groups 1
-
4, while two other clusters cor
respond to the TF groups 5 and 6.

0
5000
10000
15000
20000
25000
30000
35000
-
3.00
-
2.50
-
2.00
-
1.50
-
1.00
-
0.50
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Number of windows
Score bin
Score threshold for
0.1% false positive
rate

(a)

Total 1298 predicted windows
813
472
13
Predicted windows overlap
k
nown long
CRMs
Predicted windows overlap
t
raining
CRMs
Novel predictions

(b)

9 out of 34
New
CRMs
in
REDfly
database
16 out of 42
Predictions of Schroeder et al.
6
out of 28
Predictions of Berman et al.
7
out of 15
Predictions of
Markstein
et al.
14 out of 58
Long
CRMs
not in training set

(c)

0
5
10
15
20
25
30
35
40
45
50
Upstream
Promoter
Exon
Intron
Downstream
% Occurrence
Location relative to the target gene
Known CRMs
Predicted CRMs
Random segments

(d)

5.50E
-
32
2.93E
-
27
1.58E
-
26
6.43E
-
26
6.79E
-
26
2.10E
-
21
2.75E
-
21
1.95E
-
16
1.42E
-
15
1.56E
-
11
2.41E
-
11
3.22E
-
10
4.95E
-
10
1.34E
-
09
1.71E
-
08
2.80E
-
07
1.23E
-
06
3.02E
-
06
4.09E
-
4
2.67E
-
3
0
20
40
60
80
100
120
140
160
180
system development
regulation of cellular
process
morphogenesis
tissue development
organ development
regulation of
physiological process
cell differentiation
post
-
embryonic
development
metamorphosis
appendage
development
embryonic
development
tube development
cell communication
segmentation
pattern specification
regulation of
development
locomotory behavior
positive regulation of
biological process
sex differentiation
negative regulation of
biological process
Number of genes

(e)

Figure
6
.

Summary of Modulexplorer’s whole genome CRM predictions
:

(a) A

stringent score threshold
was used for shortlisting predicted
CRM windows
such
that
the false positive rate is
about
0.1%. (b)
A total of
1298 windows

were predicted above the chosen
threshold
,
out of which
813
are
novel predictions.
(c) Some of the
novel predictions
overlap
CRMs predicted or verified in other
studies. (d) The pre
dicted CRMs
are
significantly over
-
represented in the promoter and upstream intergenic regions. (e)
This is the
list of
l
evel
3 gene ontology (GO) categories statistically over
-
represented in the target genes of the predicted CRMs
. They

show
enrichm
ent in development and regulatory functions (
Bonferroni corrected P
-
values

of the GO associations are shown alongside).

M1
M2
M3a
M3b
M4
M5
M6
Bin
M7
M3
M8
M9a
M9b
M10
M11
M12
1
2
3
4
5
6
7
8
9
10
11
12
M7(+) M6(+)
001 GGGAT
CCG
AAA
TAGTTAGTG
TAAACAA
GGAGG
CACTCTTGAGAACGCGAG
M10(+) M8(+) M7(
-
)
051 GGGCAACTGTTGTGGAAATGCCCGAGAT
TGAAT
CGCTGG
T
TAAATATTTA
M5(+) M7(+)
101
TG
AAATC
ATAAAA
TT
TGATGTCTCC
CTTCCGTTGGCCACTTGA
CA
GTAA
T
M3a(
-
)
151
GCGA
CCATTA
C
GGCAATGTGTCGAAGAAGAACCCCTGGTCCTGAATCCCG
M11(+)
201 ACACAACCCAACTCCAGAGCGCCGGTG
CTA
ATGATG
A
TTTTGATGTGCAG
251 TCAACGGATTGGCTGCAGACCCACGAAGACCCGGCGATTACGTGGA
GTAC
M3b(
-
)M3b(+) M3b(
-
) M12 M3b(
-
)M3b(+)
301
TA
CCCATT
TGGCTT
CCCATT
TCGATT
TCCCCA
TG
CCCATT
T
GGCCGTGCA
M6(+)M5(
-
) M6(+) M10(+) M2(+)
351 A
TGTTTG
TTTTAT
GCAC
GATCC
G
TTGTTTT
AC
AAT
CGCTG
T
AAATAA
ATA
M3a(
-
) M3b(
-
)
401
GGAGCCG
CAGATCAAAGGCCT
AT
CAATTA
G
CA
CCCATT
TCGATTATGCTG
M3b(
-
)
451 CATGCTGCATATGCAGCACTTGCACTGCCTGCAATTCACA
CCCAAT
TAGT
M2(+) M10(+) M3a(
-
)
501
AATAA
AT
TTGAA
TGCG
CG
CTGC
AATTT
G
CCGCCATTCGGCTCAACAGTTA
M3a(
-
) M5(
-
)
551 TGGT
GG
CCATTA
A
G
T
TTTAT
CGAT
GGCGCTACAGCTCCCGATCCCCTACC
601 CCCGATCTTTCCTTGCCCCATGCCCAGATTTCAATTCGATTCCCGGATCT
M3a(
-
) M7(
-
)
651 GGGAG
CCAATT
TGA
TTTGTG
GCCCACTCGAGAGGGCTTCGAGCCATCCAC
M7(+)M7(+)
701 CTTTGATATTCTCGCACATAGGCC
CACAAAAA
GATACGTGCATGCTTAAC
M10(+)
751 CGAACTTAATTGCAATTGACTTTTAATGCTTATGCGGGCTGCC
CGCTGT
G
801 TTAATTCGAATTC
(a)
(c)
(b)

Figure
7
.

Regulatory code for mesoderm enhancers
. The regulatory code motifs are shown in (a).
Also shown i
s the phylogenetic tree of
similarity among the motifs.
Their
matches within the dpp

813 bp

enhancer
sequence are shown in (b) by underlines
. For
comparison,
the known TFBS
in this enhancer
are
shown in
red color text.
The known annotation covered only
the first 600 bp of the
enhancer. In this region, 31 matches of the regulatory code motifs occurred, of which 28 overlapped known TFBS while 3 were
false
positives.
The expression pattern of
the
dpp

gene
in mesoderm
during stages 8
-
9 is
shown in (c).


T
able 1.

Clusters of CRMs sharing a common regulatory code (motifs) were obtained using iterative frequent itemset mining.
Five
major clusters
are listed here
with predominant expression in the embryonic mesoderm, ventral nerve cord, eye
-
antennal disc
,
bl
astoderm embryo and the larval
wing imaginal disc.
For each cluster, the following information is shown: (i) t
he
number of known and predicted CRM target genes in each cluster,
(ii) number of
predicted
CRM target genes

which could be validated, (iii) num
ber of validated genes which are novel for their role in development,
and (iv)
specificity of the regulatory code for the cluster (given in Supplementary Figure 6) against 1000 random Drosophila sequences

(generated by
RSAT tool using a 6
th

order Markov mo
del) and other training CRMs.


Cluster name
(by dominant
gene expression
pattern)

Known
target genes

Predicted
target genes

Predicted
target genes
with validation

Validation type

Novel target
genes (out of the
predicted genes
with
validation)
*

Specificity
of regulatory code

# false positives
in 1000 random
sequences

# matches with
other CRMs (out
of 356)

Mesoderm

9

27

23

BDGP in
-
situ (19)

+Flybase (4)

9

0

2

Ventral nerve
cord

14

44

30

BDGP in
-
situ (23)

+ Flybase (7)

18

2

0

Eye
-
antennal
disc

17

21

9

BDGP in
-
situ (7)

+ Flybase (2)

4

0

0

Wing imaginal
disc

9

31

15

Microarray (12)

+ Flybase (3)

-

12

8

Blastoderm

29

79

50

BDGP (37)

+ Microarray (13)

-

-

-


*

Novel target genes

refers to those predicted target genes which have been validated
in this s
tudy
from BDGP in
-
situ images
but
have not previously
been cited in the literature for their role in development. The updated list of genes with a known role in development was o
btained from the
“Interactive Fly” website:

Mesoderm:
http://www.sdbonline.org/fly/aimorph/mesoderm.htm

Ventral nerve cord:
http://www.sdbonline.org/fly/aimorph/cns.htm

Eye
-
antennal disc:
http://www.sdbonline.org/fly/aimorph/eye.htm