APPLICATION OF MOTIF DISCOVERY TOOL FOR FOXA BINDING SITES ANALYSIS

roomycankerblossomΤεχνίτη Νοημοσύνη και Ρομποτική

23 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

106 εμφανίσεις

APPLICATION OF
MOTIF DISCOVERY
TOOL
F
OR

FOXA BINDING
SITES

ANALYSIS

Levitsky

V
.
G
.

Institute of Cytology & Genetics SB RAS, Novosibirsk, Russia

Novosibirsk State University, Novosibirsk, Russia

e
-
mail:
levitsky@
bionet.nsc.ru



Key words:

transcription factor

binding site
s
,
motif discovery, genetic algorithm


Motivation

and

Aim
:
Motif discovery is
still is
an
important
step
in annotation of
sequenced genomes

[
1
].
This
task

consists
in detection
of overrepresented

motives in a
dataset

of
nucleotide
sequences.

It is supposed that

sequences contain multiple
occurrences
of binding sites (
BS
)

of known or unknown transcription factor (TF)
.
IUPAC 15
-
letter code is widely used for representation of
ambiguous
pattern of
n
u
cleotide
s

(motif)
in a given DNA sequence, where a
ny

character may represent more
than one nucleotide
.
Since typical
length
of BS
is
8
-
1
2

nt
, exhaustive search in 15
-
letter
alphabet require
s

calculation
in the

space
s

of

respectively
1
5
8

-

15
1
2

dimension
s
(
order
of magnitude from

9

to 1
4
)
.

Stochastic method developed here may substantially
decrease this search space.

Methods

and

Algorithms
:

A genetic algorithm
(GA)
based approach is employed

to
search for motives in 15
-
letter degenerate code. Markov model wa
s used
to

measure
the

overrepresentation of motives.

Results:

Developed

approach was applied
to construct
alternate alignment
s of
BS

of
TF
FoxA
.
Nucleotide sequences of 3
7 functional
FoxA

BS we
re retrieved from Sample
databas
e

[
2
].
GA

scored motives based

on
:

(
a)
T
, portion of

training sequences
with at
least one occurrence
of

a
motif

(0

<

T



1)

b) F
,

estimated expected length of
a
background sequence that contain a motif
.
Background

sequences were modeled by 0
th

order markov model.

For
whole training
dat
aset
the
value T

=

0.55
was chosen
since
the
next growth of this param
eter caused the

fall of F score

below reasonable threshold
.
Among motives with highest F score
s

YRTTTRYBYWDD
,
BNTSTTTKBHBW

and
HTVTTTGBDBH

were selected since for them t
he lowest mutual
overlapping of
respective locations in training sequences w
ere

observed
.
These motives were found
respectively in 21, 22 and 21 sequences of training dataset;

pairs of 1
st

& 2
nd
, 1
st

& 3
rd
,
and 2
nd

& 3
rd

motives were
found

concurrently
in 14, 12 and 15 seq
uences
,

respectively
.
Thus,
33 sequences
out of

total 37
contained

at least one
occurrence
of
any

motives.
A
nalysis of
ChIP
-
Seq data
[
3
]

approved that
developed

approach
i
s very
promising

for

prediction of

potential BS

of FoxA
.

Conclusion
:

Developed method

may be use
d

for the
fast and effective search of
degenerate motives.

References

1.

T
.

Marschall, S.

Rahmann
(2009)
Efficient exact motif discovery.

Bioinformatics,

25:
i356
-
3
64.

2.

N.A. Kolchanov, et al., (2002) Transcription Regula
tory Regions Database (TRRD):
its status in 2002.
Nucl Acid Res
,
30:
312
-
317.

3.

V.G. Levitsky, et al.,

(th
is issue)
Recognition of potential binding sites in ChIP
-
Seq
data
.