Where is the Motif???

educationafflictedΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

98 εμφανίσεις

Bioinformatics

Dr. Aladdin
Hamwieh



Khalid Al
-
shamaa

Abdulqader Jighly

2010
-
2011

Lecture
3

Finding Motifs

Aleppo University

Faculty of technical engineering

Department of Biotechnology

Main Lines


Definition


Motif types


Motifs problem


Motifs: Profiles and Consensus


Motif Logo


Motif Search in Local Database

Definition


A

motif

is

a

short

conserved

sequence

pattern

associated

with

distinct

functions

of

a

protein

or

DNA
.

Motif Types

1.
Regulatory sequences

Combinatorial Gene Regulation


A microarray experiment showed
that when gene X is knocked out,
20
other genes are not expressed



How can one gene have such drastic
effects?

Combinatorial Gene Regulation

Combinatorial Gene Regulation


Gene X encodes regulatory protein,
a.k.a. a
transcription factor

(TF)



The
20
unexpressed genes rely on
gene X’s TF to induce transcription



A single TF may regulate multiple
genes


Regulatory Protein


Every

gene

contains

a

regulatory

region

(RR)

typically

stretching

100
-
1000

bp

upstream

of

the

transcriptional

start

site


Located

within

the

RR

are

the

Transcription

Factor

Binding

Sites

(TFBS),

also

known

as

motifs
,

specific

for

a

given

transcription

factor


TFs

influence

gene

expression

by

binding

to

a

specific

location

in

the

respective

gene’s

regulatory

region

-

TFBS


Regulatory Regions


A

TFBS

can

be

located

anywhere

within

the

Regulatory

Region
.



TFBS

may

vary

slightly

across

different

regulatory

regions

since

non
-
essential

bases

could

mutate

Transcription Factor Binding Sites

gene

ATCCCG

gene

T
TCC
G
G

gene

ATCCCG

gene

AT
G
CCG

gene

AT
G
CC
C

Motifs and Transcriptional Start Sites

TTGACA

-
35
hexamer

spacer


TATAAT

-
10
hexamer

Transcription
start site

interval

15
-

19
bases

5
-

9
bases

-
35

-
10

A weight matrix contains more information

A

T

G

C

1

2

3

4

5

6

A

T

G

C

1

2

3

4

5

6

Based on ~
450

known promoters

0.1
0.1 0.1 0.5 0.2 0.5


0.7 0.7 0.2 0.2 0.2 0.2

0.1 0.1 0.5 0.1 0.1 0.2

0.1 0.1 0.2 0.2 0.5 0.1

0.1 0.7 0.2 0.6 0.5 0.1

0.7 0.1 0.5 0.2 0.2 0.8

0.1 0.1 0.1 0.1 0.1 0.0

0.1 0.1 0.2 0.1 0.1 0.1

Consensus considerations


GAL
4
in Yeast


Activator of
galactose
-
induced genes (convert
galactose

to glucose)


Protein structure
determines motif


DNA
-
protein interactions
require certain bases at
specified locations


Motif reflects
homodimer

structure



Example

Motif Types

2.
Motifs in protein structure

Importance


Functional

relationships

between

proteins

cannot

be

distinguished

through

simple

BLAST

or

FASTA

database
.



Proteins

often

perform

multiple

functions

that

cannot

be

fully

described

using

a

single

annotation
.



To

resolve

these

issues,

identification

of

the

motifs

and

domains

becomes

very

useful
.

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg


acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca


tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga


gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga


tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag


gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa


cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat


aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta


ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag


ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Random Sample

Implanting Motif AAAAAAAGGGGGGG

atgaccgggatactgat
AAAAAAAAGGGGGGG
ggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg


acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaata
AAAAAAAAGGGGGGG
a


tgagtatccctgggatgactt
AAAAAAAAGGGGGGG
tgctctcccgatttttgaatatgtaggatcattcgccagggtccga


gctgagaattggatg
AAAAAAAAGGGGGGG
tccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga


tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaat
AAAAAAAAGGGGGGG
cttatag


gtcaatcatgttcttgtgaatggattt
AAAAAAAAGGGGGGG
gaccgcttggcgcacccaaattcagtgtgggcgagcgcaa


cggttttggcccttgttagaggcccccgt
AAAAAAAAGGGGGGG
caattatgagagagctaatctatcgcgtgcgtgttcat


aacttgagtt
AAAAAAAAGGGGGGG
ctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta


ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcat
AAAAAAAAGGGGGGG
accgaaagggaag


ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagctt
AAAAAAAAGGGGGGG
a


Hard to identify


Relatively short sequences (as small as
6
bases)


Many positions not well conserved


Factors improving identification


Usually localized in certain proximity of a
gene (search within
3
kb upstream)


Some positions highly conserved


Use other data (Microarray?)

The Challenge


Find

a

motif

in

a

sample

of
:


20

“random”

sequences

(e
.
g
.

600

nt

long)


each

sequence

containing

an

implanted

pattern

of

length

15
.



each

pattern

appearing

with

4

mismatches

as

(
15
,
4
)

motif
.

Challenge Problem

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg


acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga


tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga


gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga


tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag


gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa


cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat


aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta


ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag


ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Where is the
Motif???

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG
?
..
|
..
|||
.
|
..
|||
?
Why Finding (
15
,
4
) Motif

is Difficult?

atgaccgggatactgat
AgAAgAAAGGttGGG
ggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg


acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaata
cAAtAAAAcGGcGGG
a


tgagtatccctgggatgactt
AAAAtAAtGGaGtGG
tgctctcccgatttttgaatatgtaggatcattcgccagggtccga


gctgagaattggatg
cAAAAAAAGGGattG
tccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga


tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaat
AtAAtAAAGGaaGGG
cttatag


gtcaatcatgttcttgtgaatggattt
AAcAAtAAGGGctGG
gaccgcttggcgcacccaaattcagtgtgggcgagcgcaa


cggttttggcccttgttagaggcccccgt
AtAAAcAAGGaGGGc
caattatgagagagctaatctatcgcgtgcgtgttcat


aacttgagtt
AAAAAAtAGGGaGcc
ctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta


ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcat
ActAAAAAGGaGcGG
accgaaagggaag


ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagctt
ActAAAAAGGaGcGG
a


a

G

g

t a c
T

t


C

c

A

t a c g t

Alignment

a c g t
T

A

g t


a c g t
C

c

A

t


C

c

g t a c g

G







_________________





A

3

0

1

0

3

1 1

0

Profile

C

2

4

0 0

1

4

0 0


G

0
1

4

0 0 0
3

1


T

0 0 0
5

1

0

1

4






_________________


Consensus

A C G T A C G T


Line up the patterns by
their start indexes



s

= (
s
1
,
s
2
, …,
s
t
)



Construct matrix profile
with frequencies of each
nucleotide in columns



Consensus nucleotide in
each position has the
highest score in column

Motifs: Profiles and Consensus

Motif

Search in Local Database