CS5263 Bioinformatics
Lecture 20
Practical issues in motif finding
Final project
Motif representation
•
Collection of exact words
–
{ACGTTAC, ACGCTAC, AGGTGAC, …}
•
Consensus sequence (with wild cards)
–
{AcGTgTtAC}
–
{ASGTKTKAC} S=C/G, K=G/T (IUPAC code)
•
Position specific weight matrices
Sequence Logo
1
2
3
4
5
6
7
8
9
A
.97
.10
.02
.03
.10
.01
.05
.85
.03
C
.01
.40
.01
.04
.05
.01
.05
.05
.03
G
.01
.40
.95
.03
.40
.01
.3
.05
.03
T
.01
.10
.02
.90
.45
.97
.6
.05
.91
I
1.76
0.28
1.64
1.37
0.40
1.76
0.60
1.15
1.42
Finding Motifs
Motif finding schemes
Genome 1
Genome 2
Gene set 1
Gene set 2
Conservation
Yes
No
Whole
genome
Yes
Genome 1 & 2 & 3
Genome 1
No
Gene 1A & 1B & 1C or
Gene Set 1 & 2 & 3
Gene Set 1
Genome 3
Gene set 3
1A
1B
1C
Phylogenetic footprinting
Dictionary building
“Motif finding”
Ideally, all information should be used, at some stage.
i.e., inside algorithm vs pre

or post

processing.
Classification of approaches
•
Combinatorial search
–
Based on enumeration of words and
computing word similarities
–
Analogy to DP for sequence alignment
•
Probabilistic modeling
–
Construct models to distinguish motifs vs non

motifs
–
Analogy to HMM for sequence alignment
Combinatorial motif finding
Given a set of sequences S = {x
1
, …, x
n
}
•
A motif W is a consensus string w
1
…w
K
•
Find
motif W
*
with “best” match to x
1
, …, x
n
Definition of “best”:
d(W, x
i
) = min hamming dist. between W and a word in x
i
d(W, S) =
i
d(W, x
i
)
W* = argmin( d(W, S) )
Exhaustive searches
1.
Pattern

driven algorithm
:
For W = AA…A to TT…T
(4
K
possibilities)
Find d( W, S )
Report W* = argmin( d(W, S) )
Running time: O( K N 4
K
)
(where N =
i
x
i
)
Guaranteed to find the optimal solution.
Exhaustive searches
2.
Sample

driven algorithm
:
For W = a K

long word in some x
i
Find d( W, S )
Report W* = argmin( d( W, S ) )
OR
Report a local improvement of W
*
Running time: O( K N
2
)
Extended sample

driven (ESD)
approaches
•
Hybrid between pattern

driven and sample

driven
•
Assume each instance does not differ by more than α
bases to the motif (
usually depends on k)
motif
instance
The real motif will reside in the

neighborhood of some words in S.
Instead of searching all 4
K
patterns,
we can search the

neighborhood
of every word in S.
α

neighborhood
WEEDER
•
Naïve: N K
α
3
α
NK
# of patterns to test
# of words in sequences
Better idea
•
Using a joint suffix tree, find all patterns
that:
–
Have length
K
–
Appeared in at least
m
sequences with at
most
α
mismatches
•
Post

processing
WEEDER: algorithm sketch
•
A list containing all eligible
nodes: with at most
α
mismatches to P
•
For each node, remember
#mismatches accumulated (
e
),
and bit vector (
B
) for seq occ,
e.g. [011100010]
•
Bit OR all B’s to get seq
occurrence for P
•
Suppose
#occ >= m
–
Pattern still valid
•
Now add a letter
A
C
G
T
T
Current pattern P, P < K
(e, B)
# mismatches
Seq occ
WEEDER: algorithm sketch
•
Simple extension: no branches.
–
No change to B
–
e may increase by 1 or no
change
–
Drop node if
e >
α
•
Branches: replace a node with
its child nodes
–
Drop if
e >
α
–
B may change
•
Re

do Bit OR using all B’s
•
Try a different char if #occ < m
•
Report P when P = K
A
C
G
T
T
A
Current pattern P
(e, B)
WEEDER: complexity
•
Can get all D(P, S) in time
O(nN (K choose
α) 3
α
) ~ O(nN K
α
3
α
).
n: # sequences. Needed for Bit OR.
•
Better than O(KN 4
K
) since usually
α << K
•
K
α
3
α
may still be expensive for large K
–
E.g. K = 20,
α
= 6
WEEDER: More tricks
•
Eligible nodes: with at most
α
mismatches to P
•
Eligible nodes: with at most
min(
L,
α
)
mismatches to P
–
L:
current pattern length
–
:
error ratio
–
Require that mismatches to be
somewhat evenly distributed
among positions
•
Prune tree at length K
A
C
G
T
T
A
Current pattern P
MULTIPROFILER
W differs from W* at
positions.
The consensus sequence
for the words in the

neighborhood of W is
similar to W.
If we ignore all the chars
that are similar to W,
the rest may suggest
the difference between
W and W*
W
W*
W*: A
C
GTA
C
G
W: A
T
GTA
A
G
MULTIPROFILER: alg sketch
•
For each word P
in S
–
Find its
α

neighborhood
in S
–
List of words: C
•
For each position j from
1..K of the words in C
–
Find the most popular char
that differ from P[j]
•
Replace
α
positions in P
with the chars found above
–
Call the new word P’
•
W* = argmin D(P’, S)
W
W*
W*: A
C
GTA
C
G
W: A
T
GTA
A
G
MULTIPROFILER
•
Complexity not discussed
in the paper
•
More efficient than
WEEDER for longer
patterns: N < K
α
3
α
•
How to choose
α
is an
issue:
–
Large
α
: too many noises
in neighborhood
–
Small
α
: few true instances
in neighborhood
W
W*
W*: A
C
GTA
C
G
W: A
T
GTA
A
G
Probabilistic modeling approaches
for motif finding
Probabilistic modeling approaches
•
A motif model
–
Usually a PWM
–
M = (P
ij
), i = 1..4, j = 1..k, k: motif length
•
A background model
–
Usually the distribution of base frequencies in
the genome (or other selected subsets of
sequences)
–
B = (b
i
), i = 1..4
•
A word can be generated by M or B
Expectation

Maximization
•
For any word W,
P(W  M) = P
W[1] 1
P
W[2] 2
…P
W[K] K
P(W  B) = b
W[1]
b
W[2]
…b
W[K]
•
Let
= P(M), i.e., the probability for any word to
be generated by M.
•
Then P(B) = 1

•
Can compute the posterior probability P(MW)
and P(BW)
P(MW) ~ P(WM) *
P(BW) ~ P(WB) * (1

)
Expectation

Maximization
•
E

step: Z
xy
= P(M  X[y..y+k

1]) for all x and y
•
M

step: re

estimate M,
given Z
Initialize
E

step
M

step
probability
position
1
9
5
1
9
5
MEME
•
Multiple EM for Motif Elicitation
•
Bailey and Elkan, UCSD
•
http://meme.sdsc.edu/
•
Multiple starting points
•
Multiple modes: ZOOPS, OOPS, TCM
Gibbs Sampling
•
Another very useful technique for
estimating missing parameters
•
EM is deterministic
–
Often trapped by local optima
•
Gibbs sampling: stochastic behavior to
avoid local optima
Gibbs Sampling
•
Gibbs sampling: sample one position according to probability
–
Update prediction of one training sequence at a time
•
Viterbi: always take the highest
•
EM: take weighted average
Sampling
Simultaneously update
predictions of all sequences
position
probability
Gibbs sampling motif finders
•
Gibbs Sampler, based on C. Larence et.al.
Science, 1993
•
AlignACE, Nat Biotech 1998, developed in
Church lab, Harvard Univ
•
BioProspector, X. Liu et. al. PSB 2001 , an
improvement of AlignACE
Limits of Motif Finders
•
Given upstream regions of coregulated genes:
–
Increasing
length makes motif finding
harder
–
random motifs clutter the true ones
–
Decreasing
length makes motif finding
harder
–
true
motif missing in some sequences
•
Similar issue for number of sequences
0
gene
???
Challenging problem
•
(k, d)

motif challenge problem
•
Many algorithms fail at (15, 4)

motif for n = 20 and L = 1000
•
Combinatorial algorithms usually work better on challenge problems
–
However, they are usually designed to find (k, d)

motifs
–
Performance in real data varies
k
d mutations
n = 20
L = 1000
(15, 4)

motif
•
Information content: 11.7 bits
•
~ 6mers. Expected occurrence 1 per 7k bp
Actual
Results
by MEME
llr = 163 E

value = 3.2e+005
llr = 177
E

value = 1.5e+006
llr = 88
E

value = 2.5e+005
(15, 4’)

motif
•
Motif length: 15
•
Each instance differs at most 4 bases from consensus
•
Information content: 14.8
•
Much easier problem
sites = 2
llr = 38
E

value = 3.0e+006
sites = 5
llr = 95
E

value = 1.8e+006
sites = 20
llr = 240
E

value = 4.7e

013
Actual
Results
by MEME
Results for some real genes
llr = 394
E

value = 2.0e

023
llr = 347
E

value = 9.8e

002
llr = 110
E

value = 1.4e+004
Motif finding in practice
•
Now we’ve found some good looking
motifs
–
Easiest step?
•
What to do next?
–
Are they real?
–
How do we find more instances in the rest of
the genome?
–
What are their functional meaning?
•
Motifs => regulatory networks
To make sense about the motifs
•
Each program usually reports a number of motifs
(tens to hundreds)
–
Many motifs are variations of each other
–
Each program also report some different ones
•
Each program has its own way of scoring motifs
–
Best scored motifs often not interesting
–
AAAAAAAA
–
ACACACAC
–
TATATATAT
Strategies to improve results
•
Combine results from different algorithms
usually helpful
–
Ones that appeared multiple times are
probably more interesting
•
Except simple repeats like AAAAA or ATATATATA
–
Cluster motifs into groups according to their
similarities
Strategies to improve results
•
Compare with known motifs in database
–
TRANSFAC
–
JASPAR
•
Issues:
–
How to determine similarity between motifs?
•
Alignment between matrices
–
How similar is similar?
•
Empirically determine some threshold
Strategies to improve results
•
Statistical test of significance
–
Enrichment in target sequences vs
background sequences
Target set
T
Background set
B
Assumed to contain a
common motif, P
Assumed to not contain P,
or with very low frequency
Ideal case: every sequence in T has P, no sequence in B has P
Statistical test for significance
•
Intuitively: if n / N >> m / M
–
P is enriched (over

represented) in T
–
Statistical significance?
Target set
T
Background set + target set
B + T
N
M
P
Appeared in
n sequences
Appeared in m
sequences
Hypergeometric distribution
•
A box with M balls, of which N are
red, and the rest are blue.
•
If we
randomly
draw m balls from
the box
•
What’s the probability we’ll see n
red balls?
–
If P is very small, we are
probably not drawing randomly
•
Red
ball: target sequences
•
Blue
ball: background sequences
•
Total # of choices: (M choose m)
•
# of choices to have n red balls:
(N choose n) x (M

N choose m

n)
Cumulative hypergeometric test for
motif significance
•
We are interested in: if we
randomly pick m balls, how likely
that we’ll see
at least
n red balls?
This can be interpreted as the p

value for the null
hypothesis that we are randomly picking.
Alternative hypothesis: our selection favors red balls.
Equivalent: the target set T is enriched with motif P.
Or: P is over

represented in T.
Examples
•
Yeast genome has 6000 genes
•
Select 50 genes believed to be co

regulated by a common TF
•
Found a motif for these 50 genes
•
It appeared in 20 out of these 50 genes
•
In the whole genome, 100 genes have this motif
•
M = 6000, N = 50, m = 100+20 = 120, n = 20
•
Intuition:
–
m/M = 120/6000. In Genome, 1 out 50 genes have the motif
–
N = 50, would expect only 1 gene in the target set to have the motif
–
20

fold enrichment
•
P

value = 6 x 10

22
•
n = 5. 5

fold enrichment. P

value = 0.003
•
Normally a very low p

value is needed, e.g. 10

10
ROC curve for motif significance
•
Motif is usually a PWM
•
Any word will have a score
–
Typical scoring function: Log P(W  M) / P(W  B)
–
W: a word.
–
M: a PWM.
–
B: background model
•
To determine whether a sequence contains a
motif, a cutoff has to be decided
–
With different cutoffs, you get different number of
genes with the motif
–
Hyper

geometric test first assumes a cutoff
–
It may be better to look at a range of cutoffs
ROC curve for motif significance
•
With different score cutoff, will have different m and n
•
Assume you want to use P to classify T and B
•
Sensitivity: n / N
•
Specificity: (M

N

m+n) / (M

N)
•
False Positive Rate = 1
–
specificity: (m
–
n) / (M

N)
•
With decreasing cutoff, sensitivity
, FPR
Target set
T
Background set + target set
B + T
N
M
P
Appeared in
n
sequences
Appeared in
m
sequences
Given a score cutoff
ROC curve for motif significance
ROC

AUC: area under curve.
1: the best. 0.5: random.
Motif 1 is more enriched in motif 2.
1

specificity
sensitivity
Motif 1
Motif 2
Random
A good cutoff
Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.
Lowest cutoff. Every sequence
has the motif. Sensitivity = 1.
specificity = 0.
0
1
1
0
Other strategies
•
Cross

validation
–
Randomly divide sequences into 10 sets, hold 1 set
for test.
–
Do motif finding on 9 sets. Does the motif also appear
in the testing set?
•
Phylogenetic conservation information
–
Does a motif also appears in the homologous genes
of another species?
–
Strongest evidence
–
However, will not be able to find species

specific ones
Other strategies
•
Finding motif modules
–
Will two motifs always appear in the same gene?
•
Location preference
–
Some motifs appear to be in certain location
•
E.g., within 50

150bp upstream to transcription start
–
If a detect motif has strong positional bias, may be a sign of its
function
•
Evidence from other types of data sources
–
Do the genes having the motif always have similar activities
(gene expression levels) across different conditions?
–
Interact with the same set of proteins?
–
Similar functions?
–
etc.
To search for new instances
•
Usually many false positives
•
Score cutoff is critical
•
Can estimate a score cutoff from the “true”
binding sites
Motif finding
Scoring function
A set of scores for the “true” sites. Take mean

std as a cutoff.
(or a cutoff such that the majority of “true” sites can be predicted).
To search for new instances
•
Use other information, such as positional
biases of motifs to restrict the regions that
a motif may appear
•
Use gene expression data to help: the
genes having the true motif should have
similar activities
•
Phylogenetic conservation is the key
Final project
•
Write a review paper on a topic that we didn’t cover in
lectures
Or
•
Implement an algorithm and do some experiments
•
Compare several algorithms (existing implementation ok)
•
Combine several algorithms to form a pipeline (e.g. gene
expression + motif analysis)
•
Final:
–
5

10 pages report (single space, single column, 12pt) + 15
minutes presentation
Possible topics for term paper
•
Possible topics:
–
Haplotype inferencing
–
Computational challenges associated with new
microarray technologies
–
Phylogenetic footprinting
–
Small RNA gene / target prediction (siRNA, mRNA,
…)
–
Biomedical text mining
–
Protein structure prediction
–
Topology of biological networks
An example project
•
Given a gene expression data (say cell cycle)
•
Cluster genes using k

means
•
Find motifs using several algorithms
•
(Cluster and combine similar motifs)
•
Rank motifs according to their specificity to the
target sequences comparing to the other
clusters
•
Get their logos
•
Use the sequences to search the whole genome
for more genes with the motif
•
Do they have any functional significance?
Comments 0
Log in to post a comment