CS5263 Bioinformatics
Lecture 20
Practical issues in motif finding
Final project
Homework problem 3.1
•
Count separately the number of character
comparisons and the number of steps
needed to find the next matching character
using the bad character rule
•
Question: can you give an example?
Extended bad character rule
char
Position in P
a
6,
3
b
7, 4
p
2
t
1
x
5
T: xpbctbxabpqq
a
abpqz
P: tpabxab
*^^
P: tp
a
bxab
Find T(k) in P that is
immediately left to i,
shift P to align T(k)
with that position
k
In this iteration:
# of comparison = 3
Table lookup: 2
Results for some real genes
llr = 394
E

value = 2.0e

023
llr = 347
E

value = 9.8e

002
llr = 110
E

value = 1.4e+004
Strategies to improve results
•
Combine results from different algorithms
usually helpful
–
Ones that appeared multiple times are
probably more interesting
•
Except simple repeats like AAAAA or ATATATATA
–
Cluster motifs into groups according to their
similarities
Strategies to improve results
•
Compare with known motifs in database
–
TRANSFAC
–
JASPAR
•
Issues:
–
How to determine similarity between motifs?
•
Alignment between matrices
–
How similar is similar?
•
Empirically determine some threshold
Strategies to improve results
•
Statistical test of significance
–
Enrichment in target sequences vs
background sequences
Target set
T
Background set
B
Assumed to contain a
common motif, P
Assumed to not contain P,
or with very low frequency
Ideal case: every sequence in T has P, no sequence in B has P
Statistical test for significance
•
Intuitively: if n / N >> m / M
–
P is enriched (over

represented) in T
–
Statistical significance?
Target set
T
Background set + target set
B + T
Size = N
Size = M
P
Appear in
n
sequences
Appear in
m
sequences
Hypergeometric distribution
•
A box with M balls, of which N are
red, and the rest are blue.
–
Red
ball: target sequences
–
Blue
ball: background sequences
•
If we
randomly
draw
m
balls from
the box,
•
what’s the probability we’ll see
n
red
balls?
–
If probability very small, we are
probably not drawing randomly
•
Total # of choices:
(M choose m)
•
# of choices to have
n
red
balls:
(N choose n) x (M

N choose m

n)
Cumulative hypergeometric test for
motif significance
•
We are interested in: if we
randomly pick m balls, how likely
that we’ll see
at least
n
red
balls?
This can be interpreted as the p

value for the null
hypothesis that we are randomly picking.
Alternative hypothesis: our selection favors
red
balls.
Equivalent: the target set T is enriched with motif P.
Or: P is over

represented in T.
Examples
•
Yeast genome has 6000 genes
•
Select 50 genes believed to be co

regulated by a common TF
•
Found a motif for these 50 genes
•
It appeared in 20 out of these 50 genes
•
In the whole genome, 100 genes have this motif
•
M = 6000, N = 50, m = 100+20 = 120, n = 20
•
Intuition:
–
m/M = 120/6000. In Genome, 1 out 50 genes have the motif
–
N = 50, would expect only 1 gene in the target set to have the motif
–
20

fold enrichment
•
P

value = 6 x 10

22
•
n = 5. 5

fold enrichment. P

value = 0.003
•
Normally a very low p

value is needed, e.g. 10

10
ROC curve for motif significance
•
Motif is usually a PWM
•
Any word will have a score
–
Typical scoring function: Log P(W  M) / P(W  B)
–
W: a word.
–
M: a PWM.
–
B: background model
•
To determine whether a sequence contains a
motif, a cutoff has to be decided
–
With different cutoffs, you get different number of
genes with the motif
–
Hyper

geometric test first assumes a cutoff
–
It may be better to look at a range of cutoffs
ROC curve for motif significance
•
With different score cutoff, will have different
m
and
n
•
Assume you want to use P to classify T and B
•
Sensitivity:
n / N
•
Specificity: (M

N

m+n) / (M

N)
•
False Positive Rate = 1
–
specificity: (m
–
n) / (M

N)
•
With decreasing cutoff, sensitivity
, FPR
Target set
T
Background set + target set
B + T
Size = N
Size = M
P
Appeared in
n
sequences
Appeared in
m
sequences
Given a score cutoff
ROC curve for motif significance
ROC

AUC: area under curve.
•
1: perfect separation.
•
0.5: random.
Motif 1 is better than motif 2.
1

specificity
sensitivity
Motif 1
Motif 2
Random
A good cutoff
Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.
Lowest cutoff. Every sequence
has the motif. Sensitivity = 1.
specificity = 0.
0
1
1
0
Other strategies
•
Cross

validation
–
Randomly divide sequences into 10 sets, hold 1 set
for test.
–
Do motif finding on 9 sets. Does the motif also appear
in the testing set?
•
Phylogenetic conservation information
–
Does a motif also appears in the homologous genes
of another species?
–
Strongest evidence
–
However, will not be able to find species

specific ones
Other strategies
•
Finding motif modules
–
Will two motifs always appear in the same gene?
•
Location preference
–
Some motifs appear to be in certain location
•
E.g., within 50

150bp upstream to transcription start
–
If a detect motif has strong positional bias, may be a sign of its
function
•
Evidence from other types of data sources
–
Do the genes having the motif always have similar activities
(gene expression levels) across different conditions?
–
Interact with the same set of proteins?
–
Similar functions?
–
etc.
To search for new instances
•
Usually many false positives
•
Score cutoff is critical
•
Can estimate a score cutoff from the “true”
binding sites
Motif finding
Scoring function
A set of scores for the “true” sites. Take mean

std as a cutoff.
(or a cutoff such that the majority of “true” sites can be predicted).
To search for new instances
•
Use other information, such as positional
biases of motifs to restrict the regions that
a motif may appear
•
Use gene expression data to help: the
genes having the true motif should have
similar activities
•
Phylogenetic conservation is the key
Final project
•
Write a review paper on a topic that we didn’t cover in
lectures
Or
•
Implement an algorithm and do some experiments
•
Compare several algorithms (existing implementation ok)
•
Combine several algorithms to form a pipeline (e.g. gene
expression + motif analysis)
•
Final:
–
5

10 pages report (single space, single column, 12pt) + 15
minutes presentation
Possible topics for term paper
•
Possible topics:
–
Haplotype inferencing
–
Computational challenges associated with new
microarray technologies
–
Phylogenetic footprinting
–
Small RNA gene / target prediction (siRNA, mRNA,
…)
–
Biomedical text mining
–
Protein structure prediction
–
Topology of biological networks
An example project
•
Given a gene expression data (say cell cycle)
•
Cluster genes using k

means
•
Find motifs using several algorithms
•
(Cluster and combine similar motifs)
•
Rank motifs according to their specificity to the
target sequences comparing to the other
clusters
•
Get their logos
•
Use the sequences to search the whole genome
for more genes with the motif
•
Do they have any functional significance?
Comments 0
Log in to post a comment