# CS5263 Bioinformatics - Department of Computer Science

Biotechnology

Oct 2, 2013 (4 years and 7 months ago)

87 views

CS5263 Bioinformatics

Lecture 20

Practical issues in motif finding

Final project

Motif representation

Collection of exact words

{ACGTTAC, ACGCTAC, AGGTGAC, …}

Consensus sequence (with wild cards)

{AcGTgTtAC}

{ASGTKTKAC} S=C/G, K=G/T (IUPAC code)

Position specific weight matrices

Sequence Logo

1

2

3

4

5

6

7

8

9

A

.97

.10

.02

.03

.10

.01

.05

.85

.03

C

.01

.40

.01

.04

.05

.01

.05

.05

.03

G

.01

.40

.95

.03

.40

.01

.3

.05

.03

T

.01

.10

.02

.90

.45

.97

.6

.05

.91

I

1.76

0.28

1.64

1.37

0.40

1.76

0.60

1.15

1.42

Finding Motifs

Motif finding schemes

Genome 1

Genome 2

Gene set 1

Gene set 2

Conservation

Yes

No

Whole
genome

Yes

Genome 1 & 2 & 3

Genome 1

No

Gene 1A & 1B & 1C or
Gene Set 1 & 2 & 3

Gene Set 1

Genome 3

Gene set 3

1A

1B

1C

Phylogenetic footprinting

Dictionary building

“Motif finding”

Ideally, all information should be used, at some stage.
i.e., inside algorithm vs pre
-

or post
-
processing.

Classification of approaches

Combinatorial search

Based on enumeration of words and
computing word similarities

Analogy to DP for sequence alignment

Probabilistic modeling

Construct models to distinguish motifs vs non
-
motifs

Analogy to HMM for sequence alignment

Combinatorial motif finding

Given a set of sequences S = {x
1
, …, x
n
}

A motif W is a consensus string w
1
…w
K

Find

motif W
*

with “best” match to x
1
, …, x
n

Definition of “best”:

d(W, x
i
) = min hamming dist. between W and a word in x
i

d(W, S) =

i

d(W, x
i
)

W* = argmin( d(W, S) )

Exhaustive searches

1.
Pattern
-
driven algorithm
:

For W = AA…A to TT…T

(4
K

possibilities)

Find d( W, S )

Report W* = argmin( d(W, S) )

Running time: O( K N 4
K

)

(where N =

i

|x
i
|)

Guaranteed to find the optimal solution.

Exhaustive searches

2.
Sample
-
driven algorithm
:

For W = a K
-
long word in some x
i

Find d( W, S )

Report W* = argmin( d( W, S ) )

OR

Report a local improvement of W
*

Running time: O( K N
2

)

Extended sample
-
driven (ESD)
approaches

Hybrid between pattern
-
driven and sample
-
driven

Assume each instance does not differ by more than α
bases to the motif (

usually depends on k)

motif

instance

The real motif will reside in the

-
neighborhood of some words in S.

K

patterns,

we can search the

-
neighborhood
of every word in S.

α
-
neighborhood

WEEDER

Naïve: N K
α

3
α

NK

# of patterns to test

# of words in sequences

Better idea

Using a joint suffix tree, find all patterns
that:

Have length
K

Appeared in at least
m

sequences with at
most
α

mismatches

Post
-
processing

WEEDER: algorithm sketch

A list containing all eligible
nodes: with at most
α

mismatches to P

For each node, remember
#mismatches accumulated (
e
),
and bit vector (
B
) for seq occ,
e.g. [011100010]

Bit OR all B’s to get seq
occurrence for P

Suppose
#occ >= m

Pattern still valid

A

C

G

T

T

Current pattern P, |P| < K

(e, B)

# mismatches

Seq occ

WEEDER: algorithm sketch

Simple extension: no branches.

No change to B

e may increase by 1 or no
change

Drop node if
e >
α

Branches: replace a node with
its child nodes

Drop if
e >
α

B may change

Re
-
do Bit OR using all B’s

Try a different char if #occ < m

Report P when |P| = K

A

C

G

T

T

A

Current pattern P

(e, B)

WEEDER: complexity

Can get all D(P, S) in time

O(nN (K choose
α) 3
α
) ~ O(nN K
α

3
α
).

n: # sequences. Needed for Bit OR.

Better than O(KN 4
K
) since usually
α << K

K
α

3
α
may still be expensive for large K

E.g. K = 20,
α

= 6

WEEDER: More tricks

Eligible nodes: with at most
α

mismatches to P

Eligible nodes: with at most
min(

L,
α
)

mismatches to P

L:

current pattern length

:
error ratio

Require that mismatches to be
somewhat evenly distributed
among positions

Prune tree at length K

A

C

G

T

T

A

Current pattern P

MULTIPROFILER

W differs from W* at

positions.

The consensus sequence
for the words in the

-
neighborhood of W is
similar to W.

If we ignore all the chars
that are similar to W,
the rest may suggest
the difference between
W and W*

W

W*

W*: A
C
GTA
C
G

W: A
T
GTA
A
G

MULTIPROFILER: alg sketch

For each word P
in S

Find its
α
-
neighborhood
in S

List of words: C

For each position j from
1..K of the words in C

Find the most popular char
that differ from P[j]

Replace
α

positions in P
with the chars found above

Call the new word P’

W* = argmin D(P’, S)

W

W*

W*: A
C
GTA
C
G

W: A
T
GTA
A
G

MULTIPROFILER

Complexity not discussed
in the paper

More efficient than
WEEDER for longer
patterns: N < K
α

3
α

How to choose
α

is an
issue:

Large
α
: too many noises
in neighborhood

Small
α
: few true instances
in neighborhood

W

W*

W*: A
C
GTA
C
G

W: A
T
GTA
A
G

Probabilistic modeling approaches

for motif finding

Probabilistic modeling approaches

A motif model

Usually a PWM

M = (P
ij
), i = 1..4, j = 1..k, k: motif length

A background model

Usually the distribution of base frequencies in
the genome (or other selected subsets of
sequences)

B = (b
i
), i = 1..4

A word can be generated by M or B

Expectation
-
Maximization

For any word W,

P(W | M) = P
W[1] 1
P
W[2] 2
…P
W[K] K

P(W | B) = b
W[1]
b
W[2]
…b
W[K]

Let

= P(M), i.e., the probability for any word to
be generated by M.

Then P(B) = 1
-

Can compute the posterior probability P(M|W)
and P(B|W)

P(M|W) ~ P(W|M) *

P(B|W) ~ P(W|B) * (1
-

)

Expectation
-
Maximization

E
-
step: Z
xy

= P(M | X[y..y+k
-
1]) for all x and y

M
-
step: re
-
estimate M,

given Z

Initialize

E
-
step

M
-
step

probability

position

1

9

5

1

9

5

MEME

Multiple EM for Motif Elicitation

Bailey and Elkan, UCSD

http://meme.sdsc.edu/

Multiple starting points

Multiple modes: ZOOPS, OOPS, TCM

Gibbs Sampling

Another very useful technique for
estimating missing parameters

EM is deterministic

Often trapped by local optima

Gibbs sampling: stochastic behavior to
avoid local optima

Gibbs Sampling

Gibbs sampling: sample one position according to probability

Update prediction of one training sequence at a time

Viterbi: always take the highest

EM: take weighted average

Sampling

Simultaneously update
predictions of all sequences

position

probability

Gibbs sampling motif finders

Gibbs Sampler, based on C. Larence et.al.
Science, 1993

AlignACE, Nat Biotech 1998, developed in
Church lab, Harvard Univ

BioProspector, X. Liu et. al. PSB 2001 , an
improvement of AlignACE

Limits of Motif Finders

Given upstream regions of coregulated genes:

Increasing

length makes motif finding
harder

random motifs clutter the true ones

Decreasing

length makes motif finding
harder

true
motif missing in some sequences

Similar issue for number of sequences

0

gene

???

Challenging problem

(k, d)
-
motif challenge problem

Many algorithms fail at (15, 4)
-
motif for n = 20 and L = 1000

Combinatorial algorithms usually work better on challenge problems

However, they are usually designed to find (k, d)
-
motifs

Performance in real data varies

k

d mutations

n = 20

L = 1000

(15, 4)
-
motif

Information content: 11.7 bits

~ 6mers. Expected occurrence 1 per 7k bp

Actual

Results

by MEME

llr = 163 E
-
value = 3.2e+005

llr = 177

E
-
value = 1.5e+006

llr = 88

E
-
value = 2.5e+005

(15, 4’)
-
motif

Motif length: 15

Each instance differs at most 4 bases from consensus

Information content: 14.8

Much easier problem

sites = 2

llr = 38

E
-
value = 3.0e+006

sites = 5

llr = 95

E
-
value = 1.8e+006

sites = 20

llr = 240

E
-
value = 4.7e
-
013

Actual

Results

by MEME

Results for some real genes

llr = 394

E
-
value = 2.0e
-
023

llr = 347

E
-
value = 9.8e
-
002

llr = 110

E
-
value = 1.4e+004

Motif finding in practice

Now we’ve found some good looking
motifs

Easiest step?

What to do next?

Are they real?

How do we find more instances in the rest of
the genome?

What are their functional meaning?

Motifs => regulatory networks

To make sense about the motifs

Each program usually reports a number of motifs
(tens to hundreds)

Many motifs are variations of each other

Each program also report some different ones

Each program has its own way of scoring motifs

Best scored motifs often not interesting

AAAAAAAA

ACACACAC

TATATATAT

Strategies to improve results

Combine results from different algorithms

Ones that appeared multiple times are
probably more interesting

Except simple repeats like AAAAA or ATATATATA

Cluster motifs into groups according to their
similarities

Strategies to improve results

Compare with known motifs in database

TRANSFAC

JASPAR

Issues:

How to determine similarity between motifs?

Alignment between matrices

How similar is similar?

Empirically determine some threshold

Strategies to improve results

Statistical test of significance

Enrichment in target sequences vs
background sequences

Target set

T

Background set

B

Assumed to contain a
common motif, P

Assumed to not contain P,
or with very low frequency

Ideal case: every sequence in T has P, no sequence in B has P

Statistical test for significance

Intuitively: if n / N >> m / M

P is enriched (over
-
represented) in T

Statistical significance?

Target set

T

Background set + target set

B + T

N

M

P

Appeared in
n sequences

Appeared in m
sequences

Hypergeometric distribution

A box with M balls, of which N are
red, and the rest are blue.

If we
randomly

draw m balls from
the box

What’s the probability we’ll see n
red balls?

If P is very small, we are
probably not drawing randomly

Red

ball: target sequences

Blue

ball: background sequences

Total # of choices: (M choose m)

# of choices to have n red balls:
(N choose n) x (M
-
N choose m
-
n)

Cumulative hypergeometric test for
motif significance

We are interested in: if we
randomly pick m balls, how likely
that we’ll see
at least

n red balls?

This can be interpreted as the p
-
value for the null
hypothesis that we are randomly picking.

Alternative hypothesis: our selection favors red balls.

Equivalent: the target set T is enriched with motif P.

Or: P is over
-
represented in T.

Examples

Yeast genome has 6000 genes

Select 50 genes believed to be co
-
regulated by a common TF

Found a motif for these 50 genes

It appeared in 20 out of these 50 genes

In the whole genome, 100 genes have this motif

M = 6000, N = 50, m = 100+20 = 120, n = 20

Intuition:

m/M = 120/6000. In Genome, 1 out 50 genes have the motif

N = 50, would expect only 1 gene in the target set to have the motif

20
-
fold enrichment

P
-
value = 6 x 10
-
22

n = 5. 5
-
fold enrichment. P
-
value = 0.003

Normally a very low p
-
value is needed, e.g. 10
-
10

ROC curve for motif significance

Motif is usually a PWM

Any word will have a score

Typical scoring function: Log P(W | M) / P(W | B)

W: a word.

M: a PWM.

B: background model

To determine whether a sequence contains a
motif, a cutoff has to be decided

With different cutoffs, you get different number of
genes with the motif

Hyper
-
geometric test first assumes a cutoff

It may be better to look at a range of cutoffs

ROC curve for motif significance

With different score cutoff, will have different m and n

Assume you want to use P to classify T and B

Sensitivity: n / N

Specificity: (M
-
N
-
m+n) / (M
-
N)

False Positive Rate = 1

specificity: (m

n) / (M
-
N)

With decreasing cutoff, sensitivity

, FPR

Target set

T

Background set + target set

B + T

N

M

P

Appeared in
n

sequences

Appeared in
m

sequences

Given a score cutoff

ROC curve for motif significance

ROC
-
AUC: area under curve.

1: the best. 0.5: random.

Motif 1 is more enriched in motif 2.

1
-
specificity

sensitivity

Motif 1

Motif 2

Random

A good cutoff

Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.

Lowest cutoff. Every sequence
has the motif. Sensitivity = 1.
specificity = 0.

0

1

1

0

Other strategies

Cross
-
validation

Randomly divide sequences into 10 sets, hold 1 set
for test.

Do motif finding on 9 sets. Does the motif also appear
in the testing set?

Phylogenetic conservation information

Does a motif also appears in the homologous genes
of another species?

Strongest evidence

However, will not be able to find species
-
specific ones

Other strategies

Finding motif modules

Will two motifs always appear in the same gene?

Location preference

Some motifs appear to be in certain location

E.g., within 50
-
150bp upstream to transcription start

If a detect motif has strong positional bias, may be a sign of its
function

Evidence from other types of data sources

Do the genes having the motif always have similar activities
(gene expression levels) across different conditions?

Interact with the same set of proteins?

Similar functions?

etc.

To search for new instances

Usually many false positives

Score cutoff is critical

Can estimate a score cutoff from the “true”
binding sites

Motif finding

Scoring function

A set of scores for the “true” sites. Take mean
-

std as a cutoff.

(or a cutoff such that the majority of “true” sites can be predicted).

To search for new instances

Use other information, such as positional
biases of motifs to restrict the regions that
a motif may appear

Use gene expression data to help: the
genes having the true motif should have
similar activities

Phylogenetic conservation is the key

Final project

Write a review paper on a topic that we didn’t cover in
lectures

Or

Implement an algorithm and do some experiments

Compare several algorithms (existing implementation ok)

Combine several algorithms to form a pipeline (e.g. gene
expression + motif analysis)

Final:

5
-
10 pages report (single space, single column, 12pt) + 15
minutes presentation

Possible topics for term paper

Possible topics:

Haplotype inferencing

Computational challenges associated with new
microarray technologies

Phylogenetic footprinting

Small RNA gene / target prediction (siRNA, mRNA,
…)

Biomedical text mining

Protein structure prediction

Topology of biological networks

An example project

Given a gene expression data (say cell cycle)

Cluster genes using k
-
means

Find motifs using several algorithms

(Cluster and combine similar motifs)

Rank motifs according to their specificity to the
target sequences comparing to the other
clusters

Get their logos

Use the sequences to search the whole genome
for more genes with the motif

Do they have any functional significance?