CS5263 Bioinformatics - Department of Computer Science

weinerthreeforksΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

84 εμφανίσεις

CS5263 Bioinformatics

Lecture 20

Practical issues in motif finding

Final project

Motif representation


Collection of exact words


{ACGTTAC, ACGCTAC, AGGTGAC, …}


Consensus sequence (with wild cards)


{AcGTgTtAC}


{ASGTKTKAC} S=C/G, K=G/T (IUPAC code)


Position specific weight matrices

Sequence Logo

1

2

3

4

5

6

7

8

9

A

.97

.10

.02

.03

.10

.01

.05

.85

.03

C

.01

.40

.01

.04

.05

.01

.05

.05

.03

G

.01

.40

.95

.03

.40

.01

.3

.05

.03

T

.01

.10

.02

.90

.45

.97

.6

.05

.91

I

1.76

0.28

1.64

1.37

0.40

1.76

0.60

1.15

1.42



Finding Motifs

Motif finding schemes

Genome 1

Genome 2

Gene set 1

Gene set 2

Conservation

Yes

No

Whole
genome

Yes

Genome 1 & 2 & 3

Genome 1

No

Gene 1A & 1B & 1C or
Gene Set 1 & 2 & 3

Gene Set 1

Genome 3

Gene set 3

1A

1B

1C

Phylogenetic footprinting

Dictionary building

“Motif finding”

Ideally, all information should be used, at some stage.
i.e., inside algorithm vs pre
-

or post
-
processing.

Classification of approaches


Combinatorial search


Based on enumeration of words and
computing word similarities


Analogy to DP for sequence alignment


Probabilistic modeling


Construct models to distinguish motifs vs non
-
motifs


Analogy to HMM for sequence alignment

Combinatorial motif finding

Given a set of sequences S = {x
1
, …, x
n
}



A motif W is a consensus string w
1
…w
K



Find

motif W
*

with “best” match to x
1
, …, x
n


Definition of “best”:

d(W, x
i
) = min hamming dist. between W and a word in x
i

d(W, S) =

i

d(W, x
i
)


W* = argmin( d(W, S) )

Exhaustive searches

1.
Pattern
-
driven algorithm
:



For W = AA…A to TT…T


(4
K

possibilities)



Find d( W, S )


Report W* = argmin( d(W, S) )


Running time: O( K N 4
K

)


(where N =

i

|x
i
|)

Guaranteed to find the optimal solution.

Exhaustive searches

2.
Sample
-
driven algorithm
:



For W = a K
-
long word in some x
i



Find d( W, S )


Report W* = argmin( d( W, S ) )


OR

Report a local improvement of W
*


Running time: O( K N
2

)

Extended sample
-
driven (ESD)
approaches


Hybrid between pattern
-
driven and sample
-
driven


Assume each instance does not differ by more than α
bases to the motif (


usually depends on k)

motif

instance



The real motif will reside in the

-
neighborhood of some words in S.

Instead of searching all 4
K

patterns,

we can search the

-
neighborhood
of every word in S.

α
-
neighborhood

WEEDER


Naïve: N K
α

3
α

NK

# of patterns to test

# of words in sequences

Better idea


Using a joint suffix tree, find all patterns
that:


Have length
K


Appeared in at least
m

sequences with at
most
α

mismatches


Post
-
processing

WEEDER: algorithm sketch


A list containing all eligible
nodes: with at most
α

mismatches to P


For each node, remember
#mismatches accumulated (
e
),
and bit vector (
B
) for seq occ,
e.g. [011100010]


Bit OR all B’s to get seq
occurrence for P


Suppose
#occ >= m


Pattern still valid


Now add a letter

A

C

G

T

T

Current pattern P, |P| < K

(e, B)

# mismatches

Seq occ

WEEDER: algorithm sketch


Simple extension: no branches.


No change to B


e may increase by 1 or no
change


Drop node if
e >
α


Branches: replace a node with
its child nodes


Drop if
e >
α


B may change


Re
-
do Bit OR using all B’s


Try a different char if #occ < m


Report P when |P| = K

A

C

G

T

T

A

Current pattern P

(e, B)

WEEDER: complexity


Can get all D(P, S) in time

O(nN (K choose
α) 3
α
) ~ O(nN K
α

3
α
).

n: # sequences. Needed for Bit OR.


Better than O(KN 4
K
) since usually
α << K



K
α

3
α
may still be expensive for large K


E.g. K = 20,
α

= 6

WEEDER: More tricks


Eligible nodes: with at most
α

mismatches to P


Eligible nodes: with at most
min(

L,
α
)

mismatches to P


L:

current pattern length



:
error ratio


Require that mismatches to be
somewhat evenly distributed
among positions


Prune tree at length K

A

C

G

T

T

A

Current pattern P

MULTIPROFILER

W differs from W* at


positions.

The consensus sequence
for the words in the

-
neighborhood of W is
similar to W.

If we ignore all the chars
that are similar to W,
the rest may suggest
the difference between
W and W*

W

W*

W*: A
C
GTA
C
G

W: A
T
GTA
A
G

MULTIPROFILER: alg sketch


For each word P
in S


Find its
α
-
neighborhood
in S


List of words: C


For each position j from
1..K of the words in C


Find the most popular char
that differ from P[j]


Replace
α

positions in P
with the chars found above


Call the new word P’


W* = argmin D(P’, S)

W

W*

W*: A
C
GTA
C
G

W: A
T
GTA
A
G

MULTIPROFILER


Complexity not discussed
in the paper


More efficient than
WEEDER for longer
patterns: N < K
α

3
α


How to choose
α

is an
issue:


Large
α
: too many noises
in neighborhood


Small
α
: few true instances
in neighborhood

W

W*

W*: A
C
GTA
C
G

W: A
T
GTA
A
G



Probabilistic modeling approaches

for motif finding

Probabilistic modeling approaches


A motif model


Usually a PWM


M = (P
ij
), i = 1..4, j = 1..k, k: motif length


A background model


Usually the distribution of base frequencies in
the genome (or other selected subsets of
sequences)


B = (b
i
), i = 1..4


A word can be generated by M or B

Expectation
-
Maximization


For any word W,


P(W | M) = P
W[1] 1
P
W[2] 2
…P
W[K] K


P(W | B) = b
W[1]
b
W[2]
…b
W[K]


Let


= P(M), i.e., the probability for any word to
be generated by M.


Then P(B) = 1
-




Can compute the posterior probability P(M|W)
and P(B|W)


P(M|W) ~ P(W|M) *



P(B|W) ~ P(W|B) * (1
-

)

Expectation
-
Maximization


E
-
step: Z
xy

= P(M | X[y..y+k
-
1]) for all x and y


M
-
step: re
-
estimate M,


given Z

Initialize

E
-
step

M
-
step

probability

position

1

9

5

1

9

5

MEME


Multiple EM for Motif Elicitation


Bailey and Elkan, UCSD


http://meme.sdsc.edu/


Multiple starting points


Multiple modes: ZOOPS, OOPS, TCM

Gibbs Sampling


Another very useful technique for
estimating missing parameters


EM is deterministic


Often trapped by local optima


Gibbs sampling: stochastic behavior to
avoid local optima

Gibbs Sampling


Gibbs sampling: sample one position according to probability


Update prediction of one training sequence at a time


Viterbi: always take the highest


EM: take weighted average

Sampling

Simultaneously update
predictions of all sequences

position

probability

Gibbs sampling motif finders


Gibbs Sampler, based on C. Larence et.al.
Science, 1993


AlignACE, Nat Biotech 1998, developed in
Church lab, Harvard Univ


BioProspector, X. Liu et. al. PSB 2001 , an
improvement of AlignACE

Limits of Motif Finders


Given upstream regions of coregulated genes:


Increasing

length makes motif finding
harder



random motifs clutter the true ones


Decreasing

length makes motif finding
harder



true
motif missing in some sequences


Similar issue for number of sequences

0

gene

???

Challenging problem


(k, d)
-
motif challenge problem


Many algorithms fail at (15, 4)
-
motif for n = 20 and L = 1000


Combinatorial algorithms usually work better on challenge problems


However, they are usually designed to find (k, d)
-
motifs


Performance in real data varies

k

d mutations

n = 20

L = 1000

(15, 4)
-
motif


Information content: 11.7 bits


~ 6mers. Expected occurrence 1 per 7k bp

Actual

Results

by MEME

llr = 163 E
-
value = 3.2e+005


llr = 177


E
-
value = 1.5e+006


llr = 88


E
-
value = 2.5e+005


(15, 4’)
-
motif


Motif length: 15


Each instance differs at most 4 bases from consensus


Information content: 14.8


Much easier problem

sites = 2


llr = 38


E
-
value = 3.0e+006


sites = 5


llr = 95


E
-
value = 1.8e+006


sites = 20


llr = 240


E
-
value = 4.7e
-
013


Actual

Results

by MEME

Results for some real genes

llr = 394


E
-
value = 2.0e
-
023


llr = 347


E
-
value = 9.8e
-
002


llr = 110


E
-
value = 1.4e+004


Motif finding in practice


Now we’ve found some good looking
motifs


Easiest step?


What to do next?


Are they real?


How do we find more instances in the rest of
the genome?


What are their functional meaning?


Motifs => regulatory networks

To make sense about the motifs


Each program usually reports a number of motifs
(tens to hundreds)


Many motifs are variations of each other


Each program also report some different ones


Each program has its own way of scoring motifs


Best scored motifs often not interesting


AAAAAAAA


ACACACAC


TATATATAT

Strategies to improve results


Combine results from different algorithms
usually helpful


Ones that appeared multiple times are
probably more interesting


Except simple repeats like AAAAA or ATATATATA


Cluster motifs into groups according to their
similarities

Strategies to improve results


Compare with known motifs in database


TRANSFAC


JASPAR


Issues:


How to determine similarity between motifs?


Alignment between matrices


How similar is similar?


Empirically determine some threshold

Strategies to improve results


Statistical test of significance


Enrichment in target sequences vs
background sequences

Target set

T

Background set

B

Assumed to contain a
common motif, P

Assumed to not contain P,
or with very low frequency

Ideal case: every sequence in T has P, no sequence in B has P

Statistical test for significance


Intuitively: if n / N >> m / M


P is enriched (over
-
represented) in T


Statistical significance?

Target set

T

Background set + target set

B + T

N

M

P

Appeared in
n sequences

Appeared in m
sequences

Hypergeometric distribution


A box with M balls, of which N are
red, and the rest are blue.


If we
randomly

draw m balls from
the box


What’s the probability we’ll see n
red balls?


If P is very small, we are
probably not drawing randomly


Red

ball: target sequences


Blue

ball: background sequences



Total # of choices: (M choose m)


# of choices to have n red balls:
(N choose n) x (M
-
N choose m
-
n)

Cumulative hypergeometric test for
motif significance


We are interested in: if we
randomly pick m balls, how likely
that we’ll see
at least

n red balls?

This can be interpreted as the p
-
value for the null
hypothesis that we are randomly picking.

Alternative hypothesis: our selection favors red balls.

Equivalent: the target set T is enriched with motif P.

Or: P is over
-
represented in T.

Examples


Yeast genome has 6000 genes


Select 50 genes believed to be co
-
regulated by a common TF


Found a motif for these 50 genes


It appeared in 20 out of these 50 genes


In the whole genome, 100 genes have this motif


M = 6000, N = 50, m = 100+20 = 120, n = 20


Intuition:


m/M = 120/6000. In Genome, 1 out 50 genes have the motif


N = 50, would expect only 1 gene in the target set to have the motif


20
-
fold enrichment


P
-
value = 6 x 10
-
22


n = 5. 5
-
fold enrichment. P
-
value = 0.003


Normally a very low p
-
value is needed, e.g. 10
-
10

ROC curve for motif significance


Motif is usually a PWM


Any word will have a score


Typical scoring function: Log P(W | M) / P(W | B)


W: a word.


M: a PWM.


B: background model


To determine whether a sequence contains a
motif, a cutoff has to be decided


With different cutoffs, you get different number of
genes with the motif


Hyper
-
geometric test first assumes a cutoff


It may be better to look at a range of cutoffs

ROC curve for motif significance


With different score cutoff, will have different m and n


Assume you want to use P to classify T and B


Sensitivity: n / N


Specificity: (M
-
N
-
m+n) / (M
-
N)


False Positive Rate = 1


specificity: (m


n) / (M
-
N)


With decreasing cutoff, sensitivity

, FPR


Target set

T

Background set + target set

B + T

N

M

P

Appeared in
n

sequences

Appeared in
m

sequences

Given a score cutoff

ROC curve for motif significance

ROC
-
AUC: area under curve.

1: the best. 0.5: random.


Motif 1 is more enriched in motif 2.

1
-
specificity

sensitivity

Motif 1

Motif 2

Random

A good cutoff

Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.

Lowest cutoff. Every sequence
has the motif. Sensitivity = 1.
specificity = 0.

0

1

1

0

Other strategies


Cross
-
validation


Randomly divide sequences into 10 sets, hold 1 set
for test.


Do motif finding on 9 sets. Does the motif also appear
in the testing set?


Phylogenetic conservation information


Does a motif also appears in the homologous genes
of another species?


Strongest evidence


However, will not be able to find species
-
specific ones

Other strategies


Finding motif modules


Will two motifs always appear in the same gene?


Location preference


Some motifs appear to be in certain location


E.g., within 50
-
150bp upstream to transcription start


If a detect motif has strong positional bias, may be a sign of its
function


Evidence from other types of data sources


Do the genes having the motif always have similar activities
(gene expression levels) across different conditions?


Interact with the same set of proteins?


Similar functions?


etc.

To search for new instances


Usually many false positives


Score cutoff is critical


Can estimate a score cutoff from the “true”
binding sites


Motif finding

Scoring function

A set of scores for the “true” sites. Take mean
-

std as a cutoff.

(or a cutoff such that the majority of “true” sites can be predicted).

To search for new instances


Use other information, such as positional
biases of motifs to restrict the regions that
a motif may appear


Use gene expression data to help: the
genes having the true motif should have
similar activities


Phylogenetic conservation is the key

Final project


Write a review paper on a topic that we didn’t cover in
lectures

Or


Implement an algorithm and do some experiments


Compare several algorithms (existing implementation ok)


Combine several algorithms to form a pipeline (e.g. gene
expression + motif analysis)



Final:


5
-
10 pages report (single space, single column, 12pt) + 15
minutes presentation

Possible topics for term paper


Possible topics:


Haplotype inferencing


Computational challenges associated with new
microarray technologies


Phylogenetic footprinting


Small RNA gene / target prediction (siRNA, mRNA,
…)


Biomedical text mining


Protein structure prediction


Topology of biological networks

An example project


Given a gene expression data (say cell cycle)


Cluster genes using k
-
means


Find motifs using several algorithms


(Cluster and combine similar motifs)


Rank motifs according to their specificity to the
target sequences comparing to the other
clusters


Get their logos


Use the sequences to search the whole genome
for more genes with the motif


Do they have any functional significance?