PatternHunter II

powerfultennesseeBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

97 views

PatternHunter II: Highly Sensitive and
Fast Homology Search

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

R94922059
林語君

Ming Li, Bin Ma

Derek Kisman, John Tromp

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

2

Overview


Homology search


Local alignment algorithms


PH I


PH II


Multiple Spaced Seeds


Computing hit probability


Finding a good seed set


PH II Design


Performance

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

3

Local alignment


Smith
-
Waterman


Smith and Waterman, 1981; Waterman and Eggert, 1987


SSearch


FastA


Wilbur and Lipman, 1983; Lipman and Pearson, 1985


BLAST


Altschul et al., 1990; Altschul et al., 1997


Blast Family: BLASTN, BLASTP, etc.


MEGABLAST

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

4

PatternHunter


Seed


Tradeoff: sensitivity <
-
> computation


Consecutive
k

letters


k
=11 in Blastn,
k
=28 in MegaBlast


Nonconsecutive
k

letters


Spaced seed


A model of
k

as its weight


Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

5

PatternHunter II


Genome Informatics 14 (2003)


Extend single optimized spaced seed
of PH to multiple ones


Speed: BLASTN (MEGABLAST)


Sensitivity: Smith
-
Waterman
(SSearch)

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

6

Definition


A homologous region,
R


A seed
hits

R


A seed set
A
={
a
1
,…
a
k
}
hits
R


Similarity


R

has
p=x
%

identities


Sensitivity


Hit probability


Optimal (DP) = 1

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

7

Computing Hit Probability


NP
-
hard on multiple seeds


DP on 1 seed


Extend DP to multiple seeds

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

8

Computing Hit Probability of
Multiple Seeds


Let
A
={
a
1
,…
a
k
}
be a set of
k
seeds and
R a random region of Length
L

with
similarity level
p
.


Binary string
b

is a suffix of
R
[0:
i
]




Answer:
f ( L,
Є

),
Є

= empty string

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

9

Computing Hit Probability of
Multiple Seeds

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

10

Computing Hit Probability of
Multiple Seeds

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

11

Finding a Good Seed Set


NP
-
hard for both optimal seed and
multiple seeds


Greedy



Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

12

Finding a Good Seed Set


Compute the 1st seed
a
1

which
maximizes the hit probability of
{
a
1
}


Compute the 2nd seed
a
2

which
maximizes the hit probability of
{
a
1
,
a
2
}


Repeat until


Reach the desired number of seeds


Reach the desired hit probability

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

13

Finding a Good Seed Set


May not optimize the combined hit
probability


Good enough


Optimal


16 weight, 11 seeds, L=64, similarity=70%, first four
seeds:{111010010100110111,111100110010100001011,
110100001100010101111,1110111010001111}


Greedy


16 weight, 12 seeds, L=64, similarity=70%, first four
seeds:{111010010100110111,111100010001001101011
1,1100110100101000110111,1110100011110010001101}


Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

14

Performance of the seeds


From low to high


Solid: weight
-
11 k=1,2,4,8,16 seeds


Dashed: 1
-
seed, weight=10,9,8,7

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

15

Performance of the seeds


Reducing the weight by 1


Increase the expected number of hits
by a factor of 4


Doubling the number of seeds


Increase the expected number of hits
by a factor of 2


Better: Multiple seeds

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

16

PH II Performance


Compare with Blast(Blastn), Smith
-
Waterman(SSearch)


Sensitivity of SSearch = 1


Alignment score


BLAST methods (hash, DP)


match=1, mismatch=
-
1, gapopen=
-
5,
gapextension=
-
1

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

17

PH II Performance


From low to high


Solid: PH II, 1, 2, 4, 8 seeds weight 11


Dashed: Blastn, seed weight 11

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

18

Complexity Proof


Finding optimal spaced seeds


NP
-
hard


Finding one optimal seed


NP
-
hard


Computing the hit probability of
multiple seeds


NP
-
hard