CAP5510 - Bioinformatics - CISE - University of Florida

underlingbuddhaΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

87 εμφανίσεις

1

CAP5510


Bioinformatics

Database Searches for
Biological Sequences or
Imperfect Alignments

Tamer Kahveci

CISE Department

University of Florida

2

Goals


Understand how major heuristic
methods for sequence comparison work


FASTA


BLAST


Understand how search results are
evaluated

3

What is Database Search ?


Find a particular (usually) short sequence in a
database of sequences (or one huge sequence).


Problem is identical to local sequence alignment, but
on a much larger scale.


We must also have some idea of the
significance

of a
database hit.


Databases always return some kind of hit, how much
attention should be paid to the result?


A similar problem is the global alignment of two large
sequences


General idea: good alignments contain high scoring
regions.

4

Imperfect Alignment


What is an imperfect alignment?


Why imperfect alignment?



The result may not be optimal.


Finding optimal alignment is usually to
costly in terms of time and memory.

5

Database Search Methods


Hash table based methods


FASTA family


FASTP, FASTA, TFASTA, FASTAX, FASTAY


BLAST family


BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST


Others


FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS


Suffix tree based methods


Mummer, AVID, Reputer, MGA, QUASAR

6

History of sequence searching


1970:

NW


1980:

SW


1985:

FASTA


1990:

BLAST

7

Hash Table

8

Hash Table


K
-
gram =
subsequence of
length K


A
k

entries


A is alphabet
size


Linear time
construction


Constant lookup
time

9

FASTP

Lipman & Pearson, 1985

10

FASTP


Three phase algorithm

1.
Find short good matches using k
-
grams

1.
K = 1 or 2

2.
Find start and end positions for good
matches

3.
Use DP to align good matches

11

position 1 2 3 4 5 6 7 8 9 10 11

protein 1 n c s p t a . . . . .

protein 2 . . . . . a c s p r k


position in offset

amino acid protein A protein B pos A
-

posB

-----------------------------------------------------

a 6 6 0

c 2 7
-
5

k
-

11

n 1
-

p 4 9
-
5

r
-

10

s 3 8
-
5

t 5
-

-----------------------------------------------------

Note the common offset for the 3 amino acids c,s and p

A possible alignment can be quickly found :

protein 1 n c s p t a


| | |

protein 2 a c s p r k

FASTP: Phase 1 (1)

12

FASTP: Phase 1 (2)


Similar to dot plot


Offsets range from 1
-
m
to n
-
1


Each offset is scored as


# matches
-

#
mismatches


Diagonals (offsets) with
large score show local
similarities



How does it depend on
k?

13

FASTP: Phase 2


5 best diagonal runs
are found


Rescore these 5
regions using
PAM250.


Initial score


Indels are not
considered yet

14

FASTP: Phase 3


Sort the aligned regions in descending
score


Optimize these alignments using
Needleman
-
Wunsch


Report the results

15

FASTP
-

Discussion


Results are not optimal. Why ?



How does performance compare to Smith
-
Waterman?



What is the impact of k?



How does this idea work for DNAs ?


K = 4 or 6 for DNA

16

FASTA


Improvement Over
FASTP

Pearson 1995

17

FASTA (1)


Phase 2: Choose 10 best diagonal runs instead of 5

18

FASTA (2)


Phase 2.5


Eliminate diagonals that score less than some given
threshold.


Combine matches to find longer matches. It incurs join
penalty similar to gap penalty

19

FASTA Variations


TFASTAX and TFASTAY: query protein
against a DNA library in all reading
frames


FASTAX, FASTAY: DNA query in all
reading frames against protein database

20

BLAST

Altschul, Gish, Miller, Myers,
Lipman, 1990

21

BLAST (or BLASTP)


BLAST


B
asic
L
ocal
A
lignment
S
earch
T
ool


An approximation of Smith
-
Waterman


Designed for database searches


Short query sequence against long database
sequence or a database of many sequences


Sacrifices search sensitivity for speed

22

BLAST Algorithm (1)


Eliminate low complexity regions from
the query sequence.


Replace them with X (protein) or N (DNA)


Hash table on query sequence.


K = 3 for proteins

MCG

CGP

MCGPFILGTYC

23

BLAST Algorithm (2)


For each k
-
gram find all
k
-
grams that align with
score at least cutoff T
using BLOSUM62


20
k

candidates


~50 on the average per k
-
gram


~50n for the entire query


Build hash table

PQG

QGM

PQGMCGPFILGTYC

PQG

PQG

18

PEG

15

PRG

14

PSG

13

PQA

12

T = 13

24

BLAST Algorithm (3)


Sequentially scan the database and
locate each k
-
gram in the hash table


Each match is a
seed

for an ungapped
alignment.


25

BLAST Algorithm (4)


HSP (High Scoring Pair)
= A match between a
query word and the
database


Find a “hit”: Two non
-
overlapping HSP’s on a
diagonal within distance
A


Extend the hit until the
score falls below a
threshold value, X


26

BLAST Algorithm (5)


Keep only the extended matches that
have a score at least S.


Determine the statistical significance
of the result

27

What is Statistical Significance?

13 : 15

13 : 15


Two one
-
on
-
one
games, two scores.



Which result is
more significant?



Expected: maybe a
random result.


Unexpected:
significant, may have
significant meanings.

28

Statistical Significance


E
-
value: The expected number of matches
with score at least S


E = Kmne
-
lambda.S


m, n : sequence lengths


S : alignment score


K, lambda: normalization parameters


P
-
value: The probability of having at least one
match with score at least S


1


e
-
E


The smaller these values are, the more
significant the result


http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.ht
ml


29

BLAST
-

Analysis


K (k
-
gram)


Lower: more sensitive.
Slower.


T (neighbor cutoff)


Lower: Find distant
neighbors. Introduces
noise


X (extension cutoff)


Higher: lower chances of
getting into a local
minima. Slower.


30

Sample Query


http://www.ncbi.nlm.nih.gov/BLAST/



I D R A M S A A R G V F E R G D W S L S S P A K R K A V L N K L A D L M E A H
A E E L A L L E T L D T G K P I R H S L R D D I P G A A R A I R W Y A E A I D K
V Y G E V A T T S S H E L A M I V R E P V G V I A A I V P W N F P L L L T C W
K L G P A L A A G N S V I L K P S E K S P L S A I R L A G L A K E A G L P D G V
L N V V T G F G H E A G Q A L S R H N D I D A I A F T G S T R T G K Q L L K D
A G D S N M K R V W L E A G G K S A N I V F A D C P D L Q Q A A S A T A A G
I F Y N Q G Q V C I A G T R L L L E E S I A D E F L A L L K Q Q A Q N W Q P
G H P L D P A T T M G T L I D C A H A D S V H S F I R E G E S K G Q L L L D G
R N A G L A A A I G P T I F V D V D P N A S L S R E E I F G P V L V V T R F T S
E E Q A L Q L A N D S Q Y G L G A A V W T R D L S R A H R M S R R L K A G S
V F V N N Y N D G D M T V P F G G Y K Q S G N G R D K S L H A L E K F T E L K
T I W I

Dhal_ecoli

31

BLASTN


BLAST for nucleic acids


K = 11


Exact match instead of neighborhood
search.

32

BLAST Variations

Program

Query

Target

Type

BLASTP

Protein

Protein

Gapped

BLASTN

Nucleic acid

Nucleic acid

Gapped

BLASTX

Nucleic acid

Protein

Gapped

TBLASTN

Protein

Nucleic acid

Gapped

TBLASTX

Protein

Nucleic acid

Gapped

33

Even More Variations


PsiBLAST (iterative)


BLAT, BLASTZ, MegaBLAST


FLASH, PatternHunter, SSAHA, SENSEI,
WABA, GLASS



Main differences are


Seed choice (k, gapped seeds)


Additional data structures

34

Suffix Trees

35

Suffix Tree


Tree structure that contains all suffixes of the input sequence



TGAGTGCGA


GAGTGCGA


AGTGCGA


GTGCGA


TGCGA


GCGA


CGA


GA


A

36

Suffix Tree Example

37


O(n) space and construction time


10n to 70n space usage reported


O(m) search time for m
-
letter sequence


Good for


Small data


Exact matches

Suffix Tree Analysis

38

Suffix Array


5 bytes per letter


O(m log n) search
time



Better space usage


Slower search

39

Mummer

40

Other Sequence Comparison
Tools


Reputer, MGA, AVID


QUASAR (suffix array)