# CAP5510 - Bioinformatics - CISE - University of Florida

Βιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 7 μήνες)

111 εμφανίσεις

1

CAP5510

Bioinformatics

Database Searches for
Biological Sequences or
Imperfect Alignments

Tamer Kahveci

CISE Department

University of Florida

2

Goals

Understand how major heuristic
methods for sequence comparison work

FASTA

BLAST

Understand how search results are
evaluated

3

What is Database Search ?

Find a particular (usually) short sequence in a
database of sequences (or one huge sequence).

Problem is identical to local sequence alignment, but
on a much larger scale.

We must also have some idea of the
significance

of a
database hit.

Databases always return some kind of hit, how much
attention should be paid to the result?

A similar problem is the global alignment of two large
sequences

General idea: good alignments contain high scoring
regions.

4

Imperfect Alignment

What is an imperfect alignment?

Why imperfect alignment?

The result may not be optimal.

Finding optimal alignment is usually to
costly in terms of time and memory.

5

Database Search Methods

Hash table based methods

FASTA family

FASTP, FASTA, TFASTA, FASTAX, FASTAY

BLAST family

BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST

Others

FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS

Suffix tree based methods

Mummer, AVID, Reputer, MGA, QUASAR

6

History of sequence searching

1970:

NW

1980:

SW

1985:

FASTA

1990:

BLAST

7

Hash Table

8

Hash Table

K
-
gram =
subsequence of
length K

A
k

entries

A is alphabet
size

Linear time
construction

Constant lookup
time

9

FASTP

Lipman & Pearson, 1985

10

FASTP

Three phase algorithm

1.
Find short good matches using k
-
grams

1.
K = 1 or 2

2.
Find start and end positions for good
matches

3.
Use DP to align good matches

11

position 1 2 3 4 5 6 7 8 9 10 11

protein 1 n c s p t a . . . . .

protein 2 . . . . . a c s p r k

position in offset

amino acid protein A protein B pos A
-

posB

-----------------------------------------------------

a 6 6 0

c 2 7
-
5

k
-

11

n 1
-

p 4 9
-
5

r
-

10

s 3 8
-
5

t 5
-

-----------------------------------------------------

Note the common offset for the 3 amino acids c,s and p

A possible alignment can be quickly found :

protein 1 n c s p t a

| | |

protein 2 a c s p r k

FASTP: Phase 1 (1)

12

FASTP: Phase 1 (2)

Similar to dot plot

Offsets range from 1
-
m
to n
-
1

Each offset is scored as

# matches
-

#
mismatches

Diagonals (offsets) with
large score show local
similarities

How does it depend on
k?

13

FASTP: Phase 2

5 best diagonal runs
are found

Rescore these 5
regions using
PAM250.

Initial score

Indels are not
considered yet

14

FASTP: Phase 3

Sort the aligned regions in descending
score

Optimize these alignments using
Needleman
-
Wunsch

Report the results

15

FASTP
-

Discussion

Results are not optimal. Why ?

How does performance compare to Smith
-
Waterman?

What is the impact of k?

How does this idea work for DNAs ?

K = 4 or 6 for DNA

16

FASTA

Improvement Over
FASTP

Pearson 1995

17

FASTA (1)

Phase 2: Choose 10 best diagonal runs instead of 5

18

FASTA (2)

Phase 2.5

Eliminate diagonals that score less than some given
threshold.

Combine matches to find longer matches. It incurs join
penalty similar to gap penalty

19

FASTA Variations

TFASTAX and TFASTAY: query protein
against a DNA library in all reading
frames

FASTAX, FASTAY: DNA query in all
reading frames against protein database

20

BLAST

Altschul, Gish, Miller, Myers,
Lipman, 1990

21

BLAST (or BLASTP)

BLAST

B
asic
L
ocal
A
lignment
S
earch
T
ool

An approximation of Smith
-
Waterman

Designed for database searches

Short query sequence against long database
sequence or a database of many sequences

Sacrifices search sensitivity for speed

22

BLAST Algorithm (1)

Eliminate low complexity regions from
the query sequence.

Replace them with X (protein) or N (DNA)

Hash table on query sequence.

K = 3 for proteins

MCG

CGP

MCGPFILGTYC

23

BLAST Algorithm (2)

For each k
-
gram find all
k
-
grams that align with
score at least cutoff T
using BLOSUM62

20
k

candidates

~50 on the average per k
-
gram

~50n for the entire query

Build hash table

PQG

QGM

PQGMCGPFILGTYC

PQG

PQG

18

PEG

15

PRG

14

PSG

13

PQA

12

T = 13

24

BLAST Algorithm (3)

Sequentially scan the database and
locate each k
-
gram in the hash table

Each match is a
seed

for an ungapped
alignment.

25

BLAST Algorithm (4)

HSP (High Scoring Pair)
= A match between a
query word and the
database

Find a “hit”: Two non
-
overlapping HSP’s on a
diagonal within distance
A

Extend the hit until the
score falls below a
threshold value, X

26

BLAST Algorithm (5)

Keep only the extended matches that
have a score at least S.

Determine the statistical significance
of the result

27

What is Statistical Significance?

13 : 15

13 : 15

Two one
-
on
-
one
games, two scores.

Which result is
more significant?

Expected: maybe a
random result.

Unexpected:
significant, may have
significant meanings.

28

Statistical Significance

E
-
value: The expected number of matches
with score at least S

E = Kmne
-
lambda.S

m, n : sequence lengths

S : alignment score

K, lambda: normalization parameters

P
-
value: The probability of having at least one
match with score at least S

1

e
-
E

The smaller these values are, the more
significant the result

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.ht
ml

29

BLAST
-

Analysis

K (k
-
gram)

Lower: more sensitive.
Slower.

T (neighbor cutoff)

Lower: Find distant
neighbors. Introduces
noise

X (extension cutoff)

Higher: lower chances of
getting into a local
minima. Slower.

30

Sample Query

http://www.ncbi.nlm.nih.gov/BLAST/

I D R A M S A A R G V F E R G D W S L S S P A K R K A V L N K L A D L M E A H
A E E L A L L E T L D T G K P I R H S L R D D I P G A A R A I R W Y A E A I D K
V Y G E V A T T S S H E L A M I V R E P V G V I A A I V P W N F P L L L T C W
K L G P A L A A G N S V I L K P S E K S P L S A I R L A G L A K E A G L P D G V
L N V V T G F G H E A G Q A L S R H N D I D A I A F T G S T R T G K Q L L K D
A G D S N M K R V W L E A G G K S A N I V F A D C P D L Q Q A A S A T A A G
I F Y N Q G Q V C I A G T R L L L E E S I A D E F L A L L K Q Q A Q N W Q P
G H P L D P A T T M G T L I D C A H A D S V H S F I R E G E S K G Q L L L D G
R N A G L A A A I G P T I F V D V D P N A S L S R E E I F G P V L V V T R F T S
E E Q A L Q L A N D S Q Y G L G A A V W T R D L S R A H R M S R R L K A G S
V F V N N Y N D G D M T V P F G G Y K Q S G N G R D K S L H A L E K F T E L K
T I W I

Dhal_ecoli

31

BLASTN

BLAST for nucleic acids

K = 11

Exact match instead of neighborhood
search.

32

BLAST Variations

Program

Query

Target

Type

BLASTP

Protein

Protein

Gapped

BLASTN

Nucleic acid

Nucleic acid

Gapped

BLASTX

Nucleic acid

Protein

Gapped

TBLASTN

Protein

Nucleic acid

Gapped

TBLASTX

Protein

Nucleic acid

Gapped

33

Even More Variations

PsiBLAST (iterative)

BLAT, BLASTZ, MegaBLAST

FLASH, PatternHunter, SSAHA, SENSEI,
WABA, GLASS

Main differences are

Seed choice (k, gapped seeds)

34

Suffix Trees

35

Suffix Tree

Tree structure that contains all suffixes of the input sequence

TGAGTGCGA

GAGTGCGA

AGTGCGA

GTGCGA

TGCGA

GCGA

CGA

GA

A

36

Suffix Tree Example

37

O(n) space and construction time

10n to 70n space usage reported

O(m) search time for m
-
letter sequence

Good for

Small data

Exact matches

Suffix Tree Analysis

38

Suffix Array

5 bytes per letter

O(m log n) search
time

Better space usage

Slower search

39

Mummer

40

Other Sequence Comparison
Tools

Reputer, MGA, AVID

QUASAR (suffix array)