Graph Algorithms (pptx)

clumpfrustratedBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

221 views

Graph Algorithms 8.6
-
8.10

CS 6030


Bioinformatics

Summer II 2012

Jason Eric Johnson

Sequencing by Hybridization


DNA Array gives all strings of length l



How do we find the order?



Spectrum(
s,l
)


String s of length n


Spectrum is
multiset

of n
-
l+1 l
-
mers

in s

Sequencing by Hybridization


s

= TATGGTGC


l = 3


Spectrum(
s,l
) = {TAT,ATG,TGG,GGT,GTG,TGC}



Problem:


Input: Set S of all l
-
mers

from s


Output: String s
s.t.

Spectrum(
s,l
) = S

Hybridization on DNA Array

Sequencing by Hybridization


Special case of Shortest Superstring Problem



SBH is linear
-
time



SSP (NP
-
Complete) is more general


In SSP, no guaranteed overlap


In SBH, we know the length of the target
sequence

Sequencing by Hybridization


There is a problem with DNA Arrays



No good way to distinguish a match from a
highly stable mismatch


Mismatch could give strong hybridization signal


Need longer probes to deal with mutations

SBH: Hamiltonian Path Approach


Two l
-
mers

overlap if overlap(
p,q
) = l


1


Last l
-
1 letters of p are same as first l
-
1 of q


Make each l
-
mer

in Spectrum(
s,l
) a node


Construct directed graph(s) that connect every
p and q with a directed edge


1 to 1 correspondence between paths that
visit each vertex exactly once (Hamiltonian
Paths) and DNA fragments with Spectrum(
s,l
)

SBH: Hamiltonian Path Approach


S

= { ATG AGG TGC TCC GTC GGT GCA CAG }



Path visited every VERTEX once

ATG

AGG

TGC

TCC

H

GTC

GGT

GCA

CAG

ATG

C

A

G

G

T

C

C

SBH: Hamiltonian Path Approach


A more complicated graph:





S

= { ATG TGG TGC GTG GGC GCA GCG CGT }


SBH: Hamiltonian Path Approach


S

= { ATG TGG TGC GTG GGC GCA GCG CGT }


Path 1:



ATGCGTGGCA

ATGGCGTGCA

Path 2:

SBH: Hamiltonian Path Approach


Problem is that there is no efficient algorithm



As overlap graph gets larger, this is not a
useful technique since the Hamiltonian Path
problem is NP
-
Complete

SBH:
Eulerian

Path Approach


This leads to simple linear
-
time algorithm for
sequence reconstruction



Construct graph whose edges correspond to l
-
mers



Find path(s) that visit each edge exactly once

SBH:
Eulerian

Path Approach


S

= { ATG, TGC, GTG, GGC, GCA, GCG, CGT }



Vertices correspond to (
l


1 )


mers : { AT, TG, GC, GG, GT, CA, CG }


Edges correspond to
l



mers from
S

AT

GT

CG

CA

GC

TG

GG


Path visited every EDGE once

SBH:
Eulerian

Path Approach

S
= { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:





ATGGCGTGCA


ATGCGTGGCA

AT

TG

GC

CA

GG

GT

CG

AT

GT

CG

CA

GC

TG

GG

SBH:
Eulerian

Path Approach


If for every vertex the number of incoming
edges is equal to the number of outgoing
edges, the graph is balanced


Theorem
: A connected graph is
Eulerian

if
and only if each of its vertices is balanced


Theorem
: A connected graph has an
Eulerian

path if and only if it contains at most two
semi
-
balanced vertices and all other vertices
are balanced


Some Difficulties with SBH


Fidelity of Hybridization:

difficult to detect differences
between probes hybridized with perfect matches and 1
or 2 mismatches


Array Size:

Effect of low fidelity can be decreased with
longer
l
-
mers, but array size increases exponentially in
l.
Array size is limited with current technology.


Practicality:

SBH is still impractical. As DNA microarray
technology improves, SBH may become practical in the
future


Practicality again
: Although SBH is still impractical, it
spearheaded expression analysis and SNP analysis
techniques

Fragment Assembly




Now that we have our reads sequenced, we
need to assemble them into the entire DNA
sequence


Fragment Assembly


We have some problems:


Errors in reads (1% to 3%)


Which strand did the read come from?


Did the read come from the target DNA sequence or its
Watson
-
Crick complement?


Repeats in DNA (this is the major problem)


See page 278 for puzzle example

Fragment Assembly


Very difficult to put it all together if repeats
are longer than read length



Could solve this by increasing read length, but
the technology isn’t there yet

Fragment Assembly


One approach is to break the sequence into
about 30,000 Bacterial Artificial Chromosomes


Sequence each BAC individually


Put them all together


Used and shown effective (if cumbersome) by the
Human Genome Project

Fragment Assembly


Another option (used in mouse genome
assembly) is the Weber
-
Meyers approach


Pairs reads that are separated by a fixed
-
size gap


Gap size L is chosen to be longer than most
repeats


Unlikely both reads lie in large repeat


Read that is in unique portion of DNA tells us
which copy of a repeat the mate is in

Fragment Assembly


Most algorithms consist of these steps:



Overlap


Find potentially overlapping reads


Layout:


Find order of reads along DNA


Consensus:


Derive DNA sequence from layout

Overlap


Find the best match between the suffix of one
read and the prefix of another



Due to sequencing errors, need to use
dynamic programming to find the optimal
overlap alignment



Apply a filtration method to filter out pairs of
fragments that do not share a significantly
long common substring

Overlapping Reads

TAGATTACACAGATTAC

TAGATTACACAGATTAC

|||||||||||||||||


Sort all
k
-
mers in reads (
k

~ 24)



Find pairs of reads sharing a k
-
mer



Extend to full alignment


throw away if not
>95% similar

T GA

TAGA

| ||

TACA

TAGT

||

Overlapping Reads and Repeats


A
k
-
mer that appears N times, initiates N
2

comparisons



For an
Alu

that appears 10
6

times


10
12

comparisons


too much



Solution:


Discard all
k
-
mers that appear more than


t



Coverage, (
t

~ 10)

Finding Overlapping Reads

Create local multiple alignments from the
overlapping reads

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

Layout


Repeats are a major challenge


Do two aligned fragments really overlap, or
are they from two copies of a repeat?


Solution: repeat masking


hide the repeats!!!


Masking results in high rate of misassembly
(up to 20%)


Misassembly means alot more work at the
finishing step


Consensus


A consensus sequence is derived from a
profile of the assembled fragments



A sufficient number of reads is required to
ensure a statistically significant consensus



Reading errors are corrected

Derive Consensus Sequence

Derive
multiple alignment

from pairwise read
alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAAACTA

TAG TTACACAGATTA
T
TGACTT
C
ATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted
voting

Protein Sequencing and Identification


Protein can be digested into peptides by
proteases (such as trypsin)



Can then sequence the fragments individually
and re
-
assemble



Mass spectrometry allows us to find proteins
involved in cell death, for example

Protein Sequencing and Identification


Tandem mass spectrometer breaks peptides
into smaller fragments


These fragments have electrical charge


Fragments are spun around in an magnetic
field until they hit a detector


Larger masses are harder to spin than smaller
ones, so mass can be determined by the
amount of energy required to fling fragments
around

Protein Sequencing and Identification


The problem we encounter is how to
reconstruct the amino acid sequence of the
peptide from the masses of the
broken pieces

References


Generated from:



An Introduction to Bioinformatics Algorithms, Neil C. Jones,
Pavel

A.
Pevzner
, A Bradford Book, The MIT Press, Cambridge, Mass., London,
England, 2004




Slides 4, 8
-
10, 13, 14, 16, 23
-
29 from
http://
bix.ucsd.edu
/
bioalgorithms
/slides.php#Ch8