Graph Algorithms 8.6

8.10
CS 6030
–
Bioinformatics
Summer II 2012
Jason Eric Johnson
Sequencing by Hybridization
•
DNA Array gives all strings of length l
•
How do we find the order?
•
Spectrum(
s,l
)
–
String s of length n
–
Spectrum is
multiset
of n

l+1 l

mers
in s
Sequencing by Hybridization
•
s
= TATGGTGC
•
l = 3
•
Spectrum(
s,l
) = {TAT,ATG,TGG,GGT,GTG,TGC}
•
Problem:
•
Input: Set S of all l

mers
from s
•
Output: String s
s.t.
Spectrum(
s,l
) = S
Hybridization on DNA Array
Sequencing by Hybridization
•
Special case of Shortest Superstring Problem
•
SBH is linear

time
•
SSP (NP

Complete) is more general
–
In SSP, no guaranteed overlap
–
In SBH, we know the length of the target
sequence
Sequencing by Hybridization
•
There is a problem with DNA Arrays
•
No good way to distinguish a match from a
highly stable mismatch
–
Mismatch could give strong hybridization signal
–
Need longer probes to deal with mutations
SBH: Hamiltonian Path Approach
•
Two l

mers
overlap if overlap(
p,q
) = l
–
1
–
Last l

1 letters of p are same as first l

1 of q
•
Make each l

mer
in Spectrum(
s,l
) a node
•
Construct directed graph(s) that connect every
p and q with a directed edge
•
1 to 1 correspondence between paths that
visit each vertex exactly once (Hamiltonian
Paths) and DNA fragments with Spectrum(
s,l
)
SBH: Hamiltonian Path Approach
S
= { ATG AGG TGC TCC GTC GGT GCA CAG }
Path visited every VERTEX once
ATG
AGG
TGC
TCC
H
GTC
GGT
GCA
CAG
ATG
C
A
G
G
T
C
C
SBH: Hamiltonian Path Approach
A more complicated graph:
S
= { ATG TGG TGC GTG GGC GCA GCG CGT }
SBH: Hamiltonian Path Approach
S
= { ATG TGG TGC GTG GGC GCA GCG CGT }
Path 1:
ATGCGTGGCA
ATGGCGTGCA
Path 2:
SBH: Hamiltonian Path Approach
•
Problem is that there is no efficient algorithm
•
As overlap graph gets larger, this is not a
useful technique since the Hamiltonian Path
problem is NP

Complete
SBH:
Eulerian
Path Approach
•
This leads to simple linear

time algorithm for
sequence reconstruction
•
Construct graph whose edges correspond to l

mers
•
Find path(s) that visit each edge exactly once
SBH:
Eulerian
Path Approach
S
= { ATG, TGC, GTG, GGC, GCA, GCG, CGT }
Vertices correspond to (
l
–
1 )
–
mers : { AT, TG, GC, GG, GT, CA, CG }
Edges correspond to
l
–
mers from
S
AT
GT
CG
CA
GC
TG
GG
Path visited every EDGE once
SBH:
Eulerian
Path Approach
S
= { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:
ATGGCGTGCA
ATGCGTGGCA
AT
TG
GC
CA
GG
GT
CG
AT
GT
CG
CA
GC
TG
GG
SBH:
Eulerian
Path Approach
•
If for every vertex the number of incoming
edges is equal to the number of outgoing
edges, the graph is balanced
•
Theorem
: A connected graph is
Eulerian
if
and only if each of its vertices is balanced
•
Theorem
: A connected graph has an
Eulerian
path if and only if it contains at most two
semi

balanced vertices and all other vertices
are balanced
Some Difficulties with SBH
•
Fidelity of Hybridization:
difficult to detect differences
between probes hybridized with perfect matches and 1
or 2 mismatches
•
Array Size:
Effect of low fidelity can be decreased with
longer
l

mers, but array size increases exponentially in
l.
Array size is limited with current technology.
•
Practicality:
SBH is still impractical. As DNA microarray
technology improves, SBH may become practical in the
future
•
Practicality again
: Although SBH is still impractical, it
spearheaded expression analysis and SNP analysis
techniques
Fragment Assembly
•
Now that we have our reads sequenced, we
need to assemble them into the entire DNA
sequence
Fragment Assembly
•
We have some problems:
–
Errors in reads (1% to 3%)
–
Which strand did the read come from?
•
Did the read come from the target DNA sequence or its
Watson

Crick complement?
–
Repeats in DNA (this is the major problem)
•
See page 278 for puzzle example
Fragment Assembly
•
Very difficult to put it all together if repeats
are longer than read length
•
Could solve this by increasing read length, but
the technology isn’t there yet
Fragment Assembly
•
One approach is to break the sequence into
about 30,000 Bacterial Artificial Chromosomes
–
Sequence each BAC individually
–
Put them all together
–
Used and shown effective (if cumbersome) by the
Human Genome Project
Fragment Assembly
•
Another option (used in mouse genome
assembly) is the Weber

Meyers approach
–
Pairs reads that are separated by a fixed

size gap
–
Gap size L is chosen to be longer than most
repeats
–
Unlikely both reads lie in large repeat
–
Read that is in unique portion of DNA tells us
which copy of a repeat the mate is in
Fragment Assembly
•
Most algorithms consist of these steps:
•
Overlap
–
Find potentially overlapping reads
•
Layout:
–
Find order of reads along DNA
•
Consensus:
–
Derive DNA sequence from layout
Overlap
•
Find the best match between the suffix of one
read and the prefix of another
•
Due to sequencing errors, need to use
dynamic programming to find the optimal
overlap alignment
•
Apply a filtration method to filter out pairs of
fragments that do not share a significantly
long common substring
Overlapping Reads
TAGATTACACAGATTAC
TAGATTACACAGATTAC

•
Sort all
k

mers in reads (
k
~ 24)
•
Find pairs of reads sharing a k

mer
•
Extend to full alignment
–
throw away if not
>95% similar
T GA
TAGA
 
TACA
TAGT

Overlapping Reads and Repeats
•
A
k

mer that appears N times, initiates N
2
comparisons
•
For an
Alu
that appears 10
6
times
10
12
comparisons
–
too much
•
Solution:
Discard all
k

mers that appear more than
t
Coverage, (
t
~ 10)
Finding Overlapping Reads
Create local multiple alignments from the
overlapping reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
Layout
•
Repeats are a major challenge
•
Do two aligned fragments really overlap, or
are they from two copies of a repeat?
•
Solution: repeat masking
–
hide the repeats!!!
•
Masking results in high rate of misassembly
(up to 20%)
•
Misassembly means alot more work at the
finishing step
Consensus
•
A consensus sequence is derived from a
profile of the assembled fragments
•
A sufficient number of reads is required to
ensure a statistically significant consensus
•
Reading errors are corrected
Derive Consensus Sequence
Derive
multiple alignment
from pairwise read
alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTA
T
TGACTT
C
ATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted
voting
Protein Sequencing and Identification
•
Protein can be digested into peptides by
proteases (such as trypsin)
•
Can then sequence the fragments individually
and re

assemble
•
Mass spectrometry allows us to find proteins
involved in cell death, for example
Protein Sequencing and Identification
•
Tandem mass spectrometer breaks peptides
into smaller fragments
•
These fragments have electrical charge
•
Fragments are spun around in an magnetic
field until they hit a detector
•
Larger masses are harder to spin than smaller
ones, so mass can be determined by the
amount of energy required to fling fragments
around
Protein Sequencing and Identification
•
The problem we encounter is how to
reconstruct the amino acid sequence of the
peptide from the masses of the
broken pieces
References
•
Generated from:
•
An Introduction to Bioinformatics Algorithms, Neil C. Jones,
Pavel
A.
Pevzner
, A Bradford Book, The MIT Press, Cambridge, Mass., London,
England, 2004
•
Slides 4, 8

10, 13, 14, 16, 23

29 from
http://
bix.ucsd.edu
/
bioalgorithms
/slides.php#Ch8
Comments 0
Log in to post a comment