Efficient Algorithms for
Analyzing Segmental
Duplications, Deletions, and
Inversions in Genomes
Crystal Kahn
*, Shay
Mozes
*, and Ben Raphael*
†
Brown University, Providence, RI
*Department of Computer Science
†
Center for Computational Molecular Biology
{
clkahn,shay,braphael}@cs.brown.edu
Segmental Duplications
•
Long (>1 kb) duplications
common in mammalian genomes
•
Copied segments are often
reinserted into the genome at
large distances from the
ancestral locus
–
50

60%
interchromosomal
[Bailey &
Eichler
, 2006], [Jiang
et al., 2008]
•
Duplicated material comprise
complex, contiguous mosaics
of duplicated fragments
[Bailey et al., 2002]
[Johnson et al. 2006]
Human Segmental Duplications
•
[Jiang et al. 07] decomposed 437
duplication blocks
(>1 kb)
into 11,951
duplicons
(>100
bp
)
•
Mosaic structure complicates ancestral reconstruction
•
Conserved patterns of
duplicons
suggest copying of
contiguous strings
[Jiang et al
., Nat Gen
2007]
Combinatorial Models of
Duplication
•
Genome rearrangement in the presence of
duplicated genes or gene families
–
Reversal distance in the presence of duplicates is NP

Hard
[Chen et al., Bioinformatics
2005
]
–
Many heuristics and approximation algorithms
[
Marron
et
al., TCS
2004
], [El

Mabrouk
, CPM
2000
], [
Ratan
et al., JCB
2008
]
–
Whole

genome duplication
[El

Mabrouk
&
Sankoff
, JComp
03
],
[
Alekseyev
&
Pevzner
, SICOMP
07
]
•
Distance metrics based on number of duplication
events:
–
Tandem duplication
[
Chaudhuri
et al., SODA
06
], [
Elemento
et
al., MBE
02
], [
Lajoie
et al., JCB
07
]
–
No breakpoint reuse
[Zhang et al., RECOMB
2008
]
Fundamental rearrangement:
duplicate operation
Duplication
Distance
d(X,Y
)
:
minimum number of
duplicate
operations to transform the empty
string into Y
by repeated insertions of
substrings of
X.
[
K&Raphael
, Bioinformatics
2008
],
[
K&Raphael
, PSB
09
]
source string
target string
X
Y
s
t
p
Duplication Distance
Duplication Distance
•
Previously used to:
1.
Compare similarity of strings of segmental
duplications
2.
Infer ancestral relationships between
segmental duplications
•
In this work we extend duplication
distance to include certain types of
deletion
and
inversion
operations
A Sequence of Duplicate
Operations
•
Y can be partitioned into subsequences
corresponding to copied substrings of X
X
Y
X
s,t
Non

overlapping Property
•
Duplicated subsequences can be
nested
inside
one another
•
But they cannot
overlap
one another
Y
Lemma
: The substrings of X that are
duplicated in the construction of Y
appear as mutually non

overlapping
subsequences of Y.
Red and blue subsequences
are overlapping
The Basic Recurrence
source string
target
string
X
Y
X
i
Y
1
X
i
+1
Y
j
Y
2
,j

1
= max multiplicity of
any character in X
Theorem
: the recurrence computes duplication distance in time
Duplication

Deletion Distance
A delete operation:
target string
Y
1
s
t
modified target
string
Y
2
Duplication

Deletion Distance
d(X,Y
)
:
minimum number
of duplicate and delete
operations
—
in any order

to
build
Y
using
X.
⌃
Duplication

Deletion Distance
Lemma
:
The minimum number of duplicate and delete
operations that builds Y from X is equal to
the cost of a minimum sequence of
duplicate

delete operations.
A duplicate

delete operation:
source
string
X
s
t
target
string
Y
Cost = 1 (dup) + 2 (del)
p
Duplication

Deletion Distance
source string
target
string
X
Y
X
i
Y
1
Y
j
Y
2,j

1
Theorem
: the recurrence computes duplication

deletion distance in time
= max multiplicity of
any character in X
X
k
Duplication

Inversion

Deletion
Distance
A duplicate

invert operation:
source
string
X
s
t
target
string
Y
Duplication

Inversion

Deletion Distance
d(X,Y
)
:
minimum number
of duplicate, duplicate

invert,
and delete operations
—
in any order

to
build
Y
using X.
−
p
Duplication

Inversion

Deletion
Distance
Lemma
:
The minimum number of duplicate, duplicate

invert, and delete operations that
generates Y using X is equal to the cost of
a minimum sequence of duplicate and
duplicate

invert

delete operations.
A duplicate

invert

delete operation:
source
string
X
s
t
target
string
Y
Cost = 1 (dup

inv) + 2 (del)
p
Other variants
•
Duplication

deletions with substring
inversions
•
Affine cost functions for duplications,
inversions, and deletions.
X
Y
s
t
p
X
s
t
Y
p
Cost = 1 + Θ
1
+ (t

s+1)Θ
2
Future Work
•
Extend previous analysis of human
segmental duplications to involve these
new models
•
Probabilistic model of duplication
–
Relax the parsimony assumption
–
Explore suboptimal rearrangement
scenarios to find a max

likelihood ancestral
reconstruction of segmental duplications
Acknowledgements
•
CLK, SM, BJR
:
–
National Science Foundation
•
BJR:
–
Career Award at the
Scientific Interface from the
Burroughs
Wellcome
Fund.
–
ADVANCE Program at Brown
University
Comments 0
Log in to post a comment