Duplication Distance - Brown University

clumpfrustratedBiotechnology

Oct 2, 2013 (3 years and 6 months ago)

69 views

Efficient Algorithms for
Analyzing Segmental
Duplications, Deletions, and
Inversions in Genomes

Crystal Kahn
*, Shay
Mozes
*, and Ben Raphael*


Brown University, Providence, RI

*Department of Computer Science


Center for Computational Molecular Biology

{
clkahn,shay,braphael}@cs.brown.edu

Segmental Duplications


Long (>1 kb) duplications
common in mammalian genomes


Copied segments are often
reinserted into the genome at
large distances from the
ancestral locus


50
-
60%
interchromosomal

[Bailey &
Eichler
, 2006], [Jiang
et al., 2008]


Duplicated material comprise
complex, contiguous mosaics
of duplicated fragments


[Bailey et al., 2002]

[Johnson et al. 2006]

Human Segmental Duplications


[Jiang et al. 07] decomposed 437
duplication blocks
(>1 kb)
into 11,951
duplicons

(>100
bp
)


Mosaic structure complicates ancestral reconstruction


Conserved patterns of
duplicons

suggest copying of
contiguous strings

[Jiang et al
., Nat Gen
2007]

Combinatorial Models of
Duplication



Genome rearrangement in the presence of
duplicated genes or gene families


Reversal distance in the presence of duplicates is NP
-
Hard
[Chen et al., Bioinformatics
2005
]


Many heuristics and approximation algorithms
[
Marron

et
al., TCS
2004
], [El
-
Mabrouk
, CPM
2000
], [
Ratan

et al., JCB
2008
]


Whole
-
genome duplication
[El
-
Mabrouk

&
Sankoff
, JComp
03
],
[
Alekseyev

&
Pevzner
, SICOMP
07
]



Distance metrics based on number of duplication
events:


Tandem duplication
[
Chaudhuri

et al., SODA
06
], [
Elemento

et
al., MBE
02
], [
Lajoie

et al., JCB
07
]


No breakpoint reuse

[Zhang et al., RECOMB
2008
]



Fundamental rearrangement:
duplicate operation




Duplication
Distance

d(X,Y
)
:
minimum number of
duplicate
operations to transform the empty
string into Y
by repeated insertions of
substrings of

X.
[
K&Raphael
, Bioinformatics
2008
],
[
K&Raphael
, PSB
09
]

source string

target string

X

Y

s

t

p

Duplication Distance

Duplication Distance


Previously used to:

1.
Compare similarity of strings of segmental
duplications

2.
Infer ancestral relationships between
segmental duplications


In this work we extend duplication
distance to include certain types of
deletion
and
inversion
operations

A Sequence of Duplicate
Operations


Y can be partitioned into subsequences
corresponding to copied substrings of X

X

Y



X
s,t


Non
-
overlapping Property


Duplicated subsequences can be
nested
inside

one another


But they cannot
overlap
one another

Y

Lemma
: The substrings of X that are
duplicated in the construction of Y
appear as mutually non
-
overlapping
subsequences of Y.

Red and blue subsequences

are overlapping


The Basic Recurrence

source string

target
string

X

Y

X
i


Y
1





X
i
+1


Y
j


Y
2
,j
-
1

= max multiplicity of


any character in X

Theorem
: the recurrence computes duplication distance in time

Duplication
-
Deletion Distance

A delete operation:

target string

Y
1

s

t

modified target
string

Y
2

Duplication
-
Deletion Distance

d(X,Y
)
:
minimum number
of duplicate and delete
operations

in any order
--
to
build

Y
using
X.



Duplication
-
Deletion Distance

Lemma
:

The minimum number of duplicate and delete
operations that builds Y from X is equal to
the cost of a minimum sequence of
duplicate
-
delete operations.

A duplicate
-
delete operation:

source
string

X

s


t


target
string

Y

Cost = 1 (dup) + 2 (del)

p

Duplication
-
Deletion Distance

source string

target
string

X

Y

X
i


Y
1





Y
j


Y
2,j
-
1

Theorem
: the recurrence computes duplication
-
deletion distance in time

= max multiplicity of


any character in X





X
k


Duplication
-
Inversion
-
Deletion
Distance

A duplicate
-
invert operation:

source
string

X

s


t

target
string

Y

Duplication
-
Inversion
-
Deletion Distance

d(X,Y
)
:
minimum number
of duplicate, duplicate
-
invert,
and delete operations

in any order
--
to
build

Y
using X.



p

Duplication
-
Inversion
-
Deletion
Distance

Lemma
:

The minimum number of duplicate, duplicate
-
invert, and delete operations that
generates Y using X is equal to the cost of
a minimum sequence of duplicate and
duplicate
-
invert
-
delete operations.

A duplicate
-
invert
-
delete operation:

source
string

X

s


t


target
string

Y

Cost = 1 (dup
-
inv) + 2 (del)

p

Other variants


Duplication
-
deletions with substring
inversions




Affine cost functions for duplications,
inversions, and deletions.

X

Y

s


t


p

X

s


t

Y

p

Cost = 1 + Θ
1

+ (t
-
s+1)Θ
2

Future Work


Extend previous analysis of human
segmental duplications to involve these
new models


Probabilistic model of duplication


Relax the parsimony assumption


Explore suboptimal rearrangement
scenarios to find a max
-
likelihood ancestral
reconstruction of segmental duplications

Acknowledgements


CLK, SM, BJR
:



National Science Foundation


BJR:


Career Award at the
Scientific Interface from the
Burroughs
Wellcome

Fund.


ADVANCE Program at Brown
University