Computational methods for the detection of structural variation in the human genome.

munchsistersAI and Robotics

Oct 17, 2013 (3 years and 10 months ago)

228 views



























Student Number: 3620557

Master’s programme
: Cancer Genomics and Developmental Biology

Utrecht Graduate School of Life Sciences

Utrecht University


Supervisor:

Dr. W.P. Kloosterman

Department of Medical Genetics

University Medical Center Utrecht



Master Thesis

17
-
10
-
2013

Computational methods
for the detection of
structural variation in the
human genome.

Erik Hoogendoorn


2




3

1

Abstract

Structural variation
s are genomic rearrangements that

contribute significantly

to evolution, natural
variation between humans, and
are often involved

in

genetic disorders
. Cellular stresses and errors in repair
mechanisms can lead to a large varie
ty of structural variation events throughout the genome.
Traditional
microscopy
-

and array
-
based methods
ar
e used for the detection of larger e
vents or copy number variations.

Next generation sequencing

has

in theory
enabled the detection
of all types of
structural variants in the human
genome

at unprecedented accuracy
.
In practice, a significant

challenge lies in the development of
computational methods that are able to identify these structural variants based on the generated data. In the
last several ye
ars, m
any tools have been
developed based on four different
categories

of information that can
be obtained from sequencing experiments: read pairs, read depths, split reads and
assembled sequences
.

In this thesis, I first
introduce the topic of structural

variation by discussing its impact in various areas,
what mechanisms can lead to its formation, and the types of structural variation that can occur. Subsequently, I
describe the array
-
based and sequencing
-
based methods that can be used to detect structur
al variation. Finally,
I give an overview of the tools that are currently available to detect signatures of structural variants in NGS
data and their properties, and conclude by discussing the current capabilities of these tools, possible future
directions

and expectations for the future.

Keywords: Structural variation;
Copy Number Variation;
Next
-
Generation Sequencing; Detection algorithms
;
Read pair; Read depth; Split read;
De novo

assembly
.



4




5

2

Contents

1

Abstract

3

2

Contents

5

1

Introduction

6

2

Structural variation

6

2.1

The importance of structural variation

6

2.2

Causes for structural variation

7

2.3

Types of structural variation

7

3

Detecting structural variation

8

3.1

Array
-
based methods

8

ArrayC
GH

8

SNP arrays

8

Advantages and limitations

9

3.
2

Sequencing
-
based methods

9

Read pair

9

Read depth

11

Split read

11

De novo assembly

11

Advantages and limitations

12

4

Computational methods

12

4.1

Read mapping

13

4.2

Read pai
r

14

Clustering
-
based methods

14

Distribution
-
based methods

15

4.
3

Read depth

17

4.4

Split read

19

4.5

De novo assembly

20

Genome a
ssembly

20

Identification of structural variation

21

4.6

Combined approaches

22

5

Discussion

24

The status quo

24

Possible improvements: integration of recent advances

25

Future perspectives

25

6

References

27




6

1

Introduction

Structural variation describes
genetic variation
that affects the genomic structure.
Although human
genomic variation was

first thought to be
mostly
due to

SNPs (Single Nucleotide Polymorphisms), it has
become clear that human genomic and probably phenotypic differences are
related more

to structural variation
than SNPs
1,2
.

Structural variation can range in size from several bp (base pairs) to entire chromosomes.
S
tructural variation

contribute
s

significantly to human diversity and disease occurrence
, and is an i
mportant
consideration in any genetic study
3,4
.

Structural variation

studies

used to be limited to
the detection of
larger
variants like aneuploidies

and chromosomal rearrangements by

using microscopic methods.
The development
of a
rray
-
based and, more recently,
sequencing
-
based meth
ods
has enabled the

detect
ion

smaller
submicroscopic
structural variants (
SVs
)

at greater resolution.

Next generation sequencing
-
based (
NGS)

methods are
theoretically
able to identify SVs of all types at previously unattainable speeds
and resolution, and
several

different methods have been developed to detect signals in the data that indicate structural variants
,
each with their own advantages
and disadvantages. However,
these methods require
extensive
computational
analysis

and the dev
elopment of various types of algorithms to filter the data, compare it to reference or other
samples and detect the signals associated with structural variation.

Here, I
will
introduce
the

effects structural
variation

can
have in humans and other species,
the
mechanisms
that can

result in the formation of SVs and

the
different types of structural variation that can occur. Subsequently, I will give an overview of the
methods
that
can be used to detect structural variation, and

provide
an overview of the
curr
ently available

computational
tools

used
for the detection of SVs in the human genome based on next
-
generation sequencing.

2

Structural variation

2.1

The importance of structural variation

Structural variation is now known to cover more nucleotide variation in t
he human population than SNPs,
and
thousands of

SVs are likely to be present in each genome
1,2,5
. Many
SVs

span
, relocate
or break
encoding as
well as regulating elements in the genome
.
This may often have no observable effect
, but
can
also induce
dosage effects,
gene disruption,
new fusion genes
,
new regulatory cascades
,

the formation of new SNPs and
differences in epigenetic regulation due to relocation
5

7
.

Thus,
although many

SVs

may be neutral,
they still

introduce a large source of genetic
and pheno
typic
variation
not just
between humans

but in all species
8,9
.

Considering the

effect
s

of SVs on phenotypic variation,
the occurrence of structural variation is also
expected to s
ignificantly

affect natural selection and thus evolution
5,8
. Indeed, structural variation

has been
suggested to
be related to the evolution of new species as well as
the evolution
within various species
9

11
.
E
xamples
exist

in plants
12

as well as primate
s
13

15
, also for the emergence of human specific
-
genes
16
.
Several
papers have

shown recent human evolution
in genes related to diet,

reproduction
a
nd disease
-
related genes
due to
structural variation
17

19
.

S
truc
tural variation
has

been

characterized extensively in relation to
disease.

Variants

affecting gene
regulation or coding sequences
may result in a wide variety of genomic disorders
8,20,21
.
Two models for the
relationsh
ip between structural variation and disease have been proposed, based on rare and common
structural variation
22
.
The first model describes

r
are

and
often
de novo

SVs in
the population
can

cause
various
disorders, collectively

accounting for a large fraction of
these disorders
22
. Examples are found for various

birth
defects
23

25
, neurological disorders
26

30

and
predisposition to
cancer
31,32
.

The second model concerns SVs
common in the population, especially

cop
y number variable gene families
,

thought to collectively
contribute to
su
sceptibility of
complex diseases
, especially related to the immune system
22
.
Examples
for this model
are
HIV
33
, malaria
34

and various immune disorders
35,36
. Although examples can be found for both

models
, these

are
probably not comprehensive for all human disease in relation to structural variation.
For example, a

simple
division between rare and common variants may be
too simplistic
37
.

However, it is clear that the

detection of
structural variation can have a large impact on the inve
stigation of human disease, both

in diagnosis and
treatment of diseases
38,39
.

In addition to

their role in disease, SVs are a
lso essential for normal functioning of human
life
. Class Switch
Recombination (CSR)
40

and V
(
D
)
J recombination
41

are
processes that rely on structural var
iation that is
stimulated by our body itself. These processes are
essential for the generation of diverse mature B cells in
response to antig
en stimulation, and thus for the human

immune system
.
T
he study of SVs
may
also
tell us
more about genetic
mechanisms
that shape genome structure
as well as
genome
evolution. Over the last years,

7

t
he need to take structural variation and i
ts roles into account has become apparent
4
. However, essential for
each of these
research areas
remains the accurate and unbiased identification of structural variants.

2.2

Causes for structural variation

Although first considered to occur randomly
42
, structural variants form in specific situations, in response to
specific environmental and cellu
lar triggers. Various stressors

like replication, transcription and genotoxic or
oxidative stress
, or combinations of these,

can be the trigger for structural variation
43
. These stresses can result
in DNA breaks and
stalled
DNA
replication forks sensitive to the formation of structural variants. Specific
sequences
are

more sensitive to structural variat
ion due to their structure, associated proteins or epigenetic
modifications
43
. Furthermore, the proteins involved in generation of functional recombination
in the immune
system
may

have off
-
target effects, leading to double
-
strand breaks. Subsequent errors in DNA repair or
recombination then cause the structural variant to

be implemented locally or between two loci in physical
proximity
.

For example, n
on
-
homologous end joining

(NHEJ)

is an error
-
prone repair mechanism for DNA double
-
strand breaks. Individual double
-
strand breaks are efficiently repaired by
classical
NHEJ, h
owever the presence
of two double
-
strand breaks can result in chromosomal translocations.
Alternate
end joining (A
-
NHEJ)
, is a
different pat
h
way that is associated with genomic rearrangements. However, the pr
ecise mechanisms are
currently
unknown
44
.

Allelic homolo
gous recombination repairs double
-
strand breaks using a template
sequence and is relatively error
-
free. However, defects in homologous recombination could result in non
-
allelic
homologous recombination (NAHR). In this case, non
-
allelic sequences, often LCR
s, LINE
-
1 and Alu repeat
elements or pseudogenes are used as a template for repair, resulting in structural variations
8
.

Additionally,
repetit
ive

and transposable

elements like th
ose involved in NAHR

are considered to contribute to structural
variation through

the effects of

retrotransposition

and
microhomology
, which can result in

Complex
Chromosomal Rearrangements

(CCRs)
. Several models exist
for the explanation of these CCRs. The MMBIR
model

(microhomology
-
mediated break
-
induced replication) posits that single DNA strands of collapsed
replication forks anneal to any single
-
stranded DNA in proximity
. Following polymerization and template
switch
es result in CCRs
45
.

A similar model,

FoSTeS (Fork Stalling and Template Switching),
suggests replication
fork te
mplate switching
, but

without breaks
15,46
.

Finally,
intra
-

and interchromosomal
CCRs

may

result from
random
non
-
homologous end
joining of fragments after
an event termed chromothripsis.
In this model
, one or
multiple chromosomes
locally shatter, then fuse again randomly, possibly due to radiation

or other events
resulting in widespread chromosomal breakage
23,47
.

For more information on this topic
, please see

a
compreh
ensive review by

Mani
et al.
43
.

2.3

Types of structural variation


Structural variation can occur in many types
, among which a distinction can be made between copy number
variant (CNV) and copy number ba
lanced variants.
Copy number balanced SVs include

inversions and
translocations.
Copy number variant SVs include deletions
, insertions

and

duplications.
Insertions may involve
a novel sequence or a mobile
-
element.

Mobile element insertions can result from
translocations or
duplications.

Duplications can occur as tandem duplications, where the duplicated segment remains adjacent to
the source DNA, or interspersed, where the duplicated DNA

i
s incorporated elsewhere in the genome. These
events may o
ccur intrac
hromosomally, but

also
between different chromosomes,
lead
ing

to
interchromosomal

variants. The term is structural variant was traditionally used to refer to larger variants larger than 50 bp or 1
kb (kilobase)
22
. However,
any variant other than a SNV (Single Nucleotide Variant)

may be considered to alter
the structure of the chromosome. As
some of
th
e methods discussed here are able to identify events
of sizes
from 1 to
50 bp at base pair resolution, the term structural variant i
s used for any non
-
SNV

genetic variant.

Of course, one event may

include combination of multiple types of SVs
, resulting in
more complex patterns
or CCRs
, where for example an inverted fragment may contain a deletion or an insertion, or any other
combinations
.
Detecting CCRs is more problematic for most methods. Additionally,
an i
nsertion
may
correspond to deletion elsewhere in

the genome, resulting in what is essentially a translocation
. However, not
all methods may detect both events and may thus infer CNVs erroneously. Accurate identification of
a certain
SV
may thus require comprehensive identification of all structural vari
ation in the studied genome
48
. The
ability for detection of these types of variants differs with respect to the various methods used, as is discussed
below.


8

3

Detecting structural variation

As mentioned above, structural variants can differ greatly in terms of size.

Larger stru
ctural variants are
considered microscopic variants, as these can be detected using traditional microscopy
-
based cytogenetic
techniques
.
That

include
s

genome
-
wide
techniques

like karyotyping,
chromosome painting

and FISH
-
based
methods. Still commonly used,

these methods can identify
most types of
structural variants beyond several
Mb
(Megabases)

and aneuploidies
.
Improvements based on these techniques

are still developed
, providing

higher
resolution and sensitivity
49
.

For the

detection of smaller, submicroscopic SVs with higher

resolution and

sensitivity
, more recent
molecular
methods

are required. These methods c
an be classified as either array
-
based or
sequencing
-
based
.
Common to these methods is that SVs are identified by comparing the experimental genome to a reference
or
other sample
genome, inferring variants from the differences. I will
briefl
y introduce these array
-

and
sequencing
-
based methods

below
.

3.1

Array
-
based me
thods

Microarrays were originally developed for RNA expression profiling, but now have a wide range of
applications, including the detection of structural variation.
Microarray
-
based methods

rely on the design of
microarray chips on glass slides, using imm
obilized cDNA oligonucleotides as targets for hybridization by
experimental DNA.
Although sequencing
-
based methods for the detection of CNVs are becoming more cost
-
effective and popular, clinical diagnostics still mainly use microarray screening
50
.
Detection of CNVs with array
-
based methods is possible using two types of microarrays: A
rrayCGH (Comparative Genomic Hybridization)
and SNP arrays.
Recent

platforms
, marketed by companies like Agilent, Illumina, Roche and Affymetrix, enable
the detection of

millions of probes on one chip, and new arrays are still being developed that increase

the
sensitivity and resolution even more.



ArrayCGH

ArrayCGH platforms can be used to detect relative CNVs by competitive hybridization of two fluorescently

labeled samples to the target

DNA
.
Experimental DNA is fragmented and fluorescently labeled prior to
hybridization. By using
different fluorescent
dyes
, for example Cy3 (green) and Cy5 (red)

for each sample
,
the
measured fluorescence for each color can give an indication for the
abundance o
f experimental DNA

from each
sample
.

It is important to use known reference samples, as a gain in one sample can’t be distinguished
from a
loss in the other without further information.
For accurate identification
of SVs
, normalization is often needed
due
to experimental biases for GC content in the DNA and dye imbalance.


The

first ArrayCGH experiments

used

large inserts
, for example

BAC
s (Bacterial Artificial Chromosomes)
,

as
targets,
and were able to detect CNVs in the range of 100

kb and longer
51
. T
he current use of oligonucleotides
allows the detection of CNVs with a

resolution
only several

kilobases
52,53
.

An advantage of ArrayCGH is the
availability of custom arrays, allowing its use as a diagnostic platform
50,54
. ArrayCGH

platforms can reach high
resolutions, especially using custom solutions
2
, but can’t match NGS
-
based

methods.


SNP arrays

SNP arrays
were

originally designed to detect single nucleotid
e p
olymorphisms
, but have been adapted for
the detection of CNVs.

Similarly to ArrayCGH, SNP arrays rely on hybridization to target NDA. However in SNP
arrays, only the test sample is hybridized, and no competitively hybridizing reference sample is used. T
he
intensity of the fluorescence upon binding is used as a measure for the matching sequences in the sample.
For
the detection of CNVs,
intensities measured across many
spots on the slide

are clustered.
CNVs are detected by
comparing
these
sample values to

(a set of) reference values

from a database or from a different experiment
.
Several algorith
ms have been developed for this analysis, and an overview of these can be found in a review by
Winchester
et al.
55
.

Similar to ArrayCG
H, SNP array resolution has

increased significantly in the y
ears since its first
use, which
typed
56
.
Currently, millions of SNPs can be interrogated on one chip. In addition to improvements in resolution,
the design of arrays has focused on incorporating more informativ
e SNPs in regions with known CNVs
,
increasing the amount of variants detected in one experiment
57
.

However, t
his does have an important
negative
side
-
effect
, as it introduces a large bias towards known CNVs.

SNP arrays generally tend to

have

lower
sensitivity in the detection of CNVs compared t
o ArrayCGH. However, SNP arrays
provide
advantages like
additional information

for genotyping, parental origin of CNVs, are more accurate in the determination of copy
numbers and allow

detection of

LOH (Loss Of Heterozygosity)
49
.


9


Advantages and limitations

A

major disadvantage of array
-
based

versus sequencing
-
based

methods

is that only gains and losses
compared to a reference can be identified. Thus, balanced variants

like translocations and large inversions
cannot be identified, meaning that other experiments are needed to identify the location and type of the SV
events in the test sample.

Array
-
based methods are also unable to detect smaller variants and have a lower

resolution, and thus miss a wide range of SVs that are potentially of interest.

However, array
-
based methods
are less costly and have a higher throughput than sequencing
-
based methods, so it is possible to genotype a
larger number of individuals in less t
ime and for a lower cost. Analysis of the data also requires less
computational resources than sequencing
-
based methods. In addition to predesigned genome
-
wide solutions, it
is often possible to order custom designs, allowing studies to focus on regions of

interest, or increase overall
resolution. Combinations of these two types of arrays have been used to detect CNVs. Either by integration of
results
58
, using SNP arrays for fine
-
mapping regions identified by ArrayCGH
59
, or using hybrid CGH+SNP
arrays
49,60
. These methods could provide more robust identification of structural variation as well as additional
information versus

existing approaches. This seems prudent, as a recent assessment has shown relatively low
(<70%) reproducibility for repeated experiments as well as poor (<50%) concordance between platforms
61
.

3.2

Sequenc
ing
-
based methods

Detection of

multiple different

types of structural variation
based on sequencing methods was first
performed using

paired
-
end
mapping by T
uzun
et al.
62
.
This

study was
based on capillary
Sanger sequencing
using fosmid
-
end sequences. Throughput

and resolution

based on this data are no
t optimal, but the longer read
lengths allow the reliable identification of large variants
.
The

development of
high
-
throughput
next
-

generation

sequencing

technologies has enabled sequencing of a full human genome within a week. Since 20
05, several
compani
es

including

454

Life Sciences
, Illumina
,
and
Life

Technologies

have m
arketed platforms with
ever
increasing

through
put

and base
-
calling accuracy
, longer read lengths

as well as

lower costs
versus traditional
capillary methods
. More recently, s
ingle Mo
lecule Sequencing (SMS) has

become a possibility with Helicos’
Heli
o
scope platform

and non
-
optical sequencing was introduced with Life Techologies’ Ion Torrent sequencer
.

Among other applications, this development i
n sequencing technology has

enabled the g
enome
-
wide
detection of structural variation at unprecedented resolution and speeds.
Several methods have

been employed
for
the identification of SVs using NGS data.

The most self
-
evident method would be
de novo
assembly of a
genome, with subsequent
alignm
ent to a reference to determine the structu
ral differences. However,
de novo

assembly of a human genome remains challenging

due to the relatively short read lengths

generated by NGS
platforms
63
. As a result,
other

method
s

were developed that
use direct alignment of reads to one of the human
genome reference assemblies.

These

methods
are read pair
, read depth and split read approaches, and
are
based on

the identifi
cation of discordant patterns in sequencing data
. I will describe the basic principles of each
of these approaches below.


Read pair

As mentioned
earlier
, the first sequencing
-
based identification of SVs used a
read pair

method
, which was
applied to data
from capillary sequencing
62
. The first
NGS
-
base
d
study
on

the
genome
-
wide
identification of SVs
applied a similar
method
, using the same algorithms as in the earlier study

but

without any optimizations for
the new type of data
64
.

Most of the current sequencing technologies, excluding SMS platforms, are capable of
generating paired
-
end or mate
-
paired reads.
In
read pair

sequencing, both ends of a linear fragment with an
insert sequence are sequenced, w
hereas in mate
-
pair sequencing a circularized fragment is used.
Although the
method of generating the read pairs

differs, the detection of SVs based on the generated dat
a is

essentially the
same.
An important consideration in the detection of SVs
is that t
he insert siz
e for mate
-
pair libraries (1.5
-
4
0

kb) is often larger than for paired
-
end libraries (300
-
500bp)
65
.

Read p
air

methods detect SVs by
mapping read pairs with a predetermined insert size back to the reference
genome.
Assessing the mapping locations of the reads to the reference genome,

a
discordant span
or

orientation of the read pair

indicates the occurrence of a
genomic

rearrangement

(
Figure
1
)
.

If read pairs map
further apart than the insert size this suggests a deletion, whereas if read pairs ma
p closer

together or one read
can’t be mapped

together this

suggests a (novel)
insertion.

Furthermore, insertions of mobile elements or other
genomic regions map to
the locations in which these are present on the

reference genome.

I
nversion
breakpoints
are

detected by
a changed orientation of one of the reads inside the inversion
, as well as
varying
spans for the pairs.

I
nterspersed duplication
s or translocations

can be detected by complex patterns where
in
several pairs
one
of the
read
s

maps
to a different

location or chromosome
.

Finally, tandem duplications can be

10

detected
by read pairs that have a

correct orientation, but

are

reverse
d in their

order

and have differences in
their span
.


As single read pairs are not reliable o
n their own due to possible mis
mapping or ambiguous mapping,
multiple read pairs
belonging to the same variant
are clustered to increase the reliability of detection, as well as
identify the breakpoint locations for the variant more accurately.

L
ibraries with larger insert sizes (several
kilobases) are

better at detecting larger variants, but are

often not able to reliably detect smaller variants

due
to
the distribution of
insert sizes
66
. In contrast,
libraries with smaller insert sizes

can’t reliably detect
the
larger
events
, but have higher resolution and are able to detect smaller variants.

A major disadvantage of the read pair
methods is that insertions larger than the insert size cannot be detected conventionally.
Although with lowe
r
power, a
lgorithmic detection of these
insertions is possible when considering a l
inked signature, as described
by

Medvedev
E
t al.
67
.

For example
,
a
large insertion from a region far away in the genome (a translocation or
duplication)
, the read pair will be detected as spanning a huge

range

in the reference genome,

as regions that
were fi
rst far from each other are now
relatively close and
used to generate the read pair. By finding this
signature for both break
-
ends (the newly formed sequences by colocation of the flanking sequences and the
insertion) and linking these, it is possible to d
etermine the origin and possibly the size of the insertion. For
novel sequence insertions this is more difficult, as one of the reads from the read pair will not map to the
Figure
1

The four sequencing
-
based methods used to identify structural variation, and the signatures that can be detected for each
type of SV. The top line indicates reference DNA. Red arrows indicate breakpoints. MEI = Mobile Element Insertion. RP= Read P
air. For
a
full legend see Alkan
et al.
22

(Copied from Nature Reviews Genetics, Alkan
et al.

2011
22
.)


11

genome. In this case,

additional steps like assembly or targeted sequencing of the
insertion sequence would be
required.



Read d
epth

Analysis of read depth, also called depth of coverage (DOC)
,

can identify structural variants by evaluating
the depth of reads to mapped to the reference genome.

This approach was first used
in combination with NGS
data
to detect CNVs
in
healthy and tumor samples from the same individual
s
68
.

For this method, a
uniform
distribution of reads

is assumed
, often according to a Poisson distribution.
Sufficient deviation from this

distribution is expected to be due to copy number differences in the sequenced genome.
Alternatively, the
expected copy number of a region can be derived from a

comparison of read depth to

reads of a control

genome.

For both variants
, a
loss

region will h
ave less reads mapped to it

than expected
, whereas a
gain

regi
on
will have more reads mapped

(
Figure
1
)
.

T
he major disadvantage

of read depth

versus the other sequenc
ing
-
based methods is that only CNVs can be
detected.
The location of events can’t be retrieved, and copy number balanced events like
inversions

or
translocations can’t be detected.
However, it’s the only sequencing
-
based method that can accurately predict
copy numbers
69
.

Larger events are more reliably detected than smaller ones, as the statistical support
increases

with the size. The reliability of the SVs detected is directly related to the sequencing coverage.

As a result, the
sequencing biases in the different platform
s affect SV detection as well. For example GC
-
rich or

poor regions

as well as repeat regions

are sequenced less
reliably, introduc
ing biases
70
.

These
biase
s, as well as mis
mapped
reads influence the data more than
in
other

sequencing
-
based

methods
.

Algorithms based on case versus
control data suffer less from biases due to sequencing, as these are assumed to be cancelled out.

However,
these are

m
ore costly as it requires additional
genomes to be sequenced.


Split r
ead

Split read mapping detects structural variation by using unmappable
or only partially mappable
reads. The
breakpoint of a SV

is found based on a
read which can only be mapped to the
genome in two parts.

Detection of
SVs is similar to read pair
-
ba
sed methods, but instead of two paired reads,

two parts of one read are used
(
Figure
1
).
A deletion will show a read mapping with alignment gaps in the reference genome, whereas
insertions will show alignment gaps in the test genome.

Like with read pairs, part of a read not map
ping may
indicate a novel sequence insertion, and partial mapping to a known mobile element in the reference genome
indicates a MEI

(Mobile Element Insertion)
.
Reads spanning tandem duplications will have the split read
mapping in reverse order.
Interspers
ed duplications or interchromosomal translocations

will show part of a
read mapping to the duplicated region or another chromosome. Like read pair methods, split read mapping
may use clustering of reads to increase the reliability of the findings.

Split re
ad mapping

was originally used in combination with sanger sequencing, which produces longer
reads than current NGS platforms
71
.
The shorter

reads

currently generated by NGS pla
t
forms

significantly
reduce the power of SV detection using this method,
as the length of a split read from NGS is rarely uniquely
mappable to the genome.
This results in strongly ambiguous, and often impossible mapping of reads, especially
to regions with
repeats or duplications.
However,

it is
currently still possible to map

breakpoints for smaller

deletions (max
.

~10

kb)
as well as very small insertions (1
-
20 bp)

at base pair resolution by using an algorithm
called Pindel
72
.
Using this method, also called anchored split mappin
g
67
, read pairs are

used to select reads
where one read maps uniquely to the genome a
nd the other can’t be mapped. Knowing the location
and
orientation
of the first read, the second read
can be split
-
mapped
using local alignment
based on the known
insert size
, reducing the search space for possible mappings as well as ambiguous mapping
significantly
.

However, this does

require that one of the reads is mapped unique
ly, which is still not always possible.

The advantage of this method is that it can map breakpoints of SVs at base pair resolution. However, for
larger events or those involvin
g distant genomic regions

this is still problematic.
Using
anchored split mapping

to reduce the search space for split reads is an important step

towards making split mapping useful in
combination with NGS platforms
, but may be hampered by inserts or delet
ions in between the reads, affecting
mapping distance.
Longer read lengths will make
split read mapping

even more powerful, as unique mapping
of at least one end may not be required.


De novo assembl
y

Ideally, f
ull alignment of
de novo

sequenced genomes aga
inst one or multiple reference genome(s)

would

be
used to identify
all structural variation in the genome
. Depending on the algorithms and reference genome(s)
used, this would enable unbiased detection of all types and lengths of
SVs
.
Although studies have

described

12

assembly of human genomes based on short
-
read data, these and other approaches still require
assembly to the
reference genome
.
Two human genome assemblies have recently been used to identify structural variation
73
.
However, this study was still limited in the identification of SVs in repeti
tive regions and was only able to
identify homozygous SVs.
Local

de novo

assembly is possible in
more
reliable

genomic

regions
74
. This allows
align
ment to the reference genome and subsequent identification of structural variants using these
generated
contigs
. Identification of SVs is then possible using the same principles as in split read mapping, with
differences only in identification of MEIs and
tandem duplications (
Figure
1
). As these fragments are typically
much larger than read fragments, this method is much more reliable in the identification of breakpoint
s and
larger SVs.

Although
d
e novo

assembly of genomes and subsequent pairwise comparison is expected to become the
standard method of SV detection
,
this is currently still problematic due to the limited read lengths and
assembly collapse over regions with

repeats and duplications
63
. As these regions are especially susceptible to
the formation of structural variation, this further decreases the reliability of
SV d
etection
due to

false positives
as well as false negatives.
Additionally, differences in coverage between genomic regions due t
o biases affect
assembly
, inducing gaps and complicating statistics in assembly. Finally,
de novo

assembly requires extensive
com
putational resources. In algorithms that reduce the computational requirements, tradeoffs are often
necessary in terms of sensitivity to overlaps. Although improvements in these areas have been made with
newer tools, the problems are still unsolved
74
.
Further development of algorithms and sequencing
platforms

will be required before this method will be able to detect all structural variation reliably
.


Advantages and limitations

A major advantage of sequencing
-
based methods over array
-
based methods is the possibility to detect all
types of variants in a single ex
periment;

both copy
-
balanced and copy
-
variant
.
Additionally, SVs of a broader
range of

length
s

can be detected with significantly less bias, as no the genomic regions measured are not
predetermined as is necessary for microarray probes. The resolution
of sequencing enables breakpoint
detection at base pair level with high enough coverage, a
llowing detailed investigation of CCRs as well. NGS
-
based methods are expected to replace microarrays for SV discovery and genotyping. Although costs of whole
genome sequencing have declined significantly, these are currently still a large factor. This is
especially true for
genome
-
wide detection of structural variation, as the reliability of the findings depends in a large part on the
sequencing coverage attained in the experiment. However, the decline in costs is expected to continue quickly
over the comi
ng years, in concert with the further development of single
-
molecule and third
-
generation
sequencing platforms
65
.

A pr
oblem common to all methods is the limited read length of current generation
sequencing platforms
, causing significant
ambiguity in the mapping of reads
, especially in repetitive regions
.
Third
-
generation sequencing technologies with increased read length
and insert sizes are expected to alleviate
these problems at least partially.
but

the development of new algorithms and the integration of information
will
also be an important factor
.

The different sequencing
-
based methods each have their own strengths an
d weaknesses

in the detection of
SVs.

Read pair
-
based methods are efficient at detecting
most
types of structural variation and extensively used,
however the insert size affects the length of the detected SVs significantly. Approaches based on read depth a
re
able to identify sequence copy numbers, but only able to detect CNVs and at poor breakpoint resolution.
Although

split read mapping can identify breakpoints at base pair resolution, its sensitivity is currently a lot
lower than other methods

due to unre
liability outside of unique genomic regions
. Finally,
de novo

assembly of
genomes promises to be t
he method to solve most of the

problems, but is currently not yet possible and

dependent on the further development of sequencing techniques and algorithms. S
everal tools have been
developed recently to integrate the information from the various methodologies. By combining algorithms,
several biases or deficiencies of some of these methods may be alleviated.
Furthermore
, several strategies seem
more suitable fo
r the detection of certain classes or properties of structural variants. For example, read depth
information
is

more suitable for copy number
detection than other methods, and split read information may
indicate the breakpoints most reliably. Any combinati
on of methodologies will need to take into account these
factors.

4

Computational methods

Various tools have been de
veloped

for
NGS
-
based detection of structural variation
.

Here, I will give an
overview of the
currently available
tool
s for read pair
-
, read d
epth
-
, split read
-

and assembly
-
based

methods of
genome
-
wide
SV detection

in the human genome

with NGS data
.
Tools
combining the information from several

13

detection methods to improv
e the results are discussed separately
.

As read mapping is an important first step
for the read pair
-
, read depth
-

and split read
-
based methods, and assembly algorithms are similarly important
in the assembly
-
based identification of SVs, approaches and tools used for these steps
are discussed as

well.


An important distinction between the tools is the strategy that is used

for alignment of the reads and
how
the SV identification algorithms
process those alignments
.
The alignment
processing
strategies can be classified
as either ‘hard clustering’
or ‘soft clustering’
75
. Most approaches use hard clustering, considering only the best
mapping of each read to the genome for the identification of SVs. This

works well for unique regions of the
genome, but has lower sensitivit
y in tandem duplication and repeat regions.
Some newer approaches use soft
clustering, where reads are mapped to all possible locations, and all these mappings are considered in the
detection of putative SVs. Although this increases sensitivity, soft clust
ering may lead to more false positives
and often requires careful filtering of input reads. In

sample
-
reference analyses, these false positives are offset
by an increase in true positives as more SVs in total are present. However, in related samples the fa
lse positives
may constitute higher percentage of total due to the low amount of total SVs between the samples. Thus, it is
important that the clustering strategy is appropriate for the study, and the
parameters in tools using the soft
clustering strategy
are well understood and set carefully.

Table
1

summarizes the tools

used for SV
identification as

discussed here, showing
which clustering approach is used, what

type
s of SVs can be detected
,

as well as their defining characteristics or advantages over other approaches.

4.1

Read mapping

Except for
de novo

assembly, all computational methods described here require
mapping of
the to the
reference genome
as a
first

step
. Many tools
have been developed
for this purpose
, based

on
several different
approaches
.

These tools mainly differ in how they find the possible mapping locations on the genome, whereas
a final alignment step on these possible mapping locations to determi
ne the scoring is
generally

performed by
using the traditional Smith
-
Waterman
76

alignment

algorithm
.
The first development was the

classical “seed and
extend”
approach
77
.
Here
, a seed DNA sequence is found based on
a

hash
table
s”

containing
all DNA words of a
certain length (k
-
mers) p
resent in the first DNA sequence (
this can be
either the reads or the reference
genome)
. The
hash

table is then used to locate
t
he k
-
mer

sequence

in

the other DNA sequence. Subsequently,
this seed is extended on both sides to complete the alignment
. This
a
pproach

is used in several tools, like
BLAT
78
,
SOAP
79
, SeqMap
80
,
mrFAST
69
, mrsFAST
81

and PASS
82
. This implementation is simple and quick for
shorter word lengths, but becomes exponentially more memory
-
intensive w
ith longer word lengths.

An improvement on

this approach was introduced with

PatternHunter
83
, which uses “spaced seeds”
. This
approach

is similar to the

“seed and extend”
approach
, but requires only some of the
seed sequence’s positions
to match. Thus, if a 5
-
mer is used, it may be that only the 1
st
, 3
rd

and 5
th

positions need to match the other
sequence. This
approach

is more sensitive and allows for mut
ations in the seed sequence, but may introduce
false matches that slow the mapping process d
own
,

a
nd does not allow indels

in the sequence
. Many
tools were
developed based on this
approach
, including
the Corona Lite mapping tool
84
,
ZOOM
85
, BFAST
86

and MAQ
87
.
Newer tools like SHRiMP
88

and RazerS
89

improve on this

approach by requiring multiple seed hits and allowing
indels.

Other “trie
-
based” approaches are aimed at reducing the memory requirements for alignment and

use

“Burrows
-
Wheeler Transform” (BWT), an
technique

that

was first used for data compression
90
.
The term trie
comes from ret
rie
val, as it can be used to retrieve entire sequences based on their position in a list. Different
data structures can be used with this approach, based on prefix
trees, suffix trees
, FM
-
indices

or suffix arrays,

but the search

method is essentially the same
91
.
In

trie
-
based approaches,
the various
k
-
mers

are compressed
into one string
based on their position relative to the start of the string. These can be used to directly search the
r
eference genome, even allowing simultaneous search of similar strings as these are compressed together. This
further decreases the memory requirements and search
times, but does require more computing time for the
construction of compressed strings.

Several very fast tools like
SSAHA2
92
,
BWA
-
SW
93
, SOAP2
94
, YOABS
95

and
BowTie
96

have been created based on this approach. Even faster alignment tools like SOAP3
97
, BarraCUDA
98

and CUSHAW
99

combine
trie
-
based approaches

with GPGPU computing, taking advantage of parallel GPU cores
to accelerate the pr
ocess.

Most of t
he
newer mapping
tools

are specifically
designed to take
into account
the properties of NG
S
platforms;

shorter reads
, more
data and sequencing e
rrors. However, some tools
like BLAT, SSAHA2
, YOABS

and BWA
-
SW

are
useful for

mapping

longer rea
ds.
Additionally, some

mapping tools are developed specifically
for
certain platforms. For example, SHRiMP
, BFAST

and drFast map

color
-
space reads associated

with SOLiD
platforms, and SOAP

and BowTie
tools
were designed for use with data from Illumina
platforms
.
For more
extensive information on this topic, a good revie
w

was written by

Li
et al
.
91
.

The selection of the mapping tool
is

14

an important consideration,

also when selecting
one

specifically for certain S
V detection methods.
Split read
mapping requires specific
strategies, and BWA
-
SW and MOSAIK
100

are example
s of only few
mapping tools
that
provide split mapping info
rmation
.
Finally,
instead of alignment as a first step,
some
assembly
-
based
algorithms

require
(
whole genome
)

alignment
a
s one of the later

step
s

in SV identification
, as will be discussed
in the section on
de novo
assembly below.

4.2

Read pair

Many tools have been developed for SV identification

based on read pair data
.
These
use algorithms that
can be

grouped into two categories: those based on clustering, and those base
d on distribution.
Algorithms
f
rom both categories
can
identify
discordant read pa
irs
by

differences in span and orientation,
and may
group
read pairs for increased reliability
. The difference lies in that c
lustering
-
based algorithms identify
discordant
re
ad pair
mapping
distance by a fixed distance like a certain amount of standard deviations or based on
simulations, whereas distribution
-
based algorithms test the
mapping span

distribution of a certain cluster of
read pairs and calculate the chance of these

being discordant by comparison to the genome wide distribution.


Clustering
-
based

methods

The first read pair
-
based approach
es

using capillary sequencing
by Tuzun
et al.
62
, and using NGS by

Korbel
et al.
64
,
both
employed

a clustering
-
based approach

where a

cluster is formed b
ased on a minimum of

two read
pairs
.

These

approach
es used

hard clustering of the reads.
The

standard clustering strategy
used here
detects

SV signatures based on read pairs w
ith

di
s
cordant

span and orientation, as described

above

in
the introduction
of NGS
-
based methods
.

The span
is

considered discordant if it deviat
es

four or more
standard deviations

(SDs)

from the mean.

The limitations of these studies lie in the

reduced sensitivity due

to the use of hard clustering,
as well as
the fixed cutoff for the read pair distance and the number of required read pairs for a cluster
, which

can affect the specificity strongly based on the coverage attained
66
.

The VariationHunter
101,102

tool improves on

the
previous approaches

by
using soft clustering,
thus
increasing sensitivity.
The same read pair distance cutoff (four SDs) as in earlier approache
s
is
used. After
mapping of all reads, a read is removed from consideration if it has at least one concordant mapping
. If a read
has only discordant mappings, it is classified as discordant. Each possible mapping is then assigned to each
possible
cluster o
f reads indicating a SV.
Then, two algorithms may be used for the ide
ntification of SVs based
on the clusters: VariationHunter
-
SC (Set Cover) or VariationHunter
-
Pr (
P
robabilistic). The first algorithm
identifies SVs based on maximum parsimony,
selecting

cl
usters so the amount of
implied
SVs introduced
is
minimal. The second algorithm calculates the probability of a cluster

representing

a true SV based on the read
mappings, with a

clusters

above a certain probability (90% was used in the paper)

identified as

SV clusters
.
Evaluation by the authors showed
significant improvement

in detecting SVs over previous methods, especially
in the
repeat regions
. H
owever, sensitivity

was still
l
acking

due to GC
-
content affecting the distribution of
reads. Additionally, the

fixed read pair distance cutoff used m
ean
s

that smaller
differences in span with
possibly good support are

still ignored.

PEMer
103

is a tool combining various functions in an analysis pipeline, with the purpo
se of SV identification.
Reads are first pre
-
processed based on the sequencing platforms used, and optimally aligned to the reference
genome (using hard clustering). Subsequently, discordant read pairs are identified based on the clustering
approach. It’s
possible to merge clusters obtained from different experiments and with different cuto
ffs in a
‘cluster
-
merging’ step. This
is a significant improvement over other tools
, as it

allows the use of multiple cutoffs
for cluster formation and a custom cutoff fo
r the calling of discordant read pairs. Furthermore, PEMer is
modular, and offers extensive customization, allowing improvements to certain modules without having to
design an entirely new pipeline. Another advantage is that PEMer can detect linked inserti
ons as described by
Medvedev
et al.
67
, allowing the detection of insertions longer th
an the library insert size. Although the
customizability is a large advantage, the parameters need to be carefully set to ensure good re
sults.
Implementation of a soft
-
clustering mapping algorithm may
further
increase the sensitivity of this tool.

Another
tool using a read pair clustering
-
based approach is
HYDRA
104
.
It uses soft
-
clustering, taking into
account mu
ltiple possible mappings to specifically improve the identification of SVs arising from multi
-
copy
sequences. Multiple mappings of the same read are considered to support the same SV if they span the same
interval. Based on the support for each mapping, a
variant call is generated for those with the highest support.
Subsequently, SV types are identified as in a standard clustering
-
based approach which, in addition to the
standard signat
ures, is able to detect several

other
signatures for tandem duplications

and inversions

that
increase the sensitivity for these types of SVs
. Although developed for the identification of structural variant
breakpoints in the mouse genome this approach should also applicable to the human genome. This approach

15

may be very useful

if applied to the specific identification of SVs in repeat and duplication regions. However, a
significant risk in using this approach is that many false positives may be introduced if the mappings are not
screened properly before the HYDRA tool is used
,

as mapping quality is not taken into account
.

SVM
2
105

is a
recently introduced tool that uses
a read pair
-
based approach, including non
-
standard
signatures typically found flanking a SV event

to increase the reliability of SV detection
.
SV flanking

regions
have defining pro
perties for insertions larger and smaller than the insert size, as well as deletions. In addition
to the default read

pair span changes, OEA

read pairs

(One
-
End Anchored,
read pairs of which only one read
maps) are used. For deletions, there will be a shar
p peak of
OEA
pairs on each strand about as long as the read
length, as these cannot be mapped in their entirety. For insertions, this peak will become broader with the size
of the insertion until the insert size is reached. Thus, the boundaries of an inse
rtion larger than the insert size
can be detected even though no spanning read pairs are available.
Statistics on the characteristics of read pairs
found around insertion and deletion regions are used in a machine
-
learning algorithm that detects SVs.
A
Sup
port Vector Machine (SVM) is trained to recognize

each of these statistics so SVs
can by classified into their
respective classes. Finally, a post
-
processing step combines clusters of these sites and identifies types and
lengths of insertions and deletions

by standard comparison of the span of read pairs to the global mean insert
size. Although the boundaries of insertions larger than the insert size of the library are recognized, the size of
these events cannot be determined. A comparison by the authors sh
owed an increased specificity in detecting
smaller (1
-
30bp)
insertions and deletions

versus BreakDancer.
However, the

detection of SVs other than
insertions and deletions

was not implemented
.

Adapting this method to also consider read pairs that map at
great distances may also increase the sensitivity for detecting translocations or MEIs.


Distribution
-
based methods

Distribution
-
based detection of discordant read pairs was introduced with the M
oDIL

tool
106
. Using
discordant as well as concordant read pairs, this

tool

compares

the
distribution of mapping

distances

for read
pairs

in
a specific
genomic locus

to the genome
-
wide distribution

to identify SVs
.

A shift in the distribution
to
wards

shorter
spans

indicates an insertion, whereas a shift towards longer
spans

indicates a deletion.
This
enables the identification of insertions and deletions in the range of 10
-
50 bp using paired
-
end data.
An
advantage of this tool is that h
eterozygous variants may be identified by
observing
a shift of half of the read
pairs
, whic
h is not possible in clustering
-
based methods
. As this tool only detects a very specific length of
insertions and deletions, it is far from comprehensive.

However, it
is useful for detecting smaller insertions and
deletions, possibly

as part of a larger pi
peline.


MoGUL
107

was developed

based on MoDIL, but uses sequencing data

from multi
ple
genomes to enable the
detection of
common
SVs from
low
-
coverage genomes.
After a soft clustering step, r
ead pairs from
multiple
individuals

are clustered
. Based on these clusters, SV calls are generated based on the span

distribution

in a
manner similar to MoDIL
. B
ased on this data,

indels of
20
-
100 bp

can be detected if the minor allele frequency
(MAF) is at least 0.06
.
Although Rare variants cannot be detected using this method, s
everal variants that were
not detected by MoDIL could

be detected
with the increased power for common variants in

MoGUL. Although
this tool is not useful for studying a single genome, it is effective in situations where a group of individuals is
studied, allowing sequencing at low coverage and thus lower cos
ts to identify common variants. This may be
useful in situations where for example a familial disease or population differences are studied.

BreakDancer
108

combines
clustering
-
based and distribution
-
based

read pair
-
based SV detection
by using
two

different

algorithms.

BreakDancerMax is used to detect
all type
s of struct
ural variation using the standard
clustering strategy
. BreakDancerMini is distribution
-
based
and used to detect smaller insertions and deletion
s
that are
not found by BreakDancerMax
, typically in the range of 10
-
100 bp
.
In addition to the insert
ions,
deletions and inversions detected by previous methods, BreakDancerMax is able to identify inter
-

and
intrachromosomal translocations.
A

comparison of BreakDancer to VariationHunter and MoDIL by the authors
showed increased sensitivity and specificity

due to the combination of the two methods, as well as
the
algorithmic improvements

enabling the detection of other SV types
.

However, the detection of variant zygosity
as with MoDIL is not possible using BreakDancerMini.
A
nother

possible limitation of the

BreakDancer tool lies
in the detection of SVs in repeat regions, as it relies on
hard clustering.



Table
1
:
Overview of computational
tools

used for the detection of SVs based on NGS data.

RP: Rea
d Pair, RD: Read Depth, SR: Split Read, BP: Breakpoint, CN: Copy Number, TD: Tandem Duplications, MEI: Mobile Element Inserti
on, VH
-
SC: VariationHunter
-
Set Cover, VH
-
PR
: VariationHunter
-
Probability,
BDMax: BreakDancerMax, BDMini: BreakDancerMini, EWT: Event
-
Wise Testing, CBS: Circular Binary Segmentation, MSB: Mean
-
shift Based HMM: Hidden Markov Model, SV: structural variant, OEA: One End
Anchored, beRD: breakend Read De
pth

*Considers ambiguously mapping reads, but maps these randomly and subsequently uses only that mapping.



17

4.3

Read
d
epth

Read depth methods can be grouped into two c
ategories
: those based

on differences in read depth
across a
single
genome

and those based on case versus control data.

Using a single sample
, reads are mapped to the
reference genome and CNVs are identified based on the average read depth or the read depth in other regions.
Using case versus control data,
differences in copy number ratios after mapping to a reference genome are used
to identify copy number differences between the two genomes.
Among both categories
,
the

algorithms use
genomic ‘windows’
in which t
he read depths are measured that
determine the

resolution at which copy
number ratios are determined.

Windows with similar read depths or copy number ratios are then merged to
find CNV regions.
Most

r
ead depth algorithms
discussed

use
hard clusterin
g alignment methods, evaluating

only the best mapping

of each read.

The first algorithm used to detect copy number variants from
NGS
read depth data was an adapted circular
binary segmentation

(CBS)

algorithm
68

originally developed use with arrayCGH data
109
. This was a applied to a
case versus control (cancer) dataset to identify somatically acquired rearr
angements.
The copy number ratio of
the two samples was determined in genomic windows.
The size of the genomic windows used was
non
-
uniform, requiring 425 reads per window. This allows the resolution to become higher with higher sequence
coverage. After ma
pping the reads to the reference genome, copy number change points were found by using
the CBS algorithm for the segmentation of windows with differing copy numbers. The CBS algorithm
segments
the genome by testing for change points between different parts

testing whether an observation is significantly
different from the mean of a segment
. This is done recursively, and stopped when no more changes can be
found.

The readDepth
110

tool uses a CBS
-
based approach similar to those used in th
e first read depth
-
based
studies. A major difference is that readDepth does not require the sequencing of a control sample, but calls
CNVs based on a single sample. readDepth employs the CBS
-
based read depth strategy where the genome is
divided into window
s, and the genome is segmented by the CBS algorithm until no more differences in copy
number can be detected to identify CNV regions. However, several improvements over earlier methods are
introduced. Genomic windows are calculated based on a desired FDR (
False Discovery Rate) that can be input
by the user based on the number of reads. Heuristic thresholds for the detection of copy number gain and loss
events are calculated based on the desired FDR and number of reads as well. Furthermore, the readDepth too
l
is able to process bisulfite
-
treated reads in addition to regular sequencing reads, and can thus also study
epigenetic alterations. Several corrections for biases are introduced as well: The mapability of reads is
corrected by multiplying the amount of r
eads in a window by the inverse of the percent mapability detected in
a mapability simulation, and regions with extremely low mapability are filtered out. Read counts in each
window are also normalized by using a LOESS method to fit a regression line to th
e data.

RDX
plorer
111

is a tool that detects CNVs based on t
he EWT (Event
-
Wise Testing) algorithm
.
This algorithm
uses 100

bp windows to identify CNV regions

based on the differences in read depth in a single
sample
.
As a
first step, all read counts mapped to each window

are corrected for the GC content
. This is do
ne by multiplying
the read count for each window with the average deviation from the read count for all windows with the same
GC percentage
.
This manner of GC content correction has been adopted by many other read depth
-
based tools.
The amount of reads in
each window is
then
converted into a Z
-
score

in a two
-
tailed normal distribution
. Based
on the desired FPR (False
-
Positive Rate) the upper
-

and lower
-
tail probabilities identify gains and losses
respectively.
Afterwards, adjacent windows with a copy number

change in the same direction are merged to
identify the range of the CNV. The correction for GC content is a positive addition as this is a significant bias in
read depth methods.

The authors state that the read counts of 100 bp windows approximate the no
rmal
distribution well at 30x coverage, but
more flexible settings are preferred as
these windows may be too small
o
r too large in experiments with better or worse overall coverage.


JointSLM
112

is an algorithm
that is also
based on EWT, but
was
developed to detect common CNVs
pre
sent
in

multiple individuals

using multivariate testing
.
Due to the increased statistical power by including multiple
genomes, JointSLM is able to determine smaller CNVs
than
the EWT algorithm alone.

Although it was designed
for multivariate testing, this
tool may also be used to study a single genome in a manner similar to EWT. Like
other population
-

or group
-
based algorithms, this may be useful in the detection of CNVs between populations.


CNVnator
113

uses a
mean shift
-
based (MSB) approach to identify
CNVs in single genomes
.

This approach
is
also derived from an algorithm designed
for the identification of copy number shifts in ArrayCGH data
114
.
The
optimal
window

size is
determined
as the one at which

the

ratio of average read depth to its standard deviation
is

roughly 4:5.
In the MSB approach, copy number variant regions are identified by merging each wi
ndow
with
flanking windows with a similar read depth. If a window with a read depth significantly different from that of

18

the merged windows is encountered, a bre
ak is detected
. Subsequently, CNVs are called based on the
probability

in a t
-
test

that that

th
e read depth of that segment is significantly different from the global read
depth.
In addition to mapping of unique reads, CNVnator maps ambiguously mapping reads randomly to
clustered read placements. Thus, it is not limited to unique regions by using be
st mappings only, but does not
consider all possible mappings by either. Read counts are corrected for GC content in a method simila
r to the
one used in RDXplorer.

CNVeM
115

uses read depth in single samples to

determine CNVs by assigning ambiguously mapping reads to
gen
omic windows fractionally
. It is
the only read depth
-
based tool that explicitly uses soft clustering. After
mapping, the genome is divided into windows of 300 bp and an initial estimation of copy numbers is made
based on an EM (Expectation Maximization) al
gorithm. A second step then evaluates all possible mappings of
reads to calculate the posterior probability of each mappi
ng, then assigns

reads fractionally
to windows based
on
this probability. This algorithm differs between
read assignments

with differen
ces in sequence as small as
one nucleotide, and predicts the copy numbers of each position. Instead of classifying CNVs as gains or losses,
the copy number of each base is then determined based on these assigned reads
, and the CNVs are determined
from this

copy number
. In a comparison by the authors, this approach was found to have higher accuracy in
detecting CNVs than CNVnator. It is also able to detect whether paralog regions are copied or deleted.

BICseq
116

is a tool that
uses

the

MSB approach

for the identification of

CNVs, but is designed for use with
case vs. control data instead of single samples. T
he definition
of windows,

merging of windows
,

and calling of
CNVs is done
similar
ly to the process in
CNV
n
ator
. However, BICseq

the
Bay
esian Information Criterion

(BIC)
as the merging and stopping criterion.

By using the BIC
, no bias is introduced by assuming a Poisson
distribution of reads on the chromosome, increasing the reliability of the results.

Furthermore, the case vs.
control approach is used to correct fo
r the GC content bias.

CNV
-
seq
117

is a

tool
for C
NV detection based on the case versus control approach
.
T
his tool
contains a
module

for calculation of

the best window

size based on the desir
ed significance level, a
copy ratio threshold
and
the attained
coverage level.
After mapping of the reads to the genome, genomic regions with

potential
CNVs are identified by a sliding of non
-
overla
pping windows across the genome
, measuring the copy number

ratio in each window
.

The

probability of a random occurrence of these ratios is calculated
by
a t
-
statistic,
based
on the hypothesis that no copy number variation is present. The hypothesis is rejected if the probability of a
CNV

exceeds the
user
-
defined
threshold
, and

a difference in copy number between the two genomes is inferred
.

Segseq
118

uses a strategy that focuses on t
he
CBS
-
based
identification of CNV breakpoints

for

copy number
ratios in case versus control data
.

Similar to CNV
-
seq
,
sliding
windows are used to compare
copy number
ratios
.
However, S
egseq
has a variable

window size based on a
user
-
specified amount of
required
reads
.
Segseq
identifies breakpoints

by comparing the copy number ratio in
each
window to those in the adjacen
t windows.
Significant change in
the ratio versus either window

identifies a breakpoint and copy number change.

Subsequently, all windows with the same copy number ratio

are merged to identify copy number variant

and
copy number balanced regions
.


rSW
-
seq
119

is a tool
that,
similar to Segseq
,
uses

case versus control read depths

to identify
changes in copy
number ratio. However, rSW
-
seq directly

identifies
CNV
regions

by registering cumulative changes in the ratio
as
breakpoints of CNVs. Reads for each sample are sorted according to their mapping on the genome, and the
read depths for each sample are subtracted from each other. Local positive or negative sums indicate copy
number gains or losses.

Regions with equal rea
d depths are ignored, and regions where read depth differences
are detected are defined as CNVs. This gives a very intuitive overview of where CNVs are found, and can also
identify CNV regions within other CNVs.
rSW
-
seq’s resolution
is dependent on the seq
uencing depth, but seems
limited

as CNVs smaller than 10

kb were not reported.
It is the only read depth
-
based tool discussed here that
does not require the specification of genomic windows.

CNAseg
120

is a
nother tool that uses genomic windows to identify differences copy number between case
and control data. In addition to LOWESS

regression

normalization for GC content, CNAseg uses Discrete
Wavelet Transform (DWT) to de
-
noise regions using, smoothing out re
gions with low
mapability
. This is
necessary as a novel HMM
-
based (Hidden Markov Model) segmentation step is introduced to segment the
windows based on the read depth. An additional algorithm
then uses

Pearson
’s x
2

test
to merge segments with
a similar cop
y number ratio, and the copy number state is estimate
d

by comparing the log ratio of read depths.
This identifies segments of contiguous windows with similar read depth
, which are then defined as CNVs.

This
tool was shown
by the authors
to increase the spe
cificity and lower the amount of false positives versus CNV
-
seq without affecting sensitivity.

Unless specified otherwise, the

single sample

read depth
-
based
tool
s discussed here assume a uniform
Poisson distribution of reads over the whole genome, thus co
nsidering any aberration in read depth an effect of

19

copy number. As read depths do in fact vary over the genome due to various biases
70
, more ac
curate models

like the BIC better approximate the distribution of reads over the genome
.
Although all
tools

described here
are
able to detect

differences in copy number within or between genomes,
the actual copy number of these regions

is not always automatically determined
. In most studies, the copy number may be estimated by
normalizing the
median of the read depth
in a copy number variant region
normalized to th
at of
copy number 2 and rounding to
the nearest integer
68,111,112
. T
his has been shown to w
o
r
k well f
or most platforms by comparing to regions with
known copy numbers, however the copy numbers did not correlate well for the SOLiD platform
121
.

In a recent
review of read depth approaches
121
, it was found that the EWT
-
based
tool
s provide the best results in terms of
both sensitiv
ity and specificity. CBS
-

and MSB
-
based
tool
s are better at detecting CNVs with a large number of
windows (50
-
100), but worse at detecting those with a smaller number of windows

(5
-
10). CNASeg performs
better on high coverage data, but worse on low coverag
e data. CNV
-
seq seems

to perform poorer overall. In
combination with high coverage data, the EWT
-
based
tool
s detect CNVs as small as 500

bp
, while the CBS
-

and
MSB
-
based
tool
s identify CNVs
with a size of

2
-
5

kb. Thus, there seems to be a great deal of var
iation between
the performance of different
tool
s, also based on the type of data that is used.

4.4

Split r
ead

Few tools have yet been developed for the identification of SVs using split read methods using NGS data.

Most of these rely on
specific alignment strategies to identify breakpoints.
Pindel
72

uses a pattern growth
algorithm to identify the breakpoints of deletions and insertions.

As described above, this tool uses anchored
split mapping. Read pairs are selected where one read maps uniquely and the othe
r can’t be mapped under a
certain threshold are used. With the uniquely mapping read as the anchor point, the direction of the read as
well as the
user
-
specified
maximum deletion size are used to define a region where Pindel will attempt to map
the other r
ead
. This is done using the pattern growth algorithm which searches for minimum

(
to find the
5’
end)

and maximum

(
to find the
3’ end)

unique substrings

to map both sides of the

read
.

The read is then
broken into either two (deletion) or three

(short
intra
-
read
insertion

fragment in the middle
) fragments.

At
least two supporting reads are required for each event.

Pindel is able to identify the breakpoints at base pair
accuracy, even for deletions as large as 10 kb.

Although t
he sensitivity of this approach
i
s still problematic in
repeat regions, allowing mismatches in mapping of the anchor read may increase the sensitivity
in the future.

By reducing the search space, the chance of mapping
partial reads

to the human genome is significantly
increased and split
read mapping is made possible for NGS platforms.

However, the search space may be
affected by insertions or deletions in between the reads. By combining this approach with information of the
mapping distance of surrounding read pairs, the accuracy may be i
ncreased.

The AGE
122

(Alignment with Gap Excision)
tool
adopts a

strictly alignment
-
based

approach to

split read
mapping.
Based on two given sequences in the approximate location of SV breakpoints, i
t simultaneously aligns
the 5’ and 3’ ends
of both sequences

similar to Smith
-
Waterman local alignment.

The final alignment is then
constructed by

tracing back the maximum
position in the matrix of each alignment and then aligning the 5’ and
3’ ends.

The SV region is then the unaligned region in between. This
approach
is able to identify

SV

breakpoints
with
base pair

accuracy, and also the exact SV
length a
nd sequence if the whole

sequence is supplied. However,
it does require
external

identification of
the
SV region

as well as two sequences as input
.
These sequences need
to be unique enough for proper alignment, which means that either the putative
SV region needs to be small
enough or the provided sequences long enough.
The SV

type

needs to be determined by
additional processing
of the results.

which is often difficult to obtain with current NGS platforms.

Considering the input and
additional proces
sing needed, the alignment algorithm is only useful for SV identification as part of

a larger
pipeline.

ClipCrop
123

detects SVs by using soft
-
clipping information. Soft
-
clipped sequences are defined as par
tially
unmatched fragments in a mapped read. Unmapped parts of partially mapped sequences are used, with a
minimum length of 10 bp. Subsequently, these clipped reads are mapped to the reference genome maximally
1000 bp on either side of the mapped part.

Se
quences mapping further ahead indicate deletions, inversely
mapping sequences indicate inversion
s
, sequences mapping before the mapped read indicate tandem
duplications, and
a cluster of unmapped reads from both sides indicates
insertions
.
Similarly to rea
d pair
methods, additional tandem duplications over those present in the reference genome can’t be detected.
Remapping of unmapped reads
is used to differentiate

between novel insertions or mobile element
insertions/translocations
, with novel insertions no
t expected to map to the reference genome
.

Clipped reads
are clustered if they support the same event, and a reliability score based on this support is used to determine
the most likely event.

ClipCrop is able to detect a larger variety of signatures, and
is not limited by the direction
of the search space like Pindel. Furthermore,
ClipCrop was shown to more efficiently detect short duplications

20

(<170 bp) than
CNVnator, BreakDancer and
Pindel based on simulated data
.
However, the detection of larger
events
was worse than with other methods.

4.5

De novo assembly

Assembly
-
based identification of structural variation requires two steps: the assembly
of the sequence
,
and

the
alignment of this

sequence
against a reference genome for detection of the variants.
Assembly can be
performed either completely
de novo
, or
by
using varying degrees of information from a reference assembly.
A
ssembly can currently be
used to identify SV in two ways: l
ocal
sequence assembly

allows the

reconstruction
of loci

with possible va
riants,
and

w
hole genome assembly would provide the most comprehensive
identification of structural variation in a genome by aligning (large parts of) whole genomes
. Alignment to the
reference genome may then identify all types of SVs as well as CCRs using

similar methods as split read
mapping.


Genome assembly

The first step, g
enome assembly
, is not a trivial task. S
everal recent reviews have been published on this
topic
, explaining
it

in more detail
63,74,124
. Repeat sequences, read errors and heterozygosity present the greatest
challenges here. The short read length of NGS p
latforms complicates these challenges even more. Previous
assemblers used for the assembly of Sanger sequencing reads were insufficient for use with NGS data, so
several new assemblers have been developed to deal with these problems.

NGS assemblers can be
divided into
four
categories: greedy algorithms, Overlap/Layout/Consensus (OLC) methods, de Bruijn Graph (DBG)
methods

and String graphs
74,124
.

Most e
arly assemblers used greedy

algor
ithms
. These

operate

by simply
extending the
seed
sequence with

the next highest
-
scoring overl
ap to the assembly until it is

no longer possible. The score is calculated based on
the amount of overlapping sequence.
A problem with

these algorithms
is

that false positives are easily added to
a contig, especially with shorter reads. Two identical
overlapping
sequences in the genome may lead to the
incorporation of unrelated sequences, producing a chimera.

Several assemblers using greedy algorithms are
S
SAKE
125
, SHARCGS
126

and VCAKE
127
.
This cate
gory of assemblers is

generally not used for NGS platforms,
except

when assembly is performed

in combination with Sanger sequencing data.

Overlap/Layout/Consensus assembly was used extens
ively for Sanger data, but some assemblers have
been adapted for use
with NGS data. OLC assembly involves three steps: first, all reads are aligned to each other
in a pair
-
wise comparison using the seed and extend algorithm. Then, an overlap graph can be constructed and
manipulated to get an approximate read layout. Finally
, multiple sequence alignment (MSA) determines the
consensus sequence. Examples of assemblers that use this approach are Newbler
128
,

which is distributed by
454 Life Sciences
,

and the Celera Assembler
129
,

which was first used for Sanger data and

subsequently

revised
for 454 data, now called CA
BOG. E
dena
130

and S
horty
131

use the OLC approach for the assembly of

shorter rea
ds
from Solexa and SOLiD platforms.

T
he de Bruijn graph

approach has been widely adopted and is mostly applied to shorter reads from
Solexa
and SOLiD platforms.

Instead of calculating all alignments and overlaps, t
his approach relies on k
-
mers

of a
certain

length that

are present in any of the reads
. k
-
mers must b
e shorter than the read length,

and
are
represented by nodes

in the graph.

These nodes have connections (
edges) with other nodes based on which
other k
-
mers they are found in the same read with.
Ideally, the k
-
mers would make one path that can be
traveled to form the entire genome. However, this method is more sensitive to repeats and sequencing errors
than OLC and many branches can be found in these graphs.

Disadvantages
of
DBG assembly
are

that
information from r
eads longer than
k
-
mers is lost and
t
he choice of K
-
mer size also has a large effect on the
results. Some assemblers use approaches that still include read information

during assembly
, but require more
computational power. Euler
132

was the first assembler to use

the

DBG
approach.

Velvet
133

and ALLPATHS
134

were introduced later, offer

improved assembly speed and contig length

and allow

the use of read
-
pair data.

These assemblers are able to assemble
entire
bacterial genomes. A
BySS
135

was the
first assembler used to
assemble

a human genome from short reads. SOAPdenovo
136

was introduced later and i
s also able to a
ssemble
larger (and human) genomes.

Finally, S
tring graphs can be used to compress read and overlap data

in
assembly
124
.
The primary
advantages of String graphs o
ver DBGs are that the data is compressed
further
so assembly can be performed
more efficiently, and the possibility to use the full reads instead of k
-
mers.
String graphs
are
based on the
overlap between reads or k
-
me
rs. Similarly to DBG assembly
, each
sequence is represented by a node
, these
have edges to other

nodes

with overlapping sequence
.

In this case, the edges are represented by the non
-
overlapping sequence between the nodes. Thus, this co
nstructs all possible paths while

the
entire
sequence
is

21

r
etrievable by following the edges
. This
approach is

used by the String Graph Assembler (SGA)
137
, which

is able
to assemble an entire
human
genome using one
machine, and corrects single
-
base

sequencing
errors.

Several updated assemblers like ALLPATHS
-
LG
138
, Velvet1.1 and Euler
-
U
SR
139

show significant

improve
ments

over their predecessors. For example, they allow the incorporation of longer reads

and mate
-
paired reads

to enhance the a
ssembly of shorter reads, are able to align larger genomes
,
and allow the input of
data from more different NGS platforms. Although
de novo

assembly of human genomes using shorter reads is
now possible, several limitations still remain. I
n addition to sign
ificant sequence contaminations, i
t was found
that de novo assemblies are significantly shorter than the reference genome, and large parts of repeated

(420.2Mb)

and duplicated sequence

(99.1%

of total duplicated sequences in the reference genome
)

are missi
ng
from genomes assembled from NGS data
63
. Until the introduction of
more reliable ‘third
-
generation’ sequencing
with
longer read lengths
, it remains important t
o include data from established large
-
molecule sequencing
methods to inform and control the data generated with NGS platforms.

Using information from
alignment to the
reference genome may also help to increase the reliability of assembly. For example, the
Cortex assembler
140,141

used in the 1000 genomes project can use varying degrees of information from a reference genome

for
assembly
.
However, using a reference genome may bias the data, and the problems in repeat

and duplication
regions will remain due to alignment problems in these regions.


Identification of structural variation

Although much work

has been
done to improve
assembly algorithms
, the identification of structural
variation using this data has been stu
died far less. This is partially due to the problems and costs that are still
involved with
de novo

assembly, prohibiting the use of assembly methods to detect SVs
142
.
Ideally, a fully
accurate sample genome may simply be compared to a reference genome by alignment, with differences in the
ali
gnment indicating SVs

as indicated in
Figure
1
. However, in addition to
full
de novo

assembly
currently not
being possible, proper alignmen
t of genomes and detection
of these signatures are still

significant challenge
s
.

Currently, the assemblers discussed here may also be used to construct smaller genomic regions to identify

structural variation by alignment of those regions. The AGE tool
122

that was
discussed for split read mapping is
able to align large contigs, even with multiple SVs, enabling it to potentially ide
ntif
y SV regions

based on
de
novo

assembled contigs as well.
As the methodology for the identification of SVs using
de novo

assembly data is
similar, other split read
-
based methods may also be adapted for use with assembly data.

Another tool called
LASTZ
143
,

based on BLASTZ
144
, was
optimized specifically for aligning whole geno
mes. This was recently used
in the

detect
ion

structural variation in two
de novo

assembled human

genomes
73,136
.

After whole genome
alignment
, the gaps and rearrangeme
nts in the alignment
were extracted as putative SVs.

Subsequently, over
130.000 SVs of several types (inversion
s
, insertions, deletions, and complex rearrangements) were identified in
each genome. However, the methodology for identification of specific var
iants was not
discussed.

A tool called NovelSeq
145

was designed specifically for the detection novel sequence insertions in the
genome.
The first step in this process is mapping of all read pairs to the reference genome using mrFAST. Read
pairs of which neither read can be aligned are classified as orphan reads, and if only one read can be aligned the
read
pair is classified as OEA. The hy
pothesis is that these orphan and OEA reads belong to novel sequence
insertions.

Subsequently, all orphan reads are assembled into longer contigs using available assembly tools
such as EULER
-
USR and ABySS. The OEA reads are then clustered to the reference
genome to find reads
supporting the same insertion.
Clustering is performed by a clustering

algorithm which has a maximum
parsimony objective to imply as few insertions as possible while explaining all OEA reads. Finally, these OEA
clusters are assembled i
nto longer contigs as well, and are used to anchor the orphan (insertion) contigs by
aligning overlapping sequences. The identification of novel sequence insertions is an important step in the
characterization of all structural variation in the human genom
e.
Several i
nsertion breakpoints could not be
identified conclusively or at base pair resolution, as multiple insertion breakpoints may be identified due to
OEA clusters mapping ambiguously to the genome. However, the information provided could significant
ly
reduce the search space for these breakpoints, allowing other methods to
validate

these reliably (e.g., split read
mapping).

The Cortex assembler
141

introduces a novel way to detect SVs based on DBG assembly. Colored de Bruijn
graphs

(CDBGs)

are an extension to the classical DBGs
. In CDBGs multiple samples are displayed in one graph,
and the
nodes and edges in the graphs are colored
based on the sample they derive from. These samples may
be different sequenced genomes, reference sequences, known variant sequences or a combination of those. The
alignment of these samples will show ‘bubbles’
in
one sequence
when the sequences differ, where different
types of bubbles indicate various variants.

The simplest bubbles to detect are for
SNPs, which can be detected
using the Bubble
-
Calling (BC) algorithm. D
eletions and insertions, where either the refer
ence (deletion) or the

22

sample (insertion) shows a bubble

are also detectable using the Path
-
Divergence (PD) algorithm
. Although
other types of SVs can
theoretically
be detected as well, these signatures are more complicated
,
confounded by
branching paths i
n the assembly due to repeats or duplications.

Thus, the identification of SVs currently seems
reliable only in non
-
repetitive regions.
SVs defined as complex have been reported,
but these were not
classified further
.
Cortex also allows population
-
based in
vestigation by aligning multiple genomes, and can
identify novel insertions

based on this information
.

Assemblies could still be improved by using read pair
information, and SV classification does not yet seem to be fully implemented.

Although the reliabil
ity of this
method has not yet been investigated thoroughly, t
his tool
is an important step

into the direction of complete
assembly
-
based

identification

of SVs
.

4.6

Combin
ed approaches

Genome STRiP
146

uses read pair

and

read depth
information

to identify SVs in populations
, and identifie
s
breakpoints by using assemblies o
f

unmapped reads

from another study
140

to span potential breakpoints
. This
tool was designed
for use with 1000 genome project
147

data, specifically to reduce false positives in
SV
identification, especially in population studies.
After read pair
-
based detection of d
iscordant read pairs
, those

in the same genomic regio
n

sharing a si
milar difference in insert size ar
e clustered over
different samples to
increase the power of SV detection. Furthermore, heterogeneity in a population is used to filter out

possible
false positives

that appear thinly in many genomes, but keep variants with a high signal in one or multiple
genomes.

The corr
elation between read depth and read pair information is also used to filter out
false positive

SVs: if read pairs indicate a possible deletion, it should be supported by a lower
read depth in those samples

with the detected deletion, but

not in general.
Th
e approximated breakpoints based on read pair and read
depth data could be resolved by assembly of unmapped reads, allowing the identification of breakpoint
locations at base pair resolution.
Compared to other methods, Genome STRIP was found

to detect less

false
positives

and
more deletions in total

in a comparison by the authors
. The detection of rare alleles in the
population with a sensitivity comparable to single
-
sample methods required higher than average coverage (8x,
average 4x). For the detection of

smaller deletions (<300bp), Pindel was more effective.

This approach is
currently only able to identify deletions in large populations, but the identification of other types of SVs is being
worked on. F
urther development of these methods may allow reliabl
e population
-
based identification of

structural variation

by integrating many different signals
.

CNVer
148

is a tool that combines read depth information

with read pair mapping for the accurate
identification of CNVs.
Without typing the SVs, discordant read pair mapping information is used to identify
regions that a
re different between the reference and the donor genome. Independently of this, the read depth
signal is used to identify regions with losses or gains. These signals are considered jointly in a framework
termed a ‘donor graph’. Reads that map to several lo
cations a
re considered for each location
, and connected if
adjacent in the reference genome or connected by read pairs.

Based on traditional differences in read depth
and the presence of discordant mate pairs,
CNV calls are made
.
This data is a
lso used to
predict the

copy count
of each region.

This method has several advantages over read depth
-

or read pair
-
only methods. The location of
deletions detected by read depth can be determined by using the read pair signature. This tool uses soft
clustering
, which

increases the sensitivity in repeat and duplication regions, and requires information from
both
read depth and read pair signals

to reduce false positives. Furthermore, it is possible to detect regions
with additional tandem duplications over those alread
y present in the reference genome

as well as insertions
larger than the insert size
, which is not possible using traditional read pair methods.

However, SVs other than
deletions can only be called as CNVs without a specific location. A

comparison to other read depth and read
pair
-
based approaches by
the authors shows that
the method is less sensitive to noise and false positives, but
many confirmed SVs are still detected by either
read pair or read depth

methods
alone, but not by CNVer,

which indicates
that the sensitivity is not maximized
.

Another tool that uses both read depth and read pair information is GASVPro
149
, which integrates both
signals into a probabilistic model. In read mapping, GASVPro
is able consider

all possible alignments by using a
Markov Chain Monte Carlo (MCMC) approach that calculates the posterior probability
of each variant over all
possible alignments of each read

(soft clustering), but a hard clustering approach (GASVPro
-
HQ) is also
available
. In addition to the standard read pair signatures, the read depth is used in signals of localized
coverage drops that

occur at the breakpoints of both copy number
-
variant and
-
invariant SVs. This is called
breakend read depth (beRD), and is also used to predict zygosity of variants. GASVPro uses both the amount of
discordant read pairs, as well as the beRD signatures at
each breakpoint to determine the probability of each
potential SV

and remove false positives
.

A comparison to other SV detection methods

with low coverage data
,
including BreakDancer, Hydra and CNVer, showed
comparable

sensitivity but much higher specifici
ty in
detecting deletions

for lower quality data when using GASVPro
, as far fewer false positives were predicted.

For

23

insertions, GASVPro was the most sensitive method, but at the cost of many possible false positives. Higher
coverage

data showed better pe
rformance of tools that use a hard clustering approach (BreakDancer, Hydra
-
HQ, GASVPro
-
HQ). The increased specificity when considering both read pair and read depth signals is
efficient

for detecting
large
deletions

reliably, and with further implementatio
n may be

useful in the detection of other
types of variants. However, the detection of inversions was
not

significantly

improved by u
sing the beRD signal
,
and
the detection of
SV types other than insertions and deletions

hasn’t been implemented.

inGAP
-
SV
150

uses read depth and read pair data to detect and visualize large and complex SVs. After
alignment of reads to t
he genome, the read depth signature is used to detect SV ‘hotspots’ by gap signatures,
drops in read depth that are also called beRD signatures in GASVPro. In the
se

hotspots, S
Vs are called and
classified

based on
a heuristic cutoff for
discordantly mappin
g read pairs. The

called variants

are then
evaluated

based on information on mapping quality and read depth. inGAP
-
SV can identify different types of
SVs
,

including large insertions and deletions, translocations and (tandem
) duplications
. The
beRD

is also used
to determine the zygosity of variants
, as the regions flanking homozygous SVs are

expected to
have a read
depth of

zero, whereas for heterozygous events it would be reduced to ab
out half the local read depth. Novel
i
nserti
ons larger than the
insert size are also detected by looking for OEA reads. Finally, the results are
visualized in a genome browser
-
like display, with different representations for different signatures. The user
can then inspect this information and annotate the putative even
ts.
The authors compared the detection of
deletions for a confirmed reference set against other tools, including BreakDancer and VariationHunter. and
found that inGAP
-
SV’s combined approach was more sensitive. An comprehensive comparison of the detection
o
f other types of SVs was no
t performed. The visualization
supplied by inGAP
-
SV is unique among the

investigated

tools, and may be very useful for researchers to investigate regions of interest in more detail.

A

recently introduced tool called

CREST
151

uses
hanging
soft
-
clipped reads
in a method simi
lar to the one
used by ClipCrop
123
.

However, CREST uses a case versus control approach that first filters out any soft
-
clipped
reads that occur in the control genome. This
filters out artif
acts, and
allows the detection of somatic variants in,
for example, cancer samples. A
fter mapping to the

reference

genome,
all soft
-
clipping reads mapping

to the
same location are first assembled into a contig
s
.
Thus, CREST uses a combination of split read

and assembly
methods.
The

contigs
are then remapped
to the genome
iteratively
using BLAT
to identify candidate partner
breakpoints

on the genome
.

For this breakpoint, the reads are similarly assembled into a contig.
Based on the
alignment
between the two
assembled breakpoints and their mapping locations, a putative SV is called
.
The
SV
type can then

be identified

in a method similar to the one used by

CripCrop
.
CREST is able to detect all
signatures but tandem duplications.
Differently from other split
read
-
based methods, CREST is designed for

the
detection of larger events
.

In a comparative analysis using simulated data by the owners, CREST
outperforms

both

BreakDancer
108

and Pindel
72

in terms

of sensitivity and specificity.

This may be due to the nature of the
data, as CREST is designed to detect somatic events and the other
tools aren’t.

Finally, SVMerge
152

attempts the

integration of data from several SV
calling tools into one pipeline

to
maximize the amount of SV calls in one run. BreakDancerMax is used to call deletions, insertions
,

inversions
and translocations based on read pair mapping. Pindel is used to call inser
tions
of 1
-
20 bp
and deletions
smaller than 1 Mb.
RDXplorer is used to detect CNVs based on read depth information

and determine the
zygosity of events
.
SECluster and RetroSeq are two tools that were developed

specifically
for implementation in
this pipeli
ne

to detect insertions
. SECluster detects large insertions by identifying clusters of OEA reads
similarly to NovelSeq. RetroSeq detects MEIs by looking for read pairs of which one read maps to the reference
and the other read can be aligned to a known mob
ile element in Repbase
153
.

After all calls have been mad
e, calls
are validated by
de novo

assembly of the reads at predicted SV breakpoints. These contigs are aligned to the
reference genome, and the result
s of this alignment are used to

confirm breakpoints and incr
ease the
resolution and
filter out false posit
ives
if the

alignment does not match the predicted SV. As heterozygous
events can’t be validated by
de novo

assembly, read depth information is also used in this step. This pipeline is
able to determine

CNVs as well as

the location of deletions detected by

read depth analysis,

as well as

insertions and inversions that are confirmed by assembly. This pipeline was found to decrease the false
discovery rate of the individual tools used significantly. SVMerge is the first meta SV caller and introduces
important validation steps after the mergi
ng of results from different tools. However, a 50% overlap is used as
a requirement for the merging of calls from different tools, which is a rather
limited cutoff

and may result in the
merging of different events.

Although SVMerge still only
detects a sub
set of structural variants, the pipeline
is
modular which means that the sensitivity may be raised even more

by integration of other tools
.
Actual
integration of the signals before calling the SVs would enhance the specificity of detection event more, and
would be t
he next
logical
step in the
combination of

the NGS
-
based signals that can be used for the detection of
SVs
.


24

5

Discussion


The
status quo

Here, I have given

an overview of the currently available tools for the detection of structural variation with
N
GS platforms.
As
discussed

in the
introduction

of the
four NGS
-
based methods of SV detection
,

each

method
has

its

own advantages

and
limitations
.

An evaluation of the performance of each of the discussed tools is
beyond the scope of this thesis, but a comp
rehensive co
mparison would be difficult

as most tools focus on the
detection of a different or new class of structural variation by introducing a new method or algorithm.
Most
papers accompanying the introduction of a new tool provide a comparison against
other tools, but mostly focus
on a comparison of their own abilities
for a proof of principle,
without considering the full spectrum of
structural variation.
Read depth
approaches

alone seem to have matured
enough

t
hat
most tools aim

at
detecting

the same
range

of SVs
. This is possibly due to the fact that many of these are based on algorithms first
applied to microarray data, and that only a limited spectrum of variation is detectable from the read depth
information.

A good review of the performance of rea
d depth
-
based tools was recently published by Magi
et
al.
121
.

From

the information gathered here, we can see trends for the development of

tools in each of the four
NGS
-
based methods of SV detection.
Most of the new
tools for each of the methods have focused on
strengthening the advantages and minimizing the disadvantages

inherent to
the information that is used
.
Read
pair methods are able t
o detect the broadest range of SV types and sizes,
but

the detection of larger insertions
is limited by the insert size, and the
variance in library

insert size limits the resolution. Newer
read pair
-
based
tools have focused on removing the

limitations of
this method by using both clustering
-

and distribution
-
based
algorithms to increase the range of detectable SV sizes
,
and have developed

algorithmic strategies to detect
additional signatures associated with structural variation
.
Read depth methods are abl
e to detect CNVs
efficie
ntly and can determine the local copy number
, but are unable to identify copy number balanced variants
or the location of the detected CNVs.
Most
read depth
-
based tools focus on
minimizing

experimental biases

like
the GC content

and

the mapability of reads
, and using more
advanced

statistical

models to

in
crease

the
accuracy and the
resolution.

Split read methods are able to determine breakpoints at base pair resolution
, but

are
currently
only effective in unique regions of the genome

due to ambiguous mapping of short read lengths
.
Several tools have now been developed
to
us
e split read mapping signatures

for the identification of SVs
.
Algorithms that use
split read mappings

are able to detect

most SV types at high resolution. However,

larger

and

more complex events can’t be detected yet, and will require longer read lengths than those available from
current generation sequencing platforms.

Finally,
reliable

de novo

assembly of a full human genome is still not
possible

due to
technical
limitations in

repetitive regions
. Due to these limitations, significant biases
towards
the detection of SVs in these regions are present in current

assembly
-
based

approaches
63
. C
urrent

tools

for
assembly
-
based
SV identification
rely on the

assembly of

shorter contigs or
focus on

non
-
repetitive
regions
,
but
are only able to detect a limited range of structural variation
.
However,
as the technical limitations are e
xpected
to be reduced significantly in the third generation of sequencing platforms,
the new algorithms and
improvements introduced in these tools provide an impor
tant first step towards
comprehensive

identification
of structural variation based on
de novo

assembly of genomes.

Clearly there have been great advances in the development of computational methods

for NGS
-
based SV
detection in recent years.

However,
none of the

four NGS
-
based

methods is comprehensive, and strong biases
are still present

in each o
f the methods
. Application of read pair, split read, and read depth methods to the same
dataset has shown that a significant fraction of the SVs detected remains unique to a single method
22
. Thus
,

the
sources of information need

to be combined in order to
maximize

the identification of structural variation in a
human genome.
This is true a
t least until
complete
de novo

genome
assembly becomes a viable option, but
probably even afterwards, as assembly
-
based methods

alone

are not able to identify the zygosity of a structural
variant

or the copy number of a sequence
.

Several approaches have be
en developed to incorporate signals from
various methods.
These combined approaches have succeeded especially
well
in increasing the
resolution and
reliability

of existing methods by requiring confirmation by other signals.

Some

tools like
inGAP
-
SV,
SVMerg
e

and Genome STRiP
already
combine
several
algorithms so that
a wider range of structural variation can be
detected in one experiment.

For multiple methods, population
-
based approaches have been developed. These approaches increase the
statistical power fo
r the detection of common structural variants by pooling data, while filtering out
experimental artifacts at the same time. These tools are less powerful for the identification of personal
structural variation, but may be extremely useful in a clinical set
ting with familial or larger case
-
control studies.

25

Read pair or combined methods seem most suited to this strategy as these have the potential to detect the
widest range of SVs and will thus profit most from the increased statistical power.

Possible i
mprov
ements: integration of
recent advances

Most
tools

described
here
have
introduced

the detection of a new type of signature or a new way to
increase the reliability of the findings
. However, a

comprehensive solution
that incorporates all of the recent
knowle
dge with the aim to identify all structural variation in a human genome
is currently not available
.
As one
sequencing experiment can generate the required information, methods using only one or two of the signals do
not optimize the use of the data.
The SV
Merge
pipeline
combines signals

by using various tools that are able

to
detect different
ranges of SVs
, and implements

a filtering step based on local
de novo

assembly. However,
SVMerge

does not integrate the signals
,

but

merges the results

from each approach
.

This represents a
significant step towards an integrated solution, but a

comprehensive algorithm

would
ideally
combine
signals

from each of the four NGS
-
based methods

into one model
.


A

lot of the knowledge gained in the development of

previous approaches

could be used in the development
of such an algorithm
, taking into account the
advantages and limitations

of each method
, as well as newly
discovered signatures that can be used to enhance the detection of SVs
.
The use of soft clusteri
ng methods will
allow maximal sensitivity for the detection of SVs in repetitive regions, but extensive confirmation and filtering
will be required to reduce false positives. This could be achieved by integrating all signals before the calling of
SVs, pref
erably into a probabilistic model
,

as these have been found to be more accurate than heuristics in most
cases
66,101,121
.
From read pair
data,

discordantly
mapping reads should be used

in clustering
-

and distri
bution
-
based models as in BreakDancer to maximize the information obtained from this step
, but also include
concordantly mapping reads as in MoDIL and MoGUL as this provides additional information of events, also
enabling the determination of zygosity
. The

read depth signal may be used to inform the detection of deletions,
duplications and insertions by using the traditional differences in read depth across the genome
, but beRD

signals

also provide important information that should be considered

in the dete
ction of any variant that forms
a new sequence at the breakend, as evidenced in GASVPro.
Furthermore, NovelSeq, inGAP
-
SV and SVMerge
have shown that OEA and orphan reads should also be considered especially in the prediction of insertions, and
OEA
reads
ca
n also be used to confirm the location of other events
.

Split read information can be used to detect
the breakpoint
s of various types of

SVs
by
using both anchored split mapping

(Pindel)

and soft clipping
information

(ClipCrop)

as these approaches detect d
ifferent classes of

variants
, and can also

identify the
predicted breakpoints at higher resolution (AG
E
)
.
Local assembly may currently be used in several of these
steps,
for example
by assembling novel

insertions

and the linking reads

(NovelSeq
) and
,
increasing the
reliability of split read mapping

(CREST),

for confirmation of breakpoints by assembling unmapped reads

(Genome STRIP)
.

Finally,
an example of true integration of signals would be
de novo

assembly of contigs while considering
multiple mappi
ngs, retaining the mapping, read depth, linked reads and insert size information may allow the
use of larger sequences while still able to consider in the traditional signatures, integrating all possible signals
into one
source

of information.

The Cortex a
ssembler’s CDBG may be a good starting point for this, as it allows
the integration of several tracks of information. However, this approach would probably require significant
computational power.

Algorithmic improvement by integration of all four SV dete
ction methods may significantly increase both
the sensitivity and the specificity for the detection. However, the library insert size has also been found to play
a large role. Whereas large insert sizes are better for detection of structural variation, sma
ller insert sizes allow
for a higher resolution
66,154
. Thus, a combination of two insert sizes, while integrating the data

to keep the
statistical power in detection may be the best solution
155
. However, the root of the major problems common to
each of four NGS
-
based SV detection methods will still remain: technical limitations.


Futu
re perspectives

Although NGS
-
based methods can theoretically identify all types and lengths of structural variation, this is
currently not possible using any algorithm due to the technical limitations of the current sequencing platforms.
It’s estimated tha
t about 50% of all SV
s

in the human genome is currently missed due to these limitations
142
.
The short reads generated by the current generation of sequencing platforms and relatively high error rates in
those with longer reads significantly reduce the

usefulness of any method used for the dete
ction of SVs in
repeat and duplication regions
156
. As the human genome contains

many such regions, and SVs are predicted to
be strongly enriched in these regions
, this is a significant gap in the data
157,158
.

The use of read pairs and soft
clustering are good way to minimize the effects as much as possible, but do not provide a solution.
The only

26

way to really
attempt to
solve these problems is by using sequencing platforms with longer read lengths and
less biases
and errors
due to the PCR steps.
Third generation sequencing platforms promise read lengths in the
kilobases
, decreased error rates and
real
-
time SMS as fast as the nucleotides are processed, thus increasing
throughput
159,160
. Currently available platforms like the Ion Torrent and
He
lioscope

have several improvements
over earlier platforms
, but are still between second and third generation platforms. Further development
s in
the

coming years

will allow significant improvements in both read mapping and
de novo

assembly,
at the same
time reducing

computational requirements as these processes will become less complex and thus require less
processing time.
This will probably enable t
he detection of a whole range of new SVs,
possibly requiring new
algorithms to evaluate these more complex regions
.
However, w
hether this will solve
all of the

problems
remains to be seen. Estimations indicate

that more than 1.5% of the genome can’t be map
ped uniquely
with
read lengths of 1 kb, which means that some repetitive regions may remain elusive
154
.

It is clear that

sequencing
-
based methods will replace
other

methods for the detection of structural
va
riation in the human genome. With the potential to detect a much broader variety of SVs with more power,
the declines in costs, and the
significant recent
algorithmic developments, it is only a matter of time.
Still, u
ntil
the technical requirements can be

met

for the development of one unbiased solution
, development and
integration of algorithms will remain important for the detection of structural variation. Even after complete
de
novo

assembly of a full human genome has become a possibility, the
development
of
computational methods
used for
alignment and the detection of signatures associated with structural variation
will still be of great
importance and influence the results significantly.



27

6

References

1.

Check, E. Human genome: patchwork people.
Nature

437
, 1084

6 (2005).

2.

Conrad, D. F.
et al.

Origins and functional impact of copy number variation in the human genome.
Nature

464
, 704

12 (2010).

3.

Fanciulli
, M., Petretto, E. & Aitman, T. J. Gene copy number variation and common human disease.
Clinical genetics

77
, 201

13 (2010).

4.

Feuk, L., Marshall, C. R., Wintle, R. F. & Scherer, S. W. Structural variants: changing the landscape of
chromosomes and design
of disease studies.
Human molecular genetics

15 Spec No
, R57

66 (2006).

5.

Hurles, M. E., Dermitzakis, E. T. & Tyler
-
Smith, C. The functional impact of structural variation in humans.
Trends in genetics : TIG

24
, 238

45 (2008).

6.

Buchanan, J. A. & Scherer
, S. W. Contemplating effects of genomic structural variation.
Genetics in
medicine : official journal of the American College of Medical Genetics

10
, 639

47 (2008).

7.

Lupski, J. R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangeme
nts and
conveyed phenotypes.
PLoS genetics

1
, e49 (2005).

8.

Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders.
Trends in
genetics : TIG

18
, 74

82 (2002).

9.

De, S. &

Babu, M. M. A time
-
invariant principle of genome evolution.
Proceedings of the National Academy
of Sciences of the United States of America

107
, 13004

9 (2010).

10.

Schmitz, J. SINEs as Driving Forces in Genome Evolution.
Genome dynamics

7
, 92

107 (2012).

11.

Ball, S., Colleoni, C., Cenci, U., Raj, J. N. & Tirtiaux, C. The evolution of glycogen and starch metabolism in
eukaryotes gives molecular clues to understand the establishment of plastid endosymbiosis.
Journal of
experimental botany

62
, 1775

801 (201
1).

12.

McHale, L. K.
et al.

Structural variants in the soybean genome localize to clusters of biotic stress response
genes.
Plant physiology

(2012).doi:10.1104/pp.112.194605

13.

Samuelson, L. C., Phillips, R. S. & Swanberg, L. J. Amylase gene structures i
n primates: retroposon
insertions and promoter evolution.
Molecular biology and evolution

13
, 767

79 (1996).

14.

Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease.
Nature reviews. Genetics

7
, 552

6
4 (2006).

15.

Xing, J.
et al.

Mobile elements create structural variation: analysis of a complete human genome.
Genome
research

19
, 1516

26 (2009).

16.

Nahon, J.
-
L. Birth of “human
-
specific” genes during primate evolution.
Genetica

118
, 193

208 (2003).

17.

Perry, G. H.
et al.

Diet and the evolution of human amylase gene copy number variation.
Nature genetics

39
, 1256

60 (2007).

18.

Coyne, J. A. & Hoekstra, H. E. Evolution of protein expression: new genes for a new diet.
Current biology :
CB

17
, R1014

6 (200
7).

19.

Beck, C. R., Garcia
-
Perez, J. L., Badge, R. M. & Moran, J. V. LINE
-
1 elements in structural variation and
disease.
Annual review of genomics and human genetics

12
, 187

215 (2011).


28

20.

Stankiewicz, P. &

Lupski, J. R. Structural variation in the human genome and its role in disease.
Annual
review of medicine

61
, 437

55 (2010).

21.

Zhang, F., Gu, W., Hurles, M. E. & Lupski, J. R. Copy number variation in human health, disease, and
evolution.
Annual review
of genomics and human genetics

10
, 451

81 (2009).

22.

Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping.
Nature
reviews. Genetics

12
, 363

76 (2011).

23.

Kloosterman, W. P.
et al.

Chromothripsis as a mechanism drivi
ng complex de novo structural
rearrangements in the germline.
Human molecular genetics

20
, 1916

24 (2011).

24.

Hochstenbach, R.
et al.

Discovery of variants unmasked by hemizygous deletions.
European journal of
human genetics : EJHG

20
, 748

53 (2012).

25.

Southard, A. E., Edelmann, L. J. & Gelb, B. D. Role of copy number variants in structural birth defects.
Pediatrics

129
, 755

63 (2012).

26.

Poduri, A. & Lowenstein, D. Epilepsy genetics
--
past, present, and future.
Current opinion in genetics &
development

21
, 325

32 (2011).

27.

Garofalo, S., Cornacchione, M. & Di Costanzo, A. From genetics to genomics of epilepsy.
Neurology
research international

2012
, 876234 (2012).

28.

Pfundt, R. & Veltman, J. A. Structural genomic variation in intellectual disability.
Me
thods in molecular
biology (Clifton, N.J.)

838
, 77

95 (2012).

29.

Sebat, J.
et al.

Strong association of de novo copy number mutations with autism.
Science (New York, N.Y.)

316
, 445

9 (2007).

30.

Stefansson, H.
et al.

Large recurrent microdeletions associated with schizophrenia.
Nature

455
, 232

6
(2008).

31.

Kuiper, R. P., Ligtenberg, M. J. L., Hoogerbrugge, N. & Geurts van Kessel, A. Germline copy number
variation and cancer risk.
Current opinion in genetics &
development

20
, 282

9 (2010).

32.

Shlien, A.
et al.

Excessive genomic DNA copy number variation in the Li
-
Fraumeni cancer predisposition
syndrome.
Proceedings of the National Academy of Sciences of the United States of America

105
, 11264

9
(2008).

33.

Gonz
alez, E.
et al.

The influence of CCL3L1 gene
-
containing segmental duplications on HIV
-
1/AIDS
susceptibility.
Science (New York, N.Y.)

307
, 1434

40 (2005).

34.

Hedrick, P. W. Population genetics of malaria resistance in humans.
Heredity

107
, 283

304 (2011).

35.

Janssens, W.
et al.

Genomic copy number determines functional expression of {beta}
-
defensin 2 in
airway epithelial cells and associates with chronic obstructive pulmonary disease.
American journal of
respiratory and critical care medicine

182
, 163

9 (
2010).

36.

Bentley, R. W.
et al.

Association of higher DEFB4 genomic copy number with Crohn’s disease.
The
American journal of gastroenterology

105
, 354

9 (2010).

37.

Hindorff, L. A., Gillanders, E. M. & Manolio, T. A. Genetic architecture of cancer and ot
her complex
diseases: lessons learned and future directions.
Carcinogenesis

32
, 945

54 (2011).


29

38.

Rodriguez
-
Revenga, L., Mila, M., Rosenberg, C., Lamb, A. & Lee, C. Structural variation in the human
genome: the impact of copy number variants on clinical d
iagnosis.
Genetics in medicine : official journal of
the American College of Medical Genetics

9
, 600

6 (2007).

39.

Rasmussen, H. B. & Dahmcke, C. M. Genome
-
wide identification of structural variants in genes encoding
drug targets: possible implications for

individualized drug therapy.
Pharmacogenetics and genomics

22
,
471

83 (2012).

40.

Stavnezer, J., Guikema, J. E. J. & Schrader, C. E. Mechanism and regulation of class switch recombination.
Annual review of immunology

26
, 261

92 (2008).

41.

Bassing, C. H.,

Swat, W. & Alt, F. W. The mechanism and regulation of chromosomal V(D)J recombination.
Cell

109 Suppl
, S45

55 (2002).

42.

Savage, J. R. Interchange and intra
-
nuclear architecture.
Environmental and molecular mutagenesis

22
,
234

44 (1993).

43.

Mani, R.
-
S.
& Chinnaiyan, A. M. Triggers for genomic rearrangements: insights into genomic, cellular and
environmental influences.
Nature reviews. Genetics

11
, 819

29 (2010).

44.

Lieber, M. R. The mechanism of double
-
strand DNA break repair by the nonhomologous DNA en
d
-
joining
pathway.
Annual review of biochemistry

79
, 181

211 (2010).

45.

Hastings, P. J., Ira, G. & Lupski, J. R. A microhomology
-
mediated break
-
induced replication model for the
origin of human copy number variation.
PLoS genetics

5
, e1000327 (2009).

46.

Burns, K. H. & Boeke, J. D. Human Transposon Tectonics.
Cell

149
, 740

752 (2012).

47.

Stephens, P. J.
et al.

Massive genomic rearrangement acquired in a single catastrophic event during
cancer development.
Cell

144
, 27

40 (2011).

48.

Quinlan, A. R. & Hall,

I. M. Characterizing complex structural variation in germline and somatic genomes.
Trends in genetics : TIG

28
, 43

53 (2012).

49.

Le Scouarnec, S. & Gribble, S. M. Characterising chromosome rearrangements: recent technical advances
in molecular cytogeneti
cs.
Heredity

108
, 75

85 (2012).

50.

Miller, D. T.
et al.

Consensus statement: chromosomal microarray is a first
-
tier clinical diagnostic test for
individuals with developmental disabilities or congenital anomalies.
American journal of human genetics

86
, 74
9

64 (2010).

51.

Pinkel, D.
et al.

High resolution analysis of DNA copy number variation using comparative genomic
hybridization to microarrays.
Nature genetics

20
, 207

11 (1998).

52.

Carvalho, B. High resolution microarray comparative genomic hybridisatio
n analysis using spotted
oligonucleotides.
Journal of Clinical Pathology

57
, 644

646 (2004).

53.

Brennan, C.
et al.

High
-
resolution global profiling of genomic alterations with long oligonucleotide
microarray.
Cancer research

64
, 4744

8 (2004).

54.

Armengol, L.
et al.

Clinical utility of chromosomal microarray analysis in invasive prenatal diagnosis.
Human genetics

131
, 513

23 (2012).

55.

Winchester, L., Yau, C. & Ragoussis, J. Comparing CNV detection methods for SNP arrays.
Briefings in
functional g
enomics & proteomics

8
, 353

66 (2009).


30

56.

Wang, D. G. Large
-
Scale Identification, Mapping, and Genotyping of Single
-
Nucleotide Polymorphisms in
the Human Genome.
Science

280
, 1077

1082 (1998).

57.

LaFramboise, T. Single nucleotide polymorphism arrays: a decade of biological, computational and
technological advances.
Nucleic acids research

37
, 4181

93 (2009).

58.

Kloth, J. N.
et al.

Combined array
-
comparative genomic hybridization and single
-
nucleoti
de
polymorphism
-
loss of heterozygosity analysis reveals complex genetic alterations in cervical cancer.
BMC genomics

8
, 53 (2007).

59.

Redon, R.
et al.

Global variation in copy number in the human genome.
Nature

444
, 444

54 (2006).

60.

Abbey, D., Hickman,
M., Gresham, D. & Berman, J. High
-
Resolution SNP/CGH Microarrays Reveal the
Accumulation of Loss of Heterozygosity in Commonly Used Candida albicans Strains.
G3 (Bethesda, Md.)

1
, 523

30 (2011).

61.

Pinto, D.
et al.

Comprehensive assessment of array
-
based
platforms and calling algorithms for detection
of copy number variants.
Nature biotechnology

29
, 512

20 (2011).

62.

Tuzun, E.
et al.

Fine
-
scale structural variation of the human genome.
Nature genetics

37
, 727

32 (2005).

63.

Alkan, C., Sajjadian, S. & Eich
ler, E. E. Limitations of next
-
generation genome sequence assembly.
Nature
methods

8
, 61

5 (2011).

64.

Korbel, J. O.
et al.

Paired
-
end mapping reveals extensive structural variation in the human genome.
Science (New York, N.Y.)

318
, 420

6 (2007).

65.

Mardis, E. R. A decade’s perspective on DNA sequencing technology.
Nature

470
, 198

203 (2011).

66.

Bashir, A., Volik, S., Collins, C., Bafna, V. & Raphael, B. J. Evaluation of paired
-
end sequencing strategies
for detection of genome rearrangements in cance
r.
PLoS computational biology

4
, e1000051 (2008).

67.

Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with
next
-
generation sequencing.
Nature methods

6
, S13

20 (2009).

68.

Campbell, P. J.
et al.

Identificat
ion of somatically acquired rearrangements in cancer using genome
-
wide
massively parallel paired
-
end sequencing.
Nature genetics

40
, 722

9 (2008).

69.

Alkan, C.
et al.

Personalized copy number and segmental duplication maps using next
-
generation
sequencing
.
Nature genetics

41
, 1061

7 (2009).

70.

Harismendy, O.
et al.

Evaluation of next generation sequencing platforms for population targeted
sequencing studies.
Genome biology

10
, R32 (2009).

71.

Mills, R. E.
et al.

An initial map of insertion and deletion (I
NDEL) variation in the human genome.
Genome
research

16
, 1182

90 (2006).

72.

Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break
points of large deletions and medium sized insertions from paired
-
end sh
ort reads.
Bioinformatics
(Oxford, England)

25
, 2865

71 (2009).

73.

Li, Y.
et al.

Structural variation in two human genomes mapped at single
-
nucleotide resolution by whole
genome de novo assembly.
Nature biotechnology

29
, 723

30 (2011).

74.

Miller, J. R.,
Koren, S. & Sutton, G. Assembly algorithms for next
-
generation sequencing data.
Genomics

95
, 315

27 (2010).


31

75.

Hormozdiari, F., Hajirasouliha, I., McPherson, A., Eichler, E. E. & Sahinalp, S. C. Simultaneous structural
variation discovery among multiple p
aired
-
end sequenced genomes.
Genome research

21
, 2203

12
(2011).

76.

Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences.
Journal of molecular
biology

147
, 195

7 (1981).

77.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &

Lipman, D. J. Basic local alignment search tool.
Journal of
molecular biology

215
, 403

10 (1990).

78.

Kent, W. J. BLAT
--
the BLAST
-
like alignment tool.
Genome research

12
, 656

64 (2002).

79.

Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucle
otide alignment program.
Bioinformatics
(Oxford, England)

24
, 713

4 (2008).

80.

Jiang, H. & Wong, W. H. SeqMap: mapping massive amount of oligonucleotides to the genome.
Bioinformatics (Oxford, England)

24
, 2395

6 (2008).

81.

Hach, F.
et al.

mrsFAST: a cac
he
-
oblivious algorithm for short
-
read mapping.
Nature methods

7
, 576

7
(2010).

82.

Campagna, D.
et al.

PASS: a program to align short sequences.
Bioinformatics (Oxford, England)

25
, 967

8
(2009).

83.

Ma, B., Tromp, J. & Li, M. PatternHunter: faster and mor
e sensitive homology search.
Bioinformatics
(Oxford, England)

18
, 440

5 (2002).

84.

McKernan, K. J.
et al.

Sequence and structural variation in a human genome uncovered by short
-
read,
massively parallel ligation sequencing using two
-
base encoding.
Genome r
esearch

19
, 1527

41 (2009).

85.

Lin, H., Zhang, Z., Zhang, M. Q., Ma, B. & Li, M. ZOOM! Zillions of oligos mapped.
Bioinformatics (Oxford,
England)

24
, 2431

7 (2008).

86.

Homer, N., Merriman, B. &

Nelson, S. F. BFAST: an alignment tool for large scale genome resequencing.
PloS one

4
, e7767 (2009).

87.

Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping
quality scores.
Genome research

18
, 1851

8
(2008).

88.

Rumble, S. M.
et al.

SHRiMP: accurate mapping of short color
-
space reads.
PLoS computational biology

5
,
e1000386 (2009).

89.

Weese, D., Emde, A.
-
K., Rausch, T., Döring, A. & Reinert, K. RazerS
--
fast read mapping with sensitivity
control.
Genome

research

19
, 1646

54 (2009).

90.

Burrows, M., Burrows, M. & Wheeler, D. J. A block
-
sorting lossless data compression algorithm.
Digital
Equipment Corporation

124
, (1994).

91.

Li, H. & Homer, N. A survey of sequence alignment algorithms for next
-
generation

sequencing.
Briefings
in bioinformatics

11
, 473

83 (2010).

92.

Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA databases.
Genome research

11
, 1725

9 (2001).

93.

Li, H. & Durbin, R. Fast and accurate short read alignment wi
th Burrows
-
Wheeler transform.
Bioinformatics (Oxford, England)

25
, 1754

60 (2009).


32

94.

Li, R.
et al.

SOAP2: an improved ultrafast tool for short read alignment.
Bioinformatics (Oxford, England)

25
, 1966

7 (2009).

95.

Galinsky, V. L. YOABS: yet other aligne
r of biological sequences
--
an efficient linearly scaling nucleotide
aligner.
Bioinformatics (Oxford, England)

28
, 1070

7 (2012).

96.

Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory
-
efficient alignment of short
DNA sequences to th
e human genome.
Genome biology

10
, R25 (2009).

97.

Liu, C.
-
M.
et al.

SOAP3: ultra
-
fast GPU
-
based parallel alignment tool for short reads.
Bioinformatics
(Oxford, England)

28
, 878

9 (2012).

98.

Klus, P.
et al.

BarraCUDA
-

a fast short read sequence aligner
using graphics processing units.
BMC
research notes

5
, 27 (2012).

99.

Liu, Y., Schmidt, B. & Maskell, D. L. CUSHAW: a CUDA compatible short read aligner to large genomes
based on the Burrows
-
Wheeler transform.
Bioinformatics (Oxford, England)

28
, 1830

7
(2012).

100.

Stromberg, M., Lee, W. & Marth, G. MOSAIK: A next
-
generation reference
-
guided aligner. at
<https://github.com/wanpinglee/MOSAIK>

101.

Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variatio
n
detection in high
-
throughput sequenced genomes.
Genome research

19
, 1270

8 (2009).

102.

Hormozdiari, F.
et al.

Next
-
generation VariationHunter: combinatorial algorithms for transposon
insertion discovery.
Bioinformatics (Oxford, England)

26
, i350

7 (2010
).

103.

Korbel, J. O.
et al.

PEMer: a computational framework with simulation
-
based error models for inferring
genomic structural variants from massive paired
-
end sequencing data.
Genome biology

10
, R23 (2009).

104.

Quinlan, A. R.
et al.

Genome
-
wide mappin
g and assembly of structural variant breakpoints in the mouse
genome.
Genome research

20
, 623

35 (2010).

105.

Chiara, M., Pesole, G. & Horner, D. S. SVM2: an improved paired
-
end
-
based tool for the detection of small
genomic structural variations using high
-
throughput single
-
genome resequencing data.
Nucleic acids
research

gks606


(2012).doi:10.1093/nar/gks606

106.

Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone
-
end sequencing
with mixtures of distributions.
Nature m
ethods

6
, 473

4 (2009).

107.

Lee, S. MoGUL: detecting common insertions and deletions in a population.
Research in Computational
Molecular Biology

1

12 (2010).at <http://www.springerlink.com/index/32W7184R7057461W.pdf>

108.

Chen, K.
et al.

BreakDancer: an
algorithm for high
-
resolution mapping of genomic structural variation.
Nature methods

6
, 677

81 (2009).

109.

Venkatraman, E. S. & Olshen, A. B. A faster circular binary segmentation algorithm for the analysis of
array CGH data.
Bioinformatics (Oxford, Engl
and)

23
, 657

63 (2007).

110.

Miller, C. A., Hampton, O., Coarfa, C. & Milosavljevic, A. ReadDepth: a parallel R package for detecting
copy number alterations from short sequencing reads.
PloS one

6
, e16327 (2011).

111.

Yoon, S., Xuan, Z., Makarov, V., Ye,
K. & Sebat, J. Sensitive and accurate detection of copy number variants
using read depth of coverage.
Genome research

19
, 1586

92 (2009).


33

112.

Magi, A., Benelli, M., Yoon, S., Roviello, F. &

Torricelli, F. Detecting common copy number variants in high
-
throughput sequencing data by using JointSLM algorithm.
Nucleic acids research

39
, e65 (2011).

113.

Abyzov, A., Urban, A. E., Snyder, M. &

Gerstein, M. CNVnator: an approach to discover, genotype, and
characterize typical and atypical CNVs from family and population genome sequencing.
Genome research

21
, 974

84 (2011).

114.

Wang, L.
-
Y., Abyzov, A., Korbel, J. O., Snyder, M. & Gerstein, M. MS
B: a mean
-
shift
-
based approach for the
analysis of structural variation in the genome.
Genome research

19
, 106

17 (2009).

115.

Wang, Z., Hormozdiari, F. & Yang, W. CNVeM: copy number variation detection using uncertainty of read
mapping.
Research in

326

34
0 (2012).at
<http://www.springerlink.com/index/P622187L42V41243.pdf>

116.

Xi, R.
et al.

Copy number variation detection in whole
-
genome sequencing data using the Bayesian
information criterion.
Proceedings of the National Academy of Sciences of the United
States of America

108
, E1128

36 (2011).

117.

Xie, C. & Tammi, M. T. CNV
-
seq, a new method to detect copy number variation using high
-
throughput
sequencing.
BMC bioinformatics

10
, 80 (2009).

118.

Chiang, D. Y.
et al.

High
-
resolution mapping of copy
-
number a
lterations with massively parallel
sequencing.
Nature methods

6
, 99

103 (2009).

119.

Kim, T.
-
M., Luquette, L. J., Xi, R. & Park, P. J. rSW
-
seq: algorithm for detection of copy number alterations
in deep sequencing data.
BMC bioinformatics

11
, 432 (2010).

1
20.

Ivakhno, S.
et al.

CNAseg
--
a novel framework for identification of copy number changes in cancer from
second
-
generation sequencing data.
Bioinformatics (Oxford, England)

26
, 3051

8 (2010).

121.

Magi, A., Tattini, L., Pippucci, T., Torricelli, F. & Bene
lli, M. Read count approach for DNA copy number
variants detection.
Bioinformatics (Oxford, England)

28
, 470

8 (2012).

122.

Abyzov, A. & Gerstein, M. AGE: defining breakpoints of genomic structural variants at single
-
nucleotide
resolution, through optimal
alignments with gap excision.
Bioinformatics (Oxford, England)

27
, 595

603
(2011).

123.

Suzuki, S., Yasuda, T., Shiraishi, Y., Miyano, S. & Nagasaki, M. ClipCrop: a tool for detecting structural
variations with single
-
base resolution using soft
-
clipping in
formation.
BMC bioinformatics

12 Suppl 1
,
S7 (2011).

124.

Henson, J., Tischler, G. & Ning, Z. Next
-
generation sequencing and large genome assemblies.
Pharmacogenomics

13
, 901

15 (2012).

125.

Warren, R. L., Sutton, G. G., Jones, S. J. M. &

Holt, R. A. Assembling millions of short DNA sequences using
SSAKE.
Bioinformatics (Oxford, England)

23
, 500

1 (2007).

126.

Dohm, J. C., Lottaz, C., Borodina, T. & Himmelbauer, H. SHARCGS, a fast and highly accurate short
-
read
assembly algorithm for de no
vo genomic sequencing.
Genome research

17
, 1697

706 (2007).

127.

Jeck, W. R.
et al.

Extending assembly of short DNA sequences to handle error.
Bioinformatics (Oxford,
England)

23
, 2942

4 (2007).

128.

Margulies, M.
et al.

Genome sequencing in microfabricate
d high
-
density picolitre reactors.
Nature

437
,
376

80 (2005).


34

129.

Myers, E. W.
et al.

A whole
-
genome assembly of Drosophila.
Science (New York, N.Y.)

287
, 2196

204
(2000).

130.

Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De nov
o bacterial genome sequencing:
millions of very short reads assembled on a desktop computer.
Genome research

18
, 802

9 (2008).

131.

Hossain, M. S., Azimi, N. & Skiena, S. Crystallizing short
-
read assemblies around seeds.
BMC
bioinformatics

10 Suppl 1
, S16
(2009).

132.

Pevzner, P. A., Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly.
Genome research

14
, 1786

96 (2004).

133.

Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn
graphs.
Genome research

18
, 821

9 (2008).

134.

Butler, J.
et al.

ALLPATHS: de novo assembly of whole
-
genome shotgun microreads.
Genome research

18
,
810

20 (2008).

135.

Simpson, J. T.
et al.

ABySS: a parallel assembler for short read sequence data.
Genome r
esearch

19
, 1117

23 (2009).

136.

Li, R.
et al.

De novo assembly of human genomes with massively parallel short read sequencing.
Genome
research

20
, 265

72 (2010).

137.

Simpson, J. T. &

Durbin, R. Efficient de novo assembly of large genomes using compressed data
structures.
Genome research

22
, 549

56 (2012).

138.

Gnerre, S.
et al.

High
-
quality draft assemblies of mammalian genomes from massively parallel sequence
data.
Proceedings of the

National Academy of Sciences of the United States of America

108
, 1513

8 (2011).

139.

Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short mate
-
paired reads:
Does the read length matter?
Genome research

19
, 336

46 (2009).

140.

Mills, R. E.
et al.

Mapping copy number variation by population
-
scale genome sequencing.
Nature

470
,
59

65 (2011).

141.

Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. &

McVean, G. De novo assembly and genotyping of variants
using colored de Bruijn graphs.
Nature genetics

44
, 226

32 (2012).

142.

Baker, M. Structural variation: the genome’s hidden architecture.
Nature methods

9
, 133

7 (2012).

143.

Harris, R. . Improved pai
rwise alignment of genomic DNA. (2007).

144.

Schwartz, S.
et al.

Human
-
mouse alignments with BLASTZ.
Genome research

13
, 103

7 (2003).

145.

Hajirasouliha, I.
et al.

Detection and characterization of novel sequence insertions using paired
-
end next
-
generatio
n sequencing.
Bioinformatics (Oxford, England)

26
, 1277

83 (2010).

146.

Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. a Discovery and genotyping of genome structural
polymorphism by sequencing on a population scale.
Nature genetics

43
, 269

76 (
2011).

147.

A map of human genome variation from population
-
scale sequencing.
Nature

467
, 1061

73 (2010).

148.

Medvedev, P., Fiume, M., Dzamba, M., Smith, T. & Brudno, M. Detecting copy number variation with
mated short reads.
Genome research

20
, 1613

22
(2010).


35

149.

Sindi, S. S., Onal, S., Peng, L., Wu, H.
-
T. & Raphael, B. J. An integrative probabilistic model for identification
of structural variation in sequencing data.
Genome Biology

13
, R22 (2012).

150.

Qi, J. & Zhao, F. inGAP
-
sv: a novel scheme to id
entify and visualize structural variation from paired end
mapping data.
Nucleic acids research

39
, W567

75 (2011).

151.

Wang, J.
et al.

CREST maps somatic structural variation in cancer genomes with base
-
pair resolution.
Nature methods

8
, 652

4 (2011).

152
.

Wong, K., Keane, T. M., Stalker, J. & Adams, D. J. Enhanced structural variant and breakpoint detection
using SVMerge by integration of multiple detection methods and local assembly.
Genome biology

11
,
R128 (2010).

153.

Jurka, J. Repbase update: a databa
se and an electronic journal of repetitive elements.
Trends in genetics :
TIG

16
, 418

20 (2000).

154.

Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second
-
generation
sequencing.
Genome research

20
, 1165

73 (2010).

155.

Bas
hir, A., Bansal, V. & Bafna, V. Designing deep sequencing experiments: detecting structural variation
and estimating transcript abundance.
BMC genomics

11
, 385 (2010).

156.

Metzker, M. L. Sequencing technologies
-

the next generation.
Nature reviews. Genet
ics

11
, 31

46 (2010).

157.

Kim, P. M.
et al.

Analysis of copy number variants and segmental duplications in the human genome:
Evidence for a change in the process of formation in recent evolutionary history.
Genome research

18
,
1865

74 (2008).

158.

Wong, K
. K.
et al.

A comprehensive analysis of common copy
-
number variations in the human genome.
American journal of human genetics

80
, 91

104 (2007).

159.

Pareek, C. S., Smoczynski, R. & Tretyn, A. Sequencing technologies and genome sequencing.
Journal of
applied genetics

52
, 413

35 (2011).

160.

Eid, J.
et al.

Real
-
time DNA sequencing from single polymerase molecules.
Science (New York, N.Y.)

323
,
133

8 (2009).