PPT - Canadian Bioinformatics Workshops

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

119 views

Canadian Bioinformatics Workshops
www.bioinformatics.ca
2
Module #: Title of Module
Module 3
Structural and Copy Number Variants
Module 3
bio
informatics
.ca
Diversity of Humans

Humans are diverse

Genomic Variation

Single Nucleotide Polymorphisms

SNPs occur ~1/1000 positions

Find by comparing reads from one individual to the
reference human genome

Structural variations are large scale genomic alterations

Insertions, deletions, inversions, translocations
and changes in copy numbers
G: 798 GAACCCCTTACAACTGAACCCCTTAC


|||||||||| |||||||||||||||
R: GAACCCCTTATAACTGAACCCCTTAC
Module 3
bio
informatics
.ca
What are structural variations?
Various examples of structural variations

Module 3
bio
informatics
.ca
What are Paired Reads?
ATCAA
CTAAG
Insert size
DNA fragment
Matepair
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs
A
REF

No structural variants
Mapped distance
Insert size
Insert size = Mapped distance
Concordant matepair
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Insertion
A
REF

Insertion
Mapped distance
Insert size
Insert size > Mapped distance
Size of insertion = Insert size - Mapped distance
Module 3
bio
informatics
.ca
Consistency - Insertion
A
REF
1.
Size of insertion explained by
X
i
= Size of insertion explained
by
X
j
2.
Overlap
X
i
X
j
X
i
X
j
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Inversion
A
REF
5’
3’
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Inversion
A
REF
5’
3’
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Inversion
A
REF
5’
3’
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Inversion
A
REF
5’
3’
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Inversion
A
REF
5’
3’
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Inversion
A
REF
5’
3’
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Inversion
A
REF
5’
3’
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Detecting Structural Variants With Matepairs - Inversion
A
REF
5’
3’
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Range of The Size of Inversion
A
REF
5’
3’
5’
3’

Inversion
|
m
- insert size of
X
i
| < size of inversion
m
Insert size of
X
i
X
i
X
i
Module 3
bio
informatics
.ca
Range of The Size of Inversion
A
REF
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Range of The Size of Inversion
A
REF
5’
3’
5’
3’

Inversion
Module 3
bio
informatics
.ca
Range of The Size of Inversion
A
REF
5’
3’
5’
3’

Inversion
size of inversion <
m
+ insert size of
X
i
Insert size of
X
i
m
X
i
X
i
Module 3
bio
informatics
.ca
Range of The Size of Inversion
|
m
– insert size of
X
i
| < size of inversion <
m
+ insert size of
X
i
A
REF
5’
3’
5’
3’
m
X
i
X
i
Insert size of
X
i
Module 3
bio
informatics
.ca
Consistency - Inversion
1.
Mapped distance A = Mapped Distance B
2.
Range of the size of inversion explained by
X
i
overlaps Range of the
size of inversion explained by
X
j
3.
Overlap
A
REF
5’
3’
5’
3’
Mapped Distance B
Mapped Distance A
X
i
X
j
X
i
X
j
Module 3
bio
informatics
.ca
Structural Variants and Split Reads
HTS Short Reads
(Pair-end)
Short Reads Aligner (BWA, BOWTIE, SHRIMP,
BFAST …)
Most of these pairs can
be aligned to the
reference genome
For some paired-end reads
one of the pair may not be
mapped because it goes
across the breakpoint of a
structural variant. We call
such reads split reads.
Module 3
bio
informatics
.ca
Structural Variants and Split Reads
ref
donor
Deletion
ref
donor
Insertion
ref
donor
ref
donor
Module 3
bio
informatics
.ca
Structural Variants and Split Reads
ref
donor
ref
donor
donor
Inversion
ref
donor
Tandem
Duplication
ref
Module 3
bio
informatics
.ca
Pair informed split mapping
ref
Deletion
Split SW
reference
region 1
reference
region 2
Module 3
bio
informatics
.ca
Pair informed split mapping
ref
Inversion
Split SW
reference
region 1
reference
region 2
Module 3
bio
informatics
.ca
Split mapping in non-pair region
ref
Small
Deletion
reference
region 2
reference
region 1
Split SW
Module 3
bio
informatics
.ca
Repeats & Pair clusters
donor
ref
No repeat : ideal situation
Module 3
bio
informatics
.ca
Repeats & Pair clusters
donor
ref
Short repeat : a few singleton reads
disappear
Module 3
bio
informatics
.ca
Repeats & Pair clusters
donor
ref
Long
repeat :
all
singleton reads disape
a
r, a few
discordant pairs disappear
Module 3
bio
informatics
.ca
Repeats & Pair clusters
donor
ref
Large
repeat :
all
singleton reads disape
a
r, all discordant
pairs disappear
Module 3
bio
informatics
.ca
Repeats & Pair clusters
donor
ref
Inversion
X
Y
Module 3
bio
informatics
.ca
Split Read Mapping

Can detect insertions 1bp or larger

Can exactly estimate breakpoint and indel sizes.

Can only detect insertions up to read length

Does not work with repetitive breakpoints
Module 3
bio
informatics
.ca

Detecting Structural Variants
don
Deletion
ref
depth-of-coverage (DOC)
methods
Deletion
matepair mapping
methods

Input

reference human genome

sequenced donor genome

Output

Variant annotations in ref
Module 3
bio
informatics
.ca
Distribution of Insert Sizes

In reality, insert sizes of matepairs are not perfect

Nor are they Gaussian

Nor are they consistent between libraries
Module 3
bio
informatics
.ca
How to detect “Discordant” matepairs?
Module 3
bio
informatics
.ca
Haploid Case – Alignment
REF
Donor
Module 3
bio
informatics
.ca
Haploid Case – Alignment
REF
Mapped distance
Donor
Cluster
Module 3
bio
informatics
.ca
Haploid Case – Distribution
Make a distribution of mapped distances in each cluster
=> The distribution shifts if there is an INDEL
No indel
208bp
188bp
228bp
20bp insertion
Module 3
bio
informatics
.ca
Accuracy of INDEL Estimation
Central limit theorem
Mean of
n
independent random variables with finite
mean and variance follows the Gaussian distribution
with mean and standard deviation

n
/



2

Z


: random variables for size of indels supported by each matepair



1
n
Z
Z


Z
n
Z
/
STD


Z


Mean
Distribution of mean of random variables
Z
1

Z
n
Module 3
bio
informatics
.ca
P-value (assigning a confidence)
P-value
Probability that a cluster is generated from a region without
an indel




Z
z
Z
p

)
0
|
'
(
value
-
P
0
Z

n
Z
/
STD


Z


Mean
Module 3
bio
informatics
.ca
0%

50%
100%
The Theory
Module 3
bio
informatics
.ca
Diploid Case – Alignment & Clustering
REF
Donor C1
Donor C2
Heterozygous insertion
Module 3
bio
informatics
.ca
Diploid Case – Alignment & Clustering
REF
Donor C1
Donor C2
Heterozygous insertion
Cluster
Module 3
bio
informatics
.ca
Diploid Case – Distributions
You expect to see matepairs from two distributions.
No indel
208bp
188bp
20bp insertion
Module 3
bio
informatics
.ca
MoDIL Algorithm
1

2

1. Randomly initialize and

1

2

2. E step: Assign each matepair,
M
i
, to one of two distributions
Assign
M
i
to
p
1
with
probability ,
p
2
with
3. M step: Update and by searching the optimal and which minimizes
Kolmogorov–Smirnov statistic




2
1
)
(
)
(
sup
t
t
o
t
t
z
F
z
F
l
D
)
(
)
(
)
(
2
1
1
i
i
i
M
p
M
p
M
p

)
(
)
(
)
(
1
2
1
1
i
i
i
M
p
M
p
M
p


1

2

1

2

Module 3
bio
informatics
.ca
MoDIL Comparison with Kidd et al. study
>=20bp INDELs
FNR=0.05
15-19bp INDELs
FNR=0.3
10-14bp INDELs
FNR=0.65

NA 18507 (40x Illumina coverage, 208±13bp pairs)
Module 3
bio
informatics
.ca
MoDIL Comparison with Kidd et al. study

NA 18507 (40x Illumina coverage, 208±13bp pairs)
Module 3
bio
informatics
.ca
Paired-end Mapping

Can detect INDELs
≥ 20bp, depending on insert
size/variance.

Can reliably estimate breakpoint locations (to a few
bp) and indel sizes.

Can only detect insertions up to insert size

Works weirdly with duplications (see next section)
Module 3
bio
informatics
.ca
Calling CNVs
Depth
Ref
Ref
1
2
1
1
1
1
1
0.8
2.3
2.3
0.5
2
1.4
1.7
1.1
1
2
1
1
2
2
1
Call
2
2
CNV
CNV
Module 3
bio
informatics
.ca
Back to… Structural Variants

What if the inserted segment is present elsewhere?
Ref
Donor
Module 3
bio
informatics
.ca
The Linking Signature
Ref
Donor
Module 3
bio
informatics
.ca
Step 1 – Build Repeat Graph
Module 3
bio
informatics
.ca
Step 1 – Build Repeat Graph
Module 3
bio
informatics
.ca
Step 2 – Capture Donor Adjacencies
Ref
Donor
Module 3
bio
informatics
.ca
Step 2 – Capture Donor Adjacencies
Ref
Donor
Module 3
bio
informatics
.ca
Each function represents the probability that the segment
of length
l
appearing
x
times in the donor had
k
reads
sampled from it
P
(
k
|
x
l
)

(
x

)
k
e

(
x

)
k
!


N
*
l
/
G
where
l
k
reads
length
x
P
Step 3 – Defining Walk Costs
Module 3
bio
informatics
.ca
Calling CNVs
Ref
1
1
1
1
2
2
1
Path
Ref
1
2
1
1
1
1
1
CNV
Finds a path “
most faithful

to the DOC

Can be solved via Network Flow!
Module 3
bio
informatics
.ca
CNVer Results

NA18507; ~40x coverage,
35bp reads, 208 insert size

Total of 4338 CNV calls
(>1k) – 2304 losses, 2076
gains

Approximately 1.6% of the
genome in CNVs

25% of Seg Dups

0.6% Elsewhere
Module 3
bio
informatics
.ca
Some Example CNVs
Module 3
bio
informatics
.ca
DOC-based Methods

Highly dependent on uniformity of sequencing &
knowledge of biases

Breakpoint resolution is bad unless combined with
other methods

Gains harder to find then losses, especially in
Segmental Duplications


Only hope for finding really large variants
Module 3
bio
informatics
.ca
People Who Did The Work
Recep
Andrew
Vlad
Mike
Yue
Marc
Vanessa
Orion
Joe
Nilgun
Paul
Vera
Misko
Yoni
Seunghak