SNP calling from Next Generation
Sequencing data
Hugo Willy
Feb 1, 2012
Overview
Sequencing of
samples
Mapping of the
reads to Ref
Genome
Genotype
determination
for each locus
Comparison
against
reference/normal
Call SNPs
MAQ
Li
Heng
et al. Genome Res. 2008
•
Main concept:
–
Mapping quality
where
u
is the mapping position,
x
is the reference genome and
z
is the read.
p
s
(
ux,z
)
is the correctness probability of the
mapping location u given x and the read z.
p(
zx,u
)
= product of error probabilities of the mismatches of the read
z
mapped to pos
u
. The error is defined by the base quality e.g. base
quality 20 denotes a probability of 10

(20/10)
of miscalling the base.
Assume p(
ux
) is uniform
MAQ (1)
•
In practice, instead of summing over the
genome, they use
•
which effectively says
“read the supplementary for details”
MAQ (2)
•
Uses paired end information to help mapping
–
The mapping quality of both reads are set to the
sum of their S score when there is a unique hit on
both ends.
–
Otherwise, just use the original S score for each
read.
–
The paired ends is also used to identify short
indel
•
when there is a perfect mapping on one side and a very
poor mapping on the other, use (local) Smith Waterman
to find
a gapped alignment
within the paired end gap
distance.
MAQ (3)
•
MAQ call SNPs separately for each library. Somatic
SNPs are found by checking the genotype of Normal
and Tumor after SNPs are called.
•
No paired end feature is considered for SNP calling.
•
At any genomic position, at
most two different
nucleotides can be legitimately
seen (assuming
biallelic
SNPs). Therefore, we
can
consider only the two most
frequent nucleotides
at any
position
and ignore others
as errors.
•
Assume
we are
observing data
D
which consist of
k
nucleotides
b
and
n

k
nucleotides
b’
.
with
(
b,b
’)
∈
{A,C,G,T}
and
b
≠ b’
. Then
the three possible
reference genotypes are
<
b,b
>, <
b,b
’>,
and <
b’,b
’>.
MAQ (4)
•
If the true genotype is <
b,b
>, we have
n

k
errors from
n
bases. Denote it using
P(D<
b,b
>) =
α
n,n

k
•
If the true genotype is <
b’,b
’>, we have
k
errors from
n
bases. Denote it using
P(D<
b’,b
’>) =
α
nk
•
If the true genotype is <
b,b
’>, the probability
can be approximated with a binomial
distribution:
MAQ (5)
•
Assuming that the errors are not independent,
–
where
ε
i
is the
i
th
smallest base error probability.
–
c’
nk
is a function of
ε
i
but varies little with
ε
i
–
θ is empirically set to 0.85
•
Prior probabilities
Let P(<b,b’>) = r, then P(<b,b>) = P(<b’,b’>) = (1

r)/2
–
r is the probability of observing a heterozygote
–
if there is a known SNP in the position (dbSNP or
OMIM), r = 0.2. Otherwise, r=0.001. One can also use
site specific allele frequency.
MAQ (6)
•
Then the genotype is selected by
with a quality of
where
P(
gD
) is the posterior
probability
of
genotype g given the
observation D
.
MAQ (7)
•
Rule based filtering
–
Discard SNPs within the 3

bp flanking region
around a potential
indel
;
–
Discard SNPs covered by three or fewer reads;
–
Discard SNPs covered by no read with a mapping
quality higher than 60;
–
In any 10

bp window, if there are three or more
SNPs, discard them all; and
–
Discard SNPs with consensus quality smaller than
10
MAQ (8)
•
Validation
–
Simulation data on Human
Chr
X with 45X reads.
•
The reference genome is duplicated and a set of variant
positions are randomly picked based on a
predetermined mutation rate of 0.001
•
For each position, 2/3 of the time it will be mutated
into a <
b,b
’> genotype and 1/3 of the time it will be
mutated in to <
b’,b
’>.
•
Indels
are created similarly.
MAQ (9)
–
On bacterial genome (S.
paratyphi
)
•
The dataset is single end reads on
S.
paratyphi
strain
AKU12601 (haploid genome).
•
They checked the SNP calls against the strain’s
reference genome. They found 6 SNPs that looks highly
convincing
–
2
homozygotes
variant
–
one confirmed using
capillary sequencing, one have 19 read cover, all
having the variant allele)
–
4
heterozygotes
variant

they suspected there exist
a repeat region that is not included in current
reference.
MAQ (10)
–
They use another strain’s genome (ATCC9150) as the
reference genome for False Negative estimation.
–
The SNPs and
indels
are determined from the
differences between the two genomes.
–
Recovered 175/211 SNPs.
Indels
are not identified
because of single end read data.
GATK
•
Aaron McKenna, Matthew Hanna, Eric Banks
•
Genome Res. 2010.
•
Main selling points:
–
MapReduce framework: allow users to code own
functions. By following the specifications, the function
will be parallelizable and also able to use shared
memory for performance.
–
Implemented using JAVA 1.6
–
Is a general / platform independent tool for
processing Next Generation Sequencing (NGS) data.
GATK (1)
•
Implements the traversals in
Samtools
–
TraverseLoci
: each single base loci with its
associated read, reference ordered data and ref.
base are presented
–
Traverse reads: each reads is traversed
–
TraverseDuplicates
: returns list of unique and
duplicate reads on the position
–
TraverseLocusWindow
: just like
traverseLoci
for
intervals
GATK (2)
•
Simple Bayesian
Genotyper
G is the genotype, D is the observed data
•
p(D) is constant while
b is the base observed on the reads
•
No mention on the P(G) in the paper. These
can be estimated from known SNP data (?)
GATK (3)
•
Where the probability of seeing a base
b
given
a diploid genotype
G
is the average of the
probabilities of seeing
b
given the haploid
genotype {A
1
,A
2
} constituting G i.e.
•
These probabilities are computed based on
the base calling quality
e
(converted back to
probability value)
GATK (4)
•
The model was validated using the chr1 data
of one of the 1000 genome sequence
(NA12878).
•
Out of the 315,202 SNPs called (variant when
compared to the ref. genome), 81.7% are
listed in
dbSNP
(concordance 99.76%).
•
The more rigorous implementation is done in
the newest
samtools
paper (Li
Heng
, 2011,
Bioinformatics).
SOAPsnp
•
Ruiqiang
, Li. Genome Research, 2009
•
BGI SNP caller
•
Same Bayesian based for genotype calling
•
The priors for each genotype T
i
ϵ
{AA,AC, …,TT}
is defined as follows
SOAPsnp
(2)
I don’t understand this…
SOAPsnp
(3)
•
Transitions (A<

>C and G<

>T) are
four times
more likely as compared to
transversions
among substitution mutations. Hence,
assuming the ref is G,
I don’t understand this either…
SOAPsnp
(4)
•
For a total of n observed alleles/reads on a
locus
•
where,
for a diploid genotype T = {
H
m
,H
n
}
SOAPsnp
(5)
•
How to compute P(
d
k
H
)?
–
This is the probability of observing an allele in the
set of reads covering the position with a Haploid
genotype H
–
They use the following equations
–
where
•
o
k
= observed nucleotide
•
q
k
= the quality of the read
•
c
k
= the position on the read (computed from 5’)
SOAPsnp
(6)
SOAPsnp
(7)
•
They assume that P(
q
k
H
) are independent
from the underlying genotype H
•
That is, it is purely a function of
q
k
•
When there are
t
reads mapping to the same
genomic location, they are ordered by their
base quality on the locus in question.
•
The
t
th
observation will have its quality score
recalibrated by
SOAPsnp
(8)
•
“Real” reads mapping locations should follow
a Poisson distribution
•
The penalty for duplicate reads mapping to
the same location is set such that the “sum” of
the read’s score contribution mimics the
contribution of reads whose mapping location
follows such distribution
SOAPsnp
(9)
SOAPsnp
(10)
•
For predicted heterozygous sites, the genotype
probability is corrected by a
ranksum
p

value on the
base qualities of the two most frequent alleles.
•
Suppose the predicted genotype is AG where the reads
with A have base call qualities
{70,60,60,50,30,20,10,10} and reads with G have
{50,20,10,10}.
•
If the distributions of base qualities are deemed to be
significantly different, this means that one of them has
worse overall qualities than the other.
•
This may arise from errors instead of real
heterozygosity
SOAPsnp
(11)
•
They did a pretty impressive analysis to further
improve their results. They use two tests
–
Simulation:
•
Use Chromosome 12 as template,
•
Generate SNPs and
indels
with a rate of 0.001
•
Generate single end and paired end reads, 36X
•
Base qualities from real sequencing are selected at
random and read errors are introduced correspondingly.
–
Real Genome of an unknown Han Chinese.
•
They use microarray to validate 1,038,923 SNPs as
references (ground

truth).
SOAPsnp
(12)
•
They find that 95% SNP arising from read errors
occurs only in one read (simulation).
–
They set a minimum read support >= 4
•
When the error has also high frequency, its
frequency discrepancy (
w.r.t
the most frequent
allele) must also have P<=0.0001 according to the
binomial distribution (what about the
ranksum
?)
•
They also found that
indels
may generate a small
amount of SNP call error (0.03%, simulation)
SOAPsnp
(13)
•
Rule based filtering
–
We required at least two reads for haploid chromosome
X/Y and four reads for diploid
autosomes
. The reads need
to have base quality scores of 20 (Q20) at the position.
–
The overall depth, including randomly placed repetitive
hits, had to be less than 100.
–
The approximate copy number of flanking sequences had
to be less than two. This was done in order to avoid
misreading SNPs as
heterozygotes
caused by the alignment
of similar reads from repeat units or by copy number
variations (CNVs).
–
There had to be at least one paired end read.
–
The SNPs had to be at least 5
bp
away from each other.
SOAPsnp
(14)
•
For positions with known SNP (listed in
dbSNP
) they use a prior SNP rate of 0.1
(compared to 0.2 in MAQ).
•
For regions with low depth, they use Q10
filtering instead of Q20.
•
These steps are shown to improve coverage
on the genotyped ground truth SNPs.
VarScan
•
Koboldt
et al. Bioinformatics App Note, 2009
•
Rule based filtering
–
For each predicted variant,
VarScan
determines the overall
coverage, as well as the number of supporting reads, average
base quality, and number of strands observed for each allele.
–
Thresholds for coverage, quality, variant frequency, and/or
number of reads required to call a variant are set automatically
with the
easyrun
command, but can be manually adjusted by
the user.
•
Selling point: can use output from different
sequencing technology (454, Solid,
Illumina
) and
different
mapper
(BLAT,
Newbler
, BOWTIE,
cross_match
and
Novoalign
).
SNVMix
•
Goya et al. Bioinformatics 2010.
•
Shah SP group, Univ. British Columbia.
•
This is the first program that takes into
account both the Tumor and Normal reads
simultaneously
•
Selling point: remove
hard thresholds
that
they argue will be different for Tumor and
Normals
. Instead, learn it using EM.
SNVMix
(1)
•
The EM details is NOT going to be covered here
•
They showed that (using simulation) the model
performs equally well for datasets where the
mean certainty of the base calls is
∼
80% and
higher.
•
This suggests that
thresholding
base calls at
Phred
Q20 [99% certainty (Morin et al., 2008)] or
Q30 [99.9% certainty (
Ley
et al., 2008)] may be
overly stringent.
SNVMix
(2)
•
Validation
–
We evaluate the model based on real data derived from 16
ovarian cancer
transcriptomes
sequenced using NGS
(
RNAseq
), and a lobular breast cancer genome sequenced
to
>40x
coverage (Shah
et al., 2009b).
–
For all cases, we obtained high density genotyping array
data for orthogonal comparisons.
–
Finally, we demonstrate results on 497 positions from the
breast cancer genome that were subjected to Sanger
sequencing and thus constitute a ‘ground truth’ dataset for
benchmarking.
–
The 497 positions are predictions of SNVMix1 (a more
basic model) that are
resequenced
using Sanger
sequencing. 305 is confirmed as SNPs while 192 is not.
SNVMix
(3)
•
The
RNAseq
data on ovarian cancer is used for
cross validation.
•
The result of the CV indicated that sequencing
depth
thresholding
reduces the AUC score
–
the best performance comes from the model
without the
thresholding
•
Overfitting
?
SNVMix
(3)
•
They further obtained 14649 SNP positions on the
breast cancer genome using SNP array
•
They tested the model on the 497 Sanger

sequenced
SNPs.
MutationSeq
•
Ding et al, Bioinformatics 2012.
•
Shah SP group, Univ. British Columbia.
•
Selling points: machine learning on
Samtools
and GATK features is better than hard
thresholding
•
They (rather, the machine learning) found that
thresholds for Normal and Tumor genomes
are different (sound familiar?
SNVMix
?)
MutationSeq
(1)
•
They use 106 features, 20 from
Samtools
, 20 from
GATK and 26 of their own features.
•
They use:
–
Random forest, Bayesian additive regression tree, SVM and
LI regularized Logistic Regression
•
Datasets:
–
Exome
capture of 48 triple negative breast cancer, 76
bp
pair end reads
–
3369 are selected using hard thresholds and
resequenced
up to 6000X with 1015 somatic, 471
germline
and 1883
false positives.
–
Another set of
SOLiD
25

50bp pair end reads are prepared
on the whole genome. Using orthogonal
exome
capture,
they get 113 somatic, 57
germline
and 337 FP sites.
MutationSeq
(2)
•
Best Classifier =
BART & SVM
MutationSeq
(3)
•
Using the features, they identify the source of
FPs
–
Low mapping quality on Normal’s variant while
High mapping quality for tumor
–
Strand bias by PCR
–
GGT

>GGG misreading
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο