Supplementary Materials:
Supp Figure
1. The functional components in
PriVar
Annotation:
Many databases ha
ve
been
used
for variant
annotation, including:
Annotation for known
sites and
a
lternative allele frequency
1.
dbSNP
v
ersion
132
from NCBI.
2.
Allele frequency from
1000 genome project
based on
different populations:
Asian, American, African, European
.
The samples are from
phase I of
the
1000
Genome project
with
low c
overage
whole genome sequencing
data
.
3.
The allele frequency
for
different population
s
:
African Americans, European
Americans
from ESP6500
project
and all
subjects
are
from
NHLBI GO Exome
Sequencing
Project
.
Amino Acid Change
1.
The
UCSC
known
Gene
database
is adopted to define the boundaries of genes
and exons as well as annotate the potential amino acid changes.
Functional impact
of variants and genes
The results from the following algorithms are
used to evaluate the functional
impact of a variant:
1.
SIFT: SIFT score
2.
Polyphen2_HDIV
: Polyphen2 score based on HumDiv
3.
Polyphen2_HVAR:
Polyphen2 score based on HumVar
4.
LRT
likelihood ratio test
5.
MutationTaster: MutationTaster score
6.
GERP_NR
: GERP++ neutral r
ate
7.
GERP_RS
: GERP++ RS score
8.
PhyloP
: PhyloP score
9.
29way_logOdds
:
SiPhy score based on 29 mammal
genomes
10.
prospectr: prospectr
(
Adie, et al., 2005
)
score
Quality summary
Variant filtering based on
quality score
s
are determined by
mapping quality
(MQ)
,
variant quality
(VQ)
,
read depth
(DP)
.
The default valu
es for MQ
, VQ and DP are 0,
because some variant calling software do not provide all
three
quality scores
.
Summary of overall quality
1.
Mapping quality
for
Exonic
(NonExonic) variants: minimum value, maximize
value, ave
rage value, standard deviation and
median
.
2.
S
equencing
d
epth
for
Exonic
(
NonExonic) variants
: minimum value, maximize
value, av
erage value, standard deviation and
median
.
3.
Genotype quality
for
Exonic
(NonExonic) variants: minimum value, maximize
value, av
erage value, standard deviation and
median.
4.
Variant quality
for
Exonic
(NonExonic) variants: minimum value, maximize
val
ue, av
erage value, standard deviation and
median.
5.
Total number of Exonic
(NonExonic) variants.
6.
Total number of Exon
ic (NonExonic) homozygous
/
heterozygous
variants.
7.
Total number of indel, Total number of deletion, Total number of insertion, The
ratio of ins
ertion to deletion, The ratio of
indel to SNV
.
8.
Total number of known Exonic (NonExonic) indel, Total number of known
Exonic (NonExonic) insertion, Total number of known Exonic (NonExonic)
deletion,
Total number of novel Exonic (NonExonic) indel, Total numb
er of
novel Exonic (NonExonic) insertion, Total number of novel Exonic (NonExonic)
deletion.
9.
L
ength distribution of
coding deletion
/
insertion
from 1bp to 10bp
C
oncordance with SNP
chip
data
Concordance between calls from SNP chip and sequencing are used
to
evaluate SNP
calls from sequencing data by six evaluation metrics.
A
(ref)/B
(non
-
ref)
AA
(comparison)
AB
(comparison)
BB
(comparison)
AA
(evaluation)
AAAA
ABAA
BBAA
AB
(evaluation)
AAAB
ABAB
BBAB
BB
(evaluation)
AABB
ABBB
BBBB
1.
homo
_
ref
=
AAAA
AAAA
+
AAAB
+
AABB
2.
hetero
=
ABAB
ABAA
+
ABAB
+
ABBB
3.
homo
_
var
=
BBBB
BBAA
+
BBAB
+
BBBB
4.
concordance
=
AAAA
+
ABAB
+
BBBB
AAAA
+
AAAB
+
AABB
+
ABAA
+
BBAA
+
ABAB
+
BBAB
+
ABBB
+
BBBB
5.
non
_
ref
_
sensitivity
=
ABAB
+
BBAB
+
ABBB
+
BBBB
ABAA
+
BBAA
+
ABAB
+
BBAB
+
ABBB
+
BBBB
6.
non
_
ref
_
discrepancy
_
rate
=
ABAA
+
BBAA
+
AAAB
+
BBAB
+
AABB
+
ABBB
ABAA
+
BBAA
+
AAAB
+
ABAB
+
BBAB
+
AABB
+
ABBB
+
BBBB
T
he proportion of
variant
calls that exist in dbSNP
Based on
1000 genome project, for a single European, nearly 90% of true variants
should
appear in dbSNPv129 and the rate should
be
consistent across
different
individuals.
1.
T
ransition to
T
ransversion ratio (Ti/Tv
ratio
)
T
he Ti/Tv ratio for validated human SNPs is approximate 2.1 for whole genome
sequenc
es
and 2.8 for
whole
exome sequenc
es
, while the value
for
false positive
SNPs shoul
d be near 0.5.
S
ummary
parameters
includ
es:
Total number of Exonic (NonExonic) SNVs, Total number of Transition Exonic
(NonExonic) SNVs,
Total number of Trans
version Exonic (NonExonic)
SNVs
, Ti/Tv ratio of Exonic (NonExonic) SNV
s
, False positive
detection rate for
Exonic (NonExonic) SNV
s
.
Total nu
mber of
known
Exonic (NonEx
onic) SNV
s
, Total number of
known
Exonic (NonExonic) Transi
tion SNVs, Total number of
known
Exonic
(Non
Exonic) Transversion SNVs,
known
Ti/Tv ratio of Exonic (NonExonic)
SNV
s
.
Total number of novel Exonic (NonExonic) SNV
s
, Total number of novel Exonic
(NonExonic) Transition SNVs, Total number of novel Exonic (NonExonic)
Transversion SNVs, novel Ti/Tv ratio of Exonic (NonExonic) SNV
s
.
2.
I
ndel to SNV ra
tio
The number of SNV calls s
hould be close to 1 (0.78) in 1000 bases for
an
African
(European). The indel should be 1 in 8000 bases for whole genome sequencing
.
Therefore, the expected value
of indel to SNV ratio should be
about
1/8.
I
nser
tion to deletion ratio
The
expected value of this
ratio is around 1 for whole genome sequencing d
ata and
smaller than 1 for whole exome sequencing
.
3.
I
ndel length distribution
The indel length distribution is symmetric for whole genome sequencing and even
counts are higher than odd
counts.
The number of indels with
lengths
that
can be
divided by 3
are relatively
larger than others in coding region
s
since most of the
indel
s
are
non
-
f
rameshift
.
4.
Mendelian error
i
dentification
This module is used to detect the
variants
that violated
Me
n
delian inheritance whe
n
parental data are available.
The m
ajor
reason
for Me
n
delian inheritance violation
is
genotyping
error
s
although this can also be used to detect de novo mutations.
Candidate gene identification
Linkage
-
based strategy:
Haplotype
-
sharing among patients
The following steps are used to identify individuals who share a long haplotype
likely IBD.
First,
regions shared by affected family member
s
(or unrelated patients if a recent
founder mutation is suspected)
are
identified
,
following
a
no
-
violation
rule.
If
the
genotypes
of
a
common
SNV
(default
value
: minor allele frequency (MAF) >0.1
in 1000 genome project
calculated from
all subjects)
f
or
two
individuals
are
homozygous
on
different alleles
,
this situation
is
considered
as
v
iolation
of allele
sharing. Certain error rate is also allowed.
Second, the candidate IBD regions
longer than a
cutoff
length
(default: 1Mb)
are
selected
,
since
s
hort
er
shared regions
are likely due to chance.
Third,
the haplotype allele frequency
(HAF)
for each IBD region is
estimated
by
Markov model
based on
the
homozygous SNV
s
in
th
e
region
. Only
SNP
s
from
Hap
M
ap phase
II (default: Caucasian)
are used for the calculation
.
The variants in the selected region would be prioritized for further
analysis.
Run of homozygosity
(ROH)
This strategy is very similar
to a method we introduced before (Zhang, et al,
2011).
T
he length and
HAF are calculated for
long
homozygous stretches
identified
by
common SNVs from
a
single patient
(default length
cutoff: 1Mb,
MAF > 0.1 in 1000 genome project for all the subjects)
.
The variants in the
selected region would be prioritized for further analysis.
The method can be used
in the place of homozygosity mapping, or for detection of homozygous regions
when bot
h alleles were likely derived from a common recent founder. In other
scenarios such as large deletion, or uniparental disomy, the data also appear as
long homozygous stretch.
Certain error rate is also allowed in this evaluation.
"Double
-
hit"
This
strategy is designed for recessive disease
s
,
which may be caused by
homozygous mutation
s
or compound heterozygous mutations
.
PriVar would keep
the genes
carrying
two
or more
mutations
(rare and predicted to have deleterious
effect
on protein function
)
.
De novo
mutations
The
candidate
de novo
mutations should satisfy
the
two conditions
below
:
1. All of the genotypes (father, mother and child) should
be with relatively high
quality
and good coverage depth.
2. The mutation
should
only appear in the child's genome.
Disease candidate genes
Disease
candidate gene list is downloaded from Huge navigator
(
Yu, et al., 2008
)
,
which is derived by
literature search
. C
ustomized
gene list
is also
supported.
Prediction of functional impact of variants
prediction
for
SNV
We used the data set
"ExoVar
"
as
training set
(
http://statgenpro.psychiatry.hku.hk/limx/kggseq/download
/ExoVar.xls
)
to fit
a
logistic regression model
,
which
was
also adopted
by
KGGSeq
,
but PriVar used more
weak correlated features.
The correlation coefficients for the ten features can be found
in
Supp
Table 1.
,
most of the
features
are weakly correlated.
Supp Table 1. The correlation coefficients among ten candidate features
used to
fit
logistic regression model.
r^2
SIFT
Polyphen2_HDIV
Polyphen2_HVAR
LRT
MutationTaster
GERP_NR
GERP_RS
PhyloP
29way_logOdds
prospectr
SIFT
1
0.2855
0.2814
0.061
0.1379
0.0069
0.0981
0.1123
0.1148
0.0178
Polyphen2_HDIV
0.2855
1
0.8715
0.1447
0.3016
0.0269
0.237
0.2728
0.2932
0.0377
Polyphen2_HVAR
0.2814
0.8715
1
0.16
0.3526
0.0357
0.2404
0.2838
0.3297
0.0493
LRT
0.061
0.1447
0.16
1
0.1744
0.0136
0.1377
0.1313
0.1345
0.0187
MutationTaster
0.1379
0.3016
0.3526
0.1744
1
0.0794
0.2295
0.2433
0.3017
0.0835
GERP_NR
0.0069
0.0269
0.0357
0.0136
0.0794
1
0.1421
0.1501
0.1786
0.0841
GERP_RS
0.0981
0.237
0.2404
0.1377
0.2295
0.1421
1
0.8348
0.416
0.0586
PhyloP
0.1123
0.2728
0.2838
0.1313
0.2433
0.1501
0.8348
1
0.5447
0.0648
29way_logOdds
0.1148
0.2932
0.3297
0.1345
0.3017
0.1786
0.416
0.5447
1
0.0918
prospectr
0.0178
0.0377
0.0493
0.0187
0.0835
0.0841
0.0586
0.0648
0.0918
1
Therefore,
the
logistic regression
model was fitted
by incorporating
ten candidate
features, the
model information can be found
in
Supp Table2
.
Variables in logistic regression
B
S.
E
.
Wald
df
Sig.
Exp(B)
95% C.I.for EXP(B)
Lower
Upper
SIFT
-
1.765
.148
142.218
1
.000
.171
.128
.229
Polyphen2_HDIV
-
.858
.190
20.364
1
.000
.424
.292
.615
Polyphen2_HVAR
2.447
.187
171.911
1
.000
11.548
8.011
16.647
LRT
.160
.197
.659
1
.417
1.173
.798
1.726
MutationTaster
1.863
.087
463.127
1
.000
6.443
5.438
7.635
GERP_NR
.724
.207
12.238
1
.000
2.063
1.375
3.096
GERP_RS
2.497
.443
31.708
1
.000
12.148
5.094
28.972
PhyloP
-
.435
.942
.214
1
.644
.647
.102
4.097
29way_logOdds
1.377
.337
16.650
1
.000
3.963
2.045
7.679
P
rospectr
3.042
.193
249.309
1
.000
20.945
14.358
30.553
Constant
-
6.387
.561
129.484
1
.000
.002
Supp Table 2. The
logistic regression model
fitted
by ten features
. B: coefficient value
for eac
h feature
, S.E.: standard error, Wald: Wald test
for each coefficient
value
,df:
degrees of freedom,
Sig: statistical significant of coefficient values, CI: confidence
interval
.
Supp Figure
2. The comparison of prediction performance between KGGSeq
and PriVar.
We
compare
d the prediction performance
between PriVar and
KGGS
eq
,
the AUC
(area under
curve
) value of Pri
Var is higher (8.5%) than KGGS
eq, which means more
comprehensive information
incorporated in
PriVar will provide
more accurate
prediction result.
By default, the SNV will be regarded as "
deleterious
" i
f
the
prediction value is
higher than 0.
5
. This
l
ead
s
to identif
ication of
most
(86.7%)
mutations
with
deleterious effect
and only 20.3%
of the
common
SNVs
are wrongly
classified
.
P
rediction
for
InDel
s
PriVar assume
s
animo acid changes caused by
i
n
d
el
s
are independent
(
Zia and Moses,
2011
)
.
P
remature stop codon
s
are
usually introduced by indels.
G
enes with indels are
still translated and amino acid changes introduced by an indel are evaluated
independently.
A
Real Case
#of SNV
&Indel
#
of Gene
Total number
147328
13149
Removing SNVs with depth
<10
6 5 7 4
5
1 0 3 2 0
K
e e p i n g e x o n i c
a n d s p l i c i n g
v a r i a n t s
1 5 7 8 8
7 3 4 7
k e e p i n g n o n
S y n o n y mo u s
S N V
a n d F r a me S h i f t I n D e l
R e mo v i n g c o mmo n
v a r i a n t s
wi t h MAF > 1 %
7 6 4 9
4 4 6 2
1 6 9 5
1 1 4 9
Ke e p i n g
v a r i a n t s
i n c a n d i d a t e g e n e s
9
1
4
2
Ke e p i n g
v a r i a n t s
i n g e n e s wi t h "Do u b l e
-
h i t"
59
10
Keeping
variants
with
prediction
score >0.
8
8
5
Supp Table
3.
A step
-
wise analysis for filtering and prioritizing
variants
found in a
patient with early
-
onset Crohn's disease. Minor allele frequency (MAF) represe
nts the
population frequency
of all the subjects
from
the 1000 genome project.
(A)
(
B
)
(C)
Supp Figure
3.
Transition/Transversion
ratio
(A),
and SNP call
concordance
(
B
)
and InDel length distribution
for
quality summary
using 10 fold
depth
as cutoff
.
Command Line used for variant calling
java
-
jar GenomeAnalysisTK.jar
-
R Homo_sapiens_assembly19.fasta
-
T UnifiedGenotyper
-
I
Father
.bam
-
I
Mother
.bam
-
I
Child
.bam
-
o
Result
.vcf
-
stand_call_conf 50
-
stand_emit_conf 10
-
dcov 500
REFERENCE
Adie, E.A.
, et al.
(2005) Speeding disease gene discovery by sequence based
candidate prioritization,
BMC bioinformatics
,
6
, 55.
Hu, J. and Ng, P.C. (2012) Predicting the effects of frameshifting indels,
Genome
biology
,
13
, R9.
Yu, W.
, et al.
(2008) A navigator for human genome epidemiology,
Nature genetics
,
40
, 124
-
125.
Zia, A. and Moses, A.M. (2011) Ranking insertion, deletion and nonsense mutations
based on their effect on genetic information,
BMC bioinformatics
,
12
, 299.
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment