Supplementary Data - Bioinformatics

raviolirookeryΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 2 μήνες)

118 εμφανίσεις

Supplementary Materials:


Supp Figure

1. The functional components in

PriVar


Annotation:


Many databases ha
ve

been
used

for variant

annotation, including:




Annotation for known

sites and
a
lternative allele frequency


1.

dbSNP

v
ersion
132

from NCBI.


2.

Allele frequency from
1000 genome project

based on
different populations:
Asian, American, African, European
.

The samples are from
phase I of
the
1000
Genome project
with
low c
overage
whole genome sequencing

data
.



3.

The allele frequency
for

different population
s
:

African Americans, European
Americans
from ESP6500

project
and all

subjects
are
from
NHLBI GO Exome
Sequencing

Project
.





Amino Acid Change


1.

The
UCSC
known

Gene

database

is adopted to define the boundaries of genes
and exons as well as annotate the potential amino acid changes.




Functional impact

of variants and genes


The results from the following algorithms are

used to evaluate the functional
impact of a variant:


1.

SIFT: SIFT score

2.

Polyphen2_HDIV
: Polyphen2 score based on HumDiv

3.

Polyphen2_HVAR:

Polyphen2 score based on HumVar

4.

LRT

likelihood ratio test

5.

MutationTaster: MutationTaster score

6.

GERP_NR
: GERP++ neutral r
ate

7.

GERP_RS
: GERP++ RS score

8.

PhyloP
: PhyloP score

9.

29way_logOdds
:

SiPhy score based on 29 mammal

genomes

10.

prospectr: prospectr

(
Adie, et al., 2005
)

score


Quality summary


Variant filtering based on
quality score
s

are determined by

mapping quality

(MQ)
,
variant quality

(VQ)
,

read depth

(DP)
.

The default valu
es for MQ
, VQ and DP are 0,
because some variant calling software do not provide all
three

quality scores
.




Summary of overall quality

1.

Mapping quality
for
Exonic

(NonExonic) variants: minimum value, maximize
value, ave
rage value, standard deviation and
median
.

2.

S
equencing
d
epth

for

Exonic

(
NonExonic) variants
: minimum value, maximize
value, av
erage value, standard deviation and

median
.

3.

Genotype quality
for
Exonic

(NonExonic) variants: minimum value, maximize
value, av
erage value, standard deviation and

median.

4.

Variant quality
for
Exonic

(NonExonic) variants: minimum value, maximize
val
ue, av
erage value, standard deviation and

median.

5.

Total number of Exonic

(NonExonic) variants.

6.

Total number of Exon
ic (NonExonic) homozygous
/
heterozygous

variants.

7.

Total number of indel, Total number of deletion, Total number of insertion, The
ratio of ins
ertion to deletion, The ratio of

indel to SNV
.

8.

Total number of known Exonic (NonExonic) indel, Total number of known
Exonic (NonExonic) insertion, Total number of known Exonic (NonExonic)
deletion,

Total number of novel Exonic (NonExonic) indel, Total numb
er of
novel Exonic (NonExonic) insertion, Total number of novel Exonic (NonExonic)
deletion.

9.

L
ength distribution of

coding deletion
/
insertion

from 1bp to 10bp




C
oncordance with SNP
chip

data

Concordance between calls from SNP chip and sequencing are used
to

evaluate SNP
calls from sequencing data by six evaluation metrics.


A

(ref)/B

(non
-
ref)

AA

(comparison)

AB

(comparison)

BB

(comparison)

AA

(evaluation)

AAAA

ABAA

BBAA

AB

(evaluation)

AAAB

ABAB

BBAB

BB

(evaluation)

AABB

ABBB

BBBB

1.
homo
_
ref
=
AAAA
AAAA
+
AAAB
+
AABB

2.

hetero
=
ABAB
ABAA
+
ABAB
+
ABBB

3.

homo
_
var
=
BBBB
BBAA
+
BBAB
+
BBBB

4.

concordance
=
AAAA
+
ABAB
+
BBBB
AAAA
+
AAAB
+
AABB
+
ABAA
+
BBAA
+
ABAB
+
BBAB
+
ABBB
+
BBBB

5.

non
_
ref
_
sensitivity
=
ABAB
+
BBAB
+
ABBB
+
BBBB
ABAA
+
BBAA
+
ABAB
+
BBAB
+
ABBB
+
BBBB

6.

non
_
ref
_
discrepancy
_
rate
=
ABAA
+
BBAA
+
AAAB
+
BBAB
+
AABB
+
ABBB
ABAA
+
BBAA
+
AAAB
+
ABAB
+
BBAB
+
AABB
+
ABBB
+
BBBB




T
he proportion of
variant
calls that exist in dbSNP

Based on
1000 genome project, for a single European, nearly 90% of true variants
should
appear in dbSNPv129 and the rate should
be
consistent across
different
individuals.

1.

T
ransition to

T
ransversion ratio (Ti/Tv

ratio
)

T
he Ti/Tv ratio for validated human SNPs is approximate 2.1 for whole genome
sequenc
es

and 2.8 for
whole
exome sequenc
es
, while the value

for
false positive
SNPs shoul
d be near 0.5.


S
ummary
parameters
includ
es:




Total number of Exonic (NonExonic) SNVs, Total number of Transition Exonic
(NonExonic) SNVs,

Total number of Trans
version Exonic (NonExonic)
SNVs
, Ti/Tv ratio of Exonic (NonExonic) SNV
s
, False positive
detection rate for
Exonic (NonExonic) SNV
s
.



Total nu
mber of
known

Exonic (NonEx
onic) SNV
s
, Total number of
known

Exonic (NonExonic) Transi
tion SNVs, Total number of
known

Exonic
(Non
Exonic) Transversion SNVs,
known

Ti/Tv ratio of Exonic (NonExonic)
SNV
s
.



Total number of novel Exonic (NonExonic) SNV
s
, Total number of novel Exonic
(NonExonic) Transition SNVs, Total number of novel Exonic (NonExonic)
Transversion SNVs, novel Ti/Tv ratio of Exonic (NonExonic) SNV
s
.


2.

I
ndel to SNV ra
tio

The number of SNV calls s
hould be close to 1 (0.78) in 1000 bases for
an
African
(European). The indel should be 1 in 8000 bases for whole genome sequencing
.
Therefore, the expected value

of indel to SNV ratio should be
about

1/8.



I
nser
tion to deletion ratio

The
expected value of this
ratio is around 1 for whole genome sequencing d
ata and
smaller than 1 for whole exome sequencing
.


3.

I
ndel length distribution

The indel length distribution is symmetric for whole genome sequencing and even
counts are higher than odd
counts.

The number of indels with
lengths

that
can be
divided by 3
are relatively

larger than others in coding region
s

since most of the
indel
s

are

non
-
f
rameshift
.


4.

Mendelian error
i
dentification

This module is used to detect the
variants

that violated

Me
n
delian inheritance whe
n
parental data are available.
The m
ajor
reason

for Me
n
delian inheritance violation
is
genotyping
error
s

although this can also be used to detect de novo mutations.


Candidate gene identification




Linkage
-
based strategy:
Haplotype
-
sharing among patients

The following steps are used to identify individuals who share a long haplotype
likely IBD.


First,
regions shared by affected family member
s

(or unrelated patients if a recent
founder mutation is suspected)
are

identified
,

following
a
no
-
violation

rule.
If
the
genotypes
of
a

common
SNV

(default

value
: minor allele frequency (MAF) >0.1
in 1000 genome project
calculated from

all subjects)
f
or
two

individuals
are
homozygous
on

different alleles
,
this situation
is
considered
as

v
iolation

of allele
sharing. Certain error rate is also allowed.


Second, the candidate IBD regions

longer than a

cutoff
length
(default: 1Mb)

are
selected
,
since
s
hort
er

shared regions

are likely due to chance.


Third,
the haplotype allele frequency

(HAF)

for each IBD region is

estimated
by
Markov model

based on

the
homozygous SNV
s
in
th
e

region
. Only
SNP
s

from
Hap
M
ap phase

II (default: Caucasian)

are used for the calculation
.


The variants in the selected region would be prioritized for further
analysis.





Run of homozygosity

(ROH)


This strategy is very similar
to a method we introduced before (Zhang, et al,
2011).

T
he length and
HAF are calculated for
long
homozygous stretches

identified
by

common SNVs from

a

single patient
(default length
cutoff: 1Mb,
MAF > 0.1 in 1000 genome project for all the subjects)
.

The variants in the
selected region would be prioritized for further analysis.

The method can be used
in the place of homozygosity mapping, or for detection of homozygous regions
when bot
h alleles were likely derived from a common recent founder. In other
scenarios such as large deletion, or uniparental disomy, the data also appear as
long homozygous stretch.

Certain error rate is also allowed in this evaluation.




"Double
-
hit"


This
strategy is designed for recessive disease
s
,

which may be caused by
homozygous mutation
s

or compound heterozygous mutations
.

PriVar would keep
the genes
carrying
two
or more
mutations

(rare and predicted to have deleterious
effect

on protein function
)
.





De novo

mutations

The
candidate

de novo

mutations should satisfy

the
two conditions

below
:

1. All of the genotypes (father, mother and child) should
be with relatively high
quality

and good coverage depth.

2. The mutation
should

only appear in the child's genome.





Disease candidate genes

Disease
candidate gene list is downloaded from Huge navigator

(
Yu, et al., 2008
)
,
which is derived by

literature search
. C
ustomized

gene list
is also

supported.


Prediction of functional impact of variants



prediction
for
SNV

We used the data set
"ExoVar
"

as
training set

(
http://statgenpro.psychiatry.hku.hk/limx/kggseq/download
/ExoVar.xls
)

to fit

a
logistic regression model
,

which
was

also adopted

by

KGGSeq
,

but PriVar used more
weak correlated features.
The correlation coefficients for the ten features can be found
in
Supp

Table 1.
,
most of the

features
are weakly correlated.


Supp Table 1. The correlation coefficients among ten candidate features
used to

fit
logistic regression model.

r^2

SIFT

Polyphen2_HDIV

Polyphen2_HVAR

LRT

MutationTaster

GERP_NR

GERP_RS

PhyloP

29way_logOdds

prospectr

SIFT

1

0.2855

0.2814

0.061

0.1379

0.0069

0.0981

0.1123

0.1148

0.0178

Polyphen2_HDIV

0.2855

1

0.8715

0.1447

0.3016

0.0269

0.237

0.2728

0.2932

0.0377

Polyphen2_HVAR

0.2814

0.8715

1

0.16

0.3526

0.0357

0.2404

0.2838

0.3297

0.0493

LRT

0.061

0.1447

0.16

1

0.1744

0.0136

0.1377

0.1313

0.1345

0.0187

MutationTaster

0.1379

0.3016

0.3526

0.1744

1

0.0794

0.2295

0.2433

0.3017

0.0835

GERP_NR

0.0069

0.0269

0.0357

0.0136

0.0794

1

0.1421

0.1501

0.1786

0.0841

GERP_RS

0.0981

0.237

0.2404

0.1377

0.2295

0.1421

1

0.8348

0.416

0.0586

PhyloP

0.1123

0.2728

0.2838

0.1313

0.2433

0.1501

0.8348

1

0.5447

0.0648

29way_logOdds

0.1148

0.2932

0.3297

0.1345

0.3017

0.1786

0.416

0.5447

1

0.0918

prospectr

0.0178

0.0377

0.0493

0.0187

0.0835

0.0841

0.0586

0.0648

0.0918

1


Therefore,
the

logistic regression

model was fitted

by incorporating

ten candidate
features, the
model information can be found

in
Supp Table2
.


Variables in logistic regression



B

S.
E
.

Wald

df

Sig.

Exp(B)

95% C.I.for EXP(B)



Lower

Upper


SIFT

-
1.765

.148

142.218

1

.000

.171

.128

.229

Polyphen2_HDIV

-
.858

.190

20.364

1

.000

.424

.292

.615

Polyphen2_HVAR

2.447

.187

171.911

1

.000

11.548

8.011

16.647

LRT

.160

.197

.659

1

.417

1.173

.798

1.726

MutationTaster

1.863

.087

463.127

1

.000

6.443

5.438

7.635

GERP_NR

.724

.207

12.238

1

.000

2.063

1.375

3.096

GERP_RS

2.497

.443

31.708

1

.000

12.148

5.094

28.972

PhyloP

-
.435

.942

.214

1

.644

.647

.102

4.097

29way_logOdds

1.377

.337

16.650

1

.000

3.963

2.045

7.679

P
rospectr

3.042

.193

249.309

1

.000

20.945

14.358

30.553

Constant

-
6.387

.561

129.484

1

.000

.002




Supp Table 2. The
logistic regression model
fitted

by ten features
. B: coefficient value
for eac
h feature
, S.E.: standard error, Wald: Wald test
for each coefficient
value
,df:
degrees of freedom,
Sig: statistical significant of coefficient values, CI: confidence
interval
.


Supp Figure

2. The comparison of prediction performance between KGGSeq
and PriVar.


We
compare
d the prediction performance

between PriVar and

KGGS
eq
,
the AUC
(area under
curve
) value of Pri
Var is higher (8.5%) than KGGS
eq, which means more
comprehensive information
incorporated in

PriVar will provide
more accurate
prediction result.

By default, the SNV will be regarded as "
deleterious
" i
f

the
prediction value is
higher than 0.
5
. This

l
ead
s

to identif
ication of

most

(86.7%)

mutations
with

deleterious effect
and only 20.3%
of the
common
SNVs

are wrongly
classified
.




P
rediction

for
InDel
s

PriVar assume
s

animo acid changes caused by
i
n
d
el
s

are independent

(
Zia and Moses,
2011
)
.
P
remature stop codon
s

are

usually introduced by indels.
G
enes with indels are
still translated and amino acid changes introduced by an indel are evaluated
independently.



A
Real Case




#of SNV
&Indel


#
of Gene

Total number


147328


13149

Removing SNVs with depth
<10


6 5 7 4
5

1 0 3 2 0

K
e e p i n g e x o n i c

a n d s p l i c i n g
v a r i a n t s


1 5 7 8 8


7 3 4 7

k e e p i n g n o n
S y n o n y mo u s

S N V

a n d F r a me S h i f t I n D e l

R e mo v i n g c o mmo n
v a r i a n t s

wi t h MAF > 1 %



7 6 4 9

4 4 6 2


1 6 9 5

1 1 4 9

Ke e p i n g
v a r i a n t s

i n c a n d i d a t e g e n e s


9
1

4
2

Ke e p i n g
v a r i a n t s

i n g e n e s wi t h "Do u b l e
-
h i t"




59

10

Keeping
variants

with
prediction

score >0.
8


8

5


Supp Table

3.

A step
-
wise analysis for filtering and prioritizing
variants

found in a
patient with early
-
onset Crohn's disease. Minor allele frequency (MAF) represe
nts the
population frequency
of all the subjects
from
the 1000 genome project.



(A)



(
B
)


(C)

Supp Figure

3.

Transition/Transversion

ratio

(A),
and SNP call
concordance

(
B
)

and InDel length distribution

for

quality summary

using 10 fold

depth

as cutoff
.



Command Line used for variant calling


java
-
jar GenomeAnalysisTK.jar


-
R Homo_sapiens_assembly19.fasta


-
T UnifiedGenotyper



-
I
Father
.bam



-
I
Mother
.bam



-
I
Child
.bam



-
o
Result
.vcf



-
stand_call_conf 50



-
stand_emit_conf 10



-
dcov 500


REFERENCE

Adie, E.A.
, et al.

(2005) Speeding disease gene discovery by sequence based
candidate prioritization,
BMC bioinformatics
,
6
, 55.

Hu, J. and Ng, P.C. (2012) Predicting the effects of frameshifting indels,
Genome
biology
,
13
, R9.

Yu, W.
, et al.

(2008) A navigator for human genome epidemiology,
Nature genetics
,
40
, 124
-
125.

Zia, A. and Moses, A.M. (2011) Ranking insertion, deletion and nonsense mutations
based on their effect on genetic information,
BMC bioinformatics
,
12
, 299.