1
European Journal of Human Genetics

Supplementary Materials for
Pathway

based identification of SNP
s
predictive of survival
Herbert Pang,
Michael Hauser,
St
é
phane Minvielle
Contents:
Page
A) Survival Support Vector Machine
2
B) Conditional inference sur
vival forests
2
C) Random Survival Forest

Conserve and Random split criteria
3
D) Cox Boosting
4
E
)
Unadjusted P

values
5
F) Table 5

SNPs in
LD with
SNPs in
Table 4
5
G
) Figure 3

Pairwise r^2 of top genes in Stress Induction of HSP Regulation Pathway
6
H)
Table 6

Prediction results for St
ress Induction of HSP Reg.
with r^2 > 0.8 restriction
7
I)
Table
7

Prediction results for
Cytokine Network
with r^2 > 0.8 restriction
7
J)
Table 8

Top genes for Stress Induction of HSP Regulation and their impor
tance ranking
8
K)
Table 9

Top genes for
Cytokines Network
and their importance ranking
8
L) Notes associated with Tables 8 and 9
9

10
M
) Table
10

Un
i
variate Cox Regression
Results for HSP Regulation Pathway
11
N
) Table
11

Univariate Cox Regression R
esults for
Cytokine Pathway
12
O
) Notes associated with Tables 10
and 11
13
2
Survival Support Vector Machine
The goal of the support vector machine for survival is to find a non

linear hyperplane that
separates the individuals who had events before tim
e t and those without an event at that
point. The non

linear relationship is
obtained
from a reproducing kernel Hilbert space with
Gaussian kernel of width 1. This procedure is repeated at every event time. As in soft

margin
support vector machines, the
survival counterpart allows fo
r cases when the hyperplane do
not
exist by penalizing observations that lie on the different side of the margin. The minimization
problem also involves a sum that is based on the concordance index (Evers and Messow, 2008).
The penalty term lambda is set to 1. The optimization problem is a positive

definite quadratic
problem and is solved by using sequential minimal optimization algorithm.
Conditional inference survival forests
Th
e conditional inference forests,
cforest
implemented in R differs from original random
forests in two ways. The base learners used are conditional inference trees (Hothorn et al.,
2006). Also, the aggregation scheme works by averaging observation weights extracted from
each of the trees built
instead of averaging predictions as in the original version.
CIFs consists of the construction of many conditional inference trees, which are built using a
regression framework by binary recursive partitioning algorithm. This algorithm has three main
steps
. First, variable selection by testing the global null hypothesis of independence between
any of the predictors and the survival outcome. Second, if the hypothesis cannot be rejected,
then terminate the algorithm, otherwise select the predictor with the s
trongest association to
the survival outcome. Third, Implement a binary split based on split criteria using the selected
predictors. Finally, recursively repeat the steps.
For the case of censored data, the association is measured by
a
P

value correspo
nding to the
following linear test statistic
of a single predictor and the survival outcome:
where
1
(in node)
= 1 if the individual is
in node, 0 otherwise.
X
ji
is the
j

th covariate for the
i

th
subject.
a
i
is the log

rank score (Ho
thorn and Lausen,
2003
) of subject
i
defined as follows:
Let
X
(1)
X
(2)
∙∙∙
X
(
n
)
be the set of ordered
predictors,
Y
i
be the response for individual
i
, and for
each
survival time of observation
i
,
and a
i
is as defined in the main manuscript.
Since the
di
stribution of
T
j
is unknown in most situations,
permutation tests are employed. The
conditional expectation
µ
j
and covariance
j
under the null hypothesis were derived
by Strasser
and Weber
(1999).
A standardized form can
be obtained and the default in the
cforest
3
algorithm is in quadratic
form
Q
= (
t
–
µ)
+
(t
–
µ)
T
, where
t
is the observed linear test statistics
and
+
is the Moore

Penrose
inverse of
. The algorithm terminates if
P

value
falls below one
minus the mincriterion as prespecified.
A split is
implemented once the predictor has been selected from Step 1) described above and
that it meets its criterion. Let X
J
be the predictor J that was chosen, and let R
k
be one of all
possible subsets of the sample space of the predictor X
J
. The linear statis
tic
,
which measures the difference between the samples
for i=1,...,n and
and
for i=1,...,n and
where
if the individual is
in node, 0 otherwise;
if X
Ji
R, 0 otherwise;
and a
i
is the log

rank scores of subject i as defined above.
The best split B
K
is chosen from B such that
for all possible subsets B
K
. A split is established whe
n the sum of the weights in both child
nodes is larger than a pre

specified minsplit. This helps avoid splits that are borderline and
reduces the number of subsets to be evaluated. One advantage as stated by the
cforest
authors is that the approach taken
ensures that the right sized tree is grown without the need
of pruning.
Random Survival Forest
Conserve split criterion
Another type of splitting rule is the conservation of events (Naftel et al.,
1985). Denote the
Nelson

Aalen cumulative hazard estimato
r for child
j
as:
4
where
t
i,j
are the ordered event times for child
j
.
The conservation of events asserts that the total number of events is conserved in each child,
i.e.
where
is the cens
oring indicator.
Let
be the ordered time points for child
j
, and
be the corresponding censoring indicator for
for k=1,...,n
j
where
The Con(X,c)
measures whether the two groups are well separated and (1+Con(X,c))

1
is used in
the program It finds the best split by finding children closest to the conservation of events
principle.
Random split criterion
The last splitting rule is "random" which imp
lements a purely random uniform splitting. For
each node, a variable is randomly selected from a random set of
m variables. For a chosen
split variable X, a random split point is chosen among all possible split points on that variable.
Cox Boosting
Give
n a set of outcome and predictors, the goal is to approximate the outcome with a function.
To optimize a loss function, Cox boosting proceeds as follows. First, initialize the function to
offset. Second, compute the residuals defined as the negative par
tial log

likelihood for Cox
5
models . Third, fit the negative gradient vector by a base procedure, in this case, a component

wise univariate linear model as described in
Buhlmann and Hothorn (2007). Fourth, update the
function by taking a step of size 0 t
o 1 of the base learner. The second to fourth steps are
repeated until a predetermined number of iterations is completed. The boosting algorithm is
based on functional gradient descent, for more details please refer to Friedman (2001).
Unadjusted P

val
ues
corresponding to Table 3 in main manuscript.
Methods
p

val
<0.05
p

val
<0.025
p

val
<0.01
p

val
<0.001
p

val
<0.0001
Random Survival Forest (log rank score
split)
17
10
4
2
1
Random Survival Forest (log rank split)
23
12
7
1
0
Conditional Inference
Forest
39
23
11
1
0
Cox Boosting
26
12
7
0
0
Table 5

SNPs i
n
linkage disequilibrium
(
r^2
> 0.8)
with
SNPs in
Table 4
within the same
pathway
Pathway
Gene
chr
dbSNP ID
In LD with
dbSNPIDs
r^2
Stress Induction of HSP Regulation
CASP3
4
rs4647669
None
Stress Induction of HSP Regulation
*
BCL2
18
rs4941195
rs7240326
rs4941187
rs12457700
rs7228914
0.87
0.86
0.90
0.89
Stress Induction of HSP Regulation
*
BCL2
18
rs1381548
None
Stress Induction of HSP Regulation
*
BCL2
18
rs10503078
None
Stress Inductio
n of HSP Regulation
*
BCL2
18
rs4987839
None
Cyctokine Network
IL18
11
rs7106524
None
Cyctokine Network
IL15
4
rs4956404
rs360718
0.87
Cyctokine Network
IL5/IRF1
5
rs739718
rs3181224
0.89
Cyctokine Network
IL12A
3
rs640039
rs2115176
0.95
* Please ref
er to Supplementary Figure 3
6
Figure 3
Pairwise r^2

Top genes
from Table 5
in Stress Induction of HSP Regulation Pathway on
chromosome 18
7
Table 6

Table of
prediction results
from ten
independent
10

fold CV runs
for Stress Induction
of HSP Regulation
with the restriction
that SNPs with
r^2 > 0.8
are not allowed in the same
CV.
10

fold cross validation
p

value
#1
2.7e

10
#2
1.71e

08
#3
1.24e

06
#4
2.69e

10
#5
1.38e

09
#6
6e

08
#7
6.28e

11
#8
5.35e

10
#9
3.45e

10
#10
1.64e

07
Table 7

Tabl
e of
prediction results
from ten independent 10

fold CV runs
for Cyctokine
Network
with the restriction that SNPs with
r^2 > 0.8
are not allowed in the same CV.
10

fold cross validation
p

value
#1
3.96e

09
#2
1.06e

10
#3
8.83e

10
#4
4.56e

09
#5
1.07e

11
#6
6.93e

08
#7
1.8e

09
#8
3.47e

14
#9
1.3e

08
#10
3.71e

07
8
Table 8

T
op genes in Table 4
for
Stress Induction of HSP Regulation
and their importance
ranking
when restriction that SNPs with
r^2 > 0.8
are not allowed in the same forests is
impose
d.
Forest
rs4647669
rs4941195
rs1381548
rs10503078
rs4987839
CASP3
BCL2
BCL2
BCL2
BCL2
Original rank
1
2
3
4
5
#1
2
n/a
5
13
4
#2
1
n/a
4
6
5
#3
2
n/a
1
14
6
#4
1
n/a
2
9
4
#5
4
n/a
3
2
7
#6
3
1
5
4
7
#7
1
n/a
3
6
7
#8
1
n/a
2
5
13
#9
9
n/a
1
3
5
#10
2
1
3
5
7
n/a
=
when that SNP was not selected in
building that
forest.
Table 9

Top genes in Table 4 for
Cytokine Network
and their importance ranking when
restriction that SNPs with
r^2 > 0.8
are not allowed in the same forests is imposed.
Fo
rest
rs7106524
rs4956404
rs739718
rs640039
IL18
IL15
IL5/IRF1
IL12A
Original rank
1
2
3
4
#1
1
n/a
9
n/a
#2
n/a
1
n/a
n/a
#3
n/a
1
6
2
#4
n/a
2
10
n/a
#5
n/a
1
n/a
2
#6
1
n/a
7
n/a
#7
n/a
n/a
n/a
1
#8
1
n/a
5
n/a
#9
2
n/a
n/a
1
#10
1
n/a
6
2
n/a
= when that SNP was not selected in building that forest.
We discuss the biological plausibility of
IL5/IRF1
here as this gene has dropped out of the top 4
when we account for LD.
The
IRF1
gene, localized on chromosome 5q31.1, is mutated in
several ca
ncers
(Cavalli et al. (2010))
. These authors
have
discovered that low
IRF1
expression
9
is strongly correlated with both risk of recurrence and risk of death which suggest a tumor
suppressor role for the gene
(Amiel et al. (1999))
. For stable melphalan

tre
ated multiple
myeloma patients, 5q31.1 is a critical region that is consistently affected in deletion of 5q and it
contains many genes associated with hematopoiesis including
IRF1
(Amiel et al. (1999))
.
Furthermore,
IL5
has been
hypothesized as one of the
factors that can influence terminal
differentiation to memory or plasma cells in B cell lymphoid malignancies, such as multiple
myeloma
(Barker et al. (2000)
)
. In relation to the genomic region
5q31.1 we discovered,
researchers have
found that hyperdiplo
idy with 5q31 gain is a distinct entity that drives a more
favorable prognosis
(Avet

Loiseau et al. (2009))
.
Notes associated with Tables 8 and 9
(Cont'd from main manuscript Results

LD section)
S
econd, we investigated whether this affects the ranking o
f important SNPs. We ran ten
random forests with 3000 trees with the restriction that SNPs with r
2
> 0.8 are not allowed in
the same run. Tables 8 and 9 show the original rank of the important SNPs and their ranks in
the different random forests for Stre
ss Induction of HSP Regulation and Cytokine Network,
respectively. For Stress Induction of HSP Regulation, the top ranked SNPs remain within +/

2
ranks for all the top 5 SNPs at least 70% of the time. For Cytokine Network, SNPs rs7106524,
rs4956404 and
rs640039 remain the top 2 SNPs across different runs. However, rs739718 is
less significant than the original rank of 3, but remains in top 10. Indeed, consistent with what
was found in the classification setting, the ranks do differ when SNPs with high
LD are removed,
but not too severely. Given the variations
and
depending
on
how the SNPs within a pathway
10
are
correlated
,
we recommend investigating the ranks for significant pathways just like our
presentation here.
Stress Induction of HSP Regulation is
an example in which a single gene with many associated
independent SNPs is identified as important for predicting survival. Four out of the five top
SNPs are associated with the gene
BCL2
. As seen from Figure 3 in Supplementary materials, the
top SNPs on
chromosome 18 corresponding to
BCL2
identified in Table 4 have r
2
of less than .4
among them. Again, we point out that this does not affect the prediction as described in the
previous paragraph, but this may affect the ranking of important SNPs. Table 5
in
Supplementary materials show the list of SNPs that have high r
2
with the important SNPs
identified in Table 4.
11
Table 10

Univariate
Cox
R
egression
on Survival Outcome
for
Stress Induction HSP
Regulation
Pathway
12
Table
11

Univariate Cox Regression
on Survival Outcome
for Cytokine Pathway
13
Notes associated with Tables 10 and 11
In Tables 10 and 11 of Supplementary Materials, we presented the single SNP survival
association based on univariate Cox Proportional Hazard Regression since it does not ma
ke
sense to look at single associations using random forests which is an ensemble method for non

parametric multivariate modeling. One of the motivations for using a pathway

based approach
is similar to
the motivation given in the
study by Mootha et al.;
a study in which
the authors
have
not
been
able to identify
gene
s predictive of binary outcome using a single gene based
approach, but
were able to
identify
a statistically and biologically significant pathway using
gene set based methods
(Mootha et al. (2
003))
. As pointed out by several authors, regression
has been shown to be more powerful than single

SNP approach in the case control setting
(Chapman et al. (
2003), Tzeng et al. (2006
), Ballard et al. (2009)
)
.
The ability to discover high
order and non

l
inear effects is another motivation for using random forests for case

control
GWAS study in Goldstein et al
(2010)
.
14
References
Amiel A, Fridman K, Elis A et al: Deletion 5q31 in patients with stable, melphalan

treated
multiple myeloma. Cancer Genet Cytoge
net 1999; 113:45

8.
Avet

Loiseau H, Li C, Magrangeas F et al: Prognostic significance of copy

number alterations in
multiple myeloma. J Clin Oncol 2009; 27:4585

90.
Ballard DH, Aporntewan C, Lee JY, Lee JS, Wu Z, Zhao H: A pathway analysis applied to Genet
ic
Analysis Workshop 16 genome

wide rheumatoid arthritis data. BMC Proc 2009: 3 Suppl 7:S91.
Barker J, Verfaillie CM: A novel in vitro model of early human adult B lymphopoiesis that allows
proliferation of pro

B cells and differentiation to mature B lymph
ocytes. Leukemia 2000;
14:1614

20.
Buhlmann P, Hothorn T. 2007. Boosting algorithms: regularization, prediction and model fitting.
Stat Sci 22:477
–
505.
Carter K., McCaskie P.
,
and Palmer L.
.
2006.
JLIN: A Ja
va based Linkage Disequilibrium
plot
ter.
BMC Bio
informatics 7
:60
.
Cavalli LR, Riggins RB, Wang A, Clarke R, Haddad BR: Frequent loss of heterozygosity at the
interferon regulatory factor

1 gene locus in breast cancer. Breast Cancer Res Treat 2010;
121:227

31.
Chapman J, Cooper J, Todd J, Clayton D: Det
ecting disease associations due to linkage
disequilibrium using haplotype tags: a class of tests and determinants of statistical power. Hum
Hered 2003; 18

31.
Evers L, Messow CM. Sparse kernel methods for high

dimensional survival data. Bioinformatics.
2
008 Jul 15;24(14):1632

8.
Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statist
29, 1180.
Goldstein B, Hubbard A, Cutler A, Barcellos L: An application of Random Forests to a genome

wide association dataset: method
ological considerations & new findings. BMC Genet. 2010;
11:49.
Hothorn T, et al. Unbiased recursive partitioning: a conditional inference framework. J. Comput.
Graph. Stat. (2006a) 15:651
–
674.
15
Hothorn T, Lausen B. On the exact distribution of maximally s
elected rank statistics. Comput.
Stat. Data Anal. (2003) 43:121
–
137.
Mootha V, Lindgren C, Eriksson K: PGC

1alpha

responsive genes involved in oxidative
phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 2003;
34:267

273.
Naf
tel D, et al. Conservation of events (1985) unpublished.
Strasser H, Weber C. On the asymptotic theory of permutation statistics. Math. Methods Stat.
(1999) 8:220
–
250.
Tzeng J, Wang C, Kao J, Hsiao C: Regression

based association analysis with clustered
h
aplotypes through use of genotypes. Am J Hum Genet 2006; 78:231

241.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment