European Journal of Human Genetics -

chardfriendlyAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

77 views

1


European Journal of Human Genetics

-

Supplementary Materials for
Pathway
-
based identification of SNP
s

predictive of survival

Herbert Pang,
Michael Hauser,
St
é
phane Minvielle


Contents:

Page

A) Survival Support Vector Machine

2

B) Conditional inference sur
vival forests

2

C) Random Survival Forest
-

Conserve and Random split criteria

3

D) Cox Boosting

4

E
)
Unadjusted P
-
values

5

F) Table 5
-

SNPs in
LD with
SNPs in
Table 4

5

G
) Figure 3
-

Pairwise r^2 of top genes in Stress Induction of HSP Regulation Pathway

6

H)
Table 6

-

Prediction results for St
ress Induction of HSP Reg.

with r^2 > 0.8 restriction

7

I)
Table
7

-

Prediction results for
Cytokine Network
with r^2 > 0.8 restriction

7

J)

Table 8
-

Top genes for Stress Induction of HSP Regulation and their impor
tance ranking

8

K)

Table 9
-

Top genes for
Cytokines Network
and their importance ranking

8

L) Notes associated with Tables 8 and 9

9
-
10

M
) Table
10

-

Un
i
variate Cox Regression

Results for HSP Regulation Pathway

11

N
) Table
11

-

Univariate Cox Regression R
esults for
Cytokine Pathway

12

O
) Notes associated with Tables 10
and 11

13





2


Survival Support Vector Machine

The goal of the support vector machine for survival is to find a non
-
linear hyperplane that
separates the individuals who had events before tim
e t and those without an event at that
point. The non
-
linear relationship is
obtained
from a reproducing kernel Hilbert space with
Gaussian kernel of width 1. This procedure is repeated at every event time. As in soft
-
margin
support vector machines, the

survival counterpart allows fo
r cases when the hyperplane do

not
exist by penalizing observations that lie on the different side of the margin. The minimization
problem also involves a sum that is based on the concordance index (Evers and Messow, 2008).

The penalty term lambda is set to 1. The optimization problem is a positive
-
definite quadratic
problem and is solved by using sequential minimal optimization algorithm.

Conditional inference survival forests

Th
e conditional inference forests,
cforest
implemented in R differs from original random
forests in two ways. The base learners used are conditional inference trees (Hothorn et al.,
2006). Also, the aggregation scheme works by averaging observation weights extracted from
each of the trees built
instead of averaging predictions as in the original version.

CIFs consists of the construction of many conditional inference trees, which are built using a
regression framework by binary recursive partitioning algorithm. This algorithm has three main
steps
. First, variable selection by testing the global null hypothesis of independence between
any of the predictors and the survival outcome. Second, if the hypothesis cannot be rejected,
then terminate the algorithm, otherwise select the predictor with the s
trongest association to
the survival outcome. Third, Implement a binary split based on split criteria using the selected
predictors. Finally, recursively repeat the steps.

For the case of censored data, the association is measured by

a
P
-
value correspo
nding to the
following linear test statistic

of a single predictor and the survival outcome:


where
1
(in node)

= 1 if the individual is

in node, 0 otherwise.
X
ji

is the
j
-
th covariate for the
i
-
th

subject.
a
i

is the log
-
rank score (Ho
thorn and Lausen,

2003
) of subject
i

defined as follows:


Let
X
(1)

X
(2)
∙∙∙
X
(
n
)

be the set of ordered

predictors,
Y
i

be the response for individual
i
, and for
each

survival time of observation
i
,
and a
i

is as defined in the main manuscript.

Since the
di
stribution of
T
j

is unknown in most situations,

permutation tests are employed. The
conditional expectation

µ
j

and covariance
j

under the null hypothesis were derived

by Strasser
and Weber

(1999).
A standardized form can

be obtained and the default in the

cforest

3


algorithm is in quadratic

form
Q

= (
t



µ)
+

(t


µ)
T
, where
t

is the observed linear test statistics
and
+

is the Moore
-
Penrose

inverse of
. The algorithm terminates if
P
-
value

falls below one
minus the mincriterion as prespecified.

A split is
implemented once the predictor has been selected from Step 1) described above and
that it meets its criterion. Let X
J

be the predictor J that was chosen, and let R
k

be one of all
possible subsets of the sample space of the predictor X
J
. The linear statis
tic

,

which measures the difference between the samples

for i=1,...,n and
and
for i=1,...,n and
where

if the individual is
in node, 0 otherwise;

if X
Ji



R, 0 otherwise;

and a
i

is the log
-
rank scores of subject i as defined above.

The best split B
K

is chosen from B such that


for all possible subsets B
K
. A split is established whe
n the sum of the weights in both child
nodes is larger than a pre
-
specified minsplit. This helps avoid splits that are borderline and
reduces the number of subsets to be evaluated. One advantage as stated by the
cforest

authors is that the approach taken

ensures that the right sized tree is grown without the need
of pruning.


Random Survival Forest

Conserve split criterion

Another type of splitting rule is the conservation of events (Naftel et al.,

1985). Denote the
Nelson
-
Aalen cumulative hazard estimato
r for child
j

as:

4



where
t
i,j

are the ordered event times for child
j
.

The conservation of events asserts that the total number of events is conserved in each child,
i.e.


where

is the cens
oring indicator.

Let
be the ordered time points for child
j
, and
be the corresponding censoring indicator for

for k=1,...,n
j


where

The Con(X,c)

measures whether the two groups are well separated and (1+Con(X,c))
-
1

is used in
the program It finds the best split by finding children closest to the conservation of events
principle.

Random split criterion

The last splitting rule is "random" which imp
lements a purely random uniform splitting. For
each node, a variable is randomly selected from a random set of

m variables. For a chosen
split variable X, a random split point is chosen among all possible split points on that variable.

Cox Boosting

Give
n a set of outcome and predictors, the goal is to approximate the outcome with a function.
To optimize a loss function, Cox boosting proceeds as follows. First, initialize the function to
offset. Second, compute the residuals defined as the negative par
tial log
-
likelihood for Cox
5


models . Third, fit the negative gradient vector by a base procedure, in this case, a component
-
wise univariate linear model as described in

Buhlmann and Hothorn (2007). Fourth, update the
function by taking a step of size 0 t
o 1 of the base learner. The second to fourth steps are
repeated until a predetermined number of iterations is completed. The boosting algorithm is
based on functional gradient descent, for more details please refer to Friedman (2001).


Unadjusted P
-
val
ues

corresponding to Table 3 in main manuscript.

Methods

p
-
val
<0.05

p
-
val
<0.025

p
-
val
<0.01

p
-
val
<0.001

p
-
val
<0.0001

Random Survival Forest (log rank score
split)

17

10

4

2

1

Random Survival Forest (log rank split)

23

12

7

1

0

Conditional Inference
Forest

39

23

11

1

0

Cox Boosting

26

12

7

0

0


Table 5

-

SNPs i
n
linkage disequilibrium
(
r^2

> 0.8)
with
SNPs in
Table 4

within the same
pathway


Pathway

Gene

chr

dbSNP ID

In LD with
dbSNPIDs

r^2

Stress Induction of HSP Regulation

CASP3

4

rs4647669

None


Stress Induction of HSP Regulation
*

BCL2

18

rs4941195

rs7240326

rs4941187

rs12457700

rs7228914

0.87

0.86

0.90

0.89

Stress Induction of HSP Regulation
*

BCL2

18

rs1381548

None


Stress Induction of HSP Regulation
*

BCL2

18

rs10503078

None


Stress Inductio
n of HSP Regulation
*

BCL2

18

rs4987839

None


Cyctokine Network

IL18

11

rs7106524

None


Cyctokine Network

IL15

4

rs4956404

rs360718

0.87

Cyctokine Network

IL5/IRF1

5

rs739718

rs3181224

0.89

Cyctokine Network

IL12A

3

rs640039

rs2115176

0.95

* Please ref
er to Supplementary Figure 3

6


Figure 3

Pairwise r^2
-

Top genes
from Table 5
in Stress Induction of HSP Regulation Pathway on
chromosome 18


7


Table 6
-

Table of
prediction results
from ten
independent
10
-
fold CV runs
for Stress Induction
of HSP Regulation

with the restriction
that SNPs with
r^2 > 0.8

are not allowed in the same
CV.


10
-
fold cross validation

p
-
value

#1

2.7e
-
10

#2

1.71e
-
08

#3

1.24e
-
06

#4

2.69e
-
10

#5

1.38e
-
09

#6

6e
-
08

#7

6.28e
-
11

#8

5.35e
-
10

#9

3.45e
-
10

#10

1.64e
-
07


Table 7

-

Tabl
e of
prediction results
from ten independent 10
-
fold CV runs
for Cyctokine
Network
with the restriction that SNPs with
r^2 > 0.8

are not allowed in the same CV.

10
-
fold cross validation

p
-
value

#1

3.96e
-
09

#2

1.06e
-
10

#3

8.83e
-
10

#4

4.56e
-
09

#5

1.07e
-
11

#6

6.93e
-
08

#7

1.8e
-
09

#8

3.47e
-
14

#9

1.3e
-
08

#10

3.71e
-
07


8


Table 8
-

T
op genes in Table 4
for
Stress Induction of HSP Regulation
and their importance
ranking
when restriction that SNPs with
r^2 > 0.8

are not allowed in the same forests is
impose
d.


Forest

rs4647669

rs4941195

rs1381548

rs10503078

rs4987839


CASP3

BCL2

BCL2

BCL2

BCL2

Original rank

1

2

3

4

5

#1

2

n/a

5

13

4

#2

1

n/a

4

6

5

#3

2

n/a

1

14

6

#4

1

n/a

2

9

4

#5

4

n/a

3

2

7

#6

3

1

5

4

7

#7

1

n/a

3

6

7

#8

1

n/a

2

5

13

#9

9

n/a

1

3

5

#10

2

1

3

5

7

n/a

=
when that SNP was not selected in
building that
forest.

Table 9
-

Top genes in Table 4 for
Cytokine Network

and their importance ranking when
restriction that SNPs with
r^2 > 0.8

are not allowed in the same forests is imposed.


Fo
rest

rs7106524

rs4956404

rs739718

rs640039


IL18

IL15

IL5/IRF1

IL12A

Original rank

1

2

3

4

#1

1

n/a

9

n/a

#2

n/a

1

n/a

n/a

#3

n/a

1

6

2

#4

n/a

2

10

n/a

#5

n/a

1

n/a

2

#6

1

n/a

7

n/a

#7

n/a

n/a

n/a

1

#8

1

n/a

5

n/a

#9

2

n/a

n/a

1

#10

1

n/a

6

2

n/a

= when that SNP was not selected in building that forest.

We discuss the biological plausibility of
IL5/IRF1

here as this gene has dropped out of the top 4
when we account for LD.
The
IRF1

gene, localized on chromosome 5q31.1, is mutated in
several ca
ncers

(Cavalli et al. (2010))
. These authors
have
discovered that low
IRF1

expression
9


is strongly correlated with both risk of recurrence and risk of death which suggest a tumor
suppressor role for the gene

(Amiel et al. (1999))
. For stable melphalan
-
tre
ated multiple
myeloma patients, 5q31.1 is a critical region that is consistently affected in deletion of 5q and it
contains many genes associated with hematopoiesis including
IRF1

(Amiel et al. (1999))
.
Furthermore,
IL5

has been
hypothesized as one of the

factors that can influence terminal
differentiation to memory or plasma cells in B cell lymphoid malignancies, such as multiple
myeloma

(Barker et al. (2000)
)
. In relation to the genomic region
5q31.1 we discovered,
researchers have
found that hyperdiplo
idy with 5q31 gain is a distinct entity that drives a more
favorable prognosis

(Avet
-
Loiseau et al. (2009))
.

Notes associated with Tables 8 and 9

(Cont'd from main manuscript Results
-

LD section)

S
econd, we investigated whether this affects the ranking o
f important SNPs. We ran ten
random forests with 3000 trees with the restriction that SNPs with r
2

> 0.8 are not allowed in
the same run. Tables 8 and 9 show the original rank of the important SNPs and their ranks in
the different random forests for Stre
ss Induction of HSP Regulation and Cytokine Network,
respectively. For Stress Induction of HSP Regulation, the top ranked SNPs remain within +/
-

2
ranks for all the top 5 SNPs at least 70% of the time. For Cytokine Network, SNPs rs7106524,
rs4956404 and

rs640039 remain the top 2 SNPs across different runs. However, rs739718 is
less significant than the original rank of 3, but remains in top 10. Indeed, consistent with what
was found in the classification setting, the ranks do differ when SNPs with high
LD are removed,
but not too severely. Given the variations

and

depending
on
how the SNPs within a pathway
10


are

correlated
,

we recommend investigating the ranks for significant pathways just like our
presentation here.

Stress Induction of HSP Regulation is
an example in which a single gene with many associated
independent SNPs is identified as important for predicting survival. Four out of the five top
SNPs are associated with the gene
BCL2
. As seen from Figure 3 in Supplementary materials, the
top SNPs on

chromosome 18 corresponding to
BCL2

identified in Table 4 have r
2

of less than .4
among them. Again, we point out that this does not affect the prediction as described in the
previous paragraph, but this may affect the ranking of important SNPs. Table 5

in
Supplementary materials show the list of SNPs that have high r
2

with the important SNPs
identified in Table 4.

11


Table 10

-

Univariate
Cox
R
egression
on Survival Outcome

for
Stress Induction HSP
Regulation

Pathway



12


Table
11

-

Univariate Cox Regression

on Survival Outcome
for Cytokine Pathway


13



Notes associated with Tables 10 and 11

In Tables 10 and 11 of Supplementary Materials, we presented the single SNP survival
association based on univariate Cox Proportional Hazard Regression since it does not ma
ke
sense to look at single associations using random forests which is an ensemble method for non
-
parametric multivariate modeling. One of the motivations for using a pathway
-
based approach
is similar to
the motivation given in the
study by Mootha et al.;
a study in which
the authors
have
not
been
able to identify
gene
s predictive of binary outcome using a single gene based
approach, but
were able to
identify

a statistically and biologically significant pathway using
gene set based methods

(Mootha et al. (2
003))
. As pointed out by several authors, regression
has been shown to be more powerful than single
-
SNP approach in the case control setting

(Chapman et al. (
2003), Tzeng et al. (2006
), Ballard et al. (2009)
)
.

The ability to discover high
order and non
-
l
inear effects is another motivation for using random forests for case
-
control
GWAS study in Goldstein et al

(2010)
.

14


References

Amiel A, Fridman K, Elis A et al: Deletion 5q31 in patients with stable, melphalan
-
treated
multiple myeloma. Cancer Genet Cytoge
net 1999; 113:45
-
8.

Avet
-
Loiseau H, Li C, Magrangeas F et al: Prognostic significance of copy
-
number alterations in
multiple myeloma. J Clin Oncol 2009; 27:4585
-
90.

Ballard DH, Aporntewan C, Lee JY, Lee JS, Wu Z, Zhao H: A pathway analysis applied to Genet
ic
Analysis Workshop 16 genome
-
wide rheumatoid arthritis data. BMC Proc 2009: 3 Suppl 7:S91.

Barker J, Verfaillie CM: A novel in vitro model of early human adult B lymphopoiesis that allows
proliferation of pro
-
B cells and differentiation to mature B lymph
ocytes. Leukemia 2000;
14:1614
-
20.

Buhlmann P, Hothorn T. 2007. Boosting algorithms: regularization, prediction and model fitting.
Stat Sci 22:477

505.

Carter K., McCaskie P.
,
and Palmer L.
.

2006.
JLIN: A Ja
va based Linkage Disequilibrium
plot
ter.
BMC Bio
informatics 7
:60
.

Cavalli LR, Riggins RB, Wang A, Clarke R, Haddad BR: Frequent loss of heterozygosity at the
interferon regulatory factor
-
1 gene locus in breast cancer. Breast Cancer Res Treat 2010;
121:227
-
31.

Chapman J, Cooper J, Todd J, Clayton D: Det
ecting disease associations due to linkage
disequilibrium using haplotype tags: a class of tests and determinants of statistical power. Hum
Hered 2003; 18
-
31.

Evers L, Messow CM. Sparse kernel methods for high
-
dimensional survival data. Bioinformatics.
2
008 Jul 15;24(14):1632
-
8.

Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statist
29, 1180.

Goldstein B, Hubbard A, Cutler A, Barcellos L: An application of Random Forests to a genome
-
wide association dataset: method
ological considerations & new findings. BMC Genet. 2010;
11:49.

Hothorn T, et al. Unbiased recursive partitioning: a conditional inference framework. J. Comput.
Graph. Stat. (2006a) 15:651

674.

15


Hothorn T, Lausen B. On the exact distribution of maximally s
elected rank statistics. Comput.
Stat. Data Anal. (2003) 43:121

137.

Mootha V, Lindgren C, Eriksson K: PGC
-
1alpha
-
responsive genes involved in oxidative
phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 2003;
34:267
-
273.

Naf
tel D, et al. Conservation of events (1985) unpublished.

Strasser H, Weber C. On the asymptotic theory of permutation statistics. Math. Methods Stat.
(1999) 8:220

250.

Tzeng J, Wang C, Kao J, Hsiao C: Regression
-
based association analysis with clustered
h
aplotypes through use of genotypes. Am J Hum Genet 2006; 78:231
-
241.