The Complexity of Linkage Analysis with Neural Networks

haremboingAI and Robotics

Oct 20, 2013 (3 years and 5 months ago)

64 views



1

10/20/2013

The Complexity of Linkage Analysis with Neural Networks

Marin Marinov and Daniel E. Weeks

Department of Human Genetics

Graduate School of Public Health

University of Pittsburgh


Running head: Linkage analysis with neural networks



Corresponding author:

Da
niel E. Weeks

University of Pittsburgh

Department of Human Genetics

A310 Crabtree Hall

130 DeSoto Street

Pittsburgh, PA 15261

USA


Tel: 1
-
412
-
624
-
5388

Fax: 1
-
412
-
624
-
3020

E
-
mail: dweeks@watson.hgen.pitt.edu


Key words: Genetic mapping, neural networks,
affected sib pairs, complex traits,
linkage analysis



2

10/20/2013

Abstract

As the focus of genome
-
wide scans for disease loci have shifted from simple
Mendelian traits to genetically complex traits, researchers have begun to consider
new alternative ways to detect li
nkage that will consider more than the marginal
effects of a single disease locus at a time. One interesting new method is to train a
neural network on a genome
-
wide data set in order to search for the best non
-
linear relationship between identity
-
by
-
desc
ent sharing among affected siblings at
markers and their disease status. We investigate here the repeatability of the
neural network results from run to run, and show that the results obtained by
multiple runs of the neural network method may differ quite

a bit. This is most
likely due to the fact that training a neural network involves minimizing an error
function with a multitude of local minima.


Introduction

Linkage analysis of complex genetic traits has typically been done by searching for
the marg
inal effects of a single putative trait locus. However, complex traits may
be the result of the interactions of several trait loci and so power to detect linkage
may be increased by searching for several trait loci at once
[1]
. Methods f
or
searching for multiple trait loci include assuming a multidimensional joint
penetrance matrix
[2]
, or stratifying on evidence for linkage at one locus while
searching for another
[3
-
5]
. Many of these methods are restri
cted to considering
two trait loci at a time, usually under the assumption that these trait loci are not
linked to each other. In contrast, Lucek and colleagues
[6, 7]

have proposed an
alternative approach based on artificial neural netwo
rks that searches for the
patterns of interacting multiple trait loci across different chromosomes. Basically,
their method searches for a set of non
-
linearly interacting markers that generates a
joint linkage signal that is distinguishable from random no
ise; the input data are
the identity
-
by
-
descent sharing at markers throughout the genome among affected
siblings. Their method appears to be very promising, as neural networks are very


3

10/20/2013

good at detecting complex interactions that might not be detectable by

conventional methodology
[8]
.


Intrigued by this new methodology, we have endeavored to apply it to our own
data sets, and have found that, while such methodology is intrinsically appealing,
its behavior is more complex than we anticipat
ed. In particular, we found that the
ranking of marker loci obtained by multiple runs of the neural network method
may differ quite a bit. We investigate here the repeatability of the neural network
results from run to run.


Methods

We attempted to dupl
icate, as much as possible, exactly the methods as described
by Lucek et al
[7]
. Like them, we also applied the neural network method to Dr.
John Todd's initial genome
-
wide scan with 277 markers on 96 affected sib pair
families for genes
influencing Type I diabetes; these data are available from
http://diesel.cimr.cam.ac.uk/todd/

[9]
.


We used the Stuttgart Neural Network Simulator (SNNS version 4.1;
[10]
) to
implement the feed
-
forward neural network of Lucek et al
[7]
. The SNNS
software is very flexible, with a great variety of user
-
defined settings, which we
describe here. The network had three layers: 277 input nodes (one for each
ma
rker), 17 middle layer 'hidden' nodes, and 2 output nodes. Like Lucek et al
[7]
,
we used the default SNNS functions: Logistic for the activation function, Identity
for the output function, Back Propagation for the learning function. The
data on
the 96 affected sib
-
pair families were coded as 192 input training patterns, one per
parent. The status of the maternal or paternal identity
-
by
-
descent status for each
affected sib pair was coded as 1 for sharing,
-
1 for non
-
sharing, and 0 for non
-
informative. This 'real' data was supplemented by a nine
-
times larger set of
'random' data. This set of 1,728 (= 9 x 192) training patterns were randomly

4

10/20/2013

generated with an equal number of 1's and

1's and with the number of zeros
equal to the number of
zeros found in the real data. There were two target output
patterns: a (1,1) pattern representing 'signal' plus 'noise' was assigned to each 'real'
input pattern; and a (0,1) pattern representing just 'noise' was assigned to each
'random' input pattern.


The behavior of a neural network is a function of the 4,709 (= 277 x 17) weights
from input node to hidden node, and the 34 (= 17 x 2) weights from hidden node
to output node. Starting with random initial weights chosen uniformly from the
interval (
-
1,
+1), the weights are iteratively updated to minimize the average sum
of the square of the deviations of actual outputs from the target outputs as
measured by the 'mean squared error' (MSE). Like Lucek et al
[7]
, we set the
'learning rate'
, which controls how quickly the weights change at each iteration, to
0.2. We continued updating weights until the learning error had decreased to a
value of 0.005 or 50,000 iterations had been tried (Note: Based on a sample SNNS
instruction file that Dr.

Lucek provided us with, we had 'setShuffle' set to 'true',
which means that each time the network updates, it is presented with a randomly
chosen pattern from the set of input training patterns. When 'setShuffle' is set to
'false', the training patterns
are presented in order and, in some cases, we obtain
much faster convergence. However, we still observe similar disparate results from
run to run.).


To measure the importance of each input node, which represents a marker,
towards distinguishing 'signal
' from 'noise', we computed a 'contribution value'
from input node i to output node k by first computing:





5

10/20/2013

where H is the number of hidden nodes, w(i
-
> j) represents the weight from input
node i to hidden node j, and w(j
-
> k) repr
esents the weight from hidden node j to
output node k. The 'contribution value' was then computed as:


.

A large contribution value is supposed to be indicative of a marker (input node)
that is linked to the trait, since it is large

only if its contribution to output node 1
('signal' plus 'noise') is much greater than its contribution to output node 2
('noise').


The set of contribution values is well
-
defined for a given data set only if the same
weight estimates (up to symmetry) are

obtained from run to run. Therefore, we
checked the repeatability of the results by running the neural network software
several times under several different settings:


Initialization of the Weights

Using exactly the same training data, we carried out ei
ght runs while varying only
the initial weights. This measures the repeatability of the results generated by the
neural network software when only the initial weights are varied.


'Random' data

We carried out eight runs holding everything constant except
for the 'random'
data; again the 'random' data represents the input patterns one would see if none
of the markers were linked to the disease loci. Thus, from run to run, the 'real'
data and the initial weights remain exactly the same, but the 'random' dat
a portion
of the training patterns varies.



6

10/20/2013

Results

While few of the runs were stopped by the MSE dropping below 0.005 (Table 1),
the majority of the runs had similar final mean squared errors in the range 0.005 to
0.009. However, within each run, the r
elative rank of the CVs for each marker was
very stable (data not shown) by the time 50,000 iterations had been reached; that
is, within a given run, as one goes from iteration to iteration, the relative ranks of
the CV's have stabilized and do not change
much at all from iteration to iteration
well before one has reached the limit of 50,000 iterations. Thus, it is unlikely that
running the neural network software for many more iterations would alter the
results presented here. Note that the mean CV and i
ts standard deviation
remained relatively constant from run to run (Table 1). However, for individual
markers, there is substantial and often dramatic variation in the CV from run to
run. In Table 1, we have displayed the CVs for three relevant markers:
TNFA
from the HLA region of chromosome 6, THINS from the insulin region of
chromosome 11, and D11S35 from the IDDM
-
4 region of chromosome 11. D11S35
is relevant because Lucek et al
[7]

in their neural network results found a strong
signal

at D11S35; they point out that IDDM
-
4 was originally identified by John
Todd's group only by subdividing the families by their sharing status at HLA.



Initialization of the Weights

Figure 1 graphs the contribution values (CVs) for each marker from two of

the
eight runs where only the initial weights varied from run to run; the two are
representative of the variation seen from run to run. Note that both runs contain
markers with high CVs in the HLA region on chromosome 6 (
i.e.,
D6S258 and
TNFA), which is
the region that is very strongly linked to diabetes. However, the
marker D9S158 which has the highest CV (35.86) in Figure 1A has a very low CV
(2.72) in Figure 1B. Similarly, the marker D13S153 has a high CV (27.55) in Figure
1A, but a miniscule CV of 0
.02 in Figure 1B. In Table 1A, note that TNFA has a


7

10/20/2013

high CV with an extreme rank for almost all the replicates. However, the CV
varies in magnitude from 17.6 to 35.4. THINS shows even more variation, with a
minimum CV of 5.0 and a maximum of 23.2. And
D11S35 would not be indicated
as a marker of interest, since its CV is always relatively small across all the
replicates.



'Random' data

Figure 2 graphs the CVs for each marker from two of the eight runs where only
the 'random' data varied from run to run
; the two are representative of the
variation seen from run to run. Again, both runs contain markers with high CVs
in the HLA region on chromosome 6, as expected. However, the marker D13S120
has a very high CV of 40.21 in Figure 2A, but a very small CV o
f 0.32 in Figure 2B.
Similarly, marker D9S154 has a CV of 20.41 in Figure 2A, but a CV of only 0.39 in
Figure 2B. As we saw in Table 1A, Table 1B shows that the three example markers
display quite a bit of variation in their CVs from run to run. The lar
gest CV of
TNFA is more than three times larger than the smallest CV seen in these eight
replicates. THINS has a large CV with a small rank in just one of the eight
replicates. D11S35 never obtains a large enough CV to indicate that this should be
a mark
er of interest.


Discussion

The idea of applying neural network technology to genome
-
wide linkage data is
attractive in that it should be able to detect complex non
-
linear interactions
between multiple trait loci. In most applications, neural networks ar
e used to
identify patterns. In such applications, one does not care if there are multiple
different ways (
i.e.,

sets of weights) to identify the pattern well. In contrast, in our
application, the actual values of the weights are crucial, since we are tr
ying to
make inferences based on them. Here, we attempted to "repeat" the results of the

8

10/20/2013

single run presented in Lucek et al
[7]
, but were unable to despite careful efforts to
duplicate their analyses. We found that, in our hands, the ne
ural network
technology does not generate repeatable results. While the HLA area was
consistently identified across multiple runs, we found that, elsewhere in the
genome, a marker in one run may have a very high CV and in the next run may
have a very low
CV (Table 1, Figures 1 and 2). This occurred even when we held
all the input training patterns constant and only varied the initial weights (Figure
1). Given the amount of variation in CVs from run to run, it seems that any
attempt to assign p
-
values, ev
en by simulation, to a given marker's CV is futile.


One might argue that we are interested in identifying regions rather than
particular markers, so that the variation from run to run might not matter so much
if, in each 'significant' region, at least on
e marker has an elevated CV. This
approach was used by Bhat et al.
[11]
, where they looked for consistent results
across 10 different runs. However, both Bhat et al.'s results and our results
indicate that it is not necessarily the case
that one sees consistently elevated
regions from run to run. For example, consider the marker D13S120, which in
Figure 2A has a very high CV, yet in Figure 2B lies in a very low trough with no
adjacent markers with dramatically elevated CVs. Perhaps this

is because in this
current approach, the neural network totally ignores any information about
relative marker order and position. Since this is a genetic study, it seems that any
method of analysis that ignores the marker map information is likely to per
form
quite poorly.


Given a set of training patterns (input and output) and a feedforward network
with a single hidden layer, there are several sets of weights that can generate the
same answers. For example, one could permute the hidden nodes or change t
he
signs of all the weights associated with a particular hidden node and still have the
same input
-
output map and the same CVs
[12]
. Sussmann
[12]

showed that these


9

10/20/2013

weights are uniquely determined (up to permutational sym
metry) if the activation
function is Tanh s, but it is not clear if his proof applies when the logistic
activation function is applied. In fact, when the logistic activation function is used
with a single neuron, the number of local minima of the MSE func
tion can grow
exponentially as a function of the dimension of the training patterns
[13]
. Indeed,
it is a "characteristic property of neural networks that their empirical and
generalization errors possess multiple minima"
[14, p. 747]
. It seems likely that the
unrepeatable behavior we have seen here from run to run is due to the fact that
the MSE we are minimizing "is likely to have many local minima and even
multiple global minima"
[14, p. 751]
. This is

not surprising at all, given we are
trying to estimate 4,743 weights. We attempted to fit a much simpler network
(three input nodes, two hidden nodes, and two output nodes) with only 10
weights to a sample data set, and, even with this smaller network an
d 1920
training patterns, we still encountered significant lack of repeatability from run to
run (data not shown); this indicates that network complexity is not the cause of the
lack of repeatability. The typical solution to such convergence problems is t
o carry
out many many searches until one is confident that the global minimum has been
obtained. However, carrying out many searches may not be computationally
feasible in this case, as each neural network run on a genome
-
wide data set can be
quite comput
ationally intensive.


In addition to carrying out many searches per data set, a complementary approach
to solving convergence problems is to try different estimation algorithms. We
applied two other estimation algorithms (as implemented in the SNNS softwa
re
package): the backpropagation algorithm with a momentum term and flat spot
elimination, and the resilient propagation algorithm. Both of these algorithms
converged much more quickly than the simple backpropagation algorithm that we
(and Lucek et al.) o
riginally used. And the resilient propagation algorithm
appears to generate more consistent ranks for TNFA and THINS when we analyze

10

10/20/2013

the same data set eight times only varying the initial weights. For example, the
range of ranks for THINS is 5 to 27, ver
sus the range of 4 to 136 returned in the
original runs. However, there is still substantial variation from run to run both in
the CV's for a given locus and the peaks seen in the graphs. For example, D13S120
is ranked first in two of the eight runs (wit
h CVs of 29.9 and 24.1), but is ranked
95
th

in another run with a CV of only 6.75; in this run, D13S120 is in a low region
of the graph that would not indicate that any locus in this region is of interest.
While our exploration of alternative estimation r
outines was not exhaustive, we
conclude that the convergence problems are so intractable (due to a plethora of
local minima, as well as problems of unidentifiability) that they are unlikely to be
solved simply by use of the optimal estimation algorithm (sh
ould one exist!).


One might wonder if the unrepeatable behavior we have seen here from run to
run is due to the variation in the randomly
-
generated portion of the input training
patterns. However, if variation in the random data is the sole source of

the
instability, then when we held all the input training patterns constant, we should
have obtained repeatable results from run to run.


As others have pointed out
[11, 15, 16]
, the definition of the contribution value
used here is
ad ho
c
, as there are many different ways to measure the importance of
a given input node
[17]
, all of which have some drawbacks. However, CV
definition is not directly relevant to lack of repeatability, since the CVs are
computed
after

the wei
ght estimation process is completed. So choosing a
different CV definition will not alter, from run to run, the estimates of the weights.
Thus, it is likely that we will see non
-
repeatable results no matter which CV
definition we use. To verify this, we

computed the rankings (
i.e.
, re
-
created Table
1) using the alternative definition CV= | |u(i
-
> 1)|
-

|u(i
-
> 2)| |. When we did
this, we saw wider variation in the rankings from run to run than with the original
CV measurements (data not shown). Finall
y, perhaps we should be investigating


11

10/20/2013

methods to measure the importance of a
subset

of input nodes (rather than the
marginal importance of single input nodes). This approach was implemented by
Fu
[18]

on experimental animal data, but has
not, to our knowledge, been applied
to human data yet.


Previously, Congdon et al
[19]

tried a similar approach, using genetic algorithms
to identify combinations of genes and other risk factors associated with coronary
artery disease. Wh
ile genetic algorithms are known for their ability to search
complex domains, Congdon et al
[19]

found that the genetic algorithms failed to
provide repeatable solutions. However, by looking at the distribution of results of
1,000 runs, t
hey were able to discern some interesting patterns. It is likely that
neural networks need to be used in a similar manner.




Acknowledgement

We thank Paul Lucek for his patience in discussing the details of his work
with us. We thank the two ano
nymous reviewers, whose comments helped us
improve this paper. This work was supported in part by funds from NIH grants
HG00932 and AG16989, the University of Pittsburgh, and the W.M. Keck Center
for Advanced Training in Computational Biology at the Unive
rsity of Pittsburgh,
Carnegie Mellon University, and the Pittsburgh Supercomputing Center.


References

1

Dupuis J, Brown PO, Siegmund D: Statistical methods for linkage analysis of
complex traits from high
-
resolution maps of identity by des
cent. Genetics
1995;140:843
-
856.



12

10/20/2013

2

Lathrop GM, Ott J: Analysis of complex diseases under oliogenic models and
intrafamilial heterogeneity by the LINKAGE programs. Am J Hum Genet
1990;47:A188.


3

MacLean CJ, Sham PC, Kendler KS: Joint linkage of multiple l
oci for a
complex disorder. Am J Hum Genet 1993;53:353
-
66.


4

Cox NJ, Frigge M, Nicolae DL, Concannon P, Hanis CL, Bell GI, Kong A: Loci
on chromosomes 2 (NIDDM1) and 15 interact to increase susceptibility to
diabetes in Mexican Americans. Nat Genet 1999;2
1:213
-
5.


5

Hauser ER, Watanabe RM, Duren WL, Boehnke M: Stratified linkage analysis
of complex genetic traits using related covariates. Am J Hum Genet
1998;63:A45.


6

Lucek PR, Ott J: Neural network analysis of complex traits. Genet Epidemiol
1997;14:1101
-
6.


7

Lucek P, Hanke J, Reich J, Solla SA, Ott J: Multi
-
locus nonparametric linkage
analysis of complex trait loci with neural networks. Hum Hered 1998;48:275
-
84.


8

Ripley BD: Pattern recognition and neural networks. Cambridge, Cambridge
University Press
, 1996.


9

Davies JL, Kawaguchi Y, Bennett ST, Copeman JB, Cordell HJ, Pritchard LE,
Reed PW, Gough SC, Jenkins SC, Palmer SM, Balfour KM, Rowe BR, Farrall
M, Barnett AH, Bain SC, Todd JA: A genome
-
wide search for human type 1
diabetes susceptibility genes
. Nature 1994;371:130
-
136.



13

10/20/2013


10

Zell A, Mamier G, Vogt M, Mache N, Hübner R, Döring S, Herrmann K
-
U,
Soyez T, Schmalzl M, Sommer T, Hatzigeorgiou A, Posselt D, Schreiner T, Kett
B, Clemente G, Wieland J,
SNNS: Stuttgart Neural Network Simulator
. 1995,
Unive
rsity of Stuttgart: Stuttgart, Germany.


11

Bhat A, Lucek PR, Ott J: Analysis of complex traits using neural networks.
Genet Epidemiol 1999;17:S503
-
7.


12

Sussman HJ: Uniqueness of the weights for minimal feedforward nets with a
given input
-
output map. Ne
ural Networks 1992;5:589
-
593.


13

Auer P, Herbster M, Warmuth MK: Exponentially many local minima for
single neurons; in Touretzky D, Mozer M, Hasselmo M (ed): Advances in
neural information processing systems (Vol. 8). Cambridge, MA, MIT Press,
1996, pp 3
16
-
322.


14

Fine TL, Mukherjee S: Parameter convergence and learning curves for neural
networks. Neural Comput 1999;11:747
-
70.


15

Li W, Haghighi F, Falk CT: Design of artificial neural network and its
applications to the analysis of alcoholism data. Genet

Epidemiol 1999;17:S223
-
8.


16

Saccone NL, Downey TJ, Jr., Meyer DJ, Neuman RJ, Rice JP: Mapping
genotype to phenotype for linkage analysis. Genet Epidemiol 1999;17:S703
-
8.


17

Sarle WS: How to measure the importance of inputs? 1999; SAS Institute Inc.
ft
p://ftp.sas.com/pub/neural/importance.html


14

10/20/2013


18

Fu L: Polygenic trait analysis by neural network learning. Artif Intell Med
1994;6:51
-
65.


19

Congdon CB, Sing CF, Reilly SL. Genetic algorithms for identifying
combinations of genes and other risk
-
factors ass
ociated with coronary artery
disease. in
International joint conference on artificial intellilgence
. 1993.
Chambery, France:




15

10/20/2013

Table and Figure Legends


Table 1:

For each of the 8 replicates, the number of iterations, the mean squared
error (MSE), the me
an CV, the standard deviation of the CVs, and for three
markers, the marker's CV and the rank of that CV among the 277 CVs for that
replicate. For Part A, we changed only the initial weights from run to run; for Part
B only the 'random' data varied.


Fig
ure 1:
Two graphs of the contribution values for each of the markers, where we
only changed the initial weights from run to run. The markers are presented in
chromosomal order, from p
-
arm to q
-
arm. The run numbers correspond to those
used in Table 1.



F
igure 2:

Two graphs of the contribution values for each of the markers, where only the
'random' data varied from run to run. The markers are presented in chromosomal
order, from p
-
arm to q
-
arm. The run numbers correspond to those used in Table
1.


16

10/20/2013

Table
1:









A)





TNFA

THINS

D11S35

Run

ITER

MSE

MEAN CV

STDEV

CV

Rank

CV

Rank

CV

Rank

1


5450

0.0050

5.20

5.14

27.74


2

14.90


16


4.12


163

2

50000

0.0052

5.67

5.24

35.37


1

16.83


13


0.71


259

3

50000

0.0089

4.97

4.77

30.90


1


5.02

136


4.59


148

4

50000

0.0073

5.40

5.90

29.00


3

16.44


17


8.88


83

5

50000

0.0052

5.89

5.42

26.54


3

18.54


8


3.92


189

6

50000

0.0057

5.28

5.65

21.05


8

18.51


12


8.17


82

7

50000

0.0063

5.25

4.81

17.64


11

18.85


8


4.14


168

8

50000

0.0068

6.46

5.
52

27.72


3

23.18


4

11.98


48












B)





TNFA

THINS

D11S35

Run

ITER

MSE

MEAN CV

STDEV

CV

Rank

CV

Rank

CV

Rank

1

50000

0.0052

4.81

4.35


8.99


61


9.91


47


5.94


108

2

50000

0.0052

5.31

5.29

16.24


10

14.42


25


1.66


224

3

50000

0.0063

5.
41

5.49

13.66


32

12.72


38


0.20


268

4

50000

0.0063

5.62

5.07

18.04


10


4.27

162


2.23


210

5


6600

0.0050

4.65

5.05

13.17


22


5.05

130

11.65


32

6

50000

0.0110

6.34

5.57

20.91


6


8.83


98


6.42


135

7

20700

0.0050

5.17

5.40

14.79


21

24.99


3

13.62


29

8

50000

0.0062

4.84

5.11

29.25


2

11.41


36


9.71


55



17

10/20/2013

Figure 1A

18

10/20/2013

Figure 1B


19

10/20/2013

Figure 2A

20

10/20/2013

Figure 2B