i
An Implementation of Artificial Neural Network Approach
Towards Gene Pathway Analysis
Project Report
By
Rishi R. Gupta
Vinay Gupta
University of Connecticut
Storrs, CT 06269
ii
Table of Contents
1. Introduction
1
2
Bac
kground
2
3.
Review of various methods
3
a). Directed and Undirected Graphs
3
b). Bayesian Networks
5
c). Boolean Networks
6
d). Stochastic Equations
8
e). Artificial Neural Networks
10
4.
Motivation
12
5.
Problem statement
13
6.
Data specifications
14
7.
Work Performed
14
8.
Results and Discussion
16
9.
Conclusions
25
10.
References
27
1
Introduction:
Background:
According to the Central Dogma of
Molecular Biology, genes made of
Deoxyribo

Nucleic Acid (DNA) are the basic units of heredity [1]. The genes could also
be called as the information molecules of life. This is because of the fact that the genetic
code represented as a sequence of chemical
monomers is decoded in each living cell into
the functional molecular network of proteins and RNA molecules (RNA
–
Ribonucleic
Acid). The proteins and RNA are termed as the functional molecules of life. These
functional molecules are responsible for the
physical, chemical and biological properties
of the cell, which in turn is manifested globally in the behavioral nature of a living
organism like humans who are made of millions of cells. Figure 1.illustrates this principle
of genetic information flow fro
m DNA to proteins.
Figure 1. Genetic Information flow in a cell
Genes {DNA}
RNA intermediate Protein
GENE EXPRESSION
(Refers to both transcription and translation)
2
Gene Expression:
The biochemical process by wh
ich Genes are first transcribed into RNA
molecules (Transcription) and then converted to the Protein molecules (Translation) is
referred to as Gene expression. This process of gene to protein information translation is
not a static phenomenon. It varies dy
namically with time depending on several factors
like the stage of development of the cell (or organism), environmental conditions etc. For
example the heat shock genes are over expressed (that is more proteins are synthesized)
as soon as the cell is subje
cted to an environment of high temperature. However the rate
of this gene expression decreases as the cell settles down from the shock. Thus the gene
expression level of heat shock genes remains at a basal rate under normal or standard
conditions and as s
oon as these conditions change, the expression level rises which then
finally returns to the normal level of expression in time as the cell recovers from the
shock. This simple example illustrates how gene expression level by varying at a
molecular level r
epresents the chemical (to some extent behavioral) response of the cell
as the environmental conditions change, the environment being one of several factors that
affects gene expression rate.
The state of gene expression under normal conditions (or avera
ge conditions) is
referred to as the normal level or basal rate. When the expression level of a gene
suddenly increases during an interval of time (such as the case that was illustrated earlier
by the example of heat shock genes), then the genes are said t
o be switched “ON” during
that time interval. Similarly a gene is switched off when its expression level decreases
from the normal rate [2].
Microarray Experiments:
DNA microarray technology is a recent advancement in biotechnology. It
provides biologists
with the ability to measure the expression levels of thousands of
genes in a single experiment. These arrays consist of large numbers of specific
oligonucleotides or cDNA sequences, each corresponding to a different gene, affixed to a
3
solid surface at ver
y precise locations. When an array chip is hybridized to labeled cDNA
derived from a particular tissue of interest, it yields simultaneous measurements of the
mRNA levels in the sample for each gene represented on the chip. Since mRNA levels
are expected t
o correlate roughly with the levels of their translation products, the active
molecules of interest, array results can be used as a crude approximation to the protein
content and thus the ‘state’ of the sample. Ideally, one would like in addition to measur
e
the levels of proteins in a cell directly, and such technology is currently being developed.
The intensity of the points in the array reflects the gene expression level of the genes at
the corresponding location. In short, DNA micro arrays yield a global
view of gene
expression.
Microarray Experimental Data:
Each data point produced by a DNA microarray hybridization experiment
represents the ratio of expression levels of a particular gene under two different
experimental conditions. Typically the one of
the experiment is carried out under
standard conditions while the other is done under the varying conditions of interest (for
example a disease condition or a certain stage of growth of the cell). Thus the ratio
reflects the increase (up

regulation) or de
crease (down

regulation) in the level of gene
expression when the cell is under the probing conditions with respect to the basal
expression level.
The result data, from a single experiment with n genes on a single chip, is a series
of n expression

level ra
tios. The numerator of each ratio is the expression level of the
gene in the varying condition of interest, whereas the denominator is the expression level
of the gene in the reference condition. The data from a series of m such experiments may
be represen
ted as a gene expression matrix, in which each of the n rows consists of an m

element expression vector for a single gene. The values of the Gene expression vector
are normalized on a logarithmic scale with the total norm of the log values of the ratios
being 1.
Review of various methods:
1).
Directed and Undirected Graphs
: Viewing genetic regulatory networks as directed
graphs is one of the most straightforward methods for modeling networks.
A directed
4
graph G is defined as a tuple
E
V
,
, with V a set of vertices and E a set of edges. A
directed edge is a tuple
j
i
,
of vertices, where i denote the head and j the tail of the edge.
The vertices of a directed graph correspond to genes or other elements of interest in the
r
egulatory system, while the edges denote interactions among the genes. The graph
representation of a regulatory network can be generalized in several ways. The vertices
and edges could be labeled, for instance, to allow information about genes and their
in
teractions to be expressed. By defining a directed edge as a tuple
s
j
i
,
,
, with s equal
to + or

, it can be indicated whether
i
is activated or inhibited by
j
. Hypergraphs can be
used to deal with situations in which proteins co

operative
ly regulate the expression of a
gene, for instance by forming a heterodimer. The edges are then defined by
S
J
i
,
,
,
where J represents a list of regulating genes and S a corresponding list of signs indicating
their regulatory influence. Fig
ure 1 shows simple regulatory networks of three genes.
The databases and knowledge bases are usually supplemented by applications to compose
and edit networks by selecting and manipulating individual interactions. At the very least,
they permit the
user to visualize complete or partial networks, often on different levels of
detail, and to navigate the networks.
From a comparison of regulatory networks of
Fig.1.
(a)
Directed graph representing a genetic regulatory network and
(b)
its definition. The plus and minus
symbols in the pictorial represen
tation can be omitted and replaced by
and

 edges, respectively.
(c)
–
(d)
Directed
hypergraph representation of a regulatory network with cooperative interactions.
5
different organisms, it might be possible to establish to which extent parts of regulatory
netwo
rks have been evolutionary conserved
.
2).
Bayesian Networks
:
In the formalism of
Bayesian networks
(Friedman
et al
., 2000;
Pearl, 1988), the structure of a genetic regulatory system is modeled by a directed acyclic
graph G =
E
V
,
(Fig.
2). The vertices i
<V , 1
i
n, represent genes or other
elements and correspond to random variables
X
i
. If
i
is a gene, then
X
i
will describe the
expression level of
i
. For each
X
i
, a conditional distribution
p(X
i
 parents(X
i
)) is
defined, where parents(X
i
)
denotes the variables corresponding to the direct regulators of
i in G.
The graph G and the conditional distributions p(X
i
 parents.X
i
))
, together defining the
Bayesian network, uniquely specify a
joint probability distribution p(
X)
. Let a
conditional independency
I(X
i
;
Y

Z
) express the fact that X
i
is independent of
Y
given
Z
, where
Y
and
Z
denote sets of variables. The graph encodes the
Markov assumption
,
stating that for every gene i in G, i(
X
i
; nondescendants(X
i
)  parents.X
i
))
. By means of the
Markov assumption, the joint probability distribution can be decomposed into
P(
X
) =
n
i
1
p(X
i
 parents(X
i
))
Fig
. 2.
Example Bayesian network consisting of a graph, conditional probability distri
butions for the
random variables, the joint probability distribution, and conditional independencies (Friedman
et al
., 2000).
6
The graph implies additional conditional independencies, as shown in Fi
g. 2 for the
example network. Two graphs, and hence two Bayesian networks, are said to be
equivalent, if they imply the same set of independencies. The graphs in an equivalence
class cannot be distinguished by observation on
X
. Equivalent graphs can be fo
rmally
characterized as having the same underlying undirected graph, but may disagree on the
direction of some of the edges (see Friedman
et al
. [2000] for details and references).
A Bayesian network approach towards modeling regulatory networks is attrac
tive
because of its solid basis in statistics, which enables it to deal with the stochastic aspects
of gene expression and noisy
measurements in a natural way. Moreover, Bayesian
networks can be used when only incomplete knowledge about the system is avail
able.
Although Bayesian networks and the graph models in the previous sections are intuitive
representations of genetic regulatory networks, they have the disadvantage of leaving
dynamical aspects of gene regulation implicit.
3). Boolean Networks:
As a fi
rst approximation, the state of a gene can be described by
a Boolean variable expressing that it is active (on, 1) or inactive (off, 0) and hence that its
products are present or absent. Moreover, interactions between elements can be
represented by Boolean
functions, which calculate the state of a gene from the activation
of other genes. The result is a
Boolean network
, an example of which is shown in Fig. 3.
Modeling regulatory networks by means of Boolean networks has become popular in the
wake of a groun
dbreaking study by Kauffman (1969b).
Let the n

vector
x
ˆ
of variables in a Boolean network represent the state of a
regulatory system of n elements. Each
i
x
ˆ
has the value 1 or 0, so that the state space of
the sy
stem consists of 2
n
states. The state
i
x
ˆ
of an element at time

point (t+1) is
computed by means of a Boolean function or
rule
i
b
ˆ
from the state of k of the n elements
at the previous time

point t. (Notice that k ma
y be different for each
i
x
ˆ
) The variable
i
x
ˆ
is also referred to as the
output
of the element and the k variables from which it is
calculated the
inputs
. For k inputs, the total number of possible Boolean functions
i
b
ˆ
mapping the inputs to the output is 2
2
k
. This means that for k = 2 there are 16 possible
7
functions, including the nand, or, and nor in Fig. 3. In summary, the dynamics of a
Boolean network describing a regulatory system are given b
y
i
x
ˆ
(
t+1) =
i
b
ˆ
(
i
x
ˆ
(
t)), 1
≤i ≤ n,
where,
i
b
ˆ
maps k inputs to an output value.
The structure of a Boolean network can be recast in the form of a wiring diagram (Fig.
3(c)). The upper row lists the state at t and the lower row the state at t+1, while the
Boolean
function calculating the output from the input is shown below each element. The
wiring diagram is a convenient representation for computing transitions between states.
The transition from one state to the next is usually determined in a parallel fashion,
applying the Boolean function of each element to its inputs. For instance, given a state
vector 000 at t = 0, the system in the example will move to a state 011 at the next time

point t = 1. That is, if all three genes are inactive at t = 0, the second and
the third gene
will become active at the next time

point. Hence, transitions between states in a network
are
deterministic
, with a single output state for a given input, and
synchronous
, in the
sense that the outputs of the elements are updated simultaneo
usly. A sequence of states
connected by transitions forms a
trajectory
of the system. Because the number of states in
the state space is finite, the number of states in a trajectory will be finite as well. More
specifically, all initial states of a traject
ory will eventually reach a steady state or a state
cycle, also referred to as
Fig. 5. (a
)
Example Boolean network and (b)
the corresponding equations. In this case, n = 3 and k = 2.
(c)
Wiring diagram
of the
Boolean network.
8
point attractor
or
dynamic attractor
, respectively. The states that are not part of an
attractor are called
transient
states. The attractor states and the transient state
s leading to
the attractor together constitute the
basin of attraction
. In the example, both attractors
have a basin of attraction consisting of four states.
Boolean networks allow large regulatory networks to be analyzed in an ef. cient
way, by making s
trong simplifying assumptions on the structure and dynamics of a
genetic regulatory system. In the Boolean network formalism, a gene is considered to be
either on or off, and intermediate expression levels are neglected. Also, transitions
between the activ
ation states of the genes are assumed to occur synchronously. When
transitions do not take place simultaneously, as is usually the case, the simulation
algorithm may not predict certain behaviors. There are situations in which the
idealizations underlying
Boolean networks are not appropriate, and more general
methods are required.
4). Stochastic Equations:
In principle, differential equations allow gene regulation to be
described in great detail, down to the level of individual reaction steps like the bind
ing of
a transcription factor to a regulatory site, or the transcription of a gene by the stepwise
advancement of DNA polymerase along the DNA molecule. However, a number of
implicit assumptions underlying differential equation formalisms are no longer val
id on
the molecular level.
Differential equations presuppose that concentrations of substances vary
continuously
and
deterministically
, both of which assumptions may be questionable in the
case of gene regulation. In the first place, the small numbers of
molecules of some of the
components of the regulatory system compromise the continuity assumption. There may
be only a few tens of molecules of a transcription factor in the cell nucleus, and a single
DNA molecule carrying the gene. Second, deterministic c
hange presupposed by the use
of the differential operator d/dt may be questionable due to fluctuations in the timing of
cellular events, such as the delay between start and finish of transcription. As a
consequence, two regulatory systems having the same i
nitial conditions may ultimately
settle into different states, a phenomenon strengthened by the small numbers of
molecules involved.
9
Instead of taking a continuous and deterministic approach, some authors have
proposed to use
discrete
and
stochastic
models
of gene regulation. Discrete amounts
X
of
molecules are taken as state variables, and a joint probability distribution p(
X
, t) is
introduced to express the probability that at time t the cell contains X
1
molecules of the .
rst species, X
2
molecules of the
second species, etc. The time evolution of the function
p(
X
; t) can now be specified as follows:
t
t
t
X
p
t
t
X
p
m
j
j
m
j
j
1
1
)
1
)(
,
(
)
,
(
where m is the number of reactions that can occur in the system,
t
j
the probability
that reaction
j
will occur i
n the interval [t , t+
t
] given that the system is in the state
X
at
t , and
j
t
the probability that reaction
j
will bring the system in state
X
from another
state in [t ; t +
t
]. Rearranging the above equation, and taking the limit as
t
0, gives
the
master equation
:
))
,
(
(
)
,
(
1
t
X
p
t
X
p
t
j
m
j
j
Compare this equation with the rate equations (
n
i
x
f
dt
dx
i
i
1
)........
(
). Whereas the
latter determine how the s
tate of the system changes with time, the former describes how
the probability of the system being in a certain
state changes with time. Notice that the
state variables in the stochastic formulation can be reformulated as concentrations by
dividing the num
ber of molecules
X
i
by a volume factor.
Although the master equation provides an intuitively clear picture of the
stochastic processes governing the dynamics of a regulatory system, it is even more dif.
cult to solve by analytical means than the
determini
stic rate equation. Stochastic
simulation results in closer approximations to the molecular reality of gene regulation,
but its use are not always evident. In the first place, the approach requires detailed
knowledge of the reaction mechanisms to be availa
ble, including estimates of the
probability density function p(
τ, ρ) Moreover, stochastic simulation is costly, due to the
large number of simulations that need to be carried out to calculate an approximate value
10
of p(
x
; t) and due to the large number of reactions that need to be traced in each of these
simulations. W
hether the costs always balance the expected benefits depends on the level
of granularity at which one wishes to study regulatory processes. On a larger time

scale,
stochastic effects may level out, so that continuous and deterministic models form a good
a
pproximation.
5). Artificial Neural Networks:
One can simply view a neural network as a parallel
computational model comprised of a large number of adaptive processing units (neurons).
The neurons communicate through a large set of interconnections with v
ariable strengths
(weights) in which the learned information is stored. Neural networks have several
unique characteristics and advantages as tools for the molecular sequence analysis
problem. A very important feature of these networks is their adaptive na
ture, where
learning by example
replaces conventional
programming
in solving problems. This
feature makes such computational models very appealing in application domains where
one has little or incomplete understanding of the problem to be solved, but wher
e training
data are readily available. Owing to the large number of interconnections between their
basics processing units, neural networks are error tolerant, and can deal with noisy data.
Neural network architecture encodes information in a distributed f
ashion. This inherent
parallelism makes it easy to optimize the network to deal with a large volume of data and
to analyze numerous input parameters. Flexible encoding schemes can be used to
combine heterogeneous sequence features for network input. Finall
y a multilayer network
is capable of capturing and discovering high

order correlations and relationships in A
neural network is characterized by (i) its pattern of connections between the neurons
(called its architecture), (ii) its method of determining th
e weights on the connections
(called its training, or learning, algorithm) and (iii) its activation function.
Neural Networks Architecture
A neural network consists of a large number of simple
processing elements called
neurons
. The arrangement of neurons
into layers and the
connection patterns within and between layers is called the
network architecture
. Each
neuron is connected to other neurons by means of
directed communication links, each
with an associated weight. The weights represent information bei
ng used by the net to
11
solve a problem. Each neuron has an internal state, called de
activation level
, which is a
function of the inputs it has received. An
activation function
is used to map any real input
into a usually bounded range, often 0 to 1 or

1 t
o 1. In
feedforward
(FF) nets, the signals
flow from the input units to the output units, in a forward direction: the input units
receive signals from the outside world; the output units present the response of the net. A
multilayer FF net is a net with on
e or more
hidden layers
between the input units and the
output units. In a
fully connected
net every node in each layer is connected to every other
node in the adjacent forward layer. If, however, some of the communication links are
missing from the networ
k, we say that the network is
partially connected
.
The Simplest Neural Network:
The simplest example of a neural network is the
Perceptron
(Rosenblatt), used for the classification of the special type of patterns
characterized as
linearly separable
. A pe
rceptron has only two layers
—
input and output
layers. It computes a linear combination of the network inputs and applies the net input to
produce the output using a threshold output function. An elementary perceptron consists
of a single output neuron with
adjustable synaptic weights
and a
threshold
. The threshold
can be treated as a synaptic weight connected to a fixed input of value 1. Such a fixed
input unit is called a
bias unit
. One can use the elementary perceptron to solve a pattern
classification pr
oblem with only two classes. To perform classification with more than
two classes requires the use of more output neurons. The weights of the perceptron can be
adapted on an iteration

by

iteration basis, using an error

correction rule known as the
perceptr
on convergence theorem
Minsky and Papert). The theorem guarantees that if a
solution exists, the perceptron learning rule will, in a finite number of steps, converge to
the correct weights that produce correct output values for all training patterns.
Appl
ication to the Gene Identification Problem:
One important area of application of
the neural network model is for gene identification. The gene identification problem is
tackled by two complimentary approaches: gene search by signal and gene search by
conte
nt Staden). The
search by content
methods uses various coding measures to
determine the protein

coding potential of sequences. The
search by signal
methods
identifies signal sequences, such as splice sites, which delimit coding regions. Neural
12
networks pro
vide an attractive model in which sequence features for both signal and
content can be combined and weighted to improve predictive accuracy. The identification
and analysis of other signals, binding sites or regulatory sites, such as promoters,
ribosome bi
nding sites, and transcriptional initiating and terminating sites, are also
important for the studies gene regulation and expression. Common approaches finding
functional signals include the
consensus sequence
method, the
weight matrix
method and
neural ne
twork method. Neural networks allow incorporation of both positive and
negative examples, the detection of higher

order and long

range correlations, and are not
based on the assumption positional independence. As a result, neural networks are found
to be s
uperior to other methods in many studies.
Motivation for the project:
The pattern of gene expression in a cell characterizes its current state. Virtually
all differences in cell state or type are correlated with changes in the mRNA levels of
many genes.
Expression patterns of many uncharacterized genes provide clues to their
possible function by comparison [1]. This leads to great many potential applications in
medicine and molecular biology especially in identification of metabolic pathways,
complex gene
tic diseases, drug discovery and toxicology analysis etc.
One major application of microarray data is in the area of functional genomics
where the functional significance of the genes is studied. This application is based on the
observation that genes of s
imilar function yield similar expression patterns. Thus based
on their microarray expression profile, genes can be grouped into classes of genes that are
functionally related.
As data from such experiments accumulates, it will be essential to
have accurate
means for extracting biological significance and using the data to assign
functions to genes. Also with the completion of the Genome projects, we have the
sequence information of the genes. What is missing in the current set up is the functional
informat
ion about the different genes. Thus we see that Data obtained from microarray
experiments could be extensively used for the assignment of functions to unknown genes
by correlating their expression profile with the profiles of genes whose function is
alre
ady known.
13
Problem statement:
As we observed earlier, microarray data can be used to determine the function of
unknown genes by correlating the expression profile of these genes with the profiles of
genes whose function is already known. At an instance g
lance, this problem could be
recognized as a pattern classification problem in which we assign the functional class to
each gene based on the feature of microarray gene expression. The knowledge for
performing the classification is contained in the gene e
xpression profile and function
class mapping of genes whose functions are already known. Thus in short, the problem is
to find whether or not a gene belongs to a functional class, based on its expression profile
at a given moment of time along with the kno
wledge base of expression profile of genes
whose function is already known. In this project we have tried to solve this pattern
classification problem using Artificial Neural Networks.
As this model is well known for
its ability to encode a knowledge base
and perform a pattern classification based on the
knowledge base, we decided to apply them to gene functional class determination
problem that was described in earlier sections. Currently the nature of this problem is
more like a time evolution of genes ra
ther than classification but if we extend the
problem, we can predict the gene expression of an unknown set of genes and then
perform a clustering to classify the genes on the basis of their expression.
Chosen problem of gene function analysis:
A well

kno
wn problem in this area is the determination of time evolution of some
unknown genes by using the knowledge of the time evolution of known genes. Thus by
comparing the expression at the end of a certain time ‘t’ we can predict (cluster) the
genes together
as if they have the same functionality. With continuous findings of new
genes in various species, the functional class of these genes is not known and their
expression is also not known at every moment. So, using the described method we can
assign function
ality to the genes and predict their behavior at any time point. The
microarray expression profiles of these genes could now be used to find their
functionality under the domain of the functional classes that have been known so far and
also their time evol
ution.
14
Data specifications:
The first data set chosen is a subset of human fibroblast response to serum data set
(Iyer et al., 1999). The authors reported the response of human fibroblasts to serum, using
cDNA microarrays representing about 8600 distinct
human genes. This data set was
generated using cDNA microarray hybridization to measure the temporal changes in
mRNA levels of 8613 human genes at 12 different times, ranging from 0 to 24 hours.
Cells growing exponentially were labeled “unsync” and include
d in their study. The
authors reported that only 517 genes were observed to have significant changes. In this
paper we employed the gene expression matrix for the same 517

gene subset
(
http://genome

www.stanford.edu/serum/data/fig2clusterdata.txt
). While performing the
ANN model over this data set we removed the data chosen at 0 hr as it didn’t add any
significance to our analysis.
The second data set used is the yeast microarra
y expression data was obtained
from Stanford Microarray Database and Munich Information Center for Protein
Sequences Yeast Genome Database (MYGD). The yeast gene expression data consists of
gene expression ratios from 2,467 genes from the budding yeast (
Sa
ccharomyces
cerevisiae)
measured in 79 different DNA microarray hybridization experiments plus the
normal expression value giving rise to 80 different column values. From these data, we
learn to recognize the functional class of TCA as defined in the Muni
ch Information
Center for Protein Sequences Yeast Genome Database (MYGD). In this data set we took
only the “alpha” data set and removed the rest of it to relieve the load on the software and
avoid over fitting.
Work Performed:
The project work consisted
of implementing an artificial neural network model on the
above

discussed data sets and shows the gene prediction. To carry out this work we used
software readily available as a trial version. The software used is Pathfinder (Z solutions,
LLC 1998,
http://www.zsolutions.com/pathfind.htm
). The steps followed to perform the
analysis are:
1). Select the training, test, Validation, and scaling data sets.
15
2). Design the network by selecting the number of o
utputs we need, the number of
hidden nodes and the total epoch size required for the training.
Number of Output Nodes:
The number of output nodes is
controlled by the problem. In our case we selected one particular
time point where we wanted to see the exp
ression value of a set of
genes. In the model we have applied the sigmoid model as it has
been known to give a better convergence.
Number of Hidden Nodes
:
The number of hidden nodes controls
the number of weights in the model. Generally, the greater the
co
mplexity of the problem the more nodes are needed. However,
too many nodes may lead to overfitting where the network “picks
up” noise in the data. In our case we used and played around with
different number of hidden nodes and chose 10 hidden nodes to be
more than required to give us the best fit.
Epoch Size
:
The epoch size is the number of observations seen by
the learning algorithm before weight adjustments. The idea is to
look at a large enough epoch size that noisy outliers will not
unduly influence t
he results and small enough that specific details
can be determined.
3). The data was trained and an output was obtained for a particular time point.
4). With the help of the software we plotted the error vs. predicted, actual vs.
predicted plots to see ho
w good our model does in order to predict the gene
expression values.
16
The bottom layer represents the input layer, in this case with 5 inputs labeled X1 through
X5. In the middle is something called the hidden layer, with a va
riable number of nodes.
It is the hidden layer that performs much of the work of the network. The output layer in
this case has two nodes, Z1 and Z2 representing output values we are trying to determine
from the inputs.
Results and Discussion:
With refe
rence to the attached screen shots of the software following results were
obtained:
1). Human Fibroblast
: In this case the data was chosen as discussed above. The network
structure chosen for this data set is as following:
Input Nodes = 11;
Hidden Nodes
= 10;
Output Node = 1;
We ran the software for two different outputs and compared the predicted output
(from the ANN model) with the Actual data obtained from the experiment. The hidden
nodes and epoch size were chosen after a little hit and trial to se
e which network
architecture gives us the best prediction and a close model. In this case the following data
set was selected for training etc.
Training = 249;
Test =150;
Validation = 118;
Fig. 5. Structure of a neural network (figure taken from the
pathfinder literature)
17
Shown below is a screen shot of the Pathfinder window and the w
ork area where
we specify the data to be obtained from an Excel Spread sheet. The training, test and
validation sets are chosen to get the best

fit model. As we know that data needs a lot of
training before we get a good set we use a large data set for the
training and the rest for
the test and validation. The variables in this case are:
Predictor (INPUT) variables
: gene expression at 15min, 30 min, 1 HR, 2HR,
4HR, 6HR, 12HR, 16HR, 20HR, 24HR, UNSYN.
Response (OUTPUT) variables
: gene expression at UNSYN (CA
SE 1) and
20HR (CASE 2).
18
Observing the above screen shot we can see the best test and the last test converging on
each other and the RMSE (root mean square value) value changes continuously to the
extent of minimizing the RMSE for the best fit model.
From this model we obtain a set of weights for the model, which this network
predicts. The screen shot showing the weights is shown below.
19
From this window we can abstract the values for weights for the weights in the layer
between Input and hidden n
odes and the weights between hidden and output nodes. Thus
by using the weights we can use it to form our own sigmoid model and calculate for an
unknown set of genes.
Summarizing the results, we can observe the next two screen shots and the graphs
attach
ed and draw some inferences. The calculation was performed for two output
variables and the analysis was done separately for both the cases. In the first case, we
used the “Unsyn” as our output and obtained a comparison for the predicted value (from
the AN
N model) and the actual experimental value. The RMSE value was 0.0602, which
is quite small and thus gives a good fit for the model.
20
In the case of the output variable being “20 HR” the RMSE value is 0.0535 (which again
is quite small) and we obtain a v
ery high correlation coefficient value (= 95.12%). This
result tells us that the model obtained for this output variable is very close to the actual
data and the prediction is very good.
21
The attached plots give the following information:
1).
Error vs. P
redicted
: Comparing the plots obtained for both the cases we find that in
case where the output variable is “UNSYN” the error vs. predicted yield plot shows an
uneven distribution across the x

axis (horizontal line). It has been found that there is a
highe
r concentration above the line than below and thus there is an uneven distribution of
the points. Thus the relationship between the predictors and the response is not so strong.
In the case of the response variable being “20 HR” we see a more even distribu
tion across
the x

axis and this gives us the information that the relationship between predictor
variables and the response variable is very strong. It implies that we get a good fit for the
model for predicting a linear change in the data than an exponent
ial change. As we
marked earlier that the data “UNSYN” is an exponential growth in the cells whereas all
the other are regular cell growth. Thus we got a good fit for the model with “20 HR” as
our response variable in comparison to “UNSYN”.
22
2).
Predicted
vs. Actual
: Observing the plots for both cases we find that the points lie on
a tight diagonal for the 2
nd
case (output variable =20HR) and that they are evenly
distributed on both sides of the diagonal which shows a good fit for the points. Thus we
can f
urther conclude that the model for data, which represents a regular cell growth, is
much better than the data fit for the exponential growth.
3).
Predicted vs. Actual
: Once again the plot shows a more closer prediction to the actual
prediction for the ca
se when the response variable is 20 HR. Thus we can finally draw a
conclusion that there is more accuracy in the model fitting for the data, which is not
exponential in nature as compared to the exponential data. In the case where we have an
exponential da
ta may be we could have used a bigger training data set to obtain a robust
model, which could fit the data. But just so that we can show a comparison between
various models we used the same data set for all the cases and observed a better model
for the cas
e of 20HR than UNSYN case.
2). Yeast Data
: As discussed above only a part of the whole data set was chosen to ease
up the calculation. The network structure chosen for this data set is as following:
Input Nodes = 17;
Hidden Nodes = 15;
Output Node = 1;
In this case the following data set was selected for training etc.
Training = 1000;
Test = 1000;
Validation = 466;
The training, test and validation sets are chosen to get the best

fit model. As we know
that data needs a lot of training before we get
a good set we use a large data set for the
training and the rest for the test and validation. The variables in this case are:
Predictor (INPUT) variables
: gene expression at 0 min. (alpha 0) to 119 min.
(alpha 119) in the steps of 7min.
Response (OUTPUT)
variables
: gene expression at alpha 119 (CASE 1) and
alpha 112 (CASE 2).
23
24
25
Summary of the results:
Response variable
RMSE
Correlation
coffecient (%)
Alpha 119
0.0405
61.95
Alpha 112
0.0585
59.93
The attached plots give the following infor
mation:
1).
Error vs. Predicted
: Comparing the plots obtained for both the cases we find that in
both case the error vs. predicted yield plot shows an even distribution across the x

axis
(horizontal line). Thus the relationship between the predictors and t
he response is strong.
Thus we can say that the error in predicting the model for the given response variable was
good. So we can take the weights obtained in both cases and use them to predict the time
evolution for any unknown set of genes from budding y
east.
2).
Predicted vs. Actual
: Observing the plots for both cases we find that the points lie on
a tight diagonal and that they are evenly distributed on both sides of the diagonal which
shows a good fit for the points. Thus we can further conclude that
the model for data,
which represents a cell growth for budding yeast, is good.
3).
Predicted vs. Actual
: Once again the plots show a closer prediction to the actual
prediction for both the cases when the response variable is Alpha 119 and Alpha 112.
Thu
s we can finally draw a conclusion that the model found from this data set is accurate
in fitting for the data. We can also conclude that if we have a bigger training data set and
a large test data set the model obtained is more accurate in nature in predi
cting the
response variable.
Conclusions:
Identification of functional class of genes based on their gene expression profiles
obtained from microarray experiments has great many applications with regard to finding
the function of unknown genes which in t
urn could be used for further study of genes
involvement in a particular functional class. Artificial neural network models have been
26
built and tested for this particular problem. Using two completely different data sets we
have obtained good results. The
results obtained for the first case (human fibroblast data
set) we can see a clear distinction between the two models. The model fitting the
exponential data is not as good as the model, which we obtained for the regular cell
growth. But still both the mod
els give a very good prediction for the experimental data.
In order to get a good prediction for the response variable in the exponential data we can
either choose to have a larger training data set or we can change the network architecture
somewhat to fit
a better model. To change the network structure we can choose to have
more hidden nodes and a larger epoch size. The results in the second case, (budding yeast
data set) we found a more accurate model, fitting both the response variable (Alpha 119
and Alp
ha 112).
The Pathfinder software gave us the weights corresponding to the two layers. We
can use these weights and prepare a sigmoid model for an unknown set of genes. Thus
concluding the remarks on this work we can say that by performing this project we
saw
that we can obtain quite accurate models for the time evolution of genes and classify
them according to their gene expression and further predict their functionality.
27
References:
Arkin, A., and Ross, J. 1995. Statistical construction of chemical reac
tion mechanisms
from measured time

series.
J. Phys. Chem.
99, 970
–
979.
Arkin, A., Ross, J., and McAdams, H.A. 1998. Stochastic kinetic analysis of
developmental pathway bifurcation in phage ¸

infected
Escherichia coli
cells.
Genetics
149, 1633
–
1648.
Arki
n, A., Shen, P., and Ross, J. 1997. A test case of correlation metric construction of a
reaction pathway from measurements.
Science
277, 1275
–
1279.
Brown. et al (1999). Knowledge

based analysis of microarray gene expression data.
Proceedings of National
Academy of Sciences, 1997, 262

267.
de Jong, H., Page, M., Hernandez, C., and Geiselmann, J. 2001. Qualitative simulation of
genetic regulatory networks: Method and application. In B. Nebel, ed.
Proc. 17th Int.
Joint Conf. Artif. Intell. (IJCAI

01)
, 67
–
73
, San Mateo,CA. Morgan Kaufmann.
de Jong, H., and Rip, A. 1997. The computer revolution in science: Steps towards the
realization of computer

supported discovery environments.
Artif. Intell.
91, 225
–
256.
DeRise et al (1997). Exploring the metabolic and
genetic control of gene expression on a
genomic scale. Science 278, 680

686.
Eisen (1995). Cluster analysis and display of genome

wide expression patterns.
Proceedings of National Academy of Sciences, 1995, 14863

14868.
Friedman, N., Linial, M., Nachm
an, I., and Pe’er, D. 2000. Using Bayesian networks to
analyze expression data.
J. Comput. Biol.
7, 601
–
620.
Gillespie, D.T. 1977. Exact stochastic simulation of coupled chemical reactions.
J. Phys.
Chem.
81(25), 2340
–
2361.
Gillespie, D.T. 1992. A rigoro
us derivation of the chemical master equation.
Physica D
188, 404
–
425.
Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C.F., Trent, J.M.,
Staudt, L.M., Hudson, J. Jr., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D. and
Brown, P.
O. (1999) The transcriptional program in the response of human fibroblasts to
serum.
Science
283,
83

87.
Karp, P.D. (1991) Hypothesis formation as design. In J. Shrager and P. Langley, eds.
ComputationalModels of Scienti. cDiscovery and Theory Formation
,
275
–
317, Morgan
Kaufmann, San Mateo, CA.
28
Kauffman, S.A. 1969a. Homeostasis and differentiation in random genetic control
networks.
Nature
224, 177
–
178.
Kauffman, S.A. 1969. Metabolic stability and epigenesis in randomly constructed genetic
nets.
J. Theor
. Biol.
22, 437
–
467.
Kauffman, S.A. 1974. The large

scale structure and dynamics of gene control circuits: An
ensemble approach.
J. Theor. Biol.
44, 167
–
190.
Kauffman, S.A. 1977. Gene regulation networks: A theory for their global structure and
behaviors
. In
Current Topics in Developmental Biology
, vol. 6, 145
–
182, Academic
Press, New York.
Lander E.S (1996). The new genomics: global view of biology, Science 274, 536

539.
McAdams, H.M., and Arkin, A. 1997. Stochastic mechanisms in gene expression.
Proc.
Natl. Acad. Sci. USA
94, 814
–
819.
McAdams, H.H., and Arkin, A. 1998. Simulation of prokaryotic genetic circuits.
Ann.
Rev. Biophys. Biomol. Struct.
27, 199
–
224.
McAdams, H.H., and Arkin, A. 1999. It’s a noisy business! Genetic regulation at the
nanomola
r scale.
Trends Genet.
15(2), 65
–
69.
McAdams, H.H., and Shapiro, L. 1995. Circuit simulation of genetic networks.
Science
269, 650
–
656.
Pathfinder, (Z solutions, LLC 1998,
http://www.zsolutions.com/p
athfind.htm
).
Shatkay, H., Edwards, S., Wilbur, W.J., and Boguski, M. (2000) Genes, themes, and
microarray: using information retrieval for large

scale gene analysis. In
Proceedings of
8
th
International Conference on Intelligent Systems for Molecular Bio
logy (ISMB)
. AAAI
Press, 317

328.
Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. (1995) Quantitative monitoring of
gene expression patterns with a complementary DNA microarray.
Science
270,
467

470.
Samsonova, M.G., Savostyanova, E.G., Serov, V.N.,
Spirov, A.V., and Reinitz, J. (1998)
GeNet: A database of genetic networks. In
Proc. First Int. Conf. Bioinformatics Genome
Regul. Struct., BGRS’98
, Novosibirsk, ICG.
Samsonova, M., and Serov, V.N. (1999) NetWork: An interactive interface to the tools
for
analysis of genetic network structure and dynamics. In R.B. Altman, K. Lauderdale,
A.K. Dunker, L. Hunter, and T.E. Klein, eds.
Proc. Pac. Symp. Biocomput. (PSB’99)
,
vol. 4, 102
–
111, Singapore,World Scientific Publishing.
29
Sanchez, C., Lachaize, C., Janod
y, F., Bellon, B., Röder, L., Euzenat, J., Rechenmann, F.,
and Jacq, B. (1999) Grasping at molecular interactions and genetic networks in
Drosophila melanogaster
.
Nucleic Acids Res.
27(1), 89
–
94.
Sánchez, L., and Thieffry, D. (2001) A logical analysis of
the
Drosophila
gap genes.
J.
Theor. Biol.
211, 115
–
141.
Sánchez, L., van Helden, J., and Thieffry, D. (1997) Establishment of the dorso

ventral
pattern during embryonic development of
Drosophila melanogaster
: A logical analysis.
J. Theor. Biol.
189, 377
–
3
89.
Sugita, M. 1961. Functional analysis of chemical systems
in vivo
using a logical circuit
equivalent.
J. Theor. Biol.
1, 179
–
192.
.Sugita, M. 1963. Functional analysis of chemical systems
in vivo
using a logical circuit
equivalent: II. The idea of a m
olecular automaton.
J. Theor. Biol.
4, 179
–
192.
Somogyi, R., and Sniegoski, C.A. 1996. Modeling the complexity of genetic networks:
Understanding multigenic and pleiotropic regulation.
Complexity
1(6), 45
–
63.
Tavazoie, S., Huges, J.D., Campbell, M.J., Ch
o, R.J. and Church, G.M. (1999)
Systematic determination of genetic network architecture.
Nat. Genet
.
22,
281

285.
Walter, C., Parker, R., and YÏas, M. 1967. A model for binary logic in biochemical
systems.
J. Theor. Biol.
15, 208
–
217.
Comments 0
Log in to post a comment