Reverse Engineering of
Genetic Networks
(Final presentation)
Ji Won Yoon
(s0344084)
supervised by Dr. Dirk Husmeier.
MSc in Informatics
at Edinburgh University,
J.Yoon@sms.ed.ac.uk
Reverse Engineering
What is reverse engineering of gene
network?

Relevance Network

My own method

MCMC for Bayesian network
Missing gene
+
“
up
”
and
“
down
”
data
from micro array
Past works
Comparison of existing approaches to the reverse
engineering of genetic networks,
Mutual information relevance networks
My own method
Bayesian networks using Markov Chain Monte Carlo method
Applying all methods to synthetic data generated
from a gene network simulator.
Applying to Biological data
Diffuse large B cell Lymphoma gene expression data
Arabidopsis gene expression data
Relevance Network (Butte, 2000)
Using mutual information
MI(A, B) = H(A)
–
H(AB)
= H(B)
–
H(BA) = MI(B, A)

> Symmetric
MI(A, B) = H(A) + H(B)
–
H(A, B)
Mutual information is zero if two genes are
independent.
Pair wise relation
Relevance Network (cont.)
Relevance Network
(Butte, 2000)
Useful only to local relation due to pair wise relation.
Important to select proper threshold to get good relations.
Bootstrapping (Comparison of results in real data and in
randomly permuted data)
Difficulty to identify the relation with two parents due to the
locality. MI(A, [B, C, D])>
MI(A, B)< MI(A, C)< MI(A, D)<
Cannot detect XOR operation
MI(A, C) = MI(B, C) = 0
No direction of edges due to symmetric property
Fast and light computation.
Useful for a number of genes
C
D
A
B
D
A
B
C
A
B
C
0
0
0
0
1
1
1
0
1
1
1
0
My method
(using mutual information)
Based on Scale free network
Crucial genes will have more connections than other genes.
A
B
C
D
A
B
C
D
E
On insert new gene F, A will have more
chance to have it than other genes.
G = (N, E)
'
G
= (N, E, L)
(L is level information}
My method (Insertion step)
Finding better parents and merging Clusters
a
MI(1, a) = 0.34
MI(5, a) = 0.35
MI(6, a) = 0.31
MI(7, a) = 0.4
MI(4, a) = 0.28
Threshold = 0.3
1
4
5
6
7
( )
1
4
5
3
8
2
6
a
10
12
7
9
S1
S2
S3
S4
11
My method (Deletion step)
Assumption
The network generated from insertion step of my method is in
stationary state in marginal log likelihood except one edge,
which is investigated to check the connection
Three case in an edge e
X

>Y, X<

Y, and X Y
P
(
D

M
) =
U * P
(
X

pa
(
X
)) *
P
(
Y

pa
(
Y
))
Y
X
Y
pa
Y
P
X
pa
X
P
U
Y
X
Y
pa
Y
P
X
pa
X
P
U
Y
X
Y
pa
Y
P
X
pa
X
P
U
))
(

(
*
))
(

(
*
))
(

(
*
))
(

(
*
))
(

(
*
))
(

(
*
3
3
3
2
2
2
1
1
1
))
(

(
*
))
(

(
Y
pa
Y
P
X
pa
X
P
i
i
My method (Deletion step)
X
Y
Graph G
e
A
C
D
H
I
B
E
F
Y
X
H
E
Y
pa
B
A
X
pa
Y
X
H
E
Y
pa
Y
B
A
X
pa
Y
X
H
E
X
Y
pa
B
A
X
pa
}
,
{
)
(
},
,
{
)
(
}
,
{
)
(
},
,
,
{
)
(
}
,
,
{
)
(
},
,
{
)
(
3
3
2
2
1
1
My method
Mainly two steps
Insertion and deletion steps
e
f
g
h
c
d
a
b
e
f
g
h
c
d
a
b
= 0.5
e
f
g
h
c
d
a
b
e
f
g
h
d
= 0.4
c
a
b
e
f
g
h
d
c
a
b
e
f
g
h
d
c
a
b
= 0.3
Continue up to t = 0.
…
5
T
6
T
7
T
1
0
T
…
…
Insertion step
Deletion step
My method
Advantage
Based on Biological facts (Scale Free Network)
•
No need of thresholds
•
Online approach
Scalability
Easy to explore sub

networks
•
Fast computation
My method
Disadvantage
Input order dependency
Risky in exploring parents in data with big noise values.
(It can be over

fitted to training data)
61 % edges are
less order dependent
(in part B)
Bayesian network with MCMC
Bayesian network
Problem 1
Problem 2
A
B
C
D
E
i
i
i
X
pa
X
P
E
D
C
B
A
P
))
(

(
)
,
,
,
,
(
)

(
)

(
)

(
)

(
)
(
B
E
P
B
D
P
A
C
P
A
B
P
A
P
)
(
)
(
)

(
)

(
D
P
M
P
M
D
P
D
M
P
M
M
P
M
D
P
M
P
M
D
P
)
(
)

(
)
(
)

(
Left: in large data set.
Right: in small data set.
Bayesian network with MCMC
MCMC (Markov Chain Monte Carlo)
Inference rule for Bayesian Network
Sample from the posterior distribution
Proposal Move :
Given M_old, propose a new network M_new with
probability
Acceptance and Rejection :
)

(
old
new
M
M
Q
)

(
)

(
)
(
)

(
)
(
)

(
,
1
min
old
new
new
old
old
old
new
new
accept
M
M
Q
M
M
Q
M
P
M
D
P
M
P
M
D
P
P
Bayesian network with MCMC
MCMC in Bayes Net toolbox
Hasting factor
The proposal probability is calculated from the
number of neighbours of the model.
Improvement of MCMCs
Fan

in
The sparse data leads the prior probability to have a non

negligible influence on the posterior
P
(
M

D
).
Limit the maximum number of edges converging on a
node, fan

in.
If
FI
(
M
) > a, P(M)=0. Otherwise, P(M)=1.
The time complexity reduced largely
Acceptable configuration of child and parents in fan

in 3
A
A
A
A
B
B
B
C
C
D
A
B
C
D
E
Improvement of MCMCs
DAG to CPDAG
(DAG : Directed Acyclic Graph, CPDAG : Completed Partially Directed Acyclic Graph)
X
Y
X
Y
P(X, Y) = P(X)P(XY) = P(YP(YX)
Set of all equivalent DAGs
DAG to CPDAG
D
E
is reversible
others are compelled.
Improvement of MCMCs
This CPDAG concept bring several advantages:
The space of equivalent classes is more reduced.
It is easy to trap in local optimum in moving DAG spaces.
Incorporating CPDAG to MCMC
MCMCMC
Trapping
A
B
A : global optima
B : local optima
Easy to be trapped in local optima B.
Multi chains with different temperatures will be useful to escape from it.
MCMCMC
Trapping
MCMCMC
A super chain, S
“
Acceptance ratios of a super chain
j
T
i
j
i
j
i
j
M
P
M
D
P
D
S
P
1
)
(
)
(
)
(
)

(
)

(
)

(
)

(
1
1
i
i
i
i
S
S
Q
S
S
Q
l
k
k
l
T
l
l
T
k
k
T
l
l
T
k
k
i
accept
M
P
M
D
P
M
P
M
D
P
M
P
M
D
P
M
P
M
D
P
P
1
1
1
1
)
(
)

(
)
(
)

(
)
(
)

(
)
(
)

(
,
1
min
Importance Sampling
.
“
Partition function
“
Proposal distribution
“
Acceptance probability
We only case the prior distribution for acceptance.
Importance Sampling is also combined with MCMCMC.
MCMCMC with Importance Sampling
: Likelihood for configuration of a node n and its parents
Order MCMC
It sample over
total orders
not over structures.
A
B
C
A B C
A C B
B C A
B A C
C A B
C B A
Order MCMC
It sample over
total orders
not over structures.
Proposal move
flipping two nodes of the previous order
Computational limitations
Using candidate sets
S
ets of parents with the highest scores in likelihood for each node
Reduces the computation time.
Order MCMC
Order MCMC
Selection features
We can extract the edges by approximating
and averaging under the stationary distribution,
where
Synthetical data
41
th
to 50
th
genes
are not connected.
Synthetic data

MCMCMC with Importance Sampling
has the best performance.

Order MCMC is the second.

Order MCMC is much faster than MCMCMC with Importance Sampling.
Synthetic data
I changed one parameters
for MCMC simulation.
1)
Standard application
(using standard parameters)
2)
Change
a
noise value
(Decrease noise value to 0.1)
3) Change
a
training data size
(Decrease the size to 50)
4) Change
the number of iterations
(Increase the number to 50000)
Standard parameters ( MCMC in Bayes Net Toolbox )
training data size:200, noise value:0.3,
the number of iterations: 5000 (5000 samples and 5000 burn

ins)
Synthetic data
Synthetic data
Convergence
1)
MCMC in BNT
2)
MCMCMC Importance
Sampling (IM)
3) MCMCMC Importance
Sampling (ID)
4) Order MCMC
1
2
3
4
training set size : 200
noise : 0.3
5000 iterations.
Synthetic data
MCMCMC
(Burn

in# + Sample #)
Left: 5000 + 5000
Right: 100000 + 100000
Acceptance ratios
Left: MCMC in BNT, Right : Order MCMC
Middle: MCMCMC with Importance Sampling
Diffuse large B cell lymphoma Data
Data discretisation
I used K

means algorithms to discretise gene expression
levels for each genes since the stationary level for each gene
can be different from others. (
up, down and normal
)
Problem of this discretisation
If there are too many noises,
the noises can make fluctuations
Finally, this method can not work well for gene3.
Diffuse large B cell lymphoma Data
Comparison of convergence
MCMC in BNT
MCMCMC with
Importance Sampling(ID)
Order MCMC
# of genes : 27
Training data size : 105
Iterations : 20000
Diffuse large B cell lymphoma Data
Comparison of Acceptance Ratios
The number of genes : 27, Training data size : 105, Iterations : 20000
Gene expression inoculated by viruses
in susceptible Arabidopsis thaliana plants
Viruses
•
Cucumber mosaic cucumovirus
•
Oil seed rape tobamovirus
•
Turnip vein clearing tobamovirus
•
Potato virus X potexvirus
•
Turnip mosaic potyvirus
1)
2)
3)
4)
5)
1DAI
2DAI
3DAI
4DAI
5DAI
7DAI
Symptom
occurs.
DAI = Day after inoculation
Inoculation
Gene a
Training data : 127 genes with 20 data size ( 4 DAIs * 5 viruses )
Gene expression inoculated by viruses
in susceptible Arabidopsis thaliana plants
only for 20 genes (1DAI and 2DAI)
1000 samples from my method
10000 samples from MCMCMC
with Importance Sampling(ID)
Gene expression inoculated by viruses
in susceptible Arabidopsis thaliana plants
127 genes
Genes with
Higher
connectivity
Average global
connectivity
= 1.5847
Gene expression inoculated by viruses
in susceptible Arabidopsis thaliana plants
127 genes
p

value check for transcription function

f is the number of genes
with j th function in 127 genes.

m
is the number of genes
with
j
th function in 14 genes.
Gene expression inoculated by viruses
in susceptible Arabidopsis thaliana plants
for 127 genes from my method (100 samples)
Conclusion
We need to select methodologies depending on the
characteristics of training data.
To obtain the closest result to real networks, MCMCMC with
Importance and Order MCMC are suitable.
MCMCMC with Importance Sampling has the best performance
but it is slower than other MCMCs.
Order MCMC has the second performance but it is four times
faster than MCMCMC with Importance Sampling.
If we want to process large scale data and we do not have
enough time to run MCMCs,
Relevance Network and My method are proper.
Also, several methods generate different networks so
that combining them will give better results.
Conclusion
Biological meaning
Transcription genes have higher connectivities more
than other genes (from my method). That is,
genes
with transcription function may act as hubs in a
network
for response against viruses in Arabidopsis
thaliana plant.
Comments 0
Log in to post a comment