ppt

benhurspicyAI and Robotics

Nov 7, 2013 (3 years and 9 months ago)

146 views

Reverse Engineering of

Genetic Networks


(Final presentation)

Ji Won Yoon

(s0344084)

supervised by Dr. Dirk Husmeier.

MSc in Informatics

at Edinburgh University,

J.Yoon@sms.ed.ac.uk

Reverse Engineering


What is reverse engineering of gene
network?

-

Relevance Network

-

My own method

-

MCMC for Bayesian network

Missing gene

+

up


and

down


data


from micro array

Past works


Comparison of existing approaches to the reverse
engineering of genetic networks,


Mutual information relevance networks


My own method


Bayesian networks using Markov Chain Monte Carlo method



Applying all methods to synthetic data generated
from a gene network simulator.



Applying to Biological data


Diffuse large B cell Lymphoma gene expression data


Arabidopsis gene expression data

Relevance Network (Butte, 2000)



Using mutual information


MI(A, B) = H(A)


H(A|B)




= H(B)


H(B|A) = MI(B, A)
-
> Symmetric


MI(A, B) = H(A) + H(B)


H(A, B)




Mutual information is zero if two genes are
independent.




Pair wise relation

Relevance Network (cont.)



Relevance Network
(Butte, 2000)



Useful only to local relation due to pair wise relation.



Important to select proper threshold to get good relations.



Bootstrapping (Comparison of results in real data and in
randomly permuted data)



Difficulty to identify the relation with two parents due to the
locality. MI(A, [B, C, D])>


MI(A, B)< MI(A, C)< MI(A, D)<



Cannot detect XOR operation


MI(A, C) = MI(B, C) = 0






No direction of edges due to symmetric property



Fast and light computation.



Useful for a number of genes


C

D

A

B

D

A

B

C

A

B

C

0

0

0

0

1

1

1

0

1

1

1

0




My method

(using mutual information)



Based on Scale free network


Crucial genes will have more connections than other genes.


A

B

C

D

A

B

C

D

E

On insert new gene F, A will have more


chance to have it than other genes.

G = (N, E)


'
G
= (N, E, L)

(L is level information}

My method (Insertion step)



Finding better parents and merging Clusters


a

MI(1, a) = 0.34

MI(5, a) = 0.35

MI(6, a) = 0.31

MI(7, a) = 0.4

MI(4, a) = 0.28

Threshold = 0.3

1

4

5

6

7


( )

1

4

5

3

8

2

6

a

10

12

7

9

S1

S2

S3

S4

11

My method (Deletion step)



Assumption


The network generated from insertion step of my method is in
stationary state in marginal log likelihood except one edge,
which is investigated to check the connection


Three case in an edge e



X
-
>Y, X<
-
Y, and X Y


P
(
D
|
M
) =


U * P
(
X
|
pa
(
X
)) *
P
(
Y
|
pa
(
Y
))

Y
X
Y
pa
Y
P
X
pa
X
P
U
Y
X
Y
pa
Y
P
X
pa
X
P
U
Y
X
Y
pa
Y
P
X
pa
X
P
U






))
(
|
(
*
))
(
|
(
*
))
(
|
(
*
))
(
|
(
*
))
(
|
(
*
))
(
|
(
*
3
3
3
2
2
2
1
1
1
))
(
|
(
*
))
(
|
(
Y
pa
Y
P
X
pa
X
P
i
i
My method (Deletion step)




X

Y

Graph G

e

A

C

D

H

I

B

E

F

Y
X
H
E
Y
pa
B
A
X
pa
Y
X
H
E
Y
pa
Y
B
A
X
pa
Y
X
H
E
X
Y
pa
B
A
X
pa












}
,
{
)
(
},
,
{
)
(
}
,
{
)
(
},
,
,
{
)
(
}
,
,
{
)
(
},
,
{
)
(
3
3
2
2
1
1
My method



Mainly two steps


Insertion and deletion steps

e

f

g

h

c

d

a

b

e

f

g

h

c

d

a

b


= 0.5

e

f

g

h

c

d

a

b

e

f

g

h

d

= 0.4

c

a

b

e

f

g

h

d

c

a

b

e

f

g

h

d

c

a

b

= 0.3

Continue up to t = 0.



5
T
6
T
7
T
1
0

T




Insertion step

Deletion step

My method



Advantage


Based on Biological facts (Scale Free Network)








No need of thresholds


Online approach


Scalability


Easy to explore sub
-
networks


Fast computation

My method



Disadvantage


Input order dependency













Risky in exploring parents in data with big noise values.


(It can be over
-
fitted to training data)

61 % edges are

less order dependent

(in part B)

Bayesian network with MCMC



Bayesian network











Problem 1

Problem 2

A

B

C

D

E



i
i
i
X
pa
X
P
E
D
C
B
A
P
))
(
|
(
)
,
,
,
,
(
)
|
(
)
|
(
)
|
(
)
|
(
)
(
B
E
P
B
D
P
A
C
P
A
B
P
A
P

)
(
)
(
)
|
(
)
|
(
D
P
M
P
M
D
P
D
M
P






M
M
P
M
D
P
M
P
M
D
P
)
(
)
|
(
)
(
)
|
(
Left: in large data set.

Right: in small data set.

Bayesian network with MCMC


MCMC (Markov Chain Monte Carlo)


Inference rule for Bayesian Network



Sample from the posterior distribution



Proposal Move :



Given M_old, propose a new network M_new with
probability



Acceptance and Rejection :

)
|
(
old
new
M
M
Q








)
|
(
)
|
(
)
(
)
|
(
)
(
)
|
(
,
1
min
old
new
new
old
old
old
new
new
accept
M
M
Q
M
M
Q
M
P
M
D
P
M
P
M
D
P
P
Bayesian network with MCMC

MCMC in Bayes Net toolbox



Hasting factor


The proposal probability is calculated from the
number of neighbours of the model.


Improvement of MCMCs




Fan
-
in



The sparse data leads the prior probability to have a non
-
negligible influence on the posterior
P
(
M
|
D
).



Limit the maximum number of edges converging on a
node, fan
-
in.



If
FI
(
M
) > a, P(M)=0. Otherwise, P(M)=1.



The time complexity reduced largely

Acceptable configuration of child and parents in fan
-
in 3

A

A

A

A

B

B

B

C

C

D

A

B

C

D

E

Improvement of MCMCs


DAG to CPDAG


(DAG : Directed Acyclic Graph, CPDAG : Completed Partially Directed Acyclic Graph)





X

Y

X

Y

P(X, Y) = P(X)P(X|Y) = P(Y|P(Y|X)

Set of all equivalent DAGs

DAG to CPDAG

D

E
is reversible

others are compelled.

Improvement of MCMCs


This CPDAG concept bring several advantages:



The space of equivalent classes is more reduced.



It is easy to trap in local optimum in moving DAG spaces.


Incorporating CPDAG to MCMC





MCMCMC


Trapping

A

B

A : global optima

B : local optima

Easy to be trapped in local optima B.



Multi chains with different temperatures will be useful to escape from it.

MCMCMC


Trapping

MCMCMC


A super chain, S








Acceptance ratios of a super chain





j
T
i
j
i
j
i
j
M
P
M
D
P
D
S
P
1
)
(
)
(
)
(
)
|
(
)
|
(
)
|
(
)
|
(
1
1



i
i
i
i
S
S
Q
S
S
Q



















l
k
k
l
T
l
l
T
k
k
T
l
l
T
k
k
i
accept
M
P
M
D
P
M
P
M
D
P
M
P
M
D
P
M
P
M
D
P
P
1
1
1
1
)
(
)
|
(
)
(
)
|
(
)
(
)
|
(
)
(
)
|
(
,
1
min
Importance Sampling


.



Partition function



Proposal distribution



Acceptance probability




We only case the prior distribution for acceptance.


Importance Sampling is also combined with MCMCMC.


MCMCMC with Importance Sampling

: Likelihood for configuration of a node n and its parents

Order MCMC



It sample over
total orders

not over structures.


A

B

C

A B C

A C B

B C A

B A C

C A B

C B A

Order MCMC



It sample over
total orders

not over structures.





Proposal move



flipping two nodes of the previous order



Computational limitations



Using candidate sets



S
ets of parents with the highest scores in likelihood for each node



Reduces the computation time.

Order MCMC




Order MCMC



Selection features






We can extract the edges by approximating
and averaging under the stationary distribution,


where

Synthetical data

41
th

to 50
th

genes

are not connected.

Synthetic data

-

MCMCMC with Importance Sampling

has the best performance.

-

Order MCMC is the second.

-

Order MCMC is much faster than MCMCMC with Importance Sampling.

Synthetic data

I changed one parameters


for MCMC simulation.

1)
Standard application


(using standard parameters)


2)
Change
a

noise value


(Decrease noise value to 0.1)


3) Change
a

training data size


(Decrease the size to 50)


4) Change
the number of iterations


(Increase the number to 50000)

Standard parameters ( MCMC in Bayes Net Toolbox )

training data size:200, noise value:0.3,

the number of iterations: 5000 (5000 samples and 5000 burn
-
ins)

Synthetic data

Synthetic data

Convergence


1)
MCMC in BNT


2)
MCMCMC Importance


Sampling (IM)


3) MCMCMC Importance


Sampling (ID)


4) Order MCMC

1

2

3

4

training set size : 200

noise : 0.3

5000 iterations.

Synthetic data




MCMCMC


(Burn
-
in# + Sample #)


Left: 5000 + 5000


Right: 100000 + 100000

Acceptance ratios

Left: MCMC in BNT, Right : Order MCMC

Middle: MCMCMC with Importance Sampling

Diffuse large B cell lymphoma Data



Data discretisation



I used K
-
means algorithms to discretise gene expression
levels for each genes since the stationary level for each gene
can be different from others. (
up, down and normal
)

Problem of this discretisation

If there are too many noises,


the noises can make fluctuations


Finally, this method can not work well for gene3.

Diffuse large B cell lymphoma Data




Comparison of convergence

MCMC in BNT

MCMCMC with

Importance Sampling(ID)

Order MCMC

# of genes : 27

Training data size : 105

Iterations : 20000

Diffuse large B cell lymphoma Data




Comparison of Acceptance Ratios

The number of genes : 27, Training data size : 105, Iterations : 20000

Gene expression inoculated by viruses

in susceptible Arabidopsis thaliana plants




Viruses



Cucumber mosaic cucumovirus


Oil seed rape tobamovirus


Turnip vein clearing tobamovirus


Potato virus X potexvirus


Turnip mosaic potyvirus

1)

2)

3)

4)

5)

1DAI

2DAI

3DAI

4DAI

5DAI

7DAI

Symptom

occurs.

DAI = Day after inoculation

Inoculation

Gene a

Training data : 127 genes with 20 data size ( 4 DAIs * 5 viruses )

Gene expression inoculated by viruses

in susceptible Arabidopsis thaliana plants



only for 20 genes (1DAI and 2DAI)

1000 samples from my method

10000 samples from MCMCMC

with Importance Sampling(ID)

Gene expression inoculated by viruses

in susceptible Arabidopsis thaliana plants


127 genes

Genes with

Higher

connectivity

Average global

connectivity

= 1.5847

Gene expression inoculated by viruses

in susceptible Arabidopsis thaliana plants


127 genes


p
-
value check for transcription function

-

f is the number of genes

with j th function in 127 genes.

-

m
is the number of genes


with
j
th function in 14 genes.

Gene expression inoculated by viruses

in susceptible Arabidopsis thaliana plants


for 127 genes from my method (100 samples)

Conclusion


We need to select methodologies depending on the
characteristics of training data.



To obtain the closest result to real networks, MCMCMC with
Importance and Order MCMC are suitable.



MCMCMC with Importance Sampling has the best performance
but it is slower than other MCMCs.



Order MCMC has the second performance but it is four times
faster than MCMCMC with Importance Sampling.


If we want to process large scale data and we do not have
enough time to run MCMCs,



Relevance Network and My method are proper.



Also, several methods generate different networks so
that combining them will give better results.

Conclusion


Biological meaning


Transcription genes have higher connectivities more
than other genes (from my method). That is,
genes
with transcription function may act as hubs in a
network

for response against viruses in Arabidopsis
thaliana plant.