Effective Enrichment of Gene Expression Data Sets - METU ...

ocelotgiantAI and Robotics

Nov 7, 2013 (3 years and 11 months ago)

98 views

Utku

Sirin

a
,
Utku

Erdogdu

a
,
Faruk

Polat

a
,
Mehmet

Tan
b
, and
Reda

Alhajj

c


a
Department of Computer Engineering

Middle East Technical University

Ankara, Turkey


b
Department of Computer Engineering

TOBB University of Economics and Technology

Ankara, Turkey


c
Department of Computer Science

University of Calgary

Alberta, Canada


IEEE 11
th

International Conference on Machine Learning and Applications

December 13
th
, 2012




Outline


Background and Motivation



Multi
-
Model Framework & Evaluation Metrics



Generative Models


Probabilistic Boolean Networks


Ordinary Differential Equations



Experimental Evaluation



Conclusion & Future Works



Background & Motivation


Gene expression data is the main source of
information for many applications in computational
systems biology


However, the datasets suffer from the problem of
skewed data matrices


There are thousands of genes and just several tens of
samples


So few samples lower the confidence level of any
computational method significantly



Background & Motivation


How to enrich available gene expression datasets confidently ?


There are several tools generating synthetic gene expression
samples, such as
GeneNetWeaver

(
Schaffter

et. al., 2011)

or
SynTReN

(
Bulcke

et. al.,2006)


However, all of them use single model such as ordinary
differential/stochastic equations or
boolean

networks, which
makes them to model gene regulation restrictively


Our idea is to integrate different machine learning
techniques into single unified multi
-
model framework so
that we can benefit from different models concurrently


Thereby, we aimed to generate synthetic gene expression samples
more confidently and mitigate the low sample size problem for
gene expression datasets by producing high qualitative data


Multi
-
Model Gene Regulation Model


Construct different gene
regulation models from
available gene expression
samples


Sample from each of them
equally and pool the generated
samples


Select the
best
samples from
the pool and output them


Each model contributes its own
characteristics and we utilize
all of them concurrently


How to select
the best

samples?


Original Gene Expression Data

Model 1

Model N



k Samples

k Samples

Multi
-
Objective Selection



k Samples

Evaluation Metrics


After having generated samples from each model, it is very important to select the most qualitative
samples to output. Otherwise our method would be impractical


To decide on the quality of generated samples, we defined three metrics measuring quality of the
generated samples from different aspects:
Compatibility, Diversity and Coverage.


Compatibility


How much close the generated samples to the original samples?


To assure that the generated samples are similar to the original samples


Mean of the
euclidean

distances of each generated sample to the original samples


Diversity


How much different the generated samples from the original samples?


To assure that the generated samples are not the
duplicate
of the original samples but carry
always new information


We calculate the entropy value of each sample in the dataset and sum the differences. For each
sample, we add the new sample to the original dataset and again sum the differences of entropy
values. The diversity value is the ratio of the latter value to the former value for that sample


Coverage


How much the generated samples cover the sample space?


To assure to cover the sample space as much as possible


Mean of the
euclidean

distances of each generated sample to the already generated samples

Evaluation Metrics, Multi
-
Objective Selection


After calculating three metric values, we have a vector of metric results
for each sample


To select the best samples among the generated samples we applied
multi
-
objective selection mechanism to the vector of metric results
using strict dominance rule


Strict dominance rule: A sample is more qualitative than some other
sample, if all of its metric results are greater than that specific sample.


We sort all of the generated samples multi
-
objectively and select the
best
k
ones to output


Non
-
dominant samples are grouped together and selected randomly

Generative Models


In our framework there may be any number of gene
regulation models


The important point is the models should be least
dependent so that the generated gene expression samples
cover different parts of sample space


In this study we representatively select two generative
models for our multi
-
model data generation framework


Probabilistic Boolean Networks (PBNs)

(
Shmulevich

et. al.,
2002)


Ordinary Differential Equations (ODEs)

(Bansal et. al., 2006)

Probabilistic Boolean Networks (PBNs)


Probabilistic versions of Boolean Networks (
Kauffman
,
1993
)


Each gene is either ON of OFF (Binary Values)


Each gene is associated with a set of
boolean

functions
and each
boolean

function is associated with a set of
genes (variables)


Each set of
boolean

function is also associated with a
probability distribution so that each time step the
value of each gene is determined by a
boolean

function
which is selected according to its probability value

Probabilistic Boolean Networks (PBNs)

g
n

g
1

g
2



g
n

g
1

g
2



Time t

Time t+1

2
3
1
2
5
4
1
1
g
g
g
f
g
g
g
f






}
30
.
0
,
45
.
0
,
25
.
0
{
},
,
,
{
1
3
2
1
1


P
f
f
f
F
6
4
2
3
g
g
g
f





We construct the PBNs by adapting the MATLAB PBN Toolbox


Then, we run the PBN and generate synthetic gene expression
samples to feed into our multi
-
model framework

Ordinary Differential Equations (ODEs)


One of the oldest methods to model gene regulation


Each gene’s expression value is associated with other gene’
s

expression values through a
regulation matrix presented as A below






The differentiation of each gene expression value is determined by a linear combination


of the expression values of all other genes


There are many algorithms modeling gene regulation with ODEs


“Network Identification by Multiple Regression” (NIR), applying multiple linear regression

(
Gardner

et. al., 2003)


“Differential Equation
-
based Local Dynamic Bayesian Network” (DELDBN), combining
differential equations and dynamic
bayesian

networks

(
Li

et. al., 2011)


In our study, we use the algorithm “Time Series Network Identification” (TSNI) due to its
prevailing properties to the other methods

(Bansal et. al., 2006)


It can handle both time series and steady state gene expression
data
sets



It can easily be applied to large datasets due to its utilization of principal component
analysis


It can determine external perturbation automatically from the data


Ax
dt
dx

Ordinary Differential Equations (ODEs)

)
(
)
(
.
k
k
t
U
B
t
X
A
X





The only unknowns are A and B matrices


If we write the equation by concatanating the known and unknown matrices

Differentiation

Term

Regulation


Matrix

Perturbation

Matrix

Expression
Values

Perturbation

Values

KNOWN !

UNKNOWN

Ordinary Differential Equations (ODEs)


It is easy to find the unknown matrix H in
this schema


However, the number of equations should
be greater than the number of variables,
which may not hold always.


At this point, TSNI applies Principal
Component Analysis (PCA) to the Y matrix
and reduce the dimension of the matrix Y
and solve the equation.


Then, the unknown matrices A and B can
be obtained easily


By running the ODE model we generated,
it is easy to produce synthetic gene
expression samples to feed into our multi
-
model framework











U
X
Y
B
A
H
Y
H
X
]
[
.
Experimental Evaluation, Datasets


We evaluated our framework using three different real life
biological datasets


The first dataset is the gene expression profile of metastatic
melanoma

cells (Bittner

et. al.
, 2000
)


It originally includes 8067 genes and 31 samples. We have used its
reduced from composed of 7 most important genes and 31 samples
(
Datta

et. al., 2003
)


The second dataset is the gene expression data set of
yeast

cell cycle (
Spellman

et. al., 1998
)


It includes 25 genes and 77 samples


The third dataset is
siRNA

disruptant

dataset in human
umbilical vein endothelial cells (
HUVECs
) (
Hurley

et. al.,
2011
)


It includes 379 genes and
400 samples


Newly published very useful source for our model


Experimental Evaluation, Results


We evaluated our framework based on the

three

metrics we defined in
two different settings


In the first setting, we used the melanoma and yeast datasets without
partitioning the datasets into training and testing sets


This is because the melanoma and yeast datasets have relatively less
number of samples such that dividing them into training and testing
sets would be meaningless


Then, we used the yeast and HUVECs datasets by partitioning them
into training and testing datasets in the second setting. Here, we have
enough number of samples to divide. HUVECs dataset has
400
samples
.


In the first set of experiments, the results are always suspicious since
training and testing sets are same. The second set of experiments
provides to see the picture clearer and increase the confidence level of
our framework


Note that because yeast dataset is middle
-
sized, we used it both in our
first and second sets of experiments to see the results comparatively


Figure 2: Diversity


In this set of experiments, we increased the number of generated samples as 10, 20, …, 500 by our
framework and checked the mean of the metric results
w.r.t

training samples.


Figure 1 and 2 shows the compatibility and diversity results. Compatibility results show that the
data generated by our framework converges to the original dataset since it gets closer and closer
to original dataset


The diversity results on the other
hand say
that although generated samples are getting closer to
the original dataset, they always carry new information with respect to the original dataset.


That means, our multi
-
model gene expression data generation framework always produces
qualitative samples which are both very close to the original dataset and bringing new
information


For melanoma dataset, newly generated samples bring almost % 30 new information , which is a
very good result


Experimental Evaluation, Results, Setting #1

Figure 1: Compatibility

Experimental Evaluation, Results, Setting #1


Coverage results concludes our first set of experiments


As seen from Figure 3, coverage results are decreasing for both datasets. This is
consistent with the compatibility results. Because system converges to generate
similar results to the original dataset, hence to each other also.

Figure 3: Coverage


In the previous experiments the testing and training
datasets were same due to low sample sizes, which lowers
the confidence level of the experimental results


In this set of experiments, we divide the yeast and HUVECs
datasets into training and testing sets. We constructed our
generative models based on training samples and checked
the metric results based on testing samples


Note that we also found the metric results based on training
samples to compare the results


We used first 50 samples for training and last 27 samples
for testing in yeast dataset


We used first 300 samples for training and last 100 samples
for testing in HUVECs dataset

Experimental Evaluation, Results, Setting #2

Experimental Evaluation, Results, Setting #2

Figure 4: Compatibility for Yeast

Figure 5: Diversity for Yeast


First we generate 50 samples and checked the metric results for each generated sample
separately


Figure 4 and 5 shows the results for yeast dataset in terms of compatibility and diversity


They verify our concern on low confidence level of first set of experiments. Because we
see that the generated data is less close to the original samples and more diverse than the
original samples when it is evaluated
w.r.t

testing data

Experimental Evaluation, Results, Setting #2

Figure 6: Compatibility for HUVECs

Figure 7: Diversity for HUVECs


Figure 6 and 7 shows the results for HUVECs dataset in terms of compatibility and
diversity


They again verify our concern on low confidence level of first set of experiments. Because
we see that the generated data is less close to the original samples and more diverse than
the original samples when it is evaluated
w.r.t

testing data


So, we can say that our generated samples are actually more qualitative than it is shown in
the first set of experiments. Because, we still have a very good compatibility values
around % 93, and the diversity values are greater than their previous values



Experimental Evaluation, Results, Setting #2


Now we know that our generated data is less close and
more diverse
w.r.t

to the original dataset


So what happens when we generate large number of
samples? To understand this, we generate 10, 20, …,
500 samples and checked the difference of the mean of
the metric results
w.r.t

testing and training


That is, for each generated sample set, we evaluate
them
w.r.t

testing dataset and
w.r.t

training dataset and
we

plot the difference


Results
w.r.t

training comprise a baseline for us and we
try to understand how the metric results
w.r.t

testing
samples change relatively

Experimental Evaluation, Results, Setting #2


Figure 8 and 9 show the results for Yeast dataset


These results show that our generated samples are very close to the original dataset
because there is only % 5 percentage difference between compatibility values


Moreover, they always carry new information because the diversity values are always
greater than zero


Nonetheless, the results for yeast dataset is not promising, because as we generate more
and more samples they do not pose a regular pattern

Figure 8: Compatibility for Yeast

Figure 9: Diversity for Yeast

Experimental Evaluation, Results, Setting #2


Figure 10 and 11 show the results for HUVECs dataset


Now, we actually see much better results. First of all the compatibility difference is less than that of
yeast. We have only % 2 percent value , which is a very good result


Secondly and more importantly, the diversity values are always increasing. That means, as we generate
more and more samples, our generated samples are not only very close to the original dataset but also
bring always new, even more and more information to the original dataset


It is a very important result, actually. Because we see that computationally we can generate gene
expression samples just like generating original samples. Hence, the complex internal dynamics of
gene regulation can successfully be simulated by superposing different methods and generating data
as if it were generated originally by the complex internal dynamics itself.


We think the reason for this result is the number of training samples we have in HUVECs dataset


It does not only show the power of computational methods but also provide practically very valuable
result of generating highly qualified gene expression data

Figure 10: Compatibility for HUVECs

Figure 11: Diversity for HUVECs

Conclusion & Future Work


By integrating different machine learning methods we can
simulate complex gene regulation system successfully


System always produces samples that are both similar to the
original gene expression dataset and carrying new information


Our system can be used as a preprocessor for any computational
approach requiring gene expression data


As future work;


The framework can be extended by integrating more models


Moreover, the produced samples may be studied under a pre
-
determined analysis task verifying the effectiveness of our system


Furthermore, a bound can be determined for number of required
samples to train our multi
-
model framework

References


GeneNetWeaver
: In
silico

benchmark generation and performance profiling of network inference methods.
Schaffter

T,
Marbach

D, and
Floreano

D.
Bioinformatics
, 27(16):2263
-
70, 2011.


SynTReN
: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. Tim
Van den
Bulcke
,
Koenraad

Van
Leemput
, Bart
Naudts
, Piet van
Remortel
,
Hongwu

Ma, Alain
Verschoren
, Bart De
Moor and Kathleen
Marchal
. BMC Bioinformatics, 26;7:43, 2006.


I.
Shmulevich
, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic
boolean

networks: a rule
-
based uncertainty
model for gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp. 261

274, 2002.


M.
Bansal
, G. D.
Gatta
, and D. Di Bernardo, “Inference of gene regulatory networks and compound mode of action
from time course gene expression profiles,” Bioinformatics, vol. 22, no. 7, pp. 815

822, Apr. 2006.


S. A. Kauffman, The Origins of Order: Self
-
Organization and Selection in Evolution, 1st ed. Oxford University Press,
USA, June 1993.


T. S. Gardner, D.
di

Bernardo, D. Lorenz, and J. J. Collins, “Inferring Genetic Networks and Identifying Compound
Mode of Action via Expression Profiling,” Science, vol. 301, no. 5629, pp. 102

105, 2003.


Z. Li, P. Li, A. Krishnan, and J. Liu, “Large
-
scale dynamic gene regulatory network inference combining differential
equation models with local dynamic Bayesian network analysis,” Bioinformatics, vol. 27, no. 19, pp. 2686

2691, Oct.
2011.


M.
Bansal
, G. D.
Gatta
, and D. Di Bernardo, “Inference of gene regulatory networks and compound mode of action
from time course gene expression profiles,” Bioinformatics, vol. 22, no. 7, pp. 815

822, Apr. 2006.


A.
Datta
, A.
Choudhary
, M. L. Bittner, and E. R. Dougherty, “External control in
markovian

genetic regulatory
networks,” Mach. Learn., vol. 52, no. 1
-
2, pp. 169

191, Jul. 2003.


P. Spellman, G. Sherlock, M. Zhang, V.
Iyer
, K. Anders, M.
Eisen
, P. Brown, D. Botstein, and B.
Futcher
,
“Comprehensive identification of cell cycle regulated genes of yeast
saccharomyces

cerevisiae

by microarray
hybridization.


D. Hurley, H. Araki, Y.
Tamada
, B. Dunmore, D. Sanders, S. Humphreys, M.
Affara
, S.
Imoto
, K. Yasuda, Y.
Tomiyasu
, K.
Tashiro
, C.
Savoie
, V. Cho, S. Smith, S.
Kuhara
, S.
Miyano
, D. S.
Charnock
-
Jones, E. J.
Crampin
, and C. G. Print, “Gene
network inference and visualization tools for biologists: application to new human
transcriptome

datasets,” Nucleic
Acids Research, 2011.

Any Question o
r

Comment?






This research is partially supported by The Scientific and Technological
Research Council of Turkey (TUBITAK), with project #110E179.