University of Crete
CS483
1
The use of Minimum Spanning Trees
in microarray expression data
Gkirtzou Ekaterini
University of Crete
CS483
2
Introduction
Classic clustering algorithms, like K

means, self

organizing maps, etc., have
certain drawbacks
No guarantee for global optimal results
Depend on geometric shape of cluster
boundaries (K

means)
University of Crete
CS483
3
Introduction
MST clustering algorithms
Expression data clustering analysis
(Xu et al

2001)
Iterative clustering algorithm
(Varma et al

2004)
Dynamically growing self

organizing tree
(DGSOT) (Luo et al

2004)
University of Crete
CS483
4
Definitions
A minimum spanning tree (MST) of a
weighted, undirected graph
with weights is an acyclic subset
that contains all of the vertices
and whose total weight
is minimum.
T G
( ) ( )
e T
w T w e
(,)
G V E
( )
w e
University of Crete
CS483
5
Definitions
The DNA microarray technology enables
the massive parallel measurement of
gene expression of thousands genes
simultaneously. Its usefulness:
compare the activity of genes in diseased
and healthy cells
categorize a disease into subgroups
discover new drug and toxicology studies.
University of Crete
CS483
6
Definitions
Clustering is a common technique for
data analysis. Clustering partitions the
data set into subsets (clusters), so that
the data in each subset share some
common trait.
University of Crete
CS483
7
MST clustering algorithms
Expression data clustering analysis
(Xu et al

2001)
Iterative clustering algorithm (Varma
et al

2004)
Dynamically growing self

organizing tree
(DGSOT) (Luo et al

2004)
University of Crete
CS483
8
Expression data clustering
analysis
Let
be a set of expression
data with each
representing the expression levels at
time 1 through time t of gene i. We
define a weighted, undirected graph
as follows. The vertex set
and the edge set
.
{ }
i
D d
1
(,,)
t
i i i
d e e
(,)
G V E
{  }
i i
V d d D
{(,)  , and }
i j i j
E d d for d d D i j
University of Crete
CS483
9
Expression data clustering
analysis
G is a complete graph.
The weight of its edge is the distance of
the two vertices e.g. Euclidean distance,
Correlation coefficient, etc.
Each cluster corresponds to one subtree
of the MST.
No essential information is lost for
clustering.
University of Crete
CS483
10
Clustering through removing
long MST

edges
Based on intuition of
the cluster
Works very well
when inter

cluster
edges are larger
than intra

cluster
ones
University of Crete
CS483
11
An iterative Clustering
Minimize the distance between the center
of a cluster and its data
Starts with K arbitrary clusters of the MST
for each pair of adjacent clusters finds the
edge to cut, which optimizes
1
(,( ))
i
K
i
i d T
d center T
(1)
(1)
University of Crete
CS483
12
A globally optimal clustering
Tries to partition the tree into K subtrees
Select K representatives to optimize
1
(,)
i
K
i
i d T
d d
(2)
(2)
University of Crete
CS483
13
MST clustering algorithms
Expression data clustering analysis
(Xu et al

2001)
Iterative clustering algorithm
(Varma et al

2004)
Dynamically growing self

organizing tree
(DGSOT) (Luo et al

2004)
University of Crete
CS483
14
Iterative clustering algorithm
The clustering measure used here is
Fukuyama

Sugeno
where
,
are the two partitions of the set
S, with each
contains
samples, denote
by
the mean of the samples in
and
the
global mean of all samples. Also denote by
the j

th sample in the cluster
2
2
2
1 1
( )
k
N
k
j k k
k j
FS S x
1
S
2
S
k
S
k
N
k
k
j
x
k
S
k
S
University of Crete
CS483
15
Iterative clustering algorithm
Feature selection counts the gene’s
support to a partition
Feature selection used here is t

statistic
with pooled variance. T

statistic is
heuristic measure
Genes with absolute t

statistic greater
than a threshold are selected
University of Crete
CS483
16
Iterative clustering algorithm
Create an MST from all genes
Delete edges from MST and obtain
binary partitions. Select the one with
minimum F

S clustering measure
The feature selection is used to select a
subset of genes that single out between
the clusters
University of Crete
CS483
17
Iterative clustering algorithm
In the next iteration the clustering is
done in this selected set of genes
Until the selected gene subset converges
Remove them form the pool and
continue.
University of Crete
CS483
18
MST clustering algorithms
Expression data clustering analysis
(Xu et al

2001)
Iterative clustering algorithm
(Varma et al

2004)
Dynamically growing self

organizing
tree (DGSOT) (Luo et al

2004)
University of Crete
CS483
19
Dynamically growing self

organizing tree (DGSOT)
In the previous algorithms the MST is
constructed on the original set of data
and used to test the intra

cluster
quantity, while here the MST is used as
a criterion to test the inter

cluster
property.
University of Crete
CS483
20
DGSOT algorithm
Tree structure self

organizing neural
network
Grows vertically and horizontally
Starts with a root

leaf node
In every vertical growing every leaf node
with heterogeneity two
descendents are created and the learning
process take place
et R
H T
University of Crete
CS483
21
DGSOT algorithm
Heterogeneity
Variability (maximum distance between input
data and node)
Average distortion d of a leaf
D: total number of input data of lead i
: distance between data j and leaf i
: reference vector of leaf i
1
(,)
i
D
j i
j
d x n
d
D
(,)
j i
d x n
i
n
University of Crete
CS483
22
DGSOT algorithm
In every horizontal growing for every
lowest non

leaf node a child is added
until the validation criterion is satisfied
and the learning process take place
The learning process distributes the data
to the leaves in the best way. The best
matching node has the minimum
distance to the input data
University of Crete
CS483
23
The validation criterion of
DGSOT
Calculated without human intervention
Based on geometric characteristics of
the clusters
Create the Voronoi diagram for the input
data. The Voronoi diagram divides the
set D data into n regions V(p):
V(p) = {x  (,) (,) }
D dist x p dist x q q
University of Crete
CS483
24
The validation criterion of
DGSOT
Let’s define a weighted, undirected graph
.The vertices is the set of the
centroids of the Voronoi cell and the
edge set is defined as
Create the MST for the graph
(,)
G V E
( )
V p
{(,) ,( ) and }
i j i j
E p p p p V p i j
(,)
G V E
University of Crete
CS483
25
Voronoi diagram of 2D dataset
In A, the dataset is partitioned into three Voronoi cells.
The MST of the centroid is ‘even’.
In B, the dataset is partitioned into four Voronoi cells.
The MST of the centroid is not ‘even’.
University of Crete
CS483
26
The validation criterion of
DGSOT
Cluster separation
`
where is minimum length edge
and
is the maximum length edge
A low value of the CS means that the
two centroids are to close to each other
and the Voronoi partition is not valid,
while a high CS value means that the
Voronoi partition is valid.
min
max
E
CS
E
min
E
max
E
University of Crete
CS483
27
Example of DGSOT
University of Crete
CS483
28
Conclusions
The tree algorithms presented in this
report have provided comparable result
to those obtained by classic clustering
algorithms, without their drawbacks, and
superior to those obtained by standard
hierarchical clustering.
University of Crete
CS483
29
Questions
Comments 0
Log in to post a comment