The use of Minimum Spanning Trees

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 7 months ago)

59 views

University of Crete

CS483

1

The use of Minimum Spanning Trees
in microarray expression data

Gkirtzou Ekaterini



University of Crete

CS483

2

Introduction


Classic clustering algorithms, like K
-
means, self
-
organizing maps, etc., have
certain drawbacks


No guarantee for global optimal results


Depend on geometric shape of cluster
boundaries (K
-
means)

University of Crete

CS483

3

Introduction


MST clustering algorithms


Expression data clustering analysis


(Xu et al
-
2001)


Iterative clustering algorithm
(Varma et al
-

2004)


Dynamically growing self
-
organizing tree
(DGSOT) (Luo et al
-

2004)

University of Crete

CS483

4

Definitions


A minimum spanning tree (MST) of a
weighted, undirected graph
with weights is an acyclic subset


that contains all of the vertices
and whose total weight





is minimum.

T G

( ) ( )
e T
w T w e



(,)
G V E

( )
w e
University of Crete

CS483

5

Definitions



The DNA microarray technology enables
the massive parallel measurement of
gene expression of thousands genes
simultaneously. Its usefulness:


compare the activity of genes in diseased
and healthy cells


categorize a disease into subgroups


discover new drug and toxicology studies.


University of Crete

CS483

6

Definitions


Clustering is a common technique for
data analysis. Clustering partitions the
data set into subsets (clusters), so that
the data in each subset share some
common trait.

University of Crete

CS483

7

MST clustering algorithms


Expression data clustering analysis

(Xu et al
-
2001)



Iterative clustering algorithm (Varma
et al
-

2004)


Dynamically growing self
-
organizing tree
(DGSOT) (Luo et al
-

2004)

University of Crete

CS483

8

Expression data clustering
analysis


Let


be a set of expression
data with each

representing the expression levels at
time 1 through time t of gene i. We
define a weighted, undirected graph



as follows. The vertex set






and the edge set


.

{ }
i
D d

1
(,,)
t
i i i
d e e

(,)
G V E

{ | }
i i
V d d D
 
{(,) | , and }
i j i j
E d d for d d D i j
  
University of Crete

CS483

9

Expression data clustering
analysis


G is a complete graph.


The weight of its edge is the distance of
the two vertices e.g. Euclidean distance,
Correlation coefficient, etc.


Each cluster corresponds to one subtree
of the MST.


No essential information is lost for
clustering.

University of Crete

CS483

10

Clustering through removing
long MST
-
edges


Based on intuition of
the cluster


Works very well
when inter
-
cluster
edges are larger
than intra
-
cluster
ones

University of Crete

CS483

11

An iterative Clustering


Minimize the distance between the center
of a cluster and its data






Starts with K arbitrary clusters of the MST


for each pair of adjacent clusters finds the
edge to cut, which optimizes


1
(,( ))
i
K
i
i d T
d center T

 

(1)
(1)
University of Crete

CS483

12

A globally optimal clustering


Tries to partition the tree into K subtrees


Select K representatives to optimize

1
(,)
i
K
i
i d T
d d

 

(2)
(2)
University of Crete

CS483

13

MST clustering algorithms


Expression data clustering analysis

(Xu et al
-
2001)


Iterative clustering algorithm
(Varma et al
-

2004)


Dynamically growing self
-
organizing tree
(DGSOT) (Luo et al
-

2004)

University of Crete

CS483

14

Iterative clustering algorithm


The clustering measure used here is
Fukuyama
-
Sugeno






















where
,

are the two partitions of the set
S, with each

contains

samples, denote
by

the mean of the samples in

and

the
global mean of all samples. Also denote by
the j
-
th sample in the cluster


2
2
2
1 1
( )
k
N
k
j k k
k j
FS S x
  
 
 
   
 
 

1
S
2
S
k
S
k
N
k


k
j
x
k
S
k
S
University of Crete

CS483

15

Iterative clustering algorithm


Feature selection counts the gene’s
support to a partition


Feature selection used here is t
-
statistic
with pooled variance. T
-
statistic is
heuristic measure


Genes with absolute t
-
statistic greater
than a threshold are selected


University of Crete

CS483

16

Iterative clustering algorithm


Create an MST from all genes


Delete edges from MST and obtain
binary partitions. Select the one with
minimum F
-
S clustering measure


The feature selection is used to select a
subset of genes that single out between
the clusters



University of Crete

CS483

17

Iterative clustering algorithm


In the next iteration the clustering is
done in this selected set of genes


Until the selected gene subset converges


Remove them form the pool and
continue.

University of Crete

CS483

18

MST clustering algorithms


Expression data clustering analysis
(Xu et al
-
2001)


Iterative clustering algorithm
(Varma et al
-

2004)


Dynamically growing self
-
organizing
tree (DGSOT) (Luo et al
-

2004)

University of Crete

CS483

19

Dynamically growing self
-
organizing tree (DGSOT)


In the previous algorithms the MST is
constructed on the original set of data
and used to test the intra
-
cluster
quantity, while here the MST is used as
a criterion to test the inter
-
cluster
property.


University of Crete

CS483

20

DGSOT algorithm


Tree structure self
-
organizing neural
network


Grows vertically and horizontally


Starts with a root
-
leaf node


In every vertical growing every leaf node
with heterogeneity two
descendents are created and the learning
process take place

et R
H T

University of Crete

CS483

21

DGSOT algorithm


Heterogeneity



Variability (maximum distance between input
data and node)


Average distortion d of a leaf

























D: total number of input data of lead i




: distance between data j and leaf i



: reference vector of leaf i




1
(,)
i
D
j i
j
d x n
d
D



(,)
j i
d x n
i
n
University of Crete

CS483

22

DGSOT algorithm


In every horizontal growing for every
lowest non
-
leaf node a child is added
until the validation criterion is satisfied
and the learning process take place


The learning process distributes the data
to the leaves in the best way. The best
matching node has the minimum
distance to the input data

University of Crete

CS483

23

The validation criterion of
DGSOT


Calculated without human intervention


Based on geometric characteristics of
the clusters


Create the Voronoi diagram for the input
data. The Voronoi diagram divides the
set D data into n regions V(p):



V(p) = {x | (,) (,) }
D dist x p dist x q q
  
University of Crete

CS483

24

The validation criterion of
DGSOT


Let’s define a weighted, undirected graph



.The vertices is the set of the
centroids of the Voronoi cell and the
edge set is defined as






Create the MST for the graph

(,)
G V E
( )
V p
{(,) |,( ) and }
i j i j
E p p p p V p i j
  
(,)
G V E
University of Crete

CS483

25

Voronoi diagram of 2D dataset


In A, the dataset is partitioned into three Voronoi cells.
The MST of the centroid is ‘even’.


In B, the dataset is partitioned into four Voronoi cells.
The MST of the centroid is not ‘even’.

University of Crete

CS483

26

The validation criterion of
DGSOT


Cluster separation






































`
















where is minimum length edge
and


is the maximum length edge


A low value of the CS means that the
two centroids are to close to each other
and the Voronoi partition is not valid,
while a high CS value means that the
Voronoi partition is valid.











min
max
E
CS
E

min
E
max
E
University of Crete

CS483

27

Example of DGSOT

University of Crete

CS483

28

Conclusions


The tree algorithms presented in this
report have provided comparable result
to those obtained by classic clustering
algorithms, without their drawbacks, and
superior to those obtained by standard
hierarchical clustering.


University of Crete

CS483

29

Questions