# The use of Minimum Spanning Trees

University of Crete

CS483

1

The use of Minimum Spanning Trees
in microarray expression data

Gkirtzou Ekaterini

University of Crete

CS483

2

Introduction

Classic clustering algorithms, like K
-
means, self
-
organizing maps, etc., have
certain drawbacks

No guarantee for global optimal results

Depend on geometric shape of cluster
boundaries (K
-
means)

University of Crete

CS483

3

Introduction

MST clustering algorithms

Expression data clustering analysis

(Xu et al
-
2001)

Iterative clustering algorithm
(Varma et al
-

2004)

Dynamically growing self
-
organizing tree
(DGSOT) (Luo et al
-

2004)

University of Crete

CS483

4

Definitions

A minimum spanning tree (MST) of a
weighted, undirected graph
with weights is an acyclic subset

that contains all of the vertices
and whose total weight

is minimum.

T G

( ) ( )
e T
w T w e

(,)
G V E

( )
w e
University of Crete

CS483

5

Definitions

The DNA microarray technology enables
the massive parallel measurement of
gene expression of thousands genes
simultaneously. Its usefulness:

compare the activity of genes in diseased
and healthy cells

categorize a disease into subgroups

discover new drug and toxicology studies.

University of Crete

CS483

6

Definitions

Clustering is a common technique for
data analysis. Clustering partitions the
data set into subsets (clusters), so that
the data in each subset share some
common trait.

University of Crete

CS483

7

MST clustering algorithms

Expression data clustering analysis

(Xu et al
-
2001)

Iterative clustering algorithm (Varma
et al
-

2004)

Dynamically growing self
-
organizing tree
(DGSOT) (Luo et al
-

2004)

University of Crete

CS483

8

Expression data clustering
analysis

Let

be a set of expression
data with each

representing the expression levels at
time 1 through time t of gene i. We
define a weighted, undirected graph

as follows. The vertex set

and the edge set

.

{ }
i
D d

1
(,,)
t
i i i
d e e

(,)
G V E

{ | }
i i
V d d D
 
{(,) | , and }
i j i j
E d d for d d D i j
  
University of Crete

CS483

9

Expression data clustering
analysis

G is a complete graph.

The weight of its edge is the distance of
the two vertices e.g. Euclidean distance,
Correlation coefficient, etc.

Each cluster corresponds to one subtree
of the MST.

No essential information is lost for
clustering.

University of Crete

CS483

10

Clustering through removing
long MST
-
edges

Based on intuition of
the cluster

Works very well
when inter
-
cluster
edges are larger
than intra
-
cluster
ones

University of Crete

CS483

11

An iterative Clustering

Minimize the distance between the center
of a cluster and its data

Starts with K arbitrary clusters of the MST

for each pair of adjacent clusters finds the
edge to cut, which optimizes

1
(,( ))
i
K
i
i d T
d center T

 

(1)
(1)
University of Crete

CS483

12

A globally optimal clustering

Tries to partition the tree into K subtrees

Select K representatives to optimize

1
(,)
i
K
i
i d T
d d

 

(2)
(2)
University of Crete

CS483

13

MST clustering algorithms

Expression data clustering analysis

(Xu et al
-
2001)

Iterative clustering algorithm
(Varma et al
-

2004)

Dynamically growing self
-
organizing tree
(DGSOT) (Luo et al
-

2004)

University of Crete

CS483

14

Iterative clustering algorithm

The clustering measure used here is
Fukuyama
-
Sugeno

where
,

are the two partitions of the set
S, with each

contains

samples, denote
by

the mean of the samples in

and

the
global mean of all samples. Also denote by
the j
-
th sample in the cluster

2
2
2
1 1
( )
k
N
k
j k k
k j
FS S x
  
 
 
   
 
 

1
S
2
S
k
S
k
N
k

k
j
x
k
S
k
S
University of Crete

CS483

15

Iterative clustering algorithm

Feature selection counts the gene’s
support to a partition

Feature selection used here is t
-
statistic
with pooled variance. T
-
statistic is
heuristic measure

Genes with absolute t
-
statistic greater
than a threshold are selected

University of Crete

CS483

16

Iterative clustering algorithm

Create an MST from all genes

Delete edges from MST and obtain
binary partitions. Select the one with
minimum F
-
S clustering measure

The feature selection is used to select a
subset of genes that single out between
the clusters

University of Crete

CS483

17

Iterative clustering algorithm

In the next iteration the clustering is
done in this selected set of genes

Until the selected gene subset converges

Remove them form the pool and
continue.

University of Crete

CS483

18

MST clustering algorithms

Expression data clustering analysis
(Xu et al
-
2001)

Iterative clustering algorithm
(Varma et al
-

2004)

Dynamically growing self
-
organizing
tree (DGSOT) (Luo et al
-

2004)

University of Crete

CS483

19

Dynamically growing self
-
organizing tree (DGSOT)

In the previous algorithms the MST is
constructed on the original set of data
and used to test the intra
-
cluster
quantity, while here the MST is used as
a criterion to test the inter
-
cluster
property.

University of Crete

CS483

20

DGSOT algorithm

Tree structure self
-
organizing neural
network

Grows vertically and horizontally

Starts with a root
-
leaf node

In every vertical growing every leaf node
with heterogeneity two
descendents are created and the learning
process take place

et R
H T

University of Crete

CS483

21

DGSOT algorithm

Heterogeneity

Variability (maximum distance between input
data and node)

Average distortion d of a leaf

D: total number of input data of lead i

: distance between data j and leaf i

: reference vector of leaf i

1
(,)
i
D
j i
j
d x n
d
D

(,)
j i
d x n
i
n
University of Crete

CS483

22

DGSOT algorithm

In every horizontal growing for every
lowest non
-
leaf node a child is added
until the validation criterion is satisfied
and the learning process take place

The learning process distributes the data
to the leaves in the best way. The best
matching node has the minimum
distance to the input data

University of Crete

CS483

23

The validation criterion of
DGSOT

Calculated without human intervention

Based on geometric characteristics of
the clusters

Create the Voronoi diagram for the input
data. The Voronoi diagram divides the
set D data into n regions V(p):

V(p) = {x | (,) (,) }
D dist x p dist x q q
  
University of Crete

CS483

24

The validation criterion of
DGSOT

Let’s define a weighted, undirected graph

.The vertices is the set of the
centroids of the Voronoi cell and the
edge set is defined as

Create the MST for the graph

(,)
G V E
( )
V p
{(,) |,( ) and }
i j i j
E p p p p V p i j
  
(,)
G V E
University of Crete

CS483

25

Voronoi diagram of 2D dataset

In A, the dataset is partitioned into three Voronoi cells.
The MST of the centroid is ‘even’.

In B, the dataset is partitioned into four Voronoi cells.
The MST of the centroid is not ‘even’.

University of Crete

CS483

26

The validation criterion of
DGSOT

Cluster separation

`

where is minimum length edge
and

is the maximum length edge

A low value of the CS means that the
two centroids are to close to each other
and the Voronoi partition is not valid,
while a high CS value means that the
Voronoi partition is valid.

min
max
E
CS
E

min
E
max
E
University of Crete

CS483

27

Example of DGSOT

University of Crete

CS483

28

Conclusions

The tree algorithms presented in this
report have provided comparable result
to those obtained by classic clustering
algorithms, without their drawbacks, and
superior to those obtained by standard
hierarchical clustering.

University of Crete

CS483

29

Questions