Clustering analysis of microarray
gene expression data
Ping Zhang
November 19
th
, 2008
Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K

Means clustering
Hierarchical clustering
Minimum spanning tree

based clustering
What is a DNA Microarray?
DNA microarray technology allows measuring expressions
for tens of thousands of genes at a time
Scanning/Signal Detection
equal expression
higher expression in Cy3
higher expression in Cy5
Cy3 channel
Cy5 channel
Data

flow schema of
microarray data analysis
Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K

Means clustering
Hierarchical clustering
Minimum spanning tree

based clustering
Time/Condition
Expression (relatively levels to reference point at 0)
Gene expression profiles
Similarity between Profiles
Similarity measure:
Euclidean distance
Correlation coefficient
Trend
…
Correlation coefficient
often works better.
0
expression
time
Expression profile
Pearson Correlation Coefficient
Compares
scaled
profiles!
Can detect inverse relationships
Most commonly used
n
i
y
i
x
i
s
y
y
s
x
x
n
r
1
1
1
n=number of conditions
x=average expression of gene x in all n conditions
y=average expression of gene y in all n conditions
s
x
=standard deviation of x
S
y
=standard deviation of y
Correlation Pitfalls
Raw Data
0
20
40
60
80
100
120
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Gene A
Gene B
Normalized Data
1
0.5
0
0.5
1
1.5
2
2.5
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Gene A
Gene B
Correlation=0.97
Euclidean Distance
Scaled versus unscaled
Cannot detect inverse
relation ships
2
2
2
2
2
1
1
.
.
.
,
n
n
y
x
y
x
y
x
Y
X
d
For Gene X=(x
1
, x
2
,…x
n
) and Gene Y=(y
1
, y
2
,…y
n
)
Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K

Means clustering
Hierarchical clustering
Minimum spanning tree

based clustering
Data

Mining through Clustering
Degradation
Synthesis
Chromatin
Glycolysis
Assumptions for clustering analysis:
Expression level of a gene reflects the gene’s activity.
Genes involved in same biological process exhibit
statistical relationship in their expression profiles.
Clustering:
group objects into clusters so that
o
objects in each cluster have “
similar
” features;
o
objects of different clusters have “
dissimilar
” features
Idea of Clustering
Methods of Clustering
•
discriminant analysis
(Fisher,1931)
•
K

means
(Lloyd,1948)
•
support vector machines (Vapnik, 1985)
•
self

organizing maps
(Kohonen, 1980)
•
hierarchical clustering
Issues in Cluster Analysis
A lot of clustering algorithms
A lot of distance/similarity metrics
Which clustering algorithm runs faster
and uses less memory?
How many clusters after all?
Are the clusters stable?
Are the clusters meaningful?
Which Clustering Method
Should I Use?
What is the biological question?
Do I have a preconceived notion of
how many clusters there should be?
How strict do I want to be? Spilt or
Join?
Can a gene be in multiple clusters?
Hard or soft boundaries between
clusters
Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K

Means clustering
Hierarchical clustering
Minimum spanning tree

based clustering
K

means clustering for
expression profiles
Step 1: Transform n (genes) * m (experiments) matrix
into n(genes) * n(genes) distance matrix
Step 2: Cluster genes based on a k

means
clustering algorithm
Exp 1
Exp 2
Exp 3
Exp 4
Gene A
Gene B
Gene C
Gene A
Gene B
Gene C
Gene A
0
Gene B
?
0
Gene C
?
?
0
To transform the n*m matrix into n*n matrix, use
a similarity (distance) metric.
K

means algorithm
The most popular algorithm for clustering
What is so attractive?
•
Simple
•
Mathematically correct
•
Fast
•
Invariant to dimension
•
Easy to implement
K

Means Clustering
Basic Ideas : using cluster centre (means) to represent
cluster
Assigning data elements to the closet cluster (centre).
Goal: Minimize square error (intra

class dissimilarity) :
=
There is no hierarchy.
Must supply the number of clusters (
k
) into which the
data are to be grouped.
))
(
,
(
i
i
i
x
C
x
d
2
Initialization 1
Specify the number of cluster
k

for example,
k
= 4
gene
conditions
Expression matrix
Each point is called “gene”
K

means Clustering : Procedure (1)
Initialization 2
Genes are
randomly assigned
to one of
k
clusters
K

means Clustering : Procedure (2)
or choose random starting centers
Calculate the mean of each cluster
C
N
i
i
C
i
c
g
N
m
1
1
(1,2)
(3,2)
(3,4)
(6,7)
4
1
i
BLUE
m
[(6,7) + (3,4) + …]
K

means Clustering : Procedure (3)
Each gene is
reassigned
to the nearest cluster
2


min
arg
i
i
j
j
g
m
c
Gene
i
to cluster
c
K

means Clustering : Procedure (4)
K

means Clustering : Procedure (5)
Iterate until the means are converged
Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K

Means clustering
Hierarchical clustering
Minimum spanning tree

based clustering
Hierarchical clustering (1)
Step 2: Cluster genes
based on distance matrix
and draw a dendrogram
until single node remains
Step 1: Transform genes * experiments matrix into
genes * genes distance matrix
Exp 1
Exp 2
Exp 3
Exp 4
Gene A
Gene B
Gene C
Gene A
Gene B
Gene C
Gene A
0
Gene B
?
0
Gene C
?
?
0
G 1
G 2
G 3
G 4
G 5
G 1
0
G 2
2
0
G 3
6
5
0
G 4
10
9
4
0
G 5
9
8
5
3
0
G (12)
G 3
G 4
G5
G (12)
0
G 3
6
0
G 4
10
4
0
G 5
9
5
3
0
G (12)
G 3
G (45)
G (12)
0
G 3
6
0
G (45)
10
5
0
Stage
Groups
P5
[1], [2], [3], [4], [5]
P4
[1 2], [3], [4], [5]
P3
[1 2], [3], [4 5]
P2
[1 2], [3 4 5]
P1
[1 2 3 4 5]
1
2
3
4
5
Hierarchical clustering (2)
Hierarchical
Clustering Results
Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K

Means clustering
Hierarchical clustering
Minimum spanning tree

based clustering
Graph Representation
Represent a set of n

dimensional points as a graph
o
each data point (gene) represented as a
node
o
each pair of genes represented as an
edge
with a weight
defined by the “dissimilarity” between the two genes
n

D data points
graph
representation
0 1 1.5 2 5 6 7 9
1 0 2 1 6.5 6 8 8
1.5 2 0 1 4 4 6 5.5
.
.
.
distance matrix
Minimum Spanning Tree
Spanning tree
: a sub

graph that has all nodes
connected and has no cycles
Minimum spanning tree (MST)
: a spanning tree with
the minimum total distance
(b)
(c)
(a)
Prim’s algorithm and Kruskal’s algorithm
Kruskal’s algorithm
step 1: select an edge with the smallest distance from graph
step 2: add to tree as along as no cycle is formed
step 3: remove the edge from graph
step 4: repeat steps 1

3 till all nodes are connected in tree.
10
3
(b)
4
3
(c)
4
3
5
(d)
4
7
3
5
(e)
How to Construct
Minimum Spanning Tree
4
6
7
3
5
8
(a)
14
Significantly simplifies the data clustering problem, while
losing very little essential information for clustering.
We have mathematically proved:
Foundation of MST Approach
A multi

dimensional clustering problem is
equivalent to a tree

partitioning problem!
Clustering by Cutting Long Edge
1
Hierarchical cutting
1
st
cut: longest edge
2
nd
cut: second longest edge
…
Work well for “easy” cases.
Produce many clusters with
single element for some
“difficult” cases.
2
Tree

Based Clustering
For each edge, calculate
the assessment value
Find the edge that give the
minimum assessment value
as the place to cut
g
*
Clustering using iterative method
guarantee to find the global optimality
using tree

based dynamic programming
Clustering through Removing
Long MST

Edges
Objective: partition an MST into K subtrees so that
the total edge

distance of all the K subtrees in
minimized
Finding K

1 longest MST

edges and cutting them
=> we get K clusters
This works as long as the inter

cluster edge

distances are clearly larger than the intra

cluster
edge

distances
An Iterative Clustering
Algorithm
Find K subtrees T
i
of an MST such that to
minimize:
Informally, the total distance between the center of
each cluster and its data points is minimized
The center c of a cluster C is defined as:
the sum of the distances between c and all the data
points in C is minimized
Does not work well if the cluster boundary is not
convex
A Globally Optimal
Clustering Algorithm
Given an MST T, partition T into K subtrees T
i
and
find a set of data points d
i
, i = 1…k, d
i
in D such
that to minimize:
Informally, group data points around the “best”
representatives rather than around the “center”
Using Dynamic Programming for this algorithm
Automated Selection
of Number of Clusters
Select “transition point” in the assessment value
as the“correct” number of clusters.
Transition Profiles
indicator[n] = (A[n

1]
–
A[n]) / (A[n]
–
A[n+1])
A[k] is the assessment value for partition with k clusters
Our clustering of yeast data
Reference
[1] Ying
Xu
, Victor
Olman
, and Dong
Xu
. Clustering Gene
Expression Data Using a Graph

Theoretic Approach: An
Application of Minimum Spanning Trees. Bioinformatics.
18:526

535, 2002.
[2] Dong
Xu
, Victor
Olman
, Li Wang, and Ying
Xu
.
EXCAVATOR: a computer program for gene expression data
analysis. Nucleic Acid Research. 31: 5582

5589. 2003.
Using slides from:
Michael
Hongbo
Xie
, Temple University (in 2006)
Vipin
Kumar, University of Minnesota
Dong
Xu
, University of Missouri
Acknowledgement
Comments 0
Log in to post a comment