Cluster analysis for microarray
data
Anja von Heydebreck
Aim of clustering: Group objects
according to their similarity
Cluster:
a set of objects
that are similar
to each other
and separated
from the other
objects.
Example: green/
red data points
were generated
from two different
normal distributions
Clustering microarray data
•
Genes and
experiments/samples
are given as the row
and column vectors of a
gene expression data
matrix.
•
Clustering may be
applied either to genes
or experiments
(regarded as vectors in
R
p
or
R
n
).
n
experiments
p
genes
gene expression data matrix
Why cluster genes?
•
Identify groups of possibly co

regulated
genes (e.g. in conjunction with sequence
data).
•
Identify typical temporal or spatial gene
expression patterns (e.g. cell cycle data).
•
Arrange a set of genes in a linear order that
is at least not totally meaningless.
Why cluster experiments/samples?
•
Quality control: Detect experimental artifacts/bad
hybridizations
•
Check whether samples are grouped according to
known categories (though this might be better
addressed using a
supervised
approach: statistical
tests, classification)
•
Identify new classes of biological samples (e.g.
tumor subtypes)
Alizadeh et al., Nature 403:503

11, 2000
Cluster analysis
Generally, cluster analysis is based on two
ingredients:
•
Distance measure
: Quantification of
(dis

)similarity of objects.
•
Cluster algorithm:
A procedure to group
objects. Aim: small within

cluster distances,
large between

cluster distances.
Some distance measures
Given vectors
x
= (
x
1
, …,
x
n
),
y
= (
y
1
, …,
y
n
)
•
Euclidean distance:
•
Manhattan distance:
•
Correlation
distance:
n
i
i
i
E
y
x
y
x
d
1
2
)
(
)
,
(
.
)
,
(
1
n
i
i
i
M
y
x
y
x
d
.
)
(
)
(
)
)(
(
1
)
,
(
1
2
1
2
1
i
i
i
i
i
i
i
C
y
y
x
x
y
y
x
x
y
x
d
Which distance measure to use?
1.0
1.5
2.0
2.5
3.0
3.5
4.0
1
2
3
4
5
Index
b
x
= (1, 1, 1.5, 1.5)
y
= (2.5, 2.5, 3.5, 3.5) = 2
x
+ 0.5
z
= (1.5, 1.5, 1, 1)
d
c
(
x
,
y
) = 0,
d
c
(
x
,
z
) = 2.
d
E
(
x
,
z
) = 1,
d
E
(
x
,
y
) ~ 3.54.
•
The choice of distance
measure should be based on the
application area. What sort of
similarities would you like to
detect?
•
Correlation distance
d
c
measures trends/relative
differences:
d
c
(
x, y
)
= d
c
(
ax+b, y
) if
a >
0
.
Which distance measure to use?
•
Euclidean and Manhattan distance both measure
absolute differences between vectors. Manhattan
distance is more robust against outliers.
•
May apply
standardization
to the observations:
Subtract mean and divide by standard deviation:
•
After standardization, Euclidean and correlation
distance are equivalent:
ˆ
x
x x
x
2
1 2 1 2
(,) 2 (,).
E C
d x x nd x x
K

means clustering
•
Input:
N
objects given as data points in
R
p
•
Specify the number
k
of clusters.
•
Initialize
k
cluster centers. Iterate until convergence:

Assign each object to the cluster with the closest center
(wrt Euclidean distance).

The centroids/mean vectors of the obtained clusters are
taken as new cluster centers.
•
K

means can be seen as an optimization problem:
Minimize the sum of squared within

cluster distances,
•
Results depend on the initialization. Use several starting
points and choose the “best” solution (with minimal
W
(
C
)).
2
1 ( ) ( )
1
( ) (,)
2
K
E i j
k C i C j k
W C d x x
K

means/PAM: How to choose
K
(the number of clusters)?
•
There is no easy answer.
•
Many heuristic approaches try to compare the
quality of clustering results for different values
of
K
(for an overview see Dudoit/Fridlyand
2002).
•
The problem can be better addressed in model

based clustering, where each cluster represents
a probability distribution, and a likelihood

based framework can be used.
Hierarchical clustering
•
Similarity of objects
is represented in a
tree structure
(
dendrogram
).
•
Advantage: no need
to specify the number
of clusters in
advance. Nested
clusters can be
represented.
Golub data: different types of
leukemia. Clustering based on the
150 genes with highest variance
across all samples.
Agglomerative hierarchical
clustering
•
Bottom

up
algorithm (top

down (divisive)
methods are less common).
•
Start with the objects as clusters.
•
In each iteration, merge the two clusters
with the minimal distance from each other

until you are left with a single cluster
comprising all objects.
•
But what is the distance between two
clusters?
Distances between clusters used for
hierarchical clustering
Calculation of the distance between
two clusters is based on the
pairwise distances between
members of the clusters.
•
Complete linkage
: largest
distance
•
Average linkage
: average
distance
•
Single linkage
: smallest
distance
Complete linkage gives preference to compact/spherical
clusters. Single linkage can produce long stretched clusters.
Hierarchical clustering
•
The height of a node in
the dendrogram represents
the distance of the two
children clusters.
•
Loss of information:
n
objects have
n
(
n

1)/2
pairwise distances, tree
has
n

1 inner nodes.
•
The ordering of the leaves
is not uniquely defined by
the dendrogram: 2
n

2
possible choices.
Golub data: different types of
leukemia. Clustering based on the
150 genes with highest variance
across all samples.
Alternative: direct visualization of
similarity/distance matrices
Useful if one wants to investigate a
specific factor
(advantage:
no loss of information). Sort experiments according to that factor.
Array batch 1
Array batch 2
Clustering of time course data
•
Suppose we have expression data from different
time points
t
1
, …,
t
n
, and want to identify typical
temporal expression profiles by clustering the genes.
•
Usual clustering methods/distance measures don’t
take the ordering of the time points into account
–
the result would be the same if the time points were
permuted.
•
Simple modification: Consider the difference
x
i
(
j
+1)
–
x
ij
between consecutive timepoints as an
additional observation
y
ij
. Then apply a clustering
algorithm such as
K

means to the augmented data
matrix.
Biclustering
•
Usual clustering algorithms are
based on global similarities of rows
or columns of an expression data
matrix.
•
But the similarity of the expression
profiles of a group of genes may be
restricted to certain experimental
conditions.
•
Goal of biclustering: identify
“homogeneous” submatrices.
•
Difficulties: computational
complexity, assessing the statistical
significance of results
•
Example: Tanay et al. 2002.
The role of feature selection
•
Sometimes, people first select genes that
appear to be differentially expressed
between groups of samples. Then they
cluster the samples based on the expression
levels of these genes. Is it remarkable if the
samples then cluster into the two groups?
•
No, this doesn’t prove anything, because the
genes were selected with respect to the two
groups! Such effects can even be obtained
with a matrix of i.i.d. random numbers.
Classification: Additional class information given
object

space
Classification methods
•
Linear Discriminant Analysis (LDA, Fisher)
•
Nearest neighbor procedures
•
Neural nets
•
Support vector machines
Embedding methods
•
Attempt to represent high

dimensional
objects as points in a low

dimensional space
(2 or 3 dimensions)
•
Principal Component Analysis
•
Correspondence Analysis
•
(Multidimensional Scaling: represents
objects for which distances are given)
Principal Component Analysis
•
Finds a low

dimensional projection such
that the sum

of

squares of the residuals is
minimized
Contingency Tables
Results from a poll among people of different nationality
nationality
favorite
dish
Ge
Au
It
Pasta
Wiener Schnitzel
Sauerbraten
7
23
12
35
11
9
19
18
11
Microarray experiment
Genes
Experimental
condition
Cdc14
Gle2
Wt yeast
wt yeast + galactose
Yeast transgene
7
23
12
35
11
9
19
18
11
Gal1
Correspondence Analysis and
Principal Component Analysis
•
Objects are depicted as points in a plane or
in three

dimensional space, trying to
maintain their „proximity“
•
CA is a variant of PCA, although based on
another distance measure:
2

distance
•
Embedding tries to preserve
2

distance
Correspondence analysis:
Interpretation
•
Shows both gene

vectors and
condition

vectors as dots in
the plane
•
Genes that are nearby are
similar
•
Conditions that are nearby
are similar
•
When genes and conditions
point in the same direction
then the gene is up

regulated
in that condition.
(f
kj
f
k+

f
lj
f
l+
)
d
kl
=
f
+j
n
j=1..n
2

distance
2
Correspondence analysis:
Interpretation
•
Shows both gene

vectors and
condition

vectors as dots in
the plane
•
Genes that are nearby are
similar
•
Conditions that are nearby
are similar
•
When genes and conditions
point in the same direction
then the gene is up

regulated
in that condition.
G1 S G2 M
Spellman et al took several samples per time

point and
hybridized the RNA to a glass chips with all yeast genes
Comments 0
Log in to post a comment