Cluster analysis for microarray

coachkentuckyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

70 εμφανίσεις

Cluster analysis for microarray
data

Anja von Heydebreck

Aim of clustering: Group objects
according to their similarity

Cluster:

a set of objects

that are similar

to each other

and separated

from the other

objects.


Example: green/

red data points

were generated

from two different

normal distributions

Clustering microarray data


Genes and
experiments/samples
are given as the row
and column vectors of a
gene expression data
matrix.


Clustering may be
applied either to genes
or experiments
(regarded as vectors in
R
p

or
R
n
).

n

experiments

p

genes

gene expression data matrix

Why cluster genes?


Identify groups of possibly co
-
regulated
genes (e.g. in conjunction with sequence
data).


Identify typical temporal or spatial gene
expression patterns (e.g. cell cycle data).


Arrange a set of genes in a linear order that
is at least not totally meaningless.

Why cluster experiments/samples?


Quality control: Detect experimental artifacts/bad
hybridizations


Check whether samples are grouped according to
known categories (though this might be better
addressed using a
supervised

approach: statistical
tests, classification)


Identify new classes of biological samples (e.g.
tumor subtypes)

Alizadeh et al., Nature 403:503
-
11, 2000

Cluster analysis


Generally, cluster analysis is based on two
ingredients:


Distance measure
: Quantification of
(dis
-
)similarity of objects.


Cluster algorithm:

A procedure to group
objects. Aim: small within
-
cluster distances,
large between
-
cluster distances.

Some distance measures

Given vectors
x

= (
x
1
, …,
x
n
),
y

= (
y
1
, …,
y
n
)



Euclidean distance:



Manhattan distance:



Correlation


distance:





n
i
i
i
E
y
x
y
x
d
1
2
)
(
)
,
(
.
)
,
(
1




n
i
i
i
M
y
x
y
x
d
.
)
(
)
(
)
)(
(
1
)
,
(
1
2
1
2
1












i
i
i
i
i
i
i
C
y
y
x
x
y
y
x
x
y
x
d
Which distance measure to use?

1.0
1.5
2.0
2.5
3.0
3.5
4.0
1
2
3
4
5
Index
b
x

= (1, 1, 1.5, 1.5)

y

= (2.5, 2.5, 3.5, 3.5) = 2
x
+ 0.5

z

= (1.5, 1.5, 1, 1)

d
c
(
x
,
y
) = 0,
d
c
(
x
,
z
) = 2.

d
E
(
x
,
z
) = 1,
d
E
(
x
,
y
) ~ 3.54.



The choice of distance
measure should be based on the
application area. What sort of
similarities would you like to
detect?



Correlation distance
d
c

measures trends/relative
differences:

d
c
(
x, y
)
= d
c
(
ax+b, y
) if

a >
0
.

Which distance measure to use?


Euclidean and Manhattan distance both measure
absolute differences between vectors. Manhattan
distance is more robust against outliers.


May apply
standardization

to the observations:


Subtract mean and divide by standard deviation:





After standardization, Euclidean and correlation
distance are equivalent:


ˆ
x
x x
x


2
1 2 1 2
(,) 2 (,).
E C
d x x nd x x

K
-
means clustering


Input:
N

objects given as data points in
R
p


Specify the number
k
of clusters.


Initialize
k

cluster centers. Iterate until convergence:


-

Assign each object to the cluster with the closest center
(wrt Euclidean distance).


-

The centroids/mean vectors of the obtained clusters are
taken as new cluster centers.


K
-
means can be seen as an optimization problem:


Minimize the sum of squared within
-
cluster distances,





Results depend on the initialization. Use several starting
points and choose the “best” solution (with minimal
W
(
C
)).


2
1 ( ) ( )
1
( ) (,)
2
K
E i j
k C i C j k
W C d x x
  

 
K
-
means/PAM: How to choose
K
(the number of clusters)?


There is no easy answer.


Many heuristic approaches try to compare the
quality of clustering results for different values
of
K
(for an overview see Dudoit/Fridlyand
2002).


The problem can be better addressed in model
-
based clustering, where each cluster represents
a probability distribution, and a likelihood
-
based framework can be used.


Hierarchical clustering


Similarity of objects
is represented in a
tree structure
(
dendrogram
).


Advantage: no need
to specify the number
of clusters in
advance. Nested
clusters can be
represented.

Golub data: different types of
leukemia. Clustering based on the
150 genes with highest variance
across all samples.

Agglomerative hierarchical
clustering



Bottom
-
up

algorithm (top
-
down (divisive)
methods are less common).


Start with the objects as clusters.


In each iteration, merge the two clusters
with the minimal distance from each other
-

until you are left with a single cluster
comprising all objects.


But what is the distance between two
clusters?


Distances between clusters used for
hierarchical clustering


Calculation of the distance between
two clusters is based on the
pairwise distances between
members of the clusters.


Complete linkage
: largest
distance


Average linkage
: average
distance


Single linkage
: smallest
distance

Complete linkage gives preference to compact/spherical

clusters. Single linkage can produce long stretched clusters.


Hierarchical clustering


The height of a node in
the dendrogram represents
the distance of the two
children clusters.


Loss of information:
n

objects have
n
(
n
-
1)/2
pairwise distances, tree
has
n
-
1 inner nodes.


The ordering of the leaves
is not uniquely defined by
the dendrogram: 2
n
-
2

possible choices.

Golub data: different types of
leukemia. Clustering based on the
150 genes with highest variance
across all samples.

Alternative: direct visualization of
similarity/distance matrices

Useful if one wants to investigate a
specific factor

(advantage:
no loss of information). Sort experiments according to that factor.

Array batch 1

Array batch 2

Clustering of time course data


Suppose we have expression data from different
time points
t
1
, …,
t
n
, and want to identify typical
temporal expression profiles by clustering the genes.


Usual clustering methods/distance measures don’t
take the ordering of the time points into account


the result would be the same if the time points were
permuted.


Simple modification: Consider the difference
x
i
(
j
+1)



x
ij

between consecutive timepoints as an
additional observation
y
ij
. Then apply a clustering
algorithm such as
K
-
means to the augmented data
matrix.

Biclustering


Usual clustering algorithms are
based on global similarities of rows
or columns of an expression data
matrix.


But the similarity of the expression
profiles of a group of genes may be
restricted to certain experimental
conditions.


Goal of biclustering: identify
“homogeneous” submatrices.


Difficulties: computational
complexity, assessing the statistical
significance of results


Example: Tanay et al. 2002.


The role of feature selection


Sometimes, people first select genes that
appear to be differentially expressed
between groups of samples. Then they
cluster the samples based on the expression
levels of these genes. Is it remarkable if the
samples then cluster into the two groups?


No, this doesn’t prove anything, because the
genes were selected with respect to the two
groups! Such effects can even be obtained
with a matrix of i.i.d. random numbers.

Classification: Additional class information given


object
-
space

Classification methods


Linear Discriminant Analysis (LDA, Fisher)


Nearest neighbor procedures


Neural nets


Support vector machines

Embedding methods


Attempt to represent high
-
dimensional
objects as points in a low
-
dimensional space
(2 or 3 dimensions)


Principal Component Analysis


Correspondence Analysis


(Multidimensional Scaling: represents
objects for which distances are given)

Principal Component Analysis


Finds a low
-
dimensional projection such
that the sum
-
of
-
squares of the residuals is
minimized

Contingency Tables

Results from a poll among people of different nationality

nationality

favorite

dish

Ge

Au

It

Pasta

Wiener Schnitzel

Sauerbraten

7

23

12

35

11

9

19

18

11

Microarray experiment

Genes

Experimental

condition

Cdc14

Gle2

Wt yeast

wt yeast + galactose

Yeast transgene

7

23

12

35

11

9

19

18

11

Gal1

Correspondence Analysis and
Principal Component Analysis



Objects are depicted as points in a plane or
in three
-
dimensional space, trying to
maintain their „proximity“


CA is a variant of PCA, although based on
another distance measure:

2
-
distance


Embedding tries to preserve

2
-
distance



Correspondence analysis:
Interpretation


Shows both gene
-
vectors and
condition
-
vectors as dots in
the plane


Genes that are nearby are
similar


Conditions that are nearby
are similar


When genes and conditions
point in the same direction
then the gene is up
-
regulated
in that condition.

(f
kj

f
k+

-

f
lj


f
l+
)

d
kl

=


f
+j


n

j=1..n


2
-
distance

2

Correspondence analysis:
Interpretation


Shows both gene
-
vectors and
condition
-
vectors as dots in
the plane


Genes that are nearby are
similar


Conditions that are nearby
are similar


When genes and conditions
point in the same direction
then the gene is up
-
regulated
in that condition.

G1 S G2 M

Spellman et al took several samples per time
-
point and

hybridized the RNA to a glass chips with all yeast genes