Advanced Topics in Bioinformatics
Microarrays: Clustering
Data Clustering
Lecture Overview
Introduction: What is Data Clustering
Key Terms & Concepts
–
Dimensionality
–
Centroids & Distance
–
Distance & Similarity measures
–
Data Structures Used
–
Hierarchical & non

hierarchical
Hierarchical Clustering
–
Algorithm
–
Single/complete/average linkage
–
Dendrograms
K

means Clustering
–
Algorithm
Other Related Concepts
–
Self Organising Maps (SOM)
–
Dimensionality Reduction: PCA & MDS
Introduction
Analysis of Gene Expression Matrices
Samples
Genes
Gene
expression
levels
Gene Expression Matrix
In a gene expression matrix, rows represent
genes and columns represent
measurements from different experimental
conditions measured on individual arrays.
The values at each position in the matrix
characterise the expression level (absolute
or relative) of a particular gene under a
particular experimental condition.
Introduction
Identifying Similar Patterns
The goal of microarray data analysis is to find relationships and
patterns in the data to achieve insights in underlying biology.
Clustering algorithms can be applied to the resulting data to find
groups of similar genes or groups of similar samples.
–
e.g. Groups of genes with “similar expression profiles (Co

expressed
Genes)

similar rows in the gene expression matrix
–
or Groups of samples (disease cell lines/tissues/toxicants) with “similar
effects” on gene expression

similar columns in the gene expression
matrix
Introduction
What is Data Clustering
Clustering of data is a method by which large sets of data is grouped into
clusters (groups) of smaller sets of similar data.
Example: There are a total of 10 balls which are of three different colours.
We are interested in clustering the balls into three different groups.
An intuitive solution is that balls of same colour are clustered (grouped
together) by colour
Identifying similarity by colour was easy, however we want to extend this
to numerical values to be able to deal with gene expression matrices, and
also to cases when there are more features (not just colour).
Introduction
Clustering Algorithms
A clustering algorithm attempts to find natural groups of components (or
data) based on some notion similarity over the features describing them.
Also, the clustering algorithm finds the centroid of a group of data sets.
To determine cluster membership, many algorithms evaluate the distance
between a point and the cluster centroids.
The output from a clustering algorithm is basically a statistical description
of the cluster centroids with the number of components in each cluster.
Key Terms and Concepts
Dimensionality of gene expression matrix
Clustering algorithms work by calculating
distances (or alternatively similarity in higher

dimensional spaces), i.e. when the elements
are described by many features (e.g. colour,
size, smoothness, etc for the balls example)
A gene expression matrix of N Genes x M
Samples can be viewed as:
–
N genes, each represented in an M

dimensional
space.
–
M samples, each represented in N

dimensional
space
We will show graphical examples mainly in 2

D
spaces
–
i.e. when N= 2 or M=2
Samples
Genes
Gene
expression
levels
Gene Expression Matrix
Key Terms and Concepts
Centroid and Distance
+
+
gene A
gene B
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
centroid
In the first example (2 genes & 25 samples) the expression values of 2
Genes are plotted for 25 samples and Centroid shown)
In the second (2 genes & 2 samples) example the distance between the
expression values of the 2 genes is shown
Key Terms and Concepts
Centriod and Distance
Cluster centroid :
The centroid of a cluster is a point whose parameter values are
the mean of the parameter values of all the points in the clusters.
Distance:
Generally, the distance between two points is taken as a common
metric to assess the similarity among the components of a
population. The commonly used distance measure is the
Euclidean metric which defines the distance between two points
p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :
Key Terms and Concepts
Properties of Distance Metrics
There are many possible distance metrics.
Some theoretical (and intuitive) properties of distance metrics
–
Distance between two profiles must be greater than or equal to zero,
distances cannot be negative.
–
The distance between a profile and itself must be zero
–
Conversely if the difference between two profiles is zero, then the
profiles must be identical.
–
The distance between profile A and profile B must be the same as the
distance between profile B and profile A.
–
The distance between profile A and profile C must be less than or
equal to the sum of the distance between profiles A and B and
profiles Ba and C.
Key Terms and Concepts
Distance/Similarity Measures
Euclidean (L
2
) distance
Manhattan (L
1
) distance
L
m
: (
x
1

x
2

m
+y
1

y
2

m
)
1/m
L
∞
: max(x
1

x
2
,y
1

y
2
)
Inner product: x
1
x
2
+y
1
y
2
Correlation coefficient
Spearman rank correlation coefficient
For simplicity we will concentrate on Euclidean and Manhattan
distances in this course
(x
1
, y
1
)
(x
2
,y
2
)
Key Terms and Concepts
Distance Measures: Minkowski Metric
Key Terms
Commonly Used Minkowski Metrics
Key Terms and Concepts
Examples of Minkowski Metrics
4
3
x
y
Key Terms and Concepts
Distance/Similarity Matrices
Gene Expression Matrix
–
N Genes x M Samples
Clustering is based on distances, this
leads to a new useful data structure:
Similarity/Dissimilarity matrix
–
Represents the distance between
either N Genes (NxN) or M Samples
(MxM)
–
Only need half the matrix, since it is
symmetrical
Key Terms
Hierarchical vs. Non

hierarchical
Hierarchical clustering is the most commonly used methods for
identifying groups of closely related genes or tissues. Hierarchical
clustering is a method that successively links genes or samples
with similar profiles to form a tree structure
–
much like
phylognentic tree.
K

means clustering is a method for non

hierarchical (flat)
clustering that requires the analyst to supply the number of
clusters in advance and then allocates genes and samples to
clusters appropriately.
Hierarchical Clustering
Algorithm
•
Given a set of N items to be clustered, and an NxN distance (or
similarity) matrix, the basic process hierarchical clustering is this:
1.
Start by assigning each item to its own cluster, so that if you have N
items, you now have N clusters, each containing just one item.
2.
Find the closest (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster.
3.
Compute distances (similarities) between the new cluster and each of
the old clusters.
4.
Repeat steps 2 and 3 until all items are clustered into a single cluster of
size N.
Hierarchical Cluster Analysis
Scan matrix for
minimum
Hierarchical Cluster Analysis
Scan matrix for
minimum
2
3
Join genes to 1 node
Hierarchical Cluster Analysis
2
3
Update matrix
Hierarchical Cluster Analysis
Scan matrix for
minimum
Join genes to 1 node
2
3
1
Hierarchical Clustering
Distance Between Two Clusters
Min
distance
Average
distance
Max
distance
Single

Link
Method / Nearest Neighbor
Complete

Link
/ Furthest Neighbor
Their
Centroids
.
Average
of all cross

cluster pairs.
Whereas it is straightforward to
calculate distance between two
points, we do have various options
when calculating distance between
clusters.
Key Terms
Linkage Methods for hierarchical clustering
Single

link clustering
(also called the connectedness or minimum
method) : we consider the distance between one cluster and another
cluster to be equal to the shortest distance from any member of one
cluster to any member of the other cluster. If the data consist of
similarities, we consider the similarity between one cluster and another
cluster to be equal to the greatest similarity from any member of one
cluster to any member of the other cluster.
Complete

link clustering
(also called the diameter or maximum method):
we consider the distance between one cluster and another cluster to be
equal to the longest distance from any member of one cluster to any
member of the other cluster.
Average

link clustering
we consider the distance between one cluster
and another cluster to be equal to the average distance from any member
of one cluster to any member of the other cluster.
Single

Link Method
b
a
Distance Matrix
Euclidean Distance
(1)
(2)
(3)
a,b,c
c
c
d
a,b
d
d
a,b,c,d
Complete

Link Method
b
a
Distance Matrix
Euclidean Distance
(1)
(2)
(3)
a,b
c
c
d
a,b
d
c,d
a,b,c,d
Key Terms and Concepts
Dendrograms and Linkage
2
4
6
0
Single

Link
Complete

Link
The resulting tree structure is usally referred to as a dendrogram.
In a dendrogram the length of each tree branch represents the distance
between clusters it joins.
Different dendrograms may arise when different Linkage methods are used
Two Way Hierarchical Clustering
Note we can do two way
clustering by performing
clustering on both the rows and
the columns
It is common to visualise the
data as shown using a heatmap.
Don’t confuse the heatmap
with the colours of a
microarray image.
They are different !
Why?
Basic Ideas : using cluster centroids (means) to represent cluster
Assigning data elements to the closet cluster (centroid).
Goal: Minimise square error (intra

class dissimilarity)
K

Means Clustering
K

means Clustering
Algorithm
1)
Select an initial partition of k clusters
2) Assign each object to the cluster with the closest centroid
3) Compute the new centeroid of the clusters:
4) Repeat step 2 and 3 until no object changes cluster
The
K

Means
Clustering Method
Example
k

means Clustering: Procedure (1)
Step 1a
Specify the number of cluster
k
e.g,
k
= 4
Each point is called “gene”
k

means Clustering: Procedure (2)
Step 1b
Assign
k
random centroids
k

means Clustering: Procedure (3)
Step 1c
Calculate the centroid (mean) of each cluster
(1,2)
(3,2)
(3,4)
(6,7)
[(6,7) + (3,4) + …]
k

means Clustering: Procedure (4)
Step 2
Each gene is
reassigned
to the nearest cluster
Gene
i
to cluster
c
k

means Clustering: Procedure (5)
Step 3
Calculate the centroid (mean) of each cluster
k

means Clustering : Procedure (5)
Step 4
Iterate until the means are converged
Comparison
K

means vs. Hierarchical Clustering
Computation Time
–
Hierarchical clustering: O( m n
2
log(n) )
–
K

means clustering: O( k t m n )
t: number of iterations
Memory Requirements
–
Hierarchical clustering: O( mn + n
2
)
–
K

means clustering: O( mn + kn )
t: number of iterations
Other
–
Hierarchical Clustering: Need to select Linkage Method, and then a
sensible split threshold
–
K

means: Need to select K
–
In both cases: Need to select distance/similarity measure
Other Related Concepts
Self Organising Maps
Self Organising Maps (SOM) algorithm is similar to k

means in that the user specifies a predefined number
of clusters as a seed.
However, as opposed to k

means, the clusters related
to another via a spatial topology

Usually the
clusters are arranged in a square or hexagonal grid.
Initially, elements are allocated to the clusters at
random. The algorithm iteratively recalculates the
cluster centroids based on the elements assigned to
each cluster as well as those assigned to its
neighbours, and then re

allocates the data elements
to the clusters.
Since the clusters are spatially related, neighbouring
clusters can generally be merged at the end of a run
based on a threshold value.
Other Related Concepts
Dimensionality Reduction
Clustering of data is a form of data reduction since it allows us to
describe large data sets (large number of points) into smaller sets.
A related concept is that of dimensionality reduction.
–
Each point in a data set is a point in a large multi

dimensional space
(Dimension can either by genes or samples)
–
Dimensionality reduction methods aim to map the same data points to a lower
dimensional space (e.g. 2

D or 3

D) that preserves their inter

relationships.
–
Dimensionality reduction methods are very useful for data visualisation, and
also as a pre

processing step before applying data analysis algorithms such
as clustering or classification that cannot cope with a very large number of
dimensions.
–
The maths behind these methods is beyond this course, and the following
slides introduce only the basic idea.
If you take genes to be dimensions,
you may end up with up to 30,000
dimensions describing each sample !
Dimensionality Reduction
Multi

dimensional Scaling (MDS)
MDS algorithms work by finding co

ordinates
in 2

D or 3

D space that preserve the
distance ranking between the points in the
high dimensional space.
The staring point of MDS algorithm is the
distance or similarity matrix between the data
points and work through an optimisation
algorithm.
MDS preserve the notion of nearness, and
therefore clusters in the high dimensional
space still look like cluster on an MDS plot.
Dimensionality Reduction
Principal Component Analysis (PCA)
Dimensionality Reduction
Principal Component Analysis (PCA)
PCA aims to identify the direction(s) of greatest variation of the data.
Conceptually this is as if you rotate the data to find the 1
st
dimension of
greatest variation, then the 2
nd
, …
Once the 1
st
dimension is found, a recursive procedure is applied on the
remaining dimensions.
The resulting PCA dimensions ordered: first dimension captures most of
the variation, second dimension captures most of the remaining variation,
etc.
PCA algorithms work using linear algebra (by calculating Eigen vectors)
After calculating all the PCA components, you keep only the top

k
components. In general the first few can usually capture about 90% of the
variation of the data
Summary
Clustering algorithms used to find similarity relationships between genes,
diseases, tissue or samples
Different similarity metrics can be used
–
mainly Euclidean and Manhattan)
Hierarchical clustering
–
Similarity matrix
–
Algorithm
–
Linkage methods
K

means clustering algorithm
SOM, MDS, and PCA (only for reference)
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο