Mentee: Joonoh Lim
Mentor:
Sanketh
Shetty
Background
Cluster analysis is an unsupervised method of
determining groupings (clusters) in data sets.
In biology, cluster analysis is used to study genes and
gene expressions.
There are three categories of gene expression data
clustering: gene

based, sample

based, subspace
clustering.
Data set is usually obtained by DNA microarray.
DNA Microarray
Establishing Data Set
Gene

based
Sample

based
15 x 15 x 8
→
225 x 8
15 x 15 x 8
→
8 x 225
Types of Clustering Algorithms
Partitional
Methods
K

means Clustering
Affinity Propagation
Spectral Clustering
Mean

shift Clustering
Normalized

cuts
Gaussian Mixture Models
Hierarchical Methods
Single linkage
Complete linkage
Average Linkage
Proximity measure
Defines the similarity between data objects
Examples: Euclidean distance, Pearson’s correlation
coefficient, Jackknife correlation, Spearman’s rank

order correlation, City block distance (Manhattan
distance), Angular separation, etc..
We use Euclidean distance.
The
Euclidean distance
between points
and
is defined as:
Hierarchical Clustering
Single linkage: group two objects in minimum distance
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
Hierarchical Clustering
Ex)Colon Cancer data
Dendrogram
Using complete linkage
K

means Clustering
www.cs.cmu.edu/~awm
K

means clustering
Ex) Colon Cancer data
K = 5
K

means clustering
Ex) Colon Cancer data
K = 10
K

means clustering
Ex) Colon Cancer data
K = 15
K

means clustering
Ex) Colon Cancer data
K = 30
Determining cluster numbers
One of widely used methods is “elbow” method.
Elbow method is to plot the percent variance
explained versus the number of clusters and to find
the point where increasing the number of clusters does
not add much information anymore.
Percentage of variance explained is the ratio of the
between

group variance to the total variance.
Elbow Method (Criterion)
wikipedia
Challenges and Future Research Directions
No single “best” algorithm.
The performance of different clustering algorithms
strongly depends on both data distribution and
application requirement.
Clustering is generally “unsupervised” learning
problem.
However, often some “partial” knowledge is available,
such as the functions of some genes.
If a clustering could integrate such partial knowledge
as some ‘clustering constraints’, we can expect more
biologically meaningful and reliable results.
Questions?
Thank you!
Comments 0
Log in to post a comment