# Bioinformatics Cluster Analysis

Βιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 9 μήνες)

155 εμφανίσεις

Mentee: Joonoh Lim

Mentor:
Sanketh

Shetty

Background

Cluster analysis is an unsupervised method of
determining groupings (clusters) in data sets.

In biology, cluster analysis is used to study genes and
gene expressions.

There are three categories of gene expression data
clustering: gene
-
based, sample
-
based, subspace
clustering.

Data set is usually obtained by DNA microarray.

DNA Microarray

Establishing Data Set

Gene
-
based

Sample
-
based

15 x 15 x 8

225 x 8

15 x 15 x 8

8 x 225

Types of Clustering Algorithms

Partitional

Methods

K
-
means Clustering

Affinity Propagation

Spectral Clustering

Mean
-
shift Clustering

Normalized
-
cuts

Gaussian Mixture Models

Hierarchical Methods

Proximity measure

Defines the similarity between data objects

Examples: Euclidean distance, Pearson’s correlation
coefficient, Jackknife correlation, Spearman’s rank
-
order correlation, City block distance (Manhattan
distance), Angular separation, etc..

We use Euclidean distance.

The
Euclidean distance

between points

and

is defined as:

Hierarchical Clustering

Single linkage: group two objects in minimum distance

http://www.resample.com/xlminer/help/HClst/HClst_intro.htm

Hierarchical Clustering

Ex)Colon Cancer data

Dendrogram

K
-
means Clustering

www.cs.cmu.edu/~awm

K
-
means clustering

Ex) Colon Cancer data

K = 5

K
-
means clustering

Ex) Colon Cancer data

K = 10

K
-
means clustering

Ex) Colon Cancer data

K = 15

K
-
means clustering

Ex) Colon Cancer data

K = 30

Determining cluster numbers

One of widely used methods is “elbow” method.

Elbow method is to plot the percent variance
explained versus the number of clusters and to find
the point where increasing the number of clusters does

Percentage of variance explained is the ratio of the
between
-
group variance to the total variance.

Elbow Method (Criterion)

wikipedia

Challenges and Future Research Directions

No single “best” algorithm.

The performance of different clustering algorithms
strongly depends on both data distribution and
application requirement.

Clustering is generally “unsupervised” learning
problem.

However, often some “partial” knowledge is available,
such as the functions of some genes.

If a clustering could integrate such partial knowledge
as some ‘clustering constraints’, we can expect more
biologically meaningful and reliable results.

Questions?

Thank you!