Bioinformatics Cluster Analysis

signtruculentΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

115 εμφανίσεις

Mentee: Joonoh Lim

Mentor:
Sanketh

Shetty

Background


Cluster analysis is an unsupervised method of
determining groupings (clusters) in data sets.


In biology, cluster analysis is used to study genes and
gene expressions.


There are three categories of gene expression data
clustering: gene
-
based, sample
-
based, subspace
clustering.


Data set is usually obtained by DNA microarray.

DNA Microarray

Establishing Data Set

Gene
-
based

Sample
-
based

15 x 15 x 8


225 x 8

15 x 15 x 8


8 x 225

Types of Clustering Algorithms


Partitional

Methods


K
-
means Clustering


Affinity Propagation


Spectral Clustering


Mean
-
shift Clustering


Normalized
-
cuts


Gaussian Mixture Models



Hierarchical Methods


Single linkage


Complete linkage


Average Linkage

Proximity measure


Defines the similarity between data objects


Examples: Euclidean distance, Pearson’s correlation
coefficient, Jackknife correlation, Spearman’s rank
-
order correlation, City block distance (Manhattan
distance), Angular separation, etc..



We use Euclidean distance.

The
Euclidean distance

between points




and





is defined as:

Hierarchical Clustering


Single linkage: group two objects in minimum distance

http://www.resample.com/xlminer/help/HClst/HClst_intro.htm

Hierarchical Clustering

Ex)Colon Cancer data

Dendrogram

Using complete linkage

K
-
means Clustering

www.cs.cmu.edu/~awm

K
-
means clustering

Ex) Colon Cancer data


K = 5

K
-
means clustering

Ex) Colon Cancer data


K = 10

K
-
means clustering

Ex) Colon Cancer data


K = 15

K
-
means clustering

Ex) Colon Cancer data


K = 30

Determining cluster numbers


One of widely used methods is “elbow” method.


Elbow method is to plot the percent variance
explained versus the number of clusters and to find
the point where increasing the number of clusters does
not add much information anymore.


Percentage of variance explained is the ratio of the
between
-
group variance to the total variance.

Elbow Method (Criterion)

wikipedia

Challenges and Future Research Directions


No single “best” algorithm.


The performance of different clustering algorithms
strongly depends on both data distribution and
application requirement.


Clustering is generally “unsupervised” learning
problem.


However, often some “partial” knowledge is available,
such as the functions of some genes.


If a clustering could integrate such partial knowledge
as some ‘clustering constraints’, we can expect more
biologically meaningful and reliable results.

Questions?

Thank you!