# Clustering and Classification

AI and Robotics

Nov 8, 2013 (4 years and 8 months ago)

115 views

Clustering and Classification

Introduction to Machine Learning

BMI 730

Kun Huang

Department of Biomedical Informatics

Ohio State University

How do we use microarray?

Profiling

Clustering

Cluster to detect
patient subgroups

Cluster to
detect gene
clusters and
regulatory
networks

Clustering
and Classification

Preprocessing

Distance measures

Popular algorithms (not necessarily the best
ones)

More sophisticated ones

Evaluation

Data mining

-
Clustering or classification?

-
Is training data available?

-
What domain specific knowledge can be applied?

-
What
preprocessing

of data is needed?

-
Log / data scale and numerical stability

-
Filtering /
denoising

-
Nonlinear kernel

-
Feature selection (do I need to use all the data?)

-
Is the dimensionality of the data too high?

How do we process microarray data (clustering)?

-
Feature
selection

genes, transformations of
expression levels.

-

Genes discovered in the class comparison (t
-
test).
Risk: missing genes.

-

Iterative approach : select genes under different p
-
value
cutoff
, then select the one with good
performance using cross
-
validation.

-

Principal components (pro and con).

-

Discriminant

analysis (e.g., LDA).

-
Dimensionality Reduction

-
Principal component analysis (PCA)

-
Singular value decomposition (SVD)

-
Karhunen
-
Loeve

transform (KLT)

Basis for
P

SVD

-
Principal Component Analysis (PCA)
-

Other
things to consider

-
Numerical balance/data normalization

-
Noisy direction

-
Continuous vs. discrete data

-
Principal components are orthogonal to each
other, however, biological data are not

-
Principal components are linear combinations of
original data

-
Prior knowledge is important

-
PCA is not clustering!

-
Dimensionality reduction: linear
discriminant

analysis (LDA)

B

.

2.0

1.5

1.0

0.5

0.5 1.0 1.5 2.0

.

.

.

.

.

.

.

.

.

.

.

.

.

A

w

.

(From S. Wu’s website)

Linear
Discriminant

Analysis

B

.

2.0

1.5

1.0

0.5

0.5 1.0 1.5 2.0

.

.

.

.

.

.

.

.

.

.

.

.

.

A

w

.

(From S. Wu’s website)

Visualization of Microarray Data

Multidimensional scaling (MDS)

High
-
dimensional coordinates unknown

Distances between the points are known

The distance may not be Euclidean, but the
embedding maintains the distance in a
Euclidean space

Try different dimensions (from one to ???)

At each dimension, perform optimal
embedding to minimize embedding error

Plot embedding error (residue) vs. dimension

Pick the knee point

Visualization of Microarray Data

Multidimensional scaling (MDS)

Clustering
and Classification

Preprocessing

Distance measures

Popular algorithms (not necessarily the best
ones)

More sophisticated ones

Evaluation

Data mining

Distance Measure (Metric?)

-
What do you mean by “similar”?

-
Euclidean

-
Uncentered

correlation

-
Pearson correlation

Distance Metric

-
Euclidean

102123_at

Lip1

1596.000

2040.900

1277.000

4090.500

1357.600

1039.200

1387.300

3189.000

1321.300

2164.400

868.600

185.300

266.400

2527.800

160552_at

Ap1s1

4144.400

3986.900

3083.100

6105.900

3245.800

4468.400

7295.000

5410.900

3162.100

4100.900

4603.200

6066.200

5505.800

5702.700

d
E
(Lip1, Ap1s1) = 12883

Distance Metric

-
Pearson Correlation

102123_at

Lip1

1596.000

2040.900

1277.000

4090.500

1357.600

1039.200

1387.300

3189.000

1321.300

2164.400

868.600

185.300

266.400

2527.800

160552_at

Ap1s1

4144.400

3986.900

3083.100

6105.900

3245.800

4468.400

7295.000

5410.900

3162.100

4100.900

4603.200

6066.200

5505.800

5702.700

d
P
(Lip1, Ap1s1) = 0.904

0
1000
2000
3000
4000
5000
6000
7000
8000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Distance Metric

-
Pearson Correlation

r = 1

r =
-
1

Ranges from 1 to
-
1.

Distance Metric

-
Uncentered

Correlation

102123_at

Lip1

1596.000

2040.900

1277.000

4090.500

1357.600

1039.200

1387.300

3189.000

1321.300

2164.400

868.600

185.300

266.400

2527.800

160552_at

Ap1s1

4144.400

3986.900

3083.100

6105.900

3245.800

4468.400

7295.000

5410.900

3162.100

4100.900

4603.200

6066.200

5505.800

5702.700

d
u
(Lip1, Ap1s1) = 0.835

q

o

Distance Metric

-
Difference between Pearson
correlation and
uncentered

correlation

102123_at

Lip1

1596.000

2040.900

1277.000

4090.500

1357.600

1039.200

1387.300

3189.000

1321.300

2164.400

868.600

185.300

266.400

2527.800

160552_at

Ap1s1

4144.400

3986.900

3083.100

6105.900

3245.800

4468.400

7295.000

5410.900

3162.100

4100.900

4603.200

6066.200

5505.800

5702.700

0
1000
2000
3000
4000
5000
6000
7000
8000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0
1000
2000
3000
4000
5000
6000
7000
8000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Pearson correlation

Baseline expression possible

Uncentered correlation

All are considered signals

Distance Metric

-
Difference between Euclidean and
correlation

Distance Metric

-
Missing: negative correlation may
also mean “close” in signal pathway
(1
-
|PCC|, 1
-
PCC^2)

Clustering
and Classification

Preprocessing

Distance measures

Popular algorithms (not necessarily the best
ones)

More sophisticated ones

Evaluation

Data mining

How do we process microarray data
(clustering)?

-
Unsupervised Learning

Hierarchical
Clustering

How do we process microarray data
(clustering)?

-
Unsupervised Learning

Hierarchical
Clustering

between two clusters.

How do we process microarray data
(clustering)?

-
Unsupervised Learning

Hierarchical
Clustering

distance between two clusters.

How do we process microarray data
(clustering)?

-
Unsupervised Learning

Hierarchical
Clustering

average of all pair
-
wise distances between members of
the two clusters. Since all genes and samples carry equal
weight, the linkage is an Unweighted Pair Group Method
with Arithmetic Means (UPGMA).

How do we process microarray data
(clustering)?

-
Unsupervised Learning

Hierarchical
Clustering

Prone to chaining and sensitive to
noise

Tends to produce compact
clusters

Sensitive to distance metric

-
Unsupervised Learning

Hierarchical
Clustering

-
Unsupervised Learning

Hierarchical
Clustering

Dendrograms

Distance

the height each
horizontal line represents
the distance between the
two groups it merges.

Order

Opensource R
uses the convention that
the tighter clusters are on
the left. Others proposed
to use expression values,
loci on chromosomes, and
other ranking criteria.

-
Unsupervised Learning
-

K
-
means

-
Vector quantization

-
K
-
D trees

-
Need to try different K, sensitive to initialization

-
Unsupervised Learning
-

K
-
means

[cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20);

K

Metric

-
Unsupervised Learning
-

K
-
means

-
Number of class K needs to be specified

-
Does not always converge

-
Sensitive to initialization

-
Unsupervised Learning
-

K
-
means

-
Unsupervised Learning

-
Self
-
organized maps (SOM)

-
Neural network based method

-
Originally used as a visualization method for
visualize (embedding) high
-
dimensional data

-
Also related vector quantization

-
The idea is to map close data points to the same
discrete level

-
Issues

-
Lack of consistency or representative features
(5.3 TP53 + 0.8 PTEN doesn’t make sense)

-
Data structure is missing

-
Not robust to outliers and noise

D’Haeseleer 2005 Nat. Biotechnol 23(12):1499
-
501

Review of Microarray and Gene Discovery

Clustering and Classification

Preprocessing

Distance measures

Popular algorithms (not necessarily the best
ones)

More sophisticated ones

Evaluation

Data mining

-
Model
-
based clustering methods

(Han) http://www.cs.umd.edu/~bhhan/research2.html

Pan
et al. Genome Biology

2002
3
:research0009.1

doi:10.1186/gb
-
2002
-
3
-
2
-
research0009

-
Structure
-
based clustering methods

-
Supervised Learning

-
Support vector machines (SVM) and Kernels

-
Only (binary) classifier, no data model

-
Supervised Learning
-

Support vector
machines (SVM) and Kernels

-
Kernel

nonlinear mapping

-
Supervised Learning
-

Naïve Bayesian
classifier

-
Bayes rule

-
Maximum a posterior (MAP)

Prior prob.

Conditional

prob.

Review of Microarray and Gene Discovery

Clustering and Classification

Preprocessing

Distance measures

Popular algorithms (not necessarily the best
ones)

More sophisticated ones

Evaluation

Data mining

-
Accuracy vs. generality

-
Overfitting

-
Model selection

Model complexity

Prediction error

Training sample

Testing sample

(reproduced from Hastie et.al.)

-
Assessing the Validity of Clusters

-
Most clustering algorithms do not assume any
structure or a prior relationship among the genes.
However, the found clusters should more or less
reflect the structures (e.g., pathways). (An
interesting research problem is to develop new
algorithms that can accommodate such
relationships.)

-
If different patients are grouped into clusters, it
implies that there are subtypes for the disease,
which is a big claim and must be validated using
other methods (e.g., pathology).

-
Relationship with external variables is important.
E.g., clustering on cells from different tissue
types may correspond to the relationship among
the tissues.

-
Assessing the Validity of Clusters

-
Where should we cut the dendrograms?

-
Which clustering results should we believe,
i.e., different (or even the same) clustering
algorithms may find different clustering
results?

-
Many tests are flawed, e.g., circular
reasoning: using genes with significant
different between two classes as features
for clustering, then use the clusters to
detect signatures which are genes
significantly changed.

-
Assessing the Validity of Clusters

-
Most clustering algorithms can find
clusters even from random data.

-
The clusters found by clustering
algorithms should exhibit greater intra
-
cluster similarity (homogeneity) and larger
inter
-
cluster distance (separation).

-
How to be sure that the clustering is not
from random data?

-
How to find good partition among any
possible partitions of the data?

-
How to assess the reproducibility of the
partitioning?

-
Assessing the Validity of Clusters

-
Global tests of clustering (meaningful cluster vs.
random cluster)

-
Check the distribution of the nearest neighbor
distances (NN) and pairwise distances, uniform
distribution and multiple distribution are very
different

NN

Pairwise

-
Assessing the Validity of Clusters

-
Reproducibility of clustering

-
Global perturbation methods (McShane et al,
Bioinformatics, 2002, 1462
-
1469

-
Using only the first three principal components
(the observation is that they convey the clustering
information well enough

-
Adding Gaussian noise and check if the clustering
relationship is still preserved

-
Indices R and D.

R

the ratio of same cluster data pairs that are preserved
after the perturbation.

D
-

discrepancy between best
-
matched clusters

How do we process microarray data
(clustering)?

-

Cross
-
validation: assessment of the
classifier. Note the key thing is to strike the
balance between accurate classification on
training data and the prediction power.

-

Training vs. testing (10%)

-

Leave
-
one
-
out bootstraping: for small
sample size, ratio on the correct prediction of
the left
-
out sample.

Validation

cDNA or Affymetrix chips measure
mRNA levels, which may not reflect
final protein concentrations

Various splice variants exist, the
expressed protein may not be active

Post
-
translational modification

Quantitative real
-
time PCR (RT
-
PCR)
is widely used for this purpose

Other high
-
level consideration

correlation does not mean causation

Review of Microarray and Gene Discovery

Clustering and Classification

Preprocessing

Distance measures

Popular algorithms (not necessarily the best
ones)

More sophisticated ones

Evaluation

Data mining

Data Mining

is searching for knowledge
in data

Knowledge mining from databases

Knowledge extraction

Data/pattern analysis

Data dredging

Knowledge Discovery in Databases (KDD)

The process of discovery

Interactive +

Iterative

Scalable approaches

Popular Data Mining Techniques

Clustering: Most dominant technique in use for gene
expression analysis in particular and bioinformatics in
general.

Partition data into groups of similarity

Classification:

Supervised version of clustering

technique to model class
membership

can subsequently classify unseen data.

Frequent Pattern Analysis

A method for identifying frequently re
-
curring patterns
(structural and transactional).

Temporal/Sequence Analysis

Model temporal data

wavelets, FFT etc.

Statistical Methods

Regression, Discriminant analysis

Summary

A
good clustering

method will produce high quality
clusters with

high
intra
-
class

similarity

low
inter
-
class

similarity

The
quality

of a clustering result depends on both
the
similarity measure

used by the method and its
implementation
.

Other metrics include: density, information entropy,

The
quality

of a clustering method is also measured
by its ability to discover some or all of the
hidden

patterns.

Recommended Literature

1. Bioinformatics

The Machine Learning Approach by P. Baldi & S.
Brunak, 2
nd

edition, The MIT Press, 2001

2. Data Mining

Concepts and Techniques by J. Han & M. Kamber,
Morgan Kaufmann Publishers, 2001

3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2
nd

edition,
John Wiley & Sons, 2001

4. The Elements of Statistical Learning by T. Hastie, R. Tibshirani, J.
Friedman, Springer
-
Verlag, 2001