# Clustering

Τεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

87 εμφανίσεις

http://creativecommons.org/licens
es/by
-
sa/2.0/

Principal Component Analysis &
Clustering

Prof:Rui Alves

ralves@cmb.udl.es

973702406

Dept Ciencies Mediques Basiques,

1st Floor, Room 1.08

Website of the
Course:
http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/

Course:
http://10.100.14.36/Student_Server/

Complex Datasets

When studying complex biological samples
sometimes there are to many variables

For example, when studying Medaka
development using Phospho metabolomics you
may have measurements of different amino
acids, etc. etc.

Question: Can we find markers of development
using these metabolites?

Question: How do we analyze the data?

Problems

How do you visually represent the data?

The sample has many dimensions, so plots
are not a good solution

How do you make sense or extract
information from it?

With so many variables how do you know
which ones are important for identifying
signatures

Two possible ways (out of many) to

PCA

Clustering

Solution 1: Try data reduction method

If we can combine the different columns in
specific ways, then maybe we can find a
way to reduce the number of variables that
we need to represent and analyze:

Principal Component Analysis

Variation in data is what identifies
signatures

Metabolite 1

Metabolite 2

Metabolite 3

Condition
C1

0.01

3

0.1

Condition
C2

0.02

0.01

5

Condition
C3

0.015

0.8

1.3

Variation in data is what identifies
signatures

Virtual Metabolite:

Metabolite 2+ 1/Metabolite 3

Signal Much strong and separates conditions
1, 2, and 3

V. Metab.

C2

C3

C1

0

20

Principal component analysis

From k “old” variables define k “new”
variables that are a linear combination of
the old variables:

y
1

=
a
11
x
1

+
a
12
x
2

+ ... +
a
1k
x
k

y
2

=
a
21
x
1

+
a
22
x
2

+ ... +
a
2k
x
k

...

y
k

=
a
k1
x
1

+
a
k2
x
2

+ ... +
a
kk
x
k

New vars

Old vars

Defining the New Variables Y

y
k's are uncorrelated (orthogonal)

y
1 explains as much as possible of original
variance in data set

y
2 explains as much as possible of
remaining variance

etc.

Principal Components Analysis on:

Covariance Matrix:

Variables must be in same units

Emphasizes variables with most variance

Mean eigenvalue

1.0

Correlation Matrix:

Variables are standardized (mean 0.0, SD 1.0)

Variables can be in different units

All variables have same impact on analysis

Mean eigenvalue = 1.0

Covariance Matrix

covariance

is the
measure of how much
two random variables
vary together

1,1 2,2
1
1 2
ˆ ˆ
(,)
n
i i
i
x x x x
Cov X X
n

 

X1

X2

X3

X1

s
1
2

0.03

0.05

X2

s
2
2

3

X3

s
3
2

Covariance Matrix

X1

X2

X3

X1

s
1
2

0.03

0.05

X2

s
2
2

3

X3

s
3
2

Diagonalize matrix

ov
T
C M
CovM DDiag D

Eigenvalues are the principal
components

11 12...
.........
1...
a a
D EigenVectors
ak akk
 
 
 
 
 
 
1
ov 2
0 0
0 0
0 0...
C M
Diag

 
 

 
 
 
Tells us how
much each PC
contributes to
a data point

Principal Components are Eigenvalues

4.0
4.5
5.0
5.5
6.0
2
3
4
5
λ
1

λ
2

1st Principal

Component,
y
1

2nd Principal

Component,
y
2

Now we have reduced problem to
two variables

0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.10
0.05
0.00
0.05
PC1
PC2
Day

1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

Day

8

What if things are still a mess?

Days 3, 4, 5 and 6 do not separate very well

What could we do to try and improve this?

Maybe add and extra PC axis to the plot!!!

0.0
0.1
0.2
PC1
0.10
0.05
0.00
0.05
PC2
0.1
0.0
0.1
PC3
Days separate well with three variables

Day

1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

Day

8

Two possible ways to address the
problems

PCA

Clustering

Complex Datasets

Solution 2: Try using all data and representing it
in a low dimensional figure

If we can cluster the different days
according to some distance function
between all amino acids, we can represent
the data in an intuitive way.

What is data clustering?

Clustering

is the classification of objects
into different groups, or more precisely,
the partitioning of a data set into subsets
(clusters), so that the data in each subset
(ideally) share some common trait

The number of cluster is usually

Types of data clustering

Hierarchical

Find successive clusters using previously
established clusters

Agglomerative algorithms begin with each element as a
separate cluster and merge them into successively
larger clusters

Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller clusters

Partitional

Find all clusters at once

First things first: distance is
important

Selecting a distance measure will determine
how data is agglomerated

2
1
dim
1
1
Euclidean Distance
Manhatan Distance
Mahalanobis Distance
Tchebyshev Distance max
n
i
i
x
i
i
T
i
i
x
x
x Cor x
x
 

  

Reducing the data and finding amino acid
signatures in development

Decide on number of clusters: Three clusters

Do you PCA of the dataset (20 variables, 35
datapoints)

Use Euclidean Distance

Use a Hierarchical, Divisive Algorithm

Hierarchical, Divisive Clustering: Step 1

One Cluster

Consider all data points as a member of cluster

Hierarchical, Divisive Clustering:

Step 1.1

Building the Second Cluster

centroid

Furthest
point from
centroid

New seed
cluster

Hierarchical, Divisive Clustering:

Step 1.1

Building the Second Cluster

Recalculate
centroid

further from old
centroid, closer to
new

Rinse and repeat until…

Hierarchical, Divisive Clustering:

Step 1.2

Finishing a Cluster

Recalculate
centroids: if
both
centroids
become
closer, do not
to cluster

point
further
from old
centroid,
closer to
new

Hierarchical, Divisive Clustering:

Step 2

Two
Clusters

Use optimization algorithm to divide datapoints
in such a way that Euc. Dist. Between all point
within each of two clusters is minimal

Hierarchical, Divisive Clustering:

Step 3

Three
Clusters

Continue dividing datapoints until all clusters
have been defined

Reducing the data and finding amino acid
signatures in development

Decide on number of clusters: Three clusters

Use Euclidean Distance

Use a Hierarchical, Aglomerative Algorithm

Hierarchical, Aglomerative Clustering:

Step 1

35 Clusters

Consider each data point as a cluster

Hierarchical, Aglomerative Clustering:

Step 2

Decreasing the number of clusters

Search for the two datapoint that are closest to each
other

Colapse them into a cluster

Repeat until you have only three clusters

Reducing the data and finding amino acid
signatures in development

Decide on number of clusters: Three clusters

Use Euclidean Distance

Use a Partitional Algorithm

Partitional Clustering

Search for the three datapoint that are farthest from each
other

Add points to each of these, according to shortest distance

Repeat until all points have been partitioned to a cluster

Clustering the days of development with
amino acid signatures

Use Euclidean Distance

Use a Clustering Algorithm

Final Notes on Clustering

If more than three PC are needed to
separate the data we could have used the
Principal components matrix and cluster from
there

Clustering can be fuzzy

Using algorithms such as genetic algorithms,
neural networks of bayesian networks one
can extract clusters that are completly non
-
obvious

Summary

PCA allows for data reduction and
decreases dimensions of the datasets
to be analyzed

Clustering allows for classification
(independent of PCA) and allows for
good visual representations