Clustering

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 1 month ago)

43 views

http://creativecommons.org/licens
es/by
-
sa/2.0/

Principal Component Analysis &
Clustering

Prof:Rui Alves

ralves@cmb.udl.es

973702406

Dept Ciencies Mediques Basiques,

1st Floor, Room 1.08

Website of the
Course:
http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/


Course:
http://10.100.14.36/Student_Server/

Complex Datasets




When studying complex biological samples
sometimes there are to many variables


For example, when studying Medaka
development using Phospho metabolomics you
may have measurements of different amino
acids, etc. etc.



Question: Can we find markers of development
using these metabolites?


Question: How do we analyze the data?


Problems


How do you visually represent the data?


The sample has many dimensions, so plots
are not a good solution



How do you make sense or extract
information from it?


With so many variables how do you know
which ones are important for identifying
signatures

Two possible ways (out of many) to
address the problems


PCA




Clustering

Solution 1: Try data reduction method


If we can combine the different columns in
specific ways, then maybe we can find a
way to reduce the number of variables that
we need to represent and analyze:



Principal Component Analysis

Variation in data is what identifies
signatures

Metabolite 1

Metabolite 2

Metabolite 3




Condition
C1

0.01

3

0.1

Condition
C2

0.02

0.01

5

Condition
C3

0.015

0.8

1.3

Variation in data is what identifies
signatures

Virtual Metabolite:


Metabolite 2+ 1/Metabolite 3

Signal Much strong and separates conditions
1, 2, and 3

V. Metab.

C2

C3

C1

0

20

Principal component analysis


From k “old” variables define k “new”
variables that are a linear combination of
the old variables:


y
1

=
a
11
x
1

+
a
12
x
2

+ ... +
a
1k
x
k


y
2

=
a
21
x
1

+
a
22
x
2

+ ... +
a
2k
x
k


...


y
k

=
a
k1
x
1

+
a
k2
x
2

+ ... +
a
kk
x
k

New vars

Old vars

Defining the New Variables Y


y
k's are uncorrelated (orthogonal)


y
1 explains as much as possible of original
variance in data set


y
2 explains as much as possible of
remaining variance


etc.

Principal Components Analysis on:



Covariance Matrix:


Variables must be in same units


Emphasizes variables with most variance


Mean eigenvalue

1.0



Correlation Matrix:


Variables are standardized (mean 0.0, SD 1.0)


Variables can be in different units


All variables have same impact on analysis


Mean eigenvalue = 1.0

Covariance Matrix


covariance

is the
measure of how much
two random variables
vary together






1,1 2,2
1
1 2
ˆ ˆ
(,)
n
i i
i
x x x x
Cov X X
n

 


X1

X2

X3



X1

s
1
2

0.03

0.05



X2



s
2
2

3



X3







s
3
2



Covariance Matrix

X1

X2

X3



X1

s
1
2

0.03

0.05



X2



s
2
2

3



X3







s
3
2



Diagonalize matrix

ov
T
C M
CovM DDiag D

Eigenvalues are the principal
components

11 12...
.........
1...
a a
D EigenVectors
ak akk
 
 
 
 
 
 
1
ov 2
0 0
0 0
0 0...
C M
Diag


 
 

 
 
 
Tells us how
much each PC
contributes to
a data point

Principal Components are Eigenvalues

4.0
4.5
5.0
5.5
6.0
2
3
4
5
λ
1

λ
2

1st Principal

Component,
y
1

2nd Principal

Component,
y
2

Now we have reduced problem to
two variables

0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.10
0.05
0.00
0.05
PC1
PC2
Day

1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

Day

8

What if things are still a mess?



Days 3, 4, 5 and 6 do not separate very well


What could we do to try and improve this?



Maybe add and extra PC axis to the plot!!!

0.0
0.1
0.2
PC1
0.10
0.05
0.00
0.05
PC2
0.1
0.0
0.1
PC3
Days separate well with three variables

Day

1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

Day

8

Two possible ways to address the
problems


PCA




Clustering

Complex Datasets



Solution 2: Try using all data and representing it
in a low dimensional figure


If we can cluster the different days
according to some distance function
between all amino acids, we can represent
the data in an intuitive way.

What is data clustering?


Clustering

is the classification of objects
into different groups, or more precisely,
the partitioning of a data set into subsets
(clusters), so that the data in each subset
(ideally) share some common trait


The number of cluster is usually
predefined in advance

Types of data clustering


Hierarchical


Find successive clusters using previously
established clusters



Agglomerative algorithms begin with each element as a
separate cluster and merge them into successively
larger clusters



Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller clusters




Partitional



Find all clusters at once


First things first: distance is
important


Selecting a distance measure will determine
how data is agglomerated








2
1
dim
1
1
Euclidean Distance
Manhatan Distance
Mahalanobis Distance
Tchebyshev Distance max
n
i
i
x
i
i
T
i
i
x
x
x Cor x
x
 





  



Reducing the data and finding amino acid
signatures in development



Decide on number of clusters: Three clusters


Do you PCA of the dataset (20 variables, 35
datapoints)


Use Euclidean Distance


Use a Hierarchical, Divisive Algorithm


Hierarchical, Divisive Clustering: Step 1


One Cluster



Consider all data points as a member of cluster

Hierarchical, Divisive Clustering:

Step 1.1


Building the Second Cluster


centroid

Furthest
point from
centroid

New seed
cluster

Hierarchical, Divisive Clustering:

Step 1.1


Building the Second Cluster


Recalculate
centroid

Add new point
further from old
centroid, closer to
new

Rinse and repeat until…

Hierarchical, Divisive Clustering:

Step 1.2


Finishing a Cluster


Recalculate
centroids: if
both
centroids
become
closer, do not
add point &
stop adding
to cluster

Add new
point
further
from old
centroid,
closer to
new

Hierarchical, Divisive Clustering:

Step 2


Two
Clusters



Use optimization algorithm to divide datapoints
in such a way that Euc. Dist. Between all point
within each of two clusters is minimal

Hierarchical, Divisive Clustering:

Step 3


Three
Clusters



Continue dividing datapoints until all clusters
have been defined

Reducing the data and finding amino acid
signatures in development



Decide on number of clusters: Three clusters


Use Euclidean Distance


Use a Hierarchical, Aglomerative Algorithm


Hierarchical, Aglomerative Clustering:

Step 1


35 Clusters



Consider each data point as a cluster

Hierarchical, Aglomerative Clustering:

Step 2


Decreasing the number of clusters



Search for the two datapoint that are closest to each
other


Colapse them into a cluster


Repeat until you have only three clusters

Reducing the data and finding amino acid
signatures in development



Decide on number of clusters: Three clusters


Use Euclidean Distance


Use a Partitional Algorithm


Partitional Clustering



Search for the three datapoint that are farthest from each
other


Add points to each of these, according to shortest distance


Repeat until all points have been partitioned to a cluster

Clustering the days of development with
amino acid signatures


Get your data matrix


Use Euclidean Distance


Use a Clustering Algorithm


Final Notes on Clustering


If more than three PC are needed to
separate the data we could have used the
Principal components matrix and cluster from
there


Clustering can be fuzzy


Using algorithms such as genetic algorithms,
neural networks of bayesian networks one
can extract clusters that are completly non
-
obvious

Summary


PCA allows for data reduction and
decreases dimensions of the datasets
to be analyzed



Clustering allows for classification
(independent of PCA) and allows for
good visual representations