http://creativecommons.org/licens
es/by

sa/2.0/
Principal Component Analysis &
Clustering
Prof:Rui Alves
ralves@cmb.udl.es
973702406
Dept Ciencies Mediques Basiques,
1st Floor, Room 1.08
Website of the
Course:
http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/
Course:
http://10.100.14.36/Student_Server/
Complex Datasets
•
When studying complex biological samples
sometimes there are to many variables
•
For example, when studying Medaka
development using Phospho metabolomics you
may have measurements of different amino
acids, etc. etc.
•
Question: Can we find markers of development
using these metabolites?
•
Question: How do we analyze the data?
Problems
•
How do you visually represent the data?
–
The sample has many dimensions, so plots
are not a good solution
•
How do you make sense or extract
information from it?
–
With so many variables how do you know
which ones are important for identifying
signatures
Two possible ways (out of many) to
address the problems
•
PCA
•
Clustering
Solution 1: Try data reduction method
•
If we can combine the different columns in
specific ways, then maybe we can find a
way to reduce the number of variables that
we need to represent and analyze:
–
Principal Component Analysis
Variation in data is what identifies
signatures
Metabolite 1
Metabolite 2
Metabolite 3
…
Condition
C1
0.01
3
0.1
Condition
C2
0.02
0.01
5
Condition
C3
0.015
0.8
1.3
Variation in data is what identifies
signatures
Virtual Metabolite:
Metabolite 2+ 1/Metabolite 3
Signal Much strong and separates conditions
1, 2, and 3
V. Metab.
C2
C3
C1
0
20
Principal component analysis
•
From k “old” variables define k “new”
variables that are a linear combination of
the old variables:
y
1
=
a
11
x
1
+
a
12
x
2
+ ... +
a
1k
x
k
y
2
=
a
21
x
1
+
a
22
x
2
+ ... +
a
2k
x
k
...
y
k
=
a
k1
x
1
+
a
k2
x
2
+ ... +
a
kk
x
k
New vars
Old vars
Defining the New Variables Y
•
y
k's are uncorrelated (orthogonal)
•
y
1 explains as much as possible of original
variance in data set
•
y
2 explains as much as possible of
remaining variance
•
etc.
Principal Components Analysis on:
•
Covariance Matrix:
–
Variables must be in same units
–
Emphasizes variables with most variance
–
Mean eigenvalue
≠
1.0
•
Correlation Matrix:
–
Variables are standardized (mean 0.0, SD 1.0)
–
Variables can be in different units
–
All variables have same impact on analysis
–
Mean eigenvalue = 1.0
Covariance Matrix
•
covariance
is the
measure of how much
two random variables
vary together
1,1 2,2
1
1 2
ˆ ˆ
(,)
n
i i
i
x x x x
Cov X X
n
X1
X2
X3
…
X1
s
1
2
0.03
0.05
…
X2
…
s
2
2
3
…
X3
…
…
…
s
3
2
…
Covariance Matrix
X1
X2
X3
…
X1
s
1
2
0.03
0.05
…
X2
…
s
2
2
3
…
X3
…
…
…
s
3
2
…
Diagonalize matrix
ov
T
C M
CovM DDiag D
Eigenvalues are the principal
components
11 12...
.........
1...
a a
D EigenVectors
ak akk
1
ov 2
0 0
0 0
0 0...
C M
Diag
Tells us how
much each PC
contributes to
a data point
Principal Components are Eigenvalues
4.0
4.5
5.0
5.5
6.0
2
3
4
5
λ
1
λ
2
1st Principal
Component,
y
1
2nd Principal
Component,
y
2
Now we have reduced problem to
two variables
0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.10
0.05
0.00
0.05
PC1
PC2
Day
1
Day
2
Day
3
Day
4
Day
5
Day
6
Day
7
Day
8
What if things are still a mess?
•
Days 3, 4, 5 and 6 do not separate very well
•
What could we do to try and improve this?
•
Maybe add and extra PC axis to the plot!!!
0.0
0.1
0.2
PC1
0.10
0.05
0.00
0.05
PC2
0.1
0.0
0.1
PC3
Days separate well with three variables
Day
1
Day
2
Day
3
Day
4
Day
5
Day
6
Day
7
Day
8
Two possible ways to address the
problems
•
PCA
•
Clustering
Complex Datasets
Solution 2: Try using all data and representing it
in a low dimensional figure
•
If we can cluster the different days
according to some distance function
between all amino acids, we can represent
the data in an intuitive way.
What is data clustering?
•
Clustering
is the classification of objects
into different groups, or more precisely,
the partitioning of a data set into subsets
(clusters), so that the data in each subset
(ideally) share some common trait
•
The number of cluster is usually
predefined in advance
Types of data clustering
•
Hierarchical
–
Find successive clusters using previously
established clusters
•
Agglomerative algorithms begin with each element as a
separate cluster and merge them into successively
larger clusters
•
Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller clusters
•
Partitional
–
Find all clusters at once
First things first: distance is
important
•
Selecting a distance measure will determine
how data is agglomerated
2
1
dim
1
1
Euclidean Distance
Manhatan Distance
Mahalanobis Distance
Tchebyshev Distance max
n
i
i
x
i
i
T
i
i
x
x
x Cor x
x
Reducing the data and finding amino acid
signatures in development
•
Decide on number of clusters: Three clusters
•
Do you PCA of the dataset (20 variables, 35
datapoints)
•
Use Euclidean Distance
•
Use a Hierarchical, Divisive Algorithm
Hierarchical, Divisive Clustering: Step 1
–
One Cluster
•
Consider all data points as a member of cluster
Hierarchical, Divisive Clustering:
Step 1.1
–
Building the Second Cluster
centroid
Furthest
point from
centroid
New seed
cluster
Hierarchical, Divisive Clustering:
Step 1.1
–
Building the Second Cluster
Recalculate
centroid
Add new point
further from old
centroid, closer to
new
Rinse and repeat until…
Hierarchical, Divisive Clustering:
Step 1.2
–
Finishing a Cluster
Recalculate
centroids: if
both
centroids
become
closer, do not
add point &
stop adding
to cluster
Add new
point
further
from old
centroid,
closer to
new
Hierarchical, Divisive Clustering:
Step 2
–
Two
Clusters
•
Use optimization algorithm to divide datapoints
in such a way that Euc. Dist. Between all point
within each of two clusters is minimal
Hierarchical, Divisive Clustering:
Step 3
–
Three
Clusters
•
Continue dividing datapoints until all clusters
have been defined
Reducing the data and finding amino acid
signatures in development
•
Decide on number of clusters: Three clusters
•
Use Euclidean Distance
•
Use a Hierarchical, Aglomerative Algorithm
Hierarchical, Aglomerative Clustering:
Step 1
–
35 Clusters
•
Consider each data point as a cluster
Hierarchical, Aglomerative Clustering:
Step 2
–
Decreasing the number of clusters
•
Search for the two datapoint that are closest to each
other
•
Colapse them into a cluster
•
Repeat until you have only three clusters
Reducing the data and finding amino acid
signatures in development
•
Decide on number of clusters: Three clusters
•
Use Euclidean Distance
•
Use a Partitional Algorithm
Partitional Clustering
•
Search for the three datapoint that are farthest from each
other
•
Add points to each of these, according to shortest distance
•
Repeat until all points have been partitioned to a cluster
Clustering the days of development with
amino acid signatures
•
Get your data matrix
•
Use Euclidean Distance
•
Use a Clustering Algorithm
Final Notes on Clustering
•
If more than three PC are needed to
separate the data we could have used the
Principal components matrix and cluster from
there
•
Clustering can be fuzzy
•
Using algorithms such as genetic algorithms,
neural networks of bayesian networks one
can extract clusters that are completly non

obvious
Summary
•
PCA allows for data reduction and
decreases dimensions of the datasets
to be analyzed
•
Clustering allows for classification
(independent of PCA) and allows for
good visual representations
Comments 0
Log in to post a comment