Cluster analysis

dealerdeputyAI and Robotics

Nov 25, 2013 (3 years and 9 months ago)

74 views

Lecture

15

Cluster

analysis

Species

Sequence

P.sym

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

P.xan

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

A

A

T

A

T

T

C

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

P.pola

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

T

T

T

A

T

T

C

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

G

G

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

C.plat

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

A

A

T

A

G

G

G

C

T

A

A

G

G

A

A

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

T

T

T

T

A

G

C.grad

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

A

A

T

A

G

G

G

C

T

A

A

G

G

A

A

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

C

C

G

G

G

T

A

C

T

G

A

T

T

T

T

A

G

D.sym

T

T

A

T

G

C

G

A

G

A

C

G

T

G

A

A

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

G

A

T

T

A

T

T

T

C

G

G

T

T

G

C

T

A

T

G

T

A

G

A

G

G

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

Linkage

algorithm

Distance

metric

A
cluster

analysis

is

a
two

stepp

process

that

needs

includes

the

choice

of

a)
a

distance

metric

and

b) a
linkage

algortihm


Between

clusters

Within

clusters

Cluster

analysis

tries

to
minimize

within

cluster

distances

and to
maximize

between

cluster

distances
.

Species

Sequence

P.sym

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

P.xan

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

A

A

T

A

T

T

C

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

P.pola

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

T

T

T

A

T

T

C

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

G

G

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

C.plat

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

A

A

T

A

G

G

G

C

T

A

A

G

G

A

A

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

T

T

T

T

A

G

C.grad

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

A

A

T

A

G

G

G

C

T

A

A

G

G

A

A

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

C

C

G

G

G

T

A

C

T

G

A

T

T

T

T

A

G

D.sym

T

T

A

T

G

C

G

A

G

A

C

G

T

G

A

A

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

G

A

T

T

A

T

T

T

C

G

G

T

T

G

C

T

A

T

G

T

A

G

A

G

G

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

The

distance

metric

P.sym

P.xan

P.pola

C.plat

C.grad

D.sym

P.sym

0

2

3

7

9

13

P.xan

2

0

4

11

11

15

P.pola

3

4

0

10

10

12

C.plat

7

11

10

0

2

19

C.grad

9

11

10

2

0

19

D.sym

13

15

12

19

19

0

A
distance

matrix

counts

in

the

simplest

case

the

number

of
differences

between

two

data
sets
.

Site 1


Site 2

Site 3

Site 4

P.sym

1

0

1

1

P.xan

1

0

0

1

P.pola

0

1

0

1

C.plat

0

1

1

1

C.grad

1

0

0

0

D.sym

1

0

1

1

Sum

4

2

3

5

Species

presence
-
absence

matrix

A

Site 1


Site 2

Site 3

Site 4

Site 1

4

0

2

3


Site 2

0

2

1

2

Site 3

2

1

3

3

Site 4

3

2

3

5

Site 1


Site 2

Site 3

Site 4

Site 1

1

0

0.571429

0.666667


Site 2

0

1

0.4

0.571429

Site 3

0.571429

0.4

1

0.75

Site 4

0.666667

0.571429

0.75

1

Distance

matrix

D = A
T
A

Soerensen

index

Jaccard

index

B

Site

A

Site
joint
S


2
joint
-
B

Site

A

Site
joint
S


0
2

4
0
*
2
2
,
1



Soerensen
Site 1


Site 2

Site 3

Site 4

P.sym

0.31

0.12

0.24

0.05

P.xan

0.20

0.65

0.54

0.44

P.pola

0.38

0.81

0.28

0.52

C.plat

0.35

0.69

0.86

0.30

C.grad

0.07

0.99

0.64

0.84

D.sym

0.43

0.78

0.73

0.21

Sum

1.75

4.04

3.30

2.36

Abundance

data







n
k
jk
ik
ij
a
a
D
1
2
Euclidean

distance





n
k
jk
ik
ij
a
a
D
1
Manhattan
distance

ij
ij
r
D

Correlation

distance

Site 1


Site 2

Site 3

Site 4

Site 1

1

-
0.27534

-
0.04805

-
0.71587


Site 2

-
0.27534

1

0.519139

0.807173

Site 3

-
0.04805

0.519139

1

0.157251

Site 4

-
0.71587

0.807173

0.157251

1

Correlation

distance

matrix

Bray

Curtis
distance











n
k
jk
n
k
ik
n
k
jk
ik
ij
a
a
a
a
D
1
1
1
1
Due

to
squaring

Euclidean

distances

put

particulalry

weight

on
outliers
.
Needs

a
linear

scale
.

The

Manhattan
distance

needs

linear

scales
.
Despite

of a
large

distance

the

metric

might

be zero.

Correlations

are

sensitive

to
non
-
linearities

in

the

data.

The

Bray
-
Curtis

distance

is

equivalent

to
the

Soerensen

index

for
presence
-
absence

data.
Suffers

from

the

same
shortcoming

as
the

Manhattan
distance
.

P.sym

P.xan

P.pola

C.plat

C.grad

D.sym

P.sym

0

2

3

7

9

13

P.xan

2

0

4

11

11

15

P.pola

3

4

0

10

10

12

C.plat

7

11

10

0

2

19

C.grad

9

11

10

2

0

19

D.sym

13

15

12

19

19

0

Species

Sequence

P.sym

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

P.xan

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

A

A

T

A

T

T

C

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

P.pola

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

T

T

T

T

A

T

T

C

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

G

G

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

C.plat

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

A

A

T

A

G

G

G

C

T

A

A

G

G

A

A

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

A

A

G

G

G

T

A

C

T

G

A

T

T

T

T

A

G

C.grad

A

A

A

T

G

C

C

T

G

A

C

G

T

G

G

G

A

A

A

T

C

A

A

T

A

G

G

G

C

T

A

A

G

G

A

A

T

T

T

A

T

T

T

C

G

T

A

T

G

C

T

A

T

G

T

A

G

C

T

T

C

C

G

G

G

T

A

C

T

G

A

T

T

T

T

A

G

D.sym

T

T

A

T

G

C

G

A

G

A

C

G

T

G

A

A

A

A

A

T

C

T

T

T

A

G

G

G

C

T

A

A

G

G

T

G

A

T

T

A

T

T

T

C

G

G

T

T

G

C

T

A

T

G

T

A

G

A

G

G

A

A

G

G

G

T

A

C

T

G

A

C

G

G

T

A

G

Linkage

algorithm

We first
combine

species

that

are

nearest

to
from

an
inner

cluster

In
the

next

step we
look

for a
species

or

a
cluster

that

is

clostest

to
the

average

distance

or

the

initial

cluster

We
continue

this

procedure

until

all

species

are

grouped
.

The

single
linkage

algorithm

tends

to
produce

many
small

clusters
.

P.sym

P.xan

P.pola

C.plat

C.grad

D.sym

S
equential

versus simultaneous
algorithms


In simultaneous algorithms the final solution is
obtained in a single step and not stepwise as in
the single linkage above.


Agglomeration versus division
algorithms

Agglomerative procedures operate bottom up,
division procedures top
down
.


Monothetic

versus
polythetic

algorithms

Polythetic

procedures use several descriptors of
linkage,
monothetic

use the same at each step
(for instance maximum association).


Hierarchical versus non
-
hierarchical
algorithms


Hierarchical methods proceed in a non
-
overlapping way. During the linkage process all
members of lower clusters are members of the
next higher cluster. Non hierarchical methods
proceed by optimization within group
homogeneity. Hence they might include
members not contained in higher order cluster.



The
single linkage algorithm
uses the
minimum distance between the members of
two clusters as the measure of cluster
distance. It favours chains of small clusters.


The
average linkage
uses average distances
between clusters. It gives frequently larger
clusters. The most often used average linkage
algorithm is the
Unweighted

Pair
-
Groups
Method Average (UPGMA)
.


The
Ward algorithm
calculates the total sum
of squared deviations from the mean of a
cluster and assigns members as to minimize
this sum. The method gives often clusters of
rather equal size.


Median clustering
tries to minimize within
cluster variance.

To
check

the

performance of
different

cluster

algorithms

and
distance

metrics

we
use

a
matrix

of random
numbers
.

Which

clusters

to
accept
?

Which

clusters

to
accept
?

Different

cluster

algorithms

give

different

results
.

We
accept

those

clusters

that

are

stable

irrespective

of
algorithm
.

In
the

case

of
our

random
numbers

clustering

is

very

unstable
.

Two

methods

detected

the

clusters

OP and ABC

All
other

items

are

not
clearly

separated
.

The

position

of
item

F
remains

unclear

Clustering

using

a
predefined

number

of
clusters

K
-
means

O

P

A

B

D

C

F

E

H

K

I

L

N

M

J

G

K
-
means

clustering

starts

from

a
predefind

number

of
clusters

and
then

arranges

the

items

in

a
way

that

the

distances

between

clusters

are

maximized

with

respect

to
the

distances

within

the

clusters
.

Technically

the

algorithm

first
randomly

assigns

cluster

means

and
then

places

items

(
each

time
calculating

new

cluster

means
)
until

an
optimal

solution

(
convergence
)
has

been

reached
).

K
-
means

always

uses

Euclidean

distances

Neighbour

joining

A
F
D
E
C
B
Root
A
F
D
E
C
B
Root
X
A
F
D
E
C
B
Root
X
Y
Neighbour joining is particularly used to generate
phylogenetic trees

i
n
(X) (X,Y)
  

(X,Y)
Q (n 2) (X,Y) (X) (Y)

    
AB
(X,A) (X,B) (A,B)
(X,U )
2
  
 
(n 2) (A,B) (A) (B)
(A,U)
2(n 2)
(n 2) (A,B) (A) (B)
(B,U)
2(n 2)
   
 

   
 

Dissimilarities

You need similarities (phylogenetic distances)

(XY)
between all elements X and Y.

Select

the

pair

with

the

lowest

value

of Q

Calculate

new

dissimilarities

Calculate the distancies from the new node

Calculate

Distance matrix
Mouse
Raven
Octopus
Lumbricus
Mouse
0
0.2
0.6
0.7
Raven
0.2
0
0.6
0.8
Octopus
0.6
0.6
0
0.5
Lumbricus
0.7
0.8
0.5
0
Delta values
1.5
1.6
1.7
2
Q-values
Mouse/Raven
-2.7
Mouse/Octopus
-2
Mouse/Lumbricus
-2.1
Raven/Octopus
-2.1
Raven/Lumbricus
-2
Octopus/Lumbricus
-2.7
Distance matrix
Mouse
Raven
Protostomia
Mouse
0
0.2
0.4
Raven
0.2
0
0.45
Protostomia
0.4
0.45
0
Delta values
0.6
0.65
0.85
Q-values
Mouse/Raven
-1.25
Mouse/Protostomia
-1.05
Raven/Protostomia
-0.6
Distance matrix
Vertebrata
Protostomia
Vertebrata
0
0.075
Protostomia
0.075
0
i
n
(X) (X,Y)
  

(X,Y)
Q (n 2) (X,Y) (X) (Y)

    
AB
(X,A) (X,B) (A,B)
(X,U )
2
  
 
(X,Y)
Q (n 2) (X,Y) (X) (Y)

    
i
n
(X) (X,Y)
  

Home
work

and
literature

Refresh
:



Distance

metrics


Euclidean

distance


Manhattan
distance


UPGMA


Ward

clustering


Neighbor

joining


K
-
means

cluster



Literature
:


http://en.wikipedia.org/wiki/Cluster_analysis


http://statsoft.com/textbook/