Chapter 12 Clustering, Distance Method, and Ordination

overratedbeltΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

95 εμφανίσεις



1

1

Chapter 12 Clustering, Distance Method, and
Ordination

12.2 Similarity Measures

Commonly used distance:

Euclidean Distance:













y
x
y
x
y
x
y
x
y
x
y
x
d
t
p
p










2
2
2
2
2
1
1
,


Statistical

Distance:







y
x
S
y
x
y
x
d
t




1
,
,

where S is the sample variance
-
covariance matrix.

Minkowski

Distance:



m
p
i
m
i
i
y
x
y
x
d
1
1
,










.


Canberra Metric
:










p
i
i
i
i
i
y
x
y
x
y
x
d
1
,
.

Czekanowski coefficient
:














p
i
i
i
p
i
i
i
y
x
y
x
y
x
d
1
1
,
min
2
1
,
.

12.3

Hierarchical Clustering Methods

Agglomerative Hierarchical Clustering Algorithm (Grouping
N

Objects):

1. Start with
N

clust
ers, each containing a single entity and an
N
N


symmetric matrix of distances (or similarities)


ik
d
D

.



2

2

2. Search the distance matrix for the nearest (most similar) pair of
clusters. L
et the distance between “most simil
ar” clusters
U

and
V

be
UV
d
.

3. Merge clusters
U

and
V
. Label the newly formed cluster
(UV)
.
Update the
entries in the distance matrix


(a) deleting the rows and columns corresponding to clusters
U

and
V

and


(b) adding a row and col
umn giving the distances between
cluster
(UV)

and the remaining clusters.

4. Repeat Steps 2 and 3 a total of
N
-
1

times. (All objects will be in a

single cluster after the algorithm terminates.) Record the
identity of clusters that are merged and the levels

(distances) at
which the merges take place.


There are 3 linkage methods. The main differences among these
methods are the distances between
(UV)

and any other cluster
W
.

(I) Single Linkage:





VW
UW
W
UV
d
d
d
,
min

.


(II) Complete

Linkage:





VW
UW
W
UV
d
d
d
,
max

.


(III) Average

Linkage:





W
UV
i
k
ik
W
UV
N
N
d
d



,

where
ik
d

is the distance between object
i

in the cluster
(UV)

and
object
k

in the cluster
W
, and


UV
N

and
W
N

are the number of items
in
clusters
(UV)

and
W
, respectively.