5.
1
Chapter 5
Cluster Analysis
Section 5.1
Introductions
Cluster analysis is probably one of the most interesting fields in the field of
computer

oriented data analysis. It was called
numerical
taxonomy by biologists
Jardin and Sibson 1971 and is now widely u
sed in marketing research green and Rao
1972, data compression Kang, Lee, Chang and Chang 1976, computer operating
system design
Ferrari
1974, information retrieval Salton 1965, and data base systems
design Subas, Kashyap and Yao 1974.
Roughly speaking, th
e problem of cluster analysis is defined as follows: We are
given a set of
samples
and we like to group these samples into homogeneous clusters.
Consider Fig. 5.1. Every reasonable person would agree that there are three clusters.
That we can detect thes
e clusters is due to the fact that there are only two variables
and we can therefore display the data easily.
2
X
1
X
Fig. 5.1
Consider the data in Table 5.1 and Table 5.2.
Variables
S
amples
1
X
2
X
3
X
4
X
1
5.7
2.8
4.5
1.3
2
6.3
3.3
4.7
1.6
x
x x
x
x
x x x
x x x
x x
x x x x
x x
x
5.
2
3
4.9
2.4
3.3
1.0
4
6.6
2.9
4.6
1.3
5
5.0
3.4
1.5
0.2
6
6.9
3.1
4.9
1.5
7
5.5
2.3
4.0
1.3
8
6.5
2.8
4.6
1.5
9
5.0
3.6
1.4
0.2
10
5.1
3.5
1.4
0.2
11
5.4
3.9
1.7
0.4
12
4.6
3.4
1.4
0.3
13
4.9
3.1
.1.5
0.1
14
7.0
3.2
4.7
1.4
15
4.9
3.0
1.4
0.2
16
4.7
3.2
1.3
0.2
17
6.4
3.2
4.5
1.5
18
4.6
3.1
1.5
0.2
19
4.4
2.9
1.4
0.2
20
5.2
2.7
3.9
1.4
Table 5.1
Posit
ion
1
9
10
13
28
36
41
43
44
52
54
55
58
66
68
70
91
96
97
100
108
110
111
112
Human
V
I
M
S
V
T
H
L
F
P
Y
S
A
I
G
D
V
K
E
A
K
T
N
E
Monkey
V
I
M
S
V
T
H
L
F
P
Y
S
A
T
G
D
V
K
E
A
K
T
N
E
Horse
V
V
Q
A
V
T
H
L
F
P
F
T
D
T
K
E
A
K
T
E
K
T
N
E
Pig
V
V
Q
A
V
T
H
L
F
P
F
S
D
T
G
E
A
K
G
E
K
T
N
E
Dog
V
V
Q
A
V
T
H
L
F
P
F
S
D
T
G
E
A
T
G
A
K
T
N
E
Whale
V
V
Q
A
V
T
H
L
F
P
F
S
D
T
G
E
A
K
G
A
K
T
N
E
Rabbit
V
V
Q
A
V
T
H
L
F
P
F
S
D
T
G
D
A
K
D
A
K
T
N
E
Kangaroo
V
V
Q
A
V
T
N
L
F
P
F
T
D
I
G
D
A
K
G
A
K
T
N
E
Chicken
I
V
Q
S
V
T
H
L
F
E
F
S
D
T
G
D
A
K
S
V
D
T
S
K
Penguin
I
V
Q
S
V
T
H
L
F
E
F
S
D
T
G
D
A
K
S
A
D
T
S
K
Duck
V
V
Q
S
V
T
H
L
F
E
F
S
D
T
G
D
A
K
S
A
D
T
A
K
Turtle
V
V
Q
A
V
T
H
L
I
E
F
S
D
T
G
E
A
K
A
A
D
T
S
K
Bullfrog
V
V
Q
A
C
V
Y
L
I
A
F
S
E
T
G
D
A
K
G
Q
S
C
S
K
Table 5.2
Aaaaaaaaaaaa
difficult
to group these data into clusters. For the case in Table 5.1,
5.
3
the samples are points in the 4

space and no
ordinary
person car, perceive points in
4

space very well. The data in Table 5
.2 are even more peculiar; they are not
numerical. In
aaaaaaaaa
, they are data from
Dayho
1969 and are concerned with
aaaaaaaa
of amine acids in a protein molecule, cytochrome c, found in the
mitochondria
of animals and higher plants. Only the positions
aaaaaaaa
are
recorded.
In this book, we shall discuss
several
clustering analysis methods and the reaser
will gradually see how clusters can be found in both
aaaaa
of data.
We shall first assume that every samply Y
i
is on the following
aaaaa
:
iN
i
i
i
x
x
x
Y
,...,
,
2
'
1
.
Unlike the data discussed in Chapters 2, 3 and 4, the data which will be discussed in
this chapter do not have to be
numeric
al. In general, we may have three types of
data;
(1)
Type I data: every variable assumes
numerical
values:
A typical sample
is:
(1.0, 3.1,

1.5, 11.3),
(2)
Type II data; Every variable in symbolic.
A typical sample
is:
(a, f, g, h).
(3)
Type III data; Some variables are symbolic and some are
aaaaaaaa
.
A Typical sample is:
(b,
aaaaaa
,
aaaaaaaa
, 5.3, 3.6).
The reader may now wonder how
one can possibly have Type III
aaaaaaa
.
Actually
, it is not
extraordinary
at all that we
’
ll find this
aaaaa
of samples. Imagine
that every sample is related to a person and
aaaaa
have five features as follows:
color of hair,
body weight,
body height,
age,
religion.
There are possibly several distinct colors for a person
’
s hair, such as
black
, brown,
white, red and so on. The most natural way to
represent
these colors is to code them
differently as follows:
black: a,
brown: b,
white: c,
red: d
5.
4
.
.
.
The reader may still wonder why one can not code these colors or numbers. For
instance, we may code them as follows:
black: 1
brown: 2
white: 3
red: 4
There is a severe problem associated with this kind of approach. Note that the
difference b
etween
“
black
”
and
“
white
”
is 2. This is of course not reasonable.
Therefore, we shall try our best to avoid using numbers.
Section 5.2. Distance Functions Used in Cluster Analysis.
The basic concept used in cluster analysis is the concept of
“similarit
y”
. If two
samples are similar, they should be grouped together. To measure the similarity
between two samples, it suffices to measure the distance between them. If the
distance is small, these samples must be similar. Thus, the job of cluster analysis
is
to group together samples whose distances are small.
Since there are three different types of data, we have to define three types of
distance functions. In the rest of the section, we shall assume that
iN
i
i
i
x
x
x
Y
,...,
,
2
1
and
jN
j
j
j
x
x
x
Y
,...,
,
2
1
.
(1)
Type I data:
This kind of data are numerical data. We can at least use two well

known
distance functions:
(a) Euclidean distances:
2
1
1
2
N
K
jk
ik
ij
x
x
d
.
For instance, if
Y
i
=(3.0, 4.1,

0.5, 9.1)
and Y
j
=(4.0, 3.1,

1.7, 8.5)
then
55
.
1
6
.
0
2
.
0
0
.
1
0
.
1
5
.
8
1
.
9
7
.
1
5
.
1
1
.
3
1
.
4
0
.
4
0
.
3
2
1
2
2
2
2
2
1
2
2
2
2
ij
d
(b) City block distances:
5.
5
N
K
jk
ik
ij
x
x
d
1
For instance, in the above example,
8
.
2
6
.
0
2
.
0
0
.
1
0
.
1
5
.
8
1
.
9
7
.
1
5
.
1
1
.
3
1
.
4
0
.
4
0
.
3
ij
d
The
Euclidean
distances require calculating the square roots of numbers which is
quite time consuming. In cluster
analysis
, it is often
ne
cessary
to calculate a large
number of distances. Any kind of saving would therefore be desirable. As the
reader will see, we can often avoid
calculating
the square roots of numbers because
we are only interested in comparing two distances, not the exact
values of the two
distances.
No matter whether one uses the euclidear distances or the city block distances, one
should be careful to make sure that all of the variables are properly normalized, so
that the unit of measurements will not play dominating ro
les. One popular and good
method of normalization is to normalize the variables with respect to their variances
which was discussed in Chapter 2.
(2)
Type II data:
This type of data are non

numerical and we shall introduce the so called Hamming
distances. Fo
r Hamming distances,
N
K
jk
ik
ij
x
x
d
1
,
where
otherwies.
0
if
1
,
x
x
x
x
jk
ik
jk
ik
For instance, if
a
c,
b,
a,
i
Y
a
nd
d
c,
c,
a,
j
Y
.
then
.
2
1
0
1
0
d
a,
c
c,
c
b,
a
a,
ij
d
Just as the Type I data, we sometimes have to normalize, in some sense, the
vari
ables also. For instance, consider the following set of samples:
Y
1
(A, A, B)
Y
2
(A, B, C)
Y
3
(A, C, C)
Y
4
(A, D, B)
5.
6
Y
5
(B, D, C)
According to the definition of Hamming distances the distance between Y
1
and Y
2
is the same as that between Y
4
and Y
5
. Howev
er, by examining the samples more
carefully, one would find that sample Y
5
is actually quite different from all other
samples in one respect: the value of X
1
is B for Y
5
while the values of X
1
are A for all
other samples.
To make sure that Y
5
appear more d
istinct, we have to modify the definition of
Hamming distances mentioned above by giving X
1
more weights than other variables.
There probably are many methods to assign more weight. The method we shell
introduce is to use the average difference concept.
That is the definition of distance
can be now defined as follows:
N
k
jk
ik
k
x
x
w
d
1
j
i
,
Y
,
Y
where
otherwise,
0
,
if
1
,
x
x
x
x
jk
ik
jk
ik
and
1
1
1
,
2
1
M
i
M
i
j
jk
ik
k
X
X
M
M
w
(M is the total number of samples.)
Let us illustrate the foregoing idea through an example. Consider the
following
set of samples:
B
A
A
Y
,
,
:
1
C
B
A
Y
,
,
:
2
C
C
A
Y
,
,
:
3
B
D
A
Y
,
,
:
4
C
D
B
Y
,
,
:
5
The
weights
are calculated as follows:
5
.
2
4
10
:
1
W
1
.
1
9
10
:
2
W
66
.
1
6
10
:
3
W
The distances between
1
Y
and
2
Y
and the distance between
1
Y
and
5
Y
are
therefore:
76
.
2
1
66
.
1
1
1
.
1
0
5
.
2
12
d
and
76
.
12
1
66
.
1
1
1
.
1
1
5
.
2
15
d
respectively.
5.
7
The reader can now se
e that we have achieved our goal because the distance
between
1
Y
and
5
Y
is much greater than that between
1
Y
and
2
Y
.
We can proceed to discuss now to define distances among
Type III data.
(c) Type III data:
As discussed before, Type III data are mixed type of data because variables can be
both numerical and non

numerical ones. In this case, we can still have a reasonable
way to define distances:
2
1
2
1
,
N
k
jk
ik
k
ij
x
x
c
w
d
wher
e
k
w
is the average differences of variable
k
x
among all distinct pairs of
samples and
jk
ik
x
x
c
,
x
,
x
j k
ik
if
k
x
is non

numerical
and
jk
ik
x
x
c
j k
ik
x
,
x
if
k
x
is
numerical
.
For instance, consider the following set of samples:
0
.
10
,
0
.
5
,
,
:
1
B
A
Y
0
.
11
,
0
.
6
,
,
:
2
C
A
Y
0
.
9
,
0
.
7
,
,
:
3
C
B
Y
.
In this case,
5
.
1
2
3
1
w
,
5
.
1
2
3
2
w
,
.
75
.
0
4
3
0
.
7
0
.
5
0
.
7
0
.
6
6.0

5.0
3
3
w
and
75
.
0
4
3
9
11
9
10
11
10
3
4
w
S
ome distance are calculated as follows:
5.
8
732
.
1
0
.
3
75
.
0
75
.
0
5
.
1
1
75
.
0
1
75
.
0
1
5
.
1
0
5
.
1
2
1
2
1
2
1
2
2
2
2
12
d
6
.
2
75
.
6
75
.
0
0
.
3
5
.
1
5
.
1
1
75
.
0
2
75
.
0
1
5
.
1
2
5
.
1
2
1
2
1
2
1
2
2
2
2
13
d
Section 5.3. The Distance Matrix and the Similarity M
atrix
Since all of our cluster analysis techniques are based upon the distance concept, it
is often appropriate to
display all of the distance in the form of a distance matrix D(M,
aaaaaaaa
) where M is the number of samples and D(i, j) is the distance between the
ith sample and the jth sample.
Consider Table 5.1. The distance matrix D is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
0.0
0.9
1.5
0.9
3.3
1.3
0.7
0.8
3.5
3.4
3.2
3.5
3.3
1.4
3.4
3.6
0.8
3.4
3.5
0.8
2
0.9
0.0
2.3
0.6
3.7
0.7
1.5
0.6
3.8
3.8
3.4
3.9
3.8
0.7
3.8
4.0
0.3
3.9
4.0
1.5
3
1.5
2.3
0.0
2.2
2.2
2.7
0.9
2.1
2.4
2.4
2.3
2.3
2.1
2.7
2.2
2.3
2.1
2.1
2.1
0.8
4
0.9
0.6
2.2
0.0
3.7
0.5
1.4
0.2
3.8
3.8
3.4
3.9
3.7
0.5
3.8
4.0
0.4
3.9
4.0
1.6
5
3.3
3.7
2.2
3.7
0.0
4.1
3.0
3.7
0.2
0.2
0.7
0.4
0.3
4.0
0.4
0.4
3.6
0.5
0.8
2.8
6
1.3
0.7
2.7
0.5
4.1
0.0
1.9
0.6
4.2
3.8
4.4
4.2
4.2
0.3
4.
2
4.4
0.7
4.3
4.5
2.0
7
0.7
1.5
1.0
1.4
3.0
1.9
0.0
1.3
3.2
3.1
2.9
3.1
3.0
1.9
3.0
3.2
1.4
3.0
3.1
0.5
8
0.8
0.6
2.1
0.2
3.7
0.6
1.3
0.0
3.9
3.8
3.5
4.0
3.8
0.7
3.8
4.0
0.4
3.9
4.0
1.5
9
3.5
3.8
2.4
3.8
0.2
4.2
3.2
3.9
0.0
0.1
0.6
0.5
0.5
4.1
0.6
0.5
3
.7
0.7
0.9
2.9
10
3.4
3.8
2.4
3.8
0.2
4.2
3.2
3.8
0.1
0.0
0.6
0.5
0.5
4.0
0.5
0.5
3.6
0.7
0.9
2.9
11
3.5
3.9
2.3
3.9
0.4
4.4
3.1
4.0
0.5
0.5
1.0
0.0
0.5
4.2
0.5
0.3
3.8
0.3
0.6
2.9
12
3.5
3.9
2.3
3.9
0.4
4.4
3.1
4.0
0.5
0.5
1.0
0.0
0.5
4.2
0.5
0.3
3.8
0
.3
0.6
2.9
13
4.5
3.9
2.3
3.9
0.3
4.2
3.0
3.8
0.5
0.5
1.0
0.5
0.0
4.0
0.2
0.3
3.6
0.3
0.6
2.8
14
1.4
0.7
2.7
0.5
4.0
0.3
1.9
0.7
4.1
4.0
3.6
4.2
4.0
0.0
4.1
4.3
0.6
4.2
4.4
2.0
15
3.4
3.9
2.2
3.8
0.4
4.2
2.9
3.8
0.6
0.5
1.0
0.5
0.2
4.1
0.0
0.3
3.7
0.3
0
.5
2.8
16
3.6
4.0
2.3
4.0
0.4
4.4
3.2
4.0
0.5
0.5
1.1
0.3
0.3
4.3
0.3
0.0
3.9
0.2
0.4
3.0
17
0.8
0.3
2.1
0.4
3.6
0.7
1.4
0.4
3.7
3.6
3.3
3.8
3.6
0.6
3.7
3.9
0.0
3.7
3.9
1.4
18
3.4
3.9
2.1
3.9
0.5
4.3
3.0
3.9
0.7
0.7
1.2
0.3
0.3
4.2
0.3
0.2
3.7
0.0
0.3
2
.8
19
3.5
4.1
2.2
4.0
0.8
4.5
3.1
4.0
0.9
0.9
1.5
0.6
0.6
4.4
0.5
0.4
3.9
0.3
0.0
2.9
20
0.8
1.5
0.8
1.6
2.8
2.0
0.5
1.5
2.9
2.9
2.7
2.9
2.8
2.0
2.8
3.0
1.4
2.8
2.9
0.0
Table 5.3
5.
9
We now define the concept of similarity
ij
s
between tw
o samples The question is:
What do we mean by saying that two samples are similar? One way of defining
similarity
is to use the concept of threshold. That is, we shall say that
T
d
s
ij
ij
if
1
and
0
ij
s
otherwise.
Again we may
use a similarity matrix S to display the similarity relationships
among the samples. The similarity matrix S of the set of data in Table 5.1, for T=2.0,
is shown in Table 5.4.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
1
1
1
1
0
1
1
1
0
0
0
0
0
1
0
0
1
0
0
1
2
1
1
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
0
0
1
3
1
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
4
1
1
0
1
0
0
1
1
0
0
0
0
0
1
0
0
1
0
0
1
5
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
6
1
1
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
0
0
1
7
1
1
1
1
0
1
1
1
0
0
0
0
0
1
0
0
1
0
0
1
8
1
1
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
0
0
1
9
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
10
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
11
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
12
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
13
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
14
1
1
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
0
0
0
15
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
16
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
17
1
1
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
0
0
1
18
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
19
0
0
0
0
1
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
20
1
1
1
1
1
1
1
1
Table 5.4
The trouble is: We still can not easily detect clusters by examining Table 5.4.
Actually, we deliberately show Table 5.4 in this section because in Table 5.8, it will be
shown ho
w we can visualize clusters, if they do exist, by rearranging the above matrix.
The reader has to be a little bit patent at this moment.
5.
10
Section 5.4. T

clusters.
Up to now, we have never given
“
cluster
”
c clear definition. Perhaps we shall
never do this
because no matter how we define it, we will find cases where the
definition is not
appropriate
. However, we shall still try to give cluster a formal
definition with the understanding that this is not necessarily an ideal one.
Definition. A point
i
Y
is said to be T

connected to a point
j
Y
if there exists a
sequence of points
j
k
i
Y
Y
Y
Y
Y
,...,
,
2
1
such that the distance between
i
Y
and
1
i
Y
is smaller than in equal to T for
i = 1, 2,
…
, k

1.
For instance, in Fig 5.2, point 8 is 1

connected to point 4 through 5, 6 and 7.
Fig. 5.2
Definition. A T

cluster C of a set of points is a subset of S which satisfies the
following conditions:
(1)
Every two points in C are T

connected
.
(2)
If point
i
Y
is T

connected to some point
j
Y
in C, then
i
Y
is in C.
Consider Fig. 5.2 again. Let T=3.0. Then we
can
connect points 1, 2 and 3 into
one cluster
1
C
and poi
nts 4, 5, 6, 7 and 8 into cluster
2
C
as shown in Fig. 5.3. Both
1
C
and
2
C
are clearly
aaaaaaaaaa

clusters. In fact, since they are connected by
links whose lengths are equal to 1, they are bot
h 1

clusters.
5.
11
Fig. 5.3
There is one special property of T

clusters which is worth
discussing
here. Note
that, for instance, point 4 is connected to point 8. However, we can never say that
point 4 is similar to point 8, if the threshold is 1. Somehow,
the definition of
T

clusters tacitly assumes that the property of similarity is transitive. That is, if X is
similar to Y and Y is similar to Z, then X is similar to Z. For many types of data, this
assumption is not valid, Gotlied and Kuhns and some oth
er
definition
of clusters has
to be used.
Another serious problem of this kind of approach is the so

called
“
chaining
”
effect.
Consider Fig. 5.4. Clusters
1
C
and
2
C
are two separate clusters. But, it is
Fig.
5.4
possible that the points between the two clusters may
“
chain
”
these two clusters into
one cluster. This is sometimes called
“
chaining effect
”
and is often
considered
as an
undesirable property. Evidently, this chaining effect is caused by the ignor
ing of the
“
density
”
. Note that between the two clusters, the density of points is especially low.
These points, in fact, can be simply viewed as
“
noise
”
and some
“
noise elimination
”
mechanism is deeded.
Almost every noise elimination mechanism involves
some kind of
“
mode
”
analysis. This well be discussed somewhere. Meanwhile, the reader is encouraged
to read Wishart 1969 for an excellent analysis of this problem. Temporarily, we
can
use the following noise elimination mechanism:
(1)
Around every point, dr
aw a circle with some appropriate radius and count
5.
12
the number of points falling into this circle.
(2)
Those points which do not have many neighbors falling into the circles are
considered as
“
noise
”
.
Despite the fact that a T

cluster is conceptually easy to de
fine, it is by no means
easy to find all such T

clusters. Fig. 5.5 shows and algorithm (Algorithm 5.1) which
exhaustively searches through all possibilities to construct t

clusters.
It is obvious that the abouv algorithm is very time

consuming. In the fo
llowing
section, we shall introduce a new concept, minimal spanning tree, and we shall show
that we
can
easily obtain T

clusters after constructing a minimal spanning tree from a
set of samples.
Section 5.5. Minimal Spanning Trees.
Given a set of points,
there are
2
1
M
M
distances among these
points
.
However, may distances are quite irrelevant to us. For instance if two points are
really for way, we will never be too concerned with the distance between these two
points. In this section,
we shall discuss the concept of minimal spanning trees which,
aaaaaa
some sense, contain only the relevant information.
5.
13
Fig. 5.5
Flowchart of Algorithm 5.1 (An algorithm t find all T

clust)
5.
14
Definition. Given a set S of points, a spanning tree of S
is a connected graph of S
which
satisfies the following conditions:
(1)
Every point of S is in the graph.
(2)
There is no loop in the graph.
For instance, the graph in Fig. 5.6a is spanning tree while the trees in Fig. 5.6b
and Fig. 5.6c are not. The graph in Fig
. 5.6b is not a spanning tree because there is a
point not on the
tree
and thus condition 1 is not satisfies. The graph in Fig. 5.6c is
not a spanning tree because links connecting 1, 2 and 3 form a loop
aaaaaaaa
thus
violates condition 2.
Definition.
A minimal spanning tree is a spanning tree whose total length is a
minimum among all possible spanning trees.
For the set of data in fig. 5.7a, we can construct many minimal spanning trees, one
of which is shown in Fig. 5.7b. Note that there are two
aa
aaaaaa
long links in Fig.
Fig. 5.7a
5.
15
Fig. 5.7b
5.7b; the link between c and d and the link between c and k. Let us assume that the
lengths of all other links on the minimal spanning tree are equal to 1. If we break
links
cd and ck, we shall obtain t
he sub

trees as shown in Fig.
5aaaaa
. Each sub

tree
Fig. 5.8
can be considered corresponding to a cluster.
The
reader is encouraged to verify that
each cluster is a 1

cluster. This interesting property will be discussed in the next
section. Meanwhi
le we shall discuss how a minimal spanning tree can be
constructed.
Algorithm
5.2 (An
algorithm
to produce a minimal spanning tree)
Input: A set S of points.
Step 1: Let A={X} and b=S

{X} where X is any point in S.
5.
16
Step 2: From set B, find a point Y whos
e distance to some X in A is the shortest
among all points in A and all points in B.
Step 3: Connect Y to X. Let A=A
∪
{Y} and B=B

{Y}.
Step 4: If B=0, exit. Otherwise, go to Step 2.
That this algorithm does correctly produce a minimal spanning tree was p
roved by
Prim [1957].
In algorithm 5.2, Step2 seems to be the step that involves a lot computations.
Suppose
k
Y
Y
Y
A
,...,
,
2
1
, and
p
k
Y
Y
B
,...,
1
and
k
Y
is the last point added to A.
Our task is to find a point
i
Y
in A and a point
j
Y
in B such that
j
i
Y
Y
d
,
is the
smallest for all
i
Y
’
s in A and all
j
Y
’
s in B. There are two methods:
(1)
We can store all of the distances in th
e memory space. This method will
save computer time, but will demand a large amount of memo space.
Note that we have to store
2
1
M
M
distances where M is the number of
samples.
(2)
We
can
recompute
all of the distances. This is , of course,
exceedingly
time consuming.
We shall now point out that we really have a method which is efficient both in
computer time and memory space. In uses only two vectors:
1
V
and
2
V
. Both
1
V
and
2
V
are of dimension
aaaaaaaaa
where M is the number of samples. Assume
that point
i
P
is in set b, its nearest point in set A is
j
P
and the distance between
i
P
and
j
P
is b. We then set
b
i
V
1
and
j
i
V
2
. By retaining only these vectors,
we shall show that we can have a very efficient algorithm to find minimal spanning
trees.
Let us assume that a point
1
Y
is selected at the very beginning to be the point in
set A.
i
V
2
is now set to 1 for all i. We then calculate the distances from point
1
Y
to all of the points in set 3.
1
2
i
i
V
will contain the
distance between
i
Y
and
1
Y
.
Assume that
2
Y
is the point which is
closest
to
1
Y
.
2
Y
will now be selected to be
the point added to set A. We now begin
the process of finding the pair of point X
in
A
and point Y in B such that the distance between X and Y is the shortest for all points
in A and all points in B. This can be easily done now. We simply calculate all of
the distances from
2
Y
to all of the remaining points in B. Let us assume that the
5.
17
distances between
3
Y
and
2
Y
is b and
a
Y
3
1
. If b<a, we do the following:
(1)
We set
3
1
V
to be b.
(2)
We set
3
2
V
to be 2.
This is an updating process. If
a
b
, we do not do anything. Note that in this step,
the distances with respect to
1
Y
are not calculated. In fact, if Y is the last point
added to A, then o
nly the distances with respect to Y are cauculated. This process
can be continued until all of the points are put into A.
We shall now present Algorithm 5.2*, Algorithm 5.2* is based upon the idea of
keeping these two vectors and keeping them updated.
Al
gorithm 5.2* (An efficient algorithm to find a minimal spanning tree.)
Input: A set S of point.
Step 1. Let A={X} and jb=S

{X} where X is any point in A.
Step 2. Set
Y
V
1
and
X
Y
V
2
for all Y in B.
Step 3. Calculate the
distances from X to all points in B. If the distance between X
and
j
Y
(
j
Y
is in B) is b and b is smaller than
a
Y
V
j
1
, set
b
Y
V
j
1
and
set
X
Y
V
j
2
. Otherwise, do no
thing.
Step 4. Let Y be a point in B where
Y
V
1
is the smallest. Let
Z
Y
V
2
(Zmust be a
point in A.). Connect Y to Z. Set A=A
∪
{Y}. Set B=B

{Y}. Set
Y
V
1
.
Step 5. If B
≠
0, exit. Otherwise, set X=Y an
d go to Step 3.
We believe that, the reader
can
now appreciate the elegance of the minimal
spanning trees. They not only contain
important
information as we shall see, but
also have the advantage of being easy to obtain one.
5.
18
Section 5.6. Produ
cing T

clusters from a Minimal Spanning Tree.
Let us consider the minimal spanning tree in Fig. 5.9. Suppose we break all links
Fig. 5.9
longer than 3 (link cf is the only such link.), we would obtain two separate clusters of
points as shown in Fig. 5
.10.
fig. 5.10
In each cluster, there is no link that is longer than 3. This is not surprising
because we have already broken all of the links longer than 3. We shall now try to
answer the following questions:
(1)
Are
1
C
and
2
C
3

clusters?
(2)
Are
1
C
and
2
C
the only two 3

clsuters?
To answer Question 1, we note that we only know that
1
C
and
2
C
satisfy the
first condition of being 3

clu
sters: no link is longer than 3. The reader
can
easily
prove that the second condition of being 3

clusters is also satisfied. Thus, the
answer to the first question is
“
Yes,
1
C
and
2
C
are 3

clsusters.
”
The answer
to the second Question is also affirmative. That is,
1
C
and
2
C
are
indeed the only 3

clusteres. It is not easy to see this. Fortunately, this fact was
established by Prim [1957].
5.
19
The theorem that Prim prove in
1957 is now formally as follows:
Theorem. Let S be a set of points. Let g be a minimal spanning tree of S. Let
1
C
,
2
C
,
…
,
k
C
must be all T

clusters of S and they are the only T

clusters of S
.
The above theorem is a very
important
theorem and it powerfully illustrates the
significance of minimal spanning trees.
Section 5.7. Minimal Spanning Trees and
Hierarchical
Clusters.
In many instance, what we need is not only a set of clusters. Rathe
r, we like to
know the hierarchical structure of the data. Consider Fig. 5.9 again. Let us denote
the entire set of samples by
0
C
. If we break the largest link of in the minimal
spaning tree, we sould obtain two clusters:
e
d
c
b
a
C
,
,
,
,
1
and
k
j
i
h
g
f
C
,
,
,
,
,
2
This relationship can be depicted by the hierarchical tree shown in Fig. 5.11.
C
0
{1, b, c, d, e, f, g, h, i, j, k}
{a, b, c, i, e} C
1
C
2
{f, g, h,
i,
j, k}
Fig. 5.11
Note that this hierarchical tree clearly describes the structure of the data. If there
are not too many such points, we can expand the tree to such an extent that every
cluster contains only one point. Note that we may be satisfied,
for example, with the
tree in Fig. 5.12.
Section 5.8. Short Spanning Paths.
Although a minimal spanning tree can be very informative, it nevertheless suffers
from one drawback:
After
we
break
all of the links, it is by no means easy to find all
of the cl
usters. In this section, we shall introduce the concept of short spanning
paths which can be used to produce clusters very quickly.
Definition. Given a set of points, a spanning path is a connected graph that satisfies
the following conditions:
(1)
Every
po
int of S is in the graph.
(2)
There are two points of S that have only one link connected to them.
5.
20
(3)
All other
points
in S have exactly two links connected to them.
Definition. A shortest spanning path is a spanning path whose total Length is the
shortest amon
g all possible spanning paths.
Since it is difficult to find a shortest spanning path, we shall only be discussing
“
short spanning
path”
instead. A short spanning path is thus not necessarily the
shortest one; yet is should be quite short.
Consider Fig.
5.14. A short spanning path may be the one shown in Fig. 5.15.
The link between e and f is especially long. By breaking this link, we can obtain two
clusters:{a, b, c, d, e} and {f, g, h,
i,
j, k, l}.
Fig. 5.14
Fig. 5.15
A
spanning
path is a speci
al form of spanning trees. It is very easy to use a
spanning path to produce clusters. For instance, the short spanning path in Fig. 5.15
can now be
displayed
in the following form:
5.
21
Fig. 5.16
We
can
then quickly detect these two clusters.
In the foll
owing, we shall give an algorithm which produces a short spanning path
out of a set of points.
Algorithm 5.3. (An algorithm to obtain a short spanning path)
Input:
.
,...,
,
2
1
M
Y
Y
Y
Step 1. Set i=1. Place one of the input points
arbitrarily
.
Step 2
. Arbitrarily select a point from the remaining Y

i points. Try placing the
points
selected
in each of the i+1 positions and compute the point
’
s
contribution to the short spanning path. Place the point at the position that
gives the smallest contribution
to the short spanning path.
Step 3. If i=M, exit. Otherwise, go to Step 2.
Example 5.1.
To give the reader some feeling about the short spanning paths, we applied the
short spanning path algorithm to the data in Table 5.1. The result is shown in Fig.
5
.17. It is worth noting that the first cluster corresponds to Iris setosa and the second
cluster corresponds to Iris versicolor.
Fig. 5.17
In the following sections, we shall introduce some visual clustering algorithms.
By visual clustering, we mean
that the samples are rearranged in such a way that one
can visualize the relationships among points quite easily and then decide whether
there are clusters or not. The visual cluster analysis algorithms became popular
because of the progress made in compu
ter display techniques. However, for all of
the algorithms described here, the computer graphics display systems are not
5.
22
absolutely necessary. Any kind of ordinary output devices, such as printers and
teletypes, will suffice.
Section 5.9. Matrix
Reorga
nization
through Short Spanning Paths.
We have assumed all the way that the input data are in the form of a
matrix.
Let
us consider the following input matrix:
X
1
X
2
X
3
X
4
S
1
a
c
d
a
S
2
c
b
b
m
S
3
a
e
f
a
S
4
d
b
b
c
Suppose we consider every row v
ector as a point and apply the short spanning
path algorithm to
3
2
1
,
,
S
S
S
and
4
S
, we will obtain a short spanning path
4
2
3
1
S
S
S
S
.
If we rearrange the row vectors according to this order, we
’
ll obtain the fo
llowing
matrix:
X
1
X
2
X
3
X
4
S
1
a
c
d
a
S
2
a
e
f
a
S
3
c
b
b
M
S
4
d
b
b
c
Suppose we further consider the column (feature) vectors as points. This time,
we shall obtain
3
2
4
1
X
X
X
X
as a short spanning path. After rearranging the column
vectors, we obtain:
X
1
X
2
X
3
X
4
S
1
a
a
c
d
S
2
a
a
e
f
S
3
c
m
b
b
S
4
d
c
b
b
Clearly, the rearrangement reveals more information than the original unordered array.
We
can
readily see that
1
S
and
3
S
const
itute a cluster with common features
1
X
and
4
X
and
2
S
and
4
S
constitute a cluster with common features
2
X
and
3
X
.
5.
23
IN general, our
matrix reorganizing algorithm is as follows:
Algorithm 5.4. (A matrix reorganization algorithm)
Step 1. Find a short spanning path for the row vectors. Arrange the row vectors
according to the order determined by the short spanning path.
Step 2. Find a s
hort spanning path for the column vectors. Arrange the column
vectors according to the order determined by this short spanning path.
Exampl3 5.2.
Table 5.5(a) shows a matrix before the data reorganization, mechanism is applied
to it. Table 5.5(b) shows
the result of
applying
the reorganizing algorithm. It is
obvious that the reorganized data is more informative.
5.
24
1
0
1
1
0
1
0
0
1
1
0
1
0
0
1
1
1
0
0
1
0
0
1
1
1
0
1
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
0
0
0
1
0
0
1
0
0
1
0
1
0
1
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
1
0
1
0
0
0
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
1
0
1
0
1
0
1
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
0
0
0
1
0
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
1
0
1
0
1
0
1
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
0
0
0
1
0
0
1
0
0
1
0
1
0
1
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
0
0
0
1
0
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
1
0
1
0
1
0
1
0
0
0
1
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
1
0
1
0
0
0
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
1
0
1
0
0
1
1
0
1
0
0
1
1
1
0
0
1
0
0
1
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
1
0
1
0
1
0
1
0
0
0
1
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
1
0
1
0
0
0
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
1
0
1
0
1
0
1
0
0
(a)
1
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
1
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
1
0
0
0
1
1
1
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
(b)
Table 5.5
5.
25
Example 5.3
Table 5.6(a) shows some United Nations voting record [Hartigan 1972]. We applied
the reorganization algorithm to both the nations and t
he issues and obtained Table
5.6(b).
Issues
Nations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Australia
1
3
3
3
1
3
1
3
3
1
1
3
3
1
Brazil
1
2
2
2
1
3
3
1
3
1
1
3
1
1
Bulgaria
1
1
1
1
3
1
1
3
1
3
2
2
1
3
Dahomey
1
3
3
3
1
3
1
3
5
1
3
1
2
2
France
1
3
3
3
3
1
2
3
3
1
1
3
2
2
Kenya
1
3
3
3
3
1
1
3
2
5
3
1
1
3
Mexico
1
2
2
2
1
3
3
1
3
1
1
1
2
1
New Zealand
1
3
3
3
1
3
1
1
3
1
1
3
3
1
Norway
1
3
3
3
3
1
3
2
3
1
1
3
3
1
Senegal
1
3
3
3
1
2
2
2
2
1
3
1
1
2
Sweden
1
3
3
3
3
1
2
3
3
1
1
3
3
1
Syria
1
2
2
2
3
1
1
3
1
2
3
1
1
3
Tanzania
1
2
2
2
3
1
1
3
2
5
3
1
1
3
United Arab Republic
1
3
3
3
3
1
1
3
2
2
3
1
1
3
United Kingdom
1
3
3
3
1
1
3
2
3
1
1
3
1
1
USA
1
3
3
3
1
3
3
1
3
1
1
3
3
1
U. S. S. R.
1
1
1
1
3
1
2
3
1
3
2
2
1
3
Venezuela
1
2
2
2
1
3
3
1
3
1
2
1
1
1
Yugoslavia
1
3
3
3
3
1
1
3
1
2
3
1
1
2
1=Yes 2=Abstain 3=No 5=Absent
Table 5.6(a)
5.
26
Issues
Nations
13
12
7
6
1
10
11
14
5
8
2
3
4
9
Senegal
1
1
2
2
1
1
3
2
1
2
3
3
3
2
Dahomey
1
2
2
2
1
3
3
1
3
1
1
3
1
1
Australia
3
3
1
3
1
1
1
1
1
3
3
3
3
3
New Zealand
3
3
1
3
1
1
1
1
1
1
3
3
3
3
Brazil
1
3
3
3
1
1
1
1
1
1
2
2
2
3
Venezuela
1
1
3
3
1
1
2
1
1
1
2
2
2
3
Mexico
2
1
3
3
1
1
1
1
1
1
2
2
2
3
USA
3
3
3
3
1
1
1
1
1
1
3
3
3
3
United Kingdom
3
3
3
1
1
1
1
1
1
2
3
3
3
3
Norway
3
3
3
1
1
1
1
1
3
2
3
3
3
3
Sweden
3
3
2
1
1
1
1
1
3
3
3
3
3
3
France
2
3
2
1
1
1
1
2
3
3
3
3
3
3
Yugoslavia
1
1
1
1
1
2
3
2
3
3
3
3
3
1
United Arab Republic
1
1
1
1
1
2
3
3
3
3
3
3
3
2
Kenya
1
1
1
1
1
5
3
3
3
3
3
3
3
2
Tanzania
1
1
1
1
1
5
3
3
3
3
2
2
2
2
Syria
1
1
1
1
1
2
3
3
3
3
2
2
2
1
Bulgaria
1
2
1
1
1
3
2
3
3
3
1
1
1
1
U. S. S. R.
1
2
2
1
1
3
2
3
3
3
1
1
1
1
5.
27
Iris setosa
13
15
18
19
16
12
9
10
5
11
3
20
7
1
17
2
8
4
14
6
13
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
15
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
18
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
19
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
16
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
12
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
9
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
10
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
5
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
11
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
20
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
7
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
17
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
2
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
8
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
4
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
14
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
6
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
Iris versicolor
Example 5.4
In Section 5.2, we introduced the similarity matrix. A typical similarity matrix
was
shown in Table 5.4. It is by no means easy to detect the existence of clusters by
examining Table 5.4. Let us now rearrange the matrix in Table 5.4 and the
reoraganized matrix is as below:
Table 5.7
We can now see two clear clusters. In fact, one clus
ter corresponds to Iris setosa and
the other corresponds to Iris versicolor.
Section 5.10
Linear Mapping
Given a set S of N

dimensional (N>2) points, we can map these points to some
2

space by simply projecting them onto a 2

dimensiona; plane. The co
mmonest
5.
28
projection
is through the use of principal component analysis which was discussed in
Chapter 3. We shall briefly review the concept of component analysis here.
Given a set S of N

dimensional points, we can find the first and the second
principal
components by finding the eigen vectors of the covariance matrix of the set
S of samples. The first (second) principal component direction
corresponds
to the
eigen vectors with the largest (second largest) eigen value. After finding these two
directions
, w
e can project these N

dimensional points onto these two directions. Since
the two
directions
preserve a large amount of variance, we would intuitively expect
clusters to show up if there are clusters in N

space.
Example 5.5
In section 3.4, we found the
first and the second principal component direction for
the United Nations data. In this example, Fig. 3.4 is now redrawn as Fig. 5.18 as
shown
below
:
Fig 5.18
To use the principal component analysis, our data must be numeric
al ones. In the
following sections, we shall introduce some mapping techniques which do not have to
adhere to this restriction. Since they do not use
projection
mechanism anymore, these
mapping mechanisms are called nonlinear mapping mechanism. We shall in
troduce
Venezuela
Mexico
Brazil
USA
New Zealand
UK
Australia
Norway
Sweden
France
Dahomey
Senegal
U. A. R.
Kenya
Yugoslavia
Syria
Tanzania
Bulgaria
U. S. S. R.
D
1
D
2
5.
29
two such mechanisms. The first one tries to preserve all of the distances. The second
one preserves a subset of all of the distances exactly.
Section 5.11 Nonlinear Mapping I(Trying to Preserve All of
the
Distances)
Let us assume that we are
given a set of L

dimensional points
.
,...,
,
2
1
M
Y
Y
Y
Let
M
Z
Z
Z
,...,
,
2
1
be the 2

dimensional points resulted from the mapping mechanisms.
T
he aim of our mapping is to preserve the geometrical structure of these points. That
is, we like to
have
i
Z
and
j
Z
such that
)
,
(
)
,
(
j
i
j
i
Y
Y
d
Z
Z
d
for all
i
and
j
.
It should be pointed out here that in many instances, it is impossible to map a set of
points to 2

space and preserve their original distances. C
onsider, for instance, the case
where there are a set of points on a 3

dimensional sphere and a point occupying the
center of
the sphere.
The reader can easily see that it is impossible to preserve all of
the distances in this case. In other words, the str
ucture of the
original
data will
necessarily be somewhat
destroyed
after the mapping. We therefore can only try to
preserve the distances as much as possible. If our
original
distance in the
high

dimensional space is large(small), we like the corresponding
distance in the
2

space to be also large(small).
Let
i
Y
and
j
Y
be two points in the high

dimensional space. Let
i
Z
and
j
Z
be
the corresponding points in the 2

space. Let
*
ij
d
be the distance between
i
Y
and
j
Y
and
ij
d
be the distance
between
i
Z
and
j
Z
. The value
2
*
)
(
ij
ij
d
d
denotes the
erro
r caused by the mapping. The nonlinear mapping which we are going to describe
in this section is an iterative method which tries to correct this error at each step.
Let
'
i
Z
and
'
j
Z
be the new points
coo responding
to
i
Z
and
j
Z
respectively
after one step of
iteration
. Using the gradient method,
'
i
Z
and
'
j
Z
are expressed as
follows:
2
*
'
)
(
ij
ij
Z
i
i
d
d
c
Z
Z
I
2
*
'
)
(
ij
ij
Z
j
j
d
d
c
Z
Z
j
Since
2
2
j
i
ij
Z
Z
d
, we can easily see that
( 1 )
5.
30
).
)(
(
4
)
(
*
2
*
j
i
ij
ij
ij
ij
Z
Z
Z
d
d
d
d
j
and
).
)(
(
4
)
(
*
2
*
j
i
ij
ij
ij
ij
Z
Z
Z
d
d
d
d
j
We then have
).
)(
(
4
*
'
j
i
ij
ij
i
i
Z
Z
d
d
c
Z
Z
).
)(
(
4
*
'
j
i
ij
ij
j
j
Z
Z
d
d
c
Z
Z
We now have to det
ermine
c
such that
ij
ij
d
d
*
2
'
'
2
'
j
i
ij
Z
Z
d
2
*
))
(
8
1
)(
(
ij
ij
j
i
d
d
c
Z
Z
2
*
2
))
(
8
1
(
ij
ij
ij
d
d
c
d
.
If
*
ij
ij
d
d
2
*
2
*
2
*
2
))
(
8
1
(
))
(
8
1
(
ij
ij
ij
ij
ij
ij
d
d
c
d
d
d
c
d
)
/(
)
1
(
8
1
*
*
ij
ij
ij
ij
d
d
d
d
c
. (3)
Substituting (3) into
(2), we obtain
).
)(
/
1
(
2
1
*
'
j
i
ij
ij
i
i
Z
Z
d
d
Z
Z
).
)(
/
1
(
2
1
*
'
j
i
ij
ij
j
j
Z
Z
d
d
Z
Z
(4)
Geometrically, Eq. (4) means that adjustment is made along the straight line
connecting
i
Z
and
j
Z
, as shown in Fi
g. 5.19. If
*
ij
ij
d
d
, the two points will be
moved closer to each other. Otherwise, the two points are moved farther apart.
( 2 )
5.
31
Fig. 5.19
If we adjust
i
Z
and
j
Z
to ach
ieve a better
ij
d
, we may also change
ik
d
and
jk
d
for
i
k
and
j
k
. Since it is generally impossible to satisfy
ij
ij
d
d
*
for all
i
and
j
, we h
ave to prefer preserving some distances over others.
Our strategy is to preserve local distances. That is,
ij
d
should be very close to
*
ij
d
if
ij
d
is small. If
*
ij
d
is large,
ij
d
does not have to be very close to
*
ij
d
. For this
reason, we introduce
)
1
/(
1
*
ij
d
into the correction factor in (4). For large distances,
the corrections will not be to large.
Thus,
).
))(
1
/(
)
/
1
((
2
1
*
*
'
j
i
ij
ij
ij
i
i
Z
Z
d
d
d
Z
Z
).
))(
1
/(
)
/
1
((
2
1
*
*
'
j
i
ij
ij
ij
j
j
Z
Z
d
d
d
Z
Z
We start with a randomly generated 2

space configuration. We then systematically
select two high

dimensional points
i
Y
and
j
Y
,
j
i
Y
Y
. Let
i
Z
and
j
Z
be two
2

space points corresponding to
i
Y
and
j
Y
respectively. We then adjust
i
Z
and
This direction if
ij
d
is too large.
i
Z
j
Z
ij
d
This direction if
ij
d
is too small.
( 5 )
5.
32
j
Z
according to Eq. (5). We again select two points
i
Y
and
j
Y
and adjust the
corresponding
i
Z
and
j
Z
. We report this process until some termination criteria are
met. For instance, we may terminate if in the last cycle, no adjus
tment is necessary.
We may also terminate if a prespecified computer time limit is exceeded or the total
error is smaller than a prespecified
threshold
. The entire algorithm is depicted in Fig.
5.20. We call this algorithm Algorithm 5.5.
5.
33
Fig 5.20
Algor
ithm 5.5 (The Nonlinear Mapping I Algorithm)
Section 5.12 Nonlinear Mapping I for a Large Number of Samples
If there are a large number of samples, it is necessary to calculate and store a large
number of distances. This iterative method may become so
time consuming and
require so much memory that it is impractical to implement this process. We therefore
used some heuristic to overcome the difficulty arising from a large number of
samples.
Assume that we have
M
samples. We first select
L
samples
L
Y
Y
Y
,...,
,
2
1
,
M
L
.
We
apply Algorithm 5.5 to these samples to obtain
L
2

space points
L
Z
Z
Z
,...,
,
2
1
.
These
L
2

space points are now fixed. To obtain the other (
M

L
) 2

space points, we do
not let these points interact wit
h respect to each other. That is, every
i
Z
of these
points,
M
i
L
, is only adjusted with respect to
j
Z
,
M
j
L
, but not with
respect to any
k
Z
,
M
k
L
. In other words, we ignore the distance between
i
Z
and
j
Z
if both
i
and
j
are larger than
L
. We call this algorithm 5.5*.
Algorithm 5.5* does not, of course, preserve the
structure
of the data as faithf
ully
as Algorithm 5.5. In Fig. 5.21(a), we show the result of applying Algorithm 5.5* to
the JU data. (For a detailed figures, we can conclude that it is worthwhile to try using
Algorithm 5.5* if one has a large number of samples to process.
5.
34
Fig. 5.21
(a) Result of using Algorithm 5.5*.
(b) Result of using Algorithm 5.5.
Section 5.13 Nonlinear Mapping II (The Triangulation Method: Preserving a Subset
of Distance Exactly)
The nonlinear mapping I technique introduced above tries to preserve as
many
distances as possible. This might be too ambitious because in the final result, it is
quite possible that no distance is preserved. In the following sections, we shall
introduce another nonlinear mapping technique which adopts another approach.
Instea
d of trying to preserve all distance, we shall preserve a subset of all
distance
exactly.
Consider three high

dimensional
points
*
i
P
,
*
j
P
and
*
k
P
. Suppose on the
2

space,
i
P
and
j
P
exactly preserve the distance between
*
i
P
and
*
j
P
. That is
*
ij
ij
d
d
. Then the third point
*
k
P
can be mapped to a point
k
P
in the 2

space such
5.
35
that the distances among
*
i
P
,
*
j
P
and
*
k
P
are all exactly preserved. This can be
done by drawing two circles with
i
P
and
j
P
as centers and
*
ik
d
and
*
jk
d
as radii
as shown in Fig. 5.22. Note that because of triangular inequality, the circles either
intersect at 2 points, or are tangent. Through
this
basic principle, we can gradually
map a set of
M
high

dimens
ional points onto a 2

dimensional plane. On the resulting
map, for every point
k
P
, there exist two points
i
P
and
j
P
such that the distances
among
i
P
,
j
P
and
k
P
are all exactly preserved.
Fig. 5.22
Let us assume that there are
M
data points. For the first three points being mapped,
two distances are exactly preserved. For the subsequent (
M

3) points, ev
ery points
preserves two distances. Therefore the number of distances being exactly preserved is
3+2(
M

3) = 2
M

3.
Since
there totally
M
(
M

1)/2 distances available, we should
preserve only the informative ones. In this book, we shall preserve all of the dis
tances
on a minimal spanning tree constructed from the data points.
Note that there are only
M

1 distances on a minimal spanning tree while our
method can preserve 2
M

3 distances.
We
therefore must provide additional
information. Two separate approaches
to provide the additional information required
were developed: the second nearest neighbor approach and the reference point
approach.
(1)
The second nearest neighbor approach:
Let
j
P
be a point already mapped Let
k
P
b
e directly linked to
j
P
on the
minimal spanning tree and the next points being mapped.
j
P
must be the closest
1
Q
j
P
i
P
*
ik
d
*
jk
d
5.
36
point to
k
P
among all of the points already mapped. Among all of the mapped
points,
whose distances to
j
P
are exactly preserved, let
i
P
be the point closest to
k
P
.
i
P
is usually directly linked to
j
P
. Then
k
P
is m
apped to the 2

space such
that its distances to
i
P
and
j
P
are exactly preserved.
For instance, in Fig. 5.23, assume points 3 and 1 are already mapped and point 2 is
the next point to be mapped. Point 2 will be map
ped in such a way that its distances
to point 1 and point 3 are exactly preserved.
5.
37
Fig. 5.23
(2)
The reference point method
The reference point approach is designed to provide some global constraint on the
resulting map. In this method, an initial point
is selected as the reference point for
all other point: the distances to this reference point are always preserved.
Consequently, for every point, two distances are preserved: a local one of the
minimal spanning tree and global one with respect to the refe
rence point.
This reference point approach allows the user to analyze the data from difference
points and obtain different maps. These reference points may even be points outside
of the data set, such as the center of data set. In the experiments described
later, we
shall show some interesting properties of this approach.
Section 5.14 Minimal Spanning Trees and the Ordering of Points Being Mapped
Let us consider Fig. 5.23 again. Suppose point 6 is chosen as initial point to be
mapped. Since all of the
distances on the minimal spanning tree have to be preserved,
some ordering of the points being mapped must be maintained. For instance, point 8
must be mapped before point 9 because the distance between point 8 and point 9 must
be preserved. On the other
hand, suppose point 11 is used as initial point, then point 9
has to be mapped before point 8 is mapped.
Given a minimal spanning tree, we can choose any point as its root. Assume point 6
is chosen. T
h
e directed tree in Fig. 5.24 shows the ancestor

desc
endent (tail to pointer)
relationship. Thus point 6 is the immediate ancestor of point 5 and point 9 is the
immediate descendent of point 8.
5.
38
Fig. 5.24
We shall say
that
the point 6, the root of the tree, has level one. The level of every
other point
on the tree is determined by the level of its immediate ancestor. If
i
P
is
the immediate ancestor of
j
P
and the level of
i
P
is
L
, the level of
j
P
will be
L
+1.
For instance,
the level of point 5 is 1 and the level of point 9 is 2.
Since points are on a directed tree, the process of mapping points can now be
viewed as a tree searching process [Slagle 1971, Nilsson 1971]. There are two basic
tree searching approaches: depth f
irst and breadth first. Both can be applied in this
case. We have conducted a number of experiments to see if these two approaches
result in
different
maps and could not detect any significant difference. Therefore, for
simplicity, we only introduce the br
eadth

first approach.
Our first rule is: Points on level
L
are mapped before points on
L
+1. For instance,
for the tree in Fig. 5.24, points 7, 3, 5 and 8 are to be mapped before points 1, 2, 4 and
5.
39
9.
Our second rule is: If a point
i
P
is mapped before a
j
P
on the same level, the
immediate descendents of
i
P
will be mapped before the immediate descendents of
j
P
.
Our third rule is: Suppose point
i
P
has
points
j
P
and
k
P
as its immediate
descendents and
j
P
is closer to
i
P
than
k
P
,
j
P
will be mapped
before
k
P
.
For minimal spanning tree in Fig. 5.24, the sequence of points to be mapped is 6, 7,
3, 5, 8, 1, 2, 4, 9, 10, 11 and 12.
When we mapped a point to the 2

space, we always use two points as references.
Using these two reference points as centers, two circ
les are
drawn
. Unless tangent,
these two circles intersect at two points, each of which preserves the distances with
respect to these reference points. This gives us another choice that can be used to
improve the quality of the final map. Consider Fig. 5.2
2 again.
1
k
P
and
2
k
P
are two
possible locations of
k
P
on the 2

space. For every possible location, we can now
find out its nearest neighbor (excluding
i
P
and
j
P
) on the mop already obtained.
Let us denote the next nearest neighbor of
1
k
P
and
2
k
P
by
1
Q
and
2
Q
respectively. We can not preserve the original distances between
k
P
by
1
Q
and
between
k
P
and
2
Q
exactly. However, we now calculate the error to
1
Q
caused by
putting
k
P
at location
1
k
P
and the error to
2
Q
caused by putting
k
P
at
2
k
P
. We
then put
k
P
at the location where the smaller error will be caused. We thus have
extended the neighborhood fir one
final bit.
Example 5.7
To give the reader some feeling about the second nearest neighbor approach, we
mapped the protein data in Table 5.2 and the result are shown in Fig. 5.25, where
Human was used as the
initial
point and Hamming distances were used.
5.
40
Fig. 5.25
Example 5.8
We now demonstrate the difference between the second nearest neighbor approach
and the reference point approach. Here, we used the voting record data (Hartigan
1972). Again Hamming distances were used. Fig. 5.26a shows the resu
lt of the second
nearest neighbor approach where the initial point was chosen to be the Soviet Union.
5.
41
Fig. 5.26a
Fig 5.26b shows the result of using the
reference
s point approach where the references
point was chosen to be the Soviet Union. One can thus
use the map to analyze the
relationship among the Soviet Union and other countries. For instance, one can see
that the USA and New Zealand are farthest away from the USSR, with the United
Kingdom, Norway, Sweden and France gradually deviating from the USA,
towards
the USSR. The diplomatic relationship of the United States with other countries can
be analyzed through Fig. 5.26c.
5.
42
Fig. 5.26b
The difference between the second nearest neighbor approach and the reference
point approach is now
easily seen. If the second nearest neighbor approach is used,
there is no global information. Therefore, the part of the map added later may have
important global information seriously misrepresented. If the reference point approach
is used, all of the poi
nts are restrained by some global information.
TAN
KEN
UAR
YUG
SEN
DAH
FRA
SWE
NWY
VEN
MEX
BRA
UKD
ASA
USA
NZL
5.
43
Fig. 5.26c
Example 5.9
This experiment shows another way that the user can make use of the reference
point approach. We used the now
classical
Iris data [Fisher 1936], which consists
of
Iris setosa, Iris versicolor and Iris virginica. Iris versicolor and Iris firginica are quite
similar to each other and both are rather diffenrence from Iris setosa. We first used a
sample form Iris setosa as the reference point and obtained the map sho
wn in Fig.
5.29a. As the reader can see, Iris versicolor and Iris virginica are close to each other
that they almost form one cluster. We then used a sample from Iris versicolor as the
reference point and obtained Fig. 5.29b. This time, the clusters
corres
ponding
to Iris
versicolor and Iris verginica became more pronounced. The reference point method
allows us to focus on a
particular
area so that we can see the clustering properties
about a particular region more clearly.
SEN
DAH
USA
NZL
ASA
UKD
NW
Y
SWE
FRA
BRA
MEX
VEN
YUG
USR
BUG
UAR
KEN
TAN
SYR
5.
44
A = I
ris Setosa B = Iris Versicolor C = Iris Virginica
Fig. 5.29a
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
5.
45
A = Iris Setosa B = Iris Versicolor C = Iris Virginica
Fig. 5.29b
Section 5.15 Summary
Although we have introduced many clustering analysis techniques,
they are still
only a fraction of clustering algorithms published. We have carefully selected these
algorithms because we believe they are the best ones, both in their
theoretic
al contents
and their easiness to be implemented. If the reader is intersected
in knowing more
about clustering algorithms, he is encouraged to read [Hartigan 1974].
The reader may now wonder: If one is given a set of data, which clustering
algorithm should be use? This problem will be discussed in the following paragraph.
First
of all, we recommend some visual clustering technique be applied to the data
first. Note that visual clustering algorithms never arbitrarily divide samples into
clusters. They are rather like
microscopes
: one can use them to display the data and
have some
preliminary idea about the data. The question is: Which visual clustering
algorithm should one use?
We have altogether introduced four visual clustering algorithms: the data
reorganizing algorithm, the linear mapping algorithm, the nonlinear mapping
alg
orithm I and the nonlinear mapping algorithms II. Each algorithm has its own
B
B
B
B
B
B
B
B
B
B
B
B
B
C
C
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
B
B
B
B
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
5.
46
characteristics. That is, one algorithm might be suitable for one kind of data and not
suitable for some other kind of data. In the following, we shall give a more detailed
descri
ption about how to use these algorithms.
(1)
The data
reorganizing
algorithm:
The input to this algorithm must be a set of vectors. It does not matter whether the
data are numerical or not because Hamming distances can be used if the data are
non

numerical.
(2)
Th
e linear mapping algorithm:
The input of this algorithm must be numerical because we have to compute the
covariance matrix. It is cautioned here that we should first analyze the eigen value. If
all of the eigen values are approximately equal, we can not ex
pect the principal
component directions to preserve a lot of information.
(3)
The nonlinear mapping techniques:
Both nonlinear mapping techniques can be applied to numerical data and
non

numerical data. However, these two algorithms have another advantage: If
we
only know the distance matrix of the sample, we still can use these two
algorithms
,
because these two algorithms are based upon distance information. The reader may
wonder: Under what condition would the input data be a distance matrix, instead of a
se
t of vectors? This can be explained by the following example: Imagine that we are
interested in knowing what the public thinks of the automobiles being sold in the
market, and whether some of them look similar to the public or not. We may, of
course, use a
set of features, such as the weight, size, horse power, price etc, to
characterize the cars. We may also simply ask a person to compare the cars and
indicate how two cars appear to him. If car
i
P
and car
j
P
look a
bsolutely alike, he
may say the distance between them is 0. If they appear totally different to him, he may
set the distance to 1. Otherwise, he may set the distance to be somewhat between 0
and 1. In this case, the input to the nonlinear mapping algorithm
is only a distance
matrix.
The following can be summarized as follows:
(1)
mapping algorithm
is applied to the data, apply at least one visual clustering
algorithms to data.
(2)
If the data are a set of numerical vectors, then all of the visual clustering
algor
ithms can be applied.
(3)
If the data are a set of non

numerical vectors,
then the data reorganizing
algorithm two nonlinear mapping algorithms can be used
. The linear mapping
algorithm can not be used.
(4)
If the data are in the form of a distance
matrix
, then t
he two nonlinear mapping
5.
47
algorithms can be applied. The other visual clustering algorithm can not be
applied.
After some visual clustering is done, we may decide that no more clustering is
necessary if we can not detect any clusters
visually
. Otherwise, we
may use either the
minimal spanning tree or the short spanning path approach. Note that both algorithms
can be applied to numerical or non

numerical data. If we only have a distance matrix,
we may still apply these two algorithms. This is why they are sel
ected by us.
Finally, we like to give a brief
history
of all algorithms introduced in this book. The
minimal spanning tree algorithm was first discussed by Prim [1957] and later by Zahn
[1971]. The short spanning path algorithm was first proposed by Slagle
, Chang and
Heller [1975]. Slagle, Chang and Lee [1974] applied it to clustering analysis. The data
reorganizing
algorithm was first discussed in [McCormick, Schweitzer and White
1972], and later by Slagle, Chang and Heller [1975]. Recently, there was anot
her data
reorganizing
algorithm proposed by Ling [1973]. As for the linear mapping
technique
,
the reader should consult Chapter 2 of this book. The nonlinear mapping I technique
was inspired by the multidimensional scaling which was developed by Shepard [1
962]
and improved by Kruskal [1964]. The nonlinear mapping I technique introduced in
this book was proposed by Chang [1973] and is a modified version of algorithm
proposed by Sammon [1969]. The nonlinear mapping II technique was developed by
Lee, Slagle a
nd Blum [1976].
5.
48
10.
Kruskal, J. B. (1964) : Nonmetric
multidimensional
sealing, Psychometrica , Vol.
29, June 1964, pp. 115

129.
11.
Lee, R. C. T., Slagle, T. R. and Blum, H. (1976) : A triangulation method for
mapping of points form N

space
to 2

space, to appear in IEEE Trans. on
Computers.
12.
Ling, R. F. (1973) : A computer generated aid for clustering analysis, Comm. of
the AOM, Vol. 16, No. 6, Jan. 1973, pp. 355

361.
13.
McCormick, W. T., Schweitzer, P. J. and White, T. W. (1972) : Problem
decom
position and data reorganization by a clustering techniques, Operation
Research, Vol. 20, No. 5, Sept.

Oct., 1972, pp. 993

1009.
14.
Nilsson, N. J. (1971) : Problem Solving Methods in Artificial Intelligence,
McGraw

Hill, N. Y., 1971.
15.
Prim, R. C. (1957) : Shor
test connection network and some generalizations,
Bell

system Technical Journal, Nov. 1957, pp. 1389

1401.
16.
Salton, G. (1968) : Automatic Information Organization and Retrieval,
McGraw

Hill, N. Y., 1968.
17.
Sammon, J. W. Jr. (1969) : A nonlinear mapping for da
ta structure analysis, IEEE
Trans. on Computers, Vol. C

18, Jan. 1969, pp. 401

409.
18.
Shepard, R. N. (1962) : The analysis of proximitier multidimensional scaling with
an unknown distance function, Psychometrica, Vol. 27, 1962, pp. 125

139, pp.
219

246.
19.
Slag
le, J. R. (1971) : Artificial Intelligence : a Heuristic Programming Approach,
McGraw

Hill, N. Y., 1971.
20.
Slagle, J. R., Chang, C. L. and
Heller, S. (1975) : A clustering and
data

reorganization algorithm, IEEE Trans. on Systems, Man and Cybernetics, Jan.
1
975, pp. 121

128.
21.
Slagle, J. R., Chang, C. L. and
Lee, R. C. T. (1974) : Experiments with some
clustering analysis algorithms, Pattern Recognition, Vol. 6, 1974, pp. 181

187.
22.
Slagle, J. R.
and
Lee, R. C. T. (1971) : Application of game tree searching to
ge
neralized pattern recognition, Comm. of the ACM, Vol. 14, No. 2, Feb. 1971, pp.
107

110.
23.
Subas, S, Kashyan, R. L. and Yao, S. B. (1975) :
The
clustering concept and
secondary key retrieval for on

line system, School of Electrical Engr., Purdue
University,
1975.
24.
Wishart, D. (1969) : Mode analysis a generalization of nearest neighbor which
reduces chaining effects, in Numerical Taxanomy, (
Edited
by A. J. Coles),
Academic Press, N. Y., 1969, pp. 282

308.
25.
Zahn, C. T. (1971) : Graph

theoretical methods for detec
ting and describing
5.
49
gestalt clusters, IEEE Trans. on Computers, Vol. C

20, No. 1, Jan. 1971, pp.
68

86.
C
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο