Performance Functions and Clustering Algorithms
Wesam Barbakh and Colin Fyfe
Abstract.
We investigate the effect of different performance
functions for measuring the performance of clustering
algorithms and derive different algorithms depending
on whic
h performance algorithm is used. In
particular, we show that two algorithms may be
derived which do not exhibit the dependence on initial
conditions(and hence the tendency to get stuck in local
optima) that the standard K

Means algorithm exhibits.
INTRODU
CTION
The K

Means algorithm is one of the most frequently
used investigatory algorithms in data analysis. The
algorithm attempts to locate K prototypes or means
throughout a data set in such a way that the K
prototypes in some way best represents the data.
The algorithm is one of the first which a data analyst
will use to investigate a new data set because it is
algorithmically simple, relatively robust and gives
`good enough’ answers over a wide variety of data
sets
: it will often not be the single best a
lgorithm on
any individual data set but be close to the optimal over
a wide range of data sets
.
However the algorithm
is known to suffer from the
defect that the means or prototypes found depend on
the initial values given to them at the start of the
simu
lation. There are a number of heuristics in the
literature which attempt to address this issue but, at
heart, the fault lies in the performance function on
which K

Means is based. Recently, there have been
several investigations of alternative performance
functions for clustering algorithms.
One of the most
effective updates of K

Means has been K

Harmonic
Means which minimises
for data samples {x
1
,…,x
N
} and prototypes
{m
1
,…,m
K
}. This performance function can be shown
to be minimised when
In th
is p
aper, we investigate two
alternative
performance functions and show the effect the
different functions have on the effectiveness of the
resulting algorithms.
We are specifically interested in
developing algorithms which are effective in a worst
case scenar
io: when the prototypes are initialised at
the same position which is very far from the data
points. If an algorithm can cope with this scenario, it
should be able to cope with a more benevolent
initialisation.
PERFORMANCE FUNCTION
S FOR
CLUSTERING
The perf
ormance function for K

Means may be
written as
J
K
=
N
i
k
i
M
k
m
X
1
2
}
,...,
1
{
)
(
min
(1)
which we wish to minimise by moving the prototypes
to the appropriate positions. Note that(1) detects only
the centres closest to data points and then distributes
them to gi
ve the minimum performance which
determines the clustering.
Any prototype which is still far from data is not
utilised and does not enter any calculation that give
minimum performance, which may result in dead
prototypes, prototypes which are never appropr
iate for
any cluster. Thus initializing centres appropriately can
play a big effect in K

Means.
We can illustrate this effect with the following toy
example: assume we have 3 data points (
1
X
,
2
X
and
3
X
) and 3 prototypes (
1
m
,
2
m
and
3
m
) and the
distances between them are as follows:
Let
2
,
k
i
k
i
m
x
d
. We consider the situation that
1
m
is closest to
1
X
1
m
is closest to
2
X
2
m
is closest to
3
X
and so
3
m
is not closest to any data point
1
m
2
m
3
m
1
X
1
,
1
d
2
,
1
d
3
,
1
d
2
X
1
,
2
d
2
,
2
d
3
,
2
d
N
i
K
k
k
i
m
x
K
1
1
2
1
N
i
K
l
il
ik
N
i
i
K
l
il
ik
k
d
d
x
d
d
m
1
1
2
2
4
1
1
2
2
4
)
1
(
1
)
1
(
1
3
X
1
,
3
d
2
,
3
d
3
,
3
d
Perf
= J
K
=
1
,
1
d
+
1
,
2
d
+
2
,
3
d
which we minimise by
changing
the postions of the prototypes,
m
1
, m
2
, m
3
.
Then
0
0
0
3
0
0
2
0
1
2
2
,
3
1
1
,
2
1
1
,
1
m
Perf
m
d
m
Perf
m
d
m
d
m
Perf
So
, it is possible now to find
new locations for
m
1
and
m
2
to
minimise the
performance
function which
determines
the clustering, but it is not possible to find
a
new location for
pro
totype
m
3
as it is far
from the
data
and is not
us
ed as a minimum
for
any data point.
We might consider the following performance
function:
J
A
=
N
i
M
L
L
i
m
X
1
1
2
(2)
which provides a relationship between all the data
points an
d prototypes, but it doesn’t provide clustering
at minimum performance since
N
i
i
k
k
A
N
i
k
i
k
A
X
N
m
m
J
m
X
m
J
1
1
1
0
)
(
2
Minimizing the performance function groups all the
prototypes to the centre of data set regardless of the
intitial position of the prototypes which is usel
ess for
identification of clusters.
A combined performance function
We wish to form a performance equation with
following properties:

Minimum performance gives a good
clustering

Creates a relationship between all data points
and all prototypes.
(2) provid
es an attempt to reduce the sensitivity to
centres initialization by making relationship between
all data points and all centres while (1) provides an
attempt to cluster data points at minimum
performance. Therefore i
t may seem that what we
want is to comb
ine
features of
(1) and (2) to make a
performance function such as:
J
1
=
N
i
k
i
k
M
j
j
i
m
X
m
X
1
2
1
min
*
(3)
We derive the clustering algorithm associated with
this performance function by calculating the partial
derivatives of (3) with respect to the prototypes.
Consider the presentation of a specific data point, X
a
.
and the prototype
k
m
, closest to X
a.
i.e.
2
min
k
a
k
m
X
=
2
k
a
m
X
Then
k
a
M
a
k
a
a
k
a
k
a
k
a
k
m
X
m
X
m
X
m
X
m
X
m
X
m
X
m
a
i
Perf
2
*
..
..
*
)
(
)
(
1
2
M
a
k
a
a
k
a
k
a
k
m
X
m
X
m
X
m
X
m
X
m
a
i
Perf
..
..
*
2
*
)
(
)
(
1
ak
k
a
k
A
m
X
m
a
i
Perf
*
)
(
)
(
(
4
)
where
M
a
a
k
a
ak
m
X
m
X
m
X
A
...
*
2
1
Now consider a second data point Xb for which
k
m
is
not the closest prototype
,
i.e.
the
min()
function give
s
the distance with respect to a
prototype other than
k
m
,
2
min
k
b
k
m
X
=
2
r
b
m
X
,
where r ≠
k
. Then
)
(
b
i
Perf
=
2
1
*
..
..
r
b
M
b
k
b
b
m
X
m
X
m
X
m
X
bk
k
b
k
B
m
X
m
b
i
Perf
*
)
(
)
(
(5)
where
k
b
r
b
bk
m
X
m
X
B
2
For
the algorithm
, the partial derivatives with respect
to
k
m
for all data points,
k
m
Perf
is
based on
(
4
)
,
or
(
5
)
or both of them.
Consider the specific situation in which
k
m
, closest to
X
2
but not the closest to X
1
or X
3
.
Then we have
k
k
k
k
k
k
k
B
m
X
A
m
X
B
m
X
m
Perf
3
3
2
2
1
1
*
)
(
*
)
(
*
)
(
Setting to 0 and so
lving for
k
m
gives
k
k
k
k
k
k
k
B
A
B
B
X
A
X
B
X
m
3
2
1
3
3
2
2
1
1
*
*
*
(6)
Consider the previous example with 3 data points
(
1
X
,
2
X
and
3
X
) and 3 centres (
1
m
,
2
m
and
3
m
) and
the distances between them such that
1
m
is closest to
1
X
,
1
m
is closest to
2
X
, and
2
m
is closest to
3
X
We will write
1
m
2
m
3
m
1
X
1
min
2
,
1
d
3
,
1
d
2
X
2
min
2
,
2
d
3
,
2
d
3
X
1
,
3
d
3
min
3
,
3
d
Then after training
31
21
11
31
3
21
2
11
1
1
*
*
*
B
A
A
B
X
A
X
A
X
m
where
1
,
3
2
3
31
3
,
2
2
,
2
2
2
21
3
,
1
2
,
1
1
1
11
min
)
(min
2
min
)
(min
2
min
d
B
d
d
A
d
d
A
32
22
12
32
3
22
2
12
1
2
*
*
*
A
B
B
A
X
B
X
B
X
m
where
)
min
(
2
min
min
min
3
,
3
3
1
,
3
3
32
2
,
2
2
2
22
2
,
1
2
1
12
d
d
A
d
B
d
B
33
23
13
33
3
23
2
13
1
3
*
*
*
B
B
B
B
X
B
X
B
X
m
where
3
,
3
2
3
33
3
,
2
2
2
23
3
,
1
2
1
13
min
min
min
d
B
d
B
d
B
This algorithm will cluster the data with the
prototypes which are closest to the data points being
positioned in such a way t
hat the clusters can be
identified. However there are some potential
prototypes (such as
m
3
in the example) which are not
sufficiently responsive to the data and so never move
to identify a cluster.
In fact, as illustrated in the
example, these points move
to the centre of the data
set (actually a weighted centre as shown in the
example).
This may be an advantage in some cases in
that we can easily identify redundancy in the
prototypes however it does waste computational
resources unnecessarily.
A second al
gorithm
To solve this, we need to move these unused
prototypes towards the data so that they may become
closest prototypes to at least one data sample and thus
take advantage of the whole performance function.
We do this by changing
k
b
r
b
bk
m
X
m
X
B
2
in (5) to
2
2
k
b
r
b
bk
m
X
m
X
B
which allows centres to move continuously until
they are in a position to be closest to some data
points. This change allows the algorithm to work
very well in the case that all centres are initialized
in the same locatio
n and very far from the data
points.
Example:
Assume we have 3 data points (
1
X
,
2
X
and
3
X
) and
3 centres (
1
m
,
2
m
and
3
m
) initialized
at the same
location.
Note: we assume every data point has only one
minimum distance to the centres, in other words it is
closest to one centre only. W
e treat the other centres
as distant centres even they have
the same minimum
value.
This step is optiona
l, but it is very important if we
want the algorithm to work very well
in
the case that
all centres are initialized in the same location.
Without this assumption in this example, we will find
the centres (
1
m
,
2
m
and
3
m
) will use the same
equation and hence go to the same location!
Let
1
m
2
m
3
m
1
X
a
a
a
2
X
b
b
b
3
X
c
c
c
Note: if we have 3 different data points, it is not
possible to have
c
b
a
For algorithm 1, we have
c
b
a
c
X
b
X
a
X
c
b
a
c
X
b
X
a
X
m
)
(
*
)
(
*
)
(
*
7
7
7
)
7
(
*
)
7
(
*
)
7
(
*
3
2
1
3
2
1
1
Similarly,
c
b
a
c
X
b
X
a
X
m
)
(
*
)
(
*
)
(
*
3
2
1
2
c
b
a
c
X
b
X
a
X
m
)
(
*
)
(
*
)
(
*
3
2
1
3
Note: all centres will go to the same location ev
en if
they are calculated by using two different types of
equation.
For algorithm 2, we have
c
b
a
c
X
b
X
a
X
c
b
a
c
X
b
X
a
X
m
)
(
*
)
(
*
)
(
*
7
7
7
)
7
(
*
)
7
(
*
)
7
(
*
3
2
1
3
2
1
1
while
3
*
*
*
3
2
1
2
2
2
2
2
2
2
2
3
2
2
2
2
2
1
2
X
X
X
c
c
b
b
a
a
c
c
X
b
b
X
a
a
X
m
and similarly,
3
3
2
1
3
X
X
X
m
Notice, the centre
1
m
that is detected by mi
nimum
function goes to a new location and all the other
centres
2
m
and
3
m
are grouped together in another
location.
This change for
bk
B
makes separation
between centres
possible even if all of t
hem start in
the same location
For algorithm 1,
i
f the new location for
2
m
and
3
m
is
still very far from
the data
and none of them is
detected as minimum, the
clustering
algorithm stop
s
without taking these centre
s in
to
account in clustering
data.
For algorithm 2, i
f the new location for
2
m
and
3
m
are
still very far from data points and none of them is
detect
ed as minimum, the clustering algorithm
move
s
these undetected cen
tres continu
al
ly toward new
locations Thus the algorithm
lprovid
es clustering for
data
insensitive
to
initialization of
centres.
Simulations
We illustrate these algorithms with a few simulations
on artificial
two dimensional
data sets
, since the
results a
re easily visualised
. Consider first the data set
in Figure 1: the
prototyp
es have all been initialised
within one of the four clusters.
Figure 2 shows the final positions of the prototypes
when K

Means is used: two clusters are not identi
fied.
Figure 3 shows the final positions of the prototypes
when K

Harmonic Means and algorithm 2 are used.
K

Harmonic Means takes 5 iterations to find this
position while algorithm 2 takes only three iterations.
In both cases, all four clust
ers were reliably and stably
identified.
Even with a good initialisation (for example with all
the
prototyp
es
in the centre of the data, around the
point (2,2)),
K

Means will not guarantee to find all the
clusters, K

Harmonic Mean
s takes 5 iterations to
move the prototypes to appropriate positions while
algorithm 2 takes only one iteration to stably find
appropriate positions for the prototypes.
Consider now the situation in which the
prototyp
es are
initialised very far from the d
ata; this is an unlikely
situation to happen in general but it may be that all the
prototypes are in fact initialised very far from a
particular cluster. The question arises as to whether
this cluster would be found by any algorithm.
We show the initial po
sitions of the data and a set of
four prototypes in Figure 4. The four prototypes are in
slightly different positions. Figure 5 shows the final
positions of the prototypes for K

Harmonic Means and
Figure
1
Data set is shown as 4 clusters of red '+'s,
prototypes are initialised to lie within one cluster
and
shown as blue '*'s.
Figure
2
Prototypes
’
positions when using K

Means.
Figure
3
Top: K

Harmonic Means after 5 iterations.
Bottom: algorithm 2 after 3 iterations.
algorithm 2. Again K

Harmonic Means took rather
longer
(24
iterations)
to appropriately position the
prototypes than algorithm 2
(5 iterations)
. K

Means
moved all four prototypes to a central location
(approximately the point (2,2)) and did not
subsequently find the 4
clusters.
Figure
4
The prototypes are positione
d very far from
the four clusters.
Figure
4
Top: K

Harmonic Means after 24
iterations. Bottom: algorithm 2 after 5 iterations.
Figure
6
Results when prototypes are initialised very
far from the data and all in the same position.
Top: K

Harmonic Means. Bottom: algorithm 2.
Figure 7 Top: Initial prototypes and data. Bottom: after
123 iterations all prototypes are situated on a data point.
Of course, the above sit
uation was somewhat
unrealistic since the number of prototypes was exactly
equal to the number of clusters but we have similar
results with e.g. 20 prototypes and the same data set.
We now go to the other extreme and have the same
number of prototypes as d
ata points. We show in
Figure 7 a simulation in which we had 40 data points
from the same four clusters and 40 prototypes which
were initialised to a single location far from the data.
The bottom diagram shows that each prototype is
eventually located at e
ach data point. This took 123
iterations of the algorithm which is rather a lot,
however neither K

Means nor K

Harmonic Means
performed well: both of these located every prototype
in the centre of the data i.e. at approximately the point
(2,2).
Even with a
good initialisation (random
locations throughout the data set), K

Means was
unable to perform as well, having typically 28
prototypes redundant in that they have moved only to
the centre of the data.
CONCLUSION
We have developed a new algorithm for data
clustering and have shown that this algorithm is
clearly superior to K

Means, the standard work

horse
for clustering. We have also compared our algorithm
to K

Harmonic Means which
is state

of

the

art in
clustering algorithms and have shown that under
typic
al conditions, it is comparable while under
extreme conditions, it is superior. Future work will
investigate convergence of these algorithms on real
data sets.
REFERENCES
B. Z. Zhang, M. Hsu, U. Dayal,
K

Harmonic Means
–
a data clustering algorithm, Tech
nical Report, HP Palo
Alto laboratories, Oct 1999.
Comments 0
Log in to post a comment