# K-Means & K-Harmonic Means: A Comparison of Two Unsupervised Clustering Algorithms.

Τεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 4 χρόνια και 6 μήνες)

165 εμφανίσεις

K
-
Means & K
-
Harmonic Means: A Comparison of Two Unsupervised
Clustering Algorithms.

Douglas Turnbull

CSE 202

Project

November 2002

1.

Introduction

Gaining a relational understanding of information is important to biology, human
cognition, artificial int
elligence, and many other data
-
intensive fields of research. In many
instances finding relationships may not be obvious by inspection due to numerous data
points or high dimensionality. In these cases, computation may be useful in order to divide
subsets o
f the information into groups called
clusters
. To effectively divide the information
we must first define a criterion for creating groups, and second, find an
optimal grouping

based on that criterion.

The problem can be generalized as follows: Given a se
t N of n data points in d dimensional
space, we must determine how to assign a set K of k points, called centers, in N so as to
optimize based on some criterion. In most cases, it is natural to assume that N is much
greater than K and d is relatively small
. This formulation is an example of unsupervised
learning. The system will create groupings based only the criterion and the information
contained in the n data point.

K
-
means

and
K
-
harmonic

means

are two center
-
based algorithm that have been developed
to

solve this problem.
K
-
means

(KM) is a popular algorithm that was first presented over
three decades ago [1]. The criterion it uses minimizes the total mean
-
squared distance from
each point in N to that point’s closest center in K.
K
-
harmonic means

(KHM) i
s a more
recent algorithm presented by Zhang in 2000[2]. This algorithm minimizes the harmonic
average from all points in N to all centers in K.

This paper will give a description of the general KM algorithm in section 2 and a
description of KHM in secti
on 3. Section 4 will compare KM and KHM by first looking
first looking a test case where KHM outperforms KM and then discussing some of the
fundamental differences between the two algorithms. Section 5 will present one current
application involving the cla
ssification of music and how it can use unsupervised
clustering. Lastly, section 6 will look at some potential research ideas for KHM.

2.

K
-
Means

2.1 The General KM Algorithm

An outline for this algorithm is based on an algorithm presented by Elkan [3
].

Data Structures:

N: n by (d+1) array
-

contains static information about data set

and a dynamic pointer to closest center

K: k by d array that holds information about centers

M: k by n array that holds distance from all point in N to all po
ints in K

Initialization

1.

Create an initial K:

Choose any k points from N

2.

Initialized pointers in N:

Assign pointers to any element in K

Main Loop

3.

Fill Matrix M:

Calculate distances from all points in N to all centers in K

4.

Update Array N:

Chang
e the pointer N[i] to point to a new center if that new center is
closer than pervious center

5.

If no pointer to a center is updated in step 4 then stop

The algorithm has converged

6.

Otherwise Recompute and Update K:

For each center c in K, compute average v
alue for all points in N that
point to c and replace c with new point c’ from data set that is closest
to average value. This can all be done with a linear pass though N.

2.2 K
-
means Analysis

2.2.1 Objective Function and Membership Constraint

This algori
thm minimizes the total mean squared distance for each point, x
i
, and the closest
center, c
j
.

minimize [SUM over i = 0 to N [min( ||x
i

c
j
||
2

for all c
j

in K)]]

Because the objective function only uses the minimum distances, each point x
i

is implicit
ly
assigned to exactly one center c
j
. The property of being assigned to one center is referred to
as
hard membership
membership

in section 4.)

2.2.2 Storage Requirements

Storage requirements assuming that n >> k >> d:

N + K + M = n * (d+1) + n * d + n * k = O(nk)

The size of matrix M dominates the storage

2.2.3 Running Time

The running time for each iteration algorithm is dominated by distance calculations in step
4. If we can assume that a d dimension distance calcu
lation is takes O(d) time, than the
running time per iteration is given by:

O(nkd)

The convergence rate in not just a function of n, k, and d but depends on the nature of the
data set. Therefore, it is difficult to discuss a runtime for the algorithm. F
or this reason
center
-
based clustering algorithms are usually compared by the runtime of a single
iteration.

2.2.4 Correctness

It has been shown in [5] that the problem of finding an optimal grouping given a d
dimensional set of n data points into k group
s is NP
-
hard. K
-
means uses a greedy heuristic
that seeks to find a local minimum given an initial condition.

In each iteration, we find a better local solution. We only include new centers that reduce
the value of the objective function at each step. The

algorithm converges when it is not
possible to exchange any center such that the total cost can be reduced. Although this gives
us a local minimum, we would have to rerun the algorithm to check every initial condition
in order to get a global minimum. Sin
ce there are an exponential number of possible
initialization combinations, rerunning with all possibilities would take exponential time.

3.

K
-
Harmonic Mean

KHM is similar to KM in that they are both center
-
based iterative algorithm. The main
difference
is how the centers are updated in each iteration. The formula for updating center
comes from a derivation found in [2]. In his paper, Zhang introduces a class of KHM with
parameter
p

that is power associated with the distance calculation. In the standard K
M
algorithm p would be 2 because the distance calculation is given by
squared
distance ||x
i

c
j
||
2
. It was found that KHM works better with values of p > 2. This will be discussed more
in section 4.

3.1 Harmonic Average Function

The harmonic average is

defined as

HA({a
1
,..,a
K
}) = K / [SUM over k = 1 to K ( 1 / a
k
) ]

This function has the property that if any one element in a
1
..a
K

is small, the Harmonic
Average will also be small. If there are no small values the harmonic average will be large.
It
behaves like a minimum function but also gives some weight to all the other values.

3.2

The KHM Algorithm

Data Structures:

N: n by d+1 array
-

contains static information about data set

K: k by d array that holds information about centers

Initialization

1.

Create an initial K:

Choose any k points from N

Main Loop

2.

Compute Harmonic Averages and Update K: See Appendix A

3.

If no center is updated in step 4 then stop

The algorithm has converged

(Note: That a more complete version of the pseudocode is gi
ven in Appendix A.)

3.3 K
-
harmonic Means Analysis

3.3.1 Objective Function and Membership Constraint

The objective function of KHM is given by:

minimize [SUM over i = 0 to N [HA( ||x
i

c
j
||
2

for all c
j

in K)]]

where HA() is the harmonic average for ea
ch data point. Unlike KM, this algorithm uses
information from all of the centers in K to calculate the harmonic average for each point in
N. This means that no center completely owns a point, but rather partially influences the
harmonic average for each p
oint. This condition is referred to as
soft membership.

3.3.2 Space Requirements, Running Time, and Correctness

The space requirements, O(nd), and running time, O(nkd), for KHM are given in [2]. The
space requirement for KHM is given as somewhat less than

KM because the algorithm
never requires the storage of large matrix M that stores the distance calculations from all
centers to all point. However, the all of these calculations must me made can be stored
temporarily.

Like KM, KHM finds a local optimal
solution based on initial conditions using a greedy
heuristic. Each iteration improves the objective function until convergence.

4. Comparison of K
-
Means with K
-
Harmonic means

A recent study by Hamerly and Elkan[4] found that KHM significantly outperfo
rms KM in
a number of experiments. They also found that KHM was much less sensitive to initial
conditions. This section will give a specific case in which KHM will finds a better
grouping. We will then discuss two properties of KHM that make it superior to

KM.

4.1 Test Case

Given a one
-
dimensional set of 8 data points, we wish to find an optimal three clustering.
In this case d=1, n = 8 and k = 3. The points are located at (1,2,3,4,6,7,9,10) and the initial
centers are located at (1,4,9).

1

2

3

4

5

6

7

8

9

10

x
1

x
2

x
3

x
4

x
5

x
6

x
7

x
8

c
1

c
2

c
3

4.1.1 KM on Test Case

Running KM, we find that after the first iteration, (x
1,
x
2
) is assigned to c
1
, (x
3,
x
4,

x
5
) is
assigned to c
2,
and (x
6,
x
7,

x
8
) is assigned to c
3.
When recomputing the averages

for each
center in step 6 of the KM algorithm we would find that no centers would need to be
changed and the algorithm would converge with an objective function to equal to 11. (We
will assume that an average value that is equidistant from two data point
s will be assigned
to point with lesser absolute value. This is an unimportant detail but a needed policy for the
reassignment of c
1.
)

However, a better local minimum would have been found given an initialization in which
the centers were assigned to (x
2
, x
5
, x
9
). This would yield and local objective of 8.

4.1.2 KHM on Test Case

Given the same initial set up, the KHM algorithm, with p =3, given in Appendix A will
converge after 3 iterations. The weight of each data point on the objective function is
calc
ulated in each iteration.

The weight function is given as:

Weight
i

= 1 / [ Sum from l = 1 to K ( 1 / ||x
i

c
l
||
p
) ]
2

Once calculated, these weights are normalized and used for calculating the new values for
updating the centers.

The state after each

iteration is as follows:

Iteration 1:

1

2

3

4

5

6

7

8

9

10

x
1

x
2

x
3

x
4

x
5

x
6

x
7

x
8

c
1

c
2

c
3

Weight:

0

.79

.78

0

34

36

0

.99

Iteration 2:

1

2

3

4

5

6

7

8

9

10

x
1

x
2

x
3

x
4

x
5

x
6

x
7

x
8

c
1

c
2

c
3

Weight

:

.91

0

.24

0

.76

0

54

524

Iteration 2:

1

2

3

4

5

6

7

8

9

10

x
1

x
2

x
3

x
4

x
5

x
6

x
7

x
8

c
1

c
2

c
3

Weight

:

.99

0

.92

15

0

.77

0

.96

Note that the final set of center does find a good local (and in this case, global) optimal
solution both in resp
ect to the KHM objective and KM objective function. Some of the
weights are relatively high (524, 54, 36) for points far from a center and low for point lose
to a center (epsilon represented by 0, .24, .78).

4.2 Hard vs. Soft Membership

As we saw in sect
ion 2.2.1, KM imposes hard membership on it data points: each data
point is assigned to exactly one center. This means that each data point only has an
influence over the center to which it is assigned. In area with high local density of data
points and ce
nters, a center maybe unable move away from a data point despite the fact that
there is a second center nearby. This second center might yield a slightly worse local
solution, but the global effect that repositioning one of the two centers would be benefic
ial
to the clustering. However, this swapping cannot be done in using KM.

KHM differs in that, for each data point, the objective function uses the distances to all
centers. The harmonic average is sensitive to the fact that there exist two or more center
s
are close to a data point. The algorithm will naturally shift one or more of these centers
away to areas where there data point that have no close center. This will create a lower
value for the objective function.

4.3 Static vs. Dynamic Weights

In ea
ch iteration of KM, the objective function gives equal weight to all of the data
points. KHM, on the other hand, assigns dynamic weights to each data points based on a
harmonic average. As described in section 3.1, the harmonic average will assign a large
weight to a data point that is not close to any centers and a small weight to data point that
is close to one or more centers. This principle is important because we want to avoid
creating densely packed area of multiple centers. By increasing the weight o
f centers that
are not close to any center, the algorithm can attract centers away from those dense areas
without increasing the weight of those data points that are in more dense areas. This
principal is central to KHM being less sensitive to initializati
on than KM. In KM centers
tend to get trapped in dense area yielding poor clustering results.

5 Application: Music Genre Classification

One application that involves the unstructured clustering is the automatic classification of
music for information re
trieval(IR). The problem involves large libraries of digitally
recorded sound files. In [6], Tzanetakis and Cook use signal processing techniques and
statistical pattern recognition algorithms to extract
features

from recorded sound files.

In this case,
the sound files represent data point and the features represent the
dimensionality of data. Much energy has been put into supervised clustering based on
nearest neighbor algorithms with a predefined audio classification hierarchy and
associated training se
ts. However, in this method, notions of human cognition and
psychoacoustics play a large role in determining the success of a classification system.

Using a fully automatic clustering algorithm such as KHM could limit the subjective
nature of musical clas
sification. Although this new classification may not make sense to a
human, it could be used to automatically store an retrieve sound information with a much
higher degree of accuracy.

6 Future Work

6.1

Original Goal

Over the past 30 years, many papers
have presented ways in which to improve KM.
Some focus on creating good initialization methods while other look at finding optimal
values for k. My thought was that using some of these ideas but with KHM in mind.

In [3], Elkan describes ways to improve KM

using geometric insights from the triangle
inequality. He presents two lemma, one for an upper bound and one for a lower bound, in
order to reduce the number of distance calculations needed. Although the improvements
do not reduce the theoretical computat
ional complexity, they do greatly reduce the
number of distance calculation and achieve significant speedups in practice.

My original goals in writing this paper was to try to use the same principles involving the
triangle inequality and apply them to t
he KHM algorithm. This was found to be not
possible due to fact that the number of calculation cannot be reduced since all
calculations between centers and points are needed for soft membership algorithms.

6.2

Improving on KHM

Despite Zhang claim in [2] that

KHM is “essentially insensitive to initialization,” both [2]
and [4] show that there is still remains some variance in the local optimal solution given
by different initialization.

One idea to further reduce this variance would be to use the centers fou
nd after one pass
KHM as an initialization to a second pass through KHM. We could artificially move
centers from more dense clusters to locations between sparse clusters.

The Algorithm is as follows:

1.

Run KHM

2.

Run KM to determine hard membership

3.

For each
center, calculate variance of cluster.

4.

Transplant centers from the
m

densest clusters to the
m

locations between sparse
clusters.

5.

Rerun KHM

This algorithm could be evaluated by testing to see if the variance of the solution set is
less than that of one p
ass through KHM given a series of different initializations.

Appendix A: Complete Pseudo Code for KHM

Data Structures:

N: n by d+1 array
-

contains static information about data set

Nmin: n element array which holds the minimum distance to any center

K: k by d array that holds information about centers

M: n by k array that holds distance from all point in N to all points in K

Temporary Arrays (Could be reduced but shown for simplicity)

U: n element array

Q: n by k temporary array

QQ:k element array

R: n by k temporary array

RR:k element array

T: n by k

p: KHM parameter discussed in beginning of section 3

Initialization

4.

Create an initial K:

Choose any k points from N

Main Loop

5.

Fill Matrix M:

Calculate distances from all points in N to all
centers in K

6.

Compute Nmin:

Find minimum distance for to any center for each point in N

7.

Recompute Harmonic Averages and Update K:

For each point (j = 0 to n)

For each center (i = 0 to k)

U[j] = U[j] + (Nmin[j]/N[j,I])

U[j] = U[j]

1;

For each center (i = 0 to k)

For each point (j = 0 to n)

Q[j,I] = [(Nmin[i]^(p
-
2) * (Nmin[i]/N[j,i])^(p+2)] /

[(1 + U[j]^p)^2]

For each center (i = 0 to k)

For each point (j = 0 to n)

QQ[i] += Q[j,I]

For each center (i = 0 to k)

For each

point (j = 0 to n)

R[j,I] = Q[j,i] / QQ[i]

For each center (i = 0 to k)

For each point (j = 0 to n)

K[i] = K[i] + R[j,i]*N[j]

8.

If no center is updated in step 4 then stop

The algorithm has converged

The algorithm above is given in a simplified

form as to show all the temporary values that
are calculated by taking partial derivative for the Harmonic Average Function. Each nested
loop in Step 4 represents the one of the five decomposed equations from Equation 6 of [3].

References

[1] J. MacQue
en.

Some methods for classification and analysis of multivariate
observations.

Proceeding of the fifth Berkeley symposium on mathematical statistics and
probability. Pp 281
-
297, 1967.

[2] B. Zhang.
Generalized k
-
harmonic means

boosting in unsupervised l
earning.

Technical Report HPL
-
2000
-
137, Hewlett
-
Packard Labs, 2000.

[3] C. Elkan.
K
-
means: fast, faster, fastest.

UCSD AI Seminar Talk, October 2002.

[4] G. Hamerly and C. Elkan.

Alternati
ves to the k
-
means algorithm that find better
clusterings

(pdf).

To appear in
Proceedings of the Eleventh International Conference on
Information and Knowledge Management

(CIKM'02), November 2002.

[5] Drineas, Frieze, Kannan, Vempala, and Vinay. “Cluste
ring in large graphs and
matrices.”

ACM
-
SIAM Symposium on Discrete Algorithms, 1999.

[6] Tzanetakis, and Cook “Musical Genre Classification of Audio Signals.”
IEEE
Transactions on Speech and Audio Processing
. 2002.