Approaching Optimal
Solution
by
Combin
ing
Clustering
s
Mohammad Rezaei, Pasi Franti
Abstract

Applying K

means to a dataset with different initial centroids,
usually
produce different clusterings. Each clustering can have some benefits regarding to
finding some of the correct centroids. In this paper we propose a method for
constructing a new clustering from two clusterings which includes the benefits of
them.
According to merge and split costs (computed for each cluster in one clustering
according to the other one), some centroids are replaced with centroids from the other
clustering. In an iterative algorithm, the resulting clustering is considered with a new
clustering produced by K

means. If there would be naturally separated clusters in a
dataset, the proposed method can find optimal clustering in a few iterations.
Keywords
–
Clustering, optimal solution, K

means
I.
Introduction
Data clustering or unsupervis
ed learning is an important but an extremely difficult
problem. The objective of
clustering is to partition a set of unlabeled objects into
homogeneous groups or clusters. A number of application areas use clustering
techniques for organizing or discovering structure in data, such as data mining,
information retrieval, image segmentati
on, and machine learning [1].
A large number of clustering algorithms exist, yet no single algorithm is able to
identify all sorts of cluster shapes and structures that are encountered in practice
[1]
.
Most of the proposed techniques utilize an optimizati
on procedure tuned to a
particular cluster shape, or emphasize cluster compactness [1].
From an optimization perspective, clustering can be formally considered as a
particular kind of NP

hard grouping
problem [6].
Particularly, evolutionary
algorithms are
metaheuristics widely believed to be effective on NP

hard problems,
being able to provide near

optimal solutions to such problems in reasonable time [6].
Under this assumption, a large number of evolutionary algorithms for solving
clustering problems have
been proposed in the literature. These algorithms are based
on the optimization of some objective function (i.e., the so

called fitness function)
that guides the evolutionary search [6].
Some surveys on evolutionary algorithms for
clustering can be found i
n [6], [7]
.
The most widely used criterion is the clustering error
criterion which for each
point
computes its squared distance
from the corresponding cluster center and then
takes the sum of these distances for all points in the data set. A popular cluste
ring
method that minimizes the clustering error is the k

means algorithm. However, the k

means algorithm is a local search procedure and it is well known that it su3ers from
the serious drawback that its performance heavily depends
on the initial starting
conditions [4].
To treat this problem several other techniques have been developed that are based
on stochastic global optimization methods (e.g. simulated annealing, genetic
algorithms)
[4]
.
The simplest way to find a better result from k

means is repeated k

means. It is just repeating clustering with different initial centroids and selecting the
best result.
We present the global k

means algorithm which is an incremental approach to
clustering that dynamically adds one cluster center at a time through a deterministic
global search procedure consisting of N (with N being the size of the data set)
executions of
the k

means algorithm from suitable initial positions [4].
K

means++
a way of initializing k

means by choosing random starting centers
with very specific probabilities. Specifically, we choose a point p as a center with
probability proportional to p’s
contribution to the overall potential
[5]
.
Different
variant of K

means++ have also been proposed.
In [2]
,
a centroid is selected randomly and is replaced to a random data point. K

means is then applied to the new centroids. The new solution is used for
next iteration
if a lower MSE
is achieved, otherwise another swap is applied to the previous
solution.
We propose a random swap EM for the initialization of EM. Instead of starting
from a completely new solution in each repeat as in repeated EM, we make a
random
perturbation on the solution before continuing EM iterations. The removal and
addition in random swap are simpler and more natural than split and merge or
crossover and mutation operations
[3]
.
Many different approaches have been proposed in the
field of clustering ensemble
in which the partitions from multiple clusterings are combined
to infer statistically
about final clusters from all partitions
[1].
In this paper
we propose a new scheme to combine two
centroid

based
clusterings to
achieve a better MSE.
We also propose a simple iterative algorithm in which the
result of two clustering is iteratively considered with a new clustering to
the
optimal
solution
.
The main idea is that …
References
[1]
Ana L. N. Fred
, Anil K. Jain:
Combining Multiple Clusterings Using Evidence
Accumulation.
IEEE Trans. Pattern Anal. Mach. Intell. 27(6): 835

850 (2005)
[2] Pasi Fränti,
Olli Virmajoki
, Ville Hautamäki:
Probabilistic clustering by random
swap algorithm.
ICPR 2008: 1

4
[3] Q. Zhao, V.
Hautamäki, I. Kärkkäinen and P. Fränti, "Random swap EM
algorithm for Gaussian mixture models",
Pattern Recognition Letters
, 33, 2120

2126,
2012
[4]
A. Likas, N. Vlassis, and J. J. Verbeek, “The global k

means clustering
algorithm,” Pattern Recognition, vol. 36, no. 2, pp. 451

461, Feb. 2003.
[5] D. Arthur and S. Vassilvitskii.
“
k

means++: The
advantage
s of careful seeding
,”
In
SODA
, pages
1027{1035, 2007.
[6]
Hruschka, E.R.; Campello, R. J G B; Freitas, A.A.; De
Carvalho, A. C P L F, "A
Survey of Evolutionary Algorithms for Clustering,"
Systems, Man, and Cybernetics,
Part C: Applications and Reviews, IEEE Transactions on
, vol.39, no.2, pp.133,155,
March 2009
[7] M. C. Naldi, A. C. P. L. F. de Carvalho, R. J. G. B
. Campello, and E. R.
Hruschka, “Genetic clustering for data mining,” in
Soft Computing
for Knowledge
Discovery and Data Mining
, O. Maimon and L. Rokach, Eds. New York: Springer

Verlag, 2007, pp. 113
–
132.
Comments 0
Log in to post a comment