Approaching Optimal Solution by Combining Clusterings

tribecagamosisΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

108 εμφανίσεις

Approaching Optimal
Solution

by

Combin
ing

Clustering
s

Mohammad Rezaei, Pasi Franti


Abstract
-

Applying K
-
means to a dataset with different initial centroids,

usually
produce different clusterings. Each clustering can have some benefits regarding to
finding some of the correct centroids. In this paper we propose a method for
constructing a new clustering from two clusterings which includes the benefits of
them.

According to merge and split costs (computed for each cluster in one clustering
according to the other one), some centroids are replaced with centroids from the other
clustering. In an iterative algorithm, the resulting clustering is considered with a new

clustering produced by K
-
means. If there would be naturally separated clusters in a
dataset, the proposed method can find optimal clustering in a few iterations.


Keywords


Clustering, optimal solution, K
-
means

I.

Introduction

Data clustering or unsupervis
ed learning is an important but an extremely difficult
problem. The objective of

clustering is to partition a set of unlabeled objects into
homogeneous groups or clusters. A number of application areas use clustering
techniques for organizing or discovering structure in data, such as data mining,
information retrieval, image segmentati
on, and machine learning [1].


A large number of clustering algorithms exist, yet no single algorithm is able to
identify all sorts of cluster shapes and structures that are encountered in practice

[1]
.

Most of the proposed techniques utilize an optimizati
on procedure tuned to a
particular cluster shape, or emphasize cluster compactness [1].

From an optimization perspective, clustering can be formally considered as a
particular kind of NP
-
hard grouping

problem [6].
Particularly, evolutionary
algorithms are
metaheuristics widely believed to be effective on NP
-
hard problems,
being able to provide near
-
optimal solutions to such problems in reasonable time [6].

Under this assumption, a large number of evolutionary algorithms for solving
clustering problems have
been proposed in the literature. These algorithms are based
on the optimization of some objective function (i.e., the so
-
called fitness function)
that guides the evolutionary search [6].

Some surveys on evolutionary algorithms for
clustering can be found i
n [6], [7]
.

The most widely used criterion is the clustering error

criterion which for each
point

computes its squared distance

from the corresponding cluster center and then
takes the sum of these distances for all points in the data set. A popular cluste
ring
method that minimizes the clustering error is the k
-
means algorithm. However, the k
-
means algorithm is a local search procedure and it is well known that it su3ers from
the serious drawback that its performance heavily depends
on the initial starting
conditions [4].

To treat this problem several other techniques have been developed that are based
on stochastic global optimization methods (e.g. simulated annealing, genetic
algorithms)

[4]
.

The simplest way to find a better result from k
-
means is repeated k
-
means. It is just repeating clustering with different initial centroids and selecting the
best result.

We present the global k
-
means algorithm which is an incremental approach to
clustering that dynamically adds one cluster center at a time through a deterministic
global search procedure consisting of N (with N being the size of the data set)
executions of
the k
-
means algorithm from suitable initial positions [4].

K
-
means++

a way of initializing k
-
means by choosing random starting centers
with very specific probabilities. Specifically, we choose a point p as a center with
probability proportional to p’s

contribution to the overall potential

[5]
.

Different
variant of K
-
means++ have also been proposed.


In [2]
,

a centroid is selected randomly and is replaced to a random data point. K
-
means is then applied to the new centroids. The new solution is used for
next iteration
if a lower MSE

is achieved, otherwise another swap is applied to the previous
solution.


We propose a random swap EM for the initialization of EM. Instead of starting
from a completely new solution in each repeat as in repeated EM, we make a

random
perturbation on the solution before continuing EM iterations. The removal and
addition in random swap are simpler and more natural than split and merge or
crossover and mutation operations

[3]
.



Many different approaches have been proposed in the
field of clustering ensemble
in which the partitions from multiple clusterings are combined

to infer statistically
about final clusters from all partitions

[1].


In this paper

we propose a new scheme to combine two
centroid
-
based

clusterings to
achieve a better MSE.
We also propose a simple iterative algorithm in which the
result of two clustering is iteratively considered with a new clustering to
the
optimal
solution
.

The main idea is that …



References

[1]
Ana L. N. Fred
, Anil K. Jain:
Combining Multiple Clusterings Using Evidence
Accumulation.

IEEE Trans. Pattern Anal. Mach. Intell. 27(6): 835
-
850 (2005)

[2] Pasi Fränti,
Olli Virmajoki
, Ville Hautamäki:
Probabilistic clustering by random
swap algorithm.

ICPR 2008: 1
-
4

[3] Q. Zhao, V.
Hautamäki, I. Kärkkäinen and P. Fränti, "Random swap EM
algorithm for Gaussian mixture models",
Pattern Recognition Letters
, 33, 2120
-
2126,
2012

[4]
A. Likas, N. Vlassis, and J. J. Verbeek, “The global k
-
means clustering

algorithm,” Pattern Recognition, vol. 36, no. 2, pp. 451
-
461, Feb. 2003.

[5] D. Arthur and S. Vassilvitskii.


k
-
means++: The

advantage
s of careful seeding
,”

In

SODA

, pages

1027{1035, 2007.


[6]
Hruschka, E.R.; Campello, R. J G B; Freitas, A.A.; De
Carvalho, A. C P L F, "A
Survey of Evolutionary Algorithms for Clustering,"
Systems, Man, and Cybernetics,
Part C: Applications and Reviews, IEEE Transactions on

, vol.39, no.2, pp.133,155,
March 2009

[7] M. C. Naldi, A. C. P. L. F. de Carvalho, R. J. G. B
. Campello, and E. R.
Hruschka, “Genetic clustering for data mining,” in
Soft Computing

for Knowledge
Discovery and Data Mining
, O. Maimon and L. Rokach, Eds. New York: Springer
-
Verlag, 2007, pp. 113

132.