An efﬁcient hybrid data clustering method based on Kharmonic means
and Particle Swarm Optimization
Fengqin Yang
a,b,
*
,Tieli Sun
a
,Changhai Zhang
b
a
College of Computer Science,Northeast Normal University,Changchun,Jilin 130117,China
b
College of Computer Science and Technology,Jilin University,Changchun,Jilin 130012,China
a r t i c l e i n f o
Keywords:
Data clustering
Kmeans
Kharmonic means
Particle Swarm Optimization
a b s t r a c t
Clustering is the process of grouping data objects into set of disjoint classes called clusters so that objects
within a class are highly similar with one another and dissimilar with the objects in other classes.K
means (KM) algorithm is one of the most popular clustering techniques because it is easy to implement
and works fast in most situations.However,it is sensitive to initialization and is easily trapped in local
optima.Kharmonic means (KHM) clustering solves the problemof initialization using a builtin boosting
function,but it also easily runs into local optima.Particle SwarmOptimization (PSO) algorithm is a sto
chastic global optimization technique.A hybrid data clustering algorithm based on PSO and KHM(PSO
KHM) is proposed in this research,which makes full use of the merits of both algorithms.The PSOKHM
algorithmnot only helps the KHMclustering escape fromlocal optima but also overcomes the shortcom
ing of the slow convergence speed of the PSO algorithm.The performance of the PSOKHM algorithm is
compared with those of the PSOand the KHMclustering on seven data sets.Experimental results indicate
the superiority of the PSOKHM algorithm.
2009 Elsevier Ltd.All rights reserved.
1.Introduction
Clustering is a search for hidden patterns that may exist in data
sets.It is a process of grouping data objects into disjointed clusters
so that the objects in each cluster are similar,yet different to the
others.Clustering techniques are applied in many application areas
such as pattern recognition (Halberstadt & Douglas,2008;Webb,
2002;Zhou & Liu,2008),machine learning (Alpaydin,2004),data
mining (Tan,Steinbach,& Kumar,2005;Tjhi & Chen,2008),infor
mation retrieval (Hu,Zhou,Guan,& Hu,2008;Li,Chung,& Holt,
2008) and bioinformatics (He,Pan,& Lin,2006;Kerr,Ruskin,Crane,
& Doolan,2008).Kmeans (KM) algorithmis one of the most pop
ular and widespread partitioning clustering algorithms because of
its superior feasibility and efﬁciency in dealing with a large
amount of data.KM partitions data objects into k clusters where
the number of clusters,k,is decided in advance according to appli
cation purposes.The most common clustering objective in KMis to
minimize the sum of dissimilarities between all objects and their
corresponding cluster centers.The main drawback of the KMalgo
rithm is that the cluster result is sensitive to the selection of the
initial cluster centers and may converge to the local optima (Cui
& Potok,2005;Kao,Zahara,& Kao,2008).
The Kharmonic means (KHM) algorithmis a more recent algo
rithm proposed by Zhang,Hsu,and Dayal (1999,2000) and modi
ﬁed by Hammerly and Elkan (2002).The clustering objective in this
algorithm is to minimize the harmonic average from all points in
the data set to all cluster centers.The KHM algorithm solves the
problemof initialization of the KMalgorithm,but it also easily runs
into local optima.Particle Swarm Optimization (PSO) is a popula
tion based stochastic optimization technique developed by Ken
nedy and Eberhart (1995),inspired by social behavior of bird
ﬂocking or ﬁsh schooling.Nowadays,PSO has gained much atten
tion and wide applications in a variety of ﬁelds (Feng,Chen,& Ye,
2007;Liu,Wang,& Jin,2008).In this paper,we explore the appli
cation of PSO to help the KHMalgorithmescape fromlocal optima.
A hybrid data clustering algorithm based on KHM and PSO,called
PSOKHM,is proposed.The experimental results on a variety of data
sets provided fromseveral artiﬁcial and reallife situations indicate
the PSOKHM algorithm is superior to the KHM algorithm and the
PSO algorithm.
The rest of the paper is organized as follows.Section 2 intro
duces KHMclustering.Section 3 describes brieﬂy PSO search tech
nique.Section 4 presents our hybrid clustering algorithm.Section 5
illustrates experimental results.Finally,Section 6 makes
conclusions.
09574174/$  see front matter 2009 Elsevier Ltd.All rights reserved.
doi:10.1016/j.eswa.2009.02.003
* Corresponding author.Address:College of Computer Science,Northeast Normal
University,5268 Renmin Street,Changchun,Jilin 130117,China.
Email address:yangfq147@nenu.edu.cn (F.Yang).
Expert Systems with Applications 36 (2009) 9847–9852
Contents lists available at ScienceDirect
Expert Systems with Applications
j ournal homepage:www.el sevi er.com/l ocat e/eswa
2.Kharmonic means clustering
The KM clustering is a simple and fast method used com
monly due to its straightforward implementation and small
number of iterations.The KM algorithm attempts to ﬁnd the
cluster centers,(c
1
,...,c
k
),such that the sum of the squared dis
tances of each data point (x
i
) to its nearest cluster center (c
j
) is
minimized.The dependency of the KM performance on the ini
tialization of the centers has been a major problem.This is due
to its winnertakesall partitioning strategy,which results in a
strong association between the data points and the nearest cen
ter and prevents the centers from moving out of a local density
of data.
The KHM clustering addresses the intrinsic problem by
replacing the minimum distance from a data point to the centers,
used in KM,by the harmonic mean of the distance from each
data point to all centers.The harmonic means gives a good
(low) score for each data point when that data point is close
to any one center.This is a property of the harmonic means;it
is similar to the minimum function used by KM,but it is a
smooth differentiable function.The following notations are used
to formulate the KHM algorithm (Hammerly & Elkan,2002;Ün
ler & Güngör,2008):
X = {x
1
,...,x
n
}:the data to be clustered.
C = {c
1
,...,c
k
}:the set of cluster centers.
m(c
j
jx
i
):the membership function deﬁning the proportion of
data point x
i
that belongs to center c
j
.
w(x
i
):the weight function deﬁning how much inﬂuence data
point x
i
has in recomputing the center parameters in the next
iteration.
Basic algorithm for KHM clustering is shown as follows:
1.Initialize the algorithm with guessed centers C,i.e.,randomly
choose the initial centers.
2.Calculate objective function value according to
KHMðX;CÞ ¼
X
n
i¼1
k
P
k
j¼1
1
x
i
c
j
k k
p
;ð1Þ
where p is an input parameter and typically p P2.
3.For each data point x
i
,compute its membership m(c
j
jx
i
) in each
center c
j
according to
mðc
j
jx
i
Þ ¼
x
i
c
j
p2
P
k
j¼1
x
i
c
j
p2
:ð2Þ
4.For each data point x
i
,compute its weight w(x
i
) according to
wðx
i
Þ ¼
P
k
j¼1
x
i
c
j
p2
P
k
j¼1
x
i
c
j
p
2
:ð3Þ
5.For each center c
j
,recompute its location fromall data points x
i
according to their memberships and weights:
c
j
¼
P
n
i¼1
mðc
j
jx
i
Þwðx
i
Þx
i
P
n
i¼1
mðc
j
jx
i
Þwðx
i
Þ
:ð4Þ
6.Repeat steps 2–5 predeﬁned number of iterations or until
KHM(X,C) does not change signiﬁcantly.
7.Assign data point x
i
to cluster j with the biggest m(c
j
x
i
).
It is demonstrated that KHMis essentially insensitive to the ini
tialization of the centers (Zhang et al.,1999).However,it tends to
converge to local optima (Güngör & Ünler,2008).
3.Particle SwarmOptimization
The PSO method was developed by Kennedy and Eberhart
(1995) and has been successfully applied to many science and
practical ﬁelds (Aupetit,Monmarché,& Slimane,2007;Liao,Tseng,
& Luarn,2007;Maitra & Chatterjee,2008;Pan,Wang,& Liu,2006).
PSO is a sociologically inspired population based optimization
algorithm.Each particle is an individual,and the swarm is com
posed of particles.In PSO,the solution space of the problemis for
mulated as a search space.Each position in the search space is a
correlated solution of the problem.Particles cooperate to ﬁnd the
best position (best solution) in the search space (solution space).
Each particle moves according to its velocity.At each iteration,
the particle movement is computed as follows:
x
i
ðt þ1Þ x
i
ðtÞ þ
v
i
ðtÞ;ð5Þ
v
i
ðt þ1Þ
x
v
i
ðtÞ þc
1
rand
1
ðpbest
i
ðtÞ x
i
ðtÞÞ
þc
2
rand
2
ðgbestðtÞ x
i
ðtÞÞ:ð6Þ
In Eqs.(5),(6),x
i
(t) is the position of particle i at time t,
v
i
(t) is the
velocity of particle i at time t,pbest
i
(t) is the best position found by
particle i itself so far,gbest(t) is the best position found by the whole
swarmso far,
x
is an inertia weight scaling the previous time step
velocity,c
1
and c
2
are two acceleration coefﬁcients that scale the
inﬂuence of the best personal position of the particle (pbest
i
(t))
and the best global position (gbest(t)),rand
1
and rand
2
are random
variables between 0 and 1.The process of PSO is shown as Fig.1.
4.The proposed hybrid clustering algorithm
The KHMalgorithmtends to converge faster than the PSO algo
rithmbecause it requires fewer function evaluations,but it usually
gets stuck in local optima.We integrate KHM with PSO to form a
hybrid clustering algorithm called PSOKHM,which maintains the
merits of KHM and PSO.More speciﬁcally,PSOKHM will apply
KHMwith four iterations to the particles in the swarmevery eight
generations such that the ﬁtness value of each particle is improved.
A particle is a vector of real numbers of dimension k
*
d,where k is
the number of clusters and d is the dimension of data to be clus
tered.A particle can be shown as Fig.2.The ﬁtness function of
the PSOKHM algorithm is the objective function of the KHM algo
rithm.Fig.3 summarizes the hybrid PSOKHM algorithm.
Initialize a population of particles with random positions and
velocities in the search space.
While(termination conditions are not met)
{
For each particle do
i
Update the position of particle
i
according to
equation (5).
Update the velocity of particle
i
according to
equation (6).
Map the position of particle
i in
the solution space
and evaluate its fitness value according to the fitness function.
Update and
( )
i
pbest t ( )gbest t
if necessary.
End for
}
Fig.1.The process of the PSO algorithm.
9848 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852
5.Experimental results
Seven data sets are employed to validate our method.These
data sets,named ArtSet1,ArtSet2,Wine,Glass,Iris,breastcan
cerwisconsin (denoted as Cancer),and Contraceptive Method
Choice (denoted as CMC),cover examples of data of low,medium
and high dimensions.All data sets except ArtSet1 and ArtSet2 are
available at ftp://ftp.ics.uci.edu./pub/machinelearningdatabases/.
Table 1 summarizes the characteristics of these data sets.Table 2
shows the parameters set in our algorithm.
5.1.Data sets
(1) ArtSet1 (n = 300,d = 2,k = 3):This is an artiﬁcial data set.It is
a twofeatured problemwith three unique classes.A total of
300 patterns are drawn from three independent bivariate
normal distributions,where classes are distributed accord
ing to N
2
l
¼
l
i1
l
i2
;
R
¼
0:4 0:04
0:04 0:4
;i ¼
1;2;3;
l
11
¼
l
12
¼ 2;
l
21
¼
l
22
¼ 2;
l
31
¼
l
32
¼ 6;
l
and
R
being mean vector and covariance matrix,respectively.The
data set is illustrated in Fig.4.
(2) ArtSet2 (n = 300,d = 3,k = 3):This is an artiﬁcial data set.It is
a threefeatured problem with three classes and 300 pat
terns,where every feature of the classes is distributed
according to Class1 Uniform (10,25),Class2 Uniform
(25,40),Class3 Uniform (40,55).The data set is illustrated
in Fig.5.
(3) Fisher’s iris data set (n = 150,d = 4,k = 3),which consists of
three different species of iris ﬂower:Iris Setosa,Iris Versicol
our and Iris Virginica.For each species,50 samples with four
features (sepal length,sepal width,petal length,and petal
width) were collected.
(4) Glass (n = 214,d = 9,k = 6),which consists of six different
types of glass:building windows ﬂoat processed (70
objects),building windows nonﬂoat processed (76 objects),
vehicle windows ﬂoat processed (17 objects),containers (13
objects),tableware (9 objects),and headlamps (29 objects).
Each type has nine features,which are refractive index,
sodium,magnesium,aluminum,silicon,potassium,calcium,
barium,and iron.
(5) Wisconsin breast cancer (n = 683,d = 9,k = 2),which con
sists of 683 objects characterized by nine features:clump
thickness,cell size uniformity,cell shape uniformity,mar
ginal adhesion,single epithelial cell size,bare nuclei,bland
chromatin,normal nucleoli,and mitoses.There are two cat
egories in the data:malignant (444 objects) and benign (239
objects).
(6) Contraceptive Method Choice (n = 1473,d = 9,k = 3):This
dataset is a subset of the 1987 National Indonesia Contra
ceptive Prevalence Survey.The samples are married women
who either were not pregnant or did not know if they were
at the time of interview.The problemis to predict the choice
of current contraceptive method (no use has 629 objects,
longtermmethods have 334 objects,and shorttermmeth
ods have 510 objects) of a woman based on her demographic
and socioeconomic characteristics.
(7) Wine (n = 178,d = 13,k = 3):These data,consisting of 178
objects characterized by 13 such features as alcohol,malic
acid,ash,alkalinity of ash,magnesium,total phenols,ﬂava
noids,nonﬂavanoid phenols,proanthocyanins,color inten
sity,hue,OD280/OD315 of diluted wines,and praline,are
the results of a chemical analysis of wines brewed in the
x
11
x
12
… x
1d
… X
k1
X
k2
… X
kd
Fig.2.The representation of a particle.
Step 1: Set the initial parameters including the maximum
iterative count IterCount, the population size P
size
, ω, c
1
and
c
2
.
Step 2: Initialize a population of size P
size
.
Step 3: Set iterative count Gen1= 0.
Step 4: Set iterative count Gen2= Gen3=0.
Step 5 (PSO Method)
Step 5.1: Apply the PSO operator to update the P
size
particles.
Step 5.2: Gen2=Gen2+1. If Gen2<8, go to Step 5.1.
Step 6 (KHM Method) For each particle do
i
Step 6.1: Take the position of particle
i
centers of the KHM algorithm.
Step 6.2: Recalculate each cluster center using the KHM
algorithm.
Step6.3: Gen3=Gen3+1. If Gen3<4, go to Step 6.2.
Step 7: Gen1= Gen1+1. If Gen1<IterCount, go to Step 4.
Step 8: Assign data point
i
x
to cluster with the biggest
j
(  )
j i
m c x.
as the initial cluster
Fig.3.The hybrid PSOKHM algorithm.
Table 1
Characteristics of data sets considered.
Name of data set No.of classes No.of features Size of data set (size of classes in parentheses)
ArtSet1 3 2 300 (100,100,100)
ArtSet2 3 3 300 (100,100,100)
Iris 3 4 150 (50,50,50)
Glass 6 9 214 (70,17,76,13,9,29)
Cancer 2 9 683 (444,239)
CMC 3 9 1473 (629,334,510)
Wine 3 13 178 (59,71,48)
Table 2
The PSOKHM algorithm parameters setup.
Parameter Value
P
size
18
x
0.7298
c
1
1.49618
c
2
1.49618
IterCount 5
F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852 9849
same region in Italy but derived from three different culti
vars.There are three categories in the data:class 1 (59
objects),class 2 (71 objects),and class 3 (48 objects).
5.2.Experimental results
In this section,we evaluate and compare the performances of
the following methods:KHM,PSO and PSOKHM algorithms as
means of solution for the objective function of the KHMalgorithm.
The quality of the respective clustering will also be compared,
where the quality is measured by the following two criteria:
(1) The sumover all data points of the harmonic average of the
distance froma data point to all the centers,as deﬁned in Eq.
(1).Clearly,the smaller the sumis,the higher the quality of
clustering is.
(2) The FMeasure uses the ideas of precision and recall from
information retrieval (Dalli,2003;Handl,Knowles,& Dorigo,
2003).Each class i (as given by the class labels of the used
benchmark data set) is regarded as the set of n
i
items desired
for a query;each cluster j (generated by the algorithm) is
regarded as the set of n
j
items retrieved for a query;n
ij
gives
the number of elements of class i within cluster j.For each
class i and cluster j precision and recall are then deﬁned as
pði;jÞ ¼
n
ij
n
j
and rði;jÞ ¼
n
ij
n
i
,and the corresponding value under
the FMeasure is Fði;jÞ ¼
ðb
2
þ1Þpði;jÞrði;jÞ
b
2
pði;jÞþrði;jÞ
,where we chose b = 1
to obtain equal weighting for p(i,j) and r(i,j).The overall F
Measure for the data set of size n is given by
F ¼
P
i
n
i
n
max
j
fFði;jÞg.Obviously,the bigger FMeasure is,
the higher the quality of clustering is.
The experimental results are averages of 10 runs of simulation.
The algorithms are implemented using Visual C++ on a Pentium(R)
D CPU 2.66 GHz with 1.00 GB RAM.It is known that p is a key
parameter to get good objective function values.For this reason
we conduct our experiments with different p values.Tables 3–5
give the means and standard deviations (over 10 runs) obtained
for each of these measures when p is 2.5,3 and 3.5,respectively.
Additionally,they show the runtimes of the algorithms.
For ArtSet1 the averages of the KHM(X,C) of KHM,PSO and PSO
KHM are almost the same and all of the FMeasures of the algo
rithms are 1.For ArtSet2 and the other ﬁve real data sets the
average of the KHM(X,C) of PSOKHM is much better than those
of KHM and PSO.With the exception of two case (CMC,p = 2.5
and Iris,p = 3.5),the average of FMeasure of PSOKHM is equal to
or better than that of KHM,and that of PSO is relatively bad.This
is an indication that for lowdimensional data set if the clusters
are spatially well separated (as is the case in ArtSet1),the perfor
mance of three algorithms is nearly the same and PSOKHMoutper
forms the other two methods for other cases.With the exception of
ArtSet1,the runtimes of PSOKHM runtimes are higher than those
of KHM and are lower than those of PSO.
Fig.4.ArtSet1.
Fig.5.ArtSet2.
Table 3
Results of KHM,PSO,and PSOKHM clustering on two artiﬁcial and ﬁve real data sets
when p = 2.5.The quality of clustering is evaluated using KHM(X,C) and the F
Measure.Runtimes (s) are additionally provided.The table shows means and
standard deviations (in brackets) for 10 independent runs.Bold face indicates the best
and italic face indicates the second best result out of the three algorithms.
KHM PSO PSOKHM
ArtSet1
KHM(X,C) 703.867 (0.000) 703.528 (0.190) 703.509 (0.050)
FMeasure 1.000 (0.000) 1.000 (0.000) 1.000 (0.000)
Runtime 0.106 (0.006) 1.648 (0.008) 1.921 (0.007)
ArtSet2
KHM(X,C) 111,852 (0) 1,910,696 (915,890) 111,813 (2)
FMeasure 1.000 (0.000) 0.665 (0.088) 1.000 (0.000)
Runtime 0.223 (0.008) 3.650 (0.031) 2.859 (0.000)
Iris
KHM(X,C) 149.333 (0.000) 230.340 (98.180) 149.058 (0.074)
FMeasure 0.750 (0.000) 0.711 (0.062) 0.753 (0.005)
Runtime 0.192 (0.008) 3.117 (0.020) 1.842 (0.005)
Glass
KHM(X,C) 1203.554 (16.231) 9551.095
(1933.211) 1196.798 (0.439)
FMeasure 0.421 (0.011) 0.387 (0.044) 0.424 (0.003)
Runtime 4.064 (0.010) 44.249 (0.431) 17.669 (0.018)
Cancer
KHM(X,C) 60,189 (0) 60,244 (563) 59,844 (22)
FMeasure 0.829 (0.000) 0.819 (0.005) 0.829 (0.000)
Runtime 2.017 (0.009) 16.046 (0.138) 9.525 (0.013)
CMC
KHM(X,C) 96,520 (0) 115,096 (33,014) 96,193 (25)
FMeasure 0.335 (0.000) 0.298 (0.019) 0.333 (0.002)
Runtime 8.639 (0.009) 54.163 (0.578) 39.825 (0.072)
Wine
KHM(X,C) 18,386,505 (0) 19,795,542
(2,007,722) 18,386,285 (5)
FMeasure 0.516 (0.000) 0.512 (0.020) 0.516 (0.000)
Runtime 2.059 (0.010) 35.642 (0.282) 6.539 (0.008)
9850 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852
Table 4
Results of KHM,PSO,and PSOKHM clustering on two artiﬁcial and ﬁve real data sets when p = 3.The quality of clustering is evaluated using KHM(X,C) and the FMeasure.
Runtimes (s) are additionally provided.The table shows means and standard deviations (in brackets) for 10 independent runs.Bold face indicates the best and italic face indicates
the second best result out of the three algorithms.
KHM PSO PSOKHM
ArtSet1
KHM(X,C) 742.110 (0.000) 741.6927 (0.080) 741.455 (0.002)
FMeasure 1.000 (0.000) 1.0000 (0.000) 1.000 (0.000)
Runtime 0.001 (0.006) 1.633 (0.008) 1.921 (0.007)
ArtSet2
KHM(X,C) 278,758 (0) 8,675,830 (6,626,165) 278,541 (33)
FMeasure 1.000 (0.000) 0.681 (0.093) 1.000 (0.000)
Runtime 0.220 (0.005) 3.575 (0.030) 2.844 (0.010)
Iris
KHM(X,C) 126.517 (0.000) 147.217 (22.896) 125.951 (0.052)
FMeasure 0.744 (0.000) 0.740 (0.025) 0.744 (0.000)
Runtime 0.190 (0.007) 3.096 (0.010) 1.826 (0.009)
Glass
KHM(X,C) 1535.198 (0.000) 18191.700 (1870.044) 1442.847 (35.871)
FMeasure 0.422 (0.000) 0.378 (0.030) 0.427 (0.003)
Runtime 4.042 (0.007) 43.594 (0.338) 17.609 (0.015)
Cancer
KHM(X,C) 119,458 (0) 119,333 (3770) 117,418 (237)
FMeasure 0.834 (0.000) 0.817 (0.033) 0.834 (0.000)
Runtime 2.027 (0.007) 16.150 (0.144) 9.594 (0.023)
CMC
KHM(X,C) 187,525 (0) 205,548 (60,798) 186,722 (111)
FMeasure 0.303 (0.000) 0.250 (0.028) 0.303 (0.000)
Runtime 8.627 (0.009) 148.985
(0.933) 39.485 (0.056)
Wine
KHM(X,C) 298,230,848 (24,270,951) 276,508,278 (23,807,035) 252,522,504 (766)
FMeasure 0.538 (0.007) 0.519 (0.021) 0.553 (0.000)
Runtime 2.084 (0.010) 35.284 (0.531) 6.598 (0.008)
Table 5
Results of KHM,PSO,and PSOKHM clustering on two artiﬁcial and ﬁve real data sets when p = 3.5.The quality of clustering is evaluated using KHM(X,C) and the FMeasure.
Runtimes (s) are additionally provided.The table shows means and standard deviations (in brackets) for 10 independent runs.Bold face indicates the best and italic face indicates
the second best result out of the three algorithms.
KHM PSO PSOKHM
ArtSet1
KHM(X,C) 807.536 (0.028) 806.811 (0.079) 806.617 (0.007)
FMeasure 1.000 (0.000) 1.000 (0.000) 1.000 (0.000)
Runtime 0.106 (0.006) 1.628 (0.006) 1.921 (0.007)
ArtSet2
KHM(X,C) 697,006 (0.000) 80,729,943 (33,400,802) 696,049 (78)
FMeasure 1.000 (0.000) 0.660 (0.081) 1.000 (0.000)
Runtime 0.220 (0.005) 3.601 (0.025) 2.842 (0.005)
Iris
KHM(X,C) 113.413 (0.085) 255.763 (117.388) 110.004 (0.260)
FMeasure 0.770 (0.024) 0.660 (0.057) 0.762 (0.004)
Runtime 0.194 (0.008) 3.078 (0.013) 1.873 (0.005)
Glass
KHM(X,C) 1871.812 (0.000) 32933.349
(1398.602) 1857.152 (4.937)
FMeasure 0.396 (0.000) 0.373 (0.020) 0.396 (0.000)
Runtime 4.056 (0.008) 43.350 (0.332) 17.651 (0.013)
Cancer
KHM(X,C) 243,440 (0) 240,634 (8842) 235,441 (696)
FMeasure 0.832 (0.000) 0.820 (0.046) 0.835 (0.003)
Runtime 2.072 (0.008) 15.097 (0.095) 9.859 (0.015)
CMC
KHM(X,C) 381,444 (0) 423,562 (43,932) 379,678 (247)
FMeasure 0.332 (0.000) 0.298 (0.016) 0.332 (0.000)
Runtime 8.528 (0.012) 49.881 (0.256) 42.7017 (0.250)
Wine
KHM(X,C) 8,568,319,639 (2075) 3,637,575,952 (202,759,448) 3,546,930,579 (1,214,985)
FMeasure 0.502 (0.000) 0.530 (0.039) 0.535 (0.004)
Runtime 2.040 (0.008) 35.072 (0.385) 6.508 (0.017)
F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852 9851
6.Conclusions
This paper investigates a hybrid clustering algorithm(PSOKHM)
based on the KHM algorithm and the PSO algorithm.The experi
ment is done on seven data sets.The PSOKHMalgorithmsearches
robustly the data cluster centers using the sumover all data points
of the harmonic average of the distance froma data point to all the
centers as a metric.Using the same metric,PSO is shown to need
more runtime to achieve the global optima,while KHM may run
into local optima.That is to say that the PSOKHM algorithm not
only improves the convergence speed of PSO but also helps KHM
escape fromlocal optima.Experimental results also showthat PSO
KHMis at least comparable to KHMand is better than PSO in terms
of FMeasure.
One drawback of PSOKHMis that it requires more runtime than
KHM.PSOKHMis not applicable when the runtime is quite critical.
Acknowledgements
I would like to express my thanks and deepest appreciation to
Prof.Jigui Sun.This work is partially supported by Science Founda
tion for Young Teachers of Northeast Normal University (No.
20061006),Specialized Research Fund for the Doctoral Program
of Higher Education (Nos.20050183065 and 20070183057).
References
Alpaydin,E.(2004).Introduction to Machine Learning.Cambridge:The MIT Press.pp.
133–150.
Aupetit,S.,Monmarché,N.,& Slimane,M.(2007).Hidden Markov models training
by a Particle SwarmOptimization Algorithm.Journal of Mathematical Modelling
and Algorithms,6,175–193.
Cui,X.,& Potok,T.E.(2005).Document clustering using Particle Swarm
Optimization.In:IEEE swarm intelligence symposium.Pasadena,California.
Dalli,A (2003).Adaptation of the Fmeasure to clusterbased Lexicon quality
evaluation.In EACL 2003.Budapest.
Feng,H.M.,Chen,C.Y.,& Ye,F.(2007).Evolutionary fuzzy particle swarm
optimization vector quantization learning scheme in image compression.Expert
Systems with Applications,32(1),213–222.
Güngör,Z.,& Ünler,A.(2008).Kharmonic means data clustering with tabusearch
method.Applied Mathematical Modelling,32,1115–1125.
Halberstadt,W.,& Douglas,T.S.(2008).Fuzzy clustering to detect tuberculous
meningitisassociated hyperdensity in CT images.Computers in Biology and
Medicine,38(2),165–170.
Hammerly,G.,& Elkan,C.(2002).Alternatives to the kmeans algorithm that ﬁnd
better clusterings.In:Proceedings of the 11th international conference on
information and knowledge management (pp.600–607).
Handl,J.,Knowles,J.,& Dorigo,M.(2003).On the performance of antbased
clustering.Design and Application of Hybrid Intelligent Systems.Frontiers in
Artiﬁcial Intelligence and Applications,104,204–213.
He,Y.,Pan,W.,& Lin,J.(2006).Cluster analysis using multivariate normal mixture
models to detect differential gene expression with microarray data.
Computational Statistics and Data Analysis,51(2),641–658.
Hu,G.,Zhou,S.,Guan,J.,& Hu,X.(2008).Towards effective document clustering:A
constrained Kmeans based approach.Information Processing and Management,
44(4),1397–1409.
Kao,Y.T.,Zahara,E.,& Kao,I.W.(2008).A hybridized approach to data clustering.
Expert Systems with Applications,34(3),1754–1762.
Kennedy,J.,& Eberhart,R.C.(1995).Particle swarmoptimization.In Proceedings of
the 1995 IEEE international conference on neural networks (pp.1942–1948).New
Jersey:IEEE Press.
Kerr,G.,Ruskin,H.J.,Crane,M.,& Doolan,P.(2008).Techniques for clustering gene
expression data.Computers in Biology and Medicine,38(3),283–293.
Liao,C.J.,Tseng,C.T.,& Luarn,P.(2007).A discrete version of particle swarm
optimization for ﬂowshop scheduling problems.Computers and Operations
Research,34,3099–3111.
Li,Y.J.,Chung,S.M.,& Holt,J.D.(2008).Text document clustering based on
frequent word meaning sequences.Data and Knowledge Engineering,64(1),
381–404.
Liu,B.,Wang,L.,& Jin,Y.H.(2008).An effective hybrid PSObased algorithmfor ﬂow
shop scheduling with limited buffers.Computers and Operations Research,35(9),
2791–2806.
Maitra,M.,& Chatterjee,A.(2008).A hybrid cooperative–comprehensive learning
based PSO algorithm for image segmentation using multilevel thresholding.
Expert Systems with Applications,34,1341–1350.
Pan,H.,Wang,L.,& Liu,B.(2006).Particle swarm optimization for function
optimization in noisy environment.Applied Mathematics and Computation,181,
908–919.
Tan,P.N.,Steinbach,M.,& Kumar,V.(2005).Introduction to data mining (pp.487–
559).Boston:AddisonWesley.
Tjhi,W.C.,& Chen,L.H.(2008).A heuristicbased fuzzy coclustering algorithmfor
categorization of highdimensional data.Fuzzy Sets and Systems,159(4),
371–389.
Ünler,A.,& Güngör,Z.(2008).Applying Kharmonic means clustering to the part
machine classiﬁcation problem.Expert Systems with Applications.doi:10.1016/
j.eswa.2007.11.048.
Webb,A.(2002).Statistical pattern recognition (pp.361–406).New Jersey:John
Wiley & Sons.
Zhang,B.,Hsu,M.,& Dayal,U.(1999).Kharmonic means – a data
clustering algorithm.Technical Report HPL1999124.HewlettPackard
Laboratories.
Zhang,B.,Hsu,M.,& Dayal,U.(2000).Kharmonic means.In:International workshop
on temporal,spatial and spatiotemporal data mining,TSDM2000.Lyon,France,
September 12.
Zhou,H.,& Liu,Y.H.(2008).Accurate integration of multiviewrange images using
kmeans clustering.Pattern Recognition,41(1),152–175.
9852 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment