Expert Systems with Applications

muttchessAI and Robotics

Nov 8, 2013 (4 years and 1 month ago)

105 views

An efficient hybrid data clustering method based on K-harmonic means
and Particle Swarm Optimization
Fengqin Yang
a,b,
*
,Tieli Sun
a
,Changhai Zhang
b
a
College of Computer Science,Northeast Normal University,Changchun,Jilin 130117,China
b
College of Computer Science and Technology,Jilin University,Changchun,Jilin 130012,China
a r t i c l e i n f o
Keywords:
Data clustering
K-means
K-harmonic means
Particle Swarm Optimization
a b s t r a c t
Clustering is the process of grouping data objects into set of disjoint classes called clusters so that objects
within a class are highly similar with one another and dissimilar with the objects in other classes.K-
means (KM) algorithm is one of the most popular clustering techniques because it is easy to implement
and works fast in most situations.However,it is sensitive to initialization and is easily trapped in local
optima.K-harmonic means (KHM) clustering solves the problemof initialization using a built-in boosting
function,but it also easily runs into local optima.Particle SwarmOptimization (PSO) algorithm is a sto-
chastic global optimization technique.A hybrid data clustering algorithm based on PSO and KHM(PSO-
KHM) is proposed in this research,which makes full use of the merits of both algorithms.The PSOKHM
algorithmnot only helps the KHMclustering escape fromlocal optima but also overcomes the shortcom-
ing of the slow convergence speed of the PSO algorithm.The performance of the PSOKHM algorithm is
compared with those of the PSOand the KHMclustering on seven data sets.Experimental results indicate
the superiority of the PSOKHM algorithm.
￿ 2009 Elsevier Ltd.All rights reserved.
1.Introduction
Clustering is a search for hidden patterns that may exist in data
sets.It is a process of grouping data objects into disjointed clusters
so that the objects in each cluster are similar,yet different to the
others.Clustering techniques are applied in many application areas
such as pattern recognition (Halberstadt & Douglas,2008;Webb,
2002;Zhou & Liu,2008),machine learning (Alpaydin,2004),data
mining (Tan,Steinbach,& Kumar,2005;Tjhi & Chen,2008),infor-
mation retrieval (Hu,Zhou,Guan,& Hu,2008;Li,Chung,& Holt,
2008) and bioinformatics (He,Pan,& Lin,2006;Kerr,Ruskin,Crane,
& Doolan,2008).K-means (KM) algorithmis one of the most pop-
ular and widespread partitioning clustering algorithms because of
its superior feasibility and efficiency in dealing with a large
amount of data.KM partitions data objects into k clusters where
the number of clusters,k,is decided in advance according to appli-
cation purposes.The most common clustering objective in KMis to
minimize the sum of dissimilarities between all objects and their
corresponding cluster centers.The main drawback of the KMalgo-
rithm is that the cluster result is sensitive to the selection of the
initial cluster centers and may converge to the local optima (Cui
& Potok,2005;Kao,Zahara,& Kao,2008).
The K-harmonic means (KHM) algorithmis a more recent algo-
rithm proposed by Zhang,Hsu,and Dayal (1999,2000) and modi-
fied by Hammerly and Elkan (2002).The clustering objective in this
algorithm is to minimize the harmonic average from all points in
the data set to all cluster centers.The KHM algorithm solves the
problemof initialization of the KMalgorithm,but it also easily runs
into local optima.Particle Swarm Optimization (PSO) is a popula-
tion based stochastic optimization technique developed by Ken-
nedy and Eberhart (1995),inspired by social behavior of bird
flocking or fish schooling.Nowadays,PSO has gained much atten-
tion and wide applications in a variety of fields (Feng,Chen,& Ye,
2007;Liu,Wang,& Jin,2008).In this paper,we explore the appli-
cation of PSO to help the KHMalgorithmescape fromlocal optima.
A hybrid data clustering algorithm based on KHM and PSO,called
PSOKHM,is proposed.The experimental results on a variety of data
sets provided fromseveral artificial and real-life situations indicate
the PSOKHM algorithm is superior to the KHM algorithm and the
PSO algorithm.
The rest of the paper is organized as follows.Section 2 intro-
duces KHMclustering.Section 3 describes briefly PSO search tech-
nique.Section 4 presents our hybrid clustering algorithm.Section 5
illustrates experimental results.Finally,Section 6 makes
conclusions.
0957-4174/$ - see front matter ￿ 2009 Elsevier Ltd.All rights reserved.
doi:10.1016/j.eswa.2009.02.003
* Corresponding author.Address:College of Computer Science,Northeast Normal
University,5268 Renmin Street,Changchun,Jilin 130117,China.
E-mail address:yangfq147@nenu.edu.cn (F.Yang).
Expert Systems with Applications 36 (2009) 9847–9852
Contents lists available at ScienceDirect
Expert Systems with Applications
j ournal homepage:www.el sevi er.com/l ocat e/eswa
2.K-harmonic means clustering
The KM clustering is a simple and fast method used com-
monly due to its straightforward implementation and small
number of iterations.The KM algorithm attempts to find the
cluster centers,(c
1
,...,c
k
),such that the sum of the squared dis-
tances of each data point (x
i
) to its nearest cluster center (c
j
) is
minimized.The dependency of the KM performance on the ini-
tialization of the centers has been a major problem.This is due
to its winner-takes-all partitioning strategy,which results in a
strong association between the data points and the nearest cen-
ter and prevents the centers from moving out of a local density
of data.
The KHM clustering addresses the intrinsic problem by
replacing the minimum distance from a data point to the centers,
used in KM,by the harmonic mean of the distance from each
data point to all centers.The harmonic means gives a good
(low) score for each data point when that data point is close
to any one center.This is a property of the harmonic means;it
is similar to the minimum function used by KM,but it is a
smooth differentiable function.The following notations are used
to formulate the KHM algorithm (Hammerly & Elkan,2002;Ün-
ler & Güngör,2008):
X = {x
1
,...,x
n
}:the data to be clustered.
C = {c
1
,...,c
k
}:the set of cluster centers.
m(c
j
jx
i
):the membership function defining the proportion of
data point x
i
that belongs to center c
j
.
w(x
i
):the weight function defining how much influence data
point x
i
has in re-computing the center parameters in the next
iteration.
Basic algorithm for KHM clustering is shown as follows:
1.Initialize the algorithm with guessed centers C,i.e.,randomly
choose the initial centers.
2.Calculate objective function value according to
KHMðX;CÞ ¼
X
n
i¼1
k
P
k
j¼1
1
x
i
c
j
k k
p
;ð1Þ
where p is an input parameter and typically p P2.
3.For each data point x
i
,compute its membership m(c
j
jx
i
) in each
center c
j
according to
mðc
j
jx
i
Þ ¼
x
i
c
j




p2
P
k
j¼1
x
i
c
j




p2
:ð2Þ
4.For each data point x
i
,compute its weight w(x
i
) according to
wðx
i
Þ ¼
P
k
j¼1
x
i
c
j




p2
P
k
j¼1
x
i
c
j




p
 
2
:ð3Þ
5.For each center c
j
,re-compute its location fromall data points x
i
according to their memberships and weights:
c
j
¼
P
n
i¼1
mðc
j
jx
i
Þwðx
i
Þx
i
P
n
i¼1
mðc
j
jx
i
Þwðx
i
Þ
:ð4Þ
6.Repeat steps 2–5 predefined number of iterations or until
KHM(X,C) does not change significantly.
7.Assign data point x
i
to cluster j with the biggest m(c
j
|x
i
).
It is demonstrated that KHMis essentially insensitive to the ini-
tialization of the centers (Zhang et al.,1999).However,it tends to
converge to local optima (Güngör & Ünler,2008).
3.Particle SwarmOptimization
The PSO method was developed by Kennedy and Eberhart
(1995) and has been successfully applied to many science and
practical fields (Aupetit,Monmarché,& Slimane,2007;Liao,Tseng,
& Luarn,2007;Maitra & Chatterjee,2008;Pan,Wang,& Liu,2006).
PSO is a sociologically inspired population based optimization
algorithm.Each particle is an individual,and the swarm is com-
posed of particles.In PSO,the solution space of the problemis for-
mulated as a search space.Each position in the search space is a
correlated solution of the problem.Particles cooperate to find the
best position (best solution) in the search space (solution space).
Each particle moves according to its velocity.At each iteration,
the particle movement is computed as follows:
x
i
ðt þ1Þ x
i
ðtÞ þ
v
i
ðtÞ;ð5Þ
v
i
ðt þ1Þ
x
v
i
ðtÞ þc
1
rand
1
ðpbest
i
ðtÞ x
i
ðtÞÞ
þc
2
rand
2
ðgbestðtÞ x
i
ðtÞÞ:ð6Þ
In Eqs.(5),(6),x
i
(t) is the position of particle i at time t,
v
i
(t) is the
velocity of particle i at time t,pbest
i
(t) is the best position found by
particle i itself so far,gbest(t) is the best position found by the whole
swarmso far,
x
is an inertia weight scaling the previous time step
velocity,c
1
and c
2
are two acceleration coefficients that scale the
influence of the best personal position of the particle (pbest
i
(t))
and the best global position (gbest(t)),rand
1
and rand
2
are random
variables between 0 and 1.The process of PSO is shown as Fig.1.
4.The proposed hybrid clustering algorithm
The KHMalgorithmtends to converge faster than the PSO algo-
rithmbecause it requires fewer function evaluations,but it usually
gets stuck in local optima.We integrate KHM with PSO to form a
hybrid clustering algorithm called PSOKHM,which maintains the
merits of KHM and PSO.More specifically,PSOKHM will apply
KHMwith four iterations to the particles in the swarmevery eight
generations such that the fitness value of each particle is improved.
A particle is a vector of real numbers of dimension k
*
d,where k is
the number of clusters and d is the dimension of data to be clus-
tered.A particle can be shown as Fig.2.The fitness function of
the PSOKHM algorithm is the objective function of the KHM algo-
rithm.Fig.3 summarizes the hybrid PSOKHM algorithm.
Initialize a population of particles with random positions and
velocities in the search space.
While(termination conditions are not met)
{
For each particle do
i
Update the position of particle
i
according to
equation (5).
Update the velocity of particle
i
according to
equation (6).
Map the position of particle
i in
the solution space
and evaluate its fitness value according to the fitness function.
Update and
( )
i
pbest t ( )gbest t
if necessary.
End for
}
Fig.1.The process of the PSO algorithm.
9848 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852
5.Experimental results
Seven data sets are employed to validate our method.These
data sets,named ArtSet1,ArtSet2,Wine,Glass,Iris,breast-can-
cer-wisconsin (denoted as Cancer),and Contraceptive Method
Choice (denoted as CMC),cover examples of data of low,medium
and high dimensions.All data sets except ArtSet1 and ArtSet2 are
available at ftp://ftp.ics.uci.edu./pub/machine-learning-databases/.
Table 1 summarizes the characteristics of these data sets.Table 2
shows the parameters set in our algorithm.
5.1.Data sets
(1) ArtSet1 (n = 300,d = 2,k = 3):This is an artificial data set.It is
a two-featured problemwith three unique classes.A total of
300 patterns are drawn from three independent bivariate
normal distributions,where classes are distributed accord-
ing to N
2
l
¼
l
i1
l
i2
 
;
R
¼
0:4 0:04
0:04 0:4
  
;i ¼
1;2;3;
l
11
¼
l
12
¼ 2;
l
21
¼
l
22
¼ 2;
l
31
¼
l
32
¼ 6;
l
and
R
being mean vector and covariance matrix,respectively.The
data set is illustrated in Fig.4.
(2) ArtSet2 (n = 300,d = 3,k = 3):This is an artificial data set.It is
a three-featured problem with three classes and 300 pat-
terns,where every feature of the classes is distributed
according to Class1 Uniform (10,25),Class2 Uniform
(25,40),Class3 Uniform (40,55).The data set is illustrated
in Fig.5.
(3) Fisher’s iris data set (n = 150,d = 4,k = 3),which consists of
three different species of iris flower:Iris Setosa,Iris Versicol-
our and Iris Virginica.For each species,50 samples with four
features (sepal length,sepal width,petal length,and petal
width) were collected.
(4) Glass (n = 214,d = 9,k = 6),which consists of six different
types of glass:building windows float processed (70
objects),building windows non-float processed (76 objects),
vehicle windows float processed (17 objects),containers (13
objects),tableware (9 objects),and headlamps (29 objects).
Each type has nine features,which are refractive index,
sodium,magnesium,aluminum,silicon,potassium,calcium,
barium,and iron.
(5) Wisconsin breast cancer (n = 683,d = 9,k = 2),which con-
sists of 683 objects characterized by nine features:clump
thickness,cell size uniformity,cell shape uniformity,mar-
ginal adhesion,single epithelial cell size,bare nuclei,bland
chromatin,normal nucleoli,and mitoses.There are two cat-
egories in the data:malignant (444 objects) and benign (239
objects).
(6) Contraceptive Method Choice (n = 1473,d = 9,k = 3):This
dataset is a subset of the 1987 National Indonesia Contra-
ceptive Prevalence Survey.The samples are married women
who either were not pregnant or did not know if they were
at the time of interview.The problemis to predict the choice
of current contraceptive method (no use has 629 objects,
long-termmethods have 334 objects,and short-termmeth-
ods have 510 objects) of a woman based on her demographic
and socioeconomic characteristics.
(7) Wine (n = 178,d = 13,k = 3):These data,consisting of 178
objects characterized by 13 such features as alcohol,malic
acid,ash,alkalinity of ash,magnesium,total phenols,flava-
noids,nonflavanoid phenols,proanthocyanins,color inten-
sity,hue,OD280/OD315 of diluted wines,and praline,are
the results of a chemical analysis of wines brewed in the
x
11
x
12
… x
1d
… X
k1
X
k2
… X
kd
Fig.2.The representation of a particle.
Step 1: Set the initial parameters including the maximum
iterative count IterCount, the population size P
size
, ω, c
1
and
c
2
.
Step 2: Initialize a population of size P
size
.
Step 3: Set iterative count Gen1= 0.
Step 4: Set iterative count Gen2= Gen3=0.
Step 5 (PSO Method)
Step 5.1: Apply the PSO operator to update the P
size
particles.
Step 5.2: Gen2=Gen2+1. If Gen2<8, go to Step 5.1.
Step 6 (KHM Method) For each particle do
i
Step 6.1: Take the position of particle
i
centers of the KHM algorithm.
Step 6.2: Recalculate each cluster center using the KHM
algorithm.
Step6.3: Gen3=Gen3+1. If Gen3<4, go to Step 6.2.
Step 7: Gen1= Gen1+1. If Gen1<IterCount, go to Step 4.
Step 8: Assign data point
i
x
to cluster with the biggest
j
( | )
j i
m c x.
as the initial cluster
Fig.3.The hybrid PSOKHM algorithm.
Table 1
Characteristics of data sets considered.
Name of data set No.of classes No.of features Size of data set (size of classes in parentheses)
ArtSet1 3 2 300 (100,100,100)
ArtSet2 3 3 300 (100,100,100)
Iris 3 4 150 (50,50,50)
Glass 6 9 214 (70,17,76,13,9,29)
Cancer 2 9 683 (444,239)
CMC 3 9 1473 (629,334,510)
Wine 3 13 178 (59,71,48)
Table 2
The PSOKHM algorithm parameters setup.
Parameter Value
P
size
18
x
0.7298
c
1
1.49618
c
2
1.49618
IterCount 5
F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852 9849
same region in Italy but derived from three different culti-
vars.There are three categories in the data:class 1 (59
objects),class 2 (71 objects),and class 3 (48 objects).
5.2.Experimental results
In this section,we evaluate and compare the performances of
the following methods:KHM,PSO and PSOKHM algorithms as
means of solution for the objective function of the KHMalgorithm.
The quality of the respective clustering will also be compared,
where the quality is measured by the following two criteria:
(1) The sumover all data points of the harmonic average of the
distance froma data point to all the centers,as defined in Eq.
(1).Clearly,the smaller the sumis,the higher the quality of
clustering is.
(2) The F-Measure uses the ideas of precision and recall from
information retrieval (Dalli,2003;Handl,Knowles,& Dorigo,
2003).Each class i (as given by the class labels of the used
benchmark data set) is regarded as the set of n
i
items desired
for a query;each cluster j (generated by the algorithm) is
regarded as the set of n
j
items retrieved for a query;n
ij
gives
the number of elements of class i within cluster j.For each
class i and cluster j precision and recall are then defined as
pði;jÞ ¼
n
ij
n
j
and rði;jÞ ¼
n
ij
n
i
,and the corresponding value under
the F-Measure is Fði;jÞ ¼
ðb
2
þ1Þpði;jÞrði;jÞ
b
2
pði;jÞþrði;jÞ
,where we chose b = 1
to obtain equal weighting for p(i,j) and r(i,j).The overall F-
Measure for the data set of size n is given by
F ¼
P
i
n
i
n
max
j
fFði;jÞg.Obviously,the bigger F-Measure is,
the higher the quality of clustering is.
The experimental results are averages of 10 runs of simulation.
The algorithms are implemented using Visual C++ on a Pentium(R)
D CPU 2.66 GHz with 1.00 GB RAM.It is known that p is a key
parameter to get good objective function values.For this reason
we conduct our experiments with different p values.Tables 3–5
give the means and standard deviations (over 10 runs) obtained
for each of these measures when p is 2.5,3 and 3.5,respectively.
Additionally,they show the runtimes of the algorithms.
For ArtSet1 the averages of the KHM(X,C) of KHM,PSO and PSO-
KHM are almost the same and all of the F-Measures of the algo-
rithms are 1.For ArtSet2 and the other five real data sets the
average of the KHM(X,C) of PSOKHM is much better than those
of KHM and PSO.With the exception of two case (CMC,p = 2.5
and Iris,p = 3.5),the average of F-Measure of PSOKHM is equal to
or better than that of KHM,and that of PSO is relatively bad.This
is an indication that for low-dimensional data set if the clusters
are spatially well separated (as is the case in ArtSet1),the perfor-
mance of three algorithms is nearly the same and PSOKHMoutper-
forms the other two methods for other cases.With the exception of
ArtSet1,the runtimes of PSOKHM runtimes are higher than those
of KHM and are lower than those of PSO.
Fig.4.ArtSet1.
Fig.5.ArtSet2.
Table 3
Results of KHM,PSO,and PSOKHM clustering on two artificial and five real data sets
when p = 2.5.The quality of clustering is evaluated using KHM(X,C) and the F-
Measure.Runtimes (s) are additionally provided.The table shows means and
standard deviations (in brackets) for 10 independent runs.Bold face indicates the best
and italic face indicates the second best result out of the three algorithms.
KHM PSO PSOKHM
ArtSet1
KHM(X,C) 703.867 (0.000) 703.528 (0.190) 703.509 (0.050)
F-Measure 1.000 (0.000) 1.000 (0.000) 1.000 (0.000)
Runtime 0.106 (0.006) 1.648 (0.008) 1.921 (0.007)
ArtSet2
KHM(X,C) 111,852 (0) 1,910,696 (915,890) 111,813 (2)
F-Measure 1.000 (0.000) 0.665 (0.088) 1.000 (0.000)
Runtime 0.223 (0.008) 3.650 (0.031) 2.859 (0.000)
Iris
KHM(X,C) 149.333 (0.000) 230.340 (98.180) 149.058 (0.074)
F-Measure 0.750 (0.000) 0.711 (0.062) 0.753 (0.005)
Runtime 0.192 (0.008) 3.117 (0.020) 1.842 (0.005)
Glass
KHM(X,C) 1203.554 (16.231) 9551.095
(1933.211) 1196.798 (0.439)
F-Measure 0.421 (0.011) 0.387 (0.044) 0.424 (0.003)
Runtime 4.064 (0.010) 44.249 (0.431) 17.669 (0.018)
Cancer
KHM(X,C) 60,189 (0) 60,244 (563) 59,844 (22)
F-Measure 0.829 (0.000) 0.819 (0.005) 0.829 (0.000)
Runtime 2.017 (0.009) 16.046 (0.138) 9.525 (0.013)
CMC
KHM(X,C) 96,520 (0) 115,096 (33,014) 96,193 (25)
F-Measure 0.335 (0.000) 0.298 (0.019) 0.333 (0.002)
Runtime 8.639 (0.009) 54.163 (0.578) 39.825 (0.072)
Wine
KHM(X,C) 18,386,505 (0) 19,795,542
(2,007,722) 18,386,285 (5)
F-Measure 0.516 (0.000) 0.512 (0.020) 0.516 (0.000)
Runtime 2.059 (0.010) 35.642 (0.282) 6.539 (0.008)
9850 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852
Table 4
Results of KHM,PSO,and PSOKHM clustering on two artificial and five real data sets when p = 3.The quality of clustering is evaluated using KHM(X,C) and the F-Measure.
Runtimes (s) are additionally provided.The table shows means and standard deviations (in brackets) for 10 independent runs.Bold face indicates the best and italic face indicates
the second best result out of the three algorithms.
KHM PSO PSOKHM
ArtSet1
KHM(X,C) 742.110 (0.000) 741.6927 (0.080) 741.455 (0.002)
F-Measure 1.000 (0.000) 1.0000 (0.000) 1.000 (0.000)
Runtime 0.001 (0.006) 1.633 (0.008) 1.921 (0.007)
ArtSet2
KHM(X,C) 278,758 (0) 8,675,830 (6,626,165) 278,541 (33)
F-Measure 1.000 (0.000) 0.681 (0.093) 1.000 (0.000)
Runtime 0.220 (0.005) 3.575 (0.030) 2.844 (0.010)
Iris
KHM(X,C) 126.517 (0.000) 147.217 (22.896) 125.951 (0.052)
F-Measure 0.744 (0.000) 0.740 (0.025) 0.744 (0.000)
Runtime 0.190 (0.007) 3.096 (0.010) 1.826 (0.009)
Glass
KHM(X,C) 1535.198 (0.000) 18191.700 (1870.044) 1442.847 (35.871)
F-Measure 0.422 (0.000) 0.378 (0.030) 0.427 (0.003)
Runtime 4.042 (0.007) 43.594 (0.338) 17.609 (0.015)
Cancer
KHM(X,C) 119,458 (0) 119,333 (3770) 117,418 (237)
F-Measure 0.834 (0.000) 0.817 (0.033) 0.834 (0.000)
Runtime 2.027 (0.007) 16.150 (0.144) 9.594 (0.023)
CMC
KHM(X,C) 187,525 (0) 205,548 (60,798) 186,722 (111)
F-Measure 0.303 (0.000) 0.250 (0.028) 0.303 (0.000)
Runtime 8.627 (0.009) 148.985
(0.933) 39.485 (0.056)
Wine
KHM(X,C) 298,230,848 (24,270,951) 276,508,278 (23,807,035) 252,522,504 (766)
F-Measure 0.538 (0.007) 0.519 (0.021) 0.553 (0.000)
Runtime 2.084 (0.010) 35.284 (0.531) 6.598 (0.008)
Table 5
Results of KHM,PSO,and PSOKHM clustering on two artificial and five real data sets when p = 3.5.The quality of clustering is evaluated using KHM(X,C) and the F-Measure.
Runtimes (s) are additionally provided.The table shows means and standard deviations (in brackets) for 10 independent runs.Bold face indicates the best and italic face indicates
the second best result out of the three algorithms.
KHM PSO PSOKHM
ArtSet1
KHM(X,C) 807.536 (0.028) 806.811 (0.079) 806.617 (0.007)
F-Measure 1.000 (0.000) 1.000 (0.000) 1.000 (0.000)
Runtime 0.106 (0.006) 1.628 (0.006) 1.921 (0.007)
ArtSet2
KHM(X,C) 697,006 (0.000) 80,729,943 (33,400,802) 696,049 (78)
F-Measure 1.000 (0.000) 0.660 (0.081) 1.000 (0.000)
Runtime 0.220 (0.005) 3.601 (0.025) 2.842 (0.005)
Iris
KHM(X,C) 113.413 (0.085) 255.763 (117.388) 110.004 (0.260)
F-Measure 0.770 (0.024) 0.660 (0.057) 0.762 (0.004)
Runtime 0.194 (0.008) 3.078 (0.013) 1.873 (0.005)
Glass
KHM(X,C) 1871.812 (0.000) 32933.349
(1398.602) 1857.152 (4.937)
F-Measure 0.396 (0.000) 0.373 (0.020) 0.396 (0.000)
Runtime 4.056 (0.008) 43.350 (0.332) 17.651 (0.013)
Cancer
KHM(X,C) 243,440 (0) 240,634 (8842) 235,441 (696)
F-Measure 0.832 (0.000) 0.820 (0.046) 0.835 (0.003)
Runtime 2.072 (0.008) 15.097 (0.095) 9.859 (0.015)
CMC
KHM(X,C) 381,444 (0) 423,562 (43,932) 379,678 (247)
F-Measure 0.332 (0.000) 0.298 (0.016) 0.332 (0.000)
Runtime 8.528 (0.012) 49.881 (0.256) 42.7017 (0.250)
Wine
KHM(X,C) 8,568,319,639 (2075) 3,637,575,952 (202,759,448) 3,546,930,579 (1,214,985)
F-Measure 0.502 (0.000) 0.530 (0.039) 0.535 (0.004)
Runtime 2.040 (0.008) 35.072 (0.385) 6.508 (0.017)
F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852 9851
6.Conclusions
This paper investigates a hybrid clustering algorithm(PSOKHM)
based on the KHM algorithm and the PSO algorithm.The experi-
ment is done on seven data sets.The PSOKHMalgorithmsearches
robustly the data cluster centers using the sumover all data points
of the harmonic average of the distance froma data point to all the
centers as a metric.Using the same metric,PSO is shown to need
more runtime to achieve the global optima,while KHM may run
into local optima.That is to say that the PSOKHM algorithm not
only improves the convergence speed of PSO but also helps KHM
escape fromlocal optima.Experimental results also showthat PSO-
KHMis at least comparable to KHMand is better than PSO in terms
of F-Measure.
One drawback of PSOKHMis that it requires more runtime than
KHM.PSOKHMis not applicable when the runtime is quite critical.
Acknowledgements
I would like to express my thanks and deepest appreciation to
Prof.Jigui Sun.This work is partially supported by Science Founda-
tion for Young Teachers of Northeast Normal University (No.
20061006),Specialized Research Fund for the Doctoral Program
of Higher Education (Nos.20050183065 and 20070183057).
References
Alpaydin,E.(2004).Introduction to Machine Learning.Cambridge:The MIT Press.pp.
133–150.
Aupetit,S.,Monmarché,N.,& Slimane,M.(2007).Hidden Markov models training
by a Particle SwarmOptimization Algorithm.Journal of Mathematical Modelling
and Algorithms,6,175–193.
Cui,X.,& Potok,T.E.(2005).Document clustering using Particle Swarm
Optimization.In:IEEE swarm intelligence symposium.Pasadena,California.
Dalli,A (2003).Adaptation of the F-measure to cluster-based Lexicon quality
evaluation.In EACL 2003.Budapest.
Feng,H.M.,Chen,C.Y.,& Ye,F.(2007).Evolutionary fuzzy particle swarm
optimization vector quantization learning scheme in image compression.Expert
Systems with Applications,32(1),213–222.
Güngör,Z.,& Ünler,A.(2008).K-harmonic means data clustering with tabu-search
method.Applied Mathematical Modelling,32,1115–1125.
Halberstadt,W.,& Douglas,T.S.(2008).Fuzzy clustering to detect tuberculous
meningitis-associated hyperdensity in CT images.Computers in Biology and
Medicine,38(2),165–170.
Hammerly,G.,& Elkan,C.(2002).Alternatives to the k-means algorithm that find
better clusterings.In:Proceedings of the 11th international conference on
information and knowledge management (pp.600–607).
Handl,J.,Knowles,J.,& Dorigo,M.(2003).On the performance of ant-based
clustering.Design and Application of Hybrid Intelligent Systems.Frontiers in
Artificial Intelligence and Applications,104,204–213.
He,Y.,Pan,W.,& Lin,J.(2006).Cluster analysis using multivariate normal mixture
models to detect differential gene expression with microarray data.
Computational Statistics and Data Analysis,51(2),641–658.
Hu,G.,Zhou,S.,Guan,J.,& Hu,X.(2008).Towards effective document clustering:A
constrained K-means based approach.Information Processing and Management,
44(4),1397–1409.
Kao,Y.T.,Zahara,E.,& Kao,I.W.(2008).A hybridized approach to data clustering.
Expert Systems with Applications,34(3),1754–1762.
Kennedy,J.,& Eberhart,R.C.(1995).Particle swarmoptimization.In Proceedings of
the 1995 IEEE international conference on neural networks (pp.1942–1948).New
Jersey:IEEE Press.
Kerr,G.,Ruskin,H.J.,Crane,M.,& Doolan,P.(2008).Techniques for clustering gene
expression data.Computers in Biology and Medicine,38(3),283–293.
Liao,C.J.,Tseng,C.T.,& Luarn,P.(2007).A discrete version of particle swarm
optimization for flowshop scheduling problems.Computers and Operations
Research,34,3099–3111.
Li,Y.J.,Chung,S.M.,& Holt,J.D.(2008).Text document clustering based on
frequent word meaning sequences.Data and Knowledge Engineering,64(1),
381–404.
Liu,B.,Wang,L.,& Jin,Y.H.(2008).An effective hybrid PSO-based algorithmfor flow
shop scheduling with limited buffers.Computers and Operations Research,35(9),
2791–2806.
Maitra,M.,& Chatterjee,A.(2008).A hybrid cooperative–comprehensive learning
based PSO algorithm for image segmentation using multilevel thresholding.
Expert Systems with Applications,34,1341–1350.
Pan,H.,Wang,L.,& Liu,B.(2006).Particle swarm optimization for function
optimization in noisy environment.Applied Mathematics and Computation,181,
908–919.
Tan,P.N.,Steinbach,M.,& Kumar,V.(2005).Introduction to data mining (pp.487–
559).Boston:Addison-Wesley.
Tjhi,W.C.,& Chen,L.H.(2008).A heuristic-based fuzzy co-clustering algorithmfor
categorization of high-dimensional data.Fuzzy Sets and Systems,159(4),
371–389.
Ünler,A.,& Güngör,Z.(2008).Applying K-harmonic means clustering to the part-
machine classification problem.Expert Systems with Applications.doi:10.1016/
j.eswa.2007.11.048.
Webb,A.(2002).Statistical pattern recognition (pp.361–406).New Jersey:John
Wiley & Sons.
Zhang,B.,Hsu,M.,& Dayal,U.(1999).K-harmonic means – a data
clustering algorithm.Technical Report HPL-1999-124.Hewlett-Packard
Laboratories.
Zhang,B.,Hsu,M.,& Dayal,U.(2000).K-harmonic means.In:International workshop
on temporal,spatial and spatio-temporal data mining,TSDM2000.Lyon,France,
September 12.
Zhou,H.,& Liu,Y.H.(2008).Accurate integration of multi-viewrange images using
k-means clustering.Pattern Recognition,41(1),152–175.
9852 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852