An efﬁcient hybrid data clustering method based on K-harmonic means

and Particle Swarm Optimization

Fengqin Yang

a,b,

*

,Tieli Sun

a

,Changhai Zhang

b

a

College of Computer Science,Northeast Normal University,Changchun,Jilin 130117,China

b

College of Computer Science and Technology,Jilin University,Changchun,Jilin 130012,China

a r t i c l e i n f o

Keywords:

Data clustering

K-means

K-harmonic means

Particle Swarm Optimization

a b s t r a c t

Clustering is the process of grouping data objects into set of disjoint classes called clusters so that objects

within a class are highly similar with one another and dissimilar with the objects in other classes.K-

means (KM) algorithm is one of the most popular clustering techniques because it is easy to implement

and works fast in most situations.However,it is sensitive to initialization and is easily trapped in local

optima.K-harmonic means (KHM) clustering solves the problemof initialization using a built-in boosting

function,but it also easily runs into local optima.Particle SwarmOptimization (PSO) algorithm is a sto-

chastic global optimization technique.A hybrid data clustering algorithm based on PSO and KHM(PSO-

KHM) is proposed in this research,which makes full use of the merits of both algorithms.The PSOKHM

algorithmnot only helps the KHMclustering escape fromlocal optima but also overcomes the shortcom-

ing of the slow convergence speed of the PSO algorithm.The performance of the PSOKHM algorithm is

compared with those of the PSOand the KHMclustering on seven data sets.Experimental results indicate

the superiority of the PSOKHM algorithm.

2009 Elsevier Ltd.All rights reserved.

1.Introduction

Clustering is a search for hidden patterns that may exist in data

sets.It is a process of grouping data objects into disjointed clusters

so that the objects in each cluster are similar,yet different to the

others.Clustering techniques are applied in many application areas

such as pattern recognition (Halberstadt & Douglas,2008;Webb,

2002;Zhou & Liu,2008),machine learning (Alpaydin,2004),data

mining (Tan,Steinbach,& Kumar,2005;Tjhi & Chen,2008),infor-

mation retrieval (Hu,Zhou,Guan,& Hu,2008;Li,Chung,& Holt,

2008) and bioinformatics (He,Pan,& Lin,2006;Kerr,Ruskin,Crane,

& Doolan,2008).K-means (KM) algorithmis one of the most pop-

ular and widespread partitioning clustering algorithms because of

its superior feasibility and efﬁciency in dealing with a large

amount of data.KM partitions data objects into k clusters where

the number of clusters,k,is decided in advance according to appli-

cation purposes.The most common clustering objective in KMis to

minimize the sum of dissimilarities between all objects and their

corresponding cluster centers.The main drawback of the KMalgo-

rithm is that the cluster result is sensitive to the selection of the

initial cluster centers and may converge to the local optima (Cui

& Potok,2005;Kao,Zahara,& Kao,2008).

The K-harmonic means (KHM) algorithmis a more recent algo-

rithm proposed by Zhang,Hsu,and Dayal (1999,2000) and modi-

ﬁed by Hammerly and Elkan (2002).The clustering objective in this

algorithm is to minimize the harmonic average from all points in

the data set to all cluster centers.The KHM algorithm solves the

problemof initialization of the KMalgorithm,but it also easily runs

into local optima.Particle Swarm Optimization (PSO) is a popula-

tion based stochastic optimization technique developed by Ken-

nedy and Eberhart (1995),inspired by social behavior of bird

ﬂocking or ﬁsh schooling.Nowadays,PSO has gained much atten-

tion and wide applications in a variety of ﬁelds (Feng,Chen,& Ye,

2007;Liu,Wang,& Jin,2008).In this paper,we explore the appli-

cation of PSO to help the KHMalgorithmescape fromlocal optima.

A hybrid data clustering algorithm based on KHM and PSO,called

PSOKHM,is proposed.The experimental results on a variety of data

sets provided fromseveral artiﬁcial and real-life situations indicate

the PSOKHM algorithm is superior to the KHM algorithm and the

PSO algorithm.

The rest of the paper is organized as follows.Section 2 intro-

duces KHMclustering.Section 3 describes brieﬂy PSO search tech-

nique.Section 4 presents our hybrid clustering algorithm.Section 5

illustrates experimental results.Finally,Section 6 makes

conclusions.

0957-4174/$ - see front matter 2009 Elsevier Ltd.All rights reserved.

doi:10.1016/j.eswa.2009.02.003

* Corresponding author.Address:College of Computer Science,Northeast Normal

University,5268 Renmin Street,Changchun,Jilin 130117,China.

E-mail address:yangfq147@nenu.edu.cn (F.Yang).

Expert Systems with Applications 36 (2009) 9847–9852

Contents lists available at ScienceDirect

Expert Systems with Applications

j ournal homepage:www.el sevi er.com/l ocat e/eswa

2.K-harmonic means clustering

The KM clustering is a simple and fast method used com-

monly due to its straightforward implementation and small

number of iterations.The KM algorithm attempts to ﬁnd the

cluster centers,(c

1

,...,c

k

),such that the sum of the squared dis-

tances of each data point (x

i

) to its nearest cluster center (c

j

) is

minimized.The dependency of the KM performance on the ini-

tialization of the centers has been a major problem.This is due

to its winner-takes-all partitioning strategy,which results in a

strong association between the data points and the nearest cen-

ter and prevents the centers from moving out of a local density

of data.

The KHM clustering addresses the intrinsic problem by

replacing the minimum distance from a data point to the centers,

used in KM,by the harmonic mean of the distance from each

data point to all centers.The harmonic means gives a good

(low) score for each data point when that data point is close

to any one center.This is a property of the harmonic means;it

is similar to the minimum function used by KM,but it is a

smooth differentiable function.The following notations are used

to formulate the KHM algorithm (Hammerly & Elkan,2002;Ün-

ler & Güngör,2008):

X = {x

1

,...,x

n

}:the data to be clustered.

C = {c

1

,...,c

k

}:the set of cluster centers.

m(c

j

jx

i

):the membership function deﬁning the proportion of

data point x

i

that belongs to center c

j

.

w(x

i

):the weight function deﬁning how much inﬂuence data

point x

i

has in re-computing the center parameters in the next

iteration.

Basic algorithm for KHM clustering is shown as follows:

1.Initialize the algorithm with guessed centers C,i.e.,randomly

choose the initial centers.

2.Calculate objective function value according to

KHMðX;CÞ ¼

X

n

i¼1

k

P

k

j¼1

1

x

i

c

j

k k

p

;ð1Þ

where p is an input parameter and typically p P2.

3.For each data point x

i

,compute its membership m(c

j

jx

i

) in each

center c

j

according to

mðc

j

jx

i

Þ ¼

x

i

c

j

p2

P

k

j¼1

x

i

c

j

p2

:ð2Þ

4.For each data point x

i

,compute its weight w(x

i

) according to

wðx

i

Þ ¼

P

k

j¼1

x

i

c

j

p2

P

k

j¼1

x

i

c

j

p

2

:ð3Þ

5.For each center c

j

,re-compute its location fromall data points x

i

according to their memberships and weights:

c

j

¼

P

n

i¼1

mðc

j

jx

i

Þwðx

i

Þx

i

P

n

i¼1

mðc

j

jx

i

Þwðx

i

Þ

:ð4Þ

6.Repeat steps 2–5 predeﬁned number of iterations or until

KHM(X,C) does not change signiﬁcantly.

7.Assign data point x

i

to cluster j with the biggest m(c

j

|x

i

).

It is demonstrated that KHMis essentially insensitive to the ini-

tialization of the centers (Zhang et al.,1999).However,it tends to

converge to local optima (Güngör & Ünler,2008).

3.Particle SwarmOptimization

The PSO method was developed by Kennedy and Eberhart

(1995) and has been successfully applied to many science and

practical ﬁelds (Aupetit,Monmarché,& Slimane,2007;Liao,Tseng,

& Luarn,2007;Maitra & Chatterjee,2008;Pan,Wang,& Liu,2006).

PSO is a sociologically inspired population based optimization

algorithm.Each particle is an individual,and the swarm is com-

posed of particles.In PSO,the solution space of the problemis for-

mulated as a search space.Each position in the search space is a

correlated solution of the problem.Particles cooperate to ﬁnd the

best position (best solution) in the search space (solution space).

Each particle moves according to its velocity.At each iteration,

the particle movement is computed as follows:

x

i

ðt þ1Þ x

i

ðtÞ þ

v

i

ðtÞ;ð5Þ

v

i

ðt þ1Þ

x

v

i

ðtÞ þc

1

rand

1

ðpbest

i

ðtÞ x

i

ðtÞÞ

þc

2

rand

2

ðgbestðtÞ x

i

ðtÞÞ:ð6Þ

In Eqs.(5),(6),x

i

(t) is the position of particle i at time t,

v

i

(t) is the

velocity of particle i at time t,pbest

i

(t) is the best position found by

particle i itself so far,gbest(t) is the best position found by the whole

swarmso far,

x

is an inertia weight scaling the previous time step

velocity,c

1

and c

2

are two acceleration coefﬁcients that scale the

inﬂuence of the best personal position of the particle (pbest

i

(t))

and the best global position (gbest(t)),rand

1

and rand

2

are random

variables between 0 and 1.The process of PSO is shown as Fig.1.

4.The proposed hybrid clustering algorithm

The KHMalgorithmtends to converge faster than the PSO algo-

rithmbecause it requires fewer function evaluations,but it usually

gets stuck in local optima.We integrate KHM with PSO to form a

hybrid clustering algorithm called PSOKHM,which maintains the

merits of KHM and PSO.More speciﬁcally,PSOKHM will apply

KHMwith four iterations to the particles in the swarmevery eight

generations such that the ﬁtness value of each particle is improved.

A particle is a vector of real numbers of dimension k

*

d,where k is

the number of clusters and d is the dimension of data to be clus-

tered.A particle can be shown as Fig.2.The ﬁtness function of

the PSOKHM algorithm is the objective function of the KHM algo-

rithm.Fig.3 summarizes the hybrid PSOKHM algorithm.

Initialize a population of particles with random positions and

velocities in the search space.

While(termination conditions are not met)

{

For each particle do

i

Update the position of particle

i

according to

equation (5).

Update the velocity of particle

i

according to

equation (6).

Map the position of particle

i in

the solution space

and evaluate its fitness value according to the fitness function.

Update and

( )

i

pbest t ( )gbest t

if necessary.

End for

}

Fig.1.The process of the PSO algorithm.

9848 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852

5.Experimental results

Seven data sets are employed to validate our method.These

data sets,named ArtSet1,ArtSet2,Wine,Glass,Iris,breast-can-

cer-wisconsin (denoted as Cancer),and Contraceptive Method

Choice (denoted as CMC),cover examples of data of low,medium

and high dimensions.All data sets except ArtSet1 and ArtSet2 are

available at ftp://ftp.ics.uci.edu./pub/machine-learning-databases/.

Table 1 summarizes the characteristics of these data sets.Table 2

shows the parameters set in our algorithm.

5.1.Data sets

(1) ArtSet1 (n = 300,d = 2,k = 3):This is an artiﬁcial data set.It is

a two-featured problemwith three unique classes.A total of

300 patterns are drawn from three independent bivariate

normal distributions,where classes are distributed accord-

ing to N

2

l

¼

l

i1

l

i2

;

R

¼

0:4 0:04

0:04 0:4

;i ¼

1;2;3;

l

11

¼

l

12

¼ 2;

l

21

¼

l

22

¼ 2;

l

31

¼

l

32

¼ 6;

l

and

R

being mean vector and covariance matrix,respectively.The

data set is illustrated in Fig.4.

(2) ArtSet2 (n = 300,d = 3,k = 3):This is an artiﬁcial data set.It is

a three-featured problem with three classes and 300 pat-

terns,where every feature of the classes is distributed

according to Class1 Uniform (10,25),Class2 Uniform

(25,40),Class3 Uniform (40,55).The data set is illustrated

in Fig.5.

(3) Fisher’s iris data set (n = 150,d = 4,k = 3),which consists of

three different species of iris ﬂower:Iris Setosa,Iris Versicol-

our and Iris Virginica.For each species,50 samples with four

features (sepal length,sepal width,petal length,and petal

width) were collected.

(4) Glass (n = 214,d = 9,k = 6),which consists of six different

types of glass:building windows ﬂoat processed (70

objects),building windows non-ﬂoat processed (76 objects),

vehicle windows ﬂoat processed (17 objects),containers (13

objects),tableware (9 objects),and headlamps (29 objects).

Each type has nine features,which are refractive index,

sodium,magnesium,aluminum,silicon,potassium,calcium,

barium,and iron.

(5) Wisconsin breast cancer (n = 683,d = 9,k = 2),which con-

sists of 683 objects characterized by nine features:clump

thickness,cell size uniformity,cell shape uniformity,mar-

ginal adhesion,single epithelial cell size,bare nuclei,bland

chromatin,normal nucleoli,and mitoses.There are two cat-

egories in the data:malignant (444 objects) and benign (239

objects).

(6) Contraceptive Method Choice (n = 1473,d = 9,k = 3):This

dataset is a subset of the 1987 National Indonesia Contra-

ceptive Prevalence Survey.The samples are married women

who either were not pregnant or did not know if they were

at the time of interview.The problemis to predict the choice

of current contraceptive method (no use has 629 objects,

long-termmethods have 334 objects,and short-termmeth-

ods have 510 objects) of a woman based on her demographic

and socioeconomic characteristics.

(7) Wine (n = 178,d = 13,k = 3):These data,consisting of 178

objects characterized by 13 such features as alcohol,malic

acid,ash,alkalinity of ash,magnesium,total phenols,ﬂava-

noids,nonﬂavanoid phenols,proanthocyanins,color inten-

sity,hue,OD280/OD315 of diluted wines,and praline,are

the results of a chemical analysis of wines brewed in the

x

11

x

12

… x

1d

… X

k1

X

k2

… X

kd

Fig.2.The representation of a particle.

Step 1: Set the initial parameters including the maximum

iterative count IterCount, the population size P

size

, ω, c

1

and

c

2

.

Step 2: Initialize a population of size P

size

.

Step 3: Set iterative count Gen1= 0.

Step 4: Set iterative count Gen2= Gen3=0.

Step 5 (PSO Method)

Step 5.1: Apply the PSO operator to update the P

size

particles.

Step 5.2: Gen2=Gen2+1. If Gen2<8, go to Step 5.1.

Step 6 (KHM Method) For each particle do

i

Step 6.1: Take the position of particle

i

centers of the KHM algorithm.

Step 6.2: Recalculate each cluster center using the KHM

algorithm.

Step6.3: Gen3=Gen3+1. If Gen3<4, go to Step 6.2.

Step 7: Gen1= Gen1+1. If Gen1<IterCount, go to Step 4.

Step 8: Assign data point

i

x

to cluster with the biggest

j

( | )

j i

m c x.

as the initial cluster

Fig.3.The hybrid PSOKHM algorithm.

Table 1

Characteristics of data sets considered.

Name of data set No.of classes No.of features Size of data set (size of classes in parentheses)

ArtSet1 3 2 300 (100,100,100)

ArtSet2 3 3 300 (100,100,100)

Iris 3 4 150 (50,50,50)

Glass 6 9 214 (70,17,76,13,9,29)

Cancer 2 9 683 (444,239)

CMC 3 9 1473 (629,334,510)

Wine 3 13 178 (59,71,48)

Table 2

The PSOKHM algorithm parameters setup.

Parameter Value

P

size

18

x

0.7298

c

1

1.49618

c

2

1.49618

IterCount 5

F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852 9849

same region in Italy but derived from three different culti-

vars.There are three categories in the data:class 1 (59

objects),class 2 (71 objects),and class 3 (48 objects).

5.2.Experimental results

In this section,we evaluate and compare the performances of

the following methods:KHM,PSO and PSOKHM algorithms as

means of solution for the objective function of the KHMalgorithm.

The quality of the respective clustering will also be compared,

where the quality is measured by the following two criteria:

(1) The sumover all data points of the harmonic average of the

distance froma data point to all the centers,as deﬁned in Eq.

(1).Clearly,the smaller the sumis,the higher the quality of

clustering is.

(2) The F-Measure uses the ideas of precision and recall from

information retrieval (Dalli,2003;Handl,Knowles,& Dorigo,

2003).Each class i (as given by the class labels of the used

benchmark data set) is regarded as the set of n

i

items desired

for a query;each cluster j (generated by the algorithm) is

regarded as the set of n

j

items retrieved for a query;n

ij

gives

the number of elements of class i within cluster j.For each

class i and cluster j precision and recall are then deﬁned as

pði;jÞ ¼

n

ij

n

j

and rði;jÞ ¼

n

ij

n

i

,and the corresponding value under

the F-Measure is Fði;jÞ ¼

ðb

2

þ1Þpði;jÞrði;jÞ

b

2

pði;jÞþrði;jÞ

,where we chose b = 1

to obtain equal weighting for p(i,j) and r(i,j).The overall F-

Measure for the data set of size n is given by

F ¼

P

i

n

i

n

max

j

fFði;jÞg.Obviously,the bigger F-Measure is,

the higher the quality of clustering is.

The experimental results are averages of 10 runs of simulation.

The algorithms are implemented using Visual C++ on a Pentium(R)

D CPU 2.66 GHz with 1.00 GB RAM.It is known that p is a key

parameter to get good objective function values.For this reason

we conduct our experiments with different p values.Tables 3–5

give the means and standard deviations (over 10 runs) obtained

for each of these measures when p is 2.5,3 and 3.5,respectively.

Additionally,they show the runtimes of the algorithms.

For ArtSet1 the averages of the KHM(X,C) of KHM,PSO and PSO-

KHM are almost the same and all of the F-Measures of the algo-

rithms are 1.For ArtSet2 and the other ﬁve real data sets the

average of the KHM(X,C) of PSOKHM is much better than those

of KHM and PSO.With the exception of two case (CMC,p = 2.5

and Iris,p = 3.5),the average of F-Measure of PSOKHM is equal to

or better than that of KHM,and that of PSO is relatively bad.This

is an indication that for low-dimensional data set if the clusters

are spatially well separated (as is the case in ArtSet1),the perfor-

mance of three algorithms is nearly the same and PSOKHMoutper-

forms the other two methods for other cases.With the exception of

ArtSet1,the runtimes of PSOKHM runtimes are higher than those

of KHM and are lower than those of PSO.

Fig.4.ArtSet1.

Fig.5.ArtSet2.

Table 3

Results of KHM,PSO,and PSOKHM clustering on two artiﬁcial and ﬁve real data sets

when p = 2.5.The quality of clustering is evaluated using KHM(X,C) and the F-

Measure.Runtimes (s) are additionally provided.The table shows means and

standard deviations (in brackets) for 10 independent runs.Bold face indicates the best

and italic face indicates the second best result out of the three algorithms.

KHM PSO PSOKHM

ArtSet1

KHM(X,C) 703.867 (0.000) 703.528 (0.190) 703.509 (0.050)

F-Measure 1.000 (0.000) 1.000 (0.000) 1.000 (0.000)

Runtime 0.106 (0.006) 1.648 (0.008) 1.921 (0.007)

ArtSet2

KHM(X,C) 111,852 (0) 1,910,696 (915,890) 111,813 (2)

F-Measure 1.000 (0.000) 0.665 (0.088) 1.000 (0.000)

Runtime 0.223 (0.008) 3.650 (0.031) 2.859 (0.000)

Iris

KHM(X,C) 149.333 (0.000) 230.340 (98.180) 149.058 (0.074)

F-Measure 0.750 (0.000) 0.711 (0.062) 0.753 (0.005)

Runtime 0.192 (0.008) 3.117 (0.020) 1.842 (0.005)

Glass

KHM(X,C) 1203.554 (16.231) 9551.095

(1933.211) 1196.798 (0.439)

F-Measure 0.421 (0.011) 0.387 (0.044) 0.424 (0.003)

Runtime 4.064 (0.010) 44.249 (0.431) 17.669 (0.018)

Cancer

KHM(X,C) 60,189 (0) 60,244 (563) 59,844 (22)

F-Measure 0.829 (0.000) 0.819 (0.005) 0.829 (0.000)

Runtime 2.017 (0.009) 16.046 (0.138) 9.525 (0.013)

CMC

KHM(X,C) 96,520 (0) 115,096 (33,014) 96,193 (25)

F-Measure 0.335 (0.000) 0.298 (0.019) 0.333 (0.002)

Runtime 8.639 (0.009) 54.163 (0.578) 39.825 (0.072)

Wine

KHM(X,C) 18,386,505 (0) 19,795,542

(2,007,722) 18,386,285 (5)

F-Measure 0.516 (0.000) 0.512 (0.020) 0.516 (0.000)

Runtime 2.059 (0.010) 35.642 (0.282) 6.539 (0.008)

9850 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852

Table 4

Results of KHM,PSO,and PSOKHM clustering on two artiﬁcial and ﬁve real data sets when p = 3.The quality of clustering is evaluated using KHM(X,C) and the F-Measure.

Runtimes (s) are additionally provided.The table shows means and standard deviations (in brackets) for 10 independent runs.Bold face indicates the best and italic face indicates

the second best result out of the three algorithms.

KHM PSO PSOKHM

ArtSet1

KHM(X,C) 742.110 (0.000) 741.6927 (0.080) 741.455 (0.002)

F-Measure 1.000 (0.000) 1.0000 (0.000) 1.000 (0.000)

Runtime 0.001 (0.006) 1.633 (0.008) 1.921 (0.007)

ArtSet2

KHM(X,C) 278,758 (0) 8,675,830 (6,626,165) 278,541 (33)

F-Measure 1.000 (0.000) 0.681 (0.093) 1.000 (0.000)

Runtime 0.220 (0.005) 3.575 (0.030) 2.844 (0.010)

Iris

KHM(X,C) 126.517 (0.000) 147.217 (22.896) 125.951 (0.052)

F-Measure 0.744 (0.000) 0.740 (0.025) 0.744 (0.000)

Runtime 0.190 (0.007) 3.096 (0.010) 1.826 (0.009)

Glass

KHM(X,C) 1535.198 (0.000) 18191.700 (1870.044) 1442.847 (35.871)

F-Measure 0.422 (0.000) 0.378 (0.030) 0.427 (0.003)

Runtime 4.042 (0.007) 43.594 (0.338) 17.609 (0.015)

Cancer

KHM(X,C) 119,458 (0) 119,333 (3770) 117,418 (237)

F-Measure 0.834 (0.000) 0.817 (0.033) 0.834 (0.000)

Runtime 2.027 (0.007) 16.150 (0.144) 9.594 (0.023)

CMC

KHM(X,C) 187,525 (0) 205,548 (60,798) 186,722 (111)

F-Measure 0.303 (0.000) 0.250 (0.028) 0.303 (0.000)

Runtime 8.627 (0.009) 148.985

(0.933) 39.485 (0.056)

Wine

KHM(X,C) 298,230,848 (24,270,951) 276,508,278 (23,807,035) 252,522,504 (766)

F-Measure 0.538 (0.007) 0.519 (0.021) 0.553 (0.000)

Runtime 2.084 (0.010) 35.284 (0.531) 6.598 (0.008)

Table 5

Results of KHM,PSO,and PSOKHM clustering on two artiﬁcial and ﬁve real data sets when p = 3.5.The quality of clustering is evaluated using KHM(X,C) and the F-Measure.

Runtimes (s) are additionally provided.The table shows means and standard deviations (in brackets) for 10 independent runs.Bold face indicates the best and italic face indicates

the second best result out of the three algorithms.

KHM PSO PSOKHM

ArtSet1

KHM(X,C) 807.536 (0.028) 806.811 (0.079) 806.617 (0.007)

F-Measure 1.000 (0.000) 1.000 (0.000) 1.000 (0.000)

Runtime 0.106 (0.006) 1.628 (0.006) 1.921 (0.007)

ArtSet2

KHM(X,C) 697,006 (0.000) 80,729,943 (33,400,802) 696,049 (78)

F-Measure 1.000 (0.000) 0.660 (0.081) 1.000 (0.000)

Runtime 0.220 (0.005) 3.601 (0.025) 2.842 (0.005)

Iris

KHM(X,C) 113.413 (0.085) 255.763 (117.388) 110.004 (0.260)

F-Measure 0.770 (0.024) 0.660 (0.057) 0.762 (0.004)

Runtime 0.194 (0.008) 3.078 (0.013) 1.873 (0.005)

Glass

KHM(X,C) 1871.812 (0.000) 32933.349

(1398.602) 1857.152 (4.937)

F-Measure 0.396 (0.000) 0.373 (0.020) 0.396 (0.000)

Runtime 4.056 (0.008) 43.350 (0.332) 17.651 (0.013)

Cancer

KHM(X,C) 243,440 (0) 240,634 (8842) 235,441 (696)

F-Measure 0.832 (0.000) 0.820 (0.046) 0.835 (0.003)

Runtime 2.072 (0.008) 15.097 (0.095) 9.859 (0.015)

CMC

KHM(X,C) 381,444 (0) 423,562 (43,932) 379,678 (247)

F-Measure 0.332 (0.000) 0.298 (0.016) 0.332 (0.000)

Runtime 8.528 (0.012) 49.881 (0.256) 42.7017 (0.250)

Wine

KHM(X,C) 8,568,319,639 (2075) 3,637,575,952 (202,759,448) 3,546,930,579 (1,214,985)

F-Measure 0.502 (0.000) 0.530 (0.039) 0.535 (0.004)

Runtime 2.040 (0.008) 35.072 (0.385) 6.508 (0.017)

F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852 9851

6.Conclusions

This paper investigates a hybrid clustering algorithm(PSOKHM)

based on the KHM algorithm and the PSO algorithm.The experi-

ment is done on seven data sets.The PSOKHMalgorithmsearches

robustly the data cluster centers using the sumover all data points

of the harmonic average of the distance froma data point to all the

centers as a metric.Using the same metric,PSO is shown to need

more runtime to achieve the global optima,while KHM may run

into local optima.That is to say that the PSOKHM algorithm not

only improves the convergence speed of PSO but also helps KHM

escape fromlocal optima.Experimental results also showthat PSO-

KHMis at least comparable to KHMand is better than PSO in terms

of F-Measure.

One drawback of PSOKHMis that it requires more runtime than

KHM.PSOKHMis not applicable when the runtime is quite critical.

Acknowledgements

I would like to express my thanks and deepest appreciation to

Prof.Jigui Sun.This work is partially supported by Science Founda-

tion for Young Teachers of Northeast Normal University (No.

20061006),Specialized Research Fund for the Doctoral Program

of Higher Education (Nos.20050183065 and 20070183057).

References

Alpaydin,E.(2004).Introduction to Machine Learning.Cambridge:The MIT Press.pp.

133–150.

Aupetit,S.,Monmarché,N.,& Slimane,M.(2007).Hidden Markov models training

by a Particle SwarmOptimization Algorithm.Journal of Mathematical Modelling

and Algorithms,6,175–193.

Cui,X.,& Potok,T.E.(2005).Document clustering using Particle Swarm

Optimization.In:IEEE swarm intelligence symposium.Pasadena,California.

Dalli,A (2003).Adaptation of the F-measure to cluster-based Lexicon quality

evaluation.In EACL 2003.Budapest.

Feng,H.M.,Chen,C.Y.,& Ye,F.(2007).Evolutionary fuzzy particle swarm

optimization vector quantization learning scheme in image compression.Expert

Systems with Applications,32(1),213–222.

Güngör,Z.,& Ünler,A.(2008).K-harmonic means data clustering with tabu-search

method.Applied Mathematical Modelling,32,1115–1125.

Halberstadt,W.,& Douglas,T.S.(2008).Fuzzy clustering to detect tuberculous

meningitis-associated hyperdensity in CT images.Computers in Biology and

Medicine,38(2),165–170.

Hammerly,G.,& Elkan,C.(2002).Alternatives to the k-means algorithm that ﬁnd

better clusterings.In:Proceedings of the 11th international conference on

information and knowledge management (pp.600–607).

Handl,J.,Knowles,J.,& Dorigo,M.(2003).On the performance of ant-based

clustering.Design and Application of Hybrid Intelligent Systems.Frontiers in

Artiﬁcial Intelligence and Applications,104,204–213.

He,Y.,Pan,W.,& Lin,J.(2006).Cluster analysis using multivariate normal mixture

models to detect differential gene expression with microarray data.

Computational Statistics and Data Analysis,51(2),641–658.

Hu,G.,Zhou,S.,Guan,J.,& Hu,X.(2008).Towards effective document clustering:A

constrained K-means based approach.Information Processing and Management,

44(4),1397–1409.

Kao,Y.T.,Zahara,E.,& Kao,I.W.(2008).A hybridized approach to data clustering.

Expert Systems with Applications,34(3),1754–1762.

Kennedy,J.,& Eberhart,R.C.(1995).Particle swarmoptimization.In Proceedings of

the 1995 IEEE international conference on neural networks (pp.1942–1948).New

Jersey:IEEE Press.

Kerr,G.,Ruskin,H.J.,Crane,M.,& Doolan,P.(2008).Techniques for clustering gene

expression data.Computers in Biology and Medicine,38(3),283–293.

Liao,C.J.,Tseng,C.T.,& Luarn,P.(2007).A discrete version of particle swarm

optimization for ﬂowshop scheduling problems.Computers and Operations

Research,34,3099–3111.

Li,Y.J.,Chung,S.M.,& Holt,J.D.(2008).Text document clustering based on

frequent word meaning sequences.Data and Knowledge Engineering,64(1),

381–404.

Liu,B.,Wang,L.,& Jin,Y.H.(2008).An effective hybrid PSO-based algorithmfor ﬂow

shop scheduling with limited buffers.Computers and Operations Research,35(9),

2791–2806.

Maitra,M.,& Chatterjee,A.(2008).A hybrid cooperative–comprehensive learning

based PSO algorithm for image segmentation using multilevel thresholding.

Expert Systems with Applications,34,1341–1350.

Pan,H.,Wang,L.,& Liu,B.(2006).Particle swarm optimization for function

optimization in noisy environment.Applied Mathematics and Computation,181,

908–919.

Tan,P.N.,Steinbach,M.,& Kumar,V.(2005).Introduction to data mining (pp.487–

559).Boston:Addison-Wesley.

Tjhi,W.C.,& Chen,L.H.(2008).A heuristic-based fuzzy co-clustering algorithmfor

categorization of high-dimensional data.Fuzzy Sets and Systems,159(4),

371–389.

Ünler,A.,& Güngör,Z.(2008).Applying K-harmonic means clustering to the part-

machine classiﬁcation problem.Expert Systems with Applications.doi:10.1016/

j.eswa.2007.11.048.

Webb,A.(2002).Statistical pattern recognition (pp.361–406).New Jersey:John

Wiley & Sons.

Zhang,B.,Hsu,M.,& Dayal,U.(1999).K-harmonic means – a data

clustering algorithm.Technical Report HPL-1999-124.Hewlett-Packard

Laboratories.

Zhang,B.,Hsu,M.,& Dayal,U.(2000).K-harmonic means.In:International workshop

on temporal,spatial and spatio-temporal data mining,TSDM2000.Lyon,France,

September 12.

Zhou,H.,& Liu,Y.H.(2008).Accurate integration of multi-viewrange images using

k-means clustering.Pattern Recognition,41(1),152–175.

9852 F.Yang et al./Expert Systems with Applications 36 (2009) 9847–9852

## Comments 0

Log in to post a comment