20,MARCH 2013 1

K Means Clustering Algorithms:A Comparitive

Study

Govind Maheswaran,Jayarajan P,Johnes Jose,Joshy Joseph

M120432CS,M120229CS,M120088CS,M120082CS

National Institute of Technology Calicut

Abstract|K Means clustering is a method of clustering

which partitions a set of n observations into k clusters,such

that each element belongs to the cluster with the nearest

mean.Although a Computationally diﬃcult problem (con-

sidered to be NP Hard),there exists many improvements

to the traditional algorithm proposed by Stuart Lloyd in

1957.The aim of this paper is to implement the Lloyd’s

algorithm and its two imporved versions,and to carry out

a comparison study among the three.

Index Terms|K Means,Clustering,Clustering Algo-

rithms,K Means Clustering,Enhanced K Means,Harmony

search K Means Algorithm

I.Introduction

K Means clustering is a method of clustering which par-

titions a set of n observations into k clusters,such that each

element belongs to the cluster with the nearest mean.Due

to this,K Means clustering ﬁnds application across vari-

ous ﬁelds,including Computer Vision,Business Analysis,

Astronomy,Agriculture etc.Our current concern regard-

ing the application of K Means clustering is that it can be

eﬀectively used in the ﬁeld of Bioinformatics to biological

sequence analysis and Genetic Clustering.These processes

can give valuable and often predictive insights regarding

the state of a disease,or the probability of the occurance

of a disease,or the eﬀect of a certain medication on the

disease.Although very useful,the ﬁrst K Means Cluster-

ing algorithm proposed by Stuart Lloyd in 1957 was not

computationally optimum.This lead to many improved

versions of the K Means clustering algorithm,each build-

ing upon the previous version,improving it and keeping

the computational complexity at check.The algorithms

being considered here are.

K Means Algorithm:

The ﬁrst Algorithm,proposed by

Stuart Lloys in 1957.The algorithm suﬀers from high

computational complexity.Also,the initial choice of

centroids are random.This would cause diﬀerent re-

sults by the same algorithm for diﬀerent executions.

Enhanced K Means Algorithm:

Enhanced K Means Al-

gorithm,proposed by K A Abdul Nazeer,S D Madhu

Kumar and M P Sebastian improves the K Means

algorithm based on a hueristical approach.This is

a computationally less complex one,as it has an

O(nlogn) heuristic method to calculate initial cen-

troids,which improves the accuracy of results and re-

duces computational complexity.

Harmony Search K Means Algorithm:

Inspired from

the way musicians comes up with fresh music by

improvising upon a basic motif,this algorithm was

proposed by Kaing Seok Lee and Z N Green,based

on a paper by Forsati et.al.

The remainder of the paper is organized as follows.Sec-

tion II presents the algorithms being considered.The im-

plementation setup is described in Section III.Section IV

presents our results and its analysis.Finally,we conclude

in Section V.

II.The Alogirithms

In this section,the algorithms are presented.We start

with the K Means Algorithm,followed by the Enhanced K

Means Algorithmand ﬁnally the Harmony Search KMeans

Algorithm.

A.K Means Clustering Algorithm

K Means Clustering algorithm works in two phases.In

the ﬁrst phase,initial k centroids are chosen at random.

In the second phase,each point in the data set is allotted

to the cluster containing the centroid that is nearest to

it.At the end of this phase,the centroid values of each

cluster is calculated,and depending upon the new values

of centroids,phase 2 is repeated untill the centroid values

converges to the same value.The complexity of this algo-

rithm is O(nkl),where n is the number of data items,k

is the number of clusters and l is the number of iterations

until convergence.

The clustering algorithm is as follows.

Input:n data items,k

Output:n data items partitioned into k clusters

1.

Choose k points as initial centroids of the required

clusters.

2.

Assign each point to the cluster which has the closest

centroid.

3.

Calculate new mean for each cluster.

4.

Set it as the new centroid of the cluster.

5.

Repeat steps 2 to 4 untill all points are divided into

k optimal clusters.

The complexity of this algorithm is O(nkl),where n is

the number of data items,k is the number of clusters and

l is the number of iterations until convergence.

B.Enhanced K Means Clustering Algorithm

Enhanced KMeans Clustering Alogrithmis an improved

version of the K Means Algorithm.The improvement is

caused by improving both the individual phases of the al-

gorithm.Initial centroid selection phase is improved using

sorting algorithms.The data is sorted based on the column

with the maximum range.This data is then partitioned

2 20,MARCH 2013

into k equal parts,and the mean values of each of these

clusters is taken as the initial centroids.In the second

phase,the distance between each data point and chosen

centroid is calculated,and is assigned to the cluster with

the shortest value.

The clustering algorithm is as follows.

Input:n data items,k

Output:n data items partitioned into k clusters

Phase 1

1.

Find the column with the maximum range

2.

Sort the data based upon this column.

3.

Partition the sorted data into k equal partitions.

4.

The mean of each partitions forms the initial Cen-

troids.

Phase 2

1.

Compute the distance of each data-point di (1¡=i¡=n)

to all the centroids cj (1¡=j¡=k) as d(di,cj);

2.

For each data-point di,ﬁnd the closest centroid cj

and assign di to cluster j.

3.

Set ClusterId[i]=j;//j:Id of the closest cluster

4.

Set NearestDist[i]=d(di,cj);

5.

For each cluster j (1¡=j¡=k),recalculate the centroids;

6.

Repeat

(a)

For each data-point di

i.

Compute its distance from the centroid of the

present nearest cluster;

ii.

If this distance is less than or equal to the present

nearest distance,the data-point stays in the

cluster;

iii.

Else

A.

For every centroid cj (1¡=j¡=k)

B.

Compute the distance d(di,cj)

C.

Endfor;

D.

Assign the data-point di to the cluster with

the nearest centroid cj

E.

Set ClusterId[i]=j

F.

Set NearestDist[i]= d(di,cj)

(b)

Endfor;

(c)

For each cluster j(1¡=j¡=k),recalculate the cen-

troids

7.

Until the convergence criteria is met.

The ﬁrst phase of the Algorithm runs in time O(nlogn).

In phase 2,some data elements may remain in the same

cluster,whereas some may switch clusters.In case if they

remain in the same cluster,complexity is O(1) whereas in

case of cluster change,complexity becomes O(k).Thus,

overall complexity is O(nlogn).Considering the fact that

k is very little compared to n,the complexity can be safely

assumed to be O(n).

C.Harmony Search K Means Clustering Algorithm

This algorithmwas inspired by the improvisation process

of musicians.In the Harmony Search algorithm,each musi-

cian (decision variable) plays (generates) a note (a value)

for ﬁnding a best harmony (global optimum) all together.

This algorithm starts with a random set of clustering,and

improves it iteratively so as to get an overall improved re-

sult.The following parameters are used by the algorithm.

Harmony Memory Size (HMS)

HArmony Memory Considering Rate (HMCR)

Pitch Adjustment Rate (PAR)

Maximum Iterations (MI)

The clustering algorithm is as follows.

Input:n data items,k,HMS,HMCR,PAR,MI

Output:n data items partitioned into k clusters

1.

Initialize the Harmony memory with HMS random

solutions.

2.

Evaluate the ﬁtness of all solutions in Harmony mem-

ory.

3.

For i in 1 to MI,do

(a)

Improvise a new solution solution from the using

the PAR value.

(b)

Evaluate the ﬁtness of the new solutions

(c)

If the new solution is better that the worst solu-

tion in the Harmony Memory,Swap out the solution

with the new solution.

(d)

Calculate the cluster Centroids for the new solu-

tion.

4.

End For

III.Implementation

The aim was to compare the eﬃciency of the 3 ver-

sions of KMeans clustering algorithm.The data set to

use for benchmarking was obtained from Gene Expression

Omnibus(GEO).The Gene expression data of Pancreatic

Cancer data fromthe Medical Research Organisation Mayo

Clinic was used.The value of K is what determines the

quality of clustering to a very great extent,in terms of data

insghts.We have chosen the value of k as 6.All distance

are measured as euclidian distances,as two dimensional

spacing needs to be considered.The quality metrics that

was used to analyze quality are intra cluster distance,inter

cluster distance and Sillhouette index.

IV.Results and Analysis

The clustered data set was successfully output by all the

three algorithms,for the same input.The ﬁgures show the

outputs of the clustering algorithms in a graphical format.

Each clusters are shown in seperate colours.The X axis

shows the data point values,and the Y axis corresponds

to the diﬀerent clusters.Figure 1 represents the K means

Algorithm,Enhanced K Means Algorithm and Harmony

Search K Means Algorithm.

The clusters has been analysed and the following insights

have been made.Figure 2 below shows the quality met-

rices of the 3 algorithms that was implemented.It has

been observed that the Intra Cluster distance decreased

from K means algorithm to enhanced algorithm and then

Harmony Search algorithm.Silhoutte Index,which is a

directly proportional quality index of the clustering algo-

rithm also favoured the Harmony Search algorithm.Thus,

the accuracy of the algorithms were found to be in the or-

der Harmony Search > Enhanced K Means > K Means

Algorithm.

SHORT NAMES:BIOINFORMATICS TERM PAPER 3

Fig.1.K Means,Enhanced K Means,Harmony Search K Means

Fig.2.Quality Metrices of Clustering Algorithms

Our next analysis was regarding the running time of the

Algorithms.Figure 3 shows running time statistics ob-

tained for k=6.It has been observed that K Means runs

in O(nkl) time,for all the inputs.It was also found that

Enhanced K means algorithm runs in O(nlogn) time.The

running time of Harmony Search increased as the size of the

input increased,although the extra time taken is compen-

sated by the better accuracy provided by Harmony Search.

Thus,the running time of the algorithms were found to be

in the order Enhanced K Means > K Means Algorithm.

Harmony search was not included in the comparison be-

cause the basic concept of iterations is diﬀerent in Har-

mony Search.Thus the running time will depend upon

the number of iterations,which is to be input by the user.

Fig.3.Running time of Clustering Algorithms (in s)

Figure 4 shows running time plots.

Fig.4.Running time Plot of the Clustering Algorithms

V.Conclusion

Three K Means clustering algorithms,K Means Algo-

rithm,Enhanced KMeans Algorithmand Harmony Search

K Means Algorithm was implemented and analysed.The

analysis proved the better accuracy of the Harmony Search

algorithm (for larger values of MI),and the better time

complexity of the Enhanced K Means Algorithm.

References

[1]

K A Abdul Nazeer;MP Sebastian,Improving the Accuracy and

Eﬃciency of the k-means Clustering Algorithm,Proceedings of

the World Congress on Engineering 2009 Vol I WCE 2009,July

1 - 3,2009,London,U.K.

[2]

Lekshmi P Chandran;K A Abdul Nazeer,An Improved Clus-

tering Algorithm based on K-Means and Harmony Search Opti-

mization

[3]

Fahim A.M;Salem A.M;Torkey A;Ramadan M.A An Eﬃcient

enhanced k-means clustering algorithm,Journal of Zhejiang Uni-

versity,10(7):16261633,2006.

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο