20,MARCH 2013 1
K Means Clustering Algorithms:A Comparitive
Study
Govind Maheswaran,Jayarajan P,Johnes Jose,Joshy Joseph
M120432CS,M120229CS,M120088CS,M120082CS
National Institute of Technology Calicut
AbstractK Means clustering is a method of clustering
which partitions a set of n observations into k clusters,such
that each element belongs to the cluster with the nearest
mean.Although a Computationally diﬃcult problem (con
sidered to be NP Hard),there exists many improvements
to the traditional algorithm proposed by Stuart Lloyd in
1957.The aim of this paper is to implement the Lloyd’s
algorithm and its two imporved versions,and to carry out
a comparison study among the three.
Index TermsK Means,Clustering,Clustering Algo
rithms,K Means Clustering,Enhanced K Means,Harmony
search K Means Algorithm
I.Introduction
K Means clustering is a method of clustering which par
titions a set of n observations into k clusters,such that each
element belongs to the cluster with the nearest mean.Due
to this,K Means clustering ﬁnds application across vari
ous ﬁelds,including Computer Vision,Business Analysis,
Astronomy,Agriculture etc.Our current concern regard
ing the application of K Means clustering is that it can be
eﬀectively used in the ﬁeld of Bioinformatics to biological
sequence analysis and Genetic Clustering.These processes
can give valuable and often predictive insights regarding
the state of a disease,or the probability of the occurance
of a disease,or the eﬀect of a certain medication on the
disease.Although very useful,the ﬁrst K Means Cluster
ing algorithm proposed by Stuart Lloyd in 1957 was not
computationally optimum.This lead to many improved
versions of the K Means clustering algorithm,each build
ing upon the previous version,improving it and keeping
the computational complexity at check.The algorithms
being considered here are.
K Means Algorithm:
The ﬁrst Algorithm,proposed by
Stuart Lloys in 1957.The algorithm suﬀers from high
computational complexity.Also,the initial choice of
centroids are random.This would cause diﬀerent re
sults by the same algorithm for diﬀerent executions.
Enhanced K Means Algorithm:
Enhanced K Means Al
gorithm,proposed by K A Abdul Nazeer,S D Madhu
Kumar and M P Sebastian improves the K Means
algorithm based on a hueristical approach.This is
a computationally less complex one,as it has an
O(nlogn) heuristic method to calculate initial cen
troids,which improves the accuracy of results and re
duces computational complexity.
Harmony Search K Means Algorithm:
Inspired from
the way musicians comes up with fresh music by
improvising upon a basic motif,this algorithm was
proposed by Kaing Seok Lee and Z N Green,based
on a paper by Forsati et.al.
The remainder of the paper is organized as follows.Sec
tion II presents the algorithms being considered.The im
plementation setup is described in Section III.Section IV
presents our results and its analysis.Finally,we conclude
in Section V.
II.The Alogirithms
In this section,the algorithms are presented.We start
with the K Means Algorithm,followed by the Enhanced K
Means Algorithmand ﬁnally the Harmony Search KMeans
Algorithm.
A.K Means Clustering Algorithm
K Means Clustering algorithm works in two phases.In
the ﬁrst phase,initial k centroids are chosen at random.
In the second phase,each point in the data set is allotted
to the cluster containing the centroid that is nearest to
it.At the end of this phase,the centroid values of each
cluster is calculated,and depending upon the new values
of centroids,phase 2 is repeated untill the centroid values
converges to the same value.The complexity of this algo
rithm is O(nkl),where n is the number of data items,k
is the number of clusters and l is the number of iterations
until convergence.
The clustering algorithm is as follows.
Input:n data items,k
Output:n data items partitioned into k clusters
1.
Choose k points as initial centroids of the required
clusters.
2.
Assign each point to the cluster which has the closest
centroid.
3.
Calculate new mean for each cluster.
4.
Set it as the new centroid of the cluster.
5.
Repeat steps 2 to 4 untill all points are divided into
k optimal clusters.
The complexity of this algorithm is O(nkl),where n is
the number of data items,k is the number of clusters and
l is the number of iterations until convergence.
B.Enhanced K Means Clustering Algorithm
Enhanced KMeans Clustering Alogrithmis an improved
version of the K Means Algorithm.The improvement is
caused by improving both the individual phases of the al
gorithm.Initial centroid selection phase is improved using
sorting algorithms.The data is sorted based on the column
with the maximum range.This data is then partitioned
2 20,MARCH 2013
into k equal parts,and the mean values of each of these
clusters is taken as the initial centroids.In the second
phase,the distance between each data point and chosen
centroid is calculated,and is assigned to the cluster with
the shortest value.
The clustering algorithm is as follows.
Input:n data items,k
Output:n data items partitioned into k clusters
Phase 1
1.
Find the column with the maximum range
2.
Sort the data based upon this column.
3.
Partition the sorted data into k equal partitions.
4.
The mean of each partitions forms the initial Cen
troids.
Phase 2
1.
Compute the distance of each datapoint di (1¡=i¡=n)
to all the centroids cj (1¡=j¡=k) as d(di,cj);
2.
For each datapoint di,ﬁnd the closest centroid cj
and assign di to cluster j.
3.
Set ClusterId[i]=j;//j:Id of the closest cluster
4.
Set NearestDist[i]=d(di,cj);
5.
For each cluster j (1¡=j¡=k),recalculate the centroids;
6.
Repeat
(a)
For each datapoint di
i.
Compute its distance from the centroid of the
present nearest cluster;
ii.
If this distance is less than or equal to the present
nearest distance,the datapoint stays in the
cluster;
iii.
Else
A.
For every centroid cj (1¡=j¡=k)
B.
Compute the distance d(di,cj)
C.
Endfor;
D.
Assign the datapoint di to the cluster with
the nearest centroid cj
E.
Set ClusterId[i]=j
F.
Set NearestDist[i]= d(di,cj)
(b)
Endfor;
(c)
For each cluster j(1¡=j¡=k),recalculate the cen
troids
7.
Until the convergence criteria is met.
The ﬁrst phase of the Algorithm runs in time O(nlogn).
In phase 2,some data elements may remain in the same
cluster,whereas some may switch clusters.In case if they
remain in the same cluster,complexity is O(1) whereas in
case of cluster change,complexity becomes O(k).Thus,
overall complexity is O(nlogn).Considering the fact that
k is very little compared to n,the complexity can be safely
assumed to be O(n).
C.Harmony Search K Means Clustering Algorithm
This algorithmwas inspired by the improvisation process
of musicians.In the Harmony Search algorithm,each musi
cian (decision variable) plays (generates) a note (a value)
for ﬁnding a best harmony (global optimum) all together.
This algorithm starts with a random set of clustering,and
improves it iteratively so as to get an overall improved re
sult.The following parameters are used by the algorithm.
Harmony Memory Size (HMS)
HArmony Memory Considering Rate (HMCR)
Pitch Adjustment Rate (PAR)
Maximum Iterations (MI)
The clustering algorithm is as follows.
Input:n data items,k,HMS,HMCR,PAR,MI
Output:n data items partitioned into k clusters
1.
Initialize the Harmony memory with HMS random
solutions.
2.
Evaluate the ﬁtness of all solutions in Harmony mem
ory.
3.
For i in 1 to MI,do
(a)
Improvise a new solution solution from the using
the PAR value.
(b)
Evaluate the ﬁtness of the new solutions
(c)
If the new solution is better that the worst solu
tion in the Harmony Memory,Swap out the solution
with the new solution.
(d)
Calculate the cluster Centroids for the new solu
tion.
4.
End For
III.Implementation
The aim was to compare the eﬃciency of the 3 ver
sions of KMeans clustering algorithm.The data set to
use for benchmarking was obtained from Gene Expression
Omnibus(GEO).The Gene expression data of Pancreatic
Cancer data fromthe Medical Research Organisation Mayo
Clinic was used.The value of K is what determines the
quality of clustering to a very great extent,in terms of data
insghts.We have chosen the value of k as 6.All distance
are measured as euclidian distances,as two dimensional
spacing needs to be considered.The quality metrics that
was used to analyze quality are intra cluster distance,inter
cluster distance and Sillhouette index.
IV.Results and Analysis
The clustered data set was successfully output by all the
three algorithms,for the same input.The ﬁgures show the
outputs of the clustering algorithms in a graphical format.
Each clusters are shown in seperate colours.The X axis
shows the data point values,and the Y axis corresponds
to the diﬀerent clusters.Figure 1 represents the K means
Algorithm,Enhanced K Means Algorithm and Harmony
Search K Means Algorithm.
The clusters has been analysed and the following insights
have been made.Figure 2 below shows the quality met
rices of the 3 algorithms that was implemented.It has
been observed that the Intra Cluster distance decreased
from K means algorithm to enhanced algorithm and then
Harmony Search algorithm.Silhoutte Index,which is a
directly proportional quality index of the clustering algo
rithm also favoured the Harmony Search algorithm.Thus,
the accuracy of the algorithms were found to be in the or
der Harmony Search > Enhanced K Means > K Means
Algorithm.
SHORT NAMES:BIOINFORMATICS TERM PAPER 3
Fig.1.K Means,Enhanced K Means,Harmony Search K Means
Fig.2.Quality Metrices of Clustering Algorithms
Our next analysis was regarding the running time of the
Algorithms.Figure 3 shows running time statistics ob
tained for k=6.It has been observed that K Means runs
in O(nkl) time,for all the inputs.It was also found that
Enhanced K means algorithm runs in O(nlogn) time.The
running time of Harmony Search increased as the size of the
input increased,although the extra time taken is compen
sated by the better accuracy provided by Harmony Search.
Thus,the running time of the algorithms were found to be
in the order Enhanced K Means > K Means Algorithm.
Harmony search was not included in the comparison be
cause the basic concept of iterations is diﬀerent in Har
mony Search.Thus the running time will depend upon
the number of iterations,which is to be input by the user.
Fig.3.Running time of Clustering Algorithms (in s)
Figure 4 shows running time plots.
Fig.4.Running time Plot of the Clustering Algorithms
V.Conclusion
Three K Means clustering algorithms,K Means Algo
rithm,Enhanced KMeans Algorithmand Harmony Search
K Means Algorithm was implemented and analysed.The
analysis proved the better accuracy of the Harmony Search
algorithm (for larger values of MI),and the better time
complexity of the Enhanced K Means Algorithm.
References
[1]
K A Abdul Nazeer;MP Sebastian,Improving the Accuracy and
Eﬃciency of the kmeans Clustering Algorithm,Proceedings of
the World Congress on Engineering 2009 Vol I WCE 2009,July
1  3,2009,London,U.K.
[2]
Lekshmi P Chandran;K A Abdul Nazeer,An Improved Clus
tering Algorithm based on KMeans and Harmony Search Opti
mization
[3]
Fahim A.M;Salem A.M;Torkey A;Ramadan M.A An Eﬃcient
enhanced kmeans clustering algorithm,Journal of Zhejiang Uni
versity,10(7):16261633,2006.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο