K Means Clustering Algorithms : A Comparitive Study

quonochontaugskateΤεχνίτη Νοημοσύνη και Ρομποτική

24 Νοε 2013 (πριν από 4 χρόνια και 1 μήνα)

48 εμφανίσεις

20,MARCH 2013 1
K Means Clustering Algorithms:A Comparitive
Study
Govind Maheswaran,Jayarajan P,Johnes Jose,Joshy Joseph
M120432CS,M120229CS,M120088CS,M120082CS
National Institute of Technology Calicut
Abstract|K Means clustering is a method of clustering
which partitions a set of n observations into k clusters,such
that each element belongs to the cluster with the nearest
mean.Although a Computationally difficult problem (con-
sidered to be NP Hard),there exists many improvements
to the traditional algorithm proposed by Stuart Lloyd in
1957.The aim of this paper is to implement the Lloyd’s
algorithm and its two imporved versions,and to carry out
a comparison study among the three.
Index Terms|K Means,Clustering,Clustering Algo-
rithms,K Means Clustering,Enhanced K Means,Harmony
search K Means Algorithm
I.Introduction
K Means clustering is a method of clustering which par-
titions a set of n observations into k clusters,such that each
element belongs to the cluster with the nearest mean.Due
to this,K Means clustering finds application across vari-
ous fields,including Computer Vision,Business Analysis,
Astronomy,Agriculture etc.Our current concern regard-
ing the application of K Means clustering is that it can be
effectively used in the field of Bioinformatics to biological
sequence analysis and Genetic Clustering.These processes
can give valuable and often predictive insights regarding
the state of a disease,or the probability of the occurance
of a disease,or the effect of a certain medication on the
disease.Although very useful,the first K Means Cluster-
ing algorithm proposed by Stuart Lloyd in 1957 was not
computationally optimum.This lead to many improved
versions of the K Means clustering algorithm,each build-
ing upon the previous version,improving it and keeping
the computational complexity at check.The algorithms
being considered here are.
K Means Algorithm:
The first Algorithm,proposed by
Stuart Lloys in 1957.The algorithm suffers from high
computational complexity.Also,the initial choice of
centroids are random.This would cause different re-
sults by the same algorithm for different executions.
Enhanced K Means Algorithm:
Enhanced K Means Al-
gorithm,proposed by K A Abdul Nazeer,S D Madhu
Kumar and M P Sebastian improves the K Means
algorithm based on a hueristical approach.This is
a computationally less complex one,as it has an
O(nlogn) heuristic method to calculate initial cen-
troids,which improves the accuracy of results and re-
duces computational complexity.
Harmony Search K Means Algorithm:
Inspired from
the way musicians comes up with fresh music by
improvising upon a basic motif,this algorithm was
proposed by Kaing Seok Lee and Z N Green,based
on a paper by Forsati et.al.
The remainder of the paper is organized as follows.Sec-
tion II presents the algorithms being considered.The im-
plementation setup is described in Section III.Section IV
presents our results and its analysis.Finally,we conclude
in Section V.
II.The Alogirithms
In this section,the algorithms are presented.We start
with the K Means Algorithm,followed by the Enhanced K
Means Algorithmand finally the Harmony Search KMeans
Algorithm.
A.K Means Clustering Algorithm
K Means Clustering algorithm works in two phases.In
the first phase,initial k centroids are chosen at random.
In the second phase,each point in the data set is allotted
to the cluster containing the centroid that is nearest to
it.At the end of this phase,the centroid values of each
cluster is calculated,and depending upon the new values
of centroids,phase 2 is repeated untill the centroid values
converges to the same value.The complexity of this algo-
rithm is O(nkl),where n is the number of data items,k
is the number of clusters and l is the number of iterations
until convergence.
The clustering algorithm is as follows.

Input:n data items,k

Output:n data items partitioned into k clusters
1.
Choose k points as initial centroids of the required
clusters.
2.
Assign each point to the cluster which has the closest
centroid.
3.
Calculate new mean for each cluster.
4.
Set it as the new centroid of the cluster.
5.
Repeat steps 2 to 4 untill all points are divided into
k optimal clusters.
The complexity of this algorithm is O(nkl),where n is
the number of data items,k is the number of clusters and
l is the number of iterations until convergence.
B.Enhanced K Means Clustering Algorithm
Enhanced KMeans Clustering Alogrithmis an improved
version of the K Means Algorithm.The improvement is
caused by improving both the individual phases of the al-
gorithm.Initial centroid selection phase is improved using
sorting algorithms.The data is sorted based on the column
with the maximum range.This data is then partitioned
2 20,MARCH 2013
into k equal parts,and the mean values of each of these
clusters is taken as the initial centroids.In the second
phase,the distance between each data point and chosen
centroid is calculated,and is assigned to the cluster with
the shortest value.
The clustering algorithm is as follows.

Input:n data items,k

Output:n data items partitioned into k clusters
Phase 1
1.
Find the column with the maximum range
2.
Sort the data based upon this column.
3.
Partition the sorted data into k equal partitions.
4.
The mean of each partitions forms the initial Cen-
troids.
Phase 2
1.
Compute the distance of each data-point di (1¡=i¡=n)
to all the centroids cj (1¡=j¡=k) as d(di,cj);
2.
For each data-point di,find the closest centroid cj
and assign di to cluster j.
3.
Set ClusterId[i]=j;//j:Id of the closest cluster
4.
Set NearestDist[i]=d(di,cj);
5.
For each cluster j (1¡=j¡=k),recalculate the centroids;
6.
Repeat
(a)
For each data-point di
i.
Compute its distance from the centroid of the
present nearest cluster;
ii.
If this distance is less than or equal to the present
nearest distance,the data-point stays in the
cluster;
iii.
Else
A.
For every centroid cj (1¡=j¡=k)
B.
Compute the distance d(di,cj)
C.
Endfor;
D.
Assign the data-point di to the cluster with
the nearest centroid cj
E.
Set ClusterId[i]=j
F.
Set NearestDist[i]= d(di,cj)
(b)
Endfor;
(c)
For each cluster j(1¡=j¡=k),recalculate the cen-
troids
7.
Until the convergence criteria is met.
The first phase of the Algorithm runs in time O(nlogn).
In phase 2,some data elements may remain in the same
cluster,whereas some may switch clusters.In case if they
remain in the same cluster,complexity is O(1) whereas in
case of cluster change,complexity becomes O(k).Thus,
overall complexity is O(nlogn).Considering the fact that
k is very little compared to n,the complexity can be safely
assumed to be O(n).
C.Harmony Search K Means Clustering Algorithm
This algorithmwas inspired by the improvisation process
of musicians.In the Harmony Search algorithm,each musi-
cian (decision variable) plays (generates) a note (a value)
for finding a best harmony (global optimum) all together.
This algorithm starts with a random set of clustering,and
improves it iteratively so as to get an overall improved re-
sult.The following parameters are used by the algorithm.

Harmony Memory Size (HMS)

HArmony Memory Considering Rate (HMCR)

Pitch Adjustment Rate (PAR)

Maximum Iterations (MI)
The clustering algorithm is as follows.

Input:n data items,k,HMS,HMCR,PAR,MI

Output:n data items partitioned into k clusters
1.
Initialize the Harmony memory with HMS random
solutions.
2.
Evaluate the fitness of all solutions in Harmony mem-
ory.
3.
For i in 1 to MI,do
(a)
Improvise a new solution solution from the using
the PAR value.
(b)
Evaluate the fitness of the new solutions
(c)
If the new solution is better that the worst solu-
tion in the Harmony Memory,Swap out the solution
with the new solution.
(d)
Calculate the cluster Centroids for the new solu-
tion.
4.
End For
III.Implementation
The aim was to compare the efficiency of the 3 ver-
sions of KMeans clustering algorithm.The data set to
use for benchmarking was obtained from Gene Expression
Omnibus(GEO).The Gene expression data of Pancreatic
Cancer data fromthe Medical Research Organisation Mayo
Clinic was used.The value of K is what determines the
quality of clustering to a very great extent,in terms of data
insghts.We have chosen the value of k as 6.All distance
are measured as euclidian distances,as two dimensional
spacing needs to be considered.The quality metrics that
was used to analyze quality are intra cluster distance,inter
cluster distance and Sillhouette index.
IV.Results and Analysis
The clustered data set was successfully output by all the
three algorithms,for the same input.The figures show the
outputs of the clustering algorithms in a graphical format.
Each clusters are shown in seperate colours.The X axis
shows the data point values,and the Y axis corresponds
to the different clusters.Figure 1 represents the K means
Algorithm,Enhanced K Means Algorithm and Harmony
Search K Means Algorithm.
The clusters has been analysed and the following insights
have been made.Figure 2 below shows the quality met-
rices of the 3 algorithms that was implemented.It has
been observed that the Intra Cluster distance decreased
from K means algorithm to enhanced algorithm and then
Harmony Search algorithm.Silhoutte Index,which is a
directly proportional quality index of the clustering algo-
rithm also favoured the Harmony Search algorithm.Thus,
the accuracy of the algorithms were found to be in the or-
der Harmony Search > Enhanced K Means > K Means
Algorithm.
SHORT NAMES:BIOINFORMATICS TERM PAPER 3
Fig.1.K Means,Enhanced K Means,Harmony Search K Means
Fig.2.Quality Metrices of Clustering Algorithms
Our next analysis was regarding the running time of the
Algorithms.Figure 3 shows running time statistics ob-
tained for k=6.It has been observed that K Means runs
in O(nkl) time,for all the inputs.It was also found that
Enhanced K means algorithm runs in O(nlogn) time.The
running time of Harmony Search increased as the size of the
input increased,although the extra time taken is compen-
sated by the better accuracy provided by Harmony Search.
Thus,the running time of the algorithms were found to be
in the order Enhanced K Means > K Means Algorithm.
Harmony search was not included in the comparison be-
cause the basic concept of iterations is different in Har-
mony Search.Thus the running time will depend upon
the number of iterations,which is to be input by the user.
Fig.3.Running time of Clustering Algorithms (in s)
Figure 4 shows running time plots.
Fig.4.Running time Plot of the Clustering Algorithms
V.Conclusion
Three K Means clustering algorithms,K Means Algo-
rithm,Enhanced KMeans Algorithmand Harmony Search
K Means Algorithm was implemented and analysed.The
analysis proved the better accuracy of the Harmony Search
algorithm (for larger values of MI),and the better time
complexity of the Enhanced K Means Algorithm.
References
[1]
K A Abdul Nazeer;MP Sebastian,Improving the Accuracy and
Efficiency of the k-means Clustering Algorithm,Proceedings of
the World Congress on Engineering 2009 Vol I WCE 2009,July
1 - 3,2009,London,U.K.
[2]
Lekshmi P Chandran;K A Abdul Nazeer,An Improved Clus-
tering Algorithm based on K-Means and Harmony Search Opti-
mization
[3]
Fahim A.M;Salem A.M;Torkey A;Ramadan M.A An Efficient
enhanced k-means clustering algorithm,Journal of Zhejiang Uni-
versity,10(7):16261633,2006.