850
Determining Number & Initial Seeds of K

Means
Clustring Using GA
Mu
hammed U. Mahdi
Babylon
university
,
collage of science for women
,
computer department
Abstract
Cluster analysis has been widely used in several disciplines, such as statistics, software
eng
ineering, biology, psychology and other social sciences, in order to identify natural groups in large
amounts of data.
Clustering has also been widely adopted by researchers within computer science and
es
pecially the database community.
K

means is the most
famous clustering algorithms. But it suffering
from some drawbacks which is the determining of number and initial seeds of clusters expected in the
datasets.
In this paper we introduce ability of automatic clusters seeds and number detection technique
us
ing GA.
Keywords: K

means, Genetic Algorithm, Clustering
ةصلاخلا
قلي بل لعلاحلعتاا القلةحدلا لءباحلالتثبا لتحا لواحلعل ا ال لا بنتلعتاتلييقانعلا ليلحت
ا يقلل احيلعااعلا للنعل اوقليحعلا للعا
ا ل حعلاو ليلاا قلاقيلعلا لواحلينيلطعلا قنب تعلاتيتحتاض غعاى ثدلا لنعل
ا لواايءثقللعلا للياالباحنلعل او لةلاقل يلاوتلتانعلا بتثتلعل
ا ا قلعقحعلا لع
ا لااوتلتانعلا قل با واحو نبعلا قيبز ل ثعلا ه لاابا قاقيلعلاتعل ياحةقث
(K

means)
ا
احلال باالبا اقلنتاقلهااع
ا لللعلا تللتاا للحلعلالمللاا للوا قللع ب بعلاثمللاازللال بعاحلليجلتتلدلا يللتلعادلللعبعلاتلليتحتابعمللا احللني تبعلا قللع ب بعلاتتللنعادلللعبعلاتلليتحتعل
ازال ب احيبق بعلاتتعافق تادا جقتلت
ا لتثتعقلاقا
حيبز ل ثعل
ا
اا حياي عل
ا
1. Introduction
Clustering is a division of data into groups of similar objects. Representing the
data by fewer clusters necessarily loses certain fine details, but achieves
simplification. It models data by its clusters.
Data modeling puts clustering in a
historical perspective rooted in mathematics, statistics, and numerical analysis. From a
machine learning perspective clusters correspond to hidden patterns, the search for
clusters is unsupervised learning, and the resul
ting system represents a data concept.
From a practical perspective clustering plays an outstanding role in data mining
applications such as scientific data exploration, information retrieval and text mining,
spatial database applications, Web analysis, ma
rketing, medical diagnostics,
computational biology, and many others. Clustering is the subject of active research in
several fields such as statistics, pattern recognition, and machine learning. This survey
focuses on clustering in data mining. Data minin
g adds to clustering the complications
of very large datasets with very many attributes of different types. This imposes
unique computational requirements on relevant clustering algorithms. A variety of
algorithms have recently emerged that meet these requ
irements and were successfully
applied to real

life data mining problems
[
Jiawei Han and Michelle Kamber
, 2001
]
.
2.
Related works
General references regarding clustering include [Ghosh 2002]. There is a close
relationship between clustering techniques and
many other
disciplines
Clustering has always been used in statistics [Arabie & Hubert 1996] and science
[Massart & Kaufman 1983].
Typical applications include speech and character
recognition. Machine learning
clustering algorithms were applied to image
segmentation
and computer vision [Jain
& Flynn 1996].
For statistical approaches to pattern
recognition see [Fukunaga 1990].
Clustering can be viewed as
a density estimation problem. This is the subject of
traditional multivariate statistical
estimatio
n [Scott
,
1992].
851
Clustering is also widely used for data compression in image
processing, which is
also known as vector quantization [Gersho & Gray 1992].
Data
fitting in numerical analysis provides still another venue in data modeling
[Daniel &
Wood 198
0].
3.
K

means and K

median
The K

means algorithm, probably the
fi
rst one of the clustering algorithms
proposed, is based on a very simple idea: Given a set of initial clusters, assign
each
point to one of them,
and then
each cluster center is replaced by
the mean
point on the
respective cluster. These two simple steps are repeated until
convergence. A point is
assigned to the cluster which is close in Euclidean
distance to the point.
Although K

means has the great advantage of being easy to implement,
it h
as two big drawbacks.
First, it can be really slow since in each step the
distance between each point to each
cluster has to be calculated, which can
be really expensive in the presence of a large
dataset. Second, this method
is really sensitive to the pro
vided initial clusters,
however, in recent
years,
this problem has been addressed with some degree of
success.
If instead of the Euclidean distance the 1

norm distance is used:
……………..(1)
a variation of K

means is obtained and it is called K

median
,
t
he
authors claim that
this variation is less sensitive to outliers than traditional
K

means due to the
characteristics of the 1

norm.
The algorithm is as follows:
1. Select
k
objects as initial centers;
2. Assign each data object to the closest center;
3.
Recalculate the centers of each cluster;
4. Repeat steps 2 and 3 until centers do not change;
The mai
n weakness points of K

means are the number of clusters may or may not be
known prior and the
randomly
initialization of clusters seed effect the result of
clustering
[
Jiawei Han and Michelle Kamber
, 2001
]
.
4. Suggested
Solving of K

Means Weakness P
oints
To determine the number of clusters K and the optimal cluster center seeds
which
can be
initialized
to the K

means. g
enetic algorithms (GA) based clustering
technique
can automatically evolve the appropriate clusters number of a data set. The
chromosome encodes the centers of clusters, whose value may vary. Modified
versions of crossover and mutation operators are used.
4.1 Genetic Algorithm.
Genetic Algorit
hms (GA) belong to a class of search techniques that mimic the
principles of natural selection to develop solutions of large optimization problems. GAs
operate by maintaining and manipulating a population of potential solutions called
chromosomes [Sivanand
am
&
Deepa, 2008]
. Each chromosome has an associated fitness
value which is a qualitative measure of the goodness of the solution encoded in it. This
fitness value is used to guide the stochastic selection of chromosomes which are then used
to generate new
candidate solutions through crossover and mutation. Crossover generates
new chromosomes by combining sections of two or more selected parents. Mutation acts
by randomly selecting genes which are then altered; thereby preventing suboptimal
solutions from pe
rsisting and increases diversity in the population. The process of
selection, crossover and mutation continue for a fixed number of generations or until a
termination Condition is satisfied. GAs have applications in fields as diverse as VLSI
852
design, patter
n recognition, image processing, neural networks, machine learning, etc.
[Mitra, Sushmita, 2003]
GA algorithm used to find optimal clusters' seeds & their number according to
flow
chart
of the algori
thm is shown in the figure(1)
.
Fig. (1)
Adapted GA for Clusters seeds detection
Start
For each chromosome in the population
Generate a number K, in the range K
min
to K
max
Choose K
i
points(rows) randomly
from the dataset
Distribute these points randomly in the chromosome
Set unfilled positions to null
For each chromosome in the population
Extract the K
i
centers stored in it
Perform clusteri
ng by assigning each point to the cluster
corresponding to the closest center
Compute DB index (Dbi) by Eq(3

7)
Compute fitness as 1/DB
i
Selection
Single point crossover with prob.
c
Mutation performed with prob.
m
:
Randomly choose one position of chromosome.
If this position is null randomly, choose a point
from data and make it as a center
Else make this position null
Gen<Maxgen?
Gen=Gen+1
Elitism
Use the detected centers
In k

means algorithm
Stop
Set K
min
,
,
K
max
to min & m
ax clusters
number expected respectively
Set MaxGen to max iteration allowed
Gen =1
Yse
Population initialization
Fitness calculation
Genetic Operations
No
853
4.2 GA S
teps
A.Representation (
encoding of solution
)
The value of
K
is assumed to lie in the range [
K
min
; K
max
], where
K
min
is chosen to
be 2 unless specified otherwise. The length of a s
tring is taken to be
K
max
where each
individual gene position represents either apointer to actual center or a null.
B.Population initialization
For initializing these centres,
K
i
points are chosen randomly from the dataset.
These points are distributed ra
ndomly in the chromosome. Let us consider the
following example.
Example: Let
K
min
=2 and
K
max
=10. Let the random number
K
i
be equal to 4
for chromosome
i
. Then this chromosome will encode the centres of 4 clusters. Let
the 4 cluster centres (4 randomly
chosen points from the data set) be (10:0, 5:0) (20:4,
13:2) (15:8, 2:9) (22:7, 17:7). On random distribution of these centres in the
chromosome, it may look like :
[null,
(20:4;13:2),null,null,
(15:8;2:9) ,null,
(10:0;:0)
(22:7;17:7), null,null].
C.Fitne
ss computation
The fitness of a chromosome is computed using the Davies
–
Bouldin index
(DBi). DBI is determined as follows [
C.C. Bojarczuk, H.S. Lopes, A.A. Freitas
, 2000
] :
Given a partition of the N points into K clusters, one first defines the following
measure of within

to

between cluster spread for two clusters, C
j
and C
k
for 1 ≤ j, k
K and j
k.
where e
j
and e
k
are the average dispersion of C
j
and C
k
, and D
jk
is the
Euclidean distance between C
j
and C
k
. If m
j
and m
k
are the centers of C
j
and C
k
,
consisting of N
j
, and N
k
points respectively:
After that the DBi is defined as:
The objective is to minimize the
DB
index for achieving proper clustering. The
fitness function for chromosome
j
is defined as 1/
DBj
, where
DBj
is the Davies
–
Bould
in index computed for this chromosome.
..……
.
…………..(
2
)
…….....…….(3)
……………..…..(5)
…………………..(4)
854
D.Genetic Operations
a.
Selection
: Conventional proportional selection is applied on the population of
strings.
b.
Crossover
: Single point crossover, applied stochastically with probability
c
, is
explained below with
an example.
Example: Suppose crossover occurs between the following two strings:
null,
, null, null,
, null ,
,
,null, null
null,
, null,
, null,
,
,null,
,null
Let the crossover position be 5 as shown above. T
hen the offspring are:
null,
, null, null,
,
,
,null,
,null
null,
, null,
, null, null ,
,
,null, null
c.
Mutation
: Each position in a chromosome is mutated with probability
m
in the
following way. If the value a
t that position is not null , then it becomes null else new
cluster center is created by selecting random points from dataset and making the
pointer point to it.
5.
Results
We applied
the
approach on following datasets:
1

Wisconsin Breast Cancer Database
(January 8, 1991) Dr. WIlliam H. Wolberg
(physician) , University of Wisconsin Hospitals.
2

Heart Disease Databases, Hungarian Institute of Cardiology. Budapest: Andras
Janosi, M.D, Date: July, 1988.
3

Car Evaluation Database, Donors: Marko Bohanec,
(marko.bohanec@ijs.si)
Blaz Zupan (
blaz.zupan@ijs.si
) ,Date: June, 1997
4

Johns Hopkins University Ionosphere database, Donor: Vince Sigillito
(vgs@aplcen.apl.jhu.edu), Date: 1989
We got the
f
ollowing
table:
Table 1. shows the results of suggested method
Dataset
number
Object in
Dataset
Number of
cluster
s
detected by GA
Iteration of
K

means
without GA
Iteration of
K

means
with GA
1
696
2
109
15
2
72
2
25
5
3
1728
4
313
65
4
351
2
52
11
6. Conclusion
In this paper we solved the main problems in k

means clustering which are the number
and seeds for clusters. We introduce a method to find the Seeds of clusters in dataset using
GA these parameters playing important role in most clustering al
gorithm
s. They are reduced
the
number of iterations may be needed to reach the centers of clusters.
855
7.
Reference
.
ARABIE, P. and HUBERT, L.J. 1996. An overview of combinatorial data analysis,
in: Arabie, P., Hubert, L.J., and Soete, G.D. (Eds.) Clustering
and Classification,
5

63, World Scientific Publishing Co., NJ.
Bojarczuk, C.C. Lopes, H.S. ; Freitas. A.A. (2000). “Genetic programming for
knowledge discovery in chest pain diagnosis,”. IEEE Engineering in Medicine
and Biology magazine

special issue on
data mining and kn
owledge discovery,
19(4), 38

44
.
DANIEL, C. and WOOD, F.C. (1980). Fitting Equations To Data: Computer Analysis
of Multifactor Data. John Wiley & Sons, New York, NY.
FUKUNAGA, K. (1990). Introduction to Statistical Pattern Recognition.
Academic
Press,San Diego, CA.
GERSHO, A. and GRAY, R.M. (1992). Vector Quantization and Signal
Compression.
Communications and Information Theory
. Kluwer Academic
Publishers, Norwell, MA.
GHOSH, J., (2002). Scalable Clustering Methods for Data Mining. In N
ong Ye
(Ed.)Handbook of Data Mining, Lawrence Erlbaum, to appear.
JAIN, A.K. and FLYNN, P.J. (1966). Image segmentation using clustering. In
Advances in Image Understanding: A Festschrift for Azriel Rosenfeld, IEEE
Press, 65

83.
Jiawei Han and Michelle Kam
ber.( 2001).
Data Mining: Concepts and Techniques
.
Morgan Kaufmann.
MASSART, D. and KAUFMAN, L. (1983). The Interpretation of Analytical
Chemical Data by the Use of Cluster Analysis. John Wiley & Sons, New York,
NY.
Mitra
and
Sushmita, (2003). Data mining
: multimedia, soft computing, and
bioinformatics, by John Wiley & Sons, Inc., Hoboken, New Jersey.
SCOTT, D.W. 1992. Multivariate Density Estimation. Wiley, New York, NY.
Sivanandam , S.N,
and
Deepa
S.N. , (2008). Introduction to Genetic Algorithms, by
Springer Berlin Heidelberg New York.
Comments 0
Log in to post a comment