Determining Number & Initial Seeds of K-Means Clustring Using GA

cobblerbeggarAI and Robotics

Oct 15, 2013 (4 years and 28 days ago)

91 views



850

Determining Number & Initial Seeds of K
-
Means
Clustring Using GA

Mu
hammed U. Mahdi

Babylon

university
,
collage of science for women
,
computer department

Abstract

Cluster analysis has been widely used in several disciplines, such as statistics, software
eng
ineering, biology, psychology and other social sciences, in order to identify natural groups in large
amounts of data.

Clustering has also been widely adopted by researchers within computer science and
es
pecially the database community.

K
-
means is the most

famous clustering algorithms. But it suffering
from some drawbacks which is the determining of number and initial seeds of clusters expected in the
datasets.

In this paper we introduce ability of automatic clusters seeds and number detection technique
us
ing GA.

Keywords: K
-
means, Genetic Algorithm, Clustering

ةصلاخلا

قلي بل لعلاحلعتاا القلةحدلا لءباحلالتثبا لتحا لواحلعل ا ال لا بنتلعتاتلييقانعلا ليلحت

ا يقلل احيلعااعلا للنعل اوقليحعلا للعا
ا ل حعلاو ليلاا قلاقيلعلا لواحلينيلطعلا قنب تعلاتيتحتاض غعاى ثدلا لنعل
ا لواايءثقللعلا للياالباحنلعل او لةلاقل يلاوتلتانعلا بتثتلعل
ا ا قلعقحعلا لع
ا لااوتلتانعلا قل با واحو نبعلا قيبز ل ثعلا ه لاابا قاقيلعلاتعل ياحةقث
(K
-
means)
ا
احلال باالبا اقلنتاقلهااع
ا لللعلا تللتاا للحلعلالمللاا للوا قللع ب بعلاثمللاازللال بعاحلليجلتتلدلا يللتلعادلللعبعلاتلليتحتابعمللا احللني تبعلا قللع ب بعلاتتللنعادلللعبعلاتلليتحتعل
ازال ب احيبق بعلاتتعافق تادا جقتلت
ا لتثتعقلاقا
حيبز ل ثعل
ا
اا حياي عل
ا
1. Introduction

Clustering is a division of data into groups of similar objects. Representing the
data by fewer clusters necessarily loses certain fine details, but achieves
simplification. It models data by its clusters.
Data modeling puts clustering in a
historical perspective rooted in mathematics, statistics, and numerical analysis. From a
machine learning perspective clusters correspond to hidden patterns, the search for
clusters is unsupervised learning, and the resul
ting system represents a data concept.
From a practical perspective clustering plays an outstanding role in data mining
applications such as scientific data exploration, information retrieval and text mining,
spatial database applications, Web analysis, ma
rketing, medical diagnostics,
computational biology, and many others. Clustering is the subject of active research in
several fields such as statistics, pattern recognition, and machine learning. This survey
focuses on clustering in data mining. Data minin
g adds to clustering the complications
of very large datasets with very many attributes of different types. This imposes
unique computational requirements on relevant clustering algorithms. A variety of
algorithms have recently emerged that meet these requ
irements and were successfully
applied to real
-
life data mining problems

[
Jiawei Han and Michelle Kamber
, 2001
]
.

2.
Related works

General references regarding clustering include [Ghosh 2002]. There is a close
relationship between clustering techniques and

many other
disciplines

Clustering has always been used in statistics [Arabie & Hubert 1996] and science

[Massart & Kaufman 1983].

Typical applications include speech and character

recognition. Machine learning
clustering algorithms were applied to image

segmentation

and computer vision [Jain
& Flynn 1996].

For statistical approaches to pattern

recognition see [Fukunaga 1990].

Clustering can be viewed as

a density estimation problem. This is the subject of
traditional multivariate statistical

estimatio
n [Scott
,

1992].



851

Clustering is also widely used for data compression in image

processing, which is
also known as vector quantization [Gersho & Gray 1992].

Data

fitting in numerical analysis provides still another venue in data modeling
[Daniel &

Wood 198
0].

3.
K
-
means and K
-
median

The K
-
means algorithm, probably the
fi
rst one of the clustering algorithms

proposed, is based on a very simple idea: Given a set of initial clusters, assign

each
point to one of them,
and then

each cluster center is replaced by
the mean

point on the
respective cluster. These two simple steps are repeated until

convergence. A point is
assigned to the cluster which is close in Euclidean

distance to the point.

Although K
-
means has the great advantage of being easy to implement,

it h
as two big drawbacks.
First, it can be really slow since in each step the

distance between each point to each
cluster has to be calculated, which can

be really expensive in the presence of a large
dataset. Second, this method

is really sensitive to the pro
vided initial clusters,
however, in recent
years,
this problem has been addressed with some degree of
success.


If instead of the Euclidean distance the 1
-
norm distance is used:



……………..(1)

a variation of K
-
means is obtained and it is called K
-
median
,
t
he

authors claim that
this variation is less sensitive to outliers than traditional

K
-
means due to the
characteristics of the 1
-
norm.

The algorithm is as follows:

1. Select
k

objects as initial centers;

2. Assign each data object to the closest center;

3.
Recalculate the centers of each cluster;

4. Repeat steps 2 and 3 until centers do not change;

The mai
n weakness points of K
-
means are the number of clusters may or may not be
known prior and the
randomly

initialization of clusters seed effect the result of

clustering

[
Jiawei Han and Michelle Kamber
, 2001
]
.

4. Suggested

Solving of K
-
Means Weakness P
oints

To determine the number of clusters K and the optimal cluster center seeds
which
can be
initialized

to the K
-
means. g
enetic algorithms (GA) based clustering

technique
can automatically evolve the appropriate clusters number of a data set. The
chromosome encodes the centers of clusters, whose value may vary. Modified
versions of crossover and mutation operators are used.

4.1 Genetic Algorithm.

Genetic Algorit
hms (GA) belong to a class of search techniques that mimic the
principles of natural selection to develop solutions of large optimization problems. GAs
operate by maintaining and manipulating a population of potential solutions called
chromosomes [Sivanand
am
&
Deepa, 2008]
. Each chromosome has an associated fitness
value which is a qualitative measure of the goodness of the solution encoded in it. This
fitness value is used to guide the stochastic selection of chromosomes which are then used
to generate new
candidate solutions through crossover and mutation. Crossover generates
new chromosomes by combining sections of two or more selected parents. Mutation acts
by randomly selecting genes which are then altered; thereby preventing suboptimal
solutions from pe
rsisting and increases diversity in the population. The process of
selection, crossover and mutation continue for a fixed number of generations or until a
termination Condition is satisfied. GAs have applications in fields as diverse as VLSI


852

design, patter
n recognition, image processing, neural networks, machine learning, etc.

[Mitra, Sushmita, 2003]

GA algorithm used to find optimal clusters' seeds & their number according to
flow
chart
of the algori
thm is shown in the figure(1)
.














































Fig. (1)

Adapted GA for Clusters seeds detection

Start

For each chromosome in the population



Generate a number K, in the range K
min

to K
max



Choose K
i

points(rows) randomly
from the dataset



Distribute these points randomly in the chromosome



Set unfilled positions to null

For each chromosome in the population



Extract the K
i

centers stored in it



Perform clusteri
ng by assigning each point to the cluster
corresponding to the closest center



Compute DB index (Dbi) by Eq(3
-
7)



Compute fitness as 1/DB
i


Selection

Single point crossover with prob.

c

Mutation performed with prob.

m
:

Randomly choose one position of chromosome.

If this position is null randomly, choose a point
from data and make it as a center


Else make this position null


Gen<Maxgen?

Gen=Gen+1


Elitism

Use the detected centers

In k
-
means algorithm

Stop

Set K
min
,
,

K
max

to min & m
ax clusters
number expected respectively


Set MaxGen to max iteration allowed

Gen =1

Yse

Population initialization

Fitness calculation

Genetic Operations

No



853

4.2 GA S
teps

A.Representation (
encoding of solution
)

The value of
K

is assumed to lie in the range [
K
min
; K
max
], where
K
min

is chosen to
be 2 unless specified otherwise. The length of a s
tring is taken to be
K
max

where each
individual gene position represents either apointer to actual center or a null.

B.Population initialization

For initializing these centres,
K
i

points are chosen randomly from the dataset.
These points are distributed ra
ndomly in the chromosome. Let us consider the
following example.

Example: Let
K
min

=2 and
K
max

=10. Let the random number
K
i

be equal to 4
for chromosome
i
. Then this chromosome will encode the centres of 4 clusters. Let
the 4 cluster centres (4 randomly
chosen points from the data set) be (10:0, 5:0) (20:4,
13:2) (15:8, 2:9) (22:7, 17:7). On random distribution of these centres in the
chromosome, it may look like :

[null,

(20:4;13:2),null,null,

(15:8;2:9) ,null,

(10:0;:0)

(22:7;17:7), null,null].

C.Fitne
ss computation

The fitness of a chromosome is computed using the Davies

Bouldin index
(DBi). DBI is determined as follows [
C.C. Bojarczuk, H.S. Lopes, A.A. Freitas
, 2000
] :
Given a partition of the N points into K clusters, one first defines the following
measure of within
-
to
-
between cluster spread for two clusters, C
j

and C
k

for 1 ≤ j, k


K and j


k.


where e
j

and e
k

are the average dispersion of C
j
and C
k
, and D
jk

is the
Euclidean distance between C
j

and C
k
. If m
j

and m
k

are the centers of C
j

and C
k
,

consisting of N
j
, and N
k

points respectively:




After that the DBi is defined as:


The objective is to minimize the
DB

index for achieving proper clustering. The
fitness function for chromosome
j

is defined as 1/
DBj

, where
DBj

is the Davies

Bould
in index computed for this chromosome.

..……
.
…………..(
2
)

…….....…….(3)

……………..…..(5)

…………………..(4)



854

D.Genetic Operations

a.
Selection
: Conventional proportional selection is applied on the population of
strings.

b.
Crossover
: Single point crossover, applied stochastically with probability

c
, is
explained below with
an example.

Example: Suppose crossover occurs between the following two strings:

null,


, null, null,


,| null ,


,


,null, null

null,


, null,


, null,|


,


,null,


,null

Let the crossover position be 5 as shown above. T
hen the offspring are:

null,


, null, null,


,


,


,null,


,null

null,


, null,


, null, null ,


,


,null, null

c.
Mutation
: Each position in a chromosome is mutated with probability

m

in the
following way. If the value a
t that position is not null , then it becomes null else new
cluster center is created by selecting random points from dataset and making the
pointer point to it.

5.

Results

We applied
the

approach on following datasets:

1
-

Wisconsin Breast Cancer Database
(January 8, 1991) Dr. WIlliam H. Wolberg
(physician) , University of Wisconsin Hospitals.

2
-

Heart Disease Databases, Hungarian Institute of Cardiology. Budapest: Andras
Janosi, M.D, Date: July, 1988.

3
-

Car Evaluation Database, Donors: Marko Bohanec,

(marko.bohanec@ijs.si)


Blaz Zupan (
blaz.zupan@ijs.si
) ,Date: June, 1997

4
-

Johns Hopkins University Ionosphere database, Donor: Vince Sigillito
(vgs@aplcen.apl.jhu.edu), Date: 1989

We got the
f
ollowing

table:

Table 1. shows the results of suggested method

Dataset
number

Object in
Dataset

Number of
cluster
s

detected by GA

Iteration of

K
-
means
without GA

Iteration of

K
-
means

with GA

1

696

2

109

15

2

72

2

25

5

3

1728

4

313

65

4

351

2

52

11

6. Conclusion

In this paper we solved the main problems in k
-
means clustering which are the number
and seeds for clusters. We introduce a method to find the Seeds of clusters in dataset using
GA these parameters playing important role in most clustering al
gorithm
s. They are reduced
the
number of iterations may be needed to reach the centers of clusters.


855

7.
Reference
.

ARABIE, P. and HUBERT, L.J. 1996. An overview of combinatorial data analysis,
in: Arabie, P., Hubert, L.J., and Soete, G.D. (Eds.) Clustering
and Classification,
5
-
63, World Scientific Publishing Co., NJ.

Bojarczuk, C.C. Lopes, H.S. ; Freitas. A.A. (2000). “Genetic programming for
knowledge discovery in chest pain diagnosis,”. IEEE Engineering in Medicine
and Biology magazine
-

special issue on

data mining and kn
owledge discovery,
19(4), 38
-
44
.

DANIEL, C. and WOOD, F.C. (1980). Fitting Equations To Data: Computer Analysis
of Multifactor Data. John Wiley & Sons, New York, NY.

FUKUNAGA, K. (1990). Introduction to Statistical Pattern Recognition.
Academic
Press,San Diego, CA.

GERSHO, A. and GRAY, R.M. (1992). Vector Quantization and Signal
Compression.

Communications and Information Theory
. Kluwer Academic
Publishers, Norwell, MA.

GHOSH, J., (2002). Scalable Clustering Methods for Data Mining. In N
ong Ye
(Ed.)Handbook of Data Mining, Lawrence Erlbaum, to appear.

JAIN, A.K. and FLYNN, P.J. (1966). Image segmentation using clustering. In
Advances in Image Understanding: A Festschrift for Azriel Rosenfeld, IEEE
Press, 65
-
83.

Jiawei Han and Michelle Kam
ber.( 2001).
Data Mining: Concepts and Techniques
.
Morgan Kaufmann.

MASSART, D. and KAUFMAN, L. (1983). The Interpretation of Analytical
Chemical Data by the Use of Cluster Analysis. John Wiley & Sons, New York,
NY.

Mitra

and

Sushmita, (2003). Data mining

: multimedia, soft computing, and
bioinformatics, by John Wiley & Sons, Inc., Hoboken, New Jersey.

SCOTT, D.W. 1992. Multivariate Density Estimation. Wiley, New York, NY.

Sivanandam , S.N,

and

Deepa

S.N. , (2008). Introduction to Genetic Algorithms, by
Springer Berlin Heidelberg New York.