A genetic approach to the automatic clustering problem

grandgoatΤεχνίτη Νοημοσύνη και Ρομποτική

23 Οκτ 2013 (πριν από 3 χρόνια και 1 μήνα)

81 εμφανίσεις

A genetic approach to the
automatic clustering problem

Author : Lin Yu Tseng


Shiueng Bien Yang


Graduate : Chien
-
Ming Hsiao

Outline


Motivation


Objective


Introduction


The basic concept of the genetic strategy


The genetic clustering algorithm


The heuristic to find a good clustering


Conclusion


Personal Opinion

Motivation


Some clustering algorithms require the user to
provide the number of clusters as input


It is not easy for the user to guess how many
clusters should be there.


The user in general has no idea about the number of
clusters.


The clustering result may be no good


Especially when the number of clusters is large and not
easy to guess


Objective


Propose a genetic clustering algorithm


Will automatically search for a proper number


Classify the objects into these clusters

Introduction


The clustering methods


Hierarchical


The agglomerative methods


The divisive methods


Non
-
Hierarchical


The
K
-
means algorithm


Is an iterative hill
-
climbing algorithm


the solution obtained depends on the initial
clustering


The basic concept of the genetic
strategy

The genetic clustering algorithm


The algorithm
CLUSTERING

consists of two stages



The nearest
-
neighbor algorithm.


To group those data that are close to one another.


To reduce the size of the data to a moderate one that is
suitable for the genetic clustering algorithm.


Genetic clustering algorithm.


To group the small clusters into larger cluster.


A heuristic strategy

is then used to find a good clustering.

The nearest
-
neighbor algorithm.


The distance


Base on
the average of the nearest
-
neighbor distances



Steps

1.
For each object
O
i
, find the distance between
O
i

and its nearest neighbor.

The nearest
-
neighbor algorithm


Steps

2.
Compute
d
av
, the average of the nearest
-
neighbor
distance by using step 1




3.
View the
n

objects as nodes of a graph. Compute
the adjacency matrix A
n*n


The nearest
-
neighbor algorithm


Steps

4.
Find the connected components of this graph.


The data sets represented by these connected
components be denoted by



B
1
,
B
2
, …,
B
m


The center of each set be denoted by


V
i
, 1



i


m


The genetic algorithm


Initialization step


Iterative generations


Reproduction phase


Crossover phase


Mutation phase

The genetic algorithm


Initialization step


A population of N strings is
randomly

generated


The length of each string is
m


m

is the number of the sets obtained in the first stage.


If
B
i

is in this subset, the
i
th position of the string will
be 1; otherwise, it will be 0


Each
B
i

in the subset is used as a seed to generate a
cluster.


The genetic algorithm

The genetic algorithm


How to generate a set of clusters from the seeds


Let
T

= {
T
1
,
T
2
,…,
T
s
} be the subset corresponding to a
string.


The initial clusters
C
i
’s are
T
i
’s and initial centers
S
i
’s
of clusters are
V
i
’s for

i

= 1, 2,…,s.


The size of cluster Ci is


C
i


=


T
i


for
i

= 1, 2,…,s,
where


T
i


denotes the number of objects belonging to
T
i

The genetic algorithm


The

B
i
’s
in {
B
1
,
B
2
, …,
B
m
}


T
are taken one by one and
the distance between the center
V
i
of the taken
B
i.


the center
S
j

of each cluster
C
j

is calculated




If
B
i
is classified as in the cluster
C
j
, the center
S
j
and the
size of the cluster
C
j
will be recomputed


The genetic algorithm


Reproduction phase


The intra
-
distance in the center
C
i





The inter
-
distance between this cluster Ci and the set of all
other clusters.




The fitness function of a string R




The genetic algorithm


Crossover phase


Two random number
p

and
q

in [1, m] are generated to
decide which pieces of the string are to be interchanged.


The crossover operator is done with probability
p
c



Mutation Phase


Each chosen bit will be changed from 0 to 1 or from 1
to 0.

The heuristic strategy to find a
good clustering


D
1
(
w
) estimates the closeness of the clusters in
the clustering



D
2
(
w
) estimates the compactness of the
clusters in the clustering

The heuristic strategy to find a
good clustering


The value of
w
’s are chosen from [w
1
, w
2
] by
some kind of binary search



To finds
the greatest jump

on the values of
D
1
(
w
)’s and
the greatest jump

on the values of
D
2
(
w
)’s.


Based on these jumps, it then decides which a good
clustering is

Experiments


The population size is
50


The crossover rate is
80 %


The mutation rate is
5 %


[w
1
, w
2
] = [
1
,
3
]


w
1
is the smallest value, w
2
is the largest value


Three sets of data were used


Fig. (a)


The first set of data


consists of three groups
of points on the plane.


The densities of three
groups are not the same


Fig. (b), (c)


K
-
mean algorithm


Fig. (d)


Complete
-
link method


Fig. (e)


Single
-
link method


Fig. (a)


The original data set
with five groups of
points


Fig. (b), (c) and (d)


K
-
mean algorithm


Fig. (e)


By CLUSTERING,
complete
-
link, single
-
link and
K
-
mean

Conclusion and Personal Opinion


The experimental results show that
CLUSTERING

is effective.


Can automatically search for a proper number
as the number of clusters.