A genetic approach to the automatic clustering problem

Τεχνίτη Νοημοσύνη και Ρομποτική

23 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

117 εμφανίσεις

A genetic approach to the
automatic clustering problem

Author : Lin Yu Tseng

Shiueng Bien Yang

-
Ming Hsiao

Outline

Motivation

Objective

Introduction

The basic concept of the genetic strategy

The genetic clustering algorithm

The heuristic to find a good clustering

Conclusion

Personal Opinion

Motivation

Some clustering algorithms require the user to
provide the number of clusters as input

It is not easy for the user to guess how many
clusters should be there.

The user in general has no idea about the number of
clusters.

The clustering result may be no good

Especially when the number of clusters is large and not
easy to guess

Objective

Propose a genetic clustering algorithm

Will automatically search for a proper number

Classify the objects into these clusters

Introduction

The clustering methods

Hierarchical

The agglomerative methods

The divisive methods

Non
-
Hierarchical

The
K
-
means algorithm

Is an iterative hill
-
climbing algorithm

the solution obtained depends on the initial
clustering

The basic concept of the genetic
strategy

The genetic clustering algorithm

The algorithm
CLUSTERING

consists of two stages

The nearest
-
neighbor algorithm.

To group those data that are close to one another.

To reduce the size of the data to a moderate one that is
suitable for the genetic clustering algorithm.

Genetic clustering algorithm.

To group the small clusters into larger cluster.

A heuristic strategy

is then used to find a good clustering.

The nearest
-
neighbor algorithm.

The distance

Base on
the average of the nearest
-
neighbor distances

Steps

1.
For each object
O
i
, find the distance between
O
i

and its nearest neighbor.

The nearest
-
neighbor algorithm

Steps

2.
Compute
d
av
, the average of the nearest
-
neighbor
distance by using step 1

3.
View the
n

objects as nodes of a graph. Compute
n*n

The nearest
-
neighbor algorithm

Steps

4.
Find the connected components of this graph.

The data sets represented by these connected
components be denoted by

B
1
,
B
2
, …,
B
m

The center of each set be denoted by

V
i
, 1

i

m

The genetic algorithm

Initialization step

Iterative generations

Reproduction phase

Crossover phase

Mutation phase

The genetic algorithm

Initialization step

A population of N strings is
randomly

generated

The length of each string is
m

m

is the number of the sets obtained in the first stage.

If
B
i

is in this subset, the
i
th position of the string will
be 1; otherwise, it will be 0

Each
B
i

in the subset is used as a seed to generate a
cluster.

The genetic algorithm

The genetic algorithm

How to generate a set of clusters from the seeds

Let
T

= {
T
1
,
T
2
,…,
T
s
} be the subset corresponding to a
string.

The initial clusters
C
i
’s are
T
i
’s and initial centers
S
i
’s
of clusters are
V
i
’s for

i

= 1, 2,…,s.

The size of cluster Ci is

C
i

=

T
i

for
i

= 1, 2,…,s,
where

T
i

denotes the number of objects belonging to
T
i

The genetic algorithm

The

B
i
’s
in {
B
1
,
B
2
, …,
B
m
}

T
are taken one by one and
the distance between the center
V
i
of the taken
B
i.

the center
S
j

of each cluster
C
j

is calculated

If
B
i
is classified as in the cluster
C
j
, the center
S
j
and the
size of the cluster
C
j
will be recomputed

The genetic algorithm

Reproduction phase

The intra
-
distance in the center
C
i

The inter
-
distance between this cluster Ci and the set of all
other clusters.

The fitness function of a string R

The genetic algorithm

Crossover phase

Two random number
p

and
q

in [1, m] are generated to
decide which pieces of the string are to be interchanged.

The crossover operator is done with probability
p
c

Mutation Phase

Each chosen bit will be changed from 0 to 1 or from 1
to 0.

The heuristic strategy to find a
good clustering

D
1
(
w
) estimates the closeness of the clusters in
the clustering

D
2
(
w
) estimates the compactness of the
clusters in the clustering

The heuristic strategy to find a
good clustering

The value of
w
’s are chosen from [w
1
, w
2
] by
some kind of binary search

To finds
the greatest jump

on the values of
D
1
(
w
)’s and
the greatest jump

on the values of
D
2
(
w
)’s.

Based on these jumps, it then decides which a good
clustering is

Experiments

The population size is
50

The crossover rate is
80 %

The mutation rate is
5 %

[w
1
, w
2
] = [
1
,
3
]

w
1
is the smallest value, w
2
is the largest value

Three sets of data were used

Fig. (a)

The first set of data

consists of three groups
of points on the plane.

The densities of three
groups are not the same

Fig. (b), (c)

K
-
mean algorithm

Fig. (d)

Complete
-

Fig. (e)

Single
-

Fig. (a)

The original data set
with five groups of
points

Fig. (b), (c) and (d)

K
-
mean algorithm

Fig. (e)

By CLUSTERING,
complete
-
-
K
-
mean

Conclusion and Personal Opinion

The experimental results show that
CLUSTERING

is effective.

Can automatically search for a proper number
as the number of clusters.