k-medoid clustering with genetic algorithm

hesitantdoubtfulΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 4 χρόνια και 9 μέρες)

96 εμφανίσεις

WEI
-
MING CHEN

2012.12.06

k
-
medoid

clustering with
genetic algorithm

Outline


k
-
medoids

clustering


famous works


GCA : clustering with the add of a genetic algorithm


Clustering genetic
algorithm : also judge the number
of clusters


Conclusion


k
-
medoids

clustering


famous
works


GCA : clustering with the add of a genetic
algorithm


Clustering genetic algorithm : also judge the number
of clusters


Conclusion

What is k
-
medoid

clustering?


Proposed in 1987
(
L.
Kaufman and
P.J.
Rousseeuw
)



There are
N

points in the space


k

points are chosen as centers (
medoids
)


Classify other points into k groups


Which
k

points should be chosen to minimize the
summation of the points to its
medoid

D
ifficulty


NP
-
hard


Genetic algorithms can be applied


k
-
medoid

clustering


famous
works


GCA : clustering with the add of a genetic
algorithm


Clustering genetic algorithm : also judge the number
of clusters


Conclusion


Partitioning Around
Medoids

(PAM)


Kaufman, L., &
Rousseeuw
, P. J. (1990). Finding
groups in data:
An introduction
to cluster analysis.
New York:
Wiley


Group N data into k sets


In every generation, select every pair of (
O
i
,
O
j
),
where
O
i

is a
medoid

and
O
j

is not, if
replace
O
i

by
O
j

would reduce the distance, replace
O
i

by
O
j


Computation time : O(k(N
-
k)
2
) [one generation]

Clustering
LARge

Applications (CLARA)


Kaufman, L., &
Rousseeuw
, P. J. (1990). Finding
groups in data: An introduction to cluster analysis.
New York:
Wiley


Reduce the calculation time


Only select
s

data in original
N

data


s

= 40+2k seems a good choice


Computation time :
O(ks
2
+k(n
-
k))
[one generation
]

Clustering Large Applications based upon
RANdomized

Search (CLARANS)


Ng, R., & Han, J. (1994). Efficient and effective
clustering methods
for spatial
data mining. In
Proceedings of the 20th international
conference on
very large databases, Santiago, Chile (pp. 144

155
)


Do not try all pairs of
(
O
i
,
O
j
)


Try max(0.0125(k(N
-
k)), 250) different
O
j

to each
O
i


Computation time :
O(N
2
) [one generation]



k
-
medoids

clustering


famous
works


GCA : clustering with the add of a genetic
algorithm


Clustering genetic algorithm : also judge the number
of clusters


Conclusion


GCA


Lucasius
, C. B., Dane, A. D., &
Kateman
, G. (1993).
On
k
-
medoid

clustering
of large data sets with the
aid of a genetic
algorithm: Background
, feasibility
and comparison.
Analytica

Chimica

Acta
,
282, 647

669
.


Chromosome encoding


N data, clustering to k groups


Problem size = k (the number of groups)


each location of the string is an integer (1~N)
(a
medoid
)

Initialization


Each string in the population
uniquely encodes
a
candidate solution of the target
problem


Random choose the candidates


Selection


Select
M
worst individuals in population and throw
them out

Crossover


Select some individuals for reproducing
M

new
population


Building
-
block like crossover


Mutation

Crossover


For example, k =3, p
1

= 2 3 7, p
2

= 4 8 2


1. Mix
p
1

and p
2


Q = 2
1

3
1

7
1

4
2

8
2

2
2



randomly scramble : Q =
4
2

2
2

2
1

8
2

7
1

3
1


2. Add new material : first k elements may be changed


Q = 5
2
2

7
8
2

7
1

3
1


3. randomly scramble again


Q = 2
2

7
1

7 3
1

5
8
2



4. The offspring are selected from left or from right


C
1

= 2 7 3 , C
2

= 8 5 3



Experiment


Under the limit of NFE < 100000


N

= 1000,
k

= 15


Experiment


GCA versus Random search

Experiment


GCA versus CLARA (k = 15)

Experiment


GCA versus
CLARA

(k = 50)

Experiment


Paper’s conclusion


GCA can handle both large values of
k

and small
values of
k


GCA outperforms CLARA, especially when
k

is a
large value


GCA lends itself excellently
for parallelization


GCA can be combined with CLARA
to obtain
a
hybrid searching system with better performance.


k
-
medoids

clustering


famous
works


GCA : clustering with the add of a genetic algorithm


Clustering genetic algorithm : also judge the
number of clusters



Conclusion

Motivation


In some cases, we do not actually know the number
of clusters


If we only know the upper limit?


Hruschka
, E.R. and F.F.E. Nelson. (2003). “
A
Genetic Algorithm for Cluster Analysis.
” Intelligent
Data Analysis
7, 15

25
.



Fitness function


a(i)

: the average distance of a individual to the
individual in the same cluster



(
𝑖
)
=

𝑑



=
1


1


d
(i)

:
the average distance of a individual to the
individual in
a different cluster


𝑑
(
𝑖
,
𝐶
)
=

𝑑



=
1



b
(i)

: the smallest of d(i, C)



(
𝑖
)
=
min
(
d
(
i
,
C
)
)

Fitness function


Silhouette
𝑠
𝑖
=

𝑖


(
𝑖
)
max

{

𝑖
,

(
𝑖
)
}



fitness =
𝑠

=


𝑠
(
𝑖
)

𝑖
=
1


This value will be high when…


small
a(i)

values


high
b(i)

values


Chromosome encoding


N data,
clustering to at most k
groups


Problem size =
N+1


each
location of the string is an integer (
1~k)
(belongs to which cluster )


Genotype1
:

22345123453321454552 5


To avoid following problems:


Genotype2:
2|2222|111113333344444
4


Genotype3:
4|4444|333335555511111 4


Child2:
2
4444 111113333344444
4


Child3:
4
2222 333335555511111 5





Consistent Algorithm :
11234512342215343441 5


Initialization


Population size = 20


The first genotype represents two clusters,

the second genotype
represents three clusters,

the
third genotype represents four clusters, . . . , and
the last
one represents
21 clusters

Selection


roulette wheel selection



1

𝑠
(
𝑖
)

1


n
ormalize to
0

𝑠
(
𝑖
)

2


Crossover


Uniform crossover do not work


Use Grouping Genetic
Algorithm (GGA
), proposed
by
Falkenauer

(1998)



First, two strings are selected


A −
1123245125432533424


B −
1212332124423221321



Randomly select groups to preserve in A


(For example, group 2 and 3)

Crossover


A −
1123245125432533424


B −
1212332124423221321


C

0023200020032033020



Check the unchanged group in B and place in C


C −
0023200024432033020


Another child : form by the groups in B (without
which is actually placed in C)


D −
1212332120023221321

Crossover


A −
1123245125432533424


B −
1212332124423221321


C −
0023200024432033020



Another
child : form by the groups in B (without
which is actually placed in C)


D −
1212332120023221321


Check the unchanged group in
A
and place in
D



The other objects (whose alleles are zeros) are placed
to the nearest cluster


Mutation


Two ways for mutation


1. randomly chosen a group, places all
the
objects to
the remaining
cluster that
has the nearest
centroid


2. divides a randomly selected group into two new
ones



Just change
the genotypes in the smallest possible
way

Experiment


4 test problems (N = 75, 200, 699, 150)


Experiment


Ruspini

data (N = 75)

Paper’s conclusion


Do not need to know the number of groups


Find out the answer of four different test problems
successfully


Only on small population size


k
-
medoids

clustering


famous
works


GCA : clustering with the add of a genetic algorithm


Clustering genetic algorithm : also judge the number
of clusters


Conclusion

Conclusion


Genetic algorithms is an acceptable method for
clustering problems


Need to design crossover carefully


Maybe EDAs can be applied


Some theses? Or final projects!