WEI

MING CHEN
2012.12.06
k

medoid
clustering with
genetic algorithm
Outline
k

medoids
clustering
famous works
GCA : clustering with the add of a genetic algorithm
Clustering genetic
algorithm : also judge the number
of clusters
Conclusion
k

medoids
clustering
famous
works
GCA : clustering with the add of a genetic
algorithm
Clustering genetic algorithm : also judge the number
of clusters
Conclusion
What is k

medoid
clustering?
Proposed in 1987
(
L.
Kaufman and
P.J.
Rousseeuw
)
There are
N
points in the space
k
points are chosen as centers (
medoids
)
Classify other points into k groups
Which
k
points should be chosen to minimize the
summation of the points to its
medoid
D
ifficulty
NP

hard
Genetic algorithms can be applied
k

medoid
clustering
famous
works
GCA : clustering with the add of a genetic
algorithm
Clustering genetic algorithm : also judge the number
of clusters
Conclusion
Partitioning Around
Medoids
(PAM)
Kaufman, L., &
Rousseeuw
, P. J. (1990). Finding
groups in data:
An introduction
to cluster analysis.
New York:
Wiley
Group N data into k sets
In every generation, select every pair of (
O
i
,
O
j
),
where
O
i
is a
medoid
and
O
j
is not, if
replace
O
i
by
O
j
would reduce the distance, replace
O
i
by
O
j
Computation time : O(k(N

k)
2
) [one generation]
Clustering
LARge
Applications (CLARA)
Kaufman, L., &
Rousseeuw
, P. J. (1990). Finding
groups in data: An introduction to cluster analysis.
New York:
Wiley
Reduce the calculation time
Only select
s
data in original
N
data
s
= 40+2k seems a good choice
Computation time :
O(ks
2
+k(n

k))
[one generation
]
Clustering Large Applications based upon
RANdomized
Search (CLARANS)
Ng, R., & Han, J. (1994). Efficient and effective
clustering methods
for spatial
data mining. In
Proceedings of the 20th international
conference on
very large databases, Santiago, Chile (pp. 144
–
155
)
Do not try all pairs of
(
O
i
,
O
j
)
Try max(0.0125(k(N

k)), 250) different
O
j
to each
O
i
Computation time :
O(N
2
) [one generation]
k

medoids
clustering
famous
works
GCA : clustering with the add of a genetic
algorithm
Clustering genetic algorithm : also judge the number
of clusters
Conclusion
GCA
Lucasius
, C. B., Dane, A. D., &
Kateman
, G. (1993).
On
k

medoid
clustering
of large data sets with the
aid of a genetic
algorithm: Background
, feasibility
and comparison.
Analytica
Chimica
Acta
,
282, 647
–
669
.
Chromosome encoding
N data, clustering to k groups
Problem size = k (the number of groups)
each location of the string is an integer (1~N)
(a
medoid
)
Initialization
Each string in the population
uniquely encodes
a
candidate solution of the target
problem
Random choose the candidates
Selection
Select
M
worst individuals in population and throw
them out
Crossover
Select some individuals for reproducing
M
new
population
Building

block like crossover
Mutation
Crossover
For example, k =3, p
1
= 2 3 7, p
2
= 4 8 2
1. Mix
p
1
and p
2
Q = 2
1
3
1
7
1
4
2
8
2
2
2
randomly scramble : Q =
4
2
2
2
2
1
8
2
7
1
3
1
2. Add new material : first k elements may be changed
Q = 5
2
2
7
8
2
7
1
3
1
3. randomly scramble again
Q = 2
2
7
1
7 3
1
5
8
2
4. The offspring are selected from left or from right
C
1
= 2 7 3 , C
2
= 8 5 3
Experiment
Under the limit of NFE < 100000
N
= 1000,
k
= 15
Experiment
GCA versus Random search
Experiment
GCA versus CLARA (k = 15)
Experiment
GCA versus
CLARA
(k = 50)
Experiment
Paper’s conclusion
GCA can handle both large values of
k
and small
values of
k
GCA outperforms CLARA, especially when
k
is a
large value
GCA lends itself excellently
for parallelization
GCA can be combined with CLARA
to obtain
a
hybrid searching system with better performance.
k

medoids
clustering
famous
works
GCA : clustering with the add of a genetic algorithm
Clustering genetic algorithm : also judge the
number of clusters
Conclusion
Motivation
In some cases, we do not actually know the number
of clusters
If we only know the upper limit?
Hruschka
, E.R. and F.F.E. Nelson. (2003). “
A
Genetic Algorithm for Cluster Analysis.
” Intelligent
Data Analysis
7, 15
–
25
.
Fitness function
a(i)
: the average distance of a individual to the
individual in the same cluster
(
𝑖
)
=
𝑑
=
1
−
1
d
(i)
:
the average distance of a individual to the
individual in
a different cluster
𝑑
(
𝑖
,
𝐶
)
=
𝑑
=
1
b
(i)
: the smallest of d(i, C)
(
𝑖
)
=
min
(
d
(
i
,
C
)
)
Fitness function
Silhouette
𝑠
𝑖
=
𝑖
−
(
𝑖
)
max
{
𝑖
,
(
𝑖
)
}
fitness =
𝑠
=
𝑠
(
𝑖
)
𝑖
=
1
This value will be high when…
small
a(i)
values
high
b(i)
values
Chromosome encoding
N data,
clustering to at most k
groups
Problem size =
N+1
each
location of the string is an integer (
1~k)
(belongs to which cluster )
Genotype1
:
22345123453321454552 5
To avoid following problems:
Genotype2:
22222111113333344444
4
Genotype3:
44444333335555511111 4
Child2:
2
4444 111113333344444
4
Child3:
4
2222 333335555511111 5
Consistent Algorithm :
11234512342215343441 5
Initialization
Population size = 20
The ﬁrst genotype represents two clusters,
the second genotype
represents three clusters,
the
third genotype represents four clusters, . . . , and
the last
one represents
21 clusters
Selection
roulette wheel selection
−
1
≤
𝑠
(
𝑖
)
≤
1
n
ormalize to
0
≤
𝑠
(
𝑖
)
≤
2
Crossover
Uniform crossover do not work
Use Grouping Genetic
Algorithm (GGA
), proposed
by
Falkenauer
(1998)
First, two strings are selected
A −
1123245125432533424
B −
1212332124423221321
Randomly select groups to preserve in A
(For example, group 2 and 3)
Crossover
A −
1123245125432533424
B −
1212332124423221321
C
−
0023200020032033020
Check the unchanged group in B and place in C
C −
0023200024432033020
Another child : form by the groups in B (without
which is actually placed in C)
D −
1212332120023221321
Crossover
A −
1123245125432533424
B −
1212332124423221321
C −
0023200024432033020
Another
child : form by the groups in B (without
which is actually placed in C)
D −
1212332120023221321
Check the unchanged group in
A
and place in
D
The other objects (whose alleles are zeros) are placed
to the nearest cluster
Mutation
Two ways for mutation
1. randomly chosen a group, places all
the
objects to
the remaining
cluster that
has the nearest
centroid
2. divides a randomly selected group into two new
ones
Just change
the genotypes in the smallest possible
way
Experiment
4 test problems (N = 75, 200, 699, 150)
Experiment
Ruspini
data (N = 75)
Paper’s conclusion
Do not need to know the number of groups
Find out the answer of four different test problems
successfully
Only on small population size
k

medoids
clustering
famous
works
GCA : clustering with the add of a genetic algorithm
Clustering genetic algorithm : also judge the number
of clusters
Conclusion
Conclusion
Genetic algorithms is an acceptable method for
clustering problems
Need to design crossover carefully
Maybe EDAs can be applied
Some theses? Or final projects!
Comments 0
Log in to post a comment