Clustering Without Prior Knowledge Based on Gene Expression Programming*
Yu Chen
1
, Changjie Tang
1
,
Jun Zhu
2
, Chuan Li
1
, Shaojie Qiao
1
, Rui Li
3
, Jiang Wu
1
1
College of Computer Science,
Sichuan University, Chengdu,
China
{chenyu05, tangchangjie}
@cs.scu.edu.cn
2
National Center for
Birth Defects Monitoring,
Chengdu, China
zhu_jun1@163.com
3
Department of Electrical
Engineering, University of
California Riverside, CA, USA
ruili@ee.ucr.edu
Abstract
Most existing clustering methods require prior
knowledge, such as the number of clusters and thresholds.
They are difficult to determine accurately in practice. To
solve the problem, this study proposes a novel clustering
algorithm named GEPCluster based on Gene Expression
Programming (GEP) without prior knowledge. The main
contributions include: (1) a new concept named
Clustering Algebra is proposed that makes clustering as
algebraic operation ，(2) a GEPCluster algorithm is
proposed to find the best clustering information automatic
by GEP and discover the best clustering solution without
any prior knowledge, (3) an AMCA (Automatic Merging
Cluster Algorithm) algorithm is proposed to merge
clustering automatically. Extensive experiments
demonstrate that GEPCluster algorithm is effective in
clustering without any prior knowledge on various data
sets.
1. Introduction
Clustering is an important knowledge discovery
technique widely used in scientific practice. Its key point
is partitioning a set of data in ddimension feature space
into subgroups while maximizing the similarity within
group and minimizing the similarity between different
groups. Many clustering algorithms have been proposed,
such as partitioning method, hierarchical method, density
based method, gridbased method, modelbased method
[1], and clustering based on Support Vector [2]. However,
most existing algorithms need prior knowledge such as
the number of clusters in KMeans [3]. Unfortunately,
such prior knowledge is usually difficult to obtain
accurately in practice.
*
This work was supported by the National Natural Science Foundation of
China under Grant No.60473071, the11th Five Years Key Programs for
Sci. &Tech. Development of China under grant No.2006038002003, and
the
Youth Foundation, Sichuan University, 06036: JS20070405506408.
Clustering without any prior knowledge is an urgent
need in many applications. To resolve this problem, a
GEPCluster algorithm is proposed in this study, which is
based on GEP. Compared with traditional clustering
algorithms, GEPCluster is featured with the following
characteristics :( a) It does not need prior knowledge, and
(b) it divides or merges clusters automatically. Extensive
experiments show that GEPCluster algorithm is effective
in solving the clustering problem without any prior
knowledge.
The remainder of this paper is organized as follows.
Section 2 gives some related work. Section 3 introduces
the preliminary terminology and concepts in GEP. Section
4 present the GEPCluster algorithm and propose a new
concept named Clustering Algebra that makes the
clustering process as algebra operation. Section 5 gives
the data structure of GEPCluster in detail. Section 6
presents AMCA algorithm. Section 7 gives the
experimental results. Section 8 concludes this paper and
discusses about the future work.
2. Related work
KMeans [3] is a popular clustering algorithm. It is
simple, efficient and widely applied in practice. However,
it has two main limitations: (a) it is susceptible to the
initial cluster centers, and it often converges to a local
minimal; (b) it needs users to specify the number of
clusters. Many algorithms are proposed to overcome these
problems. Recently, the evolutionary strategies have also
been explored for it. Babu and Murty proposed the
clustering algorithm based on evolutionary strategies in
[4], but it needs the number of clusters as prior knowledge.
Murthy and Chowdhury proposed the clustering algorithm
based on genetic algorithms (GA) [5]. It encodes the
chromosome with the binary strings, and each bit in the
chromosome represents one object in the data set. Due to
the limitation of the length of chromosome, it only works
well on small data sets. In [6], Bandyopadhyay and
Maulik encode every center of cluster into the
1
Third International Conference on Natural Computation (ICNC 2007)
0769528759/07 $25.00 © 2007
chromosome to deal with larger data set. It still requires
the presetting of the number of clusters.
In this study, we propose a prior knowledge free
clustering algorithm named GEPCluster based on GEP.
It has a high probability to obtain the optimum solution of
the clustering problem without any prior knowledge.
3. Preliminary concepts in GEP
GEP was proposed by Candida Ferreira in 2001, and it
is a new member of the evolutionary computation family.
GEP combines the advantages of both GA and Genetic
Programming (GP), while overcoming some of their
individual limitations. It is stronger to deal with more
difficult problems than GA and GP [78]. The individual
chromosomes in GEP are encoded as simple linear strings
of fixed length, and the chromosome is composed of one
or more genes. Each gene in the individual chromosome is
divided into a gene head and a gene tail.
When evaluating a chromosome’s fitness, the
chromosome must be firstly mapped into an expression
tree (ET) with a widthfirst fashion. The mapping starts
from the first position in the chromosome, which
corresponds to the root of the ET. If the root node is a
terminal, the mapping stops; otherwise, its child branches
are processed from left to right. The number of child
branches is determined by the number of arguments of
their parent node. All of the children branches are read
from the chromosome one by one. This process continues
layerbylayer until all leaf nodes in the ET are composed
of elements from the terminal set. After mapping a
chromosome into an ET, the fitness of the chromosome
can be calculated by decoding the ET through inorder
traversal.
4. GEPCluster algorithm
Clustering can be viewed as an optimization problem in
the feature space. GEP performs well in global
optimization. Thus we propose a GEPCluster algorithm
based on GEP.
In GEPCluster, two special clustering operators are
introduced, namely ‘
∩
’ and ‘ ’, described in Definition ∪ 2.
Definition 1(Centroid (O
1
, O
2
) Let O
1
and O
2
be points in
the d dimension space, O
1
= {x
1
, x
2
… x
d
}, O
2
= {y
1
, y
2
… y
d
}.
The centroid between O
1
and O
2
can be calculated by a
Centroid function, which is defined as Centroid (O
1
, O
2
)
= {
1 1 2 2
2 2 2
,,...,
d d
x
yx y x y ++ +
}.
Definition 2(Clustering Algebra) Let S be a d dimension
space，and P=2
S
be the power set of S. Let O
1
and O
2
be
the centers of cluster C
1
and C
2
.
(1) The aggregation operator ‘∩’ over P is defined as
O
1
∩O
2
=Centroid (O
1
, O
2
).
(2) The segmentation operator ‘∪’over P is defined as
O
1
∪O
2
= {O
1
, O
2
}.
(3) The 3tuple CA= (P, ∩, ∪) is called Clustering
Algebra over P.
Note that, in case (1), C
1
and C
2
must be aggregated
into a cluster C with the center O= Centroid (O
1
, O
2
). In
case (2), C
1
and C
2
are two distinct clusters, and they still
keep the segmentation, which the centers of the clusters
are O
1
and O
2
, respectively.
The significance of above operations is describing the
automatic aggregate and segment of clusters algebraically.
Following lemmas describe the properties of the
Clustering Algebra.
Lemma 1 Let CA= (P, ∩, ∪) be Clustering Algebra over
P. Then
(1) ∩ and ∪ are commutative.
(2) ∩ and ∪ are associative.
Proof. (1)The commutative law is obvious, since
clustering operator is symmetric with respect to two
operands.
(2) Let O
1
, O
2
and O
3
be the centers of three clusters C
1
,
C
2
and C
3
. By Definition 2, the formula (O
1
∩O
2
) ∩O
3
indicates that C
1
and C
2
are aggregated into cluster C’, and
the center of C’ is Centroid (O
1
, O
2
). Then C’ and C
3
are
aggregated into cluster C’’, and the center of C’’ is
Centroid (Centroid (O
1
, O
2
), O
3
). That is to say, C
1
, C
2
and C
3
are three subclusters in C’’, and then C
1
, C
2
and
C
3
are aggregated.
Similarly, the formula O
1
∩ (O
2
∩O
3
) indicates that O
1
,
O
2
and O
3
are in a same cluster C’’. Therefore, the
equation (O
1
∩O
2
) ∩O
3
≡O
1
∩ (O
2
∩O
3
) is satisfied. In the
same way, we can know the equation (O
1
∪O
2
)
∪O
3
≡O
1
∪ (O
2
∪O
3
) is satisfied.
Based on Clustering Algebra, the clusters are easily
defined in the chromosome in the GEP framework.
Definition 3(ClusterET) Let t be a valid expression tree
in GEP. If all of the nonleaf nodes in t are ‘
∩
’ or ‘∪’, we
consider t is a ClusterET (
Cluster
ing
E
xpression
T
ree).
An example of ClusterET t is shown in Figure 1.
Clustering without any prior knowledge is important
with the following reasons. (a)The initial number of
clusters in the chromosomes is stochastic, it may not be
accurate. (b) The cluster center in the chromosomes will
change during the evolution process. The Clustering
Algebra will aggregate or segment the cluster centers
automatically. (c) The gene in each chromosome may
mutate or recombine with other individuals during the
each round of evolution. As the result, some cluster
centers will be removed and some new cluster centers will
be generated. After several generations of evolution,
GEPCluster can find the best centers of clusters. The
best clustering solution could be found through these
centers of clusters. The detail of GEPCluster is described
in Algorithm 1.
Algorithm 1 Clustering based on GEP (GEPCluster)
2
Third International Conference on Natural Computation (ICNC 2007)
0769528759/07 $25.00 © 2007
Input: Origin data set S; the GEP parameters.
Output: The list of clusters
(1) Initialize a population;
(2) Repeat
(3) For each individual
(4) Clustering S by the cluster centre which are hidden
in the individual’s chromosome;
(5) Calculate the fitness value of individuals;
(6) Keep the best individual;
(7) Select individuals according to the roulette wheel
selection;
(8) Recombination individuals by the rate p
c;
(9) Mutation individuals by rate p
m
;
(10)Until (maximum iterations are attained);
(11) Select the best one individual;
(12) Call AMCA function to optimize the clustering
result; //detail in Section 6
(13) Output the list of clusters;
The advantages of GEPCluster are derived from two
ways. On one hand, the data in the gene tail may be the
center of cluster or the noncenter of cluster. This feature
frees us from setting the initial number of clusters in
advance. On the other hand, the cluster centers could be
merged or divided automatically by Clustering Algebra
during the evolution process in GEPCluster. Therefore,
with GEPCluster, we could solve the problem of
predefining the number of cluster in [36].
5. The data structure in GEPCluster
5.1. Encoding the chromosome
In GEPCluster, the chromosome of an individual is
encoded with integer numbers, and each chromosome
contains only one gene which includes a gene head and a
gene tail. In the gene head, there can be functions or
terminals, however, only terminals can appear in the gene
tail. The Functions Set (FS) is {, ∩}, and the Terminals ∪
Set (TS) is {x
1
, x
2
… x
i
, i∈ (1, n)}, where x
i
is the
sequence number of the ith object in the data set. Thus,
the gene head is created with elements from both FS and
TS, but the gene tail is created using only the terminals.
Example 1. Consider a chromosome R
0
shown as
“∪∩∪x
1
x
2
∪∩
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
10
”. R
0
denotes the state
of clustering in some round of evolution. Here, the
symbol ‘x
i
’ (i ∈ [1, 10]) represents the center of a cluster
in the data set.
5.2. Mapping the ClusterET
In GEPCluster, each chromosome in the population is
mapped into a ClusterET similar to the method in GEP,
so as to the fitness calculation. The R
0
in
Example 1 can
be mapped into a ClusterET t shown in Figure 1.
Figure 1. The ClusterET t for R
0
.
5.3. Decoding the ClusterET
The decoding process is performed by preorder
traversal of the ClusterET. The main steps are as follows.
(a) Process from the root of ClusterET. If the value of the
root node is ‘∪’, divide its left subtree and right subtree
into two different sets by definition 2, namely, achieve
segmentation among the clusters; If the value of the root
node is ‘∩’, traverse its subtrees from left to right
directly. Select the node in these subtrees, and calculate
their centroid by the Centroid function. Then, put these
nodes into the same cluster which the center of cluster is
denoted by a special object in the dataset. This special
object is the data which is nearest to the centroid of these
data in the subtrees. (b) Decode all the subtrees
recursively as done before.
Then, we can find all the cluster centers encoded in the
chromosome. The complete decoding process is shown in
Algorithm 2, and we explain this process in Example 2.
Algorithm 2 ClusterET Decoding (finding the centers
of clusters).
Input: the ClusterET
Output: the cluster centers
(0)Proc DecodeClusterET（ClusterET）;
(1) If (root IS NOT NULL) then
(2) Switch (root):
(3) Case ‘∪’:
(4)
Put Left Sub tree into set I
(5)
Put Left Sub tree into set I
(6) For each set I DecodeClusterET (set I);
(7) Case ‘∩’:
(8) DecodeClusterET (Left SubTree);
(9) DecodeClusterET (Right SubTree);
(10) centerpoint ← Centroid (Left SubTree and
Right SubTree); //
calculate the centroid of all the
data in both Left SubTree and Right SubTree
(11) Find a clustercenter which close to center point;
(12) Return all clustercenter;
(13) Endp;
Example 2. The ClusterET t in Figure 1 is decoded into
some subETs shown in Figure 2, and the cluster centers
hidden in t will be found through decoding it.
3
Third International Conference on Natural Computation (ICNC 2007)
0769528759/07 $25.00 © 2007
Figure 2. Decode the ClusterET t. (a) The
original ClusterET. (b)  (e) The process of
decoding.
Figure 2 shows the procedure that how t in Figure 1 is
decoded step by step. By Algorithm 2, the root node 1 in t
(Figure 2 (a)) is ‘ ’, and then its both child branches are ∪
divided firstly into two subClusterETs, such as t
1
and t
2
in Figure 2(a). Furthermore, the root node 2 in t
1
is ‘∩’,
we aggregate its both child branches into node x
1
′ by
Centroid (x
1
,
x
2
), marked as Figure 2 (b). Similarly, we do
the similar understanding with t
2
. Finally, four sub
ClusterETs are obtained. They are Figure 2(b), (c), (d)
and (e), and it is shown that there are four clusters
existing in t in Figure 1.
Now, we find all centers of clusters in Figure 2 (a).
That is to say, there are four clusters hidden in
chromosome R
0
, and the centers of these clusters are x
1
′, x
3
, x
4
and x
5
′.
5.4. Fitness evaluation
After obtaining the clusters centers hidden in the
chromosome, the fitness of each individual needs to be
evaluated. The basic steps are shown below.
(a) Clustering the data set by these center points obtained
in section 5.3.
(b) Recalculate the center of each cluster, and search a
new center point to denote the cluster.
(c) Calculate the sum of the mean square error in each
cluster, and the total mean square error E is defined [1] as:
2
1
1
 ,
i j
k
i i
i p C x Ci
i
j
E
p m m x
n
= ∈ ∈
= − =
∑∑ ∑
where k is the number of clusters, m
i
is the center of
cluster C
k
, and n
i
is the number of the data points in
cluster C
i
. The
smaller E is, the better the clustering result
is. As shown in GEP, the best individual is usually
considered as the one with the maximum fitness.
Therefore, we let the fitness function f= 1/E in GEP
Cluster.
5.5. Genetic operators in GEPCluster
(1) Selection operator
In GEPCluster, individuals are selected according to
the fitness by roulette wheel sampling, and an elitist
strategy is also used, that is, the best individual of each
generation is always saved to the succeeding generation.
(2) Recombination operator
A one point recombination and two point
recombination is used in GEPCluster. By experience,
when the recombination rate is set to 0.77, GEPCluster
would carry a better performance. The recombination
position in a chromosome is randomly selected. If the
recombined alleles bring on invalid gene, these alleles
keep unchangeable; otherwise, these alleles are exchanged.
(3) Mutation operator
As often used in the literature [7], the mutation rate is
0.044 in GEPCluster. In GEPCluster, we select the
mutation position and the allele in the chromosome firstly,
and then mutate the allele. There are two cases. The
mutation will be made directly based on rule 1 if the allele
is ‘∩’ or ‘ ’. In the other case, ∪ if the allele is in TS, we
select one data point from the data set randomly, and
change the allele in the mutation position with it. If these
operations lead to invalid gene, we abandon this operation,
and select another data point from the data set randomly
until no invalid gene appears.
Rule 1 :(∩→,∪∪→∩) During the mutation operation,
the allele ‘∩’ is changed into ‘ ’, and the allele ‘ ’is ∪ ∪
changed into ‘∩’.
6. AMCA
Generally, all the clusters may be identified distinctly
through several evolution computations. However,
sometimes we need to adjust some clusters which have
been found before. Here, AMCA (Automatic Merging
cluster algorithm) is proposed. With AMCA, some clusters
could be merged. The detail of AMCA is shown in
Algorithm 3.
Algorithm 3 AMCA
Input： the initial cluster list
Output: the most proper cluster list
(1) Repeat
(2) Arbitrarily choose one cluster C
i
, and sort the left
clusters by C
i
;
(3) Find point x
i
in C
i
which is the nearest point to C
j
;
(4) Find point x
j
in C
j
which is the nearest point to C
i
;
(5) D
ij
←the Euclidean distance between x
i
and x
j;
(6) D
i
←the mean value of cluster C
i
;
(7) D
j
←the mean value of cluster C
j
;
(8) If (D
ij
≤ Di or D
ij
≤ Dj) then
(9) C
ij
←merge cluster Ci and Cj ;
(10) Remove C
j
from the list;
(11)Until no change in the cluster list;
(12)Output the cluster list;
4
Third International Conference on Natural Computation (ICNC 2007)
0769528759/07 $25.00 © 2007
7. Empirical evaluation
We test GEPCluster on two different data sets. Both
data 1 and data 2 are synthetic data.
(1) Results on Data Set 1
Data set 1 contains two hundred and seventy 2
dimensional points. Data set 1 is shown in Figure 3 (a),
and it is composed of four clusters. There are four clusters
in data set 1, which are composed of two spherical
clusters, a convex cluster, and a lineshape cluster. All
these clusters have almost the same density.
Figure 3(b) shows the result of cluster on data set 1. We
run GEPCluster 50 times randomly, 48 times succeed in
all trials, and only 2 times failed. These tests show that the
success rate is reach to 96%.
Figure 3. Data set 1 with four clusters: (a) The
original data set, (b) result of clustering.
(2) Results on Data Set 2
Data set 2 is shown in Figure 4 (a), it contains two
hundred and fiftyfour
2dimensional points, and it is
composed of three clusters. There are three clusters in
data set 2, which are composed of a rectangle shape
cluster and two circle shape clusters. Notice that, the two
circleshape clusters have distinct difference in size and
density.
Figure 4 (b) shows the result of clustering on data set 2.
We run GEPCluster 50 times, and 49 times succeed in all
trials, and only one time fails. These tests show that the
success rate is 98%.
Figure 4. Data set 2 with three clusters: (a) The
original data set, (b) result of clustering.
8. Conclusions
In this study, we have proposed a new concept named
Clustering Algebra that makes clustering as algebraic
operation. We have also proposed a novel clustering
algorithm named GEPCluster based on GEP without
prior knowledge. Moreover, we have proposed an AMCA
algorithm to merge clustering automatically. Extensive
experiments demonstrate that GEPCluster algorithm is
effective to clustering automatically without any prior
knowledge on various data sets.
In practice we have found that, GEPCluster still has
some small flaws, such as, sensitiveness to noise and not
very efficient in high dimension data. These problems are
under study, and we expect to present them in future.
9. References
[1] Han JW, Kambr M. Data. Mining Concepts and Techniques.
Beijing: Higher Education Press, 2001.
[2] Jaewook Lee, Daewon Lee. Dynamic Characterization of
Cluster Structures for Robust and Inductive Support Vector
Clustering, IEEE Trans. Pattern Analysis and Machine
Intelligence, vol.28, pp.18691874, 2006.
[3] J.MacQueen, Some methods for classification and analysis of
multivariate observations, in Proc 5
th
Berkeley Symp, vol. I,
1996, p.281297.
[4] Babu, G.P., Murty, M.N., 1994. Clustering with evolution
strategies. Pattern Recognition 27 (2), 321–329.
[5] C.A. Murthy, N. Chowdhury, In search of optimal clusters
using genetic algorithms, Pattern Recog. Lett. 17 (1996) 825–
832.
[6] Bandyopadhyay. S and Maulik. U. An evolutionary
technique based on Kmeans algorithm for optimal clustering in
R
N
, Information Sciences: An Intemational Journal, Vol. 146;
No. I, pp.221237, 2002.
[7] Ferreira C.Complete reference for the first GEP paper,
(12/5/2001). Gene Expression Programming: A New Adaptive
Algorithm for Solving ProblemsComplete reference for the
first GEP paper, Complex Systems, 13 (2): 87129.
[8] Shaojie Qiao,Changjie Tang,Jing Peng,Jianjun Hu and Huan
Zhang, BPGEP: Robot Path Planning based on Backtracking
ParallelChromosome GEP,Proceedings of the International
Conference on Sensing, Computing and Automation, DCDIS
series B: Application and Algorithm, Copyright@2006 Watam
Press, pp.439444,2006
5
Third International Conference on Natural Computation (ICNC 2007)
0769528759/07 $25.00 © 2007
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment