Clustering Without Prior Knowledge Based on Gene Expression Programming*

brewerobstructionAI and Robotics

Nov 7, 2013 (3 years and 11 months ago)

143 views

Clustering Without Prior Knowledge Based on Gene Expression Programming*


Yu Chen
1
, Changjie Tang
1
,

Jun Zhu
2
, Chuan Li
1
, Shaojie Qiao
1
, Rui Li
3
, Jiang Wu
1
1
College of Computer Science,
Sichuan University, Chengdu,
China
{chenyu05, tangchangjie}
@cs.scu.edu.cn
2
National Center for
Birth Defects Monitoring,
Chengdu, China
zhu_jun1@163.com


3
Department of Electrical
Engineering, University of
California Riverside, CA, USA
ruili@ee.ucr.edu



Abstract

Most existing clustering methods require prior
knowledge, such as the number of clusters and thresholds.
They are difficult to determine accurately in practice. To
solve the problem, this study proposes a novel clustering
algorithm named GEP-Cluster based on Gene Expression
Programming (GEP) without prior knowledge. The main
contributions include: (1) a new concept named
Clustering Algebra is proposed that makes clustering as
algebraic operation ,(2) a GEP-Cluster algorithm is
proposed to find the best clustering information automatic
by GEP and discover the best clustering solution without
any prior knowledge, (3) an AMCA (Automatic Merging
Cluster Algorithm) algorithm is proposed to merge
clustering automatically. Extensive experiments
demonstrate that GEP-Cluster algorithm is effective in
clustering without any prior knowledge on various data
sets.

1. Introduction

Clustering is an important knowledge discovery
technique widely used in scientific practice. Its key point
is partitioning a set of data in d-dimension feature space
into sub-groups while maximizing the similarity within
group and minimizing the similarity between different
groups. Many clustering algorithms have been proposed,
such as partitioning method, hierarchical method, density-
based method, grid-based method, model-based method
[1], and clustering based on Support Vector [2]. However,
most existing algorithms need prior knowledge such as
the number of clusters in K-Means [3]. Unfortunately,
such prior knowledge is usually difficult to obtain
accurately in practice.


*
This work was supported by the National Natural Science Foundation of
China under Grant No.60473071, the11th Five Years Key Programs for
Sci. &Tech. Development of China under grant No.2006038002003, and
the

Youth Foundation, Sichuan University, 06036: JS20070405506408.
Clustering without any prior knowledge is an urgent
need in many applications. To resolve this problem, a
GEP-Cluster algorithm is proposed in this study, which is
based on GEP. Compared with traditional clustering
algorithms, GEP-Cluster is featured with the following
characteristics :( a) It does not need prior knowledge, and
(b) it divides or merges clusters automatically. Extensive
experiments show that GEP-Cluster algorithm is effective
in solving the clustering problem without any prior
knowledge.
The remainder of this paper is organized as follows.
Section 2 gives some related work. Section 3 introduces
the preliminary terminology and concepts in GEP. Section
4 present the GEP-Cluster algorithm and propose a new
concept named Clustering Algebra that makes the
clustering process as algebra operation. Section 5 gives
the data structure of GEP-Cluster in detail. Section 6
presents AMCA algorithm. Section 7 gives the
experimental results. Section 8 concludes this paper and
discusses about the future work.

2. Related work

K-Means [3] is a popular clustering algorithm. It is
simple, efficient and widely applied in practice. However,
it has two main limitations: (a) it is susceptible to the
initial cluster centers, and it often converges to a local
minimal; (b) it needs users to specify the number of
clusters. Many algorithms are proposed to overcome these
problems. Recently, the evolutionary strategies have also
been explored for it. Babu and Murty proposed the
clustering algorithm based on evolutionary strategies in
[4], but it needs the number of clusters as prior knowledge.
Murthy and Chowdhury proposed the clustering algorithm
based on genetic algorithms (GA) [5]. It encodes the
chromosome with the binary strings, and each bit in the
chromosome represents one object in the data set. Due to
the limitation of the length of chromosome, it only works
well on small data sets. In [6], Bandyopadhyay and
Maulik encode every center of cluster into the
1
Third International Conference on Natural Computation (ICNC 2007)
0-7695-2875-9/07 $25.00 © 2007
chromosome to deal with larger data set. It still requires
the presetting of the number of clusters.
In this study, we propose a prior knowledge free
clustering algorithm named GEP-Cluster based on GEP.
It has a high probability to obtain the optimum solution of
the clustering problem without any prior knowledge.

3. Preliminary concepts in GEP

GEP was proposed by Candida Ferreira in 2001, and it
is a new member of the evolutionary computation family.
GEP combines the advantages of both GA and Genetic
Programming (GP), while overcoming some of their
individual limitations. It is stronger to deal with more
difficult problems than GA and GP [7-8]. The individual
chromosomes in GEP are encoded as simple linear strings
of fixed length, and the chromosome is composed of one
or more genes. Each gene in the individual chromosome is
divided into a gene head and a gene tail.
When evaluating a chromosome’s fitness, the
chromosome must be firstly mapped into an expression
tree (ET) with a width-first fashion. The mapping starts
from the first position in the chromosome, which
corresponds to the root of the ET. If the root node is a
terminal, the mapping stops; otherwise, its child branches
are processed from left to right. The number of child
branches is determined by the number of arguments of
their parent node. All of the children branches are read
from the chromosome one by one. This process continues
layer-by-layer until all leaf nodes in the ET are composed
of elements from the terminal set. After mapping a
chromosome into an ET, the fitness of the chromosome
can be calculated by decoding the ET through inorder
traversal.

4. GEP-Cluster algorithm

Clustering can be viewed as an optimization problem in
the feature space. GEP performs well in global
optimization. Thus we propose a GEP-Cluster algorithm
based on GEP.
In GEP-Cluster, two special clustering operators are
introduced, namely ‘

’ and ‘ ’, described in Definition ∪ 2.
Definition 1(Centroid (O
1
, O
2
) Let O
1
and O
2
be points in
the d dimension space, O
1
= {x
1
, x
2
… x
d
}, O
2
= {y
1
, y
2
… y
d
}.
The centroid between O
1
and O
2
can be calculated by a
Centroid function, which is defined as Centroid (O
1
, O
2
)
= {
1 1 2 2
2 2 2
,,...,
d d
x
yx y x y ++ +
}.
Definition 2(Clustering Algebra) Let S be a d dimension
space,and P=2
S
be the power set of S. Let O
1
and O
2
be
the centers of cluster C
1
and C
2
.
(1) The aggregation operator ‘∩’ over P is defined as
O
1
∩O
2
=Centroid (O
1
, O
2
).
(2) The segmentation operator ‘∪’over P is defined as
O
1
∪O
2
= {O
1
, O
2
}.
(3) The 3-tuple CA= (P, ∩, ∪) is called Clustering
Algebra over P.
Note that, in case (1), C
1
and C
2
must be aggregated
into a cluster C with the center O= Centroid (O
1
, O
2
). In
case (2), C
1
and C
2
are two distinct clusters, and they still
keep the segmentation, which the centers of the clusters
are O
1
and O
2
, respectively.
The significance of above operations is describing the
automatic aggregate and segment of clusters algebraically.
Following lemmas describe the properties of the
Clustering Algebra.
Lemma 1 Let CA= (P, ∩, ∪) be Clustering Algebra over
P. Then
(1) ∩ and ∪ are commutative.
(2) ∩ and ∪ are associative.
Proof. (1)The commutative law is obvious, since
clustering operator is symmetric with respect to two
operands.
(2) Let O
1
, O
2
and O
3
be the centers of three clusters C
1
,
C
2
and C
3
. By Definition 2, the formula (O
1
∩O
2
) ∩O
3

indicates that C
1
and C
2
are aggregated into cluster C’, and
the center of C’ is Centroid (O
1
, O
2
). Then C’ and C
3
are
aggregated into cluster C’’, and the center of C’’ is
Centroid (Centroid (O
1
, O
2
), O
3
). That is to say, C
1
, C
2

and C
3
are three sub-clusters in C’’, and then C
1
, C
2
and
C
3
are aggregated.
Similarly, the formula O
1
∩ (O
2
∩O
3
) indicates that O
1
,
O
2
and O
3
are in a same cluster C’’. Therefore, the
equation (O
1
∩O
2
) ∩O
3
≡O
1
∩ (O
2
∩O
3
) is satisfied. In the
same way, we can know the equation (O
1
∪O
2
)
∪O
3
≡O
1
∪ (O
2
∪O
3
) is satisfied.
Based on Clustering Algebra, the clusters are easily
defined in the chromosome in the GEP framework.
Definition 3(Cluster-ET) Let t be a valid expression tree
in GEP. If all of the non-leaf nodes in t are ‘

’ or ‘∪’, we
consider t is a Cluster-ET (
Cluster
ing
E
xpression
T
ree).
An example of Cluster-ET t is shown in Figure 1.
Clustering without any prior knowledge is important
with the following reasons. (a)The initial number of
clusters in the chromosomes is stochastic, it may not be
accurate. (b) The cluster center in the chromosomes will
change during the evolution process. The Clustering
Algebra will aggregate or segment the cluster centers
automatically. (c) The gene in each chromosome may
mutate or recombine with other individuals during the
each round of evolution. As the result, some cluster
centers will be removed and some new cluster centers will
be generated. After several generations of evolution,
GEP-Cluster can find the best centers of clusters. The
best clustering solution could be found through these
centers of clusters. The detail of GEP-Cluster is described
in Algorithm 1.
Algorithm 1 Clustering based on GEP (GEP-Cluster)
2
Third International Conference on Natural Computation (ICNC 2007)
0-7695-2875-9/07 $25.00 © 2007
Input: Origin data set S; the GEP parameters.
Output: The list of clusters
(1) Initialize a population;
(2) Repeat
(3) For each individual
(4) Clustering S by the cluster centre which are hidden
in the individual’s chromosome;
(5) Calculate the fitness value of individuals;
(6) Keep the best individual;
(7) Select individuals according to the roulette- wheel
selection;
(8) Recombination individuals by the rate p
c;
(9) Mutation individuals by rate p
m
;
(10)Until (maximum iterations are attained);
(11) Select the best one individual;
(12) Call AMCA function to optimize the clustering
result; //detail in Section 6
(13) Output the list of clusters;
The advantages of GEP-Cluster are derived from two
ways. On one hand, the data in the gene tail may be the
center of cluster or the non-center of cluster. This feature
frees us from setting the initial number of clusters in
advance. On the other hand, the cluster centers could be
merged or divided automatically by Clustering Algebra
during the evolution process in GEP-Cluster. Therefore,
with GEP-Cluster, we could solve the problem of
predefining the number of cluster in [3-6].

5. The data structure in GEP-Cluster

5.1. Encoding the chromosome

In GEP-Cluster, the chromosome of an individual is
encoded with integer numbers, and each chromosome
contains only one gene which includes a gene head and a
gene tail. In the gene head, there can be functions or
terminals, however, only terminals can appear in the gene
tail. The Functions Set (FS) is {, ∩}, and the Terminals ∪
Set (TS) is {x
1
, x
2
… x
i
, i∈ (1, n)}, where x
i
is the
sequence number of the ith object in the data set. Thus,
the gene head is created with elements from both FS and
TS, but the gene tail is created using only the terminals.
Example 1. Consider a chromosome R
0
shown as
“∪∩∪x
1
x
2
∪∩

x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
10
”. R
0
denotes the state
of clustering in some round of evolution. Here, the
symbol ‘x
i
’ (i ∈ [1, 10]) represents the center of a cluster
in the data set.

5.2. Mapping the Cluster-ET

In GEP-Cluster, each chromosome in the population is
mapped into a Cluster-ET similar to the method in GEP,
so as to the fitness calculation. The R
0
in

Example 1 can
be mapped into a Cluster-ET t shown in Figure 1.


Figure 1. The Cluster-ET t for R
0
.

5.3. Decoding the Cluster-ET

The decoding process is performed by preorder
traversal of the Cluster-ET. The main steps are as follows.
(a) Process from the root of Cluster-ET. If the value of the
root node is ‘∪’, divide its left sub-tree and right sub-tree
into two different sets by definition 2, namely, achieve
segmentation among the clusters; If the value of the root
node is ‘∩’, traverse its sub-trees from left to right
directly. Select the node in these sub-trees, and calculate
their centroid by the Centroid function. Then, put these
nodes into the same cluster which the center of cluster is
denoted by a special object in the dataset. This special
object is the data which is nearest to the centroid of these
data in the sub-trees. (b) Decode all the sub-trees
recursively as done before.
Then, we can find all the cluster centers encoded in the
chromosome. The complete decoding process is shown in
Algorithm 2, and we explain this process in Example 2.
Algorithm 2 Cluster-ET Decoding (finding the centers
of clusters).
Input: the Cluster-ET
Output: the cluster centers
(0)Proc DecodeClusterET(Cluster-ET);
(1) If (root IS NOT NULL) then
(2) Switch (root):
(3) Case ‘∪’:
(4)
Put Left Sub- tree into set I
(5)
Put Left Sub- tree into set I
(6) For each set I DecodeCluster-ET (set I);
(7) Case ‘∩’:
(8) DecodeClusterET (Left Sub-Tree);
(9) DecodeClusterET (Right Sub-Tree);
(10) center-point ← Centroid (Left Sub-Tree and
Right Sub-Tree); //
calculate the centroid of all the
data in both Left Sub-Tree and Right Sub-Tree
(11) Find a cluster-center which close to center point;
(12) Return all cluster-center;
(13) Endp;
Example 2. The Cluster-ET t in Figure 1 is decoded into
some sub-ETs shown in Figure 2, and the cluster centers
hidden in t will be found through decoding it.
3
Third International Conference on Natural Computation (ICNC 2007)
0-7695-2875-9/07 $25.00 © 2007


Figure 2. Decode the Cluster-ET t. (a) The
original Cluster-ET. (b) - (e) The process of
decoding.
Figure 2 shows the procedure that how t in Figure 1 is
decoded step by step. By Algorithm 2, the root node 1 in t
(Figure 2 (a)) is ‘ ’, and then its both child branches are ∪
divided firstly into two sub-Cluster-ETs, such as t
1
and t
2

in Figure 2(a). Furthermore, the root node 2 in t
1
is ‘∩’,
we aggregate its both child branches into node x
1
′ by
Centroid (x
1
,

x
2
), marked as Figure 2 (b). Similarly, we do
the similar understanding with t
2
. Finally, four sub-
Cluster-ETs are obtained. They are Figure 2(b), (c), (d)
and (e), and it is shown that there are four clusters
existing in t in Figure 1.
Now, we find all centers of clusters in Figure 2 (a).
That is to say, there are four clusters hidden in
chromosome R
0
, and the centers of these clusters are x
1
′, x

3
, x
4
and x
5
′.

5.4. Fitness evaluation

After obtaining the clusters centers hidden in the
chromosome, the fitness of each individual needs to be
evaluated. The basic steps are shown below.
(a) Clustering the data set by these center points obtained
in section 5.3.
(b) Recalculate the center of each cluster, and search a
new center point to denote the cluster.
(c) Calculate the sum of the mean square error in each
cluster, and the total mean square error E is defined [1] as:
2
1
1
| |,
i j
k
i i
i p C x Ci
i
j
E
p m m x
n
= ∈ ∈
= − =
∑∑ ∑

where k is the number of clusters, m
i
is the center of
cluster C
k
, and n
i
is the number of the data points in
cluster C
i
. The

smaller E is, the better the clustering result
is. As shown in GEP, the best individual is usually
considered as the one with the maximum fitness.
Therefore, we let the fitness function f= 1/E in GEP-
Cluster.

5.5. Genetic operators in GEP-Cluster

(1) Selection operator
In GEP-Cluster, individuals are selected according to
the fitness by roulette wheel sampling, and an elitist
strategy is also used, that is, the best individual of each
generation is always saved to the succeeding generation.
(2) Recombination operator
A one point recombination and two point
recombination is used in GEP-Cluster. By experience,
when the recombination rate is set to 0.77, GEP-Cluster
would carry a better performance. The recombination
position in a chromosome is randomly selected. If the
recombined alleles bring on invalid gene, these alleles
keep unchangeable; otherwise, these alleles are exchanged.
(3) Mutation operator
As often used in the literature [7], the mutation rate is
0.044 in GEP-Cluster. In GEP-Cluster, we select the
mutation position and the allele in the chromosome firstly,
and then mutate the allele. There are two cases. The
mutation will be made directly based on rule 1 if the allele
is ‘∩’ or ‘ ’. In the other case, ∪ if the allele is in TS, we
select one data point from the data set randomly, and
change the allele in the mutation position with it. If these
operations lead to invalid gene, we abandon this operation,
and select another data point from the data set randomly
until no invalid gene appears.
Rule 1 :(∩→,∪∪→∩) During the mutation operation,
the allele ‘∩’ is changed into ‘ ’, and the allele ‘ ’is ∪ ∪
changed into ‘∩’.

6. AMCA


Generally, all the clusters may be identified distinctly
through several evolution computations. However,
sometimes we need to adjust some clusters which have
been found before. Here, AMCA (Automatic Merging
cluster algorithm) is proposed. With AMCA, some clusters
could be merged. The detail of AMCA is shown in
Algorithm 3.
Algorithm 3 AMCA
Input: the initial cluster list
Output: the most proper cluster list
(1) Repeat
(2) Arbitrarily choose one cluster C
i
, and sort the left
clusters by C
i
;
(3) Find point x
i
in C
i
which is the nearest point to C
j
;
(4) Find point x
j
in C
j
which is the nearest point to C
i
;
(5) D
ij
←the Euclidean distance between x
i
and x
j;
(6) D
i
←the mean value of cluster C
i
;
(7) D
j
←the mean value of cluster C
j
;
(8) If (D
ij
≤ Di or D
ij
≤ Dj) then
(9) C
ij
←merge cluster Ci and Cj ;
(10) Remove C
j
from the list;
(11)Until no change in the cluster list;
(12)Output the cluster list;

4
Third International Conference on Natural Computation (ICNC 2007)
0-7695-2875-9/07 $25.00 © 2007
7. Empirical evaluation

We test GEP-Cluster on two different data sets. Both
data 1 and data 2 are synthetic data.
(1) Results on Data Set 1
Data set 1 contains two hundred and seventy 2-
dimensional points. Data set 1 is shown in Figure 3 (a),
and it is composed of four clusters. There are four clusters
in data set 1, which are composed of two spherical
clusters, a convex cluster, and a line-shape cluster. All
these clusters have almost the same density.
Figure 3(b) shows the result of cluster on data set 1. We
run GEP-Cluster 50 times randomly, 48 times succeed in
all trials, and only 2 times failed. These tests show that the
success rate is reach to 96%.

Figure 3. Data set 1 with four clusters: (a) The
original data set, (b) result of clustering.
(2) Results on Data Set 2
Data set 2 is shown in Figure 4 (a), it contains two
hundred and fifty-four

2-dimensional points, and it is
composed of three clusters. There are three clusters in
data set 2, which are composed of a rectangle shape
cluster and two circle shape clusters. Notice that, the two
circle-shape clusters have distinct difference in size and
density.
Figure 4 (b) shows the result of clustering on data set 2.
We run GEP-Cluster 50 times, and 49 times succeed in all
trials, and only one time fails. These tests show that the
success rate is 98%.

Figure 4. Data set 2 with three clusters: (a) The
original data set, (b) result of clustering.

8. Conclusions

In this study, we have proposed a new concept named
Clustering Algebra that makes clustering as algebraic
operation. We have also proposed a novel clustering
algorithm named GEP-Cluster based on GEP without
prior knowledge. Moreover, we have proposed an AMCA
algorithm to merge clustering automatically. Extensive
experiments demonstrate that GEP-Cluster algorithm is
effective to clustering automatically without any prior
knowledge on various data sets.
In practice we have found that, GEP-Cluster still has
some small flaws, such as, sensitiveness to noise and not
very efficient in high dimension data. These problems are
under study, and we expect to present them in future.

9. References

[1] Han JW, Kambr M. Data. Mining Concepts and Techniques.
Beijing: Higher Education Press, 2001.

[2] Jaewook Lee, Daewon Lee. Dynamic Characterization of
Cluster Structures for Robust and Inductive Support Vector
Clustering, IEEE Trans. Pattern Analysis and Machine
Intelligence, vol.28, pp.1869-1874, 2006.

[3] J.MacQueen, Some methods for classification and analysis of
multivariate observations, in Proc 5
th
Berkeley Symp, vol. I,
1996, p.281-297.

[4] Babu, G.P., Murty, M.N., 1994. Clustering with evolution
strategies. Pattern Recognition 27 (2), 321–329.

[5] C.A. Murthy, N. Chowdhury, In search of optimal clusters
using genetic algorithms, Pattern Recog. Lett. 17 (1996) 825–
832.

[6] Bandyopadhyay. S and Maulik. U. An evolutionary
technique based on K-means algorithm for optimal clustering in
R
N
, Information Sciences: An Intemational Journal, Vol. 146;
No. I, pp.221-237, 2002.

[7] Ferreira C.Complete reference for the first GEP paper,
(12/5/2001). Gene Expression Programming: A New Adaptive
Algorithm for Solving Problems--Complete reference for the
first GEP paper, Complex Systems, 13 (2): 87-129.

[8] Shaojie Qiao,Changjie Tang,Jing Peng,Jianjun Hu and Huan
Zhang, BPGEP: Robot Path Planning based on Backtracking
Parallel-Chromosome GEP,Proceedings of the International
Conference on Sensing, Computing and Automation, DCDIS
series B: Application and Algorithm, Copyright@2006 Watam
Press, pp.439-444,2006
5
Third International Conference on Natural Computation (ICNC 2007)
0-7695-2875-9/07 $25.00 © 2007