Clustering Without Prior Knowledge Based on Gene Expression Programming*

Yu Chen

1

, Changjie Tang

1

,

Jun Zhu

2

, Chuan Li

1

, Shaojie Qiao

1

, Rui Li

3

, Jiang Wu

1

1

College of Computer Science,

Sichuan University, Chengdu,

China

{chenyu05, tangchangjie}

@cs.scu.edu.cn

2

National Center for

Birth Defects Monitoring,

Chengdu, China

zhu_jun1@163.com

3

Department of Electrical

Engineering, University of

California Riverside, CA, USA

ruili@ee.ucr.edu

Abstract

Most existing clustering methods require prior

knowledge, such as the number of clusters and thresholds.

They are difficult to determine accurately in practice. To

solve the problem, this study proposes a novel clustering

algorithm named GEP-Cluster based on Gene Expression

Programming (GEP) without prior knowledge. The main

contributions include: (1) a new concept named

Clustering Algebra is proposed that makes clustering as

algebraic operation ，(2) a GEP-Cluster algorithm is

proposed to find the best clustering information automatic

by GEP and discover the best clustering solution without

any prior knowledge, (3) an AMCA (Automatic Merging

Cluster Algorithm) algorithm is proposed to merge

clustering automatically. Extensive experiments

demonstrate that GEP-Cluster algorithm is effective in

clustering without any prior knowledge on various data

sets.

1. Introduction

Clustering is an important knowledge discovery

technique widely used in scientific practice. Its key point

is partitioning a set of data in d-dimension feature space

into sub-groups while maximizing the similarity within

group and minimizing the similarity between different

groups. Many clustering algorithms have been proposed,

such as partitioning method, hierarchical method, density-

based method, grid-based method, model-based method

[1], and clustering based on Support Vector [2]. However,

most existing algorithms need prior knowledge such as

the number of clusters in K-Means [3]. Unfortunately,

such prior knowledge is usually difficult to obtain

accurately in practice.

*

This work was supported by the National Natural Science Foundation of

China under Grant No.60473071, the11th Five Years Key Programs for

Sci. &Tech. Development of China under grant No.2006038002003, and

the

Youth Foundation, Sichuan University, 06036: JS20070405506408.

Clustering without any prior knowledge is an urgent

need in many applications. To resolve this problem, a

GEP-Cluster algorithm is proposed in this study, which is

based on GEP. Compared with traditional clustering

algorithms, GEP-Cluster is featured with the following

characteristics :( a) It does not need prior knowledge, and

(b) it divides or merges clusters automatically. Extensive

experiments show that GEP-Cluster algorithm is effective

in solving the clustering problem without any prior

knowledge.

The remainder of this paper is organized as follows.

Section 2 gives some related work. Section 3 introduces

the preliminary terminology and concepts in GEP. Section

4 present the GEP-Cluster algorithm and propose a new

concept named Clustering Algebra that makes the

clustering process as algebra operation. Section 5 gives

the data structure of GEP-Cluster in detail. Section 6

presents AMCA algorithm. Section 7 gives the

experimental results. Section 8 concludes this paper and

discusses about the future work.

2. Related work

K-Means [3] is a popular clustering algorithm. It is

simple, efficient and widely applied in practice. However,

it has two main limitations: (a) it is susceptible to the

initial cluster centers, and it often converges to a local

minimal; (b) it needs users to specify the number of

clusters. Many algorithms are proposed to overcome these

problems. Recently, the evolutionary strategies have also

been explored for it. Babu and Murty proposed the

clustering algorithm based on evolutionary strategies in

[4], but it needs the number of clusters as prior knowledge.

Murthy and Chowdhury proposed the clustering algorithm

based on genetic algorithms (GA) [5]. It encodes the

chromosome with the binary strings, and each bit in the

chromosome represents one object in the data set. Due to

the limitation of the length of chromosome, it only works

well on small data sets. In [6], Bandyopadhyay and

Maulik encode every center of cluster into the

1

Third International Conference on Natural Computation (ICNC 2007)

0-7695-2875-9/07 $25.00 © 2007

chromosome to deal with larger data set. It still requires

the presetting of the number of clusters.

In this study, we propose a prior knowledge free

clustering algorithm named GEP-Cluster based on GEP.

It has a high probability to obtain the optimum solution of

the clustering problem without any prior knowledge.

3. Preliminary concepts in GEP

GEP was proposed by Candida Ferreira in 2001, and it

is a new member of the evolutionary computation family.

GEP combines the advantages of both GA and Genetic

Programming (GP), while overcoming some of their

individual limitations. It is stronger to deal with more

difficult problems than GA and GP [7-8]. The individual

chromosomes in GEP are encoded as simple linear strings

of fixed length, and the chromosome is composed of one

or more genes. Each gene in the individual chromosome is

divided into a gene head and a gene tail.

When evaluating a chromosome’s fitness, the

chromosome must be firstly mapped into an expression

tree (ET) with a width-first fashion. The mapping starts

from the first position in the chromosome, which

corresponds to the root of the ET. If the root node is a

terminal, the mapping stops; otherwise, its child branches

are processed from left to right. The number of child

branches is determined by the number of arguments of

their parent node. All of the children branches are read

from the chromosome one by one. This process continues

layer-by-layer until all leaf nodes in the ET are composed

of elements from the terminal set. After mapping a

chromosome into an ET, the fitness of the chromosome

can be calculated by decoding the ET through inorder

traversal.

4. GEP-Cluster algorithm

Clustering can be viewed as an optimization problem in

the feature space. GEP performs well in global

optimization. Thus we propose a GEP-Cluster algorithm

based on GEP.

In GEP-Cluster, two special clustering operators are

introduced, namely ‘

∩

’ and ‘ ’, described in Definition ∪ 2.

Definition 1(Centroid (O

1

, O

2

) Let O

1

and O

2

be points in

the d dimension space, O

1

= {x

1

, x

2

… x

d

}, O

2

= {y

1

, y

2

… y

d

}.

The centroid between O

1

and O

2

can be calculated by a

Centroid function, which is defined as Centroid (O

1

, O

2

)

= {

1 1 2 2

2 2 2

,,...,

d d

x

yx y x y ++ +

}.

Definition 2(Clustering Algebra) Let S be a d dimension

space，and P=2

S

be the power set of S. Let O

1

and O

2

be

the centers of cluster C

1

and C

2

.

(1) The aggregation operator ‘∩’ over P is defined as

O

1

∩O

2

=Centroid (O

1

, O

2

).

(2) The segmentation operator ‘∪’over P is defined as

O

1

∪O

2

= {O

1

, O

2

}.

(3) The 3-tuple CA= (P, ∩, ∪) is called Clustering

Algebra over P.

Note that, in case (1), C

1

and C

2

must be aggregated

into a cluster C with the center O= Centroid (O

1

, O

2

). In

case (2), C

1

and C

2

are two distinct clusters, and they still

keep the segmentation, which the centers of the clusters

are O

1

and O

2

, respectively.

The significance of above operations is describing the

automatic aggregate and segment of clusters algebraically.

Following lemmas describe the properties of the

Clustering Algebra.

Lemma 1 Let CA= (P, ∩, ∪) be Clustering Algebra over

P. Then

(1) ∩ and ∪ are commutative.

(2) ∩ and ∪ are associative.

Proof. (1)The commutative law is obvious, since

clustering operator is symmetric with respect to two

operands.

(2) Let O

1

, O

2

and O

3

be the centers of three clusters C

1

,

C

2

and C

3

. By Definition 2, the formula (O

1

∩O

2

) ∩O

3

indicates that C

1

and C

2

are aggregated into cluster C’, and

the center of C’ is Centroid (O

1

, O

2

). Then C’ and C

3

are

aggregated into cluster C’’, and the center of C’’ is

Centroid (Centroid (O

1

, O

2

), O

3

). That is to say, C

1

, C

2

and C

3

are three sub-clusters in C’’, and then C

1

, C

2

and

C

3

are aggregated.

Similarly, the formula O

1

∩ (O

2

∩O

3

) indicates that O

1

,

O

2

and O

3

are in a same cluster C’’. Therefore, the

equation (O

1

∩O

2

) ∩O

3

≡O

1

∩ (O

2

∩O

3

) is satisfied. In the

same way, we can know the equation (O

1

∪O

2

)

∪O

3

≡O

1

∪ (O

2

∪O

3

) is satisfied.

Based on Clustering Algebra, the clusters are easily

defined in the chromosome in the GEP framework.

Definition 3(Cluster-ET) Let t be a valid expression tree

in GEP. If all of the non-leaf nodes in t are ‘

∩

’ or ‘∪’, we

consider t is a Cluster-ET (

Cluster

ing

E

xpression

T

ree).

An example of Cluster-ET t is shown in Figure 1.

Clustering without any prior knowledge is important

with the following reasons. (a)The initial number of

clusters in the chromosomes is stochastic, it may not be

accurate. (b) The cluster center in the chromosomes will

change during the evolution process. The Clustering

Algebra will aggregate or segment the cluster centers

automatically. (c) The gene in each chromosome may

mutate or recombine with other individuals during the

each round of evolution. As the result, some cluster

centers will be removed and some new cluster centers will

be generated. After several generations of evolution,

GEP-Cluster can find the best centers of clusters. The

best clustering solution could be found through these

centers of clusters. The detail of GEP-Cluster is described

in Algorithm 1.

Algorithm 1 Clustering based on GEP (GEP-Cluster)

2

Third International Conference on Natural Computation (ICNC 2007)

0-7695-2875-9/07 $25.00 © 2007

Input: Origin data set S; the GEP parameters.

Output: The list of clusters

(1) Initialize a population;

(2) Repeat

(3) For each individual

(4) Clustering S by the cluster centre which are hidden

in the individual’s chromosome;

(5) Calculate the fitness value of individuals;

(6) Keep the best individual;

(7) Select individuals according to the roulette- wheel

selection;

(8) Recombination individuals by the rate p

c;

(9) Mutation individuals by rate p

m

;

(10)Until (maximum iterations are attained);

(11) Select the best one individual;

(12) Call AMCA function to optimize the clustering

result; //detail in Section 6

(13) Output the list of clusters;

The advantages of GEP-Cluster are derived from two

ways. On one hand, the data in the gene tail may be the

center of cluster or the non-center of cluster. This feature

frees us from setting the initial number of clusters in

advance. On the other hand, the cluster centers could be

merged or divided automatically by Clustering Algebra

during the evolution process in GEP-Cluster. Therefore,

with GEP-Cluster, we could solve the problem of

predefining the number of cluster in [3-6].

5. The data structure in GEP-Cluster

5.1. Encoding the chromosome

In GEP-Cluster, the chromosome of an individual is

encoded with integer numbers, and each chromosome

contains only one gene which includes a gene head and a

gene tail. In the gene head, there can be functions or

terminals, however, only terminals can appear in the gene

tail. The Functions Set (FS) is {, ∩}, and the Terminals ∪

Set (TS) is {x

1

, x

2

… x

i

, i∈ (1, n)}, where x

i

is the

sequence number of the ith object in the data set. Thus,

the gene head is created with elements from both FS and

TS, but the gene tail is created using only the terminals.

Example 1. Consider a chromosome R

0

shown as

“∪∩∪x

1

x

2

∪∩

x

3

x

4

x

5

x

6

x

7

x

8

x

9

x

10

”. R

0

denotes the state

of clustering in some round of evolution. Here, the

symbol ‘x

i

’ (i ∈ [1, 10]) represents the center of a cluster

in the data set.

5.2. Mapping the Cluster-ET

In GEP-Cluster, each chromosome in the population is

mapped into a Cluster-ET similar to the method in GEP,

so as to the fitness calculation. The R

0

in

Example 1 can

be mapped into a Cluster-ET t shown in Figure 1.

Figure 1. The Cluster-ET t for R

0

.

5.3. Decoding the Cluster-ET

The decoding process is performed by preorder

traversal of the Cluster-ET. The main steps are as follows.

(a) Process from the root of Cluster-ET. If the value of the

root node is ‘∪’, divide its left sub-tree and right sub-tree

into two different sets by definition 2, namely, achieve

segmentation among the clusters; If the value of the root

node is ‘∩’, traverse its sub-trees from left to right

directly. Select the node in these sub-trees, and calculate

their centroid by the Centroid function. Then, put these

nodes into the same cluster which the center of cluster is

denoted by a special object in the dataset. This special

object is the data which is nearest to the centroid of these

data in the sub-trees. (b) Decode all the sub-trees

recursively as done before.

Then, we can find all the cluster centers encoded in the

chromosome. The complete decoding process is shown in

Algorithm 2, and we explain this process in Example 2.

Algorithm 2 Cluster-ET Decoding (finding the centers

of clusters).

Input: the Cluster-ET

Output: the cluster centers

(0)Proc DecodeClusterET（Cluster-ET）;

(1) If (root IS NOT NULL) then

(2) Switch (root):

(3) Case ‘∪’:

(4)

Put Left Sub- tree into set I

(5)

Put Left Sub- tree into set I

(6) For each set I DecodeCluster-ET (set I);

(7) Case ‘∩’:

(8) DecodeClusterET (Left Sub-Tree);

(9) DecodeClusterET (Right Sub-Tree);

(10) center-point ← Centroid (Left Sub-Tree and

Right Sub-Tree); //

calculate the centroid of all the

data in both Left Sub-Tree and Right Sub-Tree

(11) Find a cluster-center which close to center point;

(12) Return all cluster-center;

(13) Endp;

Example 2. The Cluster-ET t in Figure 1 is decoded into

some sub-ETs shown in Figure 2, and the cluster centers

hidden in t will be found through decoding it.

3

Third International Conference on Natural Computation (ICNC 2007)

0-7695-2875-9/07 $25.00 © 2007

Figure 2. Decode the Cluster-ET t. (a) The

original Cluster-ET. (b) - (e) The process of

decoding.

Figure 2 shows the procedure that how t in Figure 1 is

decoded step by step. By Algorithm 2, the root node 1 in t

(Figure 2 (a)) is ‘ ’, and then its both child branches are ∪

divided firstly into two sub-Cluster-ETs, such as t

1

and t

2

in Figure 2(a). Furthermore, the root node 2 in t

1

is ‘∩’,

we aggregate its both child branches into node x

1

′ by

Centroid (x

1

,

x

2

), marked as Figure 2 (b). Similarly, we do

the similar understanding with t

2

. Finally, four sub-

Cluster-ETs are obtained. They are Figure 2(b), (c), (d)

and (e), and it is shown that there are four clusters

existing in t in Figure 1.

Now, we find all centers of clusters in Figure 2 (a).

That is to say, there are four clusters hidden in

chromosome R

0

, and the centers of these clusters are x

1

′, x

3

, x

4

and x

5

′.

5.4. Fitness evaluation

After obtaining the clusters centers hidden in the

chromosome, the fitness of each individual needs to be

evaluated. The basic steps are shown below.

(a) Clustering the data set by these center points obtained

in section 5.3.

(b) Recalculate the center of each cluster, and search a

new center point to denote the cluster.

(c) Calculate the sum of the mean square error in each

cluster, and the total mean square error E is defined [1] as:

2

1

1

| |,

i j

k

i i

i p C x Ci

i

j

E

p m m x

n

= ∈ ∈

= − =

∑∑ ∑

where k is the number of clusters, m

i

is the center of

cluster C

k

, and n

i

is the number of the data points in

cluster C

i

. The

smaller E is, the better the clustering result

is. As shown in GEP, the best individual is usually

considered as the one with the maximum fitness.

Therefore, we let the fitness function f= 1/E in GEP-

Cluster.

5.5. Genetic operators in GEP-Cluster

(1) Selection operator

In GEP-Cluster, individuals are selected according to

the fitness by roulette wheel sampling, and an elitist

strategy is also used, that is, the best individual of each

generation is always saved to the succeeding generation.

(2) Recombination operator

A one point recombination and two point

recombination is used in GEP-Cluster. By experience,

when the recombination rate is set to 0.77, GEP-Cluster

would carry a better performance. The recombination

position in a chromosome is randomly selected. If the

recombined alleles bring on invalid gene, these alleles

keep unchangeable; otherwise, these alleles are exchanged.

(3) Mutation operator

As often used in the literature [7], the mutation rate is

0.044 in GEP-Cluster. In GEP-Cluster, we select the

mutation position and the allele in the chromosome firstly,

and then mutate the allele. There are two cases. The

mutation will be made directly based on rule 1 if the allele

is ‘∩’ or ‘ ’. In the other case, ∪ if the allele is in TS, we

select one data point from the data set randomly, and

change the allele in the mutation position with it. If these

operations lead to invalid gene, we abandon this operation,

and select another data point from the data set randomly

until no invalid gene appears.

Rule 1 :(∩→,∪∪→∩) During the mutation operation,

the allele ‘∩’ is changed into ‘ ’, and the allele ‘ ’is ∪ ∪

changed into ‘∩’.

6. AMCA

Generally, all the clusters may be identified distinctly

through several evolution computations. However,

sometimes we need to adjust some clusters which have

been found before. Here, AMCA (Automatic Merging

cluster algorithm) is proposed. With AMCA, some clusters

could be merged. The detail of AMCA is shown in

Algorithm 3.

Algorithm 3 AMCA

Input： the initial cluster list

Output: the most proper cluster list

(1) Repeat

(2) Arbitrarily choose one cluster C

i

, and sort the left

clusters by C

i

;

(3) Find point x

i

in C

i

which is the nearest point to C

j

;

(4) Find point x

j

in C

j

which is the nearest point to C

i

;

(5) D

ij

←the Euclidean distance between x

i

and x

j;

(6) D

i

←the mean value of cluster C

i

;

(7) D

j

←the mean value of cluster C

j

;

(8) If (D

ij

≤ Di or D

ij

≤ Dj) then

(9) C

ij

←merge cluster Ci and Cj ;

(10) Remove C

j

from the list;

(11)Until no change in the cluster list;

(12)Output the cluster list;

4

Third International Conference on Natural Computation (ICNC 2007)

0-7695-2875-9/07 $25.00 © 2007

7. Empirical evaluation

We test GEP-Cluster on two different data sets. Both

data 1 and data 2 are synthetic data.

(1) Results on Data Set 1

Data set 1 contains two hundred and seventy 2-

dimensional points. Data set 1 is shown in Figure 3 (a),

and it is composed of four clusters. There are four clusters

in data set 1, which are composed of two spherical

clusters, a convex cluster, and a line-shape cluster. All

these clusters have almost the same density.

Figure 3(b) shows the result of cluster on data set 1. We

run GEP-Cluster 50 times randomly, 48 times succeed in

all trials, and only 2 times failed. These tests show that the

success rate is reach to 96%.

Figure 3. Data set 1 with four clusters: (a) The

original data set, (b) result of clustering.

(2) Results on Data Set 2

Data set 2 is shown in Figure 4 (a), it contains two

hundred and fifty-four

2-dimensional points, and it is

composed of three clusters. There are three clusters in

data set 2, which are composed of a rectangle shape

cluster and two circle shape clusters. Notice that, the two

circle-shape clusters have distinct difference in size and

density.

Figure 4 (b) shows the result of clustering on data set 2.

We run GEP-Cluster 50 times, and 49 times succeed in all

trials, and only one time fails. These tests show that the

success rate is 98%.

Figure 4. Data set 2 with three clusters: (a) The

original data set, (b) result of clustering.

8. Conclusions

In this study, we have proposed a new concept named

Clustering Algebra that makes clustering as algebraic

operation. We have also proposed a novel clustering

algorithm named GEP-Cluster based on GEP without

prior knowledge. Moreover, we have proposed an AMCA

algorithm to merge clustering automatically. Extensive

experiments demonstrate that GEP-Cluster algorithm is

effective to clustering automatically without any prior

knowledge on various data sets.

In practice we have found that, GEP-Cluster still has

some small flaws, such as, sensitiveness to noise and not

very efficient in high dimension data. These problems are

under study, and we expect to present them in future.

9. References

[1] Han JW, Kambr M. Data. Mining Concepts and Techniques.

Beijing: Higher Education Press, 2001.

[2] Jaewook Lee, Daewon Lee. Dynamic Characterization of

Cluster Structures for Robust and Inductive Support Vector

Clustering, IEEE Trans. Pattern Analysis and Machine

Intelligence, vol.28, pp.1869-1874, 2006.

[3] J.MacQueen, Some methods for classification and analysis of

multivariate observations, in Proc 5

th

Berkeley Symp, vol. I,

1996, p.281-297.

[4] Babu, G.P., Murty, M.N., 1994. Clustering with evolution

strategies. Pattern Recognition 27 (2), 321–329.

[5] C.A. Murthy, N. Chowdhury, In search of optimal clusters

using genetic algorithms, Pattern Recog. Lett. 17 (1996) 825–

832.

[6] Bandyopadhyay. S and Maulik. U. An evolutionary

technique based on K-means algorithm for optimal clustering in

R

N

, Information Sciences: An Intemational Journal, Vol. 146;

No. I, pp.221-237, 2002.

[7] Ferreira C.Complete reference for the first GEP paper,

(12/5/2001). Gene Expression Programming: A New Adaptive

Algorithm for Solving Problems--Complete reference for the

first GEP paper, Complex Systems, 13 (2): 87-129.

[8] Shaojie Qiao,Changjie Tang,Jing Peng,Jianjun Hu and Huan

Zhang, BPGEP: Robot Path Planning based on Backtracking

Parallel-Chromosome GEP,Proceedings of the International

Conference on Sensing, Computing and Automation, DCDIS

series B: Application and Algorithm, Copyright@2006 Watam

Press, pp.439-444,2006

5

Third International Conference on Natural Computation (ICNC 2007)

0-7695-2875-9/07 $25.00 © 2007

## Comments 0

Log in to post a comment