Gene Expression Programming Based on Simulated

Annealing

*

JIANG Siwei, CAI Zhihua, ZENG Dan, Liu Yadong, LI Qu

College of Computer, China University of Geosciences, Wuhan 430074

Corresponding author’s Email: amosonic@163.com

Abstract--Gene Expression Programming (GEP) is a

genotype/phenotype system that evolves computer programs of

different sizes and shapes encoded in linear chromosomes of fixed

length. However, the performance of basic GEP is highly

dependent on the genetic operators’ rate. In this work, we present a

new algorithm called GEPSA that combines GEP and Simulated

Annealing (SA), and GEPSA decreases the dependence on genetic

operators’ rate without impairing the performance of GEP. Three

function finding problems, including a benchmark problem of

prediction sunspots, are tested on GEPSA, results shows that

importing Simulated Annealing can improve the performance of

GEP.

Key words--Gene Expression Programming; Simulated Annealing;

Function Finding; Regression Analysis

I.

I

NTRODUCTION

Gene Expression Programming is a novel genetic algorithm

in which individuals are encoded as symbolic strings of fixed

length (genotype) and then expressed as expression trees

(phenotype) with different sizes and shapes. It combines the

characteristics of genetic algorithms (GA) and genetic

programming (GP), and overcomes some drawbacks of them.

Structurally, genetic algorithms can be subdivided in three

fundamental groups: 1) the first group is GA with linear

chromosome of fixed length; 2) the second group is GP with the

individual consisting of ramified structures of different sizes

and shapes; 3) the third group is GEP. In essential, GEP is a

subgroup of GAs [1,2,3].

According to our previous experiments [4,5,6], basic GEP

can get good results in regression and prediction. However,

similar to other basic genetic algorithms, the dependence on

genetic operators’ rate for different problems also exists in GEP.

To solve this problem, some researches have imported SA into

GAs, and the hybrid algorithm has higher performance [7,8].

Being interested in this kind of approach，we present a new

algorithm called GEPSA which combines GEP and SA.

Experimental results show that the new algorithm gets better

performance than basic GEP.

The remaining sections of this paper are organized as follows.

*

Supported by Humanities & Social Sciences Open Foundation of hubei

province (NO: 2004B0011) and the National Natural Science Foundation of

hubei province (NO: 2003ABA043).

In Section 2, we introduce the backgrounds of GEP and some

related works. In Section 3, we describe our new algorithm. In

Section 4, we apply the new algorithm to three experiments and

analyze the results in detail. The conclusion is presented in

Section 5.

II. R

ELATED

W

ORKS

In GEP, each gene is composed of a head and a tail, and

multiple genes construct a chromosome. GEP is a

genotype/phenotype system where the search space remains

separated from the solution space, thus has better performance

than GAs[1,9,10]; on the other hand, GEP is a totally

unconstrained genotype/phenotype system as all modifications

made in the genotype always result in correct phenotype, thus

better than GP[1,10].

To improve the performance of GAs, some researchers have

combined the idea of Simulated Annealing with genetic

algorithms. WANG Xuemei [7] used the average fitness metric

with Simulated Annealing to increase the diversity of the whole

population. WANG Ling[8] used the complementary aspects of

the structure and behavior of GA and SA , and represented a

global hybrid strategy.

III.

G

ENE

E

XPRESSION

P

ROGRAMMING

B

ASED ON

S

IMULATED

A

NNEALING

Basic GEP has nine genetic operators: mutation, three kinds

of transposition (IS, RIS and gene transposition), three kinds of

recombination (one-point, two-point and gene recombination),

random constants mutation, and random constants transposition

[3]. Among these operators, mutation is the most important and

powerful one [10].

Similar to GAs, the performance of GEP is vulnerable to the

settings of genetic operators’ rate. Here we present a new

algorithm called GEPSA, which only use three genetic

operators: mutation, multiple-point recombination and random

constants mutation. For each gene we take one-point crossover,

and then it becomes multi-point recombination for the

multi-gene in a chromosome. GEPSA is depicted as follows:

Algorithm: GEPSA (f, P, P

m

, P

c

, P

Dc

, T

0

, a)

Input: f: the fitness function to evaluate the individuals; P: the

population for evaluation; P

m

: the mutation rate; P

c

: the

multiple -point crossover rate; P

Dc

: the random constants

mutation rate; T

0

: the initial temperature of SA; a: the

temperature alter rate.

Output: The model with the highest fitness.

1. Initialize the population P randomly;

2. Evaluate: for each individual p, compute f(p);

3. for(i = 0; i < n; i++)

Generate the new population:

(1) Mutation: generate new individual by mutation

old individual with P

m

. Replace the old individual

with the new individual if min{1,

exp(-?f/Ti)}>random[0,1], where

?f=f(old)-f(new);

(2) Multiple-point crossover and random constants

mutation: generate new individual from old

individual with P

c

, P

Dc

. Accept new according to

method in (1);

(3) Using the elitism method;

4. Return the best model with highest fitness.

IV.

E

XPERIMENT AND

R

ESULT

In this paper, we compare GEPSA with basic GEP in three

problems [3].

The first one is a problem of sequence induction, where

n

a

consists of the nonnegative integers. The nth term N of the

chosen sequence is given by the formula:

4 3 2

5 4 3 2 1

n n n n

N a a a a= + + + + (1)

The second one is a problem of function finding, where the

independent variable is a floating-point chosen from the interval

[-1, 1]. In this case, the following “V” shaped function was

chosen:

2 2

4.251* ln( ) 7.243*

a

y a a e= + + (2)

The third one is the well-studied benchmark problem of

predicting sunspots [11]. In this case, 100 observations of the

Wolfer sunspots series were used (Table 1) with an embedding

dimension of 10 and a delay time of one.

A. Fitness Function

We use fitness function based on absolute error (equation 3)

and relative error (equation 4), where M is the range of selection,

C

(i,j)

is the value returned by the individual chromosome i for

fitness case j (out of C

t

fitness cases), and T

j

is the target value

for fitness case[1]:

( )

(,)

1

t

C

i i j j

j

f

M C T

=

= − −

∑

(3)

( )

(,)

1

( )/*100

t

C

i i j j j

j

f M C T T

=

= − −

∑

(4)

We use R-square to test the evaluation model, where

i

y

is

the fact value,

y

is the average of all

i

y

, and

i

y

∧

is the value

return by GEP. A higher R-square indicates a better regression

model [5].

^

2 2 2

Re

1 1

/( )/( )

n n

g Total

i i

R

SS SS yi y yi y

− −

= =

= = − −

∑ ∑

(5)

TABLE 1

W

OLFER SUNSPOTS SERIES (READ BY ROWS)

101 82 66 35 31 7 20 92

154 125 85 68 38 23 10 24

83 132 131 118 90 67 60 47

41 21 16 6 4 7 14 34

45 43 48 42 28 10 8 2

0 1 5 12 14 35 46 41

30 24 16 7 4 2 8 17

36 50 62 67 71 48 28 8

13 57 122 138 103 86 63 37

24 11 15 40 62 98 124 96

66 64 54 39 21 7 4 23

55 94 96 77 59 44 47 30

16 7 37 74

B. Setting the System

For the sequence induction problem, the 10 nonnegative

integers

n

a

were used as fitness cases. The fitness function

was based on the relative error with a selection range of 20%,

giving maximum fitness f

max

= 200.

For the “V” shaped function problem, a set of 20 random

fitness cases chosen from the interval [-1, 1] was used. The

fitness function was based on the relative error with a selection

range of 100%, giving f

max

= 2000.

For the time series prediction problem, using an embedding

dimension of 10 and a delay time of one, the sunspots series

presented in Table 1 result in 90 fitness cases. In this case, the

fitness function was based on the absolute error with a selection

range of 1000%, giving f

max

= 90,000.

C. Experimental Analysis

The function set, terminal set and genetic operators for three

problems are all shown in table 2. The 2, 5, 8th columns are

results of basic GEP with random constants; the 3, 6, 9th

columns are results of basic GEP without random constants [3];

the 4, 7, 10th columns are results of GEPSA.

To test the algorithm’s robustness, we evaluate the three

problems over 100 independent runs and give the average of

best fitness and average of best R-square, the results in table 2

TABLE 2

G

ENERAL SETTINGS USED IN THE SEQUENCE INDUCTION (SI), THE “V” FUNCTION, AND SUNSPOTS (SS) PROBLEMS.

SI* SI SI** V* V V** SS* SS SS**

Number of runs 100 100 100 100 100 100 100 100 100

Number of Generations 100 100 100 5000 5000 5000 5000 5000 5000

Population size 100 100 100 100 100 100 100 100 100

Number of fitness Cases 10 10 10 20 20 20 90 90 90

Function set +-*/ +-*/ +-*/ +-*/LE

K~SC

+-*/LE

K~SC

+-*/LE

K~SC

+-* +-* +-*

Terminal set a, ? a a, ? a, ? a a, ? a, ? a a, ?

Random constants array length 10 --- 10 10 --- 10 10 --- 10

Random constants range 0,1,2,3 --- 0,1,2,3 [-1,1] --- [-1,1] [-1,1] --- [-1,1]

Head length 6 6 6 6 6 6 8 8 8

Number of genes 7 7 7 5 5 5 3 3 3

Linking function + + + + + + + + +

Chromosome length 140 91 140 100 65 100 78 51 78

Mutation rate 0.044 0.044 0.044 0.044 0.044 0.044 0.044 0.044 0.044

One-point recombination rate 0.3 0.3 --- 0.3 0.3 --- 0.3 0.3 ---

Two-point recombination rate 0.3 0.3 --- 0.3 0.3 --- 0.3 0.3 ---

Multi-point recombination rate --- --- 0.3 --- --- 0.3 --- --- 0.3

Gene recombination rate 0.1 0.1 --- 0.1 0.1 --- 0.1 0.1 ---

IS transposition rate 0.1 0.1 --- 0.1 0.1 --- 0.1 0.1 ---

IS element length 1,2,3 1,2,3 --- 1,2,3 1,2,3 --- 1,2,3 1,2,3 ---

RIS transposition rate 0.1 0.1 --- 0.1 0.1 --- 0.1 0.1 ---

RIS element length 1,2,3 1,2,3 --- 1,2,3 1,2,3 --- 1,2,3 1,2,3 ---

Gene transposition rate 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Random constants mutation 0.01 --- 0.01 0.01 --- 0.01 0.01 --- 0.01

Dc specific transposition rate 0.1 --- --- 0.1 --- --- 0.1 --- ---

Dc specific elements length 1,2,3 --- --- 1,2,3 --- --- 1,2,3 --- ---

The initial temperature --- --- 20 --- --- 50 --- --- 50

The alterative temperature rate --- --- 0.95 --- --- 0.8 --- --- 0.8

Selection range 20% 20% 20% 100% 100% 100% 1000% 1000% 1000%

Precision 0% 0% 0% 0% 0% 0% 0% 0% 0%

Average best-of-run fitness 179.827 197.232 197.72 1914.8 1931.84 1972.46 86215.3 89033.3 89166.3

Average best-of-run R-square 0.97761 0.99935 0.99999 0.95726 0.99534 0.99663 0.71337 0.81186 0.87510

Success rate 16% 81% 89% --- --- --- --- --- ---

Notes: The “*” indicates the explicit use of random constants, the “**”indicates the use of GEPSA.

show that the average-best-fitness and average best-of-run

R-square evaluated by GEPSA are both higher than those by

basic GEP.

The first problem of sequence induction can be exactly

solved by basic GEP, GPESA, the first perfect solution was

found in generation 65 of run 1 by GEPSA. To solve this

problem, Candida used two approaches (basic GEP

with/without constants), whose success rates are respective 16%

and 81%. In basic GEP methods, there are more than seven

genetic operators; in our experiments, although only three

genetic operators are used in our approach, the success rate is

89%.

To find the “V” shaped function, we use function set F = {+,

-, *, /, L, E, K, ~, S, C} (“L” represent the natural logarithm,

“E” represent exp(x), “K” represent the logarithm of base 10,

“~” represent the exp(10, x), “S” represent the sine function,

“C” represent the cosine). Using basic GEP, Candida provided

the best solution with the best fitness of 1990.023 and an

R-square of 0.9999313; Using GEPSA, the best solution was

found in generation 3897 of run 87:

[0.068*cos(0.538)/0.114]

[exp(exp( 0.806* sin(0.216)))]

[(ln ) *0.584/cos(sin )] [ln 1 cos( ) ]

[exp((cos( 0.066) 0.5 )/0.792)]

y a

a

a a a

a

= −

+ − −

+ + −

+ − + +

It has a fitness of 1995.7 and an R-square of 0.999972

evaluated over the set of 20 fitness cases, and thus is better than

the model evolved with the basic GEP.

For the benchmark problem of predicting sunspots, using

basic GEP, the best solution has the best fitness of 89176.61 and

R-square of 0.882831; using GEPSA, the best solution was

found in generations 4754 of run 71:

*

[ ] [ * * ]

0.808* * * * *

[ 0.262*( )]

b h a c

y j d j

b e g a d i j h

a b e

i

e i

+

= + +

+ +

+ +

+ − +

−

It has a fitness of 89386 and an R-square of 0.924051

evaluated over the set of 90 fitness cases, and thus it is better

than the model evolved with the basic GEP.

From the comparisons of success rate, fitness functions and

R-square we can see that the best solution using GEPSA is

better than using basic GEP for the three problems.

V.

C

ONCLUSIONS

To decrease the dependence on genetic operators’ rate in GEP

for different problems, we present a new algorithm GEPSA that

combines the GEP and Simulated Annealing. Three experiments

are tested by two approaches: basic GEP and GEPSA, the

results suggest that GEPSA is more efficient, not only in terms

of the accuracy of the best evolved models are better than basic

GEP, but also in terms of the average fitness and average

R-square are better than basic GEP by running 100 independent

times. In the future, we can use parallel calculation and

multi-population strategy to improve the performance of GEP.

R

EFERENCES

[1] Ferreira, C., Gene Expression Programming: A New Adaptive Algorithm

for Solving Problems [J], Complex Systems, 2001, vol. 13, No. 2,

pp.87-129, 2001.

[2] Ferreira, C., 2002. Genetic Representation and Genetic Neutrality in

Gene Expression Programming, Advances in Complex Systems, vol. 5,

No.4, pp.389-408, 2002.

[3] Ferreira, C., Function Finding and the Creation of Numerical Constants

in Gene Expression Programming, the 7th Online World Conference on

Soft Computing in Industrial Applications, September 23-October 4,

2002.

[4] LI Qu, CAI Zhihua, ZHU Li, ZHAO Yunsheng, Application of Gene

Expression Programming in Predicting the Amount of Gas Emitted from

Coal Face [J], Journal of Basic Science and Engineering. Vol. 2 No. 1,

March 2004.

[5] LI Qu, CAI Zhihua, JIANG Siwei, ZHU Li, Gene Expression

Programming in Prediction, Proceedings of the 5th World Congress on

Intelligent Control and Automation, June, pp. 15-19, 2004.

[6] CAI Zhihua, LI Qu & JIANG Siwei, Symbolic Regression Based on GEP

and Its Application in Predicting the Amount of Gas Emitted from Coal

Face, Proceedings of the 2004 International Symposium on Safety

Science and Technology, October 2004.

[7] Wang Xuemei, Wang Yihe, The Combination Of Simulated Annealing

And Genetic Algorithm. Chinese Journal Of Computers, vol. 20, No. 4,

pp381-384, 1997.

[8] Wang Ling, Zheng Dazhong, Unified Framework for Neighbor Search

Algorithms and Hybrid Optimization Strategies, Tsinghua university (sic

& tech), Vol. 40, No 9, pp.125-128, 2000.

[9] Banzhaf, W., Genotype-phenotype-mapping and Neutral Variation – A

Case Study in Genetic Programming. In Y.Davidor, H.-P. Schwefel, and

R. Männer, eds., Parallel Problem Solving from Nature III, Vol. 866 of

Lecture Notes in Computer Science, Springer-Verlag, 1994.

[10] Ferreira, C., Mutation, Transposition, and Recombination: An Analysis of

the Evolutionary Dynamics, Proceedings of the 6th Joint Conference on

Information Sciences, 4th International Workshop on Frontiers in

Evolutionary Algorithms, pp. 614-617, 2002.

[11]

Weigend, A.S., B. A. Huberman, and D. E. Rumelhart, Predicting

Sunspots and Exchange Rates with Connectionist Networks. In S. Eubank

and M. Casdagli, eds., Nonlinear Modeling and orecasting, pp.395-432,

Redwood City, CA, Addison-Wesley, 1992.

## Comments 0

Log in to post a comment