Gene Expression Programming Based on Simulated Annealing

jinksimaginaryΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

74 εμφανίσεις

Gene Expression Programming Based on Simulated
Annealing
*

JIANG Siwei, CAI Zhihua, ZENG Dan, Liu Yadong, LI Qu
College of Computer, China University of Geosciences, Wuhan 430074
Corresponding author’s Email: amosonic@163.com

Abstract--Gene Expression Programming (GEP) is a
genotype/phenotype system that evolves computer programs of
different sizes and shapes encoded in linear chromosomes of fixed
length. However, the performance of basic GEP is highly
dependent on the genetic operators’ rate. In this work, we present a
new algorithm called GEPSA that combines GEP and Simulated
Annealing (SA), and GEPSA decreases the dependence on genetic
operators’ rate without impairing the performance of GEP. Three
function finding problems, including a benchmark problem of
prediction sunspots, are tested on GEPSA, results shows that
importing Simulated Annealing can improve the performance of
GEP.
Key words--Gene Expression Programming; Simulated Annealing;
Function Finding; Regression Analysis


I.
I
NTRODUCTION

Gene Expression Programming is a novel genetic algorithm
in which individuals are encoded as symbolic strings of fixed
length (genotype) and then expressed as expression trees
(phenotype) with different sizes and shapes. It combines the
characteristics of genetic algorithms (GA) and genetic
programming (GP), and overcomes some drawbacks of them.
Structurally, genetic algorithms can be subdivided in three
fundamental groups: 1) the first group is GA with linear
chromosome of fixed length; 2) the second group is GP with the
individual consisting of ramified structures of different sizes
and shapes; 3) the third group is GEP. In essential, GEP is a
subgroup of GAs [1,2,3].
According to our previous experiments [4,5,6], basic GEP
can get good results in regression and prediction. However,
similar to other basic genetic algorithms, the dependence on
genetic operators’ rate for different problems also exists in GEP.
To solve this problem, some researches have imported SA into
GAs, and the hybrid algorithm has higher performance [7,8].
Being interested in this kind of approach,we present a new
algorithm called GEPSA which combines GEP and SA.
Experimental results show that the new algorithm gets better
performance than basic GEP.
The remaining sections of this paper are organized as follows.


*
Supported by Humanities & Social Sciences Open Foundation of hubei
province (NO: 2004B0011) and the National Natural Science Foundation of
hubei province (NO: 2003ABA043).
In Section 2, we introduce the backgrounds of GEP and some
related works. In Section 3, we describe our new algorithm. In
Section 4, we apply the new algorithm to three experiments and
analyze the results in detail. The conclusion is presented in
Section 5.
II. R
ELATED
W
ORKS

In GEP, each gene is composed of a head and a tail, and
multiple genes construct a chromosome. GEP is a
genotype/phenotype system where the search space remains
separated from the solution space, thus has better performance
than GAs[1,9,10]; on the other hand, GEP is a totally
unconstrained genotype/phenotype system as all modifications
made in the genotype always result in correct phenotype, thus
better than GP[1,10].
To improve the performance of GAs, some researchers have
combined the idea of Simulated Annealing with genetic
algorithms. WANG Xuemei [7] used the average fitness metric
with Simulated Annealing to increase the diversity of the whole
population. WANG Ling[8] used the complementary aspects of
the structure and behavior of GA and SA , and represented a
global hybrid strategy.

III.
G
ENE
E
XPRESSION
P
ROGRAMMING
B
ASED ON
S
IMULATED
A
NNEALING
Basic GEP has nine genetic operators: mutation, three kinds
of transposition (IS, RIS and gene transposition), three kinds of
recombination (one-point, two-point and gene recombination),
random constants mutation, and random constants transposition
[3]. Among these operators, mutation is the most important and
powerful one [10].
Similar to GAs, the performance of GEP is vulnerable to the
settings of genetic operators’ rate. Here we present a new
algorithm called GEPSA, which only use three genetic
operators: mutation, multiple-point recombination and random
constants mutation. For each gene we take one-point crossover,
and then it becomes multi-point recombination for the
multi-gene in a chromosome. GEPSA is depicted as follows:
Algorithm: GEPSA (f, P, P
m
, P
c
, P
Dc
, T
0
, a)
Input: f: the fitness function to evaluate the individuals; P: the
population for evaluation; P
m
: the mutation rate; P
c
: the
multiple -point crossover rate; P
Dc
: the random constants
mutation rate; T
0
: the initial temperature of SA; a: the
temperature alter rate.
Output: The model with the highest fitness.
1. Initialize the population P randomly;
2. Evaluate: for each individual p, compute f(p);
3. for(i = 0; i < n; i++)
Generate the new population:
(1) Mutation: generate new individual by mutation
old individual with P
m
. Replace the old individual
with the new individual if min{1,
exp(-?f/Ti)}>random[0,1], where
?f=f(old)-f(new);
(2) Multiple-point crossover and random constants
mutation: generate new individual from old
individual with P
c
, P
Dc
. Accept new according to
method in (1);
(3) Using the elitism method;
4. Return the best model with highest fitness.

IV.
E
XPERIMENT AND
R
ESULT
In this paper, we compare GEPSA with basic GEP in three
problems [3].
The first one is a problem of sequence induction, where
n
a

consists of the nonnegative integers. The nth term N of the
chosen sequence is given by the formula:
4 3 2
5 4 3 2 1
n n n n
N a a a a= + + + + (1)
The second one is a problem of function finding, where the
independent variable is a floating-point chosen from the interval
[-1, 1]. In this case, the following “V” shaped function was
chosen:
2 2
4.251* ln( ) 7.243*
a
y a a e= + + (2)
The third one is the well-studied benchmark problem of
predicting sunspots [11]. In this case, 100 observations of the
Wolfer sunspots series were used (Table 1) with an embedding
dimension of 10 and a delay time of one.
A. Fitness Function
We use fitness function based on absolute error (equation 3)
and relative error (equation 4), where M is the range of selection,
C
(i,j)
is the value returned by the individual chromosome i for
fitness case j (out of C
t
fitness cases), and T
j
is the target value
for fitness case[1]:
( )
(,)
1
t
C
i i j j
j
f
M C T
=
= − −


(3)

( )
(,)
1
( )/*100
t
C
i i j j j
j
f M C T T
=
= − −


(4)
We use R-square to test the evaluation model, where
i
y
is
the fact value,
y
is the average of all
i
y
, and
i
y

is the value
return by GEP. A higher R-square indicates a better regression
model [5].
^
2 2 2
Re
1 1
/( )/( )
n n
g Total
i i
R
SS SS yi y yi y
− −
= =
= = − −
∑ ∑

(5)

TABLE 1
W
OLFER SUNSPOTS SERIES (READ BY ROWS)
101 82 66 35 31 7 20 92
154 125 85 68 38 23 10 24
83 132 131 118 90 67 60 47
41 21 16 6 4 7 14 34
45 43 48 42 28 10 8 2
0 1 5 12 14 35 46 41
30 24 16 7 4 2 8 17
36 50 62 67 71 48 28 8
13 57 122 138 103 86 63 37
24 11 15 40 62 98 124 96
66 64 54 39 21 7 4 23
55 94 96 77 59 44 47 30
16 7 37 74

B. Setting the System
For the sequence induction problem, the 10 nonnegative
integers
n
a
were used as fitness cases. The fitness function
was based on the relative error with a selection range of 20%,
giving maximum fitness f
max
= 200.
For the “V” shaped function problem, a set of 20 random
fitness cases chosen from the interval [-1, 1] was used. The
fitness function was based on the relative error with a selection
range of 100%, giving f
max
= 2000.
For the time series prediction problem, using an embedding
dimension of 10 and a delay time of one, the sunspots series
presented in Table 1 result in 90 fitness cases. In this case, the
fitness function was based on the absolute error with a selection
range of 1000%, giving f
max
= 90,000.
C. Experimental Analysis
The function set, terminal set and genetic operators for three
problems are all shown in table 2. The 2, 5, 8th columns are
results of basic GEP with random constants; the 3, 6, 9th
columns are results of basic GEP without random constants [3];
the 4, 7, 10th columns are results of GEPSA.
To test the algorithm’s robustness, we evaluate the three
problems over 100 independent runs and give the average of
best fitness and average of best R-square, the results in table 2
TABLE 2
G
ENERAL SETTINGS USED IN THE SEQUENCE INDUCTION (SI), THE “V” FUNCTION, AND SUNSPOTS (SS) PROBLEMS.
SI* SI SI** V* V V** SS* SS SS**
Number of runs 100 100 100 100 100 100 100 100 100
Number of Generations 100 100 100 5000 5000 5000 5000 5000 5000
Population size 100 100 100 100 100 100 100 100 100
Number of fitness Cases 10 10 10 20 20 20 90 90 90
Function set +-*/ +-*/ +-*/ +-*/LE
K~SC
+-*/LE
K~SC
+-*/LE
K~SC
+-* +-* +-*
Terminal set a, ? a a, ? a, ? a a, ? a, ? a a, ?
Random constants array length 10 --- 10 10 --- 10 10 --- 10
Random constants range 0,1,2,3 --- 0,1,2,3 [-1,1] --- [-1,1] [-1,1] --- [-1,1]
Head length 6 6 6 6 6 6 8 8 8
Number of genes 7 7 7 5 5 5 3 3 3
Linking function + + + + + + + + +
Chromosome length 140 91 140 100 65 100 78 51 78
Mutation rate 0.044 0.044 0.044 0.044 0.044 0.044 0.044 0.044 0.044
One-point recombination rate 0.3 0.3 --- 0.3 0.3 --- 0.3 0.3 ---
Two-point recombination rate 0.3 0.3 --- 0.3 0.3 --- 0.3 0.3 ---
Multi-point recombination rate --- --- 0.3 --- --- 0.3 --- --- 0.3
Gene recombination rate 0.1 0.1 --- 0.1 0.1 --- 0.1 0.1 ---
IS transposition rate 0.1 0.1 --- 0.1 0.1 --- 0.1 0.1 ---
IS element length 1,2,3 1,2,3 --- 1,2,3 1,2,3 --- 1,2,3 1,2,3 ---
RIS transposition rate 0.1 0.1 --- 0.1 0.1 --- 0.1 0.1 ---
RIS element length 1,2,3 1,2,3 --- 1,2,3 1,2,3 --- 1,2,3 1,2,3 ---
Gene transposition rate 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Random constants mutation 0.01 --- 0.01 0.01 --- 0.01 0.01 --- 0.01
Dc specific transposition rate 0.1 --- --- 0.1 --- --- 0.1 --- ---
Dc specific elements length 1,2,3 --- --- 1,2,3 --- --- 1,2,3 --- ---
The initial temperature --- --- 20 --- --- 50 --- --- 50
The alterative temperature rate --- --- 0.95 --- --- 0.8 --- --- 0.8
Selection range 20% 20% 20% 100% 100% 100% 1000% 1000% 1000%
Precision 0% 0% 0% 0% 0% 0% 0% 0% 0%
Average best-of-run fitness 179.827 197.232 197.72 1914.8 1931.84 1972.46 86215.3 89033.3 89166.3
Average best-of-run R-square 0.97761 0.99935 0.99999 0.95726 0.99534 0.99663 0.71337 0.81186 0.87510
Success rate 16% 81% 89% --- --- --- --- --- ---
Notes: The “*” indicates the explicit use of random constants, the “**”indicates the use of GEPSA.

show that the average-best-fitness and average best-of-run
R-square evaluated by GEPSA are both higher than those by
basic GEP.
The first problem of sequence induction can be exactly
solved by basic GEP, GPESA, the first perfect solution was
found in generation 65 of run 1 by GEPSA. To solve this
problem, Candida used two approaches (basic GEP
with/without constants), whose success rates are respective 16%
and 81%. In basic GEP methods, there are more than seven
genetic operators; in our experiments, although only three
genetic operators are used in our approach, the success rate is
89%.
To find the “V” shaped function, we use function set F = {+,
-, *, /, L, E, K, ~, S, C} (“L” represent the natural logarithm,
“E” represent exp(x), “K” represent the logarithm of base 10,
“~” represent the exp(10, x), “S” represent the sine function,
“C” represent the cosine). Using basic GEP, Candida provided
the best solution with the best fitness of 1990.023 and an
R-square of 0.9999313; Using GEPSA, the best solution was
found in generation 3897 of run 87:
[0.068*cos(0.538)/0.114]
[exp(exp( 0.806* sin(0.216)))]
[(ln ) *0.584/cos(sin )] [ln 1 cos( ) ]
[exp((cos( 0.066) 0.5 )/0.792)]
y a
a
a a a
a
= −
+ − −
+ + −
+ − + +
  
  
  

It has a fitness of 1995.7 and an R-square of 0.999972
evaluated over the set of 20 fitness cases, and thus is better than
the model evolved with the basic GEP.
For the benchmark problem of predicting sunspots, using
basic GEP, the best solution has the best fitness of 89176.61 and
R-square of 0.882831; using GEPSA, the best solution was
found in generations 4754 of run 71:
*
[ ] [ * * ]
0.808* * * * *
[ 0.262*( )]
b h a c
y j d j
b e g a d i j h
a b e
i
e i
+
= + +
+ +
+ +
+ − +

  

It has a fitness of 89386 and an R-square of 0.924051
evaluated over the set of 90 fitness cases, and thus it is better
than the model evolved with the basic GEP.

From the comparisons of success rate, fitness functions and
R-square we can see that the best solution using GEPSA is
better than using basic GEP for the three problems.
V.
C
ONCLUSIONS
To decrease the dependence on genetic operators’ rate in GEP
for different problems, we present a new algorithm GEPSA that
combines the GEP and Simulated Annealing. Three experiments
are tested by two approaches: basic GEP and GEPSA, the
results suggest that GEPSA is more efficient, not only in terms
of the accuracy of the best evolved models are better than basic
GEP, but also in terms of the average fitness and average
R-square are better than basic GEP by running 100 independent
times. In the future, we can use parallel calculation and
multi-population strategy to improve the performance of GEP.
R
EFERENCES
[1] Ferreira, C., Gene Expression Programming: A New Adaptive Algorithm
for Solving Problems [J], Complex Systems, 2001, vol. 13, No. 2,
pp.87-129, 2001.
[2] Ferreira, C., 2002. Genetic Representation and Genetic Neutrality in
Gene Expression Programming, Advances in Complex Systems, vol. 5,
No.4, pp.389-408, 2002.
[3] Ferreira, C., Function Finding and the Creation of Numerical Constants
in Gene Expression Programming, the 7th Online World Conference on
Soft Computing in Industrial Applications, September 23-October 4,
2002.
[4] LI Qu, CAI Zhihua, ZHU Li, ZHAO Yunsheng, Application of Gene
Expression Programming in Predicting the Amount of Gas Emitted from
Coal Face [J], Journal of Basic Science and Engineering. Vol. 2 No. 1,
March 2004.
[5] LI Qu, CAI Zhihua, JIANG Siwei, ZHU Li, Gene Expression
Programming in Prediction, Proceedings of the 5th World Congress on
Intelligent Control and Automation, June, pp. 15-19, 2004.
[6] CAI Zhihua, LI Qu & JIANG Siwei, Symbolic Regression Based on GEP
and Its Application in Predicting the Amount of Gas Emitted from Coal
Face, Proceedings of the 2004 International Symposium on Safety
Science and Technology, October 2004.
[7] Wang Xuemei, Wang Yihe, The Combination Of Simulated Annealing
And Genetic Algorithm. Chinese Journal Of Computers, vol. 20, No. 4,
pp381-384, 1997.
[8] Wang Ling, Zheng Dazhong, Unified Framework for Neighbor Search
Algorithms and Hybrid Optimization Strategies, Tsinghua university (sic
& tech), Vol. 40, No 9, pp.125-128, 2000.
[9] Banzhaf, W., Genotype-phenotype-mapping and Neutral Variation – A
Case Study in Genetic Programming. In Y.Davidor, H.-P. Schwefel, and
R. Männer, eds., Parallel Problem Solving from Nature III, Vol. 866 of
Lecture Notes in Computer Science, Springer-Verlag, 1994.
[10] Ferreira, C., Mutation, Transposition, and Recombination: An Analysis of
the Evolutionary Dynamics, Proceedings of the 6th Joint Conference on
Information Sciences, 4th International Workshop on Frontiers in
Evolutionary Algorithms, pp. 614-617, 2002.
[11]
Weigend, A.S., B. A. Huberman, and D. E. Rumelhart, Predicting
Sunspots and Exchange Rates with Connectionist Networks. In S. Eubank
and M. Casdagli, eds., Nonlinear Modeling and orecasting, pp.395-432,
Redwood City, CA, Addison-Wesley, 1992.