A survey of Evolutionary Algorithms for
Data Mining and Knowledge Discovery
Alex Freitas (2001)
A look at Genetic Algorithms(GA) and Genetic Programming(GP)
also from the point
view of the user.
An overview of Data Mining and Knowledge Discovery
The three desirable properties of KD:
1)Accuracy. We would like a high predictive accuracy rate
for our data mining task of classification
2)Comprehensible.The user should
be presented with a
“black box” predictor that spouts out a number of IF_THEN rules
and say this is the output. Prediction rules should be
3)Interesting. In many cases the user should be presented
with an ensemble of discovered rules not all of which will be
interesting in practice.
Show the user some schematics:
“We believe that ideally a combination of
approaches should be used to try to solve the
very hard problem of returning interesting knowledge to
Objective approaches: clearly the algorithms and
computer processing give us high
Subjective approaches: user helps to define and re
those formulas or IF_THEN rules that are of interest to the
Data Mining Task: Classification
Many of the KD procedures are to predict the value (the class) of a
specified goal attribute based on the predicting attributes.
E.G. IF( Unpaid_Loan =“NO”) AND (Overdrafts=“Yes”)
THEN ( Credit = “Bad”)
For a comprehensive discussion about how to measure predictive
accuracy of classification rules we are referred to  Hand, DJ ,
Construction and Assessment of Classification Rules
Tom Mitchell’s book also has good information.
Data Mining : Other tasks
Dependence Modelling :an extension or
generalization of classification
Clustering: discovering groups: unsupervised
Discovery of Association Rules: more than one
item in the consequent attribute is possible; the
classification task may be assymmetric with
respect to the predicting attributes and the
The Knowledge Discovery Process
Discretization: transforming a “continuous” attribute into a
discrete one. (For example, “low”, “medium”, “high”)
Attribute Selection: selecting a set of attributes relevant for
The motivation for attribute selection may be obvious: it has
been found that irrelavant attributes can somehow “confuse”
the data mining algorithm, leading to the discovery of
inaccurate or useless knowledge. ( A wrapper method can
help select attributes.)
Discovered Knowledge Postprocessing
Two main motivations:
First , when the discovered rule set is large, we want to
simplify it. (The user may or may not help at this stage.)
Some other techniques will be addressed.
Second, we often want to extract a subset of interesting
rules, among all the discovered ones. We will look at
methods later on(GA) but
methods involving a user/data miner collaboration may
also be important.
Genetic Algorithms (GA) for Rule Discovery
Michigan approach: population consists of individuals(“chromosomes”)
where each individual encodes a single prediction rule.
Pittsburgh approach: each individual encodes a set of prediction rules
Pluses and minuses: the Pittsburgh approach directly takes into account
rule interaction when computing the fitness function of an individual. This
approach leads to syntactically
longer individuals. In the Michigan
approach the individuals are simpler and syntactically shorter. It
simplifies the design of genetic operators. (Interactions in Michigan
approach are not taken into account.)
Take the rule: IF cond#1 AND cond#2 AND …cnd#n….THEN class= c(i)
Representation of the rule antecedent
Representation of rule consequent ( the THEN part)
The Rule Antecedent (Using GA)
Often there is a conjunction of conditions.
Usually use binary encoding.
A given attribute can take on k discrete values. Encoding can
consist of k bits.( for “on” or “off”) Allows for internal disjunctions.
0 0 1 1 1 0 1 1 0 0………0
All bits can be turned into “1” ’s in order to “turn off” this
binary encoding is possible. Variable
length individuals will
arise. May have to modify crossover to be able to cope with
Representing the Rule Consequent (Predicted Class)
Three ways of representing the predicted class. (the THEN part)
First, encode it in the genome of an individual (possibly
making it subject to evolution.)
Second, associate all individuals of the population with the
same predicted class, which is never modified during the
running of the algorithm.
Third, choose the predicted class most suitable for a rule (a
deterministic way) as soon as the corresponding rule
antecedent is formed. ( E.G. Maximize fitness.)
Author believes the third possibility to be the most
Genetic Operators for Rule Discovery
Selection: Each individual represents a single rule.(Michigan
approach). An approach called REGAL can be used. Individuals
to be “mated” are “elected” by training examples. (Use a fitness
operator or a probabilistic model.)
Generalizing/specializing crossover: basic idea of this special
kind of crossover is to generalize or specialize a given rule,
depending on whether it is currently overfitting or underfitting
condition operator. The g/s of a rule
can be done in a way independent of crossover. (Tweaking the
antecedent conditions especially if contnuous conditions exist.)
Fitness Function for Rule Discovery
Remember:Accuracy, comprehensibility + interesting
How to get these 3 rule quality criteria incorporated
into a fitness function.
Let a rule be IF A THEN C , the calculate the
confidence factor CF = TP / ( TP + FP) using chart:
C not C
Predicted C TP FP
. Class not C FN TN
Comp=completeness measure = TP / (TP + FN)
Fitness = CF * COMP = (TP)(TP) / (TP+FP)(TP+FN)
Fitness = w1 X (CF * COMP) + w2 X (Simp) where Simp is a
measure of rule simplicity 0< Simp<1 and w1 and w2 are weights
Genetic Algorithms (GAs) for Pre
“The use of GAs for attribute selection seems natural.
The main reason is that the major source of difficulty in
attribute selection is attribute interaction, and one of
the strengths of Gas is that they usually cope well with
We can use very simple genetic encoding where each
individual represents a candidate attribute subset. A
candidate attribute subset can be represented as a
string with m binary genes where m is the number of
attributes and each gene can take on a “0” or “1”.
Follow crossover and mutation procedures.
GA can be used with nearest neighbour algorithms
(NNA) to “tweak” for better results.
Genetic Algorithms (GAs) for Post
GAs can be used in the post
processing step when there is an
ensemble of classifiers (e.g. rule sets) created.
Generating an ensemble of classifiers is useful since it has been
shown that in several cases an ensemble of classifiers has a better
predictive accuracy than a single classifier.
A fitness function may be created using weights for each classifier
in the ensemble. (A user may help.) There are also GA schemes to
optimize the weights of the classifiers.
Genetic Programming (GP) for Rule Discovery
Individual representation: attributes are often numeric
Functions such as +,
,*,/ , < , >,=, AND, OR,… are used
as well as input arguments.
An individual is often represented as a tree diagram.
Once we apply the functions in the internal nodes of a
GP individual to the values of the attributes in the leaf
nodes of the individual, the system computes a
numerical value that is output at the root of the tree.
Discovering comprehensible rules using GP: these
rules could be similar to GA rules but there are some
Simplicity = (MaxNodes
0.5) / (MaxNodes
This could lead to discovery of short, simple rules that
may be required in the Medical field.
Genetic Programming (GP) for Data Pre
A Major problem in attribute construction is that the search space
tends to be huge. If the search can be accomplished especially
with the relational operator “>” then many good candidate
operations may evolve.
Sometimes there are GA/GP methods for pre
In his Chapter on evolutionary algorithms Alex Freitas has shown
us where the emphasis should be laid in Knowledge learning. His
goal of “transparency” of pre
processing, rule learning and post
processing methods should be taken into account. A user would
like to know where the classifier or the IF_THEN rule came from
and if he can help influence the process to get a more intelligent
High Classification Accuracy does not imply
Effective Genetic Search (Tim Kovacs;Manfred Kerber)
The authors are publishing their experimental results which they
believe will help clear up some of the limitations of GA methods. In
particular they examine XCS, a popular classification system
which uses genetic algorithms.
The paper by K+K refers us to work by Stewart Wilson(1995)
entitled “ Classifier Fitness Based on Accuracy”. In XCS each
classifier maintains a prediction of expected payoff, but the
classifier’s fitness is given by a measure of the predictor’s
accuracy. Wilson’s example shows some individuals in the
Where p=prediction є = prediction error F= fitness parameter
XCS and XCS
NGA are compared
The authors XCS
NGA (XCS with no GA) uses XCS modified so that
genetic search does not operate on the initial rule population. In all
other respects XCS
NGA functions as XCS.
XCS classifies data points by a vote among the rules which match
it, with each vote weighted both by the rule’s fitness. In this way a
accuracy rule and a high
accuracy rule is given the
classification of the high
In XCS, the rules (region shapes and sizes) are adapted by the GA.
NGA lacks a GA and its region shapes and sizes do not
NGA relies on there being enough rules to adequately solve
the rule improvement problem (rule discovery) by random chance.
Roughly speaking, XCS
NGA’s approach is to generate many
random rules and ignore those which happen to have low
High Accuracy implications of XCS
The authors have obtained very good results with XCS
agree that this alternative has its limitations but they want to
address the publication of classification accuracy as a goal onto
Many published papers use 6
bit multiplexer functions only.
(Strings are of length L= k + 2^k so that if k=2bits then L=6 . For a
bit multiplexer we have : k=6 bits, and L= 70.)
Because of its random nature the XCS
NGA thrives better with
more initial rules and then give excellent results.
The claim is that they argue that only those studies which claim
effective genetic research based on results with small functions
are demonstrated invalid by their results with XCS
A More Powerful Metric for Evaluating Genetic Search
This metric is symbolized by %[O] where %[O] = the proportion of
the Optimal solution in the rule population on a given time step.
This metric has been shown to have greater discriminatory power
than the performance metric introduced by Wilson.
Wilson’s “performance” is defined as a moving average of the
proportion of the last n trials in which the system has responded
with the correct action. ( n is traditionally equal to 50.)
There are often 400 rules to start with. The XCS
NGA does better
with more rules.
%[O] is better able to discern the progress of genetic search than
the performance metric.
The new metric extends the utility of small tests.
%[O] has disadvantages including the need to compute the optimal
solution in advance as well as the computational expense of
Finally, replacing GA with an interactive random rule generator
would provide a baseline against which to compare genetic search.