An Application of Genetic Algorithms to Uplift Modelling

losolivossnowΤεχνίτη Νοημοσύνη και Ρομποτική

23 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

103 εμφανίσεις

An Application of Genetic
Algorithms to Uplift
Modelling
David P.Hofmeyr
Master of Science by Coursework
University of Edinburgh
2011
Declaration
I declare that this thesis was composed by myself and that the work contained
therein is my own,except where explicitly stated otherwise in the text.
(David P.Hofmeyr)
ii
Abstract
This paper means to tackle the problemof Uplift Modeling - i.e.modeling change
in behaviour as a direct result of treatment - using randomised methods,namely
evolutionary algorithms;both for variable generation and variable selection.We
give a detailed description of the evolutionary methods entailed as well as some
of the key aspects of uplift modeling such as the Qini coecient and some current
methods of modeling.We then apply this evolutionary approach to an example
problem published by Kevin Hillstrom in his blog (MineThatData) and discuss
how our results compare favourably with those from the winning submission.
iii
I would like to thank my supervisor,Dr.Nicholas J.
Radclie,for his support,guidance and expertise,
without which this dissertation would not have been
possible.His extensive knowledge in interests we share
has been a great boon,and has aided hugely my
understanding of these subjects.
I would also like to thank Stochastic Solutions Ltd.for
the use of their uplift software.
iv
Contents
Abstract iii
1 Introduction:1
2 Uplift Modelling 2
2.1 The Basics of Uplift Modelling....................3
2.2 Signicance Based Uplift Trees...................4
2.3 Evaluating Uplift Models (The Qini Coecient)..........5
3 Evolutionary Algorithms (Genetic Algorithms) 8
3.1 The Evolutionary Process......................9
3.2 Genetic Programs...........................10
3.2.1 Trees..............................11
3.2.2 Terminals and Functions...................12
3.2.3 The Initial Population....................13
3.2.4 Evolutionary Operators for Genetic Programs.......14
3.2.5 A Word on LISP and Symbolic Expressions........15
3.3 The Power of Genetic Algorithms..................17
3.3.1 The Schema Theorem....................17
3.3.2 Forma Analysis........................19
3.3.3 Convergence of Genetic Algorithms.............20
4 Application of Genetic Algorithms to Uplift Modelling and the
Hillstrom Challenge 21
4.1 Application of Genetic Algorithms to Uplift Modelling......21
4.1.1 The Genetic Programming Approach............21
4.1.2 The Polynomial Approach:..................23
4.2 The Hillstrom Challenge and Results................24
4.2.1 The Data and Questions...................24
4.2.2 Summary of Results from Winning Submission.......26
4.2.3 Results of the Genetic Algorithm Approach........30
5 Conclusion:56
A Variables and Models 59
A.1 Men's Model.Genetic Programming Approach...........59
A.2 Women's Model.Genetic Programming Approach.........59
v
A.3 Men's Model.Fitness Proportionate Selection.Combined Genetic
Programming and Uplift Tree....................59
A.4 Women's Model.Fitness Proportionate Selection.Combined Ge-
netic Programming and Uplift Tree.................61
vi
Structure Beginning in chapter 2,we direct our attention to the basic compo-
nents of uplift modelling including the inspiration for its development,its applica-
tion,some of the mechanical apsects of model building and a means for evaluating
models.In chapter 3 we turn our focus to the topic of evolutionary algorithms
(specically genetic algorithms)  their medium,structure,use and application
with particular emphasis on genetic programming.In chapter 4 we merge the
two and show directly how the evolutionary processes described in section 4 can
be applied to the eld of uplift modelling.We go on to apply these methods to
a challenge published by Kevin Hillstrom in his blog,MineThatData,and com-
pare our results with those from the winning submission.Finally we conclude in
chapter 5 with a discussion of our ndings,the shortcomings of our method and
advice for future implementation.
vii
Chapter 1
Introduction:
Uplift modelling is a class of predictive modelling techniques in its relative infancy
compared with many others.It follows a lineage of modelling methods in the eld
of customer relations management,particularly in the eld of direct marketing,
dating back to the introduction of data mining in the 1950's.
Aimed at predicting the incremental impact of a targeted action,uplift modelling
draws on the shortcomings of its predecessors in this area.These pre-existing
techniques do well,through the use of control groups,to measure the incremental
eects of an action after the event but fail to model it predictively.
Due to its being a new discipline,and the diculty in modelling the second
order nature of incrementality,there is a shortage of documented modelling tech-
niques that currently predict incremental impact.In this paper we will attempt
to build uplift models using evolutionary algorithms.
Inspired by John Holland's Adaptation in Natural and Articial Systems,the
practice of evolving solutions by\genetic"interaction has itself evolved well be-
yond its canonical form as presented therein.
Genetic programming allows us to build models of unbounded complexity,and
we hope to utilise this power in building uplift models.First,by developing the
models directly within the genetic programming paradigm in its pure form and
second,by combining it with an existing modelling technique in an interactive
evolutionary process,we hope to obtain a useful and eective way of developing
uplift models.
We are given a useful metric for comparing uplift models,namely the Qini coef-
cient,and using this we will be able to compare our ndings with models built
using deterministic methods alone,as well as with some built using a simpler
evolutionary process.
1
Chapter 2
Uplift Modelling
Uplift is dened as a measure of the change in behaviour of an individual (or
group of individuals) as a direct result of some action.For example,in the con-
text of marketing,the uplift associated with some campaign can be understood
as the incremental sales volume generated by it.
Uplift modelling is concerned with identifying individuals or subsets of a popula-
tion for which a (usually binary) in uence variable has the greatest incremental
impact.As is always the case with predictive modelling accuracy is paramount
(i.e.we seek to have manageable error in the model),but in addition to this we
seek to isolate the eect of the in uence variable as well as have its predicted
eect be as non uniform as possible.A fairly simple regression model will handle
such a variable and be able to isolate its eect adequately,however in this case
(for a binary in uence variable) the two submodels will be parallel (i.e.the pre-
dicted eect of the variable will be uniform across the population) and so will be
unable to identify those for which the eect is greatest.
Uplift modelling was born out of the failure of traditional marketing strategy
models to recognise the distinction between"targeting people who are likely to
buy if they are included in a campaign"and"targeting people who are only likely
to buy if they are included in a campaign."Modelling on the former will at best
do well to measure and assess incremental sales as a result of some campaign.On
the other hand,understanding the latter allows one to actually maximise them.
The traditional models concentrate on associating purchases with treatment by
some marketing campaign.While this method can certainly give evidence that a
customer was in some way in uenced by the campaign,there is no guarantee of
it.It may be that such a customer would have bought whether or not they were
targeted by the campaign and so any increment in purchasing is not recognised.
Moreover,the potential for negative in uencing by campaigns is completely ig-
nored.While it may seem absurd that targeted marketing could signicantly
negatively in uence a potential customer's likelihood of purchasing,these phe-
nomena are well documented ([1]).
In this section we will discuss some of the basic ideas behind Uplift modelling,
2
modelling methods that preceded it (so-called response models) and some of their
shortcomings as well as a metric for evaluating Uplift models (namely the Qini
coecient) which will be used in our analysis later on.
2.1 The Basics of Uplift Modelling
Consider a population P divided into two disjoint subpopulations,T and C.Sup-
pose then that members of T are exposed to some treatment,while members of C
are not (we refer to T as the treated population and C as the control population).
We will begin with the binary case and consider the outcome variable O 2 f0;1g,
where O = 1 is seen as the\desireable"outcome.
A conventional response model attempts to model
P(O = 1jx;T);
that is the probability of an individual,described by the variables x,returning
the desireable outcome given that they are in subpopulation T (i.e.were treated).
Notice that this technique ignores the subpopulation C (i.e.those not treated).
Uplift models,on the other hand,instead model
U(x):= P(O = 1jx;T) P(O = 1jx;C)
the increase in probability that an individual will return the desired response
given that they are in subpopulation T (i.e.were treated) over the relevant prob-
ability were they in subpopulation C (i.e.were not treated).
When O is continuous (or at least discrete but not binary) then we can instead
consider expectations.In this case,uplift models attempt to model
U(x):= E[Ojx;T] E[Ojx;C];
the increase in expected value of the outcome O given that the individual,
described by the variables x,is in subpopulation T over that were they in sub-
population C.
The diculty in modelling this arises when we realise that the uplift for an
individual is not an observable quantity,since no indiviual can be in both T and
C.The most obvious way to model this is to t a response model for each of the
terms independently (that is,a model for individuals in T and another for those
in C) and then to subtract the one from the other.In theory there is nothing
intrinsically wrong with this method,however in practice it is not always reliable,
and can in fact be quite poor ([1]).
Instead we can consider models that impose a segmentation on the population
and base expectations on averages within those segments.In this case we en-
counter diculties surrounding reliability,as estimates for every segment need to
be statistically robust and thus considerably more observations are required than
in some other modelling methods.
3
2.2 Signicance Based Uplift Trees
When we consider a segmentation of the population as suggested above,a natural
choice of model structure is tree-based models,as they are intrinsically based on
segments.In fact the only packaged uplift modelling software currently available
is based on a tree building algorithm,the basics of which will be discussed herein.
The key features of the tree based model are:
 Signicance based splitting:
The goal of generating useful splits during the growth of the tree model is
to both maximise the dierence in uplift between the two subpopulations
and minimise the dierence in their sizes.In general these two objectives
are at odds with one another however and a method of satisfying both
satisfactorily is not trivial.The signicance-based splitting criterion ts a
linear model to each of a set of potential splits and uses signicance of the
interaction termas a measure for selection.Modelling the outcome variable
as
O
ij
=  +
i
+
j
+
ij
where O
ij
indicates the expected outcome returned by an individual in
subpopulation i (either treated (T) or control (C)) on side j (L or R) of
the split being considered. is a constant related to the mean overall
outcome across the entire population.
i
relates to the eect of treatment
in subpopulation i (naturally we set 
C
to be zero).
j
quanties the eect
of the split by indicating the base dierence in outcome between the two
sides.Finally,
ij
measures the interaction between the treatment and the
split.
There is no loss of generality in setting

L
=
CL
=
CR
=
TL
= 0
Which leaves
TR
as the dierence in uplift between the subpopulations
either side of the split,which is precisely the quantity of interest.The use-
fulness of taking the signicance of this estimate rather than its magnitude
as a selection criterion lies in the fact that the t-statistic on which it is
based implicitly combines both magnitude and population size.
 Variance-based pruning
The additional diculties encountered in modelling uplift over many other
quantities lies in the fact that overall uplift due to the treatment is often
small compared with other relationships in the data,the control population
is often signicantly smaller than the treated,and nally its second-order
nature (i.e.it being a measure of the change in outcome,rather than a
basic measure of magnitude of an observable quantity) and as such larger
errors tend to arise with estimation.As a result,stability of models is often
dicult to achieve.
4
Radclie ([1]) recommends resampling the training population k times (k =
8 is common),training on the rst sample and evaluating the stability with
reference to the other k 1 samples.At each node in the model,the uplift
is measured in population 1 and its standard deviation is estimated from
the other k 1 samples.Any split for which a child node exhibits standard
deviation greater than some threshold is discarded,along with the subtree
descending from it.
 Bagging
The practice of bagging is a further concession to the rife instability inherent
in building uplift models.It refers to building a number of models (typically
10 or 20) as described above,each using dierent resamplings,and basing
predictions on the averages achieved over all the models.Radclie ([1])
notes that this approach can succeed in cases where building only a single
tree failed.
 Pessimistic qini-based variable selection
The major factors regarding variable selection in any model building tech-
nique are reducing dimensionality (particularly to prevent overtting on
the data used for building the model),avoiding multicorrelation (strong
correlations between predictor variables can lead to unstable results and
can lead to misinterpretations of the relationships inherent in the data that
are recognised by the model),improving model quality and stability and
improving model interpretability.
Again the stability issue is more apparent in uplift modelling than in gen-
eral.
Basing variable selection on pessimistic qini-based estimates refers to rank-
ing the candidate variables according to a quality measure (the Qini coef-
cient,see below) and choosing either a xed number of variables,or all
which are above a specied threshold level.Radclie ([1]) recommends an
adjustment to the basic Qini coecient to be more pessimistic and,in doing
so,reduce the likelihood of selecting variables which might lead to instabil-
ity in the model.As in the case of bagging,this is done by resampling the
training population repeatedly and subtracting a multiple of the standard
deviation of Qini values obtained from the initial Qini estimate.
2.3 Evaluating Uplift Models (The Qini Coe-
cient)
We have discussed a small amount about what is desirable in an uplift model,but
given some collection of models how do we determine which is\best"?There is a
tendency in practice to resort to ad hoc methods,most often graphical.Several
authors suggest comparing uplift in the top k deciles to the overall uplift,how-
ever dierent choices of k and can reverse the comparison ([2]) and without some
conformity in choosing a cut o level this method does not appear very robust.
This method also ignores the opposite end of the spectrum,those for which uplift
5
is considerably lower than the overall,or is even negative.Moreover,it is very
possible,and has been observed in practice ([2]),that decile segmentation does
not generate any statistically signicant dierences and a more coarse segmenta-
tion is needed.
Radclie ([2]) proposes the Qini coecient,an analogue of the Gini coecient
([11]) for uplift.Where the Gini coecient measures the (anti- ) correlation be-
tween outcome and targeting depth (based on an ordering induced by the model),
the Qini instead measures the (anti- ) correlation between uplift rate and target-
ing depth.
Let Mbe a model that induces a total ordering on a population,divided into T
and C as above.Dene f
T
M
(x):[0;1]!R to be the response rate at depth x,
according to the ordering induced by Mfor the treated population.Similarly,
dene f
C
M
for the control population.(Note that for a nite population,f will
not be dened over the continuous interval [0;1],in such a case we consider the
piecewise continuous approximation formed by spreading the probability density
associated with discrete scores over the equvalent interval of continuous ranks.)
We can then dene the predicted uplift pointwise,for a given depth x
u
M
(x):= f
T
M
(x) f
C
M
(x)
And with this we can dene the cumulative outcome and uplift rates
F
T
M
(x) =
Z
x
0
f
T
M
(y) dy
F
C
M
(x) =
Z
x
0
f
C
M
(y) dy
U
M
(x):= F
T
M
(x) F
C
M
(x):
Condsider a model which imposes a random ordering on the population,R.
We have
E[U
R
(x)] = ux;8x 2 [0;1];
where u =
R
1
0
u(x) dx,the average uplift in the population.The Qini coe-
cient,Q,is dened by the area between U
M
and this diagonal,i.e.the average
excess cumulative uplift resulting from the model M,normalised by the equiva-
lent value for the best possible ordering.
Radclie acknowledges ([2]) that a model inducing such a best possible ordering
might not even be theoretically achievable:for example,no model can separate
individuals described identically by the predictor varaibles whose outcomes are
independent of treatment.To incorporate this,he also denes q
0
,which is based
on the"best case"model with no negative uplift.
6
Note that the transformation from Q to q
0
is monotone,and so the choice of
one over the other is essentially down to interpretation.
7
Chapter 3
Evolutionary Algorithms
(Genetic Algorithms)
The concept behind evolutionary algorithms stems from our recognition of na-
ture's ability to produce (evolve) species and individuals which appear well adapted
to their environment.Regardless of how harsh an environment may be,life nds
a way to propagate,and even ourish.
The evolutionary algorithm in its essence is a search method (heuristic) which
draws on ideas akin to Darwinian evolution to produce increasingly appealing
(t) solutions to a problem.The power of evolutionary algorithms (specically
genetic algorithms) was initially formalised in John Holland's pioneering work
([3]),in which he describes the notion of schema analysis.Schema analysis oper-
ates within the genotype (representation) space and shows howindividuals sharing
similarities in genetic structure (schemata) which display suciently above aver-
age tness* have a tendency to proliferate.While this has given us tremendous
insight into the usefulness of genetic algorithms,a crucial limitation of schema
analysis lies in the specicity of the genotype space,that is the requirement that
\chromosomes"(representations) be linearly arranged strings of a xed number
of genes,each of which coming from a predened set of alleles.Greene ([4]) dis-
cusses the usefulness of considering nonlinear arrangements of chromosomes and
how Holland's Schema Theorem can be extended to the case of a general con-
nected graph arrangement.Forma analysis ([5]) tackles very similar ideas from
a much more general perspective and allows for a far more universal application.
Formae refer to equivalence classes implied by any equivalence relation over the
phenotype (solution) space,and so similarities between subscribing individuals
need not be restricted to schemata,thus many of the limitations associated with
schema analysis are handily avoided.
In this chapter we will introduce and brie y discuss some important issues and
ideas in the scope of evolutionary algorithms that are relevant to this paper.We
will then delve into a slightly more detailed account of Genetic Programming,a
particular type of evolution.Finally we will touch on some theory showing the
eectiveness of genetic algorithms.
8
* The level of tness above average depends on the nature of the schema in terms
of its length and degree of specicity.
3.1 The Evolutionary Process
In this section we will encounter a broad structure of the evolutionary process
associated with the algorithms.We being this section with some necessary ter-
minology.
Denition The Search Space,S,is the set of all possible solutions to a problem.
S is also referred to as the solution space and the phenotype space.
Denition The Representation Space,G,is the set of all (gene) representations
of members of S.
G is also referred to as the genotype space and the chromosome space.
Denition The Solution Map,F:G!S,associates each element of the repre-
sentation space with its associated solution.
F should be onto,i.e.8 s 2 S:F
1
(fsg) 6=;.In other words every possible
solution has a representation in G.Ideally we would want for F also to be one
to one,i.e.8 s 2 S:jF
1
(fsg)j = 1,where j  j denotes the cardinality of a set.
However,this uniqueness of representations is,in general,not satised.
Denition The Evaluation Function,E:S!
Q
i2I
P
i
,where I is some index
set,maps from the solution space to a product of totally ordered sets.
For the purposes of this paper we only consider real valued evaluation func-
tions and refer to E(s) as the tness of s.Weise devotes a signicant part of ([6])
to showing how higher dimensional evaluation functions can be reduced by the
use of Pareto Dominance.In the eld of optimisation these higher dimensional
functions usually arise in the presence of multi-objective problems.
Denition An Evolutionary Process is a 5-tuple,(S;G;O;F;E),where S is the
solution space,G is the representation space,F is the solution map,E is the
evaluation function and O is a collection of so-called evolutionary operators.
Evolutionary operators take collections of elements from the representation
space (or solution space),along with some parameter(s) (from some predened
control set) and return a collection of elements from the representation space
(or solution space).The nature of the control set depends on the operator and
should be designed in a way that,for a given operator,collection of operands and
parameters the output is unique.
In this paper we will consider three types of evolutionary operators.Firstly,
selection operators,which take an entire population of individuals and return
some subset of that population.Selection operators can be used to determine
breeding pools (collections of individuals chosen to undergo recombination) as
well as for culling populations in order to control the population size.
9
Denition An adjustment to the culling operator referred to as elitism insures
that individuals dened as elite cannot be selectd for removal fromthe population.
In general elitism refers specically to retaining the\ttest"member of the
population.,however we prefer this slightly more general denition which allows
retention of any predened set of individuals.
Secondly,recombination operators,which take pairs of individuals and return
(according to the control parameters) some number of\child"solutions.Much
of ([5]) is devoted to useful properties of recombination operators in the eld of
forma analysis.In general these properties require background beyond the scope
of this paper,however one such property which does not require any background
knowledge and which is important in the context of convergence is the notion of
purity.
Denition A recombination operator,R,is said to be Pure if
8g 2 G;c 2 C
R
:R(g;g;c) = g.
That is a recombination operator is pure if the ospring of (genetically) iden-
tical parents are identical to those parents,regardless of the control parameters.
Finally,mutation operators,which take single individuals and return,accord-
ing again to control parameters,another single individual.In general mutation
will change the make-up of an individual,and may result in an individual which
possesses characteristics otherwise not present in the population.It is this that
is both the bane and the boon of genetic algorithms.Mutation allows for a much
broader search of the solution space and so the power of an algorithm is greatly
amplied by it.On the other hand,it is a great thorn in the side of analysts as
understanding convergence becomes far more dicult.
Note:Recombination and Mutation are also referred to as genetic operators.
The mechanism of the evolution process is the iterative application of evolu-
tionary operators to some initial subset of the representation space.The age of
the computer has allowed us to perform absurdly many simple tasks in a short
space of time and it is this power that allows us to utilise evolutionary algorithms
in a meaningful way.Of course we cannot emulate the vastness of time that life
has taken to evolve from single celled organisms to the biologically diverse and
complex world we live in,but we are able to observe improvements in even the
most complex of problems.
3.2 Genetic Programs
Genetic programs are a very specic type of genetic algorithm in which the indi-
viduals undergoing evolution are,themselves,computer programs.The relevance
10
of genetic programming to this project is that the variables built in our evolu-
tion of uplift models can be expressed as tree-based structures like those used to
describe the computer programs herein.As such,the mechanical workings of the
evolutionary processes of the two are essentially the same.
Koza ([7]) discusses how extremely varied problems can be solved by the dis-
covery of a computer program which produces a particular output from a collec-
tion of inputs.It will be this ability to build highly non-smooth transformations
(ones which might not be possible using conventional modelling methods nor
simpler evolutionary processes) that will be useful in constructing models and
variables in our application of genetic programming to uplift modelling.
3.2.1 Trees
As mentioned above,the medium for representation of computer programs is via
tree graphs and we will now devote a small amount of time to dening the nature
of these trees and how the evolution thereof takes place.
Denition A tree is a connected directed graph with no cycles.That is it is a
graph in which any two nodes are connected by a unique path*.
*we use path to mean a sequence of nodes for which there is an edge connect-
ing each pair of consequent nodes that does not visit the same node more than
once.This is sometimes referred to as a simple path.
In genetic programming we consider hierarchical trees,in which order of nodes
is important and we refer to immediate superiors (i.e.those adjacent and higher
up the hierarchy) as parent nodes,and immediate inferiors as child nodes.
Denition We refer to a node with no child nodes as a terminal,a node with
no parent nodes as a root and a node with at least one parent and at least one
child node as an intermediate node.
Denition The descendants of a node N include all the children of N as well as
all descendants of those children.
Denition A subtree of a tree T subtended by a node N is the tree consisting of
N and all the descendants of N,connected in the same manner as they are in T.
Denition The depth of a tree is the greatest distance (in terms of number of
nodes passed through) along a path from a root node to a terminal node.
11
Figure 3.1:Tree
In the above tree,the node labeled"1"is the root node,it has 2 child nodes
"2"and"3".Nodes"2","4"and"5"are terminal nodes."3"is an intermediate
node and is the parent of"4"and"5".The tree has depth 3 and size 5.
3.2.2 Terminals and Functions
Of crucial importance to the evolutionary process is the correct selection of po-
tential terminals and indermediate (and root) nodes.In order to have the output
of our programs vary according to input (something which is obviously desirable
as a constant output is far from interesting) we include in the set of potential
terminals a collection of variables to represent the inputs,as well as any necessary
interpretations thereof (such as the state of some system in the presence of the
input variables).One can also incude numerical values,the boolean values True
and False,etc.Especially in the case where dierent inputs range on very dier-
ent scales it is sometimes useful to include a constant terminal which can relate
one to the other by some operator (e.g.dividing the one with the larger scale
or multiplying the one with the smaller by some appropriately sized constant).
Intermediate nodes contain elements from a set of functions,usually comprised
of some collection of:
 arithmetic operators (+,-,*,etc.),
 mathematical functions (trigonometric identities,logarithms,etc.),
 boolean operators (AND,OR,etc.),
 conditional operators (IF-THEN-ELSE,etc.),
 relations (=,,etc.),
 any functions that induce iteration or recursion,etc.
The representation space is then the collection of all potential structures that
can be composed recursively from the collection of terminals (T) and functions
(F).We will denote this space G
(T;F)
.
The terminal and function sets should also be designed so as to satisfy the closure
and suciency properties.
12
Denition A terminal,function pair (T;F) is said to be closed if each f 2 F is
dened over T as well as all possible outputs of members of F.
An example of a non-closed pair is any with function set containing both 
and/since for any t
1
;t
2
2 T,t
1
=(t
2
 t
2
) does not exist.In other words,the
function/is not dened for the pair of values (t
1
;t
2
t
2
) = (t
1
;0).Mathematical
functions such as square-root and logarithms are not dened for negative values,
and so here too possible non-closure is a problem.In such instances it is common
to replace these functions with ones that are equivalent where the function is
dened,but takes on some other form elsewhere (often merely taking the value
zero).
Denition A terminal,function pair (T;F) is sucient for a problem P if the
solution to P lies in G
(T;F)
.
This point may appear trivial,however the identication of the requirements of
the problem are not always obvious.Borrowing an example from ([7]),Kepler's
Third Law,which was discovered in 1618,states that the cube of a planet's
distance from the sun is proportional to the square of its period around the sun.
If F = f+;g then a computer program that predicts the period of a planet
around the sun could not result.Similarly,without knowledge of Kepler's Law,
one would not know that distance from the sun is the sole predictive variable
when determining the period of a planet and so might construct a terminal set
comprised of only information about the planet iteslf;say diameter,density and
rotational speed,which would not be sucient for the problem.
3.2.3 The Initial Population
The genetic programming evolutionary process produces increasingly complex in-
dividuals due to the nature of the evolutionary operators used.Initialising the
process with the inaugural population is an important step in the process and
there are two commonly used methods for generating initial structures,namely
\grow"and\full".
The\grow"method creates trees of varying shape for which the distance between
a single root node and any terminal node is no greater than some predetermined
depth.Starting with a randomly selected root node (from the set T [F),when-
ever a node is added that is in F,if the current depth is strictly less than the
maximum,attach a number of randomly selected nodes (fromT [F) equal to the
number of arguments that the node takes.If the depth is equal to the maximum,
instead append nodes only from T.
The\full"method creates trees which are dense,i.e.the distance from the
single root node to every terminal node is equal to some predetermined depth.
They are more uniform in shape and size than grown trees,with variations only
arising as a result of functions taking a dierent number of arguments.They are
constructed in the same way as"grown"trees,except when a function node is
13
added,if the current distance to the root is strictly less than the predetermined
depth then only nodes from F are appended.
The\ramped half-and-half"method suggested by Koza ([7]) generates an ini-
tial population using a combination of the two generative methods.For every
depht between 2 and some chosen maximum,the method generates an equal
number of grown and full trees of that depth and adds them to the population.
This creates an initial population varied in both size and shape.
3.2.4 Evolutionary Operators for Genetic Programs
For the purposes of this paper we consider the following recombination operator:
R:G  G  Z
2
+
!G;where R(g
1
;g
2
;n;m) is g
2
with the subtree subtended
by the n(mod size(g
1
))
th
node replaced with the subtree of g
2
subtended by the
m(mod size(g
2
))
th
node.Here the size of a tree is the number of nodes it contains,
and the nodes are numbered depth rst (since in general the control operands are
chosen randomly,the numbering is in this case arbitrary.In practice the set of
control operands is bounded since generating random elements from an innite
set is problematic).Notice that in this case the control set is Z
2
+
.
Figure 3.2:Crossover
Mutation will be dened as:
M:G G Z!G;where M(g;h;n) is g with the subtree subtended by the
14
n(mod size(g))
th
node replaced with h.We have taken a slight liberty in the
notation here since in general the size of h will depend on g and n in that we
will restrict the depth of h to the depth of the subtree it replaces.We would
prefer not to have operands depend on one another in this way,however there
should be no loss of interpretation.Furthermore,because again the operands are
chosen randomly,it would be possible to insert some pruned version of h in the
case where it is larger than the desired size.This would bias it's size upwards,
however would actually result in mutated trees closer in size to the original.The
control set in this case is G Z.
Figure 3.3:Mutation
3.2.5 A Word on LISP and Symbolic Expressions
LISP (LISt Processing) is a programming language based on two types of enti-
ties:atoms and lists.Atoms in LISP are things like constants,like the number
7,and variables,like TIME.Lists in LISP are ordered sets of items enclosed in
parentheses.Examples of lists are (A 3 TIME) and (< 2 7).
A Symbolic Expression (S-expression) is a list or atom in LISP.S-expressions
are the only syntactic form in LISP,and so the programs of LISP are themselves
S-expressions.
LISP evaluates constant atoms to themselves (e.g.,7),and variable atoms to their
current value (e.g.,TIME).LISP evaluates lists by treating the rst element as a
function and applying it to the evaluations of the remaining elements of the list.
15
These elements may be atoms,or lists themsleves.
For example,the S-expression (< (+ 1 (* 2 3)) 4) is evaluated by LISP as ap-
plying the comparison function\<"to the entities (+ 1 (* 2 3)) and 4,which it
evaluates in turn.The rst is evaluated as applying the numerical operator\+"
to the entities 1 and (* 2 3).The second of these is evaluated as applying the
numerical operator\*"to the entities 2 and 3.Working back up this chain,LISP
evaluates *(2,3) = 2 * 3 = 6;+(1,6) = 1 + 6 = 7;and nally <(7,4) = 7 < 4
= False.
Symbolic Expressions as Trees
In order for the programs we are evolving as trees to be understood by a computer,
we need to express them in a language that the computer can interpret.There is
a simple transformation from the tree structure to S-expression as follows:
If the root is a terminal
return the atom equivalent
else
create a list with rst element equivalent to the root
for each subtree subtended by child nodes,repeat process
append.
Example Consider the tree depicted below
Figure 3.4:Example Tree
Root node\+"is not a terminal
create list (+)
rst child\*"is not a terminal
create list (*)
rst child\a"is a terminal
append (* a)
second child\b"is a terminal
append (* a b)
16
append (+ (* a b))
second child\Q"is not a terminal
create list (Q)
rst child\*"is not a terminal
create list (*)
rst child\c"is a terminal
append (* c)
second child\d"is a terminal
append (* c d)
append (Q (* c d))
append (+ (* a b) (Q (* c d)))
So the tree above has S-expression (+ (* a b) (Q (* c d))),where Q represents
a function which takes only one argument.
3.3 The Power of Genetic Algorithms
As mentioned before,the power of genetic algorithms was initially formalised
by John Holland in the form of schema analysis.In this section we will cover
brie y the notion of schema and how the Schema Theorem shows how improve-
ments in the evolution arise.We will then introduce Formae as alternative ideas
to schema and how forma analysis yields similar improvement results.Finally,
we will touch on some ideas of convergence of evolutionary algorithms from a
topological perspective.
3.3.1 The Schema Theorem
Denition Let G be a representation space in which indiviuals are strings of
some xed length.A Schema,H,is a template which denes a subset of G
based on its members sharing genes at specic positions.The dening length
of a schema,(H),is the dierence in index between the rst and last specied
position.The order of a schema,o(H),is the number of specied positions.
For example,if G is the collection of binary strings of length 6,then 100
is the schema consisting of all elements of G with 1 in the rst position and zero
in the third and sixth positions.It has order 3 and dening length 6 1 = 5.
Theorem 3.3.1.The Schema Theorem:
Let N
H
(t) be the number of members of a population displaying schema H at
generation (iteration) t.Let f
H
(t) be the average tness of those displaying H at
generation t and f(t) be the average tness in the entire population.Then
E[N
H
(t +1)]  N
H
f
H
(t)
f(t)
(1 
X
!2

d
!
)
where d
!
represents the likelihood of disruption of the schema H by the evo-
lutionary operator!for each operator in the set being applied in the evolution;

.
17
The Schema Theorem in its usual form requires that selection be tness pro-
portionate,and also assumes certain things about the operators in use such as
transmission (c.f.[5]) of genes by recombination operators.For many recombi-
nation operators,such as the commonly used one-point crossover (c.f.[3]),the
likelihood of disruption is proportional to the dening length of the schema and
similarly the likelihood of disruption is generally proportional to the order of the
schema.Thus short schema with low order which are highly t have the greatest
likelihood of proliferation.
In the case that tournament selection is used (in which pairs (or groups) of in-
dividuals are selected to\compete"with one another directly and those of higher
tness are given some predened probability of victory.c.f.section 4.1.1 later),
more knowledge about the distribution of tness is needed for an analytic result.
With only knowledge of the average tness we can argue the absurd case that all
the tness among members of a schema H is attributable to a single individual,
and so if that schema is of any considerable size,the likelihood of a randomly
chosen member of H\beating"some other random individual not in H is approx-
imately (1),where  is the competition bias.We can,however,see how if the
average tness of members of a certain schema,H,is above the average in the
overall population and the distribution is not too skewed,then that schema will
have a tendancy to proliferate in future generations (as described in the schema
theorem).With more thought we realise that in any case a distribution of tness
among H members which is not too skewed is far more desirable,since we then
believe more strongly that above average tness arises from the schema rather
than the high tness being incidental.
It can easily be shown that if p
k
is the k
th
percentile of tnesses among mem-
bers of P H,then H tends to proliferate if the proportion of H members with
tness above p
k
is suciently above 50=k.The problem with this result is that
is far less precise than the schema theorem in that the inequality will generally
be quite slack.Furthermore,this result is meaningless for quantiles below the
median since 50=k > 1 for k < 50.
Holland (among others) describes the short length,highly t shcemata as\build-
ing blocks",the combinations of which are intended to produce highly eective
individuals within the search space.This is well described by David Goldberg,
another proponent of genetic algorithms:
\Short,low order,and highly t schemata are sampled,recombined [crossed
over],and resampled to form strings of potentially higher tness.In a way by
working with these particular schemata [the building blocks],we have reduced the
complexity of our problem;instead of building high-performance strings by trying
every conceivable combination,we construct better and better strings from the
best partial solutions of past samplings."
 Goldberg 1989
18
3.3.2 Forma Analysis
It has been shown that the choice of describing subsets of the representation
space by schema (with regards to the improvement shown by the schema theo-
rem) is essentially arbitrary,and in fact a very similar result is provable where
H instead describes any subset of the representation space,provided the dis-
ruption coecients are tractable for the chosen subset ([9]).The importance of
this result is that,where in the case of schemata there may be no recognisable
similarities within the solution space between subscribing members,here the de-
signer/implementer of the algorithm is able to both identify,via the progress
of the algorithm,regions of the search space which might be fruitful as well as
prescribe subsets to be exploited based on prior knowledge of the problemat hand.
Forma analysis is concerned with dening a collection of equivalence rela-
tions on the search space,the equivalence classes of which give rise to genetic
representations of members of the search space.
Denition An equivalence relation  on a set X is a relation which is
1.re exive:x  x 8x 2 X
2.symmetric:x  y )y  x 8x;y 2 X
3.transitive:x  y and y  z ) x  z 8x;y;z 2 X
The canonical equivalence relation is the comparison\=",since trivially equal-
ity satises re exivity (x = x),symmetry (if x = y then y = x) and transitivity
(if x = y and y = z then x = z).
If  is an equivalence relation on the set X,then for x 2 X we dene the
equivalence class of x with respect to  as
[x]

= fy 2 Xjx  yg
The intersection of two equivalence relations,
1;2
= 
1
\
2
is dened by
x 
1;2
y i x 
1
y and x 
2
y.
Given a collection of equivalence relations E we can use this to dene the
span of E;S(E):= f
T
2I
 jI 2 P(E)g,where P(E) denotes the power set of
E.If we consider equivalence relations over the search space S then E induces
a representation space in which the representation of an s 2 S is given by the
vector of equivalence classes to which s belongs,with respect to the elements of E.
Radclie refers to the equivalence classes created by elements of S(E) as for-
mae.Ideally one designs E in a way that S(E) separates the points of S;i.e.
8s 2 S:[s]
T
2E

= fsg,or equivalently 8s;t 2 S:9  2 E s.t.s 6 t.
By allowing the freedom to choose the elements of E,with some restrictions,the
designer of the evolutionary algorithm is able to predene regions of the search
space which are believed to potentially possess useful solutions.This denition
also allows for improved interpretation of results as the equivalence relations used
19
are generally more meaningful than schema at a higher level.
In the context of genetic programming,where the variable size and shape
of individuals in the representation space precludes the use of schema analysis,
forma analysis potentially allows for a better understanding of a similar progress
result to the schema theorem.In particular,we may wish to consider a forma,
say F,members of which contain a specic subtree.We would like to analyse
its tendency to exist and grow in future generations.The diculty arises in
understanding the disruption rates as additional knowledge about the members
of this forma is required,such as the maximum size of the members of F (or
at least an upper bound thereof).In some cases one might wish to impose
a maximum size on the programs being evolved for the sake of computational
eciency,in which case the maximum size of members of F at least has an upper
bound and so pessimistic disruption rates with respect to the genetic operators
dened previously can easily be determined.
3.3.3 Convergence of Genetic Algorithms
Even though genetic algorithms are widely used in the eld of optimisation,in
general problems to which they are applied are not actually solved to optimum.
Instead the algorithms are often used in highly unstable and large spaces as par-
tial search algorithms in which we rely on progress suggested by analogues of the
schema theorem to nd solutions which are"better"than those achievable with
more analytically based algorithms.
In spite of their often being used for improvement rather than optimisation,
it is important to understand the convergence properties (should they exist) in
regarding when they can guarantee the discovery of an (at least local) optimum
within the search space.
Holland shows in ([3]) that if crossover is the only genetic operator employed
then the evolutionary process described as a stochastic process has a stationary
distribution,however we seek a more general result.
Rudolph ([10]) also uses a stochastic process representation to show that in
the presence of mutation,genetic algorithms like those which Holland describes
will never converge to the global optimum.However,he also shows that a modi-
cation of the algorithm in which the best solution within the population is always
maintained will converge to the global optimum.
20
Chapter 4
Application of Genetic
Algorithms to Uplift Modelling
and the Hillstrom Challenge
In this section we will show how genetic algorithms can be applied to the problem
of uplift modelling.We will then apply these algorithms to a specic problem,
posed by Kevin Hillstrom in his MineThatData blog ([12]),for which uplift mod-
elling is ideally suited,and compare our results with those from the winning
entry.
4.1 Application of Genetic Algorithms to Uplift
Modelling
In order to implement a genetic algorithm,we need to describe rstly each element
of the evolutionary process and secondly any adjustments and additional steps
and methods used.
4.1.1 The Genetic Programming Approach
 The Solution Space
The solution space S will consist of all possible variables with values in
hX [ Q;(+;;;/)i*,the set generated by X (the collection of all values
the basic descriptive variables of the problem can assume) and Q with the
operators (+,-,*,/),capable of at most separating data points which are
discernable by at least one basic descriptive variable.(In other words if
two data points are described by the same set of variables x,then every
member of S will assign the same value to them)
*If variables take on values in RQ,we tend in practice to only consider
their rational approximations,in which case S can be described as all ra-
tional valued variables.However as a concession to being concise regarding
the suciency of the problem,we choose to be more explicit.
21
 The Representation Space
Our intention is to generate composite variables (of the basic descriptive
variables dened in the problem) for use in building uplift models.As we
do not wish to restrict the level of complexity these variables entail,we
will use genetic programming to evolve them as symbolic expressions.The
terminal set T will be the collection of basic variables of the problem as
well as constant terms which relate the scale of the variables where ap-
plicable and those which divide the variables in an apparently useful way
(e.g.measures of the\centre"(usually mean or median) of variables,or
in the case of low order discrete variables,values which separate explicitly
the dierent values).In the case where these variables are non quantita-
tive,we convert them using collections of indicator variables into discrete
numerical variables.The function set F will consist of the basic binary
numerical operators (+,-,*,/),with division safeguarded by the added
identity a=0 = 0 8 a 2 R,the relations which exploit the ordered structure
of R (=,,<,,>),accepting the convention of numerical binary logic
variables True = 1 and False = 0,and the conditional operator (IF-THEN-
ELSE).
It is easy to see that (T;F) is closed.To show that it is sucient,we
can adopt the absurd argument that if a genetic program can manifest the
constant 1 and the function set contains +, and/then every rational
number can occur.Explcitly the expression (= a a) for any a 2 T has
the value 1,and so we can generate every rational number and hence every
number in the set generated by Q[X.
 The Evolutionary Operators
The evolutionary operators will contain recombination and mutation as de-
scribed in section 3.2.4 and the selection operators described below.
Our selection operators both for culling of the population and for choosing
breeding pairs will be based on so-called tournament selection dened as:
C

:G  G  [0;1]!G;which returns the tter of the two individuals
if the nal operand is less than the selection bias  2 [0;1],otherwise it
returns the less t.In general the nal operand is selected from a U(0;1)
distribution and so this can be understood as a competition in which the
tter individual has a probability  of winning.
The culling operator is then dened by:
K

:P(G) Z
2
[0;1]!P(G);where
K(P;n;m;r) = P fC
1
(P[n];P[m];1 r)g
that is it removes the\loser"of the competition between the n
th
and m
th
members of a population (if n is greater than the size of P then we consider
n(modjPj),and similarly for m).
The operator for selecting a breeding pair is:
22
B

:P(G) Z
4
[0;1]
2
!G G;where
B

(P;n;m;o;p;r;s) = (C

(P[n];P[m];r);C

(P[o];P[p];s)
that is it returns the winners of two tournaments between members of P
indexed by the integer operands.
 The Solution Map
The solution map F takes each g 2 G and assigns it the variable g(x),the
output of g for a given vector of basic descriptive variables x.
 The Evaluation Function
The evaluation function E is dened by E(s) = q
0
(s),where q
0
(s) is the
\no negative-uplift"adjusted Qini coecient from the ordering imposed on
the data set by the variable s.
 In addition to the usual structure of the evolutionary process,we include a
dynamic adjustment to the evaluation function as follows:
At regular intervals we select subsets of the population (using various se-
lection methods) to be passed into a signicance based uplift-tree building
algorithm.In the event of discovering a\better"model than has so far
been found (we will formalise what we mean by better in the next section),
we will adjust the tness of the variables used in the model by a factor
according to how\good"the model is.The intention is that by adjusting
the tness of variables which have proven useful in building models,we
will extend their longevity,increase their likelihood of reproducing and also
increase their likelihood of being selected for future models.
 Finally,we will grant temporary elitism to the variable with the highest
basic tness level (as described by the static evaluation function) and those
variables included in the\best"model found so far.
4.1.2 The Polynomial Approach:
 The Solution Space
The Solution space consists of all two dimensional polynomial transforma-
tions of the basic descriptive variables of degree 4 with zero constant term.
As before non quantitative variables will be redened numerically.
 The Representation Space
The representation space consists of all strings in R
14
.Note that
P
4
i=0

4
i


1 = 14,the number of two dimensional combinations of order less than or
equal to 4 excluding the zero order constant.
 The Evolutionary Operators
Crossover will be dened as follows:
C:GG[0;1]
14
!G;where C(g;h;x)
i
= x
i
g
i
+(1x
i
)h
i
.We select each
[0;1] operand from a U(0;1) distribution and so C can be seen as selecting
a point randomly between the two points in G.
23
Mutation applies the following function to each element of an individual
in G independently:
M

:R[0;1] R!R;where M

(x;;r) = x + r if  is less than the
mutation rate  2 [0;1],and M

(x;;r) = x otherwise.We choose r from
a N(0;1) distribution,and  from a U(0;1) distribution.
Selection operators will be as in the Genetic Programming approach above.
 The Solution Map
The solution map F assigns the elements of g 2 G as coecients in a
polynomial in S.
 The Evaluation Function
The evaluation function will be as in the Genetic Programming approach
above.
 As before we include a few adjustments to the standard evolutionary pro-
cess.Because the polynomials being built are two dimensional,we simul-
taneously evolve

N
2

separate populations,where N is the number of basic
variables in the problem.
 At regular intervals we combine the populations and select a subset prob-
abilistically (according to tness) to be passed into a signicance based
uplift-tree algorithm.As before,whenever a new"best"model is found,
the tnesses of the variables used are adjusted.
 Once again we grant temporary elitism to the variables with the highest
basic tness level in each population,as well as those included in the"best"
model so far.
 Initial coecients were chosen randomly between 10 and 10.Because
the models are evaluated based on an ordering,the scale of coecients is
arbitrary and so we could have chosen any range centered at 0.
4.2 The Hillstrom Challenge and Results
In March 2008 Kevin Hillstrommade available,through his blog,MineThatData,
a dataset describing two email campaigns and a control group and issued a chal-
lenge to analyse the data with respect to a set of questions.In this section we
will cover brie y the nature of the data set,the questions posed and a summary
of the results from the winning submission ([8]).We will then discuss the results
obtained using our genetic algorithm,its shortcomings and where it was able to
produce improved results on those published.
4.2.1 The Data and Questions
The data represent 64,000 entries each describing a customer.A third of the
customers were chosen randomly to receive an email called the Men's Email,a
24
second randomly chosen third to receive the Women's Email,and the remaining
customers served as a control group,receiving neither email.
The individuals were described by three outcome (dependent) variables:
- Visit:A binary variable indicating whether the customer visited the site within
a two week outcome period.
- Conversion:A binary variable indicating whether the customer whether or not
they purchased at the site during that period.Obviously Visit being false implies
Conversion is false.
- Spend:A real valued variable indicating the amount spent during the outcome
period.Obviously spend = 0 whenever Conversion is false.
They were also described by 8 independent variables (referred to previously as
basic descriptive variables):
- Recency:An integer valued variable indicating the number of months since each
customer's most recent purchase prior to the outcome period.
- History Segment:A categorical variable dividing the population according dis-
joint ranges of money spent in the preceding year.
- History:An actual monetary amount spend by each customer in the preceding
year.
- Mens:A binary variable indicating if the customer had purchased in the Men's
department in the last year.
- Womens:As for Mens,but regarding the Women's department.
- Zip Code:Classies customers according to either Urban,Suburban or Rural.
- Newbie:A binary variable indicating if the customer made their rst purchase
within the previous year.
- Channel:Describes the channels through which the customer bought previously
(either by Phone,Internet,or both).
Finally a variable indicating which customers were treated with the Mens-Email,
the Womens-Email and which formed part of the control group.
The questions posed in the challenge were as follows:
1.Which e-mail campaign performed the best,the Mens version,or the Womens
version?
2.How much incremental sales per customer did the Mens version of the e-mail
campaign drive?How much incremental sales per customer did the Womens ver-
sion of the e-mail campaign drive?
3.If you could only send an e-mail campaign to the best 10,000 customers,which
customers would receive the e-mail campaign?Why?
4.If you had to eliminate 10,000 customers from receiving an e-mail campaign,
which customers would you suppress from the campaign?Why?
5.Did the Mens version of the e-mail campaign perform dierent than the Wom-
ens version of the e-mail campaign,across various customer segments?
6.Did the campaigns perform dierent when measured across dierent metrics,
like Visitors,Conversion,and Total Spend?
7.Did you observe any anomalies,or odd ndings?
8.Which audience would you target the Mens version to,and the Womens ver-
sion to,given the results of the test?What data do you have to support your
recommendation?
25
Questions 3 and 4 relate directly to uplift modelling since uplift models seek
to maximise incremental sales by means of imposing an order structure on a
population according to their incremental spend behaviour as a result of some
treatment.We will focus our analysis on developing reliable uplift models and in
doing so implicitly answer these two questions.The remainder of the questions
above refer more to measuring incremental sales than to maximising them,and
so we defer in those cases to the results obtained by Radclie in his winning
submission,summarised in the next section.
4.2.2 Summary of Results from Winning Submission
Radclie uses uplift modelling to tackle three dierent formulations of the prob-
lem posed by Hillstrom.He successfully analyses the eectiveness of the cam-
paigns as well as identies those for which they were most (and least) eective.
A brief summary of these analyses will be presented here.
It was found that both campaigns had a positive eect overall,however in the
case of the Women's Email,there appeared to be segments of the population for
which the campaign served in decreasing spend behaviour,rather than increasing
it.The average spend (among those who purchased) increased dramatically with
the Women's campaign,but decreased with the Men's.That the Men's campaign
drove more incremental sales overall was attributed to increasing purchase rate
suciently to compensate for this decrease.
The three formulations of the problem are driven by the three response vari-
ables present in Hillstrom's data;modelling incremental visit frequency,modelling
incremental purchase frequency and nally modelling incremental spend volume.
It was found that the Men's campaign outperformed the Women's on all three
counts.
Table 4.1:Uplift summary
As can be seen in the above table,the Mens'campaign outperformed the
Women's with an increase in visit rate of 7.66% (over 4.52% for the Women's),
in purchase frequency of 0.68% (over 0.31% for the Women's) and an increase in
average spend by 77c (over 42c for the Women's).
While these monetary amounts per head sound small,because of the low overall
purchase rate,in the case of the Men's campaign it in fact more than doubled
average spend.
26
A splitting of the population into random segments (of 10% each) showed that
these estimates are quite unstable,however,likely caused by the low frequency of
both visits and purchases.This poses a potential problem when identifying the
best (and worst) 10,000 candidates.
The nal formulation (regarding increased spend behaviour) is the one most
closely associated with campaign success in general (and implicitly includes the
other two) and is the objective we will focus on in our evolutionary approach,
and so will form the main focus henceforth.Details of the other two models can
be found in ([8]).
The objective is to build a relaible uplift model and choose the 10,000 popu-
lation members with the highest spend uplift and the 10,000 with the lowest (or
most negative,should that be the case).
Two approaches were considered;one a direct continuous approach and the other
based on creating a binary outcome variable'spend over x',for some amount
x.The continuous approach faces the diculty of massive skewness in the spend
variable,where close to 99%of people have zero spend.Tree models,however,are
fairly well equipped to handle skewness in the variables and this method proved
superior for the Women's campaing.The binary response model was preferable
for the Men's.
The model for the Men's campaign is described as follows:
The model assigns 1 point for each of the following criteria
 historic spend over $160.
 historic spend over $350.
 customer is a multi-channel user.
The model splits the population into 4 segments according to this score,with
the highest being those multi-channel users with historic spend over $350.The
summary of spend behaviour identied by this model is summarised below.
Table 4.2:Spend behaviour identied by Men's model
27
Figure 4.1:Spend uplift by score for Men's model
The q
0
for the model is 19.80% over the entire data set.Following is the Qini
graph described by the model ordering.
Figure 4.2:Qini graph for Men's model
The model for the Women's campaign predicts the incremental spend directly,
however is not as easy to describe.The following table allows for some interpre-
tation by showing the average model score assigned to each bin of the variables
used.
28
Table 4.3:Mean score by binned variables for Women's model
The table shows that the model favours especially those with high historical
spend (above $500),and to a lesser extent those with relatively low historical
spend,multichannel users and newbies.
The model has a q
0
value of 118.00% on the training set and 60.30% on the
validation.The gure below shows the average uplift by model score band.
Figure 4.3:Spend uplift by score band for Women's model
Selecting the best 10,000 customers to target naturally begins with those scor-
ing 3 according to the Men's model.This amounts to 5,249 people,and estimating
their uplift from the 3,498 not included in the Women's campaign we estimate
the average spend uplift at $1.54.
Next are those with score 2 according to the Men's model.This includes more
people than are needed to select 10,000 and so we consider the segregation of this
segment:
29
Table 4.4:Spend uplift for dierent segments with score 2 by Men's model
We can see that Web users with historic spend between $160 and $350 provide
the highest average uplift.This adds a further 4,684 people with estimated uplift
of $1.61.The remaining 67 customers are then chosen randomly from the Phone
users with historic spend between $160 and $350.The estimaed average uplift
overall for these 10,000 people is $1.58.
Next,the worst 10,000 customers were chosen as follows.Because the Men's cam-
paign performed better,the worst 10,000 according to that model were selected.
The bottom score (0) identies 32,237 people with an average uplift estimated
at 55c.Of those,trying to select the 10,000 with the lowest score according to
the Women's model (estimated uplift of 44c on average,corresponding roughly
to bands 0 to 4) results in 12,952 people.Estimating the uplift as a result of the
Men's campaign among these by removing those who received the Women's email
comes to just 5c.So the choice of 10,000 people to exclude from the campaign is
to select 10,000 randomly from with score of 0 from the Men's model and score
< 5 from the Women's model.
From the Men's model these are people who
 have historical spend less than $160
 are not multichannel
and from the Women's model,indicated by Table 4.3
 are not newbies
 use the Phone channel.
If we consider this subpopulation exactly,we nd 8,023 people for whom it
was estimated that the Men's Email would depress their spend by an average of
just over 25c.
4.2.3 Results of the Genetic Algorithm Approach
Because of the high computational time of running genetic algorithms it is costly
to generate statistically robust results,however we believe those presented herein
are sucient for drawing meaningful conclusions and for making relevant com-
parisons with the results discussed above.
For each technique employed,we did ten runs and include a summary of those
30
as well as a detailed account of the best found in each case.
For both the Men's and Women's emailing campaigns we used the following
common parameters and methodology:
 Maximum Population size:100.Initial tests showed overwhelming ho-
mogeneity in the upper regions of the population (among the ttest inid-
viduals).This appeared to be due to large redundant subtrees in many
population members,changes within which did not aect the individuals'
performances,meaning that recombining these individuals often produced
ospring essentially equivalent to them.In order to combat this we utilised
pruning methods to remove these subtrees,however these proved to be un-
necessarily costly for the length of our runs and so were abandoned.Instead
we excluded individuals with basic tness levels equal to any for existing
population members.We recognise full well that this method limits the
power of the algorithm,by reducing the region of the search space that is
accessible,however for the purposes of this problemit showed to be the most
eective method in terms of computational cost.Population sizes tend to
be between 60 and 80 members,with the majority centered between these
two numbers.
We generated initial populations using the ramped-half-and-half method
described in section 3.2.3 for depths ranging from 2 to 6.
 Terminal Set consisted of the variables history (historic spend),recency
(most recent purchase in months),newbie (indicator variable to showwhether
the customer's rst purchase was within the last year),mens (indicator vari-
able to show whether the customer bought in the men's department within
the last year),womens (similar to mens for women's department),binHist
(a binned version of history with cuts at 50,100,200 and 400),NumZip (a
numerical version of zip code.Rural = 0,Suburban = 1,Urban = 2) and
NumChannel (numerical version of channel.Web = 0,Phone = 1,Multi-
channel = 2).In addition we included the constant values 2 (to separate
Multichannel and Urban explicitly.0 and 1 occur frequently with the use of
numerical True and False and so weren't included explicitly.It also allows
us to obtain ascending low order constants more rapidly than we would
otherwise),6 (a measure of the center of the recency variable) and 200 (a
measure of the centre of the history variable.)
 Competitive bias for selecting breeding pairs = 0.7.
 Competitive bias for survival = 0.95.
 Mutation rate = 0.05.
 We terminated the process after 1500 iterations.
31
 The results for each method used are summarised in graphical form as
follows:
The\Average"line indicates the average q
0
value over the entire data set
for the best model found in each run.The dotted lines show this average
adjusted by one standard deviation to give an indication of a reasonable
range to expect using this method.The\Maximum"and\Minimum"lines
show the highest and lowest of these q
0
values.Finally,the\Lower Average"
line shows average q
0
of the worst split from the models selected from each
run.This line is intended to give an indication of an extreme worst case
scenario using this method.
Method 1:Building Models Directly via Genetic Algorithms Initially
we used our genetic algorithms directly to build uplift models.Observing the
eectiveness of models arising directly fromthe evolutionary process is important
in understanding the relevance of evolutionary algorithms in the eld of uplift
modelling at a basic level.
32
Genetic Programming Approach
 Mens
Figure 4.4:Men's Mailing.Building models directly via Genetic Programming
We can see that the q
0
values are a vast improvement over that for the
model built deterministically and presented in section 4.2.2,even on the
worst splits,in the latter part of the evolution these exceed 19.80%.The
average q
0
on the entire data set is approximately 31
The Symbolic Expression form of the model does not interpret well due to
its complexity,and so instead we (as in section 4.2.2) defer to relationships
with the basic variables independently.An explicit expression of the model
can be found in the appendix.
The model ranges from 0 to 4 taking only integer values,and Table 4.5
below shows a summary of the spend behaviour associated with each score.
Table 4.5:Spend behaviour identied by Men's model
Where saying that we have a higher Qini value indicates we are able to bet-
ter separate the population according to uplift,it is not obvious the degree
33
of improvement.Table 4.5 shows in absolute terms this uplift and so we
can already compare the outcome of this model with the deterministic one.
The model identies two entire segments (amounting to 2,912 customers
between the Men's Emailing group and the control set,a full 4,375 in the
entire data set) for whom the estimated uplift is signicantly above any
segment identied before.
Figure 4.5:Spend uplift by score for Men's model
The Qini graph below also shows graphically this improvement over the
deterministic model.
Figure 4.6:Qini graph for Mens model
Table 4.6 below shows the average model score for ranges of the basic vari-
ables.We can see that the model favours particularly those with high
34
historic spend (over $400),newbies,those who have shopped in the men's
department within the previous year,those who haven't shopped in the
women's department in the previous year,Multichannel users and those
who live in urban areas.
Table 4.6:Mean model score by binned variables
 Womens
Figure 4.7:Women's Mailing.Building models directly via Genetic Programming
While the winning submission to Hillstrom's challenge did not explicitly
include a Qini coecient for the model over the entire data set,over a
training/validation split showed q
0
values of 118% and 60.30% respectively.
Though certainly the value over the entire data set needn't be the average
35
of these two numbers,experience from experimenting shows that it is a rea-
sonable estimate for the sake of making comparisons.The Genetic Program
evolution was able to produce individual models with comparable q
0
values
(in the region of %88),but they were discarded by the algorithm for having
too high variance over the validation splits.The resulting models,though
not as powerful as the deterministic one then,do oer apparent improved
robustness.The best model found using this method had a q
0
of 67.47%
and a lower estimate (average minus standard deviation over the validation
splits) of 55.08%.
The model takes on integer values between 6 and 12 inclusive.Table 4.7
below shows that the model identied a large segment of the population
for whom the uplift is strongly negative ($0.93).We can see that uplift
is not strictly increasing with model score and simply reordering the scores
would vastly improve the Qini coecient.Reordering the binned segments
in the table below improves this to a value of 81.17% and simply reordering
the values should not eect the robustness of the model,in fact combining
segments as we have done in a few cases here should improve it.
Table 4.7:Spend behaviour identied by Women's model
Figure 4.8:Spend uplift by score for Women's model
36
The Qini graph below shows clearly this bad ordering by the model in that
the gradient is not uniformly decreasing.
Figure 4.9:Qini graph for Womens model (left) and adjusted version (right)
Our interpretation of the model is similar to that oered for the determin-
istic model in that it emphasise most strongly those with high historical
spend and newbies as well as multichannel users.We have included more
comparisons than before and observe strong emphasis on those who have
shopped in the women's department previously and those who have not
shopped in the men's department.
Table 4.8:Mean model score by binned variables
For both campaigns the algorithm was able to build useful models which
compare well with those built using deterministic methods.In the case of the
Women's campaign,the nal model found a useful segmentation of the popula-
tion but failed to order the segments correctly.It is perhaps an oversight not to
37
have included in the process (at least in the evaluation function) a transformation
which asserts increasing ordering by uplift,as models which produce even more
useful segmentations might have been lost.
We now turn our attention to questions 3 and 4 of the Hillstrom challenge
using these models.Again our focus is on the Men's campaign due to its being
most eective.We naturally choose those with score  3 rst (4,375 individuals,
for whom we estimate uplift from those not receiving the Women's Email to be
$2.38).Adding those with score of 2 by the Men's model gives us an additional
7,406,giving us more than 10,000 overall.Choosing a random 5,625 from this
group will give us a collection of 10,000 customers for whom we estimate the
average uplift to be $1.79.This is an improvement of 21c over that found using
the deterministic modelling methods.
To nd the worst 10,000 people to target,we begin with the worst segment iden-
tied by the Men's model.As this amounts to more than half the population,
however,we need to consider the Women's model as well in order to separate out
the worst group.Choosing the lowest segment according to the Women's model
(score < 4) gives us 6,841 customers with an estimated average uplift (from the
4,512 who did not receive the Women's email) of 55c.The next lowest segment
from the Women's model corresponds to a score between 2 and 3.This comes to
5,892 people in total with estimated uplift caused by the Men's Email of 41c.
Selecting the requisite 5,488 randomly to satisfy the 10,000 cut o gives us a sam-
ple for whom the Men's Email decreased spending by and average of 47c.This
compares very favourably to the sample found using the deterministic model for
whom decreased spend as a result of the Men's campaign was estimated at 25c
on average.
Polynomial approach As we restricted the polynomials both in dimension
and order we do not expect as good results as those from the Genetic Program-
ming approach.We restricted themthus to indicate the usefulness of polynomials
at a basic level for building uplift models.Because of their simplistic form the
computational cost compared with genetic programming is far lower.If we do not
restrict the order of these polynomials,then they too would become increasingly
complex and so the cost benet will be mitigated (and potentially lost com-
pletely).Later on we will combine the two dimensional polynomials to increase
the dimensionality of the models built to better understand their usefulness.
 Mens
The best polynomial model found for the Men's campaign is given by
0:27x +2:84x
2
3:5x
3
+0:6x
4
+3:62y 4:83xy +0:09x
2
y 0:67x
3
y
2:21y
2
+3:22xy
2
+4:93x
2
y
2
2:92y
3
3:95xy
3
2:09y
4
where x = binHist and y = NumZip.The q
0
value for this model is
38
29.58%,which is comparable with those built via genetic programming and
an improvement over the deterministic model.
 Womens
For the Women's campaign the best model is given by
2:2053351821x+1:96775115108x
2
3:62381607147x
3
+0:674810214017x
4
where here x = binHist.The q
0
coecient is 61.29%.It should be noted
that a polynomial which generated the same output over the data set arose
in all 10 runs,sometimes appearing already in the initial population.While
the model oers some merit with a fairly high Qini value,it appears that
building continuous models of low dimension using discrete variables in this
way might not be appropriate as a random search would have performed at
least as well in this instance.
In order to consider the usefulness of this method we will have to wait until
we combine the polynomials to produce higher dimensional models,since we have
observed that building low dimensional models of discrete variables in this way
potentially does not perform better than random search.
Method 2:Combining the Genetic Algorithms with Signicance-Based
Uplift Tree We now consider a combined evolutionary algorithm which cou-
ples the genetic evolutions used in Method 1 with an algorithm for building
signicance-based uplift trees.
A warm up period of 120 iterations was used,whereafter models were built every
20 iterations.We allow the initial warm up period so that models are only built
once more interesting variables should begin to emerge.By allowing the popula-
tion to evolve between each model building we hope to achieve more dynamism
with lower computational cost.
Models were built over the entire data set and validated by randomly splitting
the data in half 10 times and evaluating the Qini coecient of the model on (both
halves of) each split.The selection of one model over another was based on the
average minus standard deviation of the varying Qini values.By incorporating
both the location (average) and spread (standard deviation) of these values we
hope to obtain models which are both useful and robust in the face of variability.
Whenever a better model was found,adjustments to the tnesses of individuals
used in the model were done by multiplying their current tness by (1 +Q
low
),
where Q
low
represents the average minus standard deviation of Qini values dis-
cussed above for that model.Such adjustments were only made in the event of a
strictly superior (by our evaluation metric) model being found.By incorporating
the level of\goodness"of the model,variables useful in the better models will
have their tness adjusted by a greater factor.It could be argued that this in-
crease is too large,however as the general tness of the population increases,it
is necessary to improve longevity.
39
Passing the Entire Population into the Model Building Algorithm At
each 20th iteration we allow the model building algorithm selection from the
entire population of individuals.The results are as follows:
 Mens
Figure 4.10:Men's Mailing.Full population passed into tree building algorithm
This method shows a considerable improvement over method 1 and an im-
provement over the results using the deterministic modelling techniques in
the winning entry.The fairly low range of best model qini coecients sug-
gests reliability of this method in nding good results,and the closeness
of lower average to absolute average suggests that these models are fairly
robust.
In spite of these encouraging points,we can see by the graph that little
improvement is made beyond about iteration 900.When passing the entire
population into the tree building algorithm,there will be a tendancy to
select the same individuals more often since the trees are built using greedy
algorithms,which is a likely cause of this stagnation.
40
 Womens
Figure 4.11:Women's Mailing.Full population passed into tree building algo-
rithm
In the case of the womens,unlike the mens above,this method produced
quite varied results as we see by the nal values for the dierent lines.That
said,it still represents a signicant improvement in terms of power over
method 1,and especially in the case of the best model found through this
method,presents a more stable model than was found via deterministic
methods.
The potential for the process to become bogged down by having the model
building algorithm select the same variables repeatedly is realistic,and the graph
for the Men's campaign (Figure 4.10) indicates this might have eected progress
already.It is useful then to explore the option of only passing parts of the
population into the algorithm.
Passing the Fittest 10 Population Members Into the Model Building
Algorithm Instead of passing the entire population into the model building
algorithm as above,we now restrict this selection to only the ten ttest members.
By restricting the selection thus,we hope to\force"the algortihm to consider
newly generated highly t individuals.Immediately,however,we have a concern
that our tness adjustments based on nding better models will at least for a
period outweigh the eects of improved tness through evolution and as a result
the 10 individuals with the highest (adjusted) tness might remain unchanged for
extended periods (until"evolution catches up").This could result in a step-like
process including extended periods of non-improvement and essentially wasted
computation time.
41
 Mens
Figure 4.12:Men's Mailing.Fittest 10 passed into tree building algorithm
The lack of improvement after about iteration 400 could be a result of
the concerns raised above.For runs of this length (1500 iterations) this
method appears inferior to the previous one in this instance.Furthermore,
the higher variability of q
0
values on dierent splits indicated by the lower
average line do not instil condence that (even given more time) this will
nd an adequately robust model.
 Womens
Figure 4.13:Women's Mailing.Fittest 10 passed into tree building algorithm
42
In the case of the womens there is very little to choose from in terms of
outcome comparing this with the previous method.The results here are
marginally better,and factoring in the lower computational cost (associated
with passing fewer variables to the model building algorithm) it appears
preferable.However,the inconsistent results in the mens case give us pause
for recommending this ultimately.
While we are able to increase the likelihood of newly spawned individuals who
are highly t being included in the models,in doing so we exclude a large part
of the population almost absolutely (except in the rare event that for a period
signicantly more tournaments are lost by tter individuals).Further,while we
expect variables with a high basic tness to be useful in building models,it is
often the case that combinations of\weaker"individuals will vastly outperform
them when combined in an uplift tree,and this method almost excludes this
possibility entirely.
Passing the Fittest 10 of a Random Sample of Size 20 Into the Model
Building Algorithm In order to hopefully improve the dynamism of the pro-
cess we consider generating random samples of size 20 from the population and
selecting the ttest 10 of those to be passed into the model building algorithm.
We hope that by doing this a greater variety of individuals will be considered by
the model building algorithm while maintining an emphasis on tter individuals.
 Mens
Figure 4.14:Men's Mailing.Fittest 10 of a Random Sample of size 20 passed
into tree building algorithm
The graph shows continued improvement throughout the run,unlike in the
previous method,however ultimately there is only marginal improvement,
43
and the results still dictate something worse than passing the entire pop-
ulation into the model building algorithm.The decrease in computational
cost when passing so fewer variables into the algorithm is certainly not
negligible,however,and this method does perhaps oer merit.
 Womens
Figure 4.15:Women's Mailing.Fittest 10 of a Random Sample of size 20 passed
into tree building algorithm
For the women's campaign this method shows a vast improvement over
those considered so far.The expected range is fairly narrow,suggesting
reliability,and the lower average is also reasonably proximal to it indicating
robustness in the models built.
Ideally we would have a single method which provides a good model for both
campaigns,and we persevere with our ideas of improving dynamism of the pro-
cess.
Passing a Probabilistic Sample of Sizes 10 and 15 Into the Model Build-
ing Algorithm An alternative approach to increasing the diversity of individu-
als passed into the model building algorithm is to select subsets of the population
using a tness proportionate selection method.The benet of this over the last
method is that when the tness levels in the population are fairly close there is a
greater emphasis on diversity of selection,but as the tnesses become more var-
ied (as the tness adjustments are greater) there is more of an emphasis on tter
individuals.These tter individuals later on represent those which have already
proven useful in building models as well as newly introduced highly (basically)
t individuals.This shift in emphasis will hopefully prove useful for progress of
the algorithm.
44
 Mens
Figure 4.16:Men's Mailing.Probabilistic Sample of size 10 passed into tree
building algorithm
Figure 4.17:Men's Mailing.Probabilistic Sample of size 15 passed into tree
building algorithm
 Womens
45
Figure 4.18:Women's Mailing.Probabilistic Sample of size 10 passed into tree
building algorithm
Figure 4.19:Women's Mailing.Probabilistic Sample of size 15 passed into tree
building algorithm
For both men's and women's campaigns this method has generated improved
results.In the case of the men's this manifested as a shift upwards while main-
taining a similar level of reliability,whereas in the womens,the narrowing of the
ranges suggests improved reliability of the models found.Allowing the model
building algorithm slightly more selection (15 over 10 variables) appears to pro-
duce better results.As higher numbers are passed in,however,there will be a
tendancy for the algorithm to resemble the method of passing the entire popu-