An Application of Genetic

Algorithms to Uplift

Modelling

David P.Hofmeyr

Master of Science by Coursework

University of Edinburgh

2011

Declaration

I declare that this thesis was composed by myself and that the work contained

therein is my own,except where explicitly stated otherwise in the text.

(David P.Hofmeyr)

ii

Abstract

This paper means to tackle the problemof Uplift Modeling - i.e.modeling change

in behaviour as a direct result of treatment - using randomised methods,namely

evolutionary algorithms;both for variable generation and variable selection.We

give a detailed description of the evolutionary methods entailed as well as some

of the key aspects of uplift modeling such as the Qini coecient and some current

methods of modeling.We then apply this evolutionary approach to an example

problem published by Kevin Hillstrom in his blog (MineThatData) and discuss

how our results compare favourably with those from the winning submission.

iii

I would like to thank my supervisor,Dr.Nicholas J.

Radclie,for his support,guidance and expertise,

without which this dissertation would not have been

possible.His extensive knowledge in interests we share

has been a great boon,and has aided hugely my

understanding of these subjects.

I would also like to thank Stochastic Solutions Ltd.for

the use of their uplift software.

iv

Contents

Abstract iii

1 Introduction:1

2 Uplift Modelling 2

2.1 The Basics of Uplift Modelling....................3

2.2 Signicance Based Uplift Trees...................4

2.3 Evaluating Uplift Models (The Qini Coecient)..........5

3 Evolutionary Algorithms (Genetic Algorithms) 8

3.1 The Evolutionary Process......................9

3.2 Genetic Programs...........................10

3.2.1 Trees..............................11

3.2.2 Terminals and Functions...................12

3.2.3 The Initial Population....................13

3.2.4 Evolutionary Operators for Genetic Programs.......14

3.2.5 A Word on LISP and Symbolic Expressions........15

3.3 The Power of Genetic Algorithms..................17

3.3.1 The Schema Theorem....................17

3.3.2 Forma Analysis........................19

3.3.3 Convergence of Genetic Algorithms.............20

4 Application of Genetic Algorithms to Uplift Modelling and the

Hillstrom Challenge 21

4.1 Application of Genetic Algorithms to Uplift Modelling......21

4.1.1 The Genetic Programming Approach............21

4.1.2 The Polynomial Approach:..................23

4.2 The Hillstrom Challenge and Results................24

4.2.1 The Data and Questions...................24

4.2.2 Summary of Results from Winning Submission.......26

4.2.3 Results of the Genetic Algorithm Approach........30

5 Conclusion:56

A Variables and Models 59

A.1 Men's Model.Genetic Programming Approach...........59

A.2 Women's Model.Genetic Programming Approach.........59

v

A.3 Men's Model.Fitness Proportionate Selection.Combined Genetic

Programming and Uplift Tree....................59

A.4 Women's Model.Fitness Proportionate Selection.Combined Ge-

netic Programming and Uplift Tree.................61

vi

Structure Beginning in chapter 2,we direct our attention to the basic compo-

nents of uplift modelling including the inspiration for its development,its applica-

tion,some of the mechanical apsects of model building and a means for evaluating

models.In chapter 3 we turn our focus to the topic of evolutionary algorithms

(specically genetic algorithms) their medium,structure,use and application

with particular emphasis on genetic programming.In chapter 4 we merge the

two and show directly how the evolutionary processes described in section 4 can

be applied to the eld of uplift modelling.We go on to apply these methods to

a challenge published by Kevin Hillstrom in his blog,MineThatData,and com-

pare our results with those from the winning submission.Finally we conclude in

chapter 5 with a discussion of our ndings,the shortcomings of our method and

advice for future implementation.

vii

Chapter 1

Introduction:

Uplift modelling is a class of predictive modelling techniques in its relative infancy

compared with many others.It follows a lineage of modelling methods in the eld

of customer relations management,particularly in the eld of direct marketing,

dating back to the introduction of data mining in the 1950's.

Aimed at predicting the incremental impact of a targeted action,uplift modelling

draws on the shortcomings of its predecessors in this area.These pre-existing

techniques do well,through the use of control groups,to measure the incremental

eects of an action after the event but fail to model it predictively.

Due to its being a new discipline,and the diculty in modelling the second

order nature of incrementality,there is a shortage of documented modelling tech-

niques that currently predict incremental impact.In this paper we will attempt

to build uplift models using evolutionary algorithms.

Inspired by John Holland's Adaptation in Natural and Articial Systems,the

practice of evolving solutions by\genetic"interaction has itself evolved well be-

yond its canonical form as presented therein.

Genetic programming allows us to build models of unbounded complexity,and

we hope to utilise this power in building uplift models.First,by developing the

models directly within the genetic programming paradigm in its pure form and

second,by combining it with an existing modelling technique in an interactive

evolutionary process,we hope to obtain a useful and eective way of developing

uplift models.

We are given a useful metric for comparing uplift models,namely the Qini coef-

cient,and using this we will be able to compare our ndings with models built

using deterministic methods alone,as well as with some built using a simpler

evolutionary process.

1

Chapter 2

Uplift Modelling

Uplift is dened as a measure of the change in behaviour of an individual (or

group of individuals) as a direct result of some action.For example,in the con-

text of marketing,the uplift associated with some campaign can be understood

as the incremental sales volume generated by it.

Uplift modelling is concerned with identifying individuals or subsets of a popula-

tion for which a (usually binary) in uence variable has the greatest incremental

impact.As is always the case with predictive modelling accuracy is paramount

(i.e.we seek to have manageable error in the model),but in addition to this we

seek to isolate the eect of the in uence variable as well as have its predicted

eect be as non uniform as possible.A fairly simple regression model will handle

such a variable and be able to isolate its eect adequately,however in this case

(for a binary in uence variable) the two submodels will be parallel (i.e.the pre-

dicted eect of the variable will be uniform across the population) and so will be

unable to identify those for which the eect is greatest.

Uplift modelling was born out of the failure of traditional marketing strategy

models to recognise the distinction between"targeting people who are likely to

buy if they are included in a campaign"and"targeting people who are only likely

to buy if they are included in a campaign."Modelling on the former will at best

do well to measure and assess incremental sales as a result of some campaign.On

the other hand,understanding the latter allows one to actually maximise them.

The traditional models concentrate on associating purchases with treatment by

some marketing campaign.While this method can certainly give evidence that a

customer was in some way in uenced by the campaign,there is no guarantee of

it.It may be that such a customer would have bought whether or not they were

targeted by the campaign and so any increment in purchasing is not recognised.

Moreover,the potential for negative in uencing by campaigns is completely ig-

nored.While it may seem absurd that targeted marketing could signicantly

negatively in uence a potential customer's likelihood of purchasing,these phe-

nomena are well documented ([1]).

In this section we will discuss some of the basic ideas behind Uplift modelling,

2

modelling methods that preceded it (so-called response models) and some of their

shortcomings as well as a metric for evaluating Uplift models (namely the Qini

coecient) which will be used in our analysis later on.

2.1 The Basics of Uplift Modelling

Consider a population P divided into two disjoint subpopulations,T and C.Sup-

pose then that members of T are exposed to some treatment,while members of C

are not (we refer to T as the treated population and C as the control population).

We will begin with the binary case and consider the outcome variable O 2 f0;1g,

where O = 1 is seen as the\desireable"outcome.

A conventional response model attempts to model

P(O = 1jx;T);

that is the probability of an individual,described by the variables x,returning

the desireable outcome given that they are in subpopulation T (i.e.were treated).

Notice that this technique ignores the subpopulation C (i.e.those not treated).

Uplift models,on the other hand,instead model

U(x):= P(O = 1jx;T) P(O = 1jx;C)

the increase in probability that an individual will return the desired response

given that they are in subpopulation T (i.e.were treated) over the relevant prob-

ability were they in subpopulation C (i.e.were not treated).

When O is continuous (or at least discrete but not binary) then we can instead

consider expectations.In this case,uplift models attempt to model

U(x):= E[Ojx;T] E[Ojx;C];

the increase in expected value of the outcome O given that the individual,

described by the variables x,is in subpopulation T over that were they in sub-

population C.

The diculty in modelling this arises when we realise that the uplift for an

individual is not an observable quantity,since no indiviual can be in both T and

C.The most obvious way to model this is to t a response model for each of the

terms independently (that is,a model for individuals in T and another for those

in C) and then to subtract the one from the other.In theory there is nothing

intrinsically wrong with this method,however in practice it is not always reliable,

and can in fact be quite poor ([1]).

Instead we can consider models that impose a segmentation on the population

and base expectations on averages within those segments.In this case we en-

counter diculties surrounding reliability,as estimates for every segment need to

be statistically robust and thus considerably more observations are required than

in some other modelling methods.

3

2.2 Signicance Based Uplift Trees

When we consider a segmentation of the population as suggested above,a natural

choice of model structure is tree-based models,as they are intrinsically based on

segments.In fact the only packaged uplift modelling software currently available

is based on a tree building algorithm,the basics of which will be discussed herein.

The key features of the tree based model are:

Signicance based splitting:

The goal of generating useful splits during the growth of the tree model is

to both maximise the dierence in uplift between the two subpopulations

and minimise the dierence in their sizes.In general these two objectives

are at odds with one another however and a method of satisfying both

satisfactorily is not trivial.The signicance-based splitting criterion ts a

linear model to each of a set of potential splits and uses signicance of the

interaction termas a measure for selection.Modelling the outcome variable

as

O

ij

= +

i

+

j

+

ij

where O

ij

indicates the expected outcome returned by an individual in

subpopulation i (either treated (T) or control (C)) on side j (L or R) of

the split being considered. is a constant related to the mean overall

outcome across the entire population.

i

relates to the eect of treatment

in subpopulation i (naturally we set

C

to be zero).

j

quanties the eect

of the split by indicating the base dierence in outcome between the two

sides.Finally,

ij

measures the interaction between the treatment and the

split.

There is no loss of generality in setting

L

=

CL

=

CR

=

TL

= 0

Which leaves

TR

as the dierence in uplift between the subpopulations

either side of the split,which is precisely the quantity of interest.The use-

fulness of taking the signicance of this estimate rather than its magnitude

as a selection criterion lies in the fact that the t-statistic on which it is

based implicitly combines both magnitude and population size.

Variance-based pruning

The additional diculties encountered in modelling uplift over many other

quantities lies in the fact that overall uplift due to the treatment is often

small compared with other relationships in the data,the control population

is often signicantly smaller than the treated,and nally its second-order

nature (i.e.it being a measure of the change in outcome,rather than a

basic measure of magnitude of an observable quantity) and as such larger

errors tend to arise with estimation.As a result,stability of models is often

dicult to achieve.

4

Radclie ([1]) recommends resampling the training population k times (k =

8 is common),training on the rst sample and evaluating the stability with

reference to the other k 1 samples.At each node in the model,the uplift

is measured in population 1 and its standard deviation is estimated from

the other k 1 samples.Any split for which a child node exhibits standard

deviation greater than some threshold is discarded,along with the subtree

descending from it.

Bagging

The practice of bagging is a further concession to the rife instability inherent

in building uplift models.It refers to building a number of models (typically

10 or 20) as described above,each using dierent resamplings,and basing

predictions on the averages achieved over all the models.Radclie ([1])

notes that this approach can succeed in cases where building only a single

tree failed.

Pessimistic qini-based variable selection

The major factors regarding variable selection in any model building tech-

nique are reducing dimensionality (particularly to prevent overtting on

the data used for building the model),avoiding multicorrelation (strong

correlations between predictor variables can lead to unstable results and

can lead to misinterpretations of the relationships inherent in the data that

are recognised by the model),improving model quality and stability and

improving model interpretability.

Again the stability issue is more apparent in uplift modelling than in gen-

eral.

Basing variable selection on pessimistic qini-based estimates refers to rank-

ing the candidate variables according to a quality measure (the Qini coef-

cient,see below) and choosing either a xed number of variables,or all

which are above a specied threshold level.Radclie ([1]) recommends an

adjustment to the basic Qini coecient to be more pessimistic and,in doing

so,reduce the likelihood of selecting variables which might lead to instabil-

ity in the model.As in the case of bagging,this is done by resampling the

training population repeatedly and subtracting a multiple of the standard

deviation of Qini values obtained from the initial Qini estimate.

2.3 Evaluating Uplift Models (The Qini Coe-

cient)

We have discussed a small amount about what is desirable in an uplift model,but

given some collection of models how do we determine which is\best"?There is a

tendency in practice to resort to ad hoc methods,most often graphical.Several

authors suggest comparing uplift in the top k deciles to the overall uplift,how-

ever dierent choices of k and can reverse the comparison ([2]) and without some

conformity in choosing a cut o level this method does not appear very robust.

This method also ignores the opposite end of the spectrum,those for which uplift

5

is considerably lower than the overall,or is even negative.Moreover,it is very

possible,and has been observed in practice ([2]),that decile segmentation does

not generate any statistically signicant dierences and a more coarse segmenta-

tion is needed.

Radclie ([2]) proposes the Qini coecient,an analogue of the Gini coecient

([11]) for uplift.Where the Gini coecient measures the (anti- ) correlation be-

tween outcome and targeting depth (based on an ordering induced by the model),

the Qini instead measures the (anti- ) correlation between uplift rate and target-

ing depth.

Let Mbe a model that induces a total ordering on a population,divided into T

and C as above.Dene f

T

M

(x):[0;1]!R to be the response rate at depth x,

according to the ordering induced by Mfor the treated population.Similarly,

dene f

C

M

for the control population.(Note that for a nite population,f will

not be dened over the continuous interval [0;1],in such a case we consider the

piecewise continuous approximation formed by spreading the probability density

associated with discrete scores over the equvalent interval of continuous ranks.)

We can then dene the predicted uplift pointwise,for a given depth x

u

M

(x):= f

T

M

(x) f

C

M

(x)

And with this we can dene the cumulative outcome and uplift rates

F

T

M

(x) =

Z

x

0

f

T

M

(y) dy

F

C

M

(x) =

Z

x

0

f

C

M

(y) dy

U

M

(x):= F

T

M

(x) F

C

M

(x):

Condsider a model which imposes a random ordering on the population,R.

We have

E[U

R

(x)] = ux;8x 2 [0;1];

where u =

R

1

0

u(x) dx,the average uplift in the population.The Qini coe-

cient,Q,is dened by the area between U

M

and this diagonal,i.e.the average

excess cumulative uplift resulting from the model M,normalised by the equiva-

lent value for the best possible ordering.

Radclie acknowledges ([2]) that a model inducing such a best possible ordering

might not even be theoretically achievable:for example,no model can separate

individuals described identically by the predictor varaibles whose outcomes are

independent of treatment.To incorporate this,he also denes q

0

,which is based

on the"best case"model with no negative uplift.

6

Note that the transformation from Q to q

0

is monotone,and so the choice of

one over the other is essentially down to interpretation.

7

Chapter 3

Evolutionary Algorithms

(Genetic Algorithms)

The concept behind evolutionary algorithms stems from our recognition of na-

ture's ability to produce (evolve) species and individuals which appear well adapted

to their environment.Regardless of how harsh an environment may be,life nds

a way to propagate,and even ourish.

The evolutionary algorithm in its essence is a search method (heuristic) which

draws on ideas akin to Darwinian evolution to produce increasingly appealing

(t) solutions to a problem.The power of evolutionary algorithms (specically

genetic algorithms) was initially formalised in John Holland's pioneering work

([3]),in which he describes the notion of schema analysis.Schema analysis oper-

ates within the genotype (representation) space and shows howindividuals sharing

similarities in genetic structure (schemata) which display suciently above aver-

age tness* have a tendency to proliferate.While this has given us tremendous

insight into the usefulness of genetic algorithms,a crucial limitation of schema

analysis lies in the specicity of the genotype space,that is the requirement that

\chromosomes"(representations) be linearly arranged strings of a xed number

of genes,each of which coming from a predened set of alleles.Greene ([4]) dis-

cusses the usefulness of considering nonlinear arrangements of chromosomes and

how Holland's Schema Theorem can be extended to the case of a general con-

nected graph arrangement.Forma analysis ([5]) tackles very similar ideas from

a much more general perspective and allows for a far more universal application.

Formae refer to equivalence classes implied by any equivalence relation over the

phenotype (solution) space,and so similarities between subscribing individuals

need not be restricted to schemata,thus many of the limitations associated with

schema analysis are handily avoided.

In this chapter we will introduce and brie y discuss some important issues and

ideas in the scope of evolutionary algorithms that are relevant to this paper.We

will then delve into a slightly more detailed account of Genetic Programming,a

particular type of evolution.Finally we will touch on some theory showing the

eectiveness of genetic algorithms.

8

* The level of tness above average depends on the nature of the schema in terms

of its length and degree of specicity.

3.1 The Evolutionary Process

In this section we will encounter a broad structure of the evolutionary process

associated with the algorithms.We being this section with some necessary ter-

minology.

Denition The Search Space,S,is the set of all possible solutions to a problem.

S is also referred to as the solution space and the phenotype space.

Denition The Representation Space,G,is the set of all (gene) representations

of members of S.

G is also referred to as the genotype space and the chromosome space.

Denition The Solution Map,F:G!S,associates each element of the repre-

sentation space with its associated solution.

F should be onto,i.e.8 s 2 S:F

1

(fsg) 6=;.In other words every possible

solution has a representation in G.Ideally we would want for F also to be one

to one,i.e.8 s 2 S:jF

1

(fsg)j = 1,where j j denotes the cardinality of a set.

However,this uniqueness of representations is,in general,not satised.

Denition The Evaluation Function,E:S!

Q

i2I

P

i

,where I is some index

set,maps from the solution space to a product of totally ordered sets.

For the purposes of this paper we only consider real valued evaluation func-

tions and refer to E(s) as the tness of s.Weise devotes a signicant part of ([6])

to showing how higher dimensional evaluation functions can be reduced by the

use of Pareto Dominance.In the eld of optimisation these higher dimensional

functions usually arise in the presence of multi-objective problems.

Denition An Evolutionary Process is a 5-tuple,(S;G;O;F;E),where S is the

solution space,G is the representation space,F is the solution map,E is the

evaluation function and O is a collection of so-called evolutionary operators.

Evolutionary operators take collections of elements from the representation

space (or solution space),along with some parameter(s) (from some predened

control set) and return a collection of elements from the representation space

(or solution space).The nature of the control set depends on the operator and

should be designed in a way that,for a given operator,collection of operands and

parameters the output is unique.

In this paper we will consider three types of evolutionary operators.Firstly,

selection operators,which take an entire population of individuals and return

some subset of that population.Selection operators can be used to determine

breeding pools (collections of individuals chosen to undergo recombination) as

well as for culling populations in order to control the population size.

9

Denition An adjustment to the culling operator referred to as elitism insures

that individuals dened as elite cannot be selectd for removal fromthe population.

In general elitism refers specically to retaining the\ttest"member of the

population.,however we prefer this slightly more general denition which allows

retention of any predened set of individuals.

Secondly,recombination operators,which take pairs of individuals and return

(according to the control parameters) some number of\child"solutions.Much

of ([5]) is devoted to useful properties of recombination operators in the eld of

forma analysis.In general these properties require background beyond the scope

of this paper,however one such property which does not require any background

knowledge and which is important in the context of convergence is the notion of

purity.

Denition A recombination operator,R,is said to be Pure if

8g 2 G;c 2 C

R

:R(g;g;c) = g.

That is a recombination operator is pure if the ospring of (genetically) iden-

tical parents are identical to those parents,regardless of the control parameters.

Finally,mutation operators,which take single individuals and return,accord-

ing again to control parameters,another single individual.In general mutation

will change the make-up of an individual,and may result in an individual which

possesses characteristics otherwise not present in the population.It is this that

is both the bane and the boon of genetic algorithms.Mutation allows for a much

broader search of the solution space and so the power of an algorithm is greatly

amplied by it.On the other hand,it is a great thorn in the side of analysts as

understanding convergence becomes far more dicult.

Note:Recombination and Mutation are also referred to as genetic operators.

The mechanism of the evolution process is the iterative application of evolu-

tionary operators to some initial subset of the representation space.The age of

the computer has allowed us to perform absurdly many simple tasks in a short

space of time and it is this power that allows us to utilise evolutionary algorithms

in a meaningful way.Of course we cannot emulate the vastness of time that life

has taken to evolve from single celled organisms to the biologically diverse and

complex world we live in,but we are able to observe improvements in even the

most complex of problems.

3.2 Genetic Programs

Genetic programs are a very specic type of genetic algorithm in which the indi-

viduals undergoing evolution are,themselves,computer programs.The relevance

10

of genetic programming to this project is that the variables built in our evolu-

tion of uplift models can be expressed as tree-based structures like those used to

describe the computer programs herein.As such,the mechanical workings of the

evolutionary processes of the two are essentially the same.

Koza ([7]) discusses how extremely varied problems can be solved by the dis-

covery of a computer program which produces a particular output from a collec-

tion of inputs.It will be this ability to build highly non-smooth transformations

(ones which might not be possible using conventional modelling methods nor

simpler evolutionary processes) that will be useful in constructing models and

variables in our application of genetic programming to uplift modelling.

3.2.1 Trees

As mentioned above,the medium for representation of computer programs is via

tree graphs and we will now devote a small amount of time to dening the nature

of these trees and how the evolution thereof takes place.

Denition A tree is a connected directed graph with no cycles.That is it is a

graph in which any two nodes are connected by a unique path*.

*we use path to mean a sequence of nodes for which there is an edge connect-

ing each pair of consequent nodes that does not visit the same node more than

once.This is sometimes referred to as a simple path.

In genetic programming we consider hierarchical trees,in which order of nodes

is important and we refer to immediate superiors (i.e.those adjacent and higher

up the hierarchy) as parent nodes,and immediate inferiors as child nodes.

Denition We refer to a node with no child nodes as a terminal,a node with

no parent nodes as a root and a node with at least one parent and at least one

child node as an intermediate node.

Denition The descendants of a node N include all the children of N as well as

all descendants of those children.

Denition A subtree of a tree T subtended by a node N is the tree consisting of

N and all the descendants of N,connected in the same manner as they are in T.

Denition The depth of a tree is the greatest distance (in terms of number of

nodes passed through) along a path from a root node to a terminal node.

11

Figure 3.1:Tree

In the above tree,the node labeled"1"is the root node,it has 2 child nodes

"2"and"3".Nodes"2","4"and"5"are terminal nodes."3"is an intermediate

node and is the parent of"4"and"5".The tree has depth 3 and size 5.

3.2.2 Terminals and Functions

Of crucial importance to the evolutionary process is the correct selection of po-

tential terminals and indermediate (and root) nodes.In order to have the output

of our programs vary according to input (something which is obviously desirable

as a constant output is far from interesting) we include in the set of potential

terminals a collection of variables to represent the inputs,as well as any necessary

interpretations thereof (such as the state of some system in the presence of the

input variables).One can also incude numerical values,the boolean values True

and False,etc.Especially in the case where dierent inputs range on very dier-

ent scales it is sometimes useful to include a constant terminal which can relate

one to the other by some operator (e.g.dividing the one with the larger scale

or multiplying the one with the smaller by some appropriately sized constant).

Intermediate nodes contain elements from a set of functions,usually comprised

of some collection of:

arithmetic operators (+,-,*,etc.),

mathematical functions (trigonometric identities,logarithms,etc.),

boolean operators (AND,OR,etc.),

conditional operators (IF-THEN-ELSE,etc.),

relations (=,,etc.),

any functions that induce iteration or recursion,etc.

The representation space is then the collection of all potential structures that

can be composed recursively from the collection of terminals (T) and functions

(F).We will denote this space G

(T;F)

.

The terminal and function sets should also be designed so as to satisfy the closure

and suciency properties.

12

Denition A terminal,function pair (T;F) is said to be closed if each f 2 F is

dened over T as well as all possible outputs of members of F.

An example of a non-closed pair is any with function set containing both

and/since for any t

1

;t

2

2 T,t

1

=(t

2

t

2

) does not exist.In other words,the

function/is not dened for the pair of values (t

1

;t

2

t

2

) = (t

1

;0).Mathematical

functions such as square-root and logarithms are not dened for negative values,

and so here too possible non-closure is a problem.In such instances it is common

to replace these functions with ones that are equivalent where the function is

dened,but takes on some other form elsewhere (often merely taking the value

zero).

Denition A terminal,function pair (T;F) is sucient for a problem P if the

solution to P lies in G

(T;F)

.

This point may appear trivial,however the identication of the requirements of

the problem are not always obvious.Borrowing an example from ([7]),Kepler's

Third Law,which was discovered in 1618,states that the cube of a planet's

distance from the sun is proportional to the square of its period around the sun.

If F = f+;g then a computer program that predicts the period of a planet

around the sun could not result.Similarly,without knowledge of Kepler's Law,

one would not know that distance from the sun is the sole predictive variable

when determining the period of a planet and so might construct a terminal set

comprised of only information about the planet iteslf;say diameter,density and

rotational speed,which would not be sucient for the problem.

3.2.3 The Initial Population

The genetic programming evolutionary process produces increasingly complex in-

dividuals due to the nature of the evolutionary operators used.Initialising the

process with the inaugural population is an important step in the process and

there are two commonly used methods for generating initial structures,namely

\grow"and\full".

The\grow"method creates trees of varying shape for which the distance between

a single root node and any terminal node is no greater than some predetermined

depth.Starting with a randomly selected root node (from the set T [F),when-

ever a node is added that is in F,if the current depth is strictly less than the

maximum,attach a number of randomly selected nodes (fromT [F) equal to the

number of arguments that the node takes.If the depth is equal to the maximum,

instead append nodes only from T.

The\full"method creates trees which are dense,i.e.the distance from the

single root node to every terminal node is equal to some predetermined depth.

They are more uniform in shape and size than grown trees,with variations only

arising as a result of functions taking a dierent number of arguments.They are

constructed in the same way as"grown"trees,except when a function node is

13

added,if the current distance to the root is strictly less than the predetermined

depth then only nodes from F are appended.

The\ramped half-and-half"method suggested by Koza ([7]) generates an ini-

tial population using a combination of the two generative methods.For every

depht between 2 and some chosen maximum,the method generates an equal

number of grown and full trees of that depth and adds them to the population.

This creates an initial population varied in both size and shape.

3.2.4 Evolutionary Operators for Genetic Programs

For the purposes of this paper we consider the following recombination operator:

R:G G Z

2

+

!G;where R(g

1

;g

2

;n;m) is g

2

with the subtree subtended

by the n(mod size(g

1

))

th

node replaced with the subtree of g

2

subtended by the

m(mod size(g

2

))

th

node.Here the size of a tree is the number of nodes it contains,

and the nodes are numbered depth rst (since in general the control operands are

chosen randomly,the numbering is in this case arbitrary.In practice the set of

control operands is bounded since generating random elements from an innite

set is problematic).Notice that in this case the control set is Z

2

+

.

Figure 3.2:Crossover

Mutation will be dened as:

M:G G Z!G;where M(g;h;n) is g with the subtree subtended by the

14

n(mod size(g))

th

node replaced with h.We have taken a slight liberty in the

notation here since in general the size of h will depend on g and n in that we

will restrict the depth of h to the depth of the subtree it replaces.We would

prefer not to have operands depend on one another in this way,however there

should be no loss of interpretation.Furthermore,because again the operands are

chosen randomly,it would be possible to insert some pruned version of h in the

case where it is larger than the desired size.This would bias it's size upwards,

however would actually result in mutated trees closer in size to the original.The

control set in this case is G Z.

Figure 3.3:Mutation

3.2.5 A Word on LISP and Symbolic Expressions

LISP (LISt Processing) is a programming language based on two types of enti-

ties:atoms and lists.Atoms in LISP are things like constants,like the number

7,and variables,like TIME.Lists in LISP are ordered sets of items enclosed in

parentheses.Examples of lists are (A 3 TIME) and (< 2 7).

A Symbolic Expression (S-expression) is a list or atom in LISP.S-expressions

are the only syntactic form in LISP,and so the programs of LISP are themselves

S-expressions.

LISP evaluates constant atoms to themselves (e.g.,7),and variable atoms to their

current value (e.g.,TIME).LISP evaluates lists by treating the rst element as a

function and applying it to the evaluations of the remaining elements of the list.

15

These elements may be atoms,or lists themsleves.

For example,the S-expression (< (+ 1 (* 2 3)) 4) is evaluated by LISP as ap-

plying the comparison function\<"to the entities (+ 1 (* 2 3)) and 4,which it

evaluates in turn.The rst is evaluated as applying the numerical operator\+"

to the entities 1 and (* 2 3).The second of these is evaluated as applying the

numerical operator\*"to the entities 2 and 3.Working back up this chain,LISP

evaluates *(2,3) = 2 * 3 = 6;+(1,6) = 1 + 6 = 7;and nally <(7,4) = 7 < 4

= False.

Symbolic Expressions as Trees

In order for the programs we are evolving as trees to be understood by a computer,

we need to express them in a language that the computer can interpret.There is

a simple transformation from the tree structure to S-expression as follows:

If the root is a terminal

return the atom equivalent

else

create a list with rst element equivalent to the root

for each subtree subtended by child nodes,repeat process

append.

Example Consider the tree depicted below

Figure 3.4:Example Tree

Root node\+"is not a terminal

create list (+)

rst child\*"is not a terminal

create list (*)

rst child\a"is a terminal

append (* a)

second child\b"is a terminal

append (* a b)

16

append (+ (* a b))

second child\Q"is not a terminal

create list (Q)

rst child\*"is not a terminal

create list (*)

rst child\c"is a terminal

append (* c)

second child\d"is a terminal

append (* c d)

append (Q (* c d))

append (+ (* a b) (Q (* c d)))

So the tree above has S-expression (+ (* a b) (Q (* c d))),where Q represents

a function which takes only one argument.

3.3 The Power of Genetic Algorithms

As mentioned before,the power of genetic algorithms was initially formalised

by John Holland in the form of schema analysis.In this section we will cover

brie y the notion of schema and how the Schema Theorem shows how improve-

ments in the evolution arise.We will then introduce Formae as alternative ideas

to schema and how forma analysis yields similar improvement results.Finally,

we will touch on some ideas of convergence of evolutionary algorithms from a

topological perspective.

3.3.1 The Schema Theorem

Denition Let G be a representation space in which indiviuals are strings of

some xed length.A Schema,H,is a template which denes a subset of G

based on its members sharing genes at specic positions.The dening length

of a schema,(H),is the dierence in index between the rst and last specied

position.The order of a schema,o(H),is the number of specied positions.

For example,if G is the collection of binary strings of length 6,then 100

is the schema consisting of all elements of G with 1 in the rst position and zero

in the third and sixth positions.It has order 3 and dening length 6 1 = 5.

Theorem 3.3.1.The Schema Theorem:

Let N

H

(t) be the number of members of a population displaying schema H at

generation (iteration) t.Let f

H

(t) be the average tness of those displaying H at

generation t and f(t) be the average tness in the entire population.Then

E[N

H

(t +1)] N

H

f

H

(t)

f(t)

(1

X

!2

d

!

)

where d

!

represents the likelihood of disruption of the schema H by the evo-

lutionary operator!for each operator in the set being applied in the evolution;

.

17

The Schema Theorem in its usual form requires that selection be tness pro-

portionate,and also assumes certain things about the operators in use such as

transmission (c.f.[5]) of genes by recombination operators.For many recombi-

nation operators,such as the commonly used one-point crossover (c.f.[3]),the

likelihood of disruption is proportional to the dening length of the schema and

similarly the likelihood of disruption is generally proportional to the order of the

schema.Thus short schema with low order which are highly t have the greatest

likelihood of proliferation.

In the case that tournament selection is used (in which pairs (or groups) of in-

dividuals are selected to\compete"with one another directly and those of higher

tness are given some predened probability of victory.c.f.section 4.1.1 later),

more knowledge about the distribution of tness is needed for an analytic result.

With only knowledge of the average tness we can argue the absurd case that all

the tness among members of a schema H is attributable to a single individual,

and so if that schema is of any considerable size,the likelihood of a randomly

chosen member of H\beating"some other random individual not in H is approx-

imately (1),where is the competition bias.We can,however,see how if the

average tness of members of a certain schema,H,is above the average in the

overall population and the distribution is not too skewed,then that schema will

have a tendancy to proliferate in future generations (as described in the schema

theorem).With more thought we realise that in any case a distribution of tness

among H members which is not too skewed is far more desirable,since we then

believe more strongly that above average tness arises from the schema rather

than the high tness being incidental.

It can easily be shown that if p

k

is the k

th

percentile of tnesses among mem-

bers of P H,then H tends to proliferate if the proportion of H members with

tness above p

k

is suciently above 50=k.The problem with this result is that

is far less precise than the schema theorem in that the inequality will generally

be quite slack.Furthermore,this result is meaningless for quantiles below the

median since 50=k > 1 for k < 50.

Holland (among others) describes the short length,highly t shcemata as\build-

ing blocks",the combinations of which are intended to produce highly eective

individuals within the search space.This is well described by David Goldberg,

another proponent of genetic algorithms:

\Short,low order,and highly t schemata are sampled,recombined [crossed

over],and resampled to form strings of potentially higher tness.In a way by

working with these particular schemata [the building blocks],we have reduced the

complexity of our problem;instead of building high-performance strings by trying

every conceivable combination,we construct better and better strings from the

best partial solutions of past samplings."

Goldberg 1989

18

3.3.2 Forma Analysis

It has been shown that the choice of describing subsets of the representation

space by schema (with regards to the improvement shown by the schema theo-

rem) is essentially arbitrary,and in fact a very similar result is provable where

H instead describes any subset of the representation space,provided the dis-

ruption coecients are tractable for the chosen subset ([9]).The importance of

this result is that,where in the case of schemata there may be no recognisable

similarities within the solution space between subscribing members,here the de-

signer/implementer of the algorithm is able to both identify,via the progress

of the algorithm,regions of the search space which might be fruitful as well as

prescribe subsets to be exploited based on prior knowledge of the problemat hand.

Forma analysis is concerned with dening a collection of equivalence rela-

tions on the search space,the equivalence classes of which give rise to genetic

representations of members of the search space.

Denition An equivalence relation on a set X is a relation which is

1.re exive:x x 8x 2 X

2.symmetric:x y )y x 8x;y 2 X

3.transitive:x y and y z ) x z 8x;y;z 2 X

The canonical equivalence relation is the comparison\=",since trivially equal-

ity satises re exivity (x = x),symmetry (if x = y then y = x) and transitivity

(if x = y and y = z then x = z).

If is an equivalence relation on the set X,then for x 2 X we dene the

equivalence class of x with respect to as

[x]

= fy 2 Xjx yg

The intersection of two equivalence relations,

1;2

=

1

\

2

is dened by

x

1;2

y i x

1

y and x

2

y.

Given a collection of equivalence relations E we can use this to dene the

span of E;S(E):= f

T

2I

jI 2 P(E)g,where P(E) denotes the power set of

E.If we consider equivalence relations over the search space S then E induces

a representation space in which the representation of an s 2 S is given by the

vector of equivalence classes to which s belongs,with respect to the elements of E.

Radclie refers to the equivalence classes created by elements of S(E) as for-

mae.Ideally one designs E in a way that S(E) separates the points of S;i.e.

8s 2 S:[s]

T

2E

= fsg,or equivalently 8s;t 2 S:9 2 E s.t.s 6 t.

By allowing the freedom to choose the elements of E,with some restrictions,the

designer of the evolutionary algorithm is able to predene regions of the search

space which are believed to potentially possess useful solutions.This denition

also allows for improved interpretation of results as the equivalence relations used

19

are generally more meaningful than schema at a higher level.

In the context of genetic programming,where the variable size and shape

of individuals in the representation space precludes the use of schema analysis,

forma analysis potentially allows for a better understanding of a similar progress

result to the schema theorem.In particular,we may wish to consider a forma,

say F,members of which contain a specic subtree.We would like to analyse

its tendency to exist and grow in future generations.The diculty arises in

understanding the disruption rates as additional knowledge about the members

of this forma is required,such as the maximum size of the members of F (or

at least an upper bound thereof).In some cases one might wish to impose

a maximum size on the programs being evolved for the sake of computational

eciency,in which case the maximum size of members of F at least has an upper

bound and so pessimistic disruption rates with respect to the genetic operators

dened previously can easily be determined.

3.3.3 Convergence of Genetic Algorithms

Even though genetic algorithms are widely used in the eld of optimisation,in

general problems to which they are applied are not actually solved to optimum.

Instead the algorithms are often used in highly unstable and large spaces as par-

tial search algorithms in which we rely on progress suggested by analogues of the

schema theorem to nd solutions which are"better"than those achievable with

more analytically based algorithms.

In spite of their often being used for improvement rather than optimisation,

it is important to understand the convergence properties (should they exist) in

regarding when they can guarantee the discovery of an (at least local) optimum

within the search space.

Holland shows in ([3]) that if crossover is the only genetic operator employed

then the evolutionary process described as a stochastic process has a stationary

distribution,however we seek a more general result.

Rudolph ([10]) also uses a stochastic process representation to show that in

the presence of mutation,genetic algorithms like those which Holland describes

will never converge to the global optimum.However,he also shows that a modi-

cation of the algorithm in which the best solution within the population is always

maintained will converge to the global optimum.

20

Chapter 4

Application of Genetic

Algorithms to Uplift Modelling

and the Hillstrom Challenge

In this section we will show how genetic algorithms can be applied to the problem

of uplift modelling.We will then apply these algorithms to a specic problem,

posed by Kevin Hillstrom in his MineThatData blog ([12]),for which uplift mod-

elling is ideally suited,and compare our results with those from the winning

entry.

4.1 Application of Genetic Algorithms to Uplift

Modelling

In order to implement a genetic algorithm,we need to describe rstly each element

of the evolutionary process and secondly any adjustments and additional steps

and methods used.

4.1.1 The Genetic Programming Approach

The Solution Space

The solution space S will consist of all possible variables with values in

hX [ Q;(+;;;/)i*,the set generated by X (the collection of all values

the basic descriptive variables of the problem can assume) and Q with the

operators (+,-,*,/),capable of at most separating data points which are

discernable by at least one basic descriptive variable.(In other words if

two data points are described by the same set of variables x,then every

member of S will assign the same value to them)

*If variables take on values in RQ,we tend in practice to only consider

their rational approximations,in which case S can be described as all ra-

tional valued variables.However as a concession to being concise regarding

the suciency of the problem,we choose to be more explicit.

21

The Representation Space

Our intention is to generate composite variables (of the basic descriptive

variables dened in the problem) for use in building uplift models.As we

do not wish to restrict the level of complexity these variables entail,we

will use genetic programming to evolve them as symbolic expressions.The

terminal set T will be the collection of basic variables of the problem as

well as constant terms which relate the scale of the variables where ap-

plicable and those which divide the variables in an apparently useful way

(e.g.measures of the\centre"(usually mean or median) of variables,or

in the case of low order discrete variables,values which separate explicitly

the dierent values).In the case where these variables are non quantita-

tive,we convert them using collections of indicator variables into discrete

numerical variables.The function set F will consist of the basic binary

numerical operators (+,-,*,/),with division safeguarded by the added

identity a=0 = 0 8 a 2 R,the relations which exploit the ordered structure

of R (=,,<,,>),accepting the convention of numerical binary logic

variables True = 1 and False = 0,and the conditional operator (IF-THEN-

ELSE).

It is easy to see that (T;F) is closed.To show that it is sucient,we

can adopt the absurd argument that if a genetic program can manifest the

constant 1 and the function set contains +, and/then every rational

number can occur.Explcitly the expression (= a a) for any a 2 T has

the value 1,and so we can generate every rational number and hence every

number in the set generated by Q[X.

The Evolutionary Operators

The evolutionary operators will contain recombination and mutation as de-

scribed in section 3.2.4 and the selection operators described below.

Our selection operators both for culling of the population and for choosing

breeding pairs will be based on so-called tournament selection dened as:

C

:G G [0;1]!G;which returns the tter of the two individuals

if the nal operand is less than the selection bias 2 [0;1],otherwise it

returns the less t.In general the nal operand is selected from a U(0;1)

distribution and so this can be understood as a competition in which the

tter individual has a probability of winning.

The culling operator is then dened by:

K

:P(G) Z

2

[0;1]!P(G);where

K(P;n;m;r) = P fC

1

(P[n];P[m];1 r)g

that is it removes the\loser"of the competition between the n

th

and m

th

members of a population (if n is greater than the size of P then we consider

n(modjPj),and similarly for m).

The operator for selecting a breeding pair is:

22

B

:P(G) Z

4

[0;1]

2

!G G;where

B

(P;n;m;o;p;r;s) = (C

(P[n];P[m];r);C

(P[o];P[p];s)

that is it returns the winners of two tournaments between members of P

indexed by the integer operands.

The Solution Map

The solution map F takes each g 2 G and assigns it the variable g(x),the

output of g for a given vector of basic descriptive variables x.

The Evaluation Function

The evaluation function E is dened by E(s) = q

0

(s),where q

0

(s) is the

\no negative-uplift"adjusted Qini coecient from the ordering imposed on

the data set by the variable s.

In addition to the usual structure of the evolutionary process,we include a

dynamic adjustment to the evaluation function as follows:

At regular intervals we select subsets of the population (using various se-

lection methods) to be passed into a signicance based uplift-tree building

algorithm.In the event of discovering a\better"model than has so far

been found (we will formalise what we mean by better in the next section),

we will adjust the tness of the variables used in the model by a factor

according to how\good"the model is.The intention is that by adjusting

the tness of variables which have proven useful in building models,we

will extend their longevity,increase their likelihood of reproducing and also

increase their likelihood of being selected for future models.

Finally,we will grant temporary elitism to the variable with the highest

basic tness level (as described by the static evaluation function) and those

variables included in the\best"model found so far.

4.1.2 The Polynomial Approach:

The Solution Space

The Solution space consists of all two dimensional polynomial transforma-

tions of the basic descriptive variables of degree 4 with zero constant term.

As before non quantitative variables will be redened numerically.

The Representation Space

The representation space consists of all strings in R

14

.Note that

P

4

i=0

4

i

1 = 14,the number of two dimensional combinations of order less than or

equal to 4 excluding the zero order constant.

The Evolutionary Operators

Crossover will be dened as follows:

C:GG[0;1]

14

!G;where C(g;h;x)

i

= x

i

g

i

+(1x

i

)h

i

.We select each

[0;1] operand from a U(0;1) distribution and so C can be seen as selecting

a point randomly between the two points in G.

23

Mutation applies the following function to each element of an individual

in G independently:

M

:R[0;1] R!R;where M

(x;;r) = x + r if is less than the

mutation rate 2 [0;1],and M

(x;;r) = x otherwise.We choose r from

a N(0;1) distribution,and from a U(0;1) distribution.

Selection operators will be as in the Genetic Programming approach above.

The Solution Map

The solution map F assigns the elements of g 2 G as coecients in a

polynomial in S.

The Evaluation Function

The evaluation function will be as in the Genetic Programming approach

above.

As before we include a few adjustments to the standard evolutionary pro-

cess.Because the polynomials being built are two dimensional,we simul-

taneously evolve

N

2

separate populations,where N is the number of basic

variables in the problem.

At regular intervals we combine the populations and select a subset prob-

abilistically (according to tness) to be passed into a signicance based

uplift-tree algorithm.As before,whenever a new"best"model is found,

the tnesses of the variables used are adjusted.

Once again we grant temporary elitism to the variables with the highest

basic tness level in each population,as well as those included in the"best"

model so far.

Initial coecients were chosen randomly between 10 and 10.Because

the models are evaluated based on an ordering,the scale of coecients is

arbitrary and so we could have chosen any range centered at 0.

4.2 The Hillstrom Challenge and Results

In March 2008 Kevin Hillstrommade available,through his blog,MineThatData,

a dataset describing two email campaigns and a control group and issued a chal-

lenge to analyse the data with respect to a set of questions.In this section we

will cover brie y the nature of the data set,the questions posed and a summary

of the results from the winning submission ([8]).We will then discuss the results

obtained using our genetic algorithm,its shortcomings and where it was able to

produce improved results on those published.

4.2.1 The Data and Questions

The data represent 64,000 entries each describing a customer.A third of the

customers were chosen randomly to receive an email called the Men's Email,a

24

second randomly chosen third to receive the Women's Email,and the remaining

customers served as a control group,receiving neither email.

The individuals were described by three outcome (dependent) variables:

- Visit:A binary variable indicating whether the customer visited the site within

a two week outcome period.

- Conversion:A binary variable indicating whether the customer whether or not

they purchased at the site during that period.Obviously Visit being false implies

Conversion is false.

- Spend:A real valued variable indicating the amount spent during the outcome

period.Obviously spend = 0 whenever Conversion is false.

They were also described by 8 independent variables (referred to previously as

basic descriptive variables):

- Recency:An integer valued variable indicating the number of months since each

customer's most recent purchase prior to the outcome period.

- History Segment:A categorical variable dividing the population according dis-

joint ranges of money spent in the preceding year.

- History:An actual monetary amount spend by each customer in the preceding

year.

- Mens:A binary variable indicating if the customer had purchased in the Men's

department in the last year.

- Womens:As for Mens,but regarding the Women's department.

- Zip Code:Classies customers according to either Urban,Suburban or Rural.

- Newbie:A binary variable indicating if the customer made their rst purchase

within the previous year.

- Channel:Describes the channels through which the customer bought previously

(either by Phone,Internet,or both).

Finally a variable indicating which customers were treated with the Mens-Email,

the Womens-Email and which formed part of the control group.

The questions posed in the challenge were as follows:

1.Which e-mail campaign performed the best,the Mens version,or the Womens

version?

2.How much incremental sales per customer did the Mens version of the e-mail

campaign drive?How much incremental sales per customer did the Womens ver-

sion of the e-mail campaign drive?

3.If you could only send an e-mail campaign to the best 10,000 customers,which

customers would receive the e-mail campaign?Why?

4.If you had to eliminate 10,000 customers from receiving an e-mail campaign,

which customers would you suppress from the campaign?Why?

5.Did the Mens version of the e-mail campaign perform dierent than the Wom-

ens version of the e-mail campaign,across various customer segments?

6.Did the campaigns perform dierent when measured across dierent metrics,

like Visitors,Conversion,and Total Spend?

7.Did you observe any anomalies,or odd ndings?

8.Which audience would you target the Mens version to,and the Womens ver-

sion to,given the results of the test?What data do you have to support your

recommendation?

25

Questions 3 and 4 relate directly to uplift modelling since uplift models seek

to maximise incremental sales by means of imposing an order structure on a

population according to their incremental spend behaviour as a result of some

treatment.We will focus our analysis on developing reliable uplift models and in

doing so implicitly answer these two questions.The remainder of the questions

above refer more to measuring incremental sales than to maximising them,and

so we defer in those cases to the results obtained by Radclie in his winning

submission,summarised in the next section.

4.2.2 Summary of Results from Winning Submission

Radclie uses uplift modelling to tackle three dierent formulations of the prob-

lem posed by Hillstrom.He successfully analyses the eectiveness of the cam-

paigns as well as identies those for which they were most (and least) eective.

A brief summary of these analyses will be presented here.

It was found that both campaigns had a positive eect overall,however in the

case of the Women's Email,there appeared to be segments of the population for

which the campaign served in decreasing spend behaviour,rather than increasing

it.The average spend (among those who purchased) increased dramatically with

the Women's campaign,but decreased with the Men's.That the Men's campaign

drove more incremental sales overall was attributed to increasing purchase rate

suciently to compensate for this decrease.

The three formulations of the problem are driven by the three response vari-

ables present in Hillstrom's data;modelling incremental visit frequency,modelling

incremental purchase frequency and nally modelling incremental spend volume.

It was found that the Men's campaign outperformed the Women's on all three

counts.

Table 4.1:Uplift summary

As can be seen in the above table,the Mens'campaign outperformed the

Women's with an increase in visit rate of 7.66% (over 4.52% for the Women's),

in purchase frequency of 0.68% (over 0.31% for the Women's) and an increase in

average spend by 77c (over 42c for the Women's).

While these monetary amounts per head sound small,because of the low overall

purchase rate,in the case of the Men's campaign it in fact more than doubled

average spend.

26

A splitting of the population into random segments (of 10% each) showed that

these estimates are quite unstable,however,likely caused by the low frequency of

both visits and purchases.This poses a potential problem when identifying the

best (and worst) 10,000 candidates.

The nal formulation (regarding increased spend behaviour) is the one most

closely associated with campaign success in general (and implicitly includes the

other two) and is the objective we will focus on in our evolutionary approach,

and so will form the main focus henceforth.Details of the other two models can

be found in ([8]).

The objective is to build a relaible uplift model and choose the 10,000 popu-

lation members with the highest spend uplift and the 10,000 with the lowest (or

most negative,should that be the case).

Two approaches were considered;one a direct continuous approach and the other

based on creating a binary outcome variable'spend over x',for some amount

x.The continuous approach faces the diculty of massive skewness in the spend

variable,where close to 99%of people have zero spend.Tree models,however,are

fairly well equipped to handle skewness in the variables and this method proved

superior for the Women's campaing.The binary response model was preferable

for the Men's.

The model for the Men's campaign is described as follows:

The model assigns 1 point for each of the following criteria

historic spend over $160.

historic spend over $350.

customer is a multi-channel user.

The model splits the population into 4 segments according to this score,with

the highest being those multi-channel users with historic spend over $350.The

summary of spend behaviour identied by this model is summarised below.

Table 4.2:Spend behaviour identied by Men's model

27

Figure 4.1:Spend uplift by score for Men's model

The q

0

for the model is 19.80% over the entire data set.Following is the Qini

graph described by the model ordering.

Figure 4.2:Qini graph for Men's model

The model for the Women's campaign predicts the incremental spend directly,

however is not as easy to describe.The following table allows for some interpre-

tation by showing the average model score assigned to each bin of the variables

used.

28

Table 4.3:Mean score by binned variables for Women's model

The table shows that the model favours especially those with high historical

spend (above $500),and to a lesser extent those with relatively low historical

spend,multichannel users and newbies.

The model has a q

0

value of 118.00% on the training set and 60.30% on the

validation.The gure below shows the average uplift by model score band.

Figure 4.3:Spend uplift by score band for Women's model

Selecting the best 10,000 customers to target naturally begins with those scor-

ing 3 according to the Men's model.This amounts to 5,249 people,and estimating

their uplift from the 3,498 not included in the Women's campaign we estimate

the average spend uplift at $1.54.

Next are those with score 2 according to the Men's model.This includes more

people than are needed to select 10,000 and so we consider the segregation of this

segment:

29

Table 4.4:Spend uplift for dierent segments with score 2 by Men's model

We can see that Web users with historic spend between $160 and $350 provide

the highest average uplift.This adds a further 4,684 people with estimated uplift

of $1.61.The remaining 67 customers are then chosen randomly from the Phone

users with historic spend between $160 and $350.The estimaed average uplift

overall for these 10,000 people is $1.58.

Next,the worst 10,000 customers were chosen as follows.Because the Men's cam-

paign performed better,the worst 10,000 according to that model were selected.

The bottom score (0) identies 32,237 people with an average uplift estimated

at 55c.Of those,trying to select the 10,000 with the lowest score according to

the Women's model (estimated uplift of 44c on average,corresponding roughly

to bands 0 to 4) results in 12,952 people.Estimating the uplift as a result of the

Men's campaign among these by removing those who received the Women's email

comes to just 5c.So the choice of 10,000 people to exclude from the campaign is

to select 10,000 randomly from with score of 0 from the Men's model and score

< 5 from the Women's model.

From the Men's model these are people who

have historical spend less than $160

are not multichannel

and from the Women's model,indicated by Table 4.3

are not newbies

use the Phone channel.

If we consider this subpopulation exactly,we nd 8,023 people for whom it

was estimated that the Men's Email would depress their spend by an average of

just over 25c.

4.2.3 Results of the Genetic Algorithm Approach

Because of the high computational time of running genetic algorithms it is costly

to generate statistically robust results,however we believe those presented herein

are sucient for drawing meaningful conclusions and for making relevant com-

parisons with the results discussed above.

For each technique employed,we did ten runs and include a summary of those

30

as well as a detailed account of the best found in each case.

For both the Men's and Women's emailing campaigns we used the following

common parameters and methodology:

Maximum Population size:100.Initial tests showed overwhelming ho-

mogeneity in the upper regions of the population (among the ttest inid-

viduals).This appeared to be due to large redundant subtrees in many

population members,changes within which did not aect the individuals'

performances,meaning that recombining these individuals often produced

ospring essentially equivalent to them.In order to combat this we utilised

pruning methods to remove these subtrees,however these proved to be un-

necessarily costly for the length of our runs and so were abandoned.Instead

we excluded individuals with basic tness levels equal to any for existing

population members.We recognise full well that this method limits the

power of the algorithm,by reducing the region of the search space that is

accessible,however for the purposes of this problemit showed to be the most

eective method in terms of computational cost.Population sizes tend to

be between 60 and 80 members,with the majority centered between these

two numbers.

We generated initial populations using the ramped-half-and-half method

described in section 3.2.3 for depths ranging from 2 to 6.

Terminal Set consisted of the variables history (historic spend),recency

(most recent purchase in months),newbie (indicator variable to showwhether

the customer's rst purchase was within the last year),mens (indicator vari-

able to show whether the customer bought in the men's department within

the last year),womens (similar to mens for women's department),binHist

(a binned version of history with cuts at 50,100,200 and 400),NumZip (a

numerical version of zip code.Rural = 0,Suburban = 1,Urban = 2) and

NumChannel (numerical version of channel.Web = 0,Phone = 1,Multi-

channel = 2).In addition we included the constant values 2 (to separate

Multichannel and Urban explicitly.0 and 1 occur frequently with the use of

numerical True and False and so weren't included explicitly.It also allows

us to obtain ascending low order constants more rapidly than we would

otherwise),6 (a measure of the center of the recency variable) and 200 (a

measure of the centre of the history variable.)

Competitive bias for selecting breeding pairs = 0.7.

Competitive bias for survival = 0.95.

Mutation rate = 0.05.

We terminated the process after 1500 iterations.

31

The results for each method used are summarised in graphical form as

follows:

The\Average"line indicates the average q

0

value over the entire data set

for the best model found in each run.The dotted lines show this average

adjusted by one standard deviation to give an indication of a reasonable

range to expect using this method.The\Maximum"and\Minimum"lines

show the highest and lowest of these q

0

values.Finally,the\Lower Average"

line shows average q

0

of the worst split from the models selected from each

run.This line is intended to give an indication of an extreme worst case

scenario using this method.

Method 1:Building Models Directly via Genetic Algorithms Initially

we used our genetic algorithms directly to build uplift models.Observing the

eectiveness of models arising directly fromthe evolutionary process is important

in understanding the relevance of evolutionary algorithms in the eld of uplift

modelling at a basic level.

32

Genetic Programming Approach

Mens

Figure 4.4:Men's Mailing.Building models directly via Genetic Programming

We can see that the q

0

values are a vast improvement over that for the

model built deterministically and presented in section 4.2.2,even on the

worst splits,in the latter part of the evolution these exceed 19.80%.The

average q

0

on the entire data set is approximately 31

The Symbolic Expression form of the model does not interpret well due to

its complexity,and so instead we (as in section 4.2.2) defer to relationships

with the basic variables independently.An explicit expression of the model

can be found in the appendix.

The model ranges from 0 to 4 taking only integer values,and Table 4.5

below shows a summary of the spend behaviour associated with each score.

Table 4.5:Spend behaviour identied by Men's model

Where saying that we have a higher Qini value indicates we are able to bet-

ter separate the population according to uplift,it is not obvious the degree

33

of improvement.Table 4.5 shows in absolute terms this uplift and so we

can already compare the outcome of this model with the deterministic one.

The model identies two entire segments (amounting to 2,912 customers

between the Men's Emailing group and the control set,a full 4,375 in the

entire data set) for whom the estimated uplift is signicantly above any

segment identied before.

Figure 4.5:Spend uplift by score for Men's model

The Qini graph below also shows graphically this improvement over the

deterministic model.

Figure 4.6:Qini graph for Mens model

Table 4.6 below shows the average model score for ranges of the basic vari-

ables.We can see that the model favours particularly those with high

34

historic spend (over $400),newbies,those who have shopped in the men's

department within the previous year,those who haven't shopped in the

women's department in the previous year,Multichannel users and those

who live in urban areas.

Table 4.6:Mean model score by binned variables

Womens

Figure 4.7:Women's Mailing.Building models directly via Genetic Programming

While the winning submission to Hillstrom's challenge did not explicitly

include a Qini coecient for the model over the entire data set,over a

training/validation split showed q

0

values of 118% and 60.30% respectively.

Though certainly the value over the entire data set needn't be the average

35

of these two numbers,experience from experimenting shows that it is a rea-

sonable estimate for the sake of making comparisons.The Genetic Program

evolution was able to produce individual models with comparable q

0

values

(in the region of %88),but they were discarded by the algorithm for having

too high variance over the validation splits.The resulting models,though

not as powerful as the deterministic one then,do oer apparent improved

robustness.The best model found using this method had a q

0

of 67.47%

and a lower estimate (average minus standard deviation over the validation

splits) of 55.08%.

The model takes on integer values between 6 and 12 inclusive.Table 4.7

below shows that the model identied a large segment of the population

for whom the uplift is strongly negative ($0.93).We can see that uplift

is not strictly increasing with model score and simply reordering the scores

would vastly improve the Qini coecient.Reordering the binned segments

in the table below improves this to a value of 81.17% and simply reordering

the values should not eect the robustness of the model,in fact combining

segments as we have done in a few cases here should improve it.

Table 4.7:Spend behaviour identied by Women's model

Figure 4.8:Spend uplift by score for Women's model

36

The Qini graph below shows clearly this bad ordering by the model in that

the gradient is not uniformly decreasing.

Figure 4.9:Qini graph for Womens model (left) and adjusted version (right)

Our interpretation of the model is similar to that oered for the determin-

istic model in that it emphasise most strongly those with high historical

spend and newbies as well as multichannel users.We have included more

comparisons than before and observe strong emphasis on those who have

shopped in the women's department previously and those who have not

shopped in the men's department.

Table 4.8:Mean model score by binned variables

For both campaigns the algorithm was able to build useful models which

compare well with those built using deterministic methods.In the case of the

Women's campaign,the nal model found a useful segmentation of the popula-

tion but failed to order the segments correctly.It is perhaps an oversight not to

37

have included in the process (at least in the evaluation function) a transformation

which asserts increasing ordering by uplift,as models which produce even more

useful segmentations might have been lost.

We now turn our attention to questions 3 and 4 of the Hillstrom challenge

using these models.Again our focus is on the Men's campaign due to its being

most eective.We naturally choose those with score 3 rst (4,375 individuals,

for whom we estimate uplift from those not receiving the Women's Email to be

$2.38).Adding those with score of 2 by the Men's model gives us an additional

7,406,giving us more than 10,000 overall.Choosing a random 5,625 from this

group will give us a collection of 10,000 customers for whom we estimate the

average uplift to be $1.79.This is an improvement of 21c over that found using

the deterministic modelling methods.

To nd the worst 10,000 people to target,we begin with the worst segment iden-

tied by the Men's model.As this amounts to more than half the population,

however,we need to consider the Women's model as well in order to separate out

the worst group.Choosing the lowest segment according to the Women's model

(score < 4) gives us 6,841 customers with an estimated average uplift (from the

4,512 who did not receive the Women's email) of 55c.The next lowest segment

from the Women's model corresponds to a score between 2 and 3.This comes to

5,892 people in total with estimated uplift caused by the Men's Email of 41c.

Selecting the requisite 5,488 randomly to satisfy the 10,000 cut o gives us a sam-

ple for whom the Men's Email decreased spending by and average of 47c.This

compares very favourably to the sample found using the deterministic model for

whom decreased spend as a result of the Men's campaign was estimated at 25c

on average.

Polynomial approach As we restricted the polynomials both in dimension

and order we do not expect as good results as those from the Genetic Program-

ming approach.We restricted themthus to indicate the usefulness of polynomials

at a basic level for building uplift models.Because of their simplistic form the

computational cost compared with genetic programming is far lower.If we do not

restrict the order of these polynomials,then they too would become increasingly

complex and so the cost benet will be mitigated (and potentially lost com-

pletely).Later on we will combine the two dimensional polynomials to increase

the dimensionality of the models built to better understand their usefulness.

Mens

The best polynomial model found for the Men's campaign is given by

0:27x +2:84x

2

3:5x

3

+0:6x

4

+3:62y 4:83xy +0:09x

2

y 0:67x

3

y

2:21y

2

+3:22xy

2

+4:93x

2

y

2

2:92y

3

3:95xy

3

2:09y

4

where x = binHist and y = NumZip.The q

0

value for this model is

38

29.58%,which is comparable with those built via genetic programming and

an improvement over the deterministic model.

Womens

For the Women's campaign the best model is given by

2:2053351821x+1:96775115108x

2

3:62381607147x

3

+0:674810214017x

4

where here x = binHist.The q

0

coecient is 61.29%.It should be noted

that a polynomial which generated the same output over the data set arose

in all 10 runs,sometimes appearing already in the initial population.While

the model oers some merit with a fairly high Qini value,it appears that

building continuous models of low dimension using discrete variables in this

way might not be appropriate as a random search would have performed at

least as well in this instance.

In order to consider the usefulness of this method we will have to wait until

we combine the polynomials to produce higher dimensional models,since we have

observed that building low dimensional models of discrete variables in this way

potentially does not perform better than random search.

Method 2:Combining the Genetic Algorithms with Signicance-Based

Uplift Tree We now consider a combined evolutionary algorithm which cou-

ples the genetic evolutions used in Method 1 with an algorithm for building

signicance-based uplift trees.

A warm up period of 120 iterations was used,whereafter models were built every

20 iterations.We allow the initial warm up period so that models are only built

once more interesting variables should begin to emerge.By allowing the popula-

tion to evolve between each model building we hope to achieve more dynamism

with lower computational cost.

Models were built over the entire data set and validated by randomly splitting

the data in half 10 times and evaluating the Qini coecient of the model on (both

halves of) each split.The selection of one model over another was based on the

average minus standard deviation of the varying Qini values.By incorporating

both the location (average) and spread (standard deviation) of these values we

hope to obtain models which are both useful and robust in the face of variability.

Whenever a better model was found,adjustments to the tnesses of individuals

used in the model were done by multiplying their current tness by (1 +Q

low

),

where Q

low

represents the average minus standard deviation of Qini values dis-

cussed above for that model.Such adjustments were only made in the event of a

strictly superior (by our evaluation metric) model being found.By incorporating

the level of\goodness"of the model,variables useful in the better models will

have their tness adjusted by a greater factor.It could be argued that this in-

crease is too large,however as the general tness of the population increases,it

is necessary to improve longevity.

39

Passing the Entire Population into the Model Building Algorithm At

each 20th iteration we allow the model building algorithm selection from the

entire population of individuals.The results are as follows:

Mens

Figure 4.10:Men's Mailing.Full population passed into tree building algorithm

This method shows a considerable improvement over method 1 and an im-

provement over the results using the deterministic modelling techniques in

the winning entry.The fairly low range of best model qini coecients sug-

gests reliability of this method in nding good results,and the closeness

of lower average to absolute average suggests that these models are fairly

robust.

In spite of these encouraging points,we can see by the graph that little

improvement is made beyond about iteration 900.When passing the entire

population into the tree building algorithm,there will be a tendancy to

select the same individuals more often since the trees are built using greedy

algorithms,which is a likely cause of this stagnation.

40

Womens

Figure 4.11:Women's Mailing.Full population passed into tree building algo-

rithm

In the case of the womens,unlike the mens above,this method produced

quite varied results as we see by the nal values for the dierent lines.That

said,it still represents a signicant improvement in terms of power over

method 1,and especially in the case of the best model found through this

method,presents a more stable model than was found via deterministic

methods.

The potential for the process to become bogged down by having the model

building algorithm select the same variables repeatedly is realistic,and the graph

for the Men's campaign (Figure 4.10) indicates this might have eected progress

already.It is useful then to explore the option of only passing parts of the

population into the algorithm.

Passing the Fittest 10 Population Members Into the Model Building

Algorithm Instead of passing the entire population into the model building

algorithm as above,we now restrict this selection to only the ten ttest members.

By restricting the selection thus,we hope to\force"the algortihm to consider

newly generated highly t individuals.Immediately,however,we have a concern

that our tness adjustments based on nding better models will at least for a

period outweigh the eects of improved tness through evolution and as a result

the 10 individuals with the highest (adjusted) tness might remain unchanged for

extended periods (until"evolution catches up").This could result in a step-like

process including extended periods of non-improvement and essentially wasted

computation time.

41

Mens

Figure 4.12:Men's Mailing.Fittest 10 passed into tree building algorithm

The lack of improvement after about iteration 400 could be a result of

the concerns raised above.For runs of this length (1500 iterations) this

method appears inferior to the previous one in this instance.Furthermore,

the higher variability of q

0

values on dierent splits indicated by the lower

average line do not instil condence that (even given more time) this will

nd an adequately robust model.

Womens

Figure 4.13:Women's Mailing.Fittest 10 passed into tree building algorithm

42

In the case of the womens there is very little to choose from in terms of

outcome comparing this with the previous method.The results here are

marginally better,and factoring in the lower computational cost (associated

with passing fewer variables to the model building algorithm) it appears

preferable.However,the inconsistent results in the mens case give us pause

for recommending this ultimately.

While we are able to increase the likelihood of newly spawned individuals who

are highly t being included in the models,in doing so we exclude a large part

of the population almost absolutely (except in the rare event that for a period

signicantly more tournaments are lost by tter individuals).Further,while we

expect variables with a high basic tness to be useful in building models,it is

often the case that combinations of\weaker"individuals will vastly outperform

them when combined in an uplift tree,and this method almost excludes this

possibility entirely.

Passing the Fittest 10 of a Random Sample of Size 20 Into the Model

Building Algorithm In order to hopefully improve the dynamism of the pro-

cess we consider generating random samples of size 20 from the population and

selecting the ttest 10 of those to be passed into the model building algorithm.

We hope that by doing this a greater variety of individuals will be considered by

the model building algorithm while maintining an emphasis on tter individuals.

Mens

Figure 4.14:Men's Mailing.Fittest 10 of a Random Sample of size 20 passed

into tree building algorithm

The graph shows continued improvement throughout the run,unlike in the

previous method,however ultimately there is only marginal improvement,

43

and the results still dictate something worse than passing the entire pop-

ulation into the model building algorithm.The decrease in computational

cost when passing so fewer variables into the algorithm is certainly not

negligible,however,and this method does perhaps oer merit.

Womens

Figure 4.15:Women's Mailing.Fittest 10 of a Random Sample of size 20 passed

into tree building algorithm

For the women's campaign this method shows a vast improvement over

those considered so far.The expected range is fairly narrow,suggesting

reliability,and the lower average is also reasonably proximal to it indicating

robustness in the models built.

Ideally we would have a single method which provides a good model for both

campaigns,and we persevere with our ideas of improving dynamism of the pro-

cess.

Passing a Probabilistic Sample of Sizes 10 and 15 Into the Model Build-

ing Algorithm An alternative approach to increasing the diversity of individu-

als passed into the model building algorithm is to select subsets of the population

using a tness proportionate selection method.The benet of this over the last

method is that when the tness levels in the population are fairly close there is a

greater emphasis on diversity of selection,but as the tnesses become more var-

ied (as the tness adjustments are greater) there is more of an emphasis on tter

individuals.These tter individuals later on represent those which have already

proven useful in building models as well as newly introduced highly (basically)

t individuals.This shift in emphasis will hopefully prove useful for progress of

the algorithm.

44

Mens

Figure 4.16:Men's Mailing.Probabilistic Sample of size 10 passed into tree

building algorithm

Figure 4.17:Men's Mailing.Probabilistic Sample of size 15 passed into tree

building algorithm

Womens

45

Figure 4.18:Women's Mailing.Probabilistic Sample of size 10 passed into tree

building algorithm

Figure 4.19:Women's Mailing.Probabilistic Sample of size 15 passed into tree

building algorithm

For both men's and women's campaigns this method has generated improved

results.In the case of the men's this manifested as a shift upwards while main-

taining a similar level of reliability,whereas in the womens,the narrowing of the

ranges suggests improved reliability of the models found.Allowing the model

building algorithm slightly more selection (15 over 10 variables) appears to pro-

duce better results.As higher numbers are passed in,however,there will be a

tendancy for the algorithm to resemble the method of passing the entire popu-

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο