Lesson 4: Customer Attrition Analysis (Teacher)

randombroadAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

90 views

Alberta Ingenuity & CMASTE

randombroad_e406384a
-
d6cb
-
4d59
-
8da4
-
631d1b99be90.doc

Centre for Machine Learning

1
/
8

Lesson 4: Customer Attrition

Analysis

(Teacher
s’ Resource
)


Purpose
: A
nalysis

of customer attrition (how long customers stay with a company and
why they leave)

is very important to subscription
-
based businesses such as cable TV,
insurance, credit cards,
magazine subscription, and more recently,
cell phone programs
and residential
utilitie
s. For a large company
millions of database records of customers
must be analyzed. This is where the Machine Learning strategy of data mining takes
over.
Data mining u
ses computer
-
based queries to find patterns and relevant information
in a large data set.


Problem
: Computer programs use a set of algorithms to analyze the customer data for
hazard probability (the chance that subscriber who has survived a certain amount

of time
is going to stop, cancel, or expire before the next unit of time) and also for survival
probability,

the chance that a random customer will still be with the company after a
specific amount of time
.

We want to see if there is a mathematical patte
rn in this analysis
t
hat we can use to make inferences and projections about customer attrition
.


Hypothesis
:
The data mining technique will provide rapid feedback about customer
behavior and a way to quantify customer loyalty.


Design
: The assumption is

that time is discrete, in units of days, months, etc., whereas
traditional statistical analysis treats time as continuous.
The hazards in a typical
subscription business are shown in
Figure 1

as: 1) customers who don’t start

2) customers who start bu
t never pay, which occurs at about 60 days and 3) customers
who stop when the promotion ends, which occurs at about 90 days.




(Figure 1)

W
e can define hazard probability as
.


See Appendix for
the
data mining program
used
to calculate
these
probabilities
.

Alberta Ingenuity & CMASTE

randombroad_e406384a
-
d6cb
-
4d59
-
8da4
-
631d1b99be90.doc

Centre for Machine Learning

2
/
8



The gradual decline in hazards over time implies that the longer a customer stays with a
company, the less likely
they are to leave.


Procedure
:

1) From Figure 1, estimate these hazard probabilities as percentages:

a)

probability of a customer not starting the subscription ________

Answer: P ~ 4%

b)

probability of a customer being lost due to non
-
payment _________

Answ
er: P ~ 10%

c)

probability of a customer being lost at the end of the promotion _________

Answer: P ~ 6%


This leads us to the concept of survivor probability
or retention rate,
which is a more
holistic picture of the likelihood tha
t a random customer will
stay with a company to a
certain

point in time. Rather than using a probability formula here, we will use graph and
data analysis

to help quantify survivor probability.



Figure 2

shows three examples of survival curves for credit card users for a particu
lar
company: the top curve is for customers who start as
card holders,
paying customers who

are charged automatically every month,

while

the bottom curve is for customers
who did
not have a credit card previously and
who are billed monthly and pay by cheq
ue. The
middle cur
ve is the “average” of the other

two curves.



(Figure 2)


We notice that the curves have the basic shape of an exponential
decay curve and we can
study them from a Math 30 P perspective with equation type
or from a Math 31
point of view with equation type
.



Alberta Ingenuity & CMASTE

randombroad_e406384a
-
d6cb
-
4d59
-
8da4
-
631d1b99be90.doc

Centre for Machine Learning

3
/
8



2)

a)
U
sing the tables of data read from the graph of

the
upper

survivor
(
retention)
curve
, generate a regression equation

of the form
.

________________


t
(months)

0

1

2

3

4

5

6

7

8

9

10

11

12

Survivor

Prob.
(%)

97

92

88

77

68

62

57

54

50

48

45

43

42


Answer:


b)

Use your equation

from part a) to calculate
:

i) the retention

probability after 18 months

ii) the number of m
onths or
days until the retention

probability was 40%
(graphically)


Answer:


Answer:




c) (* for Math 31 studen
ts) use an algebraic process to rewrite your regression

equation in the form

Answer:



Alberta Ingenuity & CMASTE

randombroad_e406384a
-
d6cb
-
4d59
-
8da4
-
631d1b99be90.doc

Centre for Machine Learning

4
/
8

d) (* for Math 31 students) Find the derivative
for your equation in part c)
and specifically th
e value of the derivative at
t

= 5 months. Explain the
meaning of that number with respect to the data.

Answer:




e
) Using the tables of data read from the graph of

the
lower

survivor curve,
generate a regression equation of the f
orm
.

___________________



t
(months)

0

1

2

3

4

5

6

7

8

9

10

11

12

Survivor

Prob.
(%)

97

74

59

30

21

19

17

15

14

13

12

11

10


Answer:



f) Use your equation from part e
) to calculate
:

i) the
retention
pr
obability after 18 months

ii) the number of m
onths or
days until the retention

probability was 50%
(graphically)

Answer:


Answer:






Alberta Ingenuity & CMASTE

randombroad_e406384a
-
d6cb
-
4d59
-
8da4
-
631d1b99be90.doc

Centre for Machine Learning

5
/
8


g
) (* for Math 31 students) use an algebraic process to rewrite
your
regression equation in the form

Answer:



h)

(* for Math 31 students) Find the derivative
for your equation in
part g) and specifically the value of the derivative at
t

= 5 months.
E
xplain the meaning of that number with respect to the data.


Answer:



2)

Did you notice the steep drop off in the lower graph at about 60 days; what
factor(s) do you think might

account for that?

Answer: This drop off would correspond

to non
-
payment of credit card bills for those
customers new to credit cards and having to pay by cheque.


3)

Why do you think the upper graph does not have the same drop
-
off at that time?

Answer: These customers had a credit card previously, so they are use
d to monthly
payments. As well, their payment is done by direct debit, which is more convenient
.










Alberta Ingenuity & CMASTE

randombroad_e406384a
-
d6cb
-
4d59
-
8da4
-
631d1b99be90.doc

Centre for Machine Learning

6
/
8

We can quantify the difference between the two groups in another way, using a common
industry measure called
customer half
-
life

or
median customer li
fetime
.

This is the
time when exactly half

or 50%

of the original customers would still be subscribers.

(
Math 30 P students are familiar with this term from work with radioactive decay
problems.
)

Figure 3

shows this for each of our two groups.




(Figure 3)


4)

From Figure 3 identify the customer half
-
life in days, for each group.

Answer: The half
-
life for the “credit card group” appears to be a
bout 245 days, and
the half
-
life for the “non
-
credit card group” appears to be about 65 days.


Evaluation
: The data mining techniques provide very useful information that can be
analyzed mathematically
,

lead
ing

to inferences about subscription customers
a
nd their
retention rates
that would aid in decision
-
making for the company.



Synthesis
: Data mining is very versatile and can be adapted for a wide variety of
different industries.







Alberta Ingenuity & CMASTE

randombroad_e406384a
-
d6cb
-
4d59
-
8da4
-
631d1b99be90.doc

Centre for Machine Learning

7
/
8

Appendix:

Data Mining program to calculate hazard probabilities:

Calculating Hazards in a Database

Let's take a closer look at how survival data mining works with a database
-
in this
case,
running with Oracle
. Assume that a database contains one row for each customer with the
following information:



Start_date



Stop_date
(NULL is not stopped)



Other interesting variables such as stop reason, channel, and so on.

How is this data used to calculate hazards?

Oracle extensions make the full calculation
possible. The first thing is to calculate the
time with company (
tenure
)

a
nd the stop flag:

SELECT ((case when stop_date is NULL then < today >

else stop_date end)
-

start_date) as tenure,

(case when stop_date is NULL then 0

else 1 end) as is_stopped

FROM customers

The next step is to
aggregate these fields by time with the comp
any or tenure
. This gives
the number of
customers with exactly each amount of time

and the num
ber that stopped
at a certain time (some customers with that tenure

will still be active):

SELECT ((case when stop_date is NULL then < today >

else stop_date end)

-

start_date) as tenure,

count(*) as pop_at_t,

sum(case when stop_date is NULL then 0

else 1 end) as num_stopped

FROM customers

GROUP BY ((case when stop_date is NULL then < today >

else stop_date end)
-

start_date)

Alberta Ingenuity & CMASTE

randombroad_e406384a
-
d6cb
-
4d59
-
8da4
-
631d1b99be90.doc

Centre for Machine Learning

8
/
8

At this point, you could continue the
calculation in a

spreadsheet. However, the
analytic
functions make it possible to calculate the total population at risk, and thus the hazard.
The total population is the sum of pop_at_t

for all times

greater than or equal to
t
. The
hazard is num_stopped d
ivided by this total. The following query does this calculation

SELECT tenure,

sum(pop_at_t) over

(order by tenure desc range unbounded preceding),

num_stopped /

(sum(pop_at_t) over

(order by tenure desc range unbounded preceding))

FROM < subquery >

GROUP

BY tenure

ORDER BY tenure

*I felt that it was worthwhile for teachers and/or students to see the type of program used
in a machine learning setting. It is not expected that the program be used or even
understood by teachers and/or students.


Sources

1)

http://www.intelligententerprise.com
, Data Mining and Hazard Survival, Linoff,
Gordon S., 2004

2)

http://www.cs.ualberta.ca/~greiner/C
-
466/SLIDES/syllabus.html