PAKDD 2006 Data Mining Competition

chardfriendlyΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 28 μέρες)

83 εμφανίσεις

PAKDD 2006 Data Mining Competition

Write
-
Up

Participant Name:
Nguyen Hoang Anh

Problem Summary

An Asian Telco operator which has successfully launched a third generation (3G) mobile
telecommunications network would like to make use of existing customer usa
ge and
demographic data to identify which customers are likely to switch to using their 3G
network.

An original sample dataset of 20,000 2G network customers and 4,000 3G network
customers has been provided with more than 200 data fields.


The target categ
orical
variable is “Customer_Type” (2G/3G). A 3G customer is defined as a customer who has
a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network
compatible mobile phone.

Three
-
quarters of the dataset (15K 2G, 3K 3G) will have the t
arget field available and is
meant to be used for training/testing. The remaining portion (5K 2G, 1K 3G) will be
made available with the target field missing and is meant to be used for prediction.

The data mining task is a classification problem for which

the objective is to accurately
predict as many current 3G customers as possible (i.e. true positives) from the “holdout”
sample provided.

U
nderstanding of the problem
:

As the problem already stated, the data mining task is a classification problem for whi
ch
the objective is to predict accurately as many current 3G customers as possible.

This classification or prediction task will be done by a model that is generated from
18000

sets of
customers
which are already classified.

The problem becomes easier when
we know that
in the prediction data,
there are 5000 2G
customers and 1000 3G customers
.

Therefore,

we can control the setting of the algorithm
so that it can come out
with
the best
predicted
1000

3G
customers.

In real life, this prediction task can be use
d for marketing purpose. If the company knows
which customers likely want to switch to using 3G, it can have better marketing strategies
for these targeted customers. That is why it is better to classify a 2G customer
in
to
the
3G
customer
type than to clas
sify a 3G customer
in
to
the
2G
customer
type.

Approaching the problem:

Support Vector Machines (SVMs) was used for this classification purpose.
The following
is the general ideas of the algorithm.

1.

Introduction to SVMs
:

Support Vector Machines were develo
ped by Vapnik in 1995 based on the
Structural Risk Minimization principle from statistical learning theory.

The idea of structural risk minimization is to find a hypothesis h from a
hypothesis space H for which one can guarantee the lowest probability of
error Err(h) for a given training examples S

(x
1
,y
1
)…(x
n
,y
n
) x
i

€ R
N
, y
i

€ {
-
1,+1}

For simplicity, let us assume that the training data can be illustrated on a plane
x and can be separated by at least a hyperplane h’

.






δ

δ

δ

Optimal

Hyperplane

Support
vector

This means that

there is a weight vector w’ and a threshold b’, so that all
positive training examples are on one side of the hyperplane while all negative
training examples lie on the other side. This is equivalent to requiring
y
i
[w
T
.x
i
+b’]>0 for each training example (
x
i
,y
i
). In other words, the equation
of the hyperplane which does this separation is:





w
T
.x+b = 0


so,


w
T
x
i
+b
≥ 0 for y
i

= +1




w
T
x
i
+b < 0 for y
i

=
-
1

In general, there can be multiple hyperplanes that separate the training data
without error. From these hyperplanes, the Support Vector Machine chooses
the Optimal Hyperspace with largest margin δ.

This particular
hyperplane h(x) is shown in the right hand picture. The margin
δ is the distance from the hyperplane to the closed training examples. For each
training example set, there is only one hyperplane with maximum margin. The
examples closest to the hyperplane ar
e called Support Vectors. They have a
distance of exactly δ.


2.

SVM
Lights
:

Although there are many implementations of Support Vector Machines (SVMs) in the
market now, SVM
Light
, an
implementation of SVMs in C, seems to be the most
popular for its hig
h precision rate.
SVM
Light

has been used as a basic binary
classifier for this classification task.

SVM Light can be downloaded from here:
http://svmlight.joachims.org/





Full technical details of algorithm
(s) used

The

training and testing data was provided in an Excel sheet that ha
s

more than 250 data
field
s
. Each of these fields was easily represented by a feature num
ber and the feature
value represented the value of each field.

1.

Data cleaning and relevanc
e analysis:

Data cleaning refers to the preprocessing of data in order to remove or reduce noise.
As not all the data in 250 fields are useful or relevant,
removing some of the fields
would help reducing the number of dimensions for SVMs. A
fter observing t
he data,
the following fields were removed as it may not much affects:

a)

Nationality: most of customers are from the same country (702)

b)

OCCUP_CD: most of the data are not there.

c)

SubPlan_Previous: the author decided to remove this field as there are already
a

SubPlan_Change_Flag and most of the customers do not change the plan.

d)

NUM_DELINQ_TEL,
PAY_METD, PAY_METD_PREV,
PAY_METD_CHG,
and HS
_CHANGE
: data are not useful or relevant.

e)

HS_MANUFACTURER: removed as there is already a handset model field.

f)

BLACK_LIST_FLA
G, TELE_CHANGE_FLAG,
and ID
_CHANGE_FLAG:
the data in all records are quite unchanged.

2.

Transforming:

a)

Input to SVMs:

As input to SVMs must be in numeric form, all the data are needed to be transformed.
Each of the data fields was represented by a feature nu
mber and the feature value
represented the value of each field.



The input to SVMs must be in the following format:


<line> = <target>

<feature>:<value>

<feature>:<value>….<feature>:<value> #info



<target> = +1 |
-
1 | 0 | <float>



<feature> = <intege
r>



<value> = <float>



<info> = <string>

The target value and each of the feature/value pairs are separated by a space
character. Feature/value pair must be in the increasing order of feature numbers.
Features with value zero can be skipped.

Example of a tr
aining data: +1 1:0.23 3:0.56 8:1

b)

Transforming program:

A program was written to allocate each of the field a represented number (feature
number) and use the value of the data field as a feature number.

+1 would be the
target value for 3G custome
rs and
-
1 would be a target value for 2G customers.

* Feature value of unmeasured data fields:

For some of the data field, we would not be able to give a value number as its values
are not numeric, for example, age, gender, marital status… As these types
of data are
also
so
important
that

we can not ignore them
,
the value of these fields was given a
feature number and its value would be equal to 0.5.


For example, one 3G customer

can be transformed to:

+1 1:0.5 4:0.5 15:0.5 20:0.5 26:0.5 32:0.5 39:0.35955
1 40:0.340561
41:0.484932 42:0.000758 43:0.003082...

Feature number

1 represents gender Male, 4 represents marital_status Single…

For measured data field, we ju
st need to give its value to the represented

feature
number and scale to 1.

The followings are s
ome of the feature number that the author has given
automatically by programming
to the data field:



1: Male Gender





2: Female Gender



3: Married



4: Single



5: Divorced



40: LINE_TENURE



41:
D
AYSTO_CONTRACT_EXPIRY



42: NUM_TEL



43: NUM_ACT_TEL



44: NUM_SU
SP_TEL

3.

Building Model:

SVM
Learn of SVM Light

was used to build the model for classification purpose.
As
indicated above,
it is better to classify a 2G customer into the 3G customer type than
to classify a 3G customer into the 2G customer type. In other w
ords, training errors
on 3G positive examples outweighs training errors on 2G negative examples, j
parameter
(cost factor)
of SVM Light Learning module was set to higher than 1.

As we already know that there will be 1000 3G customers in the dataset, the au
thor
decided to use all of these 18000 sets of data as training data for the learning model.
Therefore, these 18000 sets of customer data were transformed into numeric form
above and passed to SVM Light Learning module.

4.

Prediction:

SVM

Classify module of
SVM Light was used to predict the new examples. There
are 2 parameters that we can set here in order to achieve the best predicted 1000 3G
customers:

a)

The value number of
unmeasured feature (gender, married status)

b)

The j parameter (cost factor) of SVM Light
.

After changing the values of these 2 parameters, the author set value number of
unmeasured feature equal to 0.4 and j parameter equal to 1.5

With these values, after classification, the author had 1086 3G customers and
4
914
2G customers.

Discussion:

The followings are some of the data returned from SVM Light Classification module.


1)
-
1.2406089


2)
-
0.28477878

potential


3)
-
1.5060007


4)
-
1.6569823


5)
1.2099758


6)
0.58005892


7)
-
0.4054586


8)
-
0.2487679

potential


9)
0.9694503


10)
-
0.88280532


11)
-
0.3529557


12)


0.42899052


13)
-
1.0195925


14)
-
1.1603308


15)
1.232341

Values higher than 0 is 3G customers, otherwise
is
2G customers.
From this

model, we
can easily predict which customer
s have

the potential to switch to using 3G network.
These customers are those having values slightly less than 0 (
e.g.

customer 2 and 8)