Analysis of the Literature

beadkennelΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

58 εμφανίσεις

Analysis of the Literature

Introduction and Motivation

The most important task i
n the field of Machine Learning

is

the classification of data into

disjoint
groups.

A machine can learn a new classifier from a set of training examples

with existing class
labels
, or with a set of examples with no class labels. These techniques are known as supervised and
unsupervised training respectively. In supervised learning
,

the machine uses the
knowledge it has
gained

from the assignation of cla
ss labels in the training data

to classify further data instances.

There are a number of common ways to do this, such as the Naïve Bayes Classifier, Rule Induction
and Decision Tree Induction.

However; these and many other similar approaches seek to mini
malise
the
number of classification errors


they ignore the possible variety in cost of each misclassification,
instead assuming that each classification error has equal cost. This is known as cost
-
insensitive
learning.

This method of classification woul
d not be practical in a great many, if not most, real world
applications. The vast majority of times, there will be one set of classifications it is critical to be sure
about, and one where a mistake is not as costly. For example, when classifying patien
ts for cancer, it
is far more harmful to get a false negative; that is, identify a patient as healthy when in fact they are
at risk of death, than it is to get a false positive


to treat a patient who is actually healthy.

Machine
Learning approaches whic
h take this into account use cost
-
sensitive learning.

The class imbalance problem is involved in many real
-
world applications, and occurs when one of
the classes vastly outnumbers the others. The cancer example above is also an instance of the class
imbal
ance problem. There are far more people in the world without cancer than there are with it.

Theory of Cost
-
Sensitive Learning

Much of the theory of cost
-
sensitive learning centres
on

the respective costs of false positive and
false nega
tive classification
s, and the ‘reward’ of a true positive and a true negative. It is common
practice (and simplifies matters) to normalise the costs so that a true positive and true

negative
simply have cost zero
-

tables

1 and 2
illustrate

this

in a binary case
.


Actual
Ne
gative

Actual
Positive

Negative
Prediction

C(0,0)

C(0,1)

Positive
Prediction

C(1,0)

C(1,1)

Table
1
. Matrix showing the respe
c
tive costs, C(i,j) of each pair of prediction and actuality.


Actual
Negative

Actual
Positive

Negative
Prediction

0

C(0,1)


C(1,1)

Positive

Prediction

C(1,0)


C(0,0)

0

Table
2
. 'Normalised' matrix with values of 0 for true positives and true negatives. (This is not the same as regular
mathematical normalisation.)

These
cost values will be supplied beforehand by various means, and are not calcula
ted during the
training process, and it is common practice (without any loss of generality) to label the minority class
as positive and the majority as negative.

Then, given a ma
trix analogous to the ones above it is
possible to calculate the expected cost,








of classifying a data instance x into the class
i

(where,
in the example above, i=0 is the posit
ive class and i=1 the negative):




































W
here







is an estimation of the probability of classifying some instance x into class j.

From this point forward, we will discuss purely binary cases, where instances are classified into
either a positive or negative category.
It follows, then, tha
t an instance of data x will be classified
into a positive class if and only if



























Due to the binary nature of the matrix, it follows immediately that















and hence it
is possible to define a threshold parameter



which determines that an instance x will be classified
as positive
if









.

Then,








































From this equation, if a cost
-
insensitive classifier can produce a posterior probability
for






, it
can be made int
o a cost
-
sensitive classifier by calculating the threshold parameter



as above, and
classifying any data instances as positive
whenever









.

Not all cost
-
insensitive classifiers
are capable of producing a posterior probability, though (decision tr
ee algorithms like C4.5
, for
example
)

since they are designed to accurately, rather than probabilistically, predict the class.

These cost
-
insensitive classifiers are designed to predict the class with a



value of 0.5


the value
expected if class imbala
nce wasn’t an issue.

Through means of sampling, it is possible to alter the
proportions of the class to ensure an effective threshold of



as required.

Assuming, as before, that
positive values are in the minority, leave them as they are and take a pro
portion,













of the
negative values, ignoring the rest.

Since the cost of a false positive is (almost always) smaller than
the cost of a false negative, this proportion is less than 1.

This method of sampling is known as
under
-
sampling, and is

effectively equivalent to a method called proportional sampling in which
positive and negative samples are chosen at the following ratio:


































Methods of Cost
-
Sensitive Learning

There are two main categories of cost
-
sensitive
learning; direct cost
-
sensitive learning and cost
-
sensitive meta
-
learning.

In direct cost
-
sensitive learning the classifiers are cost
-
sensitive themselves
whereas in cost
-
sensitive meta
-
learning a process is devised that takes an existing cost
-
insensitive

classifier and makes it cost
-
sensitive through various methods; usually sampling or thresholding.
These will both be explained in more detail later.

Direct Cost
-
Sensitive Learning

In direct cost
-
sensitive learning, the classification algorithm calculates

a threshold value



as part of
its functionality, thus addressing the class
-
imbalance problem and also the problem of differing
misclassification costs in its standard process.

Cost
-
sensitive algorithms aim to introduce these
misclassifications costs i
nto the learning process and utilise them to assist in correct classification.

Two

well
-
known

algorithms that implement direct cost
-
sensitive learning are ICET and CSTree
. The
former utilises a genetic algorithm to create a population
, which it then uses
as part of a decision
tree algorithm, using the average classification cost as the fitness function of the genetic algorithm.

The latter uses misclassification cost as part of the process of creating the tree. Instead of
minimising attribute entropy, CST
ree selects an attribute based on minimising misclassification cost.

Cost
-
Sensitive Meta
-
Learning

Cost
-
sensitive meta
-
learning utilises existing cost
-
insensitive algorithms without altering them,
instead pre
-

and post
-
processing the inputs and outputs to t
hese cost
-
insensitive algorithms,
ensuring that the system as a whole is cost
-
sensitive.

Cost
-
sensitive meta
-
learning algorithms are
generally split into two categories


thresholding and sampling, making use of methods (2) and (3)
respectively.

Providing the cost
-
insensitive classifier can generate a posterior probability, thresholding uses the
threshold parameter,



to classify data instances into positive or negative classes.
These
thresholding methods rely on the accuracy of the calculated
posterior probability estimations, and it
is therefore important the estimations be as reliable as possible.

MetaCost uses decision trees in an attempt to find an accurate estimation of
the required posterior
probabilities of training examples before usi
ng the information gathered to re
-
label the training
examples into a cost
-
sensitive classifier.

One of the more intuitive approaches to thresholding is
Cost
-
Sensitive Naïve Bayes. Since the Naïve Bayes algorithm already calculates posterior
probabilities
, it is a simple matter to utilise the threshold parameter to classify data instances.

Sampling pre
-
processes the training examples in accordance with (3), under
-
sampling the data set
and the applying cost
-
insensitive classifiers on the sampled data.

Esti
mations of probabilities
are

unnecessary, providing positive and negative examples can be classified with a reasonable degree of
accuracy.

Conclusion

Three differing methods of cost
-
sensitive learning have been covered


direct cost
-
sensitive learning,
thresholding and sampling.

All three of these methods aim to combat problems in the standard
cost
-
insensitive algorithms that occur when misclassification costs are not uniform and class
membership is imbalanced.

When faced with a vast majority of negat
ive samples, and a minority of positive ones, cost
-
insensitive methods tend to predict almost all examples they are presented with as negative. Whilst
this may seem an immediate problem, it has actually been shown that


providing mis
classification
costs
are uniform and the primary objective for the classifier is to achieve optimum accuracy of
prediction



this is the best thing to do.

Therefore, the cost
-
sensitive methods discussed above only become useful for differing
misclassification costs, and have n
o place in data sets with just imbalanced classes.

Investigation
suggests that, for the most part, all three methods perform relatively evenly across a wide variety of
sample data sets. However, for very small data sets, the over
-
sampling method, thresho
lding,
performs best, and for large data sets (in the region of over 10,000) the direct cost
-
sensitive
algorithms emerged as clear winners.

Between the two size constraints, however, the optimal approach varies drastically from data set to
data set. It is

of interest to note that ‘wrapper’ based approaches t
hat use sampling as a pre
-

or
post
-
process of an existing algorithm
work just as well, or even better than, the more thorough and
technically correct direct cost
-
sensitive algorithms, despite the inhere
nt disadvantages of both over
-

and under
-
sampling.

The compatibility of sampling methods with a great mean existing cost
-
insensitive algorithms means they can be much easier to use in a variety of situations, and raises
questions of whether the direct cos
t
-
sensitive algorithms are really necessary if they cannot be made
to perform better on small and medium sized data
-
sets.

Data Pre
-
Processing

Data Discretizing

Discretizing data is necessary for some types of machine learning algorithms, but it can have be
nefits
even for algorithms where it is not necessary. It works by taking a continuous data set, such as

Mont
hlyIncome

, and organising the available data into categories chosen by the discretization
algorithm.

There are two types of these algorithms, uns
upervised and supervised, and each functions in a way
similar to unsupervised and supervised machine learning algorithms. The former creates categories
without using the class label of the training examples, either separating the values into a set number
of categories based on the range of data, or into categories so that each category contains the same
number of values. On the other hand, supervised discretization utilises the class labels of data to
split the data into categories in a more intuitive and

useful way. Supervised discretization is almost
always superior and so it is the method that we have used in our pre
-
processing.

We used supervised discretisation to pre
-
process the data for four of our five algorithms; the
superior data categorisation i
mproved the accuracy of the models. It was not necessary for the
decision tree model that we used


J
48, since that algorithm has an in
-
built pruning functionality.

Sampling

Methods of sampling were explored thoroughly in the analysis section, but Weka on
ly has one
inbuilt sampling filter


‘spread subsample’. We applied this to all of the algorithms and it achieved
a very slight improvement in accuracy

for Naïve Bayes

when applied before the discretization
.
However, the size of our data set made it unre
liable and it often caused a drastic decrease in
accuracy, and so it wasn’t included as part of our model. However, for smaller data sets under
-
sampling would be much more useful and would almost always be used.

Evaluation of Models

As advised in the brie
f, we measured the accuracy of our models by the area under their ROC curve.

This area gives the probability that the classifier will assign a higher value to a randomly chosen
positive value than a randomly chosen negative value. In an idea world, this
probability would be
equal to 1, but it suffices that higher values

for the area

are better than lower values.

The optimal
values that we achieved for the ROC AUC are as follows:



Naïve Bayes


0.864



Logistic Regression


0.865



Decision Tree (J48)


0.842



K
-
Nearest Neighbour (IBK)


0.
842



Rule Induction (Prism)


0.
835

As can be clearly seen, Naïve Bayes and Logistic Regression performed the best, with Logistic
Regression slightly superior. J48, the decision tree algorithm,

and
IBK, the K
-
Nearest Neighbour

method, were

also reasonably efficie
nt
.

Prism, the Rule Induction algorithm, was the least effective,
but not by too large a margin.

From just this data, it would appear that Naïve Bayes is a slightly
inferior classifier compared to Logistic Regression,

but the margin is so small that it’s hard to tell
without further data. Therefore, we have used additional
statistics

generated

by running our model
on the data set given

(in this data set, negative was the minority class and thus a false negative is
mor
e expensive than a false positive
.
)
:

Naïve Bayes:



Correctly Classified Instances


3194 (87.0547%)



Number of False Positives


285



Number of False Negatives


190



Total Number of Instances


3669



Kappa Statistic


0.4953



Root Mean Squared Error


0.3185

Lo
gistic Regression:



Correctly Classified Instances


8408 (93.4222%)



Number of False Positives


503



Number of False Negatives


89



Total Number of Instances


9000



Kappa Statistic


0.2527



Root Mean Squared Error


0.2289

These additional data would furthe
r support the theory that logistic regression is superior


it has a
higher percentage of correct classification, a lower number of false negatives (the expensive
misclassification) and a lower RMS error.

The only anomaly is in the Kappa Statistic, which
would
imply that Naïve Bayes is superior.
Our

values for Kappa are vastly smaller than the corresponding
ROC AUC


a typical result since a Kappa Statistic is usually an underestimation of accuracy.

However, Kappa values typically correlate very well wit
h ROC AUC; better than they do with any
other accuracy measure. Since our Kappa values are unexpectedly different, I would hesitate before
using it as a measure of accuracy in this case.

Ignoring the Kappa Statistic, then, the evidence points towards Logi
stic Regression being the optimal
model for the data set given


it has the largest ROC AUC, the small number of false negatives and
the highest correct classification percentage.

It is, of course, impossible to determine which of the
five models is the o
ptimal in general, since their performances would change dramatically given
different training examples or a different test data set. However, for the training examples available,
the Logistic Regression model is optimal.

References

[1]

Chawla, Japkowicz,

Kol
cz
:
Editorial: Special Issue on Learning from Imbalanced Data Sets

[2]

Ling, Sheng:
Cost
-
Sensitive Learning and the Class Imbalance Problem

[3]

Weiss, McCarthy, Zabar:
Cost
-
Sensitive Learning vs. Sampling: Which is Best for Handling
Unbalanced Classes
with Unequal Error Costs?

[4]

Turney:
Cost
-
Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree
Induction Algorithm

[5]

Davis, Jungwoo, Rossbach, Ramadan, Witchel:
Cost
-
Sensitive Deci
sion Tree Learning for Forensic
Classification

[6]

Domingos:
MetaCost


A General Method for Making Classifiers Cost
-
Sensitive

[7]

Lustgarten, Gopalakrishnan, Grover, Visweswaran:
Improving Classification Performance with
Discretization on Biomedical Datasets