# Rich Caruana Alexandru Niculescu-Mizil

AI and Robotics

Oct 19, 2013 (5 years and 2 days ago)

98 views

Rich
Caruana

Alexandru

Niculescu
-
Mizil

Presented by
Varun

Sudhakar

Importance:

Empirical comparison of different learning algorithms
provides answers to questions such as

Which is the best learning algorithm?

How well does a particular learning algorithm perform
when compared to another algorithm over the same
data?

The last comprehensive empirical comparison was

Several new learning algorithms have been developed
after STATLOG (Random forests, Bagging, SVMs)

No extensive evaluation of these new methods.

SVMs

ANN

Logistic Regression

Naïve
Bayes

KNN

Random Forests

Decision Trees(
Bayes
, Cart, Cart0, ID3,c4, MML,SMML)

Bagged Trees

Boosted Trees

Boosted Stumps

Threshold Metrics:

Accuracy
-
The proportion of correct predictions the

classifier makes relative to the size of the

dataset

F
-
score
-

Harmonic mean of the precision and recall

at a given threshold

Lift
-

%of true positives above the threshold

--------------------------------------------

%of dataset above the threshold

Ordering/Rank Metrics:

ROC curve
-

Plot of sensitivity vs. (1
-

specificity)for all possible

thresholds

APR

-

Average precision

BEP(Break Even Point)
-
the precision at the point

(threshold value) where

precision and recall are equal.

Probability Metrics:

(Root Mean Square Error)
-

A measure of total error

defined as the square root of

the sum of the variance and

the square of the bias

MXE (Mean Cross Entropy)
-

used in the probabilistic

setting
when
interested

in
predicting
the

probability
that an example is positive

MXE =
-
1/N
Σ
ln
(
Pred
(c)) + (1
-
true(c)*
ln
(1
-
pred(c))

Lift is appropriate for marketing

Medicine prefers ROC

Precision/Recall is used for information retrieval

…It is also possible for a algorithm to perform well over
one metric and perform poorly over some other metric

Letter

Cover Type

Protein coding

MEDIS

MG

IndianPine92

California Housing

Bacteria

SLAC(Stanford linear accelerator)

For each data set, 5000 random instances are used for
training and the rest are used as one large test set.

5 fold cross validation is used on the 5000 training
instances

5 fold cross validation is used to select the best
parameters for the learning algorithm.

…The purpose of the 5 fold cross validation is to calibrate
the different algorithms using either Platt scaling or
Isotonic regression

SVM predictions are transformed to posterior
probabilities by passing them through a sigmoid

Platt's method also works well for boosted trees and
boosted stumps

… might not be the correct transformation for all
learning algorithms.

Isotonic regression provides a more general solution
since the only restriction it makes is that the mapping
function should be isotonic (strictly increasing or
strictly decreasing)

SVMs
:

The regularization parameter is varied by factors of ten

from 10
-
7

to 10
3

ANN

hidden units{1,2,4,8,32,128}

momentum {0,0.2,0.5,0.9}

Logistic Regression:

The ridge (regularization) parameter is varied by
factors of 10 from 10
-
8
to 10
4

KNN:

26 values of K ranging from K = 1 to K = |
trainset
|

Random Forests:

The size of the feature set considered at each split is
1,2,4,6,8,12,16 or 20.

Boosted Trees:

2,4,8,16,32,64,128,256,512,1024 and 2048 steps of boosting

Boosted Stumps:

single level decision trees generated with 5 different
splitting criteria, each boosted for
2,4,8,16,32,64,128,256,512,1024,2048,4096,8192 steps

Without calibration, the best algorithms were bagged
trees, random forests, and neural nets.

After calibration, the best algorithms were calibrated
boosted trees, calibrated random forests, bagged trees,
PLT
-
calibrated SVMs and neural nets

…SVMs and Boosted trees have improved rankings with
calibrations.

Interestingly, calibrating neural nets with PLT or ISO
hurts their calibration.

And some algorithms such as Memory
-
based methods
(e.g. KNN) are unaffected by calibration

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Algorithm

Letter

-
Boosted DT(
plt
)

Cover Type

-
Boosted DT(
plt
)

-
Boosted STMP(
plt
)

Protein coding
-
Boosted DT(
plt
)

MEDIS
-
Random Forest(
plt
)

MG

-
Bagged DT

IndianPine92
-
Boosted DT(
plt
)

California Housing
-
Boosted DT(
plt
)

Bacteria
-
Bagged DT

SLAC
-
Random Forest(ISO)

Neural nets perform well on all metrics on 10 of 11
problems, but perform poorly on COD

If the COD problem had not been included, neural

nets would move up 1
-
2 places in the rankings

Bootstrap analysis

randomly select a bootstrap sample from the original
11 problems

randomly select a bootstrap sample of 8 metrics from
the original 8 metrics

rank the ten algorithms by mean performance across
the sampled problems and metrics

Repeat bootstrap sampling 1000 times, yielding 1000
potentially different rankings of the learning methods

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
BST-DT
RF
BG-DT
Algorithm

Algorithm
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Naïve Bayes
Log reg
DT
Algorithm

Algorithm
Model

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

Bst

DT

.580

.228

.160

.023

.009

.0

.0

.0

.0

.0

RF

.390

.525

.084

.001

.0

.0

.0

.0

.0

.0

Bag DT

.030

.232

.571

.150

.017

.0

.0

.0

.0

.0

SVM

.0

.008

.148

.574

.240

.029

.001

.0

.0

.0

ANN

.0

.007

.035

.230

.606

.122

.0

.0

.0

.0

KNN

.0

.0

.0

.009

.114

.592

.245

.038

.002

.0

Bst

stm

.0

.0

.002

.013

.014

.257

.710

.004

.0

.0

DT

.0

.0

.0

.0

.0

.0

.004

.616

.291

.089

logreg

.0

.0

.0

.0

.0

.0

.040

.312

.423

.225

NB

.0

.0

.0

.0

.0

.0

.0

.030

.284

.686

The models that performed poorest were naive
bayes
,
logistic regression,
decisiontrees
, and boosted stumps

bagged trees, random forests, and neural nets give the best
average performance without calibration

After calibration with Platt's Method, boosted trees predict
better probabilities than all other methods

But at the same time boosted stumps and logistic
regression, which perform poorly on average, are the best
models for some metrics

Effectiveness of an algorithm depends on the metric used
and the data set.

The End