Rich Caruana Alexandru Niculescu-Mizil

strangerwineΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

78 εμφανίσεις

Rich
Caruana

Alexandru

Niculescu
-
Mizil


Presented by
Varun

Sudhakar


Importance:



Empirical comparison of different learning algorithms
provides answers to questions such as


Which is the best learning algorithm?


How well does a particular learning algorithm perform
when compared to another algorithm over the same
data?



The last comprehensive empirical comparison was
STATLOG in 1995


Several new learning algorithms have been developed
after STATLOG (Random forests, Bagging, SVMs)


No extensive evaluation of these new methods.


SVMs


ANN


Logistic Regression


Naïve
Bayes


KNN


Random Forests


Decision Trees(
Bayes
, Cart, Cart0, ID3,c4, MML,SMML)


Bagged Trees


Boosted Trees


Boosted Stumps





Threshold Metrics:


Accuracy
-
The proportion of correct predictions the


classifier makes relative to the size of the


dataset


F
-
score
-

Harmonic mean of the precision and recall


at a given threshold


Lift
-

%of true positives above the threshold



--------------------------------------------




%of dataset above the threshold

Ordering/Rank Metrics:


ROC curve
-



Plot of sensitivity vs. (1
-





specificity)for all possible




thresholds


APR

-



Average precision



BEP(Break Even Point)
-
the precision at the point




(threshold value) where





precision and recall are equal.

Probability Metrics:

(Root Mean Square Error)
-

A measure of total error




defined as the square root of




the sum of the variance and




the square of the bias


MXE (Mean Cross Entropy)
-

used in the probabilistic





setting
when
interested





in
predicting
the






probability
that an example is positive

MXE =
-
1/N
Σ
(True©*
ln
(
Pred
(c)) + (1
-
true(c)*
ln
(1
-
pred(c))



Lift is appropriate for marketing


Medicine prefers ROC


Precision/Recall is used for information retrieval


…It is also possible for a algorithm to perform well over
one metric and perform poorly over some other metric


Letter




Cover Type




Adult




Protein coding


MEDIS


MG



IndianPine92


California Housing


Bacteria


SLAC(Stanford linear accelerator)


For each data set, 5000 random instances are used for
training and the rest are used as one large test set.


5 fold cross validation is used on the 5000 training
instances


5 fold cross validation is used to select the best
parameters for the learning algorithm.


…The purpose of the 5 fold cross validation is to calibrate
the different algorithms using either Platt scaling or
Isotonic regression


SVM predictions are transformed to posterior
probabilities by passing them through a sigmoid


Platt's method also works well for boosted trees and
boosted stumps


… might not be the correct transformation for all
learning algorithms.


Isotonic regression provides a more general solution
since the only restriction it makes is that the mapping
function should be isotonic (strictly increasing or
strictly decreasing)


SVMs
:


radial width {.001,0.005,0.01,0.05,0.1,0.5,1,2}


The regularization parameter is varied by factors of ten


from 10
-
7

to 10
3



ANN


hidden units{1,2,4,8,32,128}


momentum {0,0.2,0.5,0.9}



Logistic Regression:


The ridge (regularization) parameter is varied by
factors of 10 from 10
-
8
to 10
4



KNN:


26 values of K ranging from K = 1 to K = |
trainset
|



Random Forests:


The size of the feature set considered at each split is
1,2,4,6,8,12,16 or 20.



Boosted Trees:

2,4,8,16,32,64,128,256,512,1024 and 2048 steps of boosting


Boosted Stumps:


single level decision trees generated with 5 different
splitting criteria, each boosted for
2,4,8,16,32,64,128,256,512,1024,2048,4096,8192 steps


Without calibration, the best algorithms were bagged
trees, random forests, and neural nets.


After calibration, the best algorithms were calibrated
boosted trees, calibrated random forests, bagged trees,
PLT
-
calibrated SVMs and neural nets



…SVMs and Boosted trees have improved rankings with
calibrations.


Interestingly, calibrating neural nets with PLT or ISO
hurts their calibration.



And some algorithms such as Memory
-
based methods
(e.g. KNN) are unaffected by calibration

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Algorithm


Letter



-
Boosted DT(
plt
)



Cover Type



-
Boosted DT(
plt
)


Adult



-
Boosted STMP(
plt
)


Protein coding
-
Boosted DT(
plt
)


MEDIS
-
Random Forest(
plt
)


MG


-
Bagged DT


IndianPine92
-
Boosted DT(
plt
)


California Housing
-
Boosted DT(
plt
)


Bacteria
-
Bagged DT


SLAC
-
Random Forest(ISO)


Neural nets perform well on all metrics on 10 of 11
problems, but perform poorly on COD



If the COD problem had not been included, neural


nets would move up 1
-
2 places in the rankings



Bootstrap analysis


randomly select a bootstrap sample from the original
11 problems


randomly select a bootstrap sample of 8 metrics from
the original 8 metrics


rank the ten algorithms by mean performance across
the sampled problems and metrics


Repeat bootstrap sampling 1000 times, yielding 1000
potentially different rankings of the learning methods

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
BST-DT
RF
BG-DT
Algorithm

Algorithm
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Naïve Bayes
Log reg
DT
Algorithm

Algorithm
Model

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

Bst

DT

.580

.228

.160

.023

.009

.0

.0

.0

.0

.0

RF

.390

.525

.084

.001

.0

.0

.0

.0

.0

.0

Bag DT

.030

.232

.571

.150

.017

.0

.0

.0

.0

.0

SVM

.0

.008

.148

.574

.240

.029

.001

.0

.0

.0

ANN

.0

.007

.035

.230

.606

.122

.0

.0

.0

.0

KNN

.0

.0

.0

.009

.114

.592

.245

.038

.002

.0

Bst

stm

.0

.0

.002

.013

.014

.257

.710

.004

.0

.0

DT

.0

.0

.0

.0

.0

.0

.004

.616

.291

.089

logreg

.0

.0

.0

.0

.0

.0

.040

.312

.423

.225

NB

.0

.0

.0

.0

.0

.0

.0

.030

.284

.686


The models that performed poorest were naive
bayes
,
logistic regression,
decisiontrees
, and boosted stumps


bagged trees, random forests, and neural nets give the best
average performance without calibration


After calibration with Platt's Method, boosted trees predict
better probabilities than all other methods


But at the same time boosted stumps and logistic
regression, which perform poorly on average, are the best
models for some metrics


Effectiveness of an algorithm depends on the metric used
and the data set.

The End