slides of CounterExample talk - Get a Free Blog

disturbedtonganeseBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

86 views

Dick De
Veaux

My

JSM Presentation

Williams
College

Note to self: Basically
the same talk I gave somewhere else where I had about
10
times longer to give it. I hope it goes better this time where I only have
15
minutes.







SOME RECENT RESULTS FROM IN STATISTICS THAT
I’VE EITHER COME UP WITH MYSELF OR
BORROWED HEAVILY FROM THE LITERATURE.



The theory behind boosting is easy to understand via a binary
classi

cation

problem.
Therefore for the time being assume that the goal is to clas
sify the members of some
population into two categories. For instance, the goal might be to determine whether a
medical patient has a certain disease or not. Typi
cally these two categories are given
numerical representations such that the positive outcome (the patient has the disease)
equals to 1 and the negative outcome (the pa
tient does not have the disease) equals to
−1. Using this notation, each example can be represented with a pair (y, x), where y


{−1, 1} and x


p
.

The boosting algorithm starts with a constant function,
e.g.

the mean or median of the
response values. After this, the algorithm proceeds iteratively. During every it
eration it
trains a weak learner (de

ned

as a rule that can classify examples slightly better than
random guessing) on a training set that weights more heavily those ex
amples that the
previous weak learners found difficult to classify correctly. Iterating in this manner
produces a set of weak learners that can be viewed as a committee of
classi

ers

working
together to correctly classify each training example. Within the committee each weak
learner has a vote on the

nal

prediction. These votes are typically weighted such that
weak learners that perform well with respect to the training set have more relative
in

uence

on the

nal

prediction. The weighted pre
dictions are then added together. The
sign of this sum forms the

nal

prediction (resulting into a prediction of either +1 or
-
1)
of the committee.



And a GREAT space and time consumer is to put lots of unnecessary bibliographic material on your visuals, especially i
f they point to your own previous work and you can get them into a
microscopically small font like this one that even you can’t read and have NO IDEA why you ever even put it in there in the f
irs
t place!!

Averaging

Statisticians and Averages




[
1
] Yang P.; Yang Y. H.;
Zomaya

A. Y. A Review of Ensemble Methods in Bioinformatics.
Current Bioinformatics
,
2010
,
5
,
296

308
.

[
2
]
Okun

O.
Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic
Classification and Implementations
; SMARTTECCO: Malmö,
2011
.

[
3
]
Dettling

M.;
Buhlmann

P. Boosting for Tumor Classification with Gene Expression
Data.
Seminar fur
Statistik
,
2002
,
19
,
1061

1069
.

[
4
]
Politis

D. N. In: Bagging Multiple Comparisons from Microarray Data,
Proceedings of
the
4
th International Conference on Bioinformatics Research and Applications
;
Mandoiu

I.;
Sunderraman

R.;
Zelikovsky

A., Eds.; Berlin/Heidelberg, Germany,
2008
; pp.
492
-
503
.

[
5
] Hastie T.;
Tibshirani

R.; Friedman J.
The Elements of Statistical Learning: Data Mining,
Inference, and Prediction
, Springer: New York,
2009
.

[
6
]
Fumera

G.; Fabio R.; Alessandra S. A Theoretical Analysis of Bagging as a Linear
Combination of Classifiers.
IEEE Transactions on Pattern Analysis and Machine
Intelligence
,
2008
,
30
,
1293

1299
.

[
7
] Duffy N.;
Helmbold

D. Boosting Methods for Regression.
Machine Learning
,
2002
,
47
,
153
-
200
.

[
8
]
Breiman

L. Random Forests.
Machine Learning
,
2001
,
45
,
5
-
32
.

[
9
] Freund Y.;
Schapire

R. E. A Decision
-
Theoretic Generalization of Online Learning and
an Application to Boosting.
Journal of Computer and System Sciences
,
1997
,
55
,
119
-
139
.









ResponseID

Good._rep

Residence

Advance

Hours

Demo_divers
e

Rep_race

Rep_gender

Diversity_ma
nage

Affirm_action

Heath_i

1

R_86cF9x3ftc
d2deY

2000

5000

3000

5000

1000

1000

1000

2000

1000

20000

2

R_5pUnS9bw
cjzcdG4

10000

5000

10000

20000

10000

0

5000

10000

5000

20000

3

R_7V8JD1kpL
PQ7WsY

0

10000

10000

5000

100

100

2000

0

0

10000

4

R_ag75IPlrwZ
GHEsQ

5000

5000

10000

10000

0

0

0

0

500

10000

5

R_2ifNjmOwO
7PFkRm

3000

5000

7000

2000

4000

2000

1000

2000

1000

3000

6

R_aeDSqohzk
Lu1712

10000

1000

20000

5000

300

300

0

500

100

10000

7

R_eX8APsaQt
7kCnVG

10000

500

30000

10000

5000

0

500

0

0

12000

8

R_1SyE3EJUM
q4o0Hq

10000

2000

15000

5000

0

0

0

0

0

15000

9

R_aYn8a7ZAY
wdlTYU

1000

1000

10000

10000

0

0

0

500

2000

15000


Our method


Performed really well


In fact, in all the data sets we found


Our method


Was the best


Was better than the other methods


In all the data sets we simulated


Our method


Was the best


Out performed the other methods


Our method


Is faster

and easier to compute and has the smallest
asymptotic variance compared to the other method that
we found in the literature


So, now I’d like to show some of the results from our
simulation studies where we simulated data sets and tuned
our method to optimize performance comparing it to the other
method which we really didn’t know how to use

Penalized Regression

Penalized Regression



Least squares


Penalized Regression



Least squares


Penalized Regression



Least squares



Ridge Regression

Variations on a Regression Theme



Least squares



Ridge Regression


Variations on a Regression Theme



Least squares



Ridge Regression



Lasso







Forward Stepwise Regression



If we standardize x’s, start with r = y

Find x most correlated with r

Add x to fit, r=y
-
fit

Find x most correlated with r

Continue until no x is correlated enough



LAR is similar (Forward
Stagewise
)


Only enter as much of x as it “deserves”


Find
xj

most correlated with r,


β
j ←
β
j +
δ
j , where
δ
j =
e

獩杮‼sⰠ

㸠畮瑩氠l湯瑨敲e
v慲楡扬攠
楳⁥煵慬汹⁣潲o敬et敤


Move

β
j and
β
k
in the direction defined by their joint
least squares
coefficient
of the current residual on (
xj

;
xk
), until some other competitor x has as much
correlation with the current residual.


Set r ← r − new predictions and repeat steps many times


LAR and Lasso


Boosting fits an Additive Model


Forward
Stagewise

Additive Modeling

1.
f
0
(x) = 0

2.
For m=1,…, M

a.
Compute


b.
Set



Adaboost

is a
stagewise

AM with




Basis functions are just the classifiers



Wait? What?


So
Adaboost


Finds the next “weak learner” that minimizes the sum of the
weighted exponential
missclassifications


With overall weights equal to


Adds this to previous estimates


Adaboost

is:


Forward
stagewise

additive model


With exponential loss function


Sensitive to misclassification since we use exponential
(not
missclassification
) loss



Why exponential loss?


Squared error loss is least squares


What’s more robust?


Absolute Error loss


Huber loss


Truncated loss




Gradient Boosting Machine


Take these ideas:


Loss function



Find



Just solve this by steepest descent





Oops
---

one small problem


The
derviatives

are based only on the training data


They won’t generalize


So calculate them and “fit” them using trees


Boosted Trees

Making it more Robust



Can handle 25% contamination to original data set

Wine Data Set


11 predictors and a rating

Input variables (based on physicochemical tests):


1
-

fixed acidity


2
-

volatile acidity


3
-

citric acid


4
-

residual sugar


5
-

chlorides


6
-

free sulfur dioxide


7
-

total sulfur dioxide


8
-

density


9
-

pH


10
-

sulphates


11
-

alcohol



Output variable (based on sensory data):


12
-

quality (score between 0 and 10)

Performance Under Contamination
-

Red
Wine

Performance Under Contamination
-

White
Wine


Our method


Performed really well


In fact, in all the data sets we found


Our method


Was the best


Was better than the other methods


In all the data sets we simulated


Our method


Was the best


Out performed the other methods


Our method


Is faster

and easier to compute and has the smallest asymptotic variance
compared to the other method that we found in the literature