# slides of CounterExample talk - Get a Free Blog

Βιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 9 μήνες)

131 εμφανίσεις

Dick De
Veaux

My

JSM Presentation

Williams
College

Note to self: Basically
10
times longer to give it. I hope it goes better this time where I only have
15
minutes.

SOME RECENT RESULTS FROM IN STATISTICS THAT
I’VE EITHER COME UP WITH MYSELF OR
BORROWED HEAVILY FROM THE LITERATURE.

The theory behind boosting is easy to understand via a binary
classi

cation

problem.
Therefore for the time being assume that the goal is to clas
sify the members of some
population into two categories. For instance, the goal might be to determine whether a
medical patient has a certain disease or not. Typi
cally these two categories are given
numerical representations such that the positive outcome (the patient has the disease)
equals to 1 and the negative outcome (the pa
tient does not have the disease) equals to
−1. Using this notation, each example can be represented with a pair (y, x), where y

{−1, 1} and x

p
.

The boosting algorithm starts with a constant function,
e.g.

the mean or median of the
response values. After this, the algorithm proceeds iteratively. During every it
eration it
trains a weak learner (de

ned

as a rule that can classify examples slightly better than
random guessing) on a training set that weights more heavily those ex
amples that the
previous weak learners found difficult to classify correctly. Iterating in this manner
produces a set of weak learners that can be viewed as a committee of
classi

ers

working
together to correctly classify each training example. Within the committee each weak
learner has a vote on the

nal

prediction. These votes are typically weighted such that
weak learners that perform well with respect to the training set have more relative
in

uence

on the

nal

prediction. The weighted pre
dictions are then added together. The
sign of this sum forms the

nal

prediction (resulting into a prediction of either +1 or
-
1)
of the committee.

And a GREAT space and time consumer is to put lots of unnecessary bibliographic material on your visuals, especially i
f they point to your own previous work and you can get them into a
microscopically small font like this one that even you can’t read and have NO IDEA why you ever even put it in there in the f
irs
t place!!

Averaging

Statisticians and Averages

[
1
] Yang P.; Yang Y. H.;
Zomaya

A. Y. A Review of Ensemble Methods in Bioinformatics.
Current Bioinformatics
,
2010
,
5
,
296

308
.

[
2
]
Okun

O.
Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic
Classification and Implementations
; SMARTTECCO: Malmö,
2011
.

[
3
]
Dettling

M.;
Buhlmann

P. Boosting for Tumor Classification with Gene Expression
Data.
Seminar fur
Statistik
,
2002
,
19
,
1061

1069
.

[
4
]
Politis

D. N. In: Bagging Multiple Comparisons from Microarray Data,
Proceedings of
the
4
th International Conference on Bioinformatics Research and Applications
;
Mandoiu

I.;
Sunderraman

R.;
Zelikovsky

A., Eds.; Berlin/Heidelberg, Germany,
2008
; pp.
492
-
503
.

[
5
] Hastie T.;
Tibshirani

R.; Friedman J.
The Elements of Statistical Learning: Data Mining,
Inference, and Prediction
, Springer: New York,
2009
.

[
6
]
Fumera

G.; Fabio R.; Alessandra S. A Theoretical Analysis of Bagging as a Linear
Combination of Classifiers.
IEEE Transactions on Pattern Analysis and Machine
Intelligence
,
2008
,
30
,
1293

1299
.

[
7
] Duffy N.;
Helmbold

D. Boosting Methods for Regression.
Machine Learning
,
2002
,
47
,
153
-
200
.

[
8
]
Breiman

L. Random Forests.
Machine Learning
,
2001
,
45
,
5
-
32
.

[
9
] Freund Y.;
Schapire

R. E. A Decision
-
Theoretic Generalization of Online Learning and
an Application to Boosting.
Journal of Computer and System Sciences
,
1997
,
55
,
119
-
139
.

ResponseID

Good._rep

Residence

Hours

Demo_divers
e

Rep_race

Rep_gender

Diversity_ma
nage

Affirm_action

Heath_i

1

R_86cF9x3ftc
d2deY

2000

5000

3000

5000

1000

1000

1000

2000

1000

20000

2

R_5pUnS9bw
cjzcdG4

10000

5000

10000

20000

10000

0

5000

10000

5000

20000

3

R_7V8JD1kpL
PQ7WsY

0

10000

10000

5000

100

100

2000

0

0

10000

4

R_ag75IPlrwZ
GHEsQ

5000

5000

10000

10000

0

0

0

0

500

10000

5

R_2ifNjmOwO
7PFkRm

3000

5000

7000

2000

4000

2000

1000

2000

1000

3000

6

R_aeDSqohzk
Lu1712

10000

1000

20000

5000

300

300

0

500

100

10000

7

R_eX8APsaQt
7kCnVG

10000

500

30000

10000

5000

0

500

0

0

12000

8

R_1SyE3EJUM
q4o0Hq

10000

2000

15000

5000

0

0

0

0

0

15000

9

R_aYn8a7ZAY
wdlTYU

1000

1000

10000

10000

0

0

0

500

2000

15000

Our method

Performed really well

In fact, in all the data sets we found

Our method

Was the best

Was better than the other methods

In all the data sets we simulated

Our method

Was the best

Out performed the other methods

Our method

Is faster

and easier to compute and has the smallest
asymptotic variance compared to the other method that
we found in the literature

So, now I’d like to show some of the results from our
simulation studies where we simulated data sets and tuned
our method to optimize performance comparing it to the other
method which we really didn’t know how to use

Penalized Regression

Penalized Regression

Least squares

Penalized Regression

Least squares

Penalized Regression

Least squares

Ridge Regression

Variations on a Regression Theme

Least squares

Ridge Regression

Variations on a Regression Theme

Least squares

Ridge Regression

Lasso

Forward Stepwise Regression

Find x most correlated with r

-
fit

Find x most correlated with r

Continue until no x is correlated enough

LAR is similar (Forward
Stagewise
)

Only enter as much of x as it “deserves”

Find
xj

most correlated with r,

β
j ←
β
j +
δ
j , where
δ
j =
e

㸠畮瑩氠l湯瑨敲e
v慲楡扬攠

Move

β
j and
β
k
in the direction defined by their joint
least squares
coefficient
of the current residual on (
xj

;
xk
), until some other competitor x has as much
correlation with the current residual.

Set r ← r − new predictions and repeat steps many times

LAR and Lasso

Forward
Stagewise

1.
f
0
(x) = 0

2.
For m=1,…, M

a.
Compute

b.
Set

is a
stagewise

AM with

Basis functions are just the classifiers

Wait? What?

So

Finds the next “weak learner” that minimizes the sum of the
weighted exponential
missclassifications

With overall weights equal to

is:

Forward
stagewise

With exponential loss function

Sensitive to misclassification since we use exponential
(not
missclassification
) loss

Why exponential loss?

Squared error loss is least squares

What’s more robust?

Absolute Error loss

Huber loss

Truncated loss

Take these ideas:

Loss function

Find

Just solve this by steepest descent

Oops
---

one small problem

The
derviatives

are based only on the training data

They won’t generalize

So calculate them and “fit” them using trees

Boosted Trees

Making it more Robust

Can handle 25% contamination to original data set

Wine Data Set

11 predictors and a rating

Input variables (based on physicochemical tests):

1
-

fixed acidity

2
-

volatile acidity

3
-

citric acid

4
-

residual sugar

5
-

chlorides

6
-

free sulfur dioxide

7
-

total sulfur dioxide

8
-

density

9
-

pH

10
-

sulphates

11
-

alcohol

Output variable (based on sensory data):

12
-

quality (score between 0 and 10)

Performance Under Contamination
-

Red
Wine

Performance Under Contamination
-

White
Wine

Our method

Performed really well

In fact, in all the data sets we found

Our method

Was the best

Was better than the other methods

In all the data sets we simulated

Our method

Was the best

Out performed the other methods

Our method

Is faster

and easier to compute and has the smallest asymptotic variance
compared to the other method that we found in the literature