CSIS8502

9.Boosting

CSIS8502

9.Boosting

1

Boosting

a kind of ensemble method for supervised learning

Create a single strong learner from a set of weak

learners

Weak learners:one that is slightly better than random

guessing

Strong learners:highly accurate classiﬁer.

CSIS8502

9.Boosting

2

Ensemble Methods

Combining classiﬁers.

simplest case:by simple majority vote

Consider 3 independent classiﬁers,each with accuracy

55%.

Error rate for majority vote for 3 classiﬁers is:

3

C

2

0:55 0:45 0:45 +

3

C

3

0:45 0:45 0:45 = 42:5%

for 5 classiﬁers:

5

C

3

0:55 0:55 0:45 0:45 0:45+

5

C

4

0:55 0:45 0:45 0:45 0:45+

5

C

5

0:45 0:45 0:45 0:45 0:45 = 40:7%

when n!1this becomes arbitrarily small.

CSIS8502

9.Boosting

3

Ensemble Methods

i

3

5

7

9

11

13

15

17

19

p = 0:55

42.5%

40.7%

39.2%

37.9%

36.7%

35.6%

34.6%

33.7%

32.9%

p = 0:65

28.2%

23.5%

20.0%

17.2%

14.9%

12.9%

11.3%

9.9%

8.7%

p = 0:75

15.6%

10.4%

7.1%

4.9%

3.4%

2.4%

1.7%

1.2%

0.9%

CSIS8502

9.Boosting

4

Voting Classiﬁers

Intuitively,if the classiﬁers are very different from each

other,combining multiple classiﬁers can be helpful.

Consider a two class problem,with class label f+1;1g,

and a set of classiﬁers f

i

(x) 2 f+1;1g,we have the

majority voting decision:

F(x) = sign(

M

X

i=1

f

i

(x))

where

sign(x) =

+1 if x > 0

1 if x < 0

This is the main idea of Ensemble methods

CSIS8502

9.Boosting

5

Why Ensemble Methods

Looking for a set of weak learners (accuracy slightly > 0:5)

is much easier,and more efﬁcient then ﬁnding a strong

learner (one with high accuracy).

There are difference classes of base models that can be

chosen:

Naive Bayes Classiﬁer

Decision Tree Classiﬁer

k-Nearest Neighbors

Neural Networks

SVMs...

CSIS8502

9.Boosting

6

How to ﬁnd Base Models

Train a diverse set of models on the same datasets.

Train a set of models from a speciﬁc class of learners by

using diversity in datasets,parameters,or initial conditions.

Examples are bagging and boosting.

CSIS8502

9.Boosting

7

Bagging

The name Bagging comes fromBootstrap Aggregation

use multiple version of a training set,by drawing a subset

of the original training data randomly with replacement,

and train a classiifer over them.

Combine the classiﬁers using majority votes.

CSIS8502

9.Boosting

8

Boosting

In Bagging,the training set is randomly drawn,and the

weights of them are the same.

However,if a data is already correctly classiﬁed by majority

of the classiﬁers,even if it is misclassiﬁed by the next one,

it will not affect the overall performance.

Hence we need only concentrate on those misclassiﬁed

data.

This is the idea of Boosting.

CSIS8502

9.Boosting

9

PAC Learning

Probably Approximately Correct Learning

Strong PAC Learning:demand small error with high

probability:

for any distribution,with arbitrary small > 0,

> 0,and given polynomially many random

examples,ﬁnd classiﬁer with error with

probability 1 .

Weak PAC Learning:same as strong PAC learning,except

that instead of requiring arbitrary small and ,we only

require

1

2

(1 );for some ﬁxed > 0

[Note:sometimes,this is written as

1

2

.]

CSIS8502

9.Boosting

10

Boosting

Assume that a given weak learning algorithm can

consistently ﬁnd classiﬁers at least slightly better than

random (e.g.accuracy 55% in a two-class setting).

given sufﬁcient data,a boosting algorithm can provably

construct a single classiﬁer (e.g.by majority voting of the

ensemble classiﬁers) with very high accuracy,say,99%.

CSIS8502

9.Boosting

11

Formal Description of Boosting

given a training set f(x

1

;y

1

); ;(x

m

;y

m

)g

y

i

2 f1;+1g are the correct labels of instance x

i

2 X

for t = 1; ;T (iterations):

construct a distribution D

t

on f1; ;mg,a weight on every

sample.

ﬁnd a weak classiﬁer

h

t

:X!f1;+1g

with a small error

t

on D

t

:

t

= Pr

D

t

[h

t

(x

i

) 6= y

i

]

output ﬁnal classiﬁer H

ﬁnal

as a combination of

h

t

;t = 1; ;T.

CSIS8502

9.Boosting

12

Boosting Algorithm

Early boosting algorithm:

Boosting by Filtering (by Schapire)

Boosting by Majority (by Freund)

These algorithm is complex,and is a predecessor to the

Adaboost Algorithm.

CSIS8502

9.Boosting

13

AdaBoost Algorithm

Initialization:D

1

(i) = 1=m.

Iterate for t = 1; ;T,

given D

t

and h

t

D

t+1

(i) =

D

t

(i)

Z

t

e

t

if y

i

= h

t

(x

i

)

e

t

if y

i

6= h

t

(x

i

)

=

D

t

(i)

Z

t

exp(

t

y

i

h

t

(x

i

))

where

Z

t

= normalization constant (1)

t

=

1

2

ln

1

t

t

> 0 (2)

t

=

m

X

i=1

D

t

(i)[h

t

(x

i

) 6= y

i

];the weighted error rate of h

t

(3)

Final Classiﬁer:

H

ﬁnal

(x) = sign(

X

t

t

h

t

(x))

CSIS8502

9.Boosting

14

Toy Example (R.Schapire)

Simple learning algorithm:split of axis,using vertical or

horizontal half-planes.

CSIS8502

9.Boosting

15

Toy Example – Round 1

CSIS8502

9.Boosting

16

Toy Example – Round 2

CSIS8502

9.Boosting

17

Toy Example – Round 3

CSIS8502

9.Boosting

18

Toy Example – Final Classiﬁer

CSIS8502

9.Boosting

19

Training Error in Adaboost

Theorem:

let

t

=

1

2

t

,then

training error(H

ﬁnal

)

Y

t

h

2

p

t

(1

t

)

i

=

Y

t

q

1 4

2

t

exp

2

X

t

2

t

!

Hence,if is the lower bound of

t

,i.e.

t

0;8t,then

training error(H

ﬁnal

) e

2

2

T

CSIS8502

9.Boosting

20

Proof

let f(x) =

X

t

t

h

t

(x),then H

ﬁnal

(x) = sign(f(x)).

Note that D

i

is deﬁned recursively.Hence by unwrapping

recurrence,we get

D

ﬁnal

(i) =

1

m

exp

y

i

X

t

t

h

t

(x

i

)

!

Y

t

Z

t

=

1

m

exp(y

i

f(x

i

))

Y

t

Z

t

CSIS8502

9.Boosting

21

Proof

training error(H

ﬁnal

) =

1

m

X

i

1 if y

i

6= H

ﬁnal

(x

i

)

0 otherwise

=

1

m

X

i

1 if y

i

f(x

i

) 0

0 otherwise

1

m

X

i

exp(y

i

f(x

i

))

=

X

i

D

ﬁnal

(i)

Y

t

Z

t

=

Y

t

Z

t

CSIS8502

9.Boosting

22

Proof

Finally,

Z

t

=

X

i

D

t

(i) exp(

t

y

i

h

t

(x

i

))

=

X

i:y

i

6=h

t

(x

i

)

D

t

(i)e

t

+

X

i:y

i

=h

t

(x

i

)

D

t

(i)e

t

=

t

e

t

+(1

t

)e

t

= 2

p

t

(1

t

) by eqn(2)

CSIS8502

9.Boosting

23

Expected Performance

expected performance for normal classiﬁers:

We would expect:

training error will continue to drop (even reaching 0)

However,testing error will ﬁrst decrease and then increases

when H

ﬁnal

becomes too complex.

Occam Razor – simple rule is better.

Overﬁtting is expected to occur.

CSIS8502

9.Boosting

24

Actual Performance (Schapire)

Experiment:using a letter dataset,and using decision tree

methods C4.5 and Adaboost.

Actual Performance:

Note that test error do not increase,even after 1000

rounds (where total#nodes in all C4.5 trees > 2M )

test error continues to drop even after training error is 0.

Occam’s Razor wrongly predicts ”simpler” rule is better.

CSIS8502

9.Boosting

25

Margins

Training error only measures whether classiﬁcation are

right or wrong.

Margins tells the conﬁdence of classiﬁcation.

H

ﬁnal

is a weighted majority vote of weak classiﬁers

Deﬁne margin = strength of the vote

= (weighted fraction voting correctly) - (weighted fraction

voting incorrectly)

Margin:

CSIS8502

9.Boosting

26

Margins

Cumulative distribution of margins of training examples:

CSIS8502

9.Boosting

27

Margins

If all margins are large,we can approximate ﬁnal classiﬁer

by a much smaller classiﬁer (compare:polls can predict a

not-too-close election)

Large margin!better generalization

Boosting tends to increase the margins of training data

Hence

Although ﬁnal classiﬁer is getting large

Margins are likely to be increasing

ﬁnal classiﬁer actually getting close to a simpler classiﬁer

thus driving down the test error

CSIS8502

9.Boosting

28

Boosting,SVM& Margins

can map the data to a high dimensional feature space by

the classiﬁers.

(x) = (h

1

(x) h

2

(x) h

T

(x))

t

CSIS8502

9.Boosting

29

Boosting,SVM& Margins

The ﬁnal classiﬁer is just a linear decision function in the

high dimensional space )SVM

The classiﬁcation score is the margin of this linear

classiﬁer.

This is also the deﬁnition of margin in the previous slide

CSIS8502

9.Boosting

30

Multiclass Problem

y 2 Y = f1; ;kg

Direct Approach:

h

t

:X!Y

D

t+1

(i) =

D

t

(i)

Z

t

e

t

if y

i

= h

t

(x

i

)

e

t

if y

i

6= h

t

(x

i

)

H

ﬁnal

(x) = arg max

y2Y

X

t:h

t

(x)=y

t

It can be proved that it has the same error bound if

t

1

2

8t.

Can also use standard multi-class to 2-class technique.

CSIS8502

9.Boosting

31

Ranking Problem

Other problems can be handled by reducing to a binary

problem

e.g.ranking of objects from examples

it can be reduced to multiple binary questions:”Is Object A

preferred to Object B”?

and apply Adaboost.

CSIS8502

9.Boosting

32

Practical Advantages of Adaboost

fast,simple and easy to program

no parameter to tune (except T)

ﬂexible,can combine with any learning algorithm

no prior knowledge needed about weak learner

provably effective,provided that a simple weak classiﬁer

can be found consistently.

(a shift of mindset – goal is now merely to ﬁnd a mediocre

classiﬁer barely better than random guessing)

versatile – can be used with data that is textual,numeric or

discrete.

CSIS8502

9.Boosting

33

Caveats

performance depends on data and the weak learner

As predicted by theory,Adaboost can fail if

weak classiﬁer too complex (causing overﬁtting)

weak classiﬁer too weak (gives low margin,and hence

cause overﬁtting) noise

empirical result show that Adaboost seems susceptible to

uniform noise.

CSIS8502

9.Boosting

34

## Comments 0

Log in to post a comment