# CSIS8502 - 9. Boosting

AI and Robotics

Nov 7, 2013 (4 years and 6 months ago)

107 views

CSIS8502
9.Boosting
CSIS8502
9.Boosting
1
Boosting
a kind of ensemble method for supervised learning
Create a single strong learner from a set of weak
learners
Weak learners:one that is slightly better than random
guessing
Strong learners:highly accurate classiﬁer.
CSIS8502
9.Boosting
2
Ensemble Methods
Combining classiﬁers.
simplest case:by simple majority vote
Consider 3 independent classiﬁers,each with accuracy
55%.
Error rate for majority vote for 3 classiﬁers is:
3
C
2
0:55 0:45 0:45 +
3
C
3
0:45 0:45 0:45 = 42:5%
for 5 classiﬁers:
5
C
3
0:55 0:55 0:45 0:45 0:45+
5
C
4
0:55 0:45 0:45 0:45 0:45+
5
C
5
0:45 0:45 0:45 0:45 0:45 = 40:7%
when n!1this becomes arbitrarily small.
CSIS8502
9.Boosting
3
Ensemble Methods
i
3
5
7
9
11
13
15
17
19
p = 0:55
42.5%
40.7%
39.2%
37.9%
36.7%
35.6%
34.6%
33.7%
32.9%
p = 0:65
28.2%
23.5%
20.0%
17.2%
14.9%
12.9%
11.3%
9.9%
8.7%
p = 0:75
15.6%
10.4%
7.1%
4.9%
3.4%
2.4%
1.7%
1.2%
0.9%
CSIS8502
9.Boosting
4
Voting Classiﬁers
Intuitively,if the classiﬁers are very different from each
other,combining multiple classiﬁers can be helpful.
Consider a two class problem,with class label f+1;1g,
and a set of classiﬁers f
i
(x) 2 f+1;1g,we have the
majority voting decision:
F(x) = sign(
M
X
i=1
f
i
(x))
where
sign(x) =

+1 if x > 0
1 if x < 0
This is the main idea of Ensemble methods
CSIS8502
9.Boosting
5
Why Ensemble Methods
Looking for a set of weak learners (accuracy slightly > 0:5)
is much easier,and more efﬁcient then ﬁnding a strong
learner (one with high accuracy).
There are difference classes of base models that can be
chosen:
Naive Bayes Classiﬁer
Decision Tree Classiﬁer
k-Nearest Neighbors
Neural Networks
SVMs...
CSIS8502
9.Boosting
6
How to ﬁnd Base Models
Train a diverse set of models on the same datasets.
Train a set of models from a speciﬁc class of learners by
using diversity in datasets,parameters,or initial conditions.
Examples are bagging and boosting.
CSIS8502
9.Boosting
7
Bagging
The name Bagging comes fromBootstrap Aggregation
use multiple version of a training set,by drawing a subset
of the original training data randomly with replacement,
and train a classiifer over them.
Combine the classiﬁers using majority votes.
CSIS8502
9.Boosting
8
Boosting
In Bagging,the training set is randomly drawn,and the
weights of them are the same.
However,if a data is already correctly classiﬁed by majority
of the classiﬁers,even if it is misclassiﬁed by the next one,
it will not affect the overall performance.
Hence we need only concentrate on those misclassiﬁed
data.
This is the idea of Boosting.
CSIS8502
9.Boosting
9
PAC Learning
Probably Approximately Correct Learning
Strong PAC Learning:demand small error with high
probability:
for any distribution,with arbitrary small  > 0,
 > 0,and given polynomially many random
examples,ﬁnd classiﬁer with error   with
probability  1 .
Weak PAC Learning:same as strong PAC learning,except
that instead of requiring arbitrary small  and ,we only
require
 
1
2
(1  );for some ﬁxed > 0
[Note:sometimes,this is written as  
1
2
 .]
CSIS8502
9.Boosting
10
Boosting
Assume that a given weak learning algorithm can
consistently ﬁnd classiﬁers at least slightly better than
random (e.g.accuracy  55% in a two-class setting).
given sufﬁcient data,a boosting algorithm can provably
construct a single classiﬁer (e.g.by majority voting of the
ensemble classiﬁers) with very high accuracy,say,99%.
CSIS8502
9.Boosting
11
Formal Description of Boosting
given a training set f(x
1
;y
1
);  ;(x
m
;y
m
)g
y
i
2 f1;+1g are the correct labels of instance x
i
2 X
for t = 1;  ;T (iterations):
construct a distribution D
t
on f1;  ;mg,a weight on every
sample.
ﬁnd a weak classiﬁer
h
t
:X!f1;+1g
with a small error 
t
on D
t
:

t
= Pr
D
t
[h
t
(x
i
) 6= y
i
]
output ﬁnal classiﬁer H
ﬁnal
as a combination of
h
t
;t = 1;  ;T.
CSIS8502
9.Boosting
12
Boosting Algorithm
Early boosting algorithm:
Boosting by Filtering (by Schapire)
Boosting by Majority (by Freund)
These algorithm is complex,and is a predecessor to the
CSIS8502
9.Boosting
13
Initialization:D
1
(i) = 1=m.
Iterate for t = 1;  ;T,
given D
t
and h
t
D
t+1
(i) =
D
t
(i)
Z
t

e

t
if y
i
= h
t
(x
i
)
e

t
if y
i
6= h
t
(x
i
)
=
D
t
(i)
Z
t
exp(
t
y
i
h
t
(x
i
))
where
Z
t
= normalization constant (1)

t
=
1
2
ln

1 
t

t

> 0 (2)

t
=
m
X
i=1
D
t
(i)[h
t
(x
i
) 6= y
i
];the weighted error rate of h
t
(3)
Final Classiﬁer:
H
ﬁnal
(x) = sign(
X
t

t
h
t
(x))
CSIS8502
9.Boosting
14
Toy Example (R.Schapire)
Simple learning algorithm:split of axis,using vertical or
horizontal half-planes.
CSIS8502
9.Boosting
15
Toy Example – Round 1
CSIS8502
9.Boosting
16
Toy Example – Round 2
CSIS8502
9.Boosting
17
Toy Example – Round 3
CSIS8502
9.Boosting
18
Toy Example – Final Classiﬁer
CSIS8502
9.Boosting
19
Theorem:
let 
t
=
1
2

t
,then
training error(H
ﬁnal
) 
Y
t
h
2
p

t
(1 
t
)
i
=
Y
t
q
1 4
2
t
 exp

2
X
t

2
t
!
Hence,if is the lower bound of
t
,i.e.
t
  0;8t,then
training error(H
ﬁnal
)  e
2
2
T
CSIS8502
9.Boosting
20
Proof
let f(x) =
X
t

t
h
t
(x),then H
ﬁnal
(x) = sign(f(x)).
Note that D
i
is deﬁned recursively.Hence by unwrapping
recurrence,we get
D
ﬁnal
(i) =
1
m
exp

y
i
X
t

t
h
t
(x
i
)
!
Y
t
Z
t
=
1
m
exp(y
i
f(x
i
))
Y
t
Z
t
CSIS8502
9.Boosting
21
Proof
training error(H
ﬁnal
) =
1
m
X
i

1 if y
i
6= H
ﬁnal
(x
i
)
0 otherwise
=
1
m
X
i

1 if y
i
f(x
i
)  0
0 otherwise

1
m
X
i
exp(y
i
f(x
i
))
=
X
i
D
ﬁnal
(i)
Y
t
Z
t
=
Y
t
Z
t
CSIS8502
9.Boosting
22
Proof
Finally,
Z
t
=
X
i
D
t
(i) exp(
t
y
i
h
t
(x
i
))
=
X
i:y
i
6=h
t
(x
i
)
D
t
(i)e

t
+
X
i:y
i
=h
t
(x
i
)
D
t
(i)e

t
= 
t
e

t
+(1 
t
)e

t
= 2
p

t
(1 
t
) by eqn(2)
CSIS8502
9.Boosting
23
Expected Performance
expected performance for normal classiﬁers:
We would expect:
training error will continue to drop (even reaching 0)
However,testing error will ﬁrst decrease and then increases
when H
ﬁnal
becomes too complex.
Occam Razor – simple rule is better.
Overﬁtting is expected to occur.
CSIS8502
9.Boosting
24
Actual Performance (Schapire)
Experiment:using a letter dataset,and using decision tree
Actual Performance:
Note that test error do not increase,even after 1000
rounds (where total#nodes in all C4.5 trees > 2M )
test error continues to drop even after training error is 0.
Occam’s Razor wrongly predicts ”simpler” rule is better.
CSIS8502
9.Boosting
25
Margins
Training error only measures whether classiﬁcation are
right or wrong.
Margins tells the conﬁdence of classiﬁcation.
H
ﬁnal
is a weighted majority vote of weak classiﬁers
Deﬁne margin = strength of the vote
= (weighted fraction voting correctly) - (weighted fraction
voting incorrectly)
Margin:
CSIS8502
9.Boosting
26
Margins
Cumulative distribution of margins of training examples:
CSIS8502
9.Boosting
27
Margins
If all margins are large,we can approximate ﬁnal classiﬁer
by a much smaller classiﬁer (compare:polls can predict a
not-too-close election)
Large margin!better generalization
Boosting tends to increase the margins of training data
Hence
Although ﬁnal classiﬁer is getting large
Margins are likely to be increasing
ﬁnal classiﬁer actually getting close to a simpler classiﬁer
thus driving down the test error
CSIS8502
9.Boosting
28
Boosting,SVM& Margins
can map the data to a high dimensional feature space by
the classiﬁers.
(x) = (h
1
(x) h
2
(x)    h
T
(x))
t
CSIS8502
9.Boosting
29
Boosting,SVM& Margins
The ﬁnal classiﬁer is just a linear decision function in the
high dimensional space )SVM
The classiﬁcation score is the margin of this linear
classiﬁer.
This is also the deﬁnition of margin in the previous slide
CSIS8502
9.Boosting
30
Multiclass Problem
y 2 Y = f1;  ;kg
Direct Approach:
h
t
:X!Y
D
t+1
(i) =
D
t
(i)
Z
t

e

t
if y
i
= h
t
(x
i
)
e

t
if y
i
6= h
t
(x
i
)
H
ﬁnal
(x) = arg max
y2Y
X
t:h
t
(x)=y

t
It can be proved that it has the same error bound if

t

1
2
8t.
Can also use standard multi-class to 2-class technique.
CSIS8502
9.Boosting
31
Ranking Problem
Other problems can be handled by reducing to a binary
problem
e.g.ranking of objects from examples
it can be reduced to multiple binary questions:”Is Object A
preferred to Object B”?
CSIS8502
9.Boosting
32
fast,simple and easy to program
no parameter to tune (except T)
ﬂexible,can combine with any learning algorithm
no prior knowledge needed about weak learner
provably effective,provided that a simple weak classiﬁer
can be found consistently.
(a shift of mindset – goal is now merely to ﬁnd a mediocre
classiﬁer barely better than random guessing)
versatile – can be used with data that is textual,numeric or
discrete.
CSIS8502
9.Boosting
33
Caveats
performance depends on data and the weak learner
As predicted by theory,Adaboost can fail if
weak classiﬁer too complex (causing overﬁtting)
weak classiﬁer too weak (gives low margin,and hence
cause overﬁtting) noise
empirical result show that Adaboost seems susceptible to
uniform noise.
CSIS8502
9.Boosting
34