CSIS8502 - 9. Boosting

habitualparathyroidsΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 5 μέρες)

97 εμφανίσεις

CSIS8502
9.Boosting
CSIS8502
9.Boosting
1
Boosting
a kind of ensemble method for supervised learning
Create a single strong learner from a set of weak
learners
Weak learners:one that is slightly better than random
guessing
Strong learners:highly accurate classifier.
CSIS8502
9.Boosting
2
Ensemble Methods
Combining classifiers.
simplest case:by simple majority vote
Consider 3 independent classifiers,each with accuracy
55%.
Error rate for majority vote for 3 classifiers is:
3
C
2
0:55 0:45 0:45 +
3
C
3
0:45 0:45 0:45 = 42:5%
for 5 classifiers:
5
C
3
0:55 0:55 0:45 0:45 0:45+
5
C
4
0:55 0:45 0:45 0:45 0:45+
5
C
5
0:45 0:45 0:45 0:45 0:45 = 40:7%
when n!1this becomes arbitrarily small.
CSIS8502
9.Boosting
3
Ensemble Methods
i
3
5
7
9
11
13
15
17
19
p = 0:55
42.5%
40.7%
39.2%
37.9%
36.7%
35.6%
34.6%
33.7%
32.9%
p = 0:65
28.2%
23.5%
20.0%
17.2%
14.9%
12.9%
11.3%
9.9%
8.7%
p = 0:75
15.6%
10.4%
7.1%
4.9%
3.4%
2.4%
1.7%
1.2%
0.9%
CSIS8502
9.Boosting
4
Voting Classifiers
Intuitively,if the classifiers are very different from each
other,combining multiple classifiers can be helpful.
Consider a two class problem,with class label f+1;1g,
and a set of classifiers f
i
(x) 2 f+1;1g,we have the
majority voting decision:
F(x) = sign(
M
X
i=1
f
i
(x))
where
sign(x) =

+1 if x > 0
1 if x < 0
This is the main idea of Ensemble methods
CSIS8502
9.Boosting
5
Why Ensemble Methods
Looking for a set of weak learners (accuracy slightly > 0:5)
is much easier,and more efficient then finding a strong
learner (one with high accuracy).
There are difference classes of base models that can be
chosen:
Naive Bayes Classifier
Decision Tree Classifier
k-Nearest Neighbors
Neural Networks
SVMs...
CSIS8502
9.Boosting
6
How to find Base Models
Train a diverse set of models on the same datasets.
Train a set of models from a specific class of learners by
using diversity in datasets,parameters,or initial conditions.
Examples are bagging and boosting.
CSIS8502
9.Boosting
7
Bagging
The name Bagging comes fromBootstrap Aggregation
use multiple version of a training set,by drawing a subset
of the original training data randomly with replacement,
and train a classiifer over them.
Combine the classifiers using majority votes.
CSIS8502
9.Boosting
8
Boosting
In Bagging,the training set is randomly drawn,and the
weights of them are the same.
However,if a data is already correctly classified by majority
of the classifiers,even if it is misclassified by the next one,
it will not affect the overall performance.
Hence we need only concentrate on those misclassified
data.
This is the idea of Boosting.
CSIS8502
9.Boosting
9
PAC Learning
Probably Approximately Correct Learning
Strong PAC Learning:demand small error with high
probability:
for any distribution,with arbitrary small  > 0,
 > 0,and given polynomially many random
examples,find classifier with error   with
probability  1 .
Weak PAC Learning:same as strong PAC learning,except
that instead of requiring arbitrary small  and ,we only
require
 
1
2
(1  );for some fixed > 0
[Note:sometimes,this is written as  
1
2
 .]
CSIS8502
9.Boosting
10
Boosting
Assume that a given weak learning algorithm can
consistently find classifiers at least slightly better than
random (e.g.accuracy  55% in a two-class setting).
given sufficient data,a boosting algorithm can provably
construct a single classifier (e.g.by majority voting of the
ensemble classifiers) with very high accuracy,say,99%.
CSIS8502
9.Boosting
11
Formal Description of Boosting
given a training set f(x
1
;y
1
);  ;(x
m
;y
m
)g
y
i
2 f1;+1g are the correct labels of instance x
i
2 X
for t = 1;  ;T (iterations):
construct a distribution D
t
on f1;  ;mg,a weight on every
sample.
find a weak classifier
h
t
:X!f1;+1g
with a small error 
t
on D
t
:

t
= Pr
D
t
[h
t
(x
i
) 6= y
i
]
output final classifier H
final
as a combination of
h
t
;t = 1;  ;T.
CSIS8502
9.Boosting
12
Boosting Algorithm
Early boosting algorithm:
Boosting by Filtering (by Schapire)
Boosting by Majority (by Freund)
These algorithm is complex,and is a predecessor to the
Adaboost Algorithm.
CSIS8502
9.Boosting
13
AdaBoost Algorithm
Initialization:D
1
(i) = 1=m.
Iterate for t = 1;  ;T,
given D
t
and h
t
D
t+1
(i) =
D
t
(i)
Z
t


e

t
if y
i
= h
t
(x
i
)
e

t
if y
i
6= h
t
(x
i
)
=
D
t
(i)
Z
t
exp(
t
y
i
h
t
(x
i
))
where
Z
t
= normalization constant (1)

t
=
1
2
ln

1 
t

t

> 0 (2)

t
=
m
X
i=1
D
t
(i)[h
t
(x
i
) 6= y
i
];the weighted error rate of h
t
(3)
Final Classifier:
H
final
(x) = sign(
X
t

t
h
t
(x))
CSIS8502
9.Boosting
14
Toy Example (R.Schapire)
Simple learning algorithm:split of axis,using vertical or
horizontal half-planes.
CSIS8502
9.Boosting
15
Toy Example – Round 1
CSIS8502
9.Boosting
16
Toy Example – Round 2
CSIS8502
9.Boosting
17
Toy Example – Round 3
CSIS8502
9.Boosting
18
Toy Example – Final Classifier
CSIS8502
9.Boosting
19
Training Error in Adaboost
Theorem:
let 
t
=
1
2

t
,then
training error(H
final
) 
Y
t
h
2
p

t
(1 
t
)
i
=
Y
t
q
1 4
2
t
 exp

2
X
t

2
t
!
Hence,if is the lower bound of
t
,i.e.
t
  0;8t,then
training error(H
final
)  e
2
2
T
CSIS8502
9.Boosting
20
Proof
let f(x) =
X
t

t
h
t
(x),then H
final
(x) = sign(f(x)).
Note that D
i
is defined recursively.Hence by unwrapping
recurrence,we get
D
final
(i) =
1
m
exp

y
i
X
t

t
h
t
(x
i
)
!
Y
t
Z
t
=
1
m
exp(y
i
f(x
i
))
Y
t
Z
t
CSIS8502
9.Boosting
21
Proof
training error(H
final
) =
1
m
X
i

1 if y
i
6= H
final
(x
i
)
0 otherwise
=
1
m
X
i

1 if y
i
f(x
i
)  0
0 otherwise

1
m
X
i
exp(y
i
f(x
i
))
=
X
i
D
final
(i)
Y
t
Z
t
=
Y
t
Z
t
CSIS8502
9.Boosting
22
Proof
Finally,
Z
t
=
X
i
D
t
(i) exp(
t
y
i
h
t
(x
i
))
=
X
i:y
i
6=h
t
(x
i
)
D
t
(i)e

t
+
X
i:y
i
=h
t
(x
i
)
D
t
(i)e

t
= 
t
e

t
+(1 
t
)e

t
= 2
p

t
(1 
t
) by eqn(2)
CSIS8502
9.Boosting
23
Expected Performance
expected performance for normal classifiers:
We would expect:
training error will continue to drop (even reaching 0)
However,testing error will first decrease and then increases
when H
final
becomes too complex.
Occam Razor – simple rule is better.
Overfitting is expected to occur.
CSIS8502
9.Boosting
24
Actual Performance (Schapire)
Experiment:using a letter dataset,and using decision tree
methods C4.5 and Adaboost.
Actual Performance:
Note that test error do not increase,even after 1000
rounds (where total#nodes in all C4.5 trees > 2M )
test error continues to drop even after training error is 0.
Occam’s Razor wrongly predicts ”simpler” rule is better.
CSIS8502
9.Boosting
25
Margins
Training error only measures whether classification are
right or wrong.
Margins tells the confidence of classification.
H
final
is a weighted majority vote of weak classifiers
Define margin = strength of the vote
= (weighted fraction voting correctly) - (weighted fraction
voting incorrectly)
Margin:
CSIS8502
9.Boosting
26
Margins
Cumulative distribution of margins of training examples:
CSIS8502
9.Boosting
27
Margins
If all margins are large,we can approximate final classifier
by a much smaller classifier (compare:polls can predict a
not-too-close election)
Large margin!better generalization
Boosting tends to increase the margins of training data
Hence
Although final classifier is getting large
Margins are likely to be increasing
final classifier actually getting close to a simpler classifier
thus driving down the test error
CSIS8502
9.Boosting
28
Boosting,SVM& Margins
can map the data to a high dimensional feature space by
the classifiers.
(x) = (h
1
(x) h
2
(x)    h
T
(x))
t
CSIS8502
9.Boosting
29
Boosting,SVM& Margins
The final classifier is just a linear decision function in the
high dimensional space )SVM
The classification score is the margin of this linear
classifier.
This is also the definition of margin in the previous slide
CSIS8502
9.Boosting
30
Multiclass Problem
y 2 Y = f1;  ;kg
Direct Approach:
h
t
:X!Y
D
t+1
(i) =
D
t
(i)
Z
t


e

t
if y
i
= h
t
(x
i
)
e

t
if y
i
6= h
t
(x
i
)
H
final
(x) = arg max
y2Y
X
t:h
t
(x)=y

t
It can be proved that it has the same error bound if

t

1
2
8t.
Can also use standard multi-class to 2-class technique.
CSIS8502
9.Boosting
31
Ranking Problem
Other problems can be handled by reducing to a binary
problem
e.g.ranking of objects from examples
it can be reduced to multiple binary questions:”Is Object A
preferred to Object B”?
and apply Adaboost.
CSIS8502
9.Boosting
32
Practical Advantages of Adaboost
fast,simple and easy to program
no parameter to tune (except T)
flexible,can combine with any learning algorithm
no prior knowledge needed about weak learner
provably effective,provided that a simple weak classifier
can be found consistently.
(a shift of mindset – goal is now merely to find a mediocre
classifier barely better than random guessing)
versatile – can be used with data that is textual,numeric or
discrete.
CSIS8502
9.Boosting
33
Caveats
performance depends on data and the weak learner
As predicted by theory,Adaboost can fail if
weak classifier too complex (causing overfitting)
weak classifier too weak (gives low margin,and hence
cause overfitting) noise
empirical result show that Adaboost seems susceptible to
uniform noise.
CSIS8502
9.Boosting
34