Here - Computer Science

muscleblouseΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

98 εμφανίσεις

A D
ECISION
-
T
HEORETIC

G
ENERALIZATION

OF

O
N
-
L
INE

L
EARNING

AND

AN

A
PPLICATION

TO

B
OOSTING

By
Yoav

Freund
and Robert E.
Schapire

Presented by David Leach


Original
Slides by Glenn
Rachlin

1

O
UTLINE
:


Background


On
-
line allocation of resources


Introduction


The Problem


The Hedge Algorithm


Analysis


Boosting


Introduction


The Problem


The
AdaBoost

Algorithm


Analysis


Applications


Extensions


Conclusions


Questions for Final exam

2

U
SEFUL

D
EFINITIONS
:


On
-
Line Learning


Information comes in one
step at a time, learner must apply model, make
prediction, observe true value, then adjust model
accordingly.


Weak Learner


Algorithm has higher accuracy
than random guessing, but is impractical by itself
for most real
-
world applications.


PAC


Probably Approximately Correct; most of
the time the prediction returned will be close to
the actual result.

3

E
NSEMBLE

L
EARNING
:

A machine learning paradigm where multiple
learners are used to solve the problem


Problem

… ...

… ...

Problem

Learner

Learner

Learner

Learner

Previously:

Ensemble:



The
generalization ability of the ensemble is usually
significantly better than that of an individual learner



Boosting
is one of the most important families of ensemble
methods

4

B
OOSTING
:
A

BACKGROUND


Significant advantages:


Solid theoretical foundation


High level of accuracy


Simple to implement


Wide range of applications


R.
Schapire

and Y. Freund won the
2003
Godel

Prize
(one of the most prestigious awards in
theoretical computer science)


Prize winning paper (which introduced
AdaBoost
):
"A decision theoretic generalization of on
-
line
learning and an application to Boosting,“ Journal of
Computer and System Sciences, 1997, 55: 119
-
139.


5

H
OW

WAS

A
DABOOST

BORN
?


In 1988, M. Kearns and L.G. Valiant
posed an interesting question:


Can a “weak” learning algorithm that performs just
slightly better than random guess can be “boosted”
into an arbitrarily accurate “strong” learning
algorithm?


More simply, can we transform one or more weak
learners into a single strong learner?


6

H
OW

WAS

A
DABOOST

BORN
?


In R.
Schapire’s

MLJ90 paper, Rob said “yes” and
gave a proof to the question. The proof is a
construction, which is the first Boosting algorithm
(“Recursive Majority Gate Formulation”)


Then, in Y. Freund’s
Phd

thesis (1993),
Yoav

gave a
scheme of combining weak learners by “Majority Vote”


Though theoretically strong, both algorithms relied on
knowledge of each weak learner’s accuracy


Later, at AT&T Bell Labs, they published the 1997
paper
(in fact the work was done in 1995)
, which proposed
the
AdaBoost

algorithm, a practical, “adaptable”
algorithm

7

B
OOSTING

T
IMELINE


1990


Boost
-
by
-
majority algorithm (Freund)


1995


AdaBoost

(Freund &
Schapire
)


1997


Generalized version of
AdaBoost

(
Schapire
& Singer)


2001


AdaBoost

in Face Detection (Viola &
Jones)


8

O
N
-
LINE

A
LLOCATION

OF

R
ESOURCES
:
I
NTRODUCTION



Problem: “... dynamically apportioning resources
among a set of options...”


In other words, “Given a set of individual
predictions, how much should we value each
one?”



The Gambler Example (A recurring theme)

9

T
HE

G
AMBLER
:

A Gambler wants to make money on horse
-
racing
by consulting a group of experts.


He discovers that experts tend to use certain “rules of
thumb” for races that dictate results to some degree
(“Horse with the best odds”, etc.).


Hard to find one particular rule that works for
multiple circumstances.


How can he use the network of various predictions
(each of which tends to use a given rule of thumb) to
win money?


More specifically, how should he split his money
among the experts?

10

O
N
-
LINE

A
LLOCATION

OF

R
ESOURCES
:
P
ROBLEM

F
ORMULATION

The on
-
line allocation model:


Allocation agent:
A
-

the gambler


A certain strategy:
i



one expert’s behavior


# of options/strategies:
{1,2,3, ... ,N}
-

the # of
experts to choose from


# of time steps:
{1,2,3, ... ,T}
-

the # of races


distribution over strategies:
p
t

-
how much money
he spends on each expert


l
oss:
l


money lost (or not gained)


11

O
N
-
LINE

A
LLOCATION

OF

R
ESOURCES
:
H
EDGE
(
Β
)


Basis: “The algorithm and its analysis are direct
generalizations of
Littlestone

and
Warmuth’s

weighted majority algorithm”


Assumptions:


The loss suffered by any strategy be bounded


All weights be nonnegative


Initial weights sum up to 1 (optional)

12


Algorithm Hedge (β)



Parameters: β

[0,1]


initial weight vector: ω
1


[0,1]
N

with


number of trials T



Do for t = 1,2, ..., T

1.
Choose allocation from environment


2.
Receive loss vector


3.
Suffer loss


4.
Set new weights vector to be


Goal:
minimize difference between expected total loss and
minimal total loss of repeating one action

13

T
HE

G
AMBLER

R
EVISITED

o
The gambler uses his fancy new algorithm as
follows:

o
1. The gambler splits his money evenly between 3
experts, giving $5 to each

o
p
1

= <.33,.33,.33>

o
2. The gambler records the loss to each expert

o
Expert 1 loses $2

o
Expert 2 loses $1

o
Expert 3 loses $4

o
loss vector
l
t

= <2,1,4>

o
total loss = .33x2 + .33x1 + .33x4 = 2.33

14

T
HE

G
AMBLER

R
EVISITED


3. The gambler sets new weights using this data
and a beta of .5


Expert 1 is weighted .33
x

.5
2

= .083


Expert 2 is now weighted .33 x .5
1
= .167


Expert 3 is now weighted .33 x .5
4

= .063


Total weight = .083 + .167 + .063 = .313


4. The gambler repeats the process, now
“hedging” his bets as follows:


p
2

= <.083/.313, .167/.313, .063/.313>

= <.265, .533, .202>

15

O
N
-
LINE

A
LLOCATION

OF

R
ESOURCES
:
B
OUNDS

OF

H
EDGE
(
Β
)

16

B
OUNDS
,
C
ON

T


β
= given parameter


T = total number of time steps or trials


N = number of options


𝑐

ln
1
𝛽
1

𝛽

𝑜𝑟

𝑎

1
1

𝛽


17

C
HOOSING

BETA

Set


Where L~ is the bound on the best strategy, and






Then:



And if we know T:

18

O
N
-
LINE

A
LLOCATION

OF

R
ESOURCES
:
E
VALUATION


The authors show that the
Hedge(β
) algorithm

yield[s
] bounds that are slightly weaker in some
cases, [than those produced by the algorithm
proposed by
Littlestone

and
Warmuth
, 1994] but
applicable to a considerably more general class of
learning problems.”


Not only binary decisions


Not only discrete loss

19

M
ORE

G
AMBLING


The gambler now wants to avoid the experts, and
opts to write a program that predicts the winner.


He must take input data


Odds


Previous results


Track conditions


And predict the outcome


Win or loss


He notices that “rules of thumb” once again
emerge, where simple heuristics can provide
some predictive accuracy, but not enough


How can he use this information to make money?

20

B
OOSTING
: I
NTRODUCTION



Aim: “.. converting a weak learning algorithm
that performs just slightly better than random
guessing into one with arbitrarily high accuracy.”


Example: Constructing an expert computer
program


Two problems:


Choosing data


Combining rules



“Boosting refers to this general problem of
producing a very accurate prediction rule by
combining rough and moderately inaccurate
rules
-
of
-
thumb.”

21

T
RADITIONAL

B
OOSTING

1.
Split a training data set into multiple
overlapping subsets.

2.
Train a weak learner on one equally
weighted example set, until accuracy
is > 50%.

3.
Train a weak learner on a new
example set, now weighted to focus on
errors.

4.
Repeat until all example sets are
exhausted.

5.
Apply all learners to test set to
determine final hypothesis.


22

T
HE

P
ROBLEM


Previous algorithms by the same authors “work
by calling a given weak learning algorithm
WeakLearn

multiple times, each time presenting
it with a different distribution [of examples], and
finally combining all the generated hypotheses
into a single hypothesis.”


Problems


Too much has to be known in advance


Improvement of the overall performance depends on
the weakest rules

23

A
DABOOST
: A
DAPTIVE

B
OOSTING



Instead of sampling, re
-
weight



Can be used to train weak classifiers



Final classification based on weighted
vote of weak classifiers


24

A
DA
B
OOST
:

If the underlying classifiers are linear networks, then
AdaBoost

builds multilayer
perceptrons

one node at a
time.









However, the underlying classifier can be anything,
decision trees, neural networks, etc…

25

T
HE

A
DA
B
OOST

A
LGORITHM
:

26

I
NITIAL

A
NALYSIS
:




The weight of each example is adjusted so that
the multiplier will be beta if correct (<1), or 1 if
incorrect (beta^0). Remember that the weight
will be normalized, so no decrease is effectively
an increase.





Each learner gets a vote inversely proportional to
the logarithm of its beta, which in turn was
proportional to its error.

27

B
ETA



What implications does this have?


1. If the error is .5, or equivalent to random guessing,
no information is gained and the time step isn’t used.



2. For a common error <.5, we can weight examples
proportionally to error, and weight votes inversely
proportional to error.



3. In the final hypothesis generation, if error was >
.5, the time step will actually have an inverse vote.

28

S
TILL

M
ORE

G
AMBLING


The gambler now has a pretty good scheme to make
money, and downloads the entire race history from
the track’s database


1. He finds that odds are a PAC predictor, and comes
up with hypotheses accordingly.


2. He calculates the error of using this predictor.


3. He looks at the data in a different way, focusing on
examples that odds could not easily predict, and
comes up with a new heuristic (when the track is
muddy, the horse with the most experienced jockey
wins)


4. He repeats this process until no more viable
heuristics can be determined.


5. When enough of the heuristics indicate a win for a
given horse, he places a bet.

29

T
HEORETICAL

P
ROPERTIES
:


Y. Freund and R.
Schapire
[JCSS97]

have proved that
the training error of
AdaBoost

is bounded by:

where

Thus, if each base classifier is slightly better than
random so that for some ,

then
the
training error drops exponentially fast

in
T

since
the above bound is at
most

30

T
HEORETICAL

P
ROPERTIES

C
ON

T


Y. Freund and R.
Schapire
[JCSS97]

have tried to bound
the generalization error as:

where denotes empirical probability


on training sample,
s

is the sample size,

d is the VC
-
dim of base learner

The above bounds suggest that Boosting will
overfit

if
T

is large.
However, empirical studies show that
Boosting
often does
not
overfit


R.
Schapire

et al.
[AnnStat98]

gave a margin
-
based bound:


for any

> 0 with high probability


where

31

T
OY

E
XAMPLE



TAKEN

FROM

A
NTONIO

T
ORRALBA

@MIT

Weak learners from
the family of lines

h => p(error) = 0.5 it is at chance

Each data point
has

a class label:


w
t
=1

and a weight:

+1 ( )

-
1 ( )

y
t
=

32

T
OY

EXAMPLE

This one seems to be the best

Each data point
has

a class label:


w
t
=1

and a weight:

+1 ( )

-
1 ( )

y
t
=

This is a ‘
weak classifier
’: It performs slightly better than chance.

33

T
OY

EXAMPLE

We set a new problem for which the previous weak

classifier
performs
better than chance again

Each data point
has

a class label:


w
t


w
t

exp
{
-
y
t

H
t
}

We update the weights:

+1 ( )

-
1 ( )

y
t
=

34

T
OY

EXAMPLE

We set a new problem for which the previous weak classifier
performs
better than
chance again

Each data point
has

a class label:


w
t


w
t

exp
{
-
y
t

H
t
}

We update the weights:

+1 ( )

-
1 ( )

y
t

=

35

T
OY

EXAMPLE

We set a new problem for which the previous weak classifier performs at chance again

Each data point
has

a class label:


w
t

w
t

exp{
-
y
t

H
t
}

We update the weights:

+1 ( )

-
1 ( )

y
t
=

36

T
OY

EXAMPLE

We set a new problem for which the previous weak classifier
performs
better than
chance again

Each data point
has

a class label:


w
t

w
t

exp{
-
y
t

H
t
}

We update the weights:

+1 ( )

-
1 ( )

y
t
=

37

T
OY

EXAMPLE

The strong (non
-

linear) classifier is built as the
combination of all the weak (linear) classifiers.

f
1

f
2

f
3

f
4

38

F
ORMAL

P
ROCEDURE

OF

A
DA
B
OOST

39

P
ROCEDURE

OF

A
DABOOST
:

40

E
RROR

ON

T
RAINING

S
ET
:

41

O
VERFITTING


Will
Adaboost

screw up with a fat complex
classifier finally?

Occam’s razor


simple is the best



Over fitting


Shall we stop before over fitting? If only over fitting happens.

42

A
CTUAL

T
YPICAL

R
UN

43

A
N

EXPLANATION

BY

MARGIN


This margin is not the margin in SVM

44

M
ARGIN

D
ISTRIBUTION

Although final classifier is getting
larger, margins are still increasing

Final classifier is actually getting to
simpler classifer

45

P
RACTICAL

A
DVANTAGES

OF

A
DA
B
OOST
:


Simple and easy to program.


No parameters to tune (except T).


Effective, provided it can consistently find rough
rules of thumb.


Goal is to find hypothesis barely better than
guessing.


Can combine with any (or many) classifiers to
find weak hypotheses: neural networks, decision
trees, simple rules of thumb, nearest
-
neighbor
classifiers, etc.

46

E
XTENSIONS


AdaBoost.M1: First Multiclass




AdaBoost.M2: Second Multiclass




AdaBoost.R
: Weak Regression

47

A
DA
B
OOST
.M1


We modify error calculation as follows:




With the caveat:




And come up with a final hypothesis by:

48

C
ONCLUDING

R
EMARKS


This paper was the introduction of
AdaBoost
, an
award
-
winning, widely used algorithm featured
in the top 10 algorithms.


The paper also included the Hedge algorithm, a
much less widely known algorithm.


It should be noted that Hedge was a solution to a
problem that prevented adaptable boosting for a
long time, and has therefore had a significant
impact on data mining since.

49

E
XAM

Q
UESTIONS
:

1.
What are we seeking

to minimize in resource
allocation?

2.
What is the goal of boosting?

3.
What makes
Adaboost

adaptable?


50

E
XAM

Q
UESTION

I: W
HAT

ARE

WE

SEEKING

TO

MINIMIZE

IN

RESOURCE

ALLOCATION
?




More simply,

we seek to minimize the total loss of the
allocator with respect to the loss of the best learner.



This gives us a consistent “worst case scenario”,
effectively hedging our bets.

51

E
XAM

Q
UESTION

II: W
HAT

IS

THE

GOAL

OF

BOOSTING
?



The goal is to use one or more weak learners as an
arbitrarily accurate strong learner


In other words, to use better
-
than
-
chance heuristics in
ensemble for high predictive accuracy

52

E
XAM

Q
UESTION

III: W
HAT

MAKES

A
DA
B
OOST

ADAPTABLE
?



The classifiers used in the final decision function have
all been modified to account for the weaknesses in the
preceding classifiers.


As

long as we at least one initial learner that tells
us
something

about the data, the algorithm will infer
everything else it needs to.

53