A Black-Box approach to machine learning

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 4 χρόνια και 2 μήνες)

85 εμφανίσεις

1

A Black
-
Box approach to
machine learning

Yoav Freund

2

Why do we need learning?


Computers need functions that map highly variable data:


Speech recognition:

Audio signal
-
> words


Image analysis:


Video signal
-
> objects


Bio
-
Informatics: Micro
-
array Images
-
> gene function


Data Mining: Transaction logs
-
> customer classification


For
accuracy
, functions must be
tuned

to fit the data
source.


For
real
-
time

processing, function computation has to be
very fast
.

3

The complexity/accuracy
tradeoff



Complexity

Error

Trivial performance

4

The speed/flexibility tradeoff

Flexibility

Speed

Matlab Code

Java Code

Machine code

Digital Hardware

Analog Hardware

5

Theory Vs. Practice


Theoretician
: I want a
polynomial
-
time

algorithm which is
guaranteed to perform arbitrarily well

in “all” situations.


-

I prove theorems.


Practitioner
: I want a
real
-
time

algorithm that
performs
well on
my

problem
.


-

I experiment.


My approach
: I want
combining

algorithms whose
performance and
speed

is
guaranteed relative

to the
performance and speed of their components.


-

I do both.

6

Plan of talk


The black
-
box approach


Boosting


Alternating decision trees


A commercial application


Boosting the margin


Confidence rated predictions


Online learning


7

The black
-
box approach


Statistical models are
not

generators
, they are
predictors
.


A predictor is a function from

observation

X

to
action

Z.


After action

is taken,
outcome

Y

is observed


which implies
loss

L
(a real valued number).


Goal:

find a predictor with small loss

(in expectation, with high probability, cumulative…)

8

Main software components

x

z

A predictor

Training examples

A learner

We assume the predictor will be applied to

examples similar to those on which it was trained

9

Learning in a system

Learning System

predictor

Training

Examples

Target System

Sensor Data

Action

feedback

10

Special case: Classification

Observation

X

-

arbitrary
(measurable)

space

Prediction

Z
-

{1,…,K}

Usually
K=2

(binary classification)

Outcome

Y

-

finite set {1,..,K}

11

batch learning for

binary classification

Data distribution:

Generalization error:

Training set:

Training error:

12

Boosting

Combining weak learners

13

A weighted training set

14

A weak learner

The weak requirement:

A weak rule

h

h

Weak Leaner

Weighted


training set


instances

predictions

15

The boosting process

weak learner

h1



(x1,y1,
1/n
), … (xn,yn,
1/n
)

weak learner

h2

(x1,y1,
w1
), … (xn,yn,
wn
)

h3

(x1,y1,
w1
), … (xn,yn,
wn
)

h4

(x1,y1,
w1
), … (xn,yn,
wn
)

h5

(x1,y1,
w1
), … (xn,yn,
wn
)

h6

(x1,y1,
w1
), … (xn,yn,
wn
)

h7

(x1,y1,
w1
), … (xn,yn,
wn
)

h8

(x1,y1,
w1
), … (xn,yn,
wn
)

h9

(x1,y1,
w1
), … (xn,yn,
wn
)

hT

(x1,y1,
w1
), … (xn,yn,
wn
)

Final

rule
:

16

Adaboost

17

Main property of Adaboost

If advantages of weak rules over random
guessing are:
g
1
,g
2
,..,g
T

then

training error

of final rule is at most

18

Boosting block diagram

Weak

Learner

Booster

Weak

rule

Example

weights

Strong Learner

Accurate

Rule

19

What is a good weak learner?

The set of weak rules (features) should be:


flexible enough to be (weakly) correlated

with most
conceivable relations between feature vector and label.


Simple enough to allow efficient search

for a rule with
non
-
trivial weighted training error.


Small enough to avoid over
-
fitting
.


Calculation of prediction from observations should
be
very fast
.

20

Alternating decision trees

Freund, Mason 1997

21

Decision Trees

X>3

Y>5

-
1

+1

-
1

X

Y

3

5

+1

-
1

-
1

22

-
0.2

A decision tree as a sum of
weak rules.

X

Y

-
0.2

+0.2

-
0.3

Y>5

-
0.1

+0.1

X>3

+0.1

-
0.1

+0.2

-
0.3

+1

-
1

-
1

sign

23

An alternating decision tree

X

Y

+0.1

-
0.1

+0.2

-
0.3

sign

-
0.2

Y>5

+0.2

-
0.3

X>3

-
0.1

+0.1

+0.7

Y<1

0.0

+0.7

+1

-
1

-
1

+1

24

Example: Medical Diagnostics


Cleve

dataset from UC Irvine database.


Heart disease diagnostics
(+1=healthy,
-
1=sick)



13 features from tests
(real valued and discrete).


303 instances.


25

AD
-
tree for heart
-
disease diagnostics

>0 : Healthy

<0 : Sick

26

Commercial Deployment.



27

AT&T “buisosity” problem


Distinguish business/residence customers from call detail
information. (time of day, length of call …)


230M

telephone numbers, label unknown for
~30%


260M

calls / day



Required computer resources:

Freund, Mason, Rogers, Pregibon, Cortes 2000


Huge
:

counting log entries

to produce statistics
--

use specialized
I/O efficient sorting algorithms (Hancock).


Significant
:
Calculating the classification

for ~70M customers.


Negligible
:

Learning

(2 Hours on 10K training examples on an off
-
line computer).

28

AD
-
tree for “buisosity”

29

AD
-
tree (Detail)

30

Quantifiable results


For accuracy

94%

increased coverage from
44%

to
56%
.


Saved AT&T
15M$

in the year 2000 in operations costs
and missed opportunities.

Score

Accuracy

Precision/recall:

31

Adaboost’s resistance to
over fitting

Why statisticians find Adaboost
interesting.

32

A very curious phenomenon

Boosting decision trees

Using <
10,000

training examples we fit >
2,000,000

parameters

33

Large margins

Thesis
:

large margins => reliable predictions

Very similar to SVM.

34

Experimental Evidence

35

Theorem

Schapire, Freund, Bartlett & Lee / Annals of statistics 1998

H
: set of binary functions with VC
-
dimension
d

C

No dependence on no. of combined functions!!!

36

Idea of Proof

37

Confidence rated
predictions

Agreement gives confidence

38

A motivating example

-

-

-

+

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

-

?

?

?

Unsure

Unsure

39

The algorithm

Parameters

Hypothesis weight:

Empirical Log Ratio
:

Prediction rule:

Freund, Mansour, Schapire 2001

40

Suggested tuning

Yields:

Suppose H is a finite set.


41

Confidence Rating block
diagram

Rater
-

Combiner

Confidence
-
rated

Rule

Candidate

Rules

Training examples

42

Face Detection


Paul Viola

and
Mike Jones

developed a face detector that can work
in real time (15 frames per second).

Viola & Jones 1999

43

Using confidence to save time

The detector combines 6000 simple features using Adaboost.

In most boxes, only 8
-
9 features are calculated.

Feature 1

All

boxes

Definitely

not a face

Might be a face

Feature 2

44

Using confidence to train

car detectors

45

Original Image Vs.

difference image

46

Co
-
training

Hwy

Images

Partially trained

B/W based

Classifier

Partially trained

Diff based

Classifier

Confident

Predictions

Confident

Predictions

Blum and Mitchell 98

47

Co
-
Training Results

Raw Image detector

Difference Image detector

Before co
-
training

After co
-
training

Levin, Freund, Viola 2002

48

Selective sampling

Unlabeled data

Partially trained

classifier

Sample of unconfident


examples

Labeled

examples

Query
-
by
-
committee
, Seung, Opper & Sompolinsky

Freund, Seung, Shamir & Tishby

49

Online learning

Adapting to changes

50

Online learning

An
expert

is an algorithm that maps the

past to a prediction

So far, the
only

statistical assumption was

that data is generated IID.

Can we get rid of that assumption?

Yes
, if we consider prediction as a repeating game

Suppose we have a set of experts,

we believe one is good, but we don’t know which one.

51

Online prediction game

Experts generate predictions:

Algorithm makes its own prediction:

Nature generates outcome:

For

Total loss of expert

i
:

Total loss of algorithm:

Goal: for
any

sequence of events

52

A very simple example


Binary

classification


N

experts


one expert is known to be
perfect


Algorithm:
predict like the majority of
experts that have made no mistake so far.


Bound:

53

History of online learning


Littlestone & Warmuth


Vovk


Vovk and Shafer’s recent book:

“Probability and Finance, its only a game!”


Innumerable contributions from many fields:
Hannan, Blackwell, Davison, Gallager, Cover, Barron,

Foster & Vohra, Fudenberg & Levin, Feder & Merhav,
Starkov, Rissannen, Cesa
-
Bianchi, Lugosi, Blum,
Freund, Schapire, Valiant, Auer …

54

Lossless compression

Z

-

[0,1]

X

-

arbitrary input space.

Y

-

{0,1}

Entropy, Lossless compression, MDL.

Statistical likelihood, standard probability theory.


Log Loss:

55

Bayesian averaging

Folk theorem in Information Theory

56

Game theoretical loss

X

-

arbitrary space

Y

-

a loss for each of
N

actions

Z

-

a distribution over
N

actions

Loss:

57

Learning in games

An algorithm which knows
T

in advance

guarantees:

Freund and Schapire 94

58

Instead, a single

is chosen at

random according to

and

is observed

Multi
-
arm bandits

Algorithm cannot observe full outcome

Auer, Cesa
-
Bianchi, Freund, Schapire 95

With probability

We describe an algorithm that guarantees:

59

Why isn’t online learning
practical?


Prescriptions too similar to Bayesian
approach.


Implementing low
-
level learning requires a
large number of experts.


Computation increases linearly with the
number of experts.


Potentially very powerful for combining a
few high
-
level experts.

60

code

B/W Frontal face detector

Indoor, neutral background

direct front
-
right lighting

Merl frontal 1.0

Online learning for detector
deployment

Face

Detector


Library

OL

Images

Download

Feedback

Face

Detections

Adaptive

real
-
time

face detector

Detector can be

adaptive!!

61

Summary


By Combining predictors we can:


Improve accuracy.


Estimate prediction confidence.


Adapt on
-
line.


To make machine learning practical:


Speed
-
up the predictors.


Concentrate human feedback on hard cases.


Fuse data from several sources.


Share predictor libraries.