1 - Idiap Research Institute

habitualparathyroidsΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

84 εμφανίσεις

MACHINE LEARNING
PAC-LEARNING
FRANC¸ OIS FLEURET
MAY 11TH,2011
Introduction
Classification
The usual setting for learning for classification:
- A training set,
- a family of classifiers,
- a test set.
Learning means to choose a classifier according to its
performances on the training set to get good performances on the
test set.
2/36
Introduction
Topic of this lecture
The goal of this lecture is to give an intuitive understanding of the
Probably Approximately Correct learning (PAC learning for short)
theory.
- Concentration inequalities,
- basic PAC results,
- relation with Occam’s principle,
- relation to Vapnik-Chervonenkis dimension.
The figures are supposed to help.If they do not,ignore them.
3/36
Introduction
Notation
We will use the following notation:
- X the space of the objects to classify (for instance images)
- C the family of classifiers
- S = ((X
1
;Y
1
);:::;(X
2N
;Y
2N
)) a random variable on
(X f0;1g)
2N
standing for the training and test samples.
- F a randomvariable on C standing for the learned classifier.It
can be a deterministic function of S or not.
4/36
Introduction
Remarks
- The set C contains all the classifiers obtainable with the
learning algorithm.
For an ANN for instance,there is one element of C for every
single configuration of the synaptic weights.
- The variable S is not one sample,but a family of 2N samples
with their labels.It contains both the training and the test set.
5/36
Gap between training and test error
One fixed f
For every f 2 C,let (f;S) denote the difference between the
training and the test errors of f estimated on
S = ((X
1
;Y
1
);:::;(X
2N
;Y
2N
)).
(f;S) =
1
N
N
X
i =1
1ff (X
N+i
) 6= Y
N+i
g
|
{z
}
test error

1
N
N
X
i =1
1ff (X
i
) 6= Y
i
g
|
{z
}
training error
Where 1ftg is equal to 1 if t is true,and 0 otherwise.Since S is
random,this is a random quantity.
6/36
Gap between the test and the training error
Data-dependent f
Given ,we want to bound the probability that the test error is less
than the training error plus .
P ((F;S)  ) ?
Here F is not constant anymore and depends on the X
1
;:::;X
2N
and the Y
1
;:::;Y
N
.
7/36
Do figures help?
Violations of the error gap
F
S
Each row corresponds to a classifier,each column to a pair
training/test set.Gray squares indicate (F;S) > .
8/36
Do figures help?
A training algorithm
F
S
A training algorithm associates an F to every S,here shown with
dots.We want to bound the number of dots on gray cells.
9/36
Concentration Inequality
Introduction
Where we see that for any fixed f,the test and
training errors are likely to be similar...
10/36
Concentration Inequality
Hœffding’s inequality (1963)
Given a family of independent random variables Z
1
;:::;Z
N
,
bounded 8i;Z
i
2 [a
i
;b
i
],if S denotes
P
i
Z
i
,we have Hœffding’s
inequality (1963).
P(S E(S) > t)  exp


2t
2
P
i
(b
i
a
i
)
2

This is an concentration result:It tells how much S is concentrated
around its average value.
11/36
Concentration Inequality
Application to the error
Note that the 1ff (X
i
) 6= Y
i
g are i.i.d Bernoulli,and we have
(f;S) =
1
N
N
X
i =1
1ff (X
N+i
) 6= Y
N+i
g 
1
N
N
X
i =1
1ff (X
i
) 6= Y
i
g
=
1
N
N
X
i =1
1ff (X
N+i
) 6= Y
N+i
g  1ff (X
i
) 6= Y
i
g
|
{z
}

i
Thus  is the averaged sum of the 
i
,which are i.i.d random
variables on f1;0;1g of zero mean.
12/36
Concentration Inequality
Application to the error
Hence,when f is fixed we have (Hœffding):
η
8f;8;P ((f;S) > )  exp


1
2

2
N

(On our graph,we have an upper bound on the number of gray
cells per row.)
13/36
Union bound
Introduction
Where we realize that the probability the cho-
sen F fails is lower than the probability that
there exists a f that fails...
14/36
Union bound
A first generalization bound
We have
P((F;S) > ) =
X
f
P(F = f;(F;S) > )
=
X
f
P(F = f;(f;S) > )

X
f
P((f;S) > )
 kCk exp


1
2

2
N

This is our first generalization bound!
15/36
Do figures help?
The union bound
F
S
We can see that graphically as a situation when the dots meet all
the gray squares.
16/36
Union bound
We can fix the probability
If we define

?
= kCk exp


1
2

2
N

We have
r
2
logkCk log
?
N
= 
17/36
Union bound
We can fix the probability
Hence from
P((F;S) > )  kCk exp


1
2

2
N

we get
P
0
@
(F;S) >
s
2
logkCk + log
1

?
N
1
A
 
?
Thus,with probability 1 
?
,we know that the gap between the
train and test error grows like the square root of the log of the
number of classifiers kCk.
18/36
Prior on C
Introduction
Where we realize that we can arbitrarily dis-
tribute allowed errors on the fs before looking
at the training data...
19/36
Prior on C
What do we control
At that point,the only quantity we control is kCk.
If we know that some of the mappings can be removed without
hurting the train error,we can remove them and get a better
bound.
Can we do something better than that?
We introduce (f ) as the control we want between the train and
test error if f is chosen.Until now,this was constant.
20/36
Prior on C
Let make  depend on F
Let (f ) denote the (bound on the) probability that the constraint is
not verified for f
P((F;S) > (F))  P (9f 2 C;(f;S) > (f ))

X
f
P((f;S) > (f ))

X
f
(f )
and we have
8f;(f ) =
s
2
log
1
(f )
N
21/36
Prior on C
Let make  depend on F
Let define 
?
=
P
f
(f ) and (f ) =
(f )

?
.The later is a distribution
on C.
Note that both can be fixed arbitrarily,and we have
8f;(f ) =
s
2
log
1
(f )
+log
1

?
N
22/36
Do figures help?
When  depends on f
S
fη( )
If the margin  depends on F,the proportion of gray squares is
not the same on every row.
23/36
Prior on C
Let’s put everything together
Our final result is that,if
- we choose a distribution  on C arbitrarily,
- we choose 0 < 
?
< 1 arbitrarily,
- we sample a pair S training set/test set each of size N,
- we choose a F after looking at the training set.
Then,we have with probability greater than 1 
?
:
(F;S) 
s
2
log
1
(F)
+log
1

?
N
where (F;S) is the difference between the test and train errors.
24/36
Prior on C
This is a philosophical theorem!
If we see log(f ) as the “description” length of f (think
Huffman).Our result true with probability 
?
(F;S) 
s
2
log
1
(F)
+log
1

?
N
says that picking a classifier with a long description leads to a bad
control on the test error.
Entities should not be multiplied unnecessarily.
Principle of parsimony of William of Occam (1280 – 1349).Also
known as Occam’s Razor.
25/36
Exchangeable selection
How we see that the family of classifiers can
be a function of both the training and the test
Xs...
26/36
Exchangeable selection
Consider a family of classifiers which are functions of the sample
fX
1
;:::;X
2N
g in an exchangeable way.
For instance with Xs in R
k
,one could rank the X
i
according to the
lexicographic order,and make the f functions of the ordered Xs.
Under such a constraint,the 
i
remains independent between 1
and 1 and our concentration result holds.
27/36
Vapnik-Chervonenkis
Where we realize that our classifier sets are not
as rich as we though...
28/36
Vapnik-Chervonenkis
Definition of the VC dimension
The Vapnik-Chervonenkis dimension of C is the largest D so that
there exists a family
x
1
;:::;x
D
2 X
D
which can be arbitrarily labeled with a classifier fromC.
29/36
Vapnik-Chervonenkis
Example
Consider for C the characteristic functions of rectangles.We can
find families of 1,2,3 or 4 points which can be labelled arbitrarily:
30/36
Vapnik-Chervonenkis
Example
However,given a family of 5 points,if the four external points are
labelled 1 and the center point labelled 0,than no function fromC
can predict that labelling.Hence here D = 4.
31/36
Vapnik-Chervonenkis
Sauer’s lemma
The VC-dimension is mainly useful because it allows to bound the
number of possible labellings of a family of N points.
Let S
C
(N) be this bound.We have (Sauer’s lemma)
S
C
(N)  (N +1)
D
This is far smaller than the number of arbitrary labelings 2
N
.
As far as we are concerned,many f s behave the same on our Xs.
32/36
Vapnik-Chervonenkis
A generalization bound
Let X denotes the non-ordered set fX
1
;:::;X
2N
g.
For x  X,let C
x
denote a subset of C such that
8f 2 C;9!g 2 C
x
;s.t.8x 2 x;f (x) = g(x)
This means that C
x
has one – and only one – mapping per
“signature” on x.
33/36
Vapnik-Chervonenkis
A generalization bound
We have:
P((F;S) > )
=
X
x
P((F;S) >  j X = x) P(X = x)
=
X
x
X
f 2C
x
P(F
jx
= f
jx
;(F;S) >  j X = x) P(X = x)

X
x
X
f 2C
x
P((f;S) >  j X = x) P(X = x)

X
x
S
C
(2N) exp


1
2

2
N

P(X = x)
= S
C
(2N) exp


1
2

2
N

34/36
Vapnik-Chervonenkis
Figures may definitely not help...
S
F Same responses
Same X
The training algorithm meets the same number of gray cells as
another one which goes only through the elements of C
x
.
35/36
Franc¸ois Fleuret
IDIAP Research Institute
francois.fleuret@idiap.ch
http://www.idiap.ch/~fleuret
36/36