MACHINE LEARNING
PACLEARNING
FRANC¸ OIS FLEURET
MAY 11TH,2011
Introduction
Classiﬁcation
The usual setting for learning for classiﬁcation:
 A training set,
 a family of classiﬁers,
 a test set.
Learning means to choose a classiﬁer according to its
performances on the training set to get good performances on the
test set.
2/36
Introduction
Topic of this lecture
The goal of this lecture is to give an intuitive understanding of the
Probably Approximately Correct learning (PAC learning for short)
theory.
 Concentration inequalities,
 basic PAC results,
 relation with Occam’s principle,
 relation to VapnikChervonenkis dimension.
The ﬁgures are supposed to help.If they do not,ignore them.
3/36
Introduction
Notation
We will use the following notation:
 X the space of the objects to classify (for instance images)
 C the family of classiﬁers
 S = ((X
1
;Y
1
);:::;(X
2N
;Y
2N
)) a random variable on
(X f0;1g)
2N
standing for the training and test samples.
 F a randomvariable on C standing for the learned classiﬁer.It
can be a deterministic function of S or not.
4/36
Introduction
Remarks
 The set C contains all the classiﬁers obtainable with the
learning algorithm.
For an ANN for instance,there is one element of C for every
single conﬁguration of the synaptic weights.
 The variable S is not one sample,but a family of 2N samples
with their labels.It contains both the training and the test set.
5/36
Gap between training and test error
One ﬁxed f
For every f 2 C,let (f;S) denote the difference between the
training and the test errors of f estimated on
S = ((X
1
;Y
1
);:::;(X
2N
;Y
2N
)).
(f;S) =
1
N
N
X
i =1
1ff (X
N+i
) 6= Y
N+i
g

{z
}
test error
1
N
N
X
i =1
1ff (X
i
) 6= Y
i
g

{z
}
training error
Where 1ftg is equal to 1 if t is true,and 0 otherwise.Since S is
random,this is a random quantity.
6/36
Gap between the test and the training error
Datadependent f
Given ,we want to bound the probability that the test error is less
than the training error plus .
P ((F;S) ) ?
Here F is not constant anymore and depends on the X
1
;:::;X
2N
and the Y
1
;:::;Y
N
.
7/36
Do ﬁgures help?
Violations of the error gap
F
S
Each row corresponds to a classiﬁer,each column to a pair
training/test set.Gray squares indicate (F;S) > .
8/36
Do ﬁgures help?
A training algorithm
F
S
A training algorithm associates an F to every S,here shown with
dots.We want to bound the number of dots on gray cells.
9/36
Concentration Inequality
Introduction
Where we see that for any ﬁxed f,the test and
training errors are likely to be similar...
10/36
Concentration Inequality
Hœffding’s inequality (1963)
Given a family of independent random variables Z
1
;:::;Z
N
,
bounded 8i;Z
i
2 [a
i
;b
i
],if S denotes
P
i
Z
i
,we have Hœffding’s
inequality (1963).
P(S E(S) > t) exp
2t
2
P
i
(b
i
a
i
)
2
This is an concentration result:It tells how much S is concentrated
around its average value.
11/36
Concentration Inequality
Application to the error
Note that the 1ff (X
i
) 6= Y
i
g are i.i.d Bernoulli,and we have
(f;S) =
1
N
N
X
i =1
1ff (X
N+i
) 6= Y
N+i
g
1
N
N
X
i =1
1ff (X
i
) 6= Y
i
g
=
1
N
N
X
i =1
1ff (X
N+i
) 6= Y
N+i
g 1ff (X
i
) 6= Y
i
g

{z
}
i
Thus is the averaged sum of the
i
,which are i.i.d random
variables on f1;0;1g of zero mean.
12/36
Concentration Inequality
Application to the error
Hence,when f is ﬁxed we have (Hœffding):
η
8f;8;P ((f;S) > ) exp
1
2
2
N
(On our graph,we have an upper bound on the number of gray
cells per row.)
13/36
Union bound
Introduction
Where we realize that the probability the cho
sen F fails is lower than the probability that
there exists a f that fails...
14/36
Union bound
A ﬁrst generalization bound
We have
P((F;S) > ) =
X
f
P(F = f;(F;S) > )
=
X
f
P(F = f;(f;S) > )
X
f
P((f;S) > )
kCk exp
1
2
2
N
This is our ﬁrst generalization bound!
15/36
Do ﬁgures help?
The union bound
F
S
We can see that graphically as a situation when the dots meet all
the gray squares.
16/36
Union bound
We can ﬁx the probability
If we deﬁne
?
= kCk exp
1
2
2
N
We have
r
2
logkCk log
?
N
=
17/36
Union bound
We can ﬁx the probability
Hence from
P((F;S) > ) kCk exp
1
2
2
N
we get
P
0
@
(F;S) >
s
2
logkCk + log
1
?
N
1
A
?
Thus,with probability 1
?
,we know that the gap between the
train and test error grows like the square root of the log of the
number of classiﬁers kCk.
18/36
Prior on C
Introduction
Where we realize that we can arbitrarily dis
tribute allowed errors on the fs before looking
at the training data...
19/36
Prior on C
What do we control
At that point,the only quantity we control is kCk.
If we know that some of the mappings can be removed without
hurting the train error,we can remove them and get a better
bound.
Can we do something better than that?
We introduce (f ) as the control we want between the train and
test error if f is chosen.Until now,this was constant.
20/36
Prior on C
Let make depend on F
Let (f ) denote the (bound on the) probability that the constraint is
not veriﬁed for f
P((F;S) > (F)) P (9f 2 C;(f;S) > (f ))
X
f
P((f;S) > (f ))
X
f
(f )
and we have
8f;(f ) =
s
2
log
1
(f )
N
21/36
Prior on C
Let make depend on F
Let deﬁne
?
=
P
f
(f ) and (f ) =
(f )
?
.The later is a distribution
on C.
Note that both can be ﬁxed arbitrarily,and we have
8f;(f ) =
s
2
log
1
(f )
+log
1
?
N
22/36
Do ﬁgures help?
When depends on f
S
fη( )
If the margin depends on F,the proportion of gray squares is
not the same on every row.
23/36
Prior on C
Let’s put everything together
Our ﬁnal result is that,if
 we choose a distribution on C arbitrarily,
 we choose 0 <
?
< 1 arbitrarily,
 we sample a pair S training set/test set each of size N,
 we choose a F after looking at the training set.
Then,we have with probability greater than 1
?
:
(F;S)
s
2
log
1
(F)
+log
1
?
N
where (F;S) is the difference between the test and train errors.
24/36
Prior on C
This is a philosophical theorem!
If we see log(f ) as the “description” length of f (think
Huffman).Our result true with probability
?
(F;S)
s
2
log
1
(F)
+log
1
?
N
says that picking a classiﬁer with a long description leads to a bad
control on the test error.
Entities should not be multiplied unnecessarily.
Principle of parsimony of William of Occam (1280 – 1349).Also
known as Occam’s Razor.
25/36
Exchangeable selection
How we see that the family of classiﬁers can
be a function of both the training and the test
Xs...
26/36
Exchangeable selection
Consider a family of classiﬁers which are functions of the sample
fX
1
;:::;X
2N
g in an exchangeable way.
For instance with Xs in R
k
,one could rank the X
i
according to the
lexicographic order,and make the f functions of the ordered Xs.
Under such a constraint,the
i
remains independent between 1
and 1 and our concentration result holds.
27/36
VapnikChervonenkis
Where we realize that our classiﬁer sets are not
as rich as we though...
28/36
VapnikChervonenkis
Deﬁnition of the VC dimension
The VapnikChervonenkis dimension of C is the largest D so that
there exists a family
x
1
;:::;x
D
2 X
D
which can be arbitrarily labeled with a classiﬁer fromC.
29/36
VapnikChervonenkis
Example
Consider for C the characteristic functions of rectangles.We can
ﬁnd families of 1,2,3 or 4 points which can be labelled arbitrarily:
30/36
VapnikChervonenkis
Example
However,given a family of 5 points,if the four external points are
labelled 1 and the center point labelled 0,than no function fromC
can predict that labelling.Hence here D = 4.
31/36
VapnikChervonenkis
Sauer’s lemma
The VCdimension is mainly useful because it allows to bound the
number of possible labellings of a family of N points.
Let S
C
(N) be this bound.We have (Sauer’s lemma)
S
C
(N) (N +1)
D
This is far smaller than the number of arbitrary labelings 2
N
.
As far as we are concerned,many f s behave the same on our Xs.
32/36
VapnikChervonenkis
A generalization bound
Let X denotes the nonordered set fX
1
;:::;X
2N
g.
For x X,let C
x
denote a subset of C such that
8f 2 C;9!g 2 C
x
;s.t.8x 2 x;f (x) = g(x)
This means that C
x
has one – and only one – mapping per
“signature” on x.
33/36
VapnikChervonenkis
A generalization bound
We have:
P((F;S) > )
=
X
x
P((F;S) > j X = x) P(X = x)
=
X
x
X
f 2C
x
P(F
jx
= f
jx
;(F;S) > j X = x) P(X = x)
X
x
X
f 2C
x
P((f;S) > j X = x) P(X = x)
X
x
S
C
(2N) exp
1
2
2
N
P(X = x)
= S
C
(2N) exp
1
2
2
N
34/36
VapnikChervonenkis
Figures may deﬁnitely not help...
S
F Same responses
Same X
The training algorithm meets the same number of gray cells as
another one which goes only through the elements of C
x
.
35/36
Franc¸ois Fleuret
IDIAP Research Institute
francois.fleuret@idiap.ch
http://www.idiap.ch/~fleuret
36/36
Comments 0
Log in to post a comment