Foundations of Machine Learning
Lecture 2
Mehryar Mohri
Courant Institute and Google Research
mohricims.nyu.edu
PAC Learning
Concentration Bounds
page
Mehryar Mohri  Foundations of Machine Learning
Motivation
Some computational learning questions
•
What can be learned efﬁciently?
•
What is inherently hard to learn?
•
A general model of learning?
Complexity
•
Computational complexity
: time and space.
•
Sample complexity
: amount of training data
needed to learn successfully.
•
Mistake bounds
: number of mistakes before
learning successfully.
3
page
Mehryar Mohri  Foundations of Machine Learning
This lecture
PAC Model
Sample complexity  ﬁnite hypothesis space 
consistent case
Sample complexity  ﬁnite hypothesis space 
inconsistent case
Concentration bounds
4
page
Mehryar Mohri  Foundations of Machine Learning
Deﬁnitions
: set of all possible
instances
or
examples
, e.g.,
the set of all men and women characterized by
their height and weight.
: the
target concept
to learn, e.g.,
for a male,
for a female example
.
:
concept class
, a set of target concepts
c
.
:
target distribution
, a ﬁxed probability
distribution over
. Training and test examples are
drawn according to
.
5
c
:
X
{
0
,
1
}
c
(
x
)=0
c
(
x
)=1
C
D
D
X
X
page
Mehryar Mohri  Foundations of Machine Learning
Deﬁnitions
S
: training sample.
H
: set of concept hypotheses, e.g., the set of all
linear classiﬁers.
The learning algorithm receives sample
S
and
selects a hypothesis
h
S
from
H
approximating
c
.
6
page
Mehryar Mohri  Foundations of Machine Learning
Errors
True error or generalization error
of
h
with
respect to the target concept
c
and distribution
D
:
Empirical error
: average error of
h
on the training
sample drawn according to distribution
D
,
Note:
7
error
D
(
h
) = Pr
x
D
[
h
(
x
)
=
c
(
x
)] = E
x
D
[1
h
(
x
)
=
c
(
x
)
]
.
error
S
(
h
) = Pr
x
b
D
[
h
(
x
)
=
c
(
x
)] =
1
m
m
i
=1
1
h
(
x
i
)
=
c
(
x
i
)
.
S
error
D
(
h
) = E
S
D
m
[
error
S
(
h
)]
.
page
Mehryar Mohri  Foundations of Machine Learning
PAC Model
PAC learning
: Probably Approximately Correct
learning.
Deﬁnition
: concept class is
PAClearnable
if there
exists a learning algorithm such that:
•
for all and all distributions ,
•
for samples of size for a ﬁxed
polynomial.
8
C
L
c
C,
⇥
>
0
,
>
0
,
D
S
Pr
S
D
[
error
(
h
S
)
⇥
⇥
]
⇤
1
,
(Valiant, 1984)
m
=
poly
(1
/
⇥
,
1
/
)
page
Mehryar Mohri  Foundations of Machine Learning
Remarks
Concept class
is known to the algorithm.
Distributionfree model: no assumption on
.
Both training and test examples drawn .
Probably: conﬁdence .
Approximately correct: accuracy .
Efﬁcient PAClearning
: runs in time .
What about the cost of the representation of
?
9
C
D
D
1
1
L
c
C
poly
(1
/ ,
1
/
)
page
Mehryar Mohri  Foundations of Machine Learning
PAC Model  New Deﬁnition
Computational representation
:
•
cost for in .
•
cost for in .
Extension
: running time.
10
x
X
O
(
n
)
c
C
O
(
size
(
c
))
O
(
poly
(1
/
⇥
,
1
/
))
⇥
O
(
poly
(1
/
⇥
,
1
/
,n,size
(
c
)))
.
page
Mehryar Mohri  Foundations of Machine Learning
Example  Rectangle Learning
Problem
: learn unknown axisaligned rectangle
R
using as small a labeled sample as possible.
Hypothesis
: rectangle
R’
. In general, there may be
false positive and false negative points.
R
R’
11
page
Mehryar Mohri  Foundations of Machine Learning
Example  Rectangle Learning
Simple method
: choose tightest consistent
rectangle
R’
for a large enough sample. How large
a sample? Is this class PAClearnable?
What is the probability that ?
R
R’
12
error
D
(
R
)
>
page
Mehryar Mohri  Foundations of Machine Learning
Example  Rectangle Learning
Fix and assume (otherwise the
result is trivial).
Let be four rectangles along the
rectangle sides such that .
R
R’
r
1
r
2
r
3
r
4
13
>
0
Pr
D
[
R
]
>
r
1
,r
2
,r
3
,r
4
Pr
D
[
r
i
] =
/
4
,i
=1
,...,
4
page
Mehryar Mohri  Foundations of Machine Learning
Example  Rectangle Learning
Errors can only occur in . Thus (geometry),
Thus,
R
R’
r
1
r
2
r
3
r
4
14
R
R
error
D
(
R
)
>
=
R
misses at least one region .
r
i
Pr[
error
D
(
R
⇥
)
>
]
⇥
Pr[
⇤
4
i
=1
{
R
⇥
misses
r
i
}
]
⇥
4
i
=1
Pr[
{
R
⇥
misses
r
i
}
]
⇥
4(1
/
4)
m
⇥
4
e
m
/
4
.
page
Mehryar Mohri  Foundations of Machine Learning
Example  Rectangle Learning
Set to match the upper bound:
Then, for , with probability at least ,
R
R’
r
1
r
2
r
3
r
4
15
= 4
e
m/
4
⇥
m
=
4
⇥
log
4
.
1
m
4
⇥
log
4
error
D
(
R
)
.
page
Mehryar Mohri  Foundations of Machine Learning
Notes
Inﬁnite hypothesis set, but simple proof.
Does this proof readily apply to other similar
concepts classes?
Geometric properties:
•
key in this proof.
•
in general nontrivial to extend to other classes,
e.g., nonconcentric circles
(see HW2, 2006)
.
16
Need for more general proof and results.
page
Mehryar Mohri  Foundations of Machine Learning
This lecture
PAC Model
Sample complexity  ﬁnite hypothesis space 
consistent case
Sample complexity  ﬁnite hypothesis space 
inconsistent case
Concentration bounds
17
page
Mehryar Mohri  Foundations of Machine Learning
Sample Complexity for Finite H 
Consistent Case
Theorem
: let be a ﬁnite set of functions from
to . Let be an algorithm that for any target
concept and sample returns a consistent
hypothesis: . Then, for any , for a
sample size ,
Equivalent statement  with probability at least
the following error bound holds:
18
H
X
{
0
,
1
}
L
S
c
H
error
(
h
S
)=0
⇥
,
>
0
m
1
⇥
l og

H

+ l og
1
⇥
Pr
S
D
[
error
(
h
S
)
⇥
⇥
]
⇤
1
.
1
error
D
(
h
S
)
1
m
(log

H

+log
1
)
.
page
Mehryar Mohri  Foundations of Machine Learning
Sample Complexity for Finite H 
Consistent Case  Proof
Proof
: Let
be such that , then
Thus,
19
h
error
(
h
)
>
Pr[
h
consistent
]
⇥
(1
)
m
.
Pr[
⌃
h
⇧
H
:
h
consistent
⌥
error
D
(
h
)
⌅
]
= Pr[(
h
1
⇧
H
consistent
⌥
error
D
(
h
1
)
⌅
)
(
h
2
⇧
H
consistent
⌥
error
D
(
h
2
)
⌅
)
∙ ∙ ∙
]
(
b y t h e u n i o n b o u n d
)
⇤
h
⇥
H
Pr[
h
consistent
⌥
error
D
(
h
)
⌅
]
⇤
h
⇥
H
Pr[
h
consistent

error
D
(
h
)
⌅
]
⇤
h
⇥
H
(1
)
m
=

H

(1
)
m
⇤

H

e
m
.
page
Mehryar Mohri  Foundations of Machine Learning
Remarks
Error bound linear in and only logarithmic
in .
is the number of bits for the representation
of .
Bound is loose for large .
Uninformative for inﬁnite .
20
1
/m
1
/
log
2

H

H

H


H

page
Mehryar Mohri  Foundations of Machine Learning
Conjunctions of Boolean Literals
Example
for .
Algorithm
: start with and rule
out literals incompatible with positive examples.
21
n
=6
hypot hesi s
x
1
x
2
x
5
x
6
.
0
1
1
0
1
1
0
1
1
1
1
1
0
0
1
1
0
1

0
1
1
1
1
1
1
0
0
1
1
0

0
1
0
0
1
1
0
1
?
?
1
1
x
1
x
1
∙ ∙ ∙
x
n
x
n
page
Mehryar Mohri  Foundations of Machine Learning
Problem
: learning class of conjunctions of
boolean literals with at most variables (e.g.,
for , ).
Algorithm
: choose learner consistent with .
•
Since , sample complexity:
•
Computational complexity: polynomial, since
algorithmic cost per training example is in .
Conjunctions of Boolean Literals
22
n
=3
x
1
x
2
x
3
C
n
n
S
=
.
02
,
⇥
=
.
1
,n
=10
,m
149
.

H

=

C
n

=3
n
h
m
1
⇥
((l og 3)
n
+ l og
1
)
.
O
(
n
)
page
Mehryar Mohri  Foundations of Machine Learning
Problem
: each deﬁned by boolean features.
Let be the set of all subsets of .
Question
: is PAClearnable?
Sample complexity
: must contain . Thus,
It can be proved that is
not PAClearnable
, it
requires an exponential sample size.
Universal Concept Class
23
x
X
n
C
X
H
C

H


C

=2
(2
n
)
.
The bound gives
m
=
1
⇥
((log 2) 2
n
+log
1
)
.
C
C
page
Mehryar Mohri  Foundations of Machine Learning
k
Term DNF Formulae
Deﬁnition
: expressions of the form with
each term conjunctions of boolean literals with
at most variables.
Problem
: learning kterm DNF formulae.
Sample complexity
: Thus, polynomial
sample complexity
Time complexity
: intractable if : the class
is then not efﬁciently PAClearnable (proof by
reduction from graph 3coloring). But, a strictly
larger class is!
24
T
1
⇥
∙ ∙ ∙
⇥
T
k
T
i
n

H

=

C

=3
nk
.
1
⇥
((l og 3)
nk
+log
1
)
.
RP
=
NP
page
Mehryar Mohri  Foundations of Machine Learning
k
CNF Expressions
Deﬁnition
: expressions of arbitrary
length with each term a disjunction of at most
boolean attributes.
Algorithm
: reduce problem to that of learning
conjunctions of boolean literals. new variables:
•
the transformation is a bijection;
•
effect of the transformation on the distribution is
not an issue: PAClearning allows any
distribution .
25
T
1
⇥
∙ ∙ ∙
⇥
T
j
j
T
i
k
D
(
u
1
,...,u
k
)
Y
u
1
,...,u
k
.
(2
n
)
k
page
Mehryar Mohri  Foundations of Machine Learning
k
Term DNF Terms and
k
CNF Expressions
Observation
: any
k
term DNF formula can be
written as a
k
CNF expression. By associativity,
•
Example
:
•
But, in general converting a
k
CNF (equiv. to a
k

term DNF) to a
k
term DNF is intractable.
Key aspects of PAClearning deﬁnition:
•
cost of representation of concept .
•
choice of hypothesis set .
26
(
u
1
u
2
u
3
)
⇥
(
v
1
v
2
v
3
) =
3
i,j
=1
(
u
i
⇥
v
j
)
.
H
c
k
i
=1
u
i,
1
∙ ∙ ∙
u
i,n
i
=
j
1
[1
,n
1
]
,...,j
k
[1
,n
k
]
u
1
,j
1
∙ ∙ ∙
u
k,j
k
.
page
Mehryar Mohri  Foundations of Machine Learning
This lecture
PAC Model
Sample complexity  ﬁnite hypothesis space 
consistent case
Sample complexity  ﬁnite hypothesis space 
inconsistent case
Concentration bounds
27
page
Mehryar Mohri  Foundations of Machine Learning
No
is a consistent hypothesis.
The typical case in practice: difﬁcult problems,
complex concept class.
But, inconsistent hypotheses with a small number
of errors on the training set can be useful.
Need a more powerful tool: Hoeffding’s inequality.
Inconsistent Case
28
h
H
page
Mehryar Mohri  Foundations of Machine Learning
Hoeffding’s Inequality
Corollary
: for any , any distribution and any
hypothesis , the following inequalities
hold:
Combining these onesided inequalities yields
29
>
0
h
:
X
{
0
,
1
}
D
Pr[
error
D
(
h
)
error
D
(
h
)
⇤
]
⇥
e
2
m
2
Pr[
error
D
(
h
)
error
D
(
h
)
⇥
]
⇥
e
2
m
2
.
Pr[

error
D
(
h
)
error
D
(
h
)

⇤
]
⇥
2
e
2
m
2
.
page
Mehryar Mohri  Foundations of Machine Learning
Application: Bound for Single Hypothesis
Theorem
: ﬁx a hypothesis . Then, for
any , with probability at least ,
Proof
: Apply Hoeffding’s inequality
30
>
0
1
Setting to match the upper bound gives the result.
h
:
X
{
0
,
1
}
Pr[

error
D
(
h
)
error
D
(
h
)

⇤
]
⇥
2
e
2
m
2
.
error
(
h
)
error
(
h
) +
l o g
2
2
m
.
page
Mehryar Mohri  Foundations of Machine Learning
Example: Tossing a Coin
Problem
: estimate bias of a coin.
Let . Then and is the
percentage of heads in the sample. Thus, with
probability at least ,
31
H
,
T
,
T
,
H
,
T
,
H
,
H
,
T
,
H
,
H
,
H
,
T
,
T
,...,
H
.
p
h
=1
H

p
p

⇥
⇥
l og
2
2
m
.
1
Thus, choosi ng and i mpl i es t hat
wi t h probabi l i t y at l east 98%,
=
.
02

p
p

⇥
⇥
log(10)
/
1000
⇤
.
048
.
m
=1000
p
=
error
(
h
)
p
=
error
(
h
)
page
Mehryar Mohri  Foundations of Machine Learning
Application to Learning Algorithm?
Can we apply that bound to the hypothesis
returned by our learning algorithm when training
on sample ?
No, because is not a ﬁxed hypothesis, it depends
on the training sample. Note also that
is not a simple quantity such as .
Instead, we need a bound that holds simultaneously
for all hypothesis , a
uniform convergence
bound
.
32
h
S
S
h
S
h
H
E[
error
(
h
S
)]
error
(
h
S
)
page
Mehryar Mohri  Foundations of Machine Learning
Generalization Bound  Finite
H
Theorem
: let be a ﬁnite hypothesis set, then, for
any , with probability at least ,
Proof
: By the union bound,
33
H
>
0
1
⇤
h
⇥
H,R
(
h
)
R
(
h
) +
⇥
l o g

H

+ l o g
2
2
m
.
Pr
⌅
⌅
h
⇤
H
⇤
R
(
h
)
R
(
h
)
>
⇧
= Pr
⌅
⇤
R
(
h
1
)
R
(
h
1
)
>
⇧
...
⇧
⇤
R
(
h

H

)
R
(
h

H

)
>
⇧
⇥
⇥
h
H
Pr
⌅
⇤
R
(
h
)
R
(
h
)
>
⇧
⇥
2

H

exp(
2
m
2
)
.
page
Mehryar Mohri  Foundations of Machine Learning
Summary & Comments
Thus, for a ﬁnite hypothesis set, whp,
can be interpreted as the number of bits
needed to encode .
Occam’s Razor principle (theologian William of
Occam): “plurality should not be posited without
necessity”.
How do we deal with inﬁnite hypothesis sets?
34
⇤
h
⇥
H,R
(
h
)
⇤
R
(
h
) +
O
⌅
l o g

H

m
⇥
.
l o g
2

H

H
page
Mehryar Mohri  Foundations of Machine Learning
Occam’s Razor
Principle formulated by controversial theologian
William of Occam: “
plurality should not be posited
without necessity
”, rephrased as “
the simplest
explanation is best
”.
Invoked in a variety of contexts, e.g., syntax.
Kolmogorov complexity can be viewed as the
corresponding framework in information theory.
In this context: to minimize true error, choose the
most parsimonious explanation (smallest ). We
will see later other applications of this principle.
35

H

page
Mehryar Mohri  Foundations of Machine Learning
This lecture
PAC Model
Sample complexity for ﬁnite hypothesis space 
consistent case
Sample complexity for ﬁnite hypothesis space 
inconsistent case
Concentration bounds
36
page
Mehryar Mohri  Foundations of Machine Learning
Concentration Inequalities
Some general tools for error analysis and bounds:
•
Hoeffding’s inequality
(additive).
•
Chernoff bounds (multiplicative).
•
McDiarmid’s inequality (more general).
37
page
Mehryar Mohri  Foundations of Machine Learning
Hoeffding’s Lemma
Lemma
: Let
be a random variable with
and
with
. Then for ,
Proof
: by convexity of , for all ,
38
E
[
e
tX
]
e
t
2
(
b
a
)
2
8
.
E[
X
] =0
X
a
X
b
b
=
a
t>
0
a
x
b
e
tx
⇥
b
x
b
a
e
ta
+
x
a
b
a
e
tb
.
Thus,
wi th,
(
t
) = l og(
b
b
a
e
ta
+
a
b
a
e
tb
) =
t a
+ l o g (
b
b
a
+
a
b
a
e
t
(
b
a
)
)
.
x
e
tx
E[
e
tX
]
E
[
b
X
b
a
e
ta
+
X
a
b
a
e
tb
] =
b
b
a
e
ta
+
a
b
a
e
tb
=
e
(
t
)
,
page
Mehryar Mohri  Foundations of Machine Learning
•
Taking the derivative gives:
•
Note that: Furthermore,
with There exists such that:
39
⇥
(
t
) =
a
a e
t
(
b
a
)
b
b
a
a
b
a
e
t
(
b
a
)
=
a
a
b
b
a
e
t
(
b
a
)
a
b
a
.
(0) = 0
and
(0) = 0
.
⇥⇥
(
t
) =
abe
t
(
b
a
)
[
b
b
a
e
t
(
b
a
)
a
b
a
]
2
=
(1
)
e
t
(
b
a
)
(
b
a
)
2
[(1
)
e
t
(
b
a
)
+
]
2
=
[(1
)
e
t
(
b
a
)
+
]
(1
)
e
t
(
b
a
)
[(1
)
e
t
(
b
a
)
+
]
(
b
a
)
2
=
u
(1
u
)(
b
a
)
2
⇥
(
b
a
)
2
4
,
=
a
b
a
.
0
t
⇥
(
t
) =
⇥
(0) +
t
⇥
(0) +
t
2
2
⇥
(
)
⇥
t
2
(
b
a
)
2
8
.
page
Mehryar Mohri  Foundations of Machine Learning
Theorem
: Let
be independent random
variables
. Then for , the following
inequalities hold for :
•
Proof
: The proof is based on
Chernoff’s bounding
technique
: for any random variable
and ,
apply Markov’s inequality and select to minimize
Hoeffding’s Theorem
40
X
1
,...,X
m
X
i
[
a
i
,b
i
]
>
0
S
m
=
m
i
=1
X
i
Pr[
S
m
E[
S
m
]
⇤
]
⇥
e
2
2
/
P
m
i
=1
(
b
i
a
i
)
2
Pr[
S
m
E[
S
m
]
⇥
]
⇥
e
2
2
/
P
m
i
=1
(
b
i
a
i
)
2
.
X
t>
0
Pr[
X
⇥
] = Pr[
e
t X
⇥
e
t
]
E[
e
tX
]
e
t
.
t
page
Mehryar Mohri  Foundations of Machine Learning
•
Using this scheme and the independence of the
random variables gives
•
The second inequality is proved in a similar way.
41
choosing
t
= 4
/
m
i
=1
(
b
i
a
i
)
2
.
Pr[
S
m
E[
S
m
]
]
e
t
E[
e
t
(
S
m
E[
S
m
])
]
=
e
t
m
i
=1
E[
e
t
(
X
i
E[
X
i
])
]
(
lemma applied to
X
i
E[
X
i
])
e
t
m
i
=1
e
t
2
(
b
i
a
i
)
2
/
8
=
e
t
e
t
2
P
m
i
=1
(
b
i
a
i
)
2
/
8
e
2
2
/
P
m
i
=1
(
b
i
a
i
)
2
,
page
Mehryar Mohri  Foundations of Machine Learning
Hoeffding’s Inequality
Corollary
: for any , any distribution and any
hypothesis , the following inequalities
hold:
Proof
: follows directly Hoeffding’s theorem.
Combining these onesided inequalities yields
42
>
0
h
:
X
{
0
,
1
}
D
Pr[
R
(
h
)
R
(
h
)
]
e
2
m
2
Pr[
R
(
h
)
R
(
h
)
]
e
2
m
2
.
Pr
R
(
h
)
R
(
h
)
2
e
2
m
2
.
page
Mehryar Mohri  Foundations of Machine Learning
Chernoff’s Inequality
Theorem
: for any , any distribution and any
hypothesis , the following inequalities
hold:
Proof
: proof based on Chernoff’s bounding
technique.
43
>
0
h
:
X
{
0
,
1
}
D
Pr[
R
(
h
)
(1 +
)
R
(
h
)]
e
mR
(
h
)
2
/
3
Pr[
R
(
h
)
(1
)
R
(
h
)]
e
mR
(
h
)
2
/
2
.
page
Mehryar Mohri  Foundations of Machine Learning
McDiarmid’s Inequality
Theorem
: let be independent random
variables taking values in and a
function verifying for all ,
44
(McDiarmid, 1989)
X
1
,...,X
m
U
f
:
U
m
R
i
[ 1
,m
]
Th e n, f or a l l ,
>
0
s up
x
1
,...,x
m
,x
i

f
(
x
1
,...,x
i
,...,x
m
)
f
(
x
1
,...,x
i
,...,x
m
)

⇥
c
i
.
Pr
⇧
f
(
X
1
,...,X
m
)
E[
f
(
X
1
,...,X
m
)]
>
⌃
⇥
2 exp
⇥
2
2
⌅
m
i
=1
c
2
i
⇤
.
page
Mehryar Mohri  Foundations of Machine Learning
•
Comments
:
•
Proof
: uses Hoeffding’s lemma.
•
Hoeffding’s inequality is a special case of
McDiarmid’s with
45
f
(
x
1
,...,x
m
) =
1
m
m
i
=1
x
i
and
c
i
=

b
i
a
i

m
.
page
Mehryar Mohri  Foundations of Machine Learning
Summary
is
PAClearnable
if ,
Learning bound, ﬁnite consistent case:
Learning bound, ﬁnite inconsistent case:
McDiarmid’s inequality:
46
C
⇤
L,
⇥
c
C,
⇥
⇥
,
>
0
,m
=
P
1
⇥
,
1
⇥
H
H
R
(
h
)
R
(
h
) +
⇥
log

H

+log
2
2
m
.
R
(
h
)
1
m
(log

H

+log
1
)
.
Pr[

f
E[
f
]

>
]
⇥
2 exp
2
2
P
m
i
=1
c
2
i
⇥
.
Pr
S
D
m
[
R
(
h
S
)
]
1
.
page
Mehryar Mohri  Foundations of Machine Learning
References
•
Michael Kearns and Umesh Vazirani.
An Introduction to Computational Learning Theory
, MIT
Press, 1994.
•
Leslie G. Valiant.
A Theory of the Learnable
, Communications of the ACM 27(11):1134–1142
(1984).
47
Comments 0
Log in to post a comment