# Foundations of Machine Learning Lecture 2

AI and Robotics

Oct 12, 2013 (4 years and 8 months ago)

1,000 views

Foundations of Machine Learning
Lecture 2
Mehryar Mohri
mohricims.nyu.edu
PAC Learning
Concentration Bounds
page
Mehryar Mohri - Foundations of Machine Learning
Motivation
Some computational learning questions

What can be learned efﬁciently?

What is inherently hard to learn?

A general model of learning?
Complexity

Computational complexity
: time and space.

Sample complexity
: amount of training data
needed to learn successfully.

Mistake bounds
: number of mistakes before
learning successfully.
3
page
Mehryar Mohri - Foundations of Machine Learning
This lecture
PAC Model
Sample complexity - ﬁnite hypothesis space -
consistent case
Sample complexity - ﬁnite hypothesis space -
inconsistent case
Concentration bounds
4
page
Mehryar Mohri - Foundations of Machine Learning
Deﬁnitions

: set of all possible
instances
or
examples
, e.g.,
the set of all men and women characterized by
their height and weight.

: the
target concept
to learn, e.g.,

for a male,

for a female example
.

:
concept class
, a set of target concepts
c
.

:
target distribution
, a ﬁxed probability
distribution over

. Training and test examples are
drawn according to
.
5
c
:
X

{
0
,
1
}
c
(
x
)=0
c
(
x
)=1
C
D
D
X
X
page
Mehryar Mohri - Foundations of Machine Learning
Deﬁnitions
S
: training sample.
H
: set of concept hypotheses, e.g., the set of all
linear classiﬁers.
S
and
selects a hypothesis
h
S
from
H
approximating
c
.
6
page
Mehryar Mohri - Foundations of Machine Learning
Errors
True error or generalization error
of
h
with
respect to the target concept
c
and distribution
D
:
Empirical error
: average error of
h
on the training
sample drawn according to distribution
D
,
Note:
7
error
D
(
h
) = Pr
x

D
[
h
(
x
)

=
c
(
x
)] = E
x

D
[1
h
(
x
)

=
c
(
x
)
]
.

error
S
(
h
) = Pr
x

b
D
[
h
(
x
)

=
c
(
x
)] =
1
m
m

i
=1
1
h
(
x
i
)

=
c
(
x
i
)
.
S
error
D
(
h
) = E
S

D
m
[

error
S
(
h
)]
.
page
Mehryar Mohri - Foundations of Machine Learning
PAC Model
PAC learning
: Probably Approximately Correct
learning.
Deﬁnition
: concept class is
PAC-learnable
if there
exists a learning algorithm such that:

for all and all distributions ,

for samples of size for a ﬁxed
polynomial.
8
C
L
c

C,

>
0
,

>
0
,
D
S
Pr
S

D
[
error
(
h
S
)

]

1

,
(Valiant, 1984)
m
=
poly
(1
/

,
1
/

)
page
Mehryar Mohri - Foundations of Machine Learning
Remarks
Concept class

is known to the algorithm.
Distribution-free model: no assumption on

.
Both training and test examples drawn .
Probably: conﬁdence .
Approximately correct: accuracy .
Efﬁcient PAC-learning
: runs in time .
What about the cost of the representation of

?
9
C
D

D
1

1

L
c

C
poly
(1
/ ,
1
/
)
page
Mehryar Mohri - Foundations of Machine Learning
PAC Model - New Deﬁnition
Computational representation
:

cost for in .

cost for in .
Extension
: running time.
10
x

X
O
(
n
)
c

C
O
(
size
(
c
))
O
(
poly
(1
/

,
1
/

))

O
(
poly
(1
/

,
1
/

,n,size
(
c
)))
.
page
Mehryar Mohri - Foundations of Machine Learning
Example - Rectangle Learning
Problem
: learn unknown axis-aligned rectangle
R

using as small a labeled sample as possible.
Hypothesis
: rectangle
R’
. In general, there may be
false positive and false negative points.
R
R’
11
page
Mehryar Mohri - Foundations of Machine Learning
Example - Rectangle Learning
Simple method
: choose tightest consistent
rectangle
R’
for a large enough sample. How large
a sample? Is this class PAC-learnable?
What is the probability that ?
R
R’
12
error
D
(
R

)
>

page
Mehryar Mohri - Foundations of Machine Learning
Example - Rectangle Learning
Fix and assume (otherwise the
result is trivial).
Let be four rectangles along the
rectangle sides such that .
R
R’
r
1
r
2
r
3
r
4
13

>
0
Pr
D
[
R
]
>

r
1
,r
2
,r
3
,r
4
Pr
D
[
r
i
] =

/
4
,i
=1
,...,
4
page
Mehryar Mohri - Foundations of Machine Learning
Example - Rectangle Learning
Errors can only occur in . Thus (geometry),
Thus,
R
R’
r
1
r
2
r
3
r
4
14
R

R

error
D
(
R

)
>

=

R

misses at least one region .
r
i
Pr[
error
D
(
R

)
>

]

Pr[

4
i
=1
{
R

misses
r
i
}
]

4

i
=1
Pr[
{
R

misses
r
i
}
]

4(1

/
4)
m

4
e

m

/
4
.
page
Mehryar Mohri - Foundations of Machine Learning
Example - Rectangle Learning
Set to match the upper bound:
Then, for , with probability at least ,
R
R’
r
1
r
2
r
3
r
4
15

= 4
e

m/
4

m
=
4

log
4

.
1

m

4

log
4

error
D
(
R

)

.
page
Mehryar Mohri - Foundations of Machine Learning
Notes
Inﬁnite hypothesis set, but simple proof.
Does this proof readily apply to other similar
concepts classes?
Geometric properties:

key in this proof.

in general non-trivial to extend to other classes,
e.g., non-concentric circles
(see HW2, 2006)
.
16
Need for more general proof and results.
page
Mehryar Mohri - Foundations of Machine Learning
This lecture
PAC Model
Sample complexity - ﬁnite hypothesis space -
consistent case
Sample complexity - ﬁnite hypothesis space -
inconsistent case
Concentration bounds
17
page
Mehryar Mohri - Foundations of Machine Learning
Sample Complexity for Finite H -
Consistent Case
Theorem
: let be a ﬁnite set of functions from
to . Let be an algorithm that for any target
concept and sample returns a consistent
hypothesis: . Then, for any , for a
sample size ,
Equivalent statement - with probability at least
the following error bound holds:
18
H
X
{
0
,
1
}
L
S
c

H

error
(
h
S
)=0

,

>
0
m

1

l og
|
H
|
+ l og
1

Pr
S

D
[
error
(
h
S
)

]

1

.
1

error
D
(
h
S
)

1
m
(log
|
H
|
+log
1

)
.
page
Mehryar Mohri - Foundations of Machine Learning
Sample Complexity for Finite H -
Consistent Case - Proof
Proof
: Let

be such that , then
Thus,
19
h
error
(
h
)
>

Pr[
h
consistent
]

(1

)
m
.
Pr[

h

H
:
h
consistent

error
D
(
h
)

]
= Pr[(
h
1

H
consistent

error
D
(
h
1
)

)

(
h
2

H
consistent

error
D
(
h
2
)

)

∙ ∙ ∙
]
(
b y t h e u n i o n b o u n d
)

h

H
Pr[
h
consistent

error
D
(
h
)

]

h

H
Pr[
h
consistent
|
error
D
(
h
)

]

h

H
(1

)
m
=
|
H
|
(1

)
m

|
H
|
e

m

.
page
Mehryar Mohri - Foundations of Machine Learning
Remarks
Error bound linear in and only logarithmic
in .
is the number of bits for the representation
of .
Bound is loose for large .
Uninformative for inﬁnite .
20
1
/m
1
/

log
2
|
H
|
H
|
H
|
|
H
|
page
Mehryar Mohri - Foundations of Machine Learning
Conjunctions of Boolean Literals
Example
for .
Algorithm
out literals incompatible with positive examples.
21
n
=6
hypot hesi s
x
1

x
2

x
5

x
6
.
0
1
1
0
1
1

0
1
1
1
1
1

0
0
1
1
0
1
-
0
1
1
1
1
1

1
0
0
1
1
0
-
0
1
0
0
1
1

0
1
?
?
1
1
x
1

x
1
∙ ∙ ∙
x
n

x
n
page
Mehryar Mohri - Foundations of Machine Learning
Problem
: learning class of conjunctions of
boolean literals with at most variables (e.g.,
for , ).
Algorithm
: choose learner consistent with .

Since , sample complexity:

Computational complexity: polynomial, since
algorithmic cost per training example is in .
Conjunctions of Boolean Literals
22
n
=3
x
1

x
2

x
3
C
n
n
S

=
.
02
,

=
.
1
,n
=10
,m

149
.
|
H
|
=
|
C
n
|
=3
n
h
m

1

((l og 3)
n
+ l og
1

)
.
O
(
n
)
page
Mehryar Mohri - Foundations of Machine Learning
Problem
: each deﬁned by boolean features.
Let be the set of all subsets of .
Question
: is PAC-learnable?
Sample complexity
: must contain . Thus,
It can be proved that is
not PAC-learnable
, it
requires an exponential sample size.
Universal Concept Class
23
x

X
n
C
X
H
C
|
H
|

|
C
|
=2
(2
n
)
.
The bound gives
m
=
1

((log 2) 2
n
+log
1

)
.
C
C
page
Mehryar Mohri - Foundations of Machine Learning
k
-Term DNF Formulae
Deﬁnition
: expressions of the form with
each term conjunctions of boolean literals with
at most variables.
Problem
: learning k-term DNF formulae.
Sample complexity
: Thus, polynomial
sample complexity
Time complexity
: intractable if : the class
is then not efﬁciently PAC-learnable (proof by
reduction from graph 3-coloring). But, a strictly
larger class is!
24
T
1

∙ ∙ ∙

T
k
T
i
n
|
H
|
=
|
C
|
=3
nk
.
1

((l og 3)
nk
+log
1

)
.
RP

=
NP
page
Mehryar Mohri - Foundations of Machine Learning
k
-CNF Expressions
Deﬁnition
: expressions of arbitrary
length with each term a disjunction of at most
boolean attributes.
Algorithm
: reduce problem to that of learning
conjunctions of boolean literals. new variables:

the transformation is a bijection;

effect of the transformation on the distribution is
not an issue: PAC-learning allows any
distribution .
25
T
1

∙ ∙ ∙

T
j
j
T
i
k
D
(
u
1
,...,u
k
)

Y
u
1
,...,u
k
.
(2
n
)
k
page
Mehryar Mohri - Foundations of Machine Learning
k
-Term DNF Terms and
k
-CNF Expressions
Observation
: any
k
-term DNF formula can be
written as a
k
-CNF expression. By associativity,

Example
:

But, in general converting a
k
-CNF (equiv. to a
k
-
term DNF) to a
k
-term DNF is intractable.
Key aspects of PAC-learning deﬁnition:

cost of representation of concept .

choice of hypothesis set .
26
(
u
1

u
2

u
3
)

(
v
1

v
2

v
3
) =

3
i,j
=1
(
u
i

v
j
)
.
H
c
k

i
=1
u
i,
1
∙ ∙ ∙
u
i,n
i
=

j
1

[1
,n
1
]
,...,j
k

[1
,n
k
]
u
1
,j
1
∙ ∙ ∙
u
k,j
k
.
page
Mehryar Mohri - Foundations of Machine Learning
This lecture
PAC Model
Sample complexity - ﬁnite hypothesis space -
consistent case
Sample complexity - ﬁnite hypothesis space -
inconsistent case
Concentration bounds
27
page
Mehryar Mohri - Foundations of Machine Learning
No

is a consistent hypothesis.
The typical case in practice: difﬁcult problems,
complex concept class.
But, inconsistent hypotheses with a small number
of errors on the training set can be useful.
Need a more powerful tool: Hoeffding’s inequality.
Inconsistent Case
28
h

H
page
Mehryar Mohri - Foundations of Machine Learning
Hoeffding’s Inequality
Corollary
: for any , any distribution and any
hypothesis , the following inequalities
hold:
Combining these one-sided inequalities yields
29

>
0
h
:
X

{
0
,
1
}
D
Pr[

error
D
(
h
)

error
D
(
h
)

]

e

2
m

2
Pr[

error
D
(
h
)

error
D
(
h
)

]

e

2
m

2
.
Pr[
|

error
D
(
h
)

error
D
(
h
)
|

]

2
e

2
m

2
.
page
Mehryar Mohri - Foundations of Machine Learning
Application: Bound for Single Hypothesis
Theorem
: ﬁx a hypothesis . Then, for
any , with probability at least ,
Proof
: Apply Hoeffding’s inequality
30

>
0
1

Setting to match the upper bound gives the result.

h
:
X

{
0
,
1
}
Pr[
|

error
D
(
h
)

error
D
(
h
)
|

]

2
e

2
m

2
.
error
(
h
)

error
(
h
) +

l o g
2

2
m
.
page
Mehryar Mohri - Foundations of Machine Learning
Example: Tossing a Coin
Problem
: estimate bias of a coin.
Let . Then and is the
percentage of heads in the sample. Thus, with
probability at least ,
31
H
,
T
,
T
,
H
,
T
,
H
,
H
,
T
,
H
,
H
,
H
,
T
,
T
,...,
H
.
p
h
=1
H
|
p

p
|

l og
2

2
m
.
1

Thus, choosi ng and i mpl i es t hat
wi t h probabi l i t y at l east 98%,

=
.
02
|
p

p
|

log(10)
/
1000

.
048
.
m
=1000
p
=
error
(
h
)

p
=

error
(
h
)
page
Mehryar Mohri - Foundations of Machine Learning
Application to Learning Algorithm?
Can we apply that bound to the hypothesis
returned by our learning algorithm when training
on sample ?
No, because is not a ﬁxed hypothesis, it depends
on the training sample. Note also that
is not a simple quantity such as .
Instead, we need a bound that holds simultaneously
for all hypothesis , a
uniform convergence
bound
.
32
h
S
S
h
S
h

H
E[

error
(
h
S
)]
error
(
h
S
)
page
Mehryar Mohri - Foundations of Machine Learning
Generalization Bound - Finite
H
Theorem
: let be a ﬁnite hypothesis set, then, for
any , with probability at least ,
Proof
: By the union bound,
33
H

>
0
1

h

H,R
(
h
)

R
(
h
) +

l o g
|
H
|
+ l o g
2

2
m
.
Pr

h

H

R
(
h
)

R
(
h
)

>

= Pr

R
(
h
1
)

R
(
h
1
)

>

...

R
(
h
|
H
|
)

R
(
h
|
H
|
)

>

h

H
Pr

R
(
h
)

R
(
h
)

>

2
|
H
|
exp(

2
m

2
)
.
page
Mehryar Mohri - Foundations of Machine Learning
Thus, for a ﬁnite hypothesis set, whp,
can be interpreted as the number of bits
needed to encode .
Occam’s Razor principle (theologian William of
Occam): “plurality should not be posited without
necessity”.
How do we deal with inﬁnite hypothesis sets?
34

h

H,R
(
h
)

R
(
h
) +
O

l o g
|
H
|
m

.
l o g
2
|
H
|
H
page
Mehryar Mohri - Foundations of Machine Learning
Occam’s Razor
Principle formulated by controversial theologian
William of Occam: “
plurality should not be posited
without necessity
”, rephrased as “
the simplest
explanation is best
”.
Invoked in a variety of contexts, e.g., syntax.
Kolmogorov complexity can be viewed as the
corresponding framework in information theory.
In this context: to minimize true error, choose the
most parsimonious explanation (smallest ). We
will see later other applications of this principle.
35
|
H
|
page
Mehryar Mohri - Foundations of Machine Learning
This lecture
PAC Model
Sample complexity for ﬁnite hypothesis space -
consistent case
Sample complexity for ﬁnite hypothesis space -
inconsistent case
Concentration bounds
36
page
Mehryar Mohri - Foundations of Machine Learning
Concentration Inequalities
Some general tools for error analysis and bounds:

Hoeffding’s inequality

Chernoff bounds (multiplicative).

McDiarmid’s inequality (more general).
37
page
Mehryar Mohri - Foundations of Machine Learning
Hoeffding’s Lemma
Lemma
: Let

be a random variable with

and

with

. Then for ,
Proof
: by convexity of , for all ,
38
E
[
e
tX
]

e
t
2
(
b

a
)
2
8
.
E[
X
] =0
X
a

X

b
b

=
a
t>
0
a

x

b
e
tx

b

x
b

a
e
ta
+
x

a
b

a
e
tb
.
Thus,
wi th,

(
t
) = l og(
b
b

a
e
ta
+

a
b

a
e
tb
) =
t a
+ l o g (
b
b

a
+

a
b

a
e
t
(
b

a
)
)
.
x

e
tx
E[
e
tX
]

E
[
b

X
b

a
e
ta
+
X

a
b

a
e
tb
] =
b
b

a
e
ta
+

a
b

a
e
tb
=
e

(
t
)
,
page
Mehryar Mohri - Foundations of Machine Learning

Taking the derivative gives:

Note that: Furthermore,
with There exists such that:
39

(
t
) =
a

a e
t
(
b

a
)
b
b

a

a
b

a
e
t
(
b

a
)
=
a

a
b
b

a
e

t
(
b

a
)

a
b

a
.

(0) = 0
and

(0) = 0
.

⇥⇥
(
t
) =

abe

t
(
b

a
)
[
b
b

a
e

t
(
b

a
)

a
b

a
]
2
=

(1

)
e

t
(
b

a
)
(
b

a
)
2
[(1

)
e

t
(
b

a
)
+

]
2
=

[(1

)
e

t
(
b

a
)
+

]
(1

)
e

t
(
b

a
)
[(1

)
e

t
(
b

a
)
+

]
(
b

a
)
2
=
u
(1

u
)(
b

a
)
2

(
b

a
)
2
4
,

=

a
b

a
.
0

t

(
t
) =

(0) +
t

(0) +
t
2
2

(

)

t
2
(
b

a
)
2
8
.
page
Mehryar Mohri - Foundations of Machine Learning
Theorem
: Let

be independent random
variables

. Then for , the following
inequalities hold for :

Proof
: The proof is based on
Chernoff’s bounding
technique
: for any random variable

and ,
apply Markov’s inequality and select to minimize
Hoeffding’s Theorem
40
X
1
,...,X
m
X
i

[
a
i
,b
i
]

>
0
S
m
=

m
i
=1
X
i
Pr[
S
m

E[
S
m
]

]

e

2

2
/
P
m
i
=1
(
b
i

a
i
)
2
Pr[
S
m

E[
S
m
]

]

e

2

2
/
P
m
i
=1
(
b
i

a
i
)
2
.
X
t>
0
Pr[
X

] = Pr[
e
t X

e
t

]

E[
e
tX
]
e
t

.
t
page
Mehryar Mohri - Foundations of Machine Learning

Using this scheme and the independence of the
random variables gives

The second inequality is proved in a similar way.
41
choosing
t
= 4

/

m
i
=1
(
b
i

a
i
)
2
.
Pr[
S
m

E[
S
m
]

]

e

t
E[
e
t
(
S
m

E[
S
m
])
]
=
e

t

m
i
=1
E[
e
t
(
X
i

E[
X
i
])
]
(
lemma applied to
X
i

E[
X
i
])

e

t

m
i
=1
e
t
2
(
b
i

a
i
)
2
/
8
=
e

t
e
t
2
P
m
i
=1
(
b
i

a
i
)
2
/
8

e

2

2
/
P
m
i
=1
(
b
i

a
i
)
2
,
page
Mehryar Mohri - Foundations of Machine Learning
Hoeffding’s Inequality
Corollary
: for any , any distribution and any
hypothesis , the following inequalities
hold:
Proof
: follows directly Hoeffding’s theorem.
Combining these one-sided inequalities yields
42

>
0
h
:
X

{
0
,
1
}
D
Pr[

R
(
h
)

R
(
h
)

]

e

2
m
2
Pr[

R
(
h
)

R
(
h
)

]

e

2
m
2
.
Pr

R
(
h
)

R
(
h
)

2
e

2
m
2
.
page
Mehryar Mohri - Foundations of Machine Learning
Chernoff’s Inequality
Theorem
: for any , any distribution and any
hypothesis , the following inequalities
hold:
Proof
: proof based on Chernoff’s bounding
technique.
43

>
0
h
:
X

{
0
,
1
}
D
Pr[

R
(
h
)

(1 +

)
R
(
h
)]

e

mR
(
h
)

2
/
3
Pr[

R
(
h
)

(1

)
R
(
h
)]

e

mR
(
h
)

2
/
2
.
page
Mehryar Mohri - Foundations of Machine Learning
McDiarmid’s Inequality
Theorem
: let be independent random
variables taking values in and a
function verifying for all ,
44
(McDiarmid, 1989)
X
1
,...,X
m
U
f
:
U
m

R
i

[ 1
,m
]
Th e n, f or a l l ,

>
0
s up
x
1
,...,x
m
,x

i
|
f
(
x
1
,...,x
i
,...,x
m
)

f
(
x
1
,...,x

i
,...,x
m
)
|

c
i
.
Pr

f
(
X
1
,...,X
m
)

E[
f
(
X
1
,...,X
m
)]

>

2 exp

2

2

m
i
=1
c
2
i

.
page
Mehryar Mohri - Foundations of Machine Learning

:

Proof
: uses Hoeffding’s lemma.

Hoeffding’s inequality is a special case of
McDiarmid’s with
45
f
(
x
1
,...,x
m
) =
1
m
m

i
=1
x
i
and
c
i
=
|
b
i

a
i
|
m
.
page
Mehryar Mohri - Foundations of Machine Learning
Summary
is
PAC-learnable
if ,
Learning bound, ﬁnite consistent case:
Learning bound, ﬁnite inconsistent case:
McDiarmid’s inequality:
46
C

L,

c

C,

,

>
0
,m
=
P

1

,
1

H
H
R
(
h
)

R
(
h
) +

log
|
H
|
+log
2

2
m
.
R
(
h
)

1
m
(log
|
H
|
+log
1

)
.
Pr[
|
f

E[
f
]
|
>

]

2 exp

2

2
P
m
i
=1
c
2
i

.
Pr
S

D
m
[
R
(
h
S
)

]

1

.
page
Mehryar Mohri - Foundations of Machine Learning
References

Michael Kearns and Umesh Vazirani.
An Introduction to Computational Learning Theory
, MIT
Press, 1994.

Leslie G. Valiant.
A Theory of the Learnable
, Communications of the ACM 27(11):1134–1142
(1984).
47