Probably Approximately
Correct Learning (PAC)
Leslie G. Valiant.
A Theory of the Learnable.
Comm. ACM (1984) 1134

1142
Recall: Bayesian learning
•
Create a model based on some
parameters
•
Assume some prior distribution on
those parameters
•
Learning problem
–
Adjust the model parameters so as to
maximize the likelihood of the model given
the data
–
Utilize Bayesian formula for that.
PAC Learning
•
Given distribution D of observables X
•
Given a family of functions (concepts) F
•
For each x
ε
X and f
ε
F:
f(x) provides the label for x
•
Given a family of hypotheses H, seek a
hypothesis h such that
Error(h) = Pr
x
ε
D
[f(x) ≠ h(x)]
is minimal
PAC New Concepts
•
Large family of distributions D
•
Large family of concepts F
•
Family of hypothesis H
•
Main questions:
–
Is there a hypothesis h
ε
H that can be learned
–
How fast can it be learned
–
What is the error that can be expected
Estimation vs. approximation
•
Note:
–
The distribution D is fixed
–
There is no noise in the system (currently)
–
F is a space of binary functions (concepts)
•
This is thus an
approximation
problem as the function is given exactly
for each x X
•
Estimation problem: f is not given
exactly but estimated from noisy data
Example (PAC)
•
Concept: Average body

size person
•
Inputs: for each person:
–
height
–
weight
•
Sample: labeled examples of persons
–
label + : average body

size
–
label

: not average body

size
•
Two dimensional inputs
Observable space X with concept f
Example (PAC)
•
Assumption:
target concept is a
rectangle.
•
Goal:
–
Find a rectangle h that “approximates” the
target
–
Hypothesis family H of rectangles
•
Formally:
–
With high probability
–
output a rectangle such that
–
its error is low.
Example (Modeling)
•
Assume:
–
Fixed distribution over persons.
•
Goal:
–
Low error with respect to THIS
distribution!!!
•
How does the distribution look like?
–
Highly complex.
–
Each parameter is not uniform.
–
Highly correlated.
Model Based approach
•
First try to model the distribution.
•
Given a model of the distribution:
–
find an optimal decision rule.
•
Bayesian Learning
PAC approach
•
Assume that the distribution is fixed.
•
Samples are drawn are i.i.d.
–
independent
–
Identically distributed
•
Concentrate on the decision rule rather
than distribution.
PAC Learning
•
Task:
learn a rectangle from examples.
•
Input:
point (x,f(x)) and classification
+
or

–
classifies by a rectangle R
•
Goal:
–
Using the fewest examples
–
compute h
–
h is a good approximation for f
PAC Learning: Accuracy
•
Testing the accuracy of a hypothesis:
–
using the distribution D of examples.
•
Error = h
D
f (symmetric difference)
•
Pr[Error] = D(Error) = D(h
D
h)
•
We would like Pr[Error] to be
controllable.
•
Given a parameter
e
:
–
Find h such that Pr[Error] <
e.
PAC Learning: Hypothesis
•
Which Rectangle should we choose?
–
Similar to parametric modeling?
Setting up the Analysis:
•
Choose smallest rectangle.
•
Need to show:
–
For any distribution D and Rectangle h
–
input parameters:
e
and
d
–
Select
m(
e,d
)
examples
–
Let h be the smallest consistent rectangle.
–
Such that with probability 1

d
(on X):
D(f
D
h) <
e
More general case (no rectangle)
•
A distribution:
D
(unknown)
•
Target function: c
t
from C
–
c
t
: X
{0,1}
•
Hypothesis: h from H
–
h: X
{0,1}
•
Error probability:
–
error(h) = Prob
D
[h(x)
c
t
(x)]
•
Oracle:
EX(c
t
,D)
PAC Learning: Definition
•
C and H are concept classes over X.
•
C is PAC learnable by H if
•
There Exist an Algorithm
A
such that:
–
For any
distribution
D
over X and
c
t
in C
–
for every
input
e
and
d
:
–
outputs a hypothesis
h
in H,
–
while having access to
EX(c
t
,D)
–
with probability 1

d
we have error(h) <
e
•
Complexities.
Finite Concept class
•
Assume
C=H
and finite.
•
h
is
e

bad if error(
h
)>
e
.
•
Algorithm:
–
Sample a set S of
m(
e
,
d
)
examples.
–
Find
h
in
H
which is consistent.
•
Algorithm fails if
h
is
e

bad.
•
X
is the set of all possible examples
•
D
is the distribution from which the
examples are drawn
•
H
is the set of all possible hypotheses,
c
H
•
m
is the number of training examples
•
error(h)
=
Pr(
h(x)
c(x)

x
is drawn from
X
with
D
)
•
h
is
approximately correct
if
error(h)
e
PAC learning: formalization (1)
PAC learning: formalization (2)
To show
: a
fter
m
examples, with high
probability, all
consistent
hypotheses are
approximately correct.
All consistent hypotheses lie in an
e

ball
around
c.
c
H
bad
e
H
Complexity analysis (1)
•
The probability that hypothesis
h
bad
H
bad
is consistent with the first
m
examples:
error
(
h
bad
) >
e
by definition.
The probability that it agrees with an
example is thus (1

e
)
and with
m
examples
(1

e
)
m
Complexity analysis (2)
•
For
H
bad
to contain a consistent
example, at least one hypothesis in it
must be consistent.
Probability(
H
bad
has a consistent hypothesis)

H
bad

(1

e
)
m

H

(1

e
)
m
•
To reduce the probability of error below
d

H

(1

e
)
m
d
•
This is possible when
at least
m
examples
m
1/
e
(ln 1/
d
+ ln

H

)
are seen.
•
This is the
sample complexity
Complexity analysis (
3
)
•
“
at least m examples are necessary to build a
consistent hypothesis
h
that is wrong at most
e
times with probability
1

d
”
•
Since

H

= 2
2^
n
, the complexity grows
exponentially with the number of attributes
n
•
Conclusion
: learning any boolean function is
no better in the worst case than table lookup!
Complexity analysis (
4
)
PAC learning

observations
•
“
Hypothesis h(X) is consistent with m
examples and has an error of at most
e
with probability
1

d
”
•
This is a worst

case analysis. Note that
the result is independent of the
distribution
D
!
•
Growth rate analysis:
–
for
e
0
,
m
proportionally
–
for
d
0,
m
logarithmically
–
for

H

,
m
logarithmically
PAC: comments
•
We only assumed that examples are
i.i.d.
•
We have two independent parameters:
–
Accuracy
e
–
Confidence
d
•
No assumption about the likelihood of a
hypothesis.
•
Hypothesis is tested on the same
distribution as the sample.
PAC: non

feasible case
•
What happens if
c
t
not in H
•
Needs to redefine the goal.
•
Let h
*
in H minimize the error
b
=error(h
*
)
•
Goal: find h in H such that
error(h)
error(h
*
) +
e = b+e
Analysis*
•
For each h in H:
–
let
obs

error(h)
be the average error on
the sample S.
•
Compute the probability that:
Pr {obs

error(h)

error(h)  <
e
/2}
Chernoff bound: Pr < exp(

(
e
/2)
2
m)
•
Consider entire H :
Pr < H exp(

(
e
/2)
2
m)
•
Sample size
m > (4/
e
2
⤠汮
籈簯
d
)
=
Correctness
•
Assume that for all h in H:
–
obs

error(h)

error(h)  <
e
/2
•
In particular:
–
obs

error(h
*
) < error(h
*
) +
e
/2
–
error(h)

e
/2 < obs

error(h)
•
For the output h:
–
obs

error(h) < obs

error(h
*
)
•
Conclusion: error(h) < error(h
*
)+
e
Sample size issue
•
Due to the use of Chernoff boud:
Pr {obs

error(h)

error(h)  <
e
/2}
Chernoff bound: Pr < exp(

(
e
/2)
2
m)
and on entire H :
Pr < H exp(

(
e
/2)
2
m)
•
It follows that the sample size
m > (4/
e
2
⤠汮
籈簯
d
)
=
†
乯琠
ㄯ
e)
汮
籈簯
d
⤠慳敦潲a
=
Example: Learning OR of literals
•
Inputs: x
1
, … , x
n
•
Literals :
x
1,
•
OR functions:
•
For each variable, target disjunction
may contain x
i
or not, thus
Number of disjunctions is 3
n
1
x
7
4
1
x
x
x
ELIM: Algorithm for learning OR
•
Keep a list of all literals
•
For every example whose classification is 0:
–
Erase all the literals that are 1.
•
Example c(00110)=0 results in deleting
•
Correctness:
–
Our hypothesis h: An OR of our set of literals.
–
Our set of literals includes the target OR literals.
–
Every time h predicts zero: we are correct.
•
Sample size:
m > (1/
e
) ln (3
n
/
d
)
1
x
Learning parity
•
Functions:
x
1
x
7
x
9
•
Number of functions:
2
n
•
Algorithm:
–
Sample set of examples
–
Solve linear equations (Matrix exists)
•
Sample size:
m > (1/
e
) ln (2
n
/
d
)
Infinite Concept class
•
X=[0,1]
and
H={c
q

q
in [0,1]}
•
c
q
(
x
) = 0
iff x <
q
•
Assume C=H:
•
Which c
q
should we choose in [min,max]?
min
max
Proof I
•
Define
max
= min{xc(x)=1},
min
= max{xc(x)=0}
•
_Show that the probability that
–
Pr[ D([
min,max
]) >
e
] <
d
•
Proof: By Contradiction.
–
The probability that x in [min,max] at least
e
–
The probability we do not sample from [min,max]
Is (1

e
)
m
–
Needs
m > (1/
e
) ln (1/
d
)
There is something wrong
Proof II (correct):
•
Let
max’
be :
D([
q
,max’])=
e
/2
•
Let
min’
be :
D([
q
,min’])=
e
/2
•
Goal: Show that with high probability
–
X
+
in [max’,
q
] and
–
X

in [
q
,min’]
•
In such a case any value in [x

,x
+
] is
good.
•
Compute sample size!
Proof II (correct):
•
Pr{x
1
,x
2
,..,x
m
} is not in [
q
,min’])=
(1

e
/2)
m
< exp(

m
e
/2)
•
Similarly with the other side
•
We require 2exp(

m
e
/2) <
δ
•
Thus, m > 2/
e
ln(2/
δ
)
Comments
•
The hypothesis space was very simple
H={c
q

q
in [0,1]}
•
There was no noise in the data or labels
•
So learning was trivial in some sense
(analytic solution)
Non

Feasible case: Label Noise
•
Suppose we sample:
•
Algorithm:
–
Find the function h with lowest error!
Analysis
•
Define: z
i
as an
e/
4

net (w.r.t. D)
•
For the optimal h* and our h there are
–
z
j
: error(h[z
j
])

error(h*) <
e
/4
–
z
k
: error(h[z
k
])

error(h) <
e
/4
•
Show that with high probability:
–
obs

error(h[z
i
])

error(h[z
i
]) <
e
/4
Exercise (Submission Mar 29, 04)
1.
Assume there is Gaussian (0,
σ
) noise
on x
i.
Apply the same analysis to
compute the required sample size for
PAC learning.
Note: Class labels are determined by
the non

noisy observations.
General
e

net approach
•
Given a class H define a class G
–
For every h in H
–
There exist a g in G such that
–
D(g
D
h) <
e/4
•
Algorithm: Find the best h in H.
•
Computing the confidence and sample
size.
Occam Razor
W. Occam (1320) “Entities should not be
multiplied unnecessarily”
A.
Einstein “Simplify a problem as much
as possible, but no simpler”
Information theoretic ground?
Occam Razor
Finding the shortest consistent
hypothesis.
•
Definition: (
a,b
)

Occam algorithm
a
>0 and
b
<1
–
Input: a sample S of size m
–
Output: hypothesis h
–
for every (x,b) in S: h(x)=b (consistency)
–
size(h) < size
a
(c
t
) m
b
•
Efficiency.
Occam algorithm and
compression
A
B
S
(x
i
,b
i
)
x
1
,
…
, x
m
Compression
•
Option 1:
–
A
sends
B
the values
b
1
, … , b
m
–
m
bits of information
•
Option 2:
–
A
sends
B
the hypothesis
h
–
Occam: large enough m has
size(h) < m
•
Option 3 (MDL):
–
A
sends
B
a hypothesis
h
and “corrections”
–
complexity:
size(h) + size(errors)
Occam Razor Theorem
•
A: (
a
,
b
)

Occam algorithm for C using H
•
D distribution over inputs X
•
c
t
in C the target function,
n=size(c
t
)
•
Sample size:
•
with probability 1

d
A(S)=h has error(h) <
e
+
)
1
(
1
2
1
ln
1
2
b
a
e
d
e
n
m
Occam Razor Theorem
•
Use the bound for finite hypothesis
class.
•
Effective hypothesis class size 2
size(h)
•
size(h) < n
a
m
b
•
Sample size:
+
=
d
e
d
e
d
e
b
a
b
a
1
ln
1
2
ln
ln
1
2
m
n
m
m
n
The VC dimension will replace 2
size(h)
Exercise (Submission Mar 29, 04)
2. For an
(
a
,
b
)

Occam algorithm, given
noisy data with noise ~ (0,
σ
2
) find
the limitations on m.
Hint (
ε

net and Chernoff bound)
Learning OR with few
attributes
•
Target function:
OR of k literals
•
Goal: learn in time:
–
polynomial in
k and log n
e
and
d
constant
•
ELIM makes “slow” progress
–
disqualifies one literal per round
–
May remain with O(n) literals
Set Cover

Definition
•
Input: S
1
, … , S
t
and S
i
U
•
Output: S
i1,
… , S
ik
and
j
S
jk
=U
•
Question:
Are there k sets that cover U?
•
NP

complete
Set Cover
Greedy algorithm
•
j=0 ; U
j
=U; C=
•
While U
j
–
Let S
i
be
arg max S
i
U
j

–
Add S
i
to C
–
Let U
j+1
= U
j
–
S
i
–
j = j+1
Set Cover: Greedy Analysis
•
At termination,
C is a cover
.
•
Assume there is a cover C’ of size k.
•
C’ is a cover for every U
j
•
Some S in C’ covers U
j
/k elements of U
j
•
Analysis of U
j
:
U
j+1

U
j


U
j
/k
•
Solving the recursion.
•
Number of sets
j < k ln U
•
Ex 2 Solve
•
Lists of arbitrary size can represent any boolean
function. Lists with tests of at most with at
most
k < n
literals define the
k

DL boolean
language
. For
n
attributes, the language is
k

DL(
n
).
•
The language of tests
Conj(n,k)
has at most
3

Conj(n,k)
distinct component sets
(Y,N,absent)
•

k

DL(
n
)

3

Conj(n,k)

Conj(n,k)!
(any order)
•

Conj(n,k) =
i
=0
(
) = O(n
k
)
Learning decision lists
2
n
i
k
Building an Occam algorithm
•
Given a sample S of size m
–
Run ELIM on S
–
Let LIT be the set of literals
–
There exists k literals in LIT that classify
correctly all S
•
Negative examples:
–
any subset of LIT classifies theme correctly
Building an Occam algorithm
•
Positive examples:
–
Search for a small subset of LIT
–
Which classifies S
+
correctly
–
For a literal z build
T
z
={x  z satisfies x}
–
There are k sets that cover S
+
–
Find k ln m sets that cover S
+
•
Output h = the OR of the
k ln m
literals
•
Size (h) < k ln m log 2n
•
Sample size
m =O( k log n log (k log
n))
Criticism of PAC model
•
The worst

case emphasis makes it unusable
–
Useful for analysis of computational complexity
–
Methods to estimate the cardinality of the space of
concepts (VC dimension). Unfortunately not
sufficiently practical
•
The notions of target concepts and noise

free
training are too restrictive
–
True. Switch to concept approximation weak.
–
Some extensions to label noise and fewer to
variable noise
Summary
•
PAC model
–
Confidence and accuracy
–
Sample size
•
Finite (and infinite) concept class
•
Occam Razor
References
•
A theory of the learnable. Comm. ACM
27(11):1134

42, 1984. Original work
•
Probably approximately correct learning. D.
Haussler
. (Review)
•
Efficient noise

tolerant learning from statistical
queries. M. Kearns. (Review on noise methods)
•
PAC learning with simple examples. F. Denis et
al. (Simple)
Learning algorithms
•
OR function
•
Parity function
•
OR of a few literals
•
Open problems
–
OR in the non

feasible case
–
Parity of a few literals
Comments 0
Log in to post a comment