# Correct Learning (PAC)

AI and Robotics

Nov 7, 2013 (4 years and 6 months ago)

85 views

Probably Approximately
Correct Learning (PAC)

Leslie G. Valiant.

A Theory of the Learnable.

Comm. ACM (1984) 1134
-
1142

Recall: Bayesian learning

Create a model based on some
parameters

Assume some prior distribution on
those parameters

Learning problem

Adjust the model parameters so as to
maximize the likelihood of the model given
the data

Utilize Bayesian formula for that.

PAC Learning

Given distribution D of observables X

Given a family of functions (concepts) F

For each x
ε

X and f
ε

F:

f(x) provides the label for x

Given a family of hypotheses H, seek a
hypothesis h such that

Error(h) = Pr
x
ε

D
[f(x) ≠ h(x)]

is minimal

PAC New Concepts

Large family of distributions D

Large family of concepts F

Family of hypothesis H

Main questions:

Is there a hypothesis h
ε

H that can be learned

How fast can it be learned

What is the error that can be expected

Estimation vs. approximation

Note:

The distribution D is fixed

There is no noise in the system (currently)

F is a space of binary functions (concepts)

This is thus an
approximation

problem as the function is given exactly
for each x X

Estimation problem: f is not given
exactly but estimated from noisy data

Example (PAC)

Concept: Average body
-
size person

Inputs: for each person:

height

weight

Sample: labeled examples of persons

label + : average body
-
size

label
-

: not average body
-
size

Two dimensional inputs

Observable space X with concept f

Example (PAC)

Assumption:

target concept is a
rectangle.

Goal:

Find a rectangle h that “approximates” the
target

Hypothesis family H of rectangles

Formally:

With high probability

output a rectangle such that

its error is low.

Example (Modeling)

Assume:

Fixed distribution over persons.

Goal:

Low error with respect to THIS
distribution!!!

How does the distribution look like?

Highly complex.

Each parameter is not uniform.

Highly correlated.

Model Based approach

First try to model the distribution.

Given a model of the distribution:

find an optimal decision rule.

Bayesian Learning

PAC approach

Assume that the distribution is fixed.

Samples are drawn are i.i.d.

independent

Identically distributed

Concentrate on the decision rule rather
than distribution.

PAC Learning

learn a rectangle from examples.

Input:

point (x,f(x)) and classification
+

or
-

classifies by a rectangle R

Goal:

Using the fewest examples

compute h

h is a good approximation for f

PAC Learning: Accuracy

Testing the accuracy of a hypothesis:

using the distribution D of examples.

Error = h
D

f (symmetric difference)

Pr[Error] = D(Error) = D(h
D

h)

We would like Pr[Error] to be
controllable.

Given a parameter
e
:

Find h such that Pr[Error] <
e.

PAC Learning: Hypothesis

Which Rectangle should we choose?

Similar to parametric modeling?

Setting up the Analysis:

Choose smallest rectangle.

Need to show:

For any distribution D and Rectangle h

input parameters:
e

and

d

Select
m(
e,d
)

examples

Let h be the smallest consistent rectangle.

Such that with probability 1
-
d
(on X):

D(f
D

h) <
e

More general case (no rectangle)

A distribution:
D

(unknown)

Target function: c
t

from C

c
t

: X

{0,1}

Hypothesis: h from H

h: X

{0,1}

Error probability:

error(h) = Prob
D
[h(x)

c
t
(x)]

Oracle:
EX(c
t
,D)

PAC Learning: Definition

C and H are concept classes over X.

C is PAC learnable by H if

There Exist an Algorithm
A

such that:

For any

distribution
D

over X and
c
t

in C

for every

input
e

and
d
:

outputs a hypothesis
h

in H,

EX(c
t
,D)

with probability 1
-
d
we have error(h) <
e

Complexities.

Finite Concept class

Assume
C=H

and finite.

h

is
e
-
h
)>
e
.

Algorithm:

Sample a set S of
m(
e
,
d
)

examples.

Find
h
in
H

which is consistent.

Algorithm fails if
h

is
e
-

X

is the set of all possible examples

D

is the distribution from which the
examples are drawn

H

is the set of all possible hypotheses,
c

H

m

is the number of training examples

error(h)

=

Pr(
h(x)

c(x)

|
x

is drawn from
X

with
D
)

h
is
approximately correct

if

error(h)

e

PAC learning: formalization (1)

PAC learning: formalization (2)

To show
: a
fter
m

examples, with high
probability, all
consistent

hypotheses are
approximately correct.

All consistent hypotheses lie in an

e
-
ball
around
c.

c

H

e

H

Complexity analysis (1)

The probability that hypothesis
h

H
is consistent with the first
m

examples:

error
(
h
) >
e

by definition.

The probability that it agrees with an
example is thus (1
-

e
)
and with
m

examples

(1
-

e
)
m

Complexity analysis (2)

For
H
to contain a consistent
example, at least one hypothesis in it
must be consistent.

Probability(
H
has a consistent hypothesis)

|
H
|
(1
-

e
)
m

|
H
|
(1
-

e
)
m

To reduce the probability of error below
d

|
H
|
(1
-

e
)
m

d

This is possible when

at least
m

examples

m

1/
e
(ln 1/
d

+ ln
|
H
|
)

are seen.

This is the
sample complexity

Complexity analysis (
3
)

at least m examples are necessary to build a
consistent hypothesis
h

that is wrong at most

e

times with probability

1
-

d

Since
|
H
|

= 2
2^
n
, the complexity grows
exponentially with the number of attributes
n

Conclusion
: learning any boolean function is
no better in the worst case than table lookup!

Complexity analysis (
4
)

PAC learning
--

observations

Hypothesis h(X) is consistent with m
examples and has an error of at most
e

with probability
1
-
d

This is a worst
-
case analysis. Note that
the result is independent of the
distribution
D
!

Growth rate analysis:

for
e

0
,
m

proportionally

for
d

0,
m

logarithmically

for
|
H
|

,
m

logarithmically

We only assumed that examples are
i.i.d.

We have two independent parameters:

Accuracy
e

Confidence
d

No assumption about the likelihood of a
hypothesis.

Hypothesis is tested on the same
distribution as the sample.

PAC: non
-
feasible case

What happens if
c
t

not in H

Needs to redefine the goal.

Let h
*

in H minimize the error
b
=error(h
*
)

Goal: find h in H such that

error(h)

error(h
*
) +
e = b+e

Analysis*

For each h in H:

let
obs
-
error(h)

be the average error on
the sample S.

Compute the probability that:

Pr {|obs
-
error(h)
-

error(h) | <
e
/2}

Chernoff bound: Pr < exp(
-
(
e
/2)
2
m)

Consider entire H :

Pr < |H| exp(
-
(
e
/2)
2
m)

Sample size
m > (4/
e
2
⤠汮 籈簯
d
)
=
Correctness

Assume that for all h in H:

|obs
-
error(h)
-

error(h) | <
e
/2

In particular:

obs
-
error(h
*
) < error(h
*
) +
e
/2

error(h)
-
e
/2 < obs
-
error(h)

For the output h:

obs
-
error(h) < obs
-
error(h
*
)

Conclusion: error(h) < error(h
*
)+
e

Sample size issue

Due to the use of Chernoff boud:

Pr {|obs
-
error(h)
-

error(h) | <
e
/2}

Chernoff bound: Pr < exp(
-
(
e
/2)
2
m)

and on entire H :

Pr < |H| exp(
-
(
e
/2)
2
m)

It follows that the sample size

m > (4/
e
2
⤠汮 籈簯
d
)
=

e)

d
⤠慳⁢敦潲a
=
Example: Learning OR of literals

Inputs: x
1
, … , x
n

Literals :
x
1,

OR functions:

For each variable, target disjunction
may contain x
i
or not, thus

Number of disjunctions is 3
n

1
x
7
4
1
x
x
x

ELIM: Algorithm for learning OR

Keep a list of all literals

For every example whose classification is 0:

Erase all the literals that are 1.

Example c(00110)=0 results in deleting

Correctness:

Our hypothesis h: An OR of our set of literals.

Our set of literals includes the target OR literals.

Every time h predicts zero: we are correct.

Sample size:
m > (1/
e
) ln (3
n
/
d
)

1
x
Learning parity

Functions:
x
1

x
7

x
9

Number of functions:
2
n

Algorithm:

Sample set of examples

Solve linear equations (Matrix exists)

Sample size:
m > (1/
e
) ln (2
n
/
d
)

Infinite Concept class

X=[0,1]

and
H={c
q

|
q

in [0,1]}

c
q
(
x
) = 0
iff x <
q

Assume C=H:

Which c
q

should we choose in [min,max]?

min

max

Proof I

Define

max

= min{x|c(x)=1},
min

= max{x|c(x)=0}

_Show that the probability that

Pr[ D([
min,max
]) >
e

] <
d

The probability that x in [min,max] at least
e

The probability we do not sample from [min,max]

Is (1
-
e
)
m

Needs
m > (1/
e
) ln (1/
d
)

There is something wrong

Proof II (correct):

Let
max’

be :
D([
q
,max’])=
e
/2

Let
min’

be :
D([
q
,min’])=
e
/2

Goal: Show that with high probability

X
+

in [max’,
q
] and

X
-

in [
q
,min’]

In such a case any value in [x
-
,x
+
] is
good.

Compute sample size!

Proof II (correct):

Pr{x
1
,x
2
,..,x
m
} is not in [
q
,min’])=

(1
-

e
/2)
m
< exp(
-
m
e
/2)

Similarly with the other side

We require 2exp(
-
m
e
/2) <
δ

Thus, m > 2/
e
ln(2/
δ
)

The hypothesis space was very simple
H={c
q

|
q

in [0,1]}

There was no noise in the data or labels

So learning was trivial in some sense
(analytic solution)

Non
-
Feasible case: Label Noise

Suppose we sample:

Algorithm:

Find the function h with lowest error!

Analysis

Define: z
i

as an
e/
4
-

net (w.r.t. D)

For the optimal h* and our h there are

z
j

: |error(h[z
j
])
-

error(h*)| <
e
/4

z
k

: |error(h[z
k
])
-

error(h)| <
e
/4

Show that with high probability:

|obs
-
error(h[z
i
])
-
error(h[z
i
])| <
e
/4

Exercise (Submission Mar 29, 04)

1.
Assume there is Gaussian (0,
σ
) noise
on x
i.
Apply the same analysis to
compute the required sample size for
PAC learning.

Note: Class labels are determined by
the non
-
noisy observations.

General
e
-
net approach

Given a class H define a class G

For every h in H

There exist a g in G such that

D(g
D

h) <

e/4

Algorithm: Find the best h in H.

Computing the confidence and sample
size.

Occam Razor

W. Occam (1320) “Entities should not be
multiplied unnecessarily”

A.
Einstein “Simplify a problem as much
as possible, but no simpler”

Information theoretic ground?

Occam Razor

Finding the shortest consistent
hypothesis.

Definition: (
a,b
)
-
Occam algorithm

a

>0 and
b

<1

Input: a sample S of size m

Output: hypothesis h

for every (x,b) in S: h(x)=b (consistency)

size(h) < size
a
(c
t
) m
b

Efficiency.

Occam algorithm and
compression

A

B

S

(x
i
,b
i
)

x
1
,

, x
m

Compression

Option 1:

A

sends
B
the values
b
1

, … , b
m

m

bits of information

Option 2:

A

sends
B

the hypothesis
h

Occam: large enough m has
size(h) < m

Option 3 (MDL):

A

sends
B

a hypothesis
h

and “corrections”

complexity:
size(h) + size(errors)

Occam Razor Theorem

A: (
a
,
b
)
-
Occam algorithm for C using H

D distribution over inputs X

c
t

in C the target function,
n=size(c
t
)

Sample size:

with probability 1
-
d

A(S)=h has error(h) <
e

+

)
1
(
1
2
1
ln
1

2
b
a
e
d
e
n
m
Occam Razor Theorem

Use the bound for finite hypothesis
class.

Effective hypothesis class size 2
size(h)

size(h) < n
a

m
b

Sample size:

+

=

d
e
d
e
d
e
b
a
b
a
1
ln
1
2
ln
ln
1

2
m
n
m
m
n
The VC dimension will replace 2
size(h)

Exercise (Submission Mar 29, 04)

2. For an
(
a
,
b
)
-
Occam algorithm, given

noisy data with noise ~ (0,
σ
2
) find
the limitations on m.

Hint (
ε
-
net and Chernoff bound)

Learning OR with few
attributes

Target function:
OR of k literals

Goal: learn in time:

polynomial in
k and log n

e

and
d

constant

ELIM makes “slow” progress

disqualifies one literal per round

May remain with O(n) literals

Set Cover
-

Definition

Input: S
1

, … , S
t
and S
i

U

Output: S
i1,

… , S
ik
and

j
S
jk
=U

Question:
Are there k sets that cover U?

NP
-
complete

Set Cover
Greedy algorithm

j=0 ; U
j
=U; C=

While U
j

Let S
i

be
arg max |S
i

U
j
|

i

to C

Let U
j+1

= U
j

S
i

j = j+1

Set Cover: Greedy Analysis

At termination,
C is a cover
.

Assume there is a cover C’ of size k.

C’ is a cover for every U
j

Some S in C’ covers U
j
/k elements of U
j

Analysis of U
j
:
|U
j+1
|

|U
j
|
-

|U
j
|/k

Solving the recursion.

Number of sets
j < k ln |U|

Ex 2 Solve

Lists of arbitrary size can represent any boolean
function. Lists with tests of at most with at
most
k < n

literals define the
k
-
DL boolean
language
. For
n

attributes, the language is
k
-
DL(
n
).

The language of tests
Conj(n,k)
has at most

3
|
Conj(n,k)|
distinct component sets

(Y,N,absent)

|
k
-
DL(
n
)
|

3
|
Conj(n,k)|
|
Conj(n,k)|!
(any order)

|
Conj(n,k)| =

i
=0

(

) = O(n
k
)

Learning decision lists

2
n

i

k

Building an Occam algorithm

Given a sample S of size m

Run ELIM on S

Let LIT be the set of literals

There exists k literals in LIT that classify
correctly all S

Negative examples:

any subset of LIT classifies theme correctly

Building an Occam algorithm

Positive examples:

Search for a small subset of LIT

Which classifies S
+

correctly

For a literal z build
T
z
={x | z satisfies x}

There are k sets that cover S
+

Find k ln m sets that cover S
+

Output h = the OR of the
k ln m

literals

Size (h) < k ln m log 2n

Sample size
m =O( k log n log (k log
n))

Criticism of PAC model

The worst
-
case emphasis makes it unusable

Useful for analysis of computational complexity

Methods to estimate the cardinality of the space of
concepts (VC dimension). Unfortunately not
sufficiently practical

The notions of target concepts and noise
-
free
training are too restrictive

True. Switch to concept approximation weak.

Some extensions to label noise and fewer to
variable noise

Summary

PAC model

Confidence and accuracy

Sample size

Finite (and infinite) concept class

Occam Razor

References

A theory of the learnable. Comm. ACM
27(11):1134
-
42, 1984. Original work

Probably approximately correct learning. D.
Haussler
. (Review)

Efficient noise
-
tolerant learning from statistical
queries. M. Kearns. (Review on noise methods)

PAC learning with simple examples. F. Denis et
al. (Simple)

Learning algorithms

OR function

Parity function

OR of a few literals

Open problems

OR in the non
-
feasible case

Parity of a few literals