Correct Learning (PAC)

strawberrycokevilleΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

64 εμφανίσεις

Probably Approximately
Correct Learning (PAC)

Leslie G. Valiant.

A Theory of the Learnable.

Comm. ACM (1984) 1134
-
1142

Recall: Bayesian learning


Create a model based on some
parameters


Assume some prior distribution on
those parameters


Learning problem


Adjust the model parameters so as to
maximize the likelihood of the model given
the data


Utilize Bayesian formula for that.

PAC Learning


Given distribution D of observables X


Given a family of functions (concepts) F


For each x
ε

X and f
ε

F:

f(x) provides the label for x


Given a family of hypotheses H, seek a
hypothesis h such that


Error(h) = Pr
x
ε

D
[f(x) ≠ h(x)]


is minimal

PAC New Concepts


Large family of distributions D


Large family of concepts F


Family of hypothesis H


Main questions:


Is there a hypothesis h
ε

H that can be learned


How fast can it be learned


What is the error that can be expected


Estimation vs. approximation



Note:


The distribution D is fixed


There is no noise in the system (currently)


F is a space of binary functions (concepts)


This is thus an
approximation

problem as the function is given exactly
for each x X


Estimation problem: f is not given
exactly but estimated from noisy data


Example (PAC)


Concept: Average body
-
size person


Inputs: for each person:


height


weight


Sample: labeled examples of persons


label + : average body
-
size


label
-

: not average body
-
size


Two dimensional inputs

Observable space X with concept f

Example (PAC)


Assumption:

target concept is a
rectangle.


Goal:


Find a rectangle h that “approximates” the
target


Hypothesis family H of rectangles


Formally:


With high probability


output a rectangle such that


its error is low.

Example (Modeling)



Assume:


Fixed distribution over persons.


Goal:


Low error with respect to THIS
distribution!!!


How does the distribution look like?


Highly complex.


Each parameter is not uniform.


Highly correlated.

Model Based approach


First try to model the distribution.


Given a model of the distribution:


find an optimal decision rule.




Bayesian Learning

PAC approach



Assume that the distribution is fixed.


Samples are drawn are i.i.d.


independent


Identically distributed


Concentrate on the decision rule rather
than distribution.

PAC Learning


Task:

learn a rectangle from examples.


Input:

point (x,f(x)) and classification
+

or
-


classifies by a rectangle R


Goal:


Using the fewest examples


compute h


h is a good approximation for f

PAC Learning: Accuracy


Testing the accuracy of a hypothesis:



using the distribution D of examples.


Error = h
D

f (symmetric difference)


Pr[Error] = D(Error) = D(h
D

h)


We would like Pr[Error] to be
controllable.


Given a parameter
e
:


Find h such that Pr[Error] <
e.

PAC Learning: Hypothesis


Which Rectangle should we choose?


Similar to parametric modeling?

Setting up the Analysis:


Choose smallest rectangle.


Need to show:


For any distribution D and Rectangle h


input parameters:
e

and

d


Select
m(
e,d
)

examples


Let h be the smallest consistent rectangle.


Such that with probability 1
-
d
(on X):


D(f
D

h) <
e

More general case (no rectangle)


A distribution:
D

(unknown)


Target function: c
t

from C


c
t

: X


{0,1}


Hypothesis: h from H


h: X


{0,1}


Error probability:


error(h) = Prob
D
[h(x)


c
t
(x)]


Oracle:
EX(c
t
,D)

PAC Learning: Definition


C and H are concept classes over X.


C is PAC learnable by H if


There Exist an Algorithm
A

such that:


For any

distribution
D

over X and
c
t

in C


for every

input
e

and
d
:


outputs a hypothesis
h

in H,


while having access to
EX(c
t
,D)


with probability 1
-
d
we have error(h) <
e


Complexities.

Finite Concept class


Assume
C=H

and finite.


h

is
e
-
bad if error(
h
)>
e
.


Algorithm:


Sample a set S of
m(
e
,
d
)

examples.


Find
h
in
H

which is consistent.


Algorithm fails if
h

is
e
-
bad.


X

is the set of all possible examples


D

is the distribution from which the
examples are drawn


H

is the set of all possible hypotheses,
c

H


m

is the number of training examples


error(h)

=


Pr(
h(x)



c(x)

|
x

is drawn from
X

with
D
)


h
is
approximately correct

if

error(h)


e

PAC learning: formalization (1)

PAC learning: formalization (2)

To show
: a
fter
m

examples, with high
probability, all
consistent

hypotheses are
approximately correct.

All consistent hypotheses lie in an

e
-
ball
around
c.

c

H
bad

e

H

Complexity analysis (1)


The probability that hypothesis
h
bad

H
bad
is consistent with the first
m

examples:


error
(
h
bad
) >
e

by definition.


The probability that it agrees with an
example is thus (1
-

e
)
and with
m

examples

(1
-

e
)
m

Complexity analysis (2)


For
H
bad
to contain a consistent
example, at least one hypothesis in it
must be consistent.


Probability(
H
bad
has a consistent hypothesis)





|
H
bad
|
(1
-

e
)
m


|
H
|
(1
-

e
)
m



To reduce the probability of error below
d




|
H
|
(1
-

e
)
m


d


This is possible when

at least
m

examples


m



1/
e
(ln 1/
d

+ ln
|
H
|
)




are seen.


This is the
sample complexity

Complexity analysis (
3
)



at least m examples are necessary to build a
consistent hypothesis
h

that is wrong at most

e

times with probability

1
-

d



Since
|
H
|

= 2
2^
n
, the complexity grows
exponentially with the number of attributes
n


Conclusion
: learning any boolean function is
no better in the worst case than table lookup!

Complexity analysis (
4
)

PAC learning
--

observations



Hypothesis h(X) is consistent with m
examples and has an error of at most
e

with probability
1
-
d



This is a worst
-
case analysis. Note that
the result is independent of the
distribution
D
!


Growth rate analysis:


for
e



0
,
m





proportionally


for
d



0,
m





logarithmically


for
|
H
|




,
m





logarithmically



PAC: comments



We only assumed that examples are
i.i.d.


We have two independent parameters:


Accuracy
e


Confidence
d


No assumption about the likelihood of a
hypothesis.


Hypothesis is tested on the same
distribution as the sample.

PAC: non
-
feasible case


What happens if
c
t

not in H


Needs to redefine the goal.


Let h
*

in H minimize the error
b
=error(h
*
)


Goal: find h in H such that


error(h)


error(h
*
) +
e = b+e

Analysis*



For each h in H:



let
obs
-
error(h)

be the average error on
the sample S.


Compute the probability that:

Pr {|obs
-
error(h)
-

error(h) | <
e
/2}

Chernoff bound: Pr < exp(
-
(
e
/2)
2
m)


Consider entire H :



Pr < |H| exp(
-
(
e
/2)
2
m)


Sample size
m > (4/
e
2
⤠汮
籈簯
d
)
=
Correctness


Assume that for all h in H:


|obs
-
error(h)
-

error(h) | <
e
/2


In particular:


obs
-
error(h
*
) < error(h
*
) +
e
/2


error(h)
-
e
/2 < obs
-
error(h)


For the output h:


obs
-
error(h) < obs
-
error(h
*
)


Conclusion: error(h) < error(h
*
)+
e

Sample size issue



Due to the use of Chernoff boud:

Pr {|obs
-
error(h)
-

error(h) | <
e
/2}

Chernoff bound: Pr < exp(
-
(
e
/2)
2
m)


and on entire H :



Pr < |H| exp(
-
(
e
/2)
2
m)


It follows that the sample size


m > (4/
e
2
⤠汮
籈簯
d
)
=

乯琠
ㄯ
e)
汮
籈簯
d
⤠慳⁢敦潲a
=
Example: Learning OR of literals


Inputs: x
1
, … , x
n


Literals :
x
1,


OR functions:


For each variable, target disjunction
may contain x
i
or not, thus


Number of disjunctions is 3
n


1
x
7
4
1
x
x
x


ELIM: Algorithm for learning OR


Keep a list of all literals


For every example whose classification is 0:


Erase all the literals that are 1.


Example c(00110)=0 results in deleting


Correctness:


Our hypothesis h: An OR of our set of literals.


Our set of literals includes the target OR literals.


Every time h predicts zero: we are correct.


Sample size:
m > (1/
e
) ln (3
n
/
d
)

1
x
Learning parity



Functions:
x
1


x
7



x
9


Number of functions:
2
n


Algorithm:


Sample set of examples


Solve linear equations (Matrix exists)


Sample size:
m > (1/
e
) ln (2
n
/
d
)


Infinite Concept class


X=[0,1]

and
H={c
q

|
q

in [0,1]}


c
q
(
x
) = 0
iff x <
q


Assume C=H:






Which c
q

should we choose in [min,max]?

min

max

Proof I


Define


max

= min{x|c(x)=1},
min

= max{x|c(x)=0}


_Show that the probability that


Pr[ D([
min,max
]) >
e

] <
d


Proof: By Contradiction.


The probability that x in [min,max] at least
e


The probability we do not sample from [min,max]

Is (1
-
e
)
m


Needs
m > (1/
e
) ln (1/
d
)


There is something wrong

Proof II (correct):


Let
max’

be :
D([
q
,max’])=
e
/2


Let
min’

be :
D([
q
,min’])=
e
/2


Goal: Show that with high probability


X
+

in [max’,
q
] and


X
-

in [
q
,min’]


In such a case any value in [x
-
,x
+
] is
good.


Compute sample size!

Proof II (correct):


Pr{x
1
,x
2
,..,x
m
} is not in [
q
,min’])=


(1
-

e
/2)
m
< exp(
-
m
e
/2)


Similarly with the other side


We require 2exp(
-
m
e
/2) <
δ


Thus, m > 2/
e
ln(2/
δ
)



Comments


The hypothesis space was very simple
H={c
q

|
q

in [0,1]}


There was no noise in the data or labels


So learning was trivial in some sense
(analytic solution)


Non
-
Feasible case: Label Noise


Suppose we sample:


Algorithm:


Find the function h with lowest error!

Analysis


Define: z
i

as an
e/
4
-

net (w.r.t. D)


For the optimal h* and our h there are


z
j

: |error(h[z
j
])
-

error(h*)| <
e
/4


z
k

: |error(h[z
k
])
-

error(h)| <
e
/4


Show that with high probability:


|obs
-
error(h[z
i
])
-
error(h[z
i
])| <
e
/4


Exercise (Submission Mar 29, 04)

1.
Assume there is Gaussian (0,
σ
) noise
on x
i.
Apply the same analysis to
compute the required sample size for
PAC learning.


Note: Class labels are determined by
the non
-
noisy observations.


General
e
-
net approach


Given a class H define a class G


For every h in H


There exist a g in G such that


D(g
D

h) <

e/4


Algorithm: Find the best h in H.


Computing the confidence and sample
size.

Occam Razor

W. Occam (1320) “Entities should not be
multiplied unnecessarily”

A.
Einstein “Simplify a problem as much
as possible, but no simpler”


Information theoretic ground?


Occam Razor

Finding the shortest consistent
hypothesis.


Definition: (
a,b
)
-
Occam algorithm


a

>0 and
b

<1


Input: a sample S of size m


Output: hypothesis h


for every (x,b) in S: h(x)=b (consistency)


size(h) < size
a
(c
t
) m
b


Efficiency.

Occam algorithm and
compression

A

B

S

(x
i
,b
i
)

x
1
,


, x
m

Compression


Option 1:


A

sends
B
the values
b
1

, … , b
m


m

bits of information


Option 2:


A

sends
B

the hypothesis
h


Occam: large enough m has
size(h) < m


Option 3 (MDL):


A

sends
B

a hypothesis
h

and “corrections”


complexity:
size(h) + size(errors)

Occam Razor Theorem



A: (
a
,
b
)
-
Occam algorithm for C using H


D distribution over inputs X


c
t

in C the target function,
n=size(c
t
)


Sample size:





with probability 1
-
d

A(S)=h has error(h) <
e

















+


)
1
(
1
2
1
ln
1

2
b
a
e
d
e
n
m
Occam Razor Theorem



Use the bound for finite hypothesis
class.


Effective hypothesis class size 2
size(h)


size(h) < n
a

m
b


Sample size:










+








=









d
e
d
e
d
e
b
a
b
a
1
ln
1
2
ln
ln
1

2
m
n
m
m
n
The VC dimension will replace 2
size(h)

Exercise (Submission Mar 29, 04)

2. For an
(
a
,
b
)
-
Occam algorithm, given


noisy data with noise ~ (0,
σ
2
) find
the limitations on m.



Hint (
ε
-
net and Chernoff bound)

Learning OR with few
attributes


Target function:
OR of k literals


Goal: learn in time:



polynomial in
k and log n


e

and
d

constant


ELIM makes “slow” progress


disqualifies one literal per round


May remain with O(n) literals

Set Cover
-

Definition


Input: S
1

, … , S
t
and S
i



U


Output: S
i1,

… , S
ik
and

j
S
jk
=U


Question:
Are there k sets that cover U?


NP
-
complete



Set Cover
Greedy algorithm


j=0 ; U
j
=U; C=



While U
j






Let S
i

be
arg max |S
i



U
j
|


Add S
i

to C


Let U
j+1

= U
j



S
i


j = j+1

Set Cover: Greedy Analysis


At termination,
C is a cover
.


Assume there is a cover C’ of size k.


C’ is a cover for every U
j


Some S in C’ covers U
j
/k elements of U
j


Analysis of U
j
:
|U
j+1
|


|U
j
|
-

|U
j
|/k


Solving the recursion.


Number of sets
j < k ln |U|


Ex 2 Solve


Lists of arbitrary size can represent any boolean
function. Lists with tests of at most with at
most
k < n

literals define the
k
-
DL boolean
language
. For
n

attributes, the language is
k
-
DL(
n
).


The language of tests
Conj(n,k)
has at most

3
|
Conj(n,k)|
distinct component sets

(Y,N,absent)


|
k
-
DL(
n
)
|


3
|
Conj(n,k)|
|
Conj(n,k)|!
(any order)


|
Conj(n,k)| =

i
=0

(


) = O(n
k
)

Learning decision lists

2
n

i

k

Building an Occam algorithm


Given a sample S of size m


Run ELIM on S


Let LIT be the set of literals


There exists k literals in LIT that classify
correctly all S


Negative examples:


any subset of LIT classifies theme correctly

Building an Occam algorithm


Positive examples:


Search for a small subset of LIT


Which classifies S
+

correctly


For a literal z build
T
z
={x | z satisfies x}


There are k sets that cover S
+


Find k ln m sets that cover S
+


Output h = the OR of the
k ln m

literals


Size (h) < k ln m log 2n


Sample size
m =O( k log n log (k log
n))

Criticism of PAC model


The worst
-
case emphasis makes it unusable


Useful for analysis of computational complexity


Methods to estimate the cardinality of the space of
concepts (VC dimension). Unfortunately not
sufficiently practical


The notions of target concepts and noise
-
free
training are too restrictive


True. Switch to concept approximation weak.


Some extensions to label noise and fewer to
variable noise


Summary


PAC model


Confidence and accuracy


Sample size


Finite (and infinite) concept class


Occam Razor

References


A theory of the learnable. Comm. ACM
27(11):1134
-
42, 1984. Original work


Probably approximately correct learning. D.
Haussler
. (Review)


Efficient noise
-
tolerant learning from statistical
queries. M. Kearns. (Review on noise methods)


PAC learning with simple examples. F. Denis et
al. (Simple)


Learning algorithms



OR function


Parity function


OR of a few literals


Open problems


OR in the non
-
feasible case


Parity of a few literals