Computational Learning Theory

whooshrwandanΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 1 μέρα)

101 εμφανίσεις

CS 9633

Machine Learning

Computational Learning Theory

Adapted from notes by Tom Mitchell

http://www
-
2.cs.cmu.edu/~tom/mlbook
-
chapter
-
slides.html

Theoretical Characterization
of Learning Problems


Under what conditions is successful
learning possible and impossible?


Under what conditions is a particular
learning algorithm assured of learning
successfully?

Two Frameworks


PAC (Probably Approximately Correct)
Learning Framework: Identify classes
of hypotheses that can and cannot be
learned from a polynomial number of
training examples


Define a natural measure of complexity for
hypothesis spaces that allows bounding
the number of training examples needed


Mistake Bound Framework

Theoretical Questions of
Interest


Is it possible to identify classes of learning problems
that are inherently difficult or easy, independent of
the learning algorithm?


Can one characterize the number of training
examples necessary or sufficient to assure
successful learning?


How is the number of examples affected


If observing a random sample of training data?


if the learner is allowed to pose queries to the trainer?


Can one characterize the number of mistakes that a
learner will make before learning the target function?


Can one characterize the inherent computational
complexity of a class of learning algorithms?


Computational Learning
Theory


Relatively recent field


Area of intense research


Partial answers to some questions on
previous page is yes.


Will generally focus on certain types of
learning problems.

Inductive Learning of Target
Function


What we are given


Hypothesis space


Training examples


What we want to know


How many training examples are sufficient
to successfully learn the target function?


How many mistakes will the learner make
before succeeding?

Questions for Broad Classes of
Learning Algorithms


Sample complexity


How many training examples do we need to
converge to a successful hypothesis with a high
probability?


Computational complexity


How much computational effort is needed to
converge to a successful hypothesis with a high
probability?


Mistake Bound


How many training examples will the learner
misclassify before converging to a successful
hypothesis?


PAC Learning


Probably Approximately Correct
Learning Model


Will restrict discussion to learning
boolean
-
valued concepts in noise
-
free
data.

Problem Setting:

Instances and Concepts


X

is set of all possible instances over which
target function may be defined


C

is set of target concepts learner is to learn


Each target concept c in C is a subset of X


Each target concept c in C is a boolean function

c: X

{0,1}



c(x) = 1 if x is positive example of concept

c(x) = 0 otherwise

Problem Setting: Distribution


Instances generated at random using some
probability distribution
D


D

may be any distribution


D

is generally not known to the learner


D

is required to be stationary (does not change
over time)


Training examples
x

are drawn at random
from
X

according to
D

and presented with
target value
c(x)

to the learner.

Problem Setting:
Hypotheses


Learner
L

considers set of hypotheses
H



After observing a sequence of training
examples of the target concept
c
,
L

must output some hypothesis
h

from
H

which is its estimate of
c


Example Problem

(Classifying Executables)


Three Classes (Malicious, Boring, Funny)


Features


a
1
GUI present (yes/no)


a
2
Deletes files (yes/no)


a
3

Allocates memory (yes/no)


a
4

Creates new thread (yes/no)


Distribution?


Hypotheses?

Instance

a
1

a
2

a
3

a
4

Class

1

Yes

No

No

Yes

B

2

Yes

No

No

No

B

3

No

Yes

Yes

No

F

4

No

No

Yes

Yes

M

5

Yes

No

No

Yes

B

6

Yes

No

No

No

F

7

Yes

Yes

Yes

No

M

8

Yes

Yes

No

Yes

M

9

No

No

No

Yes

B

10

No

No

Yes

No

M

Computer Science Department

CS 9633 Machine Learning

True Error


Definition: The
true error

(denoted
error
D
(h))
of hypothesis
h

with respect to target concept
c
and distribution
D

, is the probability that h will
misclassify an instance drawn at random
according to
D
.




)]
(
)
(
[
Pr
)
(
x
h
x
c
h
error
D
x
D



Computer Science Department

CS 9633 Machine Learning

Error of h with respect to c

Instance space
X

+

+

+

c

h

-

-

-

-

Computer Science Department

CS 9633 Machine Learning

Key Points


True error defined over entire instance space,
not just training data


Error depends strongly on the unknown
probability distribution
D


The error of
h

with respect to
c

is not directly
observable to the learner L

can only
observe performance with respect to training
data (training error)


Question: How probable is it that the
observed training error for h gives a
misleading estimate of the true error?

Computer Science Department

CS 9633 Machine Learning

PAC Learnability


Goal: characterize classes of target concepts that
can be reliably learned



from a reasonable number of randomly drawn training
examples and



using a reasonable amount of computation


Unreasonable to expect perfect learning where
error
D
(h) = 0


Would need to provide training examples corresponding to
every possible instance


With random sample of training examples, there is always a
non
-
zero probability that the training examples will be
misleading

Computer Science Department

CS 9633 Machine Learning

Weaken Demand on
Learner


Hypothesis error (Approximately)


Will not require a zero error hypothesis


Require that error is bounded by some constant

,
that can be made arbitrarily small




is the error parameter


Error on training data (Probably)


Will not require that the learner succeed on every
sequence of randomly drawn training examples


Require that its probability of failure is bounded by
a constant,

, that can be made arbitrarily small




is the confidence parameter

Computer Science Department

CS 9633 Machine Learning

Definition of PAC
-
Learnability


Definition: Consider a concept class
C
defined over a set of instances
X

of length
n

and a learner
L

using hypothesis space
H
.
C

is PAC
-
learnable by
L

using
H

if all
c


C
,
distributions
D
over
X
,




such that

0 <


< ½

,
and


such that
0 <


< ½
, learner
L

will with
probability at least
(1
-


)

output a hypothesis
h


H

such that
error
D
(h)



, in time that is
polynomial in
1/

,
1/

,
n
, and
size(c)
.




Computer Science Department

CS 9633 Machine Learning

Requirements of Definition


L must with arbitrarily high probability (1
-

), out put a hypothesis having arbitrarily
low error (

).


L’s learning must be efficient

grows
polynomially in terms of


Strengths of output hypothesis (1/

, 1/

)


Inherent complexity of instance space (n)
and concept class C (size(c)).


Computer Science Department

CS 9633 Machine Learning

Block Diagram of PAC
Learning Model

Learning
algorithm L

Training sample

n
i
i
i
x
c
x
1
)}
(
,
{


Control Parameters


,


Hypothesis

h

Computer Science Department

CS 9633 Machine Learning

Examples of second
requirement


Consider executables problem where
instances are conjunctions of boolean
features:

a
1
=yes


a
2
=no


a
3
=yes


a
4
=no


Concepts are conjunctions of a subset
of the features

a
1
=yes


a
3
=yes


a
4
=yes

Computer Science Department

CS 9633 Machine Learning

Using the Concept of PAC
Learning in Practice


We often want to know how many training
instances we need in order to achieve a
certain level of accuracy with a specified
probability.


If L requires some minimum processing time
per training example, then for C to be PAC
-
learnable by L, L must learn from a
polynomial number of training examples.

Computer Science Department

CS 9633 Machine Learning

Sample Complexity


Sample complexity of a learning
problem

is
the growth in the required training examples
with problem size.


Will determine the sample complexity for
consistent learners.


A learner is consistent if it outputs hypotheses
which perfectly fit the training data whenever
possible.


All algorithms in Chapter 2 are consistent learners.

Computer Science Department

CS 9633 Machine Learning

Recall definition of VS


The
version space
, denoted VS
H,D
, with
respect to hypothesis space H and
training examples D, is the subset of
hypotheses from H consistent with the
training examples in D

}
|
{
,
(h,D)
Consistent
H
h
VS
D
H


Computer Science Department

CS 9633 Machine Learning

VS and PAC learning by
consistent learners


Every consistent learner outputs a hypothesis
belonging to the version space, regardless of
the instance space X, hypothesis space H, or
training data D.


To bound the number of examples needed by
any consistent learner, we need only to
bound the number of examples needed to
assure that the version space contains no
unacceptable hypotheses.

Computer Science Department

CS 9633 Machine Learning


-
exhausted


Definition: Consider a hypothesis space H,
target concept c, instance distribution
D
, and
set of training examples D of c. The version
space VS
H,D

is said to be

-
exhausted with
respect to c and
D
, if every hypothesis h in
V
H,D
has error less than


with respect to c and
D.





)
(
)
(
,
h
error
V
h
D
D
H
Computer Science Department

CS 9633 Machine Learning

Exhausting the version
space

VS
H,D

error = 0.1

r=0.2

error = 0.3

r=0.2

error = 0.2

r=0

error = 0.1

r=0

error = 0.3

r=0.4

error = 0.2

r=0.3

Hypothesis Space H

Computer Science Department

CS 9633 Machine Learning

Exhausting the Version
Space


Only an observer who knows the identify of
the target concept can determine with
certainty whether the version space is

-
exhausted.


But, we can bound the probability that the
version space will be

-
exhausted after a
given number of training examples


Without knowing the identity of the target concept


Without knowing the distribution from which
training examples were drawn

Computer Science Department

CS 9633 Machine Learning

Theorem 7.1


Theorem 7.1

-
數桡畳h楮朠瑨癥牳楯渠
space.
If the hypothesis space H is finite, D
is a sequence of m


1 independent
randomly drawn examples of some target
concept c, then for any 0

1, the
probability that the version space VS
H,D

is
not

-
exhausted (with respect to c) is less
than or equal to

|H|e
-

m

Computer Science Department

CS 9633 Machine Learning

Proof of theorem


See text

Computer Science Department

CS 9633 Machine Learning

Number of Training
Examples (Eq. 7.2)

)
1
ln
(ln
1
1
ln
ln
0
ln
1
ln
0
1





















H
m
H
m
m
H
e
H
e
H
m
m
Computer Science Department

CS 9633 Machine Learning

Summary of Result


Inequality on previous slide provides a general bound
on the number of trianing examples sufficient for any
consistent learner to successfully learn any target
concept in H, for any desired values of


and

.


This number m of training examples is sufficient to
assure that any consistent hypothesis will be


probably (with probability 1
-

)


approximately (within error

) correct.


The value of m grows


linearly with 1/




logarithmically with 1/




logarithmically with |H|


The bound can be a substantial overestimate.

Computer Science Department

CS 9633 Machine Learning

Problem


Suppose we have the instance space described for
the EnjoySports problem:


Sky (Sunny, Cloudy, Rainy)


AirTemp (Warm, Cold)


Humidity (Normal, High)


Wind (Strong, Weak)


Water (Warm, Cold)


Forecast (Same, Change)



Hypotheses can be as before

(?, Warm, Normal, ?, ?, Same) (0, 0, 0, 0, 0, 0)


How many training examples do we need to have an
error rate of less than 10% with a probability of 95%?

Computer Science Department

CS 9633 Machine Learning

Limits of Equation 7.2


Equation 7.2 tell us how many training
examples suffice to ensure (with probability
(1
-

) that every hypothesis having 0 training
error, will have a true error of at most

.


Problem: there may be no hypothesis that is
consistent with if the concept is not in H. In
this case, we want the minimum error
hypothesis.

Computer Science Department

CS 9633 Machine Learning

Agnostic Learning and
Inconsistent Hypotheses


An Agnostic Learner does not make the
assumption that the concept is contained in
the hypothesis space.


We may want to consider the hypothesis with
the minimum error


Can derive a bound similar to the previous
one:

))
/
1
ln(
|
|
(ln
2
1
2




H
m
Computer Science Department

CS 9633 Machine Learning

Concepts that are PAC
-
Learnable


Proofs that a type of concept is PAC
-
Learnable usually consist of two steps:


Show that each target concept in C can be
learned from a polynomial number of
training examples


Show that the processing time per training
example is also polynomially bounded

Computer Science Department

CS 9633 Machine Learning

PAC Learnability of Conjunctions
of Boolean Literals


Class C of target concepts described by
conjunctions of boolean literals:


GUI_Present



Opens_files


Is C PAC learnable? Yes.


Will prove by


Showing that a polynomial # of training examples
is needed to learn each concept


Demonstrate an algorithm that uses polynomial
time per training example

Computer Science Department

CS 9633 Machine Learning

Examples Needed to Learn
Each Concept


Consider a consistent learner that uses
hypothesis space H =C


Compute number m of random training
examples sufficient to ensure that L will, with
probability (1
-


), output a hypothesis with
maximum error

.


We will use m

(1/

)(ln|H|+ln(1/

))


What is the size of the hypothesis space?

Computer Science Department

CS 9633 Machine Learning

Complexity Per Example


We just need to show that for some algorithm, we can spend a
polynomial amount of time per training example.


One way to do this is to give an algorithm.


In this case, we can use Find
-
S as the learning algorithm.


Find
-
S incrementally computes the most specific hypothesis
consistent with each training example.


Old


Tired


+



Old


Happy +


Tired +


Old



Tired
-


Rich


Happy +


What is a bound on the time per example?

Computer Science Department

CS 9633 Machine Learning

Theorem 7.2

PAC
-
learnability of boolean conjunctions.
The class C of conjunctions of boolean
literals is PAC
-
learnable by the FIND
-
S
algorithm using H=C

Computer Science Department

CS 9633 Machine Learning

Proof of Theorem 7.2


Equation 7.4 shows that the sample
complexity for this concept class id
polynomial in n, 1/

, and 1/

, and
independent of size(c). To incrematally
process each training example, the
FIND
-
S algorithm requires effort linear
in n and independent of 1/

, 1/

, and
size(c). Therefore, this concept class is
PAC
-
learnable by the FIND
-
S algorithm.

Computer Science Department

CS 9633 Machine Learning

Interesting Results


Unbiased learners are not PAC
learnable because they require an
exponential number of examples.


K
-
term Disjunctive Normal Form is not
PAC learnable


K
-
term Conjunctive Normal Form is a
superset of k
-
DNF, but it is PAC
learnable

Computer Science Department

CS 9633 Machine Learning

Sample Complexity with
Infinite Hypothesis Spaces


Two drawbacks to previous result


It often does not give a very tight bound on
the sample complexity


It only applies to finite hypothesis spaces


Vapnik
-
Chervonekis dimension of H
(VC dimension)


Will give tighter bounds


Applies to many infinite hypothesis spaces.

Computer Science Department

CS 9633 Machine Learning

Shattering a Set of
Instances


Consider a subset of instances S from the
instance space X.


Every hypothesis imposes dichotomies on S

{x

S | h(x) = 1}

{x

S | h(x) = 0}


Given some instance space S, there are 2
|S|
possible dichotomies.


The ability of H to shatter a set of concepts is
a measure of its capacity to represent target
concepts defined over these instances.

Computer Science Department

CS 9633 Machine Learning

Shattering a Hypothesis
Space


Definition: A set of instances S is
shattered by hypothesis space H if and
only if for every dichotomy of S there
exists some hypothesis in H consistent
with this dichotomy.

Computer Science Department

CS 9633 Machine Learning

Vapnik
-
Chervonenkis
Dimension


Ability to shatter a set of instances is
closely related to the inductive bias of
the hypothesis space.


An unbiased hypothesis space is one
that shatters the instance space X.


Sometimes H cannot be shattered, but
a large subset of it can.

Computer Science Department

CS 9633 Machine Learning

Vapnik
-
Chervonenkis
Dimension


Definition: The Vapnik
-
Chervonenkis
dimension, VC(H) of hypothesis space
H defined over instance space X, is the
size of
the largest finite subset of X
shattered by H
. If arbitrarily large finite
sets of X can be shattered by H, then
VC(H) =

.

Computer Science Department

CS 9633 Machine Learning

Shattered Instance Space

Computer Science Department

CS 9633 Machine Learning

Example 1 of VC Dimension


Instance space X is the set of real
numbers X =
R
.


H is the set of intervals on the real
number line. Form of H is:

a < x < b


What is VC(H)?

Computer Science Department

CS 9633 Machine Learning

Shattering the real number
line

-
1.2

3.4

-
1.2

3.4

6.7

What is VC(H)?

What is |H|?

Computer Science Department

CS 9633 Machine Learning

Example 2 of VC Dimension


Set X of instances corresponding to
numbers on the x,y plane


H is the set of all linear decision
surfaces


What is VC(H)?

Computer Science Department

CS 9633 Machine Learning

Shattering the x
-
y plane

2 instances

3 instances

VC(H) = ?

|H| = ?

Computer Science Department

CS 9633 Machine Learning

Proving limits on VC
dimension


If we find any set of instances of size d
that can be shattered, then VC(H)


d.


To show that VC(H) < d, we must show
that no set of size d can be shattered.


Computer Science Department

CS 9633 Machine Learning

General result for r
dimensional space



The VC dimension of linear decision
surfaces in an r dimensional space is
r+1.

Computer Science Department

CS 9633 Machine Learning

Example 3 of VC dimension


Set X of instances are conjunctions of
exactly three boolean literals


young


happy


single


H is the set of hypothesis described by
a conjunction of up to 3 boolean literals.


What is VC(H)?


Computer Science Department

CS 9633 Machine Learning

Shattering conjunctions of
literals


Approach: construct a set of instances of size 3 that can be
shattered. Let instance
i

have positive literal
l
i

and all other
literals negative. Representation of instances that are
conjunctions of literals
l
1
,
l
2

and
l
3

as bit strings:



Instance
1
: 100




Instance
2
: 010




Instance
3
: 001




Construction of dichotomy: To exclude an instance, add
appropriate

l
i

to the hypothesis.


Extend the argument to n literals.


Can VC(H) be greater than n (number of literals)?

Computer Science Department

CS 9633 Machine Learning

Sample Complexity and the
VC dimension


Can derive a new bound for the number of
randomly drawn training examples that suffice
to probably approximately learn a target
concept (how many examples do we need to

-
exhaust the version space with probability (1
-

)?)




)
/
13
(
log
)
(
8
)
/
2
(
log
4
1
2
2



H
VC
m


Computer Science Department

CS 9633 Machine Learning

Comparing the Bounds



)
/
13
(
log
8
)
/
2
(
log
4
1
2
2



H
VC
m


))
1
ln(
(ln
2
1
2




H
m
Computer Science Department

CS 9633 Machine Learning

Lower Bound on Sample
Complexity


Theorem 7.3

Lower bound on sample
complexity.

Consider any concept class C
such that VC(C)


2, any learner L, and any 0 <


< 1/8, and 0 <


< 1/100. Then there exists a
distribution
D

and target concept in C such that
if L observes fewer examples than




Then with probability at least

, L outputs a
hypothesis h having error
D
(h) >

.











32
1
)
(
),
/
1
log(
1
max
C
VC