Introduction to Machine Learning (in NLP)

Home

Probably Approximately Correct learning model

Section 1. Background

PAC learning framework is a part of computational learning theory (CLT).

CLT is a mathematical field to analyze machine learning algorithms.

Training data is finite and the future is uncertain. Thus probabilistic bounds on the performance

of machine learning algorithms are quite common. Also time complexity and feasibility of

learning are important.

In CLT, a computation is considered feasible if it can be done in polynomial time.

Sample complexity How many training examples are needed for a learner to converge with high

probability to a successful hypothesis?

Computational complexity How much computational effort is needed for a learner to converge

with high probability to a successful hypothesis?

Section 2. The problem setting

Input data .

Output values .

Training data

set of target concepts

Instances are generated at random from according to some probability distribution . In

general, may be any distribution and it will be unknown to the learner. must be stationary,

i.e. it does not change over time.

A set of possible hypotheses.

A learner outputs some hypothesis from as a model of .

What are the capabilities of learning algorithms. We will not concentrate on individual learning

algorithms, but rather on broad classes of them.

Section 3. Error of a hypothesis

How closely the learner's output hypothesis approximates the target concept ?.

Definition True error of the hypothesis with respect to the target function and

the probabilistic distribution is the probability that the hypothesis wrongly classifies a

randomly selected instance according to (

(You are

already familiar with this definition.)

Section 4. PAC learnability

Aim

X

Y 1;1g = f +

Data hx;(x ) i;;g: = f

i

c

i

= y

i

x

i

2 X y

i

2 Y

m

i=1

C c 0;g 2 C:c:X!f 1

X D

D D

H

L h H c

h c

error (h)

D

h c

D h

D error (h) [c(x) = (x)])

D

Pr

x2D

= h

1 z 7

To characterize classes of target concepts that can be reliably learned from a reasonable

number of randomly drawn training examples and a reasonable amount of computation.

Definition Consider a concept class defined over a set of instances of length ( is the

size of instances, i.e. the size of their representation) and a learner using hypothesis space

. is PAC-learnable by using if for all , distributions over , such that

such that learner will with probability at least (confidence)

output a hypothesis such that in time that is polynomial in ; , and

( is the encoding length of , assuming some representation for ).

I.e., two things are required from :

must output, with arbitrarily high probability , a hypothesis having arbitrarily low error .1.

It must do so efficiently in time that grows at most polynomially with ;, with and

(that define inherent complexity of the underlying instance space and concept class ).

2.

Section 5. Sample complexity for FINITE hypothesis spaces

Sample complexity

How many training examples are needed for a learner to converge (with high probability) to a

successful hypothesis? We will express it in terms of size of the hypothesis space and

so-called Vapnik-Chervonenkis dimension.

Can we derive a bound on the number of training examples required by any consistent

learner?

Recall the definition of version space:

Version Space (VS ) with respect to and training data is the subset of

consistent with the training examples in .

To bound the number of examples needed by any consistent learner, we need only bound the

number of examples needed to assure that the Version Space contains no unacceptable

hypotheses. The following definition states this condition precisely:

Definition Consider a hypothesis space , target concept , instance distribution , and set

of training examples of . The version space VS is said to be -exhausted with

respect to and , if every hypothesis in VS has true error less than with respect to

and : <

We know that every consistent learner will output a hypothesis from the version space. So

what we have to do to bound the number of training examples that the learner needs, we just

bound the number of training examples needed to be sure that the version space contains no

hypotheses that does not match the training examples. The following theorem provides such

a bound:

Theorem - -exhausting version space

If the hypothesis space is finite, and is a sequence of independent randomly

drawn examples of some target concept , than for any , the probability that the

version space VS is not -exhausted (with respect to ) is less than or equal to

Proof

Let be all the hypotheses in that have true error greater than with respect to

C X n n

L

H C L H c 2 C D X

0; < <

2

1

0; < <

2

1

L 1

h 2 H error (h);

D

;

1

1

n

size(c) size(c) c 2 C C

L

L 1

1

1

n size(c)

X C

H

H;Data

H Data H

Data VS h jConsistent(h;ata)g

H;Data

f 2 H D

H c D

Data c

H;Data

c D h

H;Data

c D (8h S )error (h) 2 V

H;Data D

:

H Data m 1

c 0

1

H;Data

c jHje

m

h;;::;

1

h

2

:h

k

H

2 z 7

. We fail to -exhaust the Version Space if and only if at least one of these hypotheses

happens to be consistent with all independent random training examples. The probability

that any single hypothesis having true error greater than would be consistent with one

randomly drawn examples is at most . Therefore the probability that this hypothesis

will be consistent with independently drawn examples is at most . Given that we

have hypotheses with error greater than , the probability that at least one of these will be

consistent with all training examples is at most . Since , this is at most

Finally, we use a general inequality stating that if then .

Thus, which proves the theorem.

In other words, this bounds the probability that training examples will fail to eliminate all

"bad" hypotheses for any consistent learner using hypothesis space .

We use this result to determine the number of training examples required to reduce this

probability of failure below some desired level

So, the number of training examples is sufficient to assure that any consistent hypothesis

will be probably (with probability ) approximately (within error ) correct. grows

linearly in and logarithmically in .

Section 6. Agnostic learning and inconsistent hypotheses

If does not contain the target concept , then a zero-training-error hypothesis cannot

always be found. We ask to output hypothesis with the minimum error over the training

examples.

Agnostic learner

makes no prior commitment about whether or not . The equation

is based on the assumption of zero-training-error hypothesis. Let's

generalize it for nonzero training error hypotheses: , let

.

How many training examples suffices to ensure (with high probability) that its true error

will be no more than ? (in the previous case

).

Proof:

Proof is analogous to the setting we consider when estimating true error based on the sample

error: probability of the coin being head corresponds to the probability that the hypothesis

will misclassify a randomly drawn instances. The independent coin flips correspond to

drawn instances. The frequency of heads over the examples corresponds to the frequency

of misclassification over the instances.

The Hoeffding bounds state if is measured over the set containing

randomly drawn examples, then

It gives us a bound on the probability that an arbitrary chosen single hypothesis has a

misleading training error.

To assure that the best hypothesis found by has an error bounded in this way, we must

consider that any could have a large error

c k

m

(1 )

m (1 )

m

k

m k(1 )

m

k Hj

j

jHj(1 ):

m

0

1 (1 )

e

k(1 ) Hj(1 ) Hje

m

j

m

j

m

m

H

:

jHje (ln Hj ( )):

m

!m

1

j +ln

1

m

(1 ) m

1

1

H c

C H

m (ln Hj ( ))

1

j +ln

1

error (h)

Data

h rgmin error (h)

best

= a

h2H Data

error (h)

D

rror (h ) +e

Data best

error (h )

Data best

= 0

m m

m

m

error (h)

Data

Data m

Pr[error (h) rror (h) ]:

D

> e

Data

+

e

2m

2

L

h

2

H

3 z 7

If we call , then

In this less restrictive case grows as the square of , rather than linearly with .

Conjunctions of Boolean literals, i.e. AND-formulas, are PAC

learnable

Consider the class of target concepts described by conjunction of up to literals (A literal

is either a Boolean variable or its negation.), for ex. ( is missing). Is

PAC-learnable?

To answer yes,

we have to show that any consistent learner will require only a polynomial number of training

examples to learn any in .

Then suggest a specific algorithm that uses polynomial time per training example.

Consider any consistent learner using a hypothesis space identical to . We need only

determine the size .

Consider defined by conjunctions of literals based on boolean variables. Then

(include the variable as a literal in the hypothesis, include its negation as a literal, or ignore

it).

Example

So

For example, if a consistent learner attempts to learn a target concept described by

conjunctions of up to 10 literals, and we desire 95% probability that it will learn a hypothesis

with error less than 0.1, then it suffices to present randomly drawn training examples,

where

Pr[(9 )(error (h) rror (h) )] Hje: 2 H

D

> e

Data

+

j

2m

2

[(9 )(error (h) rror (h) )] = Pr 2 H

D

> e

Data

+ m (lnjHj n( )):

1

2

2

+l

1

m

1

1

C n

c &l &l &:::&l = l

1 2 4 n

l

3

C

c C

L H C

jHj

H n jHj = 3

n

n = 2

h

1

= x

1

h x

2

=:

1

h

3

= x

2

h x

4

=:

2

h

5

= x

1

^x

2

h x

6

= x

1

^:

2

h x

7

=:

1

^x

2

h x x

8

=:

1

^:

2

h x x

9

= x

1

^:

1

^x

2

^:

2

m (n ):

1

ln3 +ln

1

m

m (10 n( ) 40: =

1

0:1

ln3 +l

1

0:05

= 1

4 z 7

Recall FIND-S algorithm.

What is the FIND-S algorithm doing? For each new positive example, the algorithm computes

the intersection of the literals shared by the current hypothesis and the new training example,

i.e For a positive example , removes literals from to make it consistent

with . That is, if , then remove from , otherwise remove from .

The most specific hypothesis: .

Theorem

PAC-learnability of boolean conjunctions. The class of conjunctions of boolean literals is

PAC-learnable by the FIND-S algorithm using .

Proof

Do it yourself.

3-CNF formulas are PAC-learnable

A 3-CNF formula is a conjunction of clauses, each of which is disjunction of at most 3 literals.

That is, each can be written where

For each of the 3-tuples of literals , one can create a variable corresponding

to the clause

k-term DNF is not PAC learnable

A 3-term DNF formulas is the disjunction of three terms, each of which is a conjunction of

literals. That is, each can be written where is a conjunction. An

example of such a hypothesis is

Assume .

( terms, each of which may take on possible values). However, is an

overestimate of , because it is double-counting the cases where and where is

more general than . We can write

It indicates that the sample complexity of -term DNF is polynomial in ;;;:BUT ... can

be shown that the computational complexity is not polynomial since this problem is equivalent

to other problems that are known to be unsolvable in polynomial time.

Section 7. Sample complexity for INFINITE hypothesis space

We can state bounds on sample complexity that use Vapnik-Chervonenkis dimension of

rather than . Even more, this bounds allow us to charachterize the sample complexity of

many infinite hypothesis spaces.

Shattering a set of instances

Definition: A dichotomy of a set is a partition of into two disjoint subsets.

Let's assume a sample set . Each hypothesis imposes some dichotomy on ,

i.e. partitions into two subsets and .

Definition: A set of instances is shattered by hypothesis space if and only if for every

a a;;::;i = h

1

a

2

:a

n

h

a a

i

= 0 x

i

h :x

i

h

x x x::x

1

^:

1

^x

2

^:

2

^:^x

n

^:

n

C

H = C

h 2 H h::; = C

1

^C

2

^:^C

m

C:

i

= l

1

_l

2

_l

3

(2n)

3

(a;;) b c x

abc

a: _b _c

h 2 H h; = T

1

_T

2

_T

3

T

i

h x x ) x x ):x x ) = (

1

^x

2

^:

7

_(

3

^:

7

^x

8

_(

4

^:

5

^x

9

H = C

jHj

3

nk

k 3

n

3

nk

H T

i

= T

j

T

i

T

j

m (nk ( )):

1

ln3 +ln

1

k

1

1

n k

H

jHj

S S

S X h 2 H S

h S fx;(x) g 2 S h = 1 fx;(x) g 2 S h = 0

S H

5 z 7

dichotomy of there exists some hypothesis in with this dichotomy.

What if cannot shatter , but can shatter some large subset of ?

Intuitively, it is reasonable to say that the larger the subset of that can be shattered, the

more expressive . The Vapnik-Chervonenkis Dimension of is precisely the measure.

The Vapnik-Chervonenkis Dimension

Definition: The Vapnik-Chervonenkis dimension, , of hypothesis space defined

over instance space is the size of the largest finite subset of shattered by . If

arbitrarily large finite sets of can be shattered by , then .

Note

For any finite . To see this, suppose . Then For any finite

will require For any finite distinct hypotheses to shatter For any finite instances. For any

finite .

Examples

Consider and the set of real intervals . What is ?

We must find the largest subset of that can be shattered by .

Consider . Can be shattered by ?

For example four hypotheses will do .

So we know that . ???

Consider , without loss of generality assume . Clearly, this set

cannot be shattered, because the dichotomy that includes and and not cannot be

represented by a single closed interval. So .

1.

Each instance in is described by the conjunction of exactly three boolean literals and each

hypothesis in is described by the conjunction of up to three boolean literals. What is ?

Represent each instance by a 3-bit string of values of the literals . Consider three

instances: This set can be shattered by , because a hypothesis

can be constructed for any desired dichotomy as follows: if dichotomy is to exclude the

instance , add the literal to the hypothesis. For example, include and exclude

use the hypothesis . This can be extended from three features to . Thus, the VC

dimension for conjunctions of boolean variables is at least .

2.

What is the VC-dimension of axis parallel rectangles in the plane ? The target function is

specified by a rectangle, and labels any example positive iff it lies inside that rectangle.

3.

Sample complexity and the VC dimension

Recall the question How many randomly drawn training examples suffice to probably

approximately correct learn any target concept in ?

Let's derive the analogous answer to the earlier bound of (recall ):

This equation provides an upper bound.

S H

H X S X

X

H H

VC(H) H

X X H

X H VC(H) = 1

jHj;C(H)leq Hj V log

2

j VC(H) = d H

2

d

d

jHj

2

d

X = R H a < x < b VC(H)

X H

S 3:1;:7g = f 5 S H

1;;; < x < 2 1 < x < 4 4 < x < 7 1 < x < 7

VC(H) 2 VC(H) 3

S x;;g = f

1

x

2

x

3

x

1

< x

2

< x

3

x

1

x

3

x

2

VC(H) = 2

X

H VC(H)

l;;

1

l

2

l

3

i 00;10;01:

1

:1 i

2

:0 i

3

:0 H

i

j

:l

j

i

2

i;

1

i

3

!

:l l

1

^:

3

n

n n

X = R

2

C

m VC(H) Hj

log

2

j

m (4 ( VC(H) ( ):

1

log

2

2

+8 log

2

13

6 z 7

Theorem: Low bound on sample complexity

Consider any concept class such that , any learner , and any , and

. Then there exists a distribution and target concept in such that if

observes fewer examples than

then with probability at least , outputs a hypothesis having error .

C VC(C) 2 L 0 < <

8

1

0 < <

1

100

R C L

max[ ( );]

1

log

1

32

(VC(C) )1

L h error (h)

D

>

7 z 7

## Comments 0

Log in to post a comment