Probably Approximately Correct learning model

strawberrycokevilleΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

59 εμφανίσεις

Introduction to Machine Learning (in NLP)
Home
Probably Approximately Correct learning model
Section 1. Background
PAC learning framework is a part of computational learning theory (CLT).
CLT is a mathematical field to analyze machine learning algorithms.
Training data is finite and the future is uncertain. Thus probabilistic bounds on the performance
of machine learning algorithms are quite common. Also time complexity and feasibility of
learning are important.
In CLT, a computation is considered feasible if it can be done in polynomial time.
Sample complexity How many training examples are needed for a learner to converge with high
probability to a successful hypothesis?
Computational complexity How much computational effort is needed for a learner to converge
with high probability to a successful hypothesis?
Section 2. The problem setting
Input data .
Output values .
Training data
set of target concepts
Instances are generated at random from according to some probability distribution . In
general, may be any distribution and it will be unknown to the learner. must be stationary,
i.e. it does not change over time.
A set of possible hypotheses.
A learner outputs some hypothesis from as a model of .
What are the capabilities of learning algorithms. We will not concentrate on individual learning
algorithms, but rather on broad classes of them.
Section 3. Error of a hypothesis
How closely the learner's output hypothesis approximates the target concept ?.
Definition True error of the hypothesis with respect to the target function and
the probabilistic distribution is the probability that the hypothesis wrongly classifies a
randomly selected instance according to (
(You are
already familiar with this definition.)
Section 4. PAC learnability
Aim
X
Y 1;1g = f +
Data hx;(x ) i;;g: = f
i
c
i
= y
i
x
i
2 X y
i
2 Y
m
i=1
C c 0;g 2 C:c:X!f 1
X D
D D
H
L h H c
h c
error (h)
D
h c
D h
D error (h) [c(x) = (x)])
D
Pr
x2D
= h
1 z 7
To characterize classes of target concepts that can be reliably learned from a reasonable
number of randomly drawn training examples and a reasonable amount of computation.
Definition Consider a concept class defined over a set of instances of length ( is the
size of instances, i.e. the size of their representation) and a learner using hypothesis space
. is PAC-learnable by using if for all , distributions over , such that
such that learner will with probability at least (confidence)
output a hypothesis such that in time that is polynomial in ; , and
( is the encoding length of , assuming some representation for ).
I.e., two things are required from :
must output, with arbitrarily high probability , a hypothesis having arbitrarily low error .1.
It must do so efficiently in time that grows at most polynomially with ;, with and
(that define inherent complexity of the underlying instance space and concept class ).
2.
Section 5. Sample complexity for FINITE hypothesis spaces
Sample complexity
How many training examples are needed for a learner to converge (with high probability) to a
successful hypothesis? We will express it in terms of size of the hypothesis space and
so-called Vapnik-Chervonenkis dimension.
Can we derive a bound on the number of training examples required by any consistent
learner?
Recall the definition of version space:
Version Space (VS ) with respect to and training data is the subset of
consistent with the training examples in .
To bound the number of examples needed by any consistent learner, we need only bound the
number of examples needed to assure that the Version Space contains no unacceptable
hypotheses. The following definition states this condition precisely:
Definition Consider a hypothesis space , target concept , instance distribution , and set
of training examples of . The version space VS is said to be -exhausted with
respect to and , if every hypothesis in VS has true error less than with respect to
and : <
We know that every consistent learner will output a hypothesis from the version space. So
what we have to do to bound the number of training examples that the learner needs, we just
bound the number of training examples needed to be sure that the version space contains no
hypotheses that does not match the training examples. The following theorem provides such
a bound:
Theorem - -exhausting version space
If the hypothesis space is finite, and is a sequence of independent randomly
drawn examples of some target concept , than for any , the probability that the
version space VS is not -exhausted (with respect to ) is less than or equal to
Proof
Let be all the hypotheses in that have true error greater than with respect to
C X n n
L
H C L H c 2 C D X 
0; <  <
2
1
 0; <  <
2
1
L 1 
h 2 H error (h);
D

;

1

1
n
size(c) size(c) c 2 C C
L
L 1  

1

1
n size(c)
X C
H
H;Data
H Data H
Data VS h jConsistent(h;ata)g
H;Data
f 2 H D
H c D
Data c
H;Data

c D h
H;Data

c D (8h S )error (h) 2 V
H;Data D
:

H Data m 1
c 0

1
H;Data
 c jHje
m
h;;::;
1
h
2
:h
k
H 
2 z 7
. We fail to -exhaust the Version Space if and only if at least one of these hypotheses
happens to be consistent with all independent random training examples. The probability
that any single hypothesis having true error greater than would be consistent with one
randomly drawn examples is at most . Therefore the probability that this hypothesis
will be consistent with independently drawn examples is at most . Given that we
have hypotheses with error greater than , the probability that at least one of these will be
consistent with all training examples is at most . Since , this is at most
Finally, we use a general inequality stating that if then .
Thus, which proves the theorem.
In other words, this bounds the probability that training examples will fail to eliminate all
"bad" hypotheses for any consistent learner using hypothesis space .
We use this result to determine the number of training examples required to reduce this
probability of failure below some desired level
So, the number of training examples is sufficient to assure that any consistent hypothesis
will be probably (with probability ) approximately (within error ) correct. grows
linearly in and logarithmically in .
Section 6. Agnostic learning and inconsistent hypotheses
If does not contain the target concept , then a zero-training-error hypothesis cannot
always be found. We ask to output hypothesis with the minimum error over the training
examples.
Agnostic learner
makes no prior commitment about whether or not . The equation
is based on the assumption of zero-training-error hypothesis. Let's
generalize it for nonzero training error hypotheses: , let
.
How many training examples suffices to ensure (with high probability) that its true error
will be no more than ? (in the previous case
).
Proof:
Proof is analogous to the setting we consider when estimating true error based on the sample
error: probability of the coin being head corresponds to the probability that the hypothesis
will misclassify a randomly drawn instances. The independent coin flips correspond to
drawn instances. The frequency of heads over the examples corresponds to the frequency
of misclassification over the instances.
The Hoeffding bounds state if is measured over the set containing
randomly drawn examples, then
It gives us a bound on the probability that an arbitrary chosen single hypothesis has a
misleading training error.
To assure that the best hypothesis found by has an error bounded in this way, we must
consider that any could have a large error
c  k
m

(1 ) 
m (1 ) 
m
k 
m k(1 ) 
m
k Hj
j
jHj(1 ): 
m
0

1 (1 ) 
e

k(1 ) Hj(1 ) Hje 
m

j 
m

j
m
m
H
 :
jHje (ln Hj ( )):
m

!m

1
j +ln

1
m
(1 )   m

1

1
H c
C  H
m (ln Hj ( ))

1
j +ln

1
error (h)
Data
h rgmin error (h)
best
= a
h2H Data
error (h)
D
 rror (h ) +e
Data best
error (h )
Data best
= 0
m m
m
m
error (h)
Data
Data m
Pr[error (h) rror (h) ]:
D
> e
Data
+
e
2m
2
L
h
2
H
3 z 7
If we call , then
In this less restrictive case grows as the square of , rather than linearly with .
Conjunctions of Boolean literals, i.e. AND-formulas, are PAC
learnable
Consider the class of target concepts described by conjunction of up to literals (A literal
is either a Boolean variable or its negation.), for ex. ( is missing). Is
PAC-learnable?
To answer yes,
we have to show that any consistent learner will require only a polynomial number of training
examples to learn any in .
Then suggest a specific algorithm that uses polynomial time per training example.
Consider any consistent learner using a hypothesis space identical to . We need only
determine the size .
Consider defined by conjunctions of literals based on boolean variables. Then
(include the variable as a literal in the hypothesis, include its negation as a literal, or ignore
it).
Example
So
For example, if a consistent learner attempts to learn a target concept described by
conjunctions of up to 10 literals, and we desire 95% probability that it will learn a hypothesis
with error less than 0.1, then it suffices to present randomly drawn training examples,
where
Pr[(9 )(error (h) rror (h) )] Hje: 2 H
D
> e
Data
+
j
2m
2
 [(9 )(error (h) rror (h) )] = Pr 2 H
D
> e
Data
+ m (lnjHj n( )):
1
2
2
+l

1
m

1

1
C n
c &l &l &:::&l = l
1 2 4 n
l
3
C
c C
L H C
jHj
H n jHj = 3
n
n = 2
h
1
= x
1
h x
2
=:
1
h
3
= x
2
h x
4
=:
2
h
5
= x
1
^x
2
h x
6
= x
1
^:
2
h x
7
=:
1
^x
2
h x x
8
=:
1
^:
2
h x x
9
= x
1
^:
1
^x
2
^:
2
m (n ):

1
ln3 +ln

1
m
m (10 n( ) 40: =
1
0:1
ln3 +l
1
0:05
= 1
4 z 7
Recall FIND-S algorithm.
What is the FIND-S algorithm doing? For each new positive example, the algorithm computes
the intersection of the literals shared by the current hypothesis and the new training example,
i.e For a positive example , removes literals from to make it consistent
with . That is, if , then remove from , otherwise remove from .
The most specific hypothesis: .
Theorem
PAC-learnability of boolean conjunctions. The class of conjunctions of boolean literals is
PAC-learnable by the FIND-S algorithm using .
Proof
Do it yourself.
3-CNF formulas are PAC-learnable
A 3-CNF formula is a conjunction of clauses, each of which is disjunction of at most 3 literals.
That is, each can be written where
For each of the 3-tuples of literals , one can create a variable corresponding
to the clause
k-term DNF is not PAC learnable
A 3-term DNF formulas is the disjunction of three terms, each of which is a conjunction of
literals. That is, each can be written where is a conjunction. An
example of such a hypothesis is
Assume .
( terms, each of which may take on possible values). However, is an
overestimate of , because it is double-counting the cases where and where is
more general than . We can write
It indicates that the sample complexity of -term DNF is polynomial in ;;;:BUT ... can
be shown that the computational complexity is not polynomial since this problem is equivalent
to other problems that are known to be unsolvable in polynomial time.
Section 7. Sample complexity for INFINITE hypothesis space
We can state bounds on sample complexity that use Vapnik-Chervonenkis dimension of
rather than . Even more, this bounds allow us to charachterize the sample complexity of
many infinite hypothesis spaces.
Shattering a set of instances
Definition: A dichotomy of a set is a partition of into two disjoint subsets.
Let's assume a sample set . Each hypothesis imposes some dichotomy on ,
i.e. partitions into two subsets and .
Definition: A set of instances is shattered by hypothesis space if and only if for every
a a;;::;i = h
1
a
2
:a
n
h
a a
i
= 0 x
i
h :x
i
h
x x x::x
1
^:
1
^x
2
^:
2
^:^x
n
^:
n
C
H = C
h 2 H h::; = C
1
^C
2
^:^C
m
C:
i
= l
1
_l
2
_l
3
(2n)
3
(a;;) b c x
abc
a: _b _c
h 2 H h; = T
1
_T
2
_T
3
T
i
h x x ) x x ):x x ) = (
1
^x
2
^:
7
_(
3
^:
7
^x
8
_(
4
^:
5
^x
9
H = C
jHj
3
nk
k 3
n
3
nk
H T
i
= T
j
T
i
T
j
m (nk ( )):

1
ln3 +ln

1
k

1

1
n k
H
jHj
S S
S  X h 2 H S
h S fx;(x) g 2 S h = 1 fx;(x) g 2 S h = 0
S H
5 z 7
dichotomy of there exists some hypothesis in with this dichotomy.
What if cannot shatter , but can shatter some large subset of ?
Intuitively, it is reasonable to say that the larger the subset of that can be shattered, the
more expressive . The Vapnik-Chervonenkis Dimension of is precisely the measure.
The Vapnik-Chervonenkis Dimension
Definition: The Vapnik-Chervonenkis dimension, , of hypothesis space defined
over instance space is the size of the largest finite subset of shattered by . If
arbitrarily large finite sets of can be shattered by , then .
Note
For any finite . To see this, suppose . Then For any finite
will require For any finite distinct hypotheses to shatter For any finite instances. For any
finite .
Examples
Consider and the set of real intervals . What is ?
We must find the largest subset of that can be shattered by .
Consider . Can be shattered by ?
For example four hypotheses will do .
So we know that . ???
Consider , without loss of generality assume . Clearly, this set
cannot be shattered, because the dichotomy that includes and and not cannot be
represented by a single closed interval. So .
1.
Each instance in is described by the conjunction of exactly three boolean literals and each
hypothesis in is described by the conjunction of up to three boolean literals. What is ?
Represent each instance by a 3-bit string of values of the literals . Consider three
instances: This set can be shattered by , because a hypothesis
can be constructed for any desired dichotomy as follows: if dichotomy is to exclude the
instance , add the literal to the hypothesis. For example, include and exclude
use the hypothesis . This can be extended from three features to . Thus, the VC
dimension for conjunctions of boolean variables is at least .
2.
What is the VC-dimension of axis parallel rectangles in the plane ? The target function is
specified by a rectangle, and labels any example positive iff it lies inside that rectangle.
3.
Sample complexity and the VC dimension
Recall the question How many randomly drawn training examples suffice to probably
approximately correct learn any target concept in ?
Let's derive the analogous answer to the earlier bound of (recall ):
This equation provides an upper bound.
S H
H X S X
X
H H
VC(H) H
X X H
X H VC(H) = 1
jHj;C(H)leq Hj V log
2
j VC(H) = d H
2
d
d
jHj
2
d
X = R H a < x < b VC(H)
X H
S 3:1;:7g = f 5 S H
1;;; < x < 2 1 < x < 4 4 < x < 7 1 < x < 7
VC(H) 2 VC(H) 3
S x;;g = f
1
x
2
x
3
x
1
< x
2
< x
3
x
1
x
3
x
2
VC(H) = 2
X
H VC(H)
l;;
1
l
2
l
3
i 00;10;01:
1
:1 i
2
:0 i
3
:0 H
i
j
:l
j
i
2
i;
1
i
3
!
:l l
1
^:
3
n
n n
X = R
2
C
m VC(H) Hj
log
2
j
m (4 ( VC(H) ( ):

1
log
2

2
+8 log
2

13
6 z 7
Theorem: Low bound on sample complexity
Consider any concept class such that , any learner , and any , and
. Then there exists a distribution and target concept in such that if
observes fewer examples than
then with probability at least , outputs a hypothesis having error .
C VC(C) 2 L 0 <  <
8
1
0 <  <
1
100
R C L
max[ ( );]

1
log

1
32
(VC(C) )1
 L h error (h)
D
> 
7 z 7