A Sucient Condition for Polynomial Distribution-Dependent Learnability

strawberrycokevilleΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

67 εμφανίσεις

A Sucient Condition for Polynomial
Distribution-Dependent Learnability
Martin Anthony
Department of Mathematics
London School of Economics
Houghton Street
London WC2A 2AE,UK
m.anthony@lse.ac.uk
John Shawe-Taylor
Department of Computer Science
Royal Holloway,University of London
Egham Hill
Egham
Surrey TW20 0EX,UK
john@dcs.rhbnc.ac.uk.
Abstract
We investigate upper bounds on the sample-size sucient for`solid'learn-
ability with respect to a probability distribution.We obtain a sucient condi-
tion for feasible (polynomially bounded) sample-size bounds for distribution-
specic (solid) learnability.
1
1 Introduction
There have been extensive studies of probabilitic models of machine learning;see
the books [3,11,12],for example.In the standard`PAC'model of learning,the
denition of successful learning is`distribution-free'.A number of researchers have
examined learning where the probability distribution generating the examples is
known;see [6,5],for example.In this paper we seek conditions under which such
distribution-specic learning can be achieved with a feasible (polynomial) number
of training examples.
2 The PAC learning framework
In this section,we describe a probabilistic model of learning,introduced by Valiant [15]
and developed by many researchers (see for example [8]).It has come to be known
as the probably approximately correct learning model [1].
Throughout,we have an example space X,which is either countable or is the Eu-
clidean space R
n
for some n.We have a probability space (X;;) dened on
X,where we assume that when X is countable, is the set of all subsets of X
and that when X is R
n
, is the Borel -algebra.A hypothesis is a -measurable
f0;1g-valued function on X.The hypothesis space H is a set of hypotheses,and the
target,c,is one particular concept fromH.A labelled example of c is an ordered pair
(x;c(x)).If c(x) = 1,we say x is a positive example of c,while if c(x) = 0,we say x
is a negative example of c.A sample y of c of length (or size) m is a sequence of m
labelled examples of c.When the target concept is clear,we will denote the sample
simply by the vector x 2 X
m
,so that if x = (x
1
;:::;x
m
) then the corresponding
sample of c is ((x
1
;a
1
);:::;(x
m
;a
m
)),where a
i
= c(x
i
).The learning problem is to
nd a good approximation to c from H,this approximation being based solely on a
sample of c,each example in the sample being chosen independently and at random,
according to the distribution .
Fix a particular target c 2 H.For any hypothesis h of H,the error of h (with respect
to c) is er

(h) = (hc);where hc is the set fx:h(x) 6= c(x)g,the symmetric
dierence of h and c.We say that a hypothesis h is -close to c if er

(h)  .For
any set F of measurable subsets of X,we dene the haziness of F (with respect to
c) as
haz

(F) = supfer

(h):h 2 Fg:
2
The set H[x;c] of hypotheses consistent with c on x is
H[x;c] = fh 2 H:h(x
i
) = c(x
i
) (1  i  m)g;
which we shall usually denote by H[x] when c is understood.Now we can dene
what is meant by solid learnability.(This terminology comes from [5].)
Denition 2.1 The hypothesis space H is solidly learnable if,for any , 2 (0;1),
there is m
0
= m
0
(;) such that given any c 2 H,for all probability measures  on
X,
m> m
0
=)
m
fx 2 X
m
:haz

(H[x]) < g > 1 :
Here,
m
is the product measure on X.
In words,H is solidly learnable if for a given accuracy parameter  and a given
certainty parameter ,there is a sample size,independent of the distribution and
the target concept,such that any hypothesis consistent with that many random
examples will\probably"be\approximately"correct.(In this case,a learning
algorithm which returns a consistent hypothesis will perform well.) From now on,
`learnability'shall mean`solid learnability'.
We assume throughout that the spaces satisfy certain measurability requirements|
namely,that they are universally separable,so that the probabilities in the denitions
and proofs are indeed dened.See [13,8] for details.
3 Distribution-independent sample sizes
The Vapnik-Chervonenkis dimension (or VC dimension) [16] has been widely used
in order to obtain some measure of the degree of expressibility of a hypothesis space,
and hence to obtain learnability results [9,8,4].Given a hypothesis space H,dene,
for each x = (x
1
;:::x
m
) 2 X
m
,a function x

:H!f0;1g
m
by
x

(h) = (h(x
1
);:::;h(x
m
)):
The growth function,
H
from the set of integers to itself is dened by

H
(m) = maxfjfx

(h):h 2 Hgj:x 2 X
m
g  2
m
:
3
If jfx

(h):h 2 Hgj = 2
m
then we say that x is shattered by H.If 
H
(m) = 2
m
for all
mthen the Vapnik-Chervonenkis dimension of H is innite.Otherwise,the Vapnik-
Chervonenkis dimension is the largest positive integer m for which 
H
(m) = 2
m
;
that is,the largest integer m such that some sample x of length m is shattered.We
remark that any nite hypothesis space certainly has nite VC dimension.
It can be shown that if VCdim(H)=d,and m d  1 then 
H
(m)  (em=d)
d
[14].
This is useful in obtaining bounds on the sucient sample size m
0
(;).Follow-
ing [10],it can be proved [8] that if the hypothesis space H has nite VC dimension
d,then H is learnable.Further,if H is learnable then H must have nite VC di-
mension [8].Specically,the suciency result of Blumer et al.follows from the
following,which is a renement of a result from [16].
Theorem 3.1 (Blumer et al.[8]) For any distribution ,

m
fx 2 X
m
:haz

(H[x]) > g < 2
H
(2m) 2
m=2
:
This bound has been tightened [4],resulting in the following bound on sucient
sample-size.
Theorem 3.2 ([4]) The hypothesis space H is learnable if H has nite VC dimen-
sion.If d = VCdim(H) > 1 is nite then a suitable m
0
is
m
0
= m
0
(;) =
&
1(1 
p)

ln

d=(d 1)
!
+2dln

6

!'
;
where ln denotes natural logarithm.
4 Distribution-dependent learning
Recall the denition of learnability of a hypothesis space H.H is learnable if for any
accuracy parameter ,any condence parameter ,any target concept c 2 H and
any probability measure  on X,there is a sample-size m
0
,which is a function of 
and  alone,such that the following holds:With probability at least 1 ,if some
hypothesis h is consistent with c on at least m
0
inputs chosen randomly according
to the distribution ,then h has actual error less than .As emphasised earlier,
the value of m
0
must depend on neither the target concept c nor the distribution
4
(probability measure) .In many realistic learning problems,the distribution on the
input space is xed but unknown.This is the primary reason for proving learnability
results and nding sucient sample-sizes which are independent of the distribution;
results that are independent of the distribution certainly hold for any particular
distribution.If something is known of the distribution or if the distribution is of a
special type,it may be possible to say more,obtaining positive results even when
the hypothesis space has innite VC dimension.
In order to introduce distribution-dependent learnability,we may dene learnability
of a particular concept c from a hypothesis space H,with respect to a particular
probability measure  on the input space X.We say that c is -learnable in H
if given any ; 2 (0;1),there is an integer m
0
= m
0
(;;c;) such that for all
m  m
0
,
m
fx 2 X
m
:haz

(H[x;c]) > g < :In addition,we say that H itself is
-learnable if every c 2 H is -learnable and if there is a sucient sample-size m
0
which is independent of the hypothesis c.If H is -learnable for every distribution
 on X,then we say that H is distribution-dependent learnable,abbreviated as
dd-learnable.
If one examines closely the proof in [8] of Theorem 2.1 then it is clear that the term

H
(2m) in the bound can be replaced by the expectation over X
2m
of the function

H
,where 
H
(x) = jfx

(h):h 2 Hgj.(This will be a randomvariable if we assume
that H is universally separable;see [2]).Thus,for distribution-dependent analysis,
we can use E
2m
(
H
(x)) in place of 
H
(m),where E
2m
(:) denotes expected value
with respect to 
2m
and over X
2m
.This yields

m
fx 2 X
m
:haz

(H[x]) > g < 2 E
2m
(
H
(x)) 2
m=2
;
for m 8=.
A function f is said to be subexponential if,for all  > 0,as x tends to innity,
f(x) exp(x) tends to zero.With this denition,we have the following.
Theorem 4.1 Let  be any probability measure on X.If E
n
(
H
(x)),the expected
value of 
H
(x) over X
n
(with respect to 
n
),is a subexponential function of n,then
H is -learnable.
Proof:For m 8=,
m
fx 2 X
m
:haz

(H[x]) > g < 2 E
2m

H
(x)2
m=2
:If
E
2m

H
(x) 2
m=2
!0 as m!1
for all  > 0,which is the case if E
n
(
H
(x)) is a subexponential function of n,then
the quantity on the right-hand side can be made less than any  > 0 by choosing
5
m  m
0
,where m
0
depends only on  and not on the hypothesis c.The result
follows.ut
It's fairly easy to see that demanding that E
n
(
H
(x)) be sub-exponential is equiva-
lent to demanding that n
1
log E
n
(
H
(x))!0 as n!1.In fact,results of Vapnik
and Chervonenkis [16] show that the weaker condition n
1
E
n
(log 
H
(x))!0 as
n!1is sucient.
We give two examples of this theorem | one discrete and the other continuous.
Example 1:Let fB
i
g
i1
be any sequence of disjoint sets such that jB
i
j = i,(i  1)
and take as example space the countably innite set X =
S
1
i=1
B
i
:Let the probability
measure  be dened on the -algebra of all subsets of X by
(fxg) =
1i
12
i
(x 2 B
i
):
Let the hypothesis space H be the set of functions H =
S
1
i=1
fI
C
:C  B
i
g;where
I
C
:X!f0;1g is the characteristic function of the subset C.Then it is easy to
see that H has innite VC dimension and thus is not learnable.However,we can
use Theorem 3.1 to prove that H is -learnable.For x 2 X
n
,let I(x) be the set of
entries of x.That is,I(x) = fx
i
:1  i  ng:Then it is not dicult to see that

H
(x) =
X
2
jI(x)\B
i
j
;
where the sum is over all i such that I(x)\B
i
6=;:Therefore,
I(x)  S
k
=
k
[
i=1
B
i
=)
H
(x)  2 +2
2
+:::+2
k
< 2
k+1
:
Further,
H
(x)  2
n
for all x 2 X
n
.
Let 
k
be the probability that I(x)  S
k
;that is,
k
= 
n
(S
n
k
).Then,

k
= ((S
k
))
n
=

1 
12
k

n
:
For any 0 < x < 1,(1 x)
n
 1 nx and so,for k  2,

k

k1
 1 

1 
1 2
k1

n

n2
k1
:
Since the sets S
n
k
cover X
n
,we therefore have
E
n
(
H
(x)) < 2 
1
+
n1
X
k=2
(
k

k1
) 2
k+1
+2
n

1 
n

S
n
n1

6
 1 +
n1
X
k=2
n2
k1
2
k+1
+2
n

n2
n1

= 1 +4n(n 2) +2n < 4n
2
:
It follows that the expected value of 
H
(x) is polynomial and therefore H is -
learnable.
Example 2:Let X be the set of non-negative reals and let the distribution have
probability density function p(x) = e
x
,so that ([0;y]) = 1  e
y
:Let the hy-
pothesis space H consist of all (characteristic functions of) nite unions of closed
intervals,at most k of which intersect the interval [0;k
2
] for each positive integer
k.Thus,for example,[1;2]
S
[3;5] is in H,but [0;1]
S
[2;3]
S
[3;5]
S
[7;9]
S
[17;18]
is not,since four of the intervals in this union intersect the interval [0;3
2
].Let us
denote the interval [0;k
2
] by S
k
.Then (S
k
) = 1 e
k
2
and (see [8]) HjS
k
has VC
dimension 2k.If x 2 S
n
k
then 
H
(x)  n
2k+1
,by a crude form of Sauer's result.In
any case,
H
(x)  2
n
and it follows that
E
n
(
H
(x)) 
n
X
k=1
n
2k+1
(
n
(S
k
) 
n
(S
k1
)) +2
n
(1 
n
(S
n
n
))
<
n
X
k=1
n
2k+1

1 (1 e
(k1)
2
)
n

+2
n

1 (1 e
n
2
)
n

<
n
X
k=1
n
2k+2
e
(k1)
2
+2
n
ne
n
2
:
The second quantity tends to 0.Further,n
2x+2
e
(x1)
2
 n
4
exp((lnn)
2
),as can
easily be checked by calculus,so that
n
X
k=1
n
2k+2
e
(k1)
2
 n
5
exp

(lnn)
2

;
which is sub-exponential.It follows that H is -learnable.
5 Polynomial learnability
Suppose that H is -learnable.For learning to be ecient in any sense,we certainly
need a sample-size bound which,as well as being independent of c,does not increase
too dramatically as  and  decrease (and the learning task becomes,consequently,
more dicult).It is appropriate to demand that,for eciency,the sample-size
(and hence running time of any ecient learning algorithm) be polynomial in 1=.
Furthermore,since if one doubles the size of a sample,then one would expect to
square the probability that a bad hypothesis is consistent with the sample,we require
7
the sample-size to vary polynomially in ln(1=).We therefore make the following
denition:
Denition 5.1 Hypothesis space H is polynomially -learnable if for any ; in
(0;1),there is m
0
= m
0
(;),polynomial in 1= and ln(1=),such that,given any
c 2 H,
m m
0
=)
m
fx 2 X
m
:haz

(H[x]) < g > 1 :
We have observed that if the expectation of 
H
(x) is subexponential then H is
-learnable.We have the following result.
Theorem 5.2 Suppose H is a hypothesis space on X and  is a distribution on
X.If there is 0 <  < 1 such that (for large n),log E
n
(
H
(x)) < n
1
then H is
polynomially -learnable.
Proof:Let n = 2
1
(4=)
1=
log(2=),where log denotes binary logarith,and sup-
pose that  < 1=4.Then n  (4=) log(2=) and so n=4  log(2=).But,also,
n  2
1
(4=)
1=
and hence n=4  (2n)
1
.It follows that
n2
 log

2

+(2n)
1
> log

2

+log E
2n
(
H
(x));
and so
2 E
2n
(
H
(x))2
n=2
< :
The value of n is polynomial in 1= and ln(1=),so H is polynomially -learnable.
ut
The above result is essentially the best that can be obtained by using the bound

n
fx 2 X
n
:haz

(H[x]) > g < 2 E
2n

H
(x)2
n=2
;
since if the condition of the theorem is not satised (for example,if the expectation
is of order 2
n=log n
),the resulting sample-size bound will be exponential.
Bertoni et al.[7] studied the question of polynomial sample complexity for distribution-
dependent learning.For x = (x
1
;:::;x
m
) 2 X
m
,let C
m
(x) be the size of the largest
subset of fx
1
;:::;x
m
g shattered by H.The,following on from the work of Vapnik
8
and Chervonenkis,Bertoni et al.showed that if there is a positive constant  such
that
E

n

C
m
(x)m
!
= O(m

);
then H is polynomially -learnable.
We now take a dierent approach,extending work of Ben-David et al.[5] to deter-
mine a sucient condition for H to be polynomially -learnable.In [5],the following
denition was made.
Denition 5.3 A hypothesis space H over an input space X is said to have X-
nite dimension if X =
S
1
i=1
B
i
where the restriction HjB
i
of H to domain B
i
has
nite VC dimension,for each i.
Ben-David et al.[5] proved that if a hypothesis space H has X-nite dimension
then H is dd-learnable.The spaces in the examples of the previous section are
easily seen to have X-nite dimension and hence are dd-learnable;that is,they
are -learnable for all probability distributions  (and not just for the particular
distributions discussed).(Indeed,if X is countable then any hypothesis space on
X has X-nite dimension,and the rst example is a special case of this.) It
follows also that the notion of dd-learnability is not a vacuous one,since these same
hypothesis spaces are dd-learnable but,being of innite VC dimension,are not
learnable.
It is straightforward to give an example of a hypothesis space H over a (necessarily)
uncountable input space X such that H does not have X-nite dimension.Take
X to be the closed interval X = [0;1];and let H be the space of all (characteris-
tic functions of) nite unions of closed subintervals of X.Now,for any Y  X,
VCdim(HjY )  k if and only if jY j  k.It follows that if X were the countable
union X =
S
1
i=1
B
i
of sets B
i
such that H had nite VC dimension on B
i
then,
in particular,each B
i
would be nite and X,as the countable union of nite sets,
would be countable.However,X is uncountable and we therefore deduce that H
does not have X-nite VC dimension.
The result of Ben-David et al.provides a positive distribution-dependent learnability
result.However,it does not address the size of sample required for learnability to
given degrees of accuracy and condence.A closer analysis of the proof of this result
in [5] shows that the resulting sucient sample-size will not be polynomial in 1=
and log(1=) for many distributions.To introduce the approach taken here,we rst
9
have the following result,in which to say that a sequence fS
k
g
1
k=1
of subsets of X
is increasing means that S
1
 S
2
 S
3
::::
Proposition 5.4 H has X-nite dimension if and only if there exists an increas-
ing sequence fS
k
g
1
k=1
of subsets of X such that
S
1
k=1
S
k
= X and VCdim(HjS
k
)  k:
Proof:Suppose that H has X-nite dimension,and let the sets B
i
be as in the
denition.Let x
0
2 B
1
and set B
0
= fx
0
g.For k  1 let S
k
=
S
m(k)
i=0
B
i
;where
m(k) is the maximum integer m such that the restriction of H to
S
m
i=0
B
i
has VC
dimension at most k.Given any x 2 X,there is an m such that x 2
S
m
i=0
B
i
.
Suppose that H restricted to
S
m
i=0
B
i
has VC dimension k.Then m(k)  m,so
x 2 S
k
.Conversely,if such sets S
i
exist,take B
i
= S
i
.Then VCdim(HjB
i
) is nite,
and
S
1
i=1
B
i
= X.ut
If H\nearly"has nite VC dimension,in some sense,we might hope to get polyno-
mially bounded sample-sizes.Motivated by the above result,we make the following
denition.
Denition 5.5 Hypothesis space H has polynomial X-nite dimension with re-
spect to  if X =
S
1
k=1
S
k
where fS
k
g
1
k=1
is increasing,VCdim(HjS
k
)  k,and
1 (S
k
) = O

1k
c

for some constant c > 0.
Benedek and Itai [6] have gone some way towards investigating sucient sample-
sizes for distribution-dependent learnability in the case of discrete distributions (that
is,distributions nonzero on only countably many elements of the example space).
With the denition of polynomial X-nite dimension,we can develop a theory for
both continuous and discrete distributions.We have the following result,which we
prove by a method similar to that used in [5].
Theorem 5.6 Let H be a hypothesis space over X,and  a probability measure
dened on X.If H has polynomial X-nite dimension with respect to ,then H
is dd-learnable and polynomially -learnable.
10
Proof:Suppose that H has polynomial X-nite dimension with respect to .
Suppose that 0 <  < 1=4 and S  X is such that (S)  1 =2.The probability
(with respect to 
m
) that a sample of length m= 2l,chosen according to ,has at
least half of its members in S is at least 1 
P
l
k=0

2l
k
 
2

2lk

1 
2

k
:Now,
l
X
k=0

2l
k
!

 2

2lk

1 
2

k

l
X
k=0

2l
k
!

2

2lk
 
l
2
l
l
X
k=0

2l
k
!
= 
l
2
l1
:
Therefore,this probability is at least 1  
l
2
l1
:If l  l
0
= log(1=) (where log
denotes logarithm to base 2) then
l(log  +1)  log

1 

(log  +1) = log 

log

1

1

< log 
and this implies that the above probability is greater than 1 =2.(Note that we
have used the fact that,since  < 1=4,log  +1 is negative.)
Let k() = minfk:(S
k
)  1  =2g:The above shows that,with probability at
least 1 =2,a random sample of length m  2l
0
has at least half of its members
in S = S
k()
.Let
m

= 2
&
2
p 2(
p2 
p)

ln

2d=(d 1)
!
+2k() ln

12

!'
:
Suppose c 2 H is the target concept.Since HjS has VC dimension at most k(),
m

is,by Theorem 2.2,twice a sucient sample size for the learnability of HjS
with accuracy =2 and condence 1  =2.Let m  m

,and let l = bm=2c  l
0
.
If x 2 X
m
is such that x has at least l of its entries from S = S
k()
,then we shall
denote by x
S
the unique vector of length l whose entries are precisely the rst l
entries of x from S,appearing in the same order as in x.Let 
1
be the probability
measure induced on S by .Thus,for any measurable subset A of X,

1
(A\S) =
(A) (S)
:
Observe that if h 2 H[x] and er

(h) >  then,since (S)  1 =2;the function
hjS (h restricted to S) is such that hjS 2 (HjS)[x
S
] and
er

1
(hjS) =
1 (S)
(fx 2 X:h(x) 6= c(x)g\S) >
1(S)

 
2

>
2
:
Therefore,denoting the number of entries of a vector x which lie in S by s(x),we
have

m
fx 2 X
m
:haz

(H[x]) > g
11
= 
m
fx:haz

(H[x]) > ;s(x)  lg +
m
fx:haz

(H[x]) > ;s(x) < lg:
The second measure here is at most =2 since with probability at least 1=2,s(x)
is at least l.Further,

m
fx 2 X
m
:haz

(H[x]) >  and s(x)  lg
= 
m
fx 2 X
m
:haz

(H[x]) > js(x)  lg 
m
fx 2 X
m
:s(x)  lg
 
m
fx 2 X
m
:9h 2 H[x] with er

(h) > js(x)  lg
 
m
fx 2 X
m
:9f 2 (HjS)[x
S
] with er

1
(f) > =2g;
where,for any events A and B,
m
(AjB) is the conditional probability (with respect
to 
m
) of A given B.Now,if s(x)  l and x is -randomly chosen,then x
S
is a

1
-randomly chosen sample of length l.Therefore this last measure is at most =2,
since l is a sucient sample-size for the learnability of HjS to accuracy =2 with
condence =2.
Note that the preceeding analysis,since it holds true for any distribution ,shows
that H is dd-learnable.Now,since H has polynomial X-nite dimension with
respect to ,there are c;R > 0 such that 1 (S
k
)  R=k
c
,so that
k() 
&

2R

1=c
'
;
which is polynomial in 1=.Therefore m

is a sucient sample-size which is poly-
nomial in 1= and in ln(1=),and hence H is polynomially -learnable.ut
To illustrate the idea of polynomial X-nite dimension,consider again the examples
of the previous section.For the rst example,we see that the space has polynomial
X-nite dimension by taking S
k
to be the union of the sets B
1
through to B
k
.The
sequence fS
k
g
1
k=1
is increasing and
S
1
k=1
S
k
= X:Further,if x 2 S
m
k
is shattered,
the entries of x must lie entirely within one of the B
i
(1  i  k) and hence
VCdim(HjS
k
) = maxfVCdim(HjB
j
):j  kg = VCdim(HjB
k
) = k:
Now,1 (S
k
) = 1=2
k
;so H has polynomial X-nite dimension with respect to 
and H is polynomially -learnable.
For the second example,let S
k
= [0;k
2
].Then fS
k
g
1
k=1
is an increasing sequence with
union X and VCdim(HjS
k
) = 2k.(Clearly,the factor 2 here is of no consequence.)
Further,1(S
k
) = e
k
2
and so H has polynomial X-nite dimension with respect
to .
It remains to give an example of a hypothesis space H over an input space X,
together with a probability distribution  on X,such that H has X-nite dimension
12
but does not have polynomial X-nite dimension with respect to .To this end,
let X be the set of all positive integers and H the set of all (characteristic functions
of) subsets of X.The input space is countable,and therefore H has X-nite
dimension.Dene the probability measure  on X by
(fxg) =
1log(x +1)

1log(x +2)
:
Suppose that the sequence of sets fS
k
g
1
k=1
is such that
X =
1
[
k=1
S
k
and VCdim (HjS
k
)  k:
Clearly,VCdim(HjS
k
) = jS
k
j:But H restricted to S
k
is supposed to have VC
dimension at most k.Therefore,for each integer k,S
k
has cardinality at most k.It
follows that
(S
k
)  (f1;2;:::;kg) = 1 
1log(k +2)
;
and 1(S
k
)  1= log(k+2).Thus,H does not have polynomial X-nite dimension
with respect to .In fact,one can show directly that H is not polynomially -
learnable.For suppose that the target is the identically-0 function and that a sample
x of size m is given.There is a hypothesis consistent with the target on x and with
error at least  unless (fx
i
:1  i  mg) > 1 :We therefore need to have
1  < (fx
i
:1  i  mg)  1 
1log(m+2)
;
so that m e
1=
2;which is exponential in 1=.
References
[1] Dana Angluin,Queries and concept learning,Machine Learning,2(4),1988:
319{342.
[2] Martin Anthony,Uniform Convergence and Learnability,PhD thesis,Univer-
sity of London 1991.
[3] Martin Anthony and Norman Biggs,Computational Learning Theory:An
Introduction,Cambridge University Press:Cambridge,UK,1992.
[4] Martin Anthony,Norman Biggs and John Shawe-Taylor,The learnability of
formal concepts,Proceedings of the Third Workshop on Computational Learn-
ing Theory,Morgan Kaufman,San Mateo,CA,1990.
13
[5] Shai Ben-David,Gyora M.Benedek and Yishay Mansour,A parameterization
scheme for classifying models of learnability,Proceedings of the Second Work-
shop on Computational Learning Theory,Morgan Kaufman,San Mateo,CA,
1989.
[6] Gyora M.Benedek and Alon Itai,Learnability with respect to xed distribu-
tions,to appear,Theoretical Computer Science.
[7] A.Bertoni,P.Campadelli,A.Morpurgo,and S.Panizza,Polynomial uniform
convergence and polynomial-sample learnability,In Proceedings 5th Annual
Workshop on Computational Learning Theory,pages 265{271.ACM Press,
New York,NY,1992.
[8] AnselmBlumer,Andrzej Ehrenfeucht,David Haussler and Manfred Warmuth,
Learnability and the Vapnik-Chervonenkis dimension,Journal of the ACM,
36(4),1989:929{965.
[9] David Haussler,Quantifying inductive bias:AI learning algorithms and
Valiant's learning framework,Articial Intelligence,36,1988:177-221.
[10] David Haussler and Emo Welzl,-nets and simplex range queries,Discrete
Comp.Geom.,2,1987:127-151.
[11] Michael J.Kearns and Umesh Vazirani (1995).Introduction to Computational
Learning Theory,MIT Press 1995.
[12] Balas K.Natarajan,Machine Learning:A Theoretical Approach,Morgan
Kaufmann,San Mateo,California,1991.
[13] David Pollard,Convergence of Stochastic Processes,Springer-Verlag,1984.
[14] N.Sauer,On the density of families of sets,J.Comb.Theory (A),13,1972:
145{147.
[15] Leslie G.Valiant,A theory of the learnable.Communications of the ACM,
27(11),1984:1134{1142.
[16] V.N.Vapnik and A.Ya.Chervonenkis,On the uniform convergence of rela-
tive frequencies of events to their probabilities.Theory of Probability and its
Applications,16(2),1971:264-280.
14