Covering Numbers for Support Vector Machines

yellowgreatAI and Robotics

Oct 16, 2013 (4 years and 22 days ago)

48 views

Covering Numbers for Support Vector Machines
Ying Guo
Department of Engineering
Australian National University
Canberra 0200,Australia
guo@hilbert.anu.edu.au
Peter L.Bartlett
RSISE
Australian National University
Canberra 0200,Australia
Peter.Bartlett@anu.edu.au
John Shawe-Taylor
Department of Computer Science
Royal Holloway College,
University of London
Egham,TW20 0EX,UK
jst@dcs.rhbnc.ac.uk
Robert C.Williamson
Department of Engineering
Australian National University
Canberra 0200,Australia
Bob.Williamson@anu.edu.au
Abstract
Support vector machines are a type of learning
machine related to the maximum margin hyper-
plane.Until recently,the only bounds on the gen-
eralization performance of SV machines (within
the PAC framework) were via bounds on the fat-
shattering dimension of maximummargin hyper-
planes.This result took no account of the kernel
used.More recently,it has been shown [8] that
one can bound the relevant covering numbers us-
ing some tools fromfunctional analysis.The re-
sulting bound is quite complex and seemingly
difÞcult to compute.In this paper we show that
the bound can be greatly simpliÞed and as a con-
sequence we are able to determine some inter-
esting quantities (such as the effective number of
dimensions used).The newbound is quite a sim-
ple formula involving the eigenvalues of the inte-
gral operator induced by the kernel.We present
an explicit calculation of covering numbers for
an SV machine using a Gaussian kernel which is
signiÞcantly better than that implied by the max-
imummargin fat-shattering result.
1 INTRODUCTION
Support Vector (SV) Machines [5] are learning algorithms
based on maximummargin hyperplanes [4] which make use
of an implicit mapping into feature space by using a more
general kernel function in place of the standard inner prod-
uct.Consequently one can apply an analysis for the maxi-
mum margin algorithm directly to SV machines.However
such a process completely ignores the effect of the kernel.
Intuitively one would expect that a ÒsmootherÓkernel would
somehow reduce the capacity of the learning machine thus
leading to better bounds on generalization error if the ma-
chine can attain a small training error.
In [9,8] it has been shown that this intuition is justiÞed.
The main result there (quoted below) gives a bound on the
covering numbers for the class of functions computed with
support vector machines.This bound along with statistical
results of the form given in [7] result in bounds that do ex-
plicitly depend on the kernel used.
In the traditional viewpoint of statistical learning theory,
one is given a class of functions
F
,and the generalization
performance attainable using
F
is determined via the cover-
ing numbers
N  F 
(precise deÞnitions are given below).
Many generalization error bounds can be expressed in terms
of
N  F 
.The main method of bounding
N  F 
has been
to use the Vapnik-Chervonenkis dimension or one of its gen-
eralizations (see [1] for an overview).
In [9,8] an alternative viewpoint is taken where the class
F
is viewed as being generated by an integral operator in-
duced by the kernel.Properties of this operator are used to
bound the required covering numbers.The result is in a form
that is not particularly easy to use (see (13) below).
The main technical result of this paper is an explicit re-
formulation of this bound which is amenable to direct calcu-
lation.We illustrate the newresult by bounding the covering
numbers of SV machines which use Gaussian RBF kernels.
The result shows the inßuence of

￿
on the covering num-
bers:the covering numbers will decrease when

￿
increases.
Here

￿
is the variance of the Gaussian function used for the
kernel.More generally,the main result makes model order
selection possible using any parametrized family of kernel
functions:we can describe precisely howthe capacity of the
class is affected by changes to the kernel.
For
d N
,
R
d
denotes the
d
-dimensional space of vec-
tors
x  x
￿
     x
d

.For
  p  
,deÞne the spaces

d
p
 f x R
d
 k x k

d
p
 g
where the
p
-norms are
k x k

d
p



d
X
j ￿￿
j x
j
j
p

A
￿
p

for
  p   
k x k

d
￿
 max
j ￿￿ d
j x
j
j 
for
p   
For
d  
,we write
l
p
 l

p
and the norms are deÞned
similarly (by formally subsituting

for
d
in the above deÞ-
nitions).
The

-covering number of
F
with respect to the metric
d
denoted
N  F  d 
is the size of the smallest

-cover for
F
using the metric
d
.Given
m
points
x
￿
     x
m

d
p
,we use
the shorthand
X
m
 x
￿
     x
m

.Suppose
F
is a class of
functions deÞned on
R
d
.The

d

norm with respect to
X
m
of
f F
is deÞned as
k f k

X
m
￿
 max
i ￿￿ m
j f x
i
 j
.The
input space is taken to be
X
,a compact subset of
R
d
.
Our main result is a bound for the covering number of
SVmachines.We only discuss the case when
d  
.(In fact
the result does hold for general
d
;see the discussion in the
conclusion).
Let
k  X  X  R
be a kernel satisfying the hypotheses
of MercerÕs theorem(Theorem2).Given
m
points
x
￿
     x
m
X
.Denote by
F
R
w
the hypothesis class implemented by
SV machines on an
m
-sample with weight vector (in feature
space) bounded by
R
w
:
F
R
w




x 

X

i
k x  x
i

X
i
X
j

i

j
k x
i
 x
j
  R
￿
w




(1)
Let

￿
 
￿
   
be the eigenvalues of the integral operator
T
k
 L
￿
X   L
￿
X 
T
k
 f 

Z
X
k   y  f y  d y 
and denote by

n
 
,
n N
the corresponding eigenfunc-
tions.(See the next section for a reminder of what this means.)
For translation invariant kernels (such as
k x y   exp x 
y 
￿

￿

),the eigenvalues are given by

i

p

K j
￿

(2)
for
j Z
,where
K   F  k x  
is the Fourier trans-
formof
k  
(see [9,8] for further details).For a smooth ker-
nel,the Fourier transform
F j
￿

decreases faster.(There
are less Òhigh frequency components.Ó) Thus for smooth ker-
nels,

i
decreases to zero rapidly for increasing
i
.
Theorem1 (Main Result) Suppose
k
is a kernel satisfying
the hypothesis of Mercers Theorem.Hypothesis class
F
R
w
,
eigenfunctions

n
 
and eigenvalues

i

are denedas above.
Let
x
￿
     x
m
X
be
m
data points.Let
C
k
 sup
n
k 
n
k
L
￿

For
n N
set


n

R
w
C
k
v
u
u
t
j



￿
   
j
￿
n
￿


￿
j
￿


X
i ￿ j
￿
￿￿

i

(3)
with
j

 min

j  
j ￿￿



￿
   
j
n
￿


￿
j


Then
C
k
 
and
sup
X
m
 X
m
N

n
 F
R
w
 
X
m

  n
The quantity


n
is an upper bound on the entropy num-
ber of
F
R
w
,which is the functional inverse of the covering
number.In this theorem,the number
j

has a natural inter-
pretation:For a given value of
n
,it can be viewed as the
effective dimension of the function class.Clearly,this effec-
tive dimension depends on the rate of decay of the eigenval-
ues.As expected,for smooth kernels (which have rapidly
decreasing eigenvalues),the effective dimension is small.It
turns out that all kernels satisfying MercerÕs conditions are
sufÞciently smooth for
j

to be Þnite.
The remainder of the paper is organized as follows.We
start by introducingnotation and deÞnitions (Section 2).Sec-
tion 3 contains the main result (the proof is in Appendix A).
Section 4 contains an example application of the main result.
Section 5 concludes.
2 DEFINITIONS AND PREVIOUS
RESULTS
Let
L E  F 
be the set of all bounded linear operators
T
be-
tween the normed spaces
E  k  k
E

and
F  k  k
F

,i.e.
operators such that the image of the (closed) unit ball
U
E
 f x E  k x k
E
  g
(4)
is bounded.The smallest such bound is called the operator
norm,
k T k  sup
x  U
E
k T x k
F

(5)
The
n
th entropy number of a set
M E
,for
n N
,is

n
M   inf f 
there exists an

-cover
for
M
in
E
containing
n
or fewer points
g 
(6)
(The function
n 

n
M 
can be thought of as the func-
tional inverse of the function

 N  M  d 
where
d
is the
metric induced by
k  k
E
.) The entropy numbers of an oper-
ator
T L E  F 
are deÞned as

n
T  
n
T U
E
 
(7)
Note that

￿
T   k T k
,and that

n
T 
certainly is well de-
Þned for all
n N
if
T
is a compact operator,i.e.if
T U
E

is compact.
In the following,
k
will always denote a kernel,and
d
and
m
will be the input dimensionality and the number of
training examples,respectively,so that the training data is a
sequence
x
￿
 y
￿
      x
m
 y
m
 R
d
 R 
(8)
Let
log
denote the logarithmto base 2.
We will map the input data into a feature space
S
via a
mapping

.We let

x  x 
,and
F
R
w
 fh w 

x i 

x S  k w k  R
w
g
R
S

Given a class of functions
F
,the generalization performance
attainable using
F
can be bounded in terms of the covering
numbers of
F
.More precisely,for some set
X
,and
x
i
X
for
i        m
,deÞne the

-growth function of the function
class
F
on
X
as
N
m
 F   sup
x
￿
  x
m
 X
m
N  F  
X
m

 
(9)
where
N  F  
X
m


is the

-covering number of
F
with re-
spect to

X
m

.Many generalization error bounds can be ex-
pressed in terms of
N
m
 F 
.
Given some set
X
,some
  p  
and a function
f  X  R
we deÞne
k f k
L
p
￿ X ￿

R
j f x  j
p
d x 

￿ p
if
the integral exists and
k f k
L
￿
￿ X ￿

ess
sup
x  X
j f x  j
.For
  p  
,we let
L
p
X   f f  X  R  k f k
L
p
￿ X ￿
 g
.
We sometimes write
L
p
 L
p
X 
.
Suppose
T  E  E
is a linear operator mapping a normed
space
E
into itself.We say that
x E
is an eigenvector if
for some scalar

,
T x  x
.Such a

is called the eigen-
value associated with
x
.When
E
is a function space (e.g.
E  L
￿
X 
the eigenvectors are of course functions,and
are usually called eigenfunctions.Thus

n
is an eigenfunc-
tion of
T  L
￿
X   L
￿
X 
if
T 
n
 
n
.In general

is
complex,but in this paper all eigenvalues are real (because
of the symmetry of the kernels used to induce the operators).
We will make use of MercerÕs theorem.The version
stated below is a special case of the theorem proven in [6,
p.145].
Theorem2 (Mercer) Suppose
k L

X
￿

is a symmetric
kernel such that the integral operator
T
k
 L
￿
X   L
￿
X 
,
T
k
f   
Z
X
k   y  f y  d y
(10)
is positive.Let

j
L
￿
X 
be the eigenfunction of
T
k
as-
sociated with the eigenvalue

j
 
and normalized such
that
k 
j
k
L
￿
 
and let

j
denote its complex conjugate.
Suppose

j
is continuous for all
j N
.Then
1.

j
T 
j

￿
.
2.

j
L

X 
and
sup
j
k 
j
k
L
￿
 
.
3.
k x  y  
P
j  N

j

j
x  
j
y 
holds for all
x  y 
,where
the series converges absolutely and uniformly for all
x  y 
.
We will call a kernel satisfying the conditions of this theorem
a Mercer kernel.Fromstatement 2 of MercerÕs theoremthere
exists some constant
C
k
R
￿
depending on
k    
such that
j 
j
x  j  C
k
for all
j N
and
x X 
(11)
This conclusion is the only reason we have added the condi-
tion that

n
is continuous;it is not necessary for the theorem
as stated,but it is convenient to bundle all of our assumptions
into the one place.In any case it is not a very restrictive as-
sumption:if
X
is compact and
k
is continuous,then

j
is
automatically continuous (see e.g.[3]).Alternatively,if
k
is
translation invariant,then

j
are scaled cosine functions and
thus continuous.
In [8] an upper bound on the entropy numbers was given
in terms of the eigenvalues of the kernel used.The result is
in terms of the entropy numbers of a scaling operator
A
.The
notation
a
s

s
l
p
donates the sequence
a
￿
 a
￿
    
such
that
P

s ￿￿
j a
s
j  
.
Theorem3 (Entropy numbers for
X 
) Let
k  X  X 
R
be a Mercer kernel.Choose
a
j

for
j N
such that

p

s
a
s

s

￿
,and dene
A  x
j

j
 R
A
a
j
x
j

j
(12)
with
R
A
 C
k
k
p

j
a
j

j
k

￿
.Then

n
A  
￿
 
￿
  sup
j  N

C
k




p

s
a
s

s




￿

a
￿
   a
j
n

￿
j

(13)
This result leads to the following bounds for SV classes.
Theorem4 (Bounds for SV classes) Let
k
be a Mercer ker-
nel.Then for all
n N
,

n
F
R
w
  R
w
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿

n
A  
(14)
where
A
is dened as in Theorem 3.
Combining Equations (13) and (14) gives effective bounds
on
N
m
 F
R
w

since

n
T  
￿
 
m

 
￿
N
m

￿
 F
R
w
  n
These results thus give a method to obtain bounds on the
entropy numbers for kernel machines.In Inequality (14),we
can choose
a
s

s
to optimize the bound.The key technical
contribution of this paper is the explicit determination of the
best choice of
a
s

s
.
We assume henceforth that

s

s
is Þxed and sorted in
non-increasing order,and
a
i

for all
i
.For
j N
,we
deÞne the set
A
j


a
s

s
 sup
i  N

a
￿
   a
i
n

￿
i


a
￿
   a
j
n

￿
j


(15)
In other words,
A
j
is the set of
a
s

s
such that the
sup
i  N

a
￿
a
￿
   a
i
n

￿
i
is attained at
i  j
.
Let
B a
s
  n j  




p

s
a
s

s




￿

a
￿
   a
j
n

￿
j

where for notational simplicity,we write
a
s

for
a
s

s
.
3 THE OPTIMAL CHOICE OF
￿ a
s
￿
s
AND
j
Our main aim in this section is to show that the inÞmum
in (14) and the supremum in (13) can be achieved and to
give an explicit recipe for the sequence
a
s

and number
j

that achieve them.The main technical theoremis as follows.
Theorem5 Let
k  X  X  R
be a Mercer kernel.Suppose

￿
 
￿
   
are the eigenvalues of
T
k
.For any
n N
,the
minimum
j

 min

j  
j ￿￿



￿
   
j
n
￿


￿
j

(16)
always exists,and
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿
sup
j  N
B a
s
  n j   B a

s
  n j

 
where
a

i




p

i
when i  j


p

￿
 
j
￿
n


￿
j
￿
when i j


(17)
This choice of
a
s

results in a simple formfor the bound of
(14) in terms of
n
and

i

:
Corollary 6 Let
k  X  X  R
be a Mercer kernel and let
A
be given by (12).Then for any
n N
,the entropy numbers
satisfy
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿

n
A  
￿
 
￿


C
k
v
u
u
t
j



￿
   
j
￿
n
￿


￿
j
￿


X
i ￿ j
￿
￿￿

i

(18)
with
j

 min

j  
j ￿￿



￿

j
n
￿

￿
j


This corollary,together with (14),implies Theorem1.
PROOF OUTLINE
The proof of Theorem5 is quite long and is in Appendix A.
It involves the following four steps.
1.We Þrst prove that for all
n N
,

j  min

j  
j ￿￿



￿
   
j
n
￿


￿
j

(19)
exists,whenever

i

are the eigenvalues of a Mercer
kernel.
2.We then prove that for any
n N
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿
sup
j  N
B a
s
  n j 
 inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
(20)
3.The next step is to prove that the choice of
a
s

and
j
described by (16) and (17) are optimal.It is separated
into two parts:
(a) For any
j
￿
 j

,and any
a
s
 A
j
￿
,
B a
s
  n j
￿
  B a

s
  n j


holds.
(b) For any
j
￿
j

,and any
a
s
 A
j
￿
,
B a
s
  n j
￿
  B a

s
  n j


also holds.
4.Finally we show that
a

s
 A
j
and

p

s
a

s

s

￿
when
a

s

is chosen according to (17).
4 EXAMPLE
We illustrate the results of this paper with an example.Con-
sider the kernel
k x y   k x  y 
where
k x   e
 x
￿

￿
.
For such kernels (RBF kernels)
k x  k

￿
 
for all
x X
.
Thus the class (1) can be written as
F
R
w
 fh w 

x i 

x S  k x k

￿
   k w k

￿
 R
w
g 
One can use the fat-shattering dimension to bound the cov-
ering number of the class of functions
F
R
w
(see [2]).
Lemma 7 With
F
R
w
as above,
fat
F
R
w
 

R
w



￿

(21)
Theorem8 If
F
is class of functions mapping from a set
X
into the interval
  B
,then for any

,if
m 
fat
F
  
,
log N
m
 F   
fat
F
 log
￿

 eB m




(22)
Combining these results we have the following bound with
which we shall compare our new bound.
log N
m
 F
R
w
  

R
w



￿
log
￿

 eB m




(23)
In order to determine the eigenvalues of
T
k
,we need
to periodize the kernel.This periodization is necessary in
order to get a discrete set of eigenvalues since
k x 
has in-
Þnite support (see [9] for further details).For the purpose
of the present paper,we can assume a Þxed period


￿
for some

￿

.Since the kernel is translation invari-
ant,the eigenfunctions are

n
x  
p
 cos n
￿
x 
and so
C
k

p

.The
p

comes fromthe requirement in Theorem
2 that
k 
j
k

￿
 
.The eigenvalues are

j

p

 e


￿
￿
￿

￿
j
￿

Setting
c
￿

p


,
c
￿


￿
￿
￿

￿
,the eigenvalues can be
written as

j
 c
￿
e
 c
￿
j
￿

(24)
From (16),we know that

j ￿￿



￿
 
j
n
￿

￿
j
implies
j


j
.But (24) shows that this condition on the eigenvalues is
equivalent to
c
￿
e
 c
￿
￿ j ￿￿￿
￿
 n

￿
j

c
j
￿
e
 c
￿
P
j
i ￿￿
i
￿

￿
j

(25)
which is equivalent to
c
￿
j 
￿


j
ln n
c
￿


j   j 



c
￿
j  j

j




 ln n
which follows from
j

 ln n

￿
￿

￿


￿  ￿

Hence,
j




 ln n

￿
￿

￿


￿  ￿

 
(26)
We can now use (18) to give an upper bound on

n
.The tail
P

i ￿ j
￿
￿￿

i
in (18) is dominated by the Þrst term,hence we
obtain the following bound.

￿
n
 O

j

n

￿
j
￿
c
￿
exp 
c
￿


j

  j




Substituting (26) shows that
log
n
 O

log log n log    log n 
￿
￿

(27)
1e13
1e12
1e11
1e10
1e09
1e08
1e07
1e06
1e05
.1e3
.1e2
.1e1
.1
1.
.1e2
Figure 1:

n
versus
n
for a Gaussian kernel as given by
Corollary 6.
2
4
6
8
10
12
14
1..1e2.1e3
Figure 2:
j

versus
n
for a Gaussian kernel.
We can get several results fromEquation (27).
The relationship between

n
and
n
.For Þxed

,(27) shows
that
log 
n
  log
￿
￿
n  
which implies
log N
m
 F
R
w
  O

log
￿
￿







(28)
which is considerably better than (23).This can also
be seen in Figure 1.
2
4
6
8
10
12
14
1e13 1e11 1e09 1e07 1e05.1e2.1 1..1e2
Figure 3:
j

versus

for a Gaussian kernel.Since
j

can
be interpreted as an Òeffective number of dimensionsÓ,this
clearly illustrates why the bound on the covering numbers
for Gaussian kernels grows so slowly as
 
.Even when
 
 ￿
,
j

is only 13.
The relationship between

￿
and

n
.Here,

￿
is the vari-
ance of the Gaussian functions.When

￿
increases,
the kernel function will be wider,so the class
F
R
w
should be simpler.In Equation (27),we notice that
if

decreases,

n
decreases for Þxed
n
.Similarly,if

increases,
n
decreases for Þxed

n
.Since the entropy
number (and the covering number) indicates the capac-
ity of the learning machine,the more complicated the
machine is,the bigger the covering number for Þxed

n
.SpeciÞcally we see fromEquation (27) that
log 
n
  
￿
￿
 
and that
log N
m
 F
R
w
  O    
Figures (1) to (3) illustrate our bounds (for

￿
 
).
5 CONCLUSIONS
We have presented a new formula for bounding the covering
numbers of support vector machines in terms of the eigen-
values of an integral operator induced by the kernel.We
showed,by way of an example using a Gaussian kernel,that
the new bound is easily computed and considerably better
than previous results that did not take account of the kernel.
We showed explicitly the effect of the choice of width of the
kernel in this case.
The Òeffective number of dimensionsÓ,
j

,can illustrate
the characters of the kernel functions clearly.For a smooth
kernel,the Òeffective number of the dimensionsÓ
j

is small.
The value of
j

depends on
n
which in turn depends on

.Thus
j

can be considered analogous to existing Òscale-
sensitiveÓdimensions,such as the fat-shatterring dimension.
A key difference is that we now have bounds for
j

that ex-
plicitly depend on the kernel.
We have discussed the result for the situation where the
input dimension is 1.The main complication arising when
d 
is that repeated eigenvalues become generic for isotropic
translation invariant kernels.This does not break the bounds
as stated (as long as one properly counts the multiplicity of
eigenvalues).However,it is possible to obtain bounds that
can be tighter in some cases,by using a slightly more reÞned
argument [9].
References
[1] M.Anthony.Probabilistic analysis of learning in
artiÞcial neural networks:The pac model and its
variants.Neural Computing Surveys,1:1Ð47,1997.
http://www.icsi.berkeley.edu/÷jagota/NCS.
[2] M.Anthony and P.Bartlett.Neural Network Learning:
Theoretical Foundations.Cambridge University Press,
1999.
[3] Robert Ash.Information Theory.Interscience Publish-
ers,New York,1965.
[4] B.E.Boser,I.M.Guyon,and V.N.Vapnik.A training
algorithmfor optimal margin classiÞers.In D.Haussler,
editor,5th Annual ACMWorkshop on COLT,pages 144Ð
152,Pittsburgh,PA,1992.ACMPress.
[5] C.Cortes and V.Vapnik.Support vector networks.Ma-
chine Learning,20:273 Ð 297,1995.
[6] H.K¬onig.Eigenvalue Distribution of Compact Opera-
tors.Birkh¬auser,Basel,1986.
[7] J.Shawe-Taylor,P.L.Bartlett,R.C.Williamson,and
M.Anthony.Structural risk minimization over data-
dependent hierarchies.IEEE Transactions on Informa-
tion Theory,44(5):1926Ð1940,1998.
[8] R.Williamson,A.Smola,and B.Sch¬olkopf.En-
tropy numbers,operators and support vector kernels.In
B.Sch¬olkopf,C.J.C.Burges,and A.J.Smola,editors,
Advances in Kernel Methods  Support Vector Learn-
ing,pages 127Ð144,Cambridge,MA,1999.MIT Press.
[9] R.C.Williamson,A.J.Smola,and B.Sch¬olkopf.Gen-
eralization performance of regularization networks and
support vector machines via entropy numbers of com-
pact operators.NeuroCOLTNC-TR-98-019,Royal Hol-
loway College,1998.
A PROOF OF THEOREM1
STEP ONE
As indicated above,we will Þrst prove the existence of

j
,
which is deÞned in (19).
Lemma 9 Suppose

￿
 
￿
     
is a non-increasing
sequence of non-negative numbers and
lim
j 

j
 
.
Then for all
n N
,there exists

j N
such that

￿
j
￿￿



￿
   
￿
j
n
￿


￿
￿
j

(29)
Proof Let
P
￿
j


￿
j
￿
j ￿￿

￿
 
￿
j
.Observe that (29) can be written
as
P
￿
j

￿
n
￿
,and hence for all
n
there is a

j
such that (29) is
true iff
lim
j 
P
j
 
.But
P
￿
j


￿
j
￿
j ￿￿

￿
   
￿
j


￿
j ￿￿

￿
￿
j
Y
i ￿￿

￿
j ￿￿

i


￿
j ￿￿

￿
since

i

is non-increasing.Since
lim
j 

j
 
,we get
lim
j 
P
j
 
.Thus for any
n N
there is a

j
such that
(29) is true.
Corollary 10 Suppose
k
is a Mercer kernel and
T
k
the asso-
ciated integral operator.If

i
 
i
T
k

,then the minimum

j
from (19) always exists.
Proof By MercerÕs Theorem,

i
 
￿
and so
lim
i 

i


.Lemma 9 can thus be applied.
STEP TWO
Lemma 11 Suppose
A
j
and
B a
s
  n j 
are denedas above,

p

s
a

s

s

￿
,
j

and
a

s
 A
j
￿
satisfy
B a

s
  n j

  inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
(30)
Then
inf
￿ a
s
￿￿

p

s
a
s

s

￿
sup
j  N
B a
s
  n j   inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
(31)
Proof Since

p

s
a

s

s

￿
,
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿
sup
j  N
B a
s
  n j   sup
j  N
B a

s
  n j  
(32)
But
a

s
 A
j
￿
,following the deÞnition of
A
j
and equal-
ity (30) we get
sup
j  N
B a

s
  n j   B a

s
  n j

  inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
In fact,we can show that the inequality in (31) is in fact an
equality.The proof is in appendix B.
It is now easier to calculate the optimal bound of the
entropy number using Lemma 11.
STEP THREE
In this step,we will prove that the choice of
a

s

and
j

given in Theorem5 are optimal.We will Þrst prove a useful
technical result.
Lemma 12 Suppose
A
j
and

i

are denedas above,
a
s

A
j
￿
.Then we have



X
i ￿ j
￿
￿￿

i
a
￿
i

A

a
￿
   a
j
￿
n

￿
j
￿


X
i ￿ j
￿
￿￿

i
  
(33)
Proof Since
a
s
 A
j
￿
,the following inequality must be
true for
k N
:

a
￿
   a
j
￿
   a
j
￿
￿ k
n

￿
j
￿
￿ k


a
￿
   a
j
￿
n

￿
j
￿

(34)
which implies

a
￿
   a
j
￿
   a
j
￿
￿ k
n



a
￿
   a
j
￿
n

j
￿
￿ k
j
￿

a
j
￿
￿￿
   a
j
￿
￿ k


a
￿
   a
j
￿
n

k
j
￿
  k N 
(35)
Set
 

a
￿
   a
j
￿
n

￿
j
￿

Then (35) can be rewritten as:
a
j
￿
￿￿
   a
j
￿
￿ k
 
k
  k N 
(36)
Hence,the left hand side of (33) can be rewritten as

X
i ￿ j
￿
￿￿

i
a
￿
i

￿


X
i ￿ j
￿
￿￿

i
 
￿

X
i ￿ j
￿
￿￿

i


a
￿
i



￿



(37)
From(36),we get
a
j
￿
￿￿
 
,so

a
￿
j
￿
￿￿



￿
  
Suppose
￿
a
￿
i

￿

￿
 
for some
i N
.We will separate the
suminto several parts.Set
k
￿
 j
￿

k
m
 max f n l
m

￿
a
￿
i

￿

￿

 i f l
m
      n gg 
l
m
 max f n k
m  ￿

￿
a
￿
i

￿

￿

 i f k
m  ￿
      n gg 
(38)
where we set
k
m
and
l
m
to

if the
max
does not exist.
Since

i

is a non-increasing sequence,from(38) we know

i

￿
a
￿
i

￿

￿

 
i ￿ c

￿
a
￿
i

￿

￿

 i f k
m  ￿
      l
m
g  c N

i

￿
a
￿
i

￿

￿


i  c

￿
a
￿
i

￿

￿

 i f l
m
      k
m
g 
 c f       i   g
for
m N
.Hence,if
l
m
is Þnite,
k
m
X
i ￿ k
m ￿ ￿
￿￿

i


a
￿
i



￿


 
l
m
l
m
X
i ￿ k
m ￿ ￿
￿￿


a
￿
i



￿



l
m
k
m
X
i ￿ l
m
￿￿


a
￿
i



￿


 
l
m
k
m
X
i ￿ k
m ￿ ￿
￿￿


a
￿
i



￿



(39)
And if
l
m
is inÞnite,this inequality is clearly true.We will
exploit the inequality of the arithmetic and geometric means
x
￿
x
￿
   x
m
 m x
￿
   x
m

￿
m
for
x
i
 
(40)
Now (40) implies that for any
k
￿
  j  k
m
,we have
j
X
i ￿ k
￿
￿￿

a
￿
i
 j  k
￿


j
Y
i ￿ k
￿
￿￿

a
￿
i

￿
j ￿ k
￿

(41)
which together with (36) gives
j
X
i ￿ k
￿
￿￿


a
￿
i



￿



j
X
i ￿ k
￿
￿￿

a
￿
i

j  k
￿

￿
  
(42)
Hence,for any
k
m
,Þnite or inÞnite,
k
m
X
i ￿ k
￿
￿￿


a
￿
i



￿


  
(43)
Now,for all
k
m
,using (39) and (43) repeatedly,we get
k
m
X
i ￿ k
￿
￿￿

i


a
￿
i



￿



k
￿
X
i ￿ k
￿
￿￿

i


a
￿
i



￿


  

k
m
X
i ￿ k
m ￿ ￿
￿￿

i


a
￿
i



￿


 
l
￿
k
￿
X
i ￿ k
￿
￿￿


a
￿
i



￿


  

l
m
k
m
X
i ￿ k
m ￿ ￿
￿￿


a
￿
i



￿


 
l
￿
k
￿
X
i ￿ k
￿
￿￿


a
￿
i



￿


  

l
m
k
m
X
i ￿ k
m ￿ ￿
￿￿


a
￿
i



￿


     
l
m
k
m
X
i ￿ k
￿
￿￿


a
￿
i



￿


  
for all
m N
.Hence

￿

X
i ￿ j
￿
￿￿

i


a
￿
i



￿


  
(44)
Noticing (37),inequality (33) is true.
Now,let us prove the main result.
Lemma 13 Let
A
j
and
B a
s
  n j 
be dened as above.
Then we have
B a

s
  n j

  inf
j
￿
 N
inf
￿ a
s
￿  A
j
￿
B a
s
  n j
￿
 
(45)
where
a

i




p

i
when i  j


p

￿
 
j
￿
n


￿
j
￿
when i j


(46)
j

 min

j  
j ￿￿



￿
   
j
n
￿


￿
j


(47)
Proof The main idea is to compare
B
￿
a
s
  n j
￿

with
B
￿
a

s
  n j


and show
B
￿
a
s
  n j
￿
  B
￿
a

s
  n j


for all
j
￿
N
and any
a
s
 A
j
￿
.From the deÞnition of
B a
s
  n j 
,we know
B
￿
a
s
  n j
￿
 


X
i ￿￿

i
a
￿
i


a
￿
   a
j
￿
n

￿
j
￿
and
B
￿
a

s
  n j

  j



￿
   
j
￿
n
￿


￿
j
￿


X
i ￿ j
￿
￿￿

i

For convenience,we set
 


￿
   
j
￿
n
￿


￿
j
￿

Hence,
B
￿
a
s
  n j
￿
  B
￿
a

s
  n j




X
i ￿￿

i
a
￿
i

a
￿
   a
j
￿
n

￿
j
￿
 j

 

X
i ￿ j
￿
￿￿

i

(48)
Part a:For the condition
j
￿
 j

.
Rewrite (48):
B
￿
a
s
  n j
￿
  B
￿
a

s
  n j



j
￿
X
i ￿￿

i
a
￿
i

a
￿
   a
j
￿
n

￿
j
￿


X
i ￿ j
￿
￿￿

i
a
￿
i

a
￿
   a
j
￿
n

￿
j
￿



j
￿


￿
   
j
￿
n
￿


￿
j
￿


X
i ￿ j
￿
￿￿

i

A



j



X
i ￿ j
￿
￿￿

i
 j
￿


￿
   
j
￿
n
￿


￿
j
￿


X
i ￿ j
￿
￿￿

i

A


j
￿
X
i ￿￿

i
a
￿
i

a
￿
   a
j
￿
n

￿
j
￿
 j
￿


￿
   
j
￿
n
￿


￿
j
￿






X
i ￿ j
￿
￿￿

i
a
￿
i

a
￿
   a
j
￿
n

￿
j
￿


X
i ￿ j
￿
￿￿

i







j
￿


￿
   
j
￿
n
￿


￿
j
￿


X
i ￿ j
￿
￿￿

i



j



X
i ￿ j
￿
￿￿

i

A



 E
￿
E
￿
E
￿

(49)
We will show
E
￿
 
,
E
￿
 
and
E
￿
 
.
To prove
E
￿
 
.
Since

i
 
and
a
i
 
,we exploit the inequality of the
arithmetic and geometric means (40) again.Hence
E
￿
 j
￿



￿
   
j
￿
a
￿
￿
   a
￿
j
￿

a
￿
￿
   a
￿
j
￿
n
￿

j
￿
j
￿

A
￿
j
￿
 j
￿


￿
   
j
￿
n
￿


￿
j
￿
 j
￿


￿
   
j
￿
n
￿


￿
j
￿
 j
￿


￿
   
j
￿
n
￿


￿
j
￿
  
(50)
To prove
E
￿
 
.
Applying Lemma 12 shows
E
￿
 
.
To prove
E
￿
 
.
In order to prove
E
￿
 
,let us deÞne function
g j   j


￿
   
j
n
￿


￿
j


X
i ￿ j ￿￿

i

(51)
We will show that
g j 
is a non-increasing function of
j
,for
j  j

.Set

j



￿
   
j
n
￿


￿
j

j  ￿



￿
   
j  ￿
n
￿


￿
j ￿ ￿

we have
g j    g j 
 j  
j  ￿

j
 j
j
 
j

j  ￿
  j
j

j  ￿
 
(52)
Noticing

j  ￿
j  ￿

j

j
j
,(52) can be modiÞed to
g j    g j 

 ￿ j  ￿￿
j  ￿


j
j

j
j  ￿
  j
j  ￿
j  ￿

j

j  ￿



(53)
Since
j  j

,following (47),we get

j



￿
   
j  ￿
n
￿


￿
j ￿ ￿
 j  j


(54)
So

j



￿
   
j  ￿
n
￿


￿
j

￿
j
j



￿
   
j  ￿
n
￿


￿
j
￿
￿
j ￿ j ￿ ￿￿



￿
   
j  ￿
n
￿


￿
j ￿ ￿

j  ￿

Making use of the formula
x
n
 y
n
 x  y 
n
X
i ￿￿
x
n  i
y
i  ￿

(55)
we obtain

j
j

j
j  ￿

j

j  ￿

j
X
i ￿￿

j  i
j

i  ￿
j  ￿
 j
j  ￿
j  ￿

j

j  ￿
 
Together with

j  ￿

and (53),we obtain

j

j  ￿
  j
j

j  ￿
   
Hence,
g j    g j  
Since
j
￿
 j

,we get
E
￿
 g j
￿
  g j

   
(56)
Combining the above results,we get
B
￿
a
s
  n j
￿
  B
￿
a

s
  n j

    j
￿
 j


(57)
Part b:For the condition
j
￿
j

.
Rewrite (48):
B
￿
a
s
  n j
￿
  B
￿
a

s
  n j





j
￿
X
i ￿￿

i
a
￿
i


X
i ￿ j
￿
￿￿

i
a
￿
i

A

a
￿
   a
j
￿
n

￿
j
￿
 j

 
j
￿
X
i ￿ j
￿
￿￿

i


X
i ￿ j
￿
￿￿

i




j
￿
X
i ￿￿

i
a
￿
i

a
￿
   a
j
￿
n

￿
j
￿
 j

 
j
￿
X
i ￿ j
￿
￿￿

i








X
i ￿ j
￿
￿￿

i
a
￿
i

a
￿
   a
j
￿
n

￿
j
￿


X
i ￿ j
￿
￿￿

i



 F
￿
F
￿

(58)
We will show
F
￿
 
and
F
￿
 
.
To prove
F
￿
 
.
For convenience,we set
D
i


a
￿
   a
i
n

￿
i

F
￿
can be rewritten as:


j
￿
X
i ￿￿

i
a
￿
i

j
￿
X
i ￿ j
￿
￿￿

i
a
￿
i

A
D
j
￿
 j

 
j
￿
X
i ￿ j
￿
￿￿

i




D
j
￿
j
￿
X
i ￿￿

i
a
￿
i
 j









j
￿
X
i ￿ j
￿
￿￿

i

D
j
￿
a
￿
i
 





 P
￿
P
￿

(59)
Let us consider
P
￿
at Þrst.
P
￿
 D
j
￿
D
j
￿
 D
j
￿

j
￿
X
i ￿￿

i
a
￿
i
 j


 D
j
￿
j
￿
X
i ￿￿

i
a
￿
i
 j

 D
j
￿
 D
j
￿

j
￿
X
i ￿￿

i
a
￿
i

Since

i
a
￿
i
 
,using the inequality of the arithmetic and
geometric mean (40) again,we get
j
￿
X
i ￿￿

i
a
￿
i
 j



￿
   
j
￿
a
￿
￿
   a
￿
j
￿

￿
j
￿
n
￿
n
￿

j

D
j
￿
 
Since
a
s
 A
j
￿
,we get
D
j
￿
 D
i
for any
i
 j
￿
and

j
￿
￿￿



￿
 
j
￿
n
￿

￿
j
￿
holds based on (47).Hence
P
￿
  D
j
￿
 D
j
￿

j
￿
X
i ￿￿

i
a
￿
i
 D
j
￿
 D
j
￿


D
j
￿
j


D
j
￿
 D
j
￿


D
j
￿
j


j
￿
￿￿
  
(60)
Let us consider
P
￿
now.If
P
￿
 
,then
F
￿
 
.
So let us prove that
F
￿
 
is also true when
P
￿
 
.Ob-
serving
a
￿
i
 D
i
i
D
i  ￿
i  ￿
and
D
j
￿
 D
i
for any
i
 j
￿
,the
last element of
P
￿

j
￿

D
j
￿
a
￿
j
￿
 

 
j
￿


D
j
￿
 ￿
D
j
￿


j
￿
 ￿
 

  
Using the similar method as before.Suppose
D
j
￿
a
￿
i
  
for some
i j

 j
￿

.We separate
P
￿
into several parts.Set
k
￿
 j
￿
 
l
m
 min f n  k
m

D
j
￿
a
￿
i
    
 i f n     k
m
  gg 
k
m
 min f n  l
m  ￿

D
j
￿
a
￿
i
   
 i f n     l
m  ￿
  gg 
(61)
Since

i

is a non-increasing sequence,from(61) we know

i

D
j
￿
a
￿
i
 

 
i ￿ c

D
j
￿
a
￿
i
 

 i f k
m ￿￿
     l
m
  g  c N

i

D
j
￿
a
￿
i
 


i  c

D
j
￿
a
￿
i
 

 i f l
m
     k
m
  g 
 c f       i   g 
(62)
Using (62),we have
k
m
 ￿
X
i ￿ k
m ￿￿

i

D
j
￿
a
￿
i
 


 
l
m
l
m
 ￿
X
i ￿ k
m ￿￿

D
j
￿
a
￿
i
 



l
m
k
m
 ￿
X
i ￿ l
m

D
j
￿
a
￿
i
 


 
l
m
k
m
 ￿
X
i ￿ k
m ￿￿

D
j
￿
a
￿
i
 



(63)
Hence,
  P
￿

j
￿
X
i ￿ j
￿
￿￿

D
j
￿
a
￿
i
 



i

k
￿
 ￿
X
i ￿ j
￿
￿￿

D
j
￿
a
￿
i
 



i

k
￿
 ￿
X
i ￿ k
￿

D
j
￿
a
￿
i
 



i

k
￿
 ￿
X
i ￿ j
￿
￿￿

D
j
￿
a
￿
i
 



i

l
￿
k
￿
 ￿
X
i ￿ k
￿

D
j
￿
a
￿
i
 



If

l
￿
P
k
￿
 ￿
i ￿ k
￿

D
j
￿
a
￿
i
 


,we get
P
￿

k
￿
 ￿
X
i ￿ j
￿
￿￿

D
j
￿
a
￿
i
 



i

If

l
￿
P
k
￿
 ￿
i ￿ k
￿

D
j
￿
a
￿
i
 

 
,we can use (62) and (63) re-
peatedly.Finally,using (40) and
a
￿
i
 D
i
i
D
i  ￿
i  ￿
again,we
can get
  P
￿

j
￿
￿ l
X
i ￿ j
￿
￿￿

D
j
￿
a
￿
i
 



j
￿
￿ l
 
j
￿
￿ l
j
￿
￿ l
X
i ￿ j
￿
￿￿

D
j
￿
a
￿
i
 


 
j
￿
￿ l
l




a
￿
j
￿
￿￿
   a
￿
j
￿
￿ l

￿
l
D
j
￿
 

A
 
j
￿
￿ l
l


D
j
￿

D
j
￿
j
￿
D
j
￿
￿ l
j
￿
￿ l

￿
l
 

A
 
j
￿
￿ l
l



D
j
￿
D
j
￿
￿ l


j
￿
l
 

A
 
j
￿
￿ l
l



D
j
￿
D
j
￿


j
￿
l
 

A
with
l f       k
￿
  g 
(64)
Combining (60) and (64),we have
F
￿
 P
￿
P
￿

j


j
￿
￿￿
D
j
￿
 D
j
￿

D
j
￿

j
￿
￿ l
l



D
j
￿
D
j
￿


j
￿
l
 

A

(65)
In order to show
F
￿
 
,we just need to show
j


j
￿
￿￿
D
j
￿
 D
j
￿

D
j
￿

j
￿
￿ l
l



D
j
￿
D
j
￿


j
￿
l
 

A
  
(66)
When
D
j
￿
 D
j
￿
,
j


j
￿
￿￿
D
j
￿
 D
j
￿

D
j
￿

j
￿
￿ l
l



D
j
￿
D
j
￿


j
￿
l
 

A
  
Inequality (66) holds.
When
D
j
￿
D
j
￿
,setting

￿
 D
￿
l
j
￿
and


 D
￿
l
j
￿
,the
inequality (66) can be rewritten as


l


j
￿
￿￿
j


l
￿

l

 


j
￿
￿

j
￿
￿ l
l
j
￿
￿

j
￿

 
Noticing

j
￿
￿ l
l



D
j
￿
D
j
￿


j
￿
l
 

A


j
￿
￿ l
l
j
￿


j
￿
￿


j
￿
￿
  
(67)
we only need to show

j
￿
￿￿
j


j
￿
￿

l
￿

l



j
￿
￿ l
l
l


j
￿
￿

j
￿


 
(68)
Since

j
￿
￿￿
 
j
￿
￿ l
,the left hand side of (68) becomes

j
￿
￿￿
j


j
￿
￿

l
￿

l



j
￿
￿ l
l
l


j
￿
￿

j
￿



j


j
￿
￿

l
￿

l


l
l


j
￿
￿

j
￿



Making use of the formula (55) again,we obtain
j


j
￿
￿

l
￿

l


l
l


j
￿
￿

j
￿



j


j
￿
￿

￿



P
l
i ￿￿

l  i
￿

i  ￿

l
l


￿



P
j
￿
i ￿￿

j
￿
 i
￿

i  ￿


j


j
￿
￿
P
l
i ￿￿

l  i
￿

i  ￿

l
l

P
j
￿
i ￿￿

j
￿
 i
￿

i  ￿


j

P
l
i ￿￿

j
￿
￿ l  i
￿

i  ￿

l
P
j
￿
i ￿￿

j
￿
 i
￿

l ￿ i  ￿


P
j
￿
k ￿￿
P
l
i ￿￿

j
￿
￿ l  i
￿

i  ￿

P
l
k ￿￿
P
j
￿
i ￿￿

j
￿
 i
￿

l ￿ i  ￿


(69)
Observe the numerator and the denominator both have
j

 l
elements represented as

m
￿

n

.But we know

￿


since
D
j
￿
D
j
￿
,hence from(69),we obtain
P
j
￿
k ￿￿
P
l
i ￿￿

j
￿
￿ l  i
￿

i  ￿

P
l
k ￿￿
P
j
￿
i ￿￿

j
￿
 i
￿

l ￿ i  ￿


P
j
￿
k ￿￿
P
l
i ￿￿

j
￿
￿

l  ￿

P
l
k ￿￿
P
j
￿
i ￿￿

j
￿
 ￿
￿

l


j

l
j
￿
￿

l  ￿

j

l
j
￿
 ￿
￿

l



￿


 
So
j


j
￿
￿

l
￿

l


l
l


j
￿
￿

j
￿


  
Hence
F
￿
 P
￿
P
￿

(70)
is proved for
j
￿
 j

k
with all
k N
.
To prove
F
￿
 
.
Using Lemma 12 again,we get
F
￿
  
(71)
Combining (70) and (71),we get
B
￿
a
s
  n j
￿
  B
￿
a

s
  n j

    j
￿
j


(72)
Combining (57) and (72),(45) is proved true.
STEP FOUR
We supposed that
a

s
 A
j
￿
in the above proof.Nowlet us
show it.First,for
j j

,

a

￿
   a

j
￿
   a

j
n


￿
j


a

￿
   a

j
￿
n


￿
j

a

￿
   a

j
￿
n


￿
j
￿
￿ j  j
￿
￿
￿
j


a

￿
   a

j
￿
n


￿
j
￿

Second,for
j  j

.From(54),we get

a

￿
   a

j
n


￿
j


p

￿
   
j
n
￿

￿
j


p

￿
   
j  ￿
n
￿

￿
j ￿ ￿


a

￿
   a

j  ￿
n


￿
j ￿ ￿

Thus
a

s
 A
j
￿
.
We can also show

p

s
a

s

s

￿
.

p

s
a

s

s

v
u
u
t

X
i ￿￿

i
a
￿
i

v
u
u
t
j




￿

X
i ￿ j
￿
￿￿

i

(73)
When
k x y 
and
n
are given,

i

and
j

are determined.
So
  n

￿
j
￿

￿
   
j
￿

￿
j
￿
is a constant.By MercerÕs The-
orem,

i
 
￿
and thus
P

i ￿ j
￿
￿￿

i
is Þnite.So (73) is
Þnite.Hence

p

s
a

s

s

￿
is proved.
CONCLUSION
Following the proof above,we get
Corollary 14 Suppose
A
j
and
B a
s
  n j 
are dened as
above.Then we have
B a

s
j

  n j

  inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
(74)
where
a

i




p

i
when i  j


p

￿
 
j
￿
n


￿
j
￿
when i j


(75)
j

 min

j  
j ￿￿



￿
   
j
n
￿


￿
j


(76)
Theorem1 is then established.
B THE PROOF THAT INEQUALITY (31)
CANNOT BE IMPROVED
Lemma 15 Suppose
A
j
and
B a
s
  n j 
are denedas above.
Let
j N
and
a
s
 A
j
.Suppose
j

and
a

s

exist.Then
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿
sup
j  N
B a
s
  n j 
 inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
(77)
Proof Let us prove
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿
sup
j  N
B a
s
  n j 
 inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
(78)
Choose an
a

s

to realise the inÞmumon the left hand side;
then
a

s

s
A
j
￿
,where
j

is the
j
that realises the inner
supremum.Then
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿
sup
j  N
B a
s
  n j 
 sup
j  N
B a

s
  n j 
 B a

s
  n j

  inf
￿ a
s
￿  A
j
￿
B a
s
  n j


 inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
We have already proved
inf
￿ a
s
￿
s
￿

p

s
a
s

s

￿
sup
j  N
B a
s
  n j 
 inf
j  N
inf
￿ a
s
￿  A
j
B a
s
  n j  
So,equation (77) is proved to be true.
Acknowledgements
This work was supported by the Australian Research Coun-
cil.Thanks to Bernhard Sch¬olkopf and Alex J.Smola for
useful discussions.