Covering Numbers for Support Vector Machines
Ying Guo
Department of Engineering
Australian National University
Canberra 0200,Australia
guo@hilbert.anu.edu.au
Peter L.Bartlett
RSISE
Australian National University
Canberra 0200,Australia
Peter.Bartlett@anu.edu.au
John ShaweTaylor
Department of Computer Science
Royal Holloway College,
University of London
Egham,TW20 0EX,UK
jst@dcs.rhbnc.ac.uk
Robert C.Williamson
Department of Engineering
Australian National University
Canberra 0200,Australia
Bob.Williamson@anu.edu.au
Abstract
Support vector machines are a type of learning
machine related to the maximum margin hyper
plane.Until recently,the only bounds on the gen
eralization performance of SV machines (within
the PAC framework) were via bounds on the fat
shattering dimension of maximummargin hyper
planes.This result took no account of the kernel
used.More recently,it has been shown [8] that
one can bound the relevant covering numbers us
ing some tools fromfunctional analysis.The re
sulting bound is quite complex and seemingly
difÞcult to compute.In this paper we show that
the bound can be greatly simpliÞed and as a con
sequence we are able to determine some inter
esting quantities (such as the effective number of
dimensions used).The newbound is quite a sim
ple formula involving the eigenvalues of the inte
gral operator induced by the kernel.We present
an explicit calculation of covering numbers for
an SV machine using a Gaussian kernel which is
signiÞcantly better than that implied by the max
imummargin fatshattering result.
1 INTRODUCTION
Support Vector (SV) Machines [5] are learning algorithms
based on maximummargin hyperplanes [4] which make use
of an implicit mapping into feature space by using a more
general kernel function in place of the standard inner prod
uct.Consequently one can apply an analysis for the maxi
mum margin algorithm directly to SV machines.However
such a process completely ignores the effect of the kernel.
Intuitively one would expect that a ÒsmootherÓkernel would
somehow reduce the capacity of the learning machine thus
leading to better bounds on generalization error if the ma
chine can attain a small training error.
In [9,8] it has been shown that this intuition is justiÞed.
The main result there (quoted below) gives a bound on the
covering numbers for the class of functions computed with
support vector machines.This bound along with statistical
results of the form given in [7] result in bounds that do ex
plicitly depend on the kernel used.
In the traditional viewpoint of statistical learning theory,
one is given a class of functions
F
,and the generalization
performance attainable using
F
is determined via the cover
ing numbers
N F
(precise deÞnitions are given below).
Many generalization error bounds can be expressed in terms
of
N F
.The main method of bounding
N F
has been
to use the VapnikChervonenkis dimension or one of its gen
eralizations (see [1] for an overview).
In [9,8] an alternative viewpoint is taken where the class
F
is viewed as being generated by an integral operator in
duced by the kernel.Properties of this operator are used to
bound the required covering numbers.The result is in a form
that is not particularly easy to use (see (13) below).
The main technical result of this paper is an explicit re
formulation of this bound which is amenable to direct calcu
lation.We illustrate the newresult by bounding the covering
numbers of SV machines which use Gaussian RBF kernels.
The result shows the inßuence of
on the covering num
bers:the covering numbers will decrease when
increases.
Here
is the variance of the Gaussian function used for the
kernel.More generally,the main result makes model order
selection possible using any parametrized family of kernel
functions:we can describe precisely howthe capacity of the
class is affected by changes to the kernel.
For
d N
,
R
d
denotes the
d
dimensional space of vec
tors
x x
x
d
.For
p
,deÞne the spaces
d
p
f x R
d
k x k
d
p
g
where the
p
norms are
k x k
d
p
d
X
j
j x
j
j
p
A
p
for
p
k x k
d
max
j d
j x
j
j
for
p
For
d
,we write
l
p
l
p
and the norms are deÞned
similarly (by formally subsituting
for
d
in the above deÞ
nitions).
The
covering number of
F
with respect to the metric
d
denoted
N F d
is the size of the smallest
cover for
F
using the metric
d
.Given
m
points
x
x
m
d
p
,we use
the shorthand
X
m
x
x
m
.Suppose
F
is a class of
functions deÞned on
R
d
.The
d
norm with respect to
X
m
of
f F
is deÞned as
k f k
X
m
max
i m
j f x
i
j
.The
input space is taken to be
X
,a compact subset of
R
d
.
Our main result is a bound for the covering number of
SVmachines.We only discuss the case when
d
.(In fact
the result does hold for general
d
;see the discussion in the
conclusion).
Let
k X X R
be a kernel satisfying the hypotheses
of MercerÕs theorem(Theorem2).Given
m
points
x
x
m
X
.Denote by
F
R
w
the hypothesis class implemented by
SV machines on an
m
sample with weight vector (in feature
space) bounded by
R
w
:
F
R
w
x
X
i
k x x
i
X
i
X
j
i
j
k x
i
x
j
R
w
(1)
Let
be the eigenvalues of the integral operator
T
k
L
X L
X
T
k
f
Z
X
k y f y d y
and denote by
n
,
n N
the corresponding eigenfunc
tions.(See the next section for a reminder of what this means.)
For translation invariant kernels (such as
k x y exp x
y
),the eigenvalues are given by
i
p
K j
(2)
for
j Z
,where
K F k x
is the Fourier trans
formof
k
(see [9,8] for further details).For a smooth ker
nel,the Fourier transform
F j
decreases faster.(There
are less Òhigh frequency components.Ó) Thus for smooth ker
nels,
i
decreases to zero rapidly for increasing
i
.
Theorem1 (Main Result) Suppose
k
is a kernel satisfying
the hypothesis of Mercers Theorem.Hypothesis class
F
R
w
,
eigenfunctions
n
and eigenvalues
i
are denedas above.
Let
x
x
m
X
be
m
data points.Let
C
k
sup
n
k
n
k
L
For
n N
set
n
R
w
C
k
v
u
u
t
j
j
n
j
X
i j
i
(3)
with
j
min
j
j
j
n
j
Then
C
k
and
sup
X
m
X
m
N
n
F
R
w
X
m
n
The quantity
n
is an upper bound on the entropy num
ber of
F
R
w
,which is the functional inverse of the covering
number.In this theorem,the number
j
has a natural inter
pretation:For a given value of
n
,it can be viewed as the
effective dimension of the function class.Clearly,this effec
tive dimension depends on the rate of decay of the eigenval
ues.As expected,for smooth kernels (which have rapidly
decreasing eigenvalues),the effective dimension is small.It
turns out that all kernels satisfying MercerÕs conditions are
sufÞciently smooth for
j
to be Þnite.
The remainder of the paper is organized as follows.We
start by introducingnotation and deÞnitions (Section 2).Sec
tion 3 contains the main result (the proof is in Appendix A).
Section 4 contains an example application of the main result.
Section 5 concludes.
2 DEFINITIONS AND PREVIOUS
RESULTS
Let
L E F
be the set of all bounded linear operators
T
be
tween the normed spaces
E k k
E
and
F k k
F
,i.e.
operators such that the image of the (closed) unit ball
U
E
f x E k x k
E
g
(4)
is bounded.The smallest such bound is called the operator
norm,
k T k sup
x U
E
k T x k
F
(5)
The
n
th entropy number of a set
M E
,for
n N
,is
n
M inf f
there exists an
cover
for
M
in
E
containing
n
or fewer points
g
(6)
(The function
n
n
M
can be thought of as the func
tional inverse of the function
N M d
where
d
is the
metric induced by
k k
E
.) The entropy numbers of an oper
ator
T L E F
are deÞned as
n
T
n
T U
E
(7)
Note that
T k T k
,and that
n
T
certainly is well de
Þned for all
n N
if
T
is a compact operator,i.e.if
T U
E
is compact.
In the following,
k
will always denote a kernel,and
d
and
m
will be the input dimensionality and the number of
training examples,respectively,so that the training data is a
sequence
x
y
x
m
y
m
R
d
R
(8)
Let
log
denote the logarithmto base 2.
We will map the input data into a feature space
S
via a
mapping
.We let
x x
,and
F
R
w
fh w
x i
x S k w k R
w
g
R
S
Given a class of functions
F
,the generalization performance
attainable using
F
can be bounded in terms of the covering
numbers of
F
.More precisely,for some set
X
,and
x
i
X
for
i m
,deÞne the
growth function of the function
class
F
on
X
as
N
m
F sup
x
x
m
X
m
N F
X
m
(9)
where
N F
X
m
is the
covering number of
F
with re
spect to
X
m
.Many generalization error bounds can be ex
pressed in terms of
N
m
F
.
Given some set
X
,some
p
and a function
f X R
we deÞne
k f k
L
p
X
R
j f x j
p
d x
p
if
the integral exists and
k f k
L
X
ess
sup
x X
j f x j
.For
p
,we let
L
p
X f f X R k f k
L
p
X
g
.
We sometimes write
L
p
L
p
X
.
Suppose
T E E
is a linear operator mapping a normed
space
E
into itself.We say that
x E
is an eigenvector if
for some scalar
,
T x x
.Such a
is called the eigen
value associated with
x
.When
E
is a function space (e.g.
E L
X
the eigenvectors are of course functions,and
are usually called eigenfunctions.Thus
n
is an eigenfunc
tion of
T L
X L
X
if
T
n
n
.In general
is
complex,but in this paper all eigenvalues are real (because
of the symmetry of the kernels used to induce the operators).
We will make use of MercerÕs theorem.The version
stated below is a special case of the theorem proven in [6,
p.145].
Theorem2 (Mercer) Suppose
k L
X
is a symmetric
kernel such that the integral operator
T
k
L
X L
X
,
T
k
f
Z
X
k y f y d y
(10)
is positive.Let
j
L
X
be the eigenfunction of
T
k
as
sociated with the eigenvalue
j
and normalized such
that
k
j
k
L
and let
j
denote its complex conjugate.
Suppose
j
is continuous for all
j N
.Then
1.
j
T
j
.
2.
j
L
X
and
sup
j
k
j
k
L
.
3.
k x y
P
j N
j
j
x
j
y
holds for all
x y
,where
the series converges absolutely and uniformly for all
x y
.
We will call a kernel satisfying the conditions of this theorem
a Mercer kernel.Fromstatement 2 of MercerÕs theoremthere
exists some constant
C
k
R
depending on
k
such that
j
j
x j C
k
for all
j N
and
x X
(11)
This conclusion is the only reason we have added the condi
tion that
n
is continuous;it is not necessary for the theorem
as stated,but it is convenient to bundle all of our assumptions
into the one place.In any case it is not a very restrictive as
sumption:if
X
is compact and
k
is continuous,then
j
is
automatically continuous (see e.g.[3]).Alternatively,if
k
is
translation invariant,then
j
are scaled cosine functions and
thus continuous.
In [8] an upper bound on the entropy numbers was given
in terms of the eigenvalues of the kernel used.The result is
in terms of the entropy numbers of a scaling operator
A
.The
notation
a
s
s
l
p
donates the sequence
a
a
such
that
P
s
j a
s
j
.
Theorem3 (Entropy numbers for
X
) Let
k X X
R
be a Mercer kernel.Choose
a
j
for
j N
such that
p
s
a
s
s
,and dene
A x
j
j
R
A
a
j
x
j
j
(12)
with
R
A
C
k
k
p
j
a
j
j
k
.Then
n
A
sup
j N
C
k
p
s
a
s
s
a
a
j
n
j
(13)
This result leads to the following bounds for SV classes.
Theorem4 (Bounds for SV classes) Let
k
be a Mercer ker
nel.Then for all
n N
,
n
F
R
w
R
w
inf
a
s
s
p
s
a
s
s
n
A
(14)
where
A
is dened as in Theorem 3.
Combining Equations (13) and (14) gives effective bounds
on
N
m
F
R
w
since
n
T
m
N
m
F
R
w
n
These results thus give a method to obtain bounds on the
entropy numbers for kernel machines.In Inequality (14),we
can choose
a
s
s
to optimize the bound.The key technical
contribution of this paper is the explicit determination of the
best choice of
a
s
s
.
We assume henceforth that
s
s
is Þxed and sorted in
nonincreasing order,and
a
i
for all
i
.For
j N
,we
deÞne the set
A
j
a
s
s
sup
i N
a
a
i
n
i
a
a
j
n
j
(15)
In other words,
A
j
is the set of
a
s
s
such that the
sup
i N
a
a
a
i
n
i
is attained at
i j
.
Let
B a
s
n j
p
s
a
s
s
a
a
j
n
j
where for notational simplicity,we write
a
s
for
a
s
s
.
3 THE OPTIMAL CHOICE OF
a
s
s
AND
j
Our main aim in this section is to show that the inÞmum
in (14) and the supremum in (13) can be achieved and to
give an explicit recipe for the sequence
a
s
and number
j
that achieve them.The main technical theoremis as follows.
Theorem5 Let
k X X R
be a Mercer kernel.Suppose
are the eigenvalues of
T
k
.For any
n N
,the
minimum
j
min
j
j
j
n
j
(16)
always exists,and
inf
a
s
s
p
s
a
s
s
sup
j N
B a
s
n j B a
s
n j
where
a
i
p
i
when i j
p
j
n
j
when i j
(17)
This choice of
a
s
results in a simple formfor the bound of
(14) in terms of
n
and
i
:
Corollary 6 Let
k X X R
be a Mercer kernel and let
A
be given by (12).Then for any
n N
,the entropy numbers
satisfy
inf
a
s
s
p
s
a
s
s
n
A
C
k
v
u
u
t
j
j
n
j
X
i j
i
(18)
with
j
min
j
j
j
n
j
This corollary,together with (14),implies Theorem1.
PROOF OUTLINE
The proof of Theorem5 is quite long and is in Appendix A.
It involves the following four steps.
1.We Þrst prove that for all
n N
,
j min
j
j
j
n
j
(19)
exists,whenever
i
are the eigenvalues of a Mercer
kernel.
2.We then prove that for any
n N
inf
a
s
s
p
s
a
s
s
sup
j N
B a
s
n j
inf
j N
inf
a
s
A
j
B a
s
n j
(20)
3.The next step is to prove that the choice of
a
s
and
j
described by (16) and (17) are optimal.It is separated
into two parts:
(a) For any
j
j
,and any
a
s
A
j
,
B a
s
n j
B a
s
n j
holds.
(b) For any
j
j
,and any
a
s
A
j
,
B a
s
n j
B a
s
n j
also holds.
4.Finally we show that
a
s
A
j
and
p
s
a
s
s
when
a
s
is chosen according to (17).
4 EXAMPLE
We illustrate the results of this paper with an example.Con
sider the kernel
k x y k x y
where
k x e
x
.
For such kernels (RBF kernels)
k x k
for all
x X
.
Thus the class (1) can be written as
F
R
w
fh w
x i
x S k x k
k w k
R
w
g
One can use the fatshattering dimension to bound the cov
ering number of the class of functions
F
R
w
(see [2]).
Lemma 7 With
F
R
w
as above,
fat
F
R
w
R
w
(21)
Theorem8 If
F
is class of functions mapping from a set
X
into the interval
B
,then for any
,if
m
fat
F
,
log N
m
F
fat
F
log
eB m
(22)
Combining these results we have the following bound with
which we shall compare our new bound.
log N
m
F
R
w
R
w
log
eB m
(23)
In order to determine the eigenvalues of
T
k
,we need
to periodize the kernel.This periodization is necessary in
order to get a discrete set of eigenvalues since
k x
has in
Þnite support (see [9] for further details).For the purpose
of the present paper,we can assume a Þxed period
for some
.Since the kernel is translation invari
ant,the eigenfunctions are
n
x
p
cos n
x
and so
C
k
p
.The
p
comes fromthe requirement in Theorem
2 that
k
j
k
.The eigenvalues are
j
p
e
j
Setting
c
p
,
c
,the eigenvalues can be
written as
j
c
e
c
j
(24)
From (16),we know that
j
j
n
j
implies
j
j
.But (24) shows that this condition on the eigenvalues is
equivalent to
c
e
c
j
n
j
c
j
e
c
P
j
i
i
j
(25)
which is equivalent to
c
j
j
ln n
c
j j
c
j j
j
ln n
which follows from
j
ln n
Hence,
j
ln n
(26)
We can now use (18) to give an upper bound on
n
.The tail
P
i j
i
in (18) is dominated by the Þrst term,hence we
obtain the following bound.
n
O
j
n
j
c
exp
c
j
j
Substituting (26) shows that
log
n
O
log log n log log n
(27)
1e13
1e12
1e11
1e10
1e09
1e08
1e07
1e06
1e05
.1e3
.1e2
.1e1
.1
1.
.1e2
Figure 1:
n
versus
n
for a Gaussian kernel as given by
Corollary 6.
2
4
6
8
10
12
14
1..1e2.1e3
Figure 2:
j
versus
n
for a Gaussian kernel.
We can get several results fromEquation (27).
The relationship between
n
and
n
.For Þxed
,(27) shows
that
log
n
log
n
which implies
log N
m
F
R
w
O
log
(28)
which is considerably better than (23).This can also
be seen in Figure 1.
2
4
6
8
10
12
14
1e13 1e11 1e09 1e07 1e05.1e2.1 1..1e2
Figure 3:
j
versus
for a Gaussian kernel.Since
j
can
be interpreted as an Òeffective number of dimensionsÓ,this
clearly illustrates why the bound on the covering numbers
for Gaussian kernels grows so slowly as
.Even when
,
j
is only 13.
The relationship between
and
n
.Here,
is the vari
ance of the Gaussian functions.When
increases,
the kernel function will be wider,so the class
F
R
w
should be simpler.In Equation (27),we notice that
if
decreases,
n
decreases for Þxed
n
.Similarly,if
increases,
n
decreases for Þxed
n
.Since the entropy
number (and the covering number) indicates the capac
ity of the learning machine,the more complicated the
machine is,the bigger the covering number for Þxed
n
.SpeciÞcally we see fromEquation (27) that
log
n
and that
log N
m
F
R
w
O
Figures (1) to (3) illustrate our bounds (for
).
5 CONCLUSIONS
We have presented a new formula for bounding the covering
numbers of support vector machines in terms of the eigen
values of an integral operator induced by the kernel.We
showed,by way of an example using a Gaussian kernel,that
the new bound is easily computed and considerably better
than previous results that did not take account of the kernel.
We showed explicitly the effect of the choice of width of the
kernel in this case.
The Òeffective number of dimensionsÓ,
j
,can illustrate
the characters of the kernel functions clearly.For a smooth
kernel,the Òeffective number of the dimensionsÓ
j
is small.
The value of
j
depends on
n
which in turn depends on
.Thus
j
can be considered analogous to existing Òscale
sensitiveÓdimensions,such as the fatshatterring dimension.
A key difference is that we now have bounds for
j
that ex
plicitly depend on the kernel.
We have discussed the result for the situation where the
input dimension is 1.The main complication arising when
d
is that repeated eigenvalues become generic for isotropic
translation invariant kernels.This does not break the bounds
as stated (as long as one properly counts the multiplicity of
eigenvalues).However,it is possible to obtain bounds that
can be tighter in some cases,by using a slightly more reÞned
argument [9].
References
[1] M.Anthony.Probabilistic analysis of learning in
artiÞcial neural networks:The pac model and its
variants.Neural Computing Surveys,1:1Ð47,1997.
http://www.icsi.berkeley.edu/÷jagota/NCS.
[2] M.Anthony and P.Bartlett.Neural Network Learning:
Theoretical Foundations.Cambridge University Press,
1999.
[3] Robert Ash.Information Theory.Interscience Publish
ers,New York,1965.
[4] B.E.Boser,I.M.Guyon,and V.N.Vapnik.A training
algorithmfor optimal margin classiÞers.In D.Haussler,
editor,5th Annual ACMWorkshop on COLT,pages 144Ð
152,Pittsburgh,PA,1992.ACMPress.
[5] C.Cortes and V.Vapnik.Support vector networks.Ma
chine Learning,20:273 Ð 297,1995.
[6] H.K¬onig.Eigenvalue Distribution of Compact Opera
tors.Birkh¬auser,Basel,1986.
[7] J.ShaweTaylor,P.L.Bartlett,R.C.Williamson,and
M.Anthony.Structural risk minimization over data
dependent hierarchies.IEEE Transactions on Informa
tion Theory,44(5):1926Ð1940,1998.
[8] R.Williamson,A.Smola,and B.Sch¬olkopf.En
tropy numbers,operators and support vector kernels.In
B.Sch¬olkopf,C.J.C.Burges,and A.J.Smola,editors,
Advances in Kernel Methods Support Vector Learn
ing,pages 127Ð144,Cambridge,MA,1999.MIT Press.
[9] R.C.Williamson,A.J.Smola,and B.Sch¬olkopf.Gen
eralization performance of regularization networks and
support vector machines via entropy numbers of com
pact operators.NeuroCOLTNCTR98019,Royal Hol
loway College,1998.
A PROOF OF THEOREM1
STEP ONE
As indicated above,we will Þrst prove the existence of
j
,
which is deÞned in (19).
Lemma 9 Suppose
is a nonincreasing
sequence of nonnegative numbers and
lim
j
j
.
Then for all
n N
,there exists
j N
such that
j
j
n
j
(29)
Proof Let
P
j
j
j
j
.Observe that (29) can be written
as
P
j
n
,and hence for all
n
there is a
j
such that (29) is
true iff
lim
j
P
j
.But
P
j
j
j
j
j
j
Y
i
j
i
j
since
i
is nonincreasing.Since
lim
j
j
,we get
lim
j
P
j
.Thus for any
n N
there is a
j
such that
(29) is true.
Corollary 10 Suppose
k
is a Mercer kernel and
T
k
the asso
ciated integral operator.If
i
i
T
k
,then the minimum
j
from (19) always exists.
Proof By MercerÕs Theorem,
i
and so
lim
i
i
.Lemma 9 can thus be applied.
STEP TWO
Lemma 11 Suppose
A
j
and
B a
s
n j
are denedas above,
p
s
a
s
s
,
j
and
a
s
A
j
satisfy
B a
s
n j
inf
j N
inf
a
s
A
j
B a
s
n j
(30)
Then
inf
a
s
p
s
a
s
s
sup
j N
B a
s
n j inf
j N
inf
a
s
A
j
B a
s
n j
(31)
Proof Since
p
s
a
s
s
,
inf
a
s
s
p
s
a
s
s
sup
j N
B a
s
n j sup
j N
B a
s
n j
(32)
But
a
s
A
j
,following the deÞnition of
A
j
and equal
ity (30) we get
sup
j N
B a
s
n j B a
s
n j
inf
j N
inf
a
s
A
j
B a
s
n j
In fact,we can show that the inequality in (31) is in fact an
equality.The proof is in appendix B.
It is now easier to calculate the optimal bound of the
entropy number using Lemma 11.
STEP THREE
In this step,we will prove that the choice of
a
s
and
j
given in Theorem5 are optimal.We will Þrst prove a useful
technical result.
Lemma 12 Suppose
A
j
and
i
are denedas above,
a
s
A
j
.Then we have
X
i j
i
a
i
A
a
a
j
n
j
X
i j
i
(33)
Proof Since
a
s
A
j
,the following inequality must be
true for
k N
:
a
a
j
a
j
k
n
j
k
a
a
j
n
j
(34)
which implies
a
a
j
a
j
k
n
a
a
j
n
j
k
j
a
j
a
j
k
a
a
j
n
k
j
k N
(35)
Set
a
a
j
n
j
Then (35) can be rewritten as:
a
j
a
j
k
k
k N
(36)
Hence,the left hand side of (33) can be rewritten as
X
i j
i
a
i
X
i j
i
X
i j
i
a
i
(37)
From(36),we get
a
j
,so
a
j
Suppose
a
i
for some
i N
.We will separate the
suminto several parts.Set
k
j
k
m
max f n l
m
a
i
i f l
m
n gg
l
m
max f n k
m
a
i
i f k
m
n gg
(38)
where we set
k
m
and
l
m
to
if the
max
does not exist.
Since
i
is a nonincreasing sequence,from(38) we know
i
a
i
i c
a
i
i f k
m
l
m
g c N
i
a
i
i c
a
i
i f l
m
k
m
g
c f i g
for
m N
.Hence,if
l
m
is Þnite,
k
m
X
i k
m
i
a
i
l
m
l
m
X
i k
m
a
i
l
m
k
m
X
i l
m
a
i
l
m
k
m
X
i k
m
a
i
(39)
And if
l
m
is inÞnite,this inequality is clearly true.We will
exploit the inequality of the arithmetic and geometric means
x
x
x
m
m x
x
m
m
for
x
i
(40)
Now (40) implies that for any
k
j k
m
,we have
j
X
i k
a
i
j k
j
Y
i k
a
i
j k
(41)
which together with (36) gives
j
X
i k
a
i
j
X
i k
a
i
j k
(42)
Hence,for any
k
m
,Þnite or inÞnite,
k
m
X
i k
a
i
(43)
Now,for all
k
m
,using (39) and (43) repeatedly,we get
k
m
X
i k
i
a
i
k
X
i k
i
a
i
k
m
X
i k
m
i
a
i
l
k
X
i k
a
i
l
m
k
m
X
i k
m
a
i
l
k
X
i k
a
i
l
m
k
m
X
i k
m
a
i
l
m
k
m
X
i k
a
i
for all
m N
.Hence
X
i j
i
a
i
(44)
Noticing (37),inequality (33) is true.
Now,let us prove the main result.
Lemma 13 Let
A
j
and
B a
s
n j
be dened as above.
Then we have
B a
s
n j
inf
j
N
inf
a
s
A
j
B a
s
n j
(45)
where
a
i
p
i
when i j
p
j
n
j
when i j
(46)
j
min
j
j
j
n
j
(47)
Proof The main idea is to compare
B
a
s
n j
with
B
a
s
n j
and show
B
a
s
n j
B
a
s
n j
for all
j
N
and any
a
s
A
j
.From the deÞnition of
B a
s
n j
,we know
B
a
s
n j
X
i
i
a
i
a
a
j
n
j
and
B
a
s
n j
j
j
n
j
X
i j
i
For convenience,we set
j
n
j
Hence,
B
a
s
n j
B
a
s
n j
X
i
i
a
i
a
a
j
n
j
j
X
i j
i
(48)
Part a:For the condition
j
j
.
Rewrite (48):
B
a
s
n j
B
a
s
n j
j
X
i
i
a
i
a
a
j
n
j
X
i j
i
a
i
a
a
j
n
j
j
j
n
j
X
i j
i
A
j
X
i j
i
j
j
n
j
X
i j
i
A
j
X
i
i
a
i
a
a
j
n
j
j
j
n
j
X
i j
i
a
i
a
a
j
n
j
X
i j
i
j
j
n
j
X
i j
i
j
X
i j
i
A
E
E
E
(49)
We will show
E
,
E
and
E
.
To prove
E
.
Since
i
and
a
i
,we exploit the inequality of the
arithmetic and geometric means (40) again.Hence
E
j
j
a
a
j
a
a
j
n
j
j
A
j
j
j
n
j
j
j
n
j
j
j
n
j
(50)
To prove
E
.
Applying Lemma 12 shows
E
.
To prove
E
.
In order to prove
E
,let us deÞne function
g j j
j
n
j
X
i j
i
(51)
We will show that
g j
is a nonincreasing function of
j
,for
j j
.Set
j
j
n
j
j
j
n
j
we have
g j g j
j
j
j
j
j
j
j
j
j
j
(52)
Noticing
j
j
j
j
j
,(52) can be modiÞed to
g j g j
j
j
j
j
j
j
j
j
j
j
j
(53)
Since
j j
,following (47),we get
j
j
n
j
j j
(54)
So
j
j
n
j
j
j
j
n
j
j j
j
n
j
j
Making use of the formula
x
n
y
n
x y
n
X
i
x
n i
y
i
(55)
we obtain
j
j
j
j
j
j
j
X
i
j i
j
i
j
j
j
j
j
j
Together with
j
and (53),we obtain
j
j
j
j
j
Hence,
g j g j
Since
j
j
,we get
E
g j
g j
(56)
Combining the above results,we get
B
a
s
n j
B
a
s
n j
j
j
(57)
Part b:For the condition
j
j
.
Rewrite (48):
B
a
s
n j
B
a
s
n j
j
X
i
i
a
i
X
i j
i
a
i
A
a
a
j
n
j
j
j
X
i j
i
X
i j
i
j
X
i
i
a
i
a
a
j
n
j
j
j
X
i j
i
X
i j
i
a
i
a
a
j
n
j
X
i j
i
F
F
(58)
We will show
F
and
F
.
To prove
F
.
For convenience,we set
D
i
a
a
i
n
i
F
can be rewritten as:
j
X
i
i
a
i
j
X
i j
i
a
i
A
D
j
j
j
X
i j
i
D
j
j
X
i
i
a
i
j
j
X
i j
i
D
j
a
i
P
P
(59)
Let us consider
P
at Þrst.
P
D
j
D
j
D
j
j
X
i
i
a
i
j
D
j
j
X
i
i
a
i
j
D
j
D
j
j
X
i
i
a
i
Since
i
a
i
,using the inequality of the arithmetic and
geometric mean (40) again,we get
j
X
i
i
a
i
j
j
a
a
j
j
n
n
j
D
j
Since
a
s
A
j
,we get
D
j
D
i
for any
i
j
and
j
j
n
j
holds based on (47).Hence
P
D
j
D
j
j
X
i
i
a
i
D
j
D
j
D
j
j
D
j
D
j
D
j
j
j
(60)
Let us consider
P
now.If
P
,then
F
.
So let us prove that
F
is also true when
P
.Ob
serving
a
i
D
i
i
D
i
i
and
D
j
D
i
for any
i
j
,the
last element of
P
j
D
j
a
j
j
D
j
D
j
j
Using the similar method as before.Suppose
D
j
a
i
for some
i j
j
.We separate
P
into several parts.Set
k
j
l
m
min f n k
m
D
j
a
i
i f n k
m
gg
k
m
min f n l
m
D
j
a
i
i f n l
m
gg
(61)
Since
i
is a nonincreasing sequence,from(61) we know
i
D
j
a
i
i c
D
j
a
i
i f k
m
l
m
g c N
i
D
j
a
i
i c
D
j
a
i
i f l
m
k
m
g
c f i g
(62)
Using (62),we have
k
m
X
i k
m
i
D
j
a
i
l
m
l
m
X
i k
m
D
j
a
i
l
m
k
m
X
i l
m
D
j
a
i
l
m
k
m
X
i k
m
D
j
a
i
(63)
Hence,
P
j
X
i j
D
j
a
i
i
k
X
i j
D
j
a
i
i
k
X
i k
D
j
a
i
i
k
X
i j
D
j
a
i
i
l
k
X
i k
D
j
a
i
If
l
P
k
i k
D
j
a
i
,we get
P
k
X
i j
D
j
a
i
i
If
l
P
k
i k
D
j
a
i
,we can use (62) and (63) re
peatedly.Finally,using (40) and
a
i
D
i
i
D
i
i
again,we
can get
P
j
l
X
i j
D
j
a
i
j
l
j
l
j
l
X
i j
D
j
a
i
j
l
l
a
j
a
j
l
l
D
j
A
j
l
l
D
j
D
j
j
D
j
l
j
l
l
A
j
l
l
D
j
D
j
l
j
l
A
j
l
l
D
j
D
j
j
l
A
with
l f k
g
(64)
Combining (60) and (64),we have
F
P
P
j
j
D
j
D
j
D
j
j
l
l
D
j
D
j
j
l
A
(65)
In order to show
F
,we just need to show
j
j
D
j
D
j
D
j
j
l
l
D
j
D
j
j
l
A
(66)
When
D
j
D
j
,
j
j
D
j
D
j
D
j
j
l
l
D
j
D
j
j
l
A
Inequality (66) holds.
When
D
j
D
j
,setting
D
l
j
and
D
l
j
,the
inequality (66) can be rewritten as
l
j
j
l
l
j
j
l
l
j
j
Noticing
j
l
l
D
j
D
j
j
l
A
j
l
l
j
j
j
(67)
we only need to show
j
j
j
l
l
j
l
l
l
j
j
(68)
Since
j
j
l
,the left hand side of (68) becomes
j
j
j
l
l
j
l
l
l
j
j
j
j
l
l
l
l
j
j
Making use of the formula (55) again,we obtain
j
j
l
l
l
l
j
j
j
j
P
l
i
l i
i
l
l
P
j
i
j
i
i
j
j
P
l
i
l i
i
l
l
P
j
i
j
i
i
j
P
l
i
j
l i
i
l
P
j
i
j
i
l i
P
j
k
P
l
i
j
l i
i
P
l
k
P
j
i
j
i
l i
(69)
Observe the numerator and the denominator both have
j
l
elements represented as
m
n
.But we know
since
D
j
D
j
,hence from(69),we obtain
P
j
k
P
l
i
j
l i
i
P
l
k
P
j
i
j
i
l i
P
j
k
P
l
i
j
l
P
l
k
P
j
i
j
l
j
l
j
l
j
l
j
l
So
j
j
l
l
l
l
j
j
Hence
F
P
P
(70)
is proved for
j
j
k
with all
k N
.
To prove
F
.
Using Lemma 12 again,we get
F
(71)
Combining (70) and (71),we get
B
a
s
n j
B
a
s
n j
j
j
(72)
Combining (57) and (72),(45) is proved true.
STEP FOUR
We supposed that
a
s
A
j
in the above proof.Nowlet us
show it.First,for
j j
,
a
a
j
a
j
n
j
a
a
j
n
j
a
a
j
n
j
j j
j
a
a
j
n
j
Second,for
j j
.From(54),we get
a
a
j
n
j
p
j
n
j
p
j
n
j
a
a
j
n
j
Thus
a
s
A
j
.
We can also show
p
s
a
s
s
.
p
s
a
s
s
v
u
u
t
X
i
i
a
i
v
u
u
t
j
X
i j
i
(73)
When
k x y
and
n
are given,
i
and
j
are determined.
So
n
j
j
j
is a constant.By MercerÕs The
orem,
i
and thus
P
i j
i
is Þnite.So (73) is
Þnite.Hence
p
s
a
s
s
is proved.
CONCLUSION
Following the proof above,we get
Corollary 14 Suppose
A
j
and
B a
s
n j
are dened as
above.Then we have
B a
s
j
n j
inf
j N
inf
a
s
A
j
B a
s
n j
(74)
where
a
i
p
i
when i j
p
j
n
j
when i j
(75)
j
min
j
j
j
n
j
(76)
Theorem1 is then established.
B THE PROOF THAT INEQUALITY (31)
CANNOT BE IMPROVED
Lemma 15 Suppose
A
j
and
B a
s
n j
are denedas above.
Let
j N
and
a
s
A
j
.Suppose
j
and
a
s
exist.Then
inf
a
s
s
p
s
a
s
s
sup
j N
B a
s
n j
inf
j N
inf
a
s
A
j
B a
s
n j
(77)
Proof Let us prove
inf
a
s
s
p
s
a
s
s
sup
j N
B a
s
n j
inf
j N
inf
a
s
A
j
B a
s
n j
(78)
Choose an
a
s
to realise the inÞmumon the left hand side;
then
a
s
s
A
j
,where
j
is the
j
that realises the inner
supremum.Then
inf
a
s
s
p
s
a
s
s
sup
j N
B a
s
n j
sup
j N
B a
s
n j
B a
s
n j
inf
a
s
A
j
B a
s
n j
inf
j N
inf
a
s
A
j
B a
s
n j
We have already proved
inf
a
s
s
p
s
a
s
s
sup
j N
B a
s
n j
inf
j N
inf
a
s
A
j
B a
s
n j
So,equation (77) is proved to be true.
Acknowledgements
This work was supported by the Australian Research Coun
cil.Thanks to Bernhard Sch¬olkopf and Alex J.Smola for
useful discussions.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment