Covering Numbers for Support Vector Machines

Ying Guo

Department of Engineering

Australian National University

Canberra 0200,Australia

guo@hilbert.anu.edu.au

Peter L.Bartlett

RSISE

Australian National University

Canberra 0200,Australia

Peter.Bartlett@anu.edu.au

John Shawe-Taylor

Department of Computer Science

Royal Holloway College,

University of London

Egham,TW20 0EX,UK

jst@dcs.rhbnc.ac.uk

Robert C.Williamson

Department of Engineering

Australian National University

Canberra 0200,Australia

Bob.Williamson@anu.edu.au

Abstract

Support vector machines are a type of learning

machine related to the maximum margin hyper-

plane.Until recently,the only bounds on the gen-

eralization performance of SV machines (within

the PAC framework) were via bounds on the fat-

shattering dimension of maximummargin hyper-

planes.This result took no account of the kernel

used.More recently,it has been shown [8] that

one can bound the relevant covering numbers us-

ing some tools fromfunctional analysis.The re-

sulting bound is quite complex and seemingly

difÞcult to compute.In this paper we show that

the bound can be greatly simpliÞed and as a con-

sequence we are able to determine some inter-

esting quantities (such as the effective number of

dimensions used).The newbound is quite a sim-

ple formula involving the eigenvalues of the inte-

gral operator induced by the kernel.We present

an explicit calculation of covering numbers for

an SV machine using a Gaussian kernel which is

signiÞcantly better than that implied by the max-

imummargin fat-shattering result.

1 INTRODUCTION

Support Vector (SV) Machines [5] are learning algorithms

based on maximummargin hyperplanes [4] which make use

of an implicit mapping into feature space by using a more

general kernel function in place of the standard inner prod-

uct.Consequently one can apply an analysis for the maxi-

mum margin algorithm directly to SV machines.However

such a process completely ignores the effect of the kernel.

Intuitively one would expect that a ÒsmootherÓkernel would

somehow reduce the capacity of the learning machine thus

leading to better bounds on generalization error if the ma-

chine can attain a small training error.

In [9,8] it has been shown that this intuition is justiÞed.

The main result there (quoted below) gives a bound on the

covering numbers for the class of functions computed with

support vector machines.This bound along with statistical

results of the form given in [7] result in bounds that do ex-

plicitly depend on the kernel used.

In the traditional viewpoint of statistical learning theory,

one is given a class of functions

F

,and the generalization

performance attainable using

F

is determined via the cover-

ing numbers

N F

(precise deÞnitions are given below).

Many generalization error bounds can be expressed in terms

of

N F

.The main method of bounding

N F

has been

to use the Vapnik-Chervonenkis dimension or one of its gen-

eralizations (see [1] for an overview).

In [9,8] an alternative viewpoint is taken where the class

F

is viewed as being generated by an integral operator in-

duced by the kernel.Properties of this operator are used to

bound the required covering numbers.The result is in a form

that is not particularly easy to use (see (13) below).

The main technical result of this paper is an explicit re-

formulation of this bound which is amenable to direct calcu-

lation.We illustrate the newresult by bounding the covering

numbers of SV machines which use Gaussian RBF kernels.

The result shows the inßuence of

on the covering num-

bers:the covering numbers will decrease when

increases.

Here

is the variance of the Gaussian function used for the

kernel.More generally,the main result makes model order

selection possible using any parametrized family of kernel

functions:we can describe precisely howthe capacity of the

class is affected by changes to the kernel.

For

d N

,

R

d

denotes the

d

-dimensional space of vec-

tors

x x

x

d

.For

p

,deÞne the spaces

d

p

f x R

d

k x k

d

p

g

where the

p

-norms are

k x k

d

p

d

X

j

j x

j

j

p

A

p

for

p

k x k

d

max

j d

j x

j

j

for

p

For

d

,we write

l

p

l

p

and the norms are deÞned

similarly (by formally subsituting

for

d

in the above deÞ-

nitions).

The

-covering number of

F

with respect to the metric

d

denoted

N F d

is the size of the smallest

-cover for

F

using the metric

d

.Given

m

points

x

x

m

d

p

,we use

the shorthand

X

m

x

x

m

.Suppose

F

is a class of

functions deÞned on

R

d

.The

d

norm with respect to

X

m

of

f F

is deÞned as

k f k

X

m

max

i m

j f x

i

j

.The

input space is taken to be

X

,a compact subset of

R

d

.

Our main result is a bound for the covering number of

SVmachines.We only discuss the case when

d

.(In fact

the result does hold for general

d

;see the discussion in the

conclusion).

Let

k X X R

be a kernel satisfying the hypotheses

of MercerÕs theorem(Theorem2).Given

m

points

x

x

m

X

.Denote by

F

R

w

the hypothesis class implemented by

SV machines on an

m

-sample with weight vector (in feature

space) bounded by

R

w

:

F

R

w

x

X

i

k x x

i

X

i

X

j

i

j

k x

i

x

j

R

w

(1)

Let

be the eigenvalues of the integral operator

T

k

L

X L

X

T

k

f

Z

X

k y f y d y

and denote by

n

,

n N

the corresponding eigenfunc-

tions.(See the next section for a reminder of what this means.)

For translation invariant kernels (such as

k x y exp x

y

),the eigenvalues are given by

i

p

K j

(2)

for

j Z

,where

K F k x

is the Fourier trans-

formof

k

(see [9,8] for further details).For a smooth ker-

nel,the Fourier transform

F j

decreases faster.(There

are less Òhigh frequency components.Ó) Thus for smooth ker-

nels,

i

decreases to zero rapidly for increasing

i

.

Theorem1 (Main Result) Suppose

k

is a kernel satisfying

the hypothesis of Mercers Theorem.Hypothesis class

F

R

w

,

eigenfunctions

n

and eigenvalues

i

are denedas above.

Let

x

x

m

X

be

m

data points.Let

C

k

sup

n

k

n

k

L

For

n N

set

n

R

w

C

k

v

u

u

t

j

j

n

j

X

i j

i

(3)

with

j

min

j

j

j

n

j

Then

C

k

and

sup

X

m

X

m

N

n

F

R

w

X

m

n

The quantity

n

is an upper bound on the entropy num-

ber of

F

R

w

,which is the functional inverse of the covering

number.In this theorem,the number

j

has a natural inter-

pretation:For a given value of

n

,it can be viewed as the

effective dimension of the function class.Clearly,this effec-

tive dimension depends on the rate of decay of the eigenval-

ues.As expected,for smooth kernels (which have rapidly

decreasing eigenvalues),the effective dimension is small.It

turns out that all kernels satisfying MercerÕs conditions are

sufÞciently smooth for

j

to be Þnite.

The remainder of the paper is organized as follows.We

start by introducingnotation and deÞnitions (Section 2).Sec-

tion 3 contains the main result (the proof is in Appendix A).

Section 4 contains an example application of the main result.

Section 5 concludes.

2 DEFINITIONS AND PREVIOUS

RESULTS

Let

L E F

be the set of all bounded linear operators

T

be-

tween the normed spaces

E k k

E

and

F k k

F

,i.e.

operators such that the image of the (closed) unit ball

U

E

f x E k x k

E

g

(4)

is bounded.The smallest such bound is called the operator

norm,

k T k sup

x U

E

k T x k

F

(5)

The

n

th entropy number of a set

M E

,for

n N

,is

n

M inf f

there exists an

-cover

for

M

in

E

containing

n

or fewer points

g

(6)

(The function

n

n

M

can be thought of as the func-

tional inverse of the function

N M d

where

d

is the

metric induced by

k k

E

.) The entropy numbers of an oper-

ator

T L E F

are deÞned as

n

T

n

T U

E

(7)

Note that

T k T k

,and that

n

T

certainly is well de-

Þned for all

n N

if

T

is a compact operator,i.e.if

T U

E

is compact.

In the following,

k

will always denote a kernel,and

d

and

m

will be the input dimensionality and the number of

training examples,respectively,so that the training data is a

sequence

x

y

x

m

y

m

R

d

R

(8)

Let

log

denote the logarithmto base 2.

We will map the input data into a feature space

S

via a

mapping

.We let

x x

,and

F

R

w

fh w

x i

x S k w k R

w

g

R

S

Given a class of functions

F

,the generalization performance

attainable using

F

can be bounded in terms of the covering

numbers of

F

.More precisely,for some set

X

,and

x

i

X

for

i m

,deÞne the

-growth function of the function

class

F

on

X

as

N

m

F sup

x

x

m

X

m

N F

X

m

(9)

where

N F

X

m

is the

-covering number of

F

with re-

spect to

X

m

.Many generalization error bounds can be ex-

pressed in terms of

N

m

F

.

Given some set

X

,some

p

and a function

f X R

we deÞne

k f k

L

p

X

R

j f x j

p

d x

p

if

the integral exists and

k f k

L

X

ess

sup

x X

j f x j

.For

p

,we let

L

p

X f f X R k f k

L

p

X

g

.

We sometimes write

L

p

L

p

X

.

Suppose

T E E

is a linear operator mapping a normed

space

E

into itself.We say that

x E

is an eigenvector if

for some scalar

,

T x x

.Such a

is called the eigen-

value associated with

x

.When

E

is a function space (e.g.

E L

X

the eigenvectors are of course functions,and

are usually called eigenfunctions.Thus

n

is an eigenfunc-

tion of

T L

X L

X

if

T

n

n

.In general

is

complex,but in this paper all eigenvalues are real (because

of the symmetry of the kernels used to induce the operators).

We will make use of MercerÕs theorem.The version

stated below is a special case of the theorem proven in [6,

p.145].

Theorem2 (Mercer) Suppose

k L

X

is a symmetric

kernel such that the integral operator

T

k

L

X L

X

,

T

k

f

Z

X

k y f y d y

(10)

is positive.Let

j

L

X

be the eigenfunction of

T

k

as-

sociated with the eigenvalue

j

and normalized such

that

k

j

k

L

and let

j

denote its complex conjugate.

Suppose

j

is continuous for all

j N

.Then

1.

j

T

j

.

2.

j

L

X

and

sup

j

k

j

k

L

.

3.

k x y

P

j N

j

j

x

j

y

holds for all

x y

,where

the series converges absolutely and uniformly for all

x y

.

We will call a kernel satisfying the conditions of this theorem

a Mercer kernel.Fromstatement 2 of MercerÕs theoremthere

exists some constant

C

k

R

depending on

k

such that

j

j

x j C

k

for all

j N

and

x X

(11)

This conclusion is the only reason we have added the condi-

tion that

n

is continuous;it is not necessary for the theorem

as stated,but it is convenient to bundle all of our assumptions

into the one place.In any case it is not a very restrictive as-

sumption:if

X

is compact and

k

is continuous,then

j

is

automatically continuous (see e.g.[3]).Alternatively,if

k

is

translation invariant,then

j

are scaled cosine functions and

thus continuous.

In [8] an upper bound on the entropy numbers was given

in terms of the eigenvalues of the kernel used.The result is

in terms of the entropy numbers of a scaling operator

A

.The

notation

a

s

s

l

p

donates the sequence

a

a

such

that

P

s

j a

s

j

.

Theorem3 (Entropy numbers for

X

) Let

k X X

R

be a Mercer kernel.Choose

a

j

for

j N

such that

p

s

a

s

s

,and dene

A x

j

j

R

A

a

j

x

j

j

(12)

with

R

A

C

k

k

p

j

a

j

j

k

.Then

n

A

sup

j N

C

k

p

s

a

s

s

a

a

j

n

j

(13)

This result leads to the following bounds for SV classes.

Theorem4 (Bounds for SV classes) Let

k

be a Mercer ker-

nel.Then for all

n N

,

n

F

R

w

R

w

inf

a

s

s

p

s

a

s

s

n

A

(14)

where

A

is dened as in Theorem 3.

Combining Equations (13) and (14) gives effective bounds

on

N

m

F

R

w

since

n

T

m

N

m

F

R

w

n

These results thus give a method to obtain bounds on the

entropy numbers for kernel machines.In Inequality (14),we

can choose

a

s

s

to optimize the bound.The key technical

contribution of this paper is the explicit determination of the

best choice of

a

s

s

.

We assume henceforth that

s

s

is Þxed and sorted in

non-increasing order,and

a

i

for all

i

.For

j N

,we

deÞne the set

A

j

a

s

s

sup

i N

a

a

i

n

i

a

a

j

n

j

(15)

In other words,

A

j

is the set of

a

s

s

such that the

sup

i N

a

a

a

i

n

i

is attained at

i j

.

Let

B a

s

n j

p

s

a

s

s

a

a

j

n

j

where for notational simplicity,we write

a

s

for

a

s

s

.

3 THE OPTIMAL CHOICE OF

a

s

s

AND

j

Our main aim in this section is to show that the inÞmum

in (14) and the supremum in (13) can be achieved and to

give an explicit recipe for the sequence

a

s

and number

j

that achieve them.The main technical theoremis as follows.

Theorem5 Let

k X X R

be a Mercer kernel.Suppose

are the eigenvalues of

T

k

.For any

n N

,the

minimum

j

min

j

j

j

n

j

(16)

always exists,and

inf

a

s

s

p

s

a

s

s

sup

j N

B a

s

n j B a

s

n j

where

a

i

p

i

when i j

p

j

n

j

when i j

(17)

This choice of

a

s

results in a simple formfor the bound of

(14) in terms of

n

and

i

:

Corollary 6 Let

k X X R

be a Mercer kernel and let

A

be given by (12).Then for any

n N

,the entropy numbers

satisfy

inf

a

s

s

p

s

a

s

s

n

A

C

k

v

u

u

t

j

j

n

j

X

i j

i

(18)

with

j

min

j

j

j

n

j

This corollary,together with (14),implies Theorem1.

PROOF OUTLINE

The proof of Theorem5 is quite long and is in Appendix A.

It involves the following four steps.

1.We Þrst prove that for all

n N

,

j min

j

j

j

n

j

(19)

exists,whenever

i

are the eigenvalues of a Mercer

kernel.

2.We then prove that for any

n N

inf

a

s

s

p

s

a

s

s

sup

j N

B a

s

n j

inf

j N

inf

a

s

A

j

B a

s

n j

(20)

3.The next step is to prove that the choice of

a

s

and

j

described by (16) and (17) are optimal.It is separated

into two parts:

(a) For any

j

j

,and any

a

s

A

j

,

B a

s

n j

B a

s

n j

holds.

(b) For any

j

j

,and any

a

s

A

j

,

B a

s

n j

B a

s

n j

also holds.

4.Finally we show that

a

s

A

j

and

p

s

a

s

s

when

a

s

is chosen according to (17).

4 EXAMPLE

We illustrate the results of this paper with an example.Con-

sider the kernel

k x y k x y

where

k x e

x

.

For such kernels (RBF kernels)

k x k

for all

x X

.

Thus the class (1) can be written as

F

R

w

fh w

x i

x S k x k

k w k

R

w

g

One can use the fat-shattering dimension to bound the cov-

ering number of the class of functions

F

R

w

(see [2]).

Lemma 7 With

F

R

w

as above,

fat

F

R

w

R

w

(21)

Theorem8 If

F

is class of functions mapping from a set

X

into the interval

B

,then for any

,if

m

fat

F

,

log N

m

F

fat

F

log

eB m

(22)

Combining these results we have the following bound with

which we shall compare our new bound.

log N

m

F

R

w

R

w

log

eB m

(23)

In order to determine the eigenvalues of

T

k

,we need

to periodize the kernel.This periodization is necessary in

order to get a discrete set of eigenvalues since

k x

has in-

Þnite support (see [9] for further details).For the purpose

of the present paper,we can assume a Þxed period

for some

.Since the kernel is translation invari-

ant,the eigenfunctions are

n

x

p

cos n

x

and so

C

k

p

.The

p

comes fromthe requirement in Theorem

2 that

k

j

k

.The eigenvalues are

j

p

e

j

Setting

c

p

,

c

,the eigenvalues can be

written as

j

c

e

c

j

(24)

From (16),we know that

j

j

n

j

implies

j

j

.But (24) shows that this condition on the eigenvalues is

equivalent to

c

e

c

j

n

j

c

j

e

c

P

j

i

i

j

(25)

which is equivalent to

c

j

j

ln n

c

j j

c

j j

j

ln n

which follows from

j

ln n

Hence,

j

ln n

(26)

We can now use (18) to give an upper bound on

n

.The tail

P

i j

i

in (18) is dominated by the Þrst term,hence we

obtain the following bound.

n

O

j

n

j

c

exp

c

j

j

Substituting (26) shows that

log

n

O

log log n log log n

(27)

1e13

1e12

1e11

1e10

1e09

1e08

1e07

1e06

1e05

.1e3

.1e2

.1e1

.1

1.

.1e2

Figure 1:

n

versus

n

for a Gaussian kernel as given by

Corollary 6.

2

4

6

8

10

12

14

1..1e2.1e3

Figure 2:

j

versus

n

for a Gaussian kernel.

We can get several results fromEquation (27).

The relationship between

n

and

n

.For Þxed

,(27) shows

that

log

n

log

n

which implies

log N

m

F

R

w

O

log

(28)

which is considerably better than (23).This can also

be seen in Figure 1.

2

4

6

8

10

12

14

1e13 1e11 1e09 1e07 1e05.1e2.1 1..1e2

Figure 3:

j

versus

for a Gaussian kernel.Since

j

can

be interpreted as an Òeffective number of dimensionsÓ,this

clearly illustrates why the bound on the covering numbers

for Gaussian kernels grows so slowly as

.Even when

,

j

is only 13.

The relationship between

and

n

.Here,

is the vari-

ance of the Gaussian functions.When

increases,

the kernel function will be wider,so the class

F

R

w

should be simpler.In Equation (27),we notice that

if

decreases,

n

decreases for Þxed

n

.Similarly,if

increases,

n

decreases for Þxed

n

.Since the entropy

number (and the covering number) indicates the capac-

ity of the learning machine,the more complicated the

machine is,the bigger the covering number for Þxed

n

.SpeciÞcally we see fromEquation (27) that

log

n

and that

log N

m

F

R

w

O

Figures (1) to (3) illustrate our bounds (for

).

5 CONCLUSIONS

We have presented a new formula for bounding the covering

numbers of support vector machines in terms of the eigen-

values of an integral operator induced by the kernel.We

showed,by way of an example using a Gaussian kernel,that

the new bound is easily computed and considerably better

than previous results that did not take account of the kernel.

We showed explicitly the effect of the choice of width of the

kernel in this case.

The Òeffective number of dimensionsÓ,

j

,can illustrate

the characters of the kernel functions clearly.For a smooth

kernel,the Òeffective number of the dimensionsÓ

j

is small.

The value of

j

depends on

n

which in turn depends on

.Thus

j

can be considered analogous to existing Òscale-

sensitiveÓdimensions,such as the fat-shatterring dimension.

A key difference is that we now have bounds for

j

that ex-

plicitly depend on the kernel.

We have discussed the result for the situation where the

input dimension is 1.The main complication arising when

d

is that repeated eigenvalues become generic for isotropic

translation invariant kernels.This does not break the bounds

as stated (as long as one properly counts the multiplicity of

eigenvalues).However,it is possible to obtain bounds that

can be tighter in some cases,by using a slightly more reÞned

argument [9].

References

[1] M.Anthony.Probabilistic analysis of learning in

artiÞcial neural networks:The pac model and its

variants.Neural Computing Surveys,1:1Ð47,1997.

http://www.icsi.berkeley.edu/÷jagota/NCS.

[2] M.Anthony and P.Bartlett.Neural Network Learning:

Theoretical Foundations.Cambridge University Press,

1999.

[3] Robert Ash.Information Theory.Interscience Publish-

ers,New York,1965.

[4] B.E.Boser,I.M.Guyon,and V.N.Vapnik.A training

algorithmfor optimal margin classiÞers.In D.Haussler,

editor,5th Annual ACMWorkshop on COLT,pages 144Ð

152,Pittsburgh,PA,1992.ACMPress.

[5] C.Cortes and V.Vapnik.Support vector networks.Ma-

chine Learning,20:273 Ð 297,1995.

[6] H.K¬onig.Eigenvalue Distribution of Compact Opera-

tors.Birkh¬auser,Basel,1986.

[7] J.Shawe-Taylor,P.L.Bartlett,R.C.Williamson,and

M.Anthony.Structural risk minimization over data-

dependent hierarchies.IEEE Transactions on Informa-

tion Theory,44(5):1926Ð1940,1998.

[8] R.Williamson,A.Smola,and B.Sch¬olkopf.En-

tropy numbers,operators and support vector kernels.In

B.Sch¬olkopf,C.J.C.Burges,and A.J.Smola,editors,

Advances in Kernel Methods Support Vector Learn-

ing,pages 127Ð144,Cambridge,MA,1999.MIT Press.

[9] R.C.Williamson,A.J.Smola,and B.Sch¬olkopf.Gen-

eralization performance of regularization networks and

support vector machines via entropy numbers of com-

pact operators.NeuroCOLTNC-TR-98-019,Royal Hol-

loway College,1998.

A PROOF OF THEOREM1

STEP ONE

As indicated above,we will Þrst prove the existence of

j

,

which is deÞned in (19).

Lemma 9 Suppose

is a non-increasing

sequence of non-negative numbers and

lim

j

j

.

Then for all

n N

,there exists

j N

such that

j

j

n

j

(29)

Proof Let

P

j

j

j

j

.Observe that (29) can be written

as

P

j

n

,and hence for all

n

there is a

j

such that (29) is

true iff

lim

j

P

j

.But

P

j

j

j

j

j

j

Y

i

j

i

j

since

i

is non-increasing.Since

lim

j

j

,we get

lim

j

P

j

.Thus for any

n N

there is a

j

such that

(29) is true.

Corollary 10 Suppose

k

is a Mercer kernel and

T

k

the asso-

ciated integral operator.If

i

i

T

k

,then the minimum

j

from (19) always exists.

Proof By MercerÕs Theorem,

i

and so

lim

i

i

.Lemma 9 can thus be applied.

STEP TWO

Lemma 11 Suppose

A

j

and

B a

s

n j

are denedas above,

p

s

a

s

s

,

j

and

a

s

A

j

satisfy

B a

s

n j

inf

j N

inf

a

s

A

j

B a

s

n j

(30)

Then

inf

a

s

p

s

a

s

s

sup

j N

B a

s

n j inf

j N

inf

a

s

A

j

B a

s

n j

(31)

Proof Since

p

s

a

s

s

,

inf

a

s

s

p

s

a

s

s

sup

j N

B a

s

n j sup

j N

B a

s

n j

(32)

But

a

s

A

j

,following the deÞnition of

A

j

and equal-

ity (30) we get

sup

j N

B a

s

n j B a

s

n j

inf

j N

inf

a

s

A

j

B a

s

n j

In fact,we can show that the inequality in (31) is in fact an

equality.The proof is in appendix B.

It is now easier to calculate the optimal bound of the

entropy number using Lemma 11.

STEP THREE

In this step,we will prove that the choice of

a

s

and

j

given in Theorem5 are optimal.We will Þrst prove a useful

technical result.

Lemma 12 Suppose

A

j

and

i

are denedas above,

a

s

A

j

.Then we have

X

i j

i

a

i

A

a

a

j

n

j

X

i j

i

(33)

Proof Since

a

s

A

j

,the following inequality must be

true for

k N

:

a

a

j

a

j

k

n

j

k

a

a

j

n

j

(34)

which implies

a

a

j

a

j

k

n

a

a

j

n

j

k

j

a

j

a

j

k

a

a

j

n

k

j

k N

(35)

Set

a

a

j

n

j

Then (35) can be rewritten as:

a

j

a

j

k

k

k N

(36)

Hence,the left hand side of (33) can be rewritten as

X

i j

i

a

i

X

i j

i

X

i j

i

a

i

(37)

From(36),we get

a

j

,so

a

j

Suppose

a

i

for some

i N

.We will separate the

suminto several parts.Set

k

j

k

m

max f n l

m

a

i

i f l

m

n gg

l

m

max f n k

m

a

i

i f k

m

n gg

(38)

where we set

k

m

and

l

m

to

if the

max

does not exist.

Since

i

is a non-increasing sequence,from(38) we know

i

a

i

i c

a

i

i f k

m

l

m

g c N

i

a

i

i c

a

i

i f l

m

k

m

g

c f i g

for

m N

.Hence,if

l

m

is Þnite,

k

m

X

i k

m

i

a

i

l

m

l

m

X

i k

m

a

i

l

m

k

m

X

i l

m

a

i

l

m

k

m

X

i k

m

a

i

(39)

And if

l

m

is inÞnite,this inequality is clearly true.We will

exploit the inequality of the arithmetic and geometric means

x

x

x

m

m x

x

m

m

for

x

i

(40)

Now (40) implies that for any

k

j k

m

,we have

j

X

i k

a

i

j k

j

Y

i k

a

i

j k

(41)

which together with (36) gives

j

X

i k

a

i

j

X

i k

a

i

j k

(42)

Hence,for any

k

m

,Þnite or inÞnite,

k

m

X

i k

a

i

(43)

Now,for all

k

m

,using (39) and (43) repeatedly,we get

k

m

X

i k

i

a

i

k

X

i k

i

a

i

k

m

X

i k

m

i

a

i

l

k

X

i k

a

i

l

m

k

m

X

i k

m

a

i

l

k

X

i k

a

i

l

m

k

m

X

i k

m

a

i

l

m

k

m

X

i k

a

i

for all

m N

.Hence

X

i j

i

a

i

(44)

Noticing (37),inequality (33) is true.

Now,let us prove the main result.

Lemma 13 Let

A

j

and

B a

s

n j

be dened as above.

Then we have

B a

s

n j

inf

j

N

inf

a

s

A

j

B a

s

n j

(45)

where

a

i

p

i

when i j

p

j

n

j

when i j

(46)

j

min

j

j

j

n

j

(47)

Proof The main idea is to compare

B

a

s

n j

with

B

a

s

n j

and show

B

a

s

n j

B

a

s

n j

for all

j

N

and any

a

s

A

j

.From the deÞnition of

B a

s

n j

,we know

B

a

s

n j

X

i

i

a

i

a

a

j

n

j

and

B

a

s

n j

j

j

n

j

X

i j

i

For convenience,we set

j

n

j

Hence,

B

a

s

n j

B

a

s

n j

X

i

i

a

i

a

a

j

n

j

j

X

i j

i

(48)

Part a:For the condition

j

j

.

Rewrite (48):

B

a

s

n j

B

a

s

n j

j

X

i

i

a

i

a

a

j

n

j

X

i j

i

a

i

a

a

j

n

j

j

j

n

j

X

i j

i

A

j

X

i j

i

j

j

n

j

X

i j

i

A

j

X

i

i

a

i

a

a

j

n

j

j

j

n

j

X

i j

i

a

i

a

a

j

n

j

X

i j

i

j

j

n

j

X

i j

i

j

X

i j

i

A

E

E

E

(49)

We will show

E

,

E

and

E

.

To prove

E

.

Since

i

and

a

i

,we exploit the inequality of the

arithmetic and geometric means (40) again.Hence

E

j

j

a

a

j

a

a

j

n

j

j

A

j

j

j

n

j

j

j

n

j

j

j

n

j

(50)

To prove

E

.

Applying Lemma 12 shows

E

.

To prove

E

.

In order to prove

E

,let us deÞne function

g j j

j

n

j

X

i j

i

(51)

We will show that

g j

is a non-increasing function of

j

,for

j j

.Set

j

j

n

j

j

j

n

j

we have

g j g j

j

j

j

j

j

j

j

j

j

j

(52)

Noticing

j

j

j

j

j

,(52) can be modiÞed to

g j g j

j

j

j

j

j

j

j

j

j

j

j

(53)

Since

j j

,following (47),we get

j

j

n

j

j j

(54)

So

j

j

n

j

j

j

j

n

j

j j

j

n

j

j

Making use of the formula

x

n

y

n

x y

n

X

i

x

n i

y

i

(55)

we obtain

j

j

j

j

j

j

j

X

i

j i

j

i

j

j

j

j

j

j

Together with

j

and (53),we obtain

j

j

j

j

j

Hence,

g j g j

Since

j

j

,we get

E

g j

g j

(56)

Combining the above results,we get

B

a

s

n j

B

a

s

n j

j

j

(57)

Part b:For the condition

j

j

.

Rewrite (48):

B

a

s

n j

B

a

s

n j

j

X

i

i

a

i

X

i j

i

a

i

A

a

a

j

n

j

j

j

X

i j

i

X

i j

i

j

X

i

i

a

i

a

a

j

n

j

j

j

X

i j

i

X

i j

i

a

i

a

a

j

n

j

X

i j

i

F

F

(58)

We will show

F

and

F

.

To prove

F

.

For convenience,we set

D

i

a

a

i

n

i

F

can be rewritten as:

j

X

i

i

a

i

j

X

i j

i

a

i

A

D

j

j

j

X

i j

i

D

j

j

X

i

i

a

i

j

j

X

i j

i

D

j

a

i

P

P

(59)

Let us consider

P

at Þrst.

P

D

j

D

j

D

j

j

X

i

i

a

i

j

D

j

j

X

i

i

a

i

j

D

j

D

j

j

X

i

i

a

i

Since

i

a

i

,using the inequality of the arithmetic and

geometric mean (40) again,we get

j

X

i

i

a

i

j

j

a

a

j

j

n

n

j

D

j

Since

a

s

A

j

,we get

D

j

D

i

for any

i

j

and

j

j

n

j

holds based on (47).Hence

P

D

j

D

j

j

X

i

i

a

i

D

j

D

j

D

j

j

D

j

D

j

D

j

j

j

(60)

Let us consider

P

now.If

P

,then

F

.

So let us prove that

F

is also true when

P

.Ob-

serving

a

i

D

i

i

D

i

i

and

D

j

D

i

for any

i

j

,the

last element of

P

j

D

j

a

j

j

D

j

D

j

j

Using the similar method as before.Suppose

D

j

a

i

for some

i j

j

.We separate

P

into several parts.Set

k

j

l

m

min f n k

m

D

j

a

i

i f n k

m

gg

k

m

min f n l

m

D

j

a

i

i f n l

m

gg

(61)

Since

i

is a non-increasing sequence,from(61) we know

i

D

j

a

i

i c

D

j

a

i

i f k

m

l

m

g c N

i

D

j

a

i

i c

D

j

a

i

i f l

m

k

m

g

c f i g

(62)

Using (62),we have

k

m

X

i k

m

i

D

j

a

i

l

m

l

m

X

i k

m

D

j

a

i

l

m

k

m

X

i l

m

D

j

a

i

l

m

k

m

X

i k

m

D

j

a

i

(63)

Hence,

P

j

X

i j

D

j

a

i

i

k

X

i j

D

j

a

i

i

k

X

i k

D

j

a

i

i

k

X

i j

D

j

a

i

i

l

k

X

i k

D

j

a

i

If

l

P

k

i k

D

j

a

i

,we get

P

k

X

i j

D

j

a

i

i

If

l

P

k

i k

D

j

a

i

,we can use (62) and (63) re-

peatedly.Finally,using (40) and

a

i

D

i

i

D

i

i

again,we

can get

P

j

l

X

i j

D

j

a

i

j

l

j

l

j

l

X

i j

D

j

a

i

j

l

l

a

j

a

j

l

l

D

j

A

j

l

l

D

j

D

j

j

D

j

l

j

l

l

A

j

l

l

D

j

D

j

l

j

l

A

j

l

l

D

j

D

j

j

l

A

with

l f k

g

(64)

Combining (60) and (64),we have

F

P

P

j

j

D

j

D

j

D

j

j

l

l

D

j

D

j

j

l

A

(65)

In order to show

F

,we just need to show

j

j

D

j

D

j

D

j

j

l

l

D

j

D

j

j

l

A

(66)

When

D

j

D

j

,

j

j

D

j

D

j

D

j

j

l

l

D

j

D

j

j

l

A

Inequality (66) holds.

When

D

j

D

j

,setting

D

l

j

and

D

l

j

,the

inequality (66) can be rewritten as

l

j

j

l

l

j

j

l

l

j

j

Noticing

j

l

l

D

j

D

j

j

l

A

j

l

l

j

j

j

(67)

we only need to show

j

j

j

l

l

j

l

l

l

j

j

(68)

Since

j

j

l

,the left hand side of (68) becomes

j

j

j

l

l

j

l

l

l

j

j

j

j

l

l

l

l

j

j

Making use of the formula (55) again,we obtain

j

j

l

l

l

l

j

j

j

j

P

l

i

l i

i

l

l

P

j

i

j

i

i

j

j

P

l

i

l i

i

l

l

P

j

i

j

i

i

j

P

l

i

j

l i

i

l

P

j

i

j

i

l i

P

j

k

P

l

i

j

l i

i

P

l

k

P

j

i

j

i

l i

(69)

Observe the numerator and the denominator both have

j

l

elements represented as

m

n

.But we know

since

D

j

D

j

,hence from(69),we obtain

P

j

k

P

l

i

j

l i

i

P

l

k

P

j

i

j

i

l i

P

j

k

P

l

i

j

l

P

l

k

P

j

i

j

l

j

l

j

l

j

l

j

l

So

j

j

l

l

l

l

j

j

Hence

F

P

P

(70)

is proved for

j

j

k

with all

k N

.

To prove

F

.

Using Lemma 12 again,we get

F

(71)

Combining (70) and (71),we get

B

a

s

n j

B

a

s

n j

j

j

(72)

Combining (57) and (72),(45) is proved true.

STEP FOUR

We supposed that

a

s

A

j

in the above proof.Nowlet us

show it.First,for

j j

,

a

a

j

a

j

n

j

a

a

j

n

j

a

a

j

n

j

j j

j

a

a

j

n

j

Second,for

j j

.From(54),we get

a

a

j

n

j

p

j

n

j

p

j

n

j

a

a

j

n

j

Thus

a

s

A

j

.

We can also show

p

s

a

s

s

.

p

s

a

s

s

v

u

u

t

X

i

i

a

i

v

u

u

t

j

X

i j

i

(73)

When

k x y

and

n

are given,

i

and

j

are determined.

So

n

j

j

j

is a constant.By MercerÕs The-

orem,

i

and thus

P

i j

i

is Þnite.So (73) is

Þnite.Hence

p

s

a

s

s

is proved.

CONCLUSION

Following the proof above,we get

Corollary 14 Suppose

A

j

and

B a

s

n j

are dened as

above.Then we have

B a

s

j

n j

inf

j N

inf

a

s

A

j

B a

s

n j

(74)

where

a

i

p

i

when i j

p

j

n

j

when i j

(75)

j

min

j

j

j

n

j

(76)

Theorem1 is then established.

B THE PROOF THAT INEQUALITY (31)

CANNOT BE IMPROVED

Lemma 15 Suppose

A

j

and

B a

s

n j

are denedas above.

Let

j N

and

a

s

A

j

.Suppose

j

and

a

s

exist.Then

inf

a

s

s

p

s

a

s

s

sup

j N

B a

s

n j

inf

j N

inf

a

s

A

j

B a

s

n j

(77)

Proof Let us prove

inf

a

s

s

p

s

a

s

s

sup

j N

B a

s

n j

inf

j N

inf

a

s

A

j

B a

s

n j

(78)

Choose an

a

s

to realise the inÞmumon the left hand side;

then

a

s

s

A

j

,where

j

is the

j

that realises the inner

supremum.Then

inf

a

s

s

p

s

a

s

s

sup

j N

B a

s

n j

sup

j N

B a

s

n j

B a

s

n j

inf

a

s

A

j

B a

s

n j

inf

j N

inf

a

s

A

j

B a

s

n j

We have already proved

inf

a

s

s

p

s

a

s

s

sup

j N

B a

s

n j

inf

j N

inf

a

s

A

j

B a

s

n j

So,equation (77) is proved to be true.

Acknowledgements

This work was supported by the Australian Research Coun-

cil.Thanks to Bernhard Sch¬olkopf and Alex J.Smola for

useful discussions.

## Comments 0

Log in to post a comment