SVM Soft Margin Classiﬁers:
Linear Programming versus Quadratic Programming
Qiang Wu
wu.qiang@student.cityu.edu.hk
DingXuan Zhou
mazhou@cityu.edu.hk
Department of Mathematics,City University of Hong Kong,
Tat Chee Avenue,Kowloon,Hong Kong,China
Support vector machine soft margin classiﬁers are important learn
ing algorithms for classiﬁcation problems.They can be stated as
convex optimization problems and are suitable for a large data set
ting.Linear programming SVM classiﬁer is specially eﬃcient for
very large size samples.But little is known about its convergence,
compared with the well understood quadratic programming SVM
classiﬁer.In this paper,we point out the diﬃculty and provide
an error analysis.Our analysis shows that the convergence be
havior of the linear programming SVM is almost the same as that
of the quadratic programming SVM.This is implemented by set
ting a stepping stone between the linear programming SVM and
the classical 1–norm soft margin classiﬁer.An upper bound for
the misclassiﬁcation error is presented for general probability dis
tributions.Explicit learning rates are derived for deterministic
and weakly separable distributions,and for distributions satisfying
some Tsybakov noise condition.
1
1 Introduction
Support vector machines (SVM’s) forman important subject in learning the
ory.They are very eﬃcient for many applications,especially for classiﬁcation
problems.
The classical SVM model,the socalled 1–norm soft margin SVM,was
introduced with polynomial kernels by Boser et al.(1992) and with gen
eral kernels by Cortes and Vapnik (1995).Since then many diﬀerent forms
of SVM algorithms were introduced for diﬀerent purposes (e.g.Niyogi and
Girosi 1996;Vapnik 1998).Among them the linear programming (LP) SVM
(Bradley and Mangasarian 2000;Kecman and Hadzic 2000;Niyogi and Girosi
1996;Pedroso and N.Murata 2001;Vapnik 1998) is an important one because
of its linearity and ﬂexibility for large data setting.The term “linear pro
gramming” means the algorithm is based on linear programming optimiza
tion.Correspondingly,the 1–norm soft margin SVM is also called quadratic
programming (QP) SVM since it is based on quadratic programming opti
mization (Vapnik 1998).Many experiments demonstrate that LPSVM is
eﬃcient and performs even better than QPSVM for some purposes:capa
ble of solving huge sample size problems (Bradley and Mangasarian 2000),
improving the computational speed (Pedroso and N.Murata 2001),and re
ducing the number of support vectors (Kecman and Hadzic 2000).
While the convergence of QPSVM has become pretty well understood
because of recent works (Steinwart 2002;Zhang 2004;Wu and Zhou 2003;
Scovel and Steinwart 2003;Wu et al.2004),little is known for LPSVM.The
purpose of this paper is to point out the main diﬃculty and then provide error
analysis for LPSVM.
Consider the binary classiﬁcation setting.Let (X,d) be a compact metric
space and Y = {1,−1}.A binary classiﬁer is a function f:X → Y which
labels every point x ∈ X with some y ∈ Y.
Both LPSVMand QPSVMconsidered here are kernel based classiﬁers.
A function K:X × X → R is called a Mercer kernel if it is continuous,
symmetric and positive semideﬁnite,i.e.,for any ﬁnite set of distinct points
2
{x
1
,∙ ∙ ∙,x
} ⊂ X,the matrix (K(x
i
,x
j
))
i,j=1
is positive semideﬁnite.
Let z = {(x
1
,y
1
),∙ ∙ ∙,(x
m
,y
m
)} ⊂ (X ×Y )
m
be the sample.Motivated
by reducing the number of support vectors of the 1–norm soft margin SVM,
Vapnik (1998) introduced the LPSVM algorithm associated to a Mercer
Kernel K.It is based on the following linear programming optimization
problem:
min
α∈R
m
+
,b∈R
1
m
m
i=1
ξ
i
+
1
C
m
i=1
α
i
subject to y
i
m
j=1
α
j
y
j
K(x
i
,x
j
) +b
≥ 1 −ξ
i
ξ
i
≥ 0,i = 1,∙ ∙ ∙,m.
(1.1)
Here α = (α
1
,∙ ∙ ∙,α
m
),ξ
i
’s are slack variables.The tradeoﬀ parameter
C = C(m) > 0 depends on m and is crucial.If
α
z
= (α
1,z
,∙ ∙ ∙,α
m,z
),b
z
solves the optimization problem (1.1),the LPSVM classiﬁer is given by
sgn(f
z
) with
f
z
(x) =
m
i=1
α
i,z
y
i
K(x,x
i
) +b
z
.(1.2)
For a realvalued function f:X → R,its sign function is deﬁned as
sgn(f)(x) = 1 if f(x) ≥ 0 and sgn(f)(x) = −1 otherwise.
The QPSVMis based on a quadratic programming optimization problem:
min
α∈R
m
+
,b∈R
1
m
m
i=1
ξ
i
+
1
2
C
m
i,j=1
α
i
y
i
K(x
i
,x
j
)α
j
y
j
subject to y
i
m
j=1
α
j
y
j
K(x
i
,x
j
) +b
≥ 1 −ξ
i
ξ
i
≥ 0,i = 1,∙ ∙ ∙,m.
(1.3)
Here
C =
C(m) > 0 is also a tradeoﬀ parameter depending on the sample
size m.If
α
z
= (˜α
1,z
,∙ ∙ ∙,˜α
m,z
),
b
z
solves the optimization problem (1.3),
then the 1–norm soft margin classiﬁer is deﬁned by sgn(
f
z
) with
f
z
(x) =
m
i=1
α
i,z
y
i
K(x,x
i
) +
b
z
.(1.4)
3
Observe that both LPSVM classiﬁer (1.1) and QPSVM classiﬁer (1.3)
are implemented by convex optimization problems.Compared with this,
neural network learning algorithms are often performed by nonconvex opti
mization problems.
The reproducing kernel property of Mercer kernels ensures nice approxi
mation power of SVMclassiﬁers.Recall that the Reproducing Kernel Hilbert
Space (RKHS) H
K
associated with a Mercer kernel K is deﬁned (Aron
szajn 1950) to be the closure of the linear span of the set of functions
{K
x
:= K(x,∙):x ∈ X} with the inner product < ∙,∙ >
K
satisfying
< K
x
,K
y
>
K
= K(x,y).The reproducing property is given by
< f,K
x
>
K
= f(x),∀x ∈ X,f ∈ H
K
.(1.5)
The QPSVM is well understood.It has attractive approximation prop
erties (see (2.2) below) because the learning scheme can be represented as
a Tikhonov regularization (Evgeniou et al.2000) (modiﬁed by an oﬀset)
associated with the RKHS:
f
z
= arg min
f=f
∗
+b∈H
K
+R
1
m
m
i=1
1 −y
i
f(x
i
)
+
+
1
2
C
f
∗
2
K
,(1.6)
where (t)
+
= max{0,t}.Set
H
K
:= H
K
+R.For a function f = f
1
+b
1
∈
H
K
,we denote f
∗
= f
1
and b
f
= b
1
.Write b
f
z
as b
z
.
It turns out that (1.6) is the same as (1.3) together with (1.4).To see
this,we ﬁrst note that
f
∗
z
must lies in the span of {K
x
i
}
m
i=1
according to
the representation theorem (Wahba 1990).Next,the dual problem of (1.6)
shows (Vapnik 1998) that the coeﬃcient of K
x
i
,α
i
y
i
,has the same sign as y
i
.
Finally,the deﬁnition of the H
K
norm yields
f
∗
z
2
K
=
m
i=1
α
i
y
i
K
x
i
2
K
=
m
i,j=1
α
i
y
i
K(x
i
,x
j
)α
j
y
j
.
The rich knowledge on Tikhonov regularization schemes and the idea of
biasvariance tradeoﬀ developed in the neural network literature provide a
mathematical foundation of the QPSVM.In particular,the convergence is
well understood due to the work done within the last a few years.Here the
form (1.6) illustrate some advantages of the QPSVM:the minimization is
4
taken over the whole space
H
K
,so we expect the QPSVM has some good
approximation power,similar to the approximation error of the space
H
K
.
Things are totally diﬀerent for LPSVM.Set
H
K,z
=
m
i=1
α
i
y
i
K(x,x
i
):α = (α
1
,∙ ∙ ∙,α
m
) ∈ R
m
+
.
Then the LPSVM scheme (1.1) can be written as
f
z
= arg min
f=f
∗
+b∈H
K,z
+R
1
m
m
i=1
1 −y
i
f(x
i
)
+
+
1
C
Ω(f
∗
)
.(1.7)
Here we have denoted Ω(f
∗
) = yα
1 =
m
i=1
α
i
for f
∗
=
m
i=1
α
i
y
i
K
x
i
with α
i
≥ 0.It plays the role of a norm of f
∗
in some sense.This is not a
Hilbert space norm,which raises the technical diﬃculty for the mathematical
analysis.More seriously,the hypothesis space H
K,z
depends on the sample z.
The “centers” x
i
of the basis functions in H
K,z
are determined by the sample
z,not free.One might consider regularization schemes in the space of all
linear combinations with free centers,but whether the minimization can be
reduced into a convex optimization problem of size m,like (1.1),is unknown.
Also,it is diﬃcult to relate the corresponding optimum (in a ball with radius
C) to f
∗
z
with respect to the estimation error.Thus separating the error
for LPSVM into two terms of sample error and approximation error is not
as immediate as for the QPSVM or neural network methods (Niyogi and
Girosi 1996) where the centers are free.In this paper,we shall overcome this
diﬃculty by setting a stepping stone.
Turn to the error analysis.Let ρ be a Borel probability measure on Z:=
X × Y and (X,Y) be the corresponding random variable.The prediction
power of a classiﬁer f is measured by its misclassiﬁcation error,i.e.,the
probability of the event f(X) = Y:
R(f) = Prob{f(X) = Y} =
X
P(Y = f(x)x) dρ
X
.(1.8)
Here ρ
X
is the marginal distribution and ρ(∙x) is the conditional distribution
of ρ.The classiﬁer minimizing the misclassiﬁcation error is called the Bayes
5
rule f
c
.It takes the form
f
c
(x) =
1,if P(Y = 1x) ≥ P(Y = −1x),
−1,if P(Y = 1x) < P(Y = −1x).
If we deﬁne the regression function of ρ as
f
ρ
(x) =
Y
ydρ(yx) = P(Y = 1x) −P(Y = −1x),x ∈ X,
then f
c
= sgn(f
ρ
).Note that for a realvalued function f,sgn(f) gives a clas
siﬁer and its misclassiﬁcation error will be denoted by R(f) for abbreviation.
Though the Bayes rule exists,it can not be found directly since ρ is un
known.Instead,we have in hand a set of samples z = {z
i
}
m
i=1
= {(x
1
,y
1
),∙ ∙ ∙,
(x
m
,y
m
)} (m ∈ N).Throughout the paper we assume {z
1
,∙ ∙ ∙,z
m
} are in
dependently and identically distributed according to ρ.A classiﬁcation algo
rithm constructs a classiﬁer f
z
based on z.
Our goal is to understand how to choose the parameter C = C(m) in the
algorithm (1.1) so that the LPSVM classiﬁer sgn(f
z
) can approximate the
Bayes rule f
c
with satisfactory convergence rates (as m→∞).Our approach
provides clues to study learning algorithms with penalty functional diﬀerent
from the RKHS norm (Niyogi and Girosi 1996;Evgeniou et al.2000).It
can be extended to schemes with general loss functions (Rosasco et al.2004;
Lugosi and Vayatis 2004;Wu et al.2004).
2 Main Results
In this paper we investigate learning rates,the decay of the excess misclassi
ﬁcation error R(f
z
) −R(f
c
) as m and C(m) become large.
Consider the QPSVMclassiﬁcation algorithm
f
z
deﬁned by (1.3).Stein
wart (2002) showed that R(
f
z
) −R(f
c
) → 0 (as m and
C =
C(m) → ∞),
when H
K
is dense in C(X),the space of continuous functions on X with
the norm ∙
∞
.Lugosi and Vayatis (2004) found that for the exponen
tial loss,the excess misclassiﬁcation error of regularized boosting algorithms
6
can be estimated by the excess generalization error.An important result on
the relation between the misclassiﬁcation error and generalization error for a
convex loss function is due to Zhang (2004).See Bartlett et al.(2003),and
Chen et al.(2004) for extensions to general loss functions.Here we consider
the hinge loss V (y,f(x)) = (1 −yf(x))
+
.The generalization error is deﬁned
as
E(f) =
Z
V (y,f(x)) dρ.
Note that f
c
is a minimizer of E(f).Then Zhang’s results asserts that
R(f) −R(f
c
) ≤ E(f) −E(f
c
),∀f:X →R.(2.1)
Thus,the excess misclassiﬁcation error R(
f
z
) − R(f
c
) can be bounded
by the excess generalization error E(
f
z
) −E(f
c
),and the following error de
composition (Wu and Zhou 2003) holds:
E(
f
z
) −E(f
c
) ≤
E
f
z
−E
z
f
z
+E
z
f
K,
C
−E
f
K,
C
+
D(
C).(2.2)
Here E
z
(f) =
1
m
m
i=1
V (y
i
,f(x
i
)).The function
f
K,
C
depends on
C and is
deﬁned as
f
K,C
:= arg min
f∈
H
K
E(f) +
1
2C
f
∗
2
K
,C > 0.(2.3)
The decomposition (2.2) makes the error analysis for QPSVMeasy,sim
ilar to that in Niyogi and Girosi (1996).The second term of (2.2) measures
the approximation power of
H
K
for ρ.
Deﬁnition 2.1.The regularization error of the system (K,ρ) is deﬁned
by
D(C):= inf
f∈
H
K
E(f) −E(f
c
) +
1
2C
f
∗
2
K
.(2.4)
The regularization error for a regularizing function f
K,C
∈
H
K
is deﬁned as
D(C):= E(f
K,C
) −E(f
c
) +
1
2C
f
∗
K,C
2
K
.(2.5)
7
In Wu and Zhou (2003) we showed that E(f) − E(f
c
) ≤ f − f
c
L
1
ρ
X
.
Hence the regularization error can be estimated by the approximation in
a weighted L
1
space,as done in Smale and Zhou (2003),and Chen et al.
(2004).
Deﬁnition 2.2.We say that the probability measure ρ can be approx
imated by
H
K
with exponent 0 < β ≤ 1 if there exists a constant c
β
such
that
(H1)
D(C) ≤ c
β
C
−β
,∀C > 0.
The ﬁrst term of (2.2) is called the sample error.It has been well under
stood in learning theory by concentration inequalities,e.g.Vapnik (1998),
Devroye et al.(1997),Niyogi (1998),Cucker and Smale (2001),Bousquet
and Elisseeﬀ (2002).
The approaches developed in Barron (1990),Bartlett (1998),Niyogi and
Girosi (1996),and Zhang (2004) separate the regularization error and the
sample error concerning
f
z
.In particular,for the QPSVM,Zhang (2004)
proved that
E
z∈Z
m
E(
f
z
)
≤ inf
f∈
H
K
E(f) +
1
2
C
f
∗
2
K
+
2
C
m
.(2.6)
It follows that E
z∈Z
m
E(
f
z
)−E(f
c
)
≤
D(
C)+
2
C
m
.When (H1) holds,Zhang’s
bound in connection with (2.1) yields E
z∈Z
m
R(
f
z
)−R(f
c
)
= O(
C
−β
)+
2
C
m
.
This is similar to some wellknown bounds for the neural network learning
algorithms,see e.g.Theorem 3.1 in Niyogi and Girosi (1996).The best
learning rate derived from (2.6) by choosing
C = m
1/(β+1)
is
E
z∈Z
m
R(
f
z
) −R(f
c
)
= O
m
−α
,α =
β
β +1
.(2.7)
Observe that the sample error bound
2
C
m
in (2.6) is independent of the
kernel K or the distribution ρ.If some information about K or ρ is available,
the sample error and hence the excess misclassiﬁcation error can be improved.
The information we need about K is the capacity measured by covering
numbers.
8
Deﬁnition 2.3.Let F be a subset of a metric space.For any ε > 0,
the covering number N(F,ε) is deﬁned to be the minimal integer ∈ N such
that there exist balls with radius ε covering F.
In this paper we only use the uniform covering number.Covering num
bers measured by empirical distances are also used in the literature (van der
Vaart and Wellner 1996).For comparisons,see Pontil (2003).
Let B
R
= {f ∈ H
K
:f
K
≤ R}.It is a subset of C(X) and the covering
number is well deﬁned.We denote the covering number of the unit ball B
1
as
N(ε):= N
B
1
,ε
,ε > 0.(2.8)
Deﬁnition 2.4.The RKHS H
K
is said to have logarithmic complexity
exponent s ≥ 1 if there exists a constant c
s
> 0 such that
(H2) logN(ε) ≤ c
s
log(1/ε)
s
.
It has polynomial complexity exponent s > 0 if there is some c
s
> 0 such that
(H2
) logN(ε) ≤ c
s
1/ε
s
.
The uniform covering number has been extensively studied in learning
theory.In particular,we know that for the Gaussian kernel K(x,y) =
exp{−x − y
2
/σ
2
} with σ > 0 on a bounded subset X of R
n
,(H2) holds
with s = n+1,see Zhou (2002);if K is C
r
with r > 0 (Sobolev smoothness),
then (H2
) is valid with s = 2n/r,see Zhou (2003).
The information we need about ρ is a Tsybakov noise condition (Tsy
bakov 2004).
Deﬁnition 2.5.Let 0 ≤ q ≤ ∞.We say that ρ has Tsybakov noise
exponent q if there exists a constant c
q
> 0 such that
(H3) P
X
{x ∈ X:f
ρ
(x) ≤ c
q
t}
≤ t
q
.
All distributions have at least noise exponent 0.Deterministic distri
butions (which satisfy f
ρ
(x) ≡ 1) have the noise exponent q = ∞ with
c
∞
= 1.
9
Using the above conditions about K and ρ,Scovel and Steinwart (2003)
showed that when (H1),(H2
) and (H3) hold,for every > 0 and every δ > 0,
with conﬁdence 1 −δ,
R(
f
z
) −R(f
c
) = O
m
−α
,α =
4β(q +1)
(2q +sq +4)(1 +β)
−.(2.9)
When no conditions are assumed for the distribution (i.e.,q = 0) or s = 2 for
the kernel (the worse case when empirical covering numbers are used,see van
der Vaart and Wellner 1996),the rate is reduced to α =
β
β+1
−,arbitrarily
close to Zhang’s rate (2.7).
Recently,Wu et al.(2004) improve the rate (2.9) and showthat under the
same assumptions (H1),(H2
) and (H3),for every ,δ > 0,with conﬁdence
1 −δ,
R(
f
z
) −R(f
c
) = O
m
−α
,α = min
β(q +1)
β(q +2) +(q +1 −β)s/2
−,
2β
β +1
.
(2.10)
When some condition is assumed for the kernel but not for the distribution,
i.e.,s < 2 but q = 0,the rate (2.10) has power α = min
β
2β+(1−β)s/2
−,
2β
β+1
.
This is better than (2.7) or (2.9) (or the rates given in Bartlett et al.2003;
Blanchard et al.2004,see Chen et al.2004;Wu et al.2004 for detailed
comparisons) if β < 1.This improvement is possible due to the projection
operator.
Deﬁnition 2.6.The projection operator π is deﬁned on the space of
measurable functions f:X →R as
π(f)(x) =
1,if f(x) > 1,
−1,if f(x) < −1,
f(x),if −1 ≤ f(x) ≤ 1.
The idea of projections appeared in marginbased bound analysis,e.g.
Bartlett (1998),Lugosi and Vayatis (2004),Zhang (2002),Anthony and
Bartlett (1999).We used the projection operator for the purpose of bound
ing misclassiﬁcation and generalization errors in Chen et al.(2004).It helps
10
us to get sharper bounds of the sample error:probability inequalities are
applied to random variables involving functions π(
f
z
) (bounded by 1),not
to
f
z
(the corresponding bound increases to inﬁnity as C becomes large).In
this paper we apply the projection operator to the LPSVM.
Turn to our main goal,the LPSVM classiﬁcation algorithm f
z
deﬁned
by (1.1).To our knowledge,the convergence of the algorithm has not been
veriﬁed,even for distributions strictly separable by a universal kernel.What
is the main diﬃculty in the error analysis?
One diﬃculty lies in the error decomposition:nothing like (2.2) exists for
LPSVM in the literature.Bounds for the regularization or approximation
error independent of z are not available.We do not know whether it can
be bounded by a norm in the whole space H
K
or a norm similar to those in
Niyogi and Girosi (1996).
In the paper we overcome the diﬃculty by means of a stepping stone
from QPSVM to LPSVM.Then we can provide error analysis for general
distributions.In particular,explicit learning rates will be presented.To this
end,we ﬁrst make an error decomposition.
Theorem 1.Let C > 0,0 < η ≤ 1 and f
K,C
∈
H
K
.There holds
R(f
z
) −R(f
c
) ≤ 2ηR(f
c
) +S(m,C,η) +2D(ηC),
where S(m,C,η) is the sample error deﬁned by
S(m,C,η):=
E(π(f
z
))−E
z
(π(f
z
))
+(1+η)
E
z
f
K,C
−E
f
K,C
.(2.11)
Theorem 1 will be proved in Section 4.The term D(ηC) is the regular
ization error (Smale and Zhou 2004) deﬁned for a regularizing function f
K,C
(arbitrarily chosen) by (2.5).In Chen et al.(2004),we showed that
D(C) ≥
D(C) ≥
κ
2
2C
(2.12)
where
κ:= E
0
/(1 +κ),κ = sup
x∈X
K(x,x),E
0
:= inf
b∈R
E(b) −E(f
c
)
.
11
Also,κ = 0 only for very special distributions.Hence the decay of D(C)
cannot be faster than O(1/C) in general.Thus,to have satisfactory conver
gence rates,C can not be too small,and it usually takes the form of m
τ
for
some τ > 0.The constant κ is the norm of the inclusion H
K
⊂ C(X):
f
∞
≤ κf
K
,∀f ∈ H
K
.(2.13)
Next we focus on analyzing the learning rates.Since a uniform rate
is impossible for all probability distributions as shown in Theorem 7.2 of
Devroye et al.(1997),we need to consider subclasses.
The choice of η is important in the upper bound in Theorem 1.If the
distribution is deterministic,i.e.,R(f
c
) = 0,we may choose η = 1.When
R(f
c
) > 0,we must choose η = η(m) → 0 as m → ∞ in order to get the
convergence rate.Of course the latter choice may lead to a slightly worse
rate.Thus,we will consider these two cases separately.
The following proposition gives the bound for deterministic distributions.
Proposition 2.1.Suppose R(f
c
) = 0.If f
K,C
is a function in
H
K
satisfying V (y,f
K,C
(x))
∞
≤ M,then for every 0 < δ < 1,with conﬁdence
1 −δ there holds
R(f
z
) ≤ 32ε
m,C
+
20Mlog(2/δ)
3m
+8D(C),
where with a constant c
s
depending on c
s
,κ and s,ε
m,C
is given by
22
m
log
2
δ
+c
s
log
CMlog
2
δ
+log
mCD(C)
s
,if (H2) holds;
35 log
2/δ
m
1 +(c
s
)
1
1+s
(CM)
s
1+s
+
32c
s
(CD(C))
s
1+s
3m
1/(1+s)
,if (H2
) holds.
Proposition 2.1 will be proved in Section 6.As corollaries we obtain
learning rates for strictly separable distributions and for weakly separable
distributions.
Deﬁnition 2.7.We say that ρ is strictly separable by
H
K
with margin
γ > 0 if there is some function f
γ
∈
H
K
such that f
∗
γ
K
= 1 and yf
γ
(x) ≥ γ
almost everywhere.
12
For QPSVM,the strictly separable case is well understood,see e.g.Vap
nik (1998),Cristianini and ShaweTaylor (2000) and vast references therein.
For LPSVM,we have
Corollary 2.1.If ρ is strictly separable by
H
K
with margin γ > 0 and
(H2) holds,then
R(f
z
) ≤
704
m
log
2
δ
+c
s
log m+log
1
γ
2
s
+
4
Cγ
2
.
In particular,this will yield the learning rate O
(log m)
s
m
by taking C = m/γ
2
.
Proof.Take f
K,C
= f
γ
/γ.Then V (y,f
K,C
(x)) ≡ 0 and D(C) equals
1
2C
f
∗
γ
/γ
2
K
=
1
2Cγ
2
.The conclusion follows from Proposition 2.1 by choosing
M = 0.
Remark 2.1.For strictly separable distributions,we verify the optimal
rate when (H2) holds.Similar rates are true for more general kernels.But
we omit details here.
Deﬁnition 2.8.We say that ρ is (weakly) separable by
H
K
if there is
some function f
sp
∈
H
K
,called the separating function,such that f
∗
sp
K
=
1 and yf
sp
(x) > 0 almost everywhere.It has separating exponent θ ∈ (0,∞]
if for some γ
θ
> 0,there holds
ρ
X
(0 < f
sp
(x) < γ
θ
t) ≤ t
θ
.(2.14)
Corollary 2.2.Suppose that ρ is separable by
H
K
with (2.14) valid.
(i) If (H2) holds,then
R(f
z
) = O
(log m+log C)
s
m
+C
−
θ
θ+2
.
This gives the learning rate O(
(log m)
s
m
) by taking C = m
(θ+2)/θ
.
13
(ii) If (H2
) holds,then
R(f
z
) = O
C
s
1+s
m
+
C
2s
θ+2
m
1
1+s
+C
−
θ
θ+2
.
This yields the learning rate O(m
−
θ
sθ+2s+θ
) by taking C = m
θ+2
sθ+2s+θ
.
Proof.Take f
K,C
= C
1
θ+2
f
sp
/γ
θ
.By the deﬁnition of f
sp
,we have
yf
K,C
(x) ≥ 0 almost everywhere.Hence 0 ≤ V (y,f
K,C
(x)) ≤ 1.Moreover,
E(f
K,C
) =
X
1 −
C
1
θ+2
γ
θ
f
sp
(x)
+
dρ
X
= ρ
X
0 < f
sp
(x) < γ
θ
C
−
1
θ+2
which is bounded by C
−
θ
θ+2
.Therefore,D(C) ≤ (1 +
1
2γ
2
θ
)C
−
θ
θ+2
.Then the
conclusion follows from Proposition 2.1 by choosing M = 1.
Example.Let X = [−1/2,1/2] and ρ be the Borel probability measure
on Z such that ρ
X
is the Lebesgue measure on X and
f
ρ
(x) =
−1,if −1/2 ≤ x < 0,
1,if 0 < x < 1/2.
If we take the linear kernel K(x,y) = x ∙ y,then θ = 1,γ
θ
= 1/2.Since (H2)
is satisﬁed with s = 1,the learning rate is O(
log m
m
) by taking C = m
3
.
Remark 2.2.The condition (2.14) with θ = ∞is exactly the deﬁnition
of strictly separable distribution and γ
θ
is the margin.
The choice of f
K,C
and the regularization error play essential roles to get
our error bounds.It inﬂuences the strategy of choosing the regularization
parameter (model selection) and determines learning rates.For weakly sep
arable distributions we chose f
K,C
to be multiples of a separating function in
Corollary 2.2.For the general case,it can be the choice (2.3).
Let’s analyze learning rates for distributions having polynomially decay
ing regularization error,i.e.,(H1) with β ≤ 1.This is reasonable because of
(2.12).
14
Theorem 2.Suppose that R(f
c
) = 0 and the hypotheses (H1),(H2
)
hold with 0 < s < ∞ and 0 < β ≤ 1,respectively.Take C = m
ζ
with
ζ:= min{
1
s+β
,
2
1+β
}.Then for every 0 < δ < 1 there exists a constant c
depending on s,β,δ such that with conﬁdence 1 −δ,
R(f
z
) ≤ cm
−α
,α = min
2β
1 +β
,
β
s +β
.
Next we consider general distributions satisfying Tsybakov condition
(Tsybakov 2004).
Theorem 3.Assume the hypotheses (H1),(H2
) and (H3) with 0 < s <
∞,0 < β ≤ 1,and 0 ≤ q ≤ ∞.Take C = m
ζ
with
ζ:= min
2
β +1
,
(q +1)(β +1)
s(q +1) +β(q +2 +qs +s)
.
For every > 0 and every 0 < δ < 1 there exists a constant c depending on
s,q,β,δ,and such that with conﬁdence 1 −δ,
R(f
z
) −R(f
c
) ≤ cm
−α
,α = min
2β
β +1
,
β(q +1)
s(q +1) +β(q +2 +qs +s)
−
.
Remark 2.3.Since R(f
c
) is usually small for a meaningful classiﬁcation
problem,the upper bound in Theorem 1 tells that the performance of LP
SVM is similar to that of QPSVM.However,to have convergence rates,we
need to choose η = η(m) →0 as mbecomes large.This makes our rate worse
than that of QPSVM.This is the case when the capacity index s is large.
When s is very small,the rate is O(m
−α
) with α close to min{
q+1
q+2
,
2β
β+1
},
which coincides to the rate (2.10),and is better than the rates (2.7) or (2.9)
for QPSVM.As any C
∞
kernel satisﬁes (H2
) for an arbitrarily small s > 0
(Zhou 2003),this is the case for polynomial or Gaussian kernels,usually used
in practice.
Remark 2.4.Here we use a stepping stone from QPSVM to LPSVM.
So the derived learning rates for the LPSVM are essentially no worse than
those of QPSVM.It would be interesting to introduce diﬀerent tools to get
15
learning rates for the LPSVM,better than those of QPSVM.Also,the
choice of the tradeoﬀ parameter C in Theorem 3 depends on the indices
β (approximation),s (capacity),and q (noise condition).This gives a rate
which is optimal by our approach.One can take other choices ζ > 0 (for
C = m
ζ
),independent of β,s,q,and then derive learning rates according to
the proof of Theorem 3.But the derived rates are worse than the one stated
in Theorem 3.It would be of importance to give some methods for choosing
C adaptively.
Remark 2.5.When empirical covering numbers are used,the capacity
index can be restricted to s ∈ [0,2].Similar learning rates can be derived,
as done in Blanchard et al.(2004),Wu et al.(2004).
3 Stepping Stone
Recall that in (1.7),the penalty termΩ(f
∗
) is usually not a norm.This makes
the scheme diﬃcult to analyze.Since the solution f
z
of the LPSVM has a
representation similar to
f
z
in QPSVM,we expect close relations between
these schemes.Hence the latter may play roles in the analysis for the former.
To this end,we need to estimate Ω(
f
∗
z
),the l
1
–norm of the coeﬃcients of the
solution
f
∗
z
to (1.4).
Lemma 3.1.For every
C > 0,the function
f
z
deﬁned by (1.3) and (1.4)
satisﬁes
Ω(
f
∗
z
) =
m
i=1
α
i,z
≤
CE
z
(
f
z
) +
f
∗
z
2
K
.
Proof.The dual problemof the 1–normsoft margin SVM(Vapnik 1998)
tells us that the coeﬃcients α
i,z
in the expression (1.4) of
f
z
satisfy
0 ≤ α
i,z
≤
C
m
and
m
i=1
α
i,z
y
i
= 0.(3.1)
The deﬁnition of the loss function V implies that 1−y
i
f
z
(x
i
) ≤ V
y
i
,
f
z
(x
i
)
.
16
Then
m
i=1
α
i,z
−
m
i=1
α
i,z
y
i
f
z
(x
i
) ≤
m
i=1
α
i,z
V
y
i
,
f
z
(x
i
)
.
Applying the upper bound for α
i,z
in (3.1),we can bound the right side
above as
m
i=1
α
i,z
V
y
i
,
f
z
(x
i
)
≤
C
m
m
i=1
V
y
i
,
f
z
(x
i
)
=
CE
z
f
z
.
Applying the second relation in (3.1) yields
m
i=1
α
i,z
y
i
b
z
= 0.
It follows that
m
i=1
α
i,z
y
i
f
z
(x
i
) =
m
i=1
α
i,z
y
i
f
∗
z
(x
i
) +
b
z
=
m
i=1
α
i,z
y
i
f
∗
z
(x
i
).
But
f
∗
z
(x
i
) =
m
j=1
α
j,z
y
j
K(x
i
,x
j
).We have
m
i=1
α
i,z
y
i
f
z
(x
i
) =
m
i,j=1
α
i,z
y
i
α
j,z
y
j
K(x
i
,x
j
) =
f
∗
z
2
K
.
Hence the bound for Ω(
f
∗
z
) follows.
Remark 3.1.Dr.Yiming Ying pointed out to us that actually the equal
ity holds in Lemma 3.1.This follows from the KKT conditions.But we only
need the inequality here.
4 Error Decomposition
In this section,we estimate R(f
z
) −R(f
c
).
17
Since sgn(π(f)) = sgn(f),we have R(f) = R(π(f)).Using (2.1) to π(f),
we obtain
R(f) −R(f
c
) = R(π(f)) −R(f
c
) ≤ E(π(f)) −E(f
c
).(4.1)
It is easy to see that V (y,π(f)(x)) ≤ V (y,f(x)).Hence
E(π(f)) ≤ E(f) and E
z
(π(f)) ≤ E
z
(f).(4.2)
We are in a position to prove Theorem 1 which,by (4.1),is an easy
consequence of the following result.
Proposition 4.1.Let C > 0,0 < η ≤ 1 and f
K,C
∈
H
K
.Then
E(π(f
z
)) −E(f
c
) +
1
C
Ω(f
∗
z
) ≤ 2ηR(f
c
) +S(m,C,η) +2D(ηC),
where S(m,C,η) is deﬁned by (2.11).
Proof.Take
f
z
to be the solution of (1.4) with
C = ηC.
We see from the deﬁnition of f
z
and (4.2) that
E
z
π(f
z
)
+
1
C
Ω(f
∗
z
)
−
E
z
(
f
z
) +
1
C
Ω(
f
∗
z
)
≤ 0.
This enables us to decompose E((π(f
z
)) +
1
C
Ω(f
∗
z
) as
E(π(f
z
)) +
1
C
Ω(f
∗
z
) ≤
E
π(f
z
)
−E
z
π(f
z
)
+
E
z
(
f
z
) +
1
C
Ω(
f
∗
z
)
.
Lemma 3.1 gives Ω(
f
∗
z
) ≤
CE
z
(
f
z
) +
f
∗
z
2
K
.But
C = ηC.Hence
E(π(f
z
)) +
1
C
Ω(f
∗
z
) ≤
E
π(f
z
)
−E
z
π(f
z
)
+(1 +η)E
z
(
f
z
) +
1
C
f
∗
z
2
K
.
Next we use the function f
K,C
to analyze the second term of the above
bound and get
E
z
(
f
z
) +
1
(1 +η)C
f
∗
z
2
K
≤ E
z
(
f
z
) +
1
2
C
f
∗
z
2
K
≤ E
z
(f
K,C
) +
1
2
C
f
∗
K,C
2
K
.
This bound can be written as
E
z
(f
K,C
)−E(f
K,C
)
+
E(f
K,C
)+
1
2
C
f
∗
K,C
2
K
.
18
Combining the above two steps,we ﬁnd that E(π(f
z
)) −E(f
c
) +
1
C
Ω(f
∗
z
)
is bounded by
E
π(f
z
)
−E
z
π(f
z
)
+(1 +η)
E
z
f
K,C
−E
f
K,C
+(1 +η)
E
f
K,C
−E
f
c
+
1
2ηC
f
∗
K,C
2
K
+ηE(f
c
).
By the fact E(f
c
) = 2R(f
c
) and the deﬁnition of D(C),we draw our conclu
sion.
5 Probability Inequalities
In this section we give some probability inequalities.They modify the Bern
stein inequality and extend our previous work in Chen et al.(2004) which
was motivated by sample error estimates for the square loss (e.g.Barron
1990;Bartlett 1998;Cucker and Smale 2001,and Mendelson 2002).Recall
the Bernstein inequality:
Let ξ be a randomvariable on Z with mean µ and variance σ
2
.If ξ−µ ≤
M,then
Prob
µ −
1
m
m
i=1
ξ(z
i
)
> ε
≤ 2 exp
−
mε
2
2(σ
2
+
1
3
Mε)
.
The oneside Bernstein inequality holds without the leading factor 2.
Proposition 5.1.Let ξ be a random variable on Z satisfying µ ≥ 0,
ξ −µ ≤ M almost everywhere,and σ
2
≤ cµ
τ
for some 0 ≤ τ ≤ 2.Then for
every ε > 0 there holds
Prob
µ −
1
m
m
i=1
ξ(z
i
)
(µ
τ
+ε
τ
)
1
2
> ε
1−
τ
2
≤ exp
−
mε
2−τ
2(c +
1
3
Mε
1−τ
)
.
Proof.The oneside Bernstein inequality tells us that
Prob
µ −
1
m
m
i=1
ξ(z
i
)
(µ
τ
+ε
τ
)
1
2
> ε
1−
τ
2
≤ exp
−
mε
2−τ
(µ
τ
+ε
τ
)
2
σ
2
+
1
3
Mε
1−
τ
2
(µ
τ
+ε
τ
)
1
2
.
19
Since σ
2
≤ cµ
τ
,we have
σ
2
+
M
3
ε
1−
τ
2
(µ
τ
+ε
τ
)
1
2
≤ cµ
τ
+
M
3
ε
1−τ
(µ
τ
+ε
τ
) ≤ (µ
τ
+ε
τ
)
c +
1
3
Mε
1−τ
.
This yields the desired inequality.
Note that f
z
depends on z and thus runs over a set of functions as z
changes.We need a probability inequality concerning the uniform conver
gence.Denote Eg:=
Z
g(z)dρ.
Lemma 5.1.Let 0 ≤ τ ≤ 1,M > 0,c ≥ 0,and G be a set of functions
on Z such that for every g ∈ G,Eg ≥ 0,g −Eg ≤ M and Eg
2
≤ c(Eg)
τ
.
Then for ε > 0,
Prob
sup
g∈G
Eg −
1
m
m
i=1
g(z
i
)
((Eg)
τ
+ε
τ
)
1
2
> 4ε
1−
τ
2
≤ N
G,ε
exp
−mε
2−τ
2(c +
1
3
Mε
1−τ
)
.
Proof.Let {g
j
}
N
j=1
⊂ G with N = N
G,ε
such that for every g ∈ G
there is some j ∈ {1,...,N} satisfying g −g
j
∞
≤ ε.Then by Proposition
5.1,a standard procedure (Cucker and Smale 2001;Mukherjee et al 2002;
Chen et al.2004) leads to the conclusion.
Remark 5.1.Various forms of probability inequalities using empirical
covering numbers can be found in the literature.For simplicity we give the
current form in Lemma 5.1 which is enough for our purpose.
Let us ﬁnd the hypothesis space covering f
z
when z runs over all possible
samples.This is implemented in the following two lemmas.
By the idea of bounding the oﬀset from Wu and Zhou (2003),and Chen
et al.(2004),we can prove the following.
Lemma 5.2.For any C > 0,m∈ N and z ∈ Z
m
,we can ﬁnd a solution
f
z
of (1.7) satisfying min
1≤i≤m
f
z
(x
i
) ≤ 1.Hence b
z
 ≤ 1 +f
∗
z
∞
.
We shall always choose f
z
as in Lemma 5.2.In fact,the only restriction
we need to make for the minimizer f
z
is to choose α
i
= 0 and b
z
= y
∗
,i.e.,
f
z
(x) = y
∗
whenever y
i
= y
∗
for all 1 ≤ i ≤ m with some y
∗
∈ Y.
20
Lemma 5.3.For every C > 0,we have f
∗
z
∈ H
K
and f
∗
z
K
≤ κΩ(f
∗
z
) ≤
κC.
Proof.It is trivial that f
∗
z
∈ H
K
.By the reproducing property (1.5),
f
∗
z
K
=
m
i,j=1
α
i,z
α
j,z
y
i
y
j
K(x
i
,x
j
)
1/2
≤ κ
m
i,j=1
α
i,z
α
j,z
1/2
= κΩ(f
∗
z
).
Bounding the solution to (1.7) by the choice f = 0 +0,we have E(f
z
) +
1
C
Ω(f
∗
z
) ≤ E(0) +0 = 1.This gives Ω(f
∗
z
) ≤ C,and completes the proof.
By Lemma 5.3 and Lemma 5.2 we know that π(f
z
) lies in
F
R
:=
π(f):f ∈ B
R
+
−(1 +κR),1 +κR
(5.1)
with R = κC.The following lemma (Chen et al.2004) gives the covering
number estimate for F
R
.
Lemma 5.4.Let F
R
be given by (5.1) with R > 0.For any ε > 0 there
holds
N(F
R
,ε) ≤
2(1 +κR)
ε
+1
N
ε
2R
.
Using the function set F
R
deﬁned by (5.1),we set for R > 0,
G
R
=
V
y,f(x)
−V (y,f
c
(x)):f ∈ F
R
.(5.2)
By Lemma 5.4 and the additive property of the log function,we have
Lemma 5.5.Let G
R
given by (5.2) with R > 0.
(i) If (H2) holds,then there exists a constant c
s
> 0 such that
log N(G
R
,ε) ≤ c
s
log
R
ε
s
.
(ii) If (H2
) holds,then there exists a constant c
s
> 0 such that
log N(G
R
,ε) ≤ c
s
R
ε
s
.
21
The following lemma was proved by Scovel and Steinwart (2003) for
general functions f:X →R.With the projection,here f has range [−1,1]
and a simpler proof is given.
Lemma 5.6.Assume (H3).For every function f:X → [−1,1] there
holds
E
V (y,f(x)) −V (y,f
c
(x))
2
≤ 8
1
2c
q
q/(q+1)
E(f) −E(f
c
)
q
q+1
.
Proof.Since f(x) ∈ [−1,1],we have V (y,f(x))−V (y,f
c
(x)) = y(f
c
(x)−
f(x)).It follows that
E(f) −E(f
c
) =
X
(f
c
(x) −f(x))f
ρ
(x)dρ
X
=
X
f
c
(x) −f(x) f
ρ
(x)dρ
X
and
E
V (y,f(x)) −V (y,f
c
(x))
2
=
X
f
c
(x) −f(x)
2
dρ
X
.
Let t > 0 and separate the domain X into two sets:X
+
t
:= {x ∈ X:f
ρ
(x) >
c
q
t} and X
−
t
:= {x ∈ X:f
ρ
(x) ≤ c
q
t}.On X
+
t
we have f
c
(x) −f(x)
2
≤
2f
c
(x) − f(x)
f
ρ
(x)
c
q
t
.On X
−
t
we have f
c
(x) − f(x)
2
≤ 4.It follows from
Assumption (H3) that
X
f
c
(x)−f(x)
2
dρ
X
≤
2
E(f) −E(f
c
)
c
q
t
+4ρ
X
(X
−
t
) ≤
2
E(f) −E(f
c
)
c
q
t
+4t
q
.
Choosing t =
(E(f) −E(f
c
))/(2c
q
)
1/(q+1)
yields the desired bound.
Take the function set G in Lemma 5.1 to be G
R
.Then a function g in
G
R
takes the form g(x,y) = V
y,π(f)(x)
− V (y,f
c
(x)) with π(f) ∈ F
R
.
Obviously we have g
∞
≤ 2,Eg = E(π(f)) − E(f
c
) and
1
m
m
i=1
g(z
i
) =
E
z
(π(f)) −E
z
(f
c
).When Assumption (H3) is valid,Lemma 5.6 tells us that
Eg
2
≤ c
Eg
τ
with τ =
q
q+1
and c = 8
1
2c
q
q/(q+1)
.Applying Lemma 5.1 and
solving the equation
log N
G
R
,ε
−
mε
2−τ
2(c +
1
3
∙ 2ε
1−τ
)
= log δ,
we see the following corollary from Lemma 5.5 and Lemma 5.6.
22
Corollary 5.1.Let G
R
be deﬁned by (5.2) with R > 0 and (H3) hold
with 0 ≤ q ≤ ∞.For every 0 < δ < 1,with conﬁdence at least 1 −δ,there
holds
E(f) −E(f
c
)
−
E
z
(f) −E
z
(f
c
)
≤ 4ε
m,R
+4ε
q+2
2(q+1)
m,R
E(f) −E(f
c
)
q
2(q+1)
for all f ∈ F
R
,where ε
m,R
is given by
5
8
1
2c
q
q/(q+1)
+
1
3
log
1
δ
+c
s
(log R+log m)
s
m
q+1
q+2
,if (H2) holds,
8
8
1
2c
q
q/(q+1)
+
1
3
c
s
R
s
m
(q+1)
q+2+qs+s
+
log
1
δ
m
q+1
q+2
,if (H2
) holds.
6 Rate Analysis
Let us now prove the main results stated in Section 2.We ﬁrst prove Propo
sition 2.1.
Proof of Proposition 2.1.Since R(f
c
) = 0,V (y,f
c
(x)) = 0 almost
everywhere and E(f
c
) = 0.Take η = 1 in Proposition 4.1.
We ﬁrst consider the random variable ξ = V (y,f
K,C
(x)).Since 0 ≤ ξ ≤
M and Eξ = E(f
K,C
) ≤ D(C),we have
σ
2
(ξ) ≤ Eξ
2
≤ MEξ ≤ MD(C).
Applying the oneside Bernstein inequality to ξ,we see by solving the quadratic
equation −
mε
2
2(σ
2
+Mε/3)
= log
δ/2
that with probability 1 −δ/2,
E
z
(f
K,C
) −E(f
K,C
) ≤
2Mlog
2
δ
3m
+
2σ
2
(ξ) log
2/δ
m
≤
5Mlog
2
δ
3m
+D(C).
(6.1)
Next we estimate E(π(f
z
)) − E
z
(π(f
z
)).By the deﬁnition of f
z
,there
holds
1
C
Ω(f
∗
z
) ≤ E
z
(f
z
) +
1
C
Ω(f
∗
z
) ≤ E
z
(
f
z
) +
1
C
Ω(
f
∗
z
).
23
According to Lemma 3.1,this is bounded by 2
E
z
(
f
z
) +
1
2C
f
∗
z
2
K
.This in
connection with the deﬁnition of
f
z
yields
1
C
Ω(f
∗
z
) ≤ 2
E
z
(
f
z
) +
1
2C
f
∗
z
2
K
≤ 2
E
z
(f
K,C
) +
1
2C
f
∗
K,C
2
K
.
Since E(f
c
) = 0,D(C) = E(f
K,C
) +
1
2C
f
∗
K,C
2
K
.It follows that
1
C
Ω(f
∗
z
) ≤ 2
E
z
(f
K,C
) −E(f
K,C
) +D(C)
.
Together with Lemma 5.3 and (6.1),this tells us that with probability 1−δ/2
f
∗
z
K
≤ κΩ(f
∗
z
) ≤ R:= 2κC
5Mlog
2/δ
3m
+2D(C)
.
As we are considering a deterministic case,(H3) holds with q = ∞ and
c
∞
= 1.Recall the deﬁnition of G
R
in (5.2).Corollary 5.1 with q = ∞ and
R given as above implies that
E(π(f
z
)) −E
z
(π(f
z
)) ≤ 4ε
m,C
+4
√
ε
m,C
E(π(f
z
))
with conﬁdence 1 −δ where ε
m,C
is deﬁned in the statement.
Putting the above two estimates into Proposition 4.1,we have with con
ﬁdence 1 −δ,
E(π(f
z
)) ≤ 4ε
m,C
+4
√
ε
m,C
E(π(f
z
)) +
10Mlog
2/δ
3m
+4D(C).
Solving the quadratic inequality for
E(π(f
z
)) leads to
E(π(f
z
)) ≤ 32ε
m,C
+
20Mlog
2/δ
3m
+8D(C).
Then our conclusion follows from (4.1).
Finally,we turn to the proof of Theorems 2 and 3.To this end,we need a
bound for
f
∗
K,C
K
.According to the deﬁnition,
1
2C
f
∗
K,C
2
K
≤
D(C).Then
we have
24
Lemma 6.1.For every C > 0,there hold
f
∗
K,C
K
≤
2C
D(C)
1/2
and
f
K,C
∞
≤ 1 +2κ
2C
D(C)
1/2
.
Proof of Theorem 2.Take f
K,C
=
f
K,C
in Proposition 4.1.Then by
Lemma 6.1 we may take M = 2 + 2κ
2C
D(C)
1/2
.Proposition 2.1 with
Assumption (H2
) yields
R(f
z
) ≤ c
s,β,δ
C
(1−β)s/(s+1)
m
1
1+s
+
C
(1−β)s/(s+1)
m
1
1+s
C
(1+β)/2
m
s
1+s
+
C
1−β
2
m
+C
−β
.
Take C = min{m
1
s+β
,m
2
1+β
}.Then
C
(1+β)/2
m
≤ 1 and the proof is complete.
Proof of Theorem 3.Denote Δ
z
= E
π(f
z
)
−E(f
c
) +
1
C
Ω(f
∗
z
).Then
we have Ω(f
∗
z
) ≤ CΔ
z
.This in connection with Lemma 5.3 yields
f
∗
z
K
≤ κΩ(f
∗
z
) ≤ κCΔ
z
.(6.2)
Take f
K,C
=
f
K,
C
with
C = ηC in Proposition 4.1.It tells us that
Δ
z
≤ 2ηR(f
c
) +S(m,C,η) +2
D(ηC).
Set η = C
−β/(β+1)
.Then
C = ηC = C
1/(β+1)
.By the fact R(f
c
) ≤
1
2
and
Assumption (H1),
Δ
z
≤ S(m,C,η) +(1 +2c
β
)C
−
β
β+1
.(6.3)
Recall the expression (2.11) for S(m,C,η).Here f
K,C
=
f
K,
C
.So we have
S(m,C,η) =
E
π(f
z
)
−E(f
c
)
−
E
z
π(f
z
)
−E
z
(f
c
)
+(1 +η)
E
z
f
K,
C
−E
z
(f
c
)
−
E
f
K,
C
−E(f
c
)
+η
E
z
(f
c
) −E(f
c
)
=:S
1
+(1 +η)S
2
+ηS
3
.
Take t ≥ 1,C ≥ 1 to be determined later.For R ≥ 1,denote
W(R):= {z ∈ Z
m
:f
∗
z
K
≤ R}.(6.4)
25
For S
1
,we apply Corollary 5.1 with δ = e
−t
≤ 1/e.We know that there
is a set V
(1)
R
⊂ Z
m
of measure at most δ = e
−t
such that
S
1
≤ c
s,q
t
R
s
m
q+1
q+2+qs+s
+
R
s
m
q+1
q+2+qs+s
∙
q+2
2(q+1)
Δ
q
2(q+1)
z
,∀z ∈ W(R)\V
(1)
R
.
Here c
s,q
:= 32
8
1
2c
q
q/(q+1)
+
1
3
(c
s
+ 1) ≥ 1 is a constant depending only
on q and s.
To estimate S
2
,consider ξ = V
y,
f
K,
C
(x)
−V (y,f
c
(x)) on (Z,ρ).By
Lemma 6.1,we have
f
K,
C
∞
≤ 1 +2κ
2
C
D(
C) ≤ 1 +2κ
2c
β
C
1−β
2(β+1)
.
Write ξ = ξ
1
+ξ
2
where
ξ
1
:= V
y,
f
K,
C
(x)
−V
y,π(
f
K,
C
)(x)
,ξ
2
:= V
y,π(
f
K,
C
)(x)
−V (y,f
c
(x)).
It is easy to check that 0 ≤ ξ
1
≤ 2κ
2c
β
C
1−β
2(β+1)
.Hence σ
2
(ξ
1
) is bounded
by 2κ
2c
β
C
1−β
2(β+1)
Eξ
1
.Then the oneside Bernstein inequality with δ = e
−t
tells us that there is a set V
(2)
⊂ Z
m
of measure at most δ = e
−t
such that
for every z ∈ Z
m
\V
(2)
,there holds
1
m
m
i=1
ξ
1
(z
i
)−Eξ
1
≤
4κ
2c
β
C
1−β
2(β+1)
t
3m
+
2σ
2
(ξ
1
)t
m
≤
10κ
2c
β
C
1−β
2(β+1)
t
3m
+Eξ
1
.
For ξ
2
,by Lemma 5.6,
σ
2
(ξ
2
) ≤ 8(
1
2c
q
)
q/(q+1)
(Eξ
2
)
q
q+1
.
But ξ
2
 ≤ 2.So the oneside Bernstein inequality tells us again that there is
a set V
(3)
⊂ Z
m
of measure at most δ = e
−t
such that for every z ∈ Z
m
\V
(3)
,
there holds
1
m
m
i=1
ξ
1
(z
i
) −Eξ
1
≤
4t
3m
+
4σ
2
(ξ
1
)t
m
≤
4t
3m
+32(
1
2c
q
)
q
q+2
t
m
q+1
q+2
+Eξ
2
.
26
Here we have used the following elementary inequality with b:= (Eξ
2
)
q
2q+2
and a:=
32(
1
2c
q
)
q/(q+1)
t/m
1/2
:
a ∙ b ≤
q +2
2q +2
a
(2q+2)/(q+2)
+
q
2q +2
b
(2q+2)/q
,∀a,b > 0.
Combing the two estimates for ξ
1
,ξ
2
with the fact that Eξ = Eξ
1
+Eξ
2
=
E
f
K,
C
−E(f
c
) ≤
D(
C) ≤ c
β
C
−β/(β+1)
we see that
S
2
≤ c
q,β
t
C
1−β
2(β+1)
m
+
1
m
q+1
q+2
+C
−β
β+1
,∀z ∈ Z
m
\V
(2)
R
\V
(3)
R
,
where c
q,β
:= 10κ
2c
β
/3+
4
3
+32(
1
2c
q
)
q/(q+1)
+c
β
is a constant depending on
q,β.
The last term is S
3
≤ 1.
Putting the above three estimates for S
1
,S
2
,S
3
to (6.3),we ﬁnd that for
every z ∈ W(R)\V
(1)
R
\V
(2)
\V
(3)
there holds
Δ
z
≤ 2c
s,q
t
R
s
m
(q+1)
q+2+qs+s
+8c
q,β
t
1
m
q+1
q+2
+C
−
β
β+1
C
1/2
m
+1
.(6.5)
Here we have used another elementary inequality for α = q/(2q +2) ∈ (0,1)
and x = Δ
z
:
x ≤ ax
α
+b,a,b,x > 0 =⇒x ≤ max{(2a)
1/(1−α)
,2b}.
Now we can choose C to be
C:= min
m
2
,m
(q+1)(β+1)
s(q+1)+β(q+2+qs+s)
.(6.6)
It ensures that
1
m
q+1
q+2
≤ C
−
β
β+1
and
1
m
(q+1)
q+2+qs+s
≤ C
−
s(q+1)+β(q+2+qs+s)
(β+1)(q+2+qs+s)
.With
this choice of C,(6.5) implies that with a set V
R
:= V
(1)
R
∪ V
(2)
R
∪ V
(3)
R
of
measure at most 3e
−t
,
Δ
z
≤ C
−
β
β+1
2c
s,q
t
C
−
1
β+1
R
s(q+1)
q+2+qs+s
+24c
q,β
t
,∀z ∈ W(R)\V
R
.(6.7)
We shall ﬁnish our proof by using (6.2) and (6.7) iteratively.
27
Start with the bound R = R
(0)
:= κC.Lemma 5.3 veriﬁes W(R
(0)
) =
Z
m
.At this ﬁrst step,by (6.7) and (6.2) we have Z
m
= W(R
(0)
) ⊆ W(R
(1)
)∪
V
R
(0),where
R
(1)
:= κC
1
β+1
(2c
s,q
t(κ +1)) C
β
β+1
∙
s(q+1)
q+2+qs+s
+24c
q,β
t
.
Now we iterate.For n = 2,3,∙ ∙ ∙,we derive from (6.7) and (6.2) that
Z
m
= W(R
(0)
) ⊆ W(R
(1)
) ∪V
R
(0) ⊆ ∙ ∙ ∙ ⊆ W(R
(n)
) ∪
∪
n−1
j=0
V
R
(j)
,
where each set V
R
(j)
has measure at most 3e
−t
and the number R
(n)
is given
by
R
(n)
= κC
1
β+1
(2c
s,q
t(κ +1))
n
C
β
β+1
∙
(
s(q+1)
q+2+qs+s
)
n
+24c
q,β
t(κ +1)n
.
Note that > 0 is ﬁxed.We choose n
0
∈ N to be large enough such that
s(q +1)
q +2 +qs +s
(n
0
+1)
≤
s +
2s
β
+
q +2
q +1
.
In the n
0
th step of our iteration we have shown that for z ∈ W(R
(n
0
)
),
f
∗
z
K
≤ κC
1
β+1
(2c
s,q
t(κ +1))
n
0
C
β
β+1
∙
(
s(q+1)
q+2+qs+s
)
n
0
+24c
q,β
t(κ +1)n
0
.
This together with (6.5) gives
Δ
z
≤ c(s,q,β,)t
n
0
max
m
−
2β
β+1
,m
−
β(q+1)
s(q+1)+β(s+2+qs+s)
+
.
This is true for z ∈ W(R
(n
0
)
)\V
R
(n
0
)
.Since the set ∪
n
0
j=0
V
R
(j)
has measure at
most 3(n
0
+1)e
−t
,we know that the set W(R
(n
0
)
)\V
R
(n
0
)
has measure at least
1 −3(n
0
+ 1)e
−t
.Note that E
π(f
z
)
−E(f
c
) ≤ Δ
z
.Take t = log(
3(n
0
+1)
δ
).
Then the proof is ﬁnished by (4.1).
Acknowledgments
This work is partially supported by the Research Grants Council of Hong
Kong [Project No.CityU 103704] and by City University of Hong Kong
[Project No.7001442].The corresponding author is DingXuan Zhou.
28
References
Anthony,M.,and Bartlett,P.L.(1999).Neural Network Learning:Theoret
ical Foundations.Cambridge University Press.
Aronszajn,N.(1950).Theory of reproducing kernels.Trans.Amer.Math.
Soc.,68,337–404.
Barron,A.R.(1990).Complexity regularization with applications to artiﬁ
cial neural networks.In Nonparametric Functional Estimation (G.Roussa,
ed.),561–576.Dortrecht:Kluwer.
Bartlett,P.L.(1998).The sample complexity of pattern classiﬁcation with
neural networks:the size of the weights is more important than the size
of the network.IEEE Trans.Inform.Theory,44,525–536.
Bartlett,P.L.,Jordan,M.I.,and McAuliﬀe,J.D.(2003).Convexity,classi
ﬁcation,and risk bounds.Preprint.
Blanchard,B.,Bousquet,O.,and Massart,P.(2004).Statistical performance
of support vector machines.Preprint.
Boser,B.E.,Guyon,I.,and Vapnik,V.(1992).A training algorithm for
optimal margin classiﬁers.In Proceedings of the Fifth Annual Workshop of
Computational Learning Theory,Vol.5,144–152.Pittsburgh:ACM.
Bousquet,O.,and Elisseeﬀ,A.(2002).Stability and generalization.J.Ma
chine Learning Research,2,499–526.
Bradley,P.S.,and Mangasarian,O.L.(2000).Massive data discrimination
via linear support vector machines.Optimization Methods and Software,
13,1–10.
Chen,D.R.,Wu,Q.,Ying,Y.,and Zhou,D.X.(2004).Support vector ma
chine soft margin classiﬁers:error analysis.J.Machine Learning Research,
5,1143–1175.
Cortes,C.,and Vapnik,V.(1995).Supportvector networks.Mach.Learning,
20,273–297.
Cristianini,N.,and ShaweTaylor,J.(2000).An Introduction to Support
Vector Machines.Cambridge University Press.
29
Cucker,F.,and Smale,S.(2001).On the mathematical foundations of learn
ing.Bull.Amer.Math.Soc.,39,1–49.
Devroye,L.,Gy¨orﬁ,L.,and Lugosi,G.(1997).A probabilistic Theory of
Pattern Recognition.New York:SpringerVerlag.
Evgeniou,T.,Pontil,M.,and Poggio,T.(2000).Regularization networks
and support vector machines.Adv.Comput.Math.,13,1–50.
Kecman,V.,and Hadzic,I.(2000).Support vector selection by linear pro
gramming.Proc.of IJCNN,5,193–198.
Lugosi,G.,and Vayatis,N.(2004).On the Bayesrisk consistency of regular
ized bossting methods.Ann.Statis.,32,30–55.
Mendelson,S.(2002).Improving the sample complexity using global data.
IEEE Trans.Inform.Theory,48,1977–1991.
Mukherjee,S.,Rifkin,R.,and Poggio,T.(2002).Regression and classiﬁcation
with regularization.In Lecture Notes in Statistics:Nonlinear Estimation
and Classiﬁcation,D.D.Denison,M.H.Hansen,C.C.Holmes,B.Mallick,
and B.Yu (eds.),107–124.New York:SpringerVerlag.
Niyogi,P.(1998).The Informational Complexity of Learning.Kluwer.
Niyogi,P.,and Girosi,F.(1996).On the relationship between generaliza
tion error,hypothesis complexity,and sample complexity for radial basis
functions.Neural Comp.,8,819–842.
Pedroso,J.P.,and Murata,N.(2001).Support vector machines with diﬀerent
norms:motivation,formulations and results.Pattern recognition Letters,
22,1263–1272.
Pontil,M.(2003).A note on diﬀerent covering numbers in learning theory.
J.Complexity,19,665–671.
Rosasco,L.,De Vito,E.,Caponnetto,A.,Piana,M.,and Verri,A.(2004).
Are loss functions all the same?Neural Comp.,16,1063–1076.
Scovel,C.,and Steinwart,I.(2003).Fast rates for support vector machines.
Preprint.
Smale,S.,and Zhou,D.X.(2003).Estimating the approximation error in
learning theory.Anal.Appl.,1,17–41.
30
Smale,S.,and Zhou,D.X.(2004).Shannon sampling and function recon
struction from point values.Bull.Amer.Math.Soc.,41,279–305.
Steinwart,I.(2002).Support vector machines are universally consistent.J.
Complexity,18,768–791.
Tsybakov,A.B.(2004).Optimal aggregation of classiﬁers in statistical learn
ing.Ann.Statis.,32,135–166.
van der Vaart,A.W.,and Wellner,J.A.(1996).Weak Convergence and
Empirical Processes.New York:SpringerVerlag.
Vapnik,V.(1998).Statistical Learning Theory.John Wiley & Sons.
Wahba,G.(1990).Spline Models for Observational Data.SIAM.
Wu,Q.,Ying,Y.,and Zhou,D.X.(2004).Multikernel regularized classiﬁers.
Preprint.
Wu,Q.,and Zhou,D.X.(2004).Analysis of support vector machine classi
ﬁcation.Preprint.
Zhang,T.(2004).Statistical behavior and consistency of classiﬁcation meth
ods based on convex risk minimization.Ann.Statis.,32,56–85.
Zhang,T.(2002).Covering number bounds of certain regularized linear func
tion classes.J.Machine Learning Research,2,527–550.
Zhou,D.X.(2002).The covering number in learning theory.J.Complexity,
18,739–767.
Zhou,D.X.(2003).Capacity of reproducing kernel spaces in learning theory.
IEEE Trans.Inform.Theory,49,1743–1752.
31
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment