DEPARTMENT OF STATISTICS
University of Wisconsin
1210 West Dayton St.
Madison,WI 53706
TECHNICAL REPORT NO.1029
November 6,2000
On the Support Vector Machine
1 2
by
Yi Lin
1
Key words:Support Vector Machine,Bayes Rule,Classiﬁcation,Sobolev Hilbert Space,
Reproducing Kernel,Reproducing Kernel Hilbert Space,Regularization Method.
2
Supported by Wisconsin Alumni Research Foundation.
On the Support Vector Machine
Yi Lin
Department of Statistics
University of Wisconsin,Madison
1210 West Dayton Street
Madison,WI 537061685
Phone:(608)2626399
Fax:(608)2620032
Email:yilin@stat.wisc.edu
November 6,2000
1
On the Support Vector Machine
Yi Lin
University of Wisconsin,Madison
Abstract
Classiﬁcation is a fundamental problem at the intersection of machine learning
and statistics.Machine learning methods have enjoyed considerable empirical success.
However,they often have an ad hoc quality.It is desirable to have hard theoretical
results which might highlight speciﬁc quantitative advantages of these methods.The
statistical methods often tackle the classiﬁcation problem through density estimation
or regression.Theoretical properties of these statistical methods can be established,
but only under the assumption of a ﬁxed order of smoothness.Whether these methods
work well when the assumptions are violated is not clear.
The support vector machine (SVM) methodology is a rapidly growing area in ma
chine learning,and is receiving considerable attention in recent years.The SVM has
proved highly successful in a number of practical classiﬁcation studies.In this paper we
show that the SVMenjoys excellent theoretical properties which explain the good per
formance of the SVM.We show that the SVM approaches the the theoretical optimal
classiﬁcation rule (the Bayes rule) in a direct fashion,and its expected misclassiﬁcation
rate quickly converges to that of the Bayes rule.The results are established under very
general conditions allowing discontinuity.They testify to the fact that classiﬁcation is
easier than density estimation and regression,and show that the SVMworks by taking
advantage of this.The results pinpoint the exact mechanism behind the SVM,and
clarify the advantage and limitation of the SVM,thus give insights on how the SVM
can be extended systematically.
Key Words and Phrases:Support Vector Machine,Bayes Rule,Classiﬁcation,Sobolev
Hilbert Space,Reproducing Kernel,Reproducing Kernel Hilbert Space,Regularization Method.
1
1 Introduction
In the classiﬁcation problem,we are given a training data set of n subjects,and for each
subject i ∈ {1,2,...,n} in the training data set,we observe an explanatory vector x
i
∈
R
d
,and a label y
i
indicating one of several given classes to which the subject belongs.
The observations in the training set are assumed to be i.i.d.from an unknown probability
distribution P(x,y),or equivalently,they are independent randomrealizations of the random
pair (X,Y ) that has cumulative probability distribution P(x,y).The task of classiﬁcation
is to derive from the training set a good classiﬁcation rule,so that once we are given the x
value of a new subject,we can assign a class label to it.One common criterion for accessing
the quality of a classiﬁcation rule is the generalization error rate (expected misclassiﬁcation
rate),though other loss functions are also possible.The situation where there are only
two classes and where the misclassiﬁcation rate is used as the criterion is most commonly
encountered in practice.In the following we concentrate on this situation.This binary
classiﬁcation problem (or pattern recognition) has been studied by many authors.See,for
example,Devroye,Gy¨orﬁ and Lugosi (1996) and Vapnik (1995) and the references cited
therein.
In this paper,the two classes will be called the positive class and the negative class,and
will be coded as +1 and −1 respectively.Any classiﬁcation rule η can then be seen as a
mapping from R
d
to {−1,1}.Denote the generalization error rate of a classiﬁcation rule
η as R(η).Then R(η) =
y−η(x)
2
dP.It is often the case that η(·) = sign[g(·)] for some
real valued function g.That is,the rule η assigns the subject to the positive class if g(x)
is positive,and to the negative class otherwise.In this case,we will use the notations R(η)
and R(g) interchangeably.
1.1 The Bayes Rule
In the classiﬁcation problem,if we knew the underlying probability distribution P(x,y),
we could derive the optimal classiﬁcation rule with respect to any given loss function.This
optimal rule is usually called the Bayes rule.For the binary classiﬁcation problem,the Bayes
2
rule that minimizes the generalization error rate is
η
∗
(x) = sign[p(x) −1/2],(1)
where
p(x) = Pr{Y = 1X = x}
is the conditional probability of the positive class at a given point x.Let R
∗
= R(η
∗
).Then
R
∗
is the minimal possible value for R(·).
Let g
+
(x) be the probability density function of X for the positive population,that is,
the conditional density of X given Y = 1.Let g
−
(x) be the probability density function of
X for the negative population.The unconditional (“prior”) probabilities of the positive class
and negative class in the target population are denoted by π
+
and π
−
respectively.Then
p(x) can be obtained by the Bayes formula
p(x) =
π
+
g
+
(x)
π
+
g
+
(x) +π
−
g
−
(x)
(2)
1.2 Common classiﬁcation methods and the SVM
The statistical approach to classiﬁcation estimates the conditional class probability p(x) (or
equivalently,the log odds log[p(x)/(1 − p(x))]).Once an estimate of p(x) is obtained,we
can plug it in (1) to get an approximate Bayes rule.The estimation is often done by logistic
regression;or by estimating the densities g
+
(x) and g
−
(x),and then using (2).It is possible
to establish the statistical properties of these methods by making use of the extensive existing
results on density estimation and regression.However,these methods posit a ﬁxed order of
smoothness assumptions on the conditional probability function p(x) or the densities g
+
(x)
and g
−
(x).This leaves the applicability of these methods in doubt since we never know such
smoothness to be the case in practice.
The machine learning community places great emphasis on algorithms and handling large
data sets.Many machine learning methods,such as the neural network,the classiﬁcation
tree,and recently the support vector machine,have enjoyed remarkable empirical success,
and have attracted tremendous interest.The machine learning methods are often motivated
by their heuristic plausibility,and justiﬁed by empirical evidence rather than hard theoretical
3
results that might demonstrate speciﬁc quantitative advantages of such methods.In order
to have a clear understanding of where and when these methods work well,it is desirable to
have theoretical results that pinpoint the advantages and limitations of these methods.
The support vector machine is a new addition to the machine learning toolbox.It was
ﬁrst proposed in Boser,Guyon and Vapnik (1992),and is going through rapid development.
The SVMis best developed in the binary classiﬁcation situation,even though several studies
attempted to use the SVM for classifying multiple classes.Since the statistics community is
largely unfamiliar with the SVM,in the following we give a brief description of the derivation
of the SVM,starting from the simple linear support vector machine and moving on to
the nonlinear support vector machine.For a more detailed tutorial on the support vector
machine,see Burges (1998).
The SVM is motivated by the intuitive geometric interpretation of maximizing the mar
gin.When the two classes of points in the training set can be separated by a linear hy
perplane,it is natural to use the hyperplane that separates the two groups of points in the
training set by the largest margin.This amounts to the hard margin linear support vector
machine:Find w ∈ R
d
,b ∈ R,to minimize w
2
,subject to
x
i
· w +b ≥ +1 for y
i
= +1;(3)
x
i
· w +b ≤ −1 for y
i
= −1;(4)
Once such w and b are found,the SVM classiﬁcation rule is sign(w · x +b).
When the points in the training data set are not linearly separable,constraints (3) and
(4) can not be satisﬁed simultaneously.We can introduce nonnegative slack variables ξ
i
’s to
overcome this diﬃculty.This results in the soft margin linear support vector machine:Find
w ∈ R
d
,b ∈ R,and ξ
i
,i = 1,2,...,n,to minimize 1/n
i
ξ
i
+λw
2
,under the constraints
x
i
· w +b ≥ +1 −ξ
i
for y
i
= +1;(5)
x
i
· w +b ≤ −1 +ξ
i
for y
i
= −1;(6)
ξ
i
≥ 0,∀i.
4
Here λ is a control parameter to be chosen by the user.Notice (5) and (6) can be combined
as
ξ
i
≥ 1 −y
i
(x
i
· w +b).
The nonlinear support vector machine maps the input variable into a high dimensional
(often inﬁnite dimensional) feature space,and applies the linear support vector machine in
the feature space.It turns out that the computation of this linear SVM in the feature space
can be carried out in the original space through a (reproducing) kernel trick.Therefore we do
not really need to know the feature space and the transformation to the feature space.The
nonlinear support vector machine with kernel K is equivalent to a regularization problem
in the reproducing kernel Hilbert space (RKHS) H
K
:Find f(x) = h(x) +b with h ∈ H
K
,
b ∈ R,and ξ
i
,i = 1,2,...,n,to minimize
1
n
(
i
ξ
i
) +λh
2
H
K
,(7)
under the constraints
ξ
i
≥ 1 −y
i
f(x
i
),(8)
ξ
i
≥ 0,∀i.(9)
Once the solution
ˆ
f is found,the SVM classiﬁcation rule is sign(
ˆ
f).
Commonly used kernels include Gaussian kernels,spline kernels,and polynomial kernels.
Wahba (1990) contains some detailed introduction to reproducing kernels and reproducing
kernel Hilbert spaces.
The theory of RKHS ensures that the solution to (7),(8),and (9) lies in a ﬁnite dimen
sional space,even when the RKHS H
K
is of inﬁnite dimension.See Wahba (1990).Hence
the SVM problem (7),(8),and (9) becomes a mathematical programming problem in a
ﬁnite dimensional space.See Wahba,Lin and Zhang (1999).The computation of the SVM
is often done with the dual formulation of this mathematical programming problem.This
dual formulation is a quadratic programming problem with a simple form.It turns out that
the SVM solution enjoys certain sparsity:usually the ﬁnal solution depends only on a small
proportion of the data points.These points are called support vectors.This sparsity can be
exploited for fast computation,and the SVM has been applied to very large datasets.See,
5
for example,Vapnik (1979),Osuna,Freund and Girosi (1997),Platt (1999),for some basic
ideas of fast computation of the SVM.
Denote l
n
(f) =
1
n
n
i=1
[1 −y
i
f(x
i
))]
+
.Here a
+
= a if a ≥ 0,and a
+
= 0 if a < 0.The
limit functional of l
n
(f) is l(f) =
[1−yf(x)]
+
dP.We can see that (7),(8),(9) is equivalent
to minimizing
l
n
(f) +λh
2
H
K
.(10)
Several authors have studied the generalization error rate of the SVM,See Vapnik (1995),
and ShaweTaylor and Cristianini (1998).These authors established bounds on generaliza
tion error based on VC dimension,fat shattering dimension,and the proportion of the
training data achieving certain margin.However,the VC dimension or the fat shattering
dimension of the nonlinear SVM is often very large,even inﬁnite.Hence the bounds estab
lished are often very loose,or even trivial (larger than 1),and do not provide a satisfactory
explanation as to why the SVM often has good generalization performance.
Due to the heuristic fashion in which the SVM is derived,it has not been clear how the
SVM is related to Bayes rule,and how the generalization error rate of the SVM compares
with the minimal possible value R
∗
.Some confusions exist in practice on what to do with
the SVMwhen the appropriate measure of risk is not the expected misclassiﬁcation rate,and
how the SVM can be used for multiclass classiﬁcation.In this paper,we clarify matters by
pinpointing the exact mechanism behind the SVM.This will enable us to extend the SVM
methodology and develop new algorithms based on the basic ideas of the SVM.
2 Statements of Our Results
From (10) we see the nonlinear SVM is another example of the penalized method very
often used in statistics.Lin (1999) showed that the minimizer of l(f) is sign[p(x) − 1/2],
which is exactly the Bayes rule.This strongly suggests that the SVM solution is aiming at
approaching the Bayes rule.Lin (1999) demonstrated with simulations that with Gaussian
kernel and spline kernel the solution to (10) approaches to the function sign[p(x) − 1/2].
One point worth mentioning is that the function sign[p(x) −1/2] is usually discontinuous
6
and does not belong to any RKHS commonly used in practice,while the solution to (10) is
in the RKHS H
K
.This is diﬀerent from the situation of many penalized methods.
In this paper we consider the ﬁrst order spline kernel in the situation d = 1.The simple
reproducing kernel under consideration facilitates the proofs.However,in principle the
same line of argument can be applied to the SVM with other commonly used reproducing
kernels.We show that under very general conditions without any smoothness assumption,
the solution to (10) converges to sign[p(x) − 1/2].We further show that under very mild
boundary conditions on p(x),the generalization error rate of the SVM converges to R
∗
at
a certain rate.These conditions are much weaker than the usual smoothness conditions
imposed in regression and density estimation,and can easily be satisﬁed by nonsmooth,
even discontinuous functions.Also the implementation of the SVM does not require any a
priori information of the conditions.
Assumption 1 The density d(x) of X is supported on [−1,1],and it is bounded away from
zero and inﬁnity in this interval.That is,there exists constants D
2
> D
1
> 0,such that
D
1
≤ d(x) ≤ D
2
for all x ∈ [−1,1].
The ﬁrst order Sobolev Hilbert space of functions on any interval [b
1
,b
2
],denoted by
H
1
[b
1
,b
2
],is deﬁned by
H
1
[b
1
,b
2
] = {ff abs.cont.;f
∈ L
2
[b
1
,b
2
]},
with the Sobolev Hilbert norm
f
2
=
b
2
b
1
f
2
dx +
b
2
b
1
(f
)
2
dx
In this paper we will write H
1
[−1,1] as simply H
1
.It is well known that this is a RKHS
and the corresponding RK is called the ﬁrst order spline kernel.With this RK,(10) becomes
min
f∈H
1
l
n
(f) +λ
1
−1
(f
)
2
dx,(11)
or equivalently,
min
f∈H
1
l
n
(f) subject to
1
−1
(f
)
2
dx ≤ M.(12)
7
Here λ or M is the smoothing parameter.Let the solution to (11) be denoted by
ˆ
f.Let the
solution to (12) be denoted by
ˆ
f
M
.We will allow the smoothing parameters to vary with
the sample size n.In this paper,any function with a hat,such as
ˆ
f and
ˆ
f
M
,are random
(depending on the training sample).We use the notation E
c
for the expectation conditional
on the training sample (X
i
,Y
i
),i = 1,2,...,n.Then E
c
[ˆg(X)] =
ˆg(x)dP for any random
function ˆg depending on the training sample.
Theorem 1 Under Assumption 1,suppose p(x) is bounded away from 1/2 from below by
some positive constant D
3
in an interval [x
0
−δ,x
0
+δ].Then there exists a positive number
Λ depending only on D
3
and δ,such that for any ﬁxed λ < Λ,or for any ﬁxed sequence λ
(n)
going to zero,we have
sign[p(x
0
) −1/2] −
ˆ
f(x
0
) = O
p
(n
−1/3
λ
−2/3
).
The same result is valid for p(x) bounded away from 1/2 from above.
The result is uniform over all the functions p(x) and points x
0
satisfying that p(x) is
bounded from below by 1/2+D
3
(or from above by 1/2−D
3
) in the interval [x
0
−δ,x
0
+δ].
Theorem 1 shows that the SVM solution converges to the Bayes rule sign[p(x) −1/2].This
uncovers the mechanism by which the SVM works.Notice that
ˆ
f is absolutely continuous,
whereas sign[p(x) −1/2] is usually discontinuous.
To investigate the global performance of the SVM,we consider the generalization error
rate.For any classiﬁcation rule η,it is natural to access its quality by looking at how fast
R(η) converges to the minimal possible value R
∗
.The convergence R(η) →R
∗
was proved
for various classiﬁcation rules (not including the SVMthough).Furthermore,certain bounds
on the diﬀerence E(R(η) −R
∗
) are known for ﬁnite sample sizes.See,for example,Devroye,
Gy¨orﬁ and Lugosi (1996) and the references therein.
For the rate of such convergence,the only studies in the literature that we know of are
that of Marron (1983) and Mammen and Tsybakov (1999).Under smoothness assumptions
on the densities g
+
(x) and g
−
(x),Marron (1983) proved that the optimal rates of convergence
are the same as those of the mean integrated squared error in density estimation.He also
showed that under these assumptions the density estimation approach to the classiﬁcation
8
problem is asymptotically optimal.The error criterion he used is the integrated (over all
prior probabilities q from 0 to 1) diﬀerence R
q
(η) −R
∗
q
,where R
q
(·) and R
∗
q
are the general
ization error rate when π
+
= q.Mammen and Tsybakov (1999) imposes conditions on the
decision region and assumes that the decision region belongs to a known class G of possible
“candidate” regions.The δentropy with bracketing of the class G is assumed to be ﬁnite and
varies with δ at a certain rate.They studied the asymptotic properties of direct minimum
contrast estimators and sieve estimators,and found the optimal rate of R(η) −R
∗
for classes
of boundary fragments.The rates they obtained are faster than those in Marron (1983).
They concluded that direct estimation procedures such as the empirical risk minimization
can achieve better performance in terms of the generalization error rate than the density
estimation based method.However,the direct minimum contrast estimators and sieve esti
mators are hard to implement and need a priori knowledge of the class G.Our second result
(to be stated) is in spirit closer to the results in Mammen and Tsybakov (1999),but we
study the asymptotic properties of the SVM,which do not assume a priori knowledge of a
candidate class with ﬁnite δentropy (with bracket).
For any function g,if we classify according to the sign of g(x),then it is easy to see the
misclassiﬁcation rate R(g) is equal to l[sign(g)]/2.By Theorem 1 we see that sign(
ˆ
f) ≈
ˆ
f.
So l(
ˆ
f) ≈ 2R(
ˆ
f).Therefore it is also reasonable to use l(·) to assess the performance of the
SVM.In fact,l(·) is called the GCKL in Wahba,Lin,and Zhang (1999),and was used to
adaptively tune the smoothing parameter for the SVM.It might be advantageous in many
situations to consider l(·) rather the R(·),since l(·) is continuous and convex,whereas R(·)
is discontinuous and not convex.It is shown in Lin (1999) that the Bayes rule η
∗
is the
minimizer of l(f) over all function f.In this paper we also consider how fast the GCKL of
the SVM converges to l(η
∗
).
Before we can state our second result,we need to characterize the behavior of p(x) at
its cross points with 1/2.We say a point r is a positive cross point if there exists a positive
number a > 0,such that p(x) > 0 in (r,r + a] and p(x) < 0 in [r − a,r).Negative cross
points are deﬁned likewise.
Assumption 2 The function p(x) crosses 1/2 ﬁnite many (k) times,that is,sign[p(x)−1/2]
9
has ﬁnite many pieces;and there exists ζ > 0 and D
4
> 0 such that for any cross point r
j
,
j = 1,2,...,k,there exists α
j
≥ 0,and D
6j
> D
5j
> 0,satisfying
D
5j
x −r
j

α
j
≤ p(x) −1/2 ≤ D
6j
x −r
j

α
j
,∀x ∈ (r
j
−ζ,r
j
+ζ),(13)
and p(x) is bounded away from 1/2 by D
4
when x is more than ζ away from all the cross
points.Denote max
j
D
6j
=
¯
D,min
j
D
5j
= D
,max
j
α
j
= ¯α,and min
j
α
j
= α
.
It falls right out from this assumption that k ≤ 2/ζ.Assumption 2 is related to the
condition (4) in Mammen and Tsybakov (1999).Assumption 2 could,in particular,be sat
isﬁed by nonsmooth,even discontinuous functions.Notice also that we allow the possibility
that some α
j
is zero.This represents possible discontinuity at the cross points.Notice the
implementation of the SVM does not require any a priori information in Assumption 2.
We will consider the setup (12) and its solution
ˆ
f
M
in our second result for technical
convenience.For any θ > 0,denote ρ(θ) = min(α
+1−θ,θ/¯α,(α
+2)/(¯α+2)).(θ/¯α = +∞
if ¯α = 0.)
Theorem 2 Under Assumption 1 and Assumption 2,for any ﬁxed θ > 0,suppose M
(n)
∼ n
t
for some 0 < t ≤ 2/[3(1 +ρ(θ))],then for any ﬁxed s > 0,there exists ﬁnite constant D(s)
depending on s,and N > 0,such that for any n > N,
n
γs
E[l(
ˆ
f
M
) −l(η
∗
)]
s
≤ D(s);(14)
n
γs
E[R(
ˆ
f
M
) −R(η
∗
)]
s
≤ D(s);(15)
where γ = min{[t(ρ(θ) +θ)],2/3 −[t(1 +θ)]/3}.The constants D(s) and N depend on p(x)
only through ζ,
¯
D,D
,D
4
,and ¯α.
For example,if ¯α = 0,θ = 1/2,then ρ(θ) = 1/2,and γ = 4/9 with t = 4/9;if ¯α = α
= 2,
θ = 2,then ρ(θ) = 1,and γ = 1/2 with t = 1/6.
The proof of Theorem 2 uses general results from empirical process theory,and follows
an argument employed in the proof of Theorem 1 of Mammen and Tsybakov (1999).One
complication is that the L
2
norm is not readily bounded by diﬀerence measured by l(·).We
derive (29) to overcome this diﬃculty.
10
3 Discussion
The SVMmakes no a priori assumption of a ﬁxed order of smoothness or a ﬁxed class of pos
sible “candidate” decision regions.It easily accommodates discontinuity.Its generalization
error rate goes to that of the Bayes rate with a fast rate of convergence.
The reason why we can obtain results under very general conditions is that the function
sign[p(x) − 1/2],though may be discontinuous,is often simpler than the function p(x) or
g
+
(x) and g
−
(x).The SVM takes advantage of this fact by aiming directly at the simpler
function sign[p(x) −1/2] which is more directly related to the decision rule.Several authors
have observed that classiﬁcation is easier than regression and density estimation.See De
vroye,Gy¨orﬁ and Lugosi (1996) and Mammen and Tsybakov (1999).Our results further
conﬁrm this.
It is important to understand the mechanism behind the SVM.The SVM implement
the Bayes rule in an interesting way:Instead of estimating p(x),it estimates sign[p(x) −
1/2].This has advantages when our goal is binary classiﬁcation with minimal expected
misclassiﬁcation rate.However,this also means that in some other situations the SVM
needs to be modiﬁed,and should not be used as is.
In practice it is often the case that the costs of false positive and false negative are
diﬀerent.It is also possible that the fraction of members of the classes in the training
set is diﬀerent than those in the general population (sampling bias).In such nonstandard
situations the Bayes rule that minimizes the expected misclassiﬁcation cost can be expressed
as sign[p(x) − c],where c ∈ (0,1) is not equal to 1/2.Hence the SVM as is will not
perform optimally in this situation,and there is no direct way of getting sign[p(x) −c] from
sign[p(x) −1/2].Lin,Lee,and Wahba (2000) contains some extension of the SVM to such
nonstandard situations.
Multiclass (Nclass) classiﬁcation problem arises naturally in practice.The Bayes rule
in this case assigns the class label corresponding to the largest conditional class probability.
Some authors suggested training N oneversusrest SVMs and taking the class for a test sub
ject to be that corresponding to the largest value of the classiﬁcation functions.Our results
show that this approach should work well when one of the conditional class probabilities is
11
larger than 1/2,(there is a majority class),but will not approach the Bayes rule when there
is no majority class.
4 The Proof of Our Results
The notation a
n
∼ b
n
means c
1
a
n
≤ b
n
≤ c
2
a
n
for all n,and some positive constants c
1
and
c
2
.Any constants here and later in the proofs are generic positive constants not depending
on n,λ,M,or the sample,and may depend on p(x) and d(x) only through D
1
,D
2
,D
3
,δ,
ζ,D
4
,
¯
D,D
,and ¯α.Consecutive appearances of c without subscript may stand for diﬀerent
positive constants.
Lemma 4.1 Under Assumption 1,suppose p(x) be bounded away from 1/2 from below by
some positive constant D
3
in [x
0
− δ,x
0
].For any ﬁxed number a ∈ [−1,1],let f
a
be the
solution to the variational problem:
min
f∈H
1
[x
0
−δ,x
0
]
f(x
0
−δ)=a
E[(1 −Y f(X))
+
1
{x
0
−δ≤X≤x
0
}
] +λ
x
0
x
0
−δ
(f
)
2
dx,(16)
then when λ is small enough,there exists ∈ [0,δ),such that ∼ λ
1/2
(1 −a)
1/2
,and f
a
= 1
for x ∈ [x
0
−δ + ,x
0
].Also,
x
0
−δ+
x
0
−δ
(f
a
)
2
dx ∼ λ
−1/2
(1 −a)
3/2
x
0
−δ+
x
0
−δ
(1 −f
a
)dx ∼ λ
1/2
(1 −a)
3/2
E[(1 −Y f
a
(X))
+
1
{x
0
−δ≤X≤x
0
}
] +λ
x
0
x
0
−δ
(f
a
)
2
dx −E[(1 −Y )
+
1
{x
0
−δ≤X≤x
0
}
] ∼ λ
1/2
(1 −a)
3/2
Proof:Without loss of generality,we assume x
0
= 0.
It is easy to see that f
a
(x) ∈ [−1,1],∀x ∈ [−δ,0].Otherwise the truncation of f
a
into
[−1,1] would still be in H
1
[−δ,0],and has a smaller value for (16).Now let us restrict our
attention to functions satisfying f(x) ≤ 1,∀x ∈ [−δ,0].Under this constraint we have
E{[1 −Y f(X)]
+
1
{−δ≤X≤0}
} +λ
0
−δ
(f
)
2
dx
= E[(1 −Y f(X))1
{−δ≤X≤0}
] +λ
0
−δ
(f
)
2
dx
=
0
−δ
d(x)dx −
0
−δ
g(x)f(x)dx +λ
0
−δ
(f
)
2
dx,(17)
12
where g(x) = d(x)[2p(x) −1].
From (17) we can see that f
a
must be monotone increasing.Otherwise suppose f
a
(x
1
) >
f
a
(x
2
),−δ ≤ x
1
< x
2
≤ 0.Consider the function deﬁned as
˜
f
a
(x) =
f
a
(x):x ∈ [−δ,x
1
]
max{f
a
(x
1
),f
a
(x)}:x ∈ [−x
1
,0].
Then
˜
f
a
∈ H
1
[−δ,0],
˜
f
a
(−δ) = a,
˜
f
a
 ≤ 1 for any x ∈ [−δ,0],and
˜
f
a
gives a smaller value
of (17).
Let G(x) =
x
0
g(t)dt.Then G(0) = 0,G(x) is continuous and strictly monotone increas
ing in [−δ,0].Integrating by parts,we have (17) is the same as
0
−δ
d(x)dx +aG(−δ) +
0
−δ
G(x)f
(x)dx +λ
0
−δ
(f
)
2
dx
= λ
0
−δ
[f
+G/(2λ)]
2
−
0
−δ
G
2
/(4λ)dx +
0
−δ
d(x)dx +aG(−δ)
From the above we know h
a
= f
a
solves the problem
min
h∈L
2
[−δ,0],
h≥0
λ
0
−δ
[h +G/(2λ)]
2
,
subject to the constraint
1 −a −
0
−δ
h(x) ≥ 0.(18)
Since p(x) ≥ 1/2 + D
3
,and D
1
≤ d(x) ≤ D
2
,by the deﬁnition of G(·),there exists a
positive constant Λ,such that for any λ ≤ Λ,we have
0
−δ
−G/(2λ)dx > 2 ≥ 1 −a.(19)
So the constraint (18) is not trivial.Introducing Lagrange multiplier µ > 0 for the constraint
(18),we get
λ
0
−δ
[h +G/(2λ)]
2
dx −µ[1 −a −
0
−δ
h(x)dx]
= λ
0
−δ
(f
+(G+µ)/(2λ))
2
dx −µ(1 −a) −
0
−δ
(µ
2
+2Gµ)/(4¯α).
So h
a
= [−(G+µ)/(2λ)]
+
.We also have 1 −a −
0
−δ
h
a
(x) = 0,which means f
a
(0) = 1.
13
Recalling that −G is continuous,strictly decreasing to 0 on [−δ,0].Let −G crosses µ at
−δ + ,where −δ < < 0.Then
f
a
(x) = 1 x ∈ [−δ + ,0]
f
a
(x) strictly increases from a to 1 in [−δ,−δ + ].(20)
By the deﬁnition of g(x) and G(x),there exists positive constants c
1
and c
2
,such that
c
1
< g(x) = G
(x) < c
2
,∀x ∈ [−δ,0].It is easy to see that 2λ(1 − a) =
0
−δ
2λh
a
=
−δ+
−δ
(−G−µ)dx is in between 1/2c
1
2
and 1/2c
2
2
.Therefore is in between 2c
−1/2
2
λ
1/2
(1−
a)
1/2
and 2c
−1/2
1
λ1/2(1 −a)
1/2
.
By the deﬁnition of ,we have c
1
(−δ + − x)/(2λ) ≤ h
a
(x) ≤ c
2
(−δ + − x)/(2λ),
x ∈ [−δ,−δ + ].So
−δ+
−δ
h
2
a
dx ∈ (c
2
1
/12
3
λ
−2
,c
2
2
/12
3
λ
−2
),
−δ+
−δ
g(x)(1 −f
a
(x))dx ∈ (c
2
1
/12
3
λ
−1
,c
2
2
/12
3
λ
−1
).
E[(1 −Y f
a
(X))
+
1
{−δ≤X≤0}
] +λ
0
−δ
(f
a
)
2
dx −E[(1 −Y )
+
1
{−δ≤X≤0}
]
=
−δ+
−δ
g(x)(1 −f
a
(x))dx +λ
−δ+
−δ
h
2
a
dx
∈ (c
2
1
3
/(6λ),c
2
2
3
/(6λ))
⊂ (4/3λ
1/2
(1 −a)
3/2
c
2
1
c
−3/2
2
,4/3λ
1/2
(1 −a)
3/2
c
2
2
c
−3/2
1
).
.
Proof of Theorem 1:Without loss of generality,assume x
0
= 0,and that p(x) is bounded
from 1/2 from below.It is easy to see that 
ˆ
f ≤ 1.Denote a =
ˆ
f(−δ),b =
ˆ
f(δ),and
F
δ
= {f ∈ H
1
[−δ,δ],f(−δ) = a,f(δ) = b}.Consider problems
min
f∈F
δ
1/n
n
i=1
[(1 −Y
i
f(X
i
))
+
1
−δ≤X
i
≤δ
] +λ
δ
−δ
(f
)
2
dx (21)
min
f∈F
δ
E[(1 −Y f(X))
+
1
{−δ≤X≤δ}
] +λ
δ
−δ
(f
)
2
dx (22)
Then the restriction of
ˆ
f to [−δ,δ] is a solution to (21).Let
¯
f
δ
be the solution to (22),then
by Lemma 4.1,for small enough λ,we have
¯
f
δ
(x) = 1,∀x ∈ (−δ +
1
,δ −
2
);and
¯
f
δ
strictly
14
increases froma to 1 in [−δ,−δ+
1
],strictly decreases from1 to b in [δ−
2
,δ],where
1
< δ,
and
2
< δ.
Denote 1 −
ˆ
f(0) by ω,and F
ω
δ
= {f ∈ H
1
[−δ,δ],f(−δ) = a,f(δ) = b,f(0) = 1 −ω}.Let
¯
f
δω
be the solution to
min
f∈F
ω
δ
E{[1 −Y f(X)]
+
1
{−δ≤X≤δ}
} +λ
δ
−δ
(f
)
2
dx (23)
From Lemma 4.1 we can see that for λ small enough,we have
¯
f
δω
(x) = 1,∀x ∈ (−δ +
1
,−
3
) ∪(
4
,δ −
2
);and
¯
f
δω
strictly increases from a to 1 in [−δ,−δ +
1
],strictly decreases
from 1 to b in [δ −
2
,δ],strictly decreases from 1 to 1 −ω in (−
3
,0),and strictly increases
form 1 −ω to 1 in (0,
4
);and
¯
f
δω
is identical to
¯
f
δ
other than on [−
3
,
4
].And
E
c
[(1−Y
¯
f
δω
(X))
+
1
{−δ≤X≤δ}
]+λ
δ
−δ
(
¯
f
δω
)
2
dx−E
c
[(1−Y
¯
f
δ
(X))
+
1
{−δ≤X≤δ}
]−λ
δ
−δ
(
¯
f
δ
)
2
dx ≥ c
3
λ
1/2
ω
3/2
Since
¯
f
δω
is the solution to (23),we get
E
c
[(1−Y
ˆ
f(X))
+
1
{−δ≤X≤δ}
]+λ
δ
−δ
(
ˆ
f
)
2
dx−E
c
[(1−Y
¯
f
δ
(X))
+
1
{−δ≤X≤δ}
]−λ
δ
−δ
(
¯
f
δ
)
2
dx ≥ c
3
λ
1/2
ω
3/2
.
(24)
On the other hand,the left hand side of (24) is equal to
−1/n
n
i=1
[(1 −Y
i
ˆ
f(X
i
))1
{−δ≤X
i
≤δ}
] +E
c
[(1 −Y
ˆ
f(X))1
{−δ≤X≤δ}
]
+1/n
n
i=1
[(1 −Y
i
ˆ
f(X
i
))
+
1
{−δ≤X
i
≤δ}
] +λ
δ
−δ
(
ˆ
f
)
2
dx
−E
c
[(1 −Y
¯
f
δ
(X))1
{−δ≤X≤δ}
] −λ
δ
−δ
(
¯
f
δ
)
2
dx
≤ −1/n
n
i=1
[(1 −Y
i
ˆ
f(X
i
))1
{−δ≤X
i
≤δ}
] +E
c
[(1 −Y
ˆ
f(X))1
{−δ≤X≤δ}
]
+1/n
n
i=1
[(1 −Y
i
¯
f
δ
(X
i
))
+
1
{−δ≤X
i
≤δ}
] +λ
δ
−δ
(
¯
f
δ
)
2
dx
−E
c
[(1 −Y
¯
f
δ
(X))1
{−δ≤X≤δ}
] −λ
δ
−δ
(
¯
f
δ
)
2
dx
= 1/n
n
i=1
[(Y
i
(
ˆ
f −
¯
f
δ
)(X
i
))1
{−δ≤X
i
≤δ}
] −E
c
[(Y (
ˆ
f −
¯
f
δ
)(X))1
{−δ≤X≤δ}
]
= 1/n
n
i=1
[(Y
i
q(X
i
))1
{−δ≤X
i
≤δ}
] −E
c
[(Y q(X))1
{−δ≤X≤δ}
],
where q =
ˆ
f −
¯
f
δ
∈ H
1
[−1,1].(
¯
f
δ
is extended to the interval [−1,1] continuously.It is
constant in [−1,−δ] or [δ,1].) The ﬁrst inequality comes from the fact that
ˆ
f solves (21).
Therefore we have
E
1/n
n
i=1
[(Y
i
q(X
i
))1
{−δ≤X
i
≤δ}
] −E
c
[(Y q(X))1
{−δ≤X≤δ}
]
2
≥ c
2
3
E(λ
1/2
ω
3/2
)
2
,(25)
15
Consider an orthonormal basis {φ
j
} in L
2
[−1,1],such that
φ
j
,φ
k
L
2
= δ
jk
;
φ
j
,φ
k
H
1
= λ
j
δ
jk
then λ
j
∼ j
2
.See Silverman (1982),or Cox and O’Sullivan (1990),or Lin (2000).
Let q
j
be the coeﬃcients of q with respect to {φ
j
}.Then
1/n
n
i=1
[(Y
i
q(X
i
))1
{−δ≤X
i
≤δ}
] −E
c
[(Y q(X))1
{−δ≤X≤δ}
]
=
j
q
j
1/n
n
i=1
[(Y
i
φ
j
(X
i
))1
{−δ≤X
i
≤δ}
] −E[(Y φ
j
(X))1
{−δ≤X≤δ}
]
≤
j
λ
j
q
2
j
1/2
j
λ
−1
j
1/n
n
i=1
[(Y
i
φ
j
(X
i
))1
{−δ≤X
i
≤δ}
] −E[(Y φ
j
(X))1
{−δ≤X≤δ}
]
2
1/2
= q
H
1
j
λ
−1
j
1/n
n
i=1
[(Y
i
φ
j
(X
i
))1
{−δ≤X
i
≤δ}
] −E[(Y φ
j
(X))1
{−δ≤X≤δ}
]
2
1/2
But we have
E
1/n
n
i=1
[(Y
i
φ
j
(X
i
))1
{−δ≤X
i
≤δ}
] −E[(Y φ
j
(X))1
{−δ≤X≤δ}
]
2
≤ 1/nE
Y φ
j
(X)1
{−δ≤X≤δ}
2
≤ 1/nE[φ
j
(X)]
2
= 1/n
1
−1
φ
2
j
(x)d(x)dx
≤ D
2
/n.
Therefore,
E
j
λ
−1
j
1/n
n
i=1
[(Y
i
φ
j
(X
i
))1
{−δ≤X
i
≤δ}
] −E[(Y φ
j
(X))1
{−δ≤X≤δ}
]
2
≤ D
2
/n
j
λ
−1
j
∼ 1/n
j
j
−2
∼ 1/n.
Since
ˆ
f is the solution to (11),comparing with zero function,we get λ
1
−1
(
ˆ
f
)
2
≤ 1.Since

ˆ
f ≤ 1,we have
ˆ
f
2
H
1
≤ 2 +1/λ.Also we have shown that
1
−1
(
¯
f
δ
)
2
=
δ
−δ
(
¯
f
δ
)
2
≤ c
4
λ
−1/2
.
16
So we can see q
2
H
1
≤ 4 +1/λ +c
4
λ
−1/2
.Therefore we get
E
1/n
n
i=1
[(Y
i
q(X
i
))1
{−δ≤X
i
≤δ}
] −E
c
[(Y q(X))1
{−δ≤X≤δ}
]
2
≤ c
5
λ
−1
/n,
Combining the last inequality with (25),we get Eω
3
≤ c
6
n
−1
λ
−2
.This is a little stronger
than the conclusion of Theorem 1. .
Proof of Theorem 2:Let
¯
f
M
be the solution to
min
1
−1
(f
)
2
≤M
E[1 −Y f(X)]
+
.(26)
We will need the following lemma in the proof.
Lemma 4.2 For suﬃciently large M,we have
1
−1
sign(p −1/2) −
¯
f
M
dx ≤ c
7
M
−(α
+2)/(¯α+2)
(27)
l(
¯
f
M
) −l(η
∗
) =
1
−1
sign(p −1/2) −
¯
f
M
]2p −1d(x)dx ≤ c
8
M
−(α
+1)
.(28)
Furthermore,for any f ∈ H
1
satisfying f(x) ≤ 1,∀x ∈ [−1,1],and ﬁxed θ > 0,we have
1
−1
(f −
¯
f
M
)
2
d(x)dx ≤ c
9
[l(f) −l(
¯
f
M
)]M
θ
+M
−ρ(θ)
.(29)
Proof of Lemma 4.2:(26) is equivalent to
min
f∈H
1
E[1 −Y f(X)]
+
+λ
1
−1
(f
)
2
dx (30)
for some λ
(M)
depending on M.It is easy to see 
¯
f
M
 ≤ 1.
Let us ﬁrst concentrate on one interval [r
j
,r
j+1
] for some ﬁxed j.Without loss of gener
ality,assume p(x) > 1/2 in (r
j
,r
j+1
),and r
j
= −δ,r
j+1
= δ for some δ > ζ/2.
Denote a =
¯
f
M
(−δ),b =
¯
f
M
(δ),and F
δ
= {f ∈ H
1
[−δ,δ],f(−δ) = a,f(δ) = b}.Then
the restriction of
¯
f
M
to the interval [−δ,δ] is the solution to the following variational problem:
min
f∈F
δ
E[(1 −Y f(X))
+
1
{−δ≤X≤δ}
] +λ
δ
−δ
(f
)
2
dx.
Now we can follow a proof that is similar to the proof of Lemma 4.1,[the proof is
identical to the proof of Lemma 4.1 up to (18).After that use the boundary condition (13)
17
at r
j
and r
j+1
].Some tedious by straightforward calculation yields,for M large enough,
there exists
1
∈ [0,ζ),
2
∈ [0,ζ),such that
1
∼ [λ(1 −a)]
1/(α
j
+2)
,
2
∼ [λ(1 −b)]
1/(α
j+1
+2)
;
and
¯
f
M
(x) = 1,∀x ∈ (−δ +
1
,δ −
2
);
¯
f
δ
strictly increases from a to 1 in [−δ,−δ +
1
],
strictly decreases from 1 to b in [δ −
2
,δ];and
−δ+
1
−δ
(
¯
f
M
)
2
dx ∼ λ
−1/(α
j
+2)
(1 −a)
(2α
j
+3)/(α
j
+2)
(31)
−δ+
1
−δ
(1 −
¯
f
M
)dx ∼ λ
1/(α
j
+2)
(1 −a)
(α
j
+3)/(α
j
+2)
(32)
−δ+
1
−δ
(1 −
¯
f
M
)(2p −1) ∼ λ
(α
j
+1)/(α
j
+2)
(1 −a)
(2α
j
+3)/(α
j
+2)
(33)
E[(1−Y
¯
f
M
(X))
+
1
{−δ≤X≤0}
]+λ
0
−δ
(
¯
f
M
)
2
dx−E[(1−Y )
+
1
{−δ≤X≤0}
] ∼ λ
(α
j
+1)/(α
j
+2)
(1−a)
(2α
j
+3)/(α
j
+2)
Consider the two sides to the cross point r
j
,since
¯
f
M
solves (30),we can see that 1−a ∼
1 +a ∼ 1.Summing up over all the intervals,we get from (31),
1
−1
(
¯
f
M
)
2
dx ∼ λ
−1/(α
+2)
,
but the left hand side must be equal to M.Therefore we get M ∼ λ
−1/(α
+2)
.
Summing up over all the intervals,we obtain (27) and (28) from (32) and (33).
For (29),we have
δ
−δ
(f −
¯
f
M
)
2
d(x)dx
=
[−δ,δ]
f>
¯
f
M
(f −
¯
f
M
)
2
d(x)dx +
[−δ,δ]
f<
¯
f
M
(f −
¯
f
M
)
2
d(x)dx
≤ c
[−δ,δ]
f>
¯
f
M
(1 −
¯
f
M
)dx +
[−δ,δ]
f<
¯
f
M
(
¯
f
M
−f)d(x)dx
≤ c
δ
−δ
(1 −
¯
f
M
)dx +
[−δ,δ]
f<
¯
f
M
(
¯
f
M
−f)d(x)dx
≤ c
M
−(α
+2)/(¯α+2)
+
[−δ,δ],f<
¯
f
M
p−1/2≤M
−θ
(
¯
f
M
−f)d(x)dx +
[−δ,δ],f<
¯
f
M
p−1/2>M
−θ
(
¯
f
M
−f)d(x)dx
≤ c
M
−ρ(θ)
+M
−θ/¯α
+M
θ
[−δ,δ],f<
¯
f
M
p−1/2>M
−θ
(
¯
f
M
−f)[2p(x) −1]d(x)dx
(34)
≤ c
M
−ρ(θ)
+M
θ
[
[−δ,δ]
(
¯
f
M
−f)[2p(x) −1]d(x)dx −
[−δ,δ]
f>
¯
f
M
(
¯
f
M
−f)[2p(x) −1]d(x)dx]
18
≤ c
M
−ρ(θ)
+M
θ
[
[−δ,δ]
(
¯
f
M
−f)[2p(x) −1]d(x)dx +
[−δ,δ]
(1 −
¯
f
M
)(2p −1)dx]
≤ c
M
−ρ(θ)
+M
θ
[−δ,δ]
(
¯
f
M
−f)[2p(x) −1]d(x)dx +M
θ−(α
+1)
≤ c
M
−ρ(θ)
+M
θ
[−δ,δ]
(
¯
f
M
−f)[2p(x) −1]d(x)dx
.
Here (34) follows from the boundary condition (13) in Assumption 2.
Summing up over all the intervals,we get (29). .
We now prove Theorem 2.
In Lemma 1 of Mammen and Tsybakov (1999),put in Z = (X,Y ),z = (x,y),where
y ∈ {−1,1},x ∈ [−1,1].Let H = {h(z) = −M
−1/2
yf(x):f ∈ H
1
,
1
−1
(f
)
2
dx ≤ M;f(x) ≤
1,∀x ∈ [−1,1]}.
Let H
B
(δ,H,P) be the δentropy with bracketing of H.Let H
∞
(δ,H) be the δentropy
of H for the supremum norm,and
¯
H
∞
(δ,H) be the δentropy of H for the supremum norm
requiring the centers of the covering balls be in H.For a deﬁnition of these concepts,see
van de Geer (1999).Deﬁne F = {f ∈ H
1
:
1
−1
(f
)
2
dx ≤ 1;f(x) ≤ M
−1/2
,∀x ∈ [−1,1]}.
Then for any δ > 0,we have
H
B
(δ,H,P) ≤ H
∞
(δ/2,H) (35)
≤
¯
H
∞
(δ/2,H) (36)
=
¯
H
∞
(δ/2,F) (37)
≤ H
∞
(δ/4,F) (38)
≤ cδ
−1
,(39)
where (35) follows from Lemma 2.1 of van de Geer (1999).(36) is by deﬁnition.For (37),
notice that any function h in H can be written as −yf(x) with f ∈ F,and vice versa,and
that for f
1
,f
2
∈ F,we have [−yf
1
(x)] −[−yf
2
(x)] = f
1
(x) −f
2
(x),for any x ∈ [−1,1],
y ∈ {−1,1}.(38) is easy to check,and (39) is well known.See,for example,Theorem 2.4 of
van de Geer (1999).
19
Write h
0
(z) = −M
−1/2
y
¯
f
M
(x).Then by Lemma 1 of Mammen and Tsybakov (1999),
there exists constants c
10
> 0,c
11
> 0 such that
Pr
sup
h∈H
n
−1/2
n
i=1
{(h −h
0
)(Z
i
) −E(h −h
0
)(Z
i
)}
{h −h
0
L
2
(P)
n
−1/3
}
1/2
> c
10
ν
≤ c
11
e
−ν
for ν ≥ 1.Here a
b = max(a,b).This is equivalent to
Pr
sup
f≤1
1
−1
(f
)
2
dx≤M
n
−1/2
n
i=1
{Y
i
(
¯
f
M
−f)(X
i
) −EY
i
(
¯
f
M
−f)(X
i
)}
{M
1/2
f −
¯
f
M
L
2
(P)
Mn
−1/3
}
1/2
> c
10
ν
≤ c
11
e
−ν
(40)
for ν ≥ 1.
Nowdeﬁne V
n
= n
1/2
M
−(1+θ)/4
[l(
ˆ
f
M
) −l(
¯
f
M
)] −[l
n
(
ˆ
f
M
) −l
n
(
¯
f
M
)]
/
l(
ˆ
f
M
) −l(
¯
f
M
)
1/4
.
Since
ˆ
f
M
solves (12),we have
V
n
≥ n
1/2
M
−(1+θ)/4
[l(
ˆ
f
M
) −l(
¯
f
M
)]
3/4
.(41)
Now consider the event A = {l(
ˆ
f
M
) −l(
¯
f
M
) > M
−(ρ(θ)+θ)
}.If A holds,then
V
n
≤ sup
1
−1
(f
)
2
dx≤M;f(x)≤1;
l(f)−l(
¯
f
M
)>M
−(ρ(θ)+θ)
n
1/2
M
−(1+θ)/4
[l(f) −l(
¯
f
M
)] −[l
n
(f) −l
n
(
¯
f
M
)]
[l(f) −l(
¯
f
M
)]
1/4
≤ sup
1
−1
(f
)
2
dx≤M;f(x)≤1;
l(f)−l(
¯
f
M
)>M
−(ρ(θ)+θ)
n
1/2
M
−(1+θ)/4
1/n
n
i=1
{Y
i
(f −
¯
f
M
)(X
i
) −EY
i
(f −
¯
f
M
)(X
i
)}
[l(f) −l(
¯
f
M
)]
1/4
≤ sup
1
−1
(f
)
2
dx≤M;f(x)≤1;
l(f)−l(
¯
f
M
)>M
−(ρ(θ)+θ)
n
−1/2
n
i=1
{Y
i
(f −
¯
f
M
)(X
i
) −EY
i
(f −
¯
f
M
)(X
i
)}
[M
1+θ
(l(f) −l(
¯
f
M
))]
1/4
≤ sup
1
−1
(f
)
2
dx≤M;f(x)≤1;
l(f)−l(
¯
f
M
)>M
−(ρ(θ)+θ)
n
−1/2
n
i=1
{Y
i
(f −
¯
f
M
)(X
i
) −EY
i
(f −
¯
f
M
)(X
i
)}
{M
1+θ
[(cM
−θ
f −
¯
f
M
2
L
2
(P)
−M
−(ρ(θ)+θ)
)
M
−(ρ(θ)+θ)
]}
1/4
(42)
≤ sup
1
−1
(f
)
2
dx≤M;f(x)≤1
n
−1/2
n
i=1
{Y
i
(f −
¯
f
M
)(X
i
) −EY
i
(f −
¯
f
M
)(X
i
)}
{M
1+θ
[(cM
−θ
f −
¯
f
M
2
L
2
(P)
)
M
−(ρ(θ)+θ)
]}
1/4
(43)
= c sup
1
−1
(f
)
2
dx≤M;f(x)≤1
n
−1/2
n
i=1
{Y
i
(f −
¯
f
M
)(X
i
) −EY
i
(f −
¯
f
M
)(X
i
)}
[(M
1/2
f −
¯
f
M
L
2
(P)
)
M
(1−ρ(θ))/2
]
1/2
≤ c sup
1
−1
(f
)
2
dx≤M;f(x)≤1
n
−1/2
n
i=1
{Y
i
(f −
¯
f
M
)(X
i
) −EY
i
(f −
¯
f
M
)(X
i
)}
[(M
1/2
f −
¯
f
M
L
2
(P)
)
Mn
−1/3
]
1/2
20
where (42) follows from(29),and (43) follows fromthe fact that a
b ≥ (a−b)
b ≥ (a
b)/2
for any a > 0,b > 0.The last step follows from that M
(n)
∼ n
t
for some 0 < t ≤
2/[3(1 +ρ(θ))].
Therefore it follows from (40) that
limsup
n→∞
E[V
s
n
1
A
] ≤ C(s),(44)
for all s > 0 and ﬁnite constants C(s) depending on s.
By (41) and (44),we have
E{[l(
ˆ
f
M
) −l(
¯
f
M
)]
s
1
A
} ≤ C(4s/3)n
−γs
(45)
On A
c
,we have l(
ˆ
f
M
) −l(
¯
f
M
) ≤ M
−(ρ(θ)+θ)
,so
E{[l(
ˆ
f
M
) −l(
¯
f
M
)]
s
1
A
c
} ≤ M
−γs
(46)
By the deﬁnition of ρ(θ),we have α
+1 ≥ ρ(θ) +θ.Noticing (a +b)
s
≤ 2
s
(a
s
+b
s
) for
any a > 0,b > 0,we see that (45) and (46) combined with (28) gives (14).Then (15) follows
directly from the following lemma.
Lemma 4.3 R(g) − R(η
∗
) ≤ l(g) − l(η
∗
) for any function g satisfying g(x) ≤ 1,∀x ∈
[−1,1].
Proof of Lemma 4.3:we have
R(g) −R(η
∗
) = 1/2{l[sign(g)] −l[sign(2p −1)]}
=
1
−1
1/2[sign(2p −1) −sign(g)][2p(x) −1]d(x)dx
≤
1
−1
[sign(2p −1) −g][2p(x) −1]d(x)dx
= l(g) −l[sign(2p −1)] = l(g) −l(η
∗
).
.
Acknowledgements:This work was partly supported by Wisconsin Alumni Research
Foundation.The author would like to thank Professor Grace Wahba for introducing him to
the ﬁeld of the SVM,and for stimulating discussions.
21
References
[1] Boser,B.,Guyon,I.and Vapnik,V.N.(1992).A training algorithm for optimal
margin classiﬁers.Fifth Annual Conference on Computational Learning Theory,
Pittsburgh ACM,pp,142152.
[2] Burges,C.J.C.(1998).A tutorial on support vector machines for pattern
recognition.Data Mining and Knowledge Discovery,2(2),121167.
[3] Cox,D.D.and O’Sullivan,F.(1990).Asymptotic analysis of penalized likelihood
and related estimators.The Annals of Statistics 18 16761695.
[4] Devroye,L.,Gy¨orﬁ,L.,and Lugosi,G.(1996).A probabilistic theory of pattern
recognition.Springer,New York.
[5] Lin,Y.(1999).Support vector machines and the Bayes rule in classiﬁcation.
University of Wisconsin  Madison technical report 1014.Submitted.
[6] Lin,Y.(2000).Tensor product space ANOVAmodels.To appear in Ann.Statist.
27.
[7] Lin,Y.,Lee,Y.and Wahba,G.(2000).Support vector machines for classiﬁca
tion in nonstandard situations.Technical Report 1016.Department of Statistics,
University of Wisconsin,Madison.Submitted.
[8] Mammen,E.and Tsybakov,A.B.(1999).Smooth discrimination analysis.Ann.
Statist.27 18081829.
[9] Marron,J.S.(1983).Optimal rates of convergence to Bayes risk in nonpara
metric discrimination.Ann.Statist.11 11421155.
[10] Osuna,E.,Freund,R.and Girosi,F.(1997).An improved training algorithm
for support vector machines.In J.Principe,L.Gile,N.Morgan,and E.Wilson,
editors,Neural networks for signal processing VII — Proceedings of the 1997
IEEE workshop,pages 276  285,New York.IEEE.
22
[11] Platt,J.(1999).Fast training of support vector machines using sequential min
imal optimization.In B.Sch¨olkopf,C.J.C.Burges,and A.J.Smola,editors,
Advances in kernel methods — Support vector learning,pages 185  208,Cam
bridge,MA,MIT Press.
[12] ShaweTaylor,J.and Cristianini,N.(1998).“Robust Bounds on the Generaliza
tion from the Margin Distribution”.Neuro COLT Technical Report TR1998
029.
[13] Silverman,B.W.(1982).On the estimation of a probability density function by
the maximumpenalised likelihood method.The Annals of Statistics 10 795810.
[14] van de Geer,S.(1999).Empirical processes in Mestimation.Cambridge univer
sity press.
[15] Vapnik,V.(1995).The Nature of Statistical Learning Theory.Springer,New
York.
[16] Wahba,G.(1990).Spline Models for Observational Data.Philadelphia,PA:
Society for Industrial and Applied Mathematics.
[17] Wahba,G.,Lin,Y.and Zhang,H.(1999).GACV for support vector machines,
or,another way to look at marginlike quantities.To appear in A.J.Smola,
P.Bartlett,B.Scholkopf & D.Schurmans (Eds.),Advances in Large Margin
Classiﬁers.Cambridge,MA & London,England:MIT Press.
23
Comments 0
Log in to post a comment