Multicategory
Ã

Learning
¤
Yufeng Liu and Xiaotong Shen
Summary
In binary classi¯cation, margin

based techniques usually deliver high performance. As a
re

sult, a multicategory problem is often treated as a sequence of binary classi¯cations. In
the ab

sen
ce of a dominating class, this treatment may be suboptimal and may yield poor
performance,
such as for Support Vector Machine (SVM). We propose a novel multicategory
generalization of
Ã

learning which treats all classes simultaneously. The new generalizati
on eliminates
this poten

tial problem, and at the same time, retains the desirable properties of its binary
counterpart. We
develop a statistical learning theory for the proposed methodology and obtain fast
convergence
rates for both linear and nonlinear
learning examples. The operational characteristics of
this
method are demonstrated via simulation. Our results indicate that the proposed
methodology
can deliver accurate class prediction and is more robust against extreme observations
than its
SVM counter
part.
Key Words and Phrases:
Generalization error, nonconvex minimization, supervised
learning,
support vectors.
1 Introduction
Classi¯cation has become increasingly important as a means for facilitating information
extrac

tion. Among binary classi¯cation
techniques, signi¯cant developments have been seen
in margin

¤
Yufeng Liu is Assistant Professor, Department of Statistics and Operations Research, Carolina Center for
Genome Sciences, University of North Carolina, CB 3260, Chapel Hill, NC 27599 (Email:
y
°iu@email.unc.edu).
He would like to thank Professor George Fisherman for his helpful comments. Xiaotong Shen is Profes

sor, School of Statistics, University of Minnesota, 224 Church Street S.E., Minneapolis, MN 55455 (Email:
xshen@stat.umn.edu). His rese
arch was supported in part by NSF grants IIS

0328802 and DMS

0072635.
The au

thors would like to thank the editor, the associate editor, and two anonymous referees for their helpful
comments
and suggestions.
1
based methodologies, including Support Vector
Machine (SVM, Boser, Guyon, and
Vapnik, 1992;
Cortes and Vapnik, 1995), Penalized Logistic Regression (PLR, Lin et al., 2000), Import
Vector
Machine (IVM, Zhu and Hastie, 2001), and Distance Weighted Discrimination (DWD,
Marron
and Todd, 2002).
Among many
margin

based techniques, the ones that focus on estimating the decision
bound

ary yield higher performance as opposed to those that focus on conditional probabilities.
This
is because the former is an easier problem than the latter. For instance, binary
SVM
directly
estimates the Bayes classi¯er sign(
P
(
Y
= +1
j
x
)
¡
1
=
2) rather than
P
(
Y
= +1
j
x
) itself, with
input vector
x
and class label
Y
2 f§
1
g
, as shown in Lin (2002). However, this aspect of
the
methodology makes its generalization to the multicategory c
ase highly nontrivial. One
popular
approach, known as
\
one

versus

rest", solves
k
binary problems via sequential training.
As ar

gued by Lee, Lin, and Wahba (2004), an approach of this sort performs poorly in the
absence
of a dominating class, since the c
onditional probability of each class is no greater than
1
=
2.
Shen, Tseng, Zhang, and Wong (2003) proposed another margin

based technique
called
Ã

learning, which replaces the convex SVM loss function by a non

convex
Ã

loss function.
They
show that more ac
curate class prediction can be achieved, while the margin
interpretation is
retained. The present article generalizes binary
Ã

learning to the multicategory case.
Since
Ã

learning, like SVM, does not directly yield
P
(
Y
= +1
j
x
), we need to take a new
approa
ch.
To treat all classes simultaneously, we generalize the concept of margins and support
vectors
via multiple comparisons among di®erent classes. Multicategory
Ã

learning has the
advantage
of retaining the desired properties of its binary counterpart, but
not su®ering from the
afore

mentioned di±culty of one

versus

rest SVM with regard to the dominating class.
To provide insight into multicategory
Ã

learning, we develop a statistical learning theory.
Speci¯cally, the theory quanti¯es the performance of mu
lticategory
Ã

learning with
respect to
the choice of tuning parameters, the size of the training sample, and the number of
classes
involved in classi¯cation. It also indicates that our multicategory
Ã

learning directly
estimates
the true decision boundary
regardless of the presence or absence of the dominating
class.
Simulation experiments indicate that
Ã

learning outperforms its counterpart SVM in gen

eralization, as in the binary case. Moreover, multicategory
Ã

learning is more robust
against
extreme ins
tances that are wrongly classi¯ed than its counterpart SVM. Interestingly, in
linear
learning problems, it exhibits some behavior that is similar to nonlinear learning
problems with
respect to the tuning parameter, which is di®erent from that of the binary
case.
2
Section 2.1 motivates our approach. Section 2.2 describes our proposal for
multicategory
Ã

learning, and Section 2.3 brie°y discusses computational issues. Section 3 studies the
statistical
properties of the proposed methodology and develops its
statistical learning theory.
Section 4
presents numerical examples, followed by conclusions and discussions in Section 5. The
Appendix
contains the lemmas and technical proofs.
2 Methodology
The primary goal of classi¯cation is to predict the class label
Y
for a given input vector
x
2
S
via a classi¯er, where
S
is an input space. For
k

class classi¯cation, a classi¯er
partitions
S
into
k
disjoint and exhaustive regions
S
1
; : : : ; S
k
with
S
j
corresponding to class
j
. A good
classi¯er
is one that predicts cl
ass index
Y
for given
x
accurately, which is measured by its
accuracy of
prediction.
Before proceeding, let
x
2
S
½
IR
d
be an input vector and
y
be an output (label) variable.
We code
y
as
f
1
; : : : ; k
g
, and de¯ne
f
= (
f
1
; : : : ; f
k
) as a decision functi
on vector. Here
f
j
,
mapping from
S
to IR, represents class
j
;
j
= 1
; : : : ; k
. A classi¯er argmax
j
=1
;:::;k
f
j
(
x
),
induced
by
f
, is employed to assign a label to any input vector
x
2
S
. In other words,
x
2
S
is
assigned
to a class with the highest value of
f
j
(
x
), which indicates the strength of evidence that
x
belongs
to class
j
. A classi¯er is trained via a training sample
f
(
x
i
; y
i
);
i
= 1
; : : : ; n
g
,
independently and
identically distributed according to an unknown probability distribution
P
(
x
; y
).
Throu
ghout
the paper, we use
X
and
Y
to denote random variables and
x
and
y
to represent
corresponding
observations.
The generalization error (GE) quanti¯es the accuracy of generalization and is de¯ned as
Err(
f
) =
P
[
Y
6
= argmax
j
f
j
(
X
)], the probability of miscla
ssifying a new input vector
X
. To
simplify the expression, we introduce
g
(
f
(
x
)
; y
) = (
f
y
(
x
)
¡
f
1
(
x
)
; : : : ; f
y
(
x
)
¡
f
y
¡
1
(
x
)
; f
y
(
x
)
¡
f
y
+1
(
x
)
; : : : ; f
y
(
x
)
¡
f
k
(
x
)), which performs multiple comparisons of class
y
versus the rest
of classes. Vector
g
(
f
(
x
)
;
y
) describes the unique feature of a multicategory problem,
which
is directly related to the generalized margins to be introduced shortly. Furthermore, for
u
=
(
u
1
; : : : ; u
k
¡
1
), we de¯ne the multivariate sign function, sign(
u
) = 1 if
u
min
= min(
u
1
; : :
: ;
u
k
¡
1
)
>
0 and
¡
1 if
u
min
∙
0. With sign(
¢
) and
g
(
f
(
x
)
; y
) in place,
f
indicates correct classi¯cation
for any given instance (
x
; y
) if
g
(
f
(
x
)
; y
)
>
0
k
¡
1
, where
0
k
¡
1
is a (
k
¡
1)

dimensional vector
of 0. Consequently, the GE reduces to Err(
f
) =
1
2
E
[1
¡
sign(
g
(
X
; Y
))], with the empirical
3
generalization error (EGE) (2
n
)
¡
1
P
n
i
=1
(1
¡
sign(
g
(
f
(
x
i
)
; y
i
))).
For motivation, we ¯rst discuss our setting in the binary case, and then generalize it to
the
multicategory case. In particular, we review binary
Ã

lear
ning with the usual coding
f¡
1
;
1
g
,
and then derive it via coding
f
1
;
2
g
.
2.1 Motivation
With
y
2 f§
1
g
, a margin

based classi¯er estimates a single function
f
and uses sign(
f
) as
the
classi¯cation rule. Within the regularization framework, it solves argmin
f
J
(
f
)+
C
P
n
i
=1
l
(
y
i
f
(
x
i
)),
where
J
(
f
), a regularization term, controls the complexity of
f
, a loss function
l
measures
the
data ¯t, and
C >
0 is a tuning parameter balancing the two terms. For example, SVM
uses
the hinge loss with
l
(
u
) = [1
¡
u
]
+
, where [
v
]
+
=
v
if
v
¸
0, and 0 otherwise; PLR and IVM
adopt the logistic loss
l
(
u
) = log(1 +
e
¡
u
); and the
Ã

loss can be any non

increasing
function
satisfying
R
¸
Ã
(
u
)
>
0 if
u
2
(0
; ¿
) and
Ã
(
u
) = 1
¡
sign(
u
) otherwise, where
¿
2
(0
;
1], and
R >
0. For simplicity
, we discuss the linear case in which
f
(
x
) =
w
T
x
+
b
;
w
2
IR
d
and
b
2
IR,
represents a
d

dimensional hyperplane. In this case,
J
(
f
) =
1
2
k
w
k
2
is de¯ned by the geometric
margin
2
k
w
k
, the vertical Euclidean distance between hyperplanes
f
=
§
1. Here
y
i
f
(
x
i
)
is the
functional margin of instance (
x
i
; y
i
).
For linear binary
Ã

learning with coding
f
1
;
2
g
, we now derive a parallel formulation
using
the argmax rule, by noting that
x
is classi¯ed as class 2 if
f
2
(
x
)
> f
1
(
x
) and 1 otherwise,
where
f
j
(
x
) =
w
T
j
x
+
b
j
;
j
= 1
;
2. Evidently, this rule of classi¯cation depends only on sign((
f
2
¡
f
1
)(
x
)). To eliminate redundancy in (
f
1
; f
2
), we invoke a sum

to

zero constraint
f
1
+
f
2
= 0.
This type of constraint was previously used by Guermeur (2002) and Lee et al. (2004
) in
two
di®erent SVM formulations. Under this constraint,
k
w
1
k
=
k
w
2
k
. Binary
Ã

learning then
solves:
min
b
1
;b
2
;
w
1
;
w
2
³
1
2
2
X
j
=1
k
w
j
k
2
+
C
n
X
i
=1
Ã
¡
g
(
f
(
x
i
)
; y
i
)
¢´
subject to
2
X
j
=1
f
j
(
x
) = 0
8
x
2
S;
(1)
where
g
(
f
(
x
i
)
; y
i
) =
f
y
i
(
x
i
)
¡
f
3
¡
y
i
(
x
i
).
With codi
ng
f
1
;
2
g
, instances from classes 1 and 2 that lie respectively in halfspaces
f
x
:
g
(
f
(
x
)
;
2)
¸ ¡
1
g
and
f
x
:
g
(
f
(
x
)
;
2)
∙
1
g
are de¯ned as
\
support vectors". In the separable
case, support vectors are instances on hyperplanes
g
(
f
(
x
)
;
2) =
§
1. Furthermore,
the
functional
margin of (
x
i
; y
i
) can be de¯ned as
g
(
f
(
x
i
)
; y
i
), indicating the correctness and strength of
classi¯cation of
x
i
by
f
.
4
2.2 Multicategory
Ã

Learning
As suggested in Shen et al. (2003), the role of a binary
Ã

function is twofold. First, it
e
liminates
the scaling problem of the sign function that is scale invariant. Second, with a positive
penalty
de¯ned by the positive value of
Ã
(
u
) for
u
2
(0
; ¿
), it pushes correctly classi¯ed
instances away
from the boundary. As a remark, we note that 1
¡
si
gn as a loss is numerically undesirable
since
the solution
f
is approximately 0 under regularization.
Using coding
f
1
;
¢ ¢ ¢
; k
g
, we de¯ne multivariate
Ã

functions on
k
¡
1 arguments as
follows:
R
¸
Ã
(
u
)
>
0 if
u
min
2
(0
; ¿
);
Ã
(
u
) = 1
¡
sign(
u
) otherwise
;
(2)
where 0
< ¿
∙
1 and 0
< R
∙
2 are some constants, and
Ã
(
u
) is non

increasing in
u
min
. We
note
that this multivariate version preserves the desired properties of its univariate
counterpart. Par

ticularly, the multivariate
Ã
assigns a positive penalty
to any instance with min(
g
(
f
(
x
i
)
; y
i
))
2
(0
; ¿
) to eliminate the scaling problem. To utilize our computational strategy based on a
di®er

ence convex (d.c.) decomposition, we use a speci¯c
Ã
in implementation:
Ã
(
u
) = 0 if
u
min
¸
1; 2 if
u
min
<
0; 2(1
¡
u
min
) if 0
∙
u
min
<
1
:
(3)
A plot of this
Ã
function for
k
= 3 is displayed in Figure 1.
Insert Figure 1 about here
Linear multicategory
Ã

learning solves min
b
;
w
³
1
2
P
k
j
=1
k
w
j
k
2
+
C
P
n
i
=1
Ã
(
g
(
x
i
; y
i
))
´
subject
to
P
k
j
=1
f
j
(
x
) = 0 for
8
x
2
S
, where
w
= vec(
w
1
; : : : ;
w
k
) is a
kd

dimensional vector with its
(
d
(
i
2
¡
1) +
i
1
)

th element
w
i
2
(
i
1
) and
b
= (
b
1
; : : : ; b
k
)
T
2
IR
k
. By Theorem 2.1 of Liu et
al. (2005), the minimization with the sum

to

zero constraint for all
x
2
S
is equivalent to
that with the constra
int for
n
training inputs
f
x
i
;
i
= 1
;
¢ ¢ ¢
; n
g
only. That is, the in¯nite
constraint
P
k
j
=1
f
j
(
x
) = 0 for
8
x
2
S
can be reduced to be
P
k
j
=1
b
j
1
n
+
P
k
j
=1
X
w
j
= 0, where
X
= (
x
1
; : : : ;
x
n
)
T
is the design matrix and
1
n
is an
n

dimensional vector of 1's.
This yields
linear multicategory
Ã

learning:
min
b
;
w
³
1
2
k
X
j
=1
k
w
j
k
2
+
C
n
X
i
=1
Ã
(
g
(
x
i
; y
i
))
´
subject to
k
X
j
=1
b
j
1
n
+
X
k
X
j
=1
w
j
= 0
;
(4)
where the value of
C
(
C >
0) in (4) re°ects relative importance between the geometric
margin
and the EGE.
5
In the
present context, we de¯ne the generalized functional margin of an instance (
x
i
; y
i
)
as min(
g
(
x
i
; y
i
)), and the generalized geometric margin to be
°
= min
1
∙
j
1
<j
2
∙
k
°
j
1
j
2
, with
°
j
1
j
2
=
2
k
w
j
1
¡
w
j
2
k
the vertical Euclidean distance between hyperplanes
f
j
1
¡
f
j
2
=
§
1. Here
°
j
1
j
2
measures separation between classes
i
and
j
; see Figure 2 for an illustration of the
role of
°
. When
k
= 2, (4) reduces to the binary case of Shen et al. (2003). As a technical
remark,
we note that that (4) uses
P
k
j
=1
k
w
j
k
2
rather than
max
1
∙
j
1
<j
2
∙
k
k
w
j
1
¡
w
j
2
k
2
in minimization.
This is because
P
k
j
=1
k
w
j
k
2
plays a similar role as max
1
∙
j
1
<j
2
∙
k
k
w
j
1
¡
w
j
2
k
2
and is easier to
implement.
Insert Figure 2 about here
Kernel

based learning can be achieved via a proper kernel
K
(
¢
;
¢
), mapping from
S
£
S
to IR. The kernel is required to satisfy Mercer's condition (Mercer, 1909) which ensures
the
kernel matrix
K
to be positive de¯nite, where
K
is an
n
£
n
matrix with its
i
1
i
2

th element
K
(
x
i
1
;
x
i
2
). Then each
f
j
can be represented as
h
j
(
x
) +
b
j
wit
h
h
j
=
P
n
i
=1
v
ji
K
(
x
i
;
x
) by the
theory of reproducing kernel Hilbert spaces, c.f., Wahba (1998). The kernel

based
multicategory
Ã

learning then solves:
min
b
;
v
³
1
2
k
X
j
=1
k
h
j
k
2
H
K
+
C
n
X
i
=1
Ã
(
g
(
x
i
; y
i
))
´
subject to
k
X
j
=1
b
j
1
n
+
K
k
X
j
=1
v
j
= 0
;
(5)
wher
e
v
j
= (
v
j
1
; : : : ; v
jn
)
T
, and
v
= vec(
v
1
; : : : ;
v
n
). Using the reproducing kernel
property,
k
h
j
k
2
H
K
can be written as
v
T
j
Kv
j
.
The concept of support vectors can be also extended to multicategory problems. In the
separable case, the instances on th
e boundaries of polyhedrons
D
j
are the support
vectors,
where polyhedron
D
j
is a collection of solutions of a ¯nite system of linear inequalities
de¯ned
by min
j
(
g
(
x
; j
))
¸
1. In the nonseparable case, the instances belonging to class
j
that do
not
fall the
inside of
D
j
are the support vectors.
2.3 Computational Development of
Ã

Learning
To treat nonconvex minimization involved in (4) and (5), we utilize the state

of

art
technology in
global optimizationthe di®erence convex algorithm (DCA) of An and Tao (19
97). The
details
are referred to to Liu et al. (2005) for an algorithm.
The key to e±cient computation is a d.c. decomposition of
Ã
=
Ã
1
+
Ã
2
, where
Ã
1
(
u
) = 0 if
u
min
¸
1 and 2(1
¡
u
min
) otherwise, and
Ã
2
(
u
) = 0 if
u
min
¸
0 and 2
u
min
otherwise. Here
Ã
1
can
6
be viewed as a multivariate generalization of the univariate hinge loss. This d.c.
decomposition
connects the
Ã

loss to the hinge loss
Ã
1
of SVM. In fact, the multivariate
Ã
mimics the
GE
de¯ned by 1
¡
sign, while the generalized hinge loss
Ã
1
is a convex
upper envelope of 1
¡
sign.
With this d.c. decomposition,
Ã
corrects the bias introduced by the imposed convexity of
Ã
1
,
and is expected to yield higher generalization accuracy.
3 Statistical Learning Theory
In the literature, there has been considerable i
nterest in generalization accuracy of
margin

based
classi¯ers. In the binary case, Lin (2000) investigated rates of convergence of SVM with
a spline
kernel. Bartlett, Jordon, and MaAuli®e (2003) studied rates of convergence for certain
convex
margin losses
. Shen et al. (2003) derived a learning theory for
Ã

learning. Zhang (2004)
obtained
consistency for general convex margin

based losses. For the multicategory case, Zhang
(2004b)
has recently studied consistency of several large margin classi¯ers using con
vex losses.
To our
knowledge, no results are available for rates of convergence in the multicategory case.
In this
section, we quantify the generalization error rates of the proposed multicategory
Ã

learning, as
measured by the Bayesian regret, to be intro
duced.
3.1 Statistical Properties
The generalization performance of a classi¯er de¯ned by
f
is measured by the Bayesian
regret
e
(
f
;
¹
f
) = Err(
f
)
¡
Err(
¹
f
)
¸
0, which is the di®erence between the actual performance and
the ideal performance. Here
¹
f
is
the Bayes rule, yielding the ideal performance
assuming that
the true distribution of (
X
; Y
) would have been known in advance, obtained by
minimizing
Err(
f
) =
E
[1
¡
sign(
g
(
f
(
X
)
; Y
))] with respect to all
f
, with
g
(
f
(
x
)
; j
) =
f
f
j
(
x
)
¡
f
l
(
x
)
; l
6
=
j
g
.
Note that
the Bayes rule is not unique because any
¹
f
, satisfying argmax
j
¹
f
j
(
x
) = argmax
j
P
j
(
x
)
with
P
j
(
x
) =
P
(
Y
=
j
j
x
), yields the minimum. Without loss of generality, we use a speci¯c
¹
f
= ( ¹
f
1
; : : : ;
¹
f
k
) with ¹
f
j
(
x
) =
k
¡
1
k
I
¡
sign(
P
j
(
x
)
¡
P
l
(
x
)
; l
6
=
j
)
= 1
¢
¡
1
k
I
¡
sign(
P
j
(
x
)
¡
P
l
(
x
)
; l
6
=
j
)
6
= 1
¢
in what follows, that is, ¹
f
l
(
x
) =
k
¡
1
k
if
l
= argmax
P
j
(
x
), and
¡
1
k
otherwise.
Theorem 3.1 below gives expressions of the Bayesian regret, which is critical for
establishing
our learning theory.
7
Theorem 3.1.
For any decision function vector
f
,
e
(
f
;
¹
f
) =
1
2
E
[
k
X
j
=1
P
j
(
X
)(
sign
(
¹
g
(
¹
f
(
X
)
; j
))
¡
sign
(
g
(
f
(
X
)
; j
)))] (6)
=
E
[
max
j
P
j
(
X
)
¡
P
argmax
j
f
j
(
X
)]
¸
0
=
E
[
X
j
6
=
l
j
P
l
(
X
)
¡
P
j
(
X
)
j
I
(
sign
(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign
(
g
(
f
(
X
)
; j
)) = 1)]
;
(7)
where
¹
g
(
¹
f
(
x
)
; j
) =
f
¹
f
j
(
x
)
¡
¹
f
l
(
x
)
; l
6
=
j
g
.
Equation (6) in Theorem 3.1 expresses
e
(
f
;
¹
f
) in terms of a weighted sum of the
individual
misclassi¯cation error, weighted by the conditional probability
P
j
(
X
). Equation (7) gives
an
expression of
e
(
f
;
¹
f
) in misclassi¯catio
n resulting from
¡
k
2
¢
multiple comparisons.
Equation (7) suggests that a multicategory problem dramatically di®ers from its binary
counterpart. For a binary problem, (7) reduces to
e
(
f
2
;
¹
f
2
) =
E
j
P
2
(
X
)
¡
1
=
2
jj
sign(
f
2
(
X
))
¡
sign( ¹
f
2
(
X
))
j
because
P
2
(
x
)
¡
P
1
(
x
) = 2(
P
2
(
x
)
¡
1
=
2). This means a comparison between
P
1
(
x
) and
P
2
(
x
) in the binary case is equivalent to examining if
P
2
(
x
) exceeds 1
=
2. For a
multicategory problem, however, this no longer holds since multiple pairwise
comparisons are
necessary in ord
er to determine the argmax. In fact, there may not exist a dominating
class,
that is max
P
l
(
x
)
<
1
=
2 for some
x
2
S
. Therefore,
k
comparisons of
P
j
(
x
) with 1
=
2 may
not
be su±cient to determine the correct classi¯cation rule. Indeed, the issue of the
existe
nce of a
dominating class is important in the multicategory rather than binary case.
The ultimate goal of classi¯cation is to minimize
E
[1
¡
sign(
g
(
f
(
X
)
; Y
))]. To avoid the scale
invariant problem of the sign function, we apply
Ã

loss here as a surrogate lo
ss which
minimizes
E
[
Ã
(
g
(
f
(
X
)
; Y
))]. The following theorem says that a
Ã

loss yields the same Bayes rule as
the
1
¡
sign loss. Thus, consistency of multicategory
Ã

learning can be established.
Theorem 3.2.
The Bayes decision vector
¹
f
satis¯es
¹
g
min
(
¹
f
(
x
)
;
argmax
j
=1
;:::;k
P
j
(
x
)) = 1
,
where
¹
g
min
is the minimum of the
k
¡
1
elements of vector
¹
g
. For any
Ã
satisfying (2),
¹
f
mini

mizes
E
[
Ã
(
g
(
f
(
X
)
; Y
))]
and
E
[1
¡
sign
(
g
(
f
(
X
)
; Y
))]
in the sense that
E
[
Ã
(
g
(
f
(
X
)
; Y
))]
¸
E
[
Ã
(
¹
g
(
¹
f
(
X
)
; Y
))] =
E
[1
¡
sign
(
¹
g
(
¹
f
(
X
)
; Y
))]
∙
E
[1
¡
sign
(
g
(
f
(
X
)
; Y
))]
for any
f
. Further

more, the minimizers for
E
[
Ã
(
g
(
f
(
X
)
; Y
))]
and
E
[1
¡
sign
(
g
(
f
(
X
)
; Y
))]
are not unique, e.g.,
c
¹
f
is also a minimizer for both quantities for any
c
¸
1
.
Theorem 3.2 says that
Ã

learning es
timates the Bayes classi¯er de¯ned by
¹
f
as
opposed to the
conditional probabilities (
P
1
(
x
)
; : : : ; P
k
(
x
)), and it plays the same role as 1
¡
sign.
Furthermore,
8
the optimal performance of
¹
f
with
¹
g
min
(
¹
f
(
x
)
;
argmax
j
P
j
(
x
)) = 1 is realized via the
Ã

lo
ss
function although it di®ers from 1
¡
sign.
3.2 Statistical Learning Theory
Let
F
be a function class of candidate function vectors, which is allowed to depend on
n
.
Note
that the Bayes decision function
¹
f
is not required to belong to
F
. For any functi
on vector
f
2 F
, classi¯cation is performed by partitioning
S
into
k
disjoint sets (
G
f
1
;
¢ ¢ ¢
;G
f
k
) = (
f
x
:
sign(
g
(
f
(
x
)
;
1) = 1
g
;
¢ ¢ ¢
;
f
x
: sign(
g
(
f
(
x
)
; k
) = 1
g
).
In this section, we generalize the learning theory of Shen et al. (2003) to the multica
t

egory case. Our learning theory quanti¯es the magnitude of
e
(
f
;
¹
f
) as a function of
n
,
k
,
the tuning parameter
C
, and the complexity of a class of candidate classi¯cation
partitions
G
(
F
) =
f
(
G
f
1
;
¢ ¢ ¢
;G
f
k
);
f
2 Fg
induced by
F
.
Denote by the appro
ximation error
e
Ã
(
f
;
¹
f
) =
1
2
(
EÃ
(
g
(
f
(
X
)
; Y
))
¡
EÃ
(
¹
g
(
¹
f
(
X
)
; Y
))), which
measures the degree of approximation of
G
(
F
) to (
G
¹
f
1
;
¢ ¢ ¢
;G
¹
f
k
). Let
J
0
= max(
J
(
f
0
)
;
1). The
following technical assumptions are made.
Assumption A:
(Approximation error)
For some positive sequence
s
n
!
0 as
n
! 1
, there
exists
f
0
2 F
such that
e
Ã
(
f
0
;
¹
f
)
∙
s
n
. Equivalently, inf
f
f
2Fg
e
Ã
(
f
;
¹
f
)
∙
s
n
. Similar to
F
,
f
0
may depend on
n
.
Assumption B:
(Boundary behavior) There exist some constants 0
< ®
∙
+
1
and
c
1
>
0
such tha
t
P
(
X
2
S
: (max
P
l
(
X
)
¡
P
j
6
=argmax
P
l
(
X
)
(
X
))
<
2
±
)
∙
c
1
±
®
for any small
±
¸
0.
Assumption B describes behavior of the conditional probabilities
P
j
's near the decision
boundary
f
x
2
S
: max
P
l
(
x
) =
P
j
(
x
); for some
l
6
=
j
2 f
1
;
2
; : : : ; k
gg
. It is equivalen
t
to that
P
(
X
2
S
: (max
P
l
(
X
)
¡
second max
P
l
(
X
))
<
2
±
)
∙
c
1
±
®
by the fact that
f
X
:
max
P
l
(
X
)
¡
P
j
6
=argmax
P
l
(
X
)
(
X
))
<
2
±
g ½ f
X
: max
P
l
(
X
)
¡
second max
P
l
(
X
)
<
2
±
g
.
To specify Assumption C, we de¯ne the metric entropy for partitions. For a class of
partitions
B
=
f
(
B
1
;
¢ ¢ ¢
;B
k
);
B
j
\
B
l
=
; 8
j
6
=
l;
[
1
∙
j
∙
k
B
j
=
S
g
and any
² >
0, call
f
(
G
v
j
1
;G
u
j
1
)
; : : : ;
(
G
v
jm
;G
u
j m
)
g
;
j
= 1
; : : : ; k
, an
²

bracketing set of
B
if for any (
G
1
; : : : ;G
k
)
2
B
there exists an
h
such that
G
v
jh
½
G
j
½
G
u
j h
and
max
1
∙
h
∙
m
ma
x
1
∙
j
∙
k
P
(
G
u
jh
¢
G
v
j h
)
∙
²;
(8)
where
G
u
jh
¢
G
v
jh
is the set di®erence between
G
u
jh
and
G
v
jh
. The metric entropy
H
B
(
²;
B
) of
B
with bracketing is then de¯ned as logarithm of the cardinality of
²

bracketing set of
B
of
the
9
smallest size.
Let
F
(
`
) =
f
f
2
F
; J
(
f
)
∙
`
g ½ F
and
G
(
`
) =
f
(
G
f
1
; : : : ;G
f
k
);
f
2 F
(
`
)
g ½ G
(
F
). Then
G
(
`
) is the set of classi¯cation partitions under regularization
J
(
f
)
∙
`
. For instance,
J
(
f
) is
1
2
P
j
k
w
j
k
2
in (4) or is
1
2
P
j
k
h
j
k
2
H
K
in (5). To measure the complexity of
G
(
`
) v
ia the metric
entropy, the following assumption is made.
Assumption C:
(Metric entropy for partitions) For some positive constants
c
i
,
i
= 2
;
3
;
4,
there exists some
²
n
>
0 such that
sup
`
¸
2
Á
(
²
n
; `
)
∙
c
2
n
1
=
2
;
(9)
where
Á
(
²
n
; `
) =
R
c
1
=
2
3
L
®=
2(
®
+1)
c
4
L
H
1
=
2
B
(
u
2
=
4
;
G
(
`
))
du=L
and
L
=
L
(
²
n
;C; `
) = min(
²
2
n
+(
Cn
)
¡
1
(
`=
2
¡
1)
J
0
;
1).
Assumption D:
(
Ã

function) The
Ã

function satis¯es (2).
As a technical remark, we note that to simplify the function entropy calculation in
Assump

tion C required in Theorem 3.4, an
additional condition on the
Ã

function may be
imposed. For
instance, we may restrict the
Ã

loss functions in (2) to satisfy a multivariate Lipschitz
condition:
j
Ã
(
u
¤
)
¡
Ã
(
u
¤¤
)
j ∙
D
j
u
¤
min
¡
u
¤¤
min
j
;
(10)
where
D >
0 is a constant. Condition (10) is satis¯e
d by the speci¯c
Ã
function in (3),
with
D
= 2. This aspect is illustrated in Example 3.3.2. However, (10) is irrelevant to the set
entropy
in Assumption C required in Theorem 3.3; see Example 3.3.1.
Theorem 3.3.
Suppose that Assumptions A

D are met. Then,
for any classi¯er of
Ã

learning
argmax
(
^
f
)
, there exists a constant
c
5
>
0
such that
P
(
e
(
^
f
;
¹
f
)
¸
±
2
n
)
∙
3
:
5 exp(
¡
c
5
n
(
nC
)
¡
®
+2
®
+1
J
®
+2
®
+1
0
)
;
provided that
Cn
¸
2
±
¡
2
n
J
0
, where
±
2
n
= min(max(
²
2
n
;
2
s
n
)
;
1)
.
Corollary 3.1.
Under the assumption
s of Theorem 3.3,
j
e
(
^
f
;
¹
f
)
j
=
O
p
(
±
2
n
)
;E
j
e
(
^
f
;
¹
f
)
j
=
O
(
±
2
n
)
;
provided that
n
¡
1
®
+1
(
C
¡
1
J
0
)
®
+2
®
+1
is bounded away from zero.
To obtain the error rate
±
2
n
in Theorem 3.3, we need to compute the metric entropy for
G
(
`
).
It may not be easy to co
mpute the metric entropy for partitions because
G
(
`
) is induced
by the
10
class of functions
F
(
`
). Moreover, it is also of interest to establish an upper bound of
e
(
^
f
;
¹
f
)
using the corresponding function entropy as opposed to set entropy. In what foll
ows, we
shall
develop such results in Theorem 3.4.
To proceed, we de¯ne the
L
2

metric entropy with bracketing for
F
as follows. For any
² >
0,
call
f
(
g
v
1
; g
u
1
)
; : : : ;
(
g
v
m
; g
u
m
)
g
an
²

bracketing function if for any
g
2 F
there is an
h
such that
g
v
h
∙
g
∙
g
u
h
and max
1
∙
h
∙
m
k
g
u
h
¡
g
l
h
k
2
∙
²
, where
k ¢ k
2
is the usual
L
2

norm, de¯ned as
k
g
k
2
2
=
R
g
2
dP
. Then the
L
2

metric entropy of
F
with bracketing
H
B
(
²;
F
) is de¯ned as logarithm
of
the cardinality of the
²

bracketing of the smallest size. Now de
¯ne a new function set
F
Ã
(
`
) =
f
Ã
(
g
(
f
(
x
)
; y
))
¡
Ã
(
g
0
(
f
0
(
x
)
; y
)) :
f
2 F
(
`
)
g
and
Á
¤
(
²
¤
n
; `
) =
R
c
1
=
2
3
L
¤
®=
2(
®
+1)
c
4
L
¤
H
1
=
2
B
(
u;
F
Ã
(
`
))
du=L
with
L
¤
= min(
²
¤
2
n
+ (
Cn
)
¡
1
(
`=
2
¡
1)
J
0
;
1).
Theorem 3.4.
Suppose that Assumptions A

D are met with
Á
¤
(
²
¤
n
; `
)
replacing
Á
(
²
n
; `
)
in
As

sumption C. Then, for any classi¯er of
Ã

learning argmax
(
^
f
)
, there exists a constant
c
5
>
0
such that
P
(
e
(
^
f
;
¹
f
)
¸
±
¤
2
n
)
∙
P
(
e
Ã
(
^
f
;
¹
f
)
¸
±
¤
2
n
)
∙
3
:
5 exp(
¡
c
5
n
(
nC
)
¡
®
+2
®
+1
J
®
+2
®
+1
0
)
;
provided that
Cn
¸
2
±
¤¡
2
n
J
0
, where
±
¤
2
n
= min(max(
²
¤
2
n
;
2
s
n
)
;
1)
.
Corollary 3.2.
Under the assumptions of Theorem 3.4,
j
e
Ã
(
^
f
;
¹
f
)
j
=
O
p
(
±
¤
2
n
)
;E
j
e
Ã
(
^
f
;
¹
f
)
j
=
O
(
±
¤
2
n
)
;
provided that
n
¡
1
®
+1
(
C
¡
1
J
0
)
®
+2
®
+1
is bounded away from zero.
Note that
e
Ã
(
^
f
;
¹
f
)
¸
e
(
^
f
;
¹
f
). The rat
e
±
¤
2
n
obtained from Theorem 3.4 using the metric
entropy for functions yields an upper bound of
e
(
^
f
;
¹
f
), thus
e
(
^
f
;
¹
f
)
∙
min(
±
2
n
; ±
¤
2
n
) with prob

ability tending to 1 by Theorems 3.3

3.4. In application, one may calculate either
±
2
n
or
±
¤
2
n
,
depending on which entropy is easier to compute.
Theorems 3.3

3.4 reveal distinct characteristics of multicategory problems, although they
cover the binary case. First, a multicategory problem has a higher level of complexity
gen

erally, and hence th
at the number of classes
k
may have an impact on the performance.
In
fact, Theorems 3.3

3.4 permit studying dependency of
e
(
^
f
;
¹
f
) on
k
and
n
simultaneously; see
Examples 3.3.1 and 3.3.2. Second, some properties of binary linear learning no longer
hold
in
the multicategory case when
k >
2. For instance, the decision boundaries generated by
linear
learning with
k >
2 can be piecewise linear hyperplanes.
11
3.3 Illustrative Examples
To illustrate our learning theory, we study speci¯c learning examples and
apply our
learning
theory to derive error bounds for multicategory
Ã

learning.
3.3.1. Linear classi¯cation:
Linear classi¯cation involving a class of
k
hyperplanes
F
=
f
f
:
f
j
(
x
) =
w
T
j
x
+
b
j
;
P
k
j
=1
f
j
= 0
;
x
2
S
= [0
;
1]
d
g
is considered, where
d
is a c
onstant. To
generate the training sample, we specify
P
(
Y
=
j
) = 1
=k
,
P
(
x
j
Y
=
j
) =
k
¡
1 for
f
x
:
x
1
2
[(
j
¡
1)
=k; j=k
)
g
and 1
=
(
k
¡
1) otherwise, where
x
1
is the ¯rst coordinate of
x
. Then the
Bayes
classi¯er yields sets
f
x
:
x
1
2
[0
;
1
=k
)
g
,
: : :
,
f
x
:
x
1
2
[(
k
¡
1)
=k;
1]
g
for the corresponding
k
classes.
We now verify Assumptions A

C. For Assumption A, it is easy to ¯nd
f
t
= (
w
11
x
1
+
b
1
; : : : ;w
1
k
x
1
+
b
k
) such that
w
1
j
's are increasing,
P
k
j
=1
w
1
j
= 0,
P
k
j
=1
b
j
= 0, and
w
1
j
j=k
+
b
j
=
w
1
;j
+1
j=k
+
b
j
+1
;
j
= 1
; : : : ; k
¡
1. Let
f
0
=
n
f
t
2 F
, then
e
(
f
0
;
¹
f
)
∙
s
n
=
c
1
n
¡
1
for some
constant
c
1
>
0. This implies Assumption A with
s
n
=
c
1
n
¡
1
. Assumption B is satis¯ed with
®
= +
1
since
P
(
X
2
S
: (max
P
l
(
X
)
¡
P
j
(
X
))
<
2
±
) =
P
(
X
1
2 f
1
=k; : : : ;
(
k
¡
1)
=k
g
) = 0 for
an
y su±ciently small
± >
0. To verify Assumption C, we note that
H
B
(
u;
G
(
`
))
∙
O
(
k
2
log(
k=u
))
for any given
`
by Lemma 1. Let
Á
1
(
²
n
; `
) =
c
3
(
k
2
log(
k=L
1
=
2
))
1
=
2
=L
1
=
2
, where
L
= min(
²
2
n
+
(
Cn
)
¡
1
(
`=
2
¡
1)
;
1). This in turn yields sup
`
¸
2
Á
(
²
n
; `
)
∙
Á
1
(
²
n
;
2) =
c
(
k
2
log(
k=²
n
))
1
=
2
=²
n
for some
c >
0 and a rate
²
n
= (
k
2
log
n
n
)
1
=
2
when
C=J
0
»
±
¡
2
n
n
¡
1
»
1
k
2
log
n
, provided that
k
2
log
n
n
!
0.
By Corollary 3.1, we conclude that
e
(
^
f
;
¹
f
)
∙
O
(
k
2
log
n
n
) except for a set of probability tending
to zero, and
Ee
(
^
f
;
¹
f
)
∙
O
(
k
2
log
n
n
), when
k
2
log
n
n
!
0 as
n
! 1
. It is interesting to note that
Ee
(
^
f
;
¹
f
)
∙
O
(
n
¡
1
log
n
) when
k
is a ¯xed constant. This conclusion holds generally for
any
Ã

function satisfying Assumption D.
3.3.2. Gaussian

kernel classi¯
cation:
In this example, we consider nonlinear learn

ing with the same
P
(
x
; y
) as in Example 3.3.1. Let
F
=
f
f
:
f
j
(
x
) =
P
n
i
=1
v
ji
K
(
x
i
;
x
) +
b
j
;
P
k
j
=1
f
j
= 0
;
x
2
S
= [0
;
1]
d
g
with Gaussian kernel
K
(
s; t
) = exp(
¡k
s
¡
t
k
2
=¾
2
).
For Assumption A, we note t
hat
F
is a rich function space with large
n
. In fact, any
continuous function can be well approximated by Gaussian kernel based representations
under
the sup

norm, c.f., Steinwart (2001). Thus there exists an
f
t
= (
f
1
t
; : : : ; f
kt
)
2 F
such that
f
jt
(
x
)
¸
0 for
x
1
2
[(
j
¡
1)
=k; j=k
] and
<
0 otherwise. With a choice of
f
0
=
²
¡
2
n
f
t
,
e
Ã
(
f
0
;
¹
f
)
∙
s
n
=
c
1
²
2
n
, where
c
1
is a constant and
²
n
is de¯ned below. Assumption B is satis¯ed with
12
®
= +
1
as in Example 3.3.1. In this case, the metric entropy of
F
Ã
(
`
) ap
pears to be easier
to
compute. We then apply Theorem 3.4 to obtain the convergence rate. Consider any
Ã

function
in (2) satis¯es (10). Then by Lemma 2,
H
B
(
u;
F
Ã
(
`
))
∙
O
(
k
(log(
`=u
))
d
+1
) for any given
`
.
Let
Á
¤
1
(
²
¤
n
; `
) =
c
3
(
k
(log(
`=L
1
=
2
))
d
+1
)
1
=
2
=L
1
=
2
, where
L
= min(
²
¤
2
n
+ (
Cn
)
¡
1
(
`=
2
¡
1)
;
1). Then
sup
`
¸
2
Á
¤
(
²
¤
n
; `
)
∙
Á
1
(
²
¤
n
;
2) =
c
(
k
(log(1
=²
¤
n
))
d
+1
)
1
=
2
=²
¤
n
for some
c >
0. Solving (9) yields a
rate
²
¤
n
= (
k
(log(
nk
¡
1
))
d
+1
n
)
1
=
2
when
C=J
0
»
±
¤¡
2
n
n
¡
1
»
1
k
(log(
nk
¡
1
))
d
+1
under a condition that
k
(log(
nk
¡
1
))
d
+
1
n
!
0 as
n
! 1
.
By Theorem 3.4, we conclude that
e
(
^
f
;
¹
f
)
∙
e
Ã
(
^
f
;
¹
f
)
∙
O
(
k
(log(
nk
¡
1
))
d
+1
n
) except for a set
of probability tending to zero. By Corollary 3.2,
Ee
(
^
f
;
¹
f
)
∙
O
(
k
(log(
nk
¡
1
))
d
+1
n
). This resulting
rate re°ects dependence of th
e rate on the class number
k
. If
k
is treated as a ¯xed
constant,
then we have
Ee
(
^
f
;
¹
f
)
∙
O
(
n
¡
1
(log
n
)
d
+1
). This conclusion holds generally for any
Ã

function
satisfying Assumption D and Condition (10).
In summary, Examples 3.3.1

3.3.2 provide an insi
ght into the generalization error of the
proposed methodology. In view of the lower bound
n
¡
1
result (c.f., Tsybakov, 2004) in the
binary case, we conjecture the rates obtained in Examples 3.3.1 and 3.3.2 are nearly
optimal,
although, to our knowledge, a l
ower bound result for any general classi¯er has not yet
been
established in the multicategory case. Further investigation is necessary.
4 Numerical Examples
In this section, we examine performance of multicategory
Ã

learning in terms of
generalization
and
compares it with its counterpart SVM. In the literature, there are a number of
di®erent
multicategory SVM generalizations; for instance, Lee et al. (2004), Crammer and Singer
(2001),
Weston and Watkins (1998), among others. To make a fair comparison, we us
e a
version of
multicategory SVM that is parallel to our multicategory
Ã

learning. Speci¯cally, we
replace the
Ã
function in (4) and (5) by
Ã
1
. Then for the linear case, this version of multicategory
SVM
solves
min
b
;
w
³
1
2
k
X
j
=1
k
w
j
k
2
+
C
n
X
i
=1
Ã
1
(
g
(
x
i
;
y
i
))
´
subject to
k
X
j
=1
b
j
1
n
+
X
k
X
j
=1
w
j
= 0
:
(11)
This version of SVM is closely related to that of Crammer and Singer (2001). In their
formu

lation, all
b
j
's are set to be 0 rather than employing the sum

to

zero constraint, which is
in
contrast to (
11). As argued by Guermeur (2002), the sum

to

zero constraint is necessary
to en

13
sure uniqueness of the solution when a
k

dimensional vector of decision functions with
intercepts
b
j
's is used for a
k

class problem.
4.1 Simulation
Two linear examples a
re considered. In these examples, the GE is approximated by the
testing
error using a testing sample, independent of training. In what follows, all calculations are
carried
out using the IMSL C routines.
Three

class linear problem
. The training data are ge
nerated as follows. First, generate
pairs (
t
1
; t
2
) from a bivariate
t

distribution with degrees of freedom
º
, where
º
= 1
;
3 in
Examples
1 and 2, respectively. Second, randomly assign
f
1
;
2
;
3
g
to its label index for each (
t
1
;
t
2
). Third,
calculate (
x
1
; x
2
) as follows:
x
1
=
t
1
+
a
1
and
x
2
=
t
2
+
a
2
with three di®erent values of
(
a
1
; a
2
) = (
p
3
;
1)
;
(
¡p
3
;
1)
;
(0
;
¡
2) for corresponding classes 1

3. In these examples, the
testing and Bayes errors are computed via independent testing samples of size 10
6
for
class
i¯ers
obtained from training samples of size 150.
To eliminate the dependence on
C
, we maximize the performances of
Ã

learning and
SVM
by optimizing
C
over a discrete set in [10
¡
3
;
10
3
]. For each method, the testing error for
the
optimal
C
is averaged over
100 repeated simulations. The simulation results are
summarized in
Table 1.
Insert Table 1 about here
As shown in Table 1,
Ã

learning usually has a smaller testing error thus better
generaliza

tion as compared to its counterpart SVM. The amount of improv
ement, however, varies
across
examples. In Example 1, the percent of improvement of multicategory
Ã

learning over
SVM
is 43.22% when the corresponding
t

distribution has one degree of freedom. In Example
2, it
decreases to 20.41% when the
t

distribution wi
th 3 degrees of freedom is employed.
Further,
Ã

learning yields a smaller number of support vectors. This suggests that
Ã

learning has
an even
more
\
sparse" solution than SVM, and hence that it has stronger ability of data reduction.
On a
related matter, S
VM fails to give data reduction in Example 1 since almost all the
instances are
support vectors, which is in contrast to much smaller number of support vectors of
Ã

learning.
One plausible explanation is that the ¯rst moment of the standard bivariate
t

dis
tribution
does
not exist, and thus the corresponding SVM does not work well. In general, any classi¯er
with
14
an unbounded loss such as SVM may su®er di±culty from extreme outliers as in this
example.
This reinforces our view that
Ã

learning is more robus
t against outliers.
4.2 Application
We now examine performance of
Ã

learning and its counterpart SVM on a benchmark
example
letter
, obtained from Statlog. In this example, each sample contains 16 primitive
numerical
attributes converted from its correspond
ing letter image with a response variable
representing
26 categories. The main goal here is to identify each letter image as one of the 26
capital letters
in the English alphabet. A detailed description can be found in
www:liacc:up:pt=ML=statlog=
datasets=
letter=letter:doc:html
.
For illustration, we use the data for letters D, O, Q with 805, 753, 783 cases respectively.
A random sample of
n
= 200 is selected for training, while leaving the rest for testing. For
each training dataset, we seek the best perfor
mance of linear
Ã

learning and SVM over
a set of
C

values in [10
¡
3
,10
3
]. The corresponding results with respect to the smallest testing
errors for
each method in ten di®erent cases are reported in Table 2. Since the Bayes error is
unknown,
the improvement
of
Ã

learning over SVM is computed via (
T
(SVM)
¡
T
(
Ã
))
=T
(SVM).
Insert Table 2 about here
Table 2 indicates that multicategory
Ã

learning has a smaller testing error than its coun

terpart SVM, although the amount of improvement varies from sample to sampl
e. In
addition,
on average, multicategory
Ã

learning has a smaller number of support vectors than
SVM. In
conclusion,
Ã

learning has better generalization and achieves further data reduction than
SVM
in this example.
5 Discussion
In this article, we propos
e a new methodology that generalizes
Ã

learning from the
binary case
to the multicategory case. A statistical learning theory is developed for
Ã

learning in
terms
of the Bayesian regret. In simulations, we show that the proposed methodology performs
well
a
nd is more robust against outliers than its counterpart SVM. In addition, we discover
some
interesting phenomena that are not with the binary case.
15
Recently, there is considerable interest in studying the variable selection problem using
the
L
1
norm to
replace the conventional
L
2
norm. In the binary case, Zhu et al. (2003) studied
prop

erties of the
L
1
SVM and showed that the corresponding regularized solution path is
piecewise
linear. It is therefore natural to investigate variable selection of the
L
1
Ã

learning.
Further developments are necessary in order to make multicategory
Ã

learning more
useful in
practice, particularly methodologies for a data

driven choice of
C
, variable selection,
regularized
solution path, as well as the nonstandard situation
including unequal loss assignments.
Appendix
Proof of Theorem 3.1
: By the de¯nition of Err(
f
), it is easy to obtain via conditioning that
e
(
f
;
¹
f
) =
1
2
E
[
P
k
l
=1
P
l
(
X
)(sign(
¹
g
(
¹
f
(
X
)
; l
))
¡
sign(
g
(
f
(
X
)
; l
)))]. Then it su±ces to consider
the situation that
sign(
¹
g
(
¹
f
(
X
)
; l
))
¡
sign(
g
(
f
(
X
)
; l
)) is nonzero, that is, when two
classi¯ers
disagree. Equivalently, for any given
X
=
x
, we can write
e
(
f
;
¹
f
) using all possible
di®erent
classi¯cation produced by
¹
f
and
f
jointly, where sign(
¹
g
(
¹
f
(
x
)
; l
)) = 1 an
d sign(
g
(
f
(
x
)
; j
))
= 1
imply that
¹
f
classi¯es
x
into class
l
while
f
classi¯es
x
into class
j
for 1
∙
l
6
=
j
∙
k
. Thus,
we
have
e
(
f
;
¹
f
) =
E
[
k
X
l
=1
X
j
6
=
l
(
P
l
(
X
)
¡
P
j
(
X
))
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
=
E
[
k
X
l
=1
X
j
6
=
l
j
P
l
(
X
)
¡
P
j
(
X
)
j
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
;
where the second equality follows from the fact that
¹
f
is the optimal (Bayes) decision
function
vector such that
P
l
(
X
)
¸
P
j
(
X
) when sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1. The desired result then follows.
Proof of The
orem 3.2
: Write
E
[1
¡
sign(
g
(
f
(
X
)
; Y
))
j
X
=
x
] as
P
k
j
=1
(1
¡
sign(
g
(
f
(
x
)
; j
)))
P
j
(
x
) = 1
¡
P
k
j
=1
sign(
g
(
f
(
x
)
; j
))
P
j
(
x
). Note that for any given
x
, one and only one of
sign(
g
(
f
(
x
)
; j
)) can be 1 and the rest equal to
¡
1. Consequently,
E
[1
¡
sign(
g
(
f
(
X
)
; Y
)
)] is
minimized when sign(
g
(
f
(
x
)
;
argmax
j
¹
f
j
(
x
))) = 1, i.e.,
f
=
¹
f
. Evidently, the minimizer is not
unique as
c
¹
f
for
c
¸
1 is also a minimizer. Then the desired result follows from the fact
that
Ã
(
u
)
¸
(1
¡
sign(
u
)) and
Ã
(
¹
g
) = 1
¡
sign(
¹
g
).
Proof o
f Theorem 3.3
: Before proceeding we introduce some notations to be used
below.
Let ~
l
Ã
(
f
;Z
i
) =
l
Ã
(
f
;Z
i
) +
¸J
(
f
) be the cost function to be minimized, as in (4) or (5), where
l
Ã
(
f
;Z
i
) =
Ã
(
g
(
f
(
X
i
)
; Y
i
)) and
¸
= 1
=
(
Cn
). Let ~
l
(
f
;Z
i
) =
l
(
f
;Z
i
) +
¸J
(
f
), where
l
(
f
;Z
i
) =
16
1
¡
sign(
g
(
f
(
X
i
)
; Y
i
)). De¯ne the scaled empirical process
E
n
(~
l
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
)) as
n
¡
1
n
X
i
=1
(~
l
(
f
;Z
i
)
¡
~
l
Ã
(
f
0
;Z
i
)
¡
E
[~
l
(
f
;Z
i
)
¡
~
l
Ã
(
f
0
;Z
i
)]) =
E
n
[
l
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)]
;
where
Z
= (
X; Y
). Let
A
i;j
=
f
f
2 F
: 2
i
¡
1
±
2
n
∙
e
(
f
;
¹
f
)
<
2
i
±
2
n
,
2
j
¡
1
J
0
∙
J
(
f
)
<
2
j
J
0
g
,
A
i;
0
=
f
f
2 F
: 2
i
¡
1
±
2
n
∙
e
(
f
;
¹
f
)
<
2
i
±
2
n
; J
(
f
)
< J
0
g
, for
j
= 1
;
2
;
¢ ¢ ¢
, and
i
= 1
;
2
;
¢ ¢ ¢
. Without
loss of generality, we assume
J
(
f
0
)
¸
1 and max(
²
2
n
;
2
s
n
)
<
1 in the sequel.
The proof uses the treatment of Shen et al.
(2003) and Shen (1998), together with the
results
in Theorem 3.1 and Assumption B. In what is to follow, we shall omit any detail that can
be
referred to the proof of Theorem 1 of Shen et al. (2003).
Using the connection between
e
(
^
f
;
¹
f
) and the cost
function as in Shen et al. (2003),
we have
P
(
e
(
^
f
;
¹
f
)
¸
±
2
n
)
∙
P
¤
³
sup
f
f
2F
:
e
(
f
;
¹
f
)
¸
±
2
n
g
n
¡
1
n
X
i
=1
(~
l
Ã
(
f
0
;Z
i
)
¡
~
l
(
f
;Z
i
))
¸
0
´
=
I;
where
P
¤
denotes the outer probability measure.
To bound
I
, it su±ces to bound
P
(
A
ij
), for each
i; j
= 1
;
¢ ¢ ¢
. To this end, we need some
inequalities regarding the ¯rst and second moments of ~
l
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
)) for
f
2
A
ij
.
For the ¯rst moment, note that
E
[
l
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)] =
E
[
l
(
f
;Z
)
¡
l
Ã
(
¹
f
;Z
)]
¡
E
[
l
Ã
(
f
0
;Z
)
¡
l
Ã
(
¹
f
;Z
)], which is equal to 2(
e
(
f
;
¹
f
)
¡
e
Ã
(
f
0
;
¹
f
)) since
El
Ã
(
¹
f
;Z
) =
El
(
¹
f
;Z
) by Theorem
3.2.
By Assumption A and the de¯nition of
±
2
n
, 2
e
Ã
(
f
0
;
¹
f
)
∙
2
s
n
∙
±
2
n
. Then, using the assumption
that
J
0
¸
∙
±
2
n
=
2, we have, for any integers
i; j
¸
1,
inf
A
i;j
E
(~
l
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
))
¸
M
(
i; j
) = (2
i
¡
1
±
2
n
) +
¸
(2
j
¡
1
¡
1)
J
(
f
0
)
;
(12)
and
inf
A
i;
0
E
(~
l
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
))
¸
(2
i
¡
1
¡
1
=
2)
±
2
n
¸
M
(
i;
0) = 2
i
¡
2
±
2
n
;
(13)
where the fact that 2
i
¡
1
¸
2
i
¡
1
has been used.
For the second moment, it follows from Theorem 3.1 and Assumption B that, for any
f
2
F
,
e
(
f
;
¹
f
) =
E
[
k
X
l
=1
X
j
6
=
l
j
P
l
(
X
)
¡
P
j
(
X
)
j
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
¸
2
±
(
E
[
k
X
l
=1
X
j
6
=
l
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)
I
(
j
P
l
(
X
)
¡
P
j
(
X
)
j ¸
2
±
))]
¸
±
(
E
[2
k
X
l
=1
X
j
6
=
l
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
¡
2
c
1
±
®
)
=
1
2
(4
c
1
)
¡
1
=®
E
[2
k
X
l
=1
X
j
6
=
l
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
(
®
+1)
=®
(14)
17
with a choice of
±
=
¡
E
[2
P
k
l
=1
P
j
6
=
l
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
=
(4
c
1
)
¢
1
=®
.
Now we establish a connection between the ¯rst and s
econd moments. By Theorem 3.2,
E
[
Ã
(
¹
g
(
¹
f
(
X
)
; Y
))
¡
(1
¡
sign(
¹
g
(
¹
f
(
X
)
; Y
)))] = 0. Note that
Ã
(
u
)
¸
1
¡
sign(
u
) for any
u
2
R
k
¡
1
,
E
j
Ã
(
g
0
(
f
0
(
X
)
; Y
))
¡
(1
¡
sign(
g
0
(
f
0
(
X
)
; Y
)))
j
=
E
[
Ã
(
g
0
(
f
0
(
X
)
; Y
))
¡
(1
¡
sign(
g
0
(
f
0
(
X
)
; Y
)))]
∙
2
e
Ã
(
f
0
;
¹
f
). By the triangular ine
quality,
E
[
l
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)]
2
∙
2
E
j
1
¡
sign(
g
(
f
(
X
)
; Y
))
¡
Ã
(
g
0
(
f
0
(
X
)
; Y
))
j ∙
2
¡
2
e
Ã
(
f
0
;
¹
f
)+
E
j
sign(
¹
g
(
¹
f
(
X
)
; Y
))
¡
sign(
g
(
f
(
X
)
; Y
))
j
+
E
j
sign(
¹
g
(
¹
f
(
X
)
; Y
))
¡
sign(
g
0
(
f
0
(
X
)
; Y
))
j
¢
:
(15)
Note that for any
f
2 F
E
j
sign(
¹
g
(
¹
f
(
X
)
; Y
))
¡
sign(
g
(
f
(
X
)
; Y
))
j
=
E
[
k
X
l
=1
I
(
Y
=
l
)
j
sign(
¹
g
(
¹
f
(
X
)
; l
))
¡
sign(
g
(
f
(
X
)
; l
))
j
]
=
E
[2
k
X
l
=1
I
(
Y
=
l
)
X
j
6
=
l
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
∙
E
[2
k
X
l
=1
X
j
6
=
l
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
:
This, together with (14), implies th
at
E
j
sign(
¹
g
(
¹
f
(
X
)
; Y
))
¡
sign(
g
(
f
(
X
)
; Y
))
j ∙
c
¤
e
(
f
;
¹
f
)
®=
(
®
+1)
;
(16)
where
c
¤
= 2
®=
(
®
+1)
(4
c
1
)
1
=
(
®
+1)
. For any
f
2
A
i;j
,
e
(
f
;
¹
f
)
®=
(
®
+1)
¸
(2
¡
1
±
2
n
)
®=
(
®
+1)
¸
2
¡
1
±
2
n
¸
s
n
¸
e
Ã
(
f
0
;
¹
f
),
e
(
f
;
¹
f
)
¸
e
(
f
0
;
¹
f
) together with (15) and (16) imply that
E
[
l
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)]
2
∙
2
¡
2
e
Ã
(
f
0
;
¹
f
)+
c
¤
(
e
(
f
;
¹
f
)
®=
(
®
+1)
+
e
(
f
0
;
¹
f
)
®=
(
®
+1)
)
¢
∙
c
0
3
(
e
(
f
;
¹
f
)
=
2)
®=
(
®
+1)
;
with
c
0
3
= 16
c
1
=
(
®
+1)
1
+ 8. Consequently, for
i
= 1
;
¢ ¢ ¢
and
j
= 0
;
1
;
¢ ¢ ¢
;
sup
A
i;j
E
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)]
2
∙
c
0
3
(2
i
¡
1
±
2
n
)
®=
(
®
+1)
∙
c
3
M
(
i; j
)
®=
(
®
+1)
=
v
(
i; j
)
2
;
where
c
3
= 2
c
0
3
.
We are now ready to bound
I
. Using the assumption that
J
0
¸
∙
±
2
n
=
2, (12) and (13), we
have
I
∙
P
i
¸
1
;j
¸
0
P
¤
¡
sup
A
i;j
E
n
(
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
))
¸
M
(
i; j
)
¢
. By de¯nition,
l
Ã
(
f
0
;Z
) and
l
(
f
;Z
) are 0 and 2. Then
E
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)]
2
∙
4 and
E
n
(
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
))
∙
4. For
18
convenience, we scale the empirical process by a constant
t
= (4
c
1
=
2
3
)
¡
1
in what follows. Then
I
∙
X
i;j
P
¤
³
sup
A
i;j
E
n
(
t
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)])
¸
M
c
(
i; j
)
´
+
X
i
P
¤
³
sup
A
i;
0
E
n
(
t
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)])
¸
M
c
(
i;
0)
´
=
I
1
+
I
2
(17)
and sup
A
i;j
E
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)]
2
∙
v
c
(
i; j
)
2
, where
v
c
(
i; j
) = min(
t
1
=
2
v
(
i; j
)
;
1),
M
c
(
i; j
) =
min(
tM
(
i; j
)
; c
¡
1
=
2
3
). Note that
v
c
(
i; j
)
<
1 implies
M
c
(
i; j
) =
tM
(
i; j
).
Next we bound
I
i
separately. For
I
1
, we verify the required conditi
ons (4.5)

(4.7) in
Theorem
3 of Shen and Wong (1994). To compute the metric entropy in (4.7) there, we need to
construct
a bracketing function of
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
). Denote an
²

bracketing set for
f
(
G
f
1
; : : : ;G
f
k
);
f
2
A
ij
g
to be
f
(
G
v
p
1
;
¢ ¢ ¢
;G
v
pm
)
;
(
G
u
p
1
;
¢ ¢ ¢
;G
u
pm
)
g
;
p
= 1
; : : : ; k
. Let
s
v
ph
(
x
) be
¡
1 if
x
2
G
u
ph
and 1 otherwise, and
s
u
ph
(
x
) be
¡
1 if
x
2
G
v
ph
and 1 otherwise;
p
= 1
;
¢ ¢ ¢
; k
,
h
= 1
;
¢ ¢ ¢
;m
.
Then
f
(
s
v
p
1
;
¢ ¢ ¢
; s
v
pm
)
;
(
s
u
p
1
;
¢ ¢ ¢
; s
u
pm
)
g
forms an
²

bracketing functi
on of
¡
sign(
g
(
f
(
x
)
; p
)) for
f
2
A
ij
and
p
= 1
;
¢ ¢ ¢
; k
. This implies that for any
²
¸
0 and
f
2
A
ij
, there exists an
h
(1
∙
h
∙
m
) such that
l
v
h
(
z
)
∙
l
(
f
; z
)
¡
l
Ã
(
f
0
; z
)
∙
l
u
h
(
z
) for any
z
= (
x
; y
), where
l
u
h
(
z
) =
1 +
P
k
p
=1
s
u
ph
(
x
)
I
(
y
=
p
)
¡
l
Ã
(
f
0
;
z
),
l
v
h
(
z
) = 1 +
P
k
p
=1
s
v
ph
(
x
)
I
(
y
=
p
)
¡
l
Ã
(
f
0
; z
), and (
E
[
l
u
h
¡
l
v
h
]
2
)
1
=
2
= (
P
k
p
=1
E
[(
s
u
ph
(
x
)
¡
s
v
ph
(
x
))
I
(
y
=
p
)]
2
)
1
=
2
∙
2(max
p
P
(
G
u
ph
¢
G
l
ph
))
1
=
2
∙
2
²
1
=
2
. So,
(
E
[
l
u
h
¡
l
v
h
]
2
)
1
=
2
∙
min(2
²
1
=
2
;
2). Hence,
H
B
(
²;
F
¤
(2
j
))
∙
H
(
²
2
=
4
;
G
(2
j
)) for any
²
>
0 and
j
= 0
;
¢ ¢ ¢
, where
F
¤
(2
j
) =
f
l
(
f
; z
)
¡
l
Ã
(
f
0
; z
) :
f
2 F
; J
(
f
)
∙
2
j
g
. Using the fact that
R
v
c
(
i;j
)
aM
c
(
i;j
)
H
1
=
2
B
(
u
2
=
4
;
G
(2
j
))
du=M
c
(
i; j
) is non

increasing in
i
and
M
c
(
i; j
);
i
= 1
;
¢ ¢ ¢
;
we have,
Z
v
c
(
i;j
)
aM
c
(
i;j
)
H
1
=
2
B
(
u
2
=
4
;
G
(2
j
))
du=M
c
(
i; j
)
∙
Z
c
1
=
2
3
M
c
(1
;j
)
®
2(
®
+1)
aM
c
(1
;j
)
H
1
=
2
B
(
u
2
=
4
;
G
(2
j
))
du=M
c
(1
; j
)
∙
Á
(
²
n
;
2
j
)
;
where
a
=
"=
32 with
"
de¯ned below. Thus (4.7) of Shen and Wong (1994) holds with
M
=
n
1
=
2
M
c
(
i; j
) and
v
=
v
c
(
i; j
)
2
, so does (4.5). In addition, with
T
= 1,
M
c
(
i; j
)
=
v
c
(
i; j
)
2
∙
max(
c
¡
1
=
2
3
; c
¡
(2
®
+3)
=
(2
®
+2)
3
) =
c
¡
1
=
2
3
∙
"=
(4
T
)
implies (4.6) with
"
= 4
c
¡
1
=
2
3
<
1.
Note that 0
< ±
n
∙
1 and
¸J
0
∙
±
2
n
=
2. Using a similar argument as in Shen et al. (2003),
an application of Theorem 3 of Shen and Wong (1994) yields that
I
1
∙
3 exp(
¡
c
5
n
(
¸J
(
f
0
))
®
+2
®
+1
=
[1
¡
exp(
¡
c
5
n
(
¸J
(
f
0
))
®
+2
®
+1
)]
2
:
19
Here and in the sequel
c
5
is a positive generic constant. Similarly,
I
2
can be bounded.
Finally,
I
∙
6 exp(
¡
c
5
n
(
¸J
(
f
0
))
®
+2
®
+1
=
[1
¡
exp(
¡
c
5
n
(
¸J
(
f
0
))
®
+2
®
+1
)]
2
. This implies that
I
1
=
2
∙
(
5
=
2 +
I
1
=
2
) exp(
¡
c
5
n
(
¸J
(
f
0
)). The result then follows from the fact that
I
∙
I
1
=
2
∙
1.
Proof of Corollary 3.1
: The result follows from the assumptions and the exponential
inequal

ity in Theorem 3.3.
Proof of Theorem 3.4
: The proof is similar to that of Th
eorem 3.3. For simplicity, we only
sketch the parts that require modi¯cations. Consider the scaled empirical process
E
n
(~
l
Ã
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
)) and let
A
i;j
=
f
f
2 F
: 2
i
¡
1
±
¤
2
n
∙
e
Ã
(
f
;
¹
f
)
<
2
i
±
¤
2
n
, 2
j
¡
1
J
0
∙
J
(
f
)
<
2
j
J
0
g
,
A
i;
0
=
f
f
2 F
: 2
i
¡
1
±
¤
2
n
∙
e
Ã
(
f
;
¹
f
)
<
2
i
±
¤
2
n
; J
(
f
)
< J
0
g
, for
j
= 1
;
2
;
¢ ¢ ¢
, and
i
= 1
;
2
;
¢ ¢ ¢
. Using
an analogous argument, we have
P
(
e
Ã
(
^
f
;
¹
f
)
¸
±
¤
2
n
)
∙
P
¤
³
sup
f
f
2F
:
e
Ã
(
f
;
¹
f
)
¸
±
¤
2
n
g
n
¡
1
n
X
i
=1
(~
l
Ã
(
f
0
;Z
i
)
¡
~
l
Ã
(
f
;Z
i
))
¸
0
´
=
I:
To bound
I
, we consider the ¯rst and se
cond moments of ~
l
Ã
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
)) for
f
2
A
ij
. For
the ¯rst moment, it is straightforward to show that, for any integers
i; j
¸
1, inf
A
i;j
E
(~
l
Ã
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
))
¸
M
(
i; j
) = (2
i
¡
1
±
¤
2
n
) +
¸
(2
j
¡
1
¡
1)
J
(
f
0
), and inf
A
i;
0
E
(~
l
Ã
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
))
¸
M
(
i;
0)
= 2
i
¡
2
±
¤
2
n
.
For the second moment,
e
Ã
(
f
;
¹
f
) =
e
(
f
;
¹
f
) +
1
2
E
[
Ã
(
g
(
f
(
X
)))
I
(
g
(
f
(
X
))
2
(0
; ¿
))] and
e
Ã
(
f
;
¹
f
)
∙
1. Thus
1
2
E
[
Ã
(
g
(
f
(
X
)))
I
(
g
(
f
(
X
)
; Y
)
2
(0
; ¿
))]
∙
e
Ã
(
f
;
¹
f
)
∙
(
e
Ã
(
f
;
¹
f
))
®
®
+1
:
(18)
For any
f
2
A
i;j
,
e
Ã
(
f
;
¹
f
)
¸
2
¡
1
±
2
n
¸
s
n
¸
e
Ã
(
f
0
;
¹
f
) together with (16) and (18) imply that
E
[
l
Ã
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)]
2
∙
2
E
j
sign(
g
(
f
(
X
)
; Y
))
¡
sign(
g
0
(
f
0
(
X
)
; Y
))
j
+2
E
[
Ã
(
g
0
(
f
0
(
X
)))
I
(
g
(
f
(
X
)
; Y
)
2
(0
; ¿
))] + 2
E
[
Ã
(
g
(
f
(
X
)))
I
(
g
(
f
(
X
)
; Y
)
2
(0
; ¿
))]
∙
2
¡
c
¤
[
e
Ã
(
f
;
¹
f
)
®=
(
®
+1)
+
e
Ã
(
f
0
;
¹
f
)
®=
(
®
+1)
]
¢
+ 4[
e
Ã
(
f
;
¹
f
)
®=
(
®
+1)
+
e
Ã
(
f
0
;
¹
f
)
®=
(
®
+1)
]
∙
c
0
3
(
e
Ã
(
f
;
¹
f
)
=
2)
®=
(
®
+1)
;
with
c
0
3
= 16
c
1
=
(
®
+1)
1
+8. Therefore, sup
A
i;j
E
(
l
Ã
(
f
0
;Z
)
¡
l
Ã
(
f
;Z
))
2
∙
c
3
M
(
i; j
)
®=
(
®
+1)
=
v
(
i; j
)
2
for
i
= 1
;
¢ ¢ ¢
and
j
= 0
;
1
;
¢ ¢ ¢
, where
c
3
= 2
c
0
3
.
To bound
I
, note
I
∙
I
1
+
I
2
, where
I
1
=
P
i;j
P
¤
¡
sup
A
i;j
E
n
(
l
Ã
(
f
0
;Z
)
¡
l
Ã
(
f
;Z
))
¸
M
(
i; j
)
¢
and
I
2
=
P
i
P
¤
¡
sup
A
i;
0
E
n
(
l
Ã
(
f
0
;Z
)
¡
l
Ã
(
f
;Z
))
¸
M
(
i;
0)
¢
. Thus we can bound
I
i
separately.
Using the fact that
R
v
(
i;j
)
aM
(
i;j
)
H
1
=
2
B
(
u;
F
Ã
(2
j
))
du=M
(
i; j
) is non

increasing in
i
and
M
(
i; j
);
i
=
20
1
;
¢
¢ ¢
;
we have
R
v
(
i;j
)
aM
(
i;j
)
H
1
=
2
B
(
u;
F
Ã
(2
j
))
du=M
(
i; j
)
∙
Á
¤
(
²
¤
n
;
2
j
). The result then follows from
the same argument as that in the proof of Theorem 3.3.
Proof of Corollary 3.2
: The result follows from the assumptions and the exponential
inequal

ity
in Theorem 3.4.
Lemma 1:
(Metric entropy in Example 3.3.1) Under the assumptions in the example
3.3.1, we
have
H
B
(
²;
G
(
`
))
∙
O
(
k
2
log(
k=²
))
:
Proof
: Let (
G
1
; : : : ;G
k
) be a classi¯cation partition induced by
f
and let
G
j
1
j
2
be
f
x
:
f
j
1
¡
f
j
2
>
0;
x
2
S
g
;
j
1
6
=
j
2
2 f
1
;
¢ ¢ ¢
; k
g
. For discussion, we ¯rst construct a bracket for
G
j
1
j
2
.
To this end, we determine
d
points at which the plane
f
j
1
¡
f
j
2
= 0 intersects with
d
out of
d
2
d
¡
1
edges of the cube [0
;
1]
d
. For each of these
d
points, we use a bracket of le
ngth
²
¤
to cover,
on the edge where the point belongs to. Given an edge, the covering number for this
point is no
greater than 1
=²
¤
. Hence the covering number for the
d
points on
d
of
d
2
d
¡
1
edges is at
most
¡
d
2
d
¡
1
d
¢
(
1
²
¤
)
d
.
After
d
intersecting points
of
f
j
1
¡
f
j
2
= 0 on the edges of
S
are covered, we then connect
the endpoints of the
d
brackets to form bracket planes
v
j
1
j
2
= 0 and
u
j
1
j
2
= 0 such that
f
x
:
v
j
1
j
2
>
0
g ½ f
x
:
f
j
1
¡
f
j
2
>
0
g ½ f
x
:
u
j
1
j
2
>
0
g
. Since the longest segment in
S
has
length
p
d
c
orresponding to the diagonal segment between (0
; : : : ;
0) and (1
; : : : ;
1), we
have
P
(
x
:
v
j
1
j
2
<
0
< u
j
1
j
2
)
∙
(
p
d
)
d
¡
1
²
¤
since
x
is uniformly distributed on
S
. Consequently,
G
v
j
1
j
2
½
G
j
1
j
2
½
G
u
j
1
j
2
and
P
(
G
v
j
1
j
2
¢
G
u
j
1
j
2
)
∙
(
p
d
)
d
¡
1
²
¤
, where
G
v
j
1
j
2
=
f
x
:
v
j
1
j
2
>
0
g
and
G
u
j
1
j
2
=
f
x
:
u
j
1
j
2
>
0
g
. Since
G
j
1
=
\
j
2
G
j
1
j
2
,
G
v
j
1
½
G
j
1
½
G
u
j
1
and
P
(
G
v
j
1
¢
G
u
j
1
)
∙
P
(
[
j
2
G
v
j
1
j
2
¢
G
u
j
1
j
2
)
∙
(
k
¡
1)(
p
d
)
d
¡
1
²
¤
, where
G
v
j
1
=
\
j
2
G
v
j
1
j
2
and
G
u
j
1
=
\
j
2
G
u
j
1
j
2
;
j
1
6
=
j
2
2 f
1
;
¢ ¢ ¢
; k
g
.
With
²
= (
k
¡
1)(
p
d
)
d
¡
1
²
¤
,
f
(
G
v
1
;G
u
1
)
;
¢ ¢ ¢
;
(
G
v
k
;G
u
k
)
g
satis¯es max
j
1
P
(
G
v
j
1
¢
G
u
j
1
)
∙
²
and
thus it forms an
²

bracketing set for (
G
1
; : : : ;G
k
). Therefore the
²

covering number for all
partitions induced by
f
is at most
¡
d
2
d
¡
1
d
¢¡
(
k
¡
1)(
p
d
)
d
¡
1
²
¢
d
¢
k
(
k
¡
1)
:
Since
d
is a
constant, the
bracketing metric entropy
H
B
(
²;
G
(
`
)) is bounded by
O
(
k
2
log(
k=²
)) for any
`
, yielding the
desired result.
Lemma 2:
(Metric entropy in Example 3.3.2) Under the assumptions in Example 3.3.2,
we
have
H
B
(
²;
F
Ã
(
`
))
∙
O
(
k
(log(
`=²
))
d
+1
)
:
Proof: In o
rder to obtain an upper bound for
H
B
(
²;
F
Ã
(
`
)), we use the sup

norm entropy
bound
21
for a single function set in Zhou (2002), that is,
H
1
(
²;
F
(
`
))
∙
O
((log(
`=²
))
d
+1
) under the
L
1
metric:
k
g
k
1
= sup
x
2
S
j
g
(
x
)
j
. Consider an arbitrary function vector
f
= (
f
1
; :
: : ; f
k
)
2
F
(
`
). The the metric entropy for all
k

dimensional function vectors in
F
(
`
) is bounded by
O
(
k
(log(
`=²
))
d
+1
) in order to cover
k
functions simultaneously. Let [
f
v
j
; f
u
j
] be an
²

bracket for
f
j
. Then [
f
v
j
¡
f
u
l
; f
u
j
¡
f
v
l
] forms a 2
²

br
acket for
f
j
¡
f
l
. Denote
g
v
j
= min
l
2f
1
;
¢¢¢
;k
gn
j
(
f
v
j
¡
f
u
l
) and
g
u
j
= min
l
2f
1
;
¢¢¢
;k
gn
j
(
f
u
j
¡
f
v
l
). Then [
g
v
j
; g
u
j
] becomes a 2
²

bracket for
g
min
(
f
; j
) = min
l
6
=
j
(
f
j
¡
f
l
).
Consequently,
Ã
(
g
v
j
)
¸
Ã
(
g
min
(
f
; j
))
¸
Ã
(
g
u
j
) by the non

increasing prope
rty of
Ã
function. By
(10), we have
j
Ã
(
g
v
j
)
¡
Ã
(
g
u
j
)
j ∙
2
D²
. Since
g
min
(
f
; y
) =
P
k
j
=1
I
(
y
=
j
)
g
min
(
f
; j
),
g
min
(
f
; y
)
2
[
P
k
j
=1
I
(
y
=
j
)
g
v
j
;
P
k
j
=1
I
(
y
=
j
)
g
u
j
] and
j
Ã
(
P
k
j
=1
I
(
y
=
j
)
g
v
j
)
¡
Ã
(
P
k
j
=1
I
(
y
=
j
)
g
u
j
)
j ∙
2
D²
.
Consequently, [
Ã
(
P
k
j
=1
I
(
y
=
j
)
g
u
j
(
x
))
¡
Ã
(
g
0
(
f
0
(
x
)
; y
))
; Ã
(
P
k
j
=1
I
(
y
=
j
)
g
v
j
(
x
))
¡
Ã
(
g
0
(
f
0
(
x
)
; y
))]
forms a bracket of length 2
D²
for
Ã
(
g
(
f
(
x
)
; y
))
¡
Ã
(
g
0
(
f
0
(
x
)
; y
)). The desired result then fol

lows.
References
An, H. L. T., and Tao, P. D. (1997). Solving a class of linearly
constrained inde¯nite
quadratic
problems by D.C. algorithms.
J. Global Optim.
,
11
, 253

285.
Bartlett, P. L, Jordan, M. I, and McAuli®e, J. D. (2003). Convexity, classi¯cation, and risk
bounds. Technical Report 638, Department of Statistics, U.C. Berkeley.
Boser, B., Guyon, I., and Vapnik, V. N. (1992). A training algorithm for optimal margin
classi¯ers.
The Fifth Annual Conference on Computational Learning Theory
, Pittsburgh
ACM, 142

152.
Cortes, C., and Vapnik, V. (1995). Support

vector networks.
Machine L
earning
,
20
, 273

279.
Crammer, K., and Singer, Y. (2001). On the algorithmic implementation of multiclass
kernel

based vector machines.
Journal of Machine Learning Research
,
2
, 265

292.
Guermeur, Y. (2002). Combining discriminant models with new multiclas
s SVMS.
Pattern
Analysis and Applications (PAA)
,
5
, 168

179.
Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory Support Vector Machines, theory,
and
application to the classi¯cation of microarray data and satellite radiance data.
J. Amer.
Statist. Assoc
.
99, 465: 67

81.
Lin, X., Wahba, G., Xiang, D., Gao, F. Klein, R., and Klein, B. (2000). Smoothing spline
ANOVA models for large data sets with Bernoulli observations and the randomized
GACV.
22
Annals of Statistics
.
28
, 1570

1600.
Lin, Y. (2000). Some as
ymptotic properties of the support vector machine. Technical
report
1029, Department of Statistics, University of Wisconsin

Madison.
Lin, Y. (2002). Support vector machines and the Bayes rule in classi¯cation.
Data Mining
and
Knowledge Discovery
. 6, 259

27
5.
Liu, Y., Shen, X., and Doss, H. (2005). Multicategory
Ã

learning and support vector
machine:
computational tools.
J. Comput. Graph. Statist.
14
, 1, 219

236.
Mammen, E. and Tsybakov, A. (1999). Smooth discrimination analysis.
Ann. Statist.
27
,
1808

1829.
Marron, J. S., and Todd, M. J. (2002). Distance Weighted Discrimination. Technical
Report
No. 1339, School of Operations Research and Industrial Engineering, Cornell University.
Mercer, J. (1909). Functions of positive and negative type and their connecti
on with the
theory
of integral equations.
Philos. Trans. Roy. Soc. London
A, 209, 415

446.
Shen, X. (1998). On the method of penalization.
Statistica Sinica
.
8
, 337

357.
Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (2003). On
Ã

learning.
J. Amer.
Sta
tist.
Assoc.
98
, 724

734.
Shen, X., and Wong, W. H. (1994). Convergence rate of sieve estimates.
Ann. Statist.
22
,
580

615.
Steinwart, I. (2001). On the in°uence of the kernel on the consistency of support vector
machines.
Journal of Machine Learning Resea
rch
,
2
, 67

93.
Tsybakov, A. B. (2004). Optimal aggregation of classi¯ers in statistical learning.
Annals
of
Statistics
,
32
, 135

166.
Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces, and
randomized
GACV. In: B. SchÄ
o
lkopf, C. J.
C. Burges and A. J. Smola (eds),
Advances in Kernel
Methods:
Support Vector Learning
, MIT Press, 125

143.
Weston, J., andWatkins, C. (1999). Support vector machines for multi

class pattern
recognition.
Proceedings of the Seventh European Symposium On Arti
¯cial Neural Networks.
Zhang, T. (2004). Statistical behavior and consistency of classi¯cation methods based
on convex
risk minimization.
Ann. Statist.
,
32
, 56

85.
Zhang, T. (2004b). Statistical analysis of some multi

category large margin classi¯cation
me
th

ods.
Journal of Machine Learning Research
,
5
, 1225

1251.
Zhou, D. X. (2002). The covering number in learning theory.
Journal of Complexity
,
18
,
739

23
767.
Zhu, J., and Hastie, T. (2005). Kernel logistic regression and the import vector machine.
Journ
al
of Computational and Graphical Statistics.
14
, 1, 185

205.
Zhu, J., Hastie, T., Rosset, S., and Tibshirani, R. (2003). 1

norm support vector
machines,
Neural Information Processing Systems
,
16.
Table 1: Testing, training errors, and their ^
e
(
¢
;
¹
f
) of
SVM and
Ã

learning using the best
C
in
Examples 1 and 2 with
n
= 150, averaged over 100 simulation replications and their
standard
errors in parenthesis. In Example 1, d.f.=1, the Bayes error is 0.2470 with the
improvement
of
Ã

learning over SVM 43.22%. In
Example 2, d.f.=3, the Bayes error is 0.1456 with the
improvement of
Ã

learning over SVM 20.41%. Here, the improvement of
Ã

learning over
SVM
is de¯ned by (
T
(SVM)
¡
T
(
Ã
))
=
^
e
(SVM
;
¹
f
), where ^
e
(
¢
;
¹
f
) =
T
(
¢
)
¡
Bayes error, and
T
(
¢
)
denotes
the testing error
of a given method.
Example Method Training(s.e.) Testing(s.e.) ^
e
(
¢
;
¹
f
)(s.e.) No. SV(s.e.)
d.f.=1 SVM 0.4002(0.1469) 0.4305(0.1405) 0.1835(0.1405) 141.76(10.97)
Ã

L 0.3199(0.1237) 0.3494(0.1209) 0.1024(0.1209) 64.64(15.43)
d.f.=3 SVM 0.1447(0.0267) 0.150
5(0.0045) 0.0049(0.0045) 71.81(11.02)
Ã

L 0.1429(0.0285) 0.1495(0.0033) 0.0039(0.0033) 41.29(13.51)
Table 2: Testing errors for problem
letter
. Each training dataset is of size 200 and
selected from
a total of 2341 samples.
Case SVM
Ã

L Improvement (%)
1 .
083 .079 3.39%
2 .073 .063 12.24%
3 .086 .076 11.41%
4 .072 .072 0%
5 .088 .085 3.74%
6 .077 .073 5.45%
7 .075 .072 4.39%
8 .079 .075 5.92%
9 .093 .091 1.51%
10 .090 .086 4.11%
Average #SVs 51.1 40.8
24
−2
−1
0
1
2
u1
−2
−1
0
1
2
u2
0
0.5
1
1.5
2
psi(u1,u2)
Figure 1: Perspective plot of the 3

class
Ã
function de¯ned in (3).
25
Ployhedron one
Ployhedron two Ployhedron three
f1−f2=1
f1−f2=0
f2−f1=1
f1−f3=1
f1−f3=0
f3−f1=1
f2−f3=1
f3−f2=1 f2−f3=0
Figure 2: I
llustration of the concept of margins and support vectors in a 3

class
separable
example: The instances for classes 1

3 fall respectively into the polyhedrons
D
j
;
j
= 1
;
2
;
3,
where
D
1
is
f
x
:
f
1
(
x
)
¡
f
2
(
x
)
¸
1
; f
1
(
x
)
¡
f
3
(
x
)
¸
1
g
,
D
2
is
f
x
:
f
2
(
x
)
¡
f
1
(
x
)
¸
1
; f
2
(
x
)
¡
f
3
(
x
)
¸
1
g
, and
D
3
is
f
x
:
f
3
(
x
)
¡
f
1
(
x
)
¸
1
; f
3
(
x
)
¡
f
2
(
x
)
¸
1
g
. The generalized
geometric margin
°
de¯ned as min
f
°
12
; °
13
; °
23
g
is maximized to obtain the decision
boundary.
There are ¯ve support vectors on the boundaries of the three polyhedrons. Among the
¯ve
support vectors, one is from class 1, one is from class 2, and the other three are from
class 3.
26
Comments 0
Log in to post a comment