Liu 1 - Department of Statistics And Operations Research

spraytownspeakerAI and Robotics

Oct 16, 2013 (4 years and 2 months ago)

92 views

Multicategory
Ã
-
Learning
¤

Yufeng Liu and Xiaotong Shen

Summary

In binary classi¯cation, margin
-
based techniques usually deliver high performance. As a
re
-

sult, a multicategory problem is often treated as a sequence of binary classi¯cations. In
the ab
-

sen
ce of a dominating class, this treatment may be suboptimal and may yield poor
performance,

such as for Support Vector Machine (SVM). We propose a novel multicategory
generalization of

Ã
-
learning which treats all classes simultaneously. The new generalizati
on eliminates
this poten
-

tial problem, and at the same time, retains the desirable properties of its binary
counterpart. We

develop a statistical learning theory for the proposed methodology and obtain fast
convergence

rates for both linear and nonlinear
learning examples. The operational characteristics of
this

method are demonstrated via simulation. Our results indicate that the proposed
methodology

can deliver accurate class prediction and is more robust against extreme observations
than its

SVM counter
part.

Key Words and Phrases:
Generalization error, nonconvex minimization, supervised
learning,

support vectors.

1 Introduction

Classi¯cation has become increasingly important as a means for facilitating information
extrac
-

tion. Among binary classi¯cation

techniques, signi¯cant developments have been seen
in margin
-

¤
Yufeng Liu is Assistant Professor, Department of Statistics and Operations Research, Carolina Center for

Genome Sciences, University of North Carolina, CB 3260, Chapel Hill, NC 27599 (Email:
y
°iu@email.unc.edu).

He would like to thank Professor George Fisherman for his helpful comments. Xiaotong Shen is Profes
-

sor, School of Statistics, University of Minnesota, 224 Church Street S.E., Minneapolis, MN 55455 (Email:

xshen@stat.umn.edu). His rese
arch was supported in part by NSF grants IIS
-
0328802 and DMS
-
0072635.
The au
-

thors would like to thank the editor, the associate editor, and two anonymous referees for their helpful
comments

and suggestions.

1

based methodologies, including Support Vector

Machine (SVM, Boser, Guyon, and
Vapnik, 1992;

Cortes and Vapnik, 1995), Penalized Logistic Regression (PLR, Lin et al., 2000), Import
Vector

Machine (IVM, Zhu and Hastie, 2001), and Distance Weighted Discrimination (DWD,
Marron

and Todd, 2002).

Among many

margin
-
based techniques, the ones that focus on estimating the decision
bound
-

ary yield higher performance as opposed to those that focus on conditional probabilities.
This

is because the former is an easier problem than the latter. For instance, binary
SVM
directly

estimates the Bayes classi¯er sign(
P
(
Y
= +1
j
x
)
¡
1
=
2) rather than
P
(
Y
= +1
j
x
) itself, with

input vector
x
and class label
Y
2 f§
1
g
, as shown in Lin (2002). However, this aspect of
the

methodology makes its generalization to the multicategory c
ase highly nontrivial. One
popular

approach, known as
\
one
-
versus
-
rest", solves
k
binary problems via sequential training.
As ar
-

gued by Lee, Lin, and Wahba (2004), an approach of this sort performs poorly in the
absence

of a dominating class, since the c
onditional probability of each class is no greater than
1
=
2.

Shen, Tseng, Zhang, and Wong (2003) proposed another margin
-
based technique
called
Ã
-

learning, which replaces the convex SVM loss function by a non
-
convex
Ã
-
loss function.
They

show that more ac
curate class prediction can be achieved, while the margin
interpretation is

retained. The present article generalizes binary
Ã
-
learning to the multicategory case.
Since

Ã
-
learning, like SVM, does not directly yield
P
(
Y
= +1
j
x
), we need to take a new
approa
ch.

To treat all classes simultaneously, we generalize the concept of margins and support
vectors

via multiple comparisons among di®erent classes. Multicategory
Ã
-
learning has the
advantage

of retaining the desired properties of its binary counterpart, but

not su®ering from the
afore
-

mentioned di±culty of one
-
versus
-
rest SVM with regard to the dominating class.

To provide insight into multicategory
Ã
-
learning, we develop a statistical learning theory.

Speci¯cally, the theory quanti¯es the performance of mu
lticategory
Ã
-
learning with
respect to

the choice of tuning parameters, the size of the training sample, and the number of
classes

involved in classi¯cation. It also indicates that our multicategory
Ã
-
learning directly
estimates

the true decision boundary
regardless of the presence or absence of the dominating
class.

Simulation experiments indicate that
Ã
-
learning outperforms its counterpart SVM in gen
-

eralization, as in the binary case. Moreover, multicategory
Ã
-
learning is more robust
against

extreme ins
tances that are wrongly classi¯ed than its counterpart SVM. Interestingly, in
linear

learning problems, it exhibits some behavior that is similar to nonlinear learning
problems with

respect to the tuning parameter, which is di®erent from that of the binary

case.

2

Section 2.1 motivates our approach. Section 2.2 describes our proposal for
multicategory
Ã
-

learning, and Section 2.3 brie°y discusses computational issues. Section 3 studies the
statistical

properties of the proposed methodology and develops its
statistical learning theory.
Section 4

presents numerical examples, followed by conclusions and discussions in Section 5. The
Appendix

contains the lemmas and technical proofs.

2 Methodology

The primary goal of classi¯cation is to predict the class label
Y

for a given input vector
x
2
S

via a classi¯er, where
S
is an input space. For
k
-
class classi¯cation, a classi¯er
partitions
S
into

k
disjoint and exhaustive regions
S
1
; : : : ; S
k
with
S
j
corresponding to class
j
. A good
classi¯er

is one that predicts cl
ass index
Y
for given
x
accurately, which is measured by its
accuracy of

prediction.

Before proceeding, let
x
2
S
½
IR
d
be an input vector and
y
be an output (label) variable.

We code
y
as
f
1
; : : : ; k
g
, and de¯ne
f
= (
f
1
; : : : ; f
k
) as a decision functi
on vector. Here
f
j
,

mapping from
S
to IR, represents class
j
;
j
= 1
; : : : ; k
. A classi¯er argmax
j
=1
;:::;k
f
j
(
x
),
induced

by
f
, is employed to assign a label to any input vector
x
2
S
. In other words,
x
2
S
is
assigned

to a class with the highest value of

f
j
(
x
), which indicates the strength of evidence that
x
belongs

to class
j
. A classi¯er is trained via a training sample
f
(
x
i
; y
i
);
i
= 1
; : : : ; n
g
,
independently and

identically distributed according to an unknown probability distribution
P
(
x
; y
).
Throu
ghout

the paper, we use
X
and
Y
to denote random variables and
x
and
y
to represent
corresponding

observations.

The generalization error (GE) quanti¯es the accuracy of generalization and is de¯ned as

Err(
f
) =
P
[
Y
6
= argmax
j
f
j
(
X
)], the probability of miscla
ssifying a new input vector
X
. To

simplify the expression, we introduce
g
(
f
(
x
)
; y
) = (
f
y
(
x
)
¡
f
1
(
x
)
; : : : ; f
y
(
x
)
¡
f
y
¡
1
(
x
)
; f
y
(
x
)
¡

f
y
+1
(
x
)
; : : : ; f
y
(
x
)
¡
f
k
(
x
)), which performs multiple comparisons of class
y
versus the rest

of classes. Vector
g
(
f
(
x
)
;

y
) describes the unique feature of a multicategory problem,
which

is directly related to the generalized margins to be introduced shortly. Furthermore, for
u
=

(
u
1
; : : : ; u
k
¡
1
), we de¯ne the multivariate sign function, sign(
u
) = 1 if
u
min
= min(
u
1
; : :
: ;
u
k
¡
1
)
>

0 and
¡
1 if
u
min

0. With sign(
¢
) and
g
(
f
(
x
)
; y
) in place,
f
indicates correct classi¯cation

for any given instance (
x
; y
) if
g
(
f
(
x
)
; y
)
>
0
k
¡
1
, where
0
k
¡
1
is a (
k
¡
1)
-
dimensional vector

of 0. Consequently, the GE reduces to Err(
f
) =
1

2
E
[1
¡

sign(
g
(
X
; Y
))], with the empirical

3

generalization error (EGE) (2
n
)
¡
1
P
n

i
=1
(1
¡
sign(
g
(
f
(
x
i
)
; y
i
))).

For motivation, we ¯rst discuss our setting in the binary case, and then generalize it to
the

multicategory case. In particular, we review binary
Ã
-
lear
ning with the usual coding

1
;
1
g
,

and then derive it via coding
f
1
;
2
g
.

2.1 Motivation

With
y
2 f§
1
g
, a margin
-
based classi¯er estimates a single function
f
and uses sign(
f
) as
the

classi¯cation rule. Within the regularization framework, it solves argmin
f
J
(
f
)+
C
P
n

i
=1
l
(
y
i
f
(
x
i
)),

where
J
(
f
), a regularization term, controls the complexity of
f
, a loss function
l
measures
the

data ¯t, and
C >
0 is a tuning parameter balancing the two terms. For example, SVM
uses

the hinge loss with
l
(
u
) = [1
¡
u
]
+
, where [
v
]
+
=
v
if
v
¸
0, and 0 otherwise; PLR and IVM

adopt the logistic loss
l
(
u
) = log(1 +
e
¡
u
); and the
Ã
-
loss can be any non
-
increasing
function

satisfying
R
¸
Ã
(
u
)
>
0 if
u
2
(0
; ¿
) and
Ã
(
u
) = 1
¡
sign(
u
) otherwise, where
¿
2
(0
;
1], and

R >
0. For simplicity
, we discuss the linear case in which
f
(
x
) =
w
T
x
+
b
;
w
2
IR
d
and
b
2
IR,

represents a
d
-
dimensional hyperplane. In this case,
J
(
f
) =
1

2
k
w
k
2
is de¯ned by the geometric

margin
2

k
w
k

, the vertical Euclidean distance between hyperplanes
f
=
§
1. Here
y
i
f
(
x
i
)

is the

functional margin of instance (
x
i
; y
i
).

For linear binary
Ã
-
learning with coding
f
1
;
2
g
, we now derive a parallel formulation
using

the argmax rule, by noting that
x
is classi¯ed as class 2 if
f
2
(
x
)
> f
1
(
x
) and 1 otherwise,
where

f
j
(
x
) =
w
T

j
x
+
b
j
;
j
= 1
;
2. Evidently, this rule of classi¯cation depends only on sign((
f
2
¡

f
1
)(
x
)). To eliminate redundancy in (
f
1
; f
2
), we invoke a sum
-
to
-
zero constraint
f
1
+
f
2
= 0.

This type of constraint was previously used by Guermeur (2002) and Lee et al. (2004
) in
two

di®erent SVM formulations. Under this constraint,
k
w
1
k
=
k
w
2
k
. Binary
Ã
-
learning then

solves:

min

b
1
;b
2
;
w
1
;
w
2
³
1

2

2
X
j
=1

k
w
j
k
2
+
C

n
X
i
=1

Ã
¡
g
(
f
(
x
i
)
; y
i
)
¢´
subject to

2
X
j
=1

f
j
(
x
) = 0
8
x
2
S;
(1)

where
g
(
f
(
x
i
)
; y
i
) =
f
y
i
(
x
i
)
¡
f
3
¡
y
i
(
x
i
).

With codi
ng
f
1
;
2
g
, instances from classes 1 and 2 that lie respectively in halfspaces
f
x
:

g
(
f
(
x
)
;
2)
¸ ¡
1
g
and
f
x
:
g
(
f
(
x
)
;
2)

1
g
are de¯ned as
\
support vectors". In the separable

case, support vectors are instances on hyperplanes
g
(
f
(
x
)
;
2) =
§
1. Furthermore,
the
functional

margin of (
x
i
; y
i
) can be de¯ned as
g
(
f
(
x
i
)
; y
i
), indicating the correctness and strength of

classi¯cation of
x
i
by
f
.

4

2.2 Multicategory
Ã
-
Learning

As suggested in Shen et al. (2003), the role of a binary
Ã
-
function is twofold. First, it
e
liminates

the scaling problem of the sign function that is scale invariant. Second, with a positive
penalty

de¯ned by the positive value of
Ã
(
u
) for
u
2
(0
; ¿
), it pushes correctly classi¯ed
instances away

from the boundary. As a remark, we note that 1
¡
si
gn as a loss is numerically undesirable
since

the solution
f
is approximately 0 under regularization.

Using coding
f
1
;
¢ ¢ ¢
; k
g
, we de¯ne multivariate
Ã
-
functions on
k
¡
1 arguments as
follows:

R
¸
Ã
(
u
)
>
0 if
u
min
2
(0
; ¿
);

Ã
(
u
) = 1
¡
sign(
u
) otherwise
;
(2)

where 0
< ¿

1 and 0
< R

2 are some constants, and
Ã
(
u
) is non
-
increasing in
u
min
. We
note

that this multivariate version preserves the desired properties of its univariate
counterpart. Par
-

ticularly, the multivariate
Ã
assigns a positive penalty

to any instance with min(
g
(
f
(
x
i
)
; y
i
))
2

(0
; ¿
) to eliminate the scaling problem. To utilize our computational strategy based on a
di®er
-

ence convex (d.c.) decomposition, we use a speci¯c
Ã
in implementation:

Ã
(
u
) = 0 if
u
min
¸
1; 2 if
u
min
<
0; 2(1
¡
u
min
) if 0

u
min
<
1
:
(3)

A plot of this
Ã
function for
k
= 3 is displayed in Figure 1.

Insert Figure 1 about here

Linear multicategory
Ã
-
learning solves min
b
;
w
³
1

2
P
k

j
=1
k
w
j
k
2
+
C
P
n

i
=1
Ã
(
g
(
x
i
; y
i
))
´
subject

to
P
k

j
=1
f
j
(
x
) = 0 for
8
x
2
S
, where
w
= vec(
w
1
; : : : ;
w
k
) is a
kd
-
dimensional vector with its

(
d
(
i
2
¡
1) +
i
1
)
-
th element
w
i
2
(
i
1
) and
b
= (
b
1
; : : : ; b
k
)
T
2
IR
k
. By Theorem 2.1 of Liu et

al. (2005), the minimization with the sum
-
to
-
zero constraint for all
x
2
S
is equivalent to

that with the constra
int for
n
training inputs
f
x
i
;
i
= 1
;
¢ ¢ ¢
; n
g
only. That is, the in¯nite

constraint
P
k

j
=1
f
j
(
x
) = 0 for
8
x
2
S
can be reduced to be
P
k

j
=1
b
j
1
n
+
P
k

j
=1
X
w
j
= 0, where

X
= (
x
1
; : : : ;
x
n
)
T
is the design matrix and
1
n
is an
n
-
dimensional vector of 1's.
This yields

linear multicategory
Ã
-
learning:

min

b
;
w
³
1

2

k
X
j
=1

k
w
j
k
2
+
C

n
X
i
=1

Ã
(
g
(
x
i
; y
i
))
´
subject to

k
X
j
=1

b
j
1
n
+
X

k
X
j
=1

w
j
= 0
;
(4)

where the value of
C
(
C >
0) in (4) re°ects relative importance between the geometric
margin

and the EGE.

5

In the

present context, we de¯ne the generalized functional margin of an instance (
x
i
; y
i
)

as min(
g
(
x
i
; y
i
)), and the generalized geometric margin to be
°
= min
1

j
1
<j
2

k
°
j
1
j
2
, with

°
j
1
j
2
=
2

k
w
j
1
¡
w
j
2
k

the vertical Euclidean distance between hyperplanes
f
j
1
¡
f
j
2
=
§
1. Here

°
j
1
j
2
measures separation between classes
i
and
j
; see Figure 2 for an illustration of the
role of

°
. When
k
= 2, (4) reduces to the binary case of Shen et al. (2003). As a technical
remark,

we note that that (4) uses
P
k

j
=1
k
w
j
k
2
rather than

max
1

j
1
<j
2

k
k
w
j
1
¡
w
j
2
k
2
in minimization.

This is because
P
k

j
=1
k
w
j
k
2
plays a similar role as max
1

j
1
<j
2

k
k
w
j
1
¡
w
j
2
k
2
and is easier to

implement.

Insert Figure 2 about here

Kernel
-
based learning can be achieved via a proper kernel
K
(
¢
;
¢
), mapping from

S
£
S

to IR. The kernel is required to satisfy Mercer's condition (Mercer, 1909) which ensures
the

kernel matrix
K
to be positive de¯nite, where
K
is an
n
£
n
matrix with its
i
1
i
2
-
th element

K
(
x
i
1
;
x
i
2
). Then each
f
j
can be represented as
h
j
(
x
) +
b
j
wit
h
h
j
=
P
n

i
=1
v
ji
K
(
x
i
;
x
) by the

theory of reproducing kernel Hilbert spaces, c.f., Wahba (1998). The kernel
-
based
multicategory

Ã
-
learning then solves:

min

b
;
v
³
1

2

k
X
j
=1

k
h
j
k
2

H
K
+
C

n
X
i
=1

Ã
(
g
(
x
i
; y
i
))
´
subject to

k
X
j
=1

b
j
1
n
+
K

k
X
j
=1

v
j
= 0
;
(5)

wher
e
v
j
= (
v
j
1
; : : : ; v
jn
)
T
, and
v
= vec(
v
1
; : : : ;
v
n
). Using the reproducing kernel
property,

k
h
j
k
2

H
K

can be written as
v
T

j
Kv
j
.

The concept of support vectors can be also extended to multicategory problems. In the

separable case, the instances on th
e boundaries of polyhedrons
D
j
are the support
vectors,

where polyhedron
D
j
is a collection of solutions of a ¯nite system of linear inequalities
de¯ned

by min
j
(
g
(
x
; j
))
¸
1. In the nonseparable case, the instances belonging to class
j
that do
not

fall the

inside of
D
j
are the support vectors.

2.3 Computational Development of
Ã
-
Learning

To treat nonconvex minimization involved in (4) and (5), we utilize the state
-
of
-
art
technology in

global optimization|the di®erence convex algorithm (DCA) of An and Tao (19
97). The
details

are referred to to Liu et al. (2005) for an algorithm.

The key to e±cient computation is a d.c. decomposition of
Ã
=
Ã
1
+
Ã
2
, where
Ã
1
(
u
) = 0 if

u
min
¸
1 and 2(1
¡
u
min
) otherwise, and
Ã
2
(
u
) = 0 if
u
min
¸
0 and 2
u
min
otherwise. Here
Ã
1
can

6

be viewed as a multivariate generalization of the univariate hinge loss. This d.c.
decomposition

connects the
Ã
-
loss to the hinge loss
Ã
1
of SVM. In fact, the multivariate
Ã
mimics the
GE

de¯ned by 1
¡
sign, while the generalized hinge loss
Ã
1
is a convex
upper envelope of 1
¡
sign.

With this d.c. decomposition,
Ã
corrects the bias introduced by the imposed convexity of
Ã
1
,

and is expected to yield higher generalization accuracy.

3 Statistical Learning Theory

In the literature, there has been considerable i
nterest in generalization accuracy of
margin
-
based

classi¯ers. In the binary case, Lin (2000) investigated rates of convergence of SVM with
a spline

kernel. Bartlett, Jordon, and MaAuli®e (2003) studied rates of convergence for certain
convex

margin losses
. Shen et al. (2003) derived a learning theory for
Ã
-
learning. Zhang (2004)
obtained

consistency for general convex margin
-
based losses. For the multicategory case, Zhang
(2004b)

has recently studied consistency of several large margin classi¯ers using con
vex losses.
To our

knowledge, no results are available for rates of convergence in the multicategory case.
In this

section, we quantify the generalization error rates of the proposed multicategory
Ã
-
learning, as

measured by the Bayesian regret, to be intro
duced.

3.1 Statistical Properties

The generalization performance of a classi¯er de¯ned by
f
is measured by the Bayesian
regret

e
(
f
;
¹
f
) = Err(
f
)
¡
Err(
¹
f
)
¸
0, which is the di®erence between the actual performance and

the ideal performance. Here
¹
f
is
the Bayes rule, yielding the ideal performance
assuming that

the true distribution of (
X
; Y
) would have been known in advance, obtained by
minimizing

Err(
f
) =
E
[1
¡
sign(
g
(
f
(
X
)
; Y
))] with respect to all
f
, with
g
(
f
(
x
)
; j
) =
f
f
j
(
x
)
¡
f
l
(
x
)
; l
6
=
j
g
.

Note that

the Bayes rule is not unique because any
¹
f
, satisfying argmax
j

¹
f
j
(
x
) = argmax
j
P
j
(
x
)

with
P
j
(
x
) =
P
(
Y
=
j
j
x
), yields the minimum. Without loss of generality, we use a speci¯c

¹
f
= ( ¹
f
1
; : : : ;
¹
f
k
) with ¹
f
j
(
x
) =
k
¡
1

k
I
¡
sign(
P
j
(
x
)
¡
P
l
(
x
)
; l
6
=
j
)
= 1
¢
¡
1

k
I
¡
sign(
P
j
(
x
)
¡
P
l
(
x
)
; l
6
=

j
)
6
= 1
¢
in what follows, that is, ¹
f
l
(
x
) =
k
¡
1

k
if
l
= argmax
P
j
(
x
), and
¡
1

k
otherwise.

Theorem 3.1 below gives expressions of the Bayesian regret, which is critical for
establishing

our learning theory.

7

Theorem 3.1.

For any decision function vector
f
,

e
(
f
;
¹
f
) =

1

2
E
[

k
X
j
=1

P
j
(
X
)(
sign
(
¹
g
(
¹
f
(
X
)
; j
))
¡
sign
(
g
(
f
(
X
)
; j
)))] (6)

=
E
[
max
j
P
j
(
X
)
¡
P
argmax
j
f
j
(
X
)]
¸
0

=
E
[
X
j
6
=
l

j
P
l
(
X
)
¡
P
j
(
X
)
j
I
(
sign
(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign
(
g
(
f
(
X
)
; j
)) = 1)]
;
(7)

where
¹
g
(
¹
f
(
x
)
; j
) =
f
¹
f
j
(
x
)
¡
¹
f
l
(
x
)
; l
6
=
j
g
.

Equation (6) in Theorem 3.1 expresses
e
(
f
;
¹
f
) in terms of a weighted sum of the
individual

misclassi¯cation error, weighted by the conditional probability
P
j
(
X
). Equation (7) gives
an

expression of
e
(
f
;
¹
f
) in misclassi¯catio
n resulting from
¡
k

2
¢
multiple comparisons.

Equation (7) suggests that a multicategory problem dramatically di®ers from its binary

counterpart. For a binary problem, (7) reduces to
e
(
f
2
;
¹
f
2
) =
E
j
P
2
(
X
)
¡
1
=
2
jj
sign(
f
2
(
X
))
¡

sign( ¹
f
2
(
X
))
j
because
P
2
(
x
)
¡

P
1
(
x
) = 2(
P
2
(
x
)
¡
1
=
2). This means a comparison between

P
1
(
x
) and
P
2
(
x
) in the binary case is equivalent to examining if
P
2
(
x
) exceeds 1
=
2. For a

multicategory problem, however, this no longer holds since multiple pairwise
comparisons are

necessary in ord
er to determine the argmax. In fact, there may not exist a dominating
class,

that is max
P
l
(
x
)
<
1
=
2 for some
x
2
S
. Therefore,
k
comparisons of
P
j
(
x
) with 1
=
2 may
not

be su±cient to determine the correct classi¯cation rule. Indeed, the issue of the
existe
nce of a

dominating class is important in the multicategory rather than binary case.

The ultimate goal of classi¯cation is to minimize
E
[1
¡
sign(
g
(
f
(
X
)
; Y
))]. To avoid the scale

invariant problem of the sign function, we apply
Ã
-
loss here as a surrogate lo
ss which
minimizes

E
[
Ã
(
g
(
f
(
X
)
; Y
))]. The following theorem says that a
Ã
-
loss yields the same Bayes rule as
the

1
¡
sign loss. Thus, consistency of multicategory
Ã
-
learning can be established.

Theorem 3.2.
The Bayes decision vector
¹
f
satis¯es
¹
g
min
(
¹
f
(
x
)
;
argmax
j
=1
;:::;k
P
j
(
x
)) = 1
,
where

¹
g
min
is the minimum of the
k
¡
1
elements of vector
¹
g
. For any
Ã
satisfying (2),
¹
f
mini
-

mizes
E
[
Ã
(
g
(
f
(
X
)
; Y
))]
and
E
[1
¡
sign
(
g
(
f
(
X
)
; Y
))]
in the sense that
E
[
Ã
(
g
(
f
(
X
)
; Y
))]
¸

E
[
Ã
(
¹
g
(
¹
f
(
X
)
; Y
))] =
E
[1
¡
sign
(
¹
g
(
¹
f
(
X
)
; Y
))]

E
[1
¡
sign
(
g
(
f
(
X
)
; Y
))]
for any
f
. Further
-

more, the minimizers for
E
[
Ã
(
g
(
f
(
X
)
; Y
))]
and
E
[1
¡
sign
(
g
(
f
(
X
)
; Y
))]
are not unique, e.g.,

c
¹
f
is also a minimizer for both quantities for any
c
¸
1
.

Theorem 3.2 says that
Ã
-
learning es
timates the Bayes classi¯er de¯ned by
¹
f
as
opposed to the

conditional probabilities (
P
1
(
x
)
; : : : ; P
k
(
x
)), and it plays the same role as 1
¡
sign.
Furthermore,

8

the optimal performance of
¹
f
with
¹
g
min
(
¹
f
(
x
)
;
argmax
j
P
j
(
x
)) = 1 is realized via the
Ã
-
lo
ss

function although it di®ers from 1
¡
sign.

3.2 Statistical Learning Theory

Let
F
be a function class of candidate function vectors, which is allowed to depend on
n
.
Note

that the Bayes decision function
¹
f
is not required to belong to
F
. For any functi
on vector

f
2 F
, classi¯cation is performed by partitioning
S
into
k
disjoint sets (
G
f
1
;
¢ ¢ ¢
;G
f
k
) = (
f
x
:

sign(
g
(
f
(
x
)
;
1) = 1
g
;
¢ ¢ ¢
;
f
x
: sign(
g
(
f
(
x
)
; k
) = 1
g
).

In this section, we generalize the learning theory of Shen et al. (2003) to the multica
t
-

egory case. Our learning theory quanti¯es the magnitude of
e
(
f
;
¹
f
) as a function of
n
,
k
,

the tuning parameter
C
, and the complexity of a class of candidate classi¯cation
partitions

G
(
F
) =
f
(
G
f
1
;
¢ ¢ ¢
;G
f
k
);
f
2 Fg
induced by
F
.

Denote by the appro
ximation error
e
Ã
(
f
;
¹
f
) =
1

2
(

(
g
(
f
(
X
)
; Y
))
¡

(
¹
g
(
¹
f
(
X
)
; Y
))), which

measures the degree of approximation of
G
(
F
) to (
G
¹
f
1
;
¢ ¢ ¢
;G
¹
f
k

). Let
J
0
= max(
J
(
f
0
)
;
1). The

following technical assumptions are made.

Assumption A:
(Approximation error)
For some positive sequence
s
n
!
0 as
n
! 1
, there

exists
f
0
2 F
such that
e
Ã
(
f
0
;
¹
f
)

s
n
. Equivalently, inf
f
f
2Fg
e
Ã
(
f
;
¹
f
)

s
n
. Similar to
F
,
f
0

may depend on
n
.

Assumption B:
(Boundary behavior) There exist some constants 0
< ®

+
1
and
c
1
>
0

such tha
t
P
(
X
2
S
: (max
P
l
(
X
)
¡
P
j
6
=argmax
P
l
(
X
)
(
X
))
<
2
±
)

c
1
±
®
for any small
±
¸
0.

Assumption B describes behavior of the conditional probabilities
P
j
's near the decision

boundary
f
x
2
S
: max
P
l
(
x
) =
P
j
(
x
); for some
l
6
=
j
2 f
1
;
2
; : : : ; k
gg
. It is equivalen
t

to that
P
(
X
2
S
: (max
P
l
(
X
)
¡
second max
P
l
(
X
))
<
2
±
)

c
1
±
®
by the fact that
f
X
:

max
P
l
(
X
)
¡
P
j
6
=argmax
P
l
(
X
)
(
X
))
<
2
±
g ½ f
X
: max
P
l
(
X
)
¡
second max
P
l
(
X
)
<
2
±
g
.

To specify Assumption C, we de¯ne the metric entropy for partitions. For a class of

partitions

B
=
f
(
B
1
;
¢ ¢ ¢
;B
k
);
B
j
\

B
l
=
; 8
j
6
=
l;
[
1

j

k
B
j
=
S
g
and any
² >
0, call

f
(
G
v

j
1
;G
u

j
1
)
; : : : ;
(
G
v

jm
;G
u

j m
)
g
;
j
= 1
; : : : ; k
, an
²
-
bracketing set of
B
if for any (
G
1
; : : : ;G
k
)
2
B

there exists an
h
such that
G
v

jh
½
G
j
½
G
u

j h
and

max

1

h

m

ma
x

1

j

k

P
(
G
u

jh
¢
G
v

j h
)

²;
(8)

where
G
u

jh
¢
G
v

jh
is the set di®erence between
G
u

jh
and
G
v

jh
. The metric entropy
H
B
(
²;
B
) of
B

with bracketing is then de¯ned as logarithm of the cardinality of
²
-
bracketing set of
B
of
the

9

smallest size.

Let
F
(
`
) =
f
f
2

F
; J
(
f
)

`
g ½ F
and
G
(
`
) =
f
(
G
f
1
; : : : ;G
f
k
);
f
2 F
(
`
)
g ½ G
(
F
). Then

G
(
`
) is the set of classi¯cation partitions under regularization
J
(
f
)

`
. For instance,
J
(
f
) is

1

2
P
j
k
w
j
k
2
in (4) or is
1

2
P
j
k
h
j
k
2

H
K

in (5). To measure the complexity of
G
(
`
) v
ia the metric

entropy, the following assumption is made.

Assumption C:
(Metric entropy for partitions) For some positive constants
c
i
,
i
= 2
;
3
;
4,

there exists some
²
n
>
0 such that

sup

`
¸
2

Á
(
²
n
; `
)

c
2
n
1
=
2
;
(9)

where
Á
(
²
n
; `
) =
R
c
1
=
2

3
L
®=
2(
®
+1)

c
4
L
H
1
=
2

B
(
u
2
=
4
;
G
(
`
))
du=L
and
L
=
L
(
²
n
;C; `
) = min(
²
2

n
+(
Cn
)
¡
1
(
`=
2
¡

1)
J
0
;
1).

Assumption D:
(
Ã
-
function) The
Ã
-
function satis¯es (2).

As a technical remark, we note that to simplify the function entropy calculation in
Assump
-

tion C required in Theorem 3.4, an
additional condition on the
Ã
-
function may be
imposed. For

instance, we may restrict the
Ã
-
loss functions in (2) to satisfy a multivariate Lipschitz
condition:

j
Ã
(
u
¤
)
¡
Ã
(
u
¤¤
)
j ∙
D
j
u
¤
min
¡
u
¤¤
min
j
;
(10)

where
D >
0 is a constant. Condition (10) is satis¯e
d by the speci¯c
Ã
function in (3),
with

D
= 2. This aspect is illustrated in Example 3.3.2. However, (10) is irrelevant to the set
entropy

in Assumption C required in Theorem 3.3; see Example 3.3.1.

Theorem 3.3.
Suppose that Assumptions A
-
D are met. Then,

for any classi¯er of
Ã
-
learning

argmax
(
^
f
)
, there exists a constant
c
5
>
0
such that

P
(
e
(
^
f
;
¹
f
)
¸
±
2

n
)

3
:
5 exp(
¡
c
5
n
(
nC
)
¡
®
+2

®
+1
J

®
+2

®
+1

0
)
;

provided that
Cn
¸
2
±
¡
2

n
J
0
, where
±
2

n
= min(max(
²
2

n
;
2
s
n
)
;
1)
.

Corollary 3.1.
Under the assumption
s of Theorem 3.3,

j
e
(
^
f
;
¹
f
)
j
=
O
p
(
±
2

n
)
;E
j
e
(
^
f
;
¹
f
)
j
=
O
(
±
2

n
)
;

provided that
n
¡
1

®
+1
(
C
¡
1
J
0
)
®
+2

®
+1
is bounded away from zero.

To obtain the error rate
±
2

n
in Theorem 3.3, we need to compute the metric entropy for
G
(
`
).

It may not be easy to co
mpute the metric entropy for partitions because
G
(
`
) is induced
by the

10

class of functions
F
(
`
). Moreover, it is also of interest to establish an upper bound of
e
(
^
f
;
¹
f
)

using the corresponding function entropy as opposed to set entropy. In what foll
ows, we
shall

develop such results in Theorem 3.4.

To proceed, we de¯ne the
L
2
-
metric entropy with bracketing for
F
as follows. For any
² >
0,

call
f
(
g
v

1
; g
u

1
)
; : : : ;
(
g
v

m
; g
u

m
)
g
an
²
-
bracketing function if for any
g
2 F
there is an
h
such that

g
v

h

g

g
u

h
and max
1

h

m
k
g
u

h
¡
g
l

h
k
2

²
, where
k ¢ k
2
is the usual
L
2
-
norm, de¯ned as
k
g
k
2

2
=

R
g
2
dP
. Then the
L
2
-
metric entropy of
F
with bracketing
H
B
(
²;
F
) is de¯ned as logarithm
of

the cardinality of the
²
-
bracketing of the smallest size. Now de
¯ne a new function set
F
Ã
(
`
) =

f
Ã
(
g
(
f
(
x
)
; y
))
¡
Ã
(
g
0
(
f
0
(
x
)
; y
)) :
f
2 F
(
`
)
g
and
Á
¤
(
²
¤
n
; `
) =
R
c
1
=
2

3
L
¤
®=
2(
®
+1)

c
4
L
¤
H
1
=
2

B
(
u;
F
Ã
(
`
))
du=L

with
L
¤
= min(
²
¤
2

n
+ (
Cn
)
¡
1
(
`=
2
¡
1)
J
0
;
1).

Theorem 3.4.
Suppose that Assumptions A
-
D are met with
Á
¤
(
²
¤
n
; `
)
replacing

Á
(
²
n
; `
)
in
As
-

sumption C. Then, for any classi¯er of
Ã
-
learning argmax
(
^
f
)
, there exists a constant
c
5
>
0

such that

P
(
e
(
^
f
;
¹
f
)
¸
±
¤
2

n
)

P
(
e
Ã
(
^
f
;
¹
f
)
¸
±
¤
2

n
)

3
:
5 exp(
¡
c
5
n
(
nC
)
¡
®
+2

®
+1
J

®
+2

®
+1

0
)
;

provided that
Cn
¸
2
±
¤¡
2

n
J
0
, where
±
¤
2

n
= min(max(
²
¤
2

n
;
2
s
n
)
;
1)
.

Corollary 3.2.
Under the assumptions of Theorem 3.4,

j
e
Ã
(
^
f
;
¹
f
)
j
=
O
p
(
±
¤
2

n
)
;E
j
e
Ã
(
^
f
;
¹
f
)
j
=
O
(
±
¤
2

n
)
;

provided that
n
¡
1

®
+1
(
C
¡
1
J
0
)
®
+2

®
+1
is bounded away from zero.

Note that
e
Ã
(
^
f
;
¹
f
)
¸
e
(
^
f
;
¹
f
). The rat
e
±
¤
2

n
obtained from Theorem 3.4 using the metric

entropy for functions yields an upper bound of
e
(
^
f
;
¹
f
), thus
e
(
^
f
;
¹
f
)

min(
±
2

n
; ±
¤
2

n
) with prob
-

ability tending to 1 by Theorems 3.3
-
3.4. In application, one may calculate either
±
2

n
or
±
¤
2

n
,

depending on which entropy is easier to compute.

Theorems 3.3
-
3.4 reveal distinct characteristics of multicategory problems, although they

cover the binary case. First, a multicategory problem has a higher level of complexity
gen
-

erally, and hence th
at the number of classes
k
may have an impact on the performance.
In

fact, Theorems 3.3
-
3.4 permit studying dependency of
e
(
^
f
;
¹
f
) on
k
and
n
simultaneously; see

Examples 3.3.1 and 3.3.2. Second, some properties of binary linear learning no longer
hold

in

the multicategory case when
k >
2. For instance, the decision boundaries generated by
linear

learning with
k >
2 can be piecewise linear hyperplanes.

11

3.3 Illustrative Examples

To illustrate our learning theory, we study speci¯c learning examples and

apply our
learning

theory to derive error bounds for multicategory
Ã
-
learning.

3.3.1. Linear classi¯cation:
Linear classi¯cation involving a class of
k
hyperplanes
F
=

f
f
:
f
j
(
x
) =
w
T

j
x
+
b
j
;
P
k

j
=1
f
j
= 0
;
x
2
S
= [0
;
1]
d
g
is considered, where
d
is a c
onstant. To

generate the training sample, we specify
P
(
Y
=
j
) = 1
=k
,
P
(
x
j
Y
=
j
) =
k
¡
1 for
f
x
:
x
1
2

[(
j
¡
1)
=k; j=k
)
g
and 1
=
(
k
¡
1) otherwise, where
x
1
is the ¯rst coordinate of
x
. Then the
Bayes

classi¯er yields sets
f
x
:
x
1
2
[0
;
1
=k
)
g
,
: : :
,
f
x
:
x
1
2
[(
k
¡
1)
=k;
1]
g
for the corresponding
k

classes.

We now verify Assumptions A
-
C. For Assumption A, it is easy to ¯nd
f
t
= (
w
11
x
1
+

b
1
; : : : ;w
1
k
x
1
+
b
k
) such that
w
1
j
's are increasing,
P
k

j
=1
w
1
j
= 0,
P
k

j
=1
b
j
= 0, and
w
1
j
j=k
+
b
j
=

w
1
;j
+1
j=k
+
b
j
+1
;
j
= 1
; : : : ; k
¡
1. Let
f
0
=
n
f
t
2 F
, then
e
(
f
0
;
¹
f
)

s
n
=
c
1
n
¡
1
for some

constant
c
1
>
0. This implies Assumption A with
s
n
=
c
1
n
¡
1
. Assumption B is satis¯ed with

®
= +
1
since
P
(
X
2
S
: (max
P
l
(
X
)
¡
P
j
(
X
))
<
2
±
) =
P
(
X
1
2 f
1
=k; : : : ;
(
k
¡
1)
=k
g
) = 0 for

an
y su±ciently small
± >
0. To verify Assumption C, we note that
H
B
(
u;
G
(
`
))

O
(
k
2
log(
k=u
))

for any given
`
by Lemma 1. Let
Á
1
(
²
n
; `
) =
c
3
(
k
2
log(
k=L
1
=
2
))
1
=
2
=L
1
=
2
, where
L
= min(
²
2

n
+

(
Cn
)
¡
1
(
`=
2
¡
1)
;
1). This in turn yields sup
`
¸
2
Á
(
²
n
; `
)

Á
1
(
²
n
;
2) =
c
(
k
2
log(
k=²
n
))
1
=
2

n

for some
c >
0 and a rate
²
n
= (
k
2
log
n

n
)
1
=
2
when
C=J
0
»
±
¡
2

n
n
¡
1
»
1

k
2
log
n
, provided that

k
2
log
n

n
!
0.

By Corollary 3.1, we conclude that
e
(
^
f
;
¹
f
)

O
(
k
2
log
n

n
) except for a set of probability tending

to zero, and
Ee
(
^
f
;
¹
f
)

O
(
k
2
log
n

n
), when
k
2
log
n

n
!
0 as
n
! 1
. It is interesting to note that

Ee
(
^
f
;
¹
f
)

O
(
n
¡
1
log
n
) when
k
is a ¯xed constant. This conclusion holds generally for
any

Ã
-
function satisfying Assumption D.

3.3.2. Gaussian
-
kernel classi¯
cation:
In this example, we consider nonlinear learn
-

ing with the same
P
(
x
; y
) as in Example 3.3.1. Let
F
=
f
f
:
f
j
(
x
) =
P
n

i
=1
v
ji
K
(
x
i
;
x
) +

b
j
;
P
k

j
=1
f
j
= 0
;
x
2
S
= [0
;
1]
d
g
with Gaussian kernel
K
(
s; t
) = exp(
¡k
s
¡
t
k
2

2
).

For Assumption A, we note t
hat
F
is a rich function space with large
n
. In fact, any

continuous function can be well approximated by Gaussian kernel based representations
under

the sup
-
norm, c.f., Steinwart (2001). Thus there exists an
f
t
= (
f
1
t
; : : : ; f
kt
)
2 F
such that

f
jt
(
x
)
¸
0 for
x
1
2
[(
j
¡
1)
=k; j=k
] and
<
0 otherwise. With a choice of
f
0
=
²
¡
2

n
f
t
,
e
Ã
(
f
0
;
¹
f
)


s
n
=
c
1
²
2

n
, where
c
1
is a constant and
²
n
is de¯ned below. Assumption B is satis¯ed with

12

®
= +
1
as in Example 3.3.1. In this case, the metric entropy of
F
Ã
(
`
) ap
pears to be easier
to

compute. We then apply Theorem 3.4 to obtain the convergence rate. Consider any
Ã
-
function

in (2) satis¯es (10). Then by Lemma 2,
H
B
(
u;
F
Ã
(
`
))

O
(
k
(log(
`=u
))
d
+1
) for any given
`
.

Let
Á
¤
1
(
²
¤
n
; `
) =
c
3
(
k
(log(
`=L
1
=
2
))
d
+1
)
1
=
2
=L
1
=
2
, where

L
= min(
²
¤
2

n
+ (
Cn
)
¡
1
(
`=
2
¡
1)
;
1). Then

sup
`
¸
2
Á
¤
(
²
¤
n
; `
)

Á
1
(
²
¤
n
;
2) =
c
(
k
(log(1

¤
n
))
d
+1
)
1
=
2

¤
n
for some
c >
0. Solving (9) yields a

rate
²
¤
n
= (
k
(log(
nk
¡
1
))
d
+1

n
)
1
=
2
when
C=J
0
»
±
¤¡
2

n
n
¡
1
»
1

k
(log(
nk
¡
1
))
d
+1
under a condition that

k
(log(
nk
¡
1
))
d
+
1

n
!
0 as
n
! 1
.

By Theorem 3.4, we conclude that
e
(
^
f
;
¹
f
)

e
Ã
(
^
f
;
¹
f
)

O
(
k
(log(
nk
¡
1
))
d
+1

n
) except for a set

of probability tending to zero. By Corollary 3.2,
Ee
(
^
f
;
¹
f
)

O
(
k
(log(
nk
¡
1
))
d
+1

n
). This resulting

rate re°ects dependence of th
e rate on the class number
k
. If
k
is treated as a ¯xed
constant,

then we have
Ee
(
^
f
;
¹
f
)

O
(
n
¡
1
(log
n
)
d
+1
). This conclusion holds generally for any
Ã
-
function

satisfying Assumption D and Condition (10).

In summary, Examples 3.3.1
-
3.3.2 provide an insi
ght into the generalization error of the

proposed methodology. In view of the lower bound
n
¡
1
result (c.f., Tsybakov, 2004) in the

binary case, we conjecture the rates obtained in Examples 3.3.1 and 3.3.2 are nearly
optimal,

although, to our knowledge, a l
ower bound result for any general classi¯er has not yet
been

established in the multicategory case. Further investigation is necessary.

4 Numerical Examples

In this section, we examine performance of multicategory
Ã
-
learning in terms of
generalization

and
compares it with its counterpart SVM. In the literature, there are a number of
di®erent

multicategory SVM generalizations; for instance, Lee et al. (2004), Crammer and Singer
(2001),

Weston and Watkins (1998), among others. To make a fair comparison, we us
e a
version of

multicategory SVM that is parallel to our multicategory
Ã
-
learning. Speci¯cally, we
replace the

Ã
function in (4) and (5) by
Ã
1
. Then for the linear case, this version of multicategory
SVM

solves

min

b
;
w
³
1

2

k
X
j
=1

k
w
j
k
2
+
C

n
X
i
=1

Ã
1
(
g
(
x
i
;

y
i
))
´
subject to

k
X
j
=1

b
j
1
n
+
X

k
X
j
=1

w
j
= 0
:
(11)

This version of SVM is closely related to that of Crammer and Singer (2001). In their
formu
-

lation, all
b
j
's are set to be 0 rather than employing the sum
-
to
-
zero constraint, which is
in

contrast to (
11). As argued by Guermeur (2002), the sum
-
to
-
zero constraint is necessary
to en
-

13

sure uniqueness of the solution when a
k
-
dimensional vector of decision functions with
intercepts

b
j
's is used for a
k
-
class problem.

4.1 Simulation

Two linear examples a
re considered. In these examples, the GE is approximated by the
testing

error using a testing sample, independent of training. In what follows, all calculations are
carried

out using the IMSL C routines.

Three
-
class linear problem
. The training data are ge
nerated as follows. First, generate

pairs (
t
1
; t
2
) from a bivariate
t
-
distribution with degrees of freedom
º
, where
º
= 1
;
3 in
Examples

1 and 2, respectively. Second, randomly assign
f
1
;
2
;
3
g
to its label index for each (
t
1
;
t
2
). Third,

calculate (
x
1
; x
2
) as follows:
x
1
=
t
1
+
a
1
and
x
2
=
t
2
+
a
2
with three di®erent values of

(
a
1
; a
2
) = (
p
3
;
1)
;
(
¡p
3
;
1)
;
(0
;
¡
2) for corresponding classes 1
-
3. In these examples, the

testing and Bayes errors are computed via independent testing samples of size 10
6
for
class
i¯ers

obtained from training samples of size 150.

To eliminate the dependence on
C
, we maximize the performances of
Ã
-
learning and
SVM

by optimizing
C
over a discrete set in [10
¡
3
;
10
3
]. For each method, the testing error for
the

optimal
C
is averaged over

100 repeated simulations. The simulation results are
summarized in

Table 1.

Insert Table 1 about here

As shown in Table 1,
Ã
-
learning usually has a smaller testing error thus better
generaliza
-

tion as compared to its counterpart SVM. The amount of improv
ement, however, varies
across

examples. In Example 1, the percent of improvement of multicategory
Ã
-
learning over
SVM

is 43.22% when the corresponding
t
-
distribution has one degree of freedom. In Example
2, it

decreases to 20.41% when the
t
-
distribution wi
th 3 degrees of freedom is employed.
Further,

Ã
-
learning yields a smaller number of support vectors. This suggests that
Ã
-
learning has
an even

more
\
sparse" solution than SVM, and hence that it has stronger ability of data reduction.
On a

related matter, S
VM fails to give data reduction in Example 1 since almost all the
instances are

support vectors, which is in contrast to much smaller number of support vectors of
Ã
-
learning.

One plausible explanation is that the ¯rst moment of the standard bivariate
t
-
dis
tribution
does

not exist, and thus the corresponding SVM does not work well. In general, any classi¯er
with

14

an unbounded loss such as SVM may su®er di±culty from extreme outliers as in this
example.

This reinforces our view that
Ã
-
learning is more robus
t against outliers.

4.2 Application

We now examine performance of
Ã
-
learning and its counterpart SVM on a benchmark
example

letter
, obtained from Statlog. In this example, each sample contains 16 primitive
numerical

attributes converted from its correspond
ing letter image with a response variable
representing

26 categories. The main goal here is to identify each letter image as one of the 26
capital letters

in the English alphabet. A detailed description can be found in
www:liacc:up:pt=ML=statlog=

datasets=
letter=letter:doc:html
.

For illustration, we use the data for letters D, O, Q with 805, 753, 783 cases respectively.

A random sample of
n
= 200 is selected for training, while leaving the rest for testing. For

each training dataset, we seek the best perfor
mance of linear
Ã
-
learning and SVM over
a set of

C
-
values in [10
¡
3
,10
3
]. The corresponding results with respect to the smallest testing
errors for

each method in ten di®erent cases are reported in Table 2. Since the Bayes error is
unknown,

the improvement
of
Ã
-
learning over SVM is computed via (
T
(SVM)
¡
T
(
Ã
))
=T
(SVM).

Insert Table 2 about here

Table 2 indicates that multicategory
Ã
-
learning has a smaller testing error than its coun
-

terpart SVM, although the amount of improvement varies from sample to sampl
e. In
addition,

on average, multicategory
Ã
-
learning has a smaller number of support vectors than
SVM. In

conclusion,
Ã
-
learning has better generalization and achieves further data reduction than
SVM

in this example.

5 Discussion

In this article, we propos
e a new methodology that generalizes
Ã
-
learning from the
binary case

to the multicategory case. A statistical learning theory is developed for
Ã
-
learning in
terms

of the Bayesian regret. In simulations, we show that the proposed methodology performs
well

a
nd is more robust against outliers than its counterpart SVM. In addition, we discover
some

interesting phenomena that are not with the binary case.

15

Recently, there is considerable interest in studying the variable selection problem using
the

L
1
norm to
replace the conventional
L
2
norm. In the binary case, Zhu et al. (2003) studied
prop
-

erties of the
L
1
SVM and showed that the corresponding regularized solution path is
piecewise

linear. It is therefore natural to investigate variable selection of the
L
1
Ã
-
learning.

Further developments are necessary in order to make multicategory
Ã
-
learning more
useful in

practice, particularly methodologies for a data
-
driven choice of
C
, variable selection,
regularized

solution path, as well as the nonstandard situation
including unequal loss assignments.

Appendix

Proof of Theorem 3.1
: By the de¯nition of Err(
f
), it is easy to obtain via conditioning that

e
(
f
;
¹
f
) =
1

2
E
[
P
k

l
=1
P
l
(
X
)(sign(
¹
g
(
¹
f
(
X
)
; l
))
¡
sign(
g
(
f
(
X
)
; l
)))]. Then it su±ces to consider

the situation that

sign(
¹
g
(
¹
f
(
X
)
; l
))
¡
sign(
g
(
f
(
X
)
; l
)) is nonzero, that is, when two
classi¯ers

disagree. Equivalently, for any given
X
=
x
, we can write
e
(
f
;
¹
f
) using all possible
di®erent

classi¯cation produced by
¹
f
and
f
jointly, where sign(
¹
g
(
¹
f
(
x
)
; l
)) = 1 an
d sign(
g
(
f
(
x
)
; j
))
= 1

imply that
¹
f
classi¯es
x
into class
l
while
f
classi¯es
x
into class
j
for 1

l
6
=
j

k
. Thus,
we

have

e
(
f
;
¹
f
) =
E
[

k
X
l
=1
X
j
6
=
l

(
P
l
(
X
)
¡
P
j
(
X
))
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]

=
E
[

k
X
l
=1
X
j
6
=
l

j
P
l
(
X
)
¡
P
j
(
X
)
j
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
;

where the second equality follows from the fact that
¹
f
is the optimal (Bayes) decision
function

vector such that
P
l
(
X
)
¸
P
j
(
X
) when sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1. The desired result then follows.

Proof of The
orem 3.2
: Write
E
[1
¡
sign(
g
(
f
(
X
)
; Y
))
j
X
=
x
] as
P
k

j
=1
(1
¡
sign(
g
(
f
(
x
)
; j
)))

P
j
(
x
) = 1
¡
P
k

j
=1
sign(
g
(
f
(
x
)
; j
))
P
j
(
x
). Note that for any given
x
, one and only one of

sign(
g
(
f
(
x
)
; j
)) can be 1 and the rest equal to
¡
1. Consequently,
E
[1
¡
sign(
g
(
f
(
X
)
; Y
)
)] is

minimized when sign(
g
(
f
(
x
)
;
argmax
j

¹
f
j
(
x
))) = 1, i.e.,
f
=
¹
f
. Evidently, the minimizer is not

unique as
c
¹
f
for
c
¸
1 is also a minimizer. Then the desired result follows from the fact
that

Ã
(
u
)
¸
(1
¡
sign(
u
)) and
Ã
(
¹
g
) = 1
¡
sign(
¹
g
).

Proof o
f Theorem 3.3
: Before proceeding we introduce some notations to be used
below.

Let ~
l
Ã
(
f
;Z
i
) =
l
Ã
(
f
;Z
i
) +
¸J
(
f
) be the cost function to be minimized, as in (4) or (5), where

l
Ã
(
f
;Z
i
) =
Ã
(
g
(
f
(
X
i
)
; Y
i
)) and
¸
= 1
=
(
Cn
). Let ~
l
(
f
;Z
i
) =
l
(
f
;Z
i
) +
¸J
(
f
), where
l
(
f
;Z
i
) =

16

1
¡
sign(
g
(
f
(
X
i
)
; Y
i
)). De¯ne the scaled empirical process
E
n
(~
l
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
)) as

n
¡
1

n
X
i
=1

(~
l
(
f
;Z
i
)
¡
~
l
Ã
(
f
0
;Z
i
)
¡
E
[~
l
(
f
;Z
i
)
¡
~
l
Ã
(
f
0
;Z
i
)]) =
E
n
[
l
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)]
;

where
Z
= (
X; Y
). Let
A
i;j
=
f
f
2 F
: 2
i
¡
1
±
2

n

e
(
f
;
¹
f
)
<
2
i
±
2

n
,
2
j
¡
1
J
0

J
(
f
)
<
2
j
J
0
g
,

A
i;
0
=
f
f
2 F
: 2
i
¡
1
±
2

n

e
(
f
;
¹
f
)
<
2
i
±
2

n
; J
(
f
)
< J
0
g
, for
j
= 1
;
2
;
¢ ¢ ¢
, and
i
= 1
;
2
;
¢ ¢ ¢
. Without

loss of generality, we assume
J
(
f
0
)
¸
1 and max(
²
2

n
;
2
s
n
)
<
1 in the sequel.

The proof uses the treatment of Shen et al.

(2003) and Shen (1998), together with the
results

in Theorem 3.1 and Assumption B. In what is to follow, we shall omit any detail that can
be

referred to the proof of Theorem 1 of Shen et al. (2003).

Using the connection between
e
(
^
f
;
¹
f
) and the cost
function as in Shen et al. (2003),
we have

P
(
e
(
^
f
;
¹
f
)
¸
±
2

n
)

P
¤
³
sup

f
f
2F
:
e
(
f
;
¹
f
)
¸
±
2

n
g

n
¡
1

n
X
i
=1

(~
l
Ã
(
f
0
;Z
i
)
¡
~
l
(
f
;Z
i
))
¸
0
´
=
I;

where
P
¤
denotes the outer probability measure.

To bound
I
, it su±ces to bound
P
(
A
ij
), for each
i; j
= 1
;
¢ ¢ ¢
. To this end, we need some

inequalities regarding the ¯rst and second moments of ~
l
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
)) for
f
2
A
ij
.

For the ¯rst moment, note that
E
[
l
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)] =
E
[
l
(
f
;Z
)
¡
l
Ã
(
¹
f
;Z
)]
¡
E
[
l
Ã
(
f
0
;Z
)
¡

l
Ã
(
¹
f
;Z
)], which is equal to 2(
e
(
f
;
¹
f
)
¡
e
Ã
(
f
0
;
¹
f
)) since
El
Ã
(
¹
f
;Z
) =
El
(
¹
f
;Z
) by Theorem
3.2.

By Assumption A and the de¯nition of
±
2

n
, 2
e
Ã
(
f
0
;
¹
f
)

2
s
n

±
2

n
. Then, using the assumption

that
J
0
¸

±
2

n
=
2, we have, for any integers
i; j
¸
1,

inf

A
i;j

E
(~
l
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
))
¸
M
(
i; j
) = (2
i
¡
1
±
2

n
) +
¸
(2
j
¡
1
¡
1)
J
(
f
0
)
;
(12)

and

inf

A
i;
0

E
(~
l
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
))
¸
(2
i
¡
1
¡
1
=
2)
±
2

n
¸
M
(
i;
0) = 2
i
¡
2
±
2

n
;
(13)

where the fact that 2
i
¡
1
¸
2
i
¡
1
has been used.

For the second moment, it follows from Theorem 3.1 and Assumption B that, for any
f
2
F
,

e
(
f
;

¹
f
) =
E
[

k
X
l
=1
X
j
6
=
l

j
P
l
(
X
)
¡
P
j
(
X
)
j
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]

¸
2
±
(
E
[

k
X
l
=1
X
j
6
=
l

I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)
I
(
j
P
l
(
X
)
¡
P
j
(
X
)
j ¸
2
±
))]

¸
±
(
E
[2

k
X
l
=1
X
j
6
=
l

I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
¡

2
c
1
±
®
)

=

1

2

(4
c
1
)
¡
1

E
[2

k
X
l
=1
X
j
6
=
l

I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
(
®
+1)

(14)

17

with a choice of
±
=
¡
E
[2
P
k

l
=1
P
j
6
=
l
I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
=
(4
c
1
)
¢
1

.

Now we establish a connection between the ¯rst and s
econd moments. By Theorem 3.2,

E
[
Ã
(
¹
g
(
¹
f
(
X
)
; Y
))
¡
(1
¡
sign(
¹
g
(
¹
f
(
X
)
; Y
)))] = 0. Note that
Ã
(
u
)
¸
1
¡
sign(
u
) for any
u
2
R
k
¡
1
,

E
j
Ã
(
g
0
(
f
0
(
X
)
; Y
))
¡
(1
¡
sign(
g
0
(
f
0
(
X
)
; Y
)))
j
=
E
[
Ã
(
g
0
(
f
0
(
X
)
; Y
))
¡
(1
¡
sign(
g
0
(
f
0
(
X
)
; Y
)))]


2
e
Ã
(
f
0
;
¹
f
). By the triangular ine
quality,

E
[
l
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)]
2

2
E
j
1
¡
sign(
g
(
f
(
X
)
; Y
))
¡
Ã
(
g
0
(
f
0
(
X
)
; Y
))
j ∙
2
¡
2
e
Ã
(
f
0
;
¹
f
)+

E
j
sign(
¹
g
(
¹
f
(
X
)
; Y
))
¡
sign(
g
(
f
(
X
)
; Y
))
j
+
E
j
sign(
¹
g
(
¹
f
(
X
)
; Y
))
¡
sign(
g
0
(
f
0
(
X
)
; Y
))
j
¢
:
(15)

Note that for any
f
2 F

E
j
sign(
¹
g
(
¹
f
(
X
)
; Y
))
¡
sign(
g
(
f
(
X
)
; Y
))
j

=
E
[

k
X
l
=1

I
(
Y
=
l
)
j
sign(
¹
g
(
¹
f
(
X
)
; l
))
¡
sign(
g
(
f
(
X
)
; l
))
j
]

=
E
[2

k
X
l
=1

I
(
Y
=
l
)
X
j
6
=
l

I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]


E
[2

k
X
l
=1
X
j
6
=
l

I
(sign(
¹
g
(
¹
f
(
X
)
; l
)) = 1
;
sign(
g
(
f
(
X
)
; j
)) = 1)]
:

This, together with (14), implies th
at

E
j
sign(
¹
g
(
¹
f
(
X
)
; Y
))
¡
sign(
g
(
f
(
X
)
; Y
))
j ∙
c
¤
e
(
f
;
¹
f
)
®=
(
®
+1)
;
(16)

where
c
¤
= 2
®=
(
®
+1)
(4
c
1
)
1
=
(
®
+1)
. For any
f
2
A
i;j
,
e
(
f
;
¹
f
)
®=
(
®
+1)
¸
(2
¡
1
±
2

n
)
®=
(
®
+1)
¸
2
¡
1
±
2

n
¸

s
n
¸
e
Ã
(
f
0
;
¹
f
),
e
(
f
;
¹
f
)
¸
e
(
f
0
;
¹
f
) together with (15) and (16) imply that

E
[
l
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)]
2

2
¡
2
e
Ã
(
f
0
;
¹
f
)+
c
¤
(
e
(
f
;
¹
f
)
®=
(
®
+1)
+
e
(
f
0
;
¹
f
)
®=
(
®
+1)
)
¢

c
0

3
(
e
(
f
;
¹
f
)
=
2)
®=
(
®
+1)
;

with
c
0

3
= 16
c
1
=
(
®
+1)

1
+ 8. Consequently, for
i
= 1
;
¢ ¢ ¢
and
j
= 0
;
1
;
¢ ¢ ¢
;

sup

A
i;j

E
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)]
2

c
0

3
(2
i
¡
1
±
2

n
)
®=
(
®
+1)

c
3
M
(
i; j
)
®=
(
®
+1)
=
v
(
i; j
)
2
;

where
c
3
= 2
c
0

3
.

We are now ready to bound
I
. Using the assumption that
J
0
¸

±
2

n
=
2, (12) and (13), we

have
I

P
i
¸
1
;j
¸
0
P
¤
¡
sup
A
i;j
E
n
(
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
))
¸
M
(
i; j
)
¢
. By de¯nition,
l
Ã
(
f
0
;Z
) and

l
(
f
;Z
) are 0 and 2. Then
E
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)]
2

4 and
E
n
(
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
))

4. For

18

convenience, we scale the empirical process by a constant
t
= (4
c
1
=
2

3
)
¡
1
in what follows. Then

I

X
i;j

P
¤
³
sup

A
i;j

E
n
(
t
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)])
¸
M
c
(
i; j
)
´

+
X
i

P
¤
³
sup

A
i;
0

E
n
(
t
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)])
¸
M
c
(
i;
0)
´
=
I
1
+
I
2
(17)

and sup
A
i;j
E
[
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
)]
2

v
c
(
i; j
)
2
, where
v
c
(
i; j
) = min(
t
1
=
2
v
(
i; j
)
;
1),
M
c
(
i; j
) =

min(
tM
(
i; j
)
; c
¡
1
=
2

3
). Note that
v
c
(
i; j
)
<
1 implies
M
c
(
i; j
) =
tM
(
i; j
).

Next we bound
I
i
separately. For
I
1
, we verify the required conditi
ons (4.5)
-
(4.7) in
Theorem

3 of Shen and Wong (1994). To compute the metric entropy in (4.7) there, we need to
construct

a bracketing function of
l
Ã
(
f
0
;Z
)
¡
l
(
f
;Z
). Denote an
²
-
bracketing set for
f
(
G
f
1
; : : : ;G
f
k
);
f
2

A
ij
g
to be
f
(
G
v

p
1
;
¢ ¢ ¢
;G
v

pm
)
;
(
G
u

p
1
;
¢ ¢ ¢
;G
u

pm
)
g
;
p
= 1
; : : : ; k
. Let
s
v

ph
(
x
) be
¡
1 if
x
2
G
u

ph

and 1 otherwise, and
s
u

ph
(
x
) be
¡
1 if
x
2
G
v

ph
and 1 otherwise;
p
= 1
;
¢ ¢ ¢
; k
,
h
= 1
;
¢ ¢ ¢
;m
.

Then
f
(
s
v

p
1
;
¢ ¢ ¢
; s
v

pm
)
;
(
s
u

p
1
;
¢ ¢ ¢
; s
u

pm
)
g
forms an
²
-
bracketing functi
on of
¡
sign(
g
(
f
(
x
)
; p
)) for

f
2
A
ij
and
p
= 1
;
¢ ¢ ¢
; k
. This implies that for any
²
¸
0 and
f
2
A
ij
, there exists an
h

(1

h

m
) such that
l
v

h
(
z
)

l
(
f
; z
)
¡
l
Ã
(
f
0
; z
)

l
u

h
(
z
) for any
z
= (
x
; y
), where
l
u

h
(
z
) =

1 +
P
k

p
=1
s
u

ph
(
x
)
I
(
y
=
p
)
¡
l
Ã
(
f
0
;

z
),
l
v

h
(
z
) = 1 +
P
k

p
=1
s
v

ph
(
x
)
I
(
y
=
p
)
¡
l
Ã
(
f
0
; z
), and (
E
[
l
u

h
¡

l
v

h
]
2
)
1
=
2
= (
P
k

p
=1
E
[(
s
u

ph
(
x
)
¡
s
v

ph
(
x
))
I
(
y
=
p
)]
2
)
1
=
2

2(max
p
P
(
G
u

ph
¢
G
l

ph
))
1
=
2

2
²
1
=
2
. So,

(
E
[
l
u

h
¡
l
v

h
]
2
)
1
=
2

min(2
²
1
=
2
;
2). Hence,
H
B
(
²;
F
¤
(2
j
))

H
(
²
2
=
4
;
G
(2
j
)) for any
²

>
0 and

j
= 0
;
¢ ¢ ¢
, where
F
¤
(2
j
) =
f
l
(
f
; z
)
¡
l
Ã
(
f
0
; z
) :
f
2 F
; J
(
f
)

2
j
g
. Using the fact that

R
v
c
(
i;j
)

aM
c
(
i;j
)
H
1
=
2

B
(
u
2
=
4
;
G
(2
j
))
du=M
c
(
i; j
) is non
-
increasing in
i
and
M
c
(
i; j
);
i
= 1
;
¢ ¢ ¢
;
we have,

Z
v
c
(
i;j
)

aM
c
(
i;j
)

H
1
=
2

B
(
u
2
=
4
;
G
(2
j
))
du=M
c
(
i; j
)


Z
c
1
=
2

3
M
c
(1
;j
)

®

2(
®
+1)

aM
c
(1
;j
)

H
1
=
2

B
(
u
2
=
4
;
G
(2
j
))
du=M
c
(1
; j
)

Á
(
²
n
;
2
j
)
;

where
a
=
"=
32 with
"
de¯ned below. Thus (4.7) of Shen and Wong (1994) holds with
M
=

n
1
=
2
M
c
(
i; j
) and
v
=
v
c
(
i; j
)
2
, so does (4.5). In addition, with
T
= 1,

M
c
(
i; j
)
=
v
c
(
i; j
)
2

max(
c
¡
1
=
2

3
; c
¡
(2
®
+3)
=
(2
®
+2)

3
) =
c
¡
1
=
2

3

"=
(4
T
)

implies (4.6) with
"
= 4
c
¡
1
=
2

3
<
1.

Note that 0
< ±
n

1 and
¸J
0

±
2

n
=
2. Using a similar argument as in Shen et al. (2003),

an application of Theorem 3 of Shen and Wong (1994) yields that

I
1

3 exp(
¡
c
5
n
(
¸J
(
f
0
))
®
+2

®
+1
=
[1
¡
exp(
¡
c
5
n
(
¸J
(
f
0
))
®
+2

®
+1
)]
2
:

19

Here and in the sequel
c
5
is a positive generic constant. Similarly,
I
2
can be bounded.

Finally,
I

6 exp(
¡
c
5
n
(
¸J
(
f
0
))
®
+2

®
+1
=
[1
¡
exp(
¡
c
5
n
(
¸J
(
f
0
))
®
+2

®
+1
)]
2
. This implies that
I
1
=
2


(
5
=
2 +
I
1
=
2
) exp(
¡
c
5
n
(
¸J
(
f
0
)). The result then follows from the fact that
I

I
1
=
2

1.

Proof of Corollary 3.1
: The result follows from the assumptions and the exponential
inequal
-

ity in Theorem 3.3.

Proof of Theorem 3.4
: The proof is similar to that of Th
eorem 3.3. For simplicity, we only

sketch the parts that require modi¯cations. Consider the scaled empirical process
E
n
(~
l
Ã
(
f
;Z
)
¡

~
l

Ã
(
f
0
;Z
)) and let
A
i;j
=
f
f
2 F
: 2
i
¡
1
±
¤
2

n

e
Ã
(
f
;
¹
f
)
<
2
i
±
¤
2

n
, 2
j
¡
1
J
0

J
(
f
)
<
2
j
J
0
g
,

A
i;
0
=
f
f
2 F
: 2
i
¡
1
±
¤
2

n

e
Ã
(
f
;
¹
f
)
<
2
i
±
¤
2

n
; J
(
f
)
< J
0
g
, for
j
= 1
;
2
;
¢ ¢ ¢
, and
i
= 1
;
2
;
¢ ¢ ¢
. Using

an analogous argument, we have

P
(
e
Ã
(
^
f
;
¹
f
)
¸
±
¤
2

n
)

P
¤
³
sup

f
f
2F
:
e
Ã
(
f
;
¹
f
)
¸
±
¤
2

n
g

n
¡
1

n
X
i
=1

(~
l
Ã
(
f
0
;Z
i
)
¡
~
l
Ã
(
f
;Z
i
))
¸
0
´
=
I:

To bound
I
, we consider the ¯rst and se
cond moments of ~
l
Ã
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
)) for
f
2
A
ij
. For

the ¯rst moment, it is straightforward to show that, for any integers
i; j
¸
1, inf
A
i;j
E
(~
l
Ã
(
f
;Z
)
¡

~
l

Ã
(
f
0
;Z
))
¸
M
(
i; j
) = (2
i
¡
1
±
¤
2

n
) +
¸
(2
j
¡
1
¡
1)
J
(
f
0
), and inf
A
i;
0
E
(~
l
Ã
(
f
;Z
)
¡
~
l
Ã
(
f
0
;Z
))
¸

M
(
i;
0)

= 2
i
¡
2
±
¤
2

n
.

For the second moment,
e
Ã
(
f
;
¹
f
) =
e
(
f
;
¹
f
) +
1

2
E
[
Ã
(
g
(
f
(
X
)))
I
(
g
(
f
(
X
))
2
(0
; ¿
))] and

e
Ã
(
f
;
¹
f
)

1. Thus

1

2
E
[
Ã
(
g
(
f
(
X
)))
I
(
g
(
f
(
X
)
; Y
)
2
(0
; ¿
))]

e
Ã
(
f
;
¹
f
)

(
e
Ã
(
f
;
¹
f
))
®

®
+1
:
(18)

For any
f
2
A
i;j
,
e
Ã
(
f
;
¹
f
)
¸
2
¡
1
±
2

n
¸
s
n
¸
e
Ã
(
f
0
;
¹
f
) together with (16) and (18) imply that

E
[
l
Ã
(
f
;Z
)
¡
l
Ã
(
f
0
;Z
)]
2

2
E
j
sign(
g
(
f
(
X
)
; Y
))
¡
sign(
g
0
(
f
0
(
X
)
; Y
))
j

+2
E
[
Ã
(
g
0
(
f
0
(
X
)))
I
(
g
(
f
(
X
)
; Y
)
2
(0
; ¿
))] + 2
E
[
Ã
(
g
(
f
(
X
)))
I
(
g
(
f
(
X
)
; Y
)
2
(0
; ¿
))]


2
¡
c
¤
[
e
Ã
(
f
;
¹
f
)
®=
(
®
+1)
+
e
Ã
(
f
0
;
¹
f
)
®=
(
®
+1)
]
¢
+ 4[
e
Ã
(
f
;
¹
f
)
®=
(
®
+1)
+
e
Ã
(
f
0
;
¹
f
)
®=
(
®
+1)
]


c
0

3
(
e
Ã
(
f
;
¹
f
)
=
2)
®=
(
®
+1)
;

with
c
0

3
= 16
c
1
=
(
®
+1)

1
+8. Therefore, sup
A
i;j
E
(
l
Ã
(
f
0
;Z
)
¡
l
Ã
(
f
;Z
))
2

c
3
M
(
i; j
)
®=
(
®
+1)
=
v
(
i; j
)
2

for
i
= 1
;
¢ ¢ ¢
and
j
= 0
;
1
;
¢ ¢ ¢
, where
c
3
= 2
c
0

3
.

To bound
I
, note
I

I
1
+
I
2
, where

I
1
=
P
i;j
P
¤
¡
sup
A
i;j
E
n
(
l
Ã
(
f
0
;Z
)
¡
l
Ã
(
f
;Z
))
¸
M
(
i; j
)
¢
and
I
2
=
P
i
P
¤
¡
sup
A
i;
0
E
n
(
l
Ã
(
f
0
;Z
)
¡
l
Ã
(
f
;Z
))
¸
M
(
i;
0)
¢
. Thus we can bound
I
i
separately.

Using the fact that
R
v
(
i;j
)

aM
(
i;j
)
H
1
=
2

B
(
u;
F
Ã
(2
j
))
du=M
(
i; j
) is non
-
increasing in
i
and
M
(
i; j
);
i
=

20

1
;
¢

¢ ¢
;
we have
R
v
(
i;j
)

aM
(
i;j
)
H
1
=
2

B
(
u;
F
Ã
(2
j
))
du=M
(
i; j
)

Á
¤
(
²
¤
n
;
2
j
). The result then follows from

the same argument as that in the proof of Theorem 3.3.

Proof of Corollary 3.2
: The result follows from the assumptions and the exponential
inequal
-

ity
in Theorem 3.4.

Lemma 1:
(Metric entropy in Example 3.3.1) Under the assumptions in the example
3.3.1, we

have

H
B
(
²;
G
(
`
))

O
(
k
2
log(
k=²
))
:

Proof
: Let (
G
1
; : : : ;G
k
) be a classi¯cation partition induced by
f
and let
G
j
1
j
2
be
f
x
:
f
j
1
¡
f
j
2
>

0;
x
2
S
g
;
j
1
6
=
j
2
2 f
1
;
¢ ¢ ¢
; k
g
. For discussion, we ¯rst construct a bracket for
G
j
1
j
2
.

To this end, we determine
d
points at which the plane
f
j
1
¡
f
j
2
= 0 intersects with
d
out of

d
2
d
¡
1
edges of the cube [0
;
1]
d
. For each of these
d
points, we use a bracket of le
ngth
²
¤
to cover,

on the edge where the point belongs to. Given an edge, the covering number for this
point is no

greater than 1

¤
. Hence the covering number for the
d
points on
d
of
d
2
d
¡
1
edges is at
most

¡
d
2
d
¡
1

d
¢
(
1

²
¤
)
d
.

After
d
intersecting points
of
f
j
1
¡
f
j
2
= 0 on the edges of
S
are covered, we then connect

the endpoints of the
d
brackets to form bracket planes
v
j
1
j
2
= 0 and
u
j
1
j
2
= 0 such that

f
x
:
v
j
1
j
2
>
0
g ½ f
x
:
f
j
1
¡
f
j
2
>
0
g ½ f
x
:
u
j
1
j
2
>
0
g
. Since the longest segment in
S
has

length
p
d
c
orresponding to the diagonal segment between (0
; : : : ;
0) and (1
; : : : ;
1), we
have
P
(
x
:

v
j
1
j
2
<
0
< u
j
1
j
2
)

(
p
d
)
d
¡
1
²
¤
since
x
is uniformly distributed on
S
. Consequently,
G
v

j
1
j
2
½

G
j
1
j
2
½
G
u

j
1
j
2
and
P
(
G
v

j
1
j
2
¢
G
u

j
1
j
2
)

(
p
d
)
d
¡
1
²
¤
, where
G
v

j
1
j
2
=
f
x
:
v
j
1
j
2
>
0
g
and
G
u

j
1
j
2
=
f
x
:

u
j
1
j
2
>
0
g
. Since
G
j
1
=
\
j
2
G
j
1
j
2
,
G
v

j
1
½
G
j
1
½
G
u

j
1
and
P
(
G
v

j
1
¢
G
u

j
1
)

P
(
[
j
2
G
v

j
1
j
2
¢
G
u

j
1
j
2
)


(
k
¡
1)(
p
d
)
d
¡
1
²
¤
, where
G
v

j
1
=
\
j
2
G
v

j
1
j
2
and
G
u

j
1
=
\
j
2
G
u

j
1
j
2
;
j
1
6
=
j
2
2 f
1
;
¢ ¢ ¢
; k
g
.

With
²
= (
k
¡
1)(
p
d
)
d
¡
1
²
¤
,

f
(
G
v

1
;G
u

1
)
;
¢ ¢ ¢
;
(
G
v

k
;G
u

k
)
g
satis¯es max
j
1
P
(
G
v

j
1
¢
G
u

j
1
)

²
and

thus it forms an
²
-
bracketing set for (
G
1
; : : : ;G
k
). Therefore the
²
-
covering number for all

partitions induced by
f
is at most
¡
d
2
d
¡
1

d
¢¡
(
k
¡
1)(
p
d
)
d
¡
1

²
¢
d
¢
k
(
k
¡
1)
:
Since
d
is a
constant, the

bracketing metric entropy
H
B
(
²;
G
(
`
)) is bounded by
O
(
k
2
log(
k=²
)) for any
`
, yielding the

desired result.

Lemma 2:
(Metric entropy in Example 3.3.2) Under the assumptions in Example 3.3.2,
we

have

H
B
(
²;
F
Ã
(
`
))

O
(
k
(log(
`=²
))
d
+1
)
:

Proof: In o
rder to obtain an upper bound for
H
B
(
²;
F
Ã
(
`
)), we use the sup
-
norm entropy
bound

21

for a single function set in Zhou (2002), that is,
H
1
(
²;
F
(
`
))

O
((log(
`=²
))
d
+1
) under the

L
1
metric:
k
g
k
1
= sup
x
2
S
j
g
(
x
)
j
. Consider an arbitrary function vector
f
= (
f
1
; :

: : ; f
k
)
2

F
(
`
). The the metric entropy for all
k
-
dimensional function vectors in
F
(
`
) is bounded by

O
(
k
(log(
`=²
))
d
+1
) in order to cover
k
functions simultaneously. Let [
f
v

j
; f
u

j
] be an
²
-
bracket for

f
j
. Then [
f
v

j
¡
f
u

l
; f
u

j
¡
f
v

l
] forms a 2
²
-
br
acket for
f
j
¡
f
l
. Denote
g
v

j
= min
l
2f
1
;
¢¢¢
;k
gn
j
(
f
v

j
¡
f
u

l
) and

g
u

j
= min
l
2f
1
;
¢¢¢
;k
gn
j
(
f
u

j
¡
f
v

l
). Then [
g
v

j
; g
u

j
] becomes a 2
²
-
bracket for
g
min
(
f
; j
) = min
l
6
=
j
(
f
j
¡
f
l
).

Consequently,
Ã
(
g
v

j
)
¸
Ã
(
g
min
(
f
; j
))
¸
Ã
(
g
u

j
) by the non
-
increasing prope
rty of
Ã
function. By

(10), we have
j
Ã
(
g
v

j
)
¡
Ã
(
g
u

j
)
j ∙
2

. Since
g
min
(
f
; y
) =
P
k

j
=1
I
(
y
=
j
)
g
min
(
f
; j
),
g
min
(
f
; y
)
2

[
P
k

j
=1
I
(
y
=
j
)
g
v

j
;
P
k

j
=1
I
(
y
=
j
)
g
u

j
] and
j
Ã
(
P
k

j
=1
I
(
y
=
j
)
g
v

j
)
¡
Ã
(
P
k

j
=1
I
(
y
=
j
)
g
u

j
)
j ∙
2

.

Consequently, [
Ã
(
P
k

j
=1
I
(
y
=
j
)
g
u

j
(
x
))
¡
Ã
(
g
0
(
f
0
(
x
)
; y
))
; Ã
(
P
k

j
=1
I
(
y
=
j
)
g
v

j
(
x
))
¡
Ã
(
g
0
(
f
0
(
x
)
; y
))]

forms a bracket of length 2

for
Ã
(
g
(
f
(
x
)
; y
))
¡
Ã
(
g
0
(
f
0
(
x
)
; y
)). The desired result then fol
-

lows.

References

An, H. L. T., and Tao, P. D. (1997). Solving a class of linearly
constrained inde¯nite
quadratic

problems by D.C. algorithms.
J. Global Optim.
,
11
, 253
-
285.

Bartlett, P. L, Jordan, M. I, and McAuli®e, J. D. (2003). Convexity, classi¯cation, and risk

bounds. Technical Report 638, Department of Statistics, U.C. Berkeley.

Boser, B., Guyon, I., and Vapnik, V. N. (1992). A training algorithm for optimal margin

classi¯ers.
The Fifth Annual Conference on Computational Learning Theory
, Pittsburgh

ACM, 142
-
152.

Cortes, C., and Vapnik, V. (1995). Support
-
vector networks.
Machine L
earning
,
20
, 273
-
279.

Crammer, K., and Singer, Y. (2001). On the algorithmic implementation of multiclass
kernel
-

based vector machines.
Journal of Machine Learning Research
,
2
, 265
-
292.

Guermeur, Y. (2002). Combining discriminant models with new multiclas
s SVMS.
Pattern

Analysis and Applications (PAA)
,
5
, 168
-
179.

Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory Support Vector Machines, theory,
and

application to the classi¯cation of microarray data and satellite radiance data.
J. Amer.

Statist. Assoc
.
99, 465: 67
-
81.

Lin, X., Wahba, G., Xiang, D., Gao, F. Klein, R., and Klein, B. (2000). Smoothing spline

ANOVA models for large data sets with Bernoulli observations and the randomized
GACV.

22

Annals of Statistics
.
28
, 1570
-
1600.

Lin, Y. (2000). Some as
ymptotic properties of the support vector machine. Technical
report

1029, Department of Statistics, University of Wisconsin
-
Madison.

Lin, Y. (2002). Support vector machines and the Bayes rule in classi¯cation.
Data Mining
and

Knowledge Discovery
. 6, 259
-
27
5.

Liu, Y., Shen, X., and Doss, H. (2005). Multicategory
Ã
-
learning and support vector
machine:

computational tools.
J. Comput. Graph. Statist.
14
, 1, 219
-
236.

Mammen, E. and Tsybakov, A. (1999). Smooth discrimination analysis.
Ann. Statist.
27
,

1808
-
1829.

Marron, J. S., and Todd, M. J. (2002). Distance Weighted Discrimination. Technical
Report

No. 1339, School of Operations Research and Industrial Engineering, Cornell University.

Mercer, J. (1909). Functions of positive and negative type and their connecti
on with the
theory

of integral equations.
Philos. Trans. Roy. Soc. London
A, 209, 415
-
446.

Shen, X. (1998). On the method of penalization.
Statistica Sinica
.
8
, 337
-
357.

Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (2003). On
Ã
-
learning.
J. Amer.
Sta
tist.

Assoc.
98
, 724
-
734.

Shen, X., and Wong, W. H. (1994). Convergence rate of sieve estimates.
Ann. Statist.
22
,

580
-
615.

Steinwart, I. (2001). On the in°uence of the kernel on the consistency of support vector
machines.

Journal of Machine Learning Resea
rch
,
2
, 67
-
93.

Tsybakov, A. B. (2004). Optimal aggregation of classi¯ers in statistical learning.
Annals
of

Statistics
,
32
, 135
-
166.

Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces, and
randomized

GACV. In: B. SchÄ
o
lkopf, C. J.

C. Burges and A. J. Smola (eds),
Advances in Kernel
Methods:

Support Vector Learning
, MIT Press, 125
-
143.

Weston, J., andWatkins, C. (1999). Support vector machines for multi
-
class pattern
recognition.

Proceedings of the Seventh European Symposium On Arti
¯cial Neural Networks.

Zhang, T. (2004). Statistical behavior and consistency of classi¯cation methods based
on convex

risk minimization.
Ann. Statist.
,
32
, 56
-
85.

Zhang, T. (2004b). Statistical analysis of some multi
-
category large margin classi¯cation
me
th
-

ods.
Journal of Machine Learning Research
,
5
, 1225
-
1251.

Zhou, D. X. (2002). The covering number in learning theory.
Journal of Complexity
,
18
,
739
-

23

767.

Zhu, J., and Hastie, T. (2005). Kernel logistic regression and the import vector machine.
Journ
al

of Computational and Graphical Statistics.
14
, 1, 185
-
205.

Zhu, J., Hastie, T., Rosset, S., and Tibshirani, R. (2003). 1
-
norm support vector
machines,

Neural Information Processing Systems
,
16.

Table 1: Testing, training errors, and their ^
e
(
¢
;
¹
f
) of
SVM and
Ã
-
learning using the best
C
in

Examples 1 and 2 with
n
= 150, averaged over 100 simulation replications and their
standard

errors in parenthesis. In Example 1, d.f.=1, the Bayes error is 0.2470 with the
improvement

of
Ã
-
learning over SVM 43.22%. In

Example 2, d.f.=3, the Bayes error is 0.1456 with the

improvement of
Ã
-
learning over SVM 20.41%. Here, the improvement of
Ã
-
learning over
SVM

is de¯ned by (
T
(SVM)
¡
T
(
Ã
))
=
^
e
(SVM
;
¹
f
), where ^
e
(
¢
;
¹
f
) =
T
(
¢
)
¡
Bayes error, and
T
(
¢
)
denotes

the testing error
of a given method.

Example Method Training(s.e.) Testing(s.e.) ^
e
(
¢
;
¹
f
)(s.e.) No. SV(s.e.)

d.f.=1 SVM 0.4002(0.1469) 0.4305(0.1405) 0.1835(0.1405) 141.76(10.97)

Ã
-
L 0.3199(0.1237) 0.3494(0.1209) 0.1024(0.1209) 64.64(15.43)

d.f.=3 SVM 0.1447(0.0267) 0.150
5(0.0045) 0.0049(0.0045) 71.81(11.02)

Ã
-
L 0.1429(0.0285) 0.1495(0.0033) 0.0039(0.0033) 41.29(13.51)

Table 2: Testing errors for problem
letter
. Each training dataset is of size 200 and
selected from

a total of 2341 samples.

Case SVM
Ã
-
L Improvement (%)

1 .
083 .079 3.39%

2 .073 .063 12.24%

3 .086 .076 11.41%

4 .072 .072 0%

5 .088 .085 3.74%

6 .077 .073 5.45%

7 .075 .072 4.39%

8 .079 .075 5.92%

9 .093 .091 1.51%

10 .090 .086 4.11%

Average #SVs 51.1 40.8

24

−2

−1

0

1

2

u1

−2

−1

0

1

2

u2

0

0.5

1

1.5

2

psi(u1,u2)

Figure 1: Perspective plot of the 3
-
class
Ã
function de¯ned in (3).

25

Ployhedron one

Ployhedron two Ployhedron three

f1−f2=1

f1−f2=0

f2−f1=1

f1−f3=1

f1−f3=0

f3−f1=1

f2−f3=1

f3−f2=1 f2−f3=0

Figure 2: I
llustration of the concept of margins and support vectors in a 3
-
class
separable

example: The instances for classes 1
-
3 fall respectively into the polyhedrons
D
j
;
j
= 1
;
2
;
3,

where
D
1
is
f
x
:
f
1
(
x
)
¡
f
2
(
x
)
¸
1
; f
1
(
x
)
¡
f
3
(
x
)
¸
1
g
,
D
2
is
f
x
:
f
2
(
x
)
¡
f
1
(
x
)
¸

1
; f
2
(
x
)
¡
f
3
(
x
)
¸
1
g
, and
D
3
is
f
x
:
f
3
(
x
)
¡
f
1
(
x
)
¸
1
; f
3
(
x
)
¡
f
2
(
x
)
¸
1
g
. The generalized

geometric margin
°
de¯ned as min
f
°
12
; °
13
; °
23
g
is maximized to obtain the decision
boundary.

There are ¯ve support vectors on the boundaries of the three polyhedrons. Among the
¯ve

support vectors, one is from class 1, one is from class 2, and the other three are from
class 3.

26