Multiclass Proximal Support Vector Machines
Yongqiang TANG and Hao Helen ZHANG
This article proposes the multiclass proximal support vector machine (MPSVM) clas
siﬁer, which extends the binary PSVM to the multiclass case. Unlike the oneversusrest
approach that constructs the decision rule based on multiple binary classiﬁcation tasks, the
proposed method considers all classes simultaneously and has better theoretical properties
and empirical performance. We formulate the MPSVM as a regularization problem in the
reproducing kernel Hilbert space and show that it implements the Bayes rule for classiﬁ
cation. In addition, the MPSVM can handle equal and unequal misclassiﬁcation costs in a
uniﬁed framework. We suggest an efﬁcient algorithm to implement the MPSVM by solving
a system of linear equations. This algorithm requires much less computational effort than
solving the standard SVM, which often requires quadratic programming and can be slow
for large problems. We also provide an alternative and more robust algorithm for illposed
problems. The effectiveness of the MPSVM is demonstrated by both simulation studies and
applications to cancer classiﬁcations using microarray data.
Key Words: Bayes rule; Nonstandard classiﬁcations; Reproducing kernel Hilbert space.
1. INTRODUCTION
In multiclass classiﬁcation problems, the training set consists ofn samples{(x,y );i =
i i
d
1,...,n}, where x ∈ is the input vector andy ∈{1,...,k} is the class label for theith
i i
sample. Our task is to learn a classiﬁcation rule from the training set to predict the label for
a future sample based on its input. The support vector machine (SVM) (Boser, Guyon, and
Vapnik 1992; Vapnik 1998; Burges 1998; Cristianini and ShaweTaylor 2000) has performed
successfully in many realworld problems. The SVM is attractive in its ability to condense
the information contained in the training set and to ﬁnd a decision surface determined by
certain points in the training set. Lee, Lin, and Wahba (2004) further generalized the SVM
to multicategory SVM (MSVM). Other multiclass versions of SVM have been studied by
Vapnik (1998) and Weston and Watkins (1999).
The implementation of the SVM and MSVM demands solving quadratic program
Yongqiang Tang is Statistician and Assistant Professor, Department of Psychiatry, SUNY Health Science Center,
Brooklyn, NY 11203 (Email: yongqiang tang@yahoo.com). Hao Helen Zhang is Assistant Professor, Department
of Statistics, North Carolina State University, Raleigh, NC 27695 (Email: hzhang@stat.ncsu.edu).
c 2006 American Statistical Association, Institute of Mathematical Statistics,
and Interface Foundation of North America
Journal of Computational and Graphical Statistics, Volume 15, Number 2, Pages 339–355
DOI: 10.1198/106186006X113647
339340 Y. TANG AND H. H. ZHANG
ming problems under linear constraints, which can be computationally expensive for large
datasets. For multiclass problems, the computation can be very challenging even for mod
erately sized datasets if the number of classesk is large. Proximal support vector machines
(PSVM) were recently introduced as a variant of SVM for binary classiﬁcations by Suykens
and Vandewalle (1999) and Fung and Mangasarian (2001). In theory, both the PSVM and
SVM target the optimal Bayes rule asymptotically, which explains their comparable per
formance in most empirical studies. However, it is much faster to train a PSVM classiﬁer
by simply solving a system of linear equations. Suykens and Vandewalle (1999), Suykens
et al. (2002), and Van Gestel et al. (2002a) proposed the least squares SVM (LSSVM).
The formulation of LSSVM is similar to that of PSVM except that the latter penalizes the
constant term as well. Extensions of the PSVM and LSSVM to multiclass problems have
been considered by Fung and Mangasarian (2005) and Van Gestel et al. (2002b) based on
the oneversusrest scheme. Their main idea is to solvek binary classiﬁcation problems
by separating one class from the rest, then construct the decision rule according to the
maximal output. One disadvantage of this scheme is that the resulting binary problems are
often very unbalanced, leading to poor performance in some cases (Fung and Mangasarian
2005). In addition, it is not easy to take into account unequal misclassiﬁcation costs in the
oneversusrest approach.
This article proposes the multiclass PSVM (MPSVM) using an alternative formulation.
The proposed method directly targets the boundaries amongk classes simultaneously by
estimating some functions of the conditional probabilities. The MPSVM is constructed in
a regularization framework of reproducing kernel Hilbert space (RKHS), and is optimal in
terms of implementing the Bayes rule asymptotically. Compared to the approaches based on
the oneversusrest scheme, the MPSVM is more ﬂexible in handling nonstandard situations
such as unequal misclassiﬁcation costs and nonrepresentative training samples. With regard
to computation, the MPSVM can be solved more efﬁciently than the MSVM.
The article is organized as follows. Section 2 brieﬂy states the Bayes classiﬁcation rule.
Section 3 reviews the binary SVM and PSVM. Section 4 introduces the MPSVM, study
its statistical properties, and propose two computation algorithms. Section 5 illustrates the
performance of the MPSVM with simulation examples. Section 6 applies the MPSVM to
cancer classiﬁcations using gene expression data.
2. BAYES CLASSIFICATION RULES
Bayes rule is the optimal classiﬁcation rule if the underlying distribution of the data is
known. It serves as the gold standard to which any classiﬁer can be compared. In practice,
the underlying distribution is rarely known. One common way to approximate the Bayes
rule is to estimate the distribution or related classiﬁcation functions from the training set.
d
For akclass classiﬁcation problem, we need to learn a classiﬁcation ruleφ(x) : →
{1,...,k} from the training set. Assume that the samples in the training set are indepen
dently and identically drawn from some distributionP(x,y). Letp (x)= Pr(Y =jX = x)
jMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 341
be the conditional probability of classj given X = x forj = 1,...,k. LetC be ak×k cost
matrix, withc the cost of incorrectly classifying a sample from classj to classl.Allc ’s
jl jj
are equal to 0 because we do not penalize correct decisions. The Bayes rule, minimizing
the expected cost of misclassifying an observation
k k
E c =E c Pr(Y =lX = x) =E c p (x) ,
l
Yφ(X) X lφ(x) X lφ(x)
l=1 l=1
is given by
k
φ (x)= arg min c p (x) . (2.1)
B lj l
j=1,...,k
l=1
With equal misclassiﬁcation costs, that is,c = 1 forj/ =l, the Bayes rule simpliﬁes to
jl
φ (x)= arg min 1−p (x) = arg max p (x), (2.2)
B j j
j=1,...,k j=1,...,k
which can be interpreted as the minimizer of the expected misclassiﬁcation rateE[Y/ =
φ(X)].
The majority of classiﬁcation methods consider only the standard setting, which as
sumes that the samples in the training set truly represent the target population and imposes
equal costs on different types of misclassiﬁcations. However, nonstandard classiﬁcation
problems may arise in many real situations. When one type of misclassiﬁcation is more
serious than others, we should apply unequal costs. For example, in the diagnosis of a
disease, classifying a diseased person as healthy or a healthy person as diseased may have
different consequences. See Lin, Lee, and Wahba (2002) for more discussions on this issue.
Sampling bias is another concern that merits special attention. When the class proportions
are quite unbalanced in the population, one often tends to oversample minor classes and
downsample major classes (Lin, Lee, and Wahba 2002; Lee, Lin, and Wahba 2004). If the
sampling scheme depends only on the class labelY and not on the input X, the Bayes rule
is
k
π
l
s
φ (x)= arg min c p (x) ,
B lj
l
s
j=1,...,k
π
l
l=1
s
whereπ is the prior proportion of classl in the target population,π is the proportion of
l
l
s s s
classl in the training set, andp (x)= Pr Y =lX = x is the conditional probability
l
s
of a sample in the training set belonging to classl given X = x.Inthisarticle,wedo
not distinguish two nonstandard cases, since the issue of sampling bias can be taken into
account by modifying the costs as
π
l
new
c =c . (2.3)
jl
jl
s
π
l342 Y. TANG AND H. H. ZHANG
3. BINARY SVM AND PROXIMAL SVM
In binary classiﬁcation problems, the class labely is often coded as +1or−1. Orig
i
inally proposed as a maximal margin classiﬁer, the SVM methodology can also be cast
as a regularization problem in a reproducing kernel Hilbert space (RKHS). See Wahba
(1990), Girosi (1998), and Poggio and Girosi (1998) for more details. LetH be an RKHS
K
associated with the reproducing kernelK. The SVM solves
n
1
2
min L(y )[1−yf(x )] +λ h , (3.1)
i i i +
H
K
f n
i=1
over all the functions of the formf(x)=β +h(x), whereh∈H ,(v) = max{v, 0}, and
0 K +
L(y ) is the cost of misclassifying theith observation. The classiﬁcation rule is sign[f(x)].
i
The proximal SVM was ﬁrst studied by Suykens and Vandewalle (1999) and Fung and
Mangasarian (2001). Unlike the SVM, the PSVM method classiﬁes points to the closest
of two parallel hyperplanes that are pushed apart as far as possible. In the regularization
framework, we can formulate the PSVM as
n
1
2 2 2
min L(y )[1−yf(x )] +λ( h +β ). (3.2)
i i i
H 0
K
f
n
i=1
Sincey = 1or−1, (3.2) is equivalent to
i
n
1
2 2 2
min L(y )[y −f(x )] +λ( h +β ). (3.3)
i i i
H 0
K
f n
i=1
Thus, the PSVM can be interpreted as the ridge regression model (Agarwal 2002). By the
representer theorem (Kimeldorf and Wahba 1971), the solution to (3.1) or (3.2) has a ﬁnite
n
representation formf(x)=β + βK(x, x). When equal misclassiﬁcation costs are
0 i i
i=1
used, (3.2) becomes
n n n
1
2 2
min [1−yf(x )] +λ β βK(x , x)+β . (3.4)
i i j l j l
0
f n
i=1 j=1
l=1
The formulation (3.4) is slightly different from the kernel PSVM studied by Fung and
Mangasarian (2001), where they solved
n n
1
2 2 2
min [1−yf(x )] +λ β +β .
i i
j 0
f n
i=1 j=1
4. MULTICLASS PROXIMAL SVM
This section introduces the multiclass proximal SVM (MPSVM). The proposed
MPSVM embraces the binary PSVM as a special case. Section 4.1 presents the formu
lation of the MPSVM and investigates its solution and asymptotic properties. Section 4.2
introduces two efﬁcient computational strategies for implementing the MSPVM.MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 343
4.1 FORMULATION AND THEORETICAL PROPERTIES
In multiclass classiﬁcation problems, the class labely can be coded as akdimensional
vector
−1 −1
1, ,..., if samplei belongs to class 1,
k−1 k−1
−1 −1
, 1,..., if samplei belongs to class 2,
k−1 k−1
y =
i
...
−1 −1
, ,..., 1 if samplei belongs to classk.
k−1 k−1
This coding was ﬁrst used by Lee, Lin, and Wahba (2004). We consider aktuple separating
function f(x)=(f (x),...,f (x)) . Since the sum of the components in y is 0, we let
1 k
k
d
f satisfy the sumtozero constraint, f (x)= 0 for any x∈ . Analogous to the
j
j=1
k
binary case, we assume that f(x) ∈ ({1} +H ), the tensor product space of k
K
j=1
reproducing kernel Hilbert spacesH . Each componentf (x) can be expressed asβ +
K j j0
h (x), withβ ∈ andh ∈H . For each y , deﬁne the diagonal matrixW(y)=
j j0 j K
i i
diag{c ,c ,...,c }, where l is the membership of y and c denotes the cost of
l 1 l 2 l k i l j
i i i i i
classifying theith sample to classj. The MPSVM is proposed as
n k
1 1
2 2
min [y − f(x )]W(y )[y − f(x )]+ λ ( h +β ), (4.1)
i i j
i i i H j0
K
f n 2
i=1 j=1
k
subject to f (x)= 0 for any x. The classiﬁcation rule induced by f(x) is
j
j=1
φ(x)= arg max f (x).
j
j=1,...,k
This formulation handles the standard and nonstandard situations in a uniﬁed manner by
choosing different cost functions. The solution to (4.1) is characterized in Theorem 1. Proof
of the theorem and all other proofs are relegated to the Appendix.
Theorem 1. Representer theorem. If the reproducing kernelK is positive deﬁnite,
minimizing (4.1) under the sumtozero constraint is equivalent to ﬁnding f (x),...,f (x)
1 k
of the form
n
f (x)=β + β K(x, x), (4.2)
j j0 ji i
i=1
k
subject to β = 0forl = 0,...,n.
jl
j=1
We show that formulation (4.1) embraces the binary PSVM as a special case. Let
k = 2. For any sample from the positive class, y =(1,−1) (1 in the binary notation), we
i
2
haveW(y)= diag{0,L(1)} and [y − f(x )]W(y )[y − f(x )] =L(1)[f (x)+ 1] =
i i 2 i
i i i i
2
L(1)[1−f (x )] , whereL(1) is the cost of misclassifying a positive sample. Similarly,
1 i
when y =(−1, 1) (−1 in the binary notation), we haveW(y)= diag{L(−1), 0} and[y −
i i i
2
f(x )]W(y )[y −f(x )] =L(−1)[−1−f (x )] , whereL(−1) is the cost of misclassifying
i i 1 i
i i344 Y. TANG AND H. H. ZHANG
a negative sample. Thus, the dataﬁt functional in (3.3) is identical to that in (4.1), withf(x)
in (3.3) playing the same role off (x) in (4.1). It is also true that
1
2
λ λ
2 2 2 2 2 2 2 2
( h +β )= ( h +β + h +β )=λ( h +β ),
j 1 2 1
H j0 H 10 H 20 H 10
K K K K
2 2
j=1
because we haveβ +β = 0 andh (x)+h (x)= 0 for any x according to Theorem 1.
10 20 1 2
The next theorem shows that the MPSVM implements the Bayes rule asymptotically
under certain regularity conditions. We use the theoretical framework of Cox and O’Sullivan
(1990) to analyze the asymptotics of penalized methods. In (4.1), the dataﬁt functional
component indicates that the solution should follow the pattern in the data, whereas the
penalty component imposes smoothing conditions. As the sample sizen goes to inﬁnity,
the limit of the dataﬁt functional,E (Y− f(X))W(Y)(Y− f(X)) , could be used to
identify the target function, which is the minimizer of the limiting functional. Under the
assumption that the target function can be approximated by the elements in the RKHS and
certain other regularity conditions, the solution to (4.1) will approach the target function as
n→∞.
Theorem 2. The minimizer ofE (Y− f(X))W(Y)(Y− f(X)) under the sumto
zero constraint implements the Bayes rule, that is,
k
arg max f (x)= arg min c p (x).
j lj l
j=1,...,k j=1,...,k
l=1
When equal misclassiﬁcation costs are used,
arg max f (x)= arg max p (x).
j j
j=1,...,k j=1,...,k
Remark 1. Inverting the expression (A.1) in the Appendix provides the estimate ofk
class conditional probabilities. With equal misclassiﬁcation costs,τ (x) reduces to 1−p (x),
j j
thus
ˆ
1/[1−f (x)]
j
p ˆ (x)= 1−(k− 1) , for j = 1,...,k. (4.3)
j
k
ˆ
1/[1−f (x)]
l
l=1
4.2 COMPUTATION ALGORITHMS
This section shows that the implementation of the MPSVM only requires solving linear
equation systems. This computational convenience makes the MPSVM a promising clas
siﬁcation tool for large datasets. Many classiﬁcation methods involve parameters that need
to be adaptively chosen with strategies like crossvalidation. Finer tuning is possible for the
MPSVM due to its computational advantage. We present two computation strategies and
summarize them as Algorithms 1 and 2. Based on our experience, Algorithm 1 runs faster,MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 345
though it may fail whenK is close to singularity, causing difﬁculty in matrix inversion.
Algorithm 2 is speciﬁcally designed for illposed problems.
ˆ
Assume that the kernelK is strictly positive deﬁnite. LetK be the Gram matrix with
ˆ
the (i,l)th entryK(x, x ). LetZ=[1 K], 1 =(1,..., 1) , 0 =(0,..., 0) , and
i l n n n
1 0
n
G = .
ˆ
0 K
n
∗
Forj = 1,...,k, deﬁne y =(y ,...,y ) ,β =(β ,β ,...,β ) and then×n
1j nj j0 j1 jn
j j
∗
diagonal matrixW = diag{c ,...,c }, where cat{l} is the membership of the
cat{1}j cat{n}j
j
lth sample andc is the cost of classifying thelth sample to classj. Further, we deﬁne
cat{l}j
∗ ∗1/2 ∗1/2
∗ 1/2 ∗ −1/2 ∗ ∗∗ ∗
λ = nλ/2, β = G β ,Z = ZG ,Z = W Z and y = W y for
j
j j
j j j j
j = 1,...,k. By Theorem 1, minimizing (4.1) is equivalent to minimizing
k
∗ ∗ ∗ ∗
(y −Zβ )W (y −Zβ )+λ β Gβ
j j j j j j j
j=1
k
∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗
= (y −Z β ) (y −Z β )+λ β β , (4.4)
j j
j j j j j j
j=1
k k
∗
subject to β = β = 0 .
n+1
j=1 j j=1 j
To solve (4.4), we consider its Wolfe dual problem
k k
∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗
L = (y −Z β ) (y −Z β )+λ β β − 2u β ,
D j j
j j j j j j j
j=1 j=1
∗ n+1
where u = 2u ∈ is the Lagrange multiplier vector. Setting equal to zero the gradient
∗
ofL with respect toβ yields
D
j
∂L
D ∗ ∗
∗∗ ∗ ∗
= 2Z (Z β − y )+ 2λ β − 2u = 0.
j j
∗ j j j
∂β
j
Thus,
−1
∗
∗ ∗∗ ∗ ∗∗ ∗
β = Z Z +λ I Z y + u =B Z y + u , (4.5)
j j
j j j j j j
−1
k ∗ k
∗
whereB = Z Z +λ I forj = 1,...,k. Since 0 = β = B
j j n+1 j
j j=1 j j=1
∗∗ ∗
Z y + u , we get the dual solution
j j
−1
k k
∗ ∗∗
u =− B B Z y . (4.6)
i j
j j
j=1 j=1
The primal solution to (4.4) is
−1/2 ∗ ∗ ∗
β =G β =A ZW y + v , (4.7)
j
j j j j346 Y. TANG AND H. H. ZHANG
−1
−1
k
−1/2 −1/2 ∗ ∗
where A = G B G = ZW Z +λ G and v = − A
j j i
j j=1
k
∗ ∗
A ZW y . For any (new) sample x, we deﬁne the vectorK = K(x, x ),...,
j x 1
j=1 j j
ˆ
K(x, x ) and computef (x)=[1,K ]β forj = 1,...,k for prediction. We summarize
n j
x j
Algorithm 1 as follows.
Algorithm 1. (a) Solve (4.4) using (4.7).
ˆ
(b) To predict the label for x, computef (x) forj = 1,...,k and classify the sample
j
ˆ
to the class corresponding to the largestf (x).
j
ˆ
Though the matrixK is strictly positive deﬁnite in theory, it can be numerically ill
ˆ
posed in practice. Algorithm 1 may fail whenK, henceG andA (j = 1,...,k), are
j
close to singularities. Therefore in Algorithm 2, we avoid the inversion ofG using the
relationships
−1 −1
ˆ
ZG =[1 ,I],ZG Z = 1 1 +K,
n n
n
which can be achieved by some strategic matrix transformations of the solutions (4.5) and
ˆ
(4.6) via the ShermanMorrisonWoodbury formula. Our experience shows that whenK is
wellconditioned, we should use Algorithm 1 because it is much faster than Algorithm 2.
Otherwise, Algorithm 2 is preferable because it provides robust solutions.
By using the ShermanMorrisonWoodbury formula, we have
−1
∗ ∗ −1 ∗ ∗ ∗ ∗
B = Z Z +λ I = I−Z (λ I +Z Z ) Z /λ =(I−Z C Z )/λ ,
j j j j j
j j j
(4.8)
where
−1
−1
∗1/2 ∗ ∗1/2 ∗ ∗ ∗1/2 ∗1/2 ∗ ∗ ∗
C =W λ I +W Z Z W W =l λ I +l Z Z .
j
j j j j j j
(4.9)
1/2
∗
Herel is the column vector containing the diagonal elements ofW ,l =l l , and
j j
j j j
is the elementwise product of two matrices. It is easy to verify that
∗∗ ∗ −1 ∗∗ ∗
B Z y = I−Z (λ I +Z Z ) Z Z y /λ
j j j j j
j j j j
∗ −1 ∗∗ ∗ ∗
=Z (λ I +Z Z ) y =Z C y . (4.10)
j j
j j j j
k
−1
LetC =k C . Then
j
j=1
k
∗ ∗ ∗
B = B =k(I−Z CZ )/λ . (4.11)
j
j=1
By the ShermanMorrisonWoodbury formula,
−1 ∗ −1 ∗ ∗ ∗ −1 −1 ∗
B =λ k I−Z (Z Z −C ) Z . (4.12)MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 347
The dual solution is
k
∗ −1 ∗∗
u = −B B Z y
j
j j
j=1
k
∗ −1 ∗ ∗ ∗ ∗
= −λ k (I−Z DZ ) Z C y
j
j
j=1
∗ ∗
= −λ Z Q, (4.13)
k
∗ ∗ −1 −1 −1 ∗ ∗ ∗
whereD =(Z Z −C ) ,H = k C y , andQ = H −D(Z Z )H =
j
j=1 j
−1
−DC H.
Combining (4.8), (4.10), and (4.13), we get the primal solution
∗
∗∗ ∗
β = B Z y +B u
j j
j j
j
∗ ∗ ∗ ∗ ∗
= Z C y −(I−Z C Z )Z Q
j j
j
∗ ∗ ∗ ∗
= Z C (y +Z Z Q)−Q ,
j
j
and
∗
−1/2 −1/2 ∗ ∗ ∗ ∗
β =G β = G Z C (y +Z Z Q)−Q
j
j j j
∗ ∗ ∗
= [1 ,I] C (y +Z Z Q)−Q .
n j
j
ˆ
The predictionf is given by
j
∗ ∗ ∗
ˆ
f (x)=[1,K ]β =(1 +K ) C (y +Z Z Q)−Q . (4.14)
j n x j
x j j
Algorithm 2.
∗ ∗ −1
ˆ
(a) ComputeZ Z =ZG Z = 1 1 +K.
n
n
(b) ComputeC (j = 1,...,k),C,D,H andQ using Equations (4.9)–(4.13).
j
ˆ
(c) To predict the label for x, computef (x) forj = 1,...,k using (4.14) and classify
j
ˆ
the sample to the class giving the largestf (x).
j
5. NUMERICAL EXAMPLES
We illustrate the performance of the MPSVM with simulated examples. The Gaus
2
x −x
i l ∗ 2
sian kernelK(x, x)= exp − is used and the tuning parameters (λ ,σ ) are
i l 2
σ
determined by tenfold crossvalidation. Section 5.2 shows an example for nonstandard
classiﬁcations. Equal misclassiﬁcation costs are assumed unless otherwise speciﬁed.
5.1 COMPARISON WITH MSVM
A training set of sample size 200 is generated according to the threeclass model on
the unit interval [0, 1] with the conditional probabilities348 Y. TANG AND H. H. ZHANG
Figure 1. Plots of true and estimated conditional probability functions.
p (x)= 0.97 exp(−3x),
1
2
p (x)= exp −2.5(x− 1.2) ,
3
p (x)= 1−p (x)−p (x).
2 1 3
This example was also used by Lee, Lin, and Wahba (2004). We apply the MPSVM and use
the formula (4.3) to estimate the true conditional probabilities. As shown in Figures 1(a)
and (b), the MPSVM has reasonably recovered the conditional probabilities. According to
(2.2), the Bayes class boundary for this example consists of two points:x = 0.260 where
p (x)= p (x)>p (x), andx = 0.613 wherep (x)<p (x)= p (x). The MPSVM
1 2 3 1 2 3
estimates these two critical points as 0.262 and 0.615, and hence approximates the Bayes
rule very well.
We now compare the MPSVM with the MSVM (Lee, Lin, and Wahba 2004), and
their oneversusrest variants. We generate 100 training sets of sample size 200 to ﬁt the
classiﬁcation rules, and a testing set of sample size 10,000 to evaluate their prediction
accuracy. The Bayes rate for this example is 0.3841. The average testing error rate (±
standard deviation) is 0.4068± 0.0101 for the MPSVM and 0.4081± 0.0124 for the one
versusrest PSVM. As reported by Lee, Lin, and Wahba (2004), the average error rate is
0.3951±0.0099 for the MSVM and 0.4307±0.0132 for the oneversusrest SVM. Overall,
the performance of the MPSVM and oneversusrest PSVM is close to the MSVM. Note
that the error rate of the oneversusrest SVM is slightly higher. One possible reason is that
the oneversusrest SVM may not approximate Bayes rule well even when the sample size
is large (Lee, Lin, and Wahba 2004), while all the other three classiﬁers are asymptotically
optimal.
5.2 SAMPLING BIAS
This example illustrates the effectiveness of the MPSVM in handling the sampling
bias situation. There are three classes, and the true population is distributed on the squareMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 349
Figure 2. Sampling bias.350 Y. TANG AND H. H. ZHANG
√ √
A=[0, 10]×[0, 10] with the following conditional probabilities:
Pr(Y =jx∈A)= 0.9if j =l,
l
Pr(Y =jx∈A)= 0.05 if j/ =l,
l
with 1≤j,l≤ 3. Three regions are
2 2
A = X :X +X ≤ 2,X >X ,
1 1 2
1 2
2 2
A = X :X +X ≤ 2,X ≤X ,
2 1 2
1 2
2 2
A = X :X +X > 2 ,
3
1 2
and they partition the squareA with the area ratio1:1:8.We ﬁrst generate 500 points
from the population, then intentionally change the class proportions in the training set by
keeping all points generated from classes 1 and 2, and only 1/8 of those from class 3. The
sampling scheme is therefore biased and the class proportions in the training set do not
reﬂect those in the underlying population. Figure 2(a) plots the training data and the Bayes
decision boundary. We ﬁt the MPSVM with, respectively, equal misclassiﬁcation costs
and the costs adjusted according to (2.3). The decision boundary based on adjusted costs
in Figure 2(b) is very similar to the Bayes boundary in Figure 2(a), while the unadjusted
boundary in Figure 2(c) fails to approximate the Bayes rule. This suggests that the adjustment
of misclassiﬁcation costs in the MPSVM is necessary to improve classiﬁcation accuracy in
the sampling bias case.
6. APPLICATIONS
In this section, we apply the MSPVM to cancer classiﬁcation using two benchmark
microarray datasets. One main challenge in microarray data analysis is that the number of
genes is often much larger than the sample size. One typical way to handle this issue is to
conduct dimension reduction, say, by principle component analysis or factor analysis (Khan
et al. 2001; West 2003), before applying classiﬁcation methods. As regularized regression
models, the SVMtype methods offer an alternative way to handle this “largep, smalln”
problem (West 2003) and are wellsuited for microarray data analysis.
6.1 LEUKEMIA DATA
The leukemia dataset was published by Golub et al. (1999) for the classiﬁcation of
two types of leukemia: ALL (acute myeloid leukemia) and AML (acute lymphoblastic
leukemia). These two cancer types are identiﬁed based on their origins—AML originates
in the bone marrow and ALL originates in lymphatic tissue. The ALL leukemia can be
further divided into Bcell and Tcell ALLs. There are 38 samples (19 ALL Bcell; 8 ALL
Tcell; 11 AML) in the training set and 34 samples (19 ALL Bcell; 1 ALL Tcell and 14
AML) in the testing set. For each sample, the expression values of 7,129 genes are collected.
In the literature, these data have been treated as a twoclass (ALL/AML) problem (Golub
et al. 1999; Dudoit, Fridlyand, and Speed 2002; Tibshirani, Hastie, Narasimhan, and Chu
2002; Nguyen and Rocke 2002b) or a threeclass (AML/ALL Bcell/ALL Tcell) problemMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 351
Figure 3. Heat maps of leukemia data.
(Lee and Lee 2003; Nguyen and Rocke 2002a; Dudoit, Fridlyand, and Speed 2002). In this
article, we treat it as a threeclass classiﬁcation problem.
To compare the classiﬁcation performance of the MPSVM with MSVM, we ﬁrst apply
the procedures used by Lee and Lee (2003) to preprocess the data: (1) thresholding (ﬂoor
of 100 and ceiling of 16, 000); (2) ﬁltering (exclusion of genes with max/min≤ 5 and max
min≤ 500 across the samples); and (3) base10 logarithmic transformation. The ﬁltering
results in 3,571 genes. After these initial steps, each array is standardized and theF tests
are used to select top 40 genes with the largestF statistics.
Figure 3 plots the heat maps of the training set and testing set, where rows represent
genes and columns represent samples. The gene expression values are indicated by color,
with black representing the lowest level and white the hightest level of expression. We put
the training samples within each class together, and use hierarchical clustering to order the
genes so that genes close to each other have similar proﬁles across the training samples.
As Figure 3(a) shows, some subset of genes are highly expressed for each class, therefore
the selected 40 genes are very informative in discriminating three classes. In the heat map
(Figure 3(b)) of the testing set, the genes are ordered in the same way as in the training set,
and hierarchical clustering is also used to order samples. The expression pattern of testing
samples seems to match the training samples pretty well.
Because the sample size is small, the performance of MPSVM and oneversusrest
PSVM is quite robust to the choice of tuning parameters. With various equally good tuning
parameters, none or at most one testing sample is misclassiﬁed by the MPSVM and the one
versusrest PSVM. For the MSVM, as shown by Lee and Lee (2003), one testing sample
is misclassiﬁed using GACV tuning and on average 0.8 testing sample is misclassiﬁed
using leaveoneout crossvalidation. This demonstrates that the performance of MPSVM
is comparable to MSVM, but the MPSVM is faster to implement.352 Y. TANG AND H. H. ZHANG
6.2 SMALL ROUND BLUE CELL TUMOR DATA
The small round blue cell tumors of children dataset was published by Khan et al.
(2001). There are four tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS), non
Hodgkin lymphoma (BL/NHL), and Ewing family of tumors (EWS). This dataset contains
2,308 out of 6,567 genes after ﬁltering. There are 63 samples (8 BL; 23 EWS; 12 NB; 23
EWS) in the training set and 20 samples (3 BL; 6 EWS; 5 NB; 20 EWS) in the testing set.
Similar to the analysis of leukemia data, the data were logarithmtransformed andF tests
were used to preselect 40 genes with the smallestp values. The testing errors of MPSVM
and oneversusrest PSVM are both 0. This dataset has been analyzed by many methods
(Khan et al. 2001; Lee and Lee 2003; Dudoit, Fridlyand, and Speed 2002; Tibshirani et al.
2002), and nearly all of them yield 0 testing errors.
APPENDIX
Proof of Theorem 1: Letf (x)= β +h (x) withh ∈H forj = 1,...,k.
j j0 j j K
n
We decompose eachh as follows:h (·)= β K(x,·)+ρ (·), whereρ (·) is the
j j ji i j j
i=1
element in the RKHS orthogonal to the spanned space{K(x,·),i = 1,...,n}.Bythe
i
sumtozero constraint, we have
k−1 k−1 n k−1
f (·)=− β − β K(x,·)− ρ (·).
k j0 ji i j
j=1 j=1 i=1 j=1
By the reproducing property, we have h (·),K(x,·) =h (x ) fori = 1,...,n. Thus,
j i H j i
K
f (x)= β +h (x)=β + h,K(x,·)
j i j0 j i j0 j i H
K
n
= β + β K(x,·)+ρ (·),K(x,·)
j0 ji l j i
i=1
H
K
n
= β + β K(x, x ).
j0 ji i j
i=1
Hence, the dataﬁt functional in (4.1) does not depend onρ (·). Furthermore,
j
2
n
2 2
h = β K(x,·) + ρ , for 1≤j≤k− 1,
j ji i j
i=1
and
2
k−1 n k−1
2 2
h = β K(x,·) + ρ .
k ji i j
j=1 i=1 j=1
Obviously, the minimizer of (4.1) does not depend onρ . This implies the Equation (4.2).
j
k
It remains to show that minimizing (4.1) subject to β = 0 fori = 0,...,n is
ji
j=1
k
ˆ
equivalent to minimizing (4.1) subject to f (x)= 0 for any x. LetK be then byn
j
j=1MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 353
n
matrix with theilth entryK(x, x ). Letf (·)=β + β K(x,·) be the minimizer
i l j j0 ji i
i=1
∗
of (4.1) under the sumtozero constraint. Letβ =(β ,...,β ) forj = 1,...,k, and
j1 jn
j
k
−1
¯
β =k β forl = 0,...,n. Deﬁne
l jl
j=1
n n
∗ ∗ ∗
¯ ¯
f (·)=β + β K(x,·)= β −β + β −β K(x,·).
i j0 0 ji i i
j j0 ji
i=1 i=1
k
∗
Thenf (·)=f (·) because f (·)≡ 0. However,
j j
j
j=1
k k k k
∗ ∗ ∗ ∗
∗ 2 ∗2 2 2 −1
ˆ ¯ ˆ
h +β = β Kβ +β −kβ −k β K β
j j0 j j j0 0 j j
j=1 j=1 j=1 j=1
k k
∗ ∗ 2 2 2
ˆ
≤ β Kβ +β = h +β .
j
j j j0 j0
j=1 j=1
The equality has to hold sincef is the global minimizer. This implies that
j
k k
∗ ∗
¯ ˆ
β = 0, β K β = 0.
0
j j
j=1 j=1
k
ˆ
When K is positive deﬁnite, β = 0 for i = 0,...,n. On the other hand, let
ji
j=1
n k
f (·)= β + β K(x,·) be the minimizer of (4.1) satisfying β = 0 for
j j0 ji i ji
i=1 j=1
∗ k ∗
ˆ
i = 0,...,n. Letβ =(β ,...,β ) forj = 1,...,k. ThusK β = 0 and
j1 jn
j j
j=1
2
k k k n
∗ ∗
ˆ
0 = β K β = β K(x,·) .
ji i
j j
j=1 j=1 j=1 i=1
k n k
This suggests that β K(x, x)= 0 and hence f (x)= 0 for any x.
ji i j
j=1 i=1 j=1
Proof of Theorem 2: Since E (f(X)− Y)W(Y)(f(X)− Y) =
E E (f(X) −Y)W(Y)(f(X)− Y)X , we can minimize E (f(X)− Y)W(Y)
(f(X)− Y) by minimizingE (f(X)− Y)W(Y)(f(X)− Y)X = x for every x. Note
thatc = 0 forj =l, then
jl
!
!
Q (f)= E (f(X)− Y)W(Y)(f(X)− Y) X = x
x
" #
k k 2
1
= c f (x)+ p (x)
jl l j
k− 1
j=1 l=1
" #
k k
2
1
= c p (x) f (x)+
lj l j
k− 1
j=1
l=1
" #
k 2
1
= τ (x) f (x)+ ,
j j
k− 1
j=1354 Y. TANG AND H. H. ZHANG
k k
whereτ (x)= c p (x). Consider minimizing the functionQ (f) subject to
j lj l x
l=1 j=1
f (x)= 0. Using the Lagrange multiplierγ> 0, we get
j
" #
k 2 k
1
A (f)= τ (x) f (x)+ −γ f (x).
x j j j
k− 1
j=1 j=1
For eachj = 1,...,k,wehave
" #
∂A 1 γ 1
x
= 2τ f + −γ = 0 =⇒f = − .
j j j
∂f k− 1 2τ k− 1
j j
k
Since f = 0, we have that
j
j=1
2k 1
γ = · .
k
−1
k− 1
τ
j
j=1
The solution is then given by
−1
k τ 1
j
f = − . (A.1)
j
k
−1
k− 1 k− 1
τ
l
l=1
k
Thus, arg max f (x)= arg min τ (x)= arg min c p (x).
j=1,...,k j j=1,...,k j j=1,...,k lj l
l=1
ACKNOWLEDGMENTS
The authors are very grateful to the editors and referees for constructive suggestions that greatly improved
the presentation of this work. We also thank Subhashis Ghosal, Kevin Gross, and David Chorlian for helpful
comments on the article. Hao Helen Zhang’s research was partially supported by the National Science Foundation
Grant DMS0405913.
[Received December 2004. Revised September 2005.]
REFERENCES
Agarwal, D. K. (2002), “Shrinkage Estimator Generalizations of Proximal Support Vector Machines,” in Proceed
ings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,ACM
Press, pp. 173–182.
Boser, B. E., Guyon, I. M., and Vapnik, V. (1992), “A Training Algorithm for Optimal Margin Classiﬁers,” in 5th
Annual ACM Workshop on Computational Learning Theory, ed. Haussler, D., Pittsburgh, PA: ACM Press,
pp. 144–152.
Burges, C. J. C. (1998), “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and
Knowledge Discovery, 2, 121–167.
Cox, D., and O’Sullivan, F. (1990), “Asymptotic Analysis of Penalized Likelihood Type Estimators,” The Annals
of Statistics, 18, 1676–1695.
Cristianini, N., and ShaweTaylor, J. (2000), An Introduction to Support Vector Machines, New York: Cambridge
University Press.MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 355
Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparison of Discrimination Methods for the Classiﬁcation of
Tumors Using Gene Expression Data,” Journal of the American Statistical Association, 97, 77–87.
Fung, G., and Mangasarian, O. L. (2001), “Proximal Support Vector Machine Classiﬁers,” in Proceedings of the
7th ACM Conference on Knowledge Discovery and Data Mining, Pittsburgh, PA: ACM Press, pp. 77–86.
(2005), “Multicategory Proximal Support Vector Machine Classiﬁers,” Machine Learning, 59, 77–97.
Girosi, F. (1998), “An Equivalence Between Sparse Approximation and Support Vector Machines,” Neural Com
putation, 10, 1455–1480.
Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J.,
Caligiuri, M., Bloomﬁeld, C., and Lander, E. (1999), “Molecular Classiﬁcation of Cancer: Class Discovery
and Class Prediction by Gene Expression Monitoring,” Science, 286, 531–537.
Khan, J., Wei, J., Ringner, M., Saal, L., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu,
C., Peterson, C., and Meltzer, P. (2001), “Classiﬁcation and Diagnostic Prediction of Cancers Using Gene
Expression Proﬁling and Artiﬁcial Neural Networks,” Nature Medicine, 7, 673–679.
Kimeldorf, G., and Wahba, G. (1971), “Some Results on Tchebychefﬁan Spline Functions,” Journal of Mathe
matical Analysis and Applications, 33, 82–95.
Lee, Y., and Lee, C. (2003), “Classiﬁcation of Multiple Cancer Types By Multicategory Support Vector Machines
Using Gene Expression Data,” Bioinformatics, 19, 1132–1139.
Lee, Y., Lin, Y., and Wahba, G. (2004), “Multicategory Support Vector Machines, Theory, and Application to
the Classiﬁcation of Microarray Data and Satellite Radiance Data,” Journal of the American Statistical
Association, 99, 67–81.
Lin, Y., Lee, Y., and Wahba, G. (2002), “Support Vector Machines for Classiﬁcation in Nonstandard Situations,”
Machine Learning, 46, 191–202.
Nguyen, D. V., and Rocke, D. M. (2002a), “Multiclass Cancer Classiﬁcation via Partial Least Squares with Gene
Expression Proﬁles,” Bioinformatics, 18, 1216–1226.
(2002b), “Tumor Classiﬁcation by Partial Least Squares Using Microarray Gene Expression Data,” Bioin
formatics, 18, 39–50.
Poggio, T., and Girosi, F. (1998), “A Sparse Representation for Functional Approximation,” Neural Computation,
10, 1445–1454.
Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., and Vandewalle, J. (2002), Least Squares Support
Vector Machines, World Scientiﬁc, Singapore.
Suykens, J. A. K., and Vandewalle, J. (1999), “Least Squares Support Vector Machine Classiﬁers,” Neural Pro
cessing Letters, 9, 293–300.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002), “Diagnosis of Multiple Cancer Types By Shrunken
Centroids of Gene Expression,” in Proceedings of the National Academy of Sciences USA, 99, pp. 6567–6572.
Van Gestel, T., Suykens, J. A. K., Lanckriet, G., Lambrechts, A., De Moor, B., and Vandewalle, J. (2002a),
“Bayesian Framework for LeastSquares Support Vector Machine Classiﬁers, Gaussian Processes and Kernel
Fisher Discriminant Analysis,” Neural Computation, 15, 1115–1148.
(2002b), “Multiclass LSSVMs: Moderated Outputs and Codingdecoding Schemes,” Neural Processing
Letters, 15, 45–58.
Vapnik, V. N. (1998), Statistical Learning Theory, New York: Wiley.
Wahba, G. (1990), Spline Models for Observational Data, CBMSNSF Regional Conference Series in Applied
Mathematics, 59, Philadelphia, PA: SIAM.
West, M. (2003), “Bayesian Factor Regression Models in the “Large p, Small n” Paradigm,” Bayesian Statistics,
7, 723–732.
Weston, J., and Watkins, C. (1999), “Support Vector Machines for MultiClass Pattern Recognition,” in Proceedings
of the 7th European Symposium On Artiﬁcial Neural Networks, Bruges, Belgium, pp. 219–224.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment