Multiclass Proximal Support Vector Machines

zoomzurichAI and Robotics

Oct 16, 2013 (3 years and 5 months ago)

86 views

Multiclass Proximal Support Vector Machines
Yongqiang TANG and Hao Helen ZHANG
This article proposes the multiclass proximal support vector machine (MPSVM) clas-
sifier, which extends the binary PSVM to the multiclass case. Unlike the one-versus-rest
approach that constructs the decision rule based on multiple binary classification tasks, the
proposed method considers all classes simultaneously and has better theoretical properties
and empirical performance. We formulate the MPSVM as a regularization problem in the
reproducing kernel Hilbert space and show that it implements the Bayes rule for classifi-
cation. In addition, the MPSVM can handle equal and unequal misclassification costs in a
unified framework. We suggest an efficient algorithm to implement the MPSVM by solving
a system of linear equations. This algorithm requires much less computational effort than
solving the standard SVM, which often requires quadratic programming and can be slow
for large problems. We also provide an alternative and more robust algorithm for ill-posed
problems. The effectiveness of the MPSVM is demonstrated by both simulation studies and
applications to cancer classifications using microarray data.
Key Words: Bayes rule; Nonstandard classifications; Reproducing kernel Hilbert space.
1. INTRODUCTION
In multiclass classification problems, the training set consists ofn samples{(x,y );i =
i i
d
1,...,n}, where x ∈ is the input vector andy ∈{1,...,k} is the class label for theith
i i
sample. Our task is to learn a classification rule from the training set to predict the label for
a future sample based on its input. The support vector machine (SVM) (Boser, Guyon, and
Vapnik 1992; Vapnik 1998; Burges 1998; Cristianini and Shawe-Taylor 2000) has performed
successfully in many real-world problems. The SVM is attractive in its ability to condense
the information contained in the training set and to find a decision surface determined by
certain points in the training set. Lee, Lin, and Wahba (2004) further generalized the SVM
to multicategory SVM (MSVM). Other multiclass versions of SVM have been studied by
Vapnik (1998) and Weston and Watkins (1999).
The implementation of the SVM and MSVM demands solving quadratic program-
Yongqiang Tang is Statistician and Assistant Professor, Department of Psychiatry, SUNY Health Science Center,
Brooklyn, NY 11203 (E-mail: yongqiang tang@yahoo.com). Hao Helen Zhang is Assistant Professor, Department
of Statistics, North Carolina State University, Raleigh, NC 27695 (E-mail: hzhang@stat.ncsu.edu).
c 2006 American Statistical Association, Institute of Mathematical Statistics,
and Interface Foundation of North America
Journal of Computational and Graphical Statistics, Volume 15, Number 2, Pages 339–355
DOI: 10.1198/106186006X113647
339340 Y. TANG AND H. H. ZHANG
ming problems under linear constraints, which can be computationally expensive for large
datasets. For multiclass problems, the computation can be very challenging even for mod-
erately sized datasets if the number of classesk is large. Proximal support vector machines
(PSVM) were recently introduced as a variant of SVM for binary classifications by Suykens
and Vandewalle (1999) and Fung and Mangasarian (2001). In theory, both the PSVM and
SVM target the optimal Bayes rule asymptotically, which explains their comparable per-
formance in most empirical studies. However, it is much faster to train a PSVM classifier
by simply solving a system of linear equations. Suykens and Vandewalle (1999), Suykens
et al. (2002), and Van Gestel et al. (2002a) proposed the least squares SVM (LSSVM).
The formulation of LSSVM is similar to that of PSVM except that the latter penalizes the
constant term as well. Extensions of the PSVM and LSSVM to multiclass problems have
been considered by Fung and Mangasarian (2005) and Van Gestel et al. (2002b) based on
the one-versus-rest scheme. Their main idea is to solvek binary classification problems
by separating one class from the rest, then construct the decision rule according to the
maximal output. One disadvantage of this scheme is that the resulting binary problems are
often very unbalanced, leading to poor performance in some cases (Fung and Mangasarian
2005). In addition, it is not easy to take into account unequal misclassification costs in the
one-versus-rest approach.
This article proposes the multiclass PSVM (MPSVM) using an alternative formulation.
The proposed method directly targets the boundaries amongk classes simultaneously by
estimating some functions of the conditional probabilities. The MPSVM is constructed in
a regularization framework of reproducing kernel Hilbert space (RKHS), and is optimal in
terms of implementing the Bayes rule asymptotically. Compared to the approaches based on
the one-versus-rest scheme, the MPSVM is more flexible in handling nonstandard situations
such as unequal misclassification costs and nonrepresentative training samples. With regard
to computation, the MPSVM can be solved more efficiently than the MSVM.
The article is organized as follows. Section 2 briefly states the Bayes classification rule.
Section 3 reviews the binary SVM and PSVM. Section 4 introduces the MPSVM, study
its statistical properties, and propose two computation algorithms. Section 5 illustrates the
performance of the MPSVM with simulation examples. Section 6 applies the MPSVM to
cancer classifications using gene expression data.
2. BAYES CLASSIFICATION RULES
Bayes rule is the optimal classification rule if the underlying distribution of the data is
known. It serves as the gold standard to which any classifier can be compared. In practice,
the underlying distribution is rarely known. One common way to approximate the Bayes
rule is to estimate the distribution or related classification functions from the training set.
d
For ak-class classification problem, we need to learn a classification ruleφ(x) : →
{1,...,k} from the training set. Assume that the samples in the training set are indepen-
dently and identically drawn from some distributionP(x,y). Letp (x)= Pr(Y =j|X = x)
jMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 341
be the conditional probability of classj given X = x forj = 1,...,k. LetC be ak×k cost
matrix, withc the cost of incorrectly classifying a sample from classj to classl.Allc ’s
jl jj
are equal to 0 because we do not penalize correct decisions. The Bayes rule, minimizing
the expected cost of misclassifying an observation

k k


E c =E c Pr(Y =l|X = x) =E c p (x) ,
l
Yφ(X) X lφ(x) X lφ(x)
l=1 l=1
is given by

k

φ (x)= arg min c p (x) . (2.1)
B lj l
j=1,...,k
l=1
With equal misclassification costs, that is,c = 1 forj/ =l, the Bayes rule simplifies to
jl

φ (x)= arg min 1−p (x) = arg max p (x), (2.2)
B j j
j=1,...,k j=1,...,k
which can be interpreted as the minimizer of the expected misclassification rateE[Y/ =
φ(X)].
The majority of classification methods consider only the standard setting, which as-
sumes that the samples in the training set truly represent the target population and imposes
equal costs on different types of misclassifications. However, nonstandard classification
problems may arise in many real situations. When one type of misclassification is more
serious than others, we should apply unequal costs. For example, in the diagnosis of a
disease, classifying a diseased person as healthy or a healthy person as diseased may have
different consequences. See Lin, Lee, and Wahba (2002) for more discussions on this issue.
Sampling bias is another concern that merits special attention. When the class proportions
are quite unbalanced in the population, one often tends to oversample minor classes and
downsample major classes (Lin, Lee, and Wahba 2002; Lee, Lin, and Wahba 2004). If the
sampling scheme depends only on the class labelY and not on the input X, the Bayes rule
is

k

π
l
s
φ (x)= arg min c p (x) ,
B lj
l
s
j=1,...,k
π
l
l=1
s
whereπ is the prior proportion of classl in the target population,π is the proportion of
l
l

s s s
classl in the training set, andp (x)= Pr Y =l|X = x is the conditional probability
l
s
of a sample in the training set belonging to classl given X = x.Inthisarticle,wedo
not distinguish two nonstandard cases, since the issue of sampling bias can be taken into
account by modifying the costs as
π
l
new
c =c . (2.3)
jl
jl
s
π
l342 Y. TANG AND H. H. ZHANG
3. BINARY SVM AND PROXIMAL SVM
In binary classification problems, the class labely is often coded as +1or−1. Orig-
i
inally proposed as a maximal margin classifier, the SVM methodology can also be cast
as a regularization problem in a reproducing kernel Hilbert space (RKHS). See Wahba
(1990), Girosi (1998), and Poggio and Girosi (1998) for more details. LetH be an RKHS
K
associated with the reproducing kernelK. The SVM solves
n

1
2
min L(y )[1−yf(x )] +λ h , (3.1)
i i i +
H
K
f n
i=1
over all the functions of the formf(x)=β +h(x), whereh∈H ,(v) = max{v, 0}, and
0 K +
L(y ) is the cost of misclassifying theith observation. The classification rule is sign[f(x)].
i
The proximal SVM was first studied by Suykens and Vandewalle (1999) and Fung and
Mangasarian (2001). Unlike the SVM, the PSVM method classifies points to the closest
of two parallel hyperplanes that are pushed apart as far as possible. In the regularization
framework, we can formulate the PSVM as
n

1
2 2 2
min L(y )[1−yf(x )] +λ( h +β ). (3.2)
i i i
H 0
K
f
n
i=1
Sincey = 1or−1, (3.2) is equivalent to
i
n

1
2 2 2
min L(y )[y −f(x )] +λ( h +β ). (3.3)
i i i
H 0
K
f n
i=1
Thus, the PSVM can be interpreted as the ridge regression model (Agarwal 2002). By the
representer theorem (Kimeldorf and Wahba 1971), the solution to (3.1) or (3.2) has a finite

n
representation formf(x)=β + βK(x, x). When equal misclassification costs are
0 i i
i=1
used, (3.2) becomes
 
n n n

1
2 2
 
min [1−yf(x )] +λ β βK(x , x)+β . (3.4)
i i j l j l
0
f n
i=1 j=1
l=1
The formulation (3.4) is slightly different from the kernel PSVM studied by Fung and
Mangasarian (2001), where they solved
 
n n

1
2 2 2
 
min [1−yf(x )] +λ β +β .
i i
j 0
f n
i=1 j=1
4. MULTICLASS PROXIMAL SVM
This section introduces the multiclass proximal SVM (MPSVM). The proposed
MPSVM embraces the binary PSVM as a special case. Section 4.1 presents the formu-
lation of the MPSVM and investigates its solution and asymptotic properties. Section 4.2
introduces two efficient computational strategies for implementing the MSPVM.MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 343
4.1 FORMULATION AND THEORETICAL PROPERTIES
In multiclass classification problems, the class labely can be coded as ak-dimensional
vector



 −1 −1

1, ,..., if samplei belongs to class 1,

k−1 k−1





−1 −1
, 1,..., if samplei belongs to class 2,
k−1 k−1
y =
i


...





−1 −1

, ,..., 1 if samplei belongs to classk.
k−1 k−1
This coding was first used by Lee, Lin, and Wahba (2004). We consider ak-tuple separating

function f(x)=(f (x),...,f (x)) . Since the sum of the components in y is 0, we let
1 k

k
d
f satisfy the sum-to-zero constraint, f (x)= 0 for any x∈ . Analogous to the
j
j=1

k
binary case, we assume that f(x) ∈ ({1} +H ), the tensor product space of k
K
j=1
reproducing kernel Hilbert spacesH . Each componentf (x) can be expressed asβ +
K j j0
h (x), withβ ∈ andh ∈H . For each y , define the diagonal matrixW(y)=
j j0 j K
i i
diag{c ,c ,...,c }, where l is the membership of y and c denotes the cost of
l 1 l 2 l k i l j
i i i i i
classifying theith sample to classj. The MPSVM is proposed as
n k

1 1
2 2
min [y − f(x )]W(y )[y − f(x )]+ λ ( h +β ), (4.1)
i i j
i i i H j0
K
f n 2
i=1 j=1

k
subject to f (x)= 0 for any x. The classification rule induced by f(x) is
j
j=1
φ(x)= arg max f (x).
j
j=1,...,k
This formulation handles the standard and nonstandard situations in a unified manner by
choosing different cost functions. The solution to (4.1) is characterized in Theorem 1. Proof
of the theorem and all other proofs are relegated to the Appendix.
Theorem 1. Representer theorem. If the reproducing kernelK is positive definite,

minimizing (4.1) under the sum-to-zero constraint is equivalent to finding f (x),...,f (x)
1 k
of the form
n

f (x)=β + β K(x, x), (4.2)
j j0 ji i
i=1

k
subject to β = 0forl = 0,...,n.
jl
j=1
We show that formulation (4.1) embraces the binary PSVM as a special case. Let
k = 2. For any sample from the positive class, y =(1,−1) (1 in the binary notation), we
i
2
haveW(y)= diag{0,L(1)} and [y − f(x )]W(y )[y − f(x )] =L(1)[f (x)+ 1] =
i i 2 i
i i i i
2
L(1)[1−f (x )] , whereL(1) is the cost of misclassifying a positive sample. Similarly,
1 i
when y =(−1, 1) (−1 in the binary notation), we haveW(y)= diag{L(−1), 0} and[y −
i i i
2
f(x )]W(y )[y −f(x )] =L(−1)[−1−f (x )] , whereL(−1) is the cost of misclassifying
i i 1 i
i i344 Y. TANG AND H. H. ZHANG
a negative sample. Thus, the data-fit functional in (3.3) is identical to that in (4.1), withf(x)
in (3.3) playing the same role off (x) in (4.1). It is also true that
1
2

λ λ
2 2 2 2 2 2 2 2
( h +β )= ( h +β + h +β )=λ( h +β ),
j 1 2 1
H j0 H 10 H 20 H 10
K K K K
2 2
j=1
because we haveβ +β = 0 andh (x)+h (x)= 0 for any x according to Theorem 1.
10 20 1 2
The next theorem shows that the MPSVM implements the Bayes rule asymptotically
under certain regularity conditions. We use the theoretical framework of Cox and O’Sullivan
(1990) to analyze the asymptotics of penalized methods. In (4.1), the data-fit functional
component indicates that the solution should follow the pattern in the data, whereas the
penalty component imposes smoothing conditions. As the sample sizen goes to infinity,


the limit of the data-fit functional,E (Y− f(X))W(Y)(Y− f(X)) , could be used to
identify the target function, which is the minimizer of the limiting functional. Under the
assumption that the target function can be approximated by the elements in the RKHS and
certain other regularity conditions, the solution to (4.1) will approach the target function as
n→∞.


Theorem 2. The minimizer ofE (Y− f(X))W(Y)(Y− f(X)) under the sum-to-
zero constraint implements the Bayes rule, that is,
k

arg max f (x)= arg min c p (x).
j lj l
j=1,...,k j=1,...,k
l=1
When equal misclassification costs are used,
arg max f (x)= arg max p (x).
j j
j=1,...,k j=1,...,k
Remark 1. Inverting the expression (A.1) in the Appendix provides the estimate ofk-
class conditional probabilities. With equal misclassification costs,τ (x) reduces to 1−p (x),
j j
thus
ˆ
1/[1−f (x)]
j
p ˆ (x)= 1−(k− 1) , for j = 1,...,k. (4.3)
j
k
ˆ
1/[1−f (x)]
l
l=1
4.2 COMPUTATION ALGORITHMS
This section shows that the implementation of the MPSVM only requires solving linear
equation systems. This computational convenience makes the MPSVM a promising clas-
sification tool for large datasets. Many classification methods involve parameters that need
to be adaptively chosen with strategies like cross-validation. Finer tuning is possible for the
MPSVM due to its computational advantage. We present two computation strategies and
summarize them as Algorithms 1 and 2. Based on our experience, Algorithm 1 runs faster,MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 345
though it may fail whenK is close to singularity, causing difficulty in matrix inversion.
Algorithm 2 is specifically designed for ill-posed problems.
ˆ
Assume that the kernelK is strictly positive definite. LetK be the Gram matrix with

ˆ
the (i,l)th entryK(x, x ). LetZ=[1 K], 1 =(1,..., 1) , 0 =(0,..., 0) , and
i l n n n


1 0
n
G = .
ˆ
0 K
n

Forj = 1,...,k, define y =(y ,...,y ) ,β =(β ,β ,...,β ) and then×n
1j nj j0 j1 jn
j j

diagonal matrixW = diag{c ,...,c }, where cat{l} is the membership of the
cat{1}j cat{n}j
j
lth sample andc is the cost of classifying thelth sample to classj. Further, we define
cat{l}j
∗ ∗1/2 ∗1/2
∗ 1/2 ∗ −1/2 ∗ ∗∗ ∗
λ = nλ/2, β = G β ,Z = ZG ,Z = W Z and y = W y for
j
j j
j j j j
j = 1,...,k. By Theorem 1, minimizing (4.1) is equivalent to minimizing
k



∗ ∗ ∗ ∗
(y −Zβ )W (y −Zβ )+λ β Gβ
j j j j j j j
j=1
k



∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗
= (y −Z β ) (y −Z β )+λ β β , (4.4)
j j
j j j j j j
j=1

k k

subject to β = β = 0 .
n+1
j=1 j j=1 j
To solve (4.4), we consider its Wolfe dual problem
k k



∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗
L = (y −Z β ) (y −Z β )+λ β β − 2u β ,
D j j
j j j j j j j
j=1 j=1
∗ n+1
where u = 2u ∈ is the Lagrange multiplier vector. Setting equal to zero the gradient

ofL with respect toβ yields
D
j
∂L
D ∗ ∗
∗∗ ∗ ∗
= 2Z (Z β − y )+ 2λ β − 2u = 0.
j j
∗ j j j
∂β
j
Thus,

−1

∗ ∗∗ ∗ ∗∗ ∗
β = Z Z +λ I Z y + u =B Z y + u , (4.5)
j j
j j j j j j


−1
k ∗ k

whereB = Z Z +λ I forj = 1,...,k. Since 0 = β = B
j j n+1 j
j j=1 j j=1

∗∗ ∗
Z y + u , we get the dual solution
j j
   
−1
k k

∗ ∗∗
   
u =− B B Z y . (4.6)
i j
j j
j=1 j=1
The primal solution to (4.4) is

−1/2 ∗ ∗ ∗
β =G β =A ZW y + v , (4.7)
j
j j j j346 Y. TANG AND H. H. ZHANG

−1


−1
k
−1/2 −1/2 ∗ ∗
where A = G B G = ZW Z +λ G and v = − A
j j i
j j=1



k
∗ ∗
A ZW y . For any (new) sample x, we define the vectorK = K(x, x ),...,
j x 1
j=1 j j



ˆ
K(x, x ) and computef (x)=[1,K ]β forj = 1,...,k for prediction. We summarize
n j
x j
Algorithm 1 as follows.
Algorithm 1. (a) Solve (4.4) using (4.7).
ˆ
(b) To predict the label for x, computef (x) forj = 1,...,k and classify the sample
j
ˆ
to the class corresponding to the largestf (x).
j
ˆ
Though the matrixK is strictly positive definite in theory, it can be numerically ill-
ˆ
posed in practice. Algorithm 1 may fail whenK, henceG andA (j = 1,...,k), are
j
close to singularities. Therefore in Algorithm 2, we avoid the inversion ofG using the
relationships
−1 −1
ˆ
ZG =[1 ,I],ZG Z = 1 1 +K,
n n
n
which can be achieved by some strategic matrix transformations of the solutions (4.5) and
ˆ
(4.6) via the Sherman-Morrison-Woodbury formula. Our experience shows that whenK is
well-conditioned, we should use Algorithm 1 because it is much faster than Algorithm 2.
Otherwise, Algorithm 2 is preferable because it provides robust solutions.
By using the Sherman-Morrison-Woodbury formula, we have

−1
∗ ∗ −1 ∗ ∗ ∗ ∗
B = Z Z +λ I = I−Z (λ I +Z Z ) Z /λ =(I−Z C Z )/λ ,
j j j j j
j j j
(4.8)
where

−1

−1
∗1/2 ∗ ∗1/2 ∗ ∗ ∗1/2 ∗1/2 ∗ ∗ ∗
C =W λ I +W Z Z W W =l λ I +l Z Z .
j
j j j j j j
(4.9)

1/2

Herel is the column vector containing the diagonal elements ofW ,l =l l , and
j j
j j j
is the elementwise product of two matrices. It is easy to verify that

∗∗ ∗ −1 ∗∗ ∗
B Z y = I−Z (λ I +Z Z ) Z Z y /λ
j j j j j
j j j j

∗ −1 ∗∗ ∗ ∗
=Z (λ I +Z Z ) y =Z C y . (4.10)
j j
j j j j

k
−1
LetC =k C . Then
j
j=1
k

∗ ∗ ∗
B = B =k(I−Z CZ )/λ . (4.11)
j
j=1
By the Sherman-Morrison-Woodbury formula,


−1 ∗ −1 ∗ ∗ ∗ −1 −1 ∗
B =λ k I−Z (Z Z −C ) Z . (4.12)MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 347
The dual solution is
 
k

∗ −1 ∗∗
 
u = −B B Z y
j
j j
j=1
  
k


∗ −1 ∗ ∗ ∗ ∗
  
= −λ k (I−Z DZ ) Z C y
j
j
j=1

∗ ∗
= −λ Z Q, (4.13)

k
∗ ∗ −1 −1 −1 ∗ ∗ ∗
whereD =(Z Z −C ) ,H = k C y , andQ = H −D(Z Z )H =
j
j=1 j
−1
−DC H.
Combining (4.8), (4.10), and (4.13), we get the primal solution

∗∗ ∗
β = B Z y +B u
j j
j j
j
∗ ∗ ∗ ∗ ∗
= Z C y −(I−Z C Z )Z Q
j j
j


∗ ∗ ∗ ∗
= Z C (y +Z Z Q)−Q ,
j
j
and


−1/2 −1/2 ∗ ∗ ∗ ∗
β =G β = G Z C (y +Z Z Q)−Q
j
j j j


∗ ∗ ∗
= [1 ,I] C (y +Z Z Q)−Q .
n j
j
ˆ
The predictionf is given by
j


∗ ∗ ∗
ˆ
f (x)=[1,K ]β =(1 +K ) C (y +Z Z Q)−Q . (4.14)
j n x j
x j j
Algorithm 2.
∗ ∗ −1
ˆ
(a) ComputeZ Z =ZG Z = 1 1 +K.
n
n
(b) ComputeC (j = 1,...,k),C,D,H andQ using Equations (4.9)–(4.13).
j
ˆ
(c) To predict the label for x, computef (x) forj = 1,...,k using (4.14) and classify
j
ˆ
the sample to the class giving the largestf (x).
j
5. NUMERICAL EXAMPLES
We illustrate the performance of the MPSVM with simulated examples. The Gaus-

2
x −x
i l ∗ 2
sian kernelK(x, x)= exp − is used and the tuning parameters (λ ,σ ) are
i l 2
σ
determined by ten-fold cross-validation. Section 5.2 shows an example for nonstandard
classifications. Equal misclassification costs are assumed unless otherwise specified.
5.1 COMPARISON WITH MSVM
A training set of sample size 200 is generated according to the three-class model on
the unit interval [0, 1] with the conditional probabilities348 Y. TANG AND H. H. ZHANG
Figure 1. Plots of true and estimated conditional probability functions.
p (x)= 0.97 exp(−3x),
1

2
p (x)= exp −2.5(x− 1.2) ,
3
p (x)= 1−p (x)−p (x).
2 1 3
This example was also used by Lee, Lin, and Wahba (2004). We apply the MPSVM and use
the formula (4.3) to estimate the true conditional probabilities. As shown in Figures 1(a)
and (b), the MPSVM has reasonably recovered the conditional probabilities. According to
(2.2), the Bayes class boundary for this example consists of two points:x = 0.260 where
p (x)= p (x)>p (x), andx = 0.613 wherep (x)<p (x)= p (x). The MPSVM
1 2 3 1 2 3
estimates these two critical points as 0.262 and 0.615, and hence approximates the Bayes
rule very well.
We now compare the MPSVM with the MSVM (Lee, Lin, and Wahba 2004), and
their one-versus-rest variants. We generate 100 training sets of sample size 200 to fit the
classification rules, and a testing set of sample size 10,000 to evaluate their prediction
accuracy. The Bayes rate for this example is 0.3841. The average testing error rate (±
standard deviation) is 0.4068± 0.0101 for the MPSVM and 0.4081± 0.0124 for the one-
versus-rest PSVM. As reported by Lee, Lin, and Wahba (2004), the average error rate is
0.3951±0.0099 for the MSVM and 0.4307±0.0132 for the one-versus-rest SVM. Overall,
the performance of the MPSVM and one-versus-rest PSVM is close to the MSVM. Note
that the error rate of the one-versus-rest SVM is slightly higher. One possible reason is that
the one-versus-rest SVM may not approximate Bayes rule well even when the sample size
is large (Lee, Lin, and Wahba 2004), while all the other three classifiers are asymptotically
optimal.
5.2 SAMPLING BIAS
This example illustrates the effectiveness of the MPSVM in handling the sampling
bias situation. There are three classes, and the true population is distributed on the squareMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 349
Figure 2. Sampling bias.350 Y. TANG AND H. H. ZHANG
√ √
A=[0, 10]×[0, 10] with the following conditional probabilities:
Pr(Y =j|x∈A)= 0.9if j =l,
l
Pr(Y =j|x∈A)= 0.05 if j/ =l,
l
with 1≤j,l≤ 3. Three regions are

2 2
A = X :X +X ≤ 2,X >X ,
1 1 2
1 2

2 2
A = X :X +X ≤ 2,X ≤X ,
2 1 2
1 2

2 2
A = X :X +X > 2 ,
3
1 2
and they partition the squareA with the area ratio1:1:8.We first generate 500 points
from the population, then intentionally change the class proportions in the training set by
keeping all points generated from classes 1 and 2, and only 1/8 of those from class 3. The
sampling scheme is therefore biased and the class proportions in the training set do not
reflect those in the underlying population. Figure 2(a) plots the training data and the Bayes
decision boundary. We fit the MPSVM with, respectively, equal misclassification costs
and the costs adjusted according to (2.3). The decision boundary based on adjusted costs
in Figure 2(b) is very similar to the Bayes boundary in Figure 2(a), while the unadjusted
boundary in Figure 2(c) fails to approximate the Bayes rule. This suggests that the adjustment
of misclassification costs in the MPSVM is necessary to improve classification accuracy in
the sampling bias case.
6. APPLICATIONS
In this section, we apply the MSPVM to cancer classification using two benchmark
microarray datasets. One main challenge in microarray data analysis is that the number of
genes is often much larger than the sample size. One typical way to handle this issue is to
conduct dimension reduction, say, by principle component analysis or factor analysis (Khan
et al. 2001; West 2003), before applying classification methods. As regularized regression
models, the SVM-type methods offer an alternative way to handle this “largep, smalln”
problem (West 2003) and are well-suited for microarray data analysis.
6.1 LEUKEMIA DATA
The leukemia dataset was published by Golub et al. (1999) for the classification of
two types of leukemia: ALL (acute myeloid leukemia) and AML (acute lymphoblastic
leukemia). These two cancer types are identified based on their origins—AML originates
in the bone marrow and ALL originates in lymphatic tissue. The ALL leukemia can be
further divided into B-cell and T-cell ALLs. There are 38 samples (19 ALL B-cell; 8 ALL
T-cell; 11 AML) in the training set and 34 samples (19 ALL B-cell; 1 ALL T-cell and 14
AML) in the testing set. For each sample, the expression values of 7,129 genes are collected.
In the literature, these data have been treated as a two-class (ALL/AML) problem (Golub
et al. 1999; Dudoit, Fridlyand, and Speed 2002; Tibshirani, Hastie, Narasimhan, and Chu
2002; Nguyen and Rocke 2002b) or a three-class (AML/ALL B-cell/ALL T-cell) problemMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 351
Figure 3. Heat maps of leukemia data.
(Lee and Lee 2003; Nguyen and Rocke 2002a; Dudoit, Fridlyand, and Speed 2002). In this
article, we treat it as a three-class classification problem.
To compare the classification performance of the MPSVM with MSVM, we first apply
the procedures used by Lee and Lee (2003) to preprocess the data: (1) thresholding (floor
of 100 and ceiling of 16, 000); (2) filtering (exclusion of genes with max/min≤ 5 and max
-min≤ 500 across the samples); and (3) base-10 logarithmic transformation. The filtering
results in 3,571 genes. After these initial steps, each array is standardized and theF -tests
are used to select top 40 genes with the largestF -statistics.
Figure 3 plots the heat maps of the training set and testing set, where rows represent
genes and columns represent samples. The gene expression values are indicated by color,
with black representing the lowest level and white the hightest level of expression. We put
the training samples within each class together, and use hierarchical clustering to order the
genes so that genes close to each other have similar profiles across the training samples.
As Figure 3(a) shows, some subset of genes are highly expressed for each class, therefore
the selected 40 genes are very informative in discriminating three classes. In the heat map
(Figure 3(b)) of the testing set, the genes are ordered in the same way as in the training set,
and hierarchical clustering is also used to order samples. The expression pattern of testing
samples seems to match the training samples pretty well.
Because the sample size is small, the performance of MPSVM and one-versus-rest
PSVM is quite robust to the choice of tuning parameters. With various equally good tuning
parameters, none or at most one testing sample is misclassified by the MPSVM and the one-
versus-rest PSVM. For the MSVM, as shown by Lee and Lee (2003), one testing sample
is misclassified using GACV tuning and on average 0.8 testing sample is misclassified
using leave-one-out cross-validation. This demonstrates that the performance of MPSVM
is comparable to MSVM, but the MPSVM is faster to implement.352 Y. TANG AND H. H. ZHANG
6.2 SMALL ROUND BLUE CELL TUMOR DATA
The small round blue cell tumors of children dataset was published by Khan et al.
(2001). There are four tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS), non-
Hodgkin lymphoma (BL/NHL), and Ewing family of tumors (EWS). This dataset contains
2,308 out of 6,567 genes after filtering. There are 63 samples (8 BL; 23 EWS; 12 NB; 23
EWS) in the training set and 20 samples (3 BL; 6 EWS; 5 NB; 20 EWS) in the testing set.
Similar to the analysis of leukemia data, the data were logarithm-transformed andF -tests
were used to preselect 40 genes with the smallestp values. The testing errors of MPSVM
and one-versus-rest PSVM are both 0. This dataset has been analyzed by many methods
(Khan et al. 2001; Lee and Lee 2003; Dudoit, Fridlyand, and Speed 2002; Tibshirani et al.
2002), and nearly all of them yield 0 testing errors.
APPENDIX
Proof of Theorem 1: Letf (x)= β +h (x) withh ∈H forj = 1,...,k.
j j0 j j K

n
We decompose eachh as follows:h (·)= β K(x,·)+ρ (·), whereρ (·) is the
j j ji i j j
i=1
element in the RKHS orthogonal to the spanned space{K(x,·),i = 1,...,n}.Bythe
i
sum-to-zero constraint, we have
k−1 k−1 n k−1

f (·)=− β − β K(x,·)− ρ (·).
k j0 ji i j
j=1 j=1 i=1 j=1
By the reproducing property, we have h (·),K(x,·) =h (x ) fori = 1,...,n. Thus,
j i H j i
K
f (x)= β +h (x)=β + h,K(x,·)
j i j0 j i j0 j i H
K

n

= β + β K(x,·)+ρ (·),K(x,·)
j0 ji l j i
i=1
H
K
n

= β + β K(x, x ).
j0 ji i j
i=1
Hence, the data-fit functional in (4.1) does not depend onρ (·). Furthermore,
j

2
n



2 2
h = β K(x,·) + ρ , for 1≤j≤k− 1,
j ji i j

i=1
and

2

k−1 n k−1


2 2

h = β K(x,·) + ρ .
k ji i j


j=1 i=1 j=1
Obviously, the minimizer of (4.1) does not depend onρ . This implies the Equation (4.2).
j

k
It remains to show that minimizing (4.1) subject to β = 0 fori = 0,...,n is
ji
j=1

k
ˆ
equivalent to minimizing (4.1) subject to f (x)= 0 for any x. LetK be then byn
j
j=1MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 353

n
matrix with theilth entryK(x, x ). Letf (·)=β + β K(x,·) be the minimizer
i l j j0 ji i
i=1

of (4.1) under the sum-to-zero constraint. Letβ =(β ,...,β ) forj = 1,...,k, and
j1 jn
j

k
−1
¯
β =k β forl = 0,...,n. Define
l jl
j=1
n n


∗ ∗ ∗
¯ ¯
f (·)=β + β K(x,·)= β −β + β −β K(x,·).
i j0 0 ji i i
j j0 ji
i=1 i=1

k

Thenf (·)=f (·) because f (·)≡ 0. However,
j j
j
j=1
   

k k k k



∗ ∗ ∗ ∗
∗ 2 ∗2 2 2 −1
ˆ ¯ ˆ
   
h +β = β Kβ +β −kβ −k β K β
j j0 j j j0 0 j j
j=1 j=1 j=1 j=1
k k



∗ ∗ 2 2 2
ˆ
≤ β Kβ +β = h +β .
j
j j j0 j0
j=1 j=1
The equality has to hold sincef is the global minimizer. This implies that
j
   

k k

∗ ∗
¯ ˆ
   
β = 0, β K β = 0.
0
j j
j=1 j=1

k
ˆ
When K is positive definite, β = 0 for i = 0,...,n. On the other hand, let
ji
j=1

n k
f (·)= β + β K(x,·) be the minimizer of (4.1) satisfying β = 0 for
j j0 ji i ji
i=1 j=1


∗ k ∗

ˆ
i = 0,...,n. Letβ =(β ,...,β ) forj = 1,...,k. ThusK β = 0 and
j1 jn
j j
j=1

   
2

k k k n


∗ ∗
ˆ
   
0 = β K β = β K(x,·) .
ji i
j j


j=1 j=1 j=1 i=1

k n k
This suggests that β K(x, x)= 0 and hence f (x)= 0 for any x.
ji i j
j=1 i=1 j=1


Proof of Theorem 2: Since E (f(X)− Y)W(Y)(f(X)− Y) =


E E (f(X) −Y)W(Y)(f(X)− Y)|X , we can minimize E (f(X)− Y)W(Y)


(f(X)− Y) by minimizingE (f(X)− Y)W(Y)(f(X)− Y)|X = x for every x. Note
thatc = 0 forj =l, then
jl
!

!
Q (f)= E (f(X)− Y)W(Y)(f(X)− Y) X = x
x

" #
k k 2

1
= c f (x)+ p (x)
jl l j
k− 1
j=1 l=1

" #
k k
2

1
= c p (x) f (x)+
lj l j
k− 1
j=1
l=1
" #
k 2

1
= τ (x) f (x)+ ,
j j
k− 1
j=1354 Y. TANG AND H. H. ZHANG

k k
whereτ (x)= c p (x). Consider minimizing the functionQ (f) subject to
j lj l x
l=1 j=1
f (x)= 0. Using the Lagrange multiplierγ> 0, we get
j
" #
k 2 k

1
A (f)= τ (x) f (x)+ −γ f (x).
x j j j
k− 1
j=1 j=1
For eachj = 1,...,k,wehave
" #
∂A 1 γ 1
x
= 2τ f + −γ = 0 =⇒f = − .
j j j
∂f k− 1 2τ k− 1
j j

k
Since f = 0, we have that
j
j=1
2k 1
γ = · .

k
−1
k− 1
τ
j
j=1
The solution is then given by
−1
k τ 1
j
f = − . (A.1)

j
k
−1
k− 1 k− 1
τ
l
l=1

k
Thus, arg max f (x)= arg min τ (x)= arg min c p (x).
j=1,...,k j j=1,...,k j j=1,...,k lj l
l=1
ACKNOWLEDGMENTS
The authors are very grateful to the editors and referees for constructive suggestions that greatly improved
the presentation of this work. We also thank Subhashis Ghosal, Kevin Gross, and David Chorlian for helpful
comments on the article. Hao Helen Zhang’s research was partially supported by the National Science Foundation
Grant DMS-0405913.
[Received December 2004. Revised September 2005.]
REFERENCES
Agarwal, D. K. (2002), “Shrinkage Estimator Generalizations of Proximal Support Vector Machines,” in Proceed-
ings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,ACM
Press, pp. 173–182.
Boser, B. E., Guyon, I. M., and Vapnik, V. (1992), “A Training Algorithm for Optimal Margin Classifiers,” in 5th
Annual ACM Workshop on Computational Learning Theory, ed. Haussler, D., Pittsburgh, PA: ACM Press,
pp. 144–152.
Burges, C. J. C. (1998), “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and
Knowledge Discovery, 2, 121–167.
Cox, D., and O’Sullivan, F. (1990), “Asymptotic Analysis of Penalized Likelihood Type Estimators,” The Annals
of Statistics, 18, 1676–1695.
Cristianini, N., and Shawe-Taylor, J. (2000), An Introduction to Support Vector Machines, New York: Cambridge
University Press.MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 355
Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparison of Discrimination Methods for the Classification of
Tumors Using Gene Expression Data,” Journal of the American Statistical Association, 97, 77–87.
Fung, G., and Mangasarian, O. L. (2001), “Proximal Support Vector Machine Classifiers,” in Proceedings of the
7th ACM Conference on Knowledge Discovery and Data Mining, Pittsburgh, PA: ACM Press, pp. 77–86.
(2005), “Multicategory Proximal Support Vector Machine Classifiers,” Machine Learning, 59, 77–97.
Girosi, F. (1998), “An Equivalence Between Sparse Approximation and Support Vector Machines,” Neural Com-
putation, 10, 1455–1480.
Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J.,
Caligiuri, M., Bloomfield, C., and Lander, E. (1999), “Molecular Classification of Cancer: Class Discovery
and Class Prediction by Gene Expression Monitoring,” Science, 286, 531–537.
Khan, J., Wei, J., Ringner, M., Saal, L., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu,
C., Peterson, C., and Meltzer, P. (2001), “Classification and Diagnostic Prediction of Cancers Using Gene
Expression Profiling and Artificial Neural Networks,” Nature Medicine, 7, 673–679.
Kimeldorf, G., and Wahba, G. (1971), “Some Results on Tchebycheffian Spline Functions,” Journal of Mathe-
matical Analysis and Applications, 33, 82–95.
Lee, Y., and Lee, C. (2003), “Classification of Multiple Cancer Types By Multicategory Support Vector Machines
Using Gene Expression Data,” Bioinformatics, 19, 1132–1139.
Lee, Y., Lin, Y., and Wahba, G. (2004), “Multicategory Support Vector Machines, Theory, and Application to
the Classification of Microarray Data and Satellite Radiance Data,” Journal of the American Statistical
Association, 99, 67–81.
Lin, Y., Lee, Y., and Wahba, G. (2002), “Support Vector Machines for Classification in Nonstandard Situations,”
Machine Learning, 46, 191–202.
Nguyen, D. V., and Rocke, D. M. (2002a), “Multi-class Cancer Classification via Partial Least Squares with Gene
Expression Profiles,” Bioinformatics, 18, 1216–1226.
(2002b), “Tumor Classification by Partial Least Squares Using Microarray Gene Expression Data,” Bioin-
formatics, 18, 39–50.
Poggio, T., and Girosi, F. (1998), “A Sparse Representation for Functional Approximation,” Neural Computation,
10, 1445–1454.
Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., and Vandewalle, J. (2002), Least Squares Support
Vector Machines, World Scientific, Singapore.
Suykens, J. A. K., and Vandewalle, J. (1999), “Least Squares Support Vector Machine Classifiers,” Neural Pro-
cessing Letters, 9, 293–300.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002), “Diagnosis of Multiple Cancer Types By Shrunken
Centroids of Gene Expression,” in Proceedings of the National Academy of Sciences USA, 99, pp. 6567–6572.
Van Gestel, T., Suykens, J. A. K., Lanckriet, G., Lambrechts, A., De Moor, B., and Vandewalle, J. (2002a),
“Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes and Kernel
Fisher Discriminant Analysis,” Neural Computation, 15, 1115–1148.
(2002b), “Multiclass LS-SVMs: Moderated Outputs and Coding-decoding Schemes,” Neural Processing
Letters, 15, 45–58.
Vapnik, V. N. (1998), Statistical Learning Theory, New York: Wiley.
Wahba, G. (1990), Spline Models for Observational Data, CBMS-NSF Regional Conference Series in Applied
Mathematics, 59, Philadelphia, PA: SIAM.
West, M. (2003), “Bayesian Factor Regression Models in the “Large p, Small n” Paradigm,” Bayesian Statistics,
7, 723–732.
Weston, J., and Watkins, C. (1999), “Support Vector Machines for MultiClass Pattern Recognition,” in Proceedings
of the 7th European Symposium On Artificial Neural Networks, Bruges, Belgium, pp. 219–224.