Multiclass Proximal Support Vector Machines

Yongqiang TANG and Hao Helen ZHANG

This article proposes the multiclass proximal support vector machine (MPSVM) clas-

siﬁer, which extends the binary PSVM to the multiclass case. Unlike the one-versus-rest

approach that constructs the decision rule based on multiple binary classiﬁcation tasks, the

proposed method considers all classes simultaneously and has better theoretical properties

and empirical performance. We formulate the MPSVM as a regularization problem in the

reproducing kernel Hilbert space and show that it implements the Bayes rule for classiﬁ-

cation. In addition, the MPSVM can handle equal and unequal misclassiﬁcation costs in a

uniﬁed framework. We suggest an efﬁcient algorithm to implement the MPSVM by solving

a system of linear equations. This algorithm requires much less computational effort than

solving the standard SVM, which often requires quadratic programming and can be slow

for large problems. We also provide an alternative and more robust algorithm for ill-posed

problems. The effectiveness of the MPSVM is demonstrated by both simulation studies and

applications to cancer classiﬁcations using microarray data.

Key Words: Bayes rule; Nonstandard classiﬁcations; Reproducing kernel Hilbert space.

1. INTRODUCTION

In multiclass classiﬁcation problems, the training set consists ofn samples{(x,y );i =

i i

d

1,...,n}, where x ∈ is the input vector andy ∈{1,...,k} is the class label for theith

i i

sample. Our task is to learn a classiﬁcation rule from the training set to predict the label for

a future sample based on its input. The support vector machine (SVM) (Boser, Guyon, and

Vapnik 1992; Vapnik 1998; Burges 1998; Cristianini and Shawe-Taylor 2000) has performed

successfully in many real-world problems. The SVM is attractive in its ability to condense

the information contained in the training set and to ﬁnd a decision surface determined by

certain points in the training set. Lee, Lin, and Wahba (2004) further generalized the SVM

to multicategory SVM (MSVM). Other multiclass versions of SVM have been studied by

Vapnik (1998) and Weston and Watkins (1999).

The implementation of the SVM and MSVM demands solving quadratic program-

Yongqiang Tang is Statistician and Assistant Professor, Department of Psychiatry, SUNY Health Science Center,

Brooklyn, NY 11203 (E-mail: yongqiang tang@yahoo.com). Hao Helen Zhang is Assistant Professor, Department

of Statistics, North Carolina State University, Raleigh, NC 27695 (E-mail: hzhang@stat.ncsu.edu).

c 2006 American Statistical Association, Institute of Mathematical Statistics,

and Interface Foundation of North America

Journal of Computational and Graphical Statistics, Volume 15, Number 2, Pages 339–355

DOI: 10.1198/106186006X113647

339340 Y. TANG AND H. H. ZHANG

ming problems under linear constraints, which can be computationally expensive for large

datasets. For multiclass problems, the computation can be very challenging even for mod-

erately sized datasets if the number of classesk is large. Proximal support vector machines

(PSVM) were recently introduced as a variant of SVM for binary classiﬁcations by Suykens

and Vandewalle (1999) and Fung and Mangasarian (2001). In theory, both the PSVM and

SVM target the optimal Bayes rule asymptotically, which explains their comparable per-

formance in most empirical studies. However, it is much faster to train a PSVM classiﬁer

by simply solving a system of linear equations. Suykens and Vandewalle (1999), Suykens

et al. (2002), and Van Gestel et al. (2002a) proposed the least squares SVM (LSSVM).

The formulation of LSSVM is similar to that of PSVM except that the latter penalizes the

constant term as well. Extensions of the PSVM and LSSVM to multiclass problems have

been considered by Fung and Mangasarian (2005) and Van Gestel et al. (2002b) based on

the one-versus-rest scheme. Their main idea is to solvek binary classiﬁcation problems

by separating one class from the rest, then construct the decision rule according to the

maximal output. One disadvantage of this scheme is that the resulting binary problems are

often very unbalanced, leading to poor performance in some cases (Fung and Mangasarian

2005). In addition, it is not easy to take into account unequal misclassiﬁcation costs in the

one-versus-rest approach.

This article proposes the multiclass PSVM (MPSVM) using an alternative formulation.

The proposed method directly targets the boundaries amongk classes simultaneously by

estimating some functions of the conditional probabilities. The MPSVM is constructed in

a regularization framework of reproducing kernel Hilbert space (RKHS), and is optimal in

terms of implementing the Bayes rule asymptotically. Compared to the approaches based on

the one-versus-rest scheme, the MPSVM is more ﬂexible in handling nonstandard situations

such as unequal misclassiﬁcation costs and nonrepresentative training samples. With regard

to computation, the MPSVM can be solved more efﬁciently than the MSVM.

The article is organized as follows. Section 2 brieﬂy states the Bayes classiﬁcation rule.

Section 3 reviews the binary SVM and PSVM. Section 4 introduces the MPSVM, study

its statistical properties, and propose two computation algorithms. Section 5 illustrates the

performance of the MPSVM with simulation examples. Section 6 applies the MPSVM to

cancer classiﬁcations using gene expression data.

2. BAYES CLASSIFICATION RULES

Bayes rule is the optimal classiﬁcation rule if the underlying distribution of the data is

known. It serves as the gold standard to which any classiﬁer can be compared. In practice,

the underlying distribution is rarely known. One common way to approximate the Bayes

rule is to estimate the distribution or related classiﬁcation functions from the training set.

d

For ak-class classiﬁcation problem, we need to learn a classiﬁcation ruleφ(x) : →

{1,...,k} from the training set. Assume that the samples in the training set are indepen-

dently and identically drawn from some distributionP(x,y). Letp (x)= Pr(Y =j|X = x)

jMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 341

be the conditional probability of classj given X = x forj = 1,...,k. LetC be ak×k cost

matrix, withc the cost of incorrectly classifying a sample from classj to classl.Allc ’s

jl jj

are equal to 0 because we do not penalize correct decisions. The Bayes rule, minimizing

the expected cost of misclassifying an observation

k k

E c =E c Pr(Y =l|X = x) =E c p (x) ,

l

Yφ(X) X lφ(x) X lφ(x)

l=1 l=1

is given by

k

φ (x)= arg min c p (x) . (2.1)

B lj l

j=1,...,k

l=1

With equal misclassiﬁcation costs, that is,c = 1 forj/ =l, the Bayes rule simpliﬁes to

jl

φ (x)= arg min 1−p (x) = arg max p (x), (2.2)

B j j

j=1,...,k j=1,...,k

which can be interpreted as the minimizer of the expected misclassiﬁcation rateE[Y/ =

φ(X)].

The majority of classiﬁcation methods consider only the standard setting, which as-

sumes that the samples in the training set truly represent the target population and imposes

equal costs on different types of misclassiﬁcations. However, nonstandard classiﬁcation

problems may arise in many real situations. When one type of misclassiﬁcation is more

serious than others, we should apply unequal costs. For example, in the diagnosis of a

disease, classifying a diseased person as healthy or a healthy person as diseased may have

different consequences. See Lin, Lee, and Wahba (2002) for more discussions on this issue.

Sampling bias is another concern that merits special attention. When the class proportions

are quite unbalanced in the population, one often tends to oversample minor classes and

downsample major classes (Lin, Lee, and Wahba 2002; Lee, Lin, and Wahba 2004). If the

sampling scheme depends only on the class labelY and not on the input X, the Bayes rule

is

k

π

l

s

φ (x)= arg min c p (x) ,

B lj

l

s

j=1,...,k

π

l

l=1

s

whereπ is the prior proportion of classl in the target population,π is the proportion of

l

l

s s s

classl in the training set, andp (x)= Pr Y =l|X = x is the conditional probability

l

s

of a sample in the training set belonging to classl given X = x.Inthisarticle,wedo

not distinguish two nonstandard cases, since the issue of sampling bias can be taken into

account by modifying the costs as

π

l

new

c =c . (2.3)

jl

jl

s

π

l342 Y. TANG AND H. H. ZHANG

3. BINARY SVM AND PROXIMAL SVM

In binary classiﬁcation problems, the class labely is often coded as +1or−1. Orig-

i

inally proposed as a maximal margin classiﬁer, the SVM methodology can also be cast

as a regularization problem in a reproducing kernel Hilbert space (RKHS). See Wahba

(1990), Girosi (1998), and Poggio and Girosi (1998) for more details. LetH be an RKHS

K

associated with the reproducing kernelK. The SVM solves

n

1

2

min L(y )[1−yf(x )] +λ h , (3.1)

i i i +

H

K

f n

i=1

over all the functions of the formf(x)=β +h(x), whereh∈H ,(v) = max{v, 0}, and

0 K +

L(y ) is the cost of misclassifying theith observation. The classiﬁcation rule is sign[f(x)].

i

The proximal SVM was ﬁrst studied by Suykens and Vandewalle (1999) and Fung and

Mangasarian (2001). Unlike the SVM, the PSVM method classiﬁes points to the closest

of two parallel hyperplanes that are pushed apart as far as possible. In the regularization

framework, we can formulate the PSVM as

n

1

2 2 2

min L(y )[1−yf(x )] +λ( h +β ). (3.2)

i i i

H 0

K

f

n

i=1

Sincey = 1or−1, (3.2) is equivalent to

i

n

1

2 2 2

min L(y )[y −f(x )] +λ( h +β ). (3.3)

i i i

H 0

K

f n

i=1

Thus, the PSVM can be interpreted as the ridge regression model (Agarwal 2002). By the

representer theorem (Kimeldorf and Wahba 1971), the solution to (3.1) or (3.2) has a ﬁnite

n

representation formf(x)=β + βK(x, x). When equal misclassiﬁcation costs are

0 i i

i=1

used, (3.2) becomes

n n n

1

2 2

min [1−yf(x )] +λ β βK(x , x)+β . (3.4)

i i j l j l

0

f n

i=1 j=1

l=1

The formulation (3.4) is slightly different from the kernel PSVM studied by Fung and

Mangasarian (2001), where they solved

n n

1

2 2 2

min [1−yf(x )] +λ β +β .

i i

j 0

f n

i=1 j=1

4. MULTICLASS PROXIMAL SVM

This section introduces the multiclass proximal SVM (MPSVM). The proposed

MPSVM embraces the binary PSVM as a special case. Section 4.1 presents the formu-

lation of the MPSVM and investigates its solution and asymptotic properties. Section 4.2

introduces two efﬁcient computational strategies for implementing the MSPVM.MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 343

4.1 FORMULATION AND THEORETICAL PROPERTIES

In multiclass classiﬁcation problems, the class labely can be coded as ak-dimensional

vector

−1 −1

1, ,..., if samplei belongs to class 1,

k−1 k−1

−1 −1

, 1,..., if samplei belongs to class 2,

k−1 k−1

y =

i

...

−1 −1

, ,..., 1 if samplei belongs to classk.

k−1 k−1

This coding was ﬁrst used by Lee, Lin, and Wahba (2004). We consider ak-tuple separating

function f(x)=(f (x),...,f (x)) . Since the sum of the components in y is 0, we let

1 k

k

d

f satisfy the sum-to-zero constraint, f (x)= 0 for any x∈ . Analogous to the

j

j=1

k

binary case, we assume that f(x) ∈ ({1} +H ), the tensor product space of k

K

j=1

reproducing kernel Hilbert spacesH . Each componentf (x) can be expressed asβ +

K j j0

h (x), withβ ∈ andh ∈H . For each y , deﬁne the diagonal matrixW(y)=

j j0 j K

i i

diag{c ,c ,...,c }, where l is the membership of y and c denotes the cost of

l 1 l 2 l k i l j

i i i i i

classifying theith sample to classj. The MPSVM is proposed as

n k

1 1

2 2

min [y − f(x )]W(y )[y − f(x )]+ λ ( h +β ), (4.1)

i i j

i i i H j0

K

f n 2

i=1 j=1

k

subject to f (x)= 0 for any x. The classiﬁcation rule induced by f(x) is

j

j=1

φ(x)= arg max f (x).

j

j=1,...,k

This formulation handles the standard and nonstandard situations in a uniﬁed manner by

choosing different cost functions. The solution to (4.1) is characterized in Theorem 1. Proof

of the theorem and all other proofs are relegated to the Appendix.

Theorem 1. Representer theorem. If the reproducing kernelK is positive deﬁnite,

minimizing (4.1) under the sum-to-zero constraint is equivalent to ﬁnding f (x),...,f (x)

1 k

of the form

n

f (x)=β + β K(x, x), (4.2)

j j0 ji i

i=1

k

subject to β = 0forl = 0,...,n.

jl

j=1

We show that formulation (4.1) embraces the binary PSVM as a special case. Let

k = 2. For any sample from the positive class, y =(1,−1) (1 in the binary notation), we

i

2

haveW(y)= diag{0,L(1)} and [y − f(x )]W(y )[y − f(x )] =L(1)[f (x)+ 1] =

i i 2 i

i i i i

2

L(1)[1−f (x )] , whereL(1) is the cost of misclassifying a positive sample. Similarly,

1 i

when y =(−1, 1) (−1 in the binary notation), we haveW(y)= diag{L(−1), 0} and[y −

i i i

2

f(x )]W(y )[y −f(x )] =L(−1)[−1−f (x )] , whereL(−1) is the cost of misclassifying

i i 1 i

i i344 Y. TANG AND H. H. ZHANG

a negative sample. Thus, the data-ﬁt functional in (3.3) is identical to that in (4.1), withf(x)

in (3.3) playing the same role off (x) in (4.1). It is also true that

1

2

λ λ

2 2 2 2 2 2 2 2

( h +β )= ( h +β + h +β )=λ( h +β ),

j 1 2 1

H j0 H 10 H 20 H 10

K K K K

2 2

j=1

because we haveβ +β = 0 andh (x)+h (x)= 0 for any x according to Theorem 1.

10 20 1 2

The next theorem shows that the MPSVM implements the Bayes rule asymptotically

under certain regularity conditions. We use the theoretical framework of Cox and O’Sullivan

(1990) to analyze the asymptotics of penalized methods. In (4.1), the data-ﬁt functional

component indicates that the solution should follow the pattern in the data, whereas the

penalty component imposes smoothing conditions. As the sample sizen goes to inﬁnity,

the limit of the data-ﬁt functional,E (Y− f(X))W(Y)(Y− f(X)) , could be used to

identify the target function, which is the minimizer of the limiting functional. Under the

assumption that the target function can be approximated by the elements in the RKHS and

certain other regularity conditions, the solution to (4.1) will approach the target function as

n→∞.

Theorem 2. The minimizer ofE (Y− f(X))W(Y)(Y− f(X)) under the sum-to-

zero constraint implements the Bayes rule, that is,

k

arg max f (x)= arg min c p (x).

j lj l

j=1,...,k j=1,...,k

l=1

When equal misclassiﬁcation costs are used,

arg max f (x)= arg max p (x).

j j

j=1,...,k j=1,...,k

Remark 1. Inverting the expression (A.1) in the Appendix provides the estimate ofk-

class conditional probabilities. With equal misclassiﬁcation costs,τ (x) reduces to 1−p (x),

j j

thus

ˆ

1/[1−f (x)]

j

p ˆ (x)= 1−(k− 1) , for j = 1,...,k. (4.3)

j

k

ˆ

1/[1−f (x)]

l

l=1

4.2 COMPUTATION ALGORITHMS

This section shows that the implementation of the MPSVM only requires solving linear

equation systems. This computational convenience makes the MPSVM a promising clas-

siﬁcation tool for large datasets. Many classiﬁcation methods involve parameters that need

to be adaptively chosen with strategies like cross-validation. Finer tuning is possible for the

MPSVM due to its computational advantage. We present two computation strategies and

summarize them as Algorithms 1 and 2. Based on our experience, Algorithm 1 runs faster,MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 345

though it may fail whenK is close to singularity, causing difﬁculty in matrix inversion.

Algorithm 2 is speciﬁcally designed for ill-posed problems.

ˆ

Assume that the kernelK is strictly positive deﬁnite. LetK be the Gram matrix with

ˆ

the (i,l)th entryK(x, x ). LetZ=[1 K], 1 =(1,..., 1) , 0 =(0,..., 0) , and

i l n n n

1 0

n

G = .

ˆ

0 K

n

∗

Forj = 1,...,k, deﬁne y =(y ,...,y ) ,β =(β ,β ,...,β ) and then×n

1j nj j0 j1 jn

j j

∗

diagonal matrixW = diag{c ,...,c }, where cat{l} is the membership of the

cat{1}j cat{n}j

j

lth sample andc is the cost of classifying thelth sample to classj. Further, we deﬁne

cat{l}j

∗ ∗1/2 ∗1/2

∗ 1/2 ∗ −1/2 ∗ ∗∗ ∗

λ = nλ/2, β = G β ,Z = ZG ,Z = W Z and y = W y for

j

j j

j j j j

j = 1,...,k. By Theorem 1, minimizing (4.1) is equivalent to minimizing

k

∗ ∗ ∗ ∗

(y −Zβ )W (y −Zβ )+λ β Gβ

j j j j j j j

j=1

k

∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗

= (y −Z β ) (y −Z β )+λ β β , (4.4)

j j

j j j j j j

j=1

k k

∗

subject to β = β = 0 .

n+1

j=1 j j=1 j

To solve (4.4), we consider its Wolfe dual problem

k k

∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗

L = (y −Z β ) (y −Z β )+λ β β − 2u β ,

D j j

j j j j j j j

j=1 j=1

∗ n+1

where u = 2u ∈ is the Lagrange multiplier vector. Setting equal to zero the gradient

∗

ofL with respect toβ yields

D

j

∂L

D ∗ ∗

∗∗ ∗ ∗

= 2Z (Z β − y )+ 2λ β − 2u = 0.

j j

∗ j j j

∂β

j

Thus,

−1

∗

∗ ∗∗ ∗ ∗∗ ∗

β = Z Z +λ I Z y + u =B Z y + u , (4.5)

j j

j j j j j j

−1

k ∗ k

∗

whereB = Z Z +λ I forj = 1,...,k. Since 0 = β = B

j j n+1 j

j j=1 j j=1

∗∗ ∗

Z y + u , we get the dual solution

j j

−1

k k

∗ ∗∗

u =− B B Z y . (4.6)

i j

j j

j=1 j=1

The primal solution to (4.4) is

−1/2 ∗ ∗ ∗

β =G β =A ZW y + v , (4.7)

j

j j j j346 Y. TANG AND H. H. ZHANG

−1

−1

k

−1/2 −1/2 ∗ ∗

where A = G B G = ZW Z +λ G and v = − A

j j i

j j=1

k

∗ ∗

A ZW y . For any (new) sample x, we deﬁne the vectorK = K(x, x ),...,

j x 1

j=1 j j

ˆ

K(x, x ) and computef (x)=[1,K ]β forj = 1,...,k for prediction. We summarize

n j

x j

Algorithm 1 as follows.

Algorithm 1. (a) Solve (4.4) using (4.7).

ˆ

(b) To predict the label for x, computef (x) forj = 1,...,k and classify the sample

j

ˆ

to the class corresponding to the largestf (x).

j

ˆ

Though the matrixK is strictly positive deﬁnite in theory, it can be numerically ill-

ˆ

posed in practice. Algorithm 1 may fail whenK, henceG andA (j = 1,...,k), are

j

close to singularities. Therefore in Algorithm 2, we avoid the inversion ofG using the

relationships

−1 −1

ˆ

ZG =[1 ,I],ZG Z = 1 1 +K,

n n

n

which can be achieved by some strategic matrix transformations of the solutions (4.5) and

ˆ

(4.6) via the Sherman-Morrison-Woodbury formula. Our experience shows that whenK is

well-conditioned, we should use Algorithm 1 because it is much faster than Algorithm 2.

Otherwise, Algorithm 2 is preferable because it provides robust solutions.

By using the Sherman-Morrison-Woodbury formula, we have

−1

∗ ∗ −1 ∗ ∗ ∗ ∗

B = Z Z +λ I = I−Z (λ I +Z Z ) Z /λ =(I−Z C Z )/λ ,

j j j j j

j j j

(4.8)

where

−1

−1

∗1/2 ∗ ∗1/2 ∗ ∗ ∗1/2 ∗1/2 ∗ ∗ ∗

C =W λ I +W Z Z W W =l λ I +l Z Z .

j

j j j j j j

(4.9)

1/2

∗

Herel is the column vector containing the diagonal elements ofW ,l =l l , and

j j

j j j

is the elementwise product of two matrices. It is easy to verify that

∗∗ ∗ −1 ∗∗ ∗

B Z y = I−Z (λ I +Z Z ) Z Z y /λ

j j j j j

j j j j

∗ −1 ∗∗ ∗ ∗

=Z (λ I +Z Z ) y =Z C y . (4.10)

j j

j j j j

k

−1

LetC =k C . Then

j

j=1

k

∗ ∗ ∗

B = B =k(I−Z CZ )/λ . (4.11)

j

j=1

By the Sherman-Morrison-Woodbury formula,

−1 ∗ −1 ∗ ∗ ∗ −1 −1 ∗

B =λ k I−Z (Z Z −C ) Z . (4.12)MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 347

The dual solution is

k

∗ −1 ∗∗

u = −B B Z y

j

j j

j=1

k

∗ −1 ∗ ∗ ∗ ∗

= −λ k (I−Z DZ ) Z C y

j

j

j=1

∗ ∗

= −λ Z Q, (4.13)

k

∗ ∗ −1 −1 −1 ∗ ∗ ∗

whereD =(Z Z −C ) ,H = k C y , andQ = H −D(Z Z )H =

j

j=1 j

−1

−DC H.

Combining (4.8), (4.10), and (4.13), we get the primal solution

∗

∗∗ ∗

β = B Z y +B u

j j

j j

j

∗ ∗ ∗ ∗ ∗

= Z C y −(I−Z C Z )Z Q

j j

j

∗ ∗ ∗ ∗

= Z C (y +Z Z Q)−Q ,

j

j

and

∗

−1/2 −1/2 ∗ ∗ ∗ ∗

β =G β = G Z C (y +Z Z Q)−Q

j

j j j

∗ ∗ ∗

= [1 ,I] C (y +Z Z Q)−Q .

n j

j

ˆ

The predictionf is given by

j

∗ ∗ ∗

ˆ

f (x)=[1,K ]β =(1 +K ) C (y +Z Z Q)−Q . (4.14)

j n x j

x j j

Algorithm 2.

∗ ∗ −1

ˆ

(a) ComputeZ Z =ZG Z = 1 1 +K.

n

n

(b) ComputeC (j = 1,...,k),C,D,H andQ using Equations (4.9)–(4.13).

j

ˆ

(c) To predict the label for x, computef (x) forj = 1,...,k using (4.14) and classify

j

ˆ

the sample to the class giving the largestf (x).

j

5. NUMERICAL EXAMPLES

We illustrate the performance of the MPSVM with simulated examples. The Gaus-

2

x −x

i l ∗ 2

sian kernelK(x, x)= exp − is used and the tuning parameters (λ ,σ ) are

i l 2

σ

determined by ten-fold cross-validation. Section 5.2 shows an example for nonstandard

classiﬁcations. Equal misclassiﬁcation costs are assumed unless otherwise speciﬁed.

5.1 COMPARISON WITH MSVM

A training set of sample size 200 is generated according to the three-class model on

the unit interval [0, 1] with the conditional probabilities348 Y. TANG AND H. H. ZHANG

Figure 1. Plots of true and estimated conditional probability functions.

p (x)= 0.97 exp(−3x),

1

2

p (x)= exp −2.5(x− 1.2) ,

3

p (x)= 1−p (x)−p (x).

2 1 3

This example was also used by Lee, Lin, and Wahba (2004). We apply the MPSVM and use

the formula (4.3) to estimate the true conditional probabilities. As shown in Figures 1(a)

and (b), the MPSVM has reasonably recovered the conditional probabilities. According to

(2.2), the Bayes class boundary for this example consists of two points:x = 0.260 where

p (x)= p (x)>p (x), andx = 0.613 wherep (x)<p (x)= p (x). The MPSVM

1 2 3 1 2 3

estimates these two critical points as 0.262 and 0.615, and hence approximates the Bayes

rule very well.

We now compare the MPSVM with the MSVM (Lee, Lin, and Wahba 2004), and

their one-versus-rest variants. We generate 100 training sets of sample size 200 to ﬁt the

classiﬁcation rules, and a testing set of sample size 10,000 to evaluate their prediction

accuracy. The Bayes rate for this example is 0.3841. The average testing error rate (±

standard deviation) is 0.4068± 0.0101 for the MPSVM and 0.4081± 0.0124 for the one-

versus-rest PSVM. As reported by Lee, Lin, and Wahba (2004), the average error rate is

0.3951±0.0099 for the MSVM and 0.4307±0.0132 for the one-versus-rest SVM. Overall,

the performance of the MPSVM and one-versus-rest PSVM is close to the MSVM. Note

that the error rate of the one-versus-rest SVM is slightly higher. One possible reason is that

the one-versus-rest SVM may not approximate Bayes rule well even when the sample size

is large (Lee, Lin, and Wahba 2004), while all the other three classiﬁers are asymptotically

optimal.

5.2 SAMPLING BIAS

This example illustrates the effectiveness of the MPSVM in handling the sampling

bias situation. There are three classes, and the true population is distributed on the squareMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 349

Figure 2. Sampling bias.350 Y. TANG AND H. H. ZHANG

√ √

A=[0, 10]×[0, 10] with the following conditional probabilities:

Pr(Y =j|x∈A)= 0.9if j =l,

l

Pr(Y =j|x∈A)= 0.05 if j/ =l,

l

with 1≤j,l≤ 3. Three regions are

2 2

A = X :X +X ≤ 2,X >X ,

1 1 2

1 2

2 2

A = X :X +X ≤ 2,X ≤X ,

2 1 2

1 2

2 2

A = X :X +X > 2 ,

3

1 2

and they partition the squareA with the area ratio1:1:8.We ﬁrst generate 500 points

from the population, then intentionally change the class proportions in the training set by

keeping all points generated from classes 1 and 2, and only 1/8 of those from class 3. The

sampling scheme is therefore biased and the class proportions in the training set do not

reﬂect those in the underlying population. Figure 2(a) plots the training data and the Bayes

decision boundary. We ﬁt the MPSVM with, respectively, equal misclassiﬁcation costs

and the costs adjusted according to (2.3). The decision boundary based on adjusted costs

in Figure 2(b) is very similar to the Bayes boundary in Figure 2(a), while the unadjusted

boundary in Figure 2(c) fails to approximate the Bayes rule. This suggests that the adjustment

of misclassiﬁcation costs in the MPSVM is necessary to improve classiﬁcation accuracy in

the sampling bias case.

6. APPLICATIONS

In this section, we apply the MSPVM to cancer classiﬁcation using two benchmark

microarray datasets. One main challenge in microarray data analysis is that the number of

genes is often much larger than the sample size. One typical way to handle this issue is to

conduct dimension reduction, say, by principle component analysis or factor analysis (Khan

et al. 2001; West 2003), before applying classiﬁcation methods. As regularized regression

models, the SVM-type methods offer an alternative way to handle this “largep, smalln”

problem (West 2003) and are well-suited for microarray data analysis.

6.1 LEUKEMIA DATA

The leukemia dataset was published by Golub et al. (1999) for the classiﬁcation of

two types of leukemia: ALL (acute myeloid leukemia) and AML (acute lymphoblastic

leukemia). These two cancer types are identiﬁed based on their origins—AML originates

in the bone marrow and ALL originates in lymphatic tissue. The ALL leukemia can be

further divided into B-cell and T-cell ALLs. There are 38 samples (19 ALL B-cell; 8 ALL

T-cell; 11 AML) in the training set and 34 samples (19 ALL B-cell; 1 ALL T-cell and 14

AML) in the testing set. For each sample, the expression values of 7,129 genes are collected.

In the literature, these data have been treated as a two-class (ALL/AML) problem (Golub

et al. 1999; Dudoit, Fridlyand, and Speed 2002; Tibshirani, Hastie, Narasimhan, and Chu

2002; Nguyen and Rocke 2002b) or a three-class (AML/ALL B-cell/ALL T-cell) problemMULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 351

Figure 3. Heat maps of leukemia data.

(Lee and Lee 2003; Nguyen and Rocke 2002a; Dudoit, Fridlyand, and Speed 2002). In this

article, we treat it as a three-class classiﬁcation problem.

To compare the classiﬁcation performance of the MPSVM with MSVM, we ﬁrst apply

the procedures used by Lee and Lee (2003) to preprocess the data: (1) thresholding (ﬂoor

of 100 and ceiling of 16, 000); (2) ﬁltering (exclusion of genes with max/min≤ 5 and max

-min≤ 500 across the samples); and (3) base-10 logarithmic transformation. The ﬁltering

results in 3,571 genes. After these initial steps, each array is standardized and theF -tests

are used to select top 40 genes with the largestF -statistics.

Figure 3 plots the heat maps of the training set and testing set, where rows represent

genes and columns represent samples. The gene expression values are indicated by color,

with black representing the lowest level and white the hightest level of expression. We put

the training samples within each class together, and use hierarchical clustering to order the

genes so that genes close to each other have similar proﬁles across the training samples.

As Figure 3(a) shows, some subset of genes are highly expressed for each class, therefore

the selected 40 genes are very informative in discriminating three classes. In the heat map

(Figure 3(b)) of the testing set, the genes are ordered in the same way as in the training set,

and hierarchical clustering is also used to order samples. The expression pattern of testing

samples seems to match the training samples pretty well.

Because the sample size is small, the performance of MPSVM and one-versus-rest

PSVM is quite robust to the choice of tuning parameters. With various equally good tuning

parameters, none or at most one testing sample is misclassiﬁed by the MPSVM and the one-

versus-rest PSVM. For the MSVM, as shown by Lee and Lee (2003), one testing sample

is misclassiﬁed using GACV tuning and on average 0.8 testing sample is misclassiﬁed

using leave-one-out cross-validation. This demonstrates that the performance of MPSVM

is comparable to MSVM, but the MPSVM is faster to implement.352 Y. TANG AND H. H. ZHANG

6.2 SMALL ROUND BLUE CELL TUMOR DATA

The small round blue cell tumors of children dataset was published by Khan et al.

(2001). There are four tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS), non-

Hodgkin lymphoma (BL/NHL), and Ewing family of tumors (EWS). This dataset contains

2,308 out of 6,567 genes after ﬁltering. There are 63 samples (8 BL; 23 EWS; 12 NB; 23

EWS) in the training set and 20 samples (3 BL; 6 EWS; 5 NB; 20 EWS) in the testing set.

Similar to the analysis of leukemia data, the data were logarithm-transformed andF -tests

were used to preselect 40 genes with the smallestp values. The testing errors of MPSVM

and one-versus-rest PSVM are both 0. This dataset has been analyzed by many methods

(Khan et al. 2001; Lee and Lee 2003; Dudoit, Fridlyand, and Speed 2002; Tibshirani et al.

2002), and nearly all of them yield 0 testing errors.

APPENDIX

Proof of Theorem 1: Letf (x)= β +h (x) withh ∈H forj = 1,...,k.

j j0 j j K

n

We decompose eachh as follows:h (·)= β K(x,·)+ρ (·), whereρ (·) is the

j j ji i j j

i=1

element in the RKHS orthogonal to the spanned space{K(x,·),i = 1,...,n}.Bythe

i

sum-to-zero constraint, we have

k−1 k−1 n k−1

f (·)=− β − β K(x,·)− ρ (·).

k j0 ji i j

j=1 j=1 i=1 j=1

By the reproducing property, we have h (·),K(x,·) =h (x ) fori = 1,...,n. Thus,

j i H j i

K

f (x)= β +h (x)=β + h,K(x,·)

j i j0 j i j0 j i H

K

n

= β + β K(x,·)+ρ (·),K(x,·)

j0 ji l j i

i=1

H

K

n

= β + β K(x, x ).

j0 ji i j

i=1

Hence, the data-ﬁt functional in (4.1) does not depend onρ (·). Furthermore,

j

2

n

2 2

h = β K(x,·) + ρ , for 1≤j≤k− 1,

j ji i j

i=1

and

2

k−1 n k−1

2 2

h = β K(x,·) + ρ .

k ji i j

j=1 i=1 j=1

Obviously, the minimizer of (4.1) does not depend onρ . This implies the Equation (4.2).

j

k

It remains to show that minimizing (4.1) subject to β = 0 fori = 0,...,n is

ji

j=1

k

ˆ

equivalent to minimizing (4.1) subject to f (x)= 0 for any x. LetK be then byn

j

j=1MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 353

n

matrix with theilth entryK(x, x ). Letf (·)=β + β K(x,·) be the minimizer

i l j j0 ji i

i=1

∗

of (4.1) under the sum-to-zero constraint. Letβ =(β ,...,β ) forj = 1,...,k, and

j1 jn

j

k

−1

¯

β =k β forl = 0,...,n. Deﬁne

l jl

j=1

n n

∗ ∗ ∗

¯ ¯

f (·)=β + β K(x,·)= β −β + β −β K(x,·).

i j0 0 ji i i

j j0 ji

i=1 i=1

k

∗

Thenf (·)=f (·) because f (·)≡ 0. However,

j j

j

j=1

k k k k

∗ ∗ ∗ ∗

∗ 2 ∗2 2 2 −1

ˆ ¯ ˆ

h +β = β Kβ +β −kβ −k β K β

j j0 j j j0 0 j j

j=1 j=1 j=1 j=1

k k

∗ ∗ 2 2 2

ˆ

≤ β Kβ +β = h +β .

j

j j j0 j0

j=1 j=1

The equality has to hold sincef is the global minimizer. This implies that

j

k k

∗ ∗

¯ ˆ

β = 0, β K β = 0.

0

j j

j=1 j=1

k

ˆ

When K is positive deﬁnite, β = 0 for i = 0,...,n. On the other hand, let

ji

j=1

n k

f (·)= β + β K(x,·) be the minimizer of (4.1) satisfying β = 0 for

j j0 ji i ji

i=1 j=1

∗ k ∗

ˆ

i = 0,...,n. Letβ =(β ,...,β ) forj = 1,...,k. ThusK β = 0 and

j1 jn

j j

j=1

2

k k k n

∗ ∗

ˆ

0 = β K β = β K(x,·) .

ji i

j j

j=1 j=1 j=1 i=1

k n k

This suggests that β K(x, x)= 0 and hence f (x)= 0 for any x.

ji i j

j=1 i=1 j=1

Proof of Theorem 2: Since E (f(X)− Y)W(Y)(f(X)− Y) =

E E (f(X) −Y)W(Y)(f(X)− Y)|X , we can minimize E (f(X)− Y)W(Y)

(f(X)− Y) by minimizingE (f(X)− Y)W(Y)(f(X)− Y)|X = x for every x. Note

thatc = 0 forj =l, then

jl

!

!

Q (f)= E (f(X)− Y)W(Y)(f(X)− Y) X = x

x

" #

k k 2

1

= c f (x)+ p (x)

jl l j

k− 1

j=1 l=1

" #

k k

2

1

= c p (x) f (x)+

lj l j

k− 1

j=1

l=1

" #

k 2

1

= τ (x) f (x)+ ,

j j

k− 1

j=1354 Y. TANG AND H. H. ZHANG

k k

whereτ (x)= c p (x). Consider minimizing the functionQ (f) subject to

j lj l x

l=1 j=1

f (x)= 0. Using the Lagrange multiplierγ> 0, we get

j

" #

k 2 k

1

A (f)= τ (x) f (x)+ −γ f (x).

x j j j

k− 1

j=1 j=1

For eachj = 1,...,k,wehave

" #

∂A 1 γ 1

x

= 2τ f + −γ = 0 =⇒f = − .

j j j

∂f k− 1 2τ k− 1

j j

k

Since f = 0, we have that

j

j=1

2k 1

γ = · .

k

−1

k− 1

τ

j

j=1

The solution is then given by

−1

k τ 1

j

f = − . (A.1)

j

k

−1

k− 1 k− 1

τ

l

l=1

k

Thus, arg max f (x)= arg min τ (x)= arg min c p (x).

j=1,...,k j j=1,...,k j j=1,...,k lj l

l=1

ACKNOWLEDGMENTS

The authors are very grateful to the editors and referees for constructive suggestions that greatly improved

the presentation of this work. We also thank Subhashis Ghosal, Kevin Gross, and David Chorlian for helpful

comments on the article. Hao Helen Zhang’s research was partially supported by the National Science Foundation

Grant DMS-0405913.

[Received December 2004. Revised September 2005.]

REFERENCES

Agarwal, D. K. (2002), “Shrinkage Estimator Generalizations of Proximal Support Vector Machines,” in Proceed-

ings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,ACM

Press, pp. 173–182.

Boser, B. E., Guyon, I. M., and Vapnik, V. (1992), “A Training Algorithm for Optimal Margin Classiﬁers,” in 5th

Annual ACM Workshop on Computational Learning Theory, ed. Haussler, D., Pittsburgh, PA: ACM Press,

pp. 144–152.

Burges, C. J. C. (1998), “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and

Knowledge Discovery, 2, 121–167.

Cox, D., and O’Sullivan, F. (1990), “Asymptotic Analysis of Penalized Likelihood Type Estimators,” The Annals

of Statistics, 18, 1676–1695.

Cristianini, N., and Shawe-Taylor, J. (2000), An Introduction to Support Vector Machines, New York: Cambridge

University Press.MULTICLASS PROXIMAL SUPPORT VECTOR MACHINES 355

Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparison of Discrimination Methods for the Classiﬁcation of

Tumors Using Gene Expression Data,” Journal of the American Statistical Association, 97, 77–87.

Fung, G., and Mangasarian, O. L. (2001), “Proximal Support Vector Machine Classiﬁers,” in Proceedings of the

7th ACM Conference on Knowledge Discovery and Data Mining, Pittsburgh, PA: ACM Press, pp. 77–86.

(2005), “Multicategory Proximal Support Vector Machine Classiﬁers,” Machine Learning, 59, 77–97.

Girosi, F. (1998), “An Equivalence Between Sparse Approximation and Support Vector Machines,” Neural Com-

putation, 10, 1455–1480.

Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J.,

Caligiuri, M., Bloomﬁeld, C., and Lander, E. (1999), “Molecular Classiﬁcation of Cancer: Class Discovery

and Class Prediction by Gene Expression Monitoring,” Science, 286, 531–537.

Khan, J., Wei, J., Ringner, M., Saal, L., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu,

C., Peterson, C., and Meltzer, P. (2001), “Classiﬁcation and Diagnostic Prediction of Cancers Using Gene

Expression Proﬁling and Artiﬁcial Neural Networks,” Nature Medicine, 7, 673–679.

Kimeldorf, G., and Wahba, G. (1971), “Some Results on Tchebychefﬁan Spline Functions,” Journal of Mathe-

matical Analysis and Applications, 33, 82–95.

Lee, Y., and Lee, C. (2003), “Classiﬁcation of Multiple Cancer Types By Multicategory Support Vector Machines

Using Gene Expression Data,” Bioinformatics, 19, 1132–1139.

Lee, Y., Lin, Y., and Wahba, G. (2004), “Multicategory Support Vector Machines, Theory, and Application to

the Classiﬁcation of Microarray Data and Satellite Radiance Data,” Journal of the American Statistical

Association, 99, 67–81.

Lin, Y., Lee, Y., and Wahba, G. (2002), “Support Vector Machines for Classiﬁcation in Nonstandard Situations,”

Machine Learning, 46, 191–202.

Nguyen, D. V., and Rocke, D. M. (2002a), “Multi-class Cancer Classiﬁcation via Partial Least Squares with Gene

Expression Proﬁles,” Bioinformatics, 18, 1216–1226.

(2002b), “Tumor Classiﬁcation by Partial Least Squares Using Microarray Gene Expression Data,” Bioin-

formatics, 18, 39–50.

Poggio, T., and Girosi, F. (1998), “A Sparse Representation for Functional Approximation,” Neural Computation,

10, 1445–1454.

Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., and Vandewalle, J. (2002), Least Squares Support

Vector Machines, World Scientiﬁc, Singapore.

Suykens, J. A. K., and Vandewalle, J. (1999), “Least Squares Support Vector Machine Classiﬁers,” Neural Pro-

cessing Letters, 9, 293–300.

Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002), “Diagnosis of Multiple Cancer Types By Shrunken

Centroids of Gene Expression,” in Proceedings of the National Academy of Sciences USA, 99, pp. 6567–6572.

Van Gestel, T., Suykens, J. A. K., Lanckriet, G., Lambrechts, A., De Moor, B., and Vandewalle, J. (2002a),

“Bayesian Framework for Least-Squares Support Vector Machine Classiﬁers, Gaussian Processes and Kernel

Fisher Discriminant Analysis,” Neural Computation, 15, 1115–1148.

(2002b), “Multiclass LS-SVMs: Moderated Outputs and Coding-decoding Schemes,” Neural Processing

Letters, 15, 45–58.

Vapnik, V. N. (1998), Statistical Learning Theory, New York: Wiley.

Wahba, G. (1990), Spline Models for Observational Data, CBMS-NSF Regional Conference Series in Applied

Mathematics, 59, Philadelphia, PA: SIAM.

West, M. (2003), “Bayesian Factor Regression Models in the “Large p, Small n” Paradigm,” Bayesian Statistics,

7, 723–732.

Weston, J., and Watkins, C. (1999), “Support Vector Machines for MultiClass Pattern Recognition,” in Proceedings

of the 7th European Symposium On Artiﬁcial Neural Networks, Bruges, Belgium, pp. 219–224.

## Comments 0

Log in to post a comment