PSVM: Parallelizing Support Vector Machines on Distributed Computers

zoomzurichΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

69 εμφανίσεις

PSVM:Parallelizing Support Vector Machines
on Distributed Computers
Edward Y.Chang

,Kaihua Zhu,Hao Wang,Hongjie Bai,
Jian Li,Zhihuan Qiu,&Hang Cui
Google Research,Beijing,China
Abstract
Support Vector Machines (SVMs) suffer from a widely recognized scalability
problem in both memory use and computational time.To improve scalability,
we have developed a parallel SVM algorithm (PSVM),which reduces memory
use through performing a row-based,approximate matrix factorization,and which
loads only essential data to each machine to perform parallel computation.Let
n
denote the number of training instances,
p
the reduced matrix dimension after
factorization (
p
is significantly smaller than
n
),and
m
the number of machines.
PSVMreduces the memory requirement from
O
(
n
2
) to
O
(
np/m
),and improves
computation time to
O
(
np
2
/m
).Empirical study shows PSVM to be effective.
PSVMOpen Source is available for download at http://code.google.com/p/psvm/.
1 Introduction
Let us examine the resource bottlenecks of SVMs in a binary classification setting to explain our
proposed solution.Given a set of training data
X
=
{
(
x
i
,y
i
)
|
x
i

R
d
}
n
i
=1
,whe r e
x
i
i s a n obs e r -
va t i on ve c t or,
y
i
∈ {−
1
,
1
}
i s t he c l a s s l a be l of
x
i
,a nd
n
i s t he s i z e of
X
,we a ppl y SVMs on
X
t o
t r a i n a bi na r y c l a s s i fie r.SVMs a i m t o s e a r c h a hype r pl a ne i n t he
Re produc i ng Ke r ne l Hi l be r t Spac e
( RKHS) t ha t ma xi mi z e s t he ma r gi n be t we e n t he t wo c l a s s e s of da t a i n
X
wi t h t he s ma l l e s t t r a i n-
i ng e r r or ( Va pni k,1995).Thi s pr obl e m c a n be f or mul a t e d a s t he f ol l owi ng qua dr a t i c opt i mi z a t i on
pr obl e m:
mi n
P
(
w
,b,
ξ
) =
1
2
￿
w
￿
2
2
+
C
n
￿
i
=1
ξ
i
( 1 )
s.t.
1

y
i
(
w
T
φ
(
x
i
) +
b
)

ξ
i

i
>
0
,
w h e r e
w
i s a w e i g h t i n g v e c t o r,
b
i s a t h r e s h o l d,
C
a r e g u l a r i z a t i o n h y p e r p a r a m e t e r,a n d
φ
(

)
a b a s i s
f u n c t i o n w h i c h m a p s
x
i
t o a n R K H S s p a c e.T h e d e c i s i o n f u n c t i o n o f S V M s i s
f
(
x
) =
w
T
φ
(
x
)+
b
,
wher e t he
w
and
b
ar e at t ai ned by sol vi ng
P
i n ( 1).The opt i mi zat i on pr obl em i n ( 1) i s t he pr i mal
f or mul at i on of SVMs.I t i s har d t o sol ve
P
di r ect l y,par t l y because t he expl i ci t mappi ng vi a
φ
(

)
can make t he pr obl em i nt r act abl e and par t l y because t he mappi ng f unct i on
φ
(

)
i s of t en unknown.
The met hod of
Lagrangi an mul t i pl i ers
i s t hus i nt r oduced t o t r ansf or m t he pr i mal f or mul at i on i nt o
t he dual one
mi n
D
(
α
) =
1
2
α
T
Q
α

α
T
1
( 2 )
s.t.
0

α

C
,
y
T
α
= 0
,
w h e r e
[
Q
]
i j
=
y
i
y
j
φ
T
(
x
i
)
φ
(
x
j
)
,a n d
α

R
n
i s t h e L a g r a n g i a n m u l t i p l i e r v a r i a b l e ( o r d u a l
v a r i a b l e ).T h e w e i g h t i n g v e c t o r
w
i s r e l a t e d w i t h
α
i n
w
=
￿
n
i
=1
α
i
φ
(
x
i
)
.

T h i s w o r k w a s i n i t i a t e d i n 2 0 0 5 w h e n t h e a u t h o r w a s a p r o f e s s o r a t U C S B.
1
The dual formulation
D
(
α
)
requires an inner product of
φ
(
x
i
)
and
φ
(
x
j
)
.SVMs utilize the
kernel
trick
by specifying a kernel function to define the inner-product
K
(
x
i
,
x
j
) =
φ
T
(
x
i
)
φ
(
x
j
)
.We
t h u s c a n r e w r i t e
[
Q
]
i j
a s
y
i
y
j
K
(
x
i
,
x
j
)
.W h e n t h e g i v e n k e r n e l f u n c t i o n
K
i s p s d ( p o s i t i v e s e m i -
d e fi n i t e ),t h e d u a l p r o b l e m
D
(
α
)
i s a c o n v e x Q u a d r a t i c P r o g r a m m i n g ( Q P ) p r o b l e m w i t h l i n e a r
c o n s t r a i n t s,w h i c h c a n b e s o l v e d v i a t h e
I n t e r i o r - P o i n t m e t h o d
( I P M ) ( M e h r o t r a,1 9 9 2 ).B o t h t h e
c o m p u t a t i o n a l a n d m e m o r y b o t t l e n e c k s o f t h e S V M t r a i n i n g a r e t h e I P M s o l v e r t o t h e d u a l f o r m u -
l a t i o n o f S V M s i n ( 2 ).
C u r r e n t l y,t h e m o s t e f f e c t i v e I P M a l g o r i t h m i s t h e p r i m a l - d u a l I P M ( M e h r o t r a,1 9 9 2 ).T h e p r i n c i p a l
i d e a o f t h e p r i m a l - d u a l I P M i s t o r e m o v e i n e q u a l i t y c o n s t r a i n t s u s i n g a b a r r i e r f u n c t i o n a n d t h e n
r e s o r t t o t h e i t e r a t i v e N e w t o n ’ s m e t h o d t o s o l v e t h e K K T l i n e a r s y s t e m r e l a t e d t o t h e H e s s i a n m a t r i x
Q
i n
D
(
α
)
.T h e c o m p u t a t i o n a l c o s t i s
O
(
n
3
)
a n d t h e m e m o r y u s a g e
O
(
n
2
)
.
I n t h i s w o r k,w e p r o p o s e a p a r a l l e l S V M a l g o r i t h m ( P S V M ) t o r e d u c e m e m o r y u s e a n d t o p a r a l l e l i z e
b o t h d a t a l o a d i n g a n d c o m p u t a t i o n.G i v e n
n
t r a i n i n g i n s t a n c e s e a c h w i t h
d
d i m e n s i o n s,P S V M fi r s t
l o a d s t h e t r a i n i n g d a t a i n a r o u n d - r o b i n f a s h i o n o n t o
m
m a c h i n e s.T h e m e m o r y r e q u i r e m e n t p e r
m a c h i n e i s
O
(
n d/m
).N e x t,P S V M p e r f o r m s a p a r a l l e l r o w - b a s e d I n c o m p l e t e C h o l e s k y F a c t o r -
i z a t i o n ( I C F ) o n t h e l o a d e d d a t a.A t t h e e n d o f p a r a l l e l I C F,e a c h m a c h i n e s t o r e s o n l y a f r a c t i o n
o f t h e f a c t o r i z e d m a t r i x,w h i c h t a k e s u p s p a c e o f
O
(
n p/m
),w h e r e
p
i s t h e c o l u m n d i m e n s i o n o f
t h e f a c t o r i z e d m a t r i x.( T y p i c a l l y,
p
c a n b e s e t t o b e a b o u t

n
w i t h o u t n o t i c e a b l y d e g r a d i n g t r a i n -
i n g a c c u r a c y.) P S V M r e d u c e s m e m o r y u s e o f I P M f r o m O (
n
2
) t o
O
(
n p/m
)
,w h e r e
p/m
i s m u c h
s m a l l e r t h a n
n
.P S V M t h e n p e r f o r m s p a r a l l e l I P M t o s o l v e t h e q u a d r a t i c o p t i m i z a t i o n p r o b l e m
i n ( 2 ).T h e c o m p u t a t i o n t i m e i s i m p r o v e d f r o m a b o u t
O
(
n
2
) o f a d e c o m p o s i t i o n - b a s e d a l g o r i t h m
( e.g.,S V M L i g h t ( J o a c h i m s,1 9 9 8 ),L I B S V M ( C h a n g & L i n,2 0 0 1 ),S M O ( P l a t t,1 9 9 8 ),a n d S i m -
p l e S V M ( V i s h w a n a t h a n e t a l.,2 0 0 3 ) ) t o
O
(
n p
2
/m
).T h i s w o r k ’ s m a i n c o n t r i b u t i o n s a r e:( 1 ) P S V M
a c h i e v e s m e m o r y r e d u c t i o n a n d c o m p u t a t i o n s p e e d u p v i a a p a r a l l e l I C F a l g o r i t h m a n d p a r a l l e l I P M.
( 2 ) P S V M h a n d l e s k e r n e l s ( i n c o n t r a s t t o o t h e r a l g o r i t h m i c a p p r o a c h e s ( J o a c h i m s,2 0 0 6;C h u e t a l.,
2 0 0 6 ) ).( 3 ) W e h a v e i m p l e m e n t e d P S V M o n o u r p a r a l l e l c o m p u t i n g i n f r a s t r u c t u r e s.P S V M e f f e c -
t i v e l y s p e e d s u p t r a i n i n g t i m e f o r l a r g e - s c a l e t a s k s w h i l e m a i n t a i n i n g h i g h t r a i n i n g a c c u r a c y.
P S V M i s a p r a c t i c a l,p a r a l l e l a p p r o x i m a t e i m p l e m e n t a t i o n t o s p e e d u p S V M t r a i n i n g o n t o d a y ’ s
d i s t r i b u t e d c o m p u t i n g i n f r a s t r u c t u r e s f o r d e a l i n g w i t h W e b - s c a l e p r o b l e m s.W h a t w e d o
n o t
c l a i m
a r e a s f o l l o w s:( 1 ) W e m a k e n o c l a i m t h a t P S V M i s t h e s o l e s o l u t i o n t o s p e e d u p S V M s.A l g o r i t h m i c
a p p r o a c h e s s u c h a s ( L e e & M a n g a s a r i a n,2 0 0 1;T s a n g e t a l.,2 0 0 5;J o a c h i m s,2 0 0 6;C h u e t a l.,
2 0 0 6 ) c a n b e m o r e e f f e c t i v e w h e n m e m o r y i s n o t a c o n s t r a i n t o r k e r n e l s a r e n o t u s e d.( 2 ) W e d o n o t
c l a i m t h a t t h e a l g o r i t h m i c a p p r o a c h i s t h e o n l y a v e n u e t o s p e e d u p S V M t r a i n i n g.D a t a - p r o c e s s i n g
a p p r o a c h e s s u c h a s ( G r a f e t a l.,2 0 0 5 ) c a n d i v i d e a s e r i a l a l g o r i t h m ( e.g.,L I B S V M ) i n t o s u b t a s k s
o n s u b s e t s o f t r a i n i n g d a t a t o a c h i e v e g o o d s p e e d u p.( D a t a - p r o c e s s i n g a n d a l g o r i t h m i c a p p r o a c h e s
c o m p l e m e n t e a c h o t h e r,a n d c a n b e u s e d t o g e t h e r t o h a n d l e l a r g e - s c a l e t r a i n i n g.)
2 P S V M A l g o r i t h m
T h e k e y s t e p o f P S V M i s p a r a l l e l I C F ( P I C F ).T r a d i t i o n a l c o l u m n - b a s e d I C F ( F i n e & S c h e i n b e r g,
2 0 0 1;B a c h & J o r d a n,2 0 0 5 ) c a n r e d u c e c o m p u t a t i o n a l c o s t,b u t t h e i n i t i a l m e m o r y r e q u i r e m e n t
i s
O
(
n p
)
,a n d h e n c e n o t p r a c t i c a l f o r v e r y l a r g e d a t a s e t.P S V M d e v i s e s p a r a l l e l r o w - b a s e d I C F
( P I C F ) a s i t s i n i t i a l s t e p,w h i c h l o a d s t r a i n i n g i n s t a n c e s o n t o p a r a l l e l m a c h i n e s a n d p e r f o r m s f a c t o r -
i z a t i o n s i m u l t a n e o u s l y o n t h e s e m a c h i n e s.O n c e P I C F h a s l o a d e d
n
t r a i n i n g d a t a d i s t r i b u t e d l y o n
m
m a c h i n e s,a n d r e d u c e d t h e s i z e o f t h e k e r n e l m a t r i x t h r o u g h f a c t o r i z a t i o n,I P M c a n b e s o l v e d o n p a r -
a l l e l m a c h i n e s s i m u l t a n e o u s l y.W e p r e s e n t P I C F fi r s t,a n d t h e n d e s c r i b e h o w I P M t a k e s a d v a n t a g e
o f P I C F.
2.1 P a r a l l e l I C F
I C F c a n a p p r o x i m a t e
Q
(
Q

R
n
×
n
) b y a s m a l l e r m a t r i x
H
(
H

R
n
×
p
,p
￿
n
),i.e.,
Q

HH
T
.ICF,together with SMW (the
Sherman-Morrison-Woodbury formula
),can greatly reduce
the computational complexity in solving an
n
×
n
linear system.The work of (Fine & Scheinberg,
2001) provides a theoretical analysis of how ICF influences the optimization problem in Eq.(2).The
authors proved that the error of the optimal objective value introduced by ICF is bounded by
C
2
l ￿/
2
,
where
C
is the hyperparameter of SVM,
l
is the number of support vectors,and
￿
is the bound of
2
Algorithm1
Row-based PICF
Input
:
n
training instances;
p
:rank of ICF matrix
H
;
m
:number of machines
Output
:
H
distributed on
m
machines
Variables
:
v
:fraction of the diagonal vector of
Q
that resides in local machine
k
:iteration number;
x
i
:the
i
th
training instance
M
:machine index set,
M
=
{
0
,
1
,...,m

1
}
I
c
:r ow- i n d e x s e t o n ma c h i n e
c
(
c

M
),
I
c
=
{
c,c
+
m,c
+2
m,...
}
1:
f o r
i
= 0
t o
n

1
d o
2:
L o a d
x
i
i n t o m a c h i n e
i mo d u l o m
.
3:
e n d f o r
4:
k

0
;
H

0
;
v

t h e f r a c t i o n o f t h e d i a g o n a l v e c t o r o f
Q
t h a t r e s i d e s i n l o c a l m a c h i n e.(
v
(
i
) (
i

I
m
)
c a n b e o b t a i n e d f r o m
x
i
)
5:
I n i t i a l i z e
m a s t e r
t o b e m a c h i n e
0
.
6:
w h i l e
k < p
d o
7:
E a c h m a c h i n e
c

M
s e l e c t s i t s l o c a l p i v o t v a l u e,w h i c h i s t h e l a r g e s t e l e m e n t i n
v
:
l p v
k,c
= m a x
i

I
c
v
(
i
)
.
a n d r e c o r d s t h e l o c a l p i v o t i n d e x,t h e r o w i n d e x c o r r e s p o n d s t o
l p v
k,c
:
l p i
k,c
=
a r g
m a x
i

I
c
v
(
i
)
.
8:
G a t h e r
l p v
k,c
’ s a n d
l p i
k,c
’s (
c

M
) to
master
.
9:
The
master
selects the largest local pivot value as global pivot value
gpv
k
and records in
i
k
,rowindex
corresponding to the global pivot value.
gpv
k
=max
c

M
lpv
k,c
.
10:
The
master
broadcasts
gpv
k
and
i
k
.
11:
Change
master
to machine
i
k
%
m
.
12:
Calculate
H
(
i
k
,k
)
a c c o r d i n g t o ( 3 ) o n
ma s t e r
.
1 3:
Th e
ma s t e r
b r o a d c a s t s t h e p i vo t i n s t a n c e
x
i
k
a n d t h e p i vo t r ow
H
(
i
k
,
:)
.( On l y t h e fir s t
k
+1
v a l u e s o f
t h e p i v o t r o w n e e d t o b e b r o a d c a s t,s i n c e t h e r e ma i n d e r a r e z e r o s.)
1 4:
E a c h ma c h i n e
c

M
c a l c u l a t e s i t s p a r t o f t h e
k
t h
c o l u mn o f
H
a c c o r d i n g t o ( 4 ).
1 5:
E a c h ma c h i n e
c

M
u p d a t e s
v
a c c o r d i n g t o ( 5 ).
1 6:
k

k
+1
1 7:
e n d wh i l e
I C F a p p r o x i ma t i o n ( i.e.
t r
(
Q

HH
T
)
< ￿
).E x p e r i m e n t a l r e s u l t s i n S e c t i o n 3 s h o w t h a t w h e n
p
i s
s e t t o

n
,t h e e r r o r c a n b e n e g l i g i b l e.
O u r r o w - b a s e d p a r a l l e l I C F ( P I C F ) w o r k s a s f o l l o w s:L e t v e c t o r
v
b e t h e d i a g o n a l o f
Q
a n d s u p p o s e
t h e p i v o t s ( t h e l a r g e s t d i a g o n a l v a l u e s ) a r e
{
i
1
,i
2
,...,i
k
}
,t he
k
t h
i t e r a t i on of I CF c omput e s t hr e e
e qua t i ons:
H
(
i
k
,k
) =
￿
v
(
i
k
)
( 3 )
H
(
J
k
,k
) = (
Q
(
J
k
,k
)

k

1
￿
j
=1
H
(
J
k
,j
)
H
(
i
k
,j
) )
/H
(
i
k
,k
)
( 4)
v
(
J
k
) =
v
(
J
k
)

H
(
J
k
,k
)
2
,
( 5)
whe r e
J
k
de not e s t he c ompl e me nt of
{
i
1
,i
2
,...,i
k
}
.The a l gor i t hm i t e r a t e s unt i l t he a ppr oxi ma t i on
of
Q
by
H
k
H
T
k
( me a s ur e d by
t r ac e
(
Q

H
k
H
T
k
)
) i s s a t i s f a c t or y,or t he pr e de fine d ma xi mum
i t e r a t i ons ( or s a y,t he de s i r e d r a nk of t he I CF ma t r i x)
p
i s r e a c he d.
As s ugge s t e d by G.Gol ub,a pa r a l l e l i z e d I CF a l gor i t hm c a n be obt a i ne d by c ons t r a i ni ng t he pa r -
a l l e l i z e d Chol e s ky Fa c t or i z a t i on a l gor i t hm,i t e r a t i ng a t mos t
p
t i me s.Howeve r,i n t he pr opos e d
a l gor i t hm ( Gol ub & Loa n,1996),ma t r i x
H
i s di s t r i but e d by c ol umns i n a r ound- r obi n wa y on
m
ma c hi ne s ( he nc e we c a l l i t c ol umn- ba s e d pa r a l l e l i z e d I CF).Suc h c ol umn- ba s e d a ppr oa c h i s opt i -
ma l f or t he s i ngl e - ma c hi ne s e t t i ng,but c a nnot ga i n f ul l be ne fit f r om pa r a l l e l i z a t i on f or t wo ma j or
r e a s ons:
3
1
.Large memory requirement.All training data are needed for each machine to calculate
Q
(
J
k
,k
)
.
The r e f or e,e a c h ma c hi ne mus t be a bl e t o s t or e a l oc a l c opy of t he t r a i ni ng da t a.
2
.Li mi t e d pa r a l l e l i z a bl e c omput a t i on.Onl y t he i nne r pr oduc t c a l c ul a t i on
(
￿
k

1
j
=1
H
(
J
k
,j
)
H
(
i
k
,j
)
) i n ( 4) c a n be pa r a l l e l i z e d.The c a l c ul a t i on of pi vot s e l e c t i on,t he
s umma t i on of l oc a l i nne r pr oduc t r e s ul t,c ol umn c a l c ul a t i on i n ( 4),a nd t he ve c t or upda t e i n ( 5)
mus t be pe r f or me d on one s i ngl e ma c hi ne.
To r e me dy t he s e s hor t c omi ngs of t he c ol umn- ba s e d a ppr oa c h,we pr opos e a r ow- ba s e d a ppr oa c h t o
pa r a l l e l i z e I CF,whi c h we s umma r i z e i n Al gor i t hm 1.Our r ow- ba s e d a ppr oa c h s t a r t s by i ni t i a l i z i ng
va r i a bl e s a nd l oa di ng t r a i ni ng da t a ont o
m
ma c hi ne s i n a r ound- r obi n f a s hi on ( St e ps
1
t o
5
).The
a l gor i t hm t he n pe r f or ms t he I CF ma i n l oop unt i l t he t e r mi na t i on c r i t e r i a a r e s a t i s fie d ( e.g.,t he r a nk
of ma t r i x
H
r e a c he s
p
).I n t he ma i n l oop,PI CF pe r f or ms five t a s ks i n e a c h i t e r a t i on
k
:

Di s t r i but e dl y find a pi vot,whi c h i s t he l a r ge s t va l ue i n t he di a gona l
v
of ma t r i x
Q
( s t e ps
7
t o
10
).
Not i c e t ha t PI CF c omput e s onl y ne e de d e l e me nt s i n
Q
f r om t r a i ni ng da t a,a nd i t doe s not s t or e
Q
.

Se t t he ma c hi ne whe r e t he pi vot r e s i de s a s t he
mas t e r
( s t e p
11
).

On t he
mas t e r
,PI CF c a l c ul a t e s
H
(
i
k
,k
)
a c c or di ng t o ( 3) ( s t e p
12
).

The
mas t e r
t he n br oa dc a s t s t he pi vot i ns t a nc e
x
i
k
a nd t he pi vot r ow
H
(
i
k
,
:)
( s t e p
13
).

Di s t r i but e dl y c omput e ( 4) a nd ( 5) ( s t e ps
14
a nd
15
).
At t he e nd of t he a l gor i t hm,
H
i s s t or e d di s t r i but e dl y on
m
ma c hi ne s,r e a dy f or pa r a l l e l I PM ( pr e -
s e nt e d i n t he ne xt s e c t i on).PI CF e nj oys t hr e e a dva nt a ge s:pa r a l l e l me mor y us e (
O
(
np/m
) ),pa r a l l e l
c omput a t i on (
O
(
p
2
n/m
) ),a nd l ow c ommuni c a t i on ove r he a d (
O
(
p
2
l og (
m
)
) ).Pa r t i c ul a r l y on t he
c ommuni c a t i on ove r he a d,i t s f r a c t i on of t he e nt i r e c omput a t i on t i me s hr i nks a s t he pr obl e m s i z e
gr ows.We wi l l ve r i f y t hi s i n t he e xpe r i me nt a l s e c t i on.Thi s pa t t e r n pe r mi t s a l a r ge r pr obl e m t o be
s ol ve d on mor e ma c hi ne s t o t a ke a dva nt a ge of pa r a l l e l me mor y us e a nd c omput a t i on.
2.2 Par al l e l I PM
As me nt i one d i n Se c t i on 1,t he mos t e f f e c t i ve a l gor i t hm t o s ol ve a c ons t r a i ne d QP pr obl e m i s t he
pr i ma l - dua l I PM.For de t a i l e d de s c r i pt i on a nd not a t i ons of I PM,pl e a s e c ons ul t ( Boyd,2004;Me hr o-
t r a,1992).For t he pur pos e of SVM t r a i ni ng,I PM boi l s down t o s ol vi ng t he f ol l owi ng e qua t i ons i n
t he Newt on s t e p i t e r a t i ve l y.
￿
λ
=

λ
+ ve c
￿
1
t
(
C

α
i
)
￿
+ di ag (
λ
i
C

α
i
)
￿
x
( 6)
￿
ξ
=

ξ
+ ve c
￿
1
t α
i
￿

di ag (
ξ
i
α
i
)
￿
x
( 7)
￿
ν
=
y
T
Σ

1
z
+
y
T
α
y
T
Σ

1
y
( 8)
D
=
di ag
(
ξ
i
α
i
+
λ
i
C

α
i
)
( 9)
￿
x
=
Σ

1
(
z

y
￿
ν
)
,
( 10)
whe r e
Σ
a nd
z
de pe nd onl y on
[
α
,
λ
,
ξ
,
ν
]
f r om t he l a s t i t e r a t i on a s f ol l ows:
Σ
=
Q
+
di ag
(
ξ
i
α
i
+
λ
i
C

α
i
)
( 11)
z
=

Q
α
+
1
n

ν
y
+
1
t
ve c (
1
α
i

1
C

α
i
)
.
( 12)
The c omput a t i on bot t l e ne c k i s on ma t r i x i nve r s e,whi c h t a ke s pl a c e on
Σ
f or s ol vi ng
￿
ν
i n ( 8)
a nd
￿
x
i n ( 10).Equa t i on ( 11) s hows t ha t
Σ
de pe nds on
Q
,a nd we have s hown t ha t
Q
c a n be
a ppr oxi ma t e d t hr ough PI CF by
HH
T
.Therefore,the bottleneck of the Newton step can be sped up
from
O
(
n
3
) to
O
(
p
2
n
),and be parallelized to
O
(
p
2
n/m
).
Distributed Data Loading
To minimize both storage and communication cost,PIPM stores data distributedly as follows:
4

Distribute matrix data
.
H
is distributedly stored at the end of PICF.

Distribute
n
×
1
vector data
.All
n
×
1
vectors are distributed in a round-robin fashion on
m
machines.These vectors are
z
,
α
,
ξ
,
λ
,

z
,

α
,

ξ
,and

λ
.

Replicate global scalar data
.Every machine caches a copy of global data including
ν
,
t
,
n
,and

ν
.Whenever a scalar is changed,a broadcast is required to maintain global consistency.
Parallel Computation of
￿
ν
Rather than walking through all equations,we describe how PIPM solves (8),where
Σ

1
appears
twice.An interesting observation is that parallelizing
Σ

1
z
(or
Σ

1
y
) is simpler than parallelizing
Σ

1
.Let us explain how parallelizing
Σ

1
z
works,and parallelizing
Σ

1
y
can follow suit.
According to SMW(the
Sherman-Morrison-Woodbury formula
),we can write
Σ

1
z
as
Σ

1
z
= (
D
+
Q
)

1
z

(
D
+
HH
T
)

1
z
=
D

1
z

D

1
H
(
I
+
H
T
D

1
H
)

1
H
T
D

1
z
=
D

1
z

D

1
H
(
GG
T
)

1
H
T
D

1
z.
Σ

1
z
can be computed in four steps:
1
.Compute
D

1
z
.
D
can be derived from locally stored vectors,following (9).
D

1
z
is a
n
×
1
vector,and can be computed locally on each of the
m
machines.
2
.Compute
t
1
=
H
T
D

1
z
.Every machine stores some rows of H and their corresponding part
of
D

1
z
.This step can be computed locally on each machine.The results are sent to the
master
(which can be a randomly picked machine for all PIPMiterations) to aggregate into
t
1
for the next
step.
3
.Compute
(
GG
T
)

1
t
1
.This step is completed on the
master
,since it has all the required
data.
G
can be obtained from
H
in a straightforward manner as shown in SMW.Computing
t
2
= (
G G
T
)

1
t
1
i s e q u i v a l e n t t o s o l v i n g t h e l i n e a r e q u a t i o n s y s t e m
t
1
= (
G G
T
)
t
2
.P I P M fi r s t
s o l v e s
t
1
=
G y
0
,t h e n
y
0
=
G
T
t
2
.O n c e i t h a s o b t a i n e d
y
0
,P I P M c a n s o l v e
G
T
t
2
=
y
0
t o o b t a i n
t
2
.T h e
m a s t e r
t h e n b r o a d c a s t s
t
2
t o a l l m a c h i n e s.
4
.C o m p u t e
D

1
Ht
2
All machines have a copy of
t
2
,and can compute
D

1
Ht
2
locally to solve
for
Σ

1
z
.
Similarly,
Σ

1
y
can be computed at the same time.Once we have obtained both,we can solve

ν
according to (8).
2.3 Computing
b
and Writing Back
When the IPM iteration stops,we have the value of
α
and hence the classification function
f
(
x
) =
N
s
￿
i
=1
α
i
y
i
k
(
s
i
,x
) +
b
He r e
N
s
i s t h e n u mb e r o f s u p p o r t v e c t o r s a n d
s
i
a r e s u p p o r t v e c t o r s.I n o r d e r t o c o mp l e t e t h i s
c l a s s i fi c a t i o n f u n c t i o n,
b
mu s t b e c o mp u t e d.Ac c o r d i n g t o t h e S VM mo d e l,g i v e n a s u p p o r t v e c t o r
s
,
we o b t a i n o n e o f t h e t wo r e s u l t s f o r
f
(
s
)
:
f
(
s
) = +1
,i f
y
s
= +1
,o r
f
(
s
) =

1
,i f
y
s
=

1
.
I n p r a c t i c e,w e c a n s e l e c t
M
,s a y
1
,
0 0 0
,s u p p o r t v e c t o r s a n d c o m p u t e t h e a v e r a g e o f t h e
b
s
i n
p a r a l l e l u s i n g Ma p R e d u c e ( D e a n & G h e m a w a t,2 0 0 4 ).
3 E x p e r i me n t s
We c o n d u c t e d e x p e r i m e n t s o n P S V M t o e v a l u a t e i t s 1 ) c l a s s - p r e d i c t i o n a c c u r a c y,2 ) s c a l a b i l i t y o n
l a r g e d a t a s e t s,a n d 3 ) o v e r h e a d s.T h e e x p e r i m e n t s w e r e c o n d u c t e d o n u p t o
5 0 0
m a c h i n e s i n o u r
d a t a c e n t e r.N o t a l l m a c h i n e s a r e i d e n t i c a l l y c o n fi g u r e d;h o w e v e r,e a c h m a c h i n e i s c o n fi g u r e d w i t h
a C P U f a s t e r t h a n 2 G H z a n d m e m o r y l a r g e r t h a n 4 G B y t e s.
5
Table 1:Class-prediction Accuracy with Different
p
Settings.
dataset
samples (train/test)
LIBSVM
p
=
n
0
.
1
p
=
n
0
.
2
p
=
n
0
.
3
p
=
n
0
.
4
p
=
n
0
.
5
svmguide1
3
,
089
/
4
,
000
0
.
9608
0
.
6563 0
.
9 0
.
9 1 7 0
.
9 4 9 5 0
.
9 5 9 3
m u s h r o o m s
7
,
5 0 0
/
6 2 4
1
0
.
9 9 0 4 0
.
9 9 2 0 1 1 1
n e w s 2 0
1 8
,
0 0 0
/
1
,
9 9 6
0
.
7 8 3 5
0
.
6 9 4 9 0
.
6 9 4 9 0
.
6 9 6 9 0
.
7 8 0 6 0
.
7 8 1 1
I m a g e
1 9 9
,
9 5 7
/
8 4
,
5 0 7
0
.
8 4 9
0
.
7 2 9 3 0
.
7 2 1 0 0
.
8 0 4 1 0
.
8 1 2 1 0
.
8 2 5 8
C o v e r T y p e
5 2 2
,
9 1 0
/
5 8
,
1 0 2
0
.
9 7 6 9
0
.
9 7 6 4 0
.
9 7 6 2 0
.
9 7 6 6 0
.
9 7 6 1 0
.
9 7 6 6
R C V
7 8 1
,
2 6 5
/
2 3
,
1 4 9
0
.
9 5 7 5
0
.
8 5 2 7 0
.
8 5 8 6 0
.
8 6 1 6 0
.
9 0 6 5 0
.
9 2 6 4
3.1 C l a s s - p r e d i c t i o n A c c u r a c y
P S V M e m p l o y s P I C F t o a p p r o x i m a t e a n
n
×
n
k e r n e l m a t r i x
Q
w i t h a n
n
×
p
m a t r i x
H
.T h i s
e x p e r i m e n t e v a l u a t e d h o w t h e c h o i c e o f
p
a f f e c t s c l a s s - p r e d i c t i o n a c c u r a c y.W e s e t
p
o f P S V M t o
n
t
,
w h e r e
t
r a n g e s f r o m
0
.
1
t o
0
.
5
i n c r e m e n t e d b y
0
.
1
,a n d c o m p a r e d i t s c l a s s - p r e d i c t i o n a c c u r a c y w i t h
t h a t a c h i e v e d b y L I B S V M.T h e fi r s t t w o c o l u m n s o f T a b l e 1 e n u m e r a t e t h e d a t a s e t s a n d t h e i r s i z e s
w i t h w h i c h w e e x p e r i m e n t e d.W e u s e G a u s s i a n k e r n e l,a n d s e l e c t t h e b e s t
C
a n d
σ
f o r L I B S V M a n d
P S V M,r e s p e c t i v e l y.F o r
C o v e r T y p e
a n d
R C V
,w e l o o s e d t h e t e r m i n a t e c o n d i t i o n ( s e t - e 1,d e f a u l t
0
.
0 0 1
) a n d u s e d s h r i n k h e u r i s t i c s ( s e t - h
0
) t o m a k e L I B S V M t e r m i n a t e w i t h i n s e v e r a l d a y s.T h e
t a b l e s h o w s t h a t w h e n
t
i s s e t t o
0
.
5
( o r
p
=

n
),t h e c l a s s - p r e d i c t i o n a c c u r a c y o f P S V M a p p r o a c h e s
t h a t o f L I B S V M.
W e c o m p a r e d o n l y w i t h L I B S V M b e c a u s e i t i s a r g u a b l y t h e b e s t o p e n - s o u r c e S V M i m p l e m e n t a -
t i o n i n b o t h a c c u r a c y a n d s p e e d.A n o t h e r p o s s i b l e c a n d i d a t e i s C V M ( T s a n g e t a l.,2 0 0 5 ).O u r
e x p e r i m e n t a l r e s u l t o n t h e
C o v e r T y p e
d a t a s e t o u t p e r f o r m s t h e r e s u l t r e p o r t e d b y C V M o n t h e s a m e
d a t a s e t i n b o t h a c c u r a c y a n d s p e e d.M o r e o v e r,C V M ’ s t r a i n i n g t i m e h a s b e e n s h o w n u n p r e d i c t a b l e
b y ( L o o s l i & C a n u,2 0 0 6 ),s i n c e t h e t r a i n i n g t i m e i s s e n s i t i v e t o t h e s e l e c t i o n o f s t o p c r i t e r i a a n d
h y p e r - p a r a m e t e r s.F o r h o w w e p o s i t i o n P S V M w i t h r e s p e c t t o o t h e r r e l a t e d w o r k,p l e a s e r e f e r t o
o u r d i s c l a i m e r i n t h e e n d o f S e c t i o n 1.
3.2 S c a l a b i l i t y
F o r s c a l a b i l i t y e x p e r i m e n t s,w e u s e d t h r e e l a r g e d a t a s e t s.T a b l e 2 r e p o r t s t h e s p e e d u p o f P S V M
o n u p t o
m
= 5 0 0
m a c h i n e s.S i n c e w h e n a d a t a s e t s i z e i s l a r g e,a s i n g l e m a c h i n e c a n n o t s t o r e
t h e f a c t o r i z e d m a t r i x
H
i n i t s l o c a l m e m o r y,w e c a n n o t o b t a i n t h e r u n n i n g t i m e o f P S V M o n o n e
m a c h i n e.W e t h u s u s e d
1 0
m a c h i n e s a s t h e b a s e l i n e t o m e a s u r e t h e s p e e d u p o f u s i n g m o r e t h a n
1 0
m a c h i n e s.T o q u a n t i f y s p e e d u p,w e m a d e a n a s s u m p t i o n t h a t t h e s p e e d u p o f u s i n g
1 0
m a c h i n e s
i s
1 0
,c o m p a r e d t o u s i n g o n e m a c h i n e.T h i s a s s u m p t i o n i s r e a s o n a b l e f o r o u r e x p e r i m e n t s,s i n c e
P S V M d o e s e n j o y l i n e a r s p e e d u p w h e n t h e n u m b e r o f m a c h i n e s i s u p t o
3 0
.
T a b l e 2:S p e e d u p (
p
i s s e t t o

n
);L I B S V M t r a i n i n g t i m e i s r e p o r t e d o n t h e l a s t r o w f o r r e f e r e n c e.
I m a g e ( 2 0 0 k )
C o v e r T y p e ( 5 0 0 k )
R C V ( 8 0 0 k )
M a c h i n e s
T i m e ( s )
S p e e d u p
T i m e ( s )
S p e e d u p
T i m e ( s )
S p e e d u p
1 0
1
,
9 5 8 ( 9 )
1 0

1 6
,
8 1 8 ( 4 4 2 )
1 0

4 5
,
1 3 5 ( 1 3 7 3 )
1 0

3 0
5 7 2 ( 8 )
3 4
.
2
5
,
5 9 1 ( 1 0 )
3 0
.
1
1 2
,
2 8 9 ( 9 8 )
3 6
.
7
5 0
4 7 3 ( 1 4 )
4 1
.
4
3
,
5 9 8 ( 6 0 )
4 6
.
8
7
,
6 9 5 ( 9 2 )
5 8
.
7
1 0 0
3 3 0 ( 4 7 )
5 9
.
4
2
,
0 8 2 ( 2 9 )
8 0
.
8
4
,
9 9 2 ( 3 4 )
9 0
.
4
1 5 0
2 7 4 ( 4 0 )
7 1
.
4
1
,
8 6 5 ( 9 3 )
9 0
.
2
3
,
3 1 3 ( 5 9 )
1 3 6
.
3
2 0 0
2 9 4 ( 4 1 )
6 6
.
7
1
,
4 1 6 ( 2 4 )
1 1 8
.
7
3
,
1 6 3 ( 6 9 )
1 4 2
.
7
2 5 0
3 9 7 ( 7 8 )
4 9
.
4
1
,
4 0 5 ( 1 1 5 )
1 1 9
.
7
2
,
7 1 9 ( 2 0 3 )
1 6 6
.
0
5 0 0
8 1 4 ( 1 2 3 )
2 4
.
1
1
,
6 5 5 ( 3 4 )
1 0 1
.
6
2
,
6 7 1 ( 1 9 3 )
1 6 9
.
0
L I B S V M
4
,
3 3 4
NA
NA
28
,
149
NA
NA
184
,
199
NA
NA
We trained PSVMthree times for each dataset-
m
combination.The speedup reported in the table
is the average of three runs with standard deviation provided in brackets.The observed variance in
speedup was caused by the variance of machine loads,as all machines were shared with other tasks
6
running on our data centers.We can observe in Table 2 that the larger is the dataset,the better is
the speedup.Figures 1(a),(b) and (c) plot the speedup of
Image
,
CoverType
,and
RCV
,respectively.
All datasets enjoy a linear speedup when the number of machines is moderate.For instance,PSVM
achieves linear speedup on
RCV
when running on up to around
100
machines.PSVMscales well till
around
250
machines.After that,adding more machines receives diminishing returns.This result
led to our examination on the overheads of PSVM,presented next.
(a) Image (200k) speedup (b) Covertype (500k) speedup (c) RCV (800k) speedup
(d) Image (200k) overhead (e) Covertype (500k) overhead (f) RCV (800k) overhead
(g) Image (200k) fraction (h) Covertype (500k) fraction (i) RCV (800k) fraction
Figure 1:Speedup and Overheads of Three Datasets.
3.3 Overheads
PSVM cannot achieve linear speedup when the number of machines continues to increase beyond
a data-size-dependent threshold.This is expected due to communication and synchronization over-
heads.Communication time is incurred when message passing takes place between machines.Syn-
chronization overhead is incurred when the
master
machine waits for task completion on the slowest
machine.(The
master
could wait forever if a child machine fails.We have implemented a check-
point scheme to deal with this issue.)
The running time consists of three parts:computation (Comp),communication (Comm),and syn-
chronization (Sync).Figures 1(d),(e) and (f) show how Comm and Sync overheads influence the
speedup curves.In the figures,we draw on the top the computation only line (Comp),which ap-
proaches the linear speedup line.Computation speedup can become sublinear when adding ma-
chines beyond a threshold.This is because the computation bottleneck of the unparallelizable step
12
in Algorithm
1
(which computation time is
O
(
p
2
)).When
m
is small,this bottleneck is insignif-
icant in the total computation time.According to the Amdahl’s law;however,even a small fraction
of unparallelizable computation can cap speedup.Fortunately,the larger the dataset is,the smaller
is this unparallelizable fraction,which is
O
(
m/n
).Therefore,more machines (larger
m
) can be
employed for larger datasets (larger
n
) to gain speedup.
7
When communication overhead or synchronization overhead is accounted for (the Comp + Comm
line and the Comp + Comm+ Sync line),the speedup deteriorates.Between the two overheads,the
synchronization overhead does not impact speedup as much as the communication overhead does.
Figures 1(g),(h),and (i) present the percentage of Comp,Comm,and Sync in total running time.
The synchronization overhead maintains about the same percentage when
m
increases,whereas the
percentage of communication overhead grows with
m
.As mentioned in Section 2.1,the communi-
cation overhead is
O
(
p
2
log(
m
)
),growing sub-linearly with
m
.But since the computation time per
node decreases as
m
increases,the fraction of the communication overhead grows with
m
.There-
fore,PSVMmust select a proper
m
for a training task to maximize the benefit of parallelization.
4 Conclusion
In this paper,we have shown howSVMs can be parallelized to achieve scalable performance.PSVM
distributedly loads training data on parallel machines,reducing memory requirement through ap-
proximate factorization on the kernel matrix.PSVM solves IPM in parallel by cleverly arranging
computation order.We have made PSVMopen source at http://code.google.com/p/psvm/.
Acknowledgement
The first author is partially supported by NSF under Grant Number IIS-0535085.
References
Bach,F.R.,& Jordan,M.I.(2005).Predictive low-rank decomposition for kernel methods.
Pro-
ceedings of the 22nd International Conference on Machine Learning
.
Boyd,S.(2004).
Convex optimization
.Cambridge University Press.
Chang,C.-C.,&Lin,C.-J.(2001).
LIBSVM:a library for support vector machines
.Software avail-
able at
http://www.csie.ntu.edu.tw/cjlin/libsvm
.
Chu,C.-T.,Kim,S.K.,Lin,Y.-A.,Yu,Y.,Bradski,G.,Ng,A.Y.,& Olukotun,K.(2006).Map
reduce for machine learning on multicore.
NIPS
.
Dean,J.,& Ghemawat,S.(2004).Mapreduce:Simplified data processing on large clusters.
OSDI’04:Symposium on Operating System Design and Implementation
.
Fine,S.,& Scheinberg,K.(2001).Efficient svm training using low-rank kernel representations.
Journal of Machine Learning Research
,
2
,243–264.
Ghemawat,S.,Gobioff,H.,& Leung,S.-T.(2003).The google file system.
19th ACMSymposium
on Operating Systems Principles
.
Golub,G.H.,&Loan,C.F.V.(1996).
Matrix computations
.Johns Hopkins University Press.
Graf,H.P.,Cosatto,E.,Bottou,L.,Dourdanovic,I.,& Vapnik,V.(2005).Parallel support vector
machines:The cascade svm.In
Advances in neural information processing systems 17
,521–528.
Joachims,T.(1998).Making large-scale svm learning practical.
Advances in Kernel Methods -
Support Vector Learning
.
Joachims,T.(2006).Training linear svms in linear time.
ACMKDD
,217–226.
Lee,Y.-J.,& Mangasarian,O.L.(2001).Rsvm:Reduced support vector machines.
First SIAM
International Conference on Data Mining
.Chicago.
Loosli,G.,& Canu,S.(2006).
Comments on the core vector machines:Fast svm training on very
large data sets
(Technical Report).
Mehrotra,S.(1992).On the implementation of a primal-dual interior point method.
SIAMJ.Opti-
mization
,
2
.
Platt,J.(1998).
Sequential minimal optimization:A fast algorithm for training support vector
machines
(Technical Report MSR-TR-98-14).Microsoft Research.
Tsang,I.W.,Kwok,J.T.,& Cheung,P.-M.(2005).Core vector machines:Fast svm training on
very large data sets.
Journal of Machine Learning Research
,
6
,363–392.
Vapnik,V.(1995).
The nature of statistical learning theory
.New York:Springer.
Vishwanathan,S.,Smola,A.J.,&Murty,M.N.(2003).Simplesvm.
ICML
.
8