Parallel support vector machine training
Kristian Woodsend
k.j.woodsend@sms.ed.ac.uk
Jacek Gondzio
j.gondzio@ed.ac.uk
School of Mathematics,University of Edinburgh,
The Kings Buildings,Edinburgh,EH9 3JZ,UK
October 8,2007
Support Vector Machines (SVMs) are powerful machine learning techniques for classiﬁcation
and regression,and they oﬀer stateoftheart performance.The training of an SVM is computa
tionally expensive and relies on optimization.The core of the approach is a dense convex quadratic
optimization problem (QP),
min
z
1
2
z
T
Y X
φ
X
T
φ
Y z −e
T
z
s.t.y
T
z = 0 (1)
0 ≤ z ≤ τe,
where X
φ
is the feature matrix of the data set (possibly after some nonlinear transformation φ),
y ∈ {−1,1}
n
the vector of classiﬁcation labels,Y = diag(y),z ∈ R
n
the support vector variables,
τ > 0 a scalar parameter of the problem,and e ∈ R
n
the vector of all ones.
Due to the dense Hessian matrix,which for a generalpurpose QP solver scales cubically with the
number of data points (O(n
3
)).This complexity result makes applying SVMs to largescale data sets
challenging,and in practise the optimization problem is intractable by general purpose optimization
solvers.The standard approach to handle this problem,used by stateoftheart software,is to build
a solution by solving a sequence of small scale problems.These activeset techniques work well
when the separation into active and nonactive variables is clear (when the data is separable by a
hyperplane),but with noisy data,the set of support vectors is not so clear,and the performance of
these algorithms deteriorates.
The standard active set technique is essentially sequential,choosing a small subset of variables to
form the active set at each iteration,based upon the results of the previous selection.Improving the
computation time through parallelization of the algorithm is diﬃcult due to dependencies between
each iteration and the next,and it is not clear how to implement this eﬃciently.
Some parallelization schemes so far proposed have involved splitting the training data to give
smaller,separable optimization subproblems which can be distributed amongst the processors,and
then combine the results in some way into a single output (Collobert et al.,2002;Dong et al.,2003;
Graf et al.,2005).
There have been only a few parallel methods in the literature which train a standard SVM on
the whole of the data set.Zanghirati and Zanni (2003) decompose (1) into a sequence of smaller,
though still dense,QP subproblems,and develop a parallel solver based on the variable projection
method.
Another family of approaches to QP optimization are based on Interior Point Method (IPM)
technology,which works by delaying the split between active and inactive variables for as long as
possible.IPMs generally work well on largescale problems.A straightforward implementation of
(1) would have complexity O(n
3
),and be unusable for anything but the smallest problems.
Chang et al.(2007) use IPMtechnology for the optimizer.To avoid the problem of inverting the
dense Hessian matrix,they generate a lowrank approximation of the kernel matrix using partial
Cholesky decomposition with pivoting.The dense Hessian matrix can then be eﬃciently inverted
implicitly using the lowrank approximation and the ShermanMorrisonWoodbury (SMW) formula.
Through suitable arrangement of the algorithm,a large part of the calculations at each iteration
can be distributed amongst the processors eﬀectively.The SMW formula has been widely used in
1
interior point methods;however,sometimes it runs into numerical diﬃculties.Fine and Scheinberg
(2002) constructed data sets where an SMWbased algorithm for SVMtraining required many more
iterations to terminate,and in some cases stalled before achieving an accurate solution.They also
showed that this situation arises in realworld data sets.
Our contribution:In this paper we propose a parallel algorithm for largescale SVM training
based on (1),using Interior Point Methods.For linear SVMs,we use the feature matrix directly.
For nonlinear SVMs,like Chang et al.(2007),our approach requires the data to be preprocessed
using partial Cholesky decomposition with pivoting;we extend the method by including residual
diagonal information,and we show that,with such information,choosing the largest pivot does not
necessarily give the best approximation.
For the QP,we use an interior point method to give an eﬃcient optimization,but unlike previous
approaches based on SMWformula,we use Cholesky decomposition to give good numerical stability.
The factorization is applied to all features at once,allowing for a more eﬃcient implementation in
terms of memory caching.In such a way,our approach directly addresses the most computationally
expensive part of the optimization,namely the inversion of the dense Hessian matrix.The resulting
implementation (Woodsend and Gondzio,2007a) gives O(n) training times,and is highly compet
itive:for the noisier benchmark data sets,our implementation is between one and two orders of
magnitude faster than stateoftheart software.
Our approach is amenable to parallel implementation (Woodsend and Gondzio,2007b).The
algorithm trains the SVM using the full data set.The data is evenly distributed amongst the
processors,allowing huge data sets to be handled.We describe an algorithm for performing the
Cholesky decomposition preprocessing in parallel,without requiring explicit storage of the kernel
matrix.By exploiting the structure of the QP optimization problem,we show how the training
itself can be achieved with nearlinear parallel eﬃciency.The resulting implementation is therefore
a highly eﬃcient sublinear SVM training algorithm,which is scalable to largescale problems.
References
E.Chang,K.Zhu,H.Wang,J.Bai,J.Li,Z.Qiu,and H.Cui.Parallelizing support vector machines
on distributed computers.In NIPS,2007.To be published.
R.Collobert,S.Bengio,and Y.Bengio.A parallel mixture of svms for very large scale problems.
Neural Computation,14(5):1105–1114,2002.
J.Dong,A.Krzyzak,and C.Suen.A fast parallel optimization for training support vector machine.
In P.Perner and A.Rosenfeld,editors,Proceedings of 3rd International Conference on Machine
Learning and Data Mining,pages 96–105,Leipzig,Germany,2003.Springer Lecture Notes in
Artiﬁcial Intelligence.
S.Fine and K.Scheinberg.Eﬃcient SVM training using lowrank kernel representations.Journal
of Machine Learning Research,2:243–264,2002.
H.P.Graf,E.Cosatto,L.Bottou,I.Dourdanovic,and V.Vapnik.Parallel support vector machines:
the Cascade SVM.In L.Saul,Y.Weiss,and L.Bottou,editors,Advances in Neural Information
Processing Systems.MIT Press,2005.volume 17.
K.Woodsend and J.Gondzio.Exploiting separability in largescale support vector machine training.
Technical Report MS07002,School of Mathematics,University of Edinburgh,August 2007a.Sub
mitted for publication.Available at http://www.maths.ed.ac.uk/˜gondzio/reports/wgSVM.html.
K.Woodsend and J.Gondzio.Parallel support vector machine training.Technical report,2007b.
In preparation.
G.Zanghirati and L.Zanni.A parallel solver for large quadratic programs in training support vector
machines.Parallel Computing,29(4):535–551,2003.
2
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment