Current Research on Improving Scalability for Kernel Based Methods

cobblerbeggarAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

53 views












Current Research

o
n

Improving Scalability for Kernel Based Methods

Alyssa Lees

New York University




For many applications, Support Vector Machines (and other Kernel based
algorithms) alone or in comb
ination with other methods

yield superior pe
rformance to
other machine learning options. In general SVMs, work very well in practice, are
modular, have a small number of tunable parameters (in comparison with neural
networks) and tend toward global solutions. However, a significant disadvantage of
kernel methods is

the problem

of
sc
alability to a

large number of data points.
The
problems associated with large data sets using SVMs include a drastic increase in training
time, increased memory requirements and a prediction time that is proportional to

the
number of kernels (support vectors) which also increases with the number of data points.
The first part of the paper will explore a few solutions used to improve generalization
performance for larger problems. Burges and Scholkopf

(1997
) offer

the id
ea of the
virtual support vector method that yields better accuracy for large problems and Tresp
compares his B
ayesian
Committee Machine

(B
CM
)

algorithm with

Reduced Rank
Approximation (
RRA
)

and

Subset of Representers
Method (SRM)

as approximate
methods fo
r scaling to larger systems. The second part of the paper investigates past and
current research in
decreasing the training time. The paper

studies

a specialized case of
Osuna
’s Decomposition, SMO as well as current advances to optimize decomposition.

In

1997 Burges and Scholkopf suggested exploiting knowledge of a problem’s
domain to ac
hieve better accuracy for large scale

support vector machines. Their
contributions of the “virtual support vector” method and the “reduced set method”

boasted a SVM

that
was 22 times faster and

yielded a better generalization

performance
( on the 10,000 NIST test of digit images the error rate dropped from 1.4% to 1.0%).
However, at the time of public
ation the authors readily ackno
w
l
e
d
ged that
boosted
Neural Nets had achi
eved better error rates (as low as .7%) at considerably faster
speeds(350K multiply
-
adds vs 650K with modified SVM).


Despite the minimal results, the methods are based on important principles:
improve accuracy by incorporating knowledge about the
invari
ance
s

of the problem and
increase classification speed by reducing the complexity of the representation of the
decision function. Problem do
main knowledge can be utilized
either

by

being used
directly
in the coding of an algorithm or

it can be used to ge
nerate artificial (‘virtual’)
training examples. Due to high correlations between the artificial data and the larger
training set size, generating virtual examples can significantly
increase training time, but
has

the advantage of being easily used for a
ny learning machine. Support Vector
Machines can combine both approaches.
Support Vector Machines have the following
cha
racteristics:

if all other training data was removed and the system was retrained, the
solution is unchanged, the support vectors, are

close to the decision boundary, and
different SVMs tend to produce the same set of support vectors.
Burges and Scholkopf
propose training an SVM to generate a set of support vectors, generating artificial
examples by applying desired invariance transform
ations to the support vectors and then
training another SVM on the new set.


Th
e Virtual Support Vector method
, as listed above, is effective in nominally
improving accuracy, but doubles the training time and decreases classification speed

(due
to more s
upport vectors). The
reduced set method chooses t
he smallest set of examples

such that the resulting loss in generalization remains acceptable. The reduced set method
is fundamentally flawed. In

certain instances (quadratic k
ernels), the reduced set can
be
computed exactly (efficiently). However, generalized cases must be computed
usi
ng
conjugate gradient computatio
n which is very
computationally expensive.

Please note
that methods for improving training time are listed later in the
paper that extend

beyo
nd
the reduced set method.


In 2001, Tresp and Schwaighofer

compared three methods designed to improve
the scalability of kernel based
systems.

The methods examined include Subset of
Representers Methods (SRM), Reduced Rank Approximation (RRA), and

an im
proved

BCM

Approximation.


SRM is based on a factorization of the
kernel function. In the

SRM

method
, a set of base kernels are selected (either subset of train
ing data or of the
test data) and the covariance of two points is approximated by using the co
variance
matrix of the base Kernel points.

The Gram matrix is then approx
imated using

these
covariances. T
he key to this method is the reduction in rank of the Gram matrix (which
will be of rank

equal or less than
the number of the base kernels). The R
educed Rank
Approximation
, RRA, uses the SRM decomposition of the Gram matrix to calculate the
kernel weights. The difference between RRA and SRM is that the SRM decomposition
changes the covariance structures of the kernels, while RRA leaves the covarian
ce
structures unchanged. In

addition, with
RRA the number of kernels with nonzero
weights is identical to the number of training points, while in SRM the number is
identical to the number of base points. Finally, Tresp offers his own method, the
B
ayesia
n
C
ommittee
M
achine (BCM)

approximation, which uses an optimal projection
of the data on a set of base kernels.
Using assumptions about conditional
independencies
,
Tresp calculates the optimal prediction at the base kernel points. However, this
calculat
ion requires computing the inverse of the covariance of label y given a function f
of the base kernel. In order to avoid the inverse calculation of an NxN matrix, the BCM
approximation uses a block diagonal approximation of t
he covariance matrix

and the
c
omputation of the associated weight vector requires only the inversion of a block
of
size
B. The approximation improves when few blocks are used. In fact the BCM
approximation becomes

SRM when the co
variance of y given f is set to alpha square
d

times the
identity.


According to the tests conducted by Tresp,

the RRA method performed
considerably

worse than SRM and BCM. For many case BCM and SRM were
comparable, but when training was performed after the test inputs are known the BCM
yielded much better gener
alization performance.

Various methods have been suggested and used to improve training time for large
datasets. The Chunking algorithm is based on the observation that the value of the
quadratic form is the same if you remove the rows and columns of the
matrix that
correspond to zero Lagrange multiplies. At each iteration of the algorithm, chunking
solves a smaller QP problem that consists of every nonzero Lagrange multiplier from the
previous step and the M worst examples that violate the KKT conditions
. In the final
iteration the entire set of non
-
zero Lagrange multipliers is extracted and hence the
algorithm solves the QP problem. While chu
n
king reduces the size of the matrix from
N^2 (where N is the number of training samples) to the number of nonz
ero Lagrange
multipliers, it still does not handle large scale training problems since the matrix still may
not fit in memory.


Osu
na

suggested a new method for solving the SVM QP problem by showing that
a large

QP problem can be broken into
set
s of smal
ler QP problems i
f one or more
examples that violate the KKT conditions
are

added to the
smaller QP problem at each
iteration.

The
Osuna
’s Decomposition algorithm uses a constant size matrix for ever
y

sub
-
problem and adds/subtracts one example at every st
ep. However a numerical QP
solver is
still required, which raises

numerical precision issues.

Sequential Minimal Optimization (SMO) is
now
a standard method used to
quickly train support vector machines. Under regular circumstances, the training of SVM
r
equires the solution of an extremely large quadratic programming problem. The idea
behind SMO is that the quadratic programming problems can be broken up into a series
of the smallest possible QP problems which can be solved analytically. Avoiding the
la
rge matrix computation, the SMO can handle very large training sets in between linear
and quadratic time with a linear amount of memory in the training set size. T
he standard
approaches such as C
hunking

and Os
una’s algorithm

can be on the order of cubic
time.
The SMO algorithm performs especially well with sparse data sets (Platt, 1998).


Sequential Minimal Optimization solves the smallest possible optimization
problem at every step, which involves two
LaGrange

multipliers.
At each step the two
LaGrange

multipliers are jointly optimized. The main advantage is that the solution for
the multipliers at each step is analytic (ie. no QP solver is used).
SMO is basically a
special case

of the
O
su
na Decomposition algorithm.


In a recent paper(2002) by Flake a
nd Lawrence, the SMO algorithm was applied
to Regression Training. The principle disadvantage of the SMO algorithm is that the rate
of convergence slows considerably when the data is not sparse and when many support
vectors are listed in the solution. Fl
ake introduces several modifications to improve the
performance for nonsparse dataset
s
. T
he improvement was shown to be
an order of
magnitude better when applied to regression problems.


Flake and Lawrence

s fundamental contribution is

to

cache the kernel

output
functions

in an intelligent manner
.


Their c
ache data structure contains an

inverse index
and a 2
-
D array that store the cached va
lues. T
he authors eliminate
thrashing

by using a
hierarchy of selection methods to find the second optimal
Lagrange

multiplier using a
heuristic

that

only
searches

the entire data set if the working set includes the entire data
set. Finally they take optimal steps to exploit the cached output. The modified SMO was
tested on dense regression datasets and
yielded

dramati
c runtime improvements. In
addition the authors
claim their

implementation of SMO can outperform the current SVM
optimization packages that use a QP solver for decomposition.


Another recent improvement to the SMO algorithm was found for a class of SVM
al
gorithm
s

by Keethi a
nd Gilber. In the 2002 paper, “
Convergence of a
Generalized

SMO Algorithm for SVM Classifier Design

, the authors found
a

particular

case that can
generalize the SMO algorithm. As well as a proof of convergence, the authors site a
sig
nificantly faster algorithm than the standard SMO.


Also in 2002, Hsu and Lin investigate ways of improving
Osuna
’s Decomposition
Method. The authors pose a simple solution to the problem of selecting a working
-
set
that leads to faster convergence in diff
icult cases.

Their implementation for finding the
working set, named BSVM, is based on the observations that the number of free variables
should be minimized and that certain iterations yield very close components in the
working set. The first observati
on is solved by adding some free variables from the
previous iteration to the current working set. In experiments BSVM outperform
s

SVMlight. The advantages of the work are that the algorithm minimizes the number of
free variables which leads to f
aster co
nvergence for difficult

problems and the algorithm
considers free variables in the current iteration again in future iterations so that the
columns of these elements are naturally cached.


The
scalability

of kernel bas
ed methods is still a serious problem
if SVMs are to
be used for large
-
scale problems. Boosted Neural Networks can in some cases achieve
better accuracy at faster speeds. However, the different advances listed in the paper offer
significant contributions to different classes of problems. SM
Os are ideal solutions for
training large sparse data sets and the current research for specialized cases such as
regression problems yield
s

superior results to other machine learning methods.






----------------------------------------------------------
------------------------------------------------


*C.

J.

C. Burges and B.

Schölkopf. ‘
Improving the accuracy and speed of support vector
learning machines
’. In M.

Mozer, M.

Jordan, and T.

P
etsche, editors,
Advances in Neural
Information Processing Systems 9
, pages 375
-
381, Cambridge, MA, 1997. MIT Press.


*Flake, Gary Wallace and Lawrence, Steve. ‘Efficient SVM Regression Training with
SMO’. Machine Learning, 46, 271
-
290, 2002.


*Guyon, I an
d Boser, B. and Vapnik, V. ‘Automatic Capacity Tuning of Very Large VC
-
dimension Classifiers’.

Advances in Neural Information Processing Systems
, volume

5,
pages 147
-
155. Morgan Kaufmann, San Mateo, CA, 1993.


*Hsu, Whih
-
Wei and Lin, Chih Jen. ‘A Simple De
composition Method for Support
Vector Machines’. Machine Learning, 46, 291
-
314, 2002.


* Li, Yi
and Long, Philip M. ‘The Relaxed

Online Maximum Margin Algorithm’.
Machine Learning, 46, 361
-
387, 2002.


*

Osuna, E. and F.

Girosi.
Reducing run
-
time complexity in SVMs
. In
Proceedings of the
14th International Conf. on Pattern Recognition, Brisbane, Australia
, 1998.


*Platt,
J. ‘Fast Training of Support Vector Machines using Sequential Minim
al
Optimization’ 1998.


* Tresp, Volker
and Schwaighofer
, Anton, ‘Scalable Kernel Systems’.
Proceedings of
ICANN 2001.