Incremental and Decremental Support Vector
Machine Learning
Gert Cauwenberghs
CLSP,ECE Dept.
Johns Hopkins University
Baltimore,MD21218
gert@jhu.edu
Tomaso Poggio
CBCL,BCS Dept.
Massachusetts Institute of Technology
Cambridge,MA02142
tp@ai.mit.edu
Abstract
An online recursive algorithmfor training support vector machines,one
vector at a time,is presented.Adiabatic increments retain the Kuhn
Tucker conditions on all previously seen training data,in a number
of steps each computed analytically.The incremental procedure is re
versible,and decremental unlearning offers an efcient method to ex
actly evaluate leaveoneout generalization performance.Interpretation
of decremental unlearning in feature space sheds light on the relationship
between generalization and geometry of the data.
1 Introduction
Training a support vector machine (SVM) requires solving a quadratic programming (QP)
problem in a number of coefcients equal to the number of training examples.For very
large datasets,standard numeric techniques for QP become infeasible.Practical techniques
decompose the probleminto manageable subproblems over part of the data [7,5] or,in the
limit,perform iterative pairwise [8] or componentwise [3] optimization.A disadvantage
of these techniques is that they may give an approximate solution,and may require many
passes through the dataset to reach a reasonable level of convergence.An online alterna
tive,that formulates the (exact) solution for
training data in terms of that for
data and
one new data point,is presented here.The incremental procedure is reversible,and decre
mental unlearning of each training sample produces an exact leaveoneout estimate of
generalization performance on the training set.
2 Incremental SVMLearning
Training an SVMincrementally on newdata by discarding all previous data except their
support vectors,gives only approximate results [11].In what follows we consider incre
mental learning as an exact online method to construct the solution recursively,one point
at a time.The key is to retain the KuhnTucker (KT) conditions on all previously seen data,
while adiabatically adding a newdata point to the solution.
2.1 KuhnTucker conditions
In SVM classication,the optimal separating function reduces to a linear combination
of kernels on the training data,
,with training vectors
and corresponding labels
.In the dual formulation of the training problem,the
On sabbatical leave at CBCL in MIT while this work was performed.
C
W
i
C
W
W
i
=C
i
=
0
g
i
=
0
g
i
>
0 g
i
<
0
x
i
x
i
x
i
support vector
error vector
Figure 1:Softmargin classication SVMtraining.
coefcients
are obtained by minimizing a convex quadratic objective function under
constraints [12]
(1)
with Lagrange multiplier (and offset)
,and with symmetric positive denite kernel matrix
.The rstorder conditions on
reduce to the KuhnTucker (KT)
conditions:
(2)
(3)
which partition the training data
and corresponding coefcients
,
,in
three categories as illustrated in Figure 1 [9]:the set
of margin support vectors strictly
on the margin (
),the set
of error support vectors exceeding the margin (not
necessarily misclassied),and the remaining set
of (ignored) vectors within the margin.
2.2 Adiabatic increments
The margin vector coefcients change value during each incremental step to keep all el
ements in
in equilibrium,i.e.,keep their KT conditions satised.In particular,the KT
conditions are expressed differentially as:
(4)
(5)
where
is the coefcient being incremented,initially zero,of a candidate vector outside
.Since
for the margin vector working set
,the changes in
coefcients must satisfy
.
.
.
.
.
.
(6)
with symmetric but not positivedenite Jacobian
:
.
.
.
.
.
.
.
.
.
.
.
.
(7)
Thus,in equilibrium
(8)
(9)
with coefcient sensitivities given by
.
.
.
.
.
.
(10)
where
,and
for all
outside
.Substituted in (4),the margins change
according to:
(11)
with margin sensitivities
(12)
and
for all
in
.
2.3 Bookkeeping:upper limit on increment
It has been tacitly assumed above that
is small enough so that no element of
moves
across
,
and/or
in the process.Since the
and
change with
through (9)
and (11),some bookkeeping is required to check each of the following conditions,and
determine the largest possible increment
accordingly:
1.
,with equality when
joins
;
2.
,with equality when
joins
;
3.
,
,with equality
when
transfers from
to
,and equality
when
transfers from
to
;
4.
,
,with equality when
transfers from
to
;
5.
,
,with equality when
transfers from
to
.
2.4 Recursive magic:
updates
To add candidate
to the working margin vector set
,
is expanded as:
.
.
.
.
.
.
(13)
The same formula applies to add any vector (not necessarily the candidate)
to
,with
parameters
,
and
calculated as (10) and (12).
The expansion of
,as incremental learning itself,is reversible.To remove a margin vector
from
,
is contracted as:
(14)
where index
refers to the
term.
The
update rules (13) and (14) are similar to online recursive estimation of the covari
ance of (sparsied) Gaussian processes [2].
C
c
g
c
W
l
W
l+1
c
l+1
c
g
c
W
l
W
l+1
c
l+1
=C
support vector
error vector
Figure 2:Incremental learning.A new vector,initially for
classied with negative
margin
,becomes a newmargin or error vector.
2.5 Incremental procedure
Let
,by adding point
(candidate margin or error vector) to
:
.
Then the new solution
,
is expressed in terms of the present
solution
,the present Jacobian inverse
,and the candidate
,
,as:
Algorithm1 (Incremental Learning,
)
1.Initialize
to zero;
2.If
,terminate (
is not a margin or error vector);
3.If
,apply the largest possible increment
so that (the r st) one of the following
conditions occurs:
(a)
:Add
to margin set
,update
accordingly,and terminate;
(b)
:Add
to error set
,and terminate;
(c) Elements of
migrate across
,
,and
(bookk eeping, section 2.3):Update
membership of elements and,if
changes,update
accordingly.
and repeat as necessary.
The incremental procedure is illustrated in Figure 2.Old vectors,from previously seen
training data,may change status along the way,but the process of adding the training data
to the solution converges in a nite number of steps.
2.6 Practical considerations
The trajectory of an example incremental training session is shown in Figure 3.The algo
rithm yields results identical to those at convergence using other QP approaches [7],with
comparable speeds on various datasets ranging up to several thousands training points
1
.
Apractical online variant for larger datasets is obtained by keeping track only of a limited
set of reserv e vectors:
,and discarding all data for which
.For small
,this implies a small overhead in memory over
and
.The larger
,the smaller the probability of missing a future margin or error vector in previous data.
The resulting storage requirements are dominated by that for the inverse Jacobian
,which
scale as
where
is the number of margin support vectors,
.
3 Decr emental Unlear ning
Leaveoneout (LOO) is a standard procedure in predicting the generalization power of a
trained classier,both from a theoretical and empirical perspective [12].It is naturally
implemented by decremental unlearning,adiabatic reversal of incremental learning,on
each of the training data fromthe full trained solution.Similar (but different) bookkeeping
of elements migrating across
,
and
applies as in the incremental case.
1
Matlab code and data are available at http://bach.ece.jhu.edu/pub/gert/svm/incremental.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65 66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Figure 3:Trajectory of coefcients
as a function of iteration step during training,for
nonseparable points in two dimensions,with
,and using a Gaussian
kernel with
.The data sequence is shown on the left.
C
c
g
c
c
g
c
c
=
C
1 1
c
g
c
\c
g
c
\c
Figure 4:Leaveoneout (LOO) decremental unlearning (
) for estimating general
ization performance,directly on the training data.
reveals a LOO classication
error.
3.1 Leaveoneout procedure
Let
,by removing point
(margin or error vector) from
:
.The
solution
is expressed in terms of
,
and the removed point
,
.The
solution yields
,which determines whether leaving
out of the training set generates a
classication error (
).Starting fromthe full
point solution:
Algorithm2 (Decremental Unlearning,
,and LOOClassication)
1.If
is not a margin or error vector:Terminate,corr ect (
is already left out,and correctly
classied);
2.If
is a margin or error vector with
:Terminate,incorr ect (by default as a
training error);
3.If
is a margin or error vector with
,apply the largest possible decrement
so
that (the r st) one of the following conditions occurs:
(a)
:Terminate,incorr ect;
(b)
:Terminate,corr ect;
(c) Elements of
migrate across
,
,and
:Update membership of elements and,
if
changes,update
accordingly.
and repeat as necessary.
The leaveoneout procedure is illustrated in Figure 4.
Figure 5:Trajectory of LOO margin
as a function of leaveoneout coefcient
.The
data and parameters are as in Figure 3.
3.2 Leaveoneout considerations
If an exact LOO estimate is requested,two passes through the data are required.The
LOO pass has similar runtime complexity and memory requirements as the incremental
learning procedure.This is signicantly better than the conventional approach to empirical
LOOevaluation which requires
(partial but possibly still extensive) training sessions.
There is a clear correspondence between generalization performance and the LOO margin
sensitivity
.As shown in Figure 4,the value of the LOO margin
is obtained from
the sequence of
vs.
segments for each of the decrement steps,and thus determined
by their slopes
.Incidentally,the LOOapproximation using linear response theory in [6]
corresponds to the rst segment of the LOO procedure,effectively extrapolating the value
of
from the initial value of
.This simple LOO approximation gives satisfactory
results in most (though not all) cases as illustrated in the example LOOsession of Figure 5.
Recent work in statistical learning theory has sought improved generalization performance
by considering nonuniformity of distributions in feature space [13] or nonuniformity in
the kernel matrix eigenspectrum[10].Ageometrical interpretation of decremental unlearn
ing,presented next,sheds further light on the dependence of generalization performance,
through
,on the geometry of the data.
4 Geometric Inter pretation in Featur e Space
The differential KuhnTucker conditions (4) and (5) translate directly in terms of the sensi
tivities
and
as
(15)
(16)
Through the nonlinear map
into feature space,the kernel matrix elements
reduce to linear inner products:
(17)
and the KT sensitivity conditions (15) and (16) in feature space become
(18)
(19)
Since
,
,(18) and (19) are equivalent to minimizing a functional:
(20)
subject to the equality constraint (19) with Lagrange parameter
.Furthermore,the optimal
value of
immediately yields the sensitivity
,from(18):
(21)
In other words,the distance in feature space between sample
and its projection on
along (16) determines,through (21),the extent to which leaving out
affects the classi
cation of
.Note that only margin support vectors are relevant in (21),and not the error
vectors which otherwise contribute to the decision boundary.
5 Concluding Remarks
Incremental learning and,in particular,decremental unlearning offer a simple and compu
tationally efcient scheme for online SVMtraining and exact leaveoneout evaluation of
the generalization performance on the training data.The procedures can be directly ex
tended to a broader class of kernel learning machines with convex quadratic cost functional
under linear constraints,including SV regression.The algorithm is intrinsically online
and extends to querybased learning methods [1].Geometric interpretation of decremental
unlearning in feature space elucidates a connection,similar to [13],between generalization
performance and the distance of the data fromthe subspace spanned by the margin vectors.
Refer ences
[1] C.Campbell,N.Cristianini and A.Smola,Query Learning with Large Margin Classiers, in
Proc.17th Int.Conf.Machine Learning (ICML2000),Morgan Kaufman,2000.
[2] L.Csato and M.Opper,Sparse Representation for Gaussian Process Models, in Adv.Neural
Information Processing Systems (NIPS'2000),vol.13,2001.
[3] T.T.Frieß,N.Cristianini and C.Campbell,The Kernel Adatron Algorithm:A Fast and Sim
ple Learning Procedure for Support Vector Machines, in 15th Int.Conf.Machine Learning,
Morgan Kaufman,1998.
[4] T.S.Jaakkola and D.Haussler,Probabilistic Kernel Methods, Proc.7th Int.Workshop on
Articial Intelligence and Statistics,1998.
[5] T.Joachims,Making LargeScale Support Vector Machine Learning Practical, in Sch¨olkopf,
Burges and Smola,Eds.,Advances in Kernel Methods Support Vector Learning,Cambridge
MA:MIT Press,1998,pp 169184.
[6] M.Opper and O.Winther,Gaussian Processes and SVM:Mean Field Results and LeaveOne
Out, Adv.Large Margin Classier s,A.Smola,P.Bartlett,B.Sch¨olkopf and D.Schuurmans,
Eds.,Cambridge MA:MIT Press,2000,pp 4356.
[7] E.Osuna,R.Freund and F.Girosi, An Improved Training Algorithmfor Support Vector Ma
chines, Proc.1997 IEEE Workshop on Neural Networks for Signal Processing,pp 276285,
1997.
[8] J.C.Platt,Fast Training of Support Vector Machines Using Sequential Minimum Optimiza
tion, in Sch¨olkopf,Burges and Smola,Eds.,Advances in Kernel Methods Support Vector
Learning,Cambridge MA:MIT Press,1998,pp 185208.
[9] M.Pontil and A.Verri,Properties of Support Vector Machines, it Neural Computation,
vol.10,pp 955974,1997.
[10] B.Sch¨olkopf,J.ShaweTaylor,A.J.Smola and R.C.Williamson,Generalization Bounds via
Eigenvalues of the GramMatrix, NeuroCOLT,Technical Report 99035,1999.
[11] N.A.Syed,H.Liu and K.K.Sung,Incremental Learning with Support Vector Machines, in
Proc.Int.Joint Conf.on Articial Intelligence (IJCAI99),1999.
[12] V.Vapnik,The Nature of Statistical Learning Theory,'New York:SpringerVerlag,1995.
[13] V.Vapnik and O.Chapelle,Bounds on Error Expectation for SVM, in Smola,Bartlett,
Sch¨olkopf and Schuurmans,Eds.,Advances in Large Margin Classier s,Cambridge MA:MIT
Press,2000.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment