Support Vector Machine Learning for
Interdependent and Structured Output Spaces
Ioannis Tsochantaridis it@cs.brown.edu
Thomas Hofmann th@cs.brown.edu
Department of Computer Science,Brown University,Providence,RI 02912
Thorsten Joachims tj@cs.cornell.edu
Department of Computer Science,Cornell University,Ithaca,NY 14853
Yasemin Altun altun@cs.brown.edu
Department of Computer Science,Brown University,Providence,RI 02912
Abstract
Learning general functional dependencies is
one of the main goals in machine learning.
Recent progress in kernelbased methods has
focused on designing °exible and powerful in
put representations.This paper addresses
the complementary issue of problems involv
ing complex outputs such as multiple depen
dent output variables and structured output
spaces.We propose to generalize multiclass
Support Vector Machine learning in a formu
lation that involves features extracted jointly
from inputs and outputs.The resulting op
timization problem is solved e±ciently by
a cutting plane algorithm that exploits the
sparseness and structural decomposition of
the problem.We demonstrate the versatility
and e®ectiveness of our method on problems
ranging from supervised grammar learning
and namedentity recognition,to taxonomic
text classi¯cation and sequence alignment.
1.Introduction
This paper deals with the general problem of learn
ing a mapping from inputs x 2 X to discrete outputs
y 2 Y based on a training sample of inputoutput pairs
(x
1
;y
1
);:::;(x
n
;y
n
) 2 X £Y drawn from some ¯xed
but unknown probability distribution.Unlike the case
of multiclass classi¯cation where Y = f1;:::;kg with
interchangeable,arbitrarily numbered labels,we con
sider structured output spaces Y.Elements y 2 Y
may be,for instance,sequences,strings,labeled trees,
Appearing in Proceedings of the 21
st
International Confer
ence on Machine Learning,Ban®,Canada,2004.Copyright
2004 by the ¯rst author.
lattices,or graphs.Such problems arise in a variety of
applications,ranging frommultilabel classi¯cation and
classi¯cation with class taxonomies,to label sequence
learning,sequence alignment learning,and supervised
grammar learning,to name just a few.
We approach these problems by generalizing large
margin methods,more speci¯cally multiclass Support
Vector Machines (SVMs) (Weston & Watkins,1998;
Crammer & Singer,2001),to the broader problem of
learning structured responses.The naive approach of
treating each structure as a separate class is often in
tractable,since it leads to a multiclass problem with a
very large number of classes.We overcome this prob
lem by specifying discriminant functions that exploit
the structure and dependencies within Y.In that re
spect,our approach follows the work of Collins (2002;
2004) on perceptron learning with a similar class of
discriminant functions.However,the maximum mar
gin algorithm we propose has advantages in terms of
accuracy and tunability to speci¯c loss functions.A
similar philosophy of using kernel methods for learning
general dependencies was pursued in Kernel Depen
dency Estimation (KDE) (Weston et al.,2003).Yet,
the use of separate kernels for inputs and outputs and
the use of kernel PCA with standard regression tech
niques signi¯cantly di®ers fromour formulation,which
is a more straightforward and natural generalization of
multiclass SVMs.
2.Discriminants and Loss Functions
We are interested in the general problem of learning
functions f:X!Y based on a training sample of
inputoutput pairs.As an illustrating example,con
sider the case of natural language parsing,where the
function f maps a given sentence x to a parse tree
problem is
SVM
0
:min
w
1
2
kwk
2
(6a)
8i;8y 2 Y n y
i
:hw;±ª
i
(y)i ¸ 1:(6b)
To allow errors in the training set,we introduce slack
variables and propose to optimize a softmargin crite
rion.While there are several ways of doing this,we
follow Crammer and Singer (2001) and introduce one
slack variable for every nonlinear constraint (4),which
will result in an upper bound on the empirical risk and
o®ers some additional algorithmic advantages.Adding
a penalty term that is linear in the slack variables to
the objective results in the quadratic program
SVM
1
:min
w;»
1
2
kwk
2
+
C
n
n
X
i=1
»
i
;s.t.8i;»
i
¸ 0 (7a)
8i;8y 2 Y n y
i
:hw;±ª
i
(y)i ¸ 1 ¡»
i
:(7b)
Alternatively,we can also penalize margin violations
by a quadratic term
C
2n
P
i
»
2
i
leading to an analogue
optimization problem which we refer to as SVM
2
.In
both cases,C > 0 is a constant that controls the trade
o® between training error minimization and margin
maximization.SVM
1
implicitly considers the zeroone classi¯cation
loss.As argued above,this is inappropriate for prob
lems like natural language parsing,where jYj is large.
We now propose two approaches that generalize the
above formulations to the case of arbitrary loss func
tions 4.Our ¯rst approach is to rescale the slack vari
ables according to the loss incurred in each of the linear
constraints.Intuitively,violating a margin constraint
involving a y 6= y
i
with high loss 4(y
i
;y) should be
penalized more severely than a violation involving an
output value with smaller loss.This can be accom
plished by multiplying the violation by the loss,or
equivalently,by scaling slack variables with the inverse
loss,which yields the problem
SVM
4s
1
:min
w;»
1
2
kwk
2
+
C
n
n
X
i=1
»
i
;s.t.8i;»
i
¸ 0 (8)
8i;8y2Yny
i
:hw;±ª
i
(y)i¸1¡
»
i
4(y
i
;y)
:(9)
A justi¯cation for this formulation is given by the sub
sequent proposition (proof omitted).
Proposition 1.Denote by (w
¤
;»
¤
) the optimal solu
tion to SVM
4s
1
.Then
1
n
P
ni=1
»
¤
i
is an upper bound
on the empirical risk R
4S
(w
¤
).
The optimization problem SVM
4s
2
can be derived
analogously,where 4(y
i
;y) is replaced by
p
4(y
i
;y)
in order to obtain an upper bound on the empirical
risk.A second way to include loss functions is to rescale
the margin as proposed by Taskar et al.(2004) for
the special case of the Hamming loss.The margin
constraints in this setting take the following form:
8i;8y 2 Y n y
i
:hw;±ª
i
(y)i ¸ 4(y
i
;y) ¡»
i
(10)
This set of constraints yield an optimization prob
lem SVM
4m
1
which also results in an upper bound on
R
4S
(w
¤
).In our opinion,a potential disadvantage of
the margin scaling approach is that it may give signif
icant weight to output values y 2 Y that are not even
close to being confusable with the target values y
i
,be
cause every increase in the loss increases the required
margin.4.Support Vector Machine Learning
The key challenge in solving the QPs for the gener
alized SVM learning is the large number of margin
constraints;more speci¯cally the total number of con
straints is njYj.In many cases,jYj may be extremely
large,in particular,if Y is a product space of some
sort (e.g.in grammar learning,label sequence learn
ing,etc.).This makes standard quadratic program
ming solvers unsuitable for this type of problem.
In the following,we propose an algorithmthat exploits
the special structure of the maximummargin problem,
so that only a much smaller subset of constraints needs
to be explicitly examined.The algorithm is a general
ization of the SVMalgorithm for label sequence learn
ing (Hofmann et al.,2002;Altun et al.,2003) and the
algorithm for inverse sequence alignment (Joachims,
2003).We will show how to compute arbitrarily close
approximations to all of the above SVM optimization
problems in polynomial time for a large range of struc
tures and loss functions.Since the algorithm operates
on the dual program,we will ¯rst derive the Wolfe dual
for the various soft margin formulations.
4.1.Dual Programs
We will denote by ®
iy
the Lagrange multiplier enforc
ing the margin constraint for label y 6= y
i
and exam
ple (x
i
;y
i
).Using standard Lagrangian duality tech
niques,one arrives at the following dual QP for the
hard margin case SVM
0
max
®
X
i;y6=y
i
®
iy
¡
1
2
X
i;y6=y
i
j;¹y6=y
j
®
iy
®
j¹y
h±ª
i
(y);±ª
j
(¹y)i (11a)
s.t.8i;8y 6= Y n y
i
:®
iy
¸ 0:(11b)
A kernel K((x;y);(x
0
;y
0
)) can be used to replace the
inner products,since inner products in ±ª can be
easily expressed as inner products of the original ª
vectors.
For softmargin optimization with slack rescaling and
linear penalties (SVM
4s
1
),additional box constraints
n
X
y6=y
i
®
iy
4(y
i
;y)
· C;8i (12)
are added to the dual.Quadratic slack penal
ties (SVM
2
) lead to the same dual as SVM
0
after
altering the inner product to h±ª
i
(y);±ª
j
(
¹
y)i +
±ij
n
C
p
4(y
i
;y)
p
4(y
j
;¹y)
.±ij = 1,if i = j,else 0.
Finally,in the case of margin rescaling,the loss func
tion a®ects the linear part of the objective function
max
®
P
i;y
®
iy
4(y
i
;y) ¡ Q(®) (where the quadratic
part Q is unchanged from (11a)) and introduces stan
dard box constraints n
P
y6=y
i
®
iy
· C.
4.2.Algorithm
The algorithm we propose aims at ¯nding a small set
of active constraints that ensures a su±ciently accu
rate solution.More precisely,it creates a nested se
quence of successively tighter relaxations of the origi
nal problem using a cutting plane method.The latter
is implemented as a variable selection approach in the
dual formulation.We will show that this is a valid
strategy,since there always exists a polynomiallysized
subset of constraints so that the corresponding solu
tion ful¯lls all constraints with a precision of at least ².
This means,the remaining { potentially exponentially
many { constraints are guaranteed to be violated by
no more than ²,without the need for explicitly adding
them to the optimization problem.
We will base the optimization on the dual program
formulation which has two important advantages over
the primal QP.First,it only depends on inner prod
ucts in the joint feature space de¯ned by ª,hence
allowing the use of kernel functions.Second,the con
straint matrix of the dual program (for the L
1
SVMs)
supports a natural problem decomposition,since it is
block diagonal,where each block corresponds to a spe
ci¯c training instance.
Pseudocode of the algorithm is depicted in Algo
rithm 1.The algorithm applies to all SVM formula
tions discussed above.The only di®erence is in the way
the cost function gets set up in step 5.The algorithm
maintains a working set S
i
for each training example
(x
i
;y
i
) to keep track of the selected constraints which
de¯ne the current relaxation.Iterating through the
training examples (x
i
;y
i
),the algorithm proceeds by
Algorithm1 Algorithmfor solving SVM
0
and the loss
rescaling formulations SVM
4s
1
and SVM
4s
2
1:Input:(x
1
;y
1
);:::;(x
n
;y
n
),C,²
2:S
i
Ã;for all i = 1;:::;n
3:repeat
4:for i = 1;:::;n do
5:set up cost function
SVM
4s
1
:H(y) ´ (1 ¡h±ª
i
(y);wi) 4(y
i
;y)
SVM
4s
2
:H(y) ´ (1¡h±ª
i
(y);wi)
p
4(y
i
;y)
SVM
4m
1
:H(y) ´ 4(y
i
;y) ¡h±ª
i
(y);wi
SVM
4m
2
:H(y) ´
p
4(y
i
;y) ¡h±ª
i
(y);wi
where w ´
P
j
P
y
0
2S
j
®
jy
0 ±ª
j
(y
0
).
6:compute ^y = arg max
y2Y
H(y)
7:compute »
i
= maxf0;max
y2S
i
H(y)g
8:if H(
^
y) > »
i
+² then
9:S
i
ÃS
i
[ f
^
yg
10:®
S
Ã optimize dual over S,S = [
i
S
i
.
11:end if
12:end for
13:until no S
i
has changed during iteration
¯nding the (potentially)\most violated"constraint,
involving some output value ^y (line 6).If the (ap
propriately scaled) margin violation of this constraint
exceeds the current value of »
i
by more than ² (line 8),
the dual variable corresponding to ^y is added to the
working set (line 9).This variable selection process in
the dual programcorresponds to a successive strength
ening of the primal problem by a cutting plane that
cuts o® the current primal solution from the feasible
set.The chosen cutting plane corresponds to the con
straint that determines the lowest feasible value for »
i
.
Once a constraint has been added,the solution is re
computed wrt.S (line 10).Alternatively,we have also
devised a scheme where the optimization is restricted
to S
i
only,and where optimization over the full S is
performed much less frequently.This can be bene¯cial
due to the block diagonal structure of the optimization
problems,which implies that variables ®
jy
with j 6= i,
y 2 S
j
can simply be\frozen"at their current val
ues.Notice that all variables not included in their
respective working set are implicitly treated as 0.The
algorithm stops,if no constraint is violated by more
than ².The presented algorithm is implemented and
available
1
as part of SVM
light
.Note that the SVM
optimization problems from iteration to iteration dif
fer only by a single constraint.We therefore restart
the SVM optimizer from the current solution,which
greatly reduces the runtime.A convenient property of
both algorithms is that they have a very general and
wellde¯ned interface independent of the choice of ª
1
http://svmlight.joachims.org/
and 4.To apply the algorithm,it is su±cient to im
plement the feature mapping ª(x;y) (either explicit or
via a joint kernel function),the loss function 4(y
i
;y),
as well as the maximization in step 6.All of those,
in particular the constraint/cut selection method,are
treated as black boxes.While the modeling of ª(x;y)
and 4(y
i
;y) is more or less straightforward,solving
the maximization problemfor constraint selection typ
ically requires exploiting the structure of ª.
4.3.Analysis
It is straightforward to show that the algorithm¯nds a
solution that is close to optimal (e.g.for the SVM
4s
1
,
adding ² to each »
i
is a feasible point of the primal at
most ²C from the maximum).However,it is not im
mediately obvious how fast the algorithm converges.
We will show in the following that the algorithm con
verges in polynomial time for a large class of problems,
despite a possibly exponential or in¯nite jYj.
Let us begin with an elementary Lemma that will be
helpful for proving subsequent results.It quanti¯es
how the dual objective changes,if one optimizes over
a single variable.
Lemma 1.Let J be a positive de¯nite matrix and let
us de¯ne a concave quadratic program
W(®) = ¡
1
2
®
0
J®+hh;®i s.t.® ¸ 0
and assume ® ¸ 0 is given with ®
r
= 0.Then max
imizing W with respect to ®
r
while keeping all other
components ¯xed will increase the objective by
(h
r
¡
P
s
®
s
J
rs
)
2
2J
rr
provided that h
r
¸
P
s
®
s
J
rs
.
Proof.Denote by ®[®
r
Ã ¯] the solution ® with the
rth coe±cient changed to ¯,then
W(®[®
r
Ã¯]) ¡W(®) = ¯
Ã
h
r
¡
X
s
®
s
J
rs
!
¡
¯
2
2
J
rr
The di®erence is maximized for
¯
¤
=
h
r
¡
P
s
®
s
J
rs
J
rr
Notice that ¯
¤
¸ 0,since h
r
¸
P
s
®
s
J
rs
and J
rr
>
0.
Using this Lemma,we can lower bound the improve
ment of the dual objective in step 10 of Algorithm 1.
For brevity,let us focus on the case of SVM
4s
2
.Simi
lar results can be derived also for the other variants.
Proposition 2.De¯ne 4
i
= max
y
4(y
i
;y) and
R
i
= max
y
k±ª
i
(y)k.Then step 10 in Algorithm
1,improves the dual objective for SVM
4s
2
at least by
1
2
²
2
(4
i
R
2
i
+n=C)
¡1
.
Proof.Using the notation in Algorithm 1 one
can apply Lemma 1 with r = (i;^y) denoting
the newly added constraint,h
r
= 1,J
rr
=
k±ª
i
(^y)k
2
+
n
C4(y
i
;^y)
and
P
s
®
s
J
rs
= hw;±ª
i
(^y)i +
P
y6=y
i
®
iy
n
C
p
4(y
i
;^y)
p
4(y
i
;y)
.Note that ®
r
= 0.Us
ing the fact that
P
y6=y
i
n®
iy
C
p
4(y
i
;y)
= »
i
,Lemma 1
shows the following increase of the objective function
when optimizing over ®
r
alone:
µ
1 ¡hw;±ª
i
(^y)i ¡
P
y6=y
i
®
iy
n
C
p
4(y
i
;^y)
p
4(y
i
;y)
¶
2
2
³
k±ª
i
(^y)k
2
+
n
C4(y
i
;^y)
´
¸
²
2
2
¡
k±ª
i
(
^
y)k
2
4(y
i
;
^
y) +
n
C
¢
The step follows from the fact that »
i
¸ 0 and
p
4(y
i
;^y)(1¡hw;±ª
i
(^y)i) > »
i
+²,which is the con
dition of step 8.Replacing the quantities in the de
nominator by their upper limit proves the claim,since
jointly optimizing over more variables than just ®
r
can
only further increase the dual objective.
This leads to the following polynomial bound
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment