Transductive Support Vector Machines for Structured Variables
Alexander Zien alexander.zien@tuebingen.mpg.de
Max Planck Institute for Biological Cybernetics
Ulf Brefeld brefeld@mpiinf.mpg.de
Tobias Scheer scheffer@mpiinf.mpg.de
Max Planck Institute for Computer Science
Abstract
We study the problem of learning kernel ma
chines transductively for structured output
variables.Transductive learning can be re
duced to combinatorial optimization prob
lems over all possible labelings of the unla
beled data.In order to scale transductive
learning to structured variables,we trans
form the corresponding nonconvex,combi
natorial,constrained optimization problems
into continuous,unconstrained optimization
problems.The discrete optimization parame
ters are eliminated and the resulting dieren
tiable problems can be optimized eciently.
We study the eectiveness of the generalized
TSVM on multiclass classication and label
sequence learning problems empirically.
1.Introduction
Learning mappings between arbitrary structured and
interdependent input and output spaces is a funda
mental problemin machine learning;it covers learning
tasks such as producing sequential or treestructured
output,and it challenges the standard model of learn
ing a mapping from independently drawn instances to
a small set of labels.Applications include named en
tity recognition and information extraction (sequen
tial output),natural language parsing (treestructured
output),classication with a class taxonomy,and col
lective classication (graphstructured output).
When the input x and the desired output y are struc
tures,it is not generally feasible to model each possi
ble value of y as an individual class.It is then helpful
Appearing in Proceedings of the 24
th
International Confer
ence on Machine Learning,Corvallis,OR,2007.Copyright
2007 by the author(s)/owner(s).
to represent input and output pairs in a joint feature
representation,and to rephrase the learning task as
nding f:X Y!R such that
^y = argmax
y2Y
f(x;y)
is the desired output for any input x.Thus,f can
be a linear discriminator in a joint space (x;y) of in
put and output variables and may depend on arbitrary
joint features.Maxmargin Markov models (Taskar
et al.,2003),kernel conditional randomelds (Laerty
et al.,2004),and support vector machines for struc
tured output spaces (Tsochantaridis et al.,2004) use
kernels to compute the inner product in input out
put space.An applicationspecic learning method
is constructed by dening appropriate features,and
choosing a decoding procedure that eciently calcu
lates the argmax,exploiting the dependency structure
of the features.The decoder can be a Viterbi algo
rithm when joint features are constrained to depend
only on adjacent outputs,or a chart parser for tree
structured outputs.
Several semisupervised techniques in joint input out
put spaces have been studied.One of the most promis
ing approaches is the integration of unlabeled instances
by Laplacian priors into structured large margin classi
ers (Laerty et al.,2004;Altun et al.,2005).Brefeld
and Scheer (2006) include unlabeled examples into
structural support vector learning by modeling dis
tinct views of the data and applying the consensus
maximization princple between peer hypotheses.Lee
et al.(2007) study semisupervised CRFs and include
unlabeled data via an entropy criterion such that their
objective acts as a probabilistic analogon to the trans
ductive setting we discuss here.Xu et al.(2006) derive
unsupervised M
3
Networks by employing SDP relax
ation techniques.Their optimization problem is simi
lar to the transductive criterion derived in this paper.
Traditional binary TSVMimplementations (Joachims,
Transductive Support Vector Machines for Structured Variables
1999) solve a combinatorial optimization problemover
pseudolabels ^y
j
for the unlabeled x
j
.These addi
tional combinatorial optimization parameters can be
removed altogether when the constraints ^y
j
hw;x
i
i
1
i
are expressed using absolute values:
j
=
maxf1 jhw;x
j
ij;0g (Chapelle,2007).The resulting
problem remains nonconvex,but is now continuous
and has fewer parameters.It can therefore be opti
mized faster,and the retrieved local minima are sub
stantially better on average (Chapelle & Zien,2005).
The structure of this paper and its main contributions
are as follows.Section 2 recalls the basics of learn
ing in structured output spaces.We then leverage the
technique of continuous optimization of the primal to
structured input output spaces,addressing both the
supervised (Section 3) and the transductive case (Sec
tion 4).Our treatment covers general loss functions
and linear discrimination as well as general kernels.
We study the benet of the generalized transductive
SVM empirically in Section 5.Section 6 provides a
discussion of the experimental results and Section 7
concludes.2.Learning in Input Output Spaces
When dealing with discriminative structured predic
tion models,input variables x
i
2 X and outputs
variables y
i
2 Y are represented jointly by a fea
ture map (x
i
;y
i
) that allows to capture multipleway
dependencies.We apply a generalized linear model
f(x;y) = hw;(x;y)i to decode the topscoring out
put ^y = argmax
y2Y
f(x;y) for input x.
We measure the quality of f by an appropriate,sym
metric,nonnegative loss function :Y Y!R
+0
that details the distance between the true y and the
prediction;for instance, may be the common 0/1
loss,given by (y;
^
y) = 1
[[y6=^y]]
.Thus,the expected
risk of f is given as
R(f) =
Z
XY
y;argmax
y
f(x;y)
dP
XY
(x;y);
where P
XY
is the (unknown) distribution of inputs
and outputs.We address this problemby searching for
a minimizer of the empirical risk on a xed iid sample
of pairs (x
i
;y
i
),1 i n,drawn iid from P
XY
,
regularized with the inverse margin jjwjj
2
.The feature
map (x;y) and the decoder have to be adapted to
the application at hand.We brie y skim the feature
spaces and decoders used in our experiments.
Multiclass classication is a special case of a joint in
put output space with the output space equaling the
nite output alphabet;i.e.,Y = .Let (x) be
the feature vector (e.g.,a tf.idf vector) of x.Then,
the classbased feature representation
(x;y) is given
by
(x;y) = 1
[[y=]]
(x);with 2 .The joint
feature representation is given by\stacking up"the
classbased representations of all classes 2 ;thus,
(x;y) = (:::;
(x;y);:::):With this denition,
the inner product in input output space reduces to
h(x
i
;y
i
);(x
j
;y
j
)i = 1
[[y
i
=y
j
]]
k(x
i
;x
j
);for arbitrary
k(x
i
;x
j
) = h (x
i
); (x
j
)i.Since the number of classes
is limited we do not need a special decoding strategy:
the argmax can eciently be determined by enumer
ating all y and returning the highestscoring class.
In label sequence learning,the task is to nd a map
ping from a sequential input x
i
= hx
i;1
;:::;x
i;jx
i
j
i
to a sequential output y
i
= hy
i;1
;:::;y
i;jx
i
j
i of the
same length jy
i
j = jx
i
j.Each element of x is an
notated with an element y
i;t
2 .We follow Al
tun et al.(2003) and extract labellabel interactions
;
(yjt) = 1
[[y
t1
=^y
t
=]]
and labelobservation fea
tures
;l
(x;yjt) = 1
[[y
t
=]]
l
(x
t
),with labels ; 2
.Here,
l
(x) extracts characteristics of x;e.g.,
123
(x) = 1 if x starts with a capital letter and 0 oth
erwise.We refer to the vector (x) = (:::;
l
(x);:::)
T
and denote the inner product by means of k(x;x) =
h (x); (x)i.The joint feature representation (x;y)
of a sequence is the sum of all feature vectors
(x;yjt) = (:::;
;
(yjt);:::;
;l
(x;yjt);:::)
T
ex
tracted at position t,
(x;y) =
T
X
t=1
(x;yjt):
The inner product in input output space decomposes
into a labellabel and a labelobservation part,
h(x
i
;y
i
);(x
j
;y
j
)i =
X
s;t
1
[[yi;s1=yj;t1^yi;s=yj;t]]
+
X
s;t
1
[[y
i;s
=y
j;t
]]
k(x
i;s
;x
j;t
):
Note that the described feature mapping exhibits a
rstorder Markov property and as a result,decoding
can be performed by a Viterbi algorithm.
Once an appropriate feature mapping and the cor
responding decoding are found for a problem at hand,
both can be plugged into the learning algorithms pre
sented in the following Sections.In the remainder we
will use the shorthand
iy
i
y
def
= (x
i
;y
i
) (x
i
;y)
for dierence vectors in joint input output space.
Transductive Support Vector Machines for Structured Variables
3.Unconstrained Optimization for
Structured Output Spaces
Optimization Problem 1 is the known SVM learning
problem in input output spaces with costbased mar
gin rescaling,which includes the xed size margin with
0/1loss as special case.All presented results can also
be derived for slack rescaling approaches;however,the
corresponding constraint generation becomes more dif
cult.The norm of w plus the sum of the slack terms
i
be minimized,subject to the constraint that,for
all examples (x
i
;y
i
),the correct label y
i
receives the
highest score by a margin.
OP 1 (SVM) Given n labeled training pairs,C > 0,
min
w;
jjwjj
2
+C
n
X
i=1
i
s.t.8
ni=1
8
y6=y
i
hw;
iy
i
y
i (y
i
;y)
i
;
8
ni=1
i
0:
In general,unconstrained optimization is easier to im
plement than constrained optimization.For the SVM,
it is possible to resolve the slack terms:
i
= max
max
y6=y
i
(y
i
;y) hw;
iy
i
y
i
;0
= max
y6=y
i
`
(y
i
;y)
(hw;
iy
i
y
i)
;(1)
where`
(t) = maxft;0g is the hinge loss with mar
gin rescaling.We can now pose Optimization Problem
2 for structured outputs,a simple quadratic optimiza
tion function without constraints.
OP 2 (rSVM) Given n labeled pairs and C > 0,
min
w
jjwjj
2
+C
n
X
i=1
i
(2)
where
i
= max
y
6=y
i
`
(y
i
;y)
(hw;
iy
i
y
i)
.
The
i
remain in Optimization Problem 2 for better
comprehensibility;when they are expanded,the cri
terion is a closed expression.When the maximum is
approximated by the softmax,and a smooth approx
imation of the hinge loss is used,Equation 1 is also
dierentiable.The softmax and its derivative are dis
played in the following Equations,
smax
~y6=y
k
(s(~y)) =
1
log
1 +
X
~y6=y
k
(e
s(~y)
1)
@
@s(y)
smax
~y6=y
k
(s(~y)) =
e
s(y)
1 +
P
~y6=y
k
(e
s(~y)
1)
;
0
0.5
1
1.5
2
0
0.25
0.5
0.75
1
yf(x)
loss
linear
quadratic
linear
hinge loss
Huber loss
Figure 1.The dierentiable Huber loss`
=1;=0:5
.
where we will use s(~y) =`
(y
i
;~y)
(hw;
iy
i
~y
i).The
Huber loss`
;
,displayed in Figure 1,is given by
`
;
(t) =
8<:
t:t
(+t)
2
4
: t +
0:otherwise
`
0;
(t) =
8<:
1:t
1
2
(
t
+1): t +
0:otherwise.
An application of the representer theorem shows that
w can be expanded as
w =
X
k
X
y6=y
k
ky
k
y
ky
k
y
:(3)
The gradient is a vector over the coecients
ky
k
y
for
each example x
k
with true label y
k
and each possible
incorrect labeling y.Computationally,only nonzero
coecients have to be represented.The gradient of
Equation 2 with respect to w is given by
r
OP2
= 2wr
w
+C
n
X
i=1
r
i
:
Thus,applying Equation 3 gives us the rst derivative
in terms of the
ky
k
y
@OP2
@
ky
k
y
= 2w
@w
@
ky
k
y
+C
n
X
i=1
@
i
@
ky
k
y
:
The partial derivative
@w
@
ky
k
y
resolves to
ky
k
y
;that
of
i
can be decomposed by the chain rule into
@
i
@
ky
k
y
=
@
i
@w
@w
@
ky
k
y
=
@
i
@w
ky
k
y
;
@
i
@w
=
X
y6=y
i
@ smax
~y6=y
i
s(~y)
@s(y)
`
0(y
i
;y)
(hw;
iy
i
y
i)
iy
i
y
This solution generalizes Chapelle (2007) for general
input output spaces.The global minimum of Opti
mization Problem 2 can now easily be found with a
Transductive Support Vector Machines for Structured Variables
standard gradient algorithm,such as conjugate gradi
ent descent.By rephrasing the problem as an uncon
strained optimization problem,its intrinsic complexity
has not changed.We will observe the benet of this
approach in the following Sections.
4.Unconstrained Transductive SVMs
In semisupervised learning,unlabeled x
j
for n+1
j n+m are given in addition to the labeled pairs
(x
i
;y
i
) for 1 i n,where usually n m.Op
timization Problem 3 requires the unlabeled data to
be classied by a large margin,but the actual label is
unconstrained;this favors a lowdensity separation.
OP 3 (TSVM) Given n labeled and m unlabeled
training pairs,let C
l
;C
u
> 0,
min
w;;
jjwjj
2
+C
l
n
X
i=1
i
+C
u
n+m
X
j=n+1
j
subject to the constraints
8
ni=1
8
y
6=y
i
hw;
iy
i
y
i (y
i
;y)
i
8
n+m
j=n+1
9
y
j
8
y
6=y
j
hw;
jy
j
y
i (y
j
;y)
j
8
ni=1
i
0;8
n+m
j=n+1
j
0:
Optimization problem 3 requires that there be a y
j
such that all other labels y violate the margin by no
more than
j
.Hence,the value of slack variable
j
is determined by the label y that incurs the strongest
margin violation.Alternatively,the sum of margin vi
olations over all y 6= y
j
may be upper bounded by
j
.
In fact we can interpolate between max and sum by
varying the softmax parameter .Note that the opti
mum expansion is sparse,as only margin violating
labels y contribute to the aggregation.As we will see
later,these y can be eciently determined.
The constraints on
j
involve a disjunction over all
possible labelings y
j
of the unlabeled x
j
which causes
nonconvexity and renders QPsolvers not directly ap
plicable.The TSVM implementation in SVM
light
(Joachims,1999) treats the pseudolabels y
j
as addi
tional combinatorial parameters.The existential quan
tier is thus removed,but the criterion has to be mini
mized over all possible values of (y
n+1
;:::;y
n+m
) and,
in a nested step of convex optimization,over the w.
Analogously to the
i
(Equation 1),we replace the
constraints on
j
:
j
= min
y
j
max
max
y6=y
j
(y
j
;y) hw;
jy
j
y
i
;0
= min
y
j
max
y6=y
j
n
u
(y
j
;y)
(hw;
jy
j
y
i)
o
:(4)
2
1
0
1
2
2
1
0
1
2
t
loss
u
,
(t)
u
,
(t)
Figure 2.Loss u
=1;=0:6
(t) and rst derivative.
We quantify the loss induced by unlabeled instances
u
;
by a function slightly dierent from Huber loss
`
;
.Diverging from`,we engineer u to be symmet
ric,and to have a vanishing derivative at (and around)
the point of symmetry.At this point,two labels score
equally good (and better than all others),and the cor
responding margin violation can be mitigated by mov
ing w in two symmetric ways.
u
;
(t) =
8>>><>>>:
1:jtj
1
(jtj+)
2
2
2
: jtj
(jtj)
2
2
2
: jtj +
0:otherwise
u
0;
(t) =
8>><>>:
0:jtj
sgn(t)
2
(jtj +): jtj
+
sgn(t)
2
(jtj ): jtj +
0:otherwise.
Having rephrased the constraints on
j
as an equation,
we can pose the unconstrained transductive SVM op
timization problem for structured outputs.
OP 4 (rTSVM) Given n labeled and m unlabeled
training pairs,let C
l
;C
u
> 0,
min
w
jjwjj
2
+C
l
n
X
i=1
i
+C
u
n+m
X
j=n+1
j
(5)
with placeholders
i
= max
y6=y
i
`
(y
i
;y)
(hw;
iy
i
y
i)
and
j
= min
y
j
max
y6=y
j
n
u
(y
j
;y)
hw;
jy
j
y
i
o
.
Variables
i
and
j
remain in Optimization Problem 4
for notational harmony,they can be expanded to yield
a closed,unconstrained optimization criterion.
Again we invoke the representer theorem (3) and op
timize along the gradient
@OP4
@
.In addition to the
derivatives calculated in the Section 3,we need the
partial derivatives of the
j
.They are analogous to
Transductive Support Vector Machines for Structured Variables
Table 1.The rTSVM Algorithm
Input:Labeled data f(x
i
;y
i
)g
ni=1
,unlabeled data fx
j
g
n+m
j=n+1
;parameters C
l
;C
u
;
> 0.
1:repeat
2:for each labeled example (x
i
;y
i
) do
3:y argmax
y6=y
i
f(y
i
;y) hw;(x
i
;y)ig//compute worst margin violater
4:if`
(y
i
;y);
(hw;(x
i
;y
i
)i hw;(x
i
;
y)i) > 0 then
5:W W [ f(i;yi
;
y)g//add dierence vector to working set
6:end if
7:end for
8:for each unlabeled example x
j
do
9:^y
j
argmax
y
hw;(x
j
;y)i
//compute top scoring output
10:y argmax
y6=y
j
(y
j
;y) hw;(x
j
;y)i
//compute runnerup
11:if 9y
j
2 W ^ y
j
6= ^y
j
then
12:8y:W Wnf(j;y
j
;y)g//delete old constraints
13:end if
14:if u
(y
j
;y);
(hw;(x
j
;y
j
)i hw;(x
j
;y)i) > 0 then
15:y
j
^y
j
16:W W [ f(j;y
j
;y)g//add dierence vector to working set
17:end if
18:end for
19: argmin
0
TSVM(
0
;W)//minimize Eq.5 by conjugate gradient descent
20:8
kyy
<
:W Wnf(k;y
k
;y)g//delete unnecessary constraints
21:until convergence
Output:Optimized ,working set W.
those of
i
;let s(~y) = u
(y
j
;~y)
(hw;
jy
j
~y
i),we have
@
j
@w
=
X
y6=y
j
@ smax
~y6=y
j
s(~y)
@s(y)
u
0(y
j
;y)
hw;
jy
j
y
i
jy
j
y
:
Every expansion coecient
jy
j
y
in uences how
strongly f favors label y
j
over y for the unlabeled ex
ample j.This solution generalizes Chapelle and Zien
(2005) for general input output spaces.
Algorithmically,continuous optimization over all pa
rameters
ky
k
y
is impossible due to exponentially
many y's.However,our loss functions cause the solu
tion to be sparse.In order to narrow the search to the
nonzero variables,generalized rTSVMtraining inter
leaves two steps.In the decoding step,the algorithm
iterates over all training instances and uses a 2best
decoder to produce the highestscoring output ^y and
the worst margin violator y 6= ^y.For labeled exam
ples (x
i
;y
i
),output
^
y has to be equal to the desired
y
i
,and
y must not violate the margin.Otherwise,
the dierence vector
iy
i
y
is added to the (initially
empty) working set of the ith example.For unlabeled
data,the highestscoring output of the joint classier
^y
j
serves as desired labeling and the runnerup as mar
gin violator y
j
.Again,in case of a margin violation
jy
y
is added to the working set for x
j
.
In the optimization step,conjugate gradient descent
(CG) is executed over the parameters
ky
k
y
,given by
all examples x
k
,desired outputs y
k
,and all associ
ated pseudolabels y currently in the working set.As
proposed in (Chapelle,2007) we use the kernel matrix
as preconditioner,which speeds up the convergence of
the CG considerably.The inner loop of the rTSVM
algorithm is depicted in Table 1.
In an outer loop,rTSVMrst increases C
l
in a barrier
fashion to avoid numerical instabilities,and eventually
increases the the in uence of the unlabeled examples
C
u
.The algorithm terminates when the working set
remains unchanged over two consecutive iterations and
C
l
and C
u
have reached the desired maximum value.
Notice that rTSVMreduces to rSVMwhen no unla
beled examples are included into the training process;
i.e.,for rSVM,lines 818 are removed from Table 1.
For binary TSVMs it has proven useful to add a bal
ancing constraint to the optimization problemthat en
sures that the relative class sizes of the predictions are
similar to those of the labeled points (Joachims,1999).
For structured outputs,the relative frequencies of the
output symbols 2 may be constrained:
P
n+m
j=n+1
P
jx
j
j
t=1
1
[[y
j;t
=]]
P
n+m
j=n+1
jx
j
j
=
P
ni=1
P
jx
i
j
s=1
1
[[y
i;s
=]]
P
ni=1
jx
i
j
:
Analoguously to binary TSVMs (Chapelle & Zien,
2005),this can be relaxed to\soft"linear constraints:
n+m
X
j=n+1
jxj
j
X
t=1
w
>
(x
j;t
;) +b
w
>
(x
j;t
) +
b
= ^p
Transductive Support Vector Machines for Structured Variables
where (x
j;t
;) are the feature maps correspond
ing to predicting for position t of x
j
,
(x
j;t
) =
P
!2
(x
j;t
;!)=jj is their average,the b
are newly
introduced label biases with average
b =
P
b
=jj,
and ^p
= (
P
j
jx
j
j)(
P
i
P
s
1
[[y
is
=]]
=(
P
i
jx
i
j)1=jj)
are centered predicted class sizes.By appropriately
centering the unlabeled data these constraints can be
equivalently transformed into xing the b
to con
stants.However,here we do not implement any bal
ancing,as we empirically observe the fractions of pre
dicted symbols to roughly agree to the corresponding
fractions on the known labels.
5.Experiments
We investigate unconstrained optimization of struc
tured output support vector machines by comparing
dierentiable rSVM and rTSVM to SVMs solved by
quadratic programming (QP) approaches.
In each setting,the in uence of unlabeled examples is
determined by a smoothing strategy which exponen
tially approaches C
u
after a xed number of epochs.
We optimize C
u
using resampling and then x C
u
and present curves that show the average error over
100 randomly drawn training and holdout sets;error
bars indicate standard error.In all experiments we set
C
l
= 1, = 0:3,and = 0:4.
5.1.Execution Time
Figure 3 compares the execution times of CGbased
rSVM and rTSVM to a QPbased SVM where we
used the same convergence criteria for all optimizers.
rTVSM is trained with the respective number of la
beled examples and a 5 times larger set of unlabeled
instances.Besides being faster than a solution based
on solving QPs,the continuous optimization is remark
ably ecient at utilizing the unlabeled data.For in
stance,rTSVM with 50 labeled and 250 unlabeled
examples converges considerably faster than rSVM
and qpSVM with only 200 labeled instances.
0
50
100
150
200
0
0.5
1
1.5
2
x 10
4
time in seconds
number of labeled examples
qpSVM
SVM
TSVM
Figure 3.Execution time.
5.2.MultiClass Classication
For the multiclass classication experiments,we use
a cleaned variant of the Cora data set that contains
9,555 linked computer science papers with a reference
section.The data set is divided into 8 dierent classes.
We extract term frequencies of the document and of
the anchor text of the inbound links.The latter are
drawn from three sentences,respectively,centered at
the occurrence of the reference.We compare the per
formances of rTSVMwith 0/1 loss to the performance
of TSVM
light
,trained with a onevsrest strategy.Fig
ure 4 details the errorrates for 200 labeled examples
and varying numbers of unlabeled instances.For no
unlabeled data both transductive methods reduce to
their fullysupervised,inductive counterparts.Both
SVMs perform equally well for the labeled instances.
However,when unlabeled examples are included into
the training process,the performance of TSVM
light
deteriorates.The errorrates of rTSVMshow a slight
improvement with 800 unlabeled instances.
0
400
800
45
46
47
48
49
50
51
52
error in percent
number of unlabeled examples
TSVM
light
TSVM
0/1
Figure 4.Errorrates for the Cora data set.
We also apply our method to the 6class dataset COIL
as used in (Chapelle et al.,2006),and compare to the
reported onevsrest TSVM results.For n = 10 la
beled points,we achieve 68.87% error,while the one
vsrest TSVM achieves 67.50%.For n = 100 points,
the results are 25.42% as compared to 25.80%.
5.3.Articial Sequential Data
The articial galaxy data set (Laerty et al.,2004)
consists of 100 sequences of length 20,generated by a
two state hidden Markov model.The initial state is
chosen uniformly and there is a 10% chance of switch
ing the state.Each state emits instances uniformly
from one of the two classes,see Figure 5 (left).
We run rSVMand rTSVMusing Hamming loss with
two dierent kernels,a Gaussian RBF kernel with
bandwidth = 0:35 and a semisupervised graph ker
nel.The graph kernel is constructed from a 10nearest
neighbor graph and given by K = 10 (L+
)
1
,with
Transductive Support Vector Machines for Structured Variables
5
10
15
20
10
15
20
25
token error in percent
number of labeled examples
RBF kernel
SVM
TSVM
5
10
15
20
7
7.5
8
8.5
9
9.5
token error in percent
number of labeled examples
Laplacian kernel
SVM
TSVM
Figure 5.The galaxy data set (left) and errorrates for rSVMand rTSVMusing RBF (center) and graph kernels (right).
graph Laplacian L and = 10
6
as proposed by Laf
ferty et al.(2004).
In each experiment we draw a certain number of la
beled sequences at random and use the rest either as
unlabeled examples or as holdout set.We report on av
erages over 20 runs.Figure 5 (center and right) details
the results for semisupervised vs.supervised algo
rithm and semisupervised vs.standard kernel.Since
the approaches are orthogonal,we apply all 4 combi
nations.For increasing numbers of labeled examples,
the error rates of the tested models decrease.The con
tinuous TSVM performs just slightly better than the
supervised SVM;the dierences are signicant only in
few cases.This problem is extremely well tailored for
the Laplacian kernel.The error rates achieved with the
semisupervised kernel are between 20% to 3% lower
than the corresponding results for the RBF kernel.
5.4.Named Entity Recognition
The CoNLL2002 data consists of sentences from a
Spanish news wire archive and contains 9 label types
which distinguish person,organization,location,and
other names.We use 3;100 sentences of between 10
and 40 tokens,leading to 24;000 distinct tokens in
the dictionary.Moreover,we extract surface clue fea
tures,like capitalization features and others.We use
a window of size 3,centered around each token.
In each experiment we draw a specied number of la
beled and unlabeled training and holdout data without
replacement at random in each iteration.We assure
that each label occurs at least once in the labeled train
ing data;otherwise,we discard and draw again.We
compare rTSVM with 0/1 loss and Hamming loss to
the HMSVM (Altun et al.,2003),trained by incre
mentally solving quadratic programs over subspaces
associated with individual input examples.Figure 6
details the results for 10 labeled sequences.
rSVM converges to better local optima than HM
SVMdue to global conjugate gradient based optimiza
tion compared to solving local quadratic programs.
When unlabeled examples are included in the train
ing process the error of the rTSVM decreases sig
nicantly.rTSVM
H
with Hamming loss performs
slightly better than rTSVM
0=1
using 0/1 loss.
6.Discussion
The TSVMcriterion is nonconvex and the maximiza
tion can be dicult even for binary class variables.
In order to scale the TSVM to structured outputs,we
employ a technique that eliminates the discrete param
eters and allows for a conjugate gradient descent in the
space of expansion coecients .Empirical compar
isons of execution time show that the continuous ap
proaches are more ecient than standard approaches
based on quadratic programming.
For the Cora text classication problem,transduc
tive learning does not achieve a substantial benet
over supervised learning.Worse yet,the combinato
rial TSVM increases the error substantially,whereas
rTSVM has negligible eect.In order to draw an un
biased picture,we present this nding with as much
emphasis as any positive result.For the Spanish news
named entity recognition problem,we consistently ob
serve small but signicant improvements over purely
supervised learning.
One might intuitively expect transductive learning to
outperform supervised learning,because more infor
mation is available.However these test instances intro
duce nonconvexity,and the local minimum retrieved
by the optimizer may be worse than the global mini
mum of the convex supervised problem.Our experi
ments indicate that this might occationally occur.
For the galaxy problem,the benet of rTSVM over
rSVM is marginal,and observable only for very few
labeled examples.By its design this problem is very
well suited for graph kernels,which reduce the error
rate by 50%.In the graph Laplacian approach (Sind
hwani et al.,2005),an SVM is trained on the labeled
Transductive Support Vector Machines for Structured Variables
0
25
50
75
100
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
token error in percent
number of unlabeled examples
HMSVM
TSVM
0/1
TSVM
H
Figure 6.Token error for the Spanish news wire data set
with 10 labeled instances.
data,but in addition to the standard kernel,the graph
Laplacian derived from labeled and unlabeled points
serves as regularizer.For binary classication,com
bining TSVMand graph Laplacian yields the greatest
benet (Chapelle & Zien,2005).For structured vari
ables,we observe a similar eect,though much weaker.
The presented rTSVM rests on a cluster assumption
for entire structures,while graphbased methods (Laf
ferty et al.,2004;Altun et al.,2005) exploit the dis
tribution of parts of structures.Both approaches im
prove over supervised learning on some datasets and
fail to do so on others.This raises the question how to
determine which kind of assumptions are appropriate
for a given task at hand.
7.Conclusion
We devised a transductive support vector machine for
structured variables (rTSVM).We transformed the
original combinatorial and constrained optimization
problem into a dierentiable and unconstrained one.
The resulting optimization problem is still nonconvex
but can be optimized eciently,for instance via a con
jugate gradient descent.A dierentiable variant of the
SVM for structured variables (rSVM) is obtained for
the special case of a fully labeled training set.
We applied both methods with various loss func
tions to multiclass classication and sequence label
ing problems.Due to our empirical ndings,we can
rule out the hypothesis that rTSVM generally im
proves learning with structured output variables over
purely supervised learning,as well as the hypothesis
that rTSVM never improves accuracy.
We conjecture that transductive structured output
learning could benet from more research on (i) im
proved nonconvex optimization techniques and on (ii)
appropriate (cluster) assumptions.
Acknowledgment
We thank John Laerty,Yan Liu,and Xiaojin Zhu for
providing their data.We also thank the anonymous re
viewers for detailed comments and suggestions.This work
has been partially funded by the German Science Founda
tion DFG under grant SCHE540/102,and it has in part
been supported by the IST Programme of the European
Community,under the PASCAL Network of Excellence,
IST2002506778.ReferencesAltun,Y.,McAllester,D.,& Belkin,M.(2005).Max
imum margin semisupervised learning for structured
variables.Advances in Neural Information Processing
Systems.
Altun,Y.,Tsochantaridis,I.,& Hofmann,T.(2003).Hid
den Markov support vector machines.Proceedings of the
International Conference on Machine Learning.
Brefeld,U.,& Scheer,T.(2006).Semisupervised learn
ing for structured output variables.Proceedings of the
International Conference on Machine Learning.
Chapelle,O.(2007).Training a support vector machine in
the primal.Neural Computation,19,1155{1178.
Chapelle,O.,Scholkopf,B.,& Zien,A.(Eds.).(2006).
Semisupervised learning.Cambridge,MA:MIT Press.
Chapelle,O.,& Zien,A.(2005).Semisupervised classi
cation by low density separation.Proceedings of the
International Workshop on AI and Statistics.
Joachims,T.(1999).Transductive inference for text clas
sication using support vector machines.Proceedings of
the International Conference on Machine Learning.
Laerty,J.,Zhu,X.,& Liu,Y.(2004).Kernel condi
tional randomelds:representation and clique selection.
Proceedings of the International Conference on Machine
Learning.
Lee,C.,Wang,S.,Jiao,F.,Greiner,R.,& Schuurmans,D.
(2007).Learning to model spatial dependency:Semi
supervised discriminative random elds.Advances in
Neural Information Processing Systems.
Sindhwani,V.,Niyogi,P.,& Belkin,M.(2005).Beyond
the point cloud:From transductive to semisupervised
learning.Proceedings of the International Conference
on Machine Learning.
Taskar,B.,Guestrin,C.,& Koller,D.(2003).Maxmargin
Markov networks.Advances in Neural Information Pro
cessing Systems.
Tsochantaridis,I.,Hofmann,T.,Joachims,T.,& Altun,
Y.(2004).Support vector machine learning for interde
pendent and structured output spaces.Proceedings of
the International Conference on Machine Learning.
Xu,L.,Wilkinson,D.,Southey,F.,& Schuurmans,D.
(2006).Discriminative unsupervised learning of struc
tured predictors.Proceedings of the International Con
ference on Machine Learning.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment