Transductive Support Vector Machines for Structured Variables

Alexander Zien alexander.zien@tuebingen.mpg.de

Max Planck Institute for Biological Cybernetics

Ulf Brefeld brefeld@mpi-inf.mpg.de

Tobias Scheer scheffer@mpi-inf.mpg.de

Max Planck Institute for Computer Science

Abstract

We study the problem of learning kernel ma-

chines transductively for structured output

variables.Transductive learning can be re-

duced to combinatorial optimization prob-

lems over all possible labelings of the unla-

beled data.In order to scale transductive

learning to structured variables,we trans-

form the corresponding non-convex,combi-

natorial,constrained optimization problems

into continuous,unconstrained optimization

problems.The discrete optimization parame-

ters are eliminated and the resulting dieren-

tiable problems can be optimized eciently.

We study the eectiveness of the generalized

TSVM on multiclass classication and label-

sequence learning problems empirically.

1.Introduction

Learning mappings between arbitrary structured and

interdependent input and output spaces is a funda-

mental problemin machine learning;it covers learning

tasks such as producing sequential or tree-structured

output,and it challenges the standard model of learn-

ing a mapping from independently drawn instances to

a small set of labels.Applications include named en-

tity recognition and information extraction (sequen-

tial output),natural language parsing (tree-structured

output),classication with a class taxonomy,and col-

lective classication (graph-structured output).

When the input x and the desired output y are struc-

tures,it is not generally feasible to model each possi-

ble value of y as an individual class.It is then helpful

Appearing in Proceedings of the 24

th

International Confer-

ence on Machine Learning,Corvallis,OR,2007.Copyright

2007 by the author(s)/owner(s).

to represent input and output pairs in a joint feature

representation,and to rephrase the learning task as

nding f:X Y!R such that

^y = argmax

y2Y

f(x;y)

is the desired output for any input x.Thus,f can

be a linear discriminator in a joint space (x;y) of in-

put and output variables and may depend on arbitrary

joint features.Max-margin Markov models (Taskar

et al.,2003),kernel conditional randomelds (Laerty

et al.,2004),and support vector machines for struc-

tured output spaces (Tsochantaridis et al.,2004) use

kernels to compute the inner product in input out-

put space.An application-specic learning method

is constructed by dening appropriate features,and

choosing a decoding procedure that eciently calcu-

lates the argmax,exploiting the dependency structure

of the features.The decoder can be a Viterbi algo-

rithm when joint features are constrained to depend

only on adjacent outputs,or a chart parser for tree-

structured outputs.

Several semi-supervised techniques in joint input out-

put spaces have been studied.One of the most promis-

ing approaches is the integration of unlabeled instances

by Laplacian priors into structured large margin classi-

ers (Laerty et al.,2004;Altun et al.,2005).Brefeld

and Scheer (2006) include unlabeled examples into

structural support vector learning by modeling dis-

tinct views of the data and applying the consensus

maximization princple between peer hypotheses.Lee

et al.(2007) study semi-supervised CRFs and include

unlabeled data via an entropy criterion such that their

objective acts as a probabilistic analogon to the trans-

ductive setting we discuss here.Xu et al.(2006) derive

unsupervised M

3

Networks by employing SDP relax-

ation techniques.Their optimization problem is simi-

lar to the transductive criterion derived in this paper.

Traditional binary TSVMimplementations (Joachims,

Transductive Support Vector Machines for Structured Variables

1999) solve a combinatorial optimization problemover

pseudo-labels ^y

j

for the unlabeled x

j

.These addi-

tional combinatorial optimization parameters can be

removed altogether when the constraints ^y

j

hw;x

i

i

1

i

are expressed using absolute values:

j

=

maxf1 jhw;x

j

ij;0g (Chapelle,2007).The resulting

problem remains non-convex,but is now continuous

and has fewer parameters.It can therefore be opti-

mized faster,and the retrieved local minima are sub-

stantially better on average (Chapelle & Zien,2005).

The structure of this paper and its main contributions

are as follows.Section 2 recalls the basics of learn-

ing in structured output spaces.We then leverage the

technique of continuous optimization of the primal to

structured input output spaces,addressing both the

supervised (Section 3) and the transductive case (Sec-

tion 4).Our treatment covers general loss functions

and linear discrimination as well as general kernels.

We study the benet of the generalized transductive

SVM empirically in Section 5.Section 6 provides a

discussion of the experimental results and Section 7

concludes.2.Learning in Input Output Spaces

When dealing with discriminative structured predic-

tion models,input variables x

i

2 X and outputs

variables y

i

2 Y are represented jointly by a fea-

ture map (x

i

;y

i

) that allows to capture multiple-way

dependencies.We apply a generalized linear model

f(x;y) = hw;(x;y)i to decode the top-scoring out-

put ^y = argmax

y2Y

f(x;y) for input x.

We measure the quality of f by an appropriate,sym-

metric,non-negative loss function :Y Y!R

+0

that details the distance between the true y and the

prediction;for instance, may be the common 0/1

loss,given by (y;

^

y) = 1

[[y6=^y]]

.Thus,the expected

risk of f is given as

R(f) =

Z

XY

y;argmax

y

f(x;y)

dP

XY

(x;y);

where P

XY

is the (unknown) distribution of inputs

and outputs.We address this problemby searching for

a minimizer of the empirical risk on a xed iid sample

of pairs (x

i

;y

i

),1 i n,drawn iid from P

XY

,

regularized with the inverse margin jjwjj

2

.The feature

map (x;y) and the decoder have to be adapted to

the application at hand.We brie y skim the feature

spaces and decoders used in our experiments.

Multi-class classication is a special case of a joint in-

put output space with the output space equaling the

nite output alphabet;i.e.,Y = .Let (x) be

the feature vector (e.g.,a tf.idf vector) of x.Then,

the class-based feature representation

(x;y) is given

by

(x;y) = 1

[[y=]]

(x);with 2 .The joint

feature representation is given by\stacking up"the

class-based representations of all classes 2 ;thus,

(x;y) = (:::;

(x;y);:::):With this denition,

the inner product in input output space reduces to

h(x

i

;y

i

);(x

j

;y

j

)i = 1

[[y

i

=y

j

]]

k(x

i

;x

j

);for arbitrary

k(x

i

;x

j

) = h (x

i

); (x

j

)i.Since the number of classes

is limited we do not need a special decoding strategy:

the argmax can eciently be determined by enumer-

ating all y and returning the highest-scoring class.

In label sequence learning,the task is to nd a map-

ping from a sequential input x

i

= hx

i;1

;:::;x

i;jx

i

j

i

to a sequential output y

i

= hy

i;1

;:::;y

i;jx

i

j

i of the

same length jy

i

j = jx

i

j.Each element of x is an-

notated with an element y

i;t

2 .We follow Al-

tun et al.(2003) and extract label-label interactions

;

(yjt) = 1

[[y

t1

=^y

t

=]]

and label-observation fea-

tures

;l

(x;yjt) = 1

[[y

t

=]]

l

(x

t

),with labels ; 2

.Here,

l

(x) extracts characteristics of x;e.g.,

123

(x) = 1 if x starts with a capital letter and 0 oth-

erwise.We refer to the vector (x) = (:::;

l

(x);:::)

T

and denote the inner product by means of k(x;x) =

h (x); (x)i.The joint feature representation (x;y)

of a sequence is the sum of all feature vectors

(x;yjt) = (:::;

;

(yjt);:::;

;l

(x;yjt);:::)

T

ex-

tracted at position t,

(x;y) =

T

X

t=1

(x;yjt):

The inner product in input output space decomposes

into a label-label and a label-observation part,

h(x

i

;y

i

);(x

j

;y

j

)i =

X

s;t

1

[[yi;s1=yj;t1^yi;s=yj;t]]

+

X

s;t

1

[[y

i;s

=y

j;t

]]

k(x

i;s

;x

j;t

):

Note that the described feature mapping exhibits a

rst-order Markov property and as a result,decoding

can be performed by a Viterbi algorithm.

Once an appropriate feature mapping and the cor-

responding decoding are found for a problem at hand,

both can be plugged into the learning algorithms pre-

sented in the following Sections.In the remainder we

will use the shorthand

iy

i

y

def

= (x

i

;y

i

) (x

i

;y)

for dierence vectors in joint input output space.

Transductive Support Vector Machines for Structured Variables

3.Unconstrained Optimization for

Structured Output Spaces

Optimization Problem 1 is the known SVM learning

problem in input output spaces with cost-based mar-

gin rescaling,which includes the xed size margin with

0/1-loss as special case.All presented results can also

be derived for slack rescaling approaches;however,the

corresponding constraint generation becomes more dif-

cult.The norm of w plus the sum of the slack terms

i

be minimized,subject to the constraint that,for

all examples (x

i

;y

i

),the correct label y

i

receives the

highest score by a margin.

OP 1 (SVM) Given n labeled training pairs,C > 0,

min

w;

jjwjj

2

+C

n

X

i=1

i

s.t.8

ni=1

8

y6=y

i

hw;

iy

i

y

i (y

i

;y)

i

;

8

ni=1

i

0:

In general,unconstrained optimization is easier to im-

plement than constrained optimization.For the SVM,

it is possible to resolve the slack terms:

i

= max

max

y6=y

i

(y

i

;y) hw;

iy

i

y

i

;0

= max

y6=y

i

`

(y

i

;y)

(hw;

iy

i

y

i)

;(1)

where`

(t) = maxft;0g is the hinge loss with mar-

gin rescaling.We can now pose Optimization Problem

2 for structured outputs,a simple quadratic optimiza-

tion function without constraints.

OP 2 (rSVM) Given n labeled pairs and C > 0,

min

w

jjwjj

2

+C

n

X

i=1

i

(2)

where

i

= max

y

6=y

i

`

(y

i

;y)

(hw;

iy

i

y

i)

.

The

i

remain in Optimization Problem 2 for better

comprehensibility;when they are expanded,the cri-

terion is a closed expression.When the maximum is

approximated by the softmax,and a smooth approx-

imation of the hinge loss is used,Equation 1 is also

dierentiable.The softmax and its derivative are dis-

played in the following Equations,

smax

~y6=y

k

(s(~y)) =

1

log

1 +

X

~y6=y

k

(e

s(~y)

1)

@

@s(y)

smax

~y6=y

k

(s(~y)) =

e

s(y)

1 +

P

~y6=y

k

(e

s(~y)

1)

;

0

0.5

1

1.5

2

0

0.25

0.5

0.75

1

yf(x)

loss

linear

quadratic

linear

hinge loss

Huber loss

Figure 1.The dierentiable Huber loss`

=1;=0:5

.

where we will use s(~y) =`

(y

i

;~y)

(hw;

iy

i

~y

i).The

Huber loss`

;

,displayed in Figure 1,is given by

`

;

(t) =

8<:

t:t

(+t)

2

4

: t +

0:otherwise

`

0;

(t) =

8<:

1:t

1

2

(

t

+1): t +

0:otherwise.

An application of the representer theorem shows that

w can be expanded as

w =

X

k

X

y6=y

k

ky

k

y

ky

k

y

:(3)

The gradient is a vector over the coecients

ky

k

y

for

each example x

k

with true label y

k

and each possible

incorrect labeling y.Computationally,only nonzero

coecients have to be represented.The gradient of

Equation 2 with respect to w is given by

r

OP2

= 2wr

w

+C

n

X

i=1

r

i

:

Thus,applying Equation 3 gives us the rst derivative

in terms of the

ky

k

y

@OP2

@

ky

k

y

= 2w

@w

@

ky

k

y

+C

n

X

i=1

@

i

@

ky

k

y

:

The partial derivative

@w

@

ky

k

y

resolves to

ky

k

y

;that

of

i

can be decomposed by the chain rule into

@

i

@

ky

k

y

=

@

i

@w

@w

@

ky

k

y

=

@

i

@w

ky

k

y

;

@

i

@w

=

X

y6=y

i

@ smax

~y6=y

i

s(~y)

@s(y)

`

0(y

i

;y)

(hw;

iy

i

y

i)

iy

i

y

This solution generalizes Chapelle (2007) for general

input output spaces.The global minimum of Opti-

mization Problem 2 can now easily be found with a

Transductive Support Vector Machines for Structured Variables

standard gradient algorithm,such as conjugate gradi-

ent descent.By rephrasing the problem as an uncon-

strained optimization problem,its intrinsic complexity

has not changed.We will observe the benet of this

approach in the following Sections.

4.Unconstrained Transductive SVMs

In semi-supervised learning,unlabeled x

j

for n+1

j n+m are given in addition to the labeled pairs

(x

i

;y

i

) for 1 i n,where usually n m.Op-

timization Problem 3 requires the unlabeled data to

be classied by a large margin,but the actual label is

unconstrained;this favors a low-density separation.

OP 3 (TSVM) Given n labeled and m unlabeled

training pairs,let C

l

;C

u

> 0,

min

w;;

jjwjj

2

+C

l

n

X

i=1

i

+C

u

n+m

X

j=n+1

j

subject to the constraints

8

ni=1

8

y

6=y

i

hw;

iy

i

y

i (y

i

;y)

i

8

n+m

j=n+1

9

y

j

8

y

6=y

j

hw;

jy

j

y

i (y

j

;y)

j

8

ni=1

i

0;8

n+m

j=n+1

j

0:

Optimization problem 3 requires that there be a y

j

such that all other labels y violate the margin by no

more than

j

.Hence,the value of slack variable

j

is determined by the label y that incurs the strongest

margin violation.Alternatively,the sum of margin vi-

olations over all y 6= y

j

may be upper bounded by

j

.

In fact we can interpolate between max and sum by

varying the softmax parameter .Note that the opti-

mum expansion is sparse,as only margin violating

labels y contribute to the aggregation.As we will see

later,these y can be eciently determined.

The constraints on

j

involve a disjunction over all

possible labelings y

j

of the unlabeled x

j

which causes

non-convexity and renders QP-solvers not directly ap-

plicable.The TSVM implementation in SVM

light

(Joachims,1999) treats the pseudo-labels y

j

as addi-

tional combinatorial parameters.The existential quan-

tier is thus removed,but the criterion has to be mini-

mized over all possible values of (y

n+1

;:::;y

n+m

) and,

in a nested step of convex optimization,over the w.

Analogously to the

i

(Equation 1),we replace the

constraints on

j

:

j

= min

y

j

max

max

y6=y

j

(y

j

;y) hw;

jy

j

y

i

;0

= min

y

j

max

y6=y

j

n

u

(y

j

;y)

(hw;

jy

j

y

i)

o

:(4)

-2

-1

0

1

2

-2

-1

0

1

2

t

loss

u

,

(t)

u

,

(t)

Figure 2.Loss u

=1;=0:6

(t) and rst derivative.

We quantify the loss induced by unlabeled instances

u

;

by a function slightly dierent from Huber loss

`

;

.Diverging from`,we engineer u to be symmet-

ric,and to have a vanishing derivative at (and around)

the point of symmetry.At this point,two labels score

equally good (and better than all others),and the cor-

responding margin violation can be mitigated by mov-

ing w in two symmetric ways.

u

;

(t) =

8>>><>>>:

1:jtj

1

(jtj+)

2

2

2

: jtj

(jtj)

2

2

2

: jtj +

0:otherwise

u

0;

(t) =

8>><>>:

0:jtj

sgn(t)

2

(jtj +): jtj

+

sgn(t)

2

(jtj ): jtj +

0:otherwise.

Having rephrased the constraints on

j

as an equation,

we can pose the unconstrained transductive SVM op-

timization problem for structured outputs.

OP 4 (rTSVM) Given n labeled and m unlabeled

training pairs,let C

l

;C

u

> 0,

min

w

jjwjj

2

+C

l

n

X

i=1

i

+C

u

n+m

X

j=n+1

j

(5)

with placeholders

i

= max

y6=y

i

`

(y

i

;y)

(hw;

iy

i

y

i)

and

j

= min

y

j

max

y6=y

j

n

u

(y

j

;y)

hw;

jy

j

y

i

o

.

Variables

i

and

j

remain in Optimization Problem 4

for notational harmony,they can be expanded to yield

a closed,unconstrained optimization criterion.

Again we invoke the representer theorem (3) and op-

timize along the gradient

@OP4

@

.In addition to the

derivatives calculated in the Section 3,we need the

partial derivatives of the

j

.They are analogous to

Transductive Support Vector Machines for Structured Variables

Table 1.The rTSVM Algorithm

Input:Labeled data f(x

i

;y

i

)g

ni=1

,unlabeled data fx

j

g

n+m

j=n+1

;parameters C

l

;C

u

;

> 0.

1:repeat

2:for each labeled example (x

i

;y

i

) do

3:y argmax

y6=y

i

f(y

i

;y) hw;(x

i

;y)ig//compute worst margin violater

4:if`

(y

i

;y);

(hw;(x

i

;y

i

)i hw;(x

i

;

y)i) > 0 then

5:W W [ f(i;yi

;

y)g//add dierence vector to working set

6:end if

7:end for

8:for each unlabeled example x

j

do

9:^y

j

argmax

y

hw;(x

j

;y)i

//compute top scoring output

10:y argmax

y6=y

j

(y

j

;y) hw;(x

j

;y)i

//compute runner-up

11:if 9y

j

2 W ^ y

j

6= ^y

j

then

12:8y:W Wnf(j;y

j

;y)g//delete old constraints

13:end if

14:if u

(y

j

;y);

(hw;(x

j

;y

j

)i hw;(x

j

;y)i) > 0 then

15:y

j

^y

j

16:W W [ f(j;y

j

;y)g//add dierence vector to working set

17:end if

18:end for

19: argmin

0

TSVM(

0

;W)//minimize Eq.5 by conjugate gradient descent

20:8

kyy

<

:W Wnf(k;y

k

;y)g//delete unnecessary constraints

21:until convergence

Output:Optimized ,working set W.

those of

i

;let s(~y) = u

(y

j

;~y)

(hw;

jy

j

~y

i),we have

@

j

@w

=

X

y6=y

j

@ smax

~y6=y

j

s(~y)

@s(y)

u

0(y

j

;y)

hw;

jy

j

y

i

jy

j

y

:

Every expansion coecient

jy

j

y

in uences how

strongly f favors label y

j

over y for the unlabeled ex-

ample j.This solution generalizes Chapelle and Zien

(2005) for general input output spaces.

Algorithmically,continuous optimization over all pa-

rameters

ky

k

y

is impossible due to exponentially

many y's.However,our loss functions cause the solu-

tion to be sparse.In order to narrow the search to the

non-zero variables,generalized rTSVMtraining inter-

leaves two steps.In the decoding step,the algorithm

iterates over all training instances and uses a 2-best

decoder to produce the highest-scoring output ^y and

the worst margin violator y 6= ^y.For labeled exam-

ples (x

i

;y

i

),output

^

y has to be equal to the desired

y

i

,and

y must not violate the margin.Otherwise,

the dierence vector

iy

i

y

is added to the (initially

empty) working set of the i-th example.For unlabeled

data,the highest-scoring output of the joint classier

^y

j

serves as desired labeling and the runner-up as mar-

gin violator y

j

.Again,in case of a margin violation

jy

y

is added to the working set for x

j

.

In the optimization step,conjugate gradient descent

(CG) is executed over the parameters

ky

k

y

,given by

all examples x

k

,desired outputs y

k

,and all associ-

ated pseudo-labels y currently in the working set.As

proposed in (Chapelle,2007) we use the kernel matrix

as preconditioner,which speeds up the convergence of

the CG considerably.The inner loop of the rTSVM

algorithm is depicted in Table 1.

In an outer loop,rTSVMrst increases C

l

in a barrier

fashion to avoid numerical instabilities,and eventually

increases the the in uence of the unlabeled examples

C

u

.The algorithm terminates when the working set

remains unchanged over two consecutive iterations and

C

l

and C

u

have reached the desired maximum value.

Notice that rTSVMreduces to rSVMwhen no unla-

beled examples are included into the training process;

i.e.,for rSVM,lines 8-18 are removed from Table 1.

For binary TSVMs it has proven useful to add a bal-

ancing constraint to the optimization problemthat en-

sures that the relative class sizes of the predictions are

similar to those of the labeled points (Joachims,1999).

For structured outputs,the relative frequencies of the

output symbols 2 may be constrained:

P

n+m

j=n+1

P

jx

j

j

t=1

1

[[y

j;t

=]]

P

n+m

j=n+1

jx

j

j

=

P

ni=1

P

jx

i

j

s=1

1

[[y

i;s

=]]

P

ni=1

jx

i

j

:

Analoguously to binary TSVMs (Chapelle & Zien,

2005),this can be relaxed to\soft"linear constraints:

n+m

X

j=n+1

jxj

j

X

t=1

w

>

(x

j;t

;) +b

w

>

(x

j;t

) +

b

= ^p

Transductive Support Vector Machines for Structured Variables

where (x

j;t

;) are the feature maps correspond-

ing to predicting for position t of x

j

,

(x

j;t

) =

P

!2

(x

j;t

;!)=jj is their average,the b

are newly

introduced label biases with average

b =

P

b

=jj,

and ^p

= (

P

j

jx

j

j)(

P

i

P

s

1

[[y

is

=]]

=(

P

i

jx

i

j)1=jj)

are centered predicted class sizes.By appropriately

centering the unlabeled data these constraints can be

equivalently transformed into xing the b

to con-

stants.However,here we do not implement any bal-

ancing,as we empirically observe the fractions of pre-

dicted symbols to roughly agree to the corresponding

fractions on the known labels.

5.Experiments

We investigate unconstrained optimization of struc-

tured output support vector machines by comparing

dierentiable rSVM and rTSVM to SVMs solved by

quadratic programming (QP) approaches.

In each setting,the in uence of unlabeled examples is

determined by a smoothing strategy which exponen-

tially approaches C

u

after a xed number of epochs.

We optimize C

u

using resampling and then x C

u

and present curves that show the average error over

100 randomly drawn training and holdout sets;error-

bars indicate standard error.In all experiments we set

C

l

= 1, = 0:3,and = 0:4.

5.1.Execution Time

Figure 3 compares the execution times of CG-based

rSVM and rTSVM to a QP-based SVM where we

used the same convergence criteria for all optimizers.

rTVSM is trained with the respective number of la-

beled examples and a 5 times larger set of unlabeled

instances.Besides being faster than a solution based

on solving QPs,the continuous optimization is remark-

ably ecient at utilizing the unlabeled data.For in-

stance,rTSVM with 50 labeled and 250 unlabeled

examples converges considerably faster than rSVM

and qpSVM with only 200 labeled instances.

0

50

100

150

200

0

0.5

1

1.5

2

x 10

4

time in seconds

number of labeled examples

qpSVM

SVM

TSVM

Figure 3.Execution time.

5.2.Multi-Class Classication

For the multi-class classication experiments,we use

a cleaned variant of the Cora data set that contains

9,555 linked computer science papers with a reference

section.The data set is divided into 8 dierent classes.

We extract term frequencies of the document and of

the anchor text of the inbound links.The latter are

drawn from three sentences,respectively,centered at

the occurrence of the reference.We compare the per-

formances of rTSVMwith 0/1 loss to the performance

of TSVM

light

,trained with a one-vs-rest strategy.Fig-

ure 4 details the error-rates for 200 labeled examples

and varying numbers of unlabeled instances.For no

unlabeled data both transductive methods reduce to

their fully-supervised,inductive counterparts.Both

SVMs perform equally well for the labeled instances.

However,when unlabeled examples are included into

the training process,the performance of TSVM

light

deteriorates.The error-rates of rTSVMshow a slight

improvement with 800 unlabeled instances.

0

400

800

45

46

47

48

49

50

51

52

error in percent

number of unlabeled examples

TSVM

light

TSVM

0/1

Figure 4.Error-rates for the Cora data set.

We also apply our method to the 6-class dataset COIL

as used in (Chapelle et al.,2006),and compare to the

reported one-vs-rest TSVM results.For n = 10 la-

beled points,we achieve 68.87% error,while the one-

vs-rest TSVM achieves 67.50%.For n = 100 points,

the results are 25.42% as compared to 25.80%.

5.3.Articial Sequential Data

The articial galaxy data set (Laerty et al.,2004)

consists of 100 sequences of length 20,generated by a

two state hidden Markov model.The initial state is

chosen uniformly and there is a 10% chance of switch-

ing the state.Each state emits instances uniformly

from one of the two classes,see Figure 5 (left).

We run rSVMand rTSVMusing Hamming loss with

two dierent kernels,a Gaussian RBF kernel with

bandwidth = 0:35 and a semi-supervised graph ker-

nel.The graph kernel is constructed from a 10-nearest

neighbor graph and given by K = 10 (L+

)

1

,with

Transductive Support Vector Machines for Structured Variables

5

10

15

20

10

15

20

25

token error in percent

number of labeled examples

RBF kernel

SVM

TSVM

5

10

15

20

7

7.5

8

8.5

9

9.5

token error in percent

number of labeled examples

Laplacian kernel

SVM

TSVM

Figure 5.The galaxy data set (left) and error-rates for rSVMand rTSVMusing RBF (center) and graph kernels (right).

graph Laplacian L and = 10

6

as proposed by Laf-

ferty et al.(2004).

In each experiment we draw a certain number of la-

beled sequences at random and use the rest either as

unlabeled examples or as holdout set.We report on av-

erages over 20 runs.Figure 5 (center and right) details

the results for semi-supervised vs.supervised algo-

rithm and semi-supervised vs.standard kernel.Since

the approaches are orthogonal,we apply all 4 combi-

nations.For increasing numbers of labeled examples,

the error rates of the tested models decrease.The con-

tinuous TSVM performs just slightly better than the

supervised SVM;the dierences are signicant only in

few cases.This problem is extremely well tailored for

the Laplacian kernel.The error rates achieved with the

semi-supervised kernel are between 20% to 3% lower

than the corresponding results for the RBF kernel.

5.4.Named Entity Recognition

The CoNLL2002 data consists of sentences from a

Spanish news wire archive and contains 9 label types

which distinguish person,organization,location,and

other names.We use 3;100 sentences of between 10

and 40 tokens,leading to 24;000 distinct tokens in

the dictionary.Moreover,we extract surface clue fea-

tures,like capitalization features and others.We use

a window of size 3,centered around each token.

In each experiment we draw a specied number of la-

beled and unlabeled training and holdout data without

replacement at random in each iteration.We assure

that each label occurs at least once in the labeled train-

ing data;otherwise,we discard and draw again.We

compare rTSVM with 0/1 loss and Hamming loss to

the HM-SVM (Altun et al.,2003),trained by incre-

mentally solving quadratic programs over subspaces

associated with individual input examples.Figure 6

details the results for 10 labeled sequences.

rSVM converges to better local optima than HM-

SVMdue to global conjugate gradient based optimiza-

tion compared to solving local quadratic programs.

When unlabeled examples are included in the train-

ing process the error of the rTSVM decreases sig-

nicantly.rTSVM

H

with Hamming loss performs

slightly better than rTSVM

0=1

using 0/1 loss.

6.Discussion

The TSVMcriterion is non-convex and the maximiza-

tion can be dicult even for binary class variables.

In order to scale the TSVM to structured outputs,we

employ a technique that eliminates the discrete param-

eters and allows for a conjugate gradient descent in the

space of expansion coecients .Empirical compar-

isons of execution time show that the continuous ap-

proaches are more ecient than standard approaches

based on quadratic programming.

For the Cora text classication problem,transduc-

tive learning does not achieve a substantial benet

over supervised learning.Worse yet,the combinato-

rial TSVM increases the error substantially,whereas

rTSVM has negligible eect.In order to draw an un-

biased picture,we present this nding with as much

emphasis as any positive result.For the Spanish news

named entity recognition problem,we consistently ob-

serve small but signicant improvements over purely

supervised learning.

One might intuitively expect transductive learning to

outperform supervised learning,because more infor-

mation is available.However these test instances intro-

duce non-convexity,and the local minimum retrieved

by the optimizer may be worse than the global mini-

mum of the convex supervised problem.Our experi-

ments indicate that this might occationally occur.

For the galaxy problem,the benet of rTSVM over

rSVM is marginal,and observable only for very few

labeled examples.By its design this problem is very

well suited for graph kernels,which reduce the error

rate by 50%.In the graph Laplacian approach (Sind-

hwani et al.,2005),an SVM is trained on the labeled

Transductive Support Vector Machines for Structured Variables

0

25

50

75

100

9.2

9.3

9.4

9.5

9.6

9.7

9.8

9.9

token error in percent

number of unlabeled examples

HMSVM

TSVM

0/1

TSVM

H

Figure 6.Token error for the Spanish news wire data set

with 10 labeled instances.

data,but in addition to the standard kernel,the graph

Laplacian derived from labeled and unlabeled points

serves as regularizer.For binary classication,com-

bining TSVMand graph Laplacian yields the greatest

benet (Chapelle & Zien,2005).For structured vari-

ables,we observe a similar eect,though much weaker.

The presented rTSVM rests on a cluster assumption

for entire structures,while graph-based methods (Laf-

ferty et al.,2004;Altun et al.,2005) exploit the dis-

tribution of parts of structures.Both approaches im-

prove over supervised learning on some datasets and

fail to do so on others.This raises the question how to

determine which kind of assumptions are appropriate

for a given task at hand.

7.Conclusion

We devised a transductive support vector machine for

structured variables (rTSVM).We transformed the

original combinatorial and constrained optimization

problem into a dierentiable and unconstrained one.

The resulting optimization problem is still non-convex

but can be optimized eciently,for instance via a con-

jugate gradient descent.A dierentiable variant of the

SVM for structured variables (rSVM) is obtained for

the special case of a fully labeled training set.

We applied both methods with various loss func-

tions to multi-class classication and sequence label-

ing problems.Due to our empirical ndings,we can

rule out the hypothesis that rTSVM generally im-

proves learning with structured output variables over

purely supervised learning,as well as the hypothesis

that rTSVM never improves accuracy.

We conjecture that transductive structured output

learning could benet from more research on (i) im-

proved non-convex optimization techniques and on (ii)

appropriate (cluster) assumptions.

Acknowledgment

We thank John Laerty,Yan Liu,and Xiaojin Zhu for

providing their data.We also thank the anonymous re-

viewers for detailed comments and suggestions.This work

has been partially funded by the German Science Founda-

tion DFG under grant SCHE540/10-2,and it has in part

been supported by the IST Programme of the European

Community,under the PASCAL Network of Excellence,

IST-2002-506778.ReferencesAltun,Y.,McAllester,D.,& Belkin,M.(2005).Max-

imum margin semi-supervised learning for structured

variables.Advances in Neural Information Processing

Systems.

Altun,Y.,Tsochantaridis,I.,& Hofmann,T.(2003).Hid-

den Markov support vector machines.Proceedings of the

International Conference on Machine Learning.

Brefeld,U.,& Scheer,T.(2006).Semi-supervised learn-

ing for structured output variables.Proceedings of the

International Conference on Machine Learning.

Chapelle,O.(2007).Training a support vector machine in

the primal.Neural Computation,19,1155{1178.

Chapelle,O.,Scholkopf,B.,& Zien,A.(Eds.).(2006).

Semi-supervised learning.Cambridge,MA:MIT Press.

Chapelle,O.,& Zien,A.(2005).Semi-supervised classi-

cation by low density separation.Proceedings of the

International Workshop on AI and Statistics.

Joachims,T.(1999).Transductive inference for text clas-

sication using support vector machines.Proceedings of

the International Conference on Machine Learning.

Laerty,J.,Zhu,X.,& Liu,Y.(2004).Kernel condi-

tional randomelds:representation and clique selection.

Proceedings of the International Conference on Machine

Learning.

Lee,C.,Wang,S.,Jiao,F.,Greiner,R.,& Schuurmans,D.

(2007).Learning to model spatial dependency:Semi-

supervised discriminative random elds.Advances in

Neural Information Processing Systems.

Sindhwani,V.,Niyogi,P.,& Belkin,M.(2005).Beyond

the point cloud:From transductive to semisupervised

learning.Proceedings of the International Conference

on Machine Learning.

Taskar,B.,Guestrin,C.,& Koller,D.(2003).Max-margin

Markov networks.Advances in Neural Information Pro-

cessing Systems.

Tsochantaridis,I.,Hofmann,T.,Joachims,T.,& Altun,

Y.(2004).Support vector machine learning for interde-

pendent and structured output spaces.Proceedings of

the International Conference on Machine Learning.

Xu,L.,Wilkinson,D.,Southey,F.,& Schuurmans,D.

(2006).Discriminative unsupervised learning of struc-

tured predictors.Proceedings of the International Con-

ference on Machine Learning.

## Comments 0

Log in to post a comment