# Lecture 16: Multiclass Support Vector Machines

Τεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

134 εμφανίσεις

Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Lecture 16:Multiclass Support Vector Machines
Hao Helen Zhang
Spring,2013
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Outlines
One-vs-rest approaches
Pairwise approaches
Recent development for Multiclass Problems
Simultaneous Classication
Various loss functions
Extensions of SVM
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Multiclass Classication Setup
Label:f1;+1g!f1;2;:::;Kg.
Classication decision rule:
f:R
d
=)f1;2;:::;Kg:
Classication accuracy is measured by
Equal-cost:Generalization Error (GE)
Err(f) = P(Y 6= f (X)):
Unequal-cost:the risk
R(f ) = E
Y;X
C(Y;f (X)):
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Main ideas:
(i) Decompose the multiclass classication problem into multiple
binary classication problems.
(ii) Use the majority voting principle (a combined decision from
the committee) to predict the label
Common approaches:simple but eective
One-vs-rest (one-vs-all) approaches
Pairwise (one-vs-one,all-vs-all) approaches
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
One-vs-rest Approach
One of the simplest multiclass classier;commonly used in SVMs;
also known as the one-vs-all (OVA) approach
(i) Solve K dierent binary problems:classify\class k"versus
\the rest classes"for k = 1;  ;K.
(ii) Assign a test sample to the class giving the largest f
k
(x)
(most positive) value,where f
k
(x) is the solution from the kth
problem
Properties:
Very simple to implement,perform well in practice
Not optimal (asymptotically):the decision rule is not Fisher
consistent if there is no dominating class (i.e.
arg max p
k
(x) <
1
2
).
Read:Rifkin and Klautau (2004)\In Defense of One-vs-all
Classication"
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Pairwise Approach
Also known as all-vs-all (AVA) approach
(i) Solve

K
2

dierent binary problems:classify\class k"versus
\class j"for all j 6= k.Each classier is called g
ij
.
(ii) For prediction at a point,each classier is queried once and
issues a vote.The class with the maximum number of
Properties:
Training process is ecient,by dealing with small binary
problems.
If K is big,there are too many problems to solve.If K = 10,
we need to train 45 binary classiers.
Simple to implement;perform competitively in practice.
Classication"
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
One Single SVM approach:Simultaneous Classication
Label:f1;+1g!f1;2;:::;Kg.
Use one single SVM to construct a decision function vector
f = (f
1
;:::;f
K
):
Classier (Decision rule):
f (x) = argmax
k=1;;K
f
k
(x):
If K = 2,there is one f
k
and the decision rule is sign(f
k
).
In some sense,multiple logistic regression is a simultaneous
classication procedure
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
SVM for Multiclass Problem
Multiclass SVM:solving one single regularization problem by
imposing a penalty on the values of f
y
(x) f
l
(x)'s.
Weston and Watkins (1999)
Cramer and Singer (2002)
Lee et al.(2004)
Liu and Shen (2006);multiclass -learning:Shen et al.
(2003)
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Various Multiclass SVMs
Weston and Watkins (1999):
a penalty is imposed only if f
y
(x) < f
k
(x) +2 for k 6= y.
Even if f
y
(x) < 1,a penalty is not imposed as long as f
k
(x) is
suciently small for k 6= y;
Similarly,if f
k
(x) > 1 for k 6= y,we do not pay a penalty if
f
y
(x) is suciently large.
L(y;f(x)) =
X
k6=y
[2 (f
y
(x) f
k
(x))]
+
:
Lee et al.(2004):L(y;f(x)) =
P
k6=y
[f
k
(x) +1]
+
:
Crammer and Singer (2002):Liu and Shen (2006),
L(y;f(x)) = [1 min
k
ff
y
(x) f
k
(x)g]
+
.
To avoid the redundancy,a sum-to-zero constraint
P
K
k=1
f
k
= 0 is
sometimes enforced.
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Linear Multiclass SVMs
For linear classication problems,we have
f
k
(x) = 
k
x +
0k
;k = 1;  ;K:
The sum-to-zero constraint can be replaced by
K
X
k=1

0k
= 0;
K
X
k=1

k
= 0:
The optimization problem becomes
min
f
n
X
i =1
L(y
i
;f(x
i
)) +
K
X
k=1
k
k
k
2
subject to the sum-to-zero constraint.
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Nonlinear Multiclass SVMs
To achieve the nonlinear classication,we assume
f
k
(x) = 
0
k
(x) +
k0
;k = 1;  ;K:
where (x) represents the basis functions in the feature space F.
Similar to the binary classication,the nonlinear MSVM can
be conveniently solved using a kernel function.
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Regularization Problems for Nonlinear MSVMs
We can represent the MSVM as the solution to a regularization
problem in the RKHS.
Assume that
f(x) = (f
1
(x);  ;f
K
(x)) 2
K
Y
k=1
(f1g +H
k
)
under the sum-to-zero constraint.
Then a MSVM classier can be derived by solving
min
f
n
X
i =1
L(y
i
;f(x
i
)) +
K
X
k=1
kg
k
k
2
H
k
;
where f
k
(x) = g
k
(x) +
0k
;g
k
2 H
k
,
0k
2 R.
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Generalized Functional Margin
Given (x;y),a reasonable decision vector f(x) should
encourage a large value for f
y
(x)
have small values for f
k
(x);k 6= y.
Dene the K 1-vector of relative dierences as
g = (f
y
(x) f
1
(x);  ;f
y
(x) f
y1
(x);f
y
(x) f
y+1
(x);  ;f
y
(x) f
K
(x)):
Liu et al.(2004) called the vector g the generalized functional
margin of f
g characterizes correctness and strength of classication of x
by f.
f indicates a correct classication of (x;y) if
g(f(x);y) > 0
k1
:
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
0-1 Loss with Functional Margin
A point (x;y) is misclassied if y 6= arg max
k
f
k
(x):
Dene the multivariate sign function as
sign(u) = 1 if u
min
= min(u
1
;  ;u
m
) > 0;
1 if u
min
 0:
where u = (u
1
;  ;u
m
).Using the functional margin,
The 0-1 loss becomes
I (ming(f(x);y) < 0) =
1
2
[1 sign(g(f(x);y))]:
The GE becomes to
R[f ] =
1
2
E [1 sign(g(f(x);y))]:
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Generalized Loss Functions Using Functional Margin
A natural way to generalize the binary loss is
n
X
i =1
`(ming(f(x
i
);y
i
)):
In particular,the loss function L(y;f(x)) can be expressed as
V(g(f(x);y)) with
Weston and Watkins (1999):V(u) =
P
K1
j=1
[2 u
j
]
+
:
Lee et al.(2004):V(u) =
P
K1
j=1
[
P
K1
c=1
u
c
K
u
j
+1]
+
:
Liu and Shen (2006):V(u) = [1 min
j
u
j
]
+
.
All of these loss functions are the upper bounds of the 0-1 loss.
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Various Loss Functions
Generalized Functional Margin
Characteristics of Support Vector Machines
High accuracy,high exibility
Naturally handle large dimensional data
Sparse representation of the solutions (via support vectors):
fast for making future prediction
No probability estimates (hard classiers)
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Other Active Problems in SVM
Variable/Feature Selection
Linear SVM:Bradley and Mangasarian (1998),Guyon et al.
(2000),Rakotomamonjy (2003),Jebara and Jaakkola (2000)
Nonlinear SVM:Weston et al.(2002),Grandvalet (2003),
Basis pursuit (Zhang 2003),COSSO selection (Lin & Zhang
(2003)
Proximal SVM { faster computation
Robust SVM { get rid of outliers
Choice of kernels
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
The L
1
SVM
Replace the L
2
penalty by the L
1
penalty.
The L
1
penalty tends to give sparse solutions.
For f (x) = h(x)
T
 +
0
,the L
1
SVM solves
min

0
;
n
X
i =1
[1 y
i
f (x
i
)]
+
+
d
X
j=1
j
j
j:(1)
The solution will have at most n nonzero coecients 
j
.
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
L
1
Penalty versus L
2
Penalty
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Robust Support Vector Machines
Hinge loss is unbounded;sensitive to outliers (e.g.wrong
labels etc)
Support Vectors:y
i
f (x
i
)  1.
Truncated hinge loss:T
s
(u) = H
1
(u) H
s
(u),where
H
s
(u) = [s u]
+
:
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
Decomposition:Dierence of Convex Functions
Key:D.C.decomposition (Di.Convex functions).
T
s
(u) = H
1
(u) H
s
(u).
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines
Overview of Multiclass Learning
Simultaneous Classication by MSVMs
Extensions of SVM
D.C.Algorithm
D.C.Algorithm:The Dierence Convex Algorithm for mini-
mizing
J() = J
vex
() +J
cav
()
1.Initialize 
0
.
2.Repeat 
t+1
= argmin

(J
vex
() +

J
0
cav
(
t
);
t

)
until convergence of 
t
.
The algorithm converges in nite steps (Liu et al.(2005)).
Choice of initial values:Use SVM's solution.
RSVM:The set of SVs is a only a SUBSET of the original one!
Nonlinear learning can be achieved by the kernel trick.
Hao Helen Zhang
Lecture 16:Multiclass Support Vector Machines