# Lecture 5b - Support Vector Machines (SVM)

Τεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 7 μήνες)

118 εμφανίσεις

Lecture 5b - Support Vector Machines (SVM)
Part II
Gastón Schlotthauer
gschlott@bioingenieria.edu.ar
Tópicos Selectos en Aprendizaje Maquinal
23 de noviembre de 2007
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Organization
1
Overview of kernel methods
2
More kernel-based algorithms
3
Multiclass SVM
4
Probabilistic output SVM
5
Bibliography
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 2/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Non-linear boundary surface
Nonlinear mapping
φ:R
2
→ R
3
(x
1
,x
2
) ￿→ (z
1
,z
2
,z
3
) = (x
2
1
,

2x
1
x
2
,x
2
2
)
φ(x) ∙ φ(x
￿
) = z ∙ z
￿
= (x ∙ x
￿
)
2
= k(x,x
￿
)
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 3/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Non-linear boundary surface
Nonlinear mapping
φ:R
2
→ R
3
(x
1
,x
2
) ￿→ (z
1
,z
2
,z
3
) = (x
2
1
,

2x
1
x
2
,x
2
2
)
φ(x) ∙ φ(x
￿
) = z ∙ z
￿
= (x ∙ x
￿
)
2
= k(x,x
￿
)
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 3/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel Trick
This ﬁnding generalizes:
For x,x
￿
∈ X (in general X ⊆ R
n
),and d ∈ N the kernel function:
k(x,x
￿
) =
￿
x ∙ x
￿
￿
d
computes a scalar product in the space E of all the monomials of degree d.
The dimension of this new space is:
￿
n +d −1
d
￿
.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 4/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel Trick
This ﬁnding generalizes:
For x,x
￿
∈ X (in general X ⊆ R
n
),and d ∈ N the kernel function:
k(x,x
￿
) =
￿
x ∙ x
￿
￿
d
computes a scalar product in the space E of all the monomials of degree d.
The dimension of this new space is:
￿
n +d −1
d
￿
.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 4/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel Trick
If the kernel is deﬁned as k(x,x
￿
) = (x ∙ x
￿
+c)
d
,c ∈ R,then the
dimension is
￿
n +d
d
￿
.
If the input vectors are 16 x 16 images (i.e.256 dimensional vectors) and
we choose as nonlinearity all 5
th
order monomials,then one would map to
a space that contains all 5
th
order products of 256 pixels,i.e.to a
￿
5 +256 −1
5
￿
≈ 10
10
dimensional space.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 5/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel Trick
If we have a way of computing the inner product φ(x) ∙ φ(x
￿
) in the
feature space E directly as a function of the original input points,it
becomes possible to merge the two steps needed to build a non-linear
learning machine.We call such a direct computation a kernel function.
In the dual optimization problem for SVM,the data x only occur in inner
products.Thus,we can replace any occurrence of x ∙ x
￿
by k(x,x
￿
) and
solve the SVM in E instead of X.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 6/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel Trick
If we have a way of computing the inner product φ(x) ∙ φ(x
￿
) in the
feature space E directly as a function of the original input points,it
becomes possible to merge the two steps needed to build a non-linear
learning machine.We call such a direct computation a kernel function.
In the dual optimization problem for SVM,the data x only occur in inner
products.Thus,we can replace any occurrence of x ∙ x
￿
by k(x,x
￿
) and
solve the SVM in E instead of X.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 6/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
More rigourously...
Mercer’s Theorem
Let X be a compact subset of R
n
.Suppose k is a continuous symmetric
function such that the integral operator T
k
:L
2
(X) →L
2
(X),
is positive,that is
￿
X×X
k(x
1
,x
2
)f(x
1
)f(x
2
)dx
1
dx
2
≥ 0,
∀f ∈ L
2
(X).Then we can expand k(x
1
,x
2
) in a uniformly convergent
series in terms of T
k
’s eigenfunctions φ
j
∈ L
2
(X),normalized in such a
way that ￿φ
j
￿ = 1,and positive associated eigenvalues λ
j
≥ 0.
k(x
1
,x
2
) =

￿
j=1
λ
j
φ
j
(x
1

j
(x
2
).
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 7/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel characterization
Mercer’s Theorem consequence
Consider a ﬁnite input space X = {x
1
,x
2
,∙ ∙ ∙,x
n
} and suppose k(x,x
￿
)
is a symmetric function on X (i.e.k(x,x
￿
) = k(x
￿
,x)).Consider the
matrix K with elements:
K
ij
= k(x
i
,x
j
).
A necessary and suﬃcient condition for k to be a valid kernel is that the
matrix K should be positive semideﬁnite for all possible choices of x ∈ X.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 8/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel characterization
Mercer’s Theorem consequence
Consider a ﬁnite input space X = {x
1
,x
2
,∙ ∙ ∙,x
n
} and suppose k(x,x
￿
)
is a symmetric function on X (i.e.k(x,x
￿
) = k(x
￿
,x)).Consider the
matrix K with elements:
K
ij
= k(x
i
,x
j
).
A necessary and suﬃcient condition for k to be a valid kernel is that the
matrix K should be positive semideﬁnite for all possible choices of x ∈ X.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 8/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
An example
x = (x
1
,x
2
),z = (z
1
,z
2
)
k(x,z) = (x
T
z)
2
= (x
1
z
1
+x
2
z
2
)
2
= x
2
1
z
2
1
+2x
1
x
2
z
1
z
2
+x
2
2
= (x
2
1

2x
1
x
2
x
2
2
)

z
2
1

2z
1
z
2
z
2
2

= φ(x) ∙ φ(z).
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 9/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Common kernel functions
Gaussian RBF e

￿x−z￿
σ

2
Polynomial (x ∙ z +c)
d
Sigmoidal tanh(c(x ∙ z) +θ)
1
￿
￿x −z￿
2
+c
2
with σ,c,θ ∈ R and d ∈ N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 10/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Techniques for constructing new kernels
Making kernels from kernels
Let k
1
,k
2
be kernels over X ×X,X ⊆ R
n
,a ∈ R
+
,f(∙) a real-valued
function on X,φ:X →R
m
,x,z ∈ X,q(∙) a polynomial with nonnegative
coeﬃcients,k
3
a kernel over R
m
×R
m
and B a symmetric positive
semi-deﬁnite n ×n matrix.Then,the following functions are kernels:
k(x,z) =k
1
(x,z) +k
2
(x,z),
k(x,z) =ak
1
(x,z),
k(x,z) =k
1
(x,z)k
2
(x,z),
k(x,z) =q(k
1
(x,z)),
k(x,z) =f(x)k
1
(x,z)f(z),
k(x,z) =k
3
(φ(x),φ(z)),
k(x,z) =exp(k
1
(x,z)),
k(x,z) =x
T
Bz.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 11/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Techniques for constructing new kernels
Making kernels from kernels
Let k
1
,k
2
be kernels over X ×X,X ⊆ R
n
,a ∈ R
+
,f(∙) a real-valued
function on X,φ:X →R
m
,x,z ∈ X,q(∙) a polynomial with nonnegative
coeﬃcients,k
3
a kernel over R
m
×R
m
and B a symmetric positive
semi-deﬁnite n ×n matrix.Then,the following functions are kernels:
k(x,z) =k
1
(x,z) +k
2
(x,z),
k(x,z) =ak
1
(x,z),
k(x,z) =k
1
(x,z)k
2
(x,z),
k(x,z) =q(k
1
(x,z)),
k(x,z) =f(x)k
1
(x,z)f(z),
k(x,z) =k
3
(φ(x),φ(z)),
k(x,z) =exp(k
1
(x,z)),
k(x,z) =x
T
Bz.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 11/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Back in the SVM...
Remembering the linear soft-margin SVM problem:
For a training set:S = {(x
1
,y
1
),...,(x
N
,y
N
)},
Maximize
W(α) = −
1
2
N
￿
i=1
N
￿
j=1
y
i
y
j
α
i
α
j
x
i
∙ x
j
+
N
￿
i=1
α
i
under the constraints

N
￿
i=1
y
i
α
i
= 0,
0 ≤ α
i
≤ C
,for i = 1,2,...,N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 12/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
...we add the kernel induced high dimensionality
Using the feature space implicitly deﬁned by the kernel k(x,z):
For a training set:S = {(x
1
,y
1
),...,(x
N
,y
N
)},
Maximize
W(α) = −
1
2
N
￿
i=1
N
￿
j=1
y
i
y
j
α
i
α
j
φ(x
i
) ∙ φ(x
j
)
+
N
￿
i=1
α
i
under the constraints

N
￿
i=1
y
i
α
i
= 0,
0 ≤ α
i
≤ C
,for i = 1,2,...,N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 13/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
...we add the kernel induced high dimensionality
Using the feature space implicitly deﬁned by the kernel k(x,z):
For a training set:S = {(x
1
,y
1
),...,(x
N
,y
N
)},
Maximize
W(α) = −
1
2
N
￿
i=1
N
￿
j=1
y
i
y
j
α
i
α
j
k(x
i
,x
j
)
+
N
￿
i=1
α
i
under the constraints

N
￿
i=1
y
i
α
i
= 0,
0 ≤ α
i
≤ C
,for i = 1,2,...,N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 14/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Organization
1
Overview of kernel methods
2
More kernel-based algorithms
3
Multiclass SVM
4
Probabilistic output SVM
5
Bibliography
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 15/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel PCA
Let the data {x
1
,...,x
N
} ∈ R
n
and the covariance matrix
C =
1
N
N
￿
j=1
φ(x
j
)φ(x
j
)
T
.
The PC are computed by solving the eigenvalue problem,ﬁnding
λ > 0,V ￿= 0 with
λV = CV =
1
N
N
￿
j=1
(φ(x
j
) ∙ V) φ(x
j
).
All eigenvectors with non-zero eigenvalue must be in the span of the
mapped data,i.e.V ∈ span{φ(x
1
),...,φ(x
N
)}.Then
V =
￿
N
i=1
α
i
φ(x
i
).
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 16/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel PCA
Let the data {x
1
,...,x
N
} ∈ R
n
and the covariance matrix
C =
1
N
N
￿
j=1
φ(x
j
)φ(x
j
)
T
.
The PC are computed by solving the eigenvalue problem,ﬁnding
λ > 0,V ￿= 0 with
λV = CV =
1
N
N
￿
j=1
(φ(x
j
) ∙ V) φ(x
j
).
All eigenvectors with non-zero eigenvalue must be in the span of the
mapped data,i.e.V ∈ span{φ(x
1
),...,φ(x
N
)}.Then
V =
￿
N
i=1
α
i
φ(x
i
).
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 16/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel PCA
Multiplying with φ(x
k
)
T
from the left:
λ(φ(x
k
) ∙ V) = φ(x
k
) ∙ (CV).
for k = 1,...,N.
Remembering the N ×N matrix
K
ij
= φ(x
j
) ∙ φ(x
i
) = k(x
j
,x
i
)
we can compute an eigenvalue problem for the expansion coeﬃcients α
i
that now only depends on the kernel function:
λKα =
1
N
K
2
α
Nλα = Kα.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 17/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel PCA
Multiplying with φ(x
k
)
T
from the left:
λ(φ(x
k
) ∙ V) = φ(x
k
) ∙ (CV).
for k = 1,...,N.
Remembering the N ×N matrix
K
ij
= φ(x
j
) ∙ φ(x
i
) = k(x
j
,x
i
)
we can compute an eigenvalue problem for the expansion coeﬃcients α
i
that now only depends on the kernel function:
λKα =
1
N
K
2
α
Nλα = Kα.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 17/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Kernel PCA
The data need to be centered.This can be done substituting the kernel
matrix K with:
˜
K = K −1
N
K −K1
N
+1
N
K1
N
where (1
N
)
ij
= 1/N.
The PCs t of a test vector x
t
with kernel PCA are then extracted
projecting the mapped vector φ(x
t
) onto the eigenvectors V
k
:
t
k
= V
k
∙ φ(x
t
) =
N
￿
i=1
α
k
i
k(x
i
,x
t
)
with k = 1,2,...,p.
In a similar way the Fisher discriminant can be “kernelized”.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 18/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Organization
1
Overview of kernel methods
2
More kernel-based algorithms
3
Multiclass SVM
4
Probabilistic output SVM
5
Bibliography
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 19/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Extending SVM for the multiclass problem
SVMs were originally designed for binary classiﬁcation.The extension for
multiclass classiﬁcation is still an on-going research issue.There are two
types of approaches:
Constructing and combining several binary classiﬁers.
Considering all data in one optimization formulation.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 20/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Binary classiﬁcation based methods
There are three well-known approaches:
One-against-one.
One-against-all.
DAGSVM.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 21/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
One-against-all (OAA)
It constructs k SVM models,where k is the number of classes.The ith
SVM is trained with all of the examples in the ith class with positive
labels,and all other examples with negative labels.
OAA method (primal problem)
For a training set:S = {(x
1
,y
1
),...,(x
N
,y
N
)},x
i
∈ R
n
,i = 1,...,N,
and y
i
∈ {1,...,k},the ith SVM solves the following problem:
Min
1
2
￿w
i
￿
2
+C
N
￿
j=1
ξ
i
j
under the constraints
w
i
∙ φ(x
j
) +b
i
≤ 1 −ξ
i
j
,if y
j
= i,
w
i
∙ φ(x
j
) +b
i
≥ −1 +ξ
i
j
,if y
j
￿= i,
ξ
i
j
≥ 0,j = 1,...,N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 22/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
One-against-all (OAA)
After solving this problem,there are k decision functions:
w
i
∙ φ(x
j
) +b
i
,i = 1,...,k.
We say x is in the class which has the largest value of the decision
function.
There are k N-variable quadratic programming problems to be solved.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 23/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
One-against-one (OAO)
It constructs k(k −1)/2 classiﬁers where each one is trained on data from
two classes.For training data from the ith and jth classes,we solve the
following binary classiﬁcation problem:
OAO method (primal problem)
Min
1
2
￿w
ij
￿
2
+C
￿
I
ij
ξ
ij
I
ij
,
under the constraints
w
ij
∙ φ(x
I
ij
) +b
ij
≤ 1 −ξ
ij
I
ij
,if y
I
ij
= i,
w
ij
∙ φ(x
I
ij
) +b
ij
≥ −1 +ξ
ij
I
ij
,if y
I
ij
= j,
ξ
ij
I
ij
≥ 0.
I
ij
is the set of index of data in the ith and jth classes.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 24/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
One-against-one (OAO)
For the ﬁnal classiﬁcation,the more used alternative is the voting strategy.
In case that two classes have identical votes it is needed an improved
decision methodology.The solution proposed by Hsu and Lin seems not to
be very clever:
...though it may not be a good strategy,now we simply select the one
with the smaller index.
(C.W.Hsu and C.J.Lin.A comparison of methods for multiclass support
vector machines.IEEE Transactions on Neural Networks,13(2),March
2002.)
Another option is the utilization of probabilistic output SVMs.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 25/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
One-against-one (OAO)
For the ﬁnal classiﬁcation,the more used alternative is the voting strategy.
In case that two classes have identical votes it is needed an improved
decision methodology.The solution proposed by Hsu and Lin seems not to
be very clever:
...though it may not be a good strategy,now we simply select the one
with the smaller index.
(C.W.Hsu and C.J.Lin.A comparison of methods for multiclass support
vector machines.IEEE Transactions on Neural Networks,13(2),March
2002.)
Another option is the utilization of probabilistic output SVMs.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 25/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Directed Acyclic Graph Support Vector Machines
(DAGSVM)
Its training phase is the same as
the OAO method by solving
k(k −1)/2 binary SVMs.
However,in the testing phase,it
uses a rooted binary directed
acyclic graph which has
k(k −1)/2 internal nodes and k
leaves.Each node is a binary SVM
of ith and jth classes.Given a
test sample x,starting at the root
node,the binary decision function
is evaluated,going through a path
before reaching a leaf node which
indicates the predicted class.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 26/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
All-together methods
There are two main approaches:
Vapnik and Weston.
Crammer and Singer.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 27/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Vapnik - Weston method
The idea is similar to the OAA approach.It constructs k two-class rules
where the ith function w
i
∙ φ(x) +b separates training vectors of the class
i from the others.The primal optimization problem can be expressed as
follows:
Min
1
2
k
￿
i=1
￿w
i
￿
2
+C
N
￿
j=1
￿
i￿=y
j
ξ
i
j
,
under the constraints
w
y
j
∙ φ(x
j
) +b
y
j
≥ w
i
∙ φ(x
j
) +b
i
+2 −ξ
i
j
,
ξ
i
j
≥ 0,j = 1,...,N,i ∈ {1,...,k}.
As in the OAA method,we say that x is in the class which has the largest
value of the decision function.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 28/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Vapnik - Weston method
The idea is similar to the OAA approach.It constructs k two-class rules
where the ith function w
i
∙ φ(x) +b separates training vectors of the class
i from the others.The primal optimization problem can be expressed as
follows:
Min
1
2
k
￿
i=1
￿w
i
￿
2
+C
N
￿
j=1
￿
i￿=y
j
ξ
i
j
,
under the constraints
w
y
j
∙ φ(x
j
) +b
y
j
≥ w
i
∙ φ(x
j
) +b
i
+2 −ξ
i
j
,
ξ
i
j
≥ 0,j = 1,...,N,i ∈ {1,...,k}.
As in the OAA method,we say that x is in the class which has the largest
value of the decision function.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 28/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Crammer and Singer method
Min
1
2
k
￿
i=1
￿w
i
￿
2
+C
N
￿
j=1
ξ
j
,
under the constraints
w
y
j
∙ φ(x
j
) −w
i
∙ φ(x
j
) ≥ e
i
j
−ξ
j
,j = 1,...,N,
where:
e
i
j
≡ 1 −δ
y
j
,i
,
δ
y
j
,i

￿
1 if y
j
= i,
0 if y
j
￿= i.
The decision function is:argmax
i=1,...,k
w
i
∙ φ(x).
Now there are only N slack variables ξ
j
,j = 1,...,N and does not
contain coeﬃcients b
j
,j = 1,...,N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 29/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Crammer and Singer method
Min
1
2
k
￿
i=1
￿w
i
￿
2
+C
N
￿
j=1
ξ
j
,
under the constraints
w
y
j
∙ φ(x
j
) −w
i
∙ φ(x
j
) ≥ e
i
j
−ξ
j
,j = 1,...,N,
where:
e
i
j
≡ 1 −δ
y
j
,i
,
δ
y
j
,i

￿
1 if y
j
= i,
0 if y
j
￿= i.
The decision function is:argmax
i=1,...,k
w
i
∙ φ(x).
Now there are only N slack variables ξ
j
,j = 1,...,N and does not
contain coeﬃcients b
j
,j = 1,...,N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 29/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
And now...which one?
Gaussian kernel
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 30/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
And now...which one?
Linear kernel
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 31/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Example of multiclass SVM classiﬁer
Problem:5 classes,Gaussian kernel
#data:200.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 32/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Example of multiclass SVM classiﬁer
One-Against-All method:
#SV:62.Training errors:5.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 33/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Example of multiclass SVM classiﬁer
One-Against-One method
#SV:55.Training errors:4.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 34/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Organization
1
Overview of kernel methods
2
More kernel-based algorithms
3
Multiclass SVM
4
Probabilistic output SVM
5
Bibliography
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 35/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Platt’s probabilistic output for SVM
Remember the output of a binary SVM:
f(x) = w

∙ φ(x) +b

=
￿
x
i
is a SV
y
i
α

i
k(x
i
,x) +b

Instead of predicting the class with sign(f(x)),many applications require a
posterior class probability p(y = 1|x).Platt proposed to approximate
p(y = 1|x) by a sigmoid function
p(x) =
1
1 +exp(Af(x) +B)
with parameters A and B.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 36/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Platt’s probabilistic output for SVM
To estimate the best values of (A,B) any subset of N training data (N
+
of them with y
i
=1 and N

of them with y
i
= −1) can be used to solve
the following problem:
m´ın
(A,B)

N
￿
i=1
(t
i
log(p
i
) +(1 −t
i
) log(1 −p
i
)),
p
i
=
1
1 +exp(Af(x
i
) +B)
t
i
=
￿
1+N
+
2+N
+
if y
i
= 1
1
2+N

if y
i
= −1
,i = 1,2,...,N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 37/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Platt’s probabilistic output for SVM
To estimate the best values of (A,B) any subset of N training data (N
+
of them with y
i
=1 and N

of them with y
i
= −1) can be used to solve
the following problem:
m´ın
(A,B)

N
￿
i=1
(t
i
log(p
i
) +(1 −t
i
) log(1 −p
i
)),
p
i
=
1
1 +exp(Af(x
i
) +B)
t
i
=
￿
1+N
+
2+N
+
if y
i
= 1
1
2+N

if y
i
= −1
,i = 1,2,...,N.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 37/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
Organization
1
Overview of kernel methods
2
More kernel-based algorithms
3
Multiclass SVM
4
Probabilistic output SVM
5
Bibliography
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 38/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
C.M.Bishop.
Pattern Recognition and Machine Learning.
Springer,2006.
V.N.Vapnik.
The Nature of Statistical Learning Theory.
Springer,2000.
N.Cristianini and J.Shawe-Taylor.
An Introduction to Support Vector Machines and Other Kernel-Based
Learning Methods.
Cambridge University Press,2000.
L.J.Cao,et al.
“A comparison of PCA,KPCA and ICA for dimensionality reduction in
support vector machine”.
Neurocomputing,55,2003.
Lecture 5b (T.S.Aprendizaje Maquinal)
Support Vector Machines (SVM)
23 de noviembre de 2007 39/40
Kernel methods
More kernel-based algorithms
Multiclass SVM
Probabilistic output SVM
Bibliography
C.W.Hsu and C.J.Lin.
“A comparison of methods for multiclass support vector machines”.
IEEE Transactions on Neural Networks,13(2),2002.
J.C.Platt,et al.
“Probabilistic Outputs for Support Vector Machines and Comparisons
to Regularized Likelihood Methods”.