Classification with SVM - ISMLL

yellowgreatAI and Robotics

Oct 16, 2013 (3 years and 11 months ago)

81 views

1/18/2009
1
Classification with SVM
nanopoulos@ismll.de
Perceptron Revisited
Binary classification can be viewed as the task of 
separating classes in feature space:
separating
?
classes
?
in
?
feature
?
space:
w
T
x + b = 0
w
T
x + b < 0
w
T
x + b > 0
y(
x
)
=
sign(
w
T
x
+
b
)
y(
x
)
=

sign(
w
T
x

+

b
)
1/18/2009
2
Linear Classification
• Find a linear hyperplane (decision boundary) that will separate the data
Linear Classification
• One Possible Solution
1/18/2009
3
Linear Classification
• Another possible solution
Linear Classification
• Other possible solutions
1/18/2009
4
(Slide from Perceptron Lecture)
7
Maximum margin hyperplane
• Which one is better? B1 or B2?
• How do you define better?
1/18/2009
5
Maximum margin hyperplane
• Find hyperplane maximizes the margin => B1 is better than B2
Theoretical Justification for Maximum 
Margins
Vapnik has proved the following:
The class of optimal linear separators has VC
The
 
class
?
of
?
optimal
?
linear
?
separators
?
has
?
VC
?
dimension (complexity measure) h bounded 
from above as 
where ρ is the margin, D is the diameter of the 
1,min
0
2
2
+












≤ m
D
h
ρ
smallest sphere that can enclose all of the 
training examples, and m
0
is the 
dimensionality.
1/18/2009
6
Distance of hyperplane from origin
2
Computing the size of margin
The width of the
kbxw
=
+

r
r
The

width

of

the

margin is:
2 k
w
2
w
12
kbxw −=+⋅
rr
0
=
+

bxw
r
r
k
k
1/18/2009
7
Setting Up the Optimization Problem
There is a scale and
1w x b

+ =
r
r
unit for data so that
k=1. Then problem
becomes:
2
w
13
1w x b⋅ + = −
r r
0
=
+

bxw
r
r
1
1
The Optimization Problem
We want to maximize:
2
||
||
2
Margin
w
=
Which is equivalent to minimizing:
But subjected to the following constraints:
||
||

2
||||
)(
2
w
wL =



−≤+=

+
=
1b)x(w)x(y:2
1
b
)x(w)x(y:1
n
T
n
n
T
n
φ
φ
class
class
1/18/2009
8
Restating the Optimization Problem
t
n
= 1 for class 1 and t
n
= ‐1 for class 2
For all data points:
t
y
(
x
)

1
䙯r
 
all
?
data
?
points:
?
t
n
y
(
x
n
)
 

1
The optimization problem becomes:
N
n
b
t
T
1
1
)
)
(
(
subject to
1
min
arg
2
=

+
x
w
w
φ
15
N
n
b
t
nn
b
,...,
1
,
1
)
)
(
(

subject

to

2
min
arg
,
=

+
x
w
w
w
φ
Solution with Lagrange multipliers
{
}

−+−=
N
n
T
nn
btabL
2
1))((
2
1
),,( xwwaw φ
{
}

=n 1
2
{ }
01)(
and 0 Subject to
1
=−


=
nnn
N
n
n
yta
a
x
Karush-Kuhn-Tucker conditions
16

=
=⇒=


N
n
nnn
xtaw
w
L
1
)(0
φ
1/18/2009
9
Dual representation
),(
2
1
)(
~
−=



N N
mnmnmn
N
n
kttaaaL xxa
{ }
01)(
,...,1 ,0 subject to
2
1
1 11
=−
=≥

=
=
=
=
nnn
N
n
n
n mn
yta
Nna
x
)
(
)
(
)
(
where
k
x
x
x
x
φ
φ
=
Solve with
quadratic
programming
in O(N
3
)
17
Only function of Lagrange multipliers
The dual representation is for maximization
)
(
)
(
)
,
(

where
mnmn
k
x
x
x
x
φ
φ
=
Classifying New Data
bktayby
N
n
nnn
T
+=⇒+=

=
1
),()( )()( xxxxwx φ
{ }
01)(
01)(
0
=−
≥−

nnn
nn
n
yta
yt
a
x
x
0
1
)
(
18
0
=
n
a
1
)
(
=
nn
y

x
o
r
support
vectors
S
∑ ∑
∈ ∈






−=
Sn Sm
mnmmn
S
ktat
N
b ),(
1
xx
Average for all
S (more stable)
1/18/2009
10
Overlapping class distributions
Tradeoff:
Allow training
errors to
increase margin
19
Soft margin
Allow some misclassified examples
Introduce slack variables
N
n
1
0
=

ξ
䥮瑲潤畣→
 
slack
?
variables
?
N
n
n
,...,
1
,
0
=

ξ
20
nnnn
T
nnn
ytbtyt ξφ −≥⇒≥+= 1)( 1))(()( xxwx
1/18/2009
11
Need to control slack variables
21

=
+
N
n
n
C
1
2
2
1

Soft Margin Solution
Minimize 
(
)
01)(
0
≥+−

nnn
n
yt
a
ξx
: constraints training errors
0>C
{ }
∑∑∑
===
−+−−+=
N
n
nn
N
n
nnnn
N
n
n
ytaCbL
111
2
1)(
2
1
),,( ξμξξ xwaw
or
(
)
0
0
0
01)(
=


=
+

nn
n
n
nnnn
y
ta
ξμ
ξ
μ
ξ
x
22
KKT conditions:
: support vectors
0=
n
a
nnn
yt ξ−=1)(x
1/18/2009
12
Dual representation



−= ),(
2
1
)(
~
N N
mnmnmn
N
n
kttaaaL xxa




=
= ==
=
=≤≤
N
1n
1 11
0
,...,1 ,0 subject to
2
nn
n
n mn
ta
NnCa
This constraint stems
from setting dev of L by
slack equal to 0
23
Same equation for dual but different
constraints for Lagrangian multipliers
Nonlinear Support Vector Machines
What if decision boundary is not linear?
1/18/2009
13
Non‐linear SVMs:  Feature spaces
General idea:   the original feature space can always be 
mapped to some higher‐dimensional feature space where 
the trainin
g
 set is se
p
arable:
g p
Φ: x →φ(x)
The “Kernel Trick”
The SVM only relies on the inner‐product between vectors 
φ(x
n
)
.
φ(x
m
)
n
m
If every datapoint is mapped into high‐dimensional space via 
some transformation Φ:  x→φ(x), the inner‐product 
becomes:k(x
m
,x
n
)= φ(x
m
)
.
φ(x
n
)
∑∑∑
= ==
−=
N
n
N
m
mnmnmn
N
n
n
kttaaaL
1 11
),(
2
1
)(
~
xxa
bktay
N
n
nnn
+=

=1
),()( xxx
k(x
m
,x
n
) is called the kernel function.
For SVM, we only need specify the kernel without need to 
know the corresponding non‐linear mapping, φ(x).
1/18/2009
14
Examples of Kernel Trick (1)
• For the example in the previous figure: 
– The non‐linear mapping
– The kernel
),()(
2
xxxx =→ϕ
)()(),(
),()(),,()(
2
2
jiji
jjjiii
xxxxK
xxxxxx
⋅=
==
ϕϕ
ϕϕ
• Where is the benefit?
)1(
jiji
xxxx +=
Examples of Kernel Trick (2)
• Polynomial kernel of degree 2 in 2 variables

The non

linear mapping:
The
?
non
linear
?
mapping:
– The kernel
)2,,,2,2,1()(
),(
21
2
2
2
121
21
xxxxxx
xx
=
=
x
x
ϕ
2
2
21
2
2
2
121
)2,,,2,2,1()(x =ϕ xxxxxx
2
21
2
2
2
121
)1(
)()(),(
)2,,,2,2,1()(
yx
yxyx
y
⋅+=
⋅=
=
ϕϕ
ϕ
K
yyyyyy
1/18/2009
15
Examples of Kernel Functions
• Linear kernel:
jiji
K xxxx

=
),(
• Polynomial kernel of power p:
• Gaussian kernel:
22
2/||||
),(
σ
ji
eK
ji
xx
xx
−−
=
p
jiji
K )1(),( xxxx ⋅+=

In the form, equivalent to RBFNN, but has the advantage of that the center of 
basis functions, i.e., support vectors, are optimized in a supervised.
• Two‐layer perceptron:
)tanh(),(
β
α
+

=
jiji
K xxxx
What Functions are Kernels?
For some functions K(x
i
,x
j
) checking that K(x
i
,x
j
)= φ(x
i
) φ(x
j
) can 
be cumbersome. 
M ’ th
M
ercer


th
eorem:  
Every semi‐positive definite symmetric function is a kernel
Semi‐positive definite symmetric functions correspond to a 
semi‐positive definite symmetric Gram matrix:
K(x
1
,x
1
) K(x
1
,x
2
) K(x
1
,x
3
) … K(x
1
,x
n
)
K
(
)
K
(
)
K
(
)
K
(
)
K
(
x
2
,x
1
)
K
(
x
2
,x
2
)
K
(
x
2
,x
3
)
K
(
x
2
,x
n
)
… … … … …
K(x
n
,x
1
) K(x
n
,x
2
) K(x
n
,x
3
) … K(x
n
,x
n
)
K=
http://en.wikipedia.org/wiki/Positive-definite_matrix