Lecture 9 :: Suppor Vector Machines

grizzlybearcroatianΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

126 εμφανίσεις

Lecture 9::Suppor Vector Machines
1/32
Logistic regression::Reminder::h

(x)
h

(x) = g(
T
x) =
1
1+e
−
T
x
g(z) ≥ 0,5 whenever z ≥ 0 and g(z) < 0,5 whenever z < 0.
Predict y = 1 if h

(x) ≥ 0.5,i.e.
T
x ≥ 0
Predict y = 0 if h

(x) < 0.5,i.e.
T
x < 0
2/32
LoR::Reminder::Linear Decision Boundary
Let h

(x) = g(
0
+
1
x
1
+
2
x
2
) and 
0
= −2,
1
= 1,
2
= 1.Predict
y = 1 if −2 +x
1
+x
2
≥ 0,i.e.x
1
+x
2
≥ 2
Figure 1:Linear decision boundary
Decision boundary x
1
+x
2
= 2 is a
property of h

(x) not of the training
data.
3/32
LoR::Reminder::Non-linear Decision Boundary
Let h

(x) = g(
0
+
1
x
1
+
2
x
2
+
3
x
2
1
+
4
x
2
2
) and 
0
= −1,

1
= 1,
2
= 1,
3
,
4
= 0.
Predict y = 1 if −1 +x
2
1
+x
2
2
≥ 0,i.e.x
2
1
+x
2
2
≥ 1
Figure 2:Non-linear decision boundary
4/32
LoR::Reminder::Cost function
J(θ) = −
1
n
[
￿
n
i =1
y
i
log(h

(x
i
)) +(1 −y
i
) log(1 −h

(x
i
)) +
λ
2m
￿
m
j=1

2
j
]
^
 = min

J()
5/32
Heading for SVM
min

J() = min


1
n
[
￿
n
i =1
y
i
log(h

(x
i
)) +(1 −y
i
) log(1 −h

(x
i
)) +
λ
2n
￿
m
j=1

2
j
]
SVM training:
min

J(θ) = min

1
n
[
￿
n
i =1
y
i
(−log(h

(x
i
))) +(1 −y
i
)(−log(1 −h

(x
i
))) +
λ
2n
￿
m
j=1

2
j
]
min

J(θ) = min

1
n
[
￿
n
i =1
y
i
J
1
(
T
x
i
) +(1 −y
i
)J
0
(
T
x
i
) +
λ
2n
￿
m
j=1

2
j
]
A +λB ≡ CA+B,C =
1
λ
min

J(θ) = min

C[
￿
n
i =1
y
i
J
1
(
T
x
i
) +(1 −y
i
)J
0
(
T
x
i
)] +
1
2
￿
m
j=1

2
j
6/32
Figure 3:Cost function in general
7/32
Heading for SVM (2)
SVM prediction:
h
^

(x) =
￿
1 if
^

T
x ≥ 0
0 otherwise
8/32
Heading for SVM (3)
min

J(θ) = min

C[
￿
n
i =1
y
i
J
1
(
T
x
i
) +(1 −y
i
)J
0
(
T
x
i
)] +
1
2
￿
m
j=1

2
j
If y = 1,we want 
T
x ≥ 1.If y = 0,we want 
T
x ≤ −1.
Hinge loss:J
1
(
T
x) = max(0,1 −
T
x),J
0
(
T
x) = max(0,1 +
T
x)
Figure 4:Cost function::SVM
9/32
SVM::Decision Boundary
min

J(θ) = min

C[
￿
n
i =1
y
i
J
1
(
T
x
i
) +(1 −y
i
)J
0
(
T
x
i
)] +
1
2
￿
m
j=1

2
j
where
1
2
￿
m
j=1

2
j
is the complexity term and
￿
n
i =1
y
i
J
1
(
T
x
i
) +(1 −y
i
)J
0
(
T
x
i
)] is the training error.
10/32
SVM::Decision Boundary::Perfect separation
Let’s suppose perfect separation,i.e.training error=0.
Then min

J(θ) = min

Cx0 +
1
2
￿
m
j=1

2
j
,i.e.min

1
2
￿
m
j=1

2
j
s.t.
T
x ≥ 1 if y
i
= 1,
T
x ≤ −1 if y
i
= 0.
11/32
SVM::Large Margin Classifier
Figure 5:SVM::Large Margin Classfier
12/32
From linear algebra::Dot products
The dot product of two vectors x,y ∈ R
n
:xy =
￿
n
i =1
x
i
y
i
The dot product xx is the square of the length of the x (i.e.
xx = ||x||
2
)
SVM:min

1
2
￿
m
j=1

2
j
= min

1
2
||||
2
13/32
SVM::Decision Boundary::Perfect separation (2)
1
 is normal to the hyperplane 
T
x.Let p be a length of projection
of x onto .
T
x = p||||.
2
Distance from an arbitrary point x
i
to the hyperplane:
p =

T
x
i
||||
14/32
SVM::Decision Boundary::Perfect separation (3)
Figure 6:Large Margin Classfier
15/32
SVM::Decision Boundary::Perfect separation (4)
d
1
=

T
x
1
||||
d
2
= −

T
x
2
||||
d
1
+d
2
=
2
||||
.
max
2
||||
≡ min
1
2
||||
2
16/32
SVM::Decision Boundary::Non-perfect separation
In a real problem it is unlikely that a line will exactly separate the data –
even if a curved decision boundary is possible.So exactly separating the
data is probably not desirable – if the data has noise and outliers,a
smooth decision boundary that ignores a few data points is better than
one that loops around the outliers.
17/32
SVM::Decision Boundary::Non-perfect
separation (2)
min

J(θ) = min

C[
￿
n
i =1
y
i
J
1
(
T
x
i
) +(1 −y
i
)J
0
(
T
x
i
)] +
1
2
￿
m
j=1

2
j
min

J(θ) = min

C
￿
n
i =1
ξ
i
+
1
2
￿
m
j=1

2
j
s.t.
T
x ≥ 1 −ξ if y
i
= 1,
T
x ≥ −1 +ξ if y
i
= 0,
where ξ
i
are so called slack variables.
18/32
SVM::Decision Boundary::Non-perfect
separation (3)
Figure 7:Slack variables
19/32
Non-linear classifier::Kernels
If the points are separated by a nonlinear region?
Let h

(x) = g(
0
+
1
x
1
+
2
x
2
+
3
x
2
1
+
4
x
2
2
) and 
0
= −1,

1
= 1,
2
= 1,
3
,
4
= 0.
Predict y = 1 if −1 +x
2
1
+x
2
2
≥ 0,i.e.x
2
1
+x
2
2
≥ 1
Figure 8:Non-linear decision boundary
20/32
Non-linear classifier::Kernels (2)
Given x,compute new three features depending on proximty to landmarks
l
1
,l
2
,l
3
:f
1
=similarity(x,l
1
),f
2
=similarity(x,l
2
),f
3
=similarity(x,l
3
).
Kernel = similarity function;for ex.Gaussian Kernel
similarity(x,l
i
) = exp(−
||x−l
i
||
2

2
)
Figure 9:Kernel trick
if x ≈ l
i
then f
i
≈ 1
if x is far from l
i
then f
i
≈ 0
Predict y=1 if

0
+
1
f
1
+
2
f
2
+
3
f
3
≥ 0
21/32
Non-linear classifier::Kernels (3)
Figure 10:Kernel trick (2)
Predict y=1 if

0
+
1
f
1
+
2
f
2
+
3
f
3
≥ 0
Let

0
= −.5,
1
= 1,
2
= 1,
3
= 0
f
1
≈ 1,f
2
≈ 0,f
3
≈ 0

0
+
1
x1+
2
x0+
3
x0 = 0.5 ≥ 0
22/32
Non-linear classifier::Kernels (4)
Figure 11:Kernel trick (3)
23/32
Non-linear classifier::Kernels (5)
SVM with Kernels:training
Given x
1
,...,x
n
,we choose l
1
= x
1
,...,l
n
= x
n
.
For ￿x
i
,y
i
￿:f
i
1
= similarity(x
i
,l
1
),...,f
i
n
= similarity(x
i
,l
n
),x
i
∈ R
m+1
,
f
i
∈ R
n+1
.
min

J(θ) = min

C[
￿
n
i =1
y
i
J
1
(
T
f
i
) +(1 −y
i
)J
0
(
T
f
i
)] +
1
2
￿
n
j=1

2
j
SVM with Kernels:prediction
Given x and f,predict y = 1 if 
T
f ≥ 0.
24/32
Let’s go back...
Minimize
1
2
||||
2
subject to constraint y
i

T
x ≥ 1,i.e.y
i

T
x −1 ≥ 0.
25/32
Lagrangian
Do quadratic programming.
For each training instance ￿x
i
,y
i
￿,introduce α
i
≥ 0.Let α = ￿α
1
,...,α
n
￿.
Let L(,α) =
1
2
||||
2

￿
i
α
i
(y
i
(
T
x
i
) −1)
Minimize L w.r.t..
....
 =
￿
i
α
i
y
i
x
i
and
￿
i
α
i
y
i
= 0.
26/32
Lagrangian::Dual problem
Solve the dual problem.
Maximize
L(α) =
￿
i
α
i

1
2
￿
i,j
α
i
α
j
y
i
y
j
x
T
i
x
j
subject to α
i
≥ 0 and
￿
i
α
i
y
i
= 0
For support vectors α
i
> 0,for other training examples α
i
= 0.
I.e.finding  is equivalent to finding support vectors and their weights
(Prons:We don’t have to optimize a vector .Instead,we optimize real
numbers α
i
s.t.simple constraints.).
Prediction
f (x) = 
T
x = (
￿
i
α
i
y
i
x
i
)
T
x =
￿
i
α
i
y
i
x
T
i
x.
27/32
Summary of linear SVM
Choose a hyperplane to separate instances 
T
x = 0.
Among all the allowed hyperplanes choose the one with the max
margin.
Maximizing margin is the same as minimizing ||||.
Choosing  is the same as choosing α
i
.
28/32
Soft margin
permits some misclassification.
Reminder:
Minimize
1
2
||||
2
+C(
￿
i
ξ
i
),where C is the penalty,
subject to y
i
(
T
x ≥ 1 −ξ
i
) where ξ
i
≥ 0
Now,through the dual problem
maximize
L(α) =
￿
i
α
i

1
2
￿
i,j
α
i
α
j
y
i
y
j
x
T
i
x
j
subject to
C ≥ α
i
≥ 0 and
￿
i
α
i
y
i
= 0
29/32
Non-linear SVM::Instance Space → Feature Space
Define φ
Calculate φ(x
i
) for each training example.
Find a linear SVM in the feature space.
Kernel ≡ similarity function
K:XxX →R,K(x,z) = φ(x)
T
φ(z).
30/32
Kernel tricks
Dot product in a feature space.
No need to know what φ is and what the feature space is.
No need to explicitly map the data to the feature space.
Define a kernel function K and replace the dot product x
T
z with a
Kernel function K(x,z) in both training and testing.
I.e.
Maximize
L(α) =
￿
i
α
i

1
2
￿
i,j
α
i
α
j
y
i
y
j
K(x
T
i
x
j
)
subject to
α
i
≥ 0 and
￿
i
α
i
y
i
= 0
31/32
Common kernel functions
Linear:K(x,z) = x
T
z
Polynomial:K(x,z) = (γx
T
z +c)
d
Radial basis function:K(x,z) = exp(−γ(||x −z||))
2
Sigmoid:K(x,z) = tanh(γx
T
z +c),where tanh(x) =
exp(x)−exp(−x)
exp(x)+exp(−x)
32/32