Support Vector Machine

Natural Language Processing Lab

lizhonghua

Support Vector Machine

• Introduction

• Theory

• SVM primal and dual problem

• Parameter selection and practical issues

• Compare to other classifier

• Conclusion and discussion

Introduction

Training data

generated by sampling from an unknown

underlying distribution P(x , y)

Some Notation:

Introduction

Which one is better ?

Theory

the right one is better , why ?

• Large margin require small

• Small Small VC Dimension of Margin

Hyperplanes ([1] Theorem 5.5 )

• Small VC Dimension lower true error bound

Theory

Large margin require small ||w||

The distance between the two parallel hyperplanes is 2/||w||

Theory

Theory

Small VC Dimension leads to lower true error Bound

With a probability at least the above inequality

established , L is the size of the training example ,h is the VC

dimension

Derive a VC Bound

• Given a fixed function f ,for each example

the loss is either 0 or 1

all examples are drawn independently ,so

are independently sampled from a

random variable

Chernoff Bound

Derive a VC Bound

For all f in F

Set =

Derive a VC Bound

• The cardinality of F

• the number of function form F that can be

distinguished from their values on{x1,x2...x2m}

• it is the number of different

outputs(y1,y2....y2m) that the functions in F

can achieve on samples of a given size.

Derive a VC Bound

VC dimention

• The VC dimension is a property of a set of

functions{f(a)}.

• If a given set of l points can be labeled in all 2^l

ways ,and for each labeling, a member of the set

{f(a)} can be found which correctly assigns those

labels, we say that that set of points is shattered

by that set of functions.

• The VC dimension for the set of functions{f(a)}

Is defined as the maximum number of training

points that can be shattered by {f(a)}.

VC dimention

Derive a VC Bound

The capacity term is a property of the function class of F , thus

the bound can’t be minimized over choice of f . We introduce

structure on F and minimize the bound over the choice of the

structure.

Structural risk minimization

!!

SVM primal and dual problem

Linear Support Vector Machines

The Separable Case

The Non-Separable Case

Nonlinear Support Vector Machines

The Separable Case

The Separable Case

Decision function :

Dual Problem :

The Non-Separable Case

The Non-Separable Case

the standard approach is to allow the fat

decision margin to make a few mistakes (some

points - outliers or noisy examples - are inside

or on the wrong side of the margin). We then

pay a cost for each misclassified example,

which depends on how far it is from meeting

the margin requirement . To implement this,

we introduce slack variables.

The Non-Separable Case

• Have an algorithm which can tolerate a certain

fraction of outliers.

• Introduce slack variables

• Use relaxed constraints

• Object function

Nonlinear Support Vector Machines

Nonlinear Support Vector Machines

Nonlinear Support Vector Machines

Some Common Kernels :

What conditions should the K satisfy ? that the K can be a

kernel function

Nonlinear Support Vector Machines

Kernel Matrix

The element of the matrix is the innerproduct of the

training examples ,and we use the kernel function to

get the innerproduct.

Nonlinear Support Vector Machines

If the kernel matrix is positive semi-definite for

finite examples ,we say that the K satisfies the

finitely positive semi-definite property , then K

can be a kernel function.

Parameter selection

• Train the SVM ,we can find W ,b

• The SVM have hyperparameters: the soft

margin constant C ,width of Gaussian kernel ,

degree of polynomial kernel.

soft margin constant C

soft margin constant C

• when C is small, it is easy to account for some

data points with the use of slack variables and

to have a fat margin placed so it models the

bulk of the data.

• as C becomes large, it is better to respect the

data at the cost of reducing the geometric

margin , and the complexity of the function

class increases.

degree of polynomial kernel

The lowest degree polynomial is linear kernel ,it’s not sufficient when a non-linear

relationship between the two class . Degree-2 is enough . Degree-5 with greater

curvature .

width of Gaussian kernel

Small gamma leads to

smooth boundary , big

gamma leads to greater

curvature of the decision

boundary . Gamma=100

leads to overfitting the

data.

A simple procedure

Chih-Jen Lin Support Vector Machines at Machine Learning

Summer School 2006

Compare to other classifiers

• Decision Tree

trends to overfit

• Naïve Bayes Classifier

feature independent , data sparse

• SVM

kernel selection/design

Conclusion and Discussion

• Intuitive

• Has linear or non-linear decision boundaries

• Does not make unreasonable assumption

about the data

• Does not overfit

• Does not have lots of parameters ,easy to

model and train

References

• [1] Bernhard Scholkopf ,Alexander J. Smola. Learning with Kernels

• [2]CHRISTOPHER J.C. BURGES A tutorial on Support Vector Machine for Pattern

Recognition. Data Mining and Knowledge Discovery ,2, 121-167(1998)

• [3] Asa Ben-Hur , Jason Weston. A User’s Guide to Support Vector Machines

• [4] Chih-Jen Lin Support Vector Machines at Machine Learning Summer School

2006

Thanks for your attention!

## Comments 0

Log in to post a comment