Support Vector Machine
Natural Language Processing Lab
lizhonghua
Support Vector Machine
• Introduction
• Theory
• SVM primal and dual problem
• Parameter selection and practical issues
• Compare to other classifier
• Conclusion and discussion
Introduction
Training data
generated by sampling from an unknown
underlying distribution P(x , y)
Some Notation:
Introduction
Which one is better ?
Theory
the right one is better , why ?
• Large margin require small
• Small Small VC Dimension of Margin
Hyperplanes ([1] Theorem 5.5 )
• Small VC Dimension lower true error bound
Theory
Large margin require small w
The distance between the two parallel hyperplanes is 2/w
Theory
Theory
Small VC Dimension leads to lower true error Bound
With a probability at least the above inequality
established , L is the size of the training example ,h is the VC
dimension
Derive a VC Bound
• Given a fixed function f ,for each example
the loss is either 0 or 1
all examples are drawn independently ,so
are independently sampled from a
random variable
Chernoff Bound
Derive a VC Bound
For all f in F
Set =
Derive a VC Bound
• The cardinality of F
• the number of function form F that can be
distinguished from their values on{x1,x2...x2m}
• it is the number of different
outputs(y1,y2....y2m) that the functions in F
can achieve on samples of a given size.
Derive a VC Bound
VC dimention
• The VC dimension is a property of a set of
functions{f(a)}.
• If a given set of l points can be labeled in all 2^l
ways ,and for each labeling, a member of the set
{f(a)} can be found which correctly assigns those
labels, we say that that set of points is shattered
by that set of functions.
• The VC dimension for the set of functions{f(a)}
Is defined as the maximum number of training
points that can be shattered by {f(a)}.
VC dimention
Derive a VC Bound
The capacity term is a property of the function class of F , thus
the bound can’t be minimized over choice of f . We introduce
structure on F and minimize the bound over the choice of the
structure.
Structural risk minimization
!!
SVM primal and dual problem
Linear Support Vector Machines
The Separable Case
The NonSeparable Case
Nonlinear Support Vector Machines
The Separable Case
The Separable Case
Decision function :
Dual Problem :
The NonSeparable Case
The NonSeparable Case
the standard approach is to allow the fat
decision margin to make a few mistakes (some
points  outliers or noisy examples  are inside
or on the wrong side of the margin). We then
pay a cost for each misclassified example,
which depends on how far it is from meeting
the margin requirement . To implement this,
we introduce slack variables.
The NonSeparable Case
• Have an algorithm which can tolerate a certain
fraction of outliers.
• Introduce slack variables
• Use relaxed constraints
• Object function
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Some Common Kernels :
What conditions should the K satisfy ? that the K can be a
kernel function
Nonlinear Support Vector Machines
Kernel Matrix
The element of the matrix is the innerproduct of the
training examples ,and we use the kernel function to
get the innerproduct.
Nonlinear Support Vector Machines
If the kernel matrix is positive semidefinite for
finite examples ,we say that the K satisfies the
finitely positive semidefinite property , then K
can be a kernel function.
Parameter selection
• Train the SVM ,we can find W ,b
• The SVM have hyperparameters: the soft
margin constant C ,width of Gaussian kernel ,
degree of polynomial kernel.
soft margin constant C
soft margin constant C
• when C is small, it is easy to account for some
data points with the use of slack variables and
to have a fat margin placed so it models the
bulk of the data.
• as C becomes large, it is better to respect the
data at the cost of reducing the geometric
margin , and the complexity of the function
class increases.
degree of polynomial kernel
The lowest degree polynomial is linear kernel ,it’s not sufficient when a nonlinear
relationship between the two class . Degree2 is enough . Degree5 with greater
curvature .
width of Gaussian kernel
Small gamma leads to
smooth boundary , big
gamma leads to greater
curvature of the decision
boundary . Gamma=100
leads to overfitting the
data.
A simple procedure
ChihJen Lin Support Vector Machines at Machine Learning
Summer School 2006
Compare to other classifiers
• Decision Tree
trends to overfit
• Naïve Bayes Classifier
feature independent , data sparse
• SVM
kernel selection/design
Conclusion and Discussion
• Intuitive
• Has linear or nonlinear decision boundaries
• Does not make unreasonable assumption
about the data
• Does not overfit
• Does not have lots of parameters ,easy to
model and train
References
• [1] Bernhard Scholkopf ,Alexander J. Smola. Learning with Kernels
• [2]CHRISTOPHER J.C. BURGES A tutorial on Support Vector Machine for Pattern
Recognition. Data Mining and Knowledge Discovery ,2, 121167(1998)
• [3] Asa BenHur , Jason Weston. A User’s Guide to Support Vector Machines
• [4] ChihJen Lin Support Vector Machines at Machine Learning Summer School
2006
Thanks for your attention!
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment