Machine Learning Using Support Vector Machines

milkygoodyearAI and Robotics

Oct 14, 2013 (4 years and 8 months ago)



Machine Learning Using Support Vector Machines

Abdul Rahim Ahmad
Marzuki Khalid
Rubiyah Yusof

Universiti Tenaga Nasional
Km 7, Jalan Kajang-Puchong,
43009 Kajang, Selangor.

Centre for Artificial Intelligence and Robotics
Universiti Teknologi Malaysia
Jalan Semarak, 54100 Kuala Lumpur


Artificial Neural Networks (ANN) have been the most widely used machine learning
methodology. They draw much of their inspiration from neurosciences. Structurally
attempting to mimic the architecture of the human brain during learning, ANN aim to
incorporate ‘human-like intelligence’ within computer systems. Recently, a new learning
methodology called support vector machine (SVM) have been introduced. SVM is said to
perform better than ANN in many cases. Furthermore, SVM can be mathematically
derived and simpler to analyze theoretically compared to NN. It also provides a clear
intuition of what learning is about. SVM work by mapping training data for learning
tasks into a higher dimensional feature space using kernel functions and then find a
maximal margin hyper plane, which separates the data. Learning the solution hyper
plane involves using quadratic programming (QP) which is computationally intensive.
However, many decomposition methods have been proposed that avoids the QP and
makes SVM learning practical for many current problems. This paper compares SVM
and ANN theoretically and practically in the case of handwriting recognition.

Keywords: Neural networks, Support Vector Learning, Machine Learning, large margin
classifiers, kernel methods, neuromodelling, handwriting recognition.

1. Introduction

The field of machine learning is concerned with constructing computer program
that automatically improve its performance with experience [1]. Machine learning system
is trained by using a sample set of training data. Once the system has learned, it is used to
perform the required function based on the learning experienced. Performance can
normally be improved by further training. In recent years many successful machine
learning applications have been developed; among them are data mining programs,
information filtering systems, autonomous vehicles and pattern recognition system. The
area of machine learning draws on concepts from diverse fields such as statistics,
artificial intelligence, philosophy, information theory, biology, cognitive science,
computational complexity and control theory. Machine learning theory presents various


theoretical ideas on improving learning while the practical aspect involves construction
and improvements of algorithms for implementing the learning. Due to the diverse
applications of machine learning, there are many literatures available on machine
learning under their own areas of applications.

Artificial neural network (ANN) has been the most popularly used machine learning
algorithm. It is inspired by biological learning systems even though it does not mimic it
fully. The well propagation (BP) algorithm is the most popular algorithm and well known
to be robust especially to problems with errors in the training set. Support Vector
Machines (SVM) on the other hand is a relatively new learning algorithm. It can be
similarly used to learn target functions. However, unlike ANN, it is very well founded
based on theory in statistical learning [2]. The major difference between SVM and ANN
is in the error optimization. In ANN, the aim of learning is to obtain a set of weight
values which minimize the training error while in SVM the training error is set to
minimum while training adjust the capacity of the machine. During training, SVM
learned the parameters a’s and the number of support vectors which is equivalent to the
number of hidden units in ANN.

This paper discusses SVM in the perspective of its comparative abilities with
ANN. The layout of this paper is as follows; section two highlights some aspects of
ANN. Section three discusses SVM theory, concentrating on classification using SVM
for linear separable case, non-linear case and linearly non separable case. SVM
implementations in classification for two-class and multi-class are also discussed. Areas
of common applications of ANN and SVM are discussed in section four with particular
emphasis in handwriting recognition. Section five concludes.

2. Artificial Neural Network

Description of Artificial neural networks (ANN) can be found in the array of
publications on the subject matter. Text books, journal articles, conference proceeding
and research reports on ANN are numerous. This section briefly summarize some main
points on ANN. ANN can be viewed as massively parallel computing systems consisting
of an extremely large number of simple processors with many interconnections. ANN
models use organizational principles such as learning, generalization, adaptivity,
computation in a network of weighted directed graphs, in which the nodes are artificial
neurons and directed edges are connections between neuron outputs and neuron inputs.
The main characteristics of neural networks are that they have the ability to learn
complex nonlinear input-output relationships, they use sequential training procedures and
they adapt themselves to the data.

In the area of pattern classification, the feed-forward network is most popularly
used. They include the BP-based multilayer perceptron (MLP) and the Radial-Basis
Function (RBF) networks. These networks are organized into layers and have
unidirectional connections between the layers. Another popular network is the Self-
Organizing Map (SOM), or Kohonen-Network, which is mainly used for data clustering
and feature mapping. The learning process involves updating network architecture and


connection weights so that a network can efficiently perform a specific
classification/clustering task. The increasing popularity of neural network models in
machine learning, in particular to solve pattern recognition problems has been primarily
due to their seemingly low dependence on domain-specific knowledge compared to
model-based and rule-based approaches and due to the availability of efficient learning
algorithms for practitioners to use. Another class of ANN, the convolutional neural
networks provides a new suite of nonlinear algorithms for feature extraction using hidden
layers built into the ANN.

3. Support Vector Machine

Support Vector Machine (SVM) was introduced in 1992 by Vapnik and his co-
workers [3]. In its original form, SVM is a training algorithm for linear classification.
Only later it was used for regression, principal component analysis, novelty detection and
also for non-linear case. SVM tunes the capacity of the classification function by
maximizing the margin between the training patterns and the decision boundary. The
solution is expressed as a linear combination of supporting patterns, which are the subset
of training patterns close to the decision boundary, called the support vectors. For non-
linear case, SVM mapped the data sets of input space into a higher dimensional feature
space, which is linear and the large-margin learning algorithm is then applied. However,
the mapping can be implicitly done by kernel functions. In the high dimensional feature
space, simpler and linear hyper plane classifiers that have maximal margin between the
classes can be obtained.

3.1 Theory of SVM

The original idea of SVM was developed for linearly separable data. In pattern
classification, suppose we have N training data: {(x
, y
), (x
),…, (x
, y
) where x
 R

and y
 { ± 1 }. In linear SVM, we would like to learn a linear separating hyper plane
classifier: f(x) = sgn(w.x + b). We also want this hyper plane to have the maximum
separating margin with respect to the two classes.

Figure 1 Maximal margin


Specifically, we want to find the hyper plane: H: y = w.x + b = 0 and two hyper planes
parallel to it and with equal distances to it, H
: y = w.x + b = +1 and H
: y = w.x + b = -
1 with the condition that there are no data points between H
and H
, and the distance or
margin M between H
and H
is maximized. See figure 1 in the previous page.
The distance between H
to H is
bwx 
and thus between H
and H
Therefore to maximize the margin, we need to minimize
 with the condition that
no data points between H
and H
1bx.w 
for positive examples y
= +1,
1. bxw
for negative examples y
= -1. The two conditions can be combined into
1).(  bxwy
. So, our problem can be formulated as
subject to
1).(  bxwy
. This is a convex, quadratic programming problem (in w and b), in a
convex set which can be solved by introducing Lagrange multipliers 
, 
,…, 
 0,
for every training data. Thus we have the following Lagrangian:

 
 
1 1
),,( 
We can now maximize
),,( bwL
with respect to , subject to the constraint that
the gradient of
),,( bwL
with respect to the primal variables w and b vanish: ie:


and that  0. We then have


and 


Substitute them into
),b,w(L 
, we have
 

1i j,i
L  (2)
in which the primal variables w and b are eliminated. Solving for 
, we get


and our decision function is then:
f(x) = sgn(w.x + b)

 


 
In the case of non-linearly separable input space, data inputs can be mapped to another
high dimensional feature space that the data points will be linearly separable. If the
mapping function is (.), we just solve:

 

i ji


Generally, if the dot product (x
). (x
) is equivalent to a kernel k(x
, x
), the mapping
need not be done explicitly. Thus equation above can be replaced by:

 

1i j,i
L 
Using the kernel in input space is equivalent to performing the map into feature space and
applying dot product in that space. There are many kernels that can be used that way.
Any kernel that satisfies Mercer’s condition can be used. One of them is the radial basis
function (Gaussian kernel)

In the case of imperfectly separable input space, where noise in the input data is
considered, there is no enforcement that there be no data points between the planes H

and H
mentioned in the previous section, but rather penalty C is enforced if data points
cross the boundaries. Using similar formulation as in linear case, we obtained the same
dual Lagrangian but with a different constraint for 
,, which is 0< 
< C where C is the
penalty. Other commonly used kernels that can be used include:
(a) Polynomial kernels
)1y.x()y,x(K 
(b) Hyperbolic tangent (Neural network) )by.axtanh()y,x(k

3.2 SVM Implementations

Implementing SVM algorithm is not easy due to the quadratic programming
involved. SVM functional form is defined before training where there are N free
parameters in an SVM trained with N training examples. The parameters are denoted as
the 
’s. To find this parameter, the quadratic programming (QP) problem in (5) is solved
subject to the linear constraints involving . The equation (5) can also be written as:

where Q is an N x N matrix that depends on the training inputs x
, the labels y
and the
functional form of the SVM.

Conceptually, the SVM QP problem is to find a minimum of a bowl-shaped objective
function. The QP has definite termination conditions called the Karush-Kuhn-Tucker
conditions that describe the set of 
that are the minima. QP optimizer module is
normally used but it is slow and might not work well on large problems. To work with
large problems, decomposition methods were used. The QP problem can be decomposed
into a series of smaller QP problems where only a small sub-matrix is to be solved at a
Vapnik [4] originally proposed a simple decomposition method. This relies on the
fact that if rows and columns of matrix Q that correspond to zero 
, the value of the
objective function remains the same. The ultimate goal of Vapnik’s decomposition is to
identify all the non-zero 
and discard all the zero 
’s. At every step, chunking solves a
QP problem that consists of every non-zero 
’s from the last step and 
that correspond
to the M worst violation of the KKT conditions, for some value of M. Finally, at the last


step, Vapnik’s chunking has identified the entire set of non-zero 
’s, thus solving the
entire QP problem. Osuna et al [5], use a constant size matrix to keep the QP sub
problem. At each step, the same number of examples are added and deleted from it.
Researchers have tried adding one or more than one examples at a time using various
heuristics though not as fast, but still achieving convergence. Both decomposition
methods by Vapnik and Osuna actually requires a commercial numerical QP package
since writing one is difficult without numerical-analysis background.

John Platt [6] introduced Sequential Minimization Optimization (SMO) which is a
better alternative approach that can decompose the SVM QP problem without any extra
matrix storage and without using numerical QP optimization steps. SMO decomposes the
overall QP problem into QP sub-problems identical to Osuna’s method. But unlike the
two methods earlier, SMO chooses to solve the smallest possible optimization problem at
every step involving only two 
to jointly optimize, to find the optimal values of the 
Keerthi et all [7] have further improved the speed of SMO algorit hm by using two
threshold parameters to derive modifications of SMO. The modified algorithms perform
significantly faster than the original SMO. Keerthi [8] also implemented the nearest point
algorithm (NPA) where the problem is converted to computing the nearest point between
two convex polytopes. The latest improvement comes from Dong [9] in which kernel
caching, digest, shrinking policies and stopping conditions are taken into account
together to speed up SVM implementation further. Dong claims that his method is faster
than SMO by 9 times.

3.3 SVM for Multiclass Classification

Basic SVM can only handle two-class classification. A number of multiclass
SVM have been discussed by Hsu [10], Platt [11] and Weston [12]. The method can
generally be categorized into: (a) Combining binary classifiers and (b) Modifying binary
to incorporate multiclass learning. In (a), multiple 2-class classifiers such as 1 vs. 1 and 1
vs. all are constructed and then during classification, each classifier outputs are combined
in some way into multiclass classifiers. For 1 vs. 1 method, in a k class problem, k(k-1)/2
classifiers needs to be constructed and for recognition, some voting method or directed
acyclic graph (DAG) can be used to combine the classifiers. In DAGSVM, each internal
node is a 1 vs. 1 classifier and all leaf nodes are the classes. For recognition, the graph is
traversed through from the root and arriving at the leaf with the correct classification. In 1
vs. all method, k classifiers need to be constructed. For recognition, classifier with the
highest output is chosen as the correct class. In (b), multiclass classifier is constructed by
solving one complex optimization problem involving large number of free parameters.
This all-together method has been proposed by Weston [12] and Vapnik [2].

Results of comparison between all the methods mentioned have been produced by
Hsu [10]. Hsu recommended that 1 vs 1 and DAGSVM are the two methods that have
been proven to be the best choices for practical use since they are less complex, easy to
construct and is faster to train.


4. ANN vs. SVM in handwriting recognition

One of the most popular areas of machine learning is pattern recognition, in
particular speech recognition and handwriting recognition. In handwriting recognition,
ANN and SVM performance are often compared based on its recognition performance on
publicly available handwritten digit database such as MNIST dataset and USPS dataset.
However, USPS dataset is more difficult since the human recognition error rate is as high
as 2.5%. Comparison of SVM and a number of different ANN methodologies as well
other learning methodology on MNIST dataset is presented as in table 1. It should be
noted that the ANN naturally handles multiclass classification while for SVM; multiclass
implementation needs to be performed as described in section 3.3.

Sample images
Error rate Comparison
 9298 digits (7291 training,
2007 testing)
 Collected from mail
envelopes in Buffalo.
 16 x 16 images, each pixel
value bet. –1 and 1.
 Human error rate of 2.5%

 5-layer NN 5%
 1 hidden layer NN 5.9%
 human error 2.5%
 SVM with C = 10 (average
for all kernels) 4%
 Originally 120,000 (60,000
training, 60,000 testing).
 Normally used 70,000
(60,000 training, 10,000
 Fit a character box of 20 x 20
in 28 x 28 image.

 Linear perceptron: 12.0%
 40 PCA+ quad: 3.3%
 1000 RBF +linear: 3.6%
 K-NN: 5%
 K-NN (deskewed): 2.4%
 K-NN (tangent dist.): 1.1%
 SVM: 1.1%
 LeNet 5: 0.95%

Table 1 – Error rate comparison of ANN, SVM and other
algorithms for MNIST and USPS database.

Generally, as can be seen in the table, SVM error rate is significantly lower than
most other algorithms except for LeNet 5 which is a convolutional NN. Though training
time for SVM was significantly slower the higher recognition rate (low error rate) justify
for the usage. Further more as faster method of implementing SVM have been introduced
recently by Dong [9], SVM usage should be increasing and replacing ANN in the area of
handwriting recognition.

Usage of SVM have also picked up other areas as well, such as in image
classification, time series prediction, face recognition, biological data processing for
medical diagnosis, text categorization and speech recognition as reported by various
authors recently.


5. Conclusion

The aim of this paper is to present SVM as a comparison to ANN. The detail
derivation of SVM is given in order to provide an intuition into the beautiful concept of
SVM. Generally, ANN is known to overfit data unless cross-validation is applied while
SVM does not overfit data and ‘curse of dimensionality’ is avoided. In ANN learning, the
topology is fixed but in SVM, learning actually is to learn the topology. The other
advantage of SVM over ANN is its better generalization ability due to Structural Risk
Minimization (SRM) principle. The non existence of local minimum in SVM learning is
also another reason why SVM is more superior.

[1] Tom M. Mitchel, Machine Learning, McGraw Hill, 1996.
[2] . V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A Training Algorithm for Optimal
Margin Classifiers. In D. Haussler, editor, 5th Annual ACM Workshop on COLT,
pages 144-152, Pittsburgh, PA, 1992. ACM Press.
[4] V. Vapnik, Estimation of Dependencies Based on Empirical Data, Springer-Verlag,
New York, 1982.
[5] E. Osuna, R. Freund, and F. Girosi, “An Improved Training Algorithm for Support
Vector Machines,” Proc. IEEE Neural Networks for Signal Processing VII
Workshop, IEEE Press, Piscataway, N.J., 1997, pp. 276–285.
[6] J.C. Platt, “Fast Training of SVMs Using Sequential Minimal Optimization,” in
Advances in Kernel Methods - Support Vector Learning, B. Schölkopf, C. Burges,
and A. Smola, eds., MIT Press, Cambridge, Mass., 1998.
[7] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, and K. R. K. Murthy, Improvements
to the SMO Algorithm for SVM Regression, IEEE Transactions On Neural
Networks, Vol. 11, No. 5, September 2000.
[8] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya and K.R.K. Murthy, “A Fast Iterative
Nearest Point Algorithm for Support Vector Machine Classifier Design” Technical
Report TR-ISL-99-03, Intelligent Systems Lab, IIS, Bangalore, 1999.
[9] Jian-xiong Dong, Krzyzak A., Suen C. Y., A fast SVM training algorithm.
International workshop on Pattern Recognition with Support Vector Machines. S.-
W. Lee and A. Verri (Eds.): Springer Lecture Notes in Computer Science LNCS
2388, pp. 53-67, Niagara Falls, Canada, August 10, 2002.
[10] Hsu, C.W., Lin, C.J, “A comparison of Methods for Multiclass Support Vector
Machines”, IEEE Transaction on Neural Networks, vol 13, no. 2, March 1992.
[11] Platt, J.C, Cristianini, N., Shawe-Taylo, J., Large Margin DAGs for Multiclass
Classification, in NIPS2000, vol. 12, 2000.
[12] Weston, J., Watkins, C, Multi-class Support Vector Machines, Technical report
CSD-TR-98-04, Royal Holloway, University of London, 1998.