1

Machine Learning Using Support Vector Machines

Abdul Rahim Ahmad

1

Marzuki Khalid

2

Rubiyah Yusof

2

1

Universiti Tenaga Nasional

Km 7, Jalan Kajang-Puchong,

43009 Kajang, Selangor.

abdrahim@uniten.edu.my

2

Centre for Artificial Intelligence and Robotics

Universiti Teknologi Malaysia

Jalan Semarak, 54100 Kuala Lumpur

{marzuki,rubiyah}@utmkl.utm.my

Abstract

Artificial Neural Networks (ANN) have been the most widely used machine learning

methodology. They draw much of their inspiration from neurosciences. Structurally

attempting to mimic the architecture of the human brain during learning, ANN aim to

incorporate ‘human-like intelligence’ within computer systems. Recently, a new learning

methodology called support vector machine (SVM) have been introduced. SVM is said to

perform better than ANN in many cases. Furthermore, SVM can be mathematically

derived and simpler to analyze theoretically compared to NN. It also provides a clear

intuition of what learning is about. SVM work by mapping training data for learning

tasks into a higher dimensional feature space using kernel functions and then find a

maximal margin hyper plane, which separates the data. Learning the solution hyper

plane involves using quadratic programming (QP) which is computationally intensive.

However, many decomposition methods have been proposed that avoids the QP and

makes SVM learning practical for many current problems. This paper compares SVM

and ANN theoretically and practically in the case of handwriting recognition.

Keywords: Neural networks, Support Vector Learning, Machine Learning, large margin

classifiers, kernel methods, neuromodelling, handwriting recognition.

1. Introduction

The field of machine learning is concerned with constructing computer program

that automatically improve its performance with experience [1]. Machine learning system

is trained by using a sample set of training data. Once the system has learned, it is used to

perform the required function based on the learning experienced. Performance can

normally be improved by further training. In recent years many successful machine

learning applications have been developed; among them are data mining programs,

information filtering systems, autonomous vehicles and pattern recognition system. The

area of machine learning draws on concepts from diverse fields such as statistics,

artificial intelligence, philosophy, information theory, biology, cognitive science,

computational complexity and control theory. Machine learning theory presents various

2

theoretical ideas on improving learning while the practical aspect involves construction

and improvements of algorithms for implementing the learning. Due to the diverse

applications of machine learning, there are many literatures available on machine

learning under their own areas of applications.

Artificial neural network (ANN) has been the most popularly used machine learning

algorithm. It is inspired by biological learning systems even though it does not mimic it

fully. The well propagation (BP) algorithm is the most popular algorithm and well known

to be robust especially to problems with errors in the training set. Support Vector

Machines (SVM) on the other hand is a relatively new learning algorithm. It can be

similarly used to learn target functions. However, unlike ANN, it is very well founded

based on theory in statistical learning [2]. The major difference between SVM and ANN

is in the error optimization. In ANN, the aim of learning is to obtain a set of weight

values which minimize the training error while in SVM the training error is set to

minimum while training adjust the capacity of the machine. During training, SVM

learned the parameters a’s and the number of support vectors which is equivalent to the

number of hidden units in ANN.

This paper discusses SVM in the perspective of its comparative abilities with

ANN. The layout of this paper is as follows; section two highlights some aspects of

ANN. Section three discusses SVM theory, concentrating on classification using SVM

for linear separable case, non-linear case and linearly non separable case. SVM

implementations in classification for two-class and multi-class are also discussed. Areas

of common applications of ANN and SVM are discussed in section four with particular

emphasis in handwriting recognition. Section five concludes.

2. Artificial Neural Network

Description of Artificial neural networks (ANN) can be found in the array of

publications on the subject matter. Text books, journal articles, conference proceeding

and research reports on ANN are numerous. This section briefly summarize some main

points on ANN. ANN can be viewed as massively parallel computing systems consisting

of an extremely large number of simple processors with many interconnections. ANN

models use organizational principles such as learning, generalization, adaptivity,

computation in a network of weighted directed graphs, in which the nodes are artificial

neurons and directed edges are connections between neuron outputs and neuron inputs.

The main characteristics of neural networks are that they have the ability to learn

complex nonlinear input-output relationships, they use sequential training procedures and

they adapt themselves to the data.

In the area of pattern classification, the feed-forward network is most popularly

used. They include the BP-based multilayer perceptron (MLP) and the Radial-Basis

Function (RBF) networks. These networks are organized into layers and have

unidirectional connections between the layers. Another popular network is the Self-

Organizing Map (SOM), or Kohonen-Network, which is mainly used for data clustering

and feature mapping. The learning process involves updating network architecture and

3

connection weights so that a network can efficiently perform a specific

classification/clustering task. The increasing popularity of neural network models in

machine learning, in particular to solve pattern recognition problems has been primarily

due to their seemingly low dependence on domain-specific knowledge compared to

model-based and rule-based approaches and due to the availability of efficient learning

algorithms for practitioners to use. Another class of ANN, the convolutional neural

networks provides a new suite of nonlinear algorithms for feature extraction using hidden

layers built into the ANN.

3. Support Vector Machine

Support Vector Machine (SVM) was introduced in 1992 by Vapnik and his co-

workers [3]. In its original form, SVM is a training algorithm for linear classification.

Only later it was used for regression, principal component analysis, novelty detection and

also for non-linear case. SVM tunes the capacity of the classification function by

maximizing the margin between the training patterns and the decision boundary. The

solution is expressed as a linear combination of supporting patterns, which are the subset

of training patterns close to the decision boundary, called the support vectors. For non-

linear case, SVM mapped the data sets of input space into a higher dimensional feature

space, which is linear and the large-margin learning algorithm is then applied. However,

the mapping can be implicitly done by kernel functions. In the high dimensional feature

space, simpler and linear hyper plane classifiers that have maximal margin between the

classes can be obtained.

3.1 Theory of SVM

The original idea of SVM was developed for linearly separable data. In pattern

classification, suppose we have N training data: {(x

1

, y

1

), (x

2

,y

2

),…, (x

n

, y

n

) where x

i

R

d

and y

i

{ ± 1 }. In linear SVM, we would like to learn a linear separating hyper plane

classifier: f(x) = sgn(w.x + b). We also want this hyper plane to have the maximum

separating margin with respect to the two classes.

Figure 1 Maximal margin

M

4

Specifically, we want to find the hyper plane: H: y = w.x + b = 0 and two hyper planes

parallel to it and with equal distances to it, H

1

: y = w.x + b = +1 and H

2

: y = w.x + b = -

1 with the condition that there are no data points between H

1

and H

2

, and the distance or

margin M between H

1

and H

2

is maximized. See figure 1 in the previous page.

The distance between H

1

to H is

x

bwx

=

w

1

and thus between H

1

and H

2

is

w

2

.

Therefore to maximize the margin, we need to minimize

www

T

with the condition that

no data points between H

1

and H

2

satisfy:

1bx.w

for positive examples y

i

= +1,

1. bxw

for negative examples y

i

= -1. The two conditions can be combined into

1).( bxwy

i

. So, our problem can be formulated as

ww

T

bw

2

1

min

,

subject to

1).( bxwy

i

. This is a convex, quadratic programming problem (in w and b), in a

convex set which can be solved by introducing Lagrange multipliers

1

,

2

,…,

N

0,

for every training data. Thus we have the following Lagrangian:

N

i

N

i

iiyi

T

bxwywwbwL

1 1

).(

2

1

),,(

(1)

We can now maximize

),,( bwL

with respect to , subject to the constraint that

the gradient of

),,( bwL

with respect to the primal variables w and b vanish: ie:

0

w

L

and

0

b

L

and that 0. We then have

N

i

iii

xyw

1

and

N

i

ii

y

1

0

.

Substitute them into

),b,w(L

, we have

N

1i j,i

jijijiiD

xxyy

2

1

L (2)

in which the primal variables w and b are eliminated. Solving for

i

, we get

N

i

iii

xyw

1

and our decision function is then:

f(x) = sgn(w.x + b)

)bx).xysgn((

N

1i

iii

)b)x.x(ysgn(

N

1i

iii

(3)

In the case of non-linearly separable input space, data inputs can be mapped to another

high dimensional feature space that the data points will be linearly separable. If the

mapping function is (.), we just solve:

N

i ji

jiji

j

iiD

xxyyL

1,

)()(

2

1

(4)

5

Generally, if the dot product (x

i

). (x

j

) is equivalent to a kernel k(x

i

, x

j

), the mapping

need not be done explicitly. Thus equation above can be replaced by:

N

1i j,i

jijijiiD

)x,x(kyy

2

1

L

(5)

Using the kernel in input space is equivalent to performing the map into feature space and

applying dot product in that space. There are many kernels that can be used that way.

Any kernel that satisfies Mercer’s condition can be used. One of them is the radial basis

function (Gaussian kernel)

2

ji

2/xx

ji

e)x,x(k

In the case of imperfectly separable input space, where noise in the input data is

considered, there is no enforcement that there be no data points between the planes H

1

and H

2

mentioned in the previous section, but rather penalty C is enforced if data points

cross the boundaries. Using similar formulation as in linear case, we obtained the same

dual Lagrangian but with a different constraint for

i

,, which is 0<

i

< C where C is the

penalty. Other commonly used kernels that can be used include:

(a) Polynomial kernels

d

)1y.x()y,x(K

(b) Hyperbolic tangent (Neural network) )by.axtanh()y,x(k

3.2 SVM Implementations

Implementing SVM algorithm is not easy due to the quadratic programming

involved. SVM functional form is defined before training where there are N free

parameters in an SVM trained with N training examples. The parameters are denoted as

the

i

’s. To find this parameter, the quadratic programming (QP) problem in (5) is solved

subject to the linear constraints involving . The equation (5) can also be written as:

N

i

jijiiD

QL

1

2

1

(6)

where Q is an N x N matrix that depends on the training inputs x

i

, the labels y

i

and the

functional form of the SVM.

Conceptually, the SVM QP problem is to find a minimum of a bowl-shaped objective

function. The QP has definite termination conditions called the Karush-Kuhn-Tucker

conditions that describe the set of

i

that are the minima. QP optimizer module is

normally used but it is slow and might not work well on large problems. To work with

large problems, decomposition methods were used. The QP problem can be decomposed

into a series of smaller QP problems where only a small sub-matrix is to be solved at a

time.

Vapnik [4] originally proposed a simple decomposition method. This relies on the

fact that if rows and columns of matrix Q that correspond to zero

i

, the value of the

objective function remains the same. The ultimate goal of Vapnik’s decomposition is to

identify all the non-zero

i

and discard all the zero

i

’s. At every step, chunking solves a

QP problem that consists of every non-zero

i

’s from the last step and

i

that correspond

to the M worst violation of the KKT conditions, for some value of M. Finally, at the last

6

step, Vapnik’s chunking has identified the entire set of non-zero

i

’s, thus solving the

entire QP problem. Osuna et al [5], use a constant size matrix to keep the QP sub

problem. At each step, the same number of examples are added and deleted from it.

Researchers have tried adding one or more than one examples at a time using various

heuristics though not as fast, but still achieving convergence. Both decomposition

methods by Vapnik and Osuna actually requires a commercial numerical QP package

since writing one is difficult without numerical-analysis background.

John Platt [6] introduced Sequential Minimization Optimization (SMO) which is a

better alternative approach that can decompose the SVM QP problem without any extra

matrix storage and without using numerical QP optimization steps. SMO decomposes the

overall QP problem into QP sub-problems identical to Osuna’s method. But unlike the

two methods earlier, SMO chooses to solve the smallest possible optimization problem at

every step involving only two

i

to jointly optimize, to find the optimal values of the

i

.

Keerthi et all [7] have further improved the speed of SMO algorit hm by using two

threshold parameters to derive modifications of SMO. The modified algorithms perform

significantly faster than the original SMO. Keerthi [8] also implemented the nearest point

algorithm (NPA) where the problem is converted to computing the nearest point between

two convex polytopes. The latest improvement comes from Dong [9] in which kernel

caching, digest, shrinking policies and stopping conditions are taken into account

together to speed up SVM implementation further. Dong claims that his method is faster

than SMO by 9 times.

3.3 SVM for Multiclass Classification

Basic SVM can only handle two-class classification. A number of multiclass

SVM have been discussed by Hsu [10], Platt [11] and Weston [12]. The method can

generally be categorized into: (a) Combining binary classifiers and (b) Modifying binary

to incorporate multiclass learning. In (a), multiple 2-class classifiers such as 1 vs. 1 and 1

vs. all are constructed and then during classification, each classifier outputs are combined

in some way into multiclass classifiers. For 1 vs. 1 method, in a k class problem, k(k-1)/2

classifiers needs to be constructed and for recognition, some voting method or directed

acyclic graph (DAG) can be used to combine the classifiers. In DAGSVM, each internal

node is a 1 vs. 1 classifier and all leaf nodes are the classes. For recognition, the graph is

traversed through from the root and arriving at the leaf with the correct classification. In 1

vs. all method, k classifiers need to be constructed. For recognition, classifier with the

highest output is chosen as the correct class. In (b), multiclass classifier is constructed by

solving one complex optimization problem involving large number of free parameters.

This all-together method has been proposed by Weston [12] and Vapnik [2].

Results of comparison between all the methods mentioned have been produced by

Hsu [10]. Hsu recommended that 1 vs 1 and DAGSVM are the two methods that have

been proven to be the best choices for practical use since they are less complex, easy to

construct and is faster to train.

7

4. ANN vs. SVM in handwriting recognition

One of the most popular areas of machine learning is pattern recognition, in

particular speech recognition and handwriting recognition. In handwriting recognition,

ANN and SVM performance are often compared based on its recognition performance on

publicly available handwritten digit database such as MNIST dataset and USPS dataset.

However, USPS dataset is more difficult since the human recognition error rate is as high

as 2.5%. Comparison of SVM and a number of different ANN methodologies as well

other learning methodology on MNIST dataset is presented as in table 1. It should be

noted that the ANN naturally handles multiclass classification while for SVM; multiclass

implementation needs to be performed as described in section 3.3.

Database

Sample images

Error rate Comparison

USPS

9298 digits (7291 training,

2007 testing)

Collected from mail

envelopes in Buffalo.

16 x 16 images, each pixel

value bet. –1 and 1.

Human error rate of 2.5%

5-layer NN 5%

1 hidden layer NN 5.9%

human error 2.5%

SVM with C = 10 (average

for all kernels) 4%

MNIST

Originally 120,000 (60,000

training, 60,000 testing).

Normally used 70,000

(60,000 training, 10,000

testing).

Fit a character box of 20 x 20

in 28 x 28 image.

Linear perceptron: 12.0%

40 PCA+ quad: 3.3%

1000 RBF +linear: 3.6%

K-NN: 5%

K-NN (deskewed): 2.4%

K-NN (tangent dist.): 1.1%

SVM: 1.1%

LeNet 5: 0.95%

Table 1 – Error rate comparison of ANN, SVM and other

algorithms for MNIST and USPS database.

Generally, as can be seen in the table, SVM error rate is significantly lower than

most other algorithms except for LeNet 5 which is a convolutional NN. Though training

time for SVM was significantly slower the higher recognition rate (low error rate) justify

for the usage. Further more as faster method of implementing SVM have been introduced

recently by Dong [9], SVM usage should be increasing and replacing ANN in the area of

handwriting recognition.

Usage of SVM have also picked up other areas as well, such as in image

classification, time series prediction, face recognition, biological data processing for

medical diagnosis, text categorization and speech recognition as reported by various

authors recently.

8

5. Conclusion

The aim of this paper is to present SVM as a comparison to ANN. The detail

derivation of SVM is given in order to provide an intuition into the beautiful concept of

SVM. Generally, ANN is known to overfit data unless cross-validation is applied while

SVM does not overfit data and ‘curse of dimensionality’ is avoided. In ANN learning, the

topology is fixed but in SVM, learning actually is to learn the topology. The other

advantage of SVM over ANN is its better generalization ability due to Structural Risk

Minimization (SRM) principle. The non existence of local minimum in SVM learning is

also another reason why SVM is more superior.

References

[1] Tom M. Mitchel, Machine Learning, McGraw Hill, 1996.

[2] . V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A Training Algorithm for Optimal

Margin Classifiers. In D. Haussler, editor, 5th Annual ACM Workshop on COLT,

pages 144-152, Pittsburgh, PA, 1992. ACM Press.

[4] V. Vapnik, Estimation of Dependencies Based on Empirical Data, Springer-Verlag,

New York, 1982.

[5] E. Osuna, R. Freund, and F. Girosi, “An Improved Training Algorithm for Support

Vector Machines,” Proc. IEEE Neural Networks for Signal Processing VII

Workshop, IEEE Press, Piscataway, N.J., 1997, pp. 276–285.

[6] J.C. Platt, “Fast Training of SVMs Using Sequential Minimal Optimization,” in

Advances in Kernel Methods - Support Vector Learning, B. Schölkopf, C. Burges,

and A. Smola, eds., MIT Press, Cambridge, Mass., 1998.

[7] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, and K. R. K. Murthy, Improvements

to the SMO Algorithm for SVM Regression, IEEE Transactions On Neural

Networks, Vol. 11, No. 5, September 2000.

[8] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya and K.R.K. Murthy, “A Fast Iterative

Nearest Point Algorithm for Support Vector Machine Classifier Design” Technical

Report TR-ISL-99-03, Intelligent Systems Lab, IIS, Bangalore, 1999.

[9] Jian-xiong Dong, Krzyzak A., Suen C. Y., A fast SVM training algorithm.

International workshop on Pattern Recognition with Support Vector Machines. S.-

W. Lee and A. Verri (Eds.): Springer Lecture Notes in Computer Science LNCS

2388, pp. 53-67, Niagara Falls, Canada, August 10, 2002.

[10] Hsu, C.W., Lin, C.J, “A comparison of Methods for Multiclass Support Vector

Machines”, IEEE Transaction on Neural Networks, vol 13, no. 2, March 1992.

[11] Platt, J.C, Cristianini, N., Shawe-Taylo, J., Large Margin DAGs for Multiclass

Classification, in NIPS2000, vol. 12, 2000.

[12] Weston, J., Watkins, C, Multi-class Support Vector Machines, Technical report

CSD-TR-98-04, Royal Holloway, University of London, 1998.

## Comments 0

Log in to post a comment