Lecture Notes

Statistical and Machine Learning

Classical Methods )Kernelizing (Bayesian

& +.

Statistical Learning Theory

% * -

Information Theory SVM Neural Networks

Su-Yun Huang

¤1

,Kuang-Yao Lee

1

and Horng-Shing Lu

2

1

Institute of Statistical Science,Academia Sinica

2

Institute of Statistics,National Chiao-Tung University

contact:

¤

syhuang@stat.sinica.edu.tw

¤

http://www.stat.sinica.edu.tw/syhuang

course web pages:

¤

http://140.109.74.188

TAs:Pei-Chun Chen and Kuang-Yao Lee

COURSE INFORMATION & PREREQUISITES

²

Prerequisites:calculus,linear algebra,elementary probability and statistics.

²

Course plan:See Table of Contents (tentative).We will emphasize on know-

ing\why"and on statistical aspects instead of algorithms and programming.

But still you have to know\how"by either writing your own implementation

or modifying from others'code.

²

Grading policy:

homework 30%,score for late homework= (full points) £0:8

d

;d:delay days.

midterm 30% (April 27,2007).

oral presentation 20% on assigned tasks (and your ¯nal project as well).

¯nal project 20%:choose your own data analysis problem and write up a short

report and prepare for an oral presentation.The report should include

{

problem and data description,

{

methodology (related to this course),

{

data analysis,and

{

conclusion.

The report can be in Chinese or in English.

²

Web resources:

{

UCI Repository for machine learning is probably the most popular repos-

itory.http://www.ics.uci.edu/»mlearn/MLRepository.html

{

UCI KDD Archive.It is an online repository of large data sets which

encompasses a wide variety of data types,analysis tasks,and application

areas.http://kdd.ics.uci.edu

{

CiteSeer is a public search engine and digital library for scienti¯c and

academic papers in the ¯elds of computer and information sciences.

http://citeseer.ist.psu.edu

{

Statlib for statistical software and data sets.http://lib.stat.cmu.edu

{

SSVM toolbox.http://dmlab1.csie.ntust.edu.tw/downloads

{

Kernel Statistics toolbox.http://www.stat.sinica.edu.tw/syhuang

or http://140.109.74.188/kern

stat

toolbox/kernstat.html

ii Lecture Notes

{

LIBSVM.http://www.csie.ntu.edu.tw/»cjlin/libsvm/

{

Wikipedia,a free online encyclopedia.

http://en.wikipedia.org/wiki/Main

Page

Contents

1 Introduction 1

1.1 Aim and scope...............................

1

1.2 Notation and abbreviation.........................

2

2 Support Vector Machines 5

2.1 Linear support vector machines......................

5

2.2 Kernel trick for nonlinear SVM extension................

9

2.3 Variants of SVM algorithms........................

11

2.3.1 One-norm soft margin SVM....................

12

2.3.2 Two-norm soft margin SVM....................

12

2.3.3 Smooth SVM (SSVM).......................

13

2.4 Multiclass classi¯cation..........................

15

3 Two Auxiliary Techniques:Reduced Kernel and Model Selection for

SVMs 17

3.1 Reduced kernel:a low-rank approximation................

17

3.1.1 RSVM with random subset....................

18

3.1.2 RSVM with optimal basis.....................

20

3.1.3 Some remarks............................

21

3.2 Uniform design for SVM model selection.................

24

3.2.1 Methodology:2 staged UD for model selection.........

24

4 More on Kernel-Based Learning Algorithms 29

4.1 Kernel principal components analysis...................

29

4.1.1 Classical PCA...........................

30

4.1.2 Kernel PCA.............................

31

4.2 Kernel sliced inverse regression......................

32

4.2.1 Classical SIR............................

32

iii

iv CONTENTS

4.2.2 Kernel SIR.............................

34

4.3 Kernel canonical correlation analysis...................

36

4.3.1 Classical CCA...........................

37

4.3.2 Kernel CCA.............................

38

4.4 Kernel Fisher discriminant analysis....................

40

4.4.1 Classical FLDA...........................

40

4.4.2 Kernel FDA.............................

41

4.5 Support vector regression.........................

42

4.5.1 RLS-SVR:a ridge approach....................

43

4.5.2 Smooth ²-support vector regression (²-SSVR)..........

43

4.6 SIR and KSIR revisited..........................

45

5 Learning Theory with RKHS (optional) 49

5.1 Introduction.................................

49

5.2 SVMs with RKHS.............................

51

5.3 Multivariate statistical analysis with RKHS...............

52

5.4 More to come................................

54

6 Neural Networks 55

A Some Matrix Algebra 69

A.1 Matrix diagonalization...........................

69

A.1.1 Eigenvalue decomposition and generalized eigenvalue decompo-

sition................................

69

A.1.2 Singular value decomposition (SVD) and generalized SVD...

70

B Kernel Statistics Toolbox 71

B.1 KGaussian..................................

71

B.2 Smooth SVM................................

71

B.3 Lagrangian SVM..............................

73

B.4 Uniform designs for model selection....................

75

B.5 Support vector regression.........................

75

B.6 Kernel PCA.................................

75

B.7 Kernel SIR.................................

77

B.8 Kernel Fisher linear discriminant analysis................

79

B.9 Kernel CCA.................................

79

List of Figures

3.1 Spectrum comparison on full kernel vs.reduced kernel..........

23

3.2 Three examples of search space of model selection............

25

3.3 The nested UD model selection with a 13-points UD at the ¯rst stage

and a 9-points UD at the second stage...................

26

3.4 The nested UD model selection with a 9-points UD at the ¯rst stage

and a 5-points UD at the second stage...................

27

4.1 SIR view over the 1st-and-2nd variates..................

34

4.2 KSIR view over the 1st-and-2nd variates.................

35

4.3 Scatter plot of pen-digits over CCA-found variates............

39

4.4 Scatter plots of pen-digits over KCCA-found variates..........

40

4.5 Data scatter along some coordinates....................

47

4.6 Data scatter,¯tted values and residuals using 1st SIR variate......

47

4.7 Data scatter,¯tted values and residuals using 1st KSIR variate.....

48

v

vi LIST OF FIGURES

Chapter 1

Introduction

1.1 Aim and scope

The development of statistical methods is often highly in°uenced and challenged by

problems fromother ¯elds.Due to the rapid growth of computer technology,we easily

encounter with enormous amount of data collected from diverse sources of scienti¯c

¯elds,which has lead to a great demand for innovative analytic tools for complex data.

The process of extracting information from data and making statistical inferences for

future observations is called learning from data.Statistical and machine learning is

an interdisciplinary ¯eld consisting of theory fromstatistics,probability,mathematics

and computer science,with plenty of applications for engineering science,biology,

bioinformatics,medical study,etc.

The aimof this monograph is to provide an overviewof the development of machine

learning with emphasis on its statistical aspects.It aims for an introductory graduate

level course.The choice of material is rather subjective.A major part of it is based on

some recent works by colleagues in the IE&SLR group.

1

This monograph starts with

an introduction to support vector machines (SVMs),followed by other kernel-based

learning algorithms,learning theory with reproducing kernel Hilbert space (RKHS)

and ended by a chapter on neural networks.Bayesian theory will be blended in

when it is necessary.Two auxiliary techniques for computation reduction and model

selection (see reduced kernel and model selection for SVMs in Chapter 3) are also

provided in this monograph.Program codes can be found in Appendix B.They are

1

IE&SLR (information extraction & statistical learning research) group consists of faculty and

students from various institutes and departments of Academia Sinica and other universities.

1

2 CHAPTER 1.INTRODUCTION

part of the Kernel Statistics toolbox

2

still in development.

1.2 Notation and abbreviation

All vectors will be column vectors unless otherwise speci¯ed or transposed to a row

vector by a prime superscript

0

.For a matrix A 2 <

n£p

,A

i

is the ith row of A.We

put the data fx

i

2 <

p

g

n

i=1

by row in a matrix denoted by A,and we call it the data

design matrix.We will use the row vector A

i

and the column vector x

i

for the same

ith observation depending on the convenience.A column vector of ones of arbitrary

dimension will be denoted by 1.For A 2 <

n£p

and B 2 <

~n£p

,the kernel K(A;B)

maps R

n£p

£R

~n£p

into R

n£~n

.We adopt a Matlab vector and matrix convention

[M

1

;M

2

]:=

"

M

1

M

2

#

;

where M

i

can be a vector or a matrix and the semicolon on the left hand side of the

equality means to stack M

1

on top of M

2

as shown on the right hand side.Other

notation convention and terminology abbreviation used in this monograph are listed

below.

²

A:a matrix of training inputs,where input data are arranged by rows,i.e.,

A = [x

0

1

;x

0

2

;:::;x

0

n

] 2 <

n£p

.

²

A

+

:a matrix of training inputs whose corresponding class label is 1.

²

A

¡

:a matrix of training inputs whose corresponding class label is ¡1.

²

A

i

:the ith row of a matrix A.

²

C(M):the column space of a matrix M,i.e.,the linear space spanned by columns

of M.

²

CCA:canonical correlation analysis.

²

CV:cross-validation.

²

D:a diagonal matrix with class labels in the diagonals,i.e.D = diag(y

1

;:::;y

n

).

²

FLDA:Fisher linear discriminant analysis.

2

Kernel Statistics toolbox is maintained on http://140.109.74.188/kern

stat

toolbox/kernstat.html

by the ¯rst author.It is also available at http://www.stat.sinica.edu.tw/syhuang.

1.2.NOTATION AND ABBREVIATION 3

²

I:an index set.

²

jIj:the size of an index set I.

²

K,K(A;B):kernel function or kernel matrix.

²

~

K = K(A;

~

A):a reduced kernel matrix using some subset

~

A.

²

KKT:Karush-Kuhn-Tucker.

²

n:full data size;~n:reduced set size.

²

PCA:principal components analysis.

²

plus function t

+

:For t = (t

1

;:::;t

p

) 2 <

p

,the plus function t

+

is de¯ned

componentwise as (t

+

)

i

= max f0;t

i

g for i = 1;:::;p.

²

1:a column vector of ones of arbitrary dimension;1

p

:a column vector of ones

of dimensionality p;

²

©:X 7!Z:feature map.

²

<:real number.

²

RK:reproducing kernel;RKHS:reproducing kernel Hilbert space.

²

RSVM:reduced (kernel) support vector machine.

²

SIR:sliced inverse regression.

²

SVM:support vector machine;LSVM:Lagrangian SVM;SSVM:smooth SVM.

²

SVR:support vector regression;SSVR:smooth support vector regression.

²

UD:uniform design.

²

X:input data space.

²

Z:feature data design matrix,i.e.Z = [z

0

1

;:::;z

0

n

].

²

Z:feature space.

²

rg(x):gradient consisting of ¯rst derivatives.

²

r

2

g(x):Hessian matrix consisting of second derivatives.

²

h¢;¢i

H

:inner product in a Hilbert space H.

²

k ¢ k

q

:the q-norm.

4 CHAPTER 1.INTRODUCTION

Chapter 2

Support Vector Machines

Classi¯cation,which is one kind of supervised learning,is a commonly encountered

task in statistics.For a binary classi¯cation problem,we are given a training data

set f(x

1

;y

1

);:::;(x

n

;y

n

)g,where x

i

2 <

p

is the input data and y

i

2 Y = f¡1;1g

is the corresponding class label.The main goal of classi¯cation is to ¯nd a decision

function f(x) based on the training data that can predict future class labels for new

input data points.

Support vector machines in a wide sense refer to a body of statistical and com-

putational techniques that build on\kernel trick".Observations are actually taken

in the pattern space,while the classi¯cation (or any other learning task) is carried

out in the so-called feature space,which is a very high dimensional Hilbert space and

where populations can be easily separated and/or predicted by linear methods.In

recent years SVMs [5,15,53] have become one of the most promising learning algo-

rithms for classi¯cation as well as for regression [16,38,39,47,31,43],which are two

fundamental tasks in data mining [55].

2.1 Linear support vector machines

The linear SVMfor binary classi¯cation attempts to ¯nd a hyperplane with maximal

margin to separate the training data into two classes,A

¡

and A

+

.We consider the

following linear decision function

f(x) = w

0

x +b:

(2.1.1)

The decision boundary,given by fx:f(x) = 0g,for separating the positive and

negative input instances is a hyperplane in the input space <

p

.The distance from

5

6 CHAPTER 2.SUPPORT VECTOR MACHINES

the origin to the hyperplane,w

0

x +b = 0,in <

p

is given by jbj=kwk

2

.For the linear

separable case,we can arrange the scale of the normal vector w and the o®set b so

that the following constraints are satis¯ed:

w

0

x

i

+b ¸ +1 for y

i

= +1;

(2.1.2)

w

0

x

i

+b · ¡1 for y

i

= ¡1:

(2.1.3)

These constraints can be reformulated into one set of inequalities:

1 ¡y

i

(w

0

x

i

+b) · 0 8i:

(2.1.4)

The margin between the two bounding planes (2.1.2) and (2.1.3) is

jb +1 ¡(b ¡1)j=kwk

2

= 2=kwk

2

:

Maximizing the margin is equivalent to minimizing kwk

2

2

.Thus,the SVM for linear

separable case can be formulated as a constrained optimization problem:

min

w;b

kwk

2

2

=2 subject to 1 ¡y

i

(w

0

x

i

+b) · 0 8i:

(2.1.5)

It is almost always the case that data are not completely linear separable and there

is no feasible solution to (2.1.5).Hence,it is necessary to consider the non-separable

case.For the non-separable case,we further introduce some positive slack variables

»

i

into the constraints:

w

0

x

i

+b ¸ +1 ¡»

i

for y

i

= +1;

(2.1.6)

w

0

x

i

+b · ¡1 +»

i

for y

i

= ¡1;

(2.1.7)

»

i

¸ 0 8i:

(2.1.8)

Inequalities (2.1.6)-(2.1.8) can be combined into

1 ¡y

i

(w

0

x

i

+b) ¡»

i

· 0 and ¡»

i

· 0;8i:

(2.1.9)

The nonnegative constraints on slack variables »

i

¸ 0 are expressed as ¡»

i

· 0 in

(2.1.9).It is merely out of convenience for making the associated Lagrange multi-

pliers nonnegative.With these slack variables,a natural way to assign error cost is

to add a penalty on the size of »

i

to the objective function to be minimized,e.g.,

kwk

2

2

=2 +C

P

n

i=1

P(»

i

),where P(») is a nonnegative increasing function of »,and C

is a tuning parameter controlling the trade-o® between data goodness of ¯t (or called

data ¯delity) and the regularity (or °atness,smoothness) of the decision function.

2.1.LINEAR SUPPORT VECTOR MACHINES 7

Here we consider a penalty on the one-norm of » leading to the following constrained

optimization problem:

min

w;b;»

kwk

2

2

=2 +C

n

X

i=1

»

i

subject to (2:1:9):

(2.1.10)

To solve the problem,Lagrange multipliers ®

i

¸ 0 and ¯

i

¸ 0 are introduced,one for

each inequalities in (2.1.9).Then we have the following Lagrangian:

L(w;b;»;®;¯) =

1

2

kwk

2

2

+C1

0

» +®

0

[1 ¡D(Aw +b1) ¡»] ¡¯

0

»;

(2.1.11)

where D is a diagonal matrix with y

i

in the diagonals.Variables w;b;» are called

primal variables,and ®;¯ are called dual variables.The task is to minimize (2.1.11)

with respect to w,b and »,and to maximize it with respect to ® and ¯.At the

optimal point,we have the following saddle point equations:

@L

@w

= 0;

@L

@b

= 0 and

@L

@»

= 0;

which translate to

w

0

¡®

0

DA = 0;®

0

D1 = 0 and C1 ¡® ¡¯ = 0;

or equivalently to

w =

n

X

i=1

®

i

y

i

x

i

;

n

X

i=1

®

i

y

i

= 0 and ®

i

+¯

i

= C 8i:

(2.1.12)

By substituting (2.1.12) into the Lagrangian,we get the dual problem:

max

®

n

X

i=1

®

i

¡

1

2

n

X

i;j=1

®

i

®

j

y

i

y

j

x

0

i

x

j

subject to

n

X

i=1

®

i

y

i

= 0 and 0 · ®

i

· C:

(2.1.13)

By introducing the slack variables »

i

,we get the box constraints that limit the size

of the Lagrange multipliers:®

i

· C.The KKT necessary and su±cient optimality

8 CHAPTER 2.SUPPORT VECTOR MACHINES

conditions

1

can be summarized by

0 · ®?1 ¡D(Aw +b1) ¡» · 0;and

0 · ¯?» ¸ 0:

These KKT conditions imply that

®

i

= 0 ) ¯

i

= C ) y

i

f(x

i

) ¸ 1 and »

i

= 0;

0 < ®

i

< C ) 0 < ¯

i

< C ) y

i

f(x

i

) = 1 and »

i

= 0;

®

i

= C ) ¯

i

= 0 ) y

i

f(x

i

) · 1 and »

i

¸ 0:

Points x

i

associated with nonzero ®

i

are called support vectors (SVs).The o®set

term b is computed by

b =

1

jIj

X

i2I

0

@

y

i

¡

n

X

j=1

®

j

y

j

x

0

j

x

i

1

A

;where I = fi:0 < ®

i

< Cg:

With w =

P

n

i=1

®

i

y

i

x

i

the underlying decision function is then given by

f(x) =

n

X

i=1

®

i

y

i

x

0

i

x +b =

X

x

i

:SV

®

i

y

i

x

0

i

x +b:

Only the SVs contribute to the decision function for classi¯cation.In addition,the

contribution of a data point as a SV is at most at a scale C in the ¯nal decision

function.The one-norm soft margin SVM is probably the most common formulation

for SMVs.

Remark 1

The SVM introduced above provides a way of doing linear discriminant

based on the one-norm soft margin criterion.There are many other alternatives in

statistical literature,e.g.,Fisher linear discriminant analysis,logistic discriminant

analysis,classi¯cation by least squares regression,Gaussian mixture approach,etc.

Think about why SVM and why the kernel extension introduced in later chapters,

while we are proceeding forward in this course.

1

Theorem.

Given a convex optimization problem:min

v

`(v) subject to inequality constraints

g

i

(v) · 0,i = 1;:::;k,with`2 C

1

and g

i

being a±ne,necessary and su±cient conditions for a

point v

¤

to be an optimal solution are the existence of Lagrange multipliers ´

¤

such that

@L(v

¤

;´

¤

)

@v

= 0;

´

¤

i

g

i

(v

¤

) = 0;i = 1;:::;k;

g

i

(v

¤

) · 0;i = 1;:::;k;

´

¤

i

¸ 0;i = 1;:::;k:

The second conditions are known as the KKT conditions.

2.2.KERNEL TRICK FOR NONLINEAR SVM EXTENSION 9

2.2 Kernel trick for nonlinear SVM extension

How can we generalize the SVM to the nonlinear case,but still keep the model

simplicity of linearity?That is,we want a °exible decision boundary for separating

the two classes,while still manage to maintain the simplicity of estimation and/or

prediction task.

The main idea is ¯rst to map the input data to a high dimensional space by a

feature map ©:X 7!Z

x 7!©(x) = (Á

1

(x);Á

2

(x);Á

3

(x);:::) 2 Z:

2

(2.2.14)

Next,the linear SVMproblem (2.1.10) is considered and nominally carried out in the

feature space Z.Here are the nominal input data in the feature space fz

i

= ©(x

i

)g

n

i=1

.

Similar to the constraints (2.1.9),we can make constraints in the feature space as

follows:

1 ¡y

i

(w

0

z

i

+b) ¡»

i

· 0 8i;

(2.2.15)

¡»

i

· 0 8i;

(2.2.16)

where w is the normal vector to the separating hyperplane in Z.The corresponding

optimization problem becomes

min

w;b;»

kwk

2

2

=2 +C

n

X

i=1

»

i

subject to (2:2:15) ¡(2:2:16):

(2.2.17)

Similar to the derivation for (2.1.13),we start with partial derivatives of the La-

grangian with respect to the primal variables:

@L

@w

= 0;

@L

@b

= 0 and

@L

@»

= 0;

which translate to

w

0

¡®

0

DZ = 0;®

0

D1 = 0 and C1 ¡® ¡¯ = 0;where Z = [z

0

1

;z

0

2

;¢ ¢ ¢;z

0

n

];

or equivalently to

w =

n

X

i=1

®

i

y

i

z

i

;

n

X

i=1

®

i

y

i

= 0 and ®

i

+¯

i

= C 8i:

(2.2.18)

By substituting (2.2.18) into the Lagrangian,we get the dual problem:

2

Regard the collection of functions fÁ

1

(x);Á

2

(x);Á

3

(x);:::g as a basis set in a certain functional

space H of X.The nonlinear SVM in general is a nonparametric method.

10 CHAPTER 2.SUPPORT VECTOR MACHINES

max

®

n

X

i=1

®

i

¡

1

2

n

X

i;j=1

®

i

®

j

y

i

y

j

©(x

i

)

0

©(x

j

) subject to

n

X

i=1

®

i

y

i

= 0 and 0 · ®

i

· C:

(2.2.19)

Similarly,b is computed as

b =

1

jIj

X

i2I

0

@

y

i

¡

n

X

j=1

®

j

y

j

©(x

j

)

0

©(x

i

)

1

A

;where I = fi:0 < ®

i

< Cg:

Note that the feature data z

i

are not observed.However,the problem (2.2.17) and

the underlying decision function

f(x) = w

0

z +b;where z = ©(x);

depend on w and z only through their values of inner products in Z.De¯ne the inner

product as a bivariate function on X £X:

K(x;u) = ©(x)

0

©(u):

There is no need to know explicitly the functional formof ©,nor the individual feature

observations z

i

.It is su±cient to know the function K(x;u) representing the inner

product values in Z.The function K(x;u) is called a kernel function.Or one can

simply start with a given positive de¯nite kernel K(x;u).(We will give a formal

de¯nition of positive de¯nite kernel later.) Then,solve the maximal dual problem

max

®

n

X

i=1

®

i

¡

1

2

n

X

i;j=1

®

i

®

j

y

i

y

j

K(x

i

;x

j

) subject to

n

X

i=1

®

i

y

i

= 0 and 0 · ®

i

· C;

(2.2.20)

and b is computed from

b =

1

jIj

X

i2I

0

@

y

i

¡

n

X

j=1

®

j

y

j

K(x

j

;x

i

)

1

A

;where I = fi:0 < ®

i

< Cg:

Finally,the underlying decision function is given by

f(x) =

n

X

i=1

®

i

y

i

K(x

i

;x) +b:

The kernel trick introduced above provides a powerful extension for nonlinear

SVMs.In fact,the SVM does not solve the nonlinear problem directly.It maps the

2.3.VARIANTS OF SVM ALGORITHMS 11

input data into a high dimensional space via an implicit transformation so that the

data in this high dimensional space can be better separated by a hyperplane therein.

One of the most popular kernel functions is the Gaussian kernel (also known as the

radial basis function):

K(x;u) = e

¡°kx¡uk

2

2

;

(2.2.21)

where x and u are two vectors in <

p

and ° is the width parameter of the Gaussian

kernel.(We do not care to normalize it to make the kernel integrate to one,as the

normalization has no e®ect on SVM problems.) The value of K(x;u) represents the

inner product of ©(x) and ©(u) in some high dimensional feature space Z and re°ects

the similarity between x and u.Values of K(x;u) drop o® at a rate determined by

kx ¡uk

2

2

and °.

Before we end this section,we give a formal de¯nition of a positive de¯nite kernel

and its spectrum.

De¯nition 1 (Positive de¯nite kernel and its spectrum)

A real-valued symmet-

ric kernel function K de¯ned on X £X is said to be positive de¯nite,if and only if

that,for any positive integer m,any sequence of real numbers fa

1

;:::;a

m

g and any

sequence of points fu

1

;:::;u

m

g in X,we have

P

m

i;j=1

a

i

a

j

K(u

i

;u

j

) ¸ 0,and strictly

positive de¯nite when the equality holds if and only if a

1

= ¢ ¢ ¢ = a

m

= 0.For a given

positive de¯nite kernel K,if it de¯nes a continuous compact operator on L

2

(X) then

K can be expanded in terms of its eigenvalues-eigenfunctions:

K(x;u) =

X

q

¸

q

Ã

q

(x)Ã

q

(u):

(2.2.22)

With the above expansion,the associated feature map is given by

©(

x

) = (

p

¸

1

Ã

1

(

x

)

;

p

¸

2

Ã

2

(

x

)

;

p

¸

3

Ã

3

(

x

)

;:::

)

:

2.3 Variants of SVM algorithms

There are two key ingredients in an SVMproblem.One is the kernel trick for making

extension to nonlinear decision function,which is a kernel mixture (i.e.,a linear

combination of kernels).The decision function is nonlinear in the pattern space but

linear in an implicit feature space.The nonlinearity via using kernel mixture allows

excellent °exibility and versatility for the decision function in the pattern space,while

the simple linear structure,as well as the sparsity of SVs for making up the decision

function in the feature space,makes the implementation of nonlinear model possible.

The other key ingredient is the error criterion (or risk,cost criterion) for seeking (or

estimating) the decision function.

12 CHAPTER 2.SUPPORT VECTOR MACHINES

2.3.1 One-norm soft margin SVM

The formulations in (2.1.10) and (2.1.13) for linear case as well as (2.2.19) and (2.2.17)

for nonlinear case fall into the category of one-norm soft margin SVM.The term one-

norm refers to the one-norm penalty on slack variables:k»k

1

=

P

n

i=1

»

i

.

3 4

2.3.2 Two-norm soft margin SVM

Besides the one-norm k»k

1

,it is natural to consider the two-norm penalty k»k

2

2

.The

corresponding formulation becomes

min

w;b;»

1

2

kwk

2

2

+

C

2

n

X

i=1

»

2

i

subject to (2:2:15) ¡(2:2:16):

(2.3.23)

Homework 1

Formulate the 2-norm soft margin SVM in dual form.

Solution

:The nonnegative constraint (2.2.16) can be removed since the term k»k

2

2

is

included in the objective function.Therefore,we have the following Lagrangian:

L(w;b;»;®) =

1

2

kwk

2

2

+

C

2

n

X

i=1

»

2

i

+®

0

[1 ¡D(Zw +b1) ¡»];

and its partial derivatives with respect to the primal variables:

@L

@w

= 0;

@L

@b

= 0 and

@L

@»

= 0;

which translate to

w

0

¡®

0

DZ = 0;®

0

D1 = 0 and C» ¡® = 0;

or equivalently to

w =

n

X

i=1

®

i

y

i

z

i

;

n

X

i=1

®

i

y

i

= 0 and »

i

=

®

i

C

8i:

3

There is another SVM variant,also called one-norm SVM [57].It penalizes on the one-norm of

the normal vector,i.e.,kwk

1

.

4

The usage of penalty is to regularize (to make\smooth",often by shrinking the parameters values

toward zero) the underlying model.The e®ect of a one-norm penalty will lead to a parsimonious

model structure and it enforces the action of keep-or-kill.The e®ect of a two-norm penalty will

shrink the parameters values,but will never kill them.In Bayesian formulation,the one-norm

penalty corresponds to a Laplacian (double exponential) prior and the two-norm corresponds to a

Gaussian prior on parameters.

2.3.VARIANTS OF SVM ALGORITHMS 13

Then by substituting the above equations into the Lagrangian,we can obtain

L(w;b;»;®)

=

1

2

n

X

i;j=1

®

i

®

j

y

i

y

j

z

0

i

z

j

+

1

2C

n

X

i=1

®

2

i

¡

n

X

i;j=1

®

i

®

j

y

i

y

j

z

0

i

z

j

+

n

X

i=1

®

i

¡

1

C

n

X

i=1

®

2

i

=

n

X

i=1

®

i

¡

1

2

n

X

i;j=1

®

i

®

j

y

i

y

j

z

0

i

z

j

¡

1

2C

n

X

i=1

®

2

i

:

The 2-norm soft margin SVM problem can be formulated in the following dual form:

max

®

n

X

i=1

®

i

¡

1

2

n

X

i;j=1

®

i

®

j

y

i

y

j

K(x

i

;x

j

) ¡

1

2C

n

X

i=1

®

2

i

subject to

n

X

i=1

®

i

y

i

= 0;®

i

¸ 0 8i:

2

There is another variant of 2-norm soft margin SVM algorithm,called the La-

grangian SVM [40] (LSVM for short).It has also appended the term b

2

=2 to the

margin criterion and solves the following reformulated problem:

min

w;b;»

1

2

(kwk

2

2

+b

2

) +

C

2

n

X

i=1

»

2

i

subject to (2:2:15) ¡(2:2:16):

(2.3.24)

This reformulation leads to simple constraints ®

i

¸ 0 in the dual problem (see Ap-

pendix B) without sacri¯cing the testing accuracy.The LSVM implementation is

based directly on the KKT necessary and su±cient optimality conditions for the dual

problem.For the linear case,the implementation of LSVM has used the Sherman-

Morrison-Woodbury (SMW) identity to transforman n£n matrix inversion to a p£p

matrix inversion.However,for the nonlinear case,the n£n kernel matrix [K(x

i

;x

j

)]

is dense and of full rank (or close to full rank),and there is no advantage of using

SMW identity.In this case,a reduced kernel technique (introduced later in Chap-

ter 3) should be used to cut down the column rank of the kernel matrix.Refer to

Appendix B for implementation details and Matlab code.

2.3.3 Smooth SVM (SSVM)

Traditionally,SVMs are solved via their dual problems.Here we introduce an alter-

native approach,which focuses on the primal problems and solves them directly by

smoothing techniques [33].Smoothing methods can be found and have been exten-

sively used in optimization literature [8,9,7,11,12,18,51].The basic idea is to

14 CHAPTER 2.SUPPORT VECTOR MACHINES

¯nd a smooth (di®erentiable) approximation to the non-smooth objective function,

here speci¯cally the plus function.Then,use a Newton

5

method to solve the smooth

optimization problem.

The decision function of an SVM has the following form:

f(x) =

n

X

i=1

®

i

y

i

K(x

i

;x) +b

Ã

:=

n

X

i=1

v

i

K(x

i

;x) +b with v

i

= ®

i

y

i

!

:

The SSVM works directly on the primal problem.It starts from a type of 2-norm

soft margin SVM,and next it appends the term b

2

=2 to the objective function to be

minimized and leads to the following minimization problem:

min

v;b;»

C

2

k»k

2

2

+

1

2

¡

kvk

2

2

+b

2

¢

;

subject to 1 ¡y

i

f(x

i

) ¡»

i

· 0 8i;

»

i

¸ 0 8i:

Note that the nonnegative constraint » ¸ 0 can be removed,as the term k»k

2

2

is

included in the objective function of the minimization problem.The problem above

becomes

min

v;b;»

C

2

k»k

2

2

+

1

2

¡

kvk

2

2

+b

2

¢

;

(2.3.25)

subject to 1 ¡D(Kv +b1) ¡» · 0;

(2.3.26)

where K = K(A;A) = [K(x

i

;x

j

)]

n

i;j=1

.At a solution,» takes the form » = (1 ¡

D(Kv +b1))

+

.The problem given by (2.3.25) and (2.3.26) converts to an equivalent

unconstrained minimization problem:

min

v;b

C

2

°

°

(1 ¡D(Kv +b1))

+

°

°

2

2

+

1

2

¡

kvk

2

2

+b

2

¢

:

(2.3.27)

It is a strongly convex minimization problemwithout any constraint and has a unique

solution.However,the objective function is not di®erentiable,which hinders the use

of a Newton method.Below a smooth approximation [33] is introduced.

5

It is also known as Newton-Raphson method and is an e±cient algorithm for ¯nding approxi-

mations to the zeros (or roots) of a smooth function.It can also be used to ¯nd local maxima and

local minima of functions by noticing that if a point is a local optimal point,then it is a root of the

derivative of the underlying function.For a twice-di®erentiable function f(x),an iterative scheme

for a local optimal point can be prescribed as follows:

x

n+1

= x

n

¡

r

2

f(x

n

)

¡1

rf(x

n

);n ¸ 0;

where x

0

is an initial point.Usually the Newton method is modi¯ed to include a small step size

0 < ° < 1 instead of ° = 1.This is known as the Newton-Armijo algorithmto ensure the convergence

of the iterative scheme.

2.4.MULTICLASS CLASSIFICATION 15

For t = (t

1

;:::;t

p

) 2 <

p

,the plus function t

+

is de¯ned componentwise as (t

+

)

i

=

max f0;t

i

g for i = 1;:::;p.An accurate smooth approximation to the plus function

is given by p(t;½),which is de¯ned componentwise by:

p(t

i

;½) = t

i

+

1

½

log

¡

1 +e

¡½t

i

¢

;t

i

2 <;½ > 0:

(2.3.28)

The smooth p function is the integral of the Sigmoid (also known as logistic) function

given by:g(t) = 1=(1+e

¡½t

) for t 2 <,which is used to approximate the step function:

h(t) = 1 for t ¸ 0 and h(t) = 0 for t < 0.As ½!1,p(t;½)!t

+

for all t 2 <

p

.We

can replace (1 ¡D(Kv +b1))

+

by the smooth approximation (2.3.28) and the SVM

problem converts to

min

v;b

C

2

kp(1 ¡D(Kv +b1);½)k

2

2

+

1

2

¡

kvk

2

2

+b

2

¢

:

(2.3.29)

The problem (2.3.29) can be solved by the fast Newton-Armijo algorithm [33].A

Matlab implementation code is given in Appendix B.This code takes the advantage

of sparsity of the Hessian matrix and has used the limit values of the sigmoid function

and p-function in computing the gradient vector and the Hessian matrix.That is,

½ = 1 in the implementation,which simpli¯es the computation for the gradient

vector and the Hessian matrix.

2.4 Multiclass classi¯cation

So far we have only talked about the binary classi¯cation,as SVMs are originally

designed for binary classi¯cation.There are two commonly seen multiclass exten-

sions for SVMs.One is the composition type methods built upon a series of binary

classi¯ers,e.g.,the one-against-one and one-against-rest,and the other is the sin-

gle machine type methods,often huge and solved in one optimization formulation.

See Rifkin and Klautau [45] for literature review,comparison study and references.

There is no universally dominant classi¯cation rule for multiclass problems.Di®erent

methods have their own merits and advantages,allowing the room for exploration of

alternative approaches.

6

In this section we will only introduce the one-against-one

(also known as pairwise) classi¯cation,which is probably the most commonly used

classi¯cation scheme in multiclass SVMs.In one-against-one classi¯cation scheme,we

train a series of binary classi¯ers for each pair of classes.For k classes,this results

in k(k ¡ 1)=2 binary classi¯ers.Though the number of classi¯ers to be trained is

6

For instance,kernel Fisher discriminant analysis [41,26] and classi¯cation via support vector

regression [10].

16 CHAPTER 2.SUPPORT VECTOR MACHINES

large,the individual problems that we need to train on are signi¯cantly smaller.It is

often time saving for training a series of small classi¯ers than training a huge single

machine.

When it comes for classifying a test pattern,we evaluate all k(k ¡ 1)=2 binary

classi¯ers,and assign the test point to the class with majority votes.A vote for a

given class is de¯ned as a classi¯er assigning the test pattern to that class.For tied

votes,we use a random assignment among classes of ties.

Chapter 3

Two Auxiliary Techniques:

Reduced Kernel and Model

Selection for SVMs

In this chapter we introduce two auxiliary techniques for dealing with large data

computation and model selection associated with SVMs.

3.1 Reduced kernel:a low-rank approximation

In nonlinear SVM as well as other kernel-based learning algorithms,to generate a

nonlinear separating surface has to deal with a fully dense kernel matrix of size n£n.

When n is median to large,there are some computational di±culties encountered:

(P1)

the size of the mathematical programming problem,

(P2)

the size of memory usage,

(P3)

the complexity of the resulting separating surface.

To cut down the model complexity,a reduced kernel technique was introduced [32,34].

The original reduced kernel support vector machine (RSVM) uses a very small random

subset of size ~n (¿n) to build up the separating surface for SSVM algorithm.The

reduced set plays a similar role as support vectors.We denote this random subset by

~

A ½ A,which is used to generate a much smaller rectangular matrix of size n £ ~n:

K(A;

~

A) = [K(x

i

;x

j

)];x

i

2 A;x

j

2

~

A:

17

18 Two Auxiliary Techniques

The reduced kernel matrix is used to replace the full kernel matrix K(A;A).The

usage of a reduced kernel can cut the problem size,associated computing time and

the memory space,and it can simplify the characterization of the nonlinear separating

surface as well.The basic idea of RSVM is to solve an approximate SVM problem

instead of the full problem.It works by replacing the full kernel matrix with a low-

rank approximation.The use of reduced kernel is not limited to randomsubset nor the

SSVM algorithm.There are alternative choices.We divide the discussion of reduced

kernel selection into 2 categories:the most computationally economic random subset

and the most expensive optimal basis selection.There are also other alternative

selections in between by searching for optimal basis within a certain random subset.

As for the SVMformulation,there are 2 basic types of RSVMformulations introduced

here:RSVM in the primal space and RSVM in the dual space,respectively.

3.1.1 RSVM with random subset

In this monograph,the term\RSVM"is referring to reduced kernel SVMof any form,

which need not be limited to random subset nor to any speci¯c SVM algorithm.

Primal RSVM using random subset is straightforwardly easy to formulate.

It works simply by replacing the full kernel matrix K with a reduced kernel,denoted

by

~

K.Using SSVM,which is solved in the primal space,as an illustrative example,

its decision function is of the form f(x) = K(x;

~

A)~v + b,where ~v is solved directly

from the following optimization problem:

min

~v;b

C

2

°

°

°

°

³

1 ¡Df

~

K~v +b1g

´

+

°

°

°

°

2

2

+

1

2

¡

k~vk

2

2

+b

2

¢

;where

~

K = K(A;

~

A):

(3.1.1)

In solving the RSVM (3.1.1),a smooth approximation p(t;½) to the plus function is

used.The solution to the minimization problem(3.1.1) leads to a nonlinear separating

surface of the form

K(x;

~

A)~v +b = 0;i:e:;

~n

X

i=1

K(x;

~

A

i

)~v

i

+b = 0:

(3.1.2)

In fact,the reduced set

~

A is not necessarily to be a subset of training set.The

RSVMminimization problem(3.1.1) retains the strong convexity and di®erentiability

properties in the space,R

~n+1

,of (~v;b) for any arbitrary rectangular kernel.Again,

the fast Newton-Armijo algorithm [33] can be applied directly to solve (3.1.1).The

existence and uniqueness of the optimal solution are also guaranteed.Moreover,the

computational complexity of solving problem (3.1.1) by the Newton-Armijo method

is O(~n

3

) while solving the full kernel SSVM is O(n

3

) [33].

3.1.REDUCED KERNEL:A LOW-RANK APPROXIMATION 19

The reduced kernel serves as a low-rank approximation to the full kernel matrix.

It is also known as the NystrÄom approximation,which has been proposed in many

sophisticated ways [46,54].The NystrÄom approximation,denoted by R,has the

following form:

K(A;A) ¼ K(A;

~

A)K(

~

A;

~

A)

¡1

K(

~

A;A):= R:

1

(3.1.3)

Note that,for a given v,the following least squares problem

min

~v2<

~n

°

°

°

Rv ¡K(A;

~

A)~v

°

°

°

2

2

(3.1.4)

has the solution given by

~v = K(

~

A;

~

A)

¡1

K(

~

A;A)v:

(3.1.5)

That is,the NystrÄom approximation approximates Kv by

~

K~v,as given below:

K(A;A)v ¼ K(A;

~

A)K(

~

A;

~

A)

¡1

K(

~

A;A)v = K(A;

~

A)~v:

(3.1.6)

In the primal RSVM,~v is directly determined by ¯tting the entire data set to a reduced

problem (3.1.1),which can be regarded as an approximation to the full problem

(2.3.27).For more advanced theoretical study,consult Lee and Huang [34].For a

general primal SVM algorithm,its corresponding RSVM approximation works by

replacing K with

~

K.That is to say,the RSVM in the primal space is easy to work

with and basically share the same type of formulation as its full problem.

Dual RSVM using random subset works by replacing the full kernel matrix

K with the reduced kernel matrix R given by (3.1.3) and solving the dual problem

(2.2.20).Note that the reduced rank kernel matrix R can be expressed as R = BB

0

,

where B = K(A;

~

A)K(

~

A;

~

A)

¡1=2

(or B = K(A;

~

A),if the scaling matrix discussed in

the footnot is taken to be the identity for simplicity).Thus,the dual problem(2.2.20)

with kernel R has the same solution as the dual linear SVM problem (2.1.13) with

input data matrix B.That is,the dual RSVMcan be treated as a linear SVMproblem.

We recommend to use the Lagrangian SVM,an e±cient linear SVM algorithm,for

the RSVMsolved in the dual space.Lagrangian SVMcode consisting of only a couple

lines of commands is given in Appendix B.The dual solution has to be transformed

1

Note that K(

~

A;

~

A) serves as a scaling matrix.For some authors,they have used the approxima-

tion

R:= K(A;

~

A)K(

~

A;A)

in the dual problem.That is they take the identity matrix as the scaling matrix for simplicity.To

get the corresponding primal solution,the least squares problem (3.1.4) is applied and the primal

solution is given by:~v = K(

~

A;A)v.The di®erence between the two approximations (3.1.3) and the

one in this footnote is on the scaling matrix being K(

~

A;

~

A) or the identity.

20 Two Auxiliary Techniques

back to the primal one,as the decision function is given in the primal space.By

letting v

i

= y

i

®

i

,or equivalently v = D®,the corresponding primal solution can by

obtained via (3.1.4) and (3.1.5) and the decision function is then given by

f(x) = K(x;

~

A)~v +b;where ~v = (scaling matrix))

¡1

K(

~

A;A)v;

where the scaling matrix is either K(

~

A;

~

A) or the identity.

3.1.2 RSVM with optimal basis

RSVM using optimal basis can be done through the singular value decomposition

(SVD).One may start with the full kernel matrix K,or a certain intermediate reduced

kernel matrix

~

K,and looks for optimal basis in the column space of

~

K for further

column-rank reduction.Assume that we start with

~

K = K(A;

~

A),where

~

A µ A.Let

the SVD of

~

K be given by

~

K = P¤G

0

;

2

where P and G are orthogonal matrices of sizes n and ~n respectively,and ¤ is a

diagonal matrix of size n £ ~n.

Primal RSVMusing optimal basis works by further reducing the column rank

by extracting the leading right singular vectors of

~

K.Let

~

G = [g

1

;:::;g

~m

] consisting

of ~m-many leading right singular vectors.Projecting rows of

~

K along these ~m-many

right singular vectors leads to

~

K

~

G,which is of size n £ ~m.We use

~

K

~

G as our new

reduced kernel.The decision function is then given by

f(x) = K(x;

~

A)

~

G~v +b;

(3.1.7)

where ~v and b are solved from(3.1.1) with

~

K replaced by the new reduced kernel

~

K

~

G.

Dual RSVM using optimal basis works by replacing the full kernel matrix K

by the reduced kernel matrix R given by

R:=

~

K

~

G

³

~

G

0

(scaling matrix)

~

G

´

¡1

~

G

0

~

K

0

;

(3.1.8)

and then solves for the dual problem (2.2.20),

3

where

~

K can be the full kernel or an

intermediate reduced kernel K(A;

~

A),and the scaling matrix is K(

~

A;

~

A) (or can be

2

As only the leading right singular vectors are needed,we can extract Gfromthe leading eigenvec-

tors of a smaller square matrix

~

K

0

~

K to save computing time.See the SVD Section in Appendix A.

3

Again,R can be expressed as R = BB

0

,where B =

~

K

~

Gf

~

G

0

K(

~

A;

~

A)

~

Gg

¡1=2

,and the dual RSVM

can be treated as a linear SVM problem with input data matrix B.

3.1.REDUCED KERNEL:A LOW-RANK APPROXIMATION 21

replaced by a simple identity matrix).By letting v = D®,the corresponding primal

solution can by obtained by solving

min

~v

kRv ¡

~

K

~

G~vk

2

2

;

(3.1.9)

which leads to ~v =

³

~

G

0

(scaling matrix)

~

G

´

¡1

~

G

0

~

K

0

v.Then,the decision function is

then given by f(x) = K(x;

~

A)

~

G~v +b.

3.1.3 Some remarks

For many other kernel-based algorithms,one has to face the same problems (P1)-(P3)

stated above.Several authors have suggested the use of low-rank approximations to

the full kernel matrix for very large problems [32,46,54].These low-rank approxima-

tions all have used a thin rectangular matrix K(A;

~

A) consisting of a subset of ~n(¿n)

columns drawn from the full kernel matrix K.Lee and Mangasarian [32],Williams

and Seeger [54] suggest to pick the subset randomly and uniformly over all columns.

Lee and Huang [34] give a further theoretical study on random subset approach in

terms of some statistical optimalities.Smola and SchÄolkopf [46] consider ¯nding an

optimal subset.Since ¯nding an optimal subset is a combinatorial problem involving

C

n

~n

possibilities and an exhaustive search is too expensive to carry out,they use a

probabilistic speedup algorithm.The trick they use is:First draw a random subset

of a ¯xed size and then pick the best basis (column) from this set.The search goes

on till the stopping criterion is reached.The optimal basis that we introduced above

consists of leading right singular vectors of

~

K instead of mere columns of

~

K.For

small sized problem we may start with

~

K the full kernel,while for median to large

sized problemwe may start with an intermediate reduced kernel using randomsubset,

and then extract the leading right singular vectors therein for further column-rank

reduction.

For problems working in the primal space,e.g.,the proximal SVM [19],the

least-squares SVM [48,49],the kernel Fisher discriminant [42,41],kernel princi-

pal component analysis [27],kernel sliced inverse regression [56],the random sub-

set method works straightforwardly easy by replacing the full kernel matrix with

the reduced kernel and also by cutting down the corresponding number of para-

meters in the decision function.For problems solved in the dual space,we form

the approximation kernel matrix R = K(A;

~

A)(scaling matrix)

¡1

K(

~

A;A),or R =

~

K

~

G

³

~

G

0

(scaling matrix)

~

G

´

¡1

~

G

0

~

K

0

,to replace the full matrix K,where the scaling

matrix is given by K(

~

A;

~

A) or the identity.Next the dual solution has to be trans-

formed to the primal form for the expression of decision function.To obtain the

22 Two Auxiliary Techniques

primal solution ~v,we should solve the least squares problem (3.1.4) or (3.1.9),where

v is the dual solution.

Homework 2

(1) Do a numerical study of the quality of reduced kernel approxima-

tion based on its spectral behavior.Use the Wine Recognition data set to build the

full kernel matrix and the reduced kernel matrices by random subset and by optimal

basis respectively.(2) Analyze the Wine Recognition data set using both the full

and reduced kernel SVM approaches with Gaussian kernel.(Choose your own SVM

algorithm,LIBSVM (or BSVM,a companion to LIBSVM for large data problems),

SSVM or else.)

Solution

:For your reference,here is the solution using Iris and Tree data sets.The

spectral behavior of the reduced kernel approximation

R = K(A;

~

A)K(

~

A;

~

A)

¡1

K(

~

A;A)

can be measured via a few quantities,such as the maximum di®erence of the eigen-

values and the relative di®erence of the traces of the full kernel matrix K(A;A) and

R,which are de¯nded as Maxi-di® and rel-di®=Trace(K¡R)=Trace(K) respectively.

The result is summarized in Table 3.1.Moreover,let ¸

k

and

~

¸

k

be the kth eigenvalues

of K and R.The term(¸

k

¡

~

¸

k

) against k is plotted in Figure 3.1.3(a).Figure 3.1.3(b)

and Figure 3.1.3(c) together can present the eigenvalues of K and R correspondingly.

Fromthese ¯gures,it seems that a larger decay rate of eigenvalues will obtain a better

quality of the approximation.Finally,instead of the total 150 eigenvalues,only the

¯rst 50 leading ones are considered here as their sum is large enough (> 99% of the

total).

In Table 3.2,numerical comparisons between full kernel and reduced kernel from

random subset are conducted,separately through SSVM and LIBSVM.The training

error and validation error (with a 5-fold cross-validation) are reported in SSVM.The

LIBSVMprovides the validation accuracy and the number of the support vectors due

to its dual formulation.It can be found that,though using a simpler model,RSVM

has only slightly bigger errors than the full kernel in training and in validation.

a.use`hibiscus'to tune for the best parameters separately for SSVM

and LIBSVM.

b.include LSVM for comparison?

2

3.1.REDUCED KERNEL:A LOW-RANK APPROXIMATION 23

Table 3.1:Spectrum comparison on full kernel K(A;A) versus its NystrÄom approxi-

mation R.

Data set & Size

Largest Eigen-v

Largest Eigen-v

Maxi-di®

Rel-di®

(Full,Ruduced)

of K(A;A)

of R

of Eigenvalues

of Traces

Iris

40.8754

40.5268

1.3223

0.1083

(150,30)

Figure 3.1:Spectrum comparison on full kernel vs.reduced kernel.

Table 3.2:The Gaussian kernel is used on all tests.The width parameter is ° = 0:1

and the tuning parameter is C = 10.

Iris Dataset

SSVM LIBSVM

Model(size) Training error Validation error Validation error#of SVs

Full(150)

0.033

0.046

0.03333 32

Reduced(30)

0.058

0.073

0.03333 32

Tree Dataset

SSVM LIBSVM

Model(size) Training error Validation error Validation error#of SVs

Full(700) 0.13405 0.14676 0.13143 236

Reduced(69) 0.14724 0.14676 0.14000 222

24 Two Auxiliary Techniques

3.2 Uniform design for SVM model selection

The problem of choosing a good parameter setting for a better generalization ability

is the so called model selection here.It is desirable to have an e®ective and automatic

scheme for it.In particular,for people who are not familiar with parameters tuning in

SVMs,an automatic and data driven scheme is helpful.To develop a model selection

scheme,we need to set up

²

a performance measure,and

²

a search scheme to look for optimal performance point.

One of the most commonly used performance measure is the cross-validation (CV)

(Stone,1974),k-fold or leave-one-out.Both require that the learning engine be trained

multiple times in order to obtain a performance measure for each parameter combi-

nation.As for search scheme,one standard method is a simple grid search on the

parameter domain of interest.It is obvious that the exhaustive grid search can not

e®ectively perform the task of automatic model selection due to its high computa-

tional cost.There are also gradient-based approaches in the literature (Chapelle et

al.[6];Larsen et al.[30];Bengio [3]).Although the gradient-based methods present

impressive gain in time complexity,they have a great chance of falling into bad local

minima,especially for SVM model selection,since the objective function,which is

the CV-based performance measure,has low degree of regularity.Figure 3.2 plots the

5-fold average test set accuracy for three public available data sets,Banana,Wave-

form and Splice in a three dimensional surface,where the x-axis and the y-axis are

log

2

C and log

2

°,respectively.The z-axis is the 5-fold average test accuracy.Each

mesh point in the (x;y)-plane stands for a parameter combination and the z-axis

indicates the model performance measure.

As an alternative,we will introduce a nested uniform design (UD) methodology

for model selection,which can be regarded as an economic modi¯cation to replace grid

search.The lattice grid points are intended to represent uniformpoints in a parameter

domain.However,they are not uniform enough.A much better uniformity scheme,

called UD,is introduced below.

3.2.1 Methodology:2 staged UD for model selection

The uniform experimental design is one kind of space ¯lling designs that can be

used for computer and industrial experiments.The UD seeks its design points to

be uniformly scattered on the experimental domain.Suppose there are s parameters

of interest over a domain C

s

.The goal here is to choose a set of m points P

m

=

3.2.UNIFORM DESIGN FOR SVM MODEL SELECTION 25

Figure 3.2:Three examples of search space of model selection.

fµ

1

;:::;µ

m

g ½ C

s

such that these points are uniformly scattered on C

s

.The search

for UD points is a hard problem involves the number theory and quasi-Monte Carlo

methods (Niederreiter [44];Fang and Wang [17]).We will simply refer to UD tables

in the UD-web given below.To implement the UD-based model selection,follow the

steps:

1.

Choose a parameter search domain,determine a suitable number of levels for

each parameter (or factor in design terminology).

2.

Choose a suitable UD table to accommodate the number of parameters and

levels.This can be easily done by visiting the UD-web.

http://www.math.hkbu.edu.hk/UniformDesign

3.

From the UD table,randomly determine the run order of experiments and

conduct the performance evaluation of each parameter combination in the UD.

4.

Fit the SVM model.This step is a knowledge discovery step from the built

model.That is,we ¯nd the best combination of the parameter values that

maximizes the performance measure.

5.

Repeat the whole procedure one more time in a smaller search domain centered

about the optimal point found at the last stage.

26 Two Auxiliary Techniques

The automatic UD-based model selection for SVMs [28] is implemented in Matlab,

named\hibiscus"in the SSVMtoolbox available at:http://dmlab1.csie.ntust.

edu.tw/downloads/.The hibiscus focuses on selecting the regularization parameter

C and the Gaussian kernel width parameter °.It ¯rst determines a two-dimensional

search box in the parameter space,which is able to automatically scale the distance

factor in the Gaussian kernel.Next,a 2-staged UD procedure is designed for parame-

ter selection within the search box.At the ¯rst stage,it sets out a crude search for a

highly likely candidate region of global optimum.At the second stage,it con¯nes to

a ¯ner search within a smaller region centered around the ¯rst-stage-found optimal

point.At each stage,a UD is used to replace the lattice grid as trial points.The

method of nested UDs is not limited to 2 stages and can be applied in a sequential

manner.Two examples of UD scheme are presented below.

Figure 3.3:The nested UD model selection with a 13-points UD at the ¯rst stage and

a 9-points UD at the second stage.

3.2.UNIFORM DESIGN FOR SVM MODEL SELECTION 27

Figure 3.4:The nested UD model selection with a 9-points UD at the ¯rst stage and

a 5-points UD at the second stage.

28 Two Auxiliary Techniques

Chapter 4

More on Kernel-Based

Learning Algorithms

Besides the various variants of SVMs,there are other kernel-based learning algorithms.

Many popular classical statistical data analysis tools are linear methods.For instance,

the principal components analysis (PCA) looks for an orthogonal decomposition of a

sample space,say <

p

,into several one-dimensional linear subspaces ranked by their

importance of contribution to the data variation.The search for in°uential linear

subspaces,or factors,is an important and frequently encountered task.PCA has

long been a successful tool for ¯nding such linear subspaces or factors.However,

it is not able to extract important nonlinear features.This chapter will focus on

extending some classical statistical linear methods to nonlinear setting via the kernel

trick introduced earlier.We will include the following procedures:PCA,sliced inverse

regression (SIR),canonical correlation analysis (CCA),Fisher linear discriminant

analysis (FLDA),least-squares regression and ²-insensitive regression.

4.1 Kernel principal components analysis

As we have brie°y mentioned that the classical PCAis not capable of ¯nding nonlinear

features and there is a need for nonlinear extension.Nonlinear PCA,e.g.,functional

PCA,curve PCA,has been known in statistical literature.However,they are not our

aim here.We will restrict ourselves to the nonlinear extension via kernel trick only.

29

30 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS

4.1.1 Classical PCA

In the original data space,X ½ <

p

,we are interested in ¯nding successive one-

dimensional directions which contain\the maximum information"about the data.

Let u be the ¯rst of such directions scaled to unit length.Data projections along

the direction u are given by Au.Here we adopt the variance as the criterion for

information content.Then,we shall be looking for u to maximize the following

objective function:

max

u2<

p

var(Au) subject to u

0

u = 1:

(4.1.1)

This problem is equivalent to

max

u2<

p

u

0

§u subject to u

0

u = 1;where § is the covariance of data A:

Applying the Lagrange theory,the Lagrangian function is given by

L(u;®) = u

0

§u ¡®(u

0

u ¡1);u 2 <

p

;® 2 <:

Taking the derivative with respect to u and setting it to zero,we have

2§u ¡2®u = 0 =) §u = ®u:

(4.1.2)

That is,u is an eigenvector (with unit length) and ® is the associated eigenvalue.

Plugging (4.1.2) back to the Lagrangian function,we have u

0

§u = ®.Therefore,u is

obtained by ¯nding the eigenvector associated with the leading eigenvalue ®.Denote

them by u

1

and ®

1

,respectively.For the second principal component,we look for u

again for the same problem (4.1.1) but with an extra constraint that u is orthogonal

to u

1

.Then,the Lagrangian becomes

L(u;®;¯) = u

0

§u ¡®(u

0

u ¡1) ¡¯u

0

1

u:

(4.1.3)

Using similar procedure,we are able to ¯nd the second principal component and

successive components sequentially.

Note that PCA ¯rst centers the data and then rotates the axes to line up with the

directions of highest data variation.The action of rotation and lining up with direc-

tions of highest variance is like ¯nding a new coordinate system to record the data.

Data recorded by the new system have a simple diagonal structure in their covariance

matrix with descending diagonal elements ranked by the importance (contribution to

data variance) of the new coordinate axes.

Homework 3

You have seen the kernel trick in SVM.Apply it to PCA and think

about how the KPCA should be formulated.

4.1.KERNEL PRINCIPAL COMPONENTS ANALYSIS 31

Solution

:Similar to the classical PCA,¯rst we have to center the input data Z in

the feature space;that is,

³

I

n

¡

1

n

1

0

n

n

´

Z.Then the covariance matrix is of the form

1

n

Z

0

³

I

n

¡

1

n

1

0

n

n

´

Z and the optimization problem could be formulated as

max

u2Z

u

0

Z

0

µ

I

n

¡

1

n

1

0

n

n

¶

Zu subject to u

0

u = 1:

(4.1.4)

The Lagrangian function is given by

L(u;®) = u

0

Z

0

µ

I

n

¡

1

n

1

0

n

n

¶

Zu ¡®(u

0

u ¡1):

@L=@u = 0 =) Z

0

³

I

n

¡

1

n

1

0

n

n

´

Zu = ®u =) u is in the column space of Z

0

,i.e.

u = Z

0

v for some v 2 <

n

.Plugging it to the above funtion,the optimization problem

becomes

max

v2Z

v

0

ZZ

0

µ

I

n

¡

1

n

1

0

n

n

¶

ZZ

0

v subject to v

0

ZZ

0

v = 1:

(4.1.5)

By the kernel trick ZZ

0

= K and also assuming K nonsingular,we can solve this

problem by ¯nding the leading eigenvectors v from the following eigenvalue problem:

µ

I

n

¡

1

n

1

0

n

n

¶

Kv = ®v

(4.1.6)

with normalization v

0

Kv = 1,or equivalently ®v

0

v = 1.Normalization changes the

scales of principal components to have unit length in the sense of v

0

Kv = 1,but it

will not change the orientation of the principal components.2

4.1.2 Kernel PCA

Full kernel PCA.From (4.1.6) in Homework 3 we see that the KPCA works by

extracting the leading eigen-components of the centered kernel matrix.Note that the

eigenvector,according to (4.1.5),should be normalized by

v

0

Kv = 1;or equivalently v

0

µ

I

n

¡

1

n

1

0

n

n

¶

Kv = 1:

(4.1.7)

As (I

n

¡

1

n

1

0

n

n

)Kv = ®v,the normalization can be expressed as ®v

0

v = 1.In summary,

the KPCA solves the eigenvalue-decomposition problem for (I

n

¡

1

n

1

0

n

n

)K.Denote

its largest eigenvalue by ®

1

and its associated eigenvector by v

1

.We can sequentially

¯nd the second and the third principal components,etc.From (4.1.7) we have that

v

j

,j = 1;2;:::,should be normalized by v

0

j

Kv

j

= 1,or equivalently by ®

j

v

0

j

v

j

= 1.

32 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS

From Homework 3 we see that Z

0

v

j

is the jth principal component in the feature

space Z.For an x 2 <

p

and its feature image z = ©(x) 2 Z,the projection of ©(x)

along the jth component Z

0

v

j

is given by

©(x)

0

Z

0

v

j

= z

0

[z

1

;z

2

;:::;z

n

]v

j

= K(x;A)v

j

:

(4.1.8)

Therefore,the projection of feature image z = ©(x) onto the leading r eigen-components

in Z is given by

[K(x;A)v

1

;:::;K(x;A)v

r

] 2 <

r

:

Reduced kernel PCA starts with a random sampling of a portion of columns

from the full kernel matrix.Next,by a SVD it extracts the leading right singular

vectors ~v of the reduced kernel matrix:

µ

I

n

¡

1

n

1

0

n

n

¶

K(A;

~

A)~v = ®~v;normalized to ®~v

0

~v = 1:

(4.1.9)

The feature image ©(x) projected along the leading r components is then given by

h

K(x;

~

A)~v

1

;:::;K(x;

~

A)~v

r

i

2 <

r

:

As we are interested in extracting only right singular vectors,it works by solving

the eigenvalue problem of

~

K

0

(I

n

¡

1

n

1

0

n

n

)

~

K,whose corresponding eigenvalues and

eigenvectors will be ®

2

j

and ~v

j

.That is,if we work directly on the eigenvalue problem

of

~

K

0

(I

n

¡

1

n

1

0

n

n

)

~

K,denoting its eigenvalues by ¯

j

,then the normalization should be

p

¯

j

~v

0

j

~v

j

= 1.

4.2 Kernel sliced inverse regression

The PCA approach can be used as an unsupervised dimension reduction tool.Of-

ten the input data come with their class labels and supervised dimension reduction

with category information becomes possible.The sliced inverse regression (SIR) [35]

was originally proposed for dimension reduction through inverse regression.We will

depart from the original inverse regression spirit for the time being and focus on a

classi¯cation framework,which is easier for the ¯rst-time reader who is non-statistics

major to get a grasp of the method.

4.2.1 Classical SIR

Apart from its statistical meaning,SIR,technically speaking,solves a generalized

eigenvalue-eigenvector decomposition problem.We have seen the decomposition prob-

lem of a covariance matrix in the PCA section.In a classi¯cation problem with given

4.2.KERNEL SLICED INVERSE REGRESSION 33

data f(x

i

;y

i

)g

n

i=1

,x

i

2 <

p

and y

i

2 f1;:::;kg,here are the steps of SIR for ¯nding

leading directions that contain\maximal information"of class membership:

1.

Slice (classify) the input data A = [x

0

1

;:::;x

0

n

] into k many slices by their

corresponding labels y.

2.

For each slice,s = 1;:::;k,compute the mean,denoted by ¹x

s

.Also compute

the grand mean,denoted by ¹x.

3.

Use the slice mean ¹x

s

as the representative for each slice to form a new input

data matrix.That is,replace each individual input point x

i

by its slice mean

for a new input data matrix,denoted by A

S

(still of size n £p).

4.

Compute §

b

= Cov(A

S

),named the between-slice (or between-class) covariance

matrix,as follows:

§

b

=

1

n

k

X

s=1

n

s

(¹x

s

¡ ¹x)(¹x

s

¡ ¹x)

0

;

where n

s

is the number of data points in the sth slice.Also compute the sample

covariance matrix of x:§ =

P

n

i=1

(x

i

¡ ¹x)(x

i

¡ ¹x)

0

=n.

5.

Solve the generalized eigenvalue problem of §

b

with respect to § (Assume § is

rank p).That is,to ¯nd the leading,say r many (r · k ¡1),eigenvalues ® and

eigenvectors u for the problem:

§

b

u

i

= ®

i

§u

i

subject to u

0

i

§u

i

= 1 and u

0

j

§u

i

= 0 for 1 · j < i · r:

6.

Collect the leading r SIR directions into a matrix U = [u

1

;:::;u

r

].U is or-

thogonal with respect §,i.e.U

0

§U = I

r

.Project the data x along the SIR

directions,i.e.U

0

x = [u

0

1

x;:::;u

0

r

x].We should call u

0

1

x the ¯rst SIR variate,

u

0

2

x the second SIR variate,and so on.

Remark 2

For an ordinary eigenvalue problem,the decomposition is done with re-

spect to an identity matrix.For a generalized eigenvalue problem,the decomposition

is done with respect to a strictly positive de¯nite matrix.PCA solves an ordinary

eigenvalue problem,while SIR solves a generalized one.

From the steps above,we see that SIR is designed to ¯nd the leading directions

for maximal discriminant power for class information.Here the discriminant ability is

judged by the between class variance scaled by the total variance.Before we formally

34 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS

introduce the kernel generalization of SIR,we present you a quick pictorial view of

the e®ect of kernel SIR (KSIR) compared with the classical SIR.Figure 4.1 is the

scatter plot of ¯rst 2 SIR variates using Pendigits data and Figure 4.2 is the scatter

plot of ¯rst 2 KSIR variates.

Figure 4.1:SIR view over the 1st-and-2nd variates.

Homework 4

Prescribe a kernelized SIR algorithm and try it out on the Wine

Recognition data set available at UCI Repository.(Hint:Replace A by K and

proceed the same as in the classical SIR.However,since the covariance matrix of K

is singular,add a small diagonal matrix ±I to Cov(K) for numerical stability.)

4.2.2 Kernel SIR

SIR is designed to ¯nd linear dimension reduction subspace by extracting the leading

variance components of sliced means scaled by the overall variance across all classes.

De¯ne N(§

b

) = fu:§

b

u = 0g,which is the null column space of §

b

.SIR is unable

to ¯nd directions which reside in this null space N nor directions which have small

angle with the null space.SIR is also limited to the linear directions and is unable to

extract nonlinear features,as you have seen in the Pendigits data demonstration.

Kernelized SIR [56] can provide an easy remedy for the di±culties encountered by the

classical SIR.A nonlinear parsimonious characterization of the data is the key goal

of the kernel extension.

4.2.KERNEL SLICED INVERSE REGRESSION 35

Figure 4.2:KSIR view over the 1st-and-2nd variates.

Full kernel SIR.As before,in looking for a nonlinear extension we ¯rst map

the original input data to the feature space Z,and then nominally carry out the SIR

procedure on the feature data in Z.That is,we look for the leading directions in the

feature space that have the highest between-slice variance with respect to the variance

of the whole data set.In other words,we intend to solve the following optimization

problem:

max

u2Z

u

0

Z

0

H

0

µ

I ¡

1

n

1

0

n

n

¶

HZu subject to u

0

Z

0

µ

I ¡

1

n

1

0

n

n

¶

Zu = n;

(4.2.10)

where H is a diagonal block matrix with the sth block given by B

s

=

1

n

s

1

n

s

1

0

n

s

with

n

s

the number of observations in the sth slice.(Here we have implicitly assumed

that the data are already sorted by slice.) Note that H is symmetric and idempotent,

i.e.H

2

= H.It is a projection matrix,which maps each feature data input to the

slice mean of its corresponding class.As in the KPCA case,we con¯ne our search of

eigenvectors to the column space of Z

0

,i.e.u = Z

0

v for some v 2 <

n

.Then,by the

kernel trick K = ZZ

0

,the optimization problem (4.2.10) becomes

max

v

v

0

ZZ

0

H

µ

I

n

¡

1

n

1

0

n

n

¶

HZZ

0

v subject to v

0

ZZ

0

µ

I

n

¡

1

n

1

0

n

n

¶

ZZ

0

v = n;

(4.2.11)

which is equivalent to

max

v

v

0

Cov(HK)v subject to v

0

Cov(K)v = 1:

(4.2.12)

36 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS

For successive KSIR directions,some extra orthogonality constraints are applied.The

KSIR problem (4.2.12) is exactly the same as the classical SIR stated in Steps 1-6

above,except for replacing the original input data matrix A by the kernel data matrix

K.Note that Cov(K) is singular and the singularity will cause numerical instability

to the problem.An easy remedy is to add a small diagonal matrix ±I to the covariance

matrix.That is,we solve the following problem:

Cov(HK)v = ®fCov(K) +±Igv:

(4.2.13)

Let V = [v

1

;v

2

;:::;v

r

] 2 <

n£r

consisting of r-many leading KSIR directions and let

C(V ) be its column space spanned by V.Again,we say fv

1

;:::;v

r

g are KSIR direc-

tions,fKv

1

;:::;Kv

r

g are KSIR variates,and C(V ) is the KSIR dimension reduction

subspace.

Reduced kernel SIR works directly on

~

K = K(A;

~

A) or

~

K = K(A;

~

A)

~

G as

de¯ned in (??).The RKSIR solves the following generalized eigenvalue problem:

Cov(H

~

K)~v = ®~v

0

n

Cov(

~

K) +±I

o

~v:

(4.2.14)

We then say f~v

1

;:::;~v

r

g are KSIR directions,f

~

K~v

1

;:::;

~

K~v

r

g are KSIR variates,and

C(

~

V ) is the KSIR dimension reduction subspace,where

~

V = [~v

1

;~v

2

;:::;~v

r

].Some

common suggestions for implementing the RKSIR are:For small sized problem,use

the leading optimal bases based on full kernel to form the reduced kernel;for median

to large sized problem,use either a random subset to form the reduced kernel or to

start with an intermediate random subset and then search for leading optimal bases

therein.

Combined with linear methods.As the input space has been mapped to Z,

which has very high dimensionality (often in¯nite) to provide enough °exibility and

versatility for linear models therein.It is,thus,su±cient to consider only simple

linear models therein for statistical inferences.For instance,KSIR can be combined

with various linear SVM algorithms or the classical linear discriminant analysis for

classi¯cation problems,and with various linear SVR algorithms or least-squares linear

regression method for regression problems.In other words,various linear algorithms

can act on top of the KSIR variates for dimension reduction and data analysis.

4.3 Kernel canonical correlation analysis

Canonical correlation analysis (CCA) studies the relation between two sets of vari-

ables.The classical CCA is used to describe only the linear relation between the

two sets,while its kernel extension,KCCA,can be used to describe general nonlinear

relation.

4.3.KERNEL CANONICAL CORRELATION ANALYSIS 37

4.3.1 Classical CCA

Suppose the randomvariable x of p components can be partitioned into x = [x

(1)

;x

(2)

]

with p

1

and p

2

components,respectively.The CCAaims to study the relation between

x

(1)

and x

(2)

and eventually de¯nes a pair of new coordinate systems in <

p

1

and

<

p

2

,which make the description of\relation"between x

(1)

and x

(2)

as\simple"as

possible.One way to study and to simplify the relationship structure is through the

factorization of the covariance matrix.Let

X

(1)

=

2

6

6

4

x

(1)

0

1

.

.

.

x

(1)

0

n

3

7

7

5

;

which is the design matrix of the ¯rst set of variables;and let X

(2)

= [x

(2)

0

1

;:::;x

(2)

0

n

],

which is the design matrix of the second set of variables.The sample covariance

matrix of X

(i)

and X

(j)

,i;j = 1;2,is given by

§

ij

=

1

n

X

(i)

0

µ

I

n

¡

1

n

1

0

n

n

¶

X

(j)

:

The CCA ¯nds the pair of vectors (u

1

;v

1

) 2 <

p

1

+p

2

,called the ¯rst pair of canonical

variates,such that

max

u;v

u

0

§

12

v subject to u

0

§

11

u = 1 and v

0

§

22

v = 1:

(4.3.15)

Successive pairs can be found by solving the same problem with extra orthogonal-

ity (uncorrelatedness) constraints.The classical CCA describes linear relation by

reducing the correlation structure between these two sets of variables to the sim-

plest possible form by means of linear transformations on X

(1)

and X

(2)

.It ¯nds

pairs (u

i

;v

i

) 2 <

p

1

+p

2

in the following way.The ¯rst pair maximizes the correlation

between u

0

1

X

(1)

and v

0

1

X

(2)

subject to the unit variance constraints Var(u

0

1

X

(1)

) =

Var(v

0

1

X

(2)

) = 1,and the kth pair (u

k

;v

k

),which are uncorrelated with the ¯rst k¡1

pairs,

1

maximizes the correlation between u

0

k

X

(1)

and v

0

k

X

(2)

,and again subject to

the unit variance constraints.The sequence of correlations between u

0

i

X

(1)

and v

0

i

X

(2)

are called canonical correlation coe±cients of X

(1)

and X

(2)

,(u

i

;v

i

) are called the ith

pair of canonical variates,and u

0

i

X

(1)

and v

0

i

X

(2)

are called the ith pair of canonical

variables.The canonical variates

2

fu

i

g

p

1

i=1

and fv

i

g

p

2

i=1

can serve as new systems of

1

Being uncorrelated is in terms of u

0

i

§

11

u

k

= 0 and v

0

i

§

22

v

k

= 0 for all i = 1;:::;k ¡1.

2

Assume that p

1

· p

2

.After p

1

¡1 pairs of canonical variates,one cas always ¯ll up the remaining

orthogonal variates for both sets of variables.

38 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS

coordinate axes for the two components x

(1)

and x

(2)

.These new systems are simply

linear systems of the original ones.Thus,the classical CCA can only be used to de-

scribe linear relation.Via such linear relation it ¯nds only linear dimension reduction

subspace and linear discriminant subspace,too.

4.3.2 Kernel CCA

Kernel methods can provide a convenient way for nonlinear generalization of CCA.

By forming the kernel data:

K

(1)

= K

1

(A

(1)

;A

(1)

) and K

(2)

= K

2

(A

(2)

;A

(2)

);

(4.3.16)

where A

(i)

= [x

(i)

0

1

;:::;x

(i)

0

n

] for i = 1;2,and K

1

and K

2

are two positive de¯nite

kernel functions.The KCCA works by carrying out the classical CCA procedure on

the kernel data [K

(1)

K

(2)

].In summary,the KCCA procedure consists of two major

steps:

(a)

Transform the data points to a kernel representation as in (4.3.16).

(b)

The classical CCA procedure is acting on the kernel data matrix [K

(1)

K

(2)

].

Note that some sort of regularization is necessary here to solve the associated

spectrum problem (4.3.15) of extracting leading canonical variates and correla-

tion coe±cients.Here we use the reduced kernel technique again to cut down the

problem size and to stabilize the numerical computation.Only partial columns

are computed to form reduced kernel matrices,denoted by

~

K

(1)

and

~

K

(2)

.The

classical CCA procedure is then acting on [

~

K

(1)

~

K

(2)

].

3

As the KCCA is simply the classical CCA acting on kernel data,existing code from

standard statistical packages are ready for use.In the example below we use Matlab

m-¯le\canoncorr",which implements the classical CCA,on kernel data.

Example 1

We use the Pendigits data set from UCI Repository for visual demon-

stration of nonlinear discriminant using KCCA.We use the 7494 training instances

for explanatory purpose.For each instance there are 16 input measurements (i.e.,x

j

is 16-dimensional) and a corresponding class label y

j

fromf0;1;2;:::;9g.A Gaussian

kernel with the window width (

p

10S

1

;:::;

p

10S

16

) is used to prepare the kernel data,

where S

i

's are the coordinate-wise sample variances.A reduced kernel of size 300

equally strati¯ed over 10 digit groups is used and serves as the

~

K

(1)

in step (b) of the

3

The reduced kernel

~

K

(i)

can be either formed by random subset

~

K

(i)

= K

i

(A

(i)

;

~

A

(i)

) or by

further optimal basis therein

~

K

(i)

= K

i

(A

(i)

;

~

A

(i)

)

~

G

(i)

.

4.3.KERNEL CANONICAL CORRELATION ANALYSIS 39

KCCA procedure.We use y

j

,the class labels,as our K

(2)

(no kernel transformation

involved).Precisely,

~

K

(2)

=

2

6

6

4

Y

0

1

.

.

.

Y

0

n

3

7

7

5

n£10

;Y

0

j

= (0;:::;1;0;:::);

where Y

j

is a categorical variable for class membership.If y

j

= i,i = 0;1;:::;9,then

Y

j

has the entry 1 in the (i +1)th place and 0 elsewhere.Now we want to explore for

the relation between the input measurements and their associated class labels using

CCA and KCCA.The training data are used to ¯nd the leading CCA- and KCCA-

found variates.Next 20 test samples from each digit-group are drawn randomly

from the test set.

4

Scatter plots of test data projected along the leading CCA-found

variates (Figure 4.3) and the leading KCCA-found variates (Figure 4.4) are given

below.Di®erent classes are labeled with distinct digits.The superior discriminant

power for KCCA over CCA is clear.

Figure 4.3:Scatter plot of pen-digits over CCA-found variates.

4

The test set has 3498 instances in total with average around 350 instances for each digit.For

clarity of plots and to avoid excess ink,we use only 20 test points per digit.

40 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS

Figure 4.4:Scatter plots of pen-digits over KCCA-found variates.

4.4 Kernel Fisher discriminant analysis

Fisher linear discriminant analysis (FLDA,or simply LDA) is a commonly used and

time-honored tool for multiclass classi¯cation because of its simplicity and probabilis-

tic outputs.Motivated from the active development of statistical learning theory [53]

and the popular and successful usage of various kernel machines [14,15],there has

emerged a hybrid approach which combines the idea of feature mapping with the

classical FLDA to prescribe a kernelized LDA algorithm.

4.4.1 Classical FLDA

For a binary classi¯cation,FLDA ¯nds an optimal normal direction of the separating

hyperplane,w

0

x +b = 0,by maximizing the so-called Rayleigh coe±cient:

max

w

J(w) ´

w

0

§

b

w

w

0

§

w

w

;

(4.4.17)

where §

b

and §

w

are respectively the between- and within-class covariance matrices.

The o®set is determined by letting the hyperplane pass through the mid point of the

two class centroids,i.e.,b = ¡w

0

(¹x

1

+ ¹x

2

)=2,where ¹x

j

=

P

i2I

j

x

i

=n

j

with I

j

index

set of jth class.

4.4.KERNEL FISHER DISCRIMINANT ANALYSIS 41

Consider a k-class problem based on training data f(x

i

;y

i

)g

n

i=1

,where x

i

2 X ½

<

p

is a p-variate input measurement and y

i

2 f1;:::;kg indicates the corresponding

class membership.With k ¡ 1 · p,the FLDA ¯nds k ¡ 1 canonical variates that

are optimal in maximizing the between-class variation (for separating the classes)

with respect to the within-class variation,and its decision boundaries are linear in

these canonical variates.The FLDA canonical variates w

j

,j = 1;:::;k ¡1,can be

determined again by maximizing the Rayleigh coe±cient given below:

max

w

J(w) ´

w

0

§

b

w

w

0

§

w

w

:

(4.4.18)

The optimality problem (4.4.18) of ¯nding FLDA canonical variates is a generalized

eigenvalue decomposition problem.Data are then projected to these (k¡1) canonical

variates and class centroids are calculated in this (k¡1)-dimensional subspace.A test

point is assigned to the class with the closest class centroid based on Mahalanobis

distance.The Mahalanobis distance of two points x and u projected along the FLDA

canonical variates is given by

d

M

(x;u) = (x ¡u)

0

W

0

§

¡1

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο