Statistical and Machine Learning

Arya MirΤεχνίτη Νοημοσύνη και Ρομποτική

12 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

341 εμφανίσεις

The development of statistical methods is often highly influenced and challenged by problems from other fields. Due to the rapid growth of computer technology, we easily encounter with enormous amount of data collected from diverse sources of scientific fields, which has lead to a great demand for innovative analytic tools for complex data. The process of extracting information from data and making statistical inferences for future observations is called learning from data. Statistical and machine learning is an interdisciplinary field consisting of theory from statistics, probability, mathematics and computer science, with plenty of applications for engineering science, biology, bioinformatics, medical study, etc.

Lecture Notes
Statistical and Machine Learning
Classical Methods )Kernelizing (Bayesian
& +.
Statistical Learning Theory
% * -
Information Theory SVM Neural Networks
Su-Yun Huang
¤1
,Kuang-Yao Lee
1
and Horng-Shing Lu
2
1
Institute of Statistical Science,Academia Sinica
2
Institute of Statistics,National Chiao-Tung University
contact:
¤
syhuang@stat.sinica.edu.tw
¤
http://www.stat.sinica.edu.tw/syhuang
course web pages:
¤
http://140.109.74.188
TAs:Pei-Chun Chen and Kuang-Yao Lee
COURSE INFORMATION & PREREQUISITES
²
Prerequisites:calculus,linear algebra,elementary probability and statistics.
²
Course plan:See Table of Contents (tentative).We will emphasize on know-
ing\why"and on statistical aspects instead of algorithms and programming.
But still you have to know\how"by either writing your own implementation
or modifying from others'code.
²
Grading policy:
homework 30%,score for late homework= (full points) £0:8
d
;d:delay days.
midterm 30% (April 27,2007).
oral presentation 20% on assigned tasks (and your ¯nal project as well).
¯nal project 20%:choose your own data analysis problem and write up a short
report and prepare for an oral presentation.The report should include
{
problem and data description,
{
methodology (related to this course),
{
data analysis,and
{
conclusion.
The report can be in Chinese or in English.
²
Web resources:
{
UCI Repository for machine learning is probably the most popular repos-
itory.http://www.ics.uci.edu/»mlearn/MLRepository.html
{
UCI KDD Archive.It is an online repository of large data sets which
encompasses a wide variety of data types,analysis tasks,and application
areas.http://kdd.ics.uci.edu
{
CiteSeer is a public search engine and digital library for scienti¯c and
academic papers in the ¯elds of computer and information sciences.
http://citeseer.ist.psu.edu
{
Statlib for statistical software and data sets.http://lib.stat.cmu.edu
{
SSVM toolbox.http://dmlab1.csie.ntust.edu.tw/downloads
{
Kernel Statistics toolbox.http://www.stat.sinica.edu.tw/syhuang
or http://140.109.74.188/kern
stat
toolbox/kernstat.html
ii Lecture Notes
{
LIBSVM.http://www.csie.ntu.edu.tw/»cjlin/libsvm/
{
Wikipedia,a free online encyclopedia.
http://en.wikipedia.org/wiki/Main
Page
Contents
1 Introduction 1
1.1 Aim and scope...............................
1
1.2 Notation and abbreviation.........................
2
2 Support Vector Machines 5
2.1 Linear support vector machines......................
5
2.2 Kernel trick for nonlinear SVM extension................
9
2.3 Variants of SVM algorithms........................
11
2.3.1 One-norm soft margin SVM....................
12
2.3.2 Two-norm soft margin SVM....................
12
2.3.3 Smooth SVM (SSVM).......................
13
2.4 Multiclass classi¯cation..........................
15
3 Two Auxiliary Techniques:Reduced Kernel and Model Selection for
SVMs 17
3.1 Reduced kernel:a low-rank approximation................
17
3.1.1 RSVM with random subset....................
18
3.1.2 RSVM with optimal basis.....................
20
3.1.3 Some remarks............................
21
3.2 Uniform design for SVM model selection.................
24
3.2.1 Methodology:2 staged UD for model selection.........
24
4 More on Kernel-Based Learning Algorithms 29
4.1 Kernel principal components analysis...................
29
4.1.1 Classical PCA...........................
30
4.1.2 Kernel PCA.............................
31
4.2 Kernel sliced inverse regression......................
32
4.2.1 Classical SIR............................
32
iii
iv CONTENTS
4.2.2 Kernel SIR.............................
34
4.3 Kernel canonical correlation analysis...................
36
4.3.1 Classical CCA...........................
37
4.3.2 Kernel CCA.............................
38
4.4 Kernel Fisher discriminant analysis....................
40
4.4.1 Classical FLDA...........................
40
4.4.2 Kernel FDA.............................
41
4.5 Support vector regression.........................
42
4.5.1 RLS-SVR:a ridge approach....................
43
4.5.2 Smooth ²-support vector regression (²-SSVR)..........
43
4.6 SIR and KSIR revisited..........................
45
5 Learning Theory with RKHS (optional) 49
5.1 Introduction.................................
49
5.2 SVMs with RKHS.............................
51
5.3 Multivariate statistical analysis with RKHS...............
52
5.4 More to come................................
54
6 Neural Networks 55
A Some Matrix Algebra 69
A.1 Matrix diagonalization...........................
69
A.1.1 Eigenvalue decomposition and generalized eigenvalue decompo-
sition................................
69
A.1.2 Singular value decomposition (SVD) and generalized SVD...
70
B Kernel Statistics Toolbox 71
B.1 KGaussian..................................
71
B.2 Smooth SVM................................
71
B.3 Lagrangian SVM..............................
73
B.4 Uniform designs for model selection....................
75
B.5 Support vector regression.........................
75
B.6 Kernel PCA.................................
75
B.7 Kernel SIR.................................
77
B.8 Kernel Fisher linear discriminant analysis................
79
B.9 Kernel CCA.................................
79
List of Figures
3.1 Spectrum comparison on full kernel vs.reduced kernel..........
23
3.2 Three examples of search space of model selection............
25
3.3 The nested UD model selection with a 13-points UD at the ¯rst stage
and a 9-points UD at the second stage...................
26
3.4 The nested UD model selection with a 9-points UD at the ¯rst stage
and a 5-points UD at the second stage...................
27
4.1 SIR view over the 1st-and-2nd variates..................
34
4.2 KSIR view over the 1st-and-2nd variates.................
35
4.3 Scatter plot of pen-digits over CCA-found variates............
39
4.4 Scatter plots of pen-digits over KCCA-found variates..........
40
4.5 Data scatter along some coordinates....................
47
4.6 Data scatter,¯tted values and residuals using 1st SIR variate......
47
4.7 Data scatter,¯tted values and residuals using 1st KSIR variate.....
48
v
vi LIST OF FIGURES
Chapter 1
Introduction
1.1 Aim and scope
The development of statistical methods is often highly in°uenced and challenged by
problems fromother ¯elds.Due to the rapid growth of computer technology,we easily
encounter with enormous amount of data collected from diverse sources of scienti¯c
¯elds,which has lead to a great demand for innovative analytic tools for complex data.
The process of extracting information from data and making statistical inferences for
future observations is called learning from data.Statistical and machine learning is
an interdisciplinary ¯eld consisting of theory fromstatistics,probability,mathematics
and computer science,with plenty of applications for engineering science,biology,
bioinformatics,medical study,etc.
The aimof this monograph is to provide an overviewof the development of machine
learning with emphasis on its statistical aspects.It aims for an introductory graduate
level course.The choice of material is rather subjective.A major part of it is based on
some recent works by colleagues in the IE&SLR group.
1
This monograph starts with
an introduction to support vector machines (SVMs),followed by other kernel-based
learning algorithms,learning theory with reproducing kernel Hilbert space (RKHS)
and ended by a chapter on neural networks.Bayesian theory will be blended in
when it is necessary.Two auxiliary techniques for computation reduction and model
selection (see reduced kernel and model selection for SVMs in Chapter 3) are also
provided in this monograph.Program codes can be found in Appendix B.They are
1
IE&SLR (information extraction & statistical learning research) group consists of faculty and
students from various institutes and departments of Academia Sinica and other universities.
1
2 CHAPTER 1.INTRODUCTION
part of the Kernel Statistics toolbox
2
still in development.
1.2 Notation and abbreviation
All vectors will be column vectors unless otherwise speci¯ed or transposed to a row
vector by a prime superscript
0
.For a matrix A 2 <
n£p
,A
i
is the ith row of A.We
put the data fx
i
2 <
p
g
n
i=1
by row in a matrix denoted by A,and we call it the data
design matrix.We will use the row vector A
i
and the column vector x
i
for the same
ith observation depending on the convenience.A column vector of ones of arbitrary
dimension will be denoted by 1.For A 2 <
n£p
and B 2 <
~n£p
,the kernel K(A;B)
maps R
n£p
£R
~n£p
into R
n£~n
.We adopt a Matlab vector and matrix convention
[M
1
;M
2
]:=
"
M
1
M
2
#
;
where M
i
can be a vector or a matrix and the semicolon on the left hand side of the
equality means to stack M
1
on top of M
2
as shown on the right hand side.Other
notation convention and terminology abbreviation used in this monograph are listed
below.
²
A:a matrix of training inputs,where input data are arranged by rows,i.e.,
A = [x
0
1
;x
0
2
;:::;x
0
n
] 2 <
n£p
.
²
A
+
:a matrix of training inputs whose corresponding class label is 1.
²
A
¡
:a matrix of training inputs whose corresponding class label is ¡1.
²
A
i
:the ith row of a matrix A.
²
C(M):the column space of a matrix M,i.e.,the linear space spanned by columns
of M.
²
CCA:canonical correlation analysis.
²
CV:cross-validation.
²
D:a diagonal matrix with class labels in the diagonals,i.e.D = diag(y
1
;:::;y
n
).
²
FLDA:Fisher linear discriminant analysis.
2
Kernel Statistics toolbox is maintained on http://140.109.74.188/kern
stat
toolbox/kernstat.html
by the ¯rst author.It is also available at http://www.stat.sinica.edu.tw/syhuang.
1.2.NOTATION AND ABBREVIATION 3
²
I:an index set.
²
jIj:the size of an index set I.
²
K,K(A;B):kernel function or kernel matrix.
²
~
K = K(A;
~
A):a reduced kernel matrix using some subset
~
A.
²
KKT:Karush-Kuhn-Tucker.
²
n:full data size;~n:reduced set size.
²
PCA:principal components analysis.
²
plus function t
+
:For t = (t
1
;:::;t
p
) 2 <
p
,the plus function t
+
is de¯ned
componentwise as (t
+
)
i
= max f0;t
i
g for i = 1;:::;p.
²
1:a column vector of ones of arbitrary dimension;1
p
:a column vector of ones
of dimensionality p;
²
©:X 7!Z:feature map.
²
<:real number.
²
RK:reproducing kernel;RKHS:reproducing kernel Hilbert space.
²
RSVM:reduced (kernel) support vector machine.
²
SIR:sliced inverse regression.
²
SVM:support vector machine;LSVM:Lagrangian SVM;SSVM:smooth SVM.
²
SVR:support vector regression;SSVR:smooth support vector regression.
²
UD:uniform design.
²
X:input data space.
²
Z:feature data design matrix,i.e.Z = [z
0
1
;:::;z
0
n
].
²
Z:feature space.
²
rg(x):gradient consisting of ¯rst derivatives.
²
r
2
g(x):Hessian matrix consisting of second derivatives.
²
h¢;¢i
H
:inner product in a Hilbert space H.
²
k ¢ k
q
:the q-norm.
4 CHAPTER 1.INTRODUCTION
Chapter 2
Support Vector Machines
Classi¯cation,which is one kind of supervised learning,is a commonly encountered
task in statistics.For a binary classi¯cation problem,we are given a training data
set f(x
1
;y
1
);:::;(x
n
;y
n
)g,where x
i
2 <
p
is the input data and y
i
2 Y = f¡1;1g
is the corresponding class label.The main goal of classi¯cation is to ¯nd a decision
function f(x) based on the training data that can predict future class labels for new
input data points.
Support vector machines in a wide sense refer to a body of statistical and com-
putational techniques that build on\kernel trick".Observations are actually taken
in the pattern space,while the classi¯cation (or any other learning task) is carried
out in the so-called feature space,which is a very high dimensional Hilbert space and
where populations can be easily separated and/or predicted by linear methods.In
recent years SVMs [5,15,53] have become one of the most promising learning algo-
rithms for classi¯cation as well as for regression [16,38,39,47,31,43],which are two
fundamental tasks in data mining [55].
2.1 Linear support vector machines
The linear SVMfor binary classi¯cation attempts to ¯nd a hyperplane with maximal
margin to separate the training data into two classes,A
¡
and A
+
.We consider the
following linear decision function
f(x) = w
0
x +b:
(2.1.1)
The decision boundary,given by fx:f(x) = 0g,for separating the positive and
negative input instances is a hyperplane in the input space <
p
.The distance from
5
6 CHAPTER 2.SUPPORT VECTOR MACHINES
the origin to the hyperplane,w
0
x +b = 0,in <
p
is given by jbj=kwk
2
.For the linear
separable case,we can arrange the scale of the normal vector w and the o®set b so
that the following constraints are satis¯ed:
w
0
x
i
+b ¸ +1 for y
i
= +1;
(2.1.2)
w
0
x
i
+b · ¡1 for y
i
= ¡1:
(2.1.3)
These constraints can be reformulated into one set of inequalities:
1 ¡y
i
(w
0
x
i
+b) · 0 8i:
(2.1.4)
The margin between the two bounding planes (2.1.2) and (2.1.3) is
jb +1 ¡(b ¡1)j=kwk
2
= 2=kwk
2
:
Maximizing the margin is equivalent to minimizing kwk
2
2
.Thus,the SVM for linear
separable case can be formulated as a constrained optimization problem:
min
w;b
kwk
2
2
=2 subject to 1 ¡y
i
(w
0
x
i
+b) · 0 8i:
(2.1.5)
It is almost always the case that data are not completely linear separable and there
is no feasible solution to (2.1.5).Hence,it is necessary to consider the non-separable
case.For the non-separable case,we further introduce some positive slack variables
»
i
into the constraints:
w
0
x
i
+b ¸ +1 ¡»
i
for y
i
= +1;
(2.1.6)
w
0
x
i
+b · ¡1 +»
i
for y
i
= ¡1;
(2.1.7)
»
i
¸ 0 8i:
(2.1.8)
Inequalities (2.1.6)-(2.1.8) can be combined into
1 ¡y
i
(w
0
x
i
+b) ¡»
i
· 0 and ¡»
i
· 0;8i:
(2.1.9)
The nonnegative constraints on slack variables »
i
¸ 0 are expressed as ¡»
i
· 0 in
(2.1.9).It is merely out of convenience for making the associated Lagrange multi-
pliers nonnegative.With these slack variables,a natural way to assign error cost is
to add a penalty on the size of »
i
to the objective function to be minimized,e.g.,
kwk
2
2
=2 +C
P
n
i=1
P(»
i
),where P(») is a nonnegative increasing function of »,and C
is a tuning parameter controlling the trade-o® between data goodness of ¯t (or called
data ¯delity) and the regularity (or °atness,smoothness) of the decision function.
2.1.LINEAR SUPPORT VECTOR MACHINES 7
Here we consider a penalty on the one-norm of » leading to the following constrained
optimization problem:
min
w;b;»
kwk
2
2
=2 +C
n
X
i=1
»
i
subject to (2:1:9):
(2.1.10)
To solve the problem,Lagrange multipliers ®
i
¸ 0 and ¯
i
¸ 0 are introduced,one for
each inequalities in (2.1.9).Then we have the following Lagrangian:
L(w;b;»;®;¯) =
1
2
kwk
2
2
+C1
0
» +®
0
[1 ¡D(Aw +b1) ¡»] ¡¯
0
»;
(2.1.11)
where D is a diagonal matrix with y
i
in the diagonals.Variables w;b;» are called
primal variables,and ®;¯ are called dual variables.The task is to minimize (2.1.11)
with respect to w,b and »,and to maximize it with respect to ® and ¯.At the
optimal point,we have the following saddle point equations:
@L
@w
= 0;
@L
@b
= 0 and
@L

= 0;
which translate to
w
0
¡®
0
DA = 0;®
0
D1 = 0 and C1 ¡® ¡¯ = 0;
or equivalently to
w =
n
X
i=1
®
i
y
i
x
i
;
n
X
i=1
®
i
y
i
= 0 and ®
i

i
= C 8i:
(2.1.12)
By substituting (2.1.12) into the Lagrangian,we get the dual problem:
max
®
n
X
i=1
®
i
¡
1
2
n
X
i;j=1
®
i
®
j
y
i
y
j
x
0
i
x
j
subject to
n
X
i=1
®
i
y
i
= 0 and 0 · ®
i
· C:
(2.1.13)
By introducing the slack variables »
i
,we get the box constraints that limit the size
of the Lagrange multipliers:®
i
· C.The KKT necessary and su±cient optimality
8 CHAPTER 2.SUPPORT VECTOR MACHINES
conditions
1
can be summarized by
0 · ®?1 ¡D(Aw +b1) ¡» · 0;and
0 · ¯?» ¸ 0:
These KKT conditions imply that
®
i
= 0 ) ¯
i
= C ) y
i
f(x
i
) ¸ 1 and »
i
= 0;
0 < ®
i
< C ) 0 < ¯
i
< C ) y
i
f(x
i
) = 1 and »
i
= 0;
®
i
= C ) ¯
i
= 0 ) y
i
f(x
i
) · 1 and »
i
¸ 0:
Points x
i
associated with nonzero ®
i
are called support vectors (SVs).The o®set
term b is computed by
b =
1
jIj
X
i2I
0
@
y
i
¡
n
X
j=1
®
j
y
j
x
0
j
x
i
1
A
;where I = fi:0 < ®
i
< Cg:
With w =
P
n
i=1
®
i
y
i
x
i
the underlying decision function is then given by
f(x) =
n
X
i=1
®
i
y
i
x
0
i
x +b =
X
x
i
:SV
®
i
y
i
x
0
i
x +b:
Only the SVs contribute to the decision function for classi¯cation.In addition,the
contribution of a data point as a SV is at most at a scale C in the ¯nal decision
function.The one-norm soft margin SVM is probably the most common formulation
for SMVs.
Remark 1
The SVM introduced above provides a way of doing linear discriminant
based on the one-norm soft margin criterion.There are many other alternatives in
statistical literature,e.g.,Fisher linear discriminant analysis,logistic discriminant
analysis,classi¯cation by least squares regression,Gaussian mixture approach,etc.
Think about why SVM and why the kernel extension introduced in later chapters,
while we are proceeding forward in this course.
1
Theorem.
Given a convex optimization problem:min
v
`(v) subject to inequality constraints
g
i
(v) · 0,i = 1;:::;k,with`2 C
1
and g
i
being a±ne,necessary and su±cient conditions for a
point v
¤
to be an optimal solution are the existence of Lagrange multipliers ´
¤
such that
@L(v
¤

¤
)
@v
= 0;
´
¤
i
g
i
(v
¤
) = 0;i = 1;:::;k;
g
i
(v
¤
) · 0;i = 1;:::;k;
´
¤
i
¸ 0;i = 1;:::;k:
The second conditions are known as the KKT conditions.
2.2.KERNEL TRICK FOR NONLINEAR SVM EXTENSION 9
2.2 Kernel trick for nonlinear SVM extension
How can we generalize the SVM to the nonlinear case,but still keep the model
simplicity of linearity?That is,we want a °exible decision boundary for separating
the two classes,while still manage to maintain the simplicity of estimation and/or
prediction task.
The main idea is ¯rst to map the input data to a high dimensional space by a
feature map ©:X 7!Z
x 7!©(x) = (Á
1
(x);Á
2
(x);Á
3
(x);:::) 2 Z:
2
(2.2.14)
Next,the linear SVMproblem (2.1.10) is considered and nominally carried out in the
feature space Z.Here are the nominal input data in the feature space fz
i
= ©(x
i
)g
n
i=1
.
Similar to the constraints (2.1.9),we can make constraints in the feature space as
follows:
1 ¡y
i
(w
0
z
i
+b) ¡»
i
· 0 8i;
(2.2.15)
¡»
i
· 0 8i;
(2.2.16)
where w is the normal vector to the separating hyperplane in Z.The corresponding
optimization problem becomes
min
w;b;»
kwk
2
2
=2 +C
n
X
i=1
»
i
subject to (2:2:15) ¡(2:2:16):
(2.2.17)
Similar to the derivation for (2.1.13),we start with partial derivatives of the La-
grangian with respect to the primal variables:
@L
@w
= 0;
@L
@b
= 0 and
@L

= 0;
which translate to
w
0
¡®
0
DZ = 0;®
0
D1 = 0 and C1 ¡® ¡¯ = 0;where Z = [z
0
1
;z
0
2
;¢ ¢ ¢;z
0
n
];
or equivalently to
w =
n
X
i=1
®
i
y
i
z
i
;
n
X
i=1
®
i
y
i
= 0 and ®
i

i
= C 8i:
(2.2.18)
By substituting (2.2.18) into the Lagrangian,we get the dual problem:
2
Regard the collection of functions fÁ
1
(x);Á
2
(x);Á
3
(x);:::g as a basis set in a certain functional
space H of X.The nonlinear SVM in general is a nonparametric method.
10 CHAPTER 2.SUPPORT VECTOR MACHINES
max
®
n
X
i=1
®
i
¡
1
2
n
X
i;j=1
®
i
®
j
y
i
y
j
©(x
i
)
0
©(x
j
) subject to
n
X
i=1
®
i
y
i
= 0 and 0 · ®
i
· C:
(2.2.19)
Similarly,b is computed as
b =
1
jIj
X
i2I
0
@
y
i
¡
n
X
j=1
®
j
y
j
©(x
j
)
0
©(x
i
)
1
A
;where I = fi:0 < ®
i
< Cg:
Note that the feature data z
i
are not observed.However,the problem (2.2.17) and
the underlying decision function
f(x) = w
0
z +b;where z = ©(x);
depend on w and z only through their values of inner products in Z.De¯ne the inner
product as a bivariate function on X £X:
K(x;u) = ©(x)
0
©(u):
There is no need to know explicitly the functional formof ©,nor the individual feature
observations z
i
.It is su±cient to know the function K(x;u) representing the inner
product values in Z.The function K(x;u) is called a kernel function.Or one can
simply start with a given positive de¯nite kernel K(x;u).(We will give a formal
de¯nition of positive de¯nite kernel later.) Then,solve the maximal dual problem
max
®
n
X
i=1
®
i
¡
1
2
n
X
i;j=1
®
i
®
j
y
i
y
j
K(x
i
;x
j
) subject to
n
X
i=1
®
i
y
i
= 0 and 0 · ®
i
· C;
(2.2.20)
and b is computed from
b =
1
jIj
X
i2I
0
@
y
i
¡
n
X
j=1
®
j
y
j
K(x
j
;x
i
)
1
A
;where I = fi:0 < ®
i
< Cg:
Finally,the underlying decision function is given by
f(x) =
n
X
i=1
®
i
y
i
K(x
i
;x) +b:
The kernel trick introduced above provides a powerful extension for nonlinear
SVMs.In fact,the SVM does not solve the nonlinear problem directly.It maps the
2.3.VARIANTS OF SVM ALGORITHMS 11
input data into a high dimensional space via an implicit transformation so that the
data in this high dimensional space can be better separated by a hyperplane therein.
One of the most popular kernel functions is the Gaussian kernel (also known as the
radial basis function):
K(x;u) = e
¡°kx¡uk
2
2
;
(2.2.21)
where x and u are two vectors in <
p
and ° is the width parameter of the Gaussian
kernel.(We do not care to normalize it to make the kernel integrate to one,as the
normalization has no e®ect on SVM problems.) The value of K(x;u) represents the
inner product of ©(x) and ©(u) in some high dimensional feature space Z and re°ects
the similarity between x and u.Values of K(x;u) drop o® at a rate determined by
kx ¡uk
2
2
and °.
Before we end this section,we give a formal de¯nition of a positive de¯nite kernel
and its spectrum.
De¯nition 1 (Positive de¯nite kernel and its spectrum)
A real-valued symmet-
ric kernel function K de¯ned on X £X is said to be positive de¯nite,if and only if
that,for any positive integer m,any sequence of real numbers fa
1
;:::;a
m
g and any
sequence of points fu
1
;:::;u
m
g in X,we have
P
m
i;j=1
a
i
a
j
K(u
i
;u
j
) ¸ 0,and strictly
positive de¯nite when the equality holds if and only if a
1
= ¢ ¢ ¢ = a
m
= 0.For a given
positive de¯nite kernel K,if it de¯nes a continuous compact operator on L
2
(X) then
K can be expanded in terms of its eigenvalues-eigenfunctions:
K(x;u) =
X
q
¸
q
Ã
q
(x)Ã
q
(u):
(2.2.22)
With the above expansion,the associated feature map is given by
©(
x
) = (
p
¸
1
Ã
1
(
x
)
;
p
¸
2
Ã
2
(
x
)
;
p
¸
3
Ã
3
(
x
)
;:::
)
:
2.3 Variants of SVM algorithms
There are two key ingredients in an SVMproblem.One is the kernel trick for making
extension to nonlinear decision function,which is a kernel mixture (i.e.,a linear
combination of kernels).The decision function is nonlinear in the pattern space but
linear in an implicit feature space.The nonlinearity via using kernel mixture allows
excellent °exibility and versatility for the decision function in the pattern space,while
the simple linear structure,as well as the sparsity of SVs for making up the decision
function in the feature space,makes the implementation of nonlinear model possible.
The other key ingredient is the error criterion (or risk,cost criterion) for seeking (or
estimating) the decision function.
12 CHAPTER 2.SUPPORT VECTOR MACHINES
2.3.1 One-norm soft margin SVM
The formulations in (2.1.10) and (2.1.13) for linear case as well as (2.2.19) and (2.2.17)
for nonlinear case fall into the category of one-norm soft margin SVM.The term one-
norm refers to the one-norm penalty on slack variables:k»k
1
=
P
n
i=1
»
i
.
3 4
2.3.2 Two-norm soft margin SVM
Besides the one-norm k»k
1
,it is natural to consider the two-norm penalty k»k
2
2
.The
corresponding formulation becomes
min
w;b;»
1
2
kwk
2
2
+
C
2
n
X
i=1
»
2
i
subject to (2:2:15) ¡(2:2:16):
(2.3.23)
Homework 1
Formulate the 2-norm soft margin SVM in dual form.
Solution
:The nonnegative constraint (2.2.16) can be removed since the term k»k
2
2
is
included in the objective function.Therefore,we have the following Lagrangian:
L(w;b;»;®) =
1
2
kwk
2
2
+
C
2
n
X
i=1
»
2
i

0
[1 ¡D(Zw +b1) ¡»];
and its partial derivatives with respect to the primal variables:
@L
@w
= 0;
@L
@b
= 0 and
@L

= 0;
which translate to
w
0
¡®
0
DZ = 0;®
0
D1 = 0 and C» ¡® = 0;
or equivalently to
w =
n
X
i=1
®
i
y
i
z
i
;
n
X
i=1
®
i
y
i
= 0 and »
i
=
®
i
C
8i:
3
There is another SVM variant,also called one-norm SVM [57].It penalizes on the one-norm of
the normal vector,i.e.,kwk
1
.
4
The usage of penalty is to regularize (to make\smooth",often by shrinking the parameters values
toward zero) the underlying model.The e®ect of a one-norm penalty will lead to a parsimonious
model structure and it enforces the action of keep-or-kill.The e®ect of a two-norm penalty will
shrink the parameters values,but will never kill them.In Bayesian formulation,the one-norm
penalty corresponds to a Laplacian (double exponential) prior and the two-norm corresponds to a
Gaussian prior on parameters.
2.3.VARIANTS OF SVM ALGORITHMS 13
Then by substituting the above equations into the Lagrangian,we can obtain
L(w;b;»;®)
=
1
2
n
X
i;j=1
®
i
®
j
y
i
y
j
z
0
i
z
j
+
1
2C
n
X
i=1
®
2
i
¡
n
X
i;j=1
®
i
®
j
y
i
y
j
z
0
i
z
j
+
n
X
i=1
®
i
¡
1
C
n
X
i=1
®
2
i
=
n
X
i=1
®
i
¡
1
2
n
X
i;j=1
®
i
®
j
y
i
y
j
z
0
i
z
j
¡
1
2C
n
X
i=1
®
2
i
:
The 2-norm soft margin SVM problem can be formulated in the following dual form:
max
®
n
X
i=1
®
i
¡
1
2
n
X
i;j=1
®
i
®
j
y
i
y
j
K(x
i
;x
j
) ¡
1
2C
n
X
i=1
®
2
i
subject to
n
X
i=1
®
i
y
i
= 0;®
i
¸ 0 8i:
2
There is another variant of 2-norm soft margin SVM algorithm,called the La-
grangian SVM [40] (LSVM for short).It has also appended the term b
2
=2 to the
margin criterion and solves the following reformulated problem:
min
w;b;»
1
2
(kwk
2
2
+b
2
) +
C
2
n
X
i=1
»
2
i
subject to (2:2:15) ¡(2:2:16):
(2.3.24)
This reformulation leads to simple constraints ®
i
¸ 0 in the dual problem (see Ap-
pendix B) without sacri¯cing the testing accuracy.The LSVM implementation is
based directly on the KKT necessary and su±cient optimality conditions for the dual
problem.For the linear case,the implementation of LSVM has used the Sherman-
Morrison-Woodbury (SMW) identity to transforman n£n matrix inversion to a p£p
matrix inversion.However,for the nonlinear case,the n£n kernel matrix [K(x
i
;x
j
)]
is dense and of full rank (or close to full rank),and there is no advantage of using
SMW identity.In this case,a reduced kernel technique (introduced later in Chap-
ter 3) should be used to cut down the column rank of the kernel matrix.Refer to
Appendix B for implementation details and Matlab code.
2.3.3 Smooth SVM (SSVM)
Traditionally,SVMs are solved via their dual problems.Here we introduce an alter-
native approach,which focuses on the primal problems and solves them directly by
smoothing techniques [33].Smoothing methods can be found and have been exten-
sively used in optimization literature [8,9,7,11,12,18,51].The basic idea is to
14 CHAPTER 2.SUPPORT VECTOR MACHINES
¯nd a smooth (di®erentiable) approximation to the non-smooth objective function,
here speci¯cally the plus function.Then,use a Newton
5
method to solve the smooth
optimization problem.
The decision function of an SVM has the following form:
f(x) =
n
X
i=1
®
i
y
i
K(x
i
;x) +b
Ã
:=
n
X
i=1
v
i
K(x
i
;x) +b with v
i
= ®
i
y
i
!
:
The SSVM works directly on the primal problem.It starts from a type of 2-norm
soft margin SVM,and next it appends the term b
2
=2 to the objective function to be
minimized and leads to the following minimization problem:
min
v;b;»
C
2
k»k
2
2
+
1
2
¡
kvk
2
2
+b
2
¢
;
subject to 1 ¡y
i
f(x
i
) ¡»
i
· 0 8i;
»
i
¸ 0 8i:
Note that the nonnegative constraint » ¸ 0 can be removed,as the term k»k
2
2
is
included in the objective function of the minimization problem.The problem above
becomes
min
v;b;»
C
2
k»k
2
2
+
1
2
¡
kvk
2
2
+b
2
¢
;
(2.3.25)
subject to 1 ¡D(Kv +b1) ¡» · 0;
(2.3.26)
where K = K(A;A) = [K(x
i
;x
j
)]
n
i;j=1
.At a solution,» takes the form » = (1 ¡
D(Kv +b1))
+
.The problem given by (2.3.25) and (2.3.26) converts to an equivalent
unconstrained minimization problem:
min
v;b
C
2
°
°
(1 ¡D(Kv +b1))
+
°
°
2
2
+
1
2
¡
kvk
2
2
+b
2
¢
:
(2.3.27)
It is a strongly convex minimization problemwithout any constraint and has a unique
solution.However,the objective function is not di®erentiable,which hinders the use
of a Newton method.Below a smooth approximation [33] is introduced.
5
It is also known as Newton-Raphson method and is an e±cient algorithm for ¯nding approxi-
mations to the zeros (or roots) of a smooth function.It can also be used to ¯nd local maxima and
local minima of functions by noticing that if a point is a local optimal point,then it is a root of the
derivative of the underlying function.For a twice-di®erentiable function f(x),an iterative scheme
for a local optimal point can be prescribed as follows:
x
n+1
= x
n
¡

r
2
f(x
n
)

¡1
rf(x
n
);n ¸ 0;
where x
0
is an initial point.Usually the Newton method is modi¯ed to include a small step size
0 < ° < 1 instead of ° = 1.This is known as the Newton-Armijo algorithmto ensure the convergence
of the iterative scheme.
2.4.MULTICLASS CLASSIFICATION 15
For t = (t
1
;:::;t
p
) 2 <
p
,the plus function t
+
is de¯ned componentwise as (t
+
)
i
=
max f0;t
i
g for i = 1;:::;p.An accurate smooth approximation to the plus function
is given by p(t;½),which is de¯ned componentwise by:
p(t
i
;½) = t
i
+
1
½
log
¡
1 +e
¡½t
i
¢
;t
i
2 <;½ > 0:
(2.3.28)
The smooth p function is the integral of the Sigmoid (also known as logistic) function
given by:g(t) = 1=(1+e
¡½t
) for t 2 <,which is used to approximate the step function:
h(t) = 1 for t ¸ 0 and h(t) = 0 for t < 0.As ½!1,p(t;½)!t
+
for all t 2 <
p
.We
can replace (1 ¡D(Kv +b1))
+
by the smooth approximation (2.3.28) and the SVM
problem converts to
min
v;b
C
2
kp(1 ¡D(Kv +b1);½)k
2
2
+
1
2
¡
kvk
2
2
+b
2
¢
:
(2.3.29)
The problem (2.3.29) can be solved by the fast Newton-Armijo algorithm [33].A
Matlab implementation code is given in Appendix B.This code takes the advantage
of sparsity of the Hessian matrix and has used the limit values of the sigmoid function
and p-function in computing the gradient vector and the Hessian matrix.That is,
½ = 1 in the implementation,which simpli¯es the computation for the gradient
vector and the Hessian matrix.
2.4 Multiclass classi¯cation
So far we have only talked about the binary classi¯cation,as SVMs are originally
designed for binary classi¯cation.There are two commonly seen multiclass exten-
sions for SVMs.One is the composition type methods built upon a series of binary
classi¯ers,e.g.,the one-against-one and one-against-rest,and the other is the sin-
gle machine type methods,often huge and solved in one optimization formulation.
See Rifkin and Klautau [45] for literature review,comparison study and references.
There is no universally dominant classi¯cation rule for multiclass problems.Di®erent
methods have their own merits and advantages,allowing the room for exploration of
alternative approaches.
6
In this section we will only introduce the one-against-one
(also known as pairwise) classi¯cation,which is probably the most commonly used
classi¯cation scheme in multiclass SVMs.In one-against-one classi¯cation scheme,we
train a series of binary classi¯ers for each pair of classes.For k classes,this results
in k(k ¡ 1)=2 binary classi¯ers.Though the number of classi¯ers to be trained is
6
For instance,kernel Fisher discriminant analysis [41,26] and classi¯cation via support vector
regression [10].
16 CHAPTER 2.SUPPORT VECTOR MACHINES
large,the individual problems that we need to train on are signi¯cantly smaller.It is
often time saving for training a series of small classi¯ers than training a huge single
machine.
When it comes for classifying a test pattern,we evaluate all k(k ¡ 1)=2 binary
classi¯ers,and assign the test point to the class with majority votes.A vote for a
given class is de¯ned as a classi¯er assigning the test pattern to that class.For tied
votes,we use a random assignment among classes of ties.
Chapter 3
Two Auxiliary Techniques:
Reduced Kernel and Model
Selection for SVMs
In this chapter we introduce two auxiliary techniques for dealing with large data
computation and model selection associated with SVMs.
3.1 Reduced kernel:a low-rank approximation
In nonlinear SVM as well as other kernel-based learning algorithms,to generate a
nonlinear separating surface has to deal with a fully dense kernel matrix of size n£n.
When n is median to large,there are some computational di±culties encountered:
(P1)
the size of the mathematical programming problem,
(P2)
the size of memory usage,
(P3)
the complexity of the resulting separating surface.
To cut down the model complexity,a reduced kernel technique was introduced [32,34].
The original reduced kernel support vector machine (RSVM) uses a very small random
subset of size ~n (¿n) to build up the separating surface for SSVM algorithm.The
reduced set plays a similar role as support vectors.We denote this random subset by
~
A ½ A,which is used to generate a much smaller rectangular matrix of size n £ ~n:
K(A;
~
A) = [K(x
i
;x
j
)];x
i
2 A;x
j
2
~
A:
17
18 Two Auxiliary Techniques
The reduced kernel matrix is used to replace the full kernel matrix K(A;A).The
usage of a reduced kernel can cut the problem size,associated computing time and
the memory space,and it can simplify the characterization of the nonlinear separating
surface as well.The basic idea of RSVM is to solve an approximate SVM problem
instead of the full problem.It works by replacing the full kernel matrix with a low-
rank approximation.The use of reduced kernel is not limited to randomsubset nor the
SSVM algorithm.There are alternative choices.We divide the discussion of reduced
kernel selection into 2 categories:the most computationally economic random subset
and the most expensive optimal basis selection.There are also other alternative
selections in between by searching for optimal basis within a certain random subset.
As for the SVMformulation,there are 2 basic types of RSVMformulations introduced
here:RSVM in the primal space and RSVM in the dual space,respectively.
3.1.1 RSVM with random subset
In this monograph,the term\RSVM"is referring to reduced kernel SVMof any form,
which need not be limited to random subset nor to any speci¯c SVM algorithm.
Primal RSVM using random subset is straightforwardly easy to formulate.
It works simply by replacing the full kernel matrix K with a reduced kernel,denoted
by
~
K.Using SSVM,which is solved in the primal space,as an illustrative example,
its decision function is of the form f(x) = K(x;
~
A)~v + b,where ~v is solved directly
from the following optimization problem:
min
~v;b
C
2
°
°
°
°
³
1 ¡Df
~
K~v +b1g
´
+
°
°
°
°
2
2
+
1
2
¡
k~vk
2
2
+b
2
¢
;where
~
K = K(A;
~
A):
(3.1.1)
In solving the RSVM (3.1.1),a smooth approximation p(t;½) to the plus function is
used.The solution to the minimization problem(3.1.1) leads to a nonlinear separating
surface of the form
K(x;
~
A)~v +b = 0;i:e:;
~n
X
i=1
K(x;
~
A
i
)~v
i
+b = 0:
(3.1.2)
In fact,the reduced set
~
A is not necessarily to be a subset of training set.The
RSVMminimization problem(3.1.1) retains the strong convexity and di®erentiability
properties in the space,R
~n+1
,of (~v;b) for any arbitrary rectangular kernel.Again,
the fast Newton-Armijo algorithm [33] can be applied directly to solve (3.1.1).The
existence and uniqueness of the optimal solution are also guaranteed.Moreover,the
computational complexity of solving problem (3.1.1) by the Newton-Armijo method
is O(~n
3
) while solving the full kernel SSVM is O(n
3
) [33].
3.1.REDUCED KERNEL:A LOW-RANK APPROXIMATION 19
The reduced kernel serves as a low-rank approximation to the full kernel matrix.
It is also known as the NystrÄom approximation,which has been proposed in many
sophisticated ways [46,54].The NystrÄom approximation,denoted by R,has the
following form:
K(A;A) ¼ K(A;
~
A)K(
~
A;
~
A)
¡1
K(
~
A;A):= R:
1
(3.1.3)
Note that,for a given v,the following least squares problem
min
~v2<
~n
°
°
°
Rv ¡K(A;
~
A)~v
°
°
°
2
2
(3.1.4)
has the solution given by
~v = K(
~
A;
~
A)
¡1
K(
~
A;A)v:
(3.1.5)
That is,the NystrÄom approximation approximates Kv by
~
K~v,as given below:
K(A;A)v ¼ K(A;
~
A)K(
~
A;
~
A)
¡1
K(
~
A;A)v = K(A;
~
A)~v:
(3.1.6)
In the primal RSVM,~v is directly determined by ¯tting the entire data set to a reduced
problem (3.1.1),which can be regarded as an approximation to the full problem
(2.3.27).For more advanced theoretical study,consult Lee and Huang [34].For a
general primal SVM algorithm,its corresponding RSVM approximation works by
replacing K with
~
K.That is to say,the RSVM in the primal space is easy to work
with and basically share the same type of formulation as its full problem.
Dual RSVM using random subset works by replacing the full kernel matrix
K with the reduced kernel matrix R given by (3.1.3) and solving the dual problem
(2.2.20).Note that the reduced rank kernel matrix R can be expressed as R = BB
0
,
where B = K(A;
~
A)K(
~
A;
~
A)
¡1=2
(or B = K(A;
~
A),if the scaling matrix discussed in
the footnot is taken to be the identity for simplicity).Thus,the dual problem(2.2.20)
with kernel R has the same solution as the dual linear SVM problem (2.1.13) with
input data matrix B.That is,the dual RSVMcan be treated as a linear SVMproblem.
We recommend to use the Lagrangian SVM,an e±cient linear SVM algorithm,for
the RSVMsolved in the dual space.Lagrangian SVMcode consisting of only a couple
lines of commands is given in Appendix B.The dual solution has to be transformed
1
Note that K(
~
A;
~
A) serves as a scaling matrix.For some authors,they have used the approxima-
tion
R:= K(A;
~
A)K(
~
A;A)
in the dual problem.That is they take the identity matrix as the scaling matrix for simplicity.To
get the corresponding primal solution,the least squares problem (3.1.4) is applied and the primal
solution is given by:~v = K(
~
A;A)v.The di®erence between the two approximations (3.1.3) and the
one in this footnote is on the scaling matrix being K(
~
A;
~
A) or the identity.
20 Two Auxiliary Techniques
back to the primal one,as the decision function is given in the primal space.By
letting v
i
= y
i
®
i
,or equivalently v = D®,the corresponding primal solution can by
obtained via (3.1.4) and (3.1.5) and the decision function is then given by
f(x) = K(x;
~
A)~v +b;where ~v = (scaling matrix))
¡1
K(
~
A;A)v;
where the scaling matrix is either K(
~
A;
~
A) or the identity.
3.1.2 RSVM with optimal basis
RSVM using optimal basis can be done through the singular value decomposition
(SVD).One may start with the full kernel matrix K,or a certain intermediate reduced
kernel matrix
~
K,and looks for optimal basis in the column space of
~
K for further
column-rank reduction.Assume that we start with
~
K = K(A;
~
A),where
~
A µ A.Let
the SVD of
~
K be given by
~
K = P¤G
0
;
2
where P and G are orthogonal matrices of sizes n and ~n respectively,and ¤ is a
diagonal matrix of size n £ ~n.
Primal RSVMusing optimal basis works by further reducing the column rank
by extracting the leading right singular vectors of
~
K.Let
~
G = [g
1
;:::;g
~m
] consisting
of ~m-many leading right singular vectors.Projecting rows of
~
K along these ~m-many
right singular vectors leads to
~
K
~
G,which is of size n £ ~m.We use
~
K
~
G as our new
reduced kernel.The decision function is then given by
f(x) = K(x;
~
A)
~
G~v +b;
(3.1.7)
where ~v and b are solved from(3.1.1) with
~
K replaced by the new reduced kernel
~
K
~
G.
Dual RSVM using optimal basis works by replacing the full kernel matrix K
by the reduced kernel matrix R given by
R:=
~
K
~
G
³
~
G
0
(scaling matrix)
~
G
´
¡1
~
G
0
~
K
0
;
(3.1.8)
and then solves for the dual problem (2.2.20),
3
where
~
K can be the full kernel or an
intermediate reduced kernel K(A;
~
A),and the scaling matrix is K(
~
A;
~
A) (or can be
2
As only the leading right singular vectors are needed,we can extract Gfromthe leading eigenvec-
tors of a smaller square matrix
~
K
0
~
K to save computing time.See the SVD Section in Appendix A.
3
Again,R can be expressed as R = BB
0
,where B =
~
K
~
Gf
~
G
0
K(
~
A;
~
A)
~
Gg
¡1=2
,and the dual RSVM
can be treated as a linear SVM problem with input data matrix B.
3.1.REDUCED KERNEL:A LOW-RANK APPROXIMATION 21
replaced by a simple identity matrix).By letting v = D®,the corresponding primal
solution can by obtained by solving
min
~v
kRv ¡
~
K
~
G~vk
2
2
;
(3.1.9)
which leads to ~v =
³
~
G
0
(scaling matrix)
~
G
´
¡1
~
G
0
~
K
0
v.Then,the decision function is
then given by f(x) = K(x;
~
A)
~
G~v +b.
3.1.3 Some remarks
For many other kernel-based algorithms,one has to face the same problems (P1)-(P3)
stated above.Several authors have suggested the use of low-rank approximations to
the full kernel matrix for very large problems [32,46,54].These low-rank approxima-
tions all have used a thin rectangular matrix K(A;
~
A) consisting of a subset of ~n(¿n)
columns drawn from the full kernel matrix K.Lee and Mangasarian [32],Williams
and Seeger [54] suggest to pick the subset randomly and uniformly over all columns.
Lee and Huang [34] give a further theoretical study on random subset approach in
terms of some statistical optimalities.Smola and SchÄolkopf [46] consider ¯nding an
optimal subset.Since ¯nding an optimal subset is a combinatorial problem involving
C
n
~n
possibilities and an exhaustive search is too expensive to carry out,they use a
probabilistic speedup algorithm.The trick they use is:First draw a random subset
of a ¯xed size and then pick the best basis (column) from this set.The search goes
on till the stopping criterion is reached.The optimal basis that we introduced above
consists of leading right singular vectors of
~
K instead of mere columns of
~
K.For
small sized problem we may start with
~
K the full kernel,while for median to large
sized problemwe may start with an intermediate reduced kernel using randomsubset,
and then extract the leading right singular vectors therein for further column-rank
reduction.
For problems working in the primal space,e.g.,the proximal SVM [19],the
least-squares SVM [48,49],the kernel Fisher discriminant [42,41],kernel princi-
pal component analysis [27],kernel sliced inverse regression [56],the random sub-
set method works straightforwardly easy by replacing the full kernel matrix with
the reduced kernel and also by cutting down the corresponding number of para-
meters in the decision function.For problems solved in the dual space,we form
the approximation kernel matrix R = K(A;
~
A)(scaling matrix)
¡1
K(
~
A;A),or R =
~
K
~
G
³
~
G
0
(scaling matrix)
~
G
´
¡1
~
G
0
~
K
0
,to replace the full matrix K,where the scaling
matrix is given by K(
~
A;
~
A) or the identity.Next the dual solution has to be trans-
formed to the primal form for the expression of decision function.To obtain the
22 Two Auxiliary Techniques
primal solution ~v,we should solve the least squares problem (3.1.4) or (3.1.9),where
v is the dual solution.
Homework 2
(1) Do a numerical study of the quality of reduced kernel approxima-
tion based on its spectral behavior.Use the Wine Recognition data set to build the
full kernel matrix and the reduced kernel matrices by random subset and by optimal
basis respectively.(2) Analyze the Wine Recognition data set using both the full
and reduced kernel SVM approaches with Gaussian kernel.(Choose your own SVM
algorithm,LIBSVM (or BSVM,a companion to LIBSVM for large data problems),
SSVM or else.)
Solution
:For your reference,here is the solution using Iris and Tree data sets.The
spectral behavior of the reduced kernel approximation
R = K(A;
~
A)K(
~
A;
~
A)
¡1
K(
~
A;A)
can be measured via a few quantities,such as the maximum di®erence of the eigen-
values and the relative di®erence of the traces of the full kernel matrix K(A;A) and
R,which are de¯nded as Maxi-di® and rel-di®=Trace(K¡R)=Trace(K) respectively.
The result is summarized in Table 3.1.Moreover,let ¸
k
and
~
¸
k
be the kth eigenvalues
of K and R.The term(¸
k
¡
~
¸
k
) against k is plotted in Figure 3.1.3(a).Figure 3.1.3(b)
and Figure 3.1.3(c) together can present the eigenvalues of K and R correspondingly.
Fromthese ¯gures,it seems that a larger decay rate of eigenvalues will obtain a better
quality of the approximation.Finally,instead of the total 150 eigenvalues,only the
¯rst 50 leading ones are considered here as their sum is large enough (> 99% of the
total).
In Table 3.2,numerical comparisons between full kernel and reduced kernel from
random subset are conducted,separately through SSVM and LIBSVM.The training
error and validation error (with a 5-fold cross-validation) are reported in SSVM.The
LIBSVMprovides the validation accuracy and the number of the support vectors due
to its dual formulation.It can be found that,though using a simpler model,RSVM
has only slightly bigger errors than the full kernel in training and in validation.
a.use`hibiscus'to tune for the best parameters separately for SSVM
and LIBSVM.
b.include LSVM for comparison?
2
3.1.REDUCED KERNEL:A LOW-RANK APPROXIMATION 23
Table 3.1:Spectrum comparison on full kernel K(A;A) versus its NystrÄom approxi-
mation R.
Data set & Size
Largest Eigen-v
Largest Eigen-v
Maxi-di®
Rel-di®
(Full,Ruduced)
of K(A;A)
of R
of Eigenvalues
of Traces
Iris
40.8754
40.5268
1.3223
0.1083
(150,30)
Figure 3.1:Spectrum comparison on full kernel vs.reduced kernel.
Table 3.2:The Gaussian kernel is used on all tests.The width parameter is ° = 0:1
and the tuning parameter is C = 10.
Iris Dataset
SSVM LIBSVM
Model(size) Training error Validation error Validation error#of SVs
Full(150)
0.033
0.046
0.03333 32
Reduced(30)
0.058
0.073
0.03333 32
Tree Dataset
SSVM LIBSVM
Model(size) Training error Validation error Validation error#of SVs
Full(700) 0.13405 0.14676 0.13143 236
Reduced(69) 0.14724 0.14676 0.14000 222
24 Two Auxiliary Techniques
3.2 Uniform design for SVM model selection
The problem of choosing a good parameter setting for a better generalization ability
is the so called model selection here.It is desirable to have an e®ective and automatic
scheme for it.In particular,for people who are not familiar with parameters tuning in
SVMs,an automatic and data driven scheme is helpful.To develop a model selection
scheme,we need to set up
²
a performance measure,and
²
a search scheme to look for optimal performance point.
One of the most commonly used performance measure is the cross-validation (CV)
(Stone,1974),k-fold or leave-one-out.Both require that the learning engine be trained
multiple times in order to obtain a performance measure for each parameter combi-
nation.As for search scheme,one standard method is a simple grid search on the
parameter domain of interest.It is obvious that the exhaustive grid search can not
e®ectively perform the task of automatic model selection due to its high computa-
tional cost.There are also gradient-based approaches in the literature (Chapelle et
al.[6];Larsen et al.[30];Bengio [3]).Although the gradient-based methods present
impressive gain in time complexity,they have a great chance of falling into bad local
minima,especially for SVM model selection,since the objective function,which is
the CV-based performance measure,has low degree of regularity.Figure 3.2 plots the
5-fold average test set accuracy for three public available data sets,Banana,Wave-
form and Splice in a three dimensional surface,where the x-axis and the y-axis are
log
2
C and log
2
°,respectively.The z-axis is the 5-fold average test accuracy.Each
mesh point in the (x;y)-plane stands for a parameter combination and the z-axis
indicates the model performance measure.
As an alternative,we will introduce a nested uniform design (UD) methodology
for model selection,which can be regarded as an economic modi¯cation to replace grid
search.The lattice grid points are intended to represent uniformpoints in a parameter
domain.However,they are not uniform enough.A much better uniformity scheme,
called UD,is introduced below.
3.2.1 Methodology:2 staged UD for model selection
The uniform experimental design is one kind of space ¯lling designs that can be
used for computer and industrial experiments.The UD seeks its design points to
be uniformly scattered on the experimental domain.Suppose there are s parameters
of interest over a domain C
s
.The goal here is to choose a set of m points P
m
=
3.2.UNIFORM DESIGN FOR SVM MODEL SELECTION 25
Figure 3.2:Three examples of search space of model selection.

1
;:::;µ
m
g ½ C
s
such that these points are uniformly scattered on C
s
.The search
for UD points is a hard problem involves the number theory and quasi-Monte Carlo
methods (Niederreiter [44];Fang and Wang [17]).We will simply refer to UD tables
in the UD-web given below.To implement the UD-based model selection,follow the
steps:
1.
Choose a parameter search domain,determine a suitable number of levels for
each parameter (or factor in design terminology).
2.
Choose a suitable UD table to accommodate the number of parameters and
levels.This can be easily done by visiting the UD-web.
http://www.math.hkbu.edu.hk/UniformDesign
3.
From the UD table,randomly determine the run order of experiments and
conduct the performance evaluation of each parameter combination in the UD.
4.
Fit the SVM model.This step is a knowledge discovery step from the built
model.That is,we ¯nd the best combination of the parameter values that
maximizes the performance measure.
5.
Repeat the whole procedure one more time in a smaller search domain centered
about the optimal point found at the last stage.
26 Two Auxiliary Techniques
The automatic UD-based model selection for SVMs [28] is implemented in Matlab,
named\hibiscus"in the SSVMtoolbox available at:http://dmlab1.csie.ntust.
edu.tw/downloads/.The hibiscus focuses on selecting the regularization parameter
C and the Gaussian kernel width parameter °.It ¯rst determines a two-dimensional
search box in the parameter space,which is able to automatically scale the distance
factor in the Gaussian kernel.Next,a 2-staged UD procedure is designed for parame-
ter selection within the search box.At the ¯rst stage,it sets out a crude search for a
highly likely candidate region of global optimum.At the second stage,it con¯nes to
a ¯ner search within a smaller region centered around the ¯rst-stage-found optimal
point.At each stage,a UD is used to replace the lattice grid as trial points.The
method of nested UDs is not limited to 2 stages and can be applied in a sequential
manner.Two examples of UD scheme are presented below.
Figure 3.3:The nested UD model selection with a 13-points UD at the ¯rst stage and
a 9-points UD at the second stage.
3.2.UNIFORM DESIGN FOR SVM MODEL SELECTION 27
Figure 3.4:The nested UD model selection with a 9-points UD at the ¯rst stage and
a 5-points UD at the second stage.
28 Two Auxiliary Techniques
Chapter 4
More on Kernel-Based
Learning Algorithms
Besides the various variants of SVMs,there are other kernel-based learning algorithms.
Many popular classical statistical data analysis tools are linear methods.For instance,
the principal components analysis (PCA) looks for an orthogonal decomposition of a
sample space,say <
p
,into several one-dimensional linear subspaces ranked by their
importance of contribution to the data variation.The search for in°uential linear
subspaces,or factors,is an important and frequently encountered task.PCA has
long been a successful tool for ¯nding such linear subspaces or factors.However,
it is not able to extract important nonlinear features.This chapter will focus on
extending some classical statistical linear methods to nonlinear setting via the kernel
trick introduced earlier.We will include the following procedures:PCA,sliced inverse
regression (SIR),canonical correlation analysis (CCA),Fisher linear discriminant
analysis (FLDA),least-squares regression and ²-insensitive regression.
4.1 Kernel principal components analysis
As we have brie°y mentioned that the classical PCAis not capable of ¯nding nonlinear
features and there is a need for nonlinear extension.Nonlinear PCA,e.g.,functional
PCA,curve PCA,has been known in statistical literature.However,they are not our
aim here.We will restrict ourselves to the nonlinear extension via kernel trick only.
29
30 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS
4.1.1 Classical PCA
In the original data space,X ½ <
p
,we are interested in ¯nding successive one-
dimensional directions which contain\the maximum information"about the data.
Let u be the ¯rst of such directions scaled to unit length.Data projections along
the direction u are given by Au.Here we adopt the variance as the criterion for
information content.Then,we shall be looking for u to maximize the following
objective function:
max
u2<
p
var(Au) subject to u
0
u = 1:
(4.1.1)
This problem is equivalent to
max
u2<
p
u
0
§u subject to u
0
u = 1;where § is the covariance of data A:
Applying the Lagrange theory,the Lagrangian function is given by
L(u;®) = u
0
§u ¡®(u
0
u ¡1);u 2 <
p
;® 2 <:
Taking the derivative with respect to u and setting it to zero,we have
2§u ¡2®u = 0 =) §u = ®u:
(4.1.2)
That is,u is an eigenvector (with unit length) and ® is the associated eigenvalue.
Plugging (4.1.2) back to the Lagrangian function,we have u
0
§u = ®.Therefore,u is
obtained by ¯nding the eigenvector associated with the leading eigenvalue ®.Denote
them by u
1
and ®
1
,respectively.For the second principal component,we look for u
again for the same problem (4.1.1) but with an extra constraint that u is orthogonal
to u
1
.Then,the Lagrangian becomes
L(u;®;¯) = u
0
§u ¡®(u
0
u ¡1) ¡¯u
0
1
u:
(4.1.3)
Using similar procedure,we are able to ¯nd the second principal component and
successive components sequentially.
Note that PCA ¯rst centers the data and then rotates the axes to line up with the
directions of highest data variation.The action of rotation and lining up with direc-
tions of highest variance is like ¯nding a new coordinate system to record the data.
Data recorded by the new system have a simple diagonal structure in their covariance
matrix with descending diagonal elements ranked by the importance (contribution to
data variance) of the new coordinate axes.
Homework 3
You have seen the kernel trick in SVM.Apply it to PCA and think
about how the KPCA should be formulated.
4.1.KERNEL PRINCIPAL COMPONENTS ANALYSIS 31
Solution
:Similar to the classical PCA,¯rst we have to center the input data Z in
the feature space;that is,
³
I
n
¡
1
n
1
0
n
n
´
Z.Then the covariance matrix is of the form
1
n
Z
0
³
I
n
¡
1
n
1
0
n
n
´
Z and the optimization problem could be formulated as
max
u2Z
u
0
Z
0
µ
I
n
¡
1
n
1
0
n
n

Zu subject to u
0
u = 1:
(4.1.4)
The Lagrangian function is given by
L(u;®) = u
0
Z
0
µ
I
n
¡
1
n
1
0
n
n

Zu ¡®(u
0
u ¡1):
@L=@u = 0 =) Z
0
³
I
n
¡
1
n
1
0
n
n
´
Zu = ®u =) u is in the column space of Z
0
,i.e.
u = Z
0
v for some v 2 <
n
.Plugging it to the above funtion,the optimization problem
becomes
max
v2Z
v
0
ZZ
0
µ
I
n
¡
1
n
1
0
n
n

ZZ
0
v subject to v
0
ZZ
0
v = 1:
(4.1.5)
By the kernel trick ZZ
0
= K and also assuming K nonsingular,we can solve this
problem by ¯nding the leading eigenvectors v from the following eigenvalue problem:
µ
I
n
¡
1
n
1
0
n
n

Kv = ®v
(4.1.6)
with normalization v
0
Kv = 1,or equivalently ®v
0
v = 1.Normalization changes the
scales of principal components to have unit length in the sense of v
0
Kv = 1,but it
will not change the orientation of the principal components.2
4.1.2 Kernel PCA
Full kernel PCA.From (4.1.6) in Homework 3 we see that the KPCA works by
extracting the leading eigen-components of the centered kernel matrix.Note that the
eigenvector,according to (4.1.5),should be normalized by
v
0
Kv = 1;or equivalently v
0
µ
I
n
¡
1
n
1
0
n
n

Kv = 1:
(4.1.7)
As (I
n
¡
1
n
1
0
n
n
)Kv = ®v,the normalization can be expressed as ®v
0
v = 1.In summary,
the KPCA solves the eigenvalue-decomposition problem for (I
n
¡
1
n
1
0
n
n
)K.Denote
its largest eigenvalue by ®
1
and its associated eigenvector by v
1
.We can sequentially
¯nd the second and the third principal components,etc.From (4.1.7) we have that
v
j
,j = 1;2;:::,should be normalized by v
0
j
Kv
j
= 1,or equivalently by ®
j
v
0
j
v
j
= 1.
32 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS
From Homework 3 we see that Z
0
v
j
is the jth principal component in the feature
space Z.For an x 2 <
p
and its feature image z = ©(x) 2 Z,the projection of ©(x)
along the jth component Z
0
v
j
is given by
©(x)
0
Z
0
v
j
= z
0
[z
1
;z
2
;:::;z
n
]v
j
= K(x;A)v
j
:
(4.1.8)
Therefore,the projection of feature image z = ©(x) onto the leading r eigen-components
in Z is given by
[K(x;A)v
1
;:::;K(x;A)v
r
] 2 <
r
:
Reduced kernel PCA starts with a random sampling of a portion of columns
from the full kernel matrix.Next,by a SVD it extracts the leading right singular
vectors ~v of the reduced kernel matrix:
µ
I
n
¡
1
n
1
0
n
n

K(A;
~
A)~v = ®~v;normalized to ®~v
0
~v = 1:
(4.1.9)
The feature image ©(x) projected along the leading r components is then given by
h
K(x;
~
A)~v
1
;:::;K(x;
~
A)~v
r
i
2 <
r
:
As we are interested in extracting only right singular vectors,it works by solving
the eigenvalue problem of
~
K
0
(I
n
¡
1
n
1
0
n
n
)
~
K,whose corresponding eigenvalues and
eigenvectors will be ®
2
j
and ~v
j
.That is,if we work directly on the eigenvalue problem
of
~
K
0
(I
n
¡
1
n
1
0
n
n
)
~
K,denoting its eigenvalues by ¯
j
,then the normalization should be
p
¯
j
~v
0
j
~v
j
= 1.
4.2 Kernel sliced inverse regression
The PCA approach can be used as an unsupervised dimension reduction tool.Of-
ten the input data come with their class labels and supervised dimension reduction
with category information becomes possible.The sliced inverse regression (SIR) [35]
was originally proposed for dimension reduction through inverse regression.We will
depart from the original inverse regression spirit for the time being and focus on a
classi¯cation framework,which is easier for the ¯rst-time reader who is non-statistics
major to get a grasp of the method.
4.2.1 Classical SIR
Apart from its statistical meaning,SIR,technically speaking,solves a generalized
eigenvalue-eigenvector decomposition problem.We have seen the decomposition prob-
lem of a covariance matrix in the PCA section.In a classi¯cation problem with given
4.2.KERNEL SLICED INVERSE REGRESSION 33
data f(x
i
;y
i
)g
n
i=1
,x
i
2 <
p
and y
i
2 f1;:::;kg,here are the steps of SIR for ¯nding
leading directions that contain\maximal information"of class membership:
1.
Slice (classify) the input data A = [x
0
1
;:::;x
0
n
] into k many slices by their
corresponding labels y.
2.
For each slice,s = 1;:::;k,compute the mean,denoted by ¹x
s
.Also compute
the grand mean,denoted by ¹x.
3.
Use the slice mean ¹x
s
as the representative for each slice to form a new input
data matrix.That is,replace each individual input point x
i
by its slice mean
for a new input data matrix,denoted by A
S
(still of size n £p).
4.
Compute §
b
= Cov(A
S
),named the between-slice (or between-class) covariance
matrix,as follows:
§
b
=
1
n
k
X
s=1
n
s
(¹x
s
¡ ¹x)(¹x
s
¡ ¹x)
0
;
where n
s
is the number of data points in the sth slice.Also compute the sample
covariance matrix of x:§ =
P
n
i=1
(x
i
¡ ¹x)(x
i
¡ ¹x)
0
=n.
5.
Solve the generalized eigenvalue problem of §
b
with respect to § (Assume § is
rank p).That is,to ¯nd the leading,say r many (r · k ¡1),eigenvalues ® and
eigenvectors u for the problem:
§
b
u
i
= ®
i
§u
i
subject to u
0
i
§u
i
= 1 and u
0
j
§u
i
= 0 for 1 · j < i · r:
6.
Collect the leading r SIR directions into a matrix U = [u
1
;:::;u
r
].U is or-
thogonal with respect §,i.e.U
0
§U = I
r
.Project the data x along the SIR
directions,i.e.U
0
x = [u
0
1
x;:::;u
0
r
x].We should call u
0
1
x the ¯rst SIR variate,
u
0
2
x the second SIR variate,and so on.
Remark 2
For an ordinary eigenvalue problem,the decomposition is done with re-
spect to an identity matrix.For a generalized eigenvalue problem,the decomposition
is done with respect to a strictly positive de¯nite matrix.PCA solves an ordinary
eigenvalue problem,while SIR solves a generalized one.
From the steps above,we see that SIR is designed to ¯nd the leading directions
for maximal discriminant power for class information.Here the discriminant ability is
judged by the between class variance scaled by the total variance.Before we formally
34 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS
introduce the kernel generalization of SIR,we present you a quick pictorial view of
the e®ect of kernel SIR (KSIR) compared with the classical SIR.Figure 4.1 is the
scatter plot of ¯rst 2 SIR variates using Pendigits data and Figure 4.2 is the scatter
plot of ¯rst 2 KSIR variates.
Figure 4.1:SIR view over the 1st-and-2nd variates.
Homework 4
Prescribe a kernelized SIR algorithm and try it out on the Wine
Recognition data set available at UCI Repository.(Hint:Replace A by K and
proceed the same as in the classical SIR.However,since the covariance matrix of K
is singular,add a small diagonal matrix ±I to Cov(K) for numerical stability.)
4.2.2 Kernel SIR
SIR is designed to ¯nd linear dimension reduction subspace by extracting the leading
variance components of sliced means scaled by the overall variance across all classes.
De¯ne N(§
b
) = fu:§
b
u = 0g,which is the null column space of §
b
.SIR is unable
to ¯nd directions which reside in this null space N nor directions which have small
angle with the null space.SIR is also limited to the linear directions and is unable to
extract nonlinear features,as you have seen in the Pendigits data demonstration.
Kernelized SIR [56] can provide an easy remedy for the di±culties encountered by the
classical SIR.A nonlinear parsimonious characterization of the data is the key goal
of the kernel extension.
4.2.KERNEL SLICED INVERSE REGRESSION 35
Figure 4.2:KSIR view over the 1st-and-2nd variates.
Full kernel SIR.As before,in looking for a nonlinear extension we ¯rst map
the original input data to the feature space Z,and then nominally carry out the SIR
procedure on the feature data in Z.That is,we look for the leading directions in the
feature space that have the highest between-slice variance with respect to the variance
of the whole data set.In other words,we intend to solve the following optimization
problem:
max
u2Z
u
0
Z
0
H
0
µ
I ¡
1
n
1
0
n
n

HZu subject to u
0
Z
0
µ
I ¡
1
n
1
0
n
n

Zu = n;
(4.2.10)
where H is a diagonal block matrix with the sth block given by B
s
=
1
n
s
1
n
s
1
0
n
s
with
n
s
the number of observations in the sth slice.(Here we have implicitly assumed
that the data are already sorted by slice.) Note that H is symmetric and idempotent,
i.e.H
2
= H.It is a projection matrix,which maps each feature data input to the
slice mean of its corresponding class.As in the KPCA case,we con¯ne our search of
eigenvectors to the column space of Z
0
,i.e.u = Z
0
v for some v 2 <
n
.Then,by the
kernel trick K = ZZ
0
,the optimization problem (4.2.10) becomes
max
v
v
0
ZZ
0
H
µ
I
n
¡
1
n
1
0
n
n

HZZ
0
v subject to v
0
ZZ
0
µ
I
n
¡
1
n
1
0
n
n

ZZ
0
v = n;
(4.2.11)
which is equivalent to
max
v
v
0
Cov(HK)v subject to v
0
Cov(K)v = 1:
(4.2.12)
36 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS
For successive KSIR directions,some extra orthogonality constraints are applied.The
KSIR problem (4.2.12) is exactly the same as the classical SIR stated in Steps 1-6
above,except for replacing the original input data matrix A by the kernel data matrix
K.Note that Cov(K) is singular and the singularity will cause numerical instability
to the problem.An easy remedy is to add a small diagonal matrix ±I to the covariance
matrix.That is,we solve the following problem:
Cov(HK)v = ®fCov(K) +±Igv:
(4.2.13)
Let V = [v
1
;v
2
;:::;v
r
] 2 <
n£r
consisting of r-many leading KSIR directions and let
C(V ) be its column space spanned by V.Again,we say fv
1
;:::;v
r
g are KSIR direc-
tions,fKv
1
;:::;Kv
r
g are KSIR variates,and C(V ) is the KSIR dimension reduction
subspace.
Reduced kernel SIR works directly on
~
K = K(A;
~
A) or
~
K = K(A;
~
A)
~
G as
de¯ned in (??).The RKSIR solves the following generalized eigenvalue problem:
Cov(H
~
K)~v = ®~v
0
n
Cov(
~
K) +±I
o
~v:
(4.2.14)
We then say f~v
1
;:::;~v
r
g are KSIR directions,f
~
K~v
1
;:::;
~
K~v
r
g are KSIR variates,and
C(
~
V ) is the KSIR dimension reduction subspace,where
~
V = [~v
1
;~v
2
;:::;~v
r
].Some
common suggestions for implementing the RKSIR are:For small sized problem,use
the leading optimal bases based on full kernel to form the reduced kernel;for median
to large sized problem,use either a random subset to form the reduced kernel or to
start with an intermediate random subset and then search for leading optimal bases
therein.
Combined with linear methods.As the input space has been mapped to Z,
which has very high dimensionality (often in¯nite) to provide enough °exibility and
versatility for linear models therein.It is,thus,su±cient to consider only simple
linear models therein for statistical inferences.For instance,KSIR can be combined
with various linear SVM algorithms or the classical linear discriminant analysis for
classi¯cation problems,and with various linear SVR algorithms or least-squares linear
regression method for regression problems.In other words,various linear algorithms
can act on top of the KSIR variates for dimension reduction and data analysis.
4.3 Kernel canonical correlation analysis
Canonical correlation analysis (CCA) studies the relation between two sets of vari-
ables.The classical CCA is used to describe only the linear relation between the
two sets,while its kernel extension,KCCA,can be used to describe general nonlinear
relation.
4.3.KERNEL CANONICAL CORRELATION ANALYSIS 37
4.3.1 Classical CCA
Suppose the randomvariable x of p components can be partitioned into x = [x
(1)
;x
(2)
]
with p
1
and p
2
components,respectively.The CCAaims to study the relation between
x
(1)
and x
(2)
and eventually de¯nes a pair of new coordinate systems in <
p
1
and
<
p
2
,which make the description of\relation"between x
(1)
and x
(2)
as\simple"as
possible.One way to study and to simplify the relationship structure is through the
factorization of the covariance matrix.Let
X
(1)
=
2
6
6
4
x
(1)
0
1
.
.
.
x
(1)
0
n
3
7
7
5
;
which is the design matrix of the ¯rst set of variables;and let X
(2)
= [x
(2)
0
1
;:::;x
(2)
0
n
],
which is the design matrix of the second set of variables.The sample covariance
matrix of X
(i)
and X
(j)
,i;j = 1;2,is given by
§
ij
=
1
n
X
(i)
0
µ
I
n
¡
1
n
1
0
n
n

X
(j)
:
The CCA ¯nds the pair of vectors (u
1
;v
1
) 2 <
p
1
+p
2
,called the ¯rst pair of canonical
variates,such that
max
u;v
u
0
§
12
v subject to u
0
§
11
u = 1 and v
0
§
22
v = 1:
(4.3.15)
Successive pairs can be found by solving the same problem with extra orthogonal-
ity (uncorrelatedness) constraints.The classical CCA describes linear relation by
reducing the correlation structure between these two sets of variables to the sim-
plest possible form by means of linear transformations on X
(1)
and X
(2)
.It ¯nds
pairs (u
i
;v
i
) 2 <
p
1
+p
2
in the following way.The ¯rst pair maximizes the correlation
between u
0
1
X
(1)
and v
0
1
X
(2)
subject to the unit variance constraints Var(u
0
1
X
(1)
) =
Var(v
0
1
X
(2)
) = 1,and the kth pair (u
k
;v
k
),which are uncorrelated with the ¯rst k¡1
pairs,
1
maximizes the correlation between u
0
k
X
(1)
and v
0
k
X
(2)
,and again subject to
the unit variance constraints.The sequence of correlations between u
0
i
X
(1)
and v
0
i
X
(2)
are called canonical correlation coe±cients of X
(1)
and X
(2)
,(u
i
;v
i
) are called the ith
pair of canonical variates,and u
0
i
X
(1)
and v
0
i
X
(2)
are called the ith pair of canonical
variables.The canonical variates
2
fu
i
g
p
1
i=1
and fv
i
g
p
2
i=1
can serve as new systems of
1
Being uncorrelated is in terms of u
0
i
§
11
u
k
= 0 and v
0
i
§
22
v
k
= 0 for all i = 1;:::;k ¡1.
2
Assume that p
1
· p
2
.After p
1
¡1 pairs of canonical variates,one cas always ¯ll up the remaining
orthogonal variates for both sets of variables.
38 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS
coordinate axes for the two components x
(1)
and x
(2)
.These new systems are simply
linear systems of the original ones.Thus,the classical CCA can only be used to de-
scribe linear relation.Via such linear relation it ¯nds only linear dimension reduction
subspace and linear discriminant subspace,too.
4.3.2 Kernel CCA
Kernel methods can provide a convenient way for nonlinear generalization of CCA.
By forming the kernel data:
K
(1)
= K
1
(A
(1)
;A
(1)
) and K
(2)
= K
2
(A
(2)
;A
(2)
);
(4.3.16)
where A
(i)
= [x
(i)
0
1
;:::;x
(i)
0
n
] for i = 1;2,and K
1
and K
2
are two positive de¯nite
kernel functions.The KCCA works by carrying out the classical CCA procedure on
the kernel data [K
(1)
K
(2)
].In summary,the KCCA procedure consists of two major
steps:
(a)
Transform the data points to a kernel representation as in (4.3.16).
(b)
The classical CCA procedure is acting on the kernel data matrix [K
(1)
K
(2)
].
Note that some sort of regularization is necessary here to solve the associated
spectrum problem (4.3.15) of extracting leading canonical variates and correla-
tion coe±cients.Here we use the reduced kernel technique again to cut down the
problem size and to stabilize the numerical computation.Only partial columns
are computed to form reduced kernel matrices,denoted by
~
K
(1)
and
~
K
(2)
.The
classical CCA procedure is then acting on [
~
K
(1)
~
K
(2)
].
3
As the KCCA is simply the classical CCA acting on kernel data,existing code from
standard statistical packages are ready for use.In the example below we use Matlab
m-¯le\canoncorr",which implements the classical CCA,on kernel data.
Example 1
We use the Pendigits data set from UCI Repository for visual demon-
stration of nonlinear discriminant using KCCA.We use the 7494 training instances
for explanatory purpose.For each instance there are 16 input measurements (i.e.,x
j
is 16-dimensional) and a corresponding class label y
j
fromf0;1;2;:::;9g.A Gaussian
kernel with the window width (
p
10S
1
;:::;
p
10S
16
) is used to prepare the kernel data,
where S
i
's are the coordinate-wise sample variances.A reduced kernel of size 300
equally strati¯ed over 10 digit groups is used and serves as the
~
K
(1)
in step (b) of the
3
The reduced kernel
~
K
(i)
can be either formed by random subset
~
K
(i)
= K
i
(A
(i)
;
~
A
(i)
) or by
further optimal basis therein
~
K
(i)
= K
i
(A
(i)
;
~
A
(i)
)
~
G
(i)
.
4.3.KERNEL CANONICAL CORRELATION ANALYSIS 39
KCCA procedure.We use y
j
,the class labels,as our K
(2)
(no kernel transformation
involved).Precisely,
~
K
(2)
=
2
6
6
4
Y
0
1
.
.
.
Y
0
n
3
7
7
5
n£10
;Y
0
j
= (0;:::;1;0;:::);
where Y
j
is a categorical variable for class membership.If y
j
= i,i = 0;1;:::;9,then
Y
j
has the entry 1 in the (i +1)th place and 0 elsewhere.Now we want to explore for
the relation between the input measurements and their associated class labels using
CCA and KCCA.The training data are used to ¯nd the leading CCA- and KCCA-
found variates.Next 20 test samples from each digit-group are drawn randomly
from the test set.
4
Scatter plots of test data projected along the leading CCA-found
variates (Figure 4.3) and the leading KCCA-found variates (Figure 4.4) are given
below.Di®erent classes are labeled with distinct digits.The superior discriminant
power for KCCA over CCA is clear.
Figure 4.3:Scatter plot of pen-digits over CCA-found variates.
4
The test set has 3498 instances in total with average around 350 instances for each digit.For
clarity of plots and to avoid excess ink,we use only 20 test points per digit.
40 CHAPTER 4.MORE ON KERNEL-BASED LEARNING ALGORITHMS
Figure 4.4:Scatter plots of pen-digits over KCCA-found variates.
4.4 Kernel Fisher discriminant analysis
Fisher linear discriminant analysis (FLDA,or simply LDA) is a commonly used and
time-honored tool for multiclass classi¯cation because of its simplicity and probabilis-
tic outputs.Motivated from the active development of statistical learning theory [53]
and the popular and successful usage of various kernel machines [14,15],there has
emerged a hybrid approach which combines the idea of feature mapping with the
classical FLDA to prescribe a kernelized LDA algorithm.
4.4.1 Classical FLDA
For a binary classi¯cation,FLDA ¯nds an optimal normal direction of the separating
hyperplane,w
0
x +b = 0,by maximizing the so-called Rayleigh coe±cient:
max
w
J(w) ´
w
0
§
b
w
w
0
§
w
w
;
(4.4.17)
where §
b
and §
w
are respectively the between- and within-class covariance matrices.
The o®set is determined by letting the hyperplane pass through the mid point of the
two class centroids,i.e.,b = ¡w
0
(¹x
1
+ ¹x
2
)=2,where ¹x
j
=
P
i2I
j
x
i
=n
j
with I
j
index
set of jth class.
4.4.KERNEL FISHER DISCRIMINANT ANALYSIS 41
Consider a k-class problem based on training data f(x
i
;y
i
)g
n
i=1
,where x
i
2 X ½
<
p
is a p-variate input measurement and y
i
2 f1;:::;kg indicates the corresponding
class membership.With k ¡ 1 · p,the FLDA ¯nds k ¡ 1 canonical variates that
are optimal in maximizing the between-class variation (for separating the classes)
with respect to the within-class variation,and its decision boundaries are linear in
these canonical variates.The FLDA canonical variates w
j
,j = 1;:::;k ¡1,can be
determined again by maximizing the Rayleigh coe±cient given below:
max
w
J(w) ´
w
0
§
b
w
w
0
§
w
w
:
(4.4.18)
The optimality problem (4.4.18) of ¯nding FLDA canonical variates is a generalized
eigenvalue decomposition problem.Data are then projected to these (k¡1) canonical
variates and class centroids are calculated in this (k¡1)-dimensional subspace.A test
point is assigned to the class with the closest class centroid based on Mahalanobis
distance.The Mahalanobis distance of two points x and u projected along the FLDA
canonical variates is given by
d
M
(x;u) = (x ¡u)
0
W
0
§
¡1