blah - Springer

aspiringtokAI and Robotics

Oct 15, 2013 (3 years and 9 months ago)

76 views

Appendix: Mathematical description of the Machine Learning
methods

Discriminant

classifier

-

The
Linear

Discriminant

divides

the feature space by a hyperplane decision surface

that
maximizes the ratio
of between
-
class variance to within
-
class variance
.

Given a certain data vector
x
,
a
discriminant function that is a linear combination of the components of
x

can be written as




(1)


where

w

is the weight vector and
w
0

the bias or threshold weight. A two
-
category

weight

linear
classifier implements the following decision rule
: Decide
ω
1

if

g
(
x
) > 0

and
ω
2

if

g
(
x
) < 0
.
Thus,
x

is
assigned to

ω
1
if the inner product





exceeds the threshold



w
0

and
ω
2

otherwise
.
If
g
(
x
) = 0
,
the
assignment is undefined
.

-

The
Quadratic

Discriminant
, divides

the feature space by a hypersphere, hyperellipsoid or
hyperhyperboloids decision surface
.
The linear discriminant function

g
(
x
)
can
be written as




(2)


where the coefficients




are the components of the
weight vector

w
.
By adding

additional
terms involving the products of pairs of components of

x
,
we obtain the
quadratic

discriminant
function



(3)



Since

, we can assume that

with no loss in generality. Thus, the

quadratic

discriminant function has an additional



coefficients at its disposal

with which to produce
more complicated separating s
urfaces. The separating surface
defined by

g
(
x
) =
0

is a second
-
degree
or
hyperquadric surface
.

-

The
Mahalanobis

Discriminant

employs

the Mahalanobis distance




(4)



where
x

is the vector of the data,
µ

is
the centroid of a certain class, and

is
the covariance
matrix of the data distribution, and assigns each datum
x

to the class
µ

that minimizes

.

Support Vector Machine

The sup
port vector machine (SVM)
[31]

is a very effective

method for general
-
purpose pattern
recognition. We are given training data
{x
1

... x
n
}

that are vectors in some space

.

We are also given their labels
{y
1

...y
n
}
, where
y
i


{−1, 1}
. In their simplest form, SVMs are hyperplanes
that separate the training data by a maximal margin.
The equation of the hyperplane is of the form




(5)



where

the vector

w

is perpendicular to the hyperplane,

” denotes an inner product

and
b

is an
additional parameter.
The training instances that lie closest to the hyperplane are called
support vectors
and the distance from those instances and the
separating hyperplane is called

geometrical margin
of the
classifier.

More generally, SVMs
rely on preprocessing the data to represent patterns in a high dimension
-

typically much higher than the original feature space.

The original training data can be projected from space
X

to a higher dimensional feature space


via
a Mercer kernel operator
K
. In other words, we consider the set of classifiers of the form:




(6)



When
K

satisfies Mercer’s condition
[2]

we can write:
K(u, v) = Φ(u) · Φ(v)
, where
Φ :
X


F
. We
can then rewrite
f
as:




where

(7)



After projecting the data, the SVM computes the
α
i
s that correspond to the maximal margin
hyperplane in
F
.

By choosing different kernel functions we can implicitly project the training data from
X

into spaces
F

for
which hyperplanes in
F

correspond to more

complex decision boundaries in the original space
X
.

Two commonly used kernels are the
polynomial

kernel given by
K(u, v) = (u ∙ v +1)
o

which induces
polynomial boundaries of degree o in the original space
X

and the
radial basis function

or
Gaussian

kernel
K(u,v)=e

σ(u−v)·(u−v)

which induces boundaries by placing weighted Gaussians upon key training
instances
[30]
.

If the training set

is not linearly separable, the standard approach is to allow the decision marg
in to
make a few mistakes

(
Soft margin

SVM)
. We then pay a cost for each misclassified example, which
depends on how far it is from meeting the margin requirement. To implement this, we introduce

slack
variables

ξ
i
. A non
-
zero value for

ξ
i

allows

x
i

not

to

meet the margin requirement at a cost proportional
to the value of

ξ
i
.

The formulation of the SVM optimization problem with slack variables is:



Find
w
, b,
ξ
i
≥0 such that:

is minimized

and for all

(8)



The optimization problem is then trading off how wide it can make the margin versus how many points
have to be moved around to allow this margin. The margin can be less than 1 for a point


by
setting

, b
ut then one pays a penalty of

C

in the minimization for having done that. The sum of
the

ξ


gives an upper bound on the number of training errors. Soft
-
margin SVMs minimize training error
traded off against margin.
If
the error penalty factor
C

is close to 0, then we don't pay that much for
points violating the margin constraint.

T
he cost function
can be minimized
by setting
w

to be a small
vector
-

this is equivalent

to creating a very wide
safety margin around the dec
ision boundary (but
having

many points violate this safety margin). If
C

is close to inf
inity
, then a lot
is paid
for points that
violate the margin constraint, and
this case is

close the
previously described
hard
-
margin

for
mulation

-

the d
rawback

here is
the high sensitivity to outlier
points in the training data

[8]
.

AdaBoost

The goal of boosting is to improve the accuracy of any given learning algorithm. In boosting we
first create a classifier, and then add new component classifiers to form an ensemble whose joint
decision rule has arbitrarily high accuracy on the training se
t
[4]
. Each classifier

needs only to be a
weak learner



that is, have accuracy only slightly better than chance as a minimum requirement.

There are a number of variations on basic boosting. The most popular, AdaBoo
st


from “Adaptive
Boosting”


allows the designer to continue adding weak learners until some desired low training
error has been achieved.

It initially chooses the learner that classifies more data correctly. In the next

step, the data
set
is re
-
weight
ed to increase the “importance” of misclassified

samples. This process continues and at
each step
the weight of each wea
k

learner among other learners is determined.

Thus, in AdaBoost each training pattern receives a weight that determines its probability

of being
selected for a training set for an individual component classifier. If a training pattern is accurately
classified, then its chance of being used again in a subsequent component classifier is reduced; on
the contrary, if the pattern is not accura
tely classified, then its chance of being used again is raised. In
this way, AdaBoost “focuses in” on the informative or “difficult” pattern. Specifically, we initialize the
weights across the training set to be uniform. On each iteration
k
, we draw a trai
ning set at random
according to these weights, and then we train component classifier
C
k

on the selected patterns. Next
we increase weights of training patterns misclassified by
C
k

and decrease weights of the patterns
correctly classified by
C
k
. Patterns
chosen according to this new distribution are used to train the next
classifier,
C
k
+1
, and the process is iterated.

We let the patterns and their labels in the full training set D be denoted



and


, respectively,
and let
W
k
(i)

be the
k
-
th
(discrete
) distribution over all these training samples. Thus the AdaBoost
procedure is:

I)

begin

initialize

(9)


II)


k=0

III)


do

k=k+1

IV)


train weak learner
C
k

using
D

sampled according to

V)


= training error of
C
k

measured on
D

using

VI)



VII)



VIII)


unt
il
k=k
max

IX)


return

C
k

and



for k=1 to
k
max

(ensemble of classifiers with
weights)

X)

end

Note that in line V the error for classifier
C
k

is determined with respect to the distribution

over
D

on which it was trained. In line VII,
Z
k

is simply a normalizing constant computed to ensure that
represents a true distribution, and
h
k
(x
i
)

is the category label (+1 or
-
1) given to pattern
x
i

by
component classifier
C
k
. Naturally, the loop termination of line VIII could instead use the criterion of
sufficiently low training error of the ensemble classifier.

The final classification decision of a test point
x

is based on a discriminant function that is merely the
weight
ed sums of the outputs given by the component classifiers:



(10)



The classification decision for this two
-
category case is then simply
.

Except in pathological cases, as long as each component classifier
is a weak learner, the total training
error of the ensemble can be made arbitrarily low by setting the number of component classifiers,

k
max
,
sufficiently high.

Supervised Neural Network

An Artificial Neural Network is an adaptive, most often nonlinear sy
stem that learns to perform a
function (an input/output map) from a data set (
inductive learning
). Adaptive means that the system
parameters are changed through operation, (
training phase
). After the training phase, the Artificial
Neural Network parameters

are fixed and the system is deployed to solve the problem at hand (
testing
phase
). The Artificial Neural Network is built with a systematic step
-
by
-
step procedure to optimize a
performance criterion or to follow some implicit internal constraint, which is

commonly referred to as
the
learning rule
. The nonlinear nature of the neural network processing elements (PEs) provides the
system with a great flexibility to achieve practically any desired input/output map.

An input is presented to the neural network and a corresponding desired or target response set at
the output (when this is the case the training is called
supervised
). An error is calculated as the
difference between the desired response and the system out
put. This error information is fed back to
the system and adjusts the system parameters in a systematic fashion (the
learning rule
). The process
is repeated until the performance becomes acceptable.

The structural unit of Neural Networks is a functional mo
del of the biological neuron, called
Perceptron. The synapses of the neuron are modeled as weights: the strength of the connection
between an input and a neuron is characterized by the value of the weight. Negative weight values
reflect inhibitory connecti
ons, while positive values designate excitatory connections. An adder sums
up all the inputs modified by their respective weights. and an activation function controls the
amplitude of the output of the neuron. An acceptable range of output is usually betwe
en 0 and 1, or
-
1 and 1.

From this model the interval activity of the neuron can be shown to be:



(11)


The output of the neuron,
y
k
, would therefore be the outcome of some activation function on the value
of
v
k
.

As mentioned previously, the activation function acts as a squashing function, such that the output of a
neuron in a neural network is between certain values (usually 0 and 1, or
-
1 and 1). The most common
activation functions, denoted by
φ
(∙) are the
Thre
shold Function
,
the
Piecewise
-
Linear
F
unction
,
and the
Log
-
sigmoid Function
.

After analyzing the properties of the basic processing unit in an artificial neural network, we will
now focus on the pattern of connections between the units and the propagation
of data. As for this
pattern of connections, the main distinction we can make is between
feed
-
forward neural networks
,

where the data flow from input to output
units is strictly feed
-
forward, and
recurrent neural
networks
,

which do contain feedback
connections

[22]
.

A

neural network

has to be configured such that the application of a set of inputs produces the
desired set of outputs. Various methods to set the strengths of the connections exist. One wa
y is to
set the weights explicitly, using
a priori

knowledge. Another way is to ‘train’ the neural network

by
feeding it teaching patterns and letting it change its weights according to some learning rule. We can
categorize the learning situations in three

distinct sorts. We talk about
s
upervised learning

when the
network is trained by providing it with input and matching output patterns. These input
-
output pairs
can be provided by an external teacher, or by the system which contains the neural network (sel
f
-
supervised); we have
u
nsupervised learning

when an (output) unit is trained to respond to clusters
of pattern within the input. In this paradigm the system is supposed to discover statistically salient
features of the input population. Finally,
re
inforce
ment learning
,
can be performed,
which

may be
considered as an intermediate form of the above two types of learning.
[23]
.