Appendix: Mathematical description of the Machine Learning
methods
Discriminant
classifier

The
Linear
Discriminant
divides
the feature space by a hyperplane decision surface
that
maximizes the ratio
of between

class variance to within

class variance
.
Given a certain data vector
x
,
a
discriminant function that is a linear combination of the components of
x
can be written as
(1)
where
w
is the weight vector and
w
0
the bias or threshold weight. A two

category
weight
linear
classifier implements the following decision rule
: Decide
ω
1
if
g
(
x
) > 0
and
ω
2
if
g
(
x
) < 0
.
Thus,
x
is
assigned to
ω
1
if the inner product
exceeds the threshold
–
w
0
and
ω
2
otherwise
.
If
g
(
x
) = 0
,
the
assignment is undefined
.

The
Quadratic
Discriminant
, divides
the feature space by a hypersphere, hyperellipsoid or
hyperhyperboloids decision surface
.
The linear discriminant function
g
(
x
)
can
be written as
(2)
where the coefficients
are the components of the
weight vector
w
.
By adding
additional
terms involving the products of pairs of components of
x
,
we obtain the
quadratic
discriminant
function
(3)
Since
, we can assume that
with no loss in generality. Thus, the
quadratic
discriminant function has an additional
coefficients at its disposal
with which to produce
more complicated separating s
urfaces. The separating surface
defined by
g
(
x
) =
0
is a second

degree
or
hyperquadric surface
.

The
Mahalanobis
Discriminant
employs
the Mahalanobis distance
(4)
where
x
is the vector of the data,
µ
is
the centroid of a certain class, and
is
the covariance
matrix of the data distribution, and assigns each datum
x
to the class
µ
that minimizes
.
Support Vector Machine
The sup
port vector machine (SVM)
[31]
is a very effective
method for general

purpose pattern
recognition. We are given training data
{x
1
... x
n
}
that are vectors in some space
.
We are also given their labels
{y
1
...y
n
}
, where
y
i
{−1, 1}
. In their simplest form, SVMs are hyperplanes
that separate the training data by a maximal margin.
The equation of the hyperplane is of the form
(5)
where
the vector
w
is perpendicular to the hyperplane,
“
” denotes an inner product
and
b
is an
additional parameter.
The training instances that lie closest to the hyperplane are called
support vectors
and the distance from those instances and the
separating hyperplane is called
geometrical margin
of the
classifier.
More generally, SVMs
rely on preprocessing the data to represent patterns in a high dimension

typically much higher than the original feature space.
The original training data can be projected from space
X
to a higher dimensional feature space
via
a Mercer kernel operator
K
. In other words, we consider the set of classiﬁers of the form:
(6)
When
K
satisfies Mercer’s condition
[2]
we can write:
K(u, v) = Φ(u) · Φ(v)
, where
Φ :
X
→
F
. We
can then rewrite
f
as:
where
(7)
After projecting the data, the SVM computes the
α
i
s that correspond to the maximal margin
hyperplane in
F
.
By choosing different kernel functions we can implicitly project the training data from
X
into spaces
F
for
which hyperplanes in
F
correspond to more
complex decision boundaries in the original space
X
.
Two commonly used kernels are the
polynomial
kernel given by
K(u, v) = (u ∙ v +1)
o
which induces
polynomial boundaries of degree o in the original space
X
and the
radial basis function
or
Gaussian
kernel
K(u,v)=e
−
σ(u−v)·(u−v)
which induces boundaries by placing weighted Gaussians upon key training
instances
[30]
.
If the training set
is not linearly separable, the standard approach is to allow the decision marg
in to
make a few mistakes
(
Soft margin
SVM)
. We then pay a cost for each misclassified example, which
depends on how far it is from meeting the margin requirement. To implement this, we introduce
slack
variables
ξ
i
. A non

zero value for
ξ
i
allows
x
i
not
to
meet the margin requirement at a cost proportional
to the value of
ξ
i
.
The formulation of the SVM optimization problem with slack variables is:
Find
w
, b,
ξ
i
≥0 such that:
is minimized
and for all
(8)
The optimization problem is then trading off how wide it can make the margin versus how many points
have to be moved around to allow this margin. The margin can be less than 1 for a point
by
setting
, b
ut then one pays a penalty of
C
in the minimization for having done that. The sum of
the
ξ
gives an upper bound on the number of training errors. Soft

margin SVMs minimize training error
traded off against margin.
If
the error penalty factor
C
is close to 0, then we don't pay that much for
points violating the margin constraint.
T
he cost function
can be minimized
by setting
w
to be a small
vector

this is equivalent
to creating a very wide
safety margin around the dec
ision boundary (but
having
many points violate this safety margin). If
C
is close to inf
inity
, then a lot
is paid
for points that
violate the margin constraint, and
this case is
close the
previously described
hard

margin
for
mulation

the d
rawback
here is
the high sensitivity to outlier
points in the training data
[8]
.
AdaBoost
The goal of boosting is to improve the accuracy of any given learning algorithm. In boosting we
first create a classifier, and then add new component classifiers to form an ensemble whose joint
decision rule has arbitrarily high accuracy on the training se
t
[4]
. Each classifier
needs only to be a
weak learner
–
that is, have accuracy only slightly better than chance as a minimum requirement.
There are a number of variations on basic boosting. The most popular, AdaBoo
st
–
from “Adaptive
Boosting”
–
allows the designer to continue adding weak learners until some desired low training
error has been achieved.
It initially chooses the learner that classifies more data correctly. In the next
step, the data
set
is re

weight
ed to increase the “importance” of misclassified
samples. This process continues and at
each step
the weight of each wea
k
learner among other learners is determined.
Thus, in AdaBoost each training pattern receives a weight that determines its probability
of being
selected for a training set for an individual component classifier. If a training pattern is accurately
classified, then its chance of being used again in a subsequent component classifier is reduced; on
the contrary, if the pattern is not accura
tely classified, then its chance of being used again is raised. In
this way, AdaBoost “focuses in” on the informative or “difficult” pattern. Specifically, we initialize the
weights across the training set to be uniform. On each iteration
k
, we draw a trai
ning set at random
according to these weights, and then we train component classifier
C
k
on the selected patterns. Next
we increase weights of training patterns misclassified by
C
k
and decrease weights of the patterns
correctly classified by
C
k
. Patterns
chosen according to this new distribution are used to train the next
classifier,
C
k
+1
, and the process is iterated.
We let the patterns and their labels in the full training set D be denoted
and
, respectively,
and let
W
k
(i)
be the
k

th
(discrete
) distribution over all these training samples. Thus the AdaBoost
procedure is:
I)
begin
initialize
(9)
II)
k=0
III)
do
k=k+1
IV)
train weak learner
C
k
using
D
sampled according to
V)
= training error of
C
k
measured on
D
using
VI)
VII)
VIII)
unt
il
k=k
max
IX)
return
C
k
and
for k=1 to
k
max
(ensemble of classifiers with
weights)
X)
end
Note that in line V the error for classifier
C
k
is determined with respect to the distribution
over
D
on which it was trained. In line VII,
Z
k
is simply a normalizing constant computed to ensure that
represents a true distribution, and
h
k
(x
i
)
is the category label (+1 or

1) given to pattern
x
i
by
component classifier
C
k
. Naturally, the loop termination of line VIII could instead use the criterion of
sufficiently low training error of the ensemble classifier.
The final classification decision of a test point
x
is based on a discriminant function that is merely the
weight
ed sums of the outputs given by the component classifiers:
(10)
The classification decision for this two

category case is then simply
.
Except in pathological cases, as long as each component classifier
is a weak learner, the total training
error of the ensemble can be made arbitrarily low by setting the number of component classifiers,
k
max
,
sufficiently high.
Supervised Neural Network
An Artificial Neural Network is an adaptive, most often nonlinear sy
stem that learns to perform a
function (an input/output map) from a data set (
inductive learning
). Adaptive means that the system
parameters are changed through operation, (
training phase
). After the training phase, the Artificial
Neural Network parameters
are fixed and the system is deployed to solve the problem at hand (
testing
phase
). The Artificial Neural Network is built with a systematic step

by

step procedure to optimize a
performance criterion or to follow some implicit internal constraint, which is
commonly referred to as
the
learning rule
. The nonlinear nature of the neural network processing elements (PEs) provides the
system with a great flexibility to achieve practically any desired input/output map.
An input is presented to the neural network and a corresponding desired or target response set at
the output (when this is the case the training is called
supervised
). An error is calculated as the
difference between the desired response and the system out
put. This error information is fed back to
the system and adjusts the system parameters in a systematic fashion (the
learning rule
). The process
is repeated until the performance becomes acceptable.
The structural unit of Neural Networks is a functional mo
del of the biological neuron, called
Perceptron. The synapses of the neuron are modeled as weights: the strength of the connection
between an input and a neuron is characterized by the value of the weight. Negative weight values
reflect inhibitory connecti
ons, while positive values designate excitatory connections. An adder sums
up all the inputs modified by their respective weights. and an activation function controls the
amplitude of the output of the neuron. An acceptable range of output is usually betwe
en 0 and 1, or

1 and 1.
From this model the interval activity of the neuron can be shown to be:
(11)
The output of the neuron,
y
k
, would therefore be the outcome of some activation function on the value
of
v
k
.
As mentioned previously, the activation function acts as a squashing function, such that the output of a
neuron in a neural network is between certain values (usually 0 and 1, or

1 and 1). The most common
activation functions, denoted by
φ
(∙) are the
Thre
shold Function
,
the
Piecewise

Linear
F
unction
,
and the
Log

sigmoid Function
.
After analyzing the properties of the basic processing unit in an artificial neural network, we will
now focus on the pattern of connections between the units and the propagation
of data. As for this
pattern of connections, the main distinction we can make is between
feed

forward neural networks
,
where the data flow from input to output
units is strictly feed

forward, and
recurrent neural
networks
,
which do contain feedback
connections
[22]
.
A
neural network
has to be configured such that the application of a set of inputs produces the
desired set of outputs. Various methods to set the strengths of the connections exist. One wa
y is to
set the weights explicitly, using
a priori
knowledge. Another way is to ‘train’ the neural network
by
feeding it teaching patterns and letting it change its weights according to some learning rule. We can
categorize the learning situations in three
distinct sorts. We talk about
s
upervised learning
when the
network is trained by providing it with input and matching output patterns. These input

output pairs
can be provided by an external teacher, or by the system which contains the neural network (sel
f

supervised); we have
u
nsupervised learning
when an (output) unit is trained to respond to clusters
of pattern within the input. In this paradigm the system is supposed to discover statistically salient
features of the input population. Finally,
re
inforce
ment learning
,
can be performed,
which
may be
considered as an intermediate form of the above two types of learning.
[23]
.
Comments 0
Log in to post a comment