Learning problem
Given a sample of N pairs (xi, ui) (i =1,…, N) in which xi are predictor vectors xi=(xi
1
,…,xi
p
)
(dimension p) and ui = (ui
1
,…,ui
q
) are target vectors (dimension q), build a decision rule
u = F(x) = F(x
1
,…,x
p
)
such that it predicts t
he target vector u, given predictor vector x=(x
1
,…,x
p
). The decision rule is
an algorithm which is not necessarily expressed as an analytical function.
Depending on the assumptions, there can be many special cases, of which the following are
most popular
.
Data flow:
Incremental
(on

line) and
batch mode
learning: entities i = 1, …, N come one

by

one or all
known
Type of target:
Regression
: u quantitative, q=1 or more

Linear regression
(F
–
linear)

Decision tree
(F
–
tree

like)

Neural Nets
(F
–
general)

Evolutionary
algorithms (F
–
general)
Pattern recognition
(classification): u binary

Discrimination
(F
–
linear)

Support vector machine (SVM)
(F
–
linear)

Logistic regression
(F
–
probability, exponential

linear)

Neural Nets
(F
–
general)

Evolutionary
alg
orithms (F
–
general)
Criterion:
least

squares
,
maximum likelihood
,
error counting
Linear regression
using least squares, q=1
The assumption is that
u = w1*x
1
+w2*x
2
+…+wp*x
p
+w0
so that for any sampled entity i =1,…,N the computed value
ûi
= w1*xi
1
+w2*xi
2
+…+wp*xi
p
+w0
differs from the observed one by the deviation di = ûi
–
ui.
The coefficients w1, w2, …, wp and intercept w0 are to be found by minimising the sum of the
squared deviations
D
2
=
i
di
2
=
i
(ui

w1*xi
1

w2*xi
2

…

wp*xi
p

w0)
2
(1)
over all possible vectors (w0,w1,…,wp).
To make the problem uniform, a fictitious feature x
0
is introduced such that all its values are 1:
xi
0
=1 for all i = 1,…,N. Then the criterion D
2
involves no
intercept, just the inner products
<w,xi> where w=( w0,w1,…,wp) and xi=( xi
0
, xi
1
, xi
2
, …, xi
p
) are (p+1)

dimensional vectors, a
unknown, xi known. From now on, the intercept in (1) is abolished because of the convention.
Furthermore, D
2
is but the Eucl
idean distance squared between N

dimensional feature column
u=(ui) and vector û=Xw whose components are ûi= <w,xi>. Here X is N x (p+1) matrix whose
rows are xi (augmented with the component xi
0
=1, thus being (p+1)

dimensional) so that Xw is
the matrix alg
ebra product of.X and w. It is well known that the solution to this problem can be
expressed as
û = P
X
u
(2)
where P
X
is the so

called orthogonal projection matrix N x N defined as
P
X
= X (X
T
X)

1
X
T
so that û = X (X
T
X)

1
X
T
u.
This matrix projec
ts every N

dimensional vector u to its nearest match in the (p+1)

dimensional
space of all N

dimensional vectors that are weighted combinations of the columns of matrix X.
Equation (2) implies that the optimal w = (X
T
X)

1
X
T
u.
This solution involves the i
nverse (X
T
X)

1
, which does not exist if the rank of X, as it happens,
is less than the number of columns in X, p+1.
The
linear discrimination
differs in only that aspect that values ui are binary. Most
conveniently, ui=1 if i belongs to the “yes” class an
d ui=

1 if i belongs to the “no” class. The
intercept
is referred to, in the context of the discrimination/classification problem, as
bias
.
On the Figure below entities (x1,x2,u) are presented as * at u=1 and 0 at u=

1.
The vector w represents the
solution found as described and the dashed line represents the set of
all x that are orthogonal to w such that <w,x> = 0
–
the separating hyperplane. The Figure
shows a relatively rare situation at which the two patterns can be separated by a hyperplane
–
the linear separability.
The decision rule here is simple: if ûi= <w,xi> is positive, ůi=1; if ûi =<w,xi> is negative, ůi=

1, or ůi = sign(<w,xi>). (The function sign(a) equals 1 when a > 0,

1 when a < 0, and 0 when
a = 0.)
Perceptron
In 1956, F.
Rosenblatt came up with the machine learning paradigm for this case and a solution
to it, the
perceptron
. The
machine learning
paradigm assumes that entities come one

by

one in
sequence so that the machine must start learning the classification rule immedi
ately, that is, start
with a learning rule and then update it each time when the rule makes a mistake. Such activity is
referred to as
generalisation
–
developing a general decision rule, sign(<w,xi>) in this case,
from instances.
The perceptron algorith
m:
0.
Intialise weights w randomly or to zero.
1.
For each training instance (xi,ui)
a.
compute ůi = sign(<w,xi>)
b.
if ůi
ui, update weights w according to equation
w(new) = w(old) +
(ui

ůi)xi
where
, a real between 0 and 1, is the so

called learning rate
2. Stopping rule.
This algorithm was proved to converge to the w found above when the patterns are linearly
separable.
In fact, Rosenblatt’s idea was to use the conventional anti

gradient minimisation algorithm in a
modified form.
The gradient opt
imisation (the steepest ascent/descent, or hill

climbing) of a function f(x) of a
multidimensional variable works as this: given an initial state x0, perform a sequence of
iterations of finding a new x location. Each iteration updates the old x

value as fo
llows:
x(new) =x(old) ±
*grad(f(x(old))
(3)
where grad(f(x)) is the vector of partial derivatives of f with respect to the components of x. It is
known from calculus, that the vector grad(f(x)) shows the steepest rise of f at point x. Thus + in
(3) is
used for maximisation of f(x), and
–
for minimisation.
The
value controls the length of the change and should be small (to guarantee not over
jumping the slope) , but not too small (to guarantee changes when grad(f(x(old)) becomes too
small; indeed gra
d(f(x(old))=0 in the optimum point).
It is not difficult to see that the partial derivative of the criterion (1), in the case when only one
incoming entity (xi,ui) is considered, with respect to wt
,
is equal to
–
2(ui

ûi) xi
t
, which is
similar to the per
ceptron learning rule. Thus, the innovation was to change the continuous ûi for
the discrete
ůi =sign(ûi).
Neuron and artificial neuron
The decision rule ůi =sign(ûi) can be interpreted in terms of an artificial neuron: features xi are
input signals (from other neurons), weights wt are the wiring (axon) features, the bias w0
–
the
firing thresh
old, and sign()
–
the neuron activation function. This way, the perceptron is an
example of the nature

inspired computation.
This Figure give a scheme of a real neuron.
An artificial neuron may have a different activation function. Two popular
activation functions,
besides the
sign
function
ůi =sign(ûi), are the
linear
activation function, ůi = ûi (we considered
it when discussed the steepest descent) and
sigmoid
activation function ůi =s(ûi) where
s(x) = 1/ (1 + exp(

x)), which is a smooth analogue to the sign function.
Multilayere
d network
Neurons can be
Back

propagation
Comments 0
Log in to post a comment