Learning problem

stepweedheightsAI and Robotics

Oct 15, 2013 (3 years and 9 months ago)

63 views

Learning problem


Given a sample of N pairs (xi, ui) (i =1,…, N) in which xi are predictor vectors xi=(xi
1
,…,xi
p
)
(dimension p) and ui = (ui
1
,…,ui
q
) are target vectors (dimension q), build a decision rule


u = F(x) = F(x
1
,…,x
p
)


such that it predicts t
he target vector u, given predictor vector x=(x
1
,…,x
p
). The decision rule is
an algorithm which is not necessarily expressed as an analytical function.


Depending on the assumptions, there can be many special cases, of which the following are
most popular
.


Data flow:


Incremental

(on
-
line) and
batch mode

learning: entities i = 1, …, N come one
-
by
-
one or all
known


Type of target:


Regression
: u quantitative, q=1 or more

-

Linear regression

(F


linear)

-

Decision tree

(F


tree
-
like)

-

Neural Nets

(F


general)

-

Evolutionary

algorithms (F


general)


Pattern recognition

(classification): u binary

-

Discrimination

(F


linear)

-

Support vector machine (SVM)
(F

linear)

-

Logistic regression

(F


probability, exponential
-
linear)

-

Neural Nets

(F


general)

-

Evolutionary

alg
orithms (F


general)


Criterion:


least
-
squares
,
maximum likelihood
,
error counting













Linear regression

using least squares, q=1


The assumption is that


u = w1*x
1
+w2*x
2
+…+wp*x
p
+w0


so that for any sampled entity i =1,…,N the computed value


ûi

= w1*xi
1
+w2*xi
2
+…+wp*xi
p
+w0


differs from the observed one by the deviation di = |ûi


ui|.


The coefficients w1, w2, …, wp and intercept w0 are to be found by minimising the sum of the
squared deviations


D
2

=


i

di
2

=


i

(ui
-

w1*xi
1
-

w2*xi
2
-

-
wp*xi
p
-
w0)
2

(1)


over all possible vectors (w0,w1,…,wp).


To make the problem uniform, a fictitious feature x
0
is introduced such that all its values are 1:
xi
0

=1 for all i = 1,…,N. Then the criterion D
2
involves no
intercept, just the inner products
<w,xi> where w=( w0,w1,…,wp) and xi=( xi
0
, xi
1
, xi
2
, …, xi
p
) are (p+1)
-
dimensional vectors, a
unknown, xi known. From now on, the intercept in (1) is abolished because of the convention.


Furthermore, D
2
is but the Eucl
idean distance squared between N
-
dimensional feature column
u=(ui) and vector û=Xw whose components are ûi= <w,xi>. Here X is N x (p+1) matrix whose
rows are xi (augmented with the component xi
0
=1, thus being (p+1)
-
dimensional) so that Xw is
the matrix alg
ebra product of.X and w. It is well known that the solution to this problem can be
expressed as


û = P
X
u










(2)


where P
X
is the so
-
called orthogonal projection matrix N x N defined as


P
X
= X (X
T
X)
-
1
X
T


so that û = X (X
T
X)
-
1
X
T
u.


This matrix projec
ts every N
-
dimensional vector u to its nearest match in the (p+1)
-
dimensional
space of all N
-
dimensional vectors that are weighted combinations of the columns of matrix X.


Equation (2) implies that the optimal w = (X
T
X)
-
1
X
T
u.


This solution involves the i
nverse (X
T
X)
-
1
, which does not exist if the rank of X, as it happens,
is less than the number of columns in X, p+1.


The
linear discrimination

differs in only that aspect that values ui are binary. Most
conveniently, ui=1 if i belongs to the “yes” class an
d ui=
-
1 if i belongs to the “no” class. The
intercept

is referred to, in the context of the discrimination/classification problem, as
bias
.


On the Figure below entities (x1,x2,u) are presented as * at u=1 and 0 at u=
-
1.





The vector w represents the

solution found as described and the dashed line represents the set of
all x that are orthogonal to w such that <w,x> = 0


the separating hyperplane. The Figure
shows a relatively rare situation at which the two patterns can be separated by a hyperplane


the linear separability.

The decision rule here is simple: if ûi= <w,xi> is positive, ůi=1; if ûi =<w,xi> is negative, ůi=
-
1, or ůi = sign(<w,xi>). (The function sign(a) equals 1 when a > 0,
-
1 when a < 0, and 0 when
a = 0.)


Perceptron


In 1956, F.
Rosenblatt came up with the machine learning paradigm for this case and a solution
to it, the
perceptron
. The
machine learning

paradigm assumes that entities come one
-
by
-
one in
sequence so that the machine must start learning the classification rule immedi
ately, that is, start
with a learning rule and then update it each time when the rule makes a mistake. Such activity is
referred to as
generalisation


developing a general decision rule, sign(<w,xi>) in this case,
from instances.


The perceptron algorith
m:


0.

Intialise weights w randomly or to zero.

1.

For each training instance (xi,ui)

a.

compute ůi = sign(<w,xi>)

b.

if ůi

ui, update weights w according to equation


w(new) = w(old) +

(ui
-

ůi)xi



where

, a real between 0 and 1, is the so
-
called learning rate



2. Stopping rule.


This algorithm was proved to converge to the w found above when the patterns are linearly
separable.


In fact, Rosenblatt’s idea was to use the conventional anti
-
gradient minimisation algorithm in a
modified form.


The gradient opt
imisation (the steepest ascent/descent, or hill
-
climbing) of a function f(x) of a
multidimensional variable works as this: given an initial state x0, perform a sequence of
iterations of finding a new x location. Each iteration updates the old x
-
value as fo
llows:

x(new) =x(old) ±

*grad(f(x(old))




(3)


where grad(f(x)) is the vector of partial derivatives of f with respect to the components of x. It is
known from calculus, that the vector grad(f(x)) shows the steepest rise of f at point x. Thus + in
(3) is

used for maximisation of f(x), and


for minimisation.

The


value controls the length of the change and should be small (to guarantee not over
jumping the slope) , but not too small (to guarantee changes when grad(f(x(old)) becomes too
small; indeed gra
d(f(x(old))=0 in the optimum point).


It is not difficult to see that the partial derivative of the criterion (1), in the case when only one
incoming entity (xi,ui) is considered, with respect to wt
,
is equal to

2(ui
-

ûi) xi
t
, which is
similar to the per
ceptron learning rule. Thus, the innovation was to change the continuous ûi for
the discrete
ůi =sign(ûi).


Neuron and artificial neuron


The decision rule ůi =sign(ûi) can be interpreted in terms of an artificial neuron: features xi are
input signals (from other neurons), weights wt are the wiring (axon) features, the bias w0


the
firing thresh
old, and sign()


the neuron activation function. This way, the perceptron is an
example of the nature
-
inspired computation.







This Figure give a scheme of a real neuron.





An artificial neuron may have a different activation function. Two popular
activation functions,
besides the
sign

function
ůi =sign(ûi), are the
linear

activation function, ůi = ûi (we considered
it when discussed the steepest descent) and
sigmoid

activation function ůi =s(ûi) where
s(x) = 1/ (1 + exp(
-
x)), which is a smooth analogue to the sign function.


Multilayere
d network


Neurons can be


Back
-
propagation