# Neural Network Training

Τεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

154 εμφανίσεις

Machine Learning
!

!
!

!
!
Srihari

Neural Network Training!
Sargur Srihari!
Machine Learning
!

!
!

!
!
Srihari

Topics!

Neural network parameters!

Probabilistic problem formulation !

Determining the error function !

Regression!

Binary classiﬁcation!

Multi-class classiﬁcation!

Parameter optimization!

2
Machine Learning
!

!
!

!
!
Srihari

Neural Network parameters!

Linear models for regression and classiﬁcation can
be represented as!

which are linear combinations of basis functions!

In a neural network the basis functions depend
on parameters !

During training allow these parameters to be adjusted along
with the coefﬁcients
w
i
!
3

y
(
x
,
w
)

f
w
j
φ
j
(
x
)
j

1
M

\$

%

&

&

'

(

)

)

φ
j
(
x
)
φ
j
(
x
)
Machine Learning
!

!
!

!
!
Srihari

Network Training: Sum of squared errors!

Neural networks perform a transformation!

vector
x
of input variables to vector
y
of output variables!

To determine
w
, simple analogy with polynomial curve
ﬁtting !

minimize sum-of-squared errors function!

Given set of input vectors
{
x
n
}, n=1,..,N
and target vectors
{t
n
}
minimize the error function!

Consider a more general probabilistic interpretation!
4

E
(w
)

1
2
||
y(
x
n
,
w
)

t
n
||
2
n

1
N

y
k
(
x,
w
)

σ
w
kj
(
2
)
j

1
M

h
w
ji
(
1
)
i

1
D

x
i
\$

%

&

'

(

)

\$

%

&

&

'

(

)

)

D

input variables
M

hidden units
N

training vectors
Machine Learning
!

!
!

!
!
Srihari

Probabilistic View: From activation function
f
determine
Error Function
E
(as defined by likelihood function)
1.

Regression!

f:
activation function is identity!

E
: Sum-of-squares error/Maximum Likelihood!
2.

(Multiple Independent) Binary Classiﬁcations!

f:
activation function is Logistic sigmoid!

E:
Cross-entropy error function!
3.

Multiclass Classiﬁcation!

f:
Softmax outputs!

E:
Cross-entropy error function!
5

E
(w
)

t
n
ln
y
n

(
1

t
n
)
ln(
1

y
n
)

n

1
N

E
(w
)

1
2
y
(
x
n
,
w
)

t
n

n

1
N

2
y
(
x,w
)

σ
(
w
T
φ
(
x
)
)

1
1

e
xp(

w
T
φ
(
x
)
)
y
(
x,w
)

w
T
φ
(
x
)

w
j
φ
j
(
x
)
j

1
M

E
(w
)

t
kn
k

1
K

n

1
N

ln
y
k
(
x
n
,
w
)
y
k
(
x,w
)

e
xp(
w
k
φ
(
x
)
)
e
xp(
w
j
φ
(
x
)
)
j

Machine Learning
!

!
!

!
!
Srihari

1. Probabilistic View: Regression !
6

Output is a single target variable
t
that can take any real value!

Assuming
t
is Gaussian distributed with an
x
-dependent mean!
!

Likelihood function!

Taking negative logarithm, we get the negative log-likelihood!

which can be used to learn parameters
w
and
β

p
(
t
|
x
,
w
)

N
(
t
|
y
(
x
,
w
)
,
β

1
)

p
(t
|
x
,
w,
β
)

n

1
N

N
(
t
n
|
y
(
x
n
,
w
),
β

1
)

β
2
y
(
x
n
,
w
)

t
n

n

1
N

2

N
2
ln
β

N
2
ln(
2
π
)
Machine Learning
!

!
!

!
!
Srihari

Likelihood Function could be used to learn parameters
w

and
β"

Usually done in a Bayesian treatment!

In neural network literature minimizing error is used!

They are equivalent here. Sum of squared errors is!

Its smallest value occurs when

E(
w
)=0

Since
E(
w
)
is non-convex:

Solution

w
ML

found using iterative optimization
w
t+1
=
w
t
+∆
w
t

Gradient descent (discussed later in this lecture))!

Another solution is back-propagation !

Since regression output is same as activation
y
k
=
a
k
,
so!

Having found
w
ML
the value of
β
ML
can also be found using

Regression Error Function!
7

E
(w
)

1
2
y
(
x
n
,
w
)

t
n

n

1
N

2
l
n
β
/
2
π
β

1
N
y
(
x
n
,
w
M
L
)

t
n

n

1
N

2

E

a
k

y
k

t
k

a
k

w
ki
(
2
)
x
i

w
k
0
(
2
)
i

1
M

where
k

1
,
..,
K
Machine Learning
!

!
!

!
!
Srihari

2. Binary Classiﬁcation!

Single target variable
t
where
t=1
denotes
C
1

and
t =0

denotes
C
2

Consider network with single output whose activation
function is logistic sigmoid!

so that
0
<
y(
x,w
)
<
1
!

Interpret
y(
x,w
)
as conditional probability
p(C
1
|
x
)

Conditional distribution of targets given inputs

8

y

σ
(
a
)

1
1

exp(

a
)

p
(
t
|
x,
w
)

y
(
x,
w
)
t
{
1

y
(
x,
w
)}
1

t
Machine Learning
!

!
!

!
!
Srihari

Binary Classiﬁcation Error Function!

Error function is negative log-likelihood which in this
case is a Cross-Entropy error function!
!

where
y
n
denotes
y(
x
n
,
w
)

Using cross-entropy error function instead of sum of
squares leads to faster training and improved
generalization
9

E
(w
)

t
n
ln
y
n

(
1

t
n
)
ln(
1

y
n
)

n

1
N

Machine Learning
!

!
!

!
!
Srihari

2.
K
Separate Binary Classiﬁcations!

Network has
K
outputs each with a logistic sigmoid
activation function!

Associated with each output is a binary class label
t
k

Taking negative logarithm of likelihood function!
E
(
w
)

{
t
nk
k

1
K

n

1
N

l
n
y
nk

(
1

t
nk
)
l
n(
1

y
nk
)
}
w
he
r
e

y
nk
de
not
e
s

y
k
(
x
n
,
w
)
p
(
t
|
x
,
w
)

y
k
(
x
,
w
)
t
k
[
1

y
k
(
x
,
w
)
]
1

t
k
k

1
K

10
Machine Learning
!

!
!

!
!
Srihari

3. Multiclass Classiﬁcation!

Each input assigned to one of
K
classes!

Binary target variables have
1-of-K
coding scheme!

Network outputs are interpreted as!

Output unit activation function is given by softmax!
11

t
k

{
0
,
1
}

y
k
(
x,
w
)

p
(
t
k

1
|
x
)
E
(
w
)

t
k
n
k

1
K

n

1
N

l
n
y
k
(
x
n
,
w
)
y
k
(
x,w
)

e
xp(
a
k
(
x,w
)
)
e
xp(
a
j
(
x,w
)
)
j

y
k
(
x,w
)

e
xp(
w
k
φ
(
x
)
)
e
xp(
w
j
φ
(
x
)
)
j

Machine Learning
!

!
!

!
!
Srihari

Parameter Optimization!

w
which minimizes the
chosen function
E
(w)

Geometrical picture of error function!

Error function has a highly nonlinear!

dependence!
12
Machine Learning
!

!
!

!
!
Srihari

Parameter Optimization: Geometrical View!
E
(w):

surface sitting over weight
space
!

w
A:
a local minimum

w
B

global minimum!

Need to ﬁnd minimum
!

At point
w
C

is given by vector!

points in direction of greatest
rate of increase of
E
(w)

of greatest decrease!

!

E
(w
)
13
Machine Learning
!

!
!

!
!
Srihari

Finding
w
where
E(
w
)
is smallest!
14
Small step from
w
to
w+
δ
w

change in error function!

Minimum of
E
(w)

will occur when!

Points at which gradient vanishes are!
stationary points: minima, maxima, saddle !
!
Complex surface!
No hope of ﬁnding analytical solution to
equation !
!
!

δ
E

δ
w
T

E
(
w
)

E
(w
)

0

E
(w
)

0
Machine Learning
!

!
!

!
!
Srihari

Iterative Numerical Procedure for Minima!
15

Since there is no analytical solution!
choose initial
w
(0)

and update it using!
!
!
!
where
τ
is the iteration step!

Different algorithms involve different choices for
weight vector update !

Weight vector update is usually!
weight vector!

To understand importance of gradient information!
consider Taylor

s series expansion of error
function !
!

w
(
τ

1)

w
(
τ
)

Δ
w
(
τ
)

Δ
w
(
τ
)

E
(w
)

w
(
τ

1)

Δ
w
(
τ
)
Machine Learning
!

!
!

!
!
Srihari

Discussion Overview!
1
!

Provides insight into optimization problem!

O(
W
3
)
where

W

is dimensionality of
w

Based on Taylor’s series:
f(x)
is approximated!
2
!

Leads to signiﬁcant improvements in speed of
locating minima of error function!

Backpropagation
is
O(
W
2
)
!
3
!

Simplest approach of using gradient information!
16
Preliminary concepts for Backpropagation!
Machine Learning
!

!
!

!
!
Srihari

Taylor

s Series Expansion of
E(
w
)
around some point
in weight space
(with cubic and higher terms omitted)
!
!
!
where
b
E
evaluated at!

b
is a vector of
W

elements!

H
is the Hessian matrix with elements
!

H

is a

W
x
W

matrix!

w
*, a
minimum of the error function!

where
H
is evaluated at
w*
and the linear term vanishes!

Let us interpret this geometrically!

Consider eigen value equation for the Hessian matrix!

where the eigen vectors
u
i

are orthonormal

Expand
(w-w*)
as a linear combination the eigenvectors

E
(w
)

E
(
ˆ
w
)

(
w

ˆ
w
)
T
b

1
2
(w

ˆ
w
)
T
H
(w

ˆ
w
)

ˆ
w

H
=
∇∇
E

b

E
|
w

ˆ
w

(
H
)
ij

E

w
i

w
j
w

ˆ
w

ˆ
w
E
(
w
)

E
(
w
*
)

1
2
(
w

w
*)
T
H
(
w

w
*
)
H
u
i

λ
i
u
i
u
i
T
u
i

δ
i
j
w

w
*

α
i
u
i
i

17
Machine Learning
!

!
!

!
!
Srihari

Neighborhood of a minimum
w*
18

w-w*
is a coordinate transformation !

Origin is translated to
w*

Axes rotated to align with eigenvectors of
Hessian!

Error function can be written as!

Matrix
H
is positive deﬁnite iff!

Since eigenvectors form a complete set!

Then an arbitrary vector
v
can be written as!

The stationary point
w*

will be a minimum if the
Hessian matrix is positive deﬁnite (or all its
eigenvalues are positive)!

Contours of constant error are ellipses!

Lengths inversely proportional to !

sq roots of eigenvalues!

E
(
w
)

E
(
w
*)

1
2
λ
i
α
i
2
i

v
T
H
v

0
f
or
a
l
l
v
v

c
i
u
i
i

v
T
H
v

c
i
2
i

λ
i
Machine Learning
!

!
!

!
!
Srihari

Condition for a point
w*
to be a minimum!

For a one-dimensional weight space, a stationary
point
w
* will be minimum if!

Corresponding result in
D
dimensions is that the
Hessian matrix evaluated at
w*
is positive deﬁnite !

A matrix
H
is positive deﬁnite iff
v
T
Hv >
0 for all
v
!

19

2
E

w
2
w
*

0
Machine Learning
!

!
!

!
!
Srihari

!
!
where
b
E
evaluated at!

b
is a vector of
W

elements!

H
is the Hessian matrix with elements
!

H

is a

W
x
W

matrix!

Error surface is speciﬁed by
b
and
H

They contain total of
W(W+3)/2
independent elements!

W

is total number of adaptive parameters in network!

Minimum depends on
O(W
2
)
parameters!

Need to perform
O(W
2
)
function evaluations, each requiring
O(W
) steps.!

Computational effort needed is
O(W
3
)

W
in a 10 x 10 x 10 network needs 100+100=200 weights which means 8
million steps

E
(w
)

E
(
ˆ
w
)

(
w

ˆ
w
)
T
b

1
2
(w

ˆ
w
)
T
H
(w

ˆ
w
)

H
=
∇∇
E

ˆ
w

b

E
|
w

ˆ
w
Machine Learning
!

!
!

!
!
Srihari

Gradient of error function can be evaluated efﬁciently
using back-propagation!

improvements in speed with which minimum of error function
can be located!

In quadratic approximation to error function!

Computational effort needed is
O(W
3
)

By using gradient information minimum can be found in
O(W
2
)
steps
21
Machine Learning
!

!
!

!
!
Srihari

Simplest approach to using gradient information!

Take a small step in the direction of the negative

η

is the learning rate!

There are batch and on-line versions !
22

w
(
τ

1)

w
(
τ
)

η

E
(w
(
τ
)
)
Machine Learning
!

!
!

!
!
Srihari

Summary!

Neural network parameters have many parameters!

can be determined analogous to linear regression
parameters!

Probabilistic formulation leads to appropriate error
functions for linear regression, binary and multi-class
classiﬁcation!

Parameter optimization can be viewed as minimizing
error function in weight space!

At the minimum Hessian is positive deﬁnite!

O(W
3
)
steps!

O(W
2
)
algorithm can be designed!
23