Neural Network Training

prudencewooshΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 25 μέρες)

93 εμφανίσεις

Machine Learning
!

!
!

!
!
Srihari

Neural Network Training!
Sargur Srihari!
Machine Learning
!

!
!

!
!
Srihari

Topics!


Neural network parameters!


Probabilistic problem formulation !


Determining the error function !


Regression!


Binary classification!


Multi-class classification!


Parameter optimization!


Local quadratic approximation!


Use of gradient optimization!


Gradient descent optimization!
2
Machine Learning
!

!
!

!
!
Srihari

Neural Network parameters!


Linear models for regression and classification can
be represented as!


which are linear combinations of basis functions!


In a neural network the basis functions depend
on parameters !


During training allow these parameters to be adjusted along
with the coefficients
w
i
!
3


y
(
x
,
w
)

f
w
j
φ
j
(
x
)
j

1
M

$

%

&

&

'

(

)

)

φ
j
(
x
)
φ
j
(
x
)
Machine Learning
!

!
!

!
!
Srihari

Network Training: Sum of squared errors!


Neural networks perform a transformation!


vector
x
of input variables to vector
y
of output variables!


To determine
w
, simple analogy with polynomial curve
fitting !


minimize sum-of-squared errors function!


Given set of input vectors
{
x
n
}, n=1,..,N
and target vectors
{t
n
}
minimize the error function!


Consider a more general probabilistic interpretation!
4


E
(w
)

1
2
||
y(
x
n
,
w
)

t
n
||
2
n

1
N



y
k
(
x,
w
)

σ
w
kj
(
2
)
j

1
M

h
w
ji
(
1
)
i

1
D

x
i
$

%

&

'

(

)

$

%

&

&

'

(

)

)

D

input variables
M

hidden units
N

training vectors
Machine Learning
!

!
!

!
!
Srihari

Probabilistic View: From activation function
f
determine
Error Function
E
(as defined by likelihood function)
1.

Regression!


f:
activation function is identity!


E
: Sum-of-squares error/Maximum Likelihood!
2.

(Multiple Independent) Binary Classifications!


f:
activation function is Logistic sigmoid!


E:
Cross-entropy error function!
3.

Multiclass Classification!


f:
Softmax outputs!


E:
Cross-entropy error function!
5


E
(w
)


t
n
ln
y
n

(
1

t
n
)
ln(
1

y
n
)


n

1
N



E
(w
)

1
2
y
(
x
n
,
w
)

t
n


n

1
N

2
y
(
x,w
)

σ
(
w
T
φ
(
x
)
)

1
1

e
xp(

w
T
φ
(
x
)
)
y
(
x,w
)

w
T
φ
(
x
)

w
j
φ
j
(
x
)
j

1
M



E
(w
)


t
kn
k

1
K

n

1
N

ln
y
k
(
x
n
,
w
)
y
k
(
x,w
)

e
xp(
w
k
φ
(
x
)
)
e
xp(
w
j
φ
(
x
)
)
j

Machine Learning
!

!
!

!
!
Srihari

1. Probabilistic View: Regression !
6


Output is a single target variable
t
that can take any real value!


Assuming
t
is Gaussian distributed with an
x
-dependent mean!
!


Likelihood function!


Taking negative logarithm, we get the negative log-likelihood!


which can be used to learn parameters
w
and
β

p
(
t
|
x
,
w
)

N
(
t
|
y
(
x
,
w
)
,
β

1
)


p
(t
|
x
,
w,
β
)

n

1
N

N
(
t
n
|
y
(
x
n
,
w
),
β

1
)


β
2
y
(
x
n
,
w
)

t
n


n

1
N

2

N
2
ln
β

N
2
ln(
2
π
)
Machine Learning
!

!
!

!
!
Srihari



Likelihood Function could be used to learn parameters
w

and
β"


Usually done in a Bayesian treatment!


In neural network literature minimizing error is used!


They are equivalent here. Sum of squared errors is!


Its smallest value occurs when

E(
w
)=0



Since
E(
w
)
is non-convex:


Solution

w
ML

found using iterative optimization
w
t+1
=
w
t
+∆
w
t




Gradient descent (discussed later in this lecture))!


Another solution is back-propagation !


Since regression output is same as activation
y
k
=
a
k
,
so!


Having found
w
ML
the value of
β
ML
can also be found using

Regression Error Function!
7


E
(w
)

1
2
y
(
x
n
,
w
)

t
n


n

1
N

2
l
n
β
/
2
π
β

1
N
y
(
x
n
,
w
M
L
)

t
n


n

1
N

2

E

a
k

y
k

t
k

a
k

w
ki
(
2
)
x
i

w
k
0
(
2
)
i

1
M

where
k

1
,
..,
K
Machine Learning
!

!
!

!
!
Srihari

2. Binary Classification!


Single target variable
t
where
t=1
denotes
C
1

and
t =0

denotes
C
2


Consider network with single output whose activation
function is logistic sigmoid!


so that
0
<
y(
x,w
)
<
1
!


Interpret
y(
x,w
)
as conditional probability
p(C
1
|
x
)


Conditional distribution of targets given inputs

8


y

σ
(
a
)

1
1

exp(

a
)


p
(
t
|
x,
w
)

y
(
x,
w
)
t
{
1

y
(
x,
w
)}
1

t
Machine Learning
!

!
!

!
!
Srihari

Binary Classification Error Function!


Error function is negative log-likelihood which in this
case is a Cross-Entropy error function!
!


where
y
n
denotes
y(
x
n
,
w
)


Using cross-entropy error function instead of sum of
squares leads to faster training and improved
generalization
9


E
(w
)


t
n
ln
y
n

(
1

t
n
)
ln(
1

y
n
)


n

1
N

Machine Learning
!

!
!

!
!
Srihari

2.
K
Separate Binary Classifications!


Network has
K
outputs each with a logistic sigmoid
activation function!


Associated with each output is a binary class label
t
k


Taking negative logarithm of likelihood function!
E
(
w
)


{
t
nk
k

1
K

n

1
N

l
n
y
nk

(
1

t
nk
)
l
n(
1

y
nk
)
}
w
he
r
e

y
nk
de
not
e
s

y
k
(
x
n
,
w
)
p
(
t
|
x
,
w
)

y
k
(
x
,
w
)
t
k
[
1

y
k
(
x
,
w
)
]
1

t
k
k

1
K

10
Machine Learning
!

!
!

!
!
Srihari

3. Multiclass Classification!


Each input assigned to one of
K
classes!


Binary target variables have
1-of-K
coding scheme!


Network outputs are interpreted as!


Leads to following error function!


Output unit activation function is given by softmax!
11


t
k

{
0
,
1
}


y
k
(
x,
w
)

p
(
t
k

1
|
x
)
E
(
w
)


t
k
n
k

1
K

n

1
N

l
n
y
k
(
x
n
,
w
)
y
k
(
x,w
)

e
xp(
a
k
(
x,w
)
)
e
xp(
a
j
(
x,w
)
)
j

y
k
(
x,w
)

e
xp(
w
k
φ
(
x
)
)
e
xp(
w
j
φ
(
x
)
)
j

Machine Learning
!

!
!

!
!
Srihari

Parameter Optimization!


Task: Find weight vector
w
which minimizes the
chosen function
E
(w)


Geometrical picture of error function!


Error function has a highly nonlinear!


dependence!
12
Machine Learning
!

!
!

!
!
Srihari

Parameter Optimization: Geometrical View!
E
(w):

surface sitting over weight
space
!


w
A:
a local minimum



w
B

global minimum!


Need to find minimum
!


At point
w
C

local gradient
gradient!


is given by vector!


points in direction of greatest
rate of increase of
E
(w)


Negative gradient points to rate
of greatest decrease!

!



E
(w
)
13
Machine Learning
!

!
!

!
!
Srihari

Finding
w
where
E(
w
)
is smallest!
14
Small step from
w
to
w+
δ
w
leads to!


change in error function!


Minimum of
E
(w)

will occur when!


Points at which gradient vanishes are!
stationary points: minima, maxima, saddle !
!
Complex surface!
No hope of finding analytical solution to
equation !
!
!




δ
E

δ
w
T

E
(
w
)



E
(w
)

0



E
(w
)

0
Machine Learning
!

!
!

!
!
Srihari

Iterative Numerical Procedure for Minima!
15


Since there is no analytical solution!
choose initial
w
(0)

and update it using!
!
!
!
where
τ
is the iteration step!


Different algorithms involve different choices for
weight vector update !


Weight vector update is usually!
based on gradient evaluated at!
weight vector!


To understand importance of gradient information!
consider Taylor

s series expansion of error
function !
Leads to local quadratic approximation
!




w
(
τ

1)

w
(
τ
)

Δ
w
(
τ
)


Δ
w
(
τ
)



E
(w
)


w
(
τ

1)


Δ
w
(
τ
)
Machine Learning
!

!
!

!
!
Srihari

Discussion Overview!
1
!
Local quadratic approximation!


Provides insight into optimization problem!


O(
W
3
)
where

W

is dimensionality of
w


Based on Taylor’s series:
f(x)
is approximated!
2
!
Use of gradient information!


Leads to significant improvements in speed of
locating minima of error function!


Backpropagation
is
O(
W
2
)
!
3
!
Gradient descent optimization!


Simplest approach of using gradient information!
16
Preliminary concepts for Backpropagation!
Machine Learning
!

!
!

!
!
Srihari

1. Local Quadratic Optimization!


Taylor

s Series Expansion of
E(
w
)
around some point
in weight space
(with cubic and higher terms omitted)
!
!
!
where
b
is the gradient of
E
evaluated at!


b
is a vector of
W

elements!

H
is the Hessian matrix with elements
!


H

is a

W
x
W

matrix!


Consider local quadratic approximation around
w
*, a
minimum of the error function!


where
H
is evaluated at
w*
and the linear term vanishes!


Let us interpret this geometrically!


Consider eigen value equation for the Hessian matrix!


where the eigen vectors
u
i

are orthonormal


Expand
(w-w*)
as a linear combination the eigenvectors


E
(w
)

E
(
ˆ
w
)

(
w

ˆ
w
)
T
b

1
2
(w

ˆ
w
)
T
H
(w

ˆ
w
)


ˆ
w


H
=
∇∇
E


b


E
|
w

ˆ
w


(
H
)
ij


E

w
i

w
j
w

ˆ
w


ˆ
w
E
(
w
)

E
(
w
*
)

1
2
(
w

w
*)
T
H
(
w

w
*
)
H
u
i

λ
i
u
i
u
i
T
u
i

δ
i
j
w

w
*

α
i
u
i
i

17
Machine Learning
!

!
!

!
!
Srihari

Neighborhood of a minimum
w*
18


w-w*
is a coordinate transformation !


Origin is translated to
w*


Axes rotated to align with eigenvectors of
Hessian!


Error function can be written as!


Matrix
H
is positive definite iff!


Since eigenvectors form a complete set!


Then an arbitrary vector
v
can be written as!


The stationary point
w*

will be a minimum if the
Hessian matrix is positive definite (or all its
eigenvalues are positive)!


Error function approximated by quadratic



Contours of constant error are ellipses!


Lengths inversely proportional to !


sq roots of eigenvalues!

E
(
w
)

E
(
w
*)

1
2
λ
i
α
i
2
i

v
T
H
v

0
f
or
a
l
l
v
v

c
i
u
i
i

v
T
H
v

c
i
2
i

λ
i
Machine Learning
!

!
!

!
!
Srihari

Condition for a point
w*
to be a minimum!


For a one-dimensional weight space, a stationary
point
w
* will be minimum if!


Corresponding result in
D
dimensions is that the
Hessian matrix evaluated at
w*
is positive definite !


A matrix
H
is positive definite iff
v
T
Hv >
0 for all
v
!

19



2
E

w
2
w
*

0
Machine Learning
!

!
!

!
!
Srihari

Complexity of Quadratic Approximation!
!
!
where
b
is the gradient of
E
evaluated at!


b
is a vector of
W

elements!

H
is the Hessian matrix with elements
!


H

is a

W
x
W

matrix!


Error surface is specified by
b
and
H


They contain total of
W(W+3)/2
independent elements!


W

is total number of adaptive parameters in network!


Minimum depends on
O(W
2
)
parameters!


Need to perform
O(W
2
)
function evaluations, each requiring
O(W
) steps.!


Computational effort needed is
O(W
3
)


W
in a 10 x 10 x 10 network needs 100+100=200 weights which means 8
million steps


E
(w
)

E
(
ˆ
w
)

(
w

ˆ
w
)
T
b

1
2
(w

ˆ
w
)
T
H
(w

ˆ
w
)


H
=
∇∇
E


ˆ
w


b


E
|
w

ˆ
w
Machine Learning
!

!
!

!
!
Srihari

2. Use of Gradient Information!


Gradient of error function can be evaluated efficiently
using back-propagation!


Use of gradient information can lead to significant
improvements in speed with which minimum of error function
can be located!


In quadratic approximation to error function!


Computational effort needed is
O(W
3
)


By using gradient information minimum can be found in
O(W
2
)
steps
21
Machine Learning
!

!
!

!
!
Srihari

3. Gradient Descent Optimization!


Simplest approach to using gradient information!


Take a small step in the direction of the negative
gradient!


η

is the learning rate!


There are batch and on-line versions !
22


w
(
τ

1)

w
(
τ
)

η

E
(w
(
τ
)
)
Machine Learning
!

!
!

!
!
Srihari

Summary!


Neural network parameters have many parameters!


can be determined analogous to linear regression
parameters!


Probabilistic formulation leads to appropriate error
functions for linear regression, binary and multi-class
classification!


Parameter optimization can be viewed as minimizing
error function in weight space!


At the minimum Hessian is positive definite!


Local quadratic optimization needs
O(W
3
)
steps!


Using gradient information more efficient
O(W
2
)
algorithm can be designed!
23