Learning Neural Networks
Learning Neural Networks
Neural Networks can represent complex
Neural Networks can represent complex
decision boundaries
decision boundaries
–
–
Variable size. Any boolean function can be
Variable size. Any boolean function can be
represented. Hidden units can be interpreted as new
represented. Hidden units can be interpreted as new
features
features
–
–
Deterministic
Deterministic
–
–
Continuous Parameters
Continuous Parameters
Learning Algorithms for neural networks
Learning Algorithms for neural networks
–
–
Local Search. The same algorithm as for sigmoid
Local Search. The same algorithm as for sigmoid
threshold units
threshold units
–
–
Eager
Eager
–
–
Batch or Online
Batch or Online
Neural Network Hypothesis Space
Neural Network Hypothesis Space
Each unit a
Each unit a
6
6, a
, a7
7
, a
, a
8
8
, and
, and
ŷ
ŷ
computes a sigmoid function of its inputs:
computes a sigmoid function of its inputs:
a
a
6
6
=
=
σ
σ
(W
(W
6
6
∙
∙
X) a
X) a7
7
=
=
σ
σ
(W
(W7
7
∙
∙
X) a
X) a
8
8
=
=
σ
σ
(W
(W
8
8
∙
∙
X)
X)
ŷ
ŷ
=
=
σ
σ
(W
(W9
9
∙
∙
A)
A)
where A = [1, a
where A = [1, a6
6
, a
, a
7
7
, a
, a
8
8] is called the vector of
] is called the vector of
hidden unit activitations
hidden unit activitations
Original motivation: Differentiable approximation to multi
Original motivation: Differentiable approximation to multi


layer LTUs
layer LTUs
ŷ
W9
W6
W7
W8
a6
a7
a8
x1
x2
x3
x4
Representational Power
Representational Power
Any Boolean Formula
Any Boolean Formula
–
–
Consider a formula in disjunctive normal form:
Consider a formula in disjunctive normal form:
(x
(x
1
1
∧
∧
¬
¬
x
x
2
2
)
)
∨
∨
(x
(x2
2
∧
∧
x
x
4
4
)
)
∨
∨
(
(
¬
¬
x
x
3
3
∧
∧
x
x5
5
)
)
Each AND can be represented by a hidden unit and the OR can
Each AND can be represented by a hidden unit and the OR can
be represented by the output unit. Arbitrary boolean functions
be represented by the output unit. Arbitrary boolean functions
require exponentially
require exponentially


many hidden units, however.
many hidden units, however.
Bounded functions
Bounded functions
–
–
Suppose we make the output linear:
Suppose we make the output linear:
ŷ
ŷ
= W
= W9
9
∙
∙
A of hidden units.
A of hidden units.
It can be proved that any bounded continuous function can be
It can be proved that any bounded continuous function can be
approximated to arbitrary accuracy with enough hidden units.
approximated to arbitrary accuracy with enough hidden units.
Arbitrary Functions
Arbitrary Functions
–
–
Any function can be approximated to arbitrary accuracy with two
Any function can be approximated to arbitrary accuracy with two
hidden layers of sigmoid units and a linear output unit.
hidden layers of sigmoid units and a linear output unit.
Fixed versus Variable Size
Fixed versus Variable Size
In principle, a network has a fixed number of parameters
In principle, a network has a fixed number of parameters
and therefore can only represent a fixed hypothesis
and therefore can only represent a fixed hypothesis
space (if the number of hidden units is fixed).
space (if the number of hidden units is fixed).
However, we will initialize the weights to values near
However, we will initialize the weights to values near
zero and use gradient descent. The more steps of
zero and use gradient descent. The more steps of
gradient descent we take, the more functions can be
gradient descent we take, the more functions can be
“
“
reached
reached
”
”
from the starting weights.
from the starting weights.
So it turns out to be more accurate to treat networks as
So it turns out to be more accurate to treat networks as
having a variable hypothesis space that depends on the
having a variable hypothesis space that depends on the
number of steps of gradient descent
number of steps of gradient descent
Backpropagation: Gradient
Backpropagation: Gradient
Descent for Multi
Descent for Multi


Layer Networks
Layer Networks
It is traditional to train neural networks to minimize the squar
It is traditional to train neural networks to minimize the squar
ed
ed
error. This is really a mistake
error. This is really a mistake
—
—
they should be trained to maximize
they should be trained to maximize
the log likelihood instead. But we will study the MSE first.
the log likelihood instead. But we will study the MSE first.
We must apply the chain rule many times to compute the gradient
We must apply the chain rule many times to compute the gradient
We will number the units from 0 to U and index them by
We will number the units from 0 to U and index them by
u
u
and
and
v
v
.
.
w
wv,u
v,u
will be the weight connecting unit
will be the weight connecting unit
u
u
to unit
to unit
v.
v.
(Note: This seems
(Note: This seems
backwards. It is the
backwards. It is the
u
u
th input to node
th input to node
v
v
.)
.)
ˆy=σ(W9
∙[1,σ(W6
∙X),σ(W7
∙X),σ(W9
∙X)])
Ji
(W)=
1
2
(ˆy
i
−y
i)
2
Derivation: Output Unit
Derivation: Output Unit
Suppose
Suppose
w
w
9,6
9,6
is a component of
is a component of
W
W9
9
, the
, the
output weight vector, connecting it from
output weight vector, connecting it from
a
a
6
6
.
.
∂J
i
(W)
∂w9,6
=
∂
∂w9,6
1
2
(ˆy
i
−y
i)2
=
1
2
∙2∙(ˆy
i
−y
i)∙
∂
∂w
9,6
(σ(W9
∙A
i
)−y
i
)
=(ˆy
i
−yi)∙σ(W9
∙A
i
)(1−σ(W9
∙A
i
))∙
∂
∂w
9,6
W9
∙Ai
=(ˆy
i
−yi)ˆy
i
(1−ˆ
y
i
)∙a
6
The Delta Rule
The Delta Rule
Define
Define
then
then
δ
9
=(ˆyi
−y
i
)ˆy
i(1−ˆyi
)
∂J
i
(W)
∂w9,6
=(ˆy
i
−y
i)ˆy
i(1−ˆ
y
i
)∙a
6
=δ
9
∙a6
Derivation: Hidden Units
Derivation: Hidden Units
Define
δ
6
=δ9
∙w
9,6a6
(1−a
6)
and rewrite as
∂J
i(W)
∂w
6
,
?
δ6x2
.
∂Ji
(W)
∂w
6,2
=(ˆy
i
−yi)∙σ(W9
∙A
i
)(1−σ(W9
∙A
i
))∙
∂
∂w6,2
W9
∙Ai
=δ
9
∙w9,6
∙
∂
∂w6,2
σ(W6
∙X)
=δ
9
∙w9,6
∙σ(W
6
∙X)(1−σ(W6
∙X))∙
∂
∂w6,2
(W6
∙X)
=δ
9
∙w9
,
?
a6(1−a6
)∙x2
Networks with Multiple Output Units
Networks with Multiple Output Units
We get a separate contribution to the gradient from each
We get a separate contribution to the gradient from each
output unit.
output unit.
Hence, for input
Hence, for input


to
to


hidden weights, we must sum up the
hidden weights, we must sum up the
contributions:
contributions:
δ
6
=a
6(1−a
6
)
10
X
u=9
wu,6δu
ŷ1
ŷ2
a6
a7
a8
1
1
W6
W7
W8
x1
x2
x3
x4
The Backpropagation Algorithm
The Backpropagation Algorithm
Forward Pass
Forward Pass
. Compute
. Compute
a
au
u
and
and
ŷ
ŷ
v
v
for hidden units
for hidden units
u
u
and
and
output units
output units
v
v
.
.
Compute Errors
Compute Errors
. Compute
. Compute
ε
εv
v
= (
= (
ŷ
ŷ
v
v
–
–
y
yv
v
) for each output
) for each output
unit
unit
v
v
Compute Output Deltas
Compute Output Deltas
. Compute
. Compute
δ
δ
u
u
=
=
a
a
u
u
(1
(1
–
–
a
a
u
u
)
)
∑
∑
v
v
w
w
v,u
v,u
δ
δ
v
v
Compute Gradient
Compute Gradient
.
.
–
–
Compute for input
Compute for input


to
to


hidden weights.
hidden weights.
–
–
Compute for hidden
Compute for hidden


to
to


output weights.
output weights.
Take Gradient Step
Take Gradient Step
.
.
∂J
i
∂wu
,j
=δ
u
xij
∂Ji
∂w
v,u
=δ
vaiu
W:=W−η∇
W
J(
xi
)
Proper Initialization
Proper Initialization
Start in the
Start in the
“
“
linear
linear
”
”
regions
regions
–
–
keep all weights near zero, so that all sigmoid units are in the
keep all weights near zero, so that all sigmoid units are in the
ir
ir
linear regions. This makes the whole net the equivalent of one
linear regions. This makes the whole net the equivalent of one
linear threshold unit
linear threshold unit
—
—
a very simple function.
a very simple function.
Break symmetry.
Break symmetry.
–
–
Ensure that each hidden unit has different input weights so that
Ensure that each hidden unit has different input weights so that
the hidden units move in different directions.
the hidden units move in different directions.
Set each weight to a random number in the range
Set each weight to a random number in the range
where the
where the
“
“
fan
fan


in
in
”
”
of weight
of weight
w
wv,u
v,u
is the number of inputs
is the number of inputs
to unit
to unit
v
v
.
.
[
−1,+1]×
1
√
fanin
.
Batch, Online, and Online with
Batch, Online, and Online with
Momentum
Momentum
Batch
Batch
. Sum the for each example
. Sum the for each example
i
i
.
.
Then take a gradient descent step.
Then take a gradient descent step.
Online
Online
. Take a gradient descent step with each
. Take a gradient descent step with each
as it is computed.
as it is computed.
Momentum
Momentum
. Maintain an exponentially
. Maintain an exponentially


weighted
weighted
moved sum of recent
moved sum of recent
Typical values of
Typical values of
µ
µ
are in the range [0.7, 0.95]
are in the range [0.7, 0.95]
∇
W
J(x
i)
∇
W
J(
x
i)
∆W
(t+1)
:=µ∆W
(t)
+∇W
J(xi
)
W
?t+1)
:=
W
?t)
−
η
∆
W
?t+1
)
Softmax Output Layer
Softmax Output Layer
Let
Let
a
a9
9
and
and
a
a
10
10
be the output activations:
be the output activations:
a
a
9
9
= W
= W
9
9
∙
∙
A,
A,
a
a10
10
= W
= W10
10
∙
∙
A. Then define
A. Then define
The objective function is the negative log likelihood:
The objective function is the negative log likelihood:
where I[expr] is 1 if expr is true and 0 otherwise
where I[expr] is 1 if expr is true and 0 otherwise
ŷ1
ŷ2
a6
a7
a8
1
1
W6
W7
W8
x1
x2
x3
x4
softmax
W9
W10
ˆy1
=
expa9
expa
9
+expa
1
0
ˆ
y
2
=
expa
10
expa
9
+expa1
0
J(W)=
X
i
X
k
−I[yi
=k]logˆy
k
Neural Network Evaluation
Neural Network Evaluation
no
no
yes
yes
no
no
somewhat
somewhat
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
Trees
Trees
yes
yes
no
no
yes
yes
no
no
yes
yes
somewhat
somewhat
yes
yes
no
no
no
no
Nets
Nets
yes
yes
yes
yes
yes
yes
no
no
yes
yes
no
no
no
no
yes
yes
no
no
LDA
LDA
yes
yes
yes
yes
Accurate
Accurate
yes
yes
yes
yes
Interpretable
Interpretable
yes
yes
yes
yes
Linear combinations
Linear combinations
no
no
no
no
Irrelevant inputs
Irrelevant inputs
yes
yes
yes
yes
Scalability
Scalability
no
no
no
no
Monotone transformations
Monotone transformations
yes
yes
no
no
Outliers
Outliers
no
no
no
no
Missing values
Missing values
no
no
no
no
Mixed data
Mixed data
Logistic
Logistic
Perc
Perc
Criterion
Criterion
Comments 0
Log in to post a comment