Overview Neural Networks Neural Network Learning

prudencewooshΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

60 εμφανίσεις

Machine Learning for Data Mining
Week 5:Neural Networks
Christof Monz
Overview
Week 5:Neural Networks
Christof Monz
1
I
Perceptrons
I
Gradient descent search
I
Multi-layer neural networks
I
The backpropagation algorithm
Neural Networks
Week 5:Neural Networks
Christof Monz
2
I
Analogy to biological neural systems,the most
robust learning systems we know
I
Attempt to understand natural biological
systems through computational modeling
I
Massive parallelism allows for computational
eciency
I
Help understand`distributed'nature of neural
representations
I
Intelligent behavior as an`emergent'property of
large number of simple units rather than from
explicitly encoded symbolic rules and algorithms
Neural Network Learning
Week 5:Neural Networks
Christof Monz
3
I
Learning approach based on modeling
adaptation in biological neural systems
I
Perceptron:Initial algorithm for learning
simple neural networks (single layer) developed
in the 1950s
I
Backpropagation:More complex algorithm for
learning multi-layer neural networks developed
in the 1980s.
Real Neurons
Week 5:Neural Networks
Christof Monz
4
Human Neural Network
Week 5:Neural Networks
Christof Monz
5
Modeling Neural Networks
Week 5:Neural Networks
Christof Monz
6
Perceptrons
Week 5:Neural Networks
Christof Monz
7
Perceptrons
Week 5:Neural Networks
Christof Monz
8
I
A perceptron is a single layer neural network
with one output unit
I
The output of a perceptron is computed as
follows
o(x
1
:::x
n
) =

1 if w
0
+w
1
x
1
+:::+w
n
x
n
>0
1 otherwise
I
Assume a`dummy'input x
0
=1 we can write:
o(x
1
:::x
n
) =

1 if

n
i=0
w
i
x
i
>0
1 otherwise
Perceptrons
Week 5:Neural Networks
Christof Monz
9
I
Learning a perceptron involves choosing the
`right'values for the weights w
0
:::w
n
I
The set of candidate hypotheses is
H =f~wj ~w2
(n+1)
g
Representational Power of Perceptrons
Week 5:Neural Networks
Christof Monz
10
I
A single perceptron represent many boolean
functions,e.g.AND,OR,NAND (:AND),...,
but not all (e.g.,XOR)
Peceptron Training Rule
Week 5:Neural Networks
Christof Monz
11
I
The perceptron training rule can be dened
for each weight as:
w
i
w
i
+w
i
where w
i
=(t o)x
i
where t is the target output,o is the output of
the perceptron,and  is the learning rate
I
This scenario assume that we know what the
target outputs are supposed to be like
Peceptron Training Rule Example
Week 5:Neural Networks
Christof Monz
12
I
If t =o then (t o)x
i
=0 and w
i
=0,i.e.
the weight for w
i
remains unchanged,regardless
of the learning rate and the input values (i.e.x
i
)
I
Let's assume a learning rate of  =0:1 and an
input value of x
i
=0:8
 If t =+1 and o =1,then
w
i
=0:1(1(1)))0:8 =0:16
 If t =1 and o =+1,then
w
i
=0:1(11)))0:8 =0:16
Peceptron Training Rule
Week 5:Neural Networks
Christof Monz
13
I
The perceptron training rule converges after a
nite number of iterations
I
Stopping criterion holds if the amount of
changes falls below a pre-dened threshold ,
e.g.,if j~wj
L1
<
I
But only if the training examples are linearly
separable
The Delta Rule
Week 5:Neural Networks
Christof Monz
14
I
The delta rule overcomes the shortcoming of
the perceptron training rule not being
guaranteed to converge if the examples are not
linearly separable
I
Delta rule is based on gradient descent search
I
Let's assume we have an unthresholded
perceptron:o(~x) =~w~x
I
We can dene the training error as:
E(~w) =
1
2

d2D
(t
d
o
d
)
2
where D is the set of training examples
Error Surface
Week 5:Neural Networks
Christof Monz
15
Gradient Descent
Week 5:Neural Networks
Christof Monz
16
I
The gradient of E is the vector pointing in the
direction of the steepest increase for any point
on the error surface
E(~w) =
h
E
w
0
;
E
w
1
;:::;
E
w
n
i
I
Since we are interested in minimizing the error,
we consider negative gradients:E(~w)
I
The training rule for gradient descent is:
~w ~w+~w
where ~w=E(~w)
Gradient Descent
Week 5:Neural Networks
Christof Monz
17
I
The training rule for individual weights is
dened as w
i
w
i
+w
i
where w
i
=
E
w
i
I
Instantiating E for the error function we use
gives:
E
w
i
=

w
i
1
2

d2D
(t
d
o
d
)
2
I
How do we use partial derivatives to actually
compute updates to weights at each step?
Gradient Descent
Week 5:Neural Networks
Christof Monz
18
E
w
i
=

w
i
1
2

d2D
(t
d
o
d
)
2
=
1
2

d2D

w
i
(t
d
o
d
)
2
=
1
2

d2D
2(t
d
o
d
)

w
i
(t
d
o
d
)
=

d2D
(t
d
o
d
)

w
i
(t
d
o
d
)
E
w
i
=

d2D
(t
d
o
d
)  (x
id
)
Gradient Descent
Week 5:Neural Networks
Christof Monz
19
I
The delta rule for individual weights can now be
written as w
i
w
i
+w
i
where w
i
=

d2D
(t
d
o
d
)x
id
I
The gradient descent algorithm
 picks initial random weights
 computes the outputs
 updates each weight by adding w
i
 repeats until converge
The Gradient Descent Algorithm
Week 5:Neural Networks
Christof Monz
20
Each training example is a pair <~x;t >
1 Initialize each w
i
to some small random value
2 Until the termination condition is met do:
2.1 Initialize each w
i
to 0
2.2 For each <~x;t >2D do
2.2.1 Compute o(~x)
2.2.2 For each weight w
i
do
w
i
w
i
+(t o)x
i
2.3 For each weight w
i
do
w
i
w
i
+w
i
The Gradient Descent Algorithm
Week 5:Neural Networks
Christof Monz
21
I
The gradient descent algorithm will nd the
global minimum,provided that the learning rate
is small enough
I
If the learning rate is too large,this algorithm
runs into the risk of overstepping the global
minimum
I
It's a common strategy to gradually the
decrease the learning rate
I
This algorithm works also in case the training
examples are not linearly separable
Shortcomings of Gradient Descent
Week 5:Neural Networks
Christof Monz
22
I
Converging to a minimum can be quite slow
(i.e.it can take thousands of steps).Increasing
the learning rate on the other hand can lead to
overstepping minima
I
If there are multiple local minima in the error
surface,gradient descent can get stuck in one
of them and not nd the global minimum
I
Stochastic gradient descent alleviates these
diculties
Stochastic Gradient Descent
Week 5:Neural Networks
Christof Monz
23
I
Gradient descent updates the weights after
summing over all training examples
I
Stochastic (or incremental) gradient descent
updates weights incrementally after calculating
the error for each individual training example
I
This this end step 2.3 is deleted and step 2.2.2
modied
Stochastic Gradient Descent Algorithm
Week 5:Neural Networks
Christof Monz
24
Each training example is a pair <~x;t >
1 Initialize each w
i
to some small random value
2 Until the termination condition is met do:
2.1 Initialize each w
i
to 0
2.2 For each <~x;t >2D do
2.2.1 Compute o(~x)
2.2.2 For each weight w
i
do
w
i
w
i
+(t o)x
i
Comparison
Week 5:Neural Networks
Christof Monz
25
I
In standard gradient descent summing over
multiple examples requires more computations
per weight update step
I
As a consequence standard gradient descent
often uses larger learning rates than stochastic
gradient descent
I
Stochastic gradient descent can avoid falling
into local minima because it uses the dierent
E
d
(~w) rather than the overall E(~w) to guide
its search
Multi-Layer Neural Networks
Week 5:Neural Networks
Christof Monz
26
I
Perceptrons only have two layers:the input
layer and the output layer
I
Perceptrons only have one output unit
I
Perceptrons are limited in their expressiveness
I
Multi-layer neural networks consist of an input
layer,a hidden layer,and an output layer
I
Multi-layer neural networks can have several
output units
Multi-Layer Neural Networks
Week 5:Neural Networks
Christof Monz
27
Multi-Layer Neural Networks
Week 5:Neural Networks
Christof Monz
28
I
The units of the hidden layer function as input
units to the next layer
I
However,multiple layers of linear units still
produce only linear functions
I
The step function in perceptrons is another
choice,but it is not dierentiable,and therefore
not suitable for gradient descent search
I
Solution:the sigmoid function,a non-linear,
dierentiable threshold function
Sigmoid Unit
Week 5:Neural Networks
Christof Monz
29
The Sigmoid Function
Week 5:Neural Networks
Christof Monz
30
I
The output is computed as o =(~w~x)
where (y) =
1
1+e
y
i.e.o =(~w~x) =
1
1+e
(~w~x)
I
Another nice property of the sigmoid function is
that its derivative is easily expressed:
d(y)
dy
=(y)  (1(y))
Learning Weights with Multiple Layers
Week 5:Neural Networks
Christof Monz
31
I
The gradient descent search can be used to
train multi-layer neural networks,but the
algorithm has to be adapted
I
Firstly,there can be multiple output units,and
therefore the error function as to be generalized:
E(~w) =
1
2

d2D

k2outputs
(t
kd
o
kd
)
2
I
Secondly,the error`feedback'has to be fed
through multiple layers
Backpropagation Algorithm
Week 5:Neural Networks
Christof Monz
32
For each training example <~x;t > do
1.Input ~x to the network and compute o
u
for every unit in
the network
2.For each output unit k calculate its error 
k
:

k
o
k
(1o
k
)(t
k
o
k
)
3.For each hidden unit h calculate its error 
h
:

h
o
h
(1o
h
)

k2outputs
w
kh

k
4.Update each network weight w
ji
:
w
ji
w
ji
+w
ji
where w
ji
=
j
x
ji
I
Note:x
ji
is the value from unit i to j and w
ji
is
the weight of connecting unit i to j,
Backpropagation Algorithm
Week 5:Neural Networks
Christof Monz
33
I
Step 1 propagates the input forward through
the network
I
Steps 2{4 propagate the errors backward
through the network
I
Step 2 is similar to the delta rule in gradient
descent (step 2.3)
I
Step 3 sums over the errors of all output units
in uence by a given hidden unit (this is because
the training data only provides direct feedback
for the output units)
Applications of Neural Networks
Week 5:Neural Networks
Christof Monz
34
I
Text to speech
I
Fraud detection
I
Automated vehicles
I
Game playing
I
Handwriting recognition
Summary
Week 5:Neural Networks
Christof Monz
35
I
Perceptrons,simple one layer neural networks
I
Perceptron training rule
I
Gradient descent search
I
Multi-layer neural networks
I
Backpropagation algorithm