Machine Learning for Data Mining

Week 5:Neural Networks

Christof Monz

Overview

Week 5:Neural Networks

Christof Monz

1

I

Perceptrons

I

Gradient descent search

I

Multi-layer neural networks

I

The backpropagation algorithm

Neural Networks

Week 5:Neural Networks

Christof Monz

2

I

Analogy to biological neural systems,the most

robust learning systems we know

I

Attempt to understand natural biological

systems through computational modeling

I

Massive parallelism allows for computational

eciency

I

Help understand`distributed'nature of neural

representations

I

Intelligent behavior as an`emergent'property of

large number of simple units rather than from

explicitly encoded symbolic rules and algorithms

Neural Network Learning

Week 5:Neural Networks

Christof Monz

3

I

Learning approach based on modeling

adaptation in biological neural systems

I

Perceptron:Initial algorithm for learning

simple neural networks (single layer) developed

in the 1950s

I

Backpropagation:More complex algorithm for

learning multi-layer neural networks developed

in the 1980s.

Real Neurons

Week 5:Neural Networks

Christof Monz

4

Human Neural Network

Week 5:Neural Networks

Christof Monz

5

Modeling Neural Networks

Week 5:Neural Networks

Christof Monz

6

Perceptrons

Week 5:Neural Networks

Christof Monz

7

Perceptrons

Week 5:Neural Networks

Christof Monz

8

I

A perceptron is a single layer neural network

with one output unit

I

The output of a perceptron is computed as

follows

o(x

1

:::x

n

) =

1 if w

0

+w

1

x

1

+:::+w

n

x

n

>0

1 otherwise

I

Assume a`dummy'input x

0

=1 we can write:

o(x

1

:::x

n

) =

1 if

n

i=0

w

i

x

i

>0

1 otherwise

Perceptrons

Week 5:Neural Networks

Christof Monz

9

I

Learning a perceptron involves choosing the

`right'values for the weights w

0

:::w

n

I

The set of candidate hypotheses is

H =f~wj ~w2

(n+1)

g

Representational Power of Perceptrons

Week 5:Neural Networks

Christof Monz

10

I

A single perceptron represent many boolean

functions,e.g.AND,OR,NAND (:AND),...,

but not all (e.g.,XOR)

Peceptron Training Rule

Week 5:Neural Networks

Christof Monz

11

I

The perceptron training rule can be dened

for each weight as:

w

i

w

i

+w

i

where w

i

=(t o)x

i

where t is the target output,o is the output of

the perceptron,and is the learning rate

I

This scenario assume that we know what the

target outputs are supposed to be like

Peceptron Training Rule Example

Week 5:Neural Networks

Christof Monz

12

I

If t =o then (t o)x

i

=0 and w

i

=0,i.e.

the weight for w

i

remains unchanged,regardless

of the learning rate and the input values (i.e.x

i

)

I

Let's assume a learning rate of =0:1 and an

input value of x

i

=0:8

If t =+1 and o =1,then

w

i

=0:1(1(1)))0:8 =0:16

If t =1 and o =+1,then

w

i

=0:1(11)))0:8 =0:16

Peceptron Training Rule

Week 5:Neural Networks

Christof Monz

13

I

The perceptron training rule converges after a

nite number of iterations

I

Stopping criterion holds if the amount of

changes falls below a pre-dened threshold ,

e.g.,if j~wj

L1

<

I

But only if the training examples are linearly

separable

The Delta Rule

Week 5:Neural Networks

Christof Monz

14

I

The delta rule overcomes the shortcoming of

the perceptron training rule not being

guaranteed to converge if the examples are not

linearly separable

I

Delta rule is based on gradient descent search

I

Let's assume we have an unthresholded

perceptron:o(~x) =~w~x

I

We can dene the training error as:

E(~w) =

1

2

d2D

(t

d

o

d

)

2

where D is the set of training examples

Error Surface

Week 5:Neural Networks

Christof Monz

15

Gradient Descent

Week 5:Neural Networks

Christof Monz

16

I

The gradient of E is the vector pointing in the

direction of the steepest increase for any point

on the error surface

E(~w) =

h

E

w

0

;

E

w

1

;:::;

E

w

n

i

I

Since we are interested in minimizing the error,

we consider negative gradients:E(~w)

I

The training rule for gradient descent is:

~w ~w+~w

where ~w=E(~w)

Gradient Descent

Week 5:Neural Networks

Christof Monz

17

I

The training rule for individual weights is

dened as w

i

w

i

+w

i

where w

i

=

E

w

i

I

Instantiating E for the error function we use

gives:

E

w

i

=

w

i

1

2

d2D

(t

d

o

d

)

2

I

How do we use partial derivatives to actually

compute updates to weights at each step?

Gradient Descent

Week 5:Neural Networks

Christof Monz

18

E

w

i

=

w

i

1

2

d2D

(t

d

o

d

)

2

=

1

2

d2D

w

i

(t

d

o

d

)

2

=

1

2

d2D

2(t

d

o

d

)

w

i

(t

d

o

d

)

=

d2D

(t

d

o

d

)

w

i

(t

d

o

d

)

E

w

i

=

d2D

(t

d

o

d

) (x

id

)

Gradient Descent

Week 5:Neural Networks

Christof Monz

19

I

The delta rule for individual weights can now be

written as w

i

w

i

+w

i

where w

i

=

d2D

(t

d

o

d

)x

id

I

The gradient descent algorithm

picks initial random weights

computes the outputs

updates each weight by adding w

i

repeats until converge

The Gradient Descent Algorithm

Week 5:Neural Networks

Christof Monz

20

Each training example is a pair <~x;t >

1 Initialize each w

i

to some small random value

2 Until the termination condition is met do:

2.1 Initialize each w

i

to 0

2.2 For each <~x;t >2D do

2.2.1 Compute o(~x)

2.2.2 For each weight w

i

do

w

i

w

i

+(t o)x

i

2.3 For each weight w

i

do

w

i

w

i

+w

i

The Gradient Descent Algorithm

Week 5:Neural Networks

Christof Monz

21

I

The gradient descent algorithm will nd the

global minimum,provided that the learning rate

is small enough

I

If the learning rate is too large,this algorithm

runs into the risk of overstepping the global

minimum

I

It's a common strategy to gradually the

decrease the learning rate

I

This algorithm works also in case the training

examples are not linearly separable

Shortcomings of Gradient Descent

Week 5:Neural Networks

Christof Monz

22

I

Converging to a minimum can be quite slow

(i.e.it can take thousands of steps).Increasing

the learning rate on the other hand can lead to

overstepping minima

I

If there are multiple local minima in the error

surface,gradient descent can get stuck in one

of them and not nd the global minimum

I

Stochastic gradient descent alleviates these

diculties

Stochastic Gradient Descent

Week 5:Neural Networks

Christof Monz

23

I

Gradient descent updates the weights after

summing over all training examples

I

Stochastic (or incremental) gradient descent

updates weights incrementally after calculating

the error for each individual training example

I

This this end step 2.3 is deleted and step 2.2.2

modied

Stochastic Gradient Descent Algorithm

Week 5:Neural Networks

Christof Monz

24

Each training example is a pair <~x;t >

1 Initialize each w

i

to some small random value

2 Until the termination condition is met do:

2.1 Initialize each w

i

to 0

2.2 For each <~x;t >2D do

2.2.1 Compute o(~x)

2.2.2 For each weight w

i

do

w

i

w

i

+(t o)x

i

Comparison

Week 5:Neural Networks

Christof Monz

25

I

In standard gradient descent summing over

multiple examples requires more computations

per weight update step

I

As a consequence standard gradient descent

often uses larger learning rates than stochastic

gradient descent

I

Stochastic gradient descent can avoid falling

into local minima because it uses the dierent

E

d

(~w) rather than the overall E(~w) to guide

its search

Multi-Layer Neural Networks

Week 5:Neural Networks

Christof Monz

26

I

Perceptrons only have two layers:the input

layer and the output layer

I

Perceptrons only have one output unit

I

Perceptrons are limited in their expressiveness

I

Multi-layer neural networks consist of an input

layer,a hidden layer,and an output layer

I

Multi-layer neural networks can have several

output units

Multi-Layer Neural Networks

Week 5:Neural Networks

Christof Monz

27

Multi-Layer Neural Networks

Week 5:Neural Networks

Christof Monz

28

I

The units of the hidden layer function as input

units to the next layer

I

However,multiple layers of linear units still

produce only linear functions

I

The step function in perceptrons is another

choice,but it is not dierentiable,and therefore

not suitable for gradient descent search

I

Solution:the sigmoid function,a non-linear,

dierentiable threshold function

Sigmoid Unit

Week 5:Neural Networks

Christof Monz

29

The Sigmoid Function

Week 5:Neural Networks

Christof Monz

30

I

The output is computed as o =(~w~x)

where (y) =

1

1+e

y

i.e.o =(~w~x) =

1

1+e

(~w~x)

I

Another nice property of the sigmoid function is

that its derivative is easily expressed:

d(y)

dy

=(y) (1(y))

Learning Weights with Multiple Layers

Week 5:Neural Networks

Christof Monz

31

I

The gradient descent search can be used to

train multi-layer neural networks,but the

algorithm has to be adapted

I

Firstly,there can be multiple output units,and

therefore the error function as to be generalized:

E(~w) =

1

2

d2D

k2outputs

(t

kd

o

kd

)

2

I

Secondly,the error`feedback'has to be fed

through multiple layers

Backpropagation Algorithm

Week 5:Neural Networks

Christof Monz

32

For each training example <~x;t > do

1.Input ~x to the network and compute o

u

for every unit in

the network

2.For each output unit k calculate its error

k

:

k

o

k

(1o

k

)(t

k

o

k

)

3.For each hidden unit h calculate its error

h

:

h

o

h

(1o

h

)

k2outputs

w

kh

k

4.Update each network weight w

ji

:

w

ji

w

ji

+w

ji

where w

ji

=

j

x

ji

I

Note:x

ji

is the value from unit i to j and w

ji

is

the weight of connecting unit i to j,

Backpropagation Algorithm

Week 5:Neural Networks

Christof Monz

33

I

Step 1 propagates the input forward through

the network

I

Steps 2{4 propagate the errors backward

through the network

I

Step 2 is similar to the delta rule in gradient

descent (step 2.3)

I

Step 3 sums over the errors of all output units

in uence by a given hidden unit (this is because

the training data only provides direct feedback

for the output units)

Applications of Neural Networks

Week 5:Neural Networks

Christof Monz

34

I

Text to speech

I

Fraud detection

I

Automated vehicles

I

Game playing

I

Handwriting recognition

Summary

Week 5:Neural Networks

Christof Monz

35

I

Perceptrons,simple one layer neural networks

I

Perceptron training rule

I

Gradient descent search

I

Multi-layer neural networks

I

Backpropagation algorithm

## Comments 0

Log in to post a comment