Perceptrons - 北京大学计算机科学技术研究所

cartcletchΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

124 εμφανίσεις

机器学习

陈昱

北京大学计算机科学技术研究所

信息安全工程研究中心



课程基本信息


主讲教师:陈昱

chen_yu@pku.edu.cn

Tel

82529680


助教:程再兴,
Tel

62763742

wataloo@hotmail.com


课程网页:

http://www.icst.pku.edu.cn/course/jiqixuexi/j
qxx2011.mht



2

Ch4 Artificial Neural Networks


Introduction


Perceptrons


Multilayer Networks and Backpropagation
Algorithms


Remarks on Backpropagation Algorithms


Face Recognition as an Example


Advanced Topics


3

Biological Motivation


Figure at right illustrates
parts of a nerve cell
(neuron)


A single long fiber
(dendrite,
树突
) is called the
axon (
轴突
).


A neuron makes connections
with 10 to 100,000 other
neurons at junctions called
synapses (
突触
).


Signals are propagated from
neuron to neuron by a
complicated electrochemical
reaction.

4

Computer vs. Human Brain

The following table is a crude comparison of
raw computational resource available to
computers and brains (S Russell et. al 2003)

5

Artificial Neural Networks


Artificial neural network (ANN) is an
interconnected group of artificial neurons that
uses a mathematical model for information
processing based on a
connectionism

modeling
of computation

6

an ANN with one hidden layer

History of Neural Network


The concept of neural networks started in the
late 1800s as an effort to describe how the
human mind performed


McCulloch and Pitts (1943) proposed a model
of neuron that corresponds to the perceptron


During mid
-
1980s work on ANN experienced
a resurgence caused in large part by
invention of BACKPROPAGATION and related
algorithms in training multi
-
layer ANN


7

More on Connectionism


Connectionism is a set of approaches in the
fields of artificial intelligence, cognitive
psychology, cognitive science, neuroscience
and philosophy of mind, that models mental
or behavioral phenomena as the emergent
processes of interconnected networks of
simple units. There are many forms of
connectionism, but the most common forms
use neural network models.


8

An Example: ALVINN System



ALVINN:
Autonomous Land
Vehicle in a Neural
Network

9

10

weight values of one hidden unit and its 30
outputs in ALVINN

When to Consider Neural Networks


The target function to be learned is defined over
high
-
dim feature vectors, taking discrete or real
values


Features can be correlated or independent of one other


Output may be (vector of) discrete or real
-
valued
attributes


Training examples might contain errors


Long training times are acceptable


Fast evaluation of learned function is desirable

11

When to Consider Neural Networks (2)


The ability of human to understand the
learned function is not important


“An unreadable table that a useful machine could
read would still be well worth having”.
---
Technology writer Roger Bridgman on neural
network

12

Ch4 Artificial Neural Networks


Introduction


Perceptrons


Multilayer Networks and Backpropagation
Algorithms


Remarks on Backpropagation Algorithms


Face Recognition as an Example


Advanced Topics


13

Perceptrons


Basic idea (McCulloch & Pitts, 1943):

Each neuron is characterized as being “on” or
“off”, with “on” occurring in response to
stimulation by a sufficient number of
neighboring neurons.

14

Perceptrons (2)


The hypotheses space
H

in perceptron
learning is


Perceptron learning is equivalent to finding a
linear classifier

15

Representation Power of Perceptrons


AND, OR, and Negation function all can be
represented by a single perception.


Indeed, any m
-
of
-
n function can be
represented by a single perceptron.


However, XOR can't be represented by a
single perceptron (figure in previous page)


Every boolean function can be represented by
certain network of perceptron with at most
one hidden layer

16

Perceptron Training Rule


Begin by a single perceptron


Two way for finding acceptable weight
vectors:


Perceptron training rule


Delta rule


Perceptron training rule


17

Perceptron Training Rule (2)


Intuitively the training rule makes sense:
Observe that



It can be proved that after a finite number of
iterations the perception training rule leads to
a weight vector that correctly classifies all
training examples, provided


training examples are linearly separable, and


η

is small enough.

(Minsky and Papert, 1969)

18

Gradient Decent and Delta Rule


Delta rule is designed to overcome the difficulty
occurred when training examples aren’t linearly
separable


Key idea: Use
gradient decent

to search hypo space
for the weight vector that best fits the training
examples

Intuitively, to search for the best fit, we should
considering training an
un
-
thresholded

perceptron
instead:


And it is natural to consider the following err function:

19

Gradient Decent and Delta Rule (2)


How to update the weight vector?

Recall that the purpose of updating is to reduce error, and the
following equation holds:




Therefore we choose the negated gradient of err function as the
direction to update the weight vector.



20

Illustration of Error Surface

21

The arrows show the negated gradient

Derivation of Gradient Decent Rule


Gradient of
E

w.r.t. weight vector:




Training rule for gradient descent:

22

Deviation (2)

Therefore the weight update rule for gradient descent becomes:



Remark: the above equation has similar representation as that of
perceptron training rule

23

The Gradient Descent Algorithm

24

Note: The error
surface contains
only a single global
minimum, but if
η

is
too large, the
iteration algorithm
might overstep the
minimum. One
solution is to
gradually reduce
η

during iteration.

Incremental Gradient Descent


Key practical difficulties in applying gradient descent are:

1.
Converging to a local minimum could be quite slow

2.
If there are multiple local minima, the procedure might not
find the global minimum


One trick to alleviate the above difficulties is to update
weights incrementally instead:



In other word, it will iteration through all training examples one by
one, and during each iteration update the weight according to
Eq.1. One such iteration is called an
epoch
. If the algorithm
selects examples randomly from the training set instead, the
method is called stochastic gradient descent method.

25

Incremental Gradient Descent (2)


Compared with standard gradient descent,
incremental gradient descent needs less
computational power per weight update


In case that
E
(
w
) has multiple minima, incremental
gradient descent can sometimes avoid falling into
these minima.


Eq (1) is known as the
delta rule
, or LMS rule


Delta rule has the same form as the perception rule,
but these two rules have different formulas for “
o
”.



Delta rule can also be used for training perceptions

26

On These Two Training Rules


The perceptron training rule converges after
a finite number of iterations to a
consistent

hypo, provided


training examples are linearly separable, and


η

is small enough.


The delta rule converges only asymptotically
towards the
minimum error hypo
, possibly
requiring unbounded time, but regardless if
training examples are linearly separable.


27

Ch4 Artificial Neural Networks


Introduction


Perceptrons


Multilayer Networks and Backpropagation
Algorithms


Remarks on Backpropagation Algorithms


Face Recognition as an Example


Advanced Topics


28

Why Multilayer?


Single layer network has quite a limited
representation power, equivalent to a linear
classifier.


Ex. Consider the problem of recognition 1 of 10
vowel sounds occurring in the context “h_d” (e.g.
had, hid), with input of two parameters obtained
from a spectral analysis of sounds, then the
decision surface might become quite complicated:

29

Why Multilayer (2)

30

A Differentiable Threshold Unit


How to choose the type of unit as the basis
for constructing multilayer networks?


Linear unit not good enough


Perceptron not differentiable


Modify perceptron to make it differentiable! One
natural choice is sigmoid function

31

A Differentiable Threshold Unit (2)


We can derive gradient descent rule to train
one sigmoid unit


Multilayer network of sigmoid unit leads to
backpropagation algorithm

32

Error Gradient for one Activation Unit

33

Backpropagation Algorithm


Since for multilayer network, we must
consider multiple outputs instead, thus we
redefine
E

as the sum of errors over all of the
network outputs:

34

Backpropagation Algorithm (2)


Consider network with one hidden layer, and sigmoid as unit function, then
incremental gradient descent version of the algorithm becomes:


Assume input and weight from unit i into j are denoted by x
ij

and w
ij
, respectively.


Initialize all weights to small random number

Until the termination condition is met, Do


For each training example (
x
,
t
), Do


Propagate the input forward through the network

1.

Computer the output of o
u

every unit u in the network


Propagate the error backward through the network

2.
For each output unit k, calculate error


3.
For each hidden unit k, calculate error


4.
Update each network weight




35

Backpropagation Algorithm (3)


Eq 2 is identical in form to the delta training rule,
however, Eq 3 is somehow different, b/c we can’t
compute the target outputs at hidden units, but the
“sum” term in the equation can be regarded as
weight summation of errors in network output units.


The (incremental) gradient descent step is iterated
(often thousands of time, using the same training
examples multiple times) until the termination
condition met (the network performs acceptably well)


For gradient descent version, summarize the

(
w
)
over all training examples so as to update
w
.


36

More on the Algorithm


The most common variation of the algorithm is to
make the weight update during n
th

iteration partially
depend on the update during (n
-
1)
th

iteration:



Generalize the algorithm to feedforward networks with
arbitrary depth: the error for a hidden unit
h
in layer
m is calculated as



Generalize the above equation to any directed acyclic
graph, which might not be arranged in uniformed
layers

37

Ch4 Artificial Neural Networks


Introduction


Perceptrons


Multilayer Networks and Backpropagation
Algorithms


Remarks on Backpropagation Algorithms


Face Recognition as an Example


Advanced Topics


38

Convergence and Local Minima


Unfortunately the error surface for multilayer network
might contain multiple local minima, thus
BACKPROPGATION over multilayer network is only
guaranteed to converge to some local minimum,
however, in practice such problem is not as severe as
one might fear, some intuitions toward such
observation:


For a network with large number of weights, when gradient
descent falls into a local minimum with respect to one of these
weights, it will not necessary be in a local minimum for other
weights (they might provide “escape route”)


Notice that if weights are initialized to near 0, then for sigmoid
units, in the early steps of iteration the function behaves like a
linear one.

39

Convergence and Local Minima (2)


Common practices for alleviate the problem of
local minima:


Add a momentum term to weight update formula


Use stochastic gradient descent instead


Train multiple networks using the same data, but
with different initial values of weights, and the
trained networks can be regarded as a “committee”
of networks whose output is the (weighted)
average of individual outputs.

40

Representational Power


Every boolean function can be represented by some two
-
layer
network, although in the worst case the number of hidden units
required will grows exponentially in terms of number of input
units.


Every bounded continuous function can be approximated with
arbitrary small error by a two
-
layer network (sigmoid unit at
hidden layer, and linear unit at output layer)


Three
-
layer for an arbitrary function (sigmoid unit at two hidden
layers, and linear unit at output layer)


However, keep in mind that weights reachable by gradient
descent might not include all possible weights

41

Hypo Space and Inductive Bias


The hypo space is an Euclidean space with
dim to be the number of weights in the
(structurally pre
-
determined) network


Inductive bias by which the backpropagation
algorithm generalized beyond input data can
be
roughly

characterized as “smooth
interpolation between data points”.

42

Hidden Layer Representation


One intriguing property of BACKPROPAGATION is its
ability to find some useful intermediate
representation for inputs at hidden layers, which
might not be obvious by just looking at training
examples (sort of “feature selection”)


Consider an example of learning identity function


43

An Example


Given the following
training set and
network structure,
can it be learned?

44

Example (contd)


The hidden values on the
left were obtained after
5000 iterations through
each of these 8 examples


45

46

Overfitting


Q: What is an appropriate condition for
terminating the weight update loop?


One obvious answer is when the error falls
below some predetermined threshold,
however, this is a poor strategy b/c
BACKPROPGATION is susceptible to
overfitting the training example.

47

Overfitting Examples

48

Example 1

Example 2

Overfitting (contd)


Why does overfitting tend to occur during later stage of
iterations, but not during early iterations?

Conceptually, as iterations go, the complexity of learned
surface increase, often to serve the purpose of fitting
noise in the data or unrepresentative characteristics of
certain training examples (similar to decision tree)


Techniques for avoiding overfitting


weight decay
(to bias learning against complex decision surface)


Just simply provide an additional validation set


What if no enough data for extra validation set (overfitting is most
severe for small training set)?


Try
k
-
fold cross
-
validation, particularly, “leave
-
one
-
out” method.

49

Ch4 Artificial Neural Networks


Introduction


Perceptrons


Multilayer Networks and Backpropagation
Algorithms


Remarks on Backpropagation Algorithms


Face Recognition as an Example


Advanced Topics

50

NN for Learning Head Pose

51

NN for Learning Head Pose (2)

52

More on Face Recognition


Two problems: authentication (one
-
to
-
one) and
identification (one
-
to
-
many)


Representative Approaches


Principle component analysis with eigenface


Developed by Sirovich and Kirby (1987) and used by
Matthew Turk and
Alex Pentland

in face classification


Hidden Markov Model


Elastic matching


Neural net


53

Ch4 Artificial Neural Networks


Introduction


Perceptrons


Multilayer Networks and Backpropagation
Algorithms


Remarks on Backpropagation Algorithms


Face Recognition as an Example


Advanced Topics

54

Alternative Error Functions


Adding a penalty term for weight magnitude:




Adding a term for errors in the
slope
(derivative of target
function)




Minimizing the cross entropy of the network with respect to
the target functio
n


Consider learning a probabilistic function, the maximum likelihood
probability estimates are given by the network that minimize the cross
entropy defined as


55

Alternative Error Functions (2)


Weight sharing, or “tying together” weights
associated with different units.

The idea is that different network weights are
forced to take on identical values so that to
enforce some constraints (prior knowledge)

For example: an neural network to phoneme
recognition, Waibel (1989) and Lang (1990).

56

Recurrent Network

57

HW


4.4 (20pt, Due Monday, Oct 10)


Email both answer and source code to TA and the
instructor


4.9 (10pt, Due Monday, Oct 10)


4.11 (play by yourself)


4.12 (10 bonus pt)


58

Cognitive Computing


Modeling at various granularities


Neuron


Cluster of neurons


Complete organism

59