机器学习
陈昱
北京大学计算机科学技术研究所
信息安全工程研究中心
课程基本信息
主讲教师：陈昱
chen_yu@pku.edu.cn
Tel
：
82529680
助教：程再兴，
Tel
：
62763742
wataloo@hotmail.com
课程网页：
http://www.icst.pku.edu.cn/course/jiqixuexi/j
qxx2011.mht
2
Ch4 Artificial Neural Networks
Introduction
Perceptrons
Multilayer Networks and Backpropagation
Algorithms
Remarks on Backpropagation Algorithms
Face Recognition as an Example
Advanced Topics
3
Biological Motivation
Figure at right illustrates
parts of a nerve cell
(neuron)
A single long fiber
(dendrite,
树突
) is called the
axon (
轴突
).
A neuron makes connections
with 10 to 100,000 other
neurons at junctions called
synapses (
突触
).
Signals are propagated from
neuron to neuron by a
complicated electrochemical
reaction.
4
Computer vs. Human Brain
The following table is a crude comparison of
raw computational resource available to
computers and brains (S Russell et. al 2003)
5
Artificial Neural Networks
Artificial neural network (ANN) is an
interconnected group of artificial neurons that
uses a mathematical model for information
processing based on a
connectionism
modeling
of computation
6
an ANN with one hidden layer
History of Neural Network
The concept of neural networks started in the
late 1800s as an effort to describe how the
human mind performed
McCulloch and Pitts (1943) proposed a model
of neuron that corresponds to the perceptron
During mid

1980s work on ANN experienced
a resurgence caused in large part by
invention of BACKPROPAGATION and related
algorithms in training multi

layer ANN
7
More on Connectionism
Connectionism is a set of approaches in the
fields of artificial intelligence, cognitive
psychology, cognitive science, neuroscience
and philosophy of mind, that models mental
or behavioral phenomena as the emergent
processes of interconnected networks of
simple units. There are many forms of
connectionism, but the most common forms
use neural network models.
8
An Example: ALVINN System
ALVINN:
Autonomous Land
Vehicle in a Neural
Network
9
10
weight values of one hidden unit and its 30
outputs in ALVINN
When to Consider Neural Networks
The target function to be learned is defined over
high

dim feature vectors, taking discrete or real
values
Features can be correlated or independent of one other
Output may be (vector of) discrete or real

valued
attributes
Training examples might contain errors
Long training times are acceptable
Fast evaluation of learned function is desirable
11
When to Consider Neural Networks (2)
The ability of human to understand the
learned function is not important
“An unreadable table that a useful machine could
read would still be well worth having”.

Technology writer Roger Bridgman on neural
network
12
Ch4 Artificial Neural Networks
Introduction
Perceptrons
Multilayer Networks and Backpropagation
Algorithms
Remarks on Backpropagation Algorithms
Face Recognition as an Example
Advanced Topics
13
Perceptrons
Basic idea (McCulloch & Pitts, 1943):
Each neuron is characterized as being “on” or
“off”, with “on” occurring in response to
stimulation by a sufficient number of
neighboring neurons.
14
Perceptrons (2)
The hypotheses space
H
in perceptron
learning is
Perceptron learning is equivalent to finding a
linear classifier
15
Representation Power of Perceptrons
AND, OR, and Negation function all can be
represented by a single perception.
Indeed, any m

of

n function can be
represented by a single perceptron.
However, XOR can't be represented by a
single perceptron (figure in previous page)
Every boolean function can be represented by
certain network of perceptron with at most
one hidden layer
16
Perceptron Training Rule
Begin by a single perceptron
Two way for finding acceptable weight
vectors:
Perceptron training rule
Delta rule
Perceptron training rule
17
Perceptron Training Rule (2)
Intuitively the training rule makes sense:
Observe that
It can be proved that after a finite number of
iterations the perception training rule leads to
a weight vector that correctly classifies all
training examples, provided
training examples are linearly separable, and
η
is small enough.
(Minsky and Papert, 1969)
18
Gradient Decent and Delta Rule
Delta rule is designed to overcome the difficulty
occurred when training examples aren’t linearly
separable
Key idea: Use
gradient decent
to search hypo space
for the weight vector that best fits the training
examples
Intuitively, to search for the best fit, we should
considering training an
un

thresholded
perceptron
instead:
And it is natural to consider the following err function:
19
Gradient Decent and Delta Rule (2)
How to update the weight vector?
Recall that the purpose of updating is to reduce error, and the
following equation holds:
Therefore we choose the negated gradient of err function as the
direction to update the weight vector.
20
Illustration of Error Surface
21
The arrows show the negated gradient
Derivation of Gradient Decent Rule
Gradient of
E
w.r.t. weight vector:
Training rule for gradient descent:
22
Deviation (2)
Therefore the weight update rule for gradient descent becomes:
Remark: the above equation has similar representation as that of
perceptron training rule
23
The Gradient Descent Algorithm
24
Note: The error
surface contains
only a single global
minimum, but if
η
is
too large, the
iteration algorithm
might overstep the
minimum. One
solution is to
gradually reduce
η
during iteration.
Incremental Gradient Descent
Key practical difficulties in applying gradient descent are:
1.
Converging to a local minimum could be quite slow
2.
If there are multiple local minima, the procedure might not
find the global minimum
One trick to alleviate the above difficulties is to update
weights incrementally instead:
In other word, it will iteration through all training examples one by
one, and during each iteration update the weight according to
Eq.1. One such iteration is called an
epoch
. If the algorithm
selects examples randomly from the training set instead, the
method is called stochastic gradient descent method.
25
Incremental Gradient Descent (2)
Compared with standard gradient descent,
incremental gradient descent needs less
computational power per weight update
In case that
E
(
w
) has multiple minima, incremental
gradient descent can sometimes avoid falling into
these minima.
Eq (1) is known as the
delta rule
, or LMS rule
Delta rule has the same form as the perception rule,
but these two rules have different formulas for “
o
”.
Delta rule can also be used for training perceptions
26
On These Two Training Rules
The perceptron training rule converges after
a finite number of iterations to a
consistent
hypo, provided
training examples are linearly separable, and
η
is small enough.
The delta rule converges only asymptotically
towards the
minimum error hypo
, possibly
requiring unbounded time, but regardless if
training examples are linearly separable.
27
Ch4 Artificial Neural Networks
Introduction
Perceptrons
Multilayer Networks and Backpropagation
Algorithms
Remarks on Backpropagation Algorithms
Face Recognition as an Example
Advanced Topics
28
Why Multilayer?
Single layer network has quite a limited
representation power, equivalent to a linear
classifier.
Ex. Consider the problem of recognition 1 of 10
vowel sounds occurring in the context “h_d” (e.g.
had, hid), with input of two parameters obtained
from a spectral analysis of sounds, then the
decision surface might become quite complicated:
29
Why Multilayer (2)
30
A Differentiable Threshold Unit
How to choose the type of unit as the basis
for constructing multilayer networks?
Linear unit not good enough
Perceptron not differentiable
Modify perceptron to make it differentiable! One
natural choice is sigmoid function
31
A Differentiable Threshold Unit (2)
We can derive gradient descent rule to train
one sigmoid unit
Multilayer network of sigmoid unit leads to
backpropagation algorithm
32
Error Gradient for one Activation Unit
33
Backpropagation Algorithm
Since for multilayer network, we must
consider multiple outputs instead, thus we
redefine
E
as the sum of errors over all of the
network outputs:
34
Backpropagation Algorithm (2)
Consider network with one hidden layer, and sigmoid as unit function, then
incremental gradient descent version of the algorithm becomes:
Assume input and weight from unit i into j are denoted by x
ij
and w
ij
, respectively.
Initialize all weights to small random number
Until the termination condition is met, Do
For each training example (
x
,
t
), Do
Propagate the input forward through the network
1.
Computer the output of o
u
every unit u in the network
Propagate the error backward through the network
2.
For each output unit k, calculate error
3.
For each hidden unit k, calculate error
4.
Update each network weight
35
Backpropagation Algorithm (3)
Eq 2 is identical in form to the delta training rule,
however, Eq 3 is somehow different, b/c we can’t
compute the target outputs at hidden units, but the
“sum” term in the equation can be regarded as
weight summation of errors in network output units.
The (incremental) gradient descent step is iterated
(often thousands of time, using the same training
examples multiple times) until the termination
condition met (the network performs acceptably well)
For gradient descent version, summarize the
△
(
w
)
over all training examples so as to update
w
.
36
More on the Algorithm
The most common variation of the algorithm is to
make the weight update during n
th
iteration partially
depend on the update during (n

1)
th
iteration:
Generalize the algorithm to feedforward networks with
arbitrary depth: the error for a hidden unit
h
in layer
m is calculated as
Generalize the above equation to any directed acyclic
graph, which might not be arranged in uniformed
layers
37
Ch4 Artificial Neural Networks
Introduction
Perceptrons
Multilayer Networks and Backpropagation
Algorithms
Remarks on Backpropagation Algorithms
Face Recognition as an Example
Advanced Topics
38
Convergence and Local Minima
Unfortunately the error surface for multilayer network
might contain multiple local minima, thus
BACKPROPGATION over multilayer network is only
guaranteed to converge to some local minimum,
however, in practice such problem is not as severe as
one might fear, some intuitions toward such
observation:
For a network with large number of weights, when gradient
descent falls into a local minimum with respect to one of these
weights, it will not necessary be in a local minimum for other
weights (they might provide “escape route”)
Notice that if weights are initialized to near 0, then for sigmoid
units, in the early steps of iteration the function behaves like a
linear one.
39
Convergence and Local Minima (2)
Common practices for alleviate the problem of
local minima:
Add a momentum term to weight update formula
Use stochastic gradient descent instead
Train multiple networks using the same data, but
with different initial values of weights, and the
trained networks can be regarded as a “committee”
of networks whose output is the (weighted)
average of individual outputs.
40
Representational Power
Every boolean function can be represented by some two

layer
network, although in the worst case the number of hidden units
required will grows exponentially in terms of number of input
units.
Every bounded continuous function can be approximated with
arbitrary small error by a two

layer network (sigmoid unit at
hidden layer, and linear unit at output layer)
Three

layer for an arbitrary function (sigmoid unit at two hidden
layers, and linear unit at output layer)
However, keep in mind that weights reachable by gradient
descent might not include all possible weights
41
Hypo Space and Inductive Bias
The hypo space is an Euclidean space with
dim to be the number of weights in the
(structurally pre

determined) network
Inductive bias by which the backpropagation
algorithm generalized beyond input data can
be
roughly
characterized as “smooth
interpolation between data points”.
42
Hidden Layer Representation
One intriguing property of BACKPROPAGATION is its
ability to find some useful intermediate
representation for inputs at hidden layers, which
might not be obvious by just looking at training
examples (sort of “feature selection”)
Consider an example of learning identity function
43
An Example
Given the following
training set and
network structure,
can it be learned?
44
Example (contd)
The hidden values on the
left were obtained after
5000 iterations through
each of these 8 examples
45
46
Overfitting
Q: What is an appropriate condition for
terminating the weight update loop?
One obvious answer is when the error falls
below some predetermined threshold,
however, this is a poor strategy b/c
BACKPROPGATION is susceptible to
overfitting the training example.
47
Overfitting Examples
48
Example 1
Example 2
Overfitting (contd)
Why does overfitting tend to occur during later stage of
iterations, but not during early iterations?
Conceptually, as iterations go, the complexity of learned
surface increase, often to serve the purpose of fitting
noise in the data or unrepresentative characteristics of
certain training examples (similar to decision tree)
Techniques for avoiding overfitting
weight decay
(to bias learning against complex decision surface)
Just simply provide an additional validation set
What if no enough data for extra validation set (overfitting is most
severe for small training set)?
Try
k

fold cross

validation, particularly, “leave

one

out” method.
49
Ch4 Artificial Neural Networks
Introduction
Perceptrons
Multilayer Networks and Backpropagation
Algorithms
Remarks on Backpropagation Algorithms
Face Recognition as an Example
Advanced Topics
50
NN for Learning Head Pose
51
NN for Learning Head Pose (2)
52
More on Face Recognition
Two problems: authentication (one

to

one) and
identification (one

to

many)
Representative Approaches
Principle component analysis with eigenface
Developed by Sirovich and Kirby (1987) and used by
Matthew Turk and
Alex Pentland
in face classification
Hidden Markov Model
Elastic matching
Neural net
53
Ch4 Artificial Neural Networks
Introduction
Perceptrons
Multilayer Networks and Backpropagation
Algorithms
Remarks on Backpropagation Algorithms
Face Recognition as an Example
Advanced Topics
54
Alternative Error Functions
Adding a penalty term for weight magnitude:
Adding a term for errors in the
slope
(derivative of target
function)
Minimizing the cross entropy of the network with respect to
the target functio
n
Consider learning a probabilistic function, the maximum likelihood
probability estimates are given by the network that minimize the cross
entropy defined as
55
Alternative Error Functions (2)
Weight sharing, or “tying together” weights
associated with different units.
The idea is that different network weights are
forced to take on identical values so that to
enforce some constraints (prior knowledge)
For example: an neural network to phoneme
recognition, Waibel (1989) and Lang (1990).
56
Recurrent Network
57
HW
4.4 (20pt, Due Monday, Oct 10)
Email both answer and source code to TA and the
instructor
4.9 (10pt, Due Monday, Oct 10)
4.11 (play by yourself)
4.12 (10 bonus pt)
58
Cognitive Computing
Modeling at various granularities
Neuron
Cluster of neurons
Complete organism
59
Comments 0
Log in to post a comment