Neural Networks

strangerwineΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

73 εμφανίσεις

Computer Science Department

CS 9633 Machine Learning

CS 9633

Machine Learning

Neural Networks

Adapted from notes by Tom Mitchell

http://www
-
2.cs.cmu.edu/~tom/mlbook
-
chapter
-
slides.html

Computer Science Department

CS 9633 Machine Learning

Neural networks


Practical method for learning


Real
-
valued functions


Discrete
-
valued functions


Vector
-
valued functions


Robust in the presence of noise


Loosely based on biological model of
learning

Computer Science Department

CS 9633 Machine Learning

Back propagation Neural
Networks


Assume fixed structure of network


Directed graph (usually acyclic)


Learning consists of choosing weights
for edges

Computer Science Department

CS 9633 Machine Learning

Characteristics of Back
Propagation Problems


Instances represented by many attribute value pairs


Target functions


Discrete
-
valued


Real
-
valued


Vector
-
valued


Instances may contain errors


Long training times are acceptable


Fast evaluation of function may be required


Not important that people understand the learned
function


Computer Science Department

CS 9633 Machine Learning

Perceptrons


Basic unit of many neural networks


Basic operation


Input: vector of real
-
values


Calculates a linear combination of inputs


Output

»
1 if result is greater than some threshold

»
-
1 otherwise


Computer Science Department

CS 9633 Machine Learning

A perceptron

x
1

x
2

x
3

x
n

S

.
.


w
1

w
2

w
3

w
n

Threshold
Processor

Summation
Processor

X
0
=1

w
0

Computer Science Department

CS 9633 Machine Learning

Notation

Perceptron function

Vector form of perceptron function

Computer Science Department

CS 9633 Machine Learning

Learning a perceptron


Learning consists of choosing values for n
weights


Space H of candidate hypotheses

Computer Science Department

CS 9633 Machine Learning

Representational Power of
Perceptrons


A perceptron represents a hyperplane decision
surface in n
-
dimensional space of instances.


Output of a 1 for instances on one side of the plane
and
-
1 for the other side of the plane


Equation for decision hyperplane





Sets of instances that can be separated by a
hyperplane are said to be
linearly separable

Computer Science Department

CS 9633 Machine Learning

Linearly Separable
Pattern Classification

Computer Science Department

CS 9633 Machine Learning

Non
-
Linearly Separable
Pattern Classification

Computer Science Department

CS 9633 Machine Learning

The Kiss of Death


1969: Marvin Minsky and Seymour Papert proved
that the perceptron had computational limits.
Statement:


“The perceptron has many features which attract
attention: its linearity, its intriguing learning
theorem...there is no reason to believe that any of
these virtues carry over to the many
-
layered
version. Nevertheless, we consider it to be an
important research problem to elucidate (or reject)
our intuitive judgment that the extension is sterile”

Computer Science Department

CS 9633 Machine Learning

Boolean functions


Perceptron can be used to represent the following
Boolean functions


AND


OR


Any m
-
of
-
n function


NOT


NAND (NOT AND)


NOR (NOT OR)


Every Boolean function can be represented by a
network of interconnected units based on these
primitives


Two levels is enough


Computer Science Department

CS 9633 Machine Learning

Revival



1982: John Hopfield responsible for
revival



1987: First IEEE conference on neural
networks. Over 2000 attended.


And the rest is history!


Computer Science Department

CS 9633 Machine Learning

Perceptron Training


Initialize weight vector with random weights


Apply the perceptron to the training example


Modify the perceptron weights whenever an
example is misclassified using perceptron
training rule.





Repeat

Computer Science Department

CS 9633 Machine Learning

Characteristics of
Perceptron Training Rule


Guaranteed to converge within a finite
number of applications of the rule to a
weight vector that correctly classifies all
training examples if:


Training examples are linearly separable


The learning rate is acceptably small

Computer Science Department

CS 9633 Machine Learning


Designed to converge toward the best
-
fit approximation of the target concept if
the instances are not linearly separable.


Searches the hypothesis space for
possible weight vectors to find the
weights that best fit the training data


Serves as a basis for backpropagation
neural networks

Gradient Descent and the
Delta Rule

Computer Science Department

CS 9633 Machine Learning

Training task


Task of training an linear unit without a
threshold




Training error (minimization task)





E is a function of w

Computer Science Department

CS 9633 Machine Learning

Hypothesis Space

Computer Science Department

CS 9633 Machine Learning

Derivation of the Gradient
Descent Learning Rule


Derivation is on page 91
-
92 of text


The derivative of the error gives the
direction of steepest ascent. The
negative is the direction of steepest
descent.


The derivative gives a very nice,
intuitive learning rule.

Gradient
-
Descent (
training_examples
,

)

Initialize each
w
i

to some small random value

Until the termination condition is met Do


Initialize each

w
i

to zero


For each
<
x
,t>

in training_examples Do


Input the instance
x

to the unit and compute the




output
o


For each linear unit weight
w
i


Do





w
i

=

w
i

+


(t
-

o) x
i



For each linear unit weight
w
i

Do


w
i



w
i
+

wi

Computer Science Department

CS 9633 Machine Learning

Gradient Ascent


Useful for very large of infinite
hypothesis space


Can be applied if


Hypothesis space contains continuously
parameterized hypothesis (e.g. weights)


The error can be differentiated with respect
to the hypothesis parameters

Computer Science Department

CS 9633 Machine Learning

Practical Difficulties with
Gradient Descent


Converging to a local minimum can
sometimes be quite slow


If there are multiple local minima, there
is no guarantee the procedure will find
the global minimum

Computer Science Department

CS 9633 Machine Learning

Stochastic Gradient Descent


Also called incremental gradient descent


Tries to address practical problems with
gradient descent


In gradient descent, the error is computed for
all of the training examples and the weights
are updated after all training examples have
been presented


Stochastic gradient descent updates the
weights incrementally based on the error with
each example

Stochastic
-
Gradient
-
Descent (
training_examples
,

)

Initialize each
w
i

to some small random value

Until the termination condition is met Do


Initialize each

w
i

to zero


For each
<
x
,t>

in training_examples Do


Input the instance
x

to the unit and compute the




output
o


For each linear unit weight
w
i


Do




w
i

= w
i

+


(t
-

o) x
i




Computer Science Department

CS 9633 Machine Learning

Standard versus Stochastic
Gradient Descent

Characteristic

Standard GD

Stochastic GD

Weight update

Based on sum of
errors for all
examples

Based on each
training example

Computation per
weight update

Summed over all
examples

One example

Multiple local minima

May find non
-
global
minima because
change always
based on

E(
w
)

May avoid non
-
global minima
because uses
changes over
various

E
d
(
w
)

Computer Science Department

CS 9633 Machine Learning

Comparison of Learning
Rules

Characteristic

Perceptron

Gradient
Descent (Delta
Rule

Weight update

Based on error in
thresholded
perceptron output

Based on error in
unthresholded linear
combination of inputs

Convergence
properties

Converges after
finite # of iterations
to a hypothesis that
perfectly classifies
training examples if
they are linearly
separable

Converges
asymptotically
toward min error
hypothesis, possibly
unbounded time,
even if not linearly
separable

Computer Science Department

CS 9633 Machine Learning

Multilayer Networks and
Backpropagation

I
1

I
2

I
3

H
1

H
2

O
1

Input Layer

Hidden Layer

Output Layer

O
2

I
0

H
0

Computer Science Department

CS 9633 Machine Learning

Mutlilayer design


Need a unit whose


Output is a non
-
linear function of inputs


Output is differentiable function of its inputs


Choices


Use a unit like a perceptron that computes
a linear combination of inputs


Applies a threshold to the result that is
smoothed and differentiable

Computer Science Department

CS 9633 Machine Learning

Threshold
Processor

Sigmoid Threshold Unit

x
1

x
2

x
3

x
n

S

.
.


w
1

w
2

w
3

w
n

Summation
Processor

X
0
=1

w
0

BACKPROPAGATION(
training_examples
,

,
n
in
,n
out
,n
hidde
n
)

Create a feed
-
forward network with
n
in

input units,
n
hidden

hidden units, and
n
ou
t

output units
.

Initialize each
w
i

to some small random value

Until the termination condition is met Do


For each
<
x
,t>

in
training_examples

Do


Propagate the input forward through the network:


1. Input the instance
x

to the network and compute


the output
o
u

of every unit
u

in the network.


Propagate the errors backward through the network:


2. For each network output unit
k
, calculate its error


term

k



3. For each hidden unit
h
, calculate its error term

h





4. Update each network weight
w
ji

Computer Science Department

CS 9633 Machine Learning

Termination Conditions


Fixed number of iterations


Error on training examples falls below
threshold


Error on validation set meets some
criteria

Computer Science Department

CS 9633 Machine Learning

Adding Momentum


A variation on backpropagation


Makes the weight update on one iteration
dependent on the update on the previous iteration


Keeps movement going in the “right” direction.


Can sometimes solve problems with local minima
and enable faster convergence

Computer Science Department

CS 9633 Machine Learning

General Acyclic Network
Structure

I
1

I
2

I
3

H
1

H
2

O
1

Input Layer

Hidden Layer

Output Layer

O
2

H
3

Computer Science Department

CS 9633 Machine Learning

Derivation of
Backpropagation Rule


See section 4.5.3 in the text

Computer Science Department

CS 9633 Machine Learning

Convergence and Local
Minima


Error surface may contain many local minima


Algorithm is only guaranteed to converge
toward some local minimum in E


In practice, it is a very effective function
approximation method.


Problem with local minima is often not
encountered


Local minimum with respect to one weight is often
counter
-
balanced by other weights


Initially, with weights near 0, the function
represented is nearly linear in its inputs

Computer Science Department

CS 9633 Machine Learning

Methods for Avoiding Local
Minima


Add a momentum term


Use stochastic gradient descent


Train multiple networks


Select best


Use committee machine

Computer Science Department

CS 9633 Machine Learning

Representational Power of
Feed Forward NNs


Boolean functions


Any Boolean function can be represented
with 2
-
layer neural network.


Scheme for arbitrary Boolean function

»
For each possible input vector, create distinct
hidden unit and set its weights so it activates iff
this specific vector is input

»
OR all of these together

Computer Science Department

CS 9633 Machine Learning

Representational Power of
Feed Forward NNs


Continuous Functions


Every bounded continuous function can be
approximated with arbitrarily small error by
a network with two layers of units

»
Sigmoid units at hidden layer

»
Unthresholded linear units at output layer

»
Number of hidden units depends on the
function to be approximates

Computer Science Department

CS 9633 Machine Learning

Representational Power of
Feed Forward NNs


Arbitrary Functions


Any function can be approximated to
arbitrary accuracy by a network with 3
layers of units.

»
Two hidden layers use sigmoid units
unthresholded linear units at output layer

»
Number of units needed at each layer is not
known in general

Computer Science Department

CS 9633 Machine Learning

Hypothesis Search Space
and Inductive Bias


Every set of network weights is a different
hypothesis


Hypothesis space is continuous


Continuous space and E differentiable with
respect to weights gives useful organization
of search space by gradient descent


Inductive bias is


Defined by interaction of gradient descent search
and weight space


Roughly characterized as smooth interpolation
between data points

Computer Science Department

CS 9633 Machine Learning

Hidden Layer
Representations


Backprop can learn useful intermediate
representations at the hidden layer


Defines new hidden layer features that
are not explicit in the input
representation, but captures relevant
properties of input instances

Computer Science Department

CS 9633 Machine Learning

Generalization, Overfitting,
and Stopping Criterion


Using error on test examples as
stopping criterion is bad idea


Backprop is prone to overfitting


Why does overfitting occur in later
iterations, but not earlier?

Computer Science Department

CS 9633 Machine Learning

Avoiding overfitting


Weight decay


Decrease weights by small factor during each
iteration


Stay away from complex surfaces


Validation Data


Train with training set


Get error with validation set


Keep best weights so far on validation data


Cross
-
validation to determine best number of
iterations