Training Neural Networks - Columbia University

apricotpigletAI and Robotics

Oct 19, 2013 (3 years and 1 month ago)

94 views

Training Neural
Networks

Robert Turetsky

Columbia University
rjt72@columbia.edu

Systems, Man and Cybernetics Society

IEEE North Jersey Chapter

December 12, 2000

Objective


Introduce fundamental concepts in
Artificial Neural Networks


Discuss methods of training ANNs


Explore some uses of ANNs


Assess the accuracy of artificial neurons
as models for biological neurons


Discuss current views, ideas and
research

Organization


Why Neural Networks?


Single TLUs


Training Neural Nets: Back propagation


Working with Neural Networks


Modeling the neuron


The multi
-
agent architecture


Directions and destinations

Why Neural Networks?

The “Von Neumann” architecture


Memory for
programs and
data


CPU for math
and logic


Control unit to
steer program
flow


Von Neumann vs. ANNs


Follows Rules


Solution can/must
be formally specified


Cannot generalize


Not error tolerant


Learns from data


Rules on data are
not visible


Able to generalize


Copes well with
noise

Von Neumann

Neural Net

Circuits that LEARN


Three types of learning:


Supervised Learning


Unsupervised Learning


Reinforcement Learning


Hebbian networks: reward ‘good’ paths,
punish ‘bad’ paths



Train neural net by adjusting weights


PAC (Probably Approximately Correct)
theory: Kerns & Vazirani 1994, Haussler
1990

Supervised Learning Concepts


Training set

: Input/output pairs


Supervised learning because we know the
correct action for every input in



We want our Neural Net to act correctly in as
many training vectors as possible


Choose training set to be a typical set of inputs


The Neural net will (hopefully) generalize to all
inputs based on training set


Validation Set: Check to see how well our
training can generalize

Neural Net Applications


Miros Corp.: Face recognition


Handwriting Recognition


BrainMaker: Medical Diagnosis


Bushnell: Neural net for combinational
automatic test pattern generation


ALVINN: Knight Rider in real life!


Getting rich: LBS Capital Management
predicts the S&P 500

History of Neural Networks


1943: McCullough and Pitts
-

Modeling the
Neuron for Parallel Distributed Processing


1958: Rosenblatt
-

Perceptron


1969: Minsky and Papert publish limits on the
ability of a perceptron to generalize


1970’s and 1980’s: ANN renaissance


1986: Rumelhart, Hinton + Williams present
backpropagation


1989: Tsividis: Neural Network on a chip

Threshold Logic Units

The building blocks of

Neural Networks

The TLU at a glance


TLU: Threshold Logic Unit


Loosely based on the firing of biological
neurons


Many inputs, one binary output


Threshold: Biasing function


Squashing function compresses infinite
input into range of 0
-

1

The TLU in Action

Training TLUs: Notation




= Threshold of TLU


X

= Input Vector


W

= Weight Vector



s =
X



W

ie:

if s



, op = 1


if s <

, op = 0


d = desired output of TLU


f = output of TLU with current
X

and
W

Augmented Vectors


Motivation: Train threshold


at the
same time as input weights


X







is the same as
X


W

-





0


Set threshold of TLU = 0


Augment
W
:
W

= [w
1
, w
2
, … w
n
,
-

]


Augment
X
:
X

= [x
1
, x
2
, .. x
n
, 1]


New TLU equation:
X


W



0

(for augmented
X

and
W
)

Gradient Descent Methods


Error Function: How far off are we?


Example Error function:




depends on weight values


Gradient Descent: Minimize error by
moving weights along the decreasing
slope of error


The Idea: iterate through the training set
and adjust the weights to minimize the
gradient of the error

Gradient Descent: The Math

We have


= (d
-

f)
2

Gradient of

:

Using the chain rule:

Since , we have

Also:

Which finally gives:

Gradient Descent: Back to reality


So we have


The problem:

f /

s is not differentiable


Three solutions:


Ignore It: The Error
-
Correction Procedure


Fudge It: Widrow
-
Hoff


Approximate it: The Generalized Delta
Procedure

Training a TLU: Example


Train a neural network to match the
following linearly separable training set:


Behind the scenes: Planes
and Hyperplanes

What can a TLU learn?

Linearly Separable Functions


A single TLU can implement any
Linearly separable function


AB’ is Linearly separable


A


B is not

NEURAL NETWORKS

An Architecture for Learning

Neural Network Fundamentals


Chain multiple TLUs together


Three layers:


Input Layer


Hidden Layers


Output Layer


Two classifications:


Feed
-
Forward


Recurrent

Neural Network Terminology

Training ANNs: Backpropagation


Main Idea: distribute the error function
across the hidden layers, corresponding
to their effect on the output


Works on feed
-
forward networks


Use sigmoid units to train, and then we
can replace with threshold functions.

Back
-
Propagation: Birds
-
eye view


Repeat:


Choose training pair and copy it to input layer


Cycle that pattern through the net


Calculate error derivative between output
activation and target output


Back propagate the summed product of the
weights and errors in the output layer to
calculate the error on the hidden units


Update weights according to the error on that
unit


Until error is low or the net settles

Back
-
Prop: Sharing the Blame


We want to assign


W
i
j

= weights of i
-
th sigmoid in j
-
th layer


X
j
-
1

= inputs to our TLU (outputs from
previous layer)


c
i
j

= learning rate constant of i
-
th sigmoid in
j
-
th layer



i
j

= sensitivity of the network output to
changes in the input of our TLU


Important equation:

Back
-
Prop: Calculating

i
j


For the output layer:

i
j
=

k



i
j
=

k

= (d
-
f)

f/

s
k



i
j

= (d
-
f)f(1
-
f) for sigmoid


Therefore W
k

<
-

W
k

+ c
k
(d
-

f) f (1
-
f ) X
k
-
1


For the hidden layers:


See Nilsson 1998 for calculation


Recursive Formula: base case

k

=(d
-
f)f(1
-
f)

Back
-
Prop: Example


Train a 2
-
layer Neural net with the
following input:


x
1
0

= 1, x
2
0

= 0, x
3
0
= 1, d = 0


x
1
0

= 0, x
2
0

= 0, x
3
0

= 1, d = 1



x
1
0

= 0, x
2
0

= 1, x
3
0

= 1, d =0


x
1
0

= 1, x
2
0

= 1, x
3
0

= 1, d = 1


Back
-
Prop: Problems


Learning rate is non
-
optimal


One solution: “Learn” the learning rate


Network Paralysis: Weights grow so
large that f
i
j
(1
-
f
i
j
)
--
> 0, and the net never
learns


Local Extrema: Gradient Descent is a
greedy method


These problems are acceptable in many
cases, even if workarounds can’t be
found

Back
-
Prop: Momentum


We want to choose a learning rate that
is as large as possible


Speed up convergence


Avoid oscillations


Add momentum term dependent on
past weight change:

Another Method: ALOPEX


Used for visual receptive field mapping
by Tzanakou and Harth,1973


Originally developed for receptive field
mapping in the visual pathway of frogs


The main ideas:


Use cross
-
correlation to determine a
direction of movement in gradient field


Add a random element to avoid local
extrema

WORKING WITH

NEURAL NETS

AI the easy way!

ANN Project Lifecycle


Task identification and design


Feasibility


Data Coding


Network Design


Data Collection


Data Checking


Training and Testing


Error Analysis


Network Analysis


System Implementation

ANN Design Tradeoffs


A good design will find a balance
between these two extremes!

ANN Design Balance: Depth


Too few hidden layers will cause errors in accuracy


Too many errors will cause errors in generalization!

CLICK!

Modeling the neuron

Wetware: Biological Neurons

The Process: Neuron Firing


Each electrical signal received at a synapse
causes neurotransmitter release


The neurotransmitter travels along the synaptic
cleft and received by the other neuron at a
receptor

site


Post
-
Synaptic
-
Potential (PSP) either increases
(hyperpolarizes) or decreases (depolarizes) the
polarization of the post
-
synaptic membrane (the
receptors)


In hyperpolarization, the spike train is inhibited.
In depolarization, the spike train is excited.

The Process: Part 2


Each PSP travels along the dendrite of the
new neuron, and spreads itself over the cell
body


When the effects of the PSP reaches the
axon
-
hillock, it is summed with other PSPs.


If the sum is greater than a certain threshold,
the neuron fires a spike along the axon


Once the spike reaches the synapse of an
efferent neuron, the process starts in that
neuron

The neuron to the TLU


Cell Body (Soma) = accumulator plus its
threshold function


Dendrites = inputs to the TLU


Axon = output of the TLU


Information Encoding:


Neurons use frequency


TLUs use value

Modeling the Neuron: Capabilities


Humans and Neural Nets are both:


Good at pattern recognition


Bad at mathematical calculation


Good at compressing lots of information
into a yes/no decision


Taught via training period


TLUs win because neurons are slow


Wetware wins because we have a
cheap source of billions of neurons

Do ANNs model neuron structures?


No: Hundreds of types of specialized nerons,
only one TLU


No: Weights to neural threshold controlled by
many neurotransmitters, not just one


Yes: Most of the complexity in the neuron is
devoted to sustaining life, not information
processing


Maybe: There is no real method for
backpropagation in the brain. Instead, firing
of neurons increases connection strength


High Level: Agent Architecture


Our minds are composed of a series of
non
-
intelligent agents


The hierarchy, interconnections, and
interactions between the agents creates
our intelligence


There is no one agent in control


We learn by forming new connections
between agents


We improve by dealing with agents at a
higher level, ie creating mental ‘scripts’

Agent Hierarchy: Playing with Blocks

From the outside, Builder

knows

how to build towers.

From inside, Builder just turns

on other agents.

How We Remember: K
-
Line Theory

New Knowledge: Connections


Sandcastles in the sky: Everything we know is
connected to everything else we know


Knowledge is acquired by making connections
new between “things” we already know

Learning Meaning


Uniframing: Combining several
descriptions into one


Accumulating: Collecting incompatible
descriptions


Reformulating: modifying a description’s
character


Transforming: bridging between
structures and functions or actions

The Exception Principle


It rarely pays to tamper with a rule that
nearly always works. It is better to
complement it with an accumulation of
exceptions


Birds can Fly


Birds can fly unless they are penguins
and ostriches

The Exception Principle:
Overfitting


Birds can fly, unless they are penguins
and ostriches, or if they happen to be
dead, or have broken wings, or are
confined to cages, or have their feet
stuck in cement, or have undergone
experiences so dreadful as to render
them psychologically incapable of flight


In real thought, finding exceptions to
everything is usually unnecessary.

Minsky’s Princples


Most new knowledge is simply finding a
new way to relate things we already
know


There is nothing wrong with circular
logic or having imperfect rules


Any idea will seem self
-
evident... once
you’ve forgotten learning it.


Easy things are hard: We’re least aware
of what our minds do best

TO THE FUTURE AND
BEYOND

Why you should be nice

to your computer

I’m lonely and I’m bored.


Come play with me!

Computers are Dumb


“Deep Blue might be able to win at
chess, but it won’t know to come in from
the rain.”


Computers can only know what they’re
told, or what they’re told to learn


Computers lack a sense of mortality and
a physical self with which to preserve


All of this will change when computers
can reach ‘consciousness’


I, Silicon Consciousness


Kurzweil: By 2019, a $1000 computer
will be equivalent to the human brain.


By 2029, machines will claim to be
conscious. We will believe them.


By 2049, nanobot swarms will make
virtual reality obsolete in real reality.


By 2099, man and machine will have
completely merged.


You mean to tell me?????


We humans will gradually introduce
machines into our bodies, as implants


Our machines will grow more human as
they learn, and learn to design themselves


The Neo
-
Luddite scenarios:


AI succeeds in creating conscious beings. All
life is at the mercy of the machines.


Humans retain control: workers are obsolete.
The power to decide the fate of the masses is
now completely in the hands of the elite.

Neural Networks: Conclusions


Neural Networks are a powerful tool for:


Pattern recognition


Generalizing to a problem


Machine learning


Training Neural Networks


Can be done, but exercise great care


Still has room for improvement


Understanding and creating
consciousness?


Still working on it :)