LING / C SC 439/539 Statistical Natural Language Processing

trainerhungarianAI and Robotics

Oct 20, 2013 (3 years and 9 months ago)

60 views

LING / C SC 439/539

Statistical Natural Language Processing


Lecture
9

2/11/2013

Recommended reading


Nilsson Chapter 5, Neural Networks


http://ai.stanford.edu/~nilsson/mlbook.html


http://en.wikipedia.org/wiki/Connectionism


David E.
Rumelhart

and James L. McClelland. 1986.
On
learning the past tenses of English verbs.
In McClelland,
J. L.,
Rumelhart
, D. E., and the PDP research group,
Parallel Distributed Processing: Explorations in the
Microstructure of Cognition. Volume II. Cambridge,
MA: MIT Press.


Steven Pinker and Alan Prince. 1988. On language and
connectionism: Analysis of a parallel distributed
processing model of language acquisition. Cognition,
28, 73
-
193.


Steven Pinker and Michael T.
Ullman
. 2002. The past
and future of the past tense. Trends in Cognitive
Science, 6, 456
-
463.

Outline


Cognitive modeling


Perceptron

as a model of the neuron


Neural networks


Acquisition of English past tense verbs


Discuss WA #2 and WA #3

Learning in cognitive science vs.
engineering


Cognitive science


Want to produce a model of the human mind


Develop learning algorithms that model how the brain
(possibly) computes


Understand observed behaviors in human learning



Engineering


Solve the problem; do whatever it takes to increase
performance


Issues of relevance to the brain/mind are secondary


No constraints on resources to be used

Cognitive science:

model
what
humans do


Learn aspects of language


How children acquire Part of Speech categories


How children acquire the ability to produce the past
tense forms of English verbs


Process language


Sentence comprehension


Vowel perception



Won’t see sentiment analysis of movie reviews in
a cognitive science journal…

Cognitive science:

model
how

humans learn/process


Model observed behaviors in human learning and
processing


Learning: errors in time course of acquisition


Overregularization

of English past tense verbs:



holded
” instead of “held”


Processing: reproduce human interpretations, such as
for reduced relative clauses:


The horse raced past the barn fell



Use appropriate input data


Language acquisition: corpus of child
-
directed speech


Brains and computers have different architectures


Brain:


Consists of over 100 billion neurons, which connect to other
neurons through synapses


Between 100 billion and 100 trillion synapses


Parallel computation


Computer:


One very fast processor


Serial computation


(okay, you can have multi
-
processor systems, but you won’t have
hundreds of billions of processors)



Cognitive modeling: write computer programs that
simulate distributed, parallel computation


Field of
Connectionism


Cognitive science: model how brain computes

Biological neuron

1.
Receives input from other neurons through its
dendrites

2.
Performs a computation.

3.
Sends result from its
axon

to the dendrites of other
neurons.


All
-
or
-
nothing firing of neuron

http://www.cse.unsw.edu.au/~billw/dictionaries/pix/bioneuron.gif

Issue of abstract representations


In modeling data, we use abstract representations


Language: linguistic concepts such as “word”, “part of
speech”, “sentence”, “phonological rule”, “syllable”,
“morpheme”, etc.



The brain performs all computations through its
neurons. Does the brain also use higher
-
level abstract
representations like those proposed in linguistics?



Connectionism:


Opinion:
no, there is no reality to linguistic theory


Try to show that distributed, parallel computational model
of the brain is sufficient to learn and process language

Outline


Cognitive modeling


Perceptron

as a model of the neuron


Neural networks


Acquisition of English past tense verbs


Discuss WA #2 and WA #3

Origin of the
perceptron


Originally formulated as a model of biological
computation



Want to model the brain


The brain consists of a huge network of neurons



Artificial neural networks


Perceptron

= model of a single neuron


Neural network: a network of
perceptrons

linked
together

Biological neuron

1.
Receives input from other neurons through its
dendrites

2.
Performs a computation.

3.
Sends result from its
axon

to the dendrites of other
neurons.


All
-
or
-
nothing firing of neuron

http://www.cse.unsw.edu.au/~billw/dictionaries/pix/bioneuron.gif

McCulloch and Pitts (1943):

first computational model of a neuron

http://www.asc
-
cybernetics.org/images/pitts80x80.jpg

http://wordassociation1.net/mcculloch.jpg

McCulloch
-
Pitts neuron








A picture of McCulloch and Pitt’s mathematical
model of a neuron. The inputs x
i

are multiplied by
the weights
w
i

, and the neuron sum their values. If
this sum is greater than the threshold
θ

then the
neuron fires, otherwise it does not.

Perceptron

learning algorithm:

Rosenblatt (1958)

http://www.enzyklopaedie
-
der
-
wirtschaftsinformatik.de/wi
-
enzyklopaedie/Members/wilex4/Rosen
-
2.jpg/image_preview

Perceptron

can’t learn
hyperplane

for linearly inseparable data



Marvin
Minsky

and Seymour
Papert

1969:


Perceptron

fails to learn XOR


XOR is a very simple function


There’s no hope for the
perceptron



Led to a decline in research in neural
computation



Wasn’t an active research topic again until 1980s

Outline


Cognitive modeling


Perceptron

as a model of the neuron


Neural networks


Acquisition of English past tense verbs


Discuss WA #2 and WA #3

Have multiple output nodes, with
different weights from each input


Each input vector X = ( x
1
, x
2
, …,
x
m

)


Set of j
perceptrons
, each computing an output


Weight matrix
W
: size m x j

x
1

x
2

x
m

A multi
-
layer
perceptron
, or neural
network, has one or more
hidden layers



Hidden layers consist of
perceptrons

(neurons)

Feedforward

neural network


Output from hidden layer(s) serves as input to
next layer(s)


Computation flows from left to right

Computational capacity of neural network


Can learn any smooth function


Perceptron

and SVM learn linear decision boundaries


Can learn XOR

XOR function, using pre
-
defined weights


Input: A = 0, B = 1


Input to C:
-
.5*1 + 1*1 = 0.5


Output of C: 1


Input to D:
-
1*1 + 1*1 = 0


Output of D: 0


Input to E:
-
.5*1 + 1*1 +
-
1*0 = 0.5

Output of E: 1


XOR function, using pre
-
defined weights


Input: A = 1, B = 1


Input to C:
-
.5*1 + 1*1 + 1*1 = 1.5

Output of C: 1


Input to D:
-
1*1 + 1*1 + 1*1 = 1

Output of D: 1


Input to E:
-
.5*1 + 1*1 +
-
1*1 =
-
.5

Output of E: 0


Learning in a neural network


1. Choose topology of network



2. Define activation function



3. Define error function



4. Define how weights are updated

1. Choose topology of network


How many hidden layers and how many
neurons?


There aren’t really any clear guidelines


Try out several configurations and pick the one
that works best



In practice, due to the difficulty of training,
people use networks that have one, or at most
two, hidden layers



2. Activation function


Activation
: when a neuron fires


Let
h

be the weighted sum of inputs to a
neuron.



Perceptron

activation function:


g(
h
) = 1 if
h

> 0


g(
h
) = 0 if
h

<= 0

Activation function for neural network


Cannot use threshold function


Need a smooth function for gradient descent
algorithm, which involves differentiation



Use a sigmoid function:




where parameter
β

is a


positive value


3. Error function


Quantifies difference between targets and outputs



Error function for a single
perceptron
:


E = t


y


t

= target


y

= output


Value of error function


If
t

= 0 and
y

= 1,
E

= 0
-

1 =
-
1


If
t

= 1 and
y

= 0,
E

= 1
-

0 = 1



Training: modify weights to achieve zero error

Error function for neural network (I)


First attempt:



E = ∑
(
t



y
)



Fails because errors may cancel out


Example: suppose we make 2 errors.


First: target = 0, output = 1, error =
-
1


Second: target = 1, output = 0, error = 1


Sum of errors:
-
1 + 1 =
zero error!


Error function for neural network (II)


Need to make errors positive



Let error function be the sum of squares
function:



4. Update weights


The error function is minimized by updating weights



Perceptron

weight update:
w

=
w

+
η

*
X
T
(t
-
o)



Updating weights in the
perceptron

was simple


Direct relationship between input and output



How do we do this in a neural network?


There may be multiple layers intervening between inputs
and outputs

Backpropagation


Suppose a neural network has one hidden layer


1st
-
layer weights: between input layer and hidden layer


2nd
-
layer weights: between hidden layer and output layer



Backpropagation
: adjust weights backwards, one layer
at a time


Update 2nd
-
layer weights using errors at output layer


Then update 1st
-
layer weights using errors at hidden layer



See readings for details of algorithm (requires calculus)


Neural network training algorithm:
quick summary

Neural network training algorithm: details


Neural network training algorithm: details


Gradient descent


Series of training iterations


Weights of the network are modified in each iteration



Gradient descent:


Each iteration tries to
minimize

the value of the error
function


In the limit of number of iterations, tries to find a
configuration of weights that leads to
zero

error in
training examples

Problem: local minima in error function


Algorithm gets “stuck” in local minima


Weights may reach a stable configuration such that
the updating function does not compute a better
alternative for next iteration



Result determined by initial weights


Random initialization of weights


These values determine the final weight configuration,
which may not necessarily lead to zero error

Gradient descent gets stuck in local minimum;

want to find global minimum


Global minimum
in error function

Local minimum
in error function

Overfitting


Since neural networks can learn any continuous function, they
can be trained to the point where they
overfit

training data


Overfitting
: algorithm has learned about the specific data
points in the training set, and their noise, instead of the more
general input/output relationship


Consequence: although error on training data is minimized,
performance on testing set is degraded




Training and testing error rates over time


During initial iterations of training:


Training error rate decreases


Testing error rate decreases



When the network is
overfitting
:


Training error rate continues to decrease


Testing error rate
increases



A characteristic of many learning algorithms

Illustration of
overfitting


Training iterations

Error rate

BLACK = training data

RED = testing data

Prevent
overfitting

with a validation set


Set aside a portion of your training data to be a
validation set


Evaluate performance on validation set over time


Stop when error rate increases

Neural network and model selection


Available models


Range of possible parameterizations of model


Topology of network, random initial weights


How well the model fits the data


Separate points in training data


Generalize to new data


Balance simplicity of model and fit to data
model can be
rather complex if there are lots of layers and nodes


Noisy data

can
overtrain

and fit to noise


Separability

any smooth function


Maximum margin

no


Computational issues


Can require more training data for good performance

Neural networks in practice


Cognitive scientists and psychologists like neural
networks


Biologically
-
inspired model of computation


(though it’s vastly simplified compared to brains)


They often don’t know about other learning algorithms!



Engineers don’t use (basic) neural networks


Don’t work as well as other algorithms, such as SVM


Takes a long time to train


However, there is always new research in distributed
learning models


Dynamic Bayesian Nets (DBNs) have recently become popular,
and have done very well in classification tasks…

Outline


Cognitive modeling


Perceptron

as a model of the neuron


Neural networks


Acquisition of English past tense verbs


Discuss WA #2 and WA #3

Rumelhart

& McClelland 1986


David E.
Rumelhart

and James L. McClelland.
1986.
On learning the past tenses of English
verbs.
In McClelland, J. L.,
Rumelhart
, D. E., and
the PDP research group, Parallel Distributed
Processing: Explorations in the Microstructure of
Cognition. Volume II. Cambridge, MA: MIT Press.



One of the most famous papers in cognitive
science


Main issue: what is the source of
linguistic knowledge?


Empiricism / connectionism:


Ability to use language is a result of general
-
purpose
learning capabilities of the brain


“Linguistic structures” are emergent properties of the
learning process (not hard
-
coded into brain)



Rationalism / Chomsky / generative linguistics:


The mind is not a “blank slate”


Human brains are equipped with representational and
processing mechanisms
specific

to language

English past tense verbs


Regular inflection:


cook, cooked


for all regulars,


cheat, cheated


apply rule:
“add

ed



climb, climbed



Irregular inflection: less
predictible


eat,
ate



rule:
suppletion


drink, drank


rule:
lower vowel


swing, swung


rule:
lower vowel


choose, chose


rule:
shorten vowel


fly, flew



rule:
vowel


u

Children’s acquisition of past tense


Observed stages in children’s acquisition


1. Initially memorize forms, including the correct
irregular forms


2. Then acquire regular rule



But also
overregularize
:
bited
,
choosed
,
drinked


3. Correct usage



Wug

test (
Berko

1958)


Children can generate morphological forms of
novel words


This is a
wug
. Here are two ___.


Rumelhart

& McClelland’s test


A neural network does not have specific
structures for language


Neural network can be applied to any learning task



Use a neural network to learn mapping from verb
base form to its past tense form



If successful, supports the hypothesis that
the
brain does not have specific representations for
language either

Rumelhart & McClelland 1986:

artificial neural network


Wickelphones


Wickelgren

1969: represent words as overlapping,
context
-
sensitive 3
-
phoneme units (
wickelphones
)



“cat” = /
kat

/ =
#
k
a


k
a
t


a
t
#



35 phonemes: 35 x 35 x 35 = 42,875
wickelphones



Need a more compact representation

Wickelfeatures


Use phonological features instead of
characters


Voiced/unvoiced
,
front/back
,
nasal
, etc.



Wickelfeature

= triple of phonological features


460 different possible
wickelfeatures



Example:
k
a
t



(Interrupted, Vowel, Front)


Wickelphone



Wickelfeatures


Each
wickelphone

activates 16
wickelfeatures


Using the network


Input: word to
be converted
to past tense


Step 1.
represent word as
wickelphones


Step 2.
wickelphones



wickelfeatures


Step 3.
wickelfeatures



wickelfeatures







(
past tense)


Step 4.
wickelfeatures



wickelphones



Linguistic knowledge


A word is
simply a
sequence of
wickelphones


No syllables


No morphological structure


Dependence of
wickelphones

upon others
is
local (linear)



Distributed representation of phonological
knowledge


No representation of a symbolic phonological rule that
changes phonemes to other phonemes


Phonological
changes result from mapping sets of
phonological features to other sets of features




Learning with the network


Learn 506 verbs and their past tenses



Input: list of ( base, past ) pairs


( eat, ate )


( throw, threw )


( walk, walked )



Learning procedure:


Initialize network with random values for weights


Multiple iterations of learning:


If network makes wrong prediction, adjust weights

% correct over time for regular and
irregular verbs


R&M’s claims


No explicit representation of a phonological
rule


A

B / C __ D,
such as “add

ed” or “lower vowel”,

Nevertheless, it is able to accomplish the learning task



Network learned English past tense “rule



Simulates
children’s learning
of
irregulars


Correct usage first


Then
overgeneralize


Correct usage over time





Phonological
knowledge
emerges

from the
network

Problems with R&M’s experiment


Steven Pinker & Alan Prince, 1988



Doesn’t work
: makes wrong predictions


mail


membled


trilb



treelilt


brilth



prevailed


smeed



imin



Network actually does have phonological knowledge,
in the form of
wickelphones

and
wickelfeatures



What explains the discrepancy?


Why did R&M report such high performance,


while P&P say it makes lots of mistakes?





It appears that R&M did not evaluate
performance on a separate
testing

set


Their figure for “% correct” indicates performance
for learning on the
training

set

Summary


Rumelhart

& McClelland tried to argue against
linguistic representations by showing that a general
-
purpose learning algorithm could acquire language


Problems:


They use linguistic knowledge anyway


Tested on training set


Therefore, R&M’s conclusions cannot be supported
on the basis of this experiment.


Doesn’t support theory that the brain is a “blank slate”;
still possible that the brain uses complex representations
such as phonological rules proposed in linguistic theory

Personal commentary: statement of
learning problem is oversimplified


Is this how children learn past tense verb forms?


Assumptions, if applied to children:


Children are given lists of verbs in their base and past tense forms.


Children then generalize.



Issues that are not addressed:


How do children identify the tenses of verbs?


How do children know that one word form is the past tense of
another


What about the acquisition of other inflections, such as present
tense and progressive tense?


Under the “past tense” learning scenario, they would be treated as entirely
separate learning problems.


But shouldn’t learning be
easier

somehow if all inflections were learned
simultaneously? Not addressed by this model.

Outline


Cognitive modeling


Perceptron

as a model of the neuron


Neural networks


Acquisition of English past tense verbs


Discuss WA #2 and WA #3