a Simple Recurrent Network

clangedbivalveΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 25 μέρες)

85 εμφανίσεις

Sentence Processing using


a Simple Recurrent Network

EE 645 Final Project


Spring 2003


Dong
-
Wan Kang


5/14/2003

Contents

1.
Introduction
-

Motivations

2.
Previous & Related Works

a) McClelland & Kawamoto(1986)

b) Elman (1990, 1993, & 1999)

c) Miikkulainen (1996)

3.
Algorithms
(Williams and Zipser, 1989)

-

Real Time Recurrent Learning


4.
Simulations

5.
Data & Encoding schemes

6.
Results

7.
Discussion & Future work

Motivations


Can the neural network recognize the lexical


classes from the sentence and learn the various


types of sentences?



From cognitive science perspective:


-

co
mparison between human language learning


and neural network learning pattern


-

(e.g.
) Learning
English past tense (Rumelhart &


McClelland,1986)
, Grammaticality judgment (Allen &


Seidenberg,1999), Embedded sentences(Elman 1993,


Miikkulainen1996, etc.)

Related Works


McClelland & Kawamoto (1986)



-

Sentences with Case Role Assignments and semantic features by
using backpropagation algorithm


-

output: 2500 case role units for each sentence


-

(e.g.)



input:
the boy

hit


the wall

with the ball
.


output: [
Agent

Verb

Patient

Instrument

]

+
[
other features
]



-

Limitation: poses a hard limit on the number of input size.



-

Alternative:
Instead of d
etecting the input patterns displaced in


space
, detect the patterns which were in
time

(sequential inputs).



Related Works
(continued)


Elman

(199
0,1993, & 1999
)


-

Simple Recurrent Network: Partially
Recurrent Network using Context units



-

Network with a dynamic memory



-

Context units at time

t

hold a copy of
the activations of the hidden units from
the previous time step at time

t
-
1
.



-

Network can recognize sequences.






input:

Many

years

ago
boy

and
girl




| | | | | |


output:

years

ago
boy

and
girl



Related Works
(continued)


Miikkulainen (1996)


-

SPEC architecture



(Subsymbolic Parser for
Embedded
Clauses


Recurrent Network)



-

Parser, Segmenter, and Stack:


process the center and tail
embedded sentences: 98,100
sentences with 49 different
sentence by using case role
assignments



-

(e.g.) Sequential Inputs



input: …, the girl,
who, liked, the dog,

saw, the boy, …



output: …, [the girl, saw, the boy] [
the girl, liked, the dog
]


case role: (agent, act, patient) (agent, act, patient)



Algorithms


Recurrent Network


-

Unlike feedforward networks, they allow connections both ways
between a pair of units and even from a unit to itself.



-

Backpropagation through time (BPTT)


unfolds the temporal


operation of the network into a layered feedforward network at


every time step. (Rumelhart, et al., 1986)



-

Real Time Recurrent Learning (RTRL)


two versions (Williams and
Zipser, 1989)


1) update weights after processing sequences is completed.


2) on
-
line: update weights while sequences are being presented.




-

Simple Recurrent Network (SRN)


partially recurrent network in


terms of time and space. It has context units which store the


outputs of the hidden units (Elman, 1990). (It can be modified from


RTRL algorithm.)

Real Time Recurrent Learning


Williams and Zipser (1989)


-

This algorithm computes the derivatives of states and


outputs with respect to all weights as the network


processes the sequence.



Summary of Algorithm:


In recurrent network, for any unit connected to any
other and the input at node
i
at time
t
, the dynamic
update rule is:





RTRL
(continued)



Error measure:



with target outputs defined for some
k
’s and
t
’s



if is defined at time

t;


otherwise



Total cost function




, t =0,1, …, T, where



RTRL
(continued)



The gradient of
E

separate in time, to do


gradient descent, we define:









The derivative of update rule:






where initial condition

t

= 0,



RTRL
(continued)


Depending on the way of updating weights, there can be
two versions of RTRL.


1) Update the weights after the sequences are


completed at
(
t = T
).


2) Update the weights after each time step:
on
-
line



Elman’s “tlearn” simulator program for “Simple Recurrent
Network” (which I’m using for this project) is
implemented based on the classical backpropagation
algorithm and the modification of this RTRL algorithm.


Simulation


Based on Elman’s data and Simple Recurrent


Network(1990,1993, & 1999), simple sentences and


embedded sentences are simulated by using “tlearn”


neural network program (BP + modified version of RTLR


algorithm) available at
http://crl.ucsd.edu/innate/index.shtml
.



Question:


1. Can the network discover the lexical classes from word
order?



2. Can the network recognize the relative pronouns and
predict them?

Network Architecture


31 input nodes


31 output nodes


150 hidden nodes


150 context nodes



*
black arrow:



distributed and learnable

*
dotted blue arrow:



linear function

and


one
-
to
-
one connection
with hidden nodes


Training Data


Lexicon
(31 words)



NOUN
-
HUM
man woman boy girl

NOUN
-
ANIM
cat mouse dog lion

NOUN
-
INANIM
book rock car

NOUN
-
AGRESS
dragon monster

NOUN
-
FRAG
glass plate

NOUN
-
FOOD
cookie bread sandwich


VERB
-
INTRAN
think sleep exist

VERB
-
TRAN
see chase like

VERB
-
AGPAT
move break

VERB
-
PERCEPT
smell see

VERB
-
DESTROY
break smash

VERB
-
EAT
eat

-----------------------------

RELAT
-
HUM
who

RELAT
-
INHUM
which


Grammar

(16 templates)


NOUN
-
HUM VERB
-
EAT NOUN
-
FOOD

NOUN
-
HUM VERB
-
PERCEPT NOUN
-
INANIM

NOUN
-
HUM VERB
-
DESTROY NOUN
-
FRAG

NOUN
-
HUM VERB
-
INTRAN

NOUN
-
HUM VERB
-
TRAN NOUN
-
HUM

NOUN
-
HUM VERB
-
AGPAT NOUN
-
INANIM

NOUN
-
HUM VERB
-
AGPAT

NOUN
-
ANIM VERB
-
EAT NOUN
-
FOOD

NOUN
-
ANIM VERB
-
TRAN NOUN
-
ANIM

NOUN
-
ANIM VERB
-
AGPAT NOUN
-
INANIM

NOUN
-
ANIM VERB
-
AGPAT

NOUN
-
INANIM VERB
-
AGPAT

NOUN
-
AGRESS VERB
-
DESTROY NOUN
-
FRAG

NOUN
-
AGRESS VERB
-
EAT NOUN
-
HUM

NOUN
-
AGRESS VERB
-
EAT NOUN
-
ANIM

NOUN
-
AGRESS VERB
-
EAT NOUN
-
FOOD

Sample Sentences & Mapping



Simple sentences


2 types


-

man think (2 words)


-

girl see dog (3 words)


-

man break glass (3 words)



Embedded sentences
-

3 types
(*RP


Relative Pronoun)



1. monster eat man
who

sleep
(RP

sub
, VERB
-
INTRAN)


2. dog see man
who

eat sandwich
(RP
-
sub
, VERB
-
TRAN)


3. woman eat cookie
which

cat chase
(RP
-
obj
, VERB
-
TRAN)



Input
-
Output Mapping: (predict next input


sequential input)


INPUT: girl

see

dog
man

break
glass

cat …


| | | | | | |


OUTPUT:
see

dog
man

break
glass

cat …


Encoding scheme


Random word representation







-

31
-
bit vector for each lexical item, each lexical item is


represented by a randomly
-
assigned different bit
.




-

not semantic feature encoding


sleep 000000000000000000000000000000
1


dog 0000
1
00000000000000000000000000


woman 00000000000000000000000000
1
0000




Training a network


Incremental Input (Elman, 1993)


“Starting small” strategy



Phase I: simple sentences
(Elman, 1990, used 10,000 sentences)


-

1,564 sentences generated(4,636 31
-
bit vectors)


-

train all patterns: learning rate =0.1, 23 epochs



Phase II: embedded sentences
(Elman, 1993, 7,500 sentences)


-

5,976 sentences generated(35,688 31
-
bit vectors)


-

loaded with weights from phase I


-

train (1,564 + 5,976) sentences together:


learning rate = 0.1, 4 epochs

Performance


Network performance was measured by Root Mean Squared Error:




the number of input patterns


RMS error

=

target output vector




actual output vector



Phase I: After 23 epochs, RMS
≈ 0.91



Phase II: After 4 epochs, RMS
≈ 0.84



Why can RMS not be lowered? The prediction task is


nondeterministic, so the network cannot produce the unique output


for the corresponding input. For this simulation, RMS is NOT


the best measurement of performance.



Elman’s simulation: RMS = 0.88 (1990),


Mean Cosine = 0.852 (1993)

Phase I: RMS
≈ 0.91 after 23 epochs

Phase II: RMS
≈ 0.84 after 4 epochs

Results and Analysis

<Test results after phase II>

Output

Target



which

(target:
which

)

?


(target: lion )

?


(target: see )

?


(target: boy )


?


(target: move )

?


(target: sandwich )

which


(target:
which

)

?


(target: cat )

?


(target: see )



?

(target: book )

which


(target:
which

)

?

(target: man )

see


(target: see )



?



(target: dog )


?



(target: chase )

?



(target: man )

?



(target:
who

)

?



(target: smash )

?



(target: glass )

...



Arrow(

)

indicates the start of the
sentence.



In all positions, the word “which” is


predicted correctly! But most of words


are not predicted including the word


“who” is not. Why?


Training Data



Since the prediction task is non
-


deterministic, predicting the exact next


word can not be the best performance


measurement.



We need to look at hidden unit


activations of each input, since they


reflect what the network has learned


about classes of inputs with regard to


what they predict.

Cluster Analysis, PCA


Cluster


Analysis



The network


successfully


recognizes


VERB, NOUN,


and some of


their
subcategories.



WHO and


WHICH has


different


distance



VERB
-
INTRAN


failed to fit in


VERB


<Hierarchical

cluster

diagram

of hidden

unit

activation

vectors>


Discussion & Conclusion

1.

The network can discover the lexical classes from word order. Noun
and Verb are different classes except the “VERB
-
INTRAN”. Also,
subclasses for NOUN are classified correctly, but some subclasses
for VERB are mixed. This is related to the input example.


2. The network can recognize and predict the relative pronoun, “which”,


but not “who” Why? Because the sentences for “who” is not “RP
-
obj”, so “who” is just considered as one of normal subject in simple


sentences.


3. The organization of input data is important and sensitive to the
recurrent network, since it processes the input sequentially and on
-
line.

4. Generally, recurrent networks by using RTRL recognized sequential


input, but it requires more training time and computation resources.

Future Studies


Recurrent Least Squared Support Vector
Machines, Suykens, J.A.K. & Vandewalle,
J., (2000).



-

provides new perspectives for time
-
series


prediction and nonlinear modeling



-

seems more efficient than BPTT, RTRL & SRN

References

Allen, J., & Seidenberg, M. (1999). The emergence of grammaticality in connectionist networks. In
Brian MacWhinney (Ed.), The emergence of language (pp.115
-
151). Hillsdale, NJ: Lawrence
Erlbaum.

Elman, J.L. (1990). Finding structure in time.
Cognitive Science,

14, 179
-
211.

Elman, J. (1993). Learning and development in neural networks: the importance of starting small.
Cognition, 48, 71
-
99.

Elman, J.L. (1999). The emergence of language: A conspiracy theory. In B. MacWhinney (Ed.)
Emergence of Language.
Hillsdale, NJ: Lawrence Earlbaum Associates.

Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the Theory of Neural Computation.
Redwood City, CA: Addison
-
Wesley.

MacClelland, J. L., & Kawamoto A. H. (1986). Mechanisms of sentence processing: Assigning roles to
constituents of sentences. (273
-
325). In J. L. McClelland & D. E. Rumelhart (Eds.), Parallel
distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press.

Miikkulaninen R. (1996). Subsymbolic case
-
role analysis of sentences with embedded clauses,
Cognitive Science, 20, 47
-
73.

Rummelhart, D., Hinton, G. E., and Williams, R. (1986). "Learning Internal Representations by Error
Propagation,"
Parallel and Distributed Processing: Exploration in the Microstructure of Cognition
,
Vol. 1, D. Rumelhart and J. McClelland (Eds.), MIT Press, Cambridge, Massachusetts, 318
-
362

Rumelhart, D.E., & McClelland, J.L. (1986). On learning the past tense of English verbs. In J.L.
McClelland & D.E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the
microstructure of cognition. Cambridge, MA: MIT Press.

Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural
networks. Neural Computation, 1, 270
--
280.