Connectionist models of language part II

beepedblacksmithΠολεοδομικά Έργα

29 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

62 εμφανίσεις



For the course

Cognitive models of language and beyond


University of Amsterdam




February 28, 2011

Connectionist models of language
part II

The Hierarchical Prediction Network

Recap: Problems of the Simple Recurrent Network


The SRN does not learn the kind of generalizations that are
typical for language users, when they substitute words or
entire
phrases

in different sentence positions.



I catch
the train

I catch
the last bus
.



-

Such generalizations presuppose a representation
involving
phrasal

syntactic categories, and a neural
mechanism for
substitution
, which are lacking in the SRN.


We need

a connectionist network that can deal with the
phrasal structure of language, and that can learn (syntactic)
categories.


That should be based on a
neural

theory about the
acquisition of syntax


Is there a neural story behind the acquisition
of a hierarchical syntax?

A neural or connectionist account of phrase
structure must cope with these questions:


What are the neural correlates of syntactic
categories and phrase structure rules?


How are syntactic categories acquired,
and how do they become abstract?


How do local trees combine to parse a
sentence within a neural architecture?




binding problem


How does the brain represent
hierarchical constituent structure?




S

VP

PN

Jill

V

NP

reads

a

book

DET

N

This work develops a theory and a computational model of
the neural instantiation of syntax.

Conceiving of neural grammar in analogy to
visual hierarchy



Progressively larger receptive fields (spatial and temporal compression)



Progressively more complex, and more invariant representations



If visual categories are hierarchically represented by neural
assemblies, then why not syntactic categories?

p d k s ʒ ɑː eɪ ər

go house fine in when

PP: in the house NP: red carpet

Jane plays a game in the house


more you red the

VP: play game


b tʃ g z ʃ æ ə ɜr


t dʒ ð ŋ ɔː


Decomposition of the object results in a hierarchical object
representation based on informative fragments.


Visual object recognition involves construction of parse trees


Rules of `visual grammar’ conjunctively bind simple categories into
more complex categories (they do feature binding)


Visual categories become progressively more invariant.


Lacking: temporal element, ability for recursion

Parse trees in the visual domain

Fragment based hierarchical object recognition (Ullman, 2007)

The Memory Prediction Framework (MPF)


MPF is an neurally motivated
algorithm for pattern recognition
by the neocortex [Hawkins &
Blakeslee, 2004]


Cortical categories (columns)
represent temporal sequences of
patterns


Categories become progressively
temporally compressed and
invariant as one moves up in the
cortical hierarchy


Hierarchical temporal compression
allows top
-
down prediction and
anticipation of future events


Parallel: phrasal categories
temporally compress word
sequences


(NP


Det Adj N)

Syntactic categories are prototypical and
graded


For a cognitive theory of syntax one must give up on
the notion that syntactic categories are discrete, set
-
theoretical entities.


Syntactic categories are
prototypical (Lakoff, 1987).


(a
nose

is a more typical noun than a
walk,
or

time
)


Category membership is graded (
nouniness
), exhibits
typicality effects, defines similarity metric


(e.g.
adverb
-
adj

are more similar than
adverb
-
pronoun
).


Children’s categories are different than adult
categories:
there must be an i
ncremental learning path
from
concrete
, item
-
based to
abstract

adult categories





Underlying the conventional syntactic categories is a
continuous category space.


A hypothesis for a
neural theory of grammar


The language area of the cortex contains local
neuronal assemblies that function as neural correlates
of
graded

syntactic categories.


Hierarchical temporal compression of assemblies in
higher cortical layers is responsible for phrase
structure encoding.
[
Memory Prediction Framework
(Hawkins, 2004)]


The topological and hierarchical arrangement of the
syntactic assemblies in the cortical hierarchy
constitutes a
grammar
.


grammar acquisition amounts
to learning a topology
.


Assemblies can be dynamically, and serially bound to
each other, enabling (phrasal) substitution (and
recursion).


Features
:

The
Hierarchical Prediction Network (HPN)

From theory to model

Hierarchical temporal compression

N

NP

Det

Adj

No labels, no fixed
associations with
non
-
terminals

Dynamic, serial binding

Pointers stored in
local memories


N

NP

Det

Adj

S

NP

VP

Continuous category space

NP

VP

Prototypical,

graded categories

NP

VP

Syntactic categories are regions in a
continuous “substitution space” in HPN

X1

X2

X3

under

VP, verb

PP, prep

NP, noun

eat

A node’s position in
substitution space

defines its graded
membership to one or more syntactic categories in a
continuous space.
Substitutability

is given as the topological
distance in
substitution space,
and it is
learned

gradually.

Labels are whatever the linguist projects on this space.

the

tomato

complex units

Simple, lexical units

happy

Temporal integration in compressor nodes

w3

w6

w5

w4

w2

w1

w7

w8

a compressor node fires after all its
slots are activated in specific order

lexical input nodes

compressor node: temporal
integration

slots

X1

slot1

slot2

slot3

root

Probability of binding a node
to a slot is function of distance
between their representations
in “substitution space”

Binding via slots

Parsing with HPN

Derivation of the sentence “Sue eats a tomato”.

Node state

is characterized

by start index
j
, active slot
i

and slot index
n
.

(Sue (eats (a tomato))

Sue

eats

tomato

a

X

Z

Y

1

2

3

4

2

3

1

3

4

2

1

3

2

bindings

A
derivation

in HPN is a connected trajectory through the
nodes of the physical network that binds a set of nodes
together through
pointers
stored in the slots.

HPN
grammar

Left corner parsing




One of many parse strategies, psychologically plausible


Combines bottom
-
up with top
-
down processing, and proceeds
incrementally, from left to right.


The “left corner” is the first symbol on the right hand side of the rule.

S

VP

NP

John

V

NP

loves

Mary

S

VP

NP

John

V

NP

loves

NP

John

V

loves

NP

Mary

S

VP

NP

John

VP

V

NP

loves

Words bottom
-
up “enable” application
of rules.


In symbolic grammar substitution between categories
depends on their
labels
, which are supposed to be
innate
.


In connectionist model no labels: substitutability
relations must be learned! (as distances in topology)


Substitution is a
mechanistic operation,
corresponding to

dynamic, serial binding
[Roelfsema, 2006]
.


A node
binds

to a slot by transmitting its identity (plus a
state index) to the slot, which then stores this as a
pointer

(cf. short
-
term memory in pre
-
synaptic connections
[Barak & Tsodyks, 2007]).


The stored pointers to bound nodes connect a
trajectory

through the network, from which a parse tree can be
reconstructed


distributed stack

Substitution versus dynamic binding

Parse and learn cycle


Initialize random node and slot representation.


For every sentence in training corpus


Find the most probable parse using left corner chart
parser.


(Parse probability is product of binding probabilities,
which are proportional to metric distances)


Strengthen the bindings of the best parse by moving
bound node
n

and slot
s

closer to each other in
substitution space



Δ
n

=
λ

s
,

and

Δ
s

=
λ

n


As the topology self
-
organizes,
substitutability

relations
between units are learned, hence a grammar.

Generalization: from concrete to abstract


Start with random node representations. Parsing a corpus
causes the nodes to become inter
-
connected,
reflecting the
corpus distribution (the
topology self
-
organizes)
.


If two words occur in the same contexts they tend to bind
to the same slots. The slots mediate formation of abstract
category regions in space through generalization.

HPN shows how constructions with slots (
I want X
) are
learned. Stage 2 in developmental stages of Usage Based
Grammar (Tomasello, 2003).


X1

feed

the

feed

the

cat

X1

2)

dog

dog

cat

1)

Demo…



1000 sentences generated by
same artificial grammar with
relative clauses as in [Elman,
1991]. E.g.,



boy who chases dog walks


HPN initialized with 10
compressor nodes with 2
slots, 5 with 3 slots


All lexical and c
-
nodes have
random initial
representations


Next word prediction is more
difficult to do in HPN, work
in progress.


HPN learns a topology from artificial corpus that
reflects the original categories from the CFG grammar.

Experiment with artificial CFG

Experiment with Eve corpus from Childes


2000 sentences from
second half of Eve
corpus


Brackets available
from Childes, but not
good quality.
Binarized.


Initialize HPN with
120 productions with
2 slots


Reasonable
clustering of part of
speech categories


Note, that SRN
cannot work with
realistic corpora.

Conclusions


Dynamic binding between nodes in HPN adds to

the
expressive power of neural networks, and allows explicit
representation of hierarchical constituent structure.


Continuous category representations enable incremental
learning of syntactic categories, from concrete to abstract.


HPN offers a novel perspective on grammar acquisition
as a self
-
organizing process of topology construction.
This is a biologically plausible way of learning.


HPN creates synthesis between connectionist and rule
-
based approach to syntax (it encodes and learns context
-
free `rules’ in the compressor nodes.




answers

the critique of [Fodor & Pylyshyn, 1988] that
connectionist networks are not
systematic.





For the course

Cognitive models of language and beyond


University of Amsterdam




February 28, 2011

Connectionist models of language
part III

Episodic grammar

Limitations of HPN

context freeness



HPN cannot do contextual conditioning, because parser
decisions depend only on distance between 2 nodes
(HPN is `context free’)


Yet, for realistic language processing, and in particular
left corner parsing, one must be able to condition on
contextual (lexical and structural) history.


Connectionism puts constraint on parser: conditioning
events must be locally accessible.


Find a neurally plausible solution for conditioning on
sentence context: use
episodic memory
.

Episodic and semantic memory


Semantic memory

is a person’s general
world knowledge, including language,
in the form of concepts (of objects,
processes and ideas) that are
systematically related to each other.


Episodic memory

is a person’s memory
of personally experienced events or
episodes
, embedded in a temporal and
spatial
context
.


In the language domain, semantic
memory encodes abstract, rule
-
based
linguistic knowledge (a grammar) ,
and episodic memory encodes
memories of concrete sentence
fragments (exemplars).


HPN implements a semantic memory
of a (context free) grammar

Me lining up in front of the bakery

Language processing makes use of two memory systems


“bread”


Interesting parallel between episodic
-
semantic memory
processes and rule
-
based vs. exemplar
-
based language
processing.


Evidence for abstract, rule
-
based grammars in tradition
of generative grammar (e.g., [Marcus, 2001])


Usage Based Grammar [Tomasello, 2000] emphasizes
item
-
based nature of language with a role for concrete
constructions larger than rules.


Suggests that a semantic memory of abstract
grammatical relations and an episodic memory of
concrete sentences interact in language processing.



Semantic
-
episodic memory distinction maps to
debate on rule
-
based versus exemplar
-
based
cognitive modeling


All episodic experiences that can be consciously
remembered leave physical
memory traces

in the brain.


Episodic memories are
content addressable
: their retrieval
can be primed by cues from semantic memory (for
instance the memory of a smell)


priming effects


Sequentiality
: episodes are construed as temporal
sequences that bind together static semantic elements,
within a certain context [Eichenbaum, 2004].


The chronological order of episodic memories is
preserved.




HPN can be enriched with an episodic memory to
model the interaction between semantic and episodic
memory in language processing, and at the same time
solve the problem of context.

Properties of episodic memory

Episodic memory traces in HPN

girl

who

dances

likes

tango

boy

mango

Episodic traces are stored in
local memories of visited nodes

The traces prime

derivations of

processed sentences

(content addressability)

After successful derivation,
local STMs in the slots must be
made permanent

Trace encoding in a symbolic episodic grammar


In symbolic, supervised case nodes
become
treelets
that
correspond one
-
to
-
one to CFG rules.


Traces are encoded as
x
-
y
;
x

= #sentence in corpus;


y

= #position in derivation (top
-
down or left
-
corner)

NP

N

S

NP

VP

NP

NP

RC

RC

WHO

VI

VP

VT

NP

1
-
1

2
-
1

1
-
3

1
-
10

2
-
2

2
-
6

N

girl

1
-
4

N

boy

2
-
3

N

mango

2
-
7

N

tango

1
-
11

WHO

who

1
-
6

VI

dances

1
-
7

VT

likes

1
-
9

2
-
5

1
-
2

1
-
8

2
-
4

1
-
5


Episodic traces after
processing sentences
“girl
who dances likes tango”
,
and “
boy likes mango


Trace encoding for a left corner derivation

Treelets
have a
register
that keeps
track of
execution of
operations

NP

N

S

NP

VP

NP

NP

RC

RC

WHO

VI

VP

VT

NP

1
-
12

1
-
20

1
-
3

1
-
18

N

girl

1
-
2

N

tango

1
-
17

WHO

who

1
-
6

VI

dances

1
-
9

VT

likes

1
-
14

1
-
4

1
-
11

1
-
15

1
-
19

1
-
7

1
-
10

START*

girl

1
-
1

NP*

who

1
-
5

RC*

dances

1
-
8

S*

likes

1
-
13

VP*

tango

1
-
16

pr

pr

pr

pr

pr

pr

pr

pr

pr

pr

att

sh

sh

att

att

pr

sh

sh

att


When parsing a novel sentence, the traces in a visited
treelet are
primed
, and trigger memories of stored
sentences (derivations).


The traces (
e
x
) receive an
activation

value (
A
), of which the
strength depends on the
common history

(
CH
) of the
pending derivation (
d
) with the stored derivation (x)


CH

is measured as #derivation steps shared between
d



and x.





Every step in the derivation is determined by competition
between activated traces of different exemplars.

Parsing as priming activation of traces

Probabilistic episodic grammar


Derivation in episodic grammar is a sequence of
visits to treelets


Probability of continuing derivation from treelet
t
k

to treelet
t
k+1

= set of traces in t
k

with preference for
t
k+1

= set of all traces in
t
k

= activation of trace
e
x

in
t
k


For parsing task, episodic grammar is trained on Wall
Street Journal (WSJ) sections 2
-
21, and evaluated on WSJ
section 22. Training as follows:


One treelet is created for every unique CF production in
treebank.


For every sentence in corpus determine the sequential order of
treelets in derivation
d


For every treelet in derivation
d

store a trace
x
-
y

consisting of
sentence nr. plus position in derivation.


After training, the probability of a given parse of a test
sentence can be computed by dynamically updating the
trace activations


For every step in a given derivation check whether the traces in
the current treelet are successors of traces in the previous treelet.


If so, increase their
CH

by 1. Otherwise, set their
CH

to 0.


Training the episodic grammar

Reranking with the episodic grammar

P

Parse

0.009

5

0.017

4

0.021

3

0.038

2

0.050

1

3
rd

party parser

trained on

WSJ sec 2
-
21

Episodic

probabilistic

grammar

trained on

WSJ sec 2
-
21

Test sentence (WSJ sec 22)

“Silver performed quietly”

Compare most
probable parse with

Gold standard parse

(S (NP (NN Silver)) (VP
(VBD performed) (ADVP
(RB quietly))) (. .))

P

Parse

0.012

5


4

0.066

4


1

0.021

3


5

0.003

2


2

0.041

1


3



1

2

3

4

5

6

7

LR=89.88

LP=90.10

F =89.99

#matching constit.

#cstit. model parse

LP =

#matching constit.

#cstit. GStd parse

LR =

PARSEVAL

reranking

evaluation

Experiments

Precision and recall scores of the
top
-
down

episodic
(TDE) reranker, and the
left
-
corner
episodic (LCE)
reranker as a function of the maximum common
history considered. Nbest = 5;
λ
0
=4;
λ
1

=
λ
2

=
λ
3

= 0.2.

87,5
88
88,5
89
89,5
90
90,5
91
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
maximum history
Fscores
top-down
left-corner
Charniak'99
Best F
-
score for TDE reranker: F = 90.36 for his=5; for

LCE reranker: F = 90.61 for his=8. Both better than Charniak’99.

85
86
87
88
89
90
91
0
1
2
3
4
5
6
7
8
9
10
11
12
maximum history
F-score
Nbest=5
Nbest=10
Nbest=20
Charniak'99
Robustness

F
-
scores of the
left
-
corner
episodic reranker when the
output list of the 3
rd

party parser is varied between the
5, 10 and 20 best parses.

88
88,5
89
89,5
90
90,5
91
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
maximum history
F-scores
left-corner
discontiguous
Charniak'99
Shortest derivation episodic reranker selects parses from the
Nbest list according to a preference for derivations that use the
fewest episodes. Best F
-
score is F=90.44 (better than Charniak’99).

Discontiguities and shortest derivation
reranker

When taking discontiguous episodes into account F
-
score
improves to F=90.68. (Nbest=5; d=0.95; f=0.6)

Various parsing strategies

(on WSJ sec 23)

Parsing model

F (<= 40)

F (all)

Charniak (1999) (max. entropy)

90.1

89.6

Petrov and Klein (2007) (refinement
-
based)

90.6

90.1

Bansal and Klein (2010) (fragment
-
based)

88.7

88.1

Cohn et al. (2009) (Bayesian)

-

84.0

Charniak and Johnson (2005) (reranker, n = 50)

-

90.1

Episodic reranker

(on WSJ sec 22)

TDE reranker (n = 5)

90.4

-

LCE reranker (n = 5)

90.6

90.1

LCE + disctg (n = 5)

90.7

-

Parsing results compared to state
-
of
-
the
-
art

Relation to Data Oriented Parsing (DOP)


Episodic grammar gives neural perspective on parsing
with larger fragments, in terms of episodic memory
retrieval.


Complementary
: in DOP substitution of an arbitrary large
subtree is conditioned on a single nonterminal; in the
episodic parser the application of a local treelet is
conditioned on an arbitrary large episode.


But: shortest derivation variant does use large fragments


Advantages over DOP


Every exemplar is stored only once in the network, thus
space complexity is linear in the number of exemplars.


Content
-
addressability
: episodes are
reconstructed

from
traces, obviating a search through an external memory

Relation to history
-
based parsing


Like state
-
of
-
the
-
art history
-
based parsers (e.g., [Collins,
2003, Charniak, 1999] the episodic grammar makes use
of lexical and structural conditioning context.


Yet, no preprocessing of labels is needed


Conditining on arbitrary long histories at no cost to the
grammar size, because history is implicit in the
representation: no need to form equivalence classes.


Association between conditioning event and sentence
from where it originates is preserved: this allows to
exploit discontiguous episodes.

Conclusions


Episodic grammar clarifies the trade
-
off between rule
-
based
and exemplar
-
based models of language, and unites them
under a single, neurally plausible framework.


Proposes and evaluates an original hypothesis about the
representation of episodic memory in the brain: the emerging
picture of episodic memory is as a life
-
long thread spun
through semantic memory.


Fits with the “
reinstatement hypothesis of episodic retrieval
”:
priming of traces reactivates the cortical circuits that were
involved in encoding the episodic memory


Fits with hippocampal models of episodic memory [e.g.,
Eichenbaum, 2004, Levy, 1996]: Special ‘context neurons’
uniquely identify (part of) an episode, and function as a
neural correllate of a counter.

Work in progress


Developing an episodic chart parser, that
computes parse probabilities through activation
spreading to (the traces in) its states.


Integrating episodic memory in probabilistic
HPN chart parser; do unsupervised grammar
induction with episodic
-
HPN


Want to join for a project? Contact me at
gideonbor@gmail.com


homepage: staff.science.uva.nl/~gideon

Thank you!

References
:


Bod (1998).
Beyond Grammar: An experience
-
based theory of language
. CSLI
Publications, Stanford, CA.


Borensztajn, G., Zuidema, W., & Bod, R. (2009).
The hierarchical


prediction network: towards a neural theory of grammar acquisition.



Proc. of the 31th Annual Meeting of the Cognitive Science Society


Elman (1990).
Finding Structure in Time.

Cognitive Science, 14:179

211


Elman (1991).
Distributed representations, simple recurrent networks, and
grammatical structure.

Machine Learning.


Fodor, J. D., & Pylyshyn, Z. W. (1988).
Connectionism and cognitive
architecture: A critical analysis.

Cognition, 3
-
71.


Hadley (1994).
Systematicity in connectionist language learning
. Mind and
Language.


Hawkins, J., & Blakeslee, S. (2004).
On intelligence.

New York: Henry Holt
and Company.


Tomasello, M.(2003).
Constructing a language: A usage
-
based theory of language
acquisition.

Cambridge, MA: Harvard University Press.