Efficient Operation on Semantic Dependency Structures

pheasantarrogantΛογισμικό & κατασκευή λογ/κού

15 Αυγ 2012 (πριν από 5 χρόνια και 29 μέρες)

307 εμφανίσεις



Vaughan Eveleigh








Efficient Operation on Semantic
Dependency Structures











MPhil in Computer Speech Text and Internet Technology

Jesus College

University of Cambridge

March 2013



Efficient Operation on Semantic
Dependency Structures


N
AME

V
AUGHAN
E
VELEIGH

C
OLLEGE

J
ESUS
C
OLLEGE

E
XAMINATION

MP
HIL
CSTIT

P
ROJECT
O
RIGINATOR

A
NN
C
OPESTAKE

P
ROJECT
S
UPERVISOR

A
NN
C
OPESTAKE

S
OU
R
CE
C
ODE
L
OCATION

HTTP
://
CODE
.
GOOGLE
.
COM
/
P
/
CSTITPROJECT
/


A
BSTRACT

Packed representations have been well exploited in syntactic parsing however
algorithms for efficient representation of highly ambiguous semantic structures a
re less
well
-
studied. This project considers operations on semantic dependency representations
and develops efficient algorithms for fundamental comparison operations on DMRS
(1)

representations

highlighting the frequent decorrelation between theoretical
efficiency

and runtime performance
.

The
developed
system has been
designed

for inclusion with the
DELPH
-
IN

(2)

linguistic processing suite using standard DMRS syntac
tic representations of sentences
in XML format
generated from

standard

DELPH
-
IN

resources.

Experiment
ing

with
a
range of

data structures

has led to the development
of a packed
-
DMRS representation
that exploits the commonality and redundancy between related

parses enabling the

efficient

computation of previously expensive operations.

Aside from DMRS to packed
-
DMRS conversion, the tool provides two further operations; structure c
omparison

t
est
s

for equality on

sets of

parsed DMRS structures
and structural sim
ilarity identifies

the
syntactic
commonalities

between
various parses
.

This thesis also discusses the limitations
of the efficient structures and

t
he scope for
further research to

provide even greater efficiency

and functionality
.




D
ECLARATION OF
O
RIGINALITY

I Vaughan Eveleigh of Jesus College, being a candidate for
MPhil in Computer
Speech Text and Internet Technology
, hereby declare that this dissertation and
the work described in it are my own work,
unaided except as may be specified
below, and that the dissertation does not contain material that has already been
used to any substantial extent for a comparable purpose.

Signed



Date























A
PPROXIMATE
W
ORD
C
OUNT
-

1549

(E
XCLUDING AUTOMATICAL
LY GENERATED APPENDI
X ENTRIES
)






CSTIT Thesis

-
1
-



Contents


Contents

................................
................................
................................
............................

1

Introduction

................................
................................
................................
......................

3

1.

Project Motivation

................................
................................
................................
.

3

2.

MRS,
RMRS and DMRS

................................
................................
.......................

3

3.

Packed Representation

................................
................................
...........................

3

4.

Related Work

................................
................................
................................
.........

4

Methodology

................................
................................
................................
.....................

5

1.

Starting Point

................................
................................
................................
.........

5

2.

Source Data Format

................................
................................
...............................

5

3.

Data Structures

................................
................................
................................
.......

5

4.

Fundamental Functions

................................
................................
..........................

5

5.

P
arsing

................................
................................
................................
...................

6

6.

System Functions

................................
................................
................................
...

6

6.1.

DMRS Validation (if time permits)

................................
...............................

6

6.2.

Packing and Unpacking

................................
................................
..................

6

6.3.

Comparison

................................
................................
................................
....

6

6.4.

Similarity

................................
................................
................................
........

6

7.

Maintainabi
lity

................................
................................
................................
.......

7

8.

Testing

................................
................................
................................
...................

7

Performance Evaluation

................................
................................
................................
...

9

1.

Evaluation Setup

................................
................................
................................
....

9

1.1.

Tools

................................
................................
................................
...............

9

1.2.

Test Sets

................................
................................
................................
.........

9

2.

System Performance

................................
................................
..............................

9

2.1.

Fundamental Functions

................................
................................
..................

9

2.2.

Parsing

................................
................................
................................
............

9

2.3.

Validation (if time permits)

................................
................................
..........

10

2.4.

Packing and Unpacking

................................
................................
................

10

Contents

-
2
-


2.
5.

Comparison

................................
................................
................................
..

10

2.6.

Similarity

................................
................................
................................
......

10

3.

Summary

................................
................................
................................
..............

10

Conclusion

................................
................................
................................
......................

11

1.

Project Successes

................................
................................
................................
.

11

2.

Scope for Further Research

................................
................................
.................

11

Bibliography

................................
................................
................................
...................

13

Appendix

................................
................................
................................
........................

15

Index

................................
................................
................................
...............................

17


CSTIT Thesis

-
3
-


Chapter 1

Introduct
i
on

1.

Project Motivation

FORMATING :

Maths example....

(



)



(


)











F
IGURE
1



E
XAMPLE OF AN EQUATIO
N AND REFERENCE

W
e want to be able to compare at the level of semantics rather than e.g., derivation trees

2.

MRS,
RMRS

and DMRS

Explain each data

format and what information they provide

(R)MRS comparison isn't good enough. I haven't got any timing figures for doing large
scale comparisons of MRSs and anyway we haven't got a packed MRS representation,
so the speed argument can't be made precise, b
ut there's an argument about locating the
differences which is straightforward
-

MRS is very redundant compared to DMRS, so
doing comparison at the DMRS level is better in terms of pinpointing differences.
There's also a visualisation argument.

The argume
nt against using the dependencies discussed in the Oepen and Loenning
paper for these purposes is that these don't contain all the semantic information that's in
an MRS

3.

Packed Representation

Packing is a good
thing. Although we could (probably) pack (R)MRS structures, it's
better to do it with DMRSs, because of redundancy.

Now since your project takes DMRS as a starting point, you're not going to be
motivating your own work on the basis that you've developed t
he representation, but
more that the representation needs your code to
realise the advantages
. However, I
think you'd probably better put in some of this stuff in the first part of your thesis.

Introduction

-
4
-


4.

Related Work

Something to site here? Maybe some of the papers?

Lack of work could reinforce need
for work.

CSTIT Thesis

-
5
-


Chapter 2

Methodology

5.

Starting Point

The dissertation should explicitly describe the starting point for the project, making
clear what existing software or other resources were used.
If a student is building on any
work that they did before starting the MPhil, this should be indicated. The dissertation
should include a concise summary of the work undertaken by the student in the course
of the project.

Explain language choice



multi sy
stem/adaptable

This section will focus on IMPLEMENTATION and THEORETICAL performance

6.

Source
Data Format

Explain various accepted input formats

Explain assumptions (or lack of) that can be made

(parse order matters)

Explain structure
and finer dtails of DMRS
e.g. graph like both directed and undirected
etc...

7.

Data Structures

Explain object oriented data structure

Just in time data structures with memory
/caching

Minimal computation

8.

Fundamental
Functions

Explain what it means for a node,

link, parse, sentence to be equal including with and
without sortinfo (strict/relaxed equality)

Explain caching, set merging and memory and why THEORETICAL improvements

Methodology

-
6
-


Explain guaranteed ordering required for future algorithms

and why this can be assumed

to be deterministic

9.

Parsing

Explain Parser and various formats

10.

System Functions

Explain usage information and options to make it usable (similar to wiki)

Also include paragraph about USEFUL error messages

10.1.

DMRS Validation (if time permits)

Possibly do this

10.2.

Packing

and Unpacking

Explain packing algorithm and preservation of information

Explain
unpacking
algorithm briefly

Explain l
oss of some information

(id numbers)

Add XML descriptions to appendix

10.3.

Comparison

Explain Naive Algorithm using UNPACKED structure

(see notes)

Explain advanced algorithm using PACKED structure

(see notes)

Spurious ambiguity

& identification

State
THEORETICAL

improvements

Output format

(with and without sortinfo)

10.4.

Similarity

Explain similarity algorithm

Add

output

XML description to app
endix

Explain possible subsumption extension (why it hasn’t been implemented, how it would
be implemented, effect on efficiency

of algorithm,
)

“This again is non
-
trivial as the syntax trees may not share the same root node yet still
have common branches. T
he system should return the most likely differences within a
suitable time scale.”

CSTIT Thesis

-
7
-


11.

Maintainability

Java doc, comments, structure etc...

12.

Testing


Where a project has as its main aim the production of a piece of software, the
dissertation should state clearl
y what test procedures were adopted and should include
test output.


Edge cases

/
Test
cases

etc....

(I may incorportate this into each individual section)

CSTIT Thesis

-
9
-


Chapter 3

Performance

Evaluation

13.

Evaluation
Setup

Tradeoff


compromises

made
-

speed vs. Usability?

All data used should be clearly described. If the results are copious, they should be
summarised in the main text and given in full in an appendix.
The section will focus on
ACTUAL not theoretical performance

“The system evalua
tion will take the form of actual runtime performance comparisons,
not necessarily the asymptotic complexity of the algorithms used”

13.1.

Tools

e.g. profiler (netbeans), personal profiler, other add ins

13.2.

Test Sets

Explain sets to use. Explain how they are typica
l

different versions of the English Resource Grammar (ERG)
(3)

on standard test
sentences from the Redwoods Test Suite
(4)
. Initially the CSLI
(5)

test suite will be
used,

with later experiments being carried out on the LOGON data.

14.

System
Performance

14.1.

Fundamental

F
unctions

Demonstrate differences with memory
and no memory to performance of

micro
functions

Explain (possibly failed) why remembering all computation is not
ALWAYS
advantageous. E.g. Failed attempt at remembering node and link comparisons

ALSO


state the law about only improving the parts of the system that consume the
most time and why forgetting some things doesn’t make much of a
difference

14.2.

Parsing

Demonstr
ate performance differences with different input methods

Performance Evaluation

-
10
-


Explain thrashing &solution. Trade off between performance and theory

Demonstrate both time and count of fundamental comparisons

14.3.

Validation (if time permits)

14.4.

Packing

and Unpacking

Packing performance

and memory saving advantage

14.5.

Comparison

Difference in runtime performance between naive and advanced algorithm

Include both time and fundamental operations

14.6.

Similarity

Explain advantage of algorithm using packed structure and performance

15.

Summary

Summarise
what has been achieved

CSTIT Thesis

-
11
-


Chapter 4

Conclusion

16.

Project Successes

What has been achieved

Only talk about good points here!

17.

Scope for Further Research

Subsumption

algorithms

Functionality vs. Efficiency. More functionality = less assu
mptions = slower

Adaptive
/clever

memory

Rank need to remember information and forget in order using those java memory
handle things


CSTIT Thesis

-
13
-


Bibliography

1.
Slacker semantics: why superficiality, dependency and avoidance of commitment can
be the right way to go.
Copestake, Ann.

Athens

: s.n., 2009. 12th Conference of the
European Chapter of the ACL. pp. 1
-
9.

2.
DELPH
-
IN: Open Source Deep Processing.
[Online] 2
010. http://www.delph
-
in.net/.

3.
On building a more efficient grammar by exploiting types.
Flickinger, D.

1, s.l.

:
Cambridge University Press, 2000, NATURAL LANGUAGE ENGINEERING, Vol. 6,
pp. 15
-
28. ISSN 1351
-
3249.

4.
LinGO Redwoods: A Rich and Dynamic Tr
eebank for HPSG.
Oepen, S, et al.

s.l.

:
Springer Netherlands, 2004, Research on Language & Computation, Vol. 2, pp. 575
-
596. ISSN 1570
-
7075.

5. CSLI LinGO Lab.
Stanford University.
[Online] http://lingo.stanford.edu/.

CSTIT Thesis

-
15
-


Appendix

A
PPENDIX
1

-

F
IRST
A
PPENDIX

XML Descriptions

Test Cases

Test output

Further Measurements

Important algorithms
CSTIT Thesis

-
17
-


Index

D

DMRS

................................
.....

3

M

MRS

................................
........

3

P

Packed Representation

.............

3

Packing

................................
....

3

R

RMRS

................................
.....

3