The relationship between Linguistic Semantics and Controlled English to support improved information extraction

mumpsimuspreviousAI and Robotics

Oct 25, 2013 (4 years and 2 months ago)

67 views

The relationship between Linguistic Semantics and
Controlled English to support improved information
extraction


This technical report is the Quarter 3 deliverable for research carried out on International
Technology Alliance (ITA) programme, specifically
in Technical Area 6, Project 4, Task
2.


The report outlines the various approaches that have been investigated in support of
providing mechanisms to link linguistic semantic information to the various conceptual
structures within a domain model. This is
on
-
going research and some of the material in
the report m
ay be advanced further or supers
eded, and additional areas of investigation
may arise.


Some of the material in this report is likely to be used in the planned ICCRTS
1

paper
titled “
Information Extr
action using Controlled English to support Knowledge
-
Sharing
and Decision
-
Making
” and therefore the technical report should not be published until the
material in that paper has been presented
(June 21
st

2012)



The remainder of this document is the detail
ed content of the technical report.


Language facts

When used in support of information extraction processing
Controlled English
2

(CE)
[CE1]
is used for two purposes: as the target of the linguistic processing, where the CE is
acting as the semantic repres
entation language; as the means by which the language
processing tools are configured to perform the processing. For the first purpose it is
necessary to have a conceptual model of the domain (as has been described above) and to
know the mapping between th
e words in a
typical
sentence and concepts in the domain
conceptual model; for the second it is necessary to have another conceptual model, that of
linguistic concepts and linguistic processing concepts, which is described in this section.
However both of
these models must be based upon common components in order that the
semantics of words can be expressed and mapped onto the semantics of the domain.
There is, in effect,
a single conceptual model with multiple layers, each layer being based
upon the concep
ts in a higher layer.


The current layers (from top to bottom) are as follows:





1

17th ICCRTS 2012: Operationalizing C2 Agility
, June 19
th
-
21
st

2012, Fairfax Virginia

2

For an introduction to, and definition of, the CE language and associate
d information
please refer to the various [CEx] references, and for information on pre
-
cursor work on
Common Logic Controlled English (CLCE) refer to reference [CE3].

1.

The M
eta

M
odel, allowing the description of the concepts themselves

(such as
‘relation concept’)
and the relations between them.

2.

The General Domain M
odel, containing

fundament
al concepts such as ‘agent’,
‘spatial entity’, ‘s
ituat
ion’, ‘container’
, together wit
h basic relations between
them, such as ‘contained in’

and “
the situation s1 has the agent a1 as
agent role

.

3.

The Semiotic Triangle, based on that of Ogden and Richards [
SEMTRI]
,
providing fundamental concepts relating meaning
s
, symb
ols

and
things

in the
domain world.
Our particular version

of the semiotic triangle is given in Figure 1.

The high level concept of ‘mea
ning’ is the parent of the meta
model concepts
(such as ‘
relation concept’) and also of other resource representations of meaning,
such as ‘wordnet synset’.
Two key relations in the triangle used extensively in the
syntax
-
semantic interface

are:

o

the symbol S stands for the thing T.

o

the symbol S expresses the mea
ning M

4.

The General Linguistic Model
, containing

our theory of linguistics,
including
such concepts as
‘word’,
‘phrase’
, ‘noun phrase’
(
a
ll

subconcept
s

of ‘symbol’
),

wordnet synset’ (subconcept of ‘meaning’), and structures such as “linguistic
frame’ which

holds relationships between CE statements about syntax and
semantics

(as described below).

The general linguistic model also contains
syntactic
relations between pa
rts of the parse tree, such as “
the verb phrase np1
has the noun
|dog|
as head

.

5.

The
Domain

M
odel, containing specific concepts (for example this might include
‘place’ or ‘village’

or ‘is located in’
). These are based upon the more generic
concepts
(
such as ‘container’
and ‘is contained in’
)
.




Figure 1


Semiotic Tr
iangle


As described below, the parser agent turns a syntactic parse tree into a set of CE
sentences that is easier to process via linguistic rules. These sentences use the concepts
defined in the general
linguistic model. G
iven the sentence “the patrol in

East Dulwich
discovers the factory”, this might be turned into sentences including:


meaning

symbol

thing

conceptualises

stands for

expresses

the noun phrase np1 has the noun |patrol| has head and has the prepositional
phrase pp1 as dependent and stands for the thing [001]
.


the prepositional phrase pp1 has the

word |in| as head and has the noun phrase np2
as
object
.


the noun phrase
np2 has the proper noun |East Dulwich| as head and stands for the
thing [002].


Here the syntax tree is rep
resented in attributes such as ‘dependent’ and ‘head’
,
and the
(minimal)
semantics as ‘stands for’

(based on the idea that each noun phrase stands for
some object in the domain).

Mapping between language facts and
domain

facts

In order

to map between the syntax of the sentence and the semantics of the

domain, we
are assuming i
n
our current
research
that there is a parser that will provide a basic
syntactic parse tree (specifically the Stanford Parser [SP
1, SP2
])
allowing use to

focus on
the mapping

of this parse tree into the specific
semantics as represented in the analyst’s
c
onceptual model.
Our understanding of how syntax maps specifically to the semantics of
the conceptual model is captured in the concepts and rules of our General Linguistic
Model;
at this stage
we are still developing our understanding although it is based
upon
exiting linguistic and semantic principles
; so

in some areas the linguistic model is over
-
simplistic, and we plan to enhance it as more of our sentence corpus is analysed.
Nevertheless we believe that expression of

the

linguistic model in Controlled E
nglish is
of benefit in sharing
and understanding
linguistic processing information with the analyst.


The construction of the semantics may be consi
dered at two

levels: map
ping to general
semantics (
that which is independent of a specific domain) and mapp
ing to specific
semantics (
that which is defined in the domain model).
We undertake this mapping in an
incremental fashion, in the spirit of least commitment, with rules that match general
patterns inferring the general semantics followed by rules that mat
ch more specific
domain
-
based patterns
adding inferences about

the more specific semantics.
Since the
domain model is itself derived from general concepts
, this incremental mapping allows
the more specific information to be consistent with t
he general info
rmation, but

adding
more detailed constraints.

More specifically we undertake the mapping using the
following functions (which may not necessarily
follow

this sequence):




Words in the parse tree are matched to concepts in the domain conceptual model



Genera
l structures in the parse tree are matched to generic semantic concepts



Specific structures in the parse tree are matched to specific concepts



Further inferences are made about the specific entities using domain specific rules


Matching words to concepts i
s undertaken
via

CE
sentences such as:


the noun |patrol| expresses the entity concept ‘patrol unit’.


based on the semantics that nouns represent concepts which are realised (or instantiated)
by entities in the domain. Such linking sentences must be deri
ved from the analyst’s
understanding of the meaning of the concepts (s)he defined, and a tool called the
Analyst’s Helper is being developed for this purpose

(
see below
)
.

Such mapping
sentences are provided to cover nouns (linked to entity concepts) adject
ives (also linked
to entity concepts) and verbs (lin
k
ed to relation concepts).
The mapping is done by rules
such as:


if


( the noun phrase NP has the noun N as head and stands for the thing T ) and


( the noun N expresses the entity concept C )

then


( the thing T realises the entity concept EC ).


where ‘
the thing T realises the entity concept EC
” states that T may be conceptualised by
the concept EC.

T
his maps between the meta level (

the
entity concept

EC’
) and the
domain level (

the thing T

); i
t seems that such mapping is required at some point in the
syntax
-
semantic interface and therefore a linguistic model must also include a capability
of representing meta models.


This one
-
to
-
one mapping is simplistic,
and we are

augment
ing

the
“expresses”
CE
sentences
(and associated rules)
with

further information indicating pre
-
conditions that
are required before the specific link can be inferred.




The most generic mapping of parse tree structures to general semantics contained in the
linguistic model (
and based in the semiotic triangle) is that noun phrases ‘stand for’
entit
ies in the domain.
An example has already been given above where

the noun phrase
np2
stands for the thing [002]

; here ‘[002]’ is a constructed unique identity of an entity
presumed

to exist in the real world (and is written to look like a “reference”). We do not
at this stage know what entity [002] is, but later processing may add information about it.

A similar CE sentence is used to state that verb phrases ‘stand for’ situations i
n the
domain, where is a situation is a general concept covering event, activity, possession,
family relation etc, where multiple entities
are involved in different roles, and where
additional information such as time and location may be associated.


More
detailed (but still generic) mapping between syntax and semantics may be
represented as logical rules in the linguistic model. The concept of ‘container’ captures

the idea that if something is “in” something else (
for example
expressed as a
prepositional p
hrase headed by “in”), then the second
in some sense “
contains


the first.
The current rule
to infer this
is:


if ( the noun phrase NP1 stands for the thing T1

and

has the prepositional phrase PP as dependent ) and


( the prepositional phrase PP has the

word '|in|' as head and

has the noun phrase NP2 as object ) and


( the noun phrase NP2 stands for the thing T2)

then


( the thing T1 is contained in the container T2 ).


Here the
rule preconditions will match on earlier parse

tree CE sen
tences
to inf
er:



the thing [001] is contained in the container [002].


Such a rule will be applicable irrespective of the domain model, but will not infer very
specific information;
it is left open as to whether a container is
a place or an organisation
or a time per
iod, etc.

However if
other more specific inferences about the nature of
the
container
[002] are available, for example that [002] is a place, then a
more
domain
-
specific rule might infer that the relationship between [001] and [002] may be specialised
into

‘is located in’:


if ( the thing T is contained in the container P ) and


( the container P is a place )

then


( the thing T is located in the place P )


inferring
the CE sentence:



the thing T is located in the place [002].


Fu
rther processing is ne
cessary

to determine
the location of

place

[002]
, and the nature of
thing [001].

Similar such rules may be used to turn the more generic sentence “
the
discovery situation s1 has the agent a1 as agent role and has the agent a2 as patient role

into the more

specific sentence “
the agent a1 finds the agent a2
”.


System/Architectural description

The
key user for a Controlled English based system is the non
-
technical “business” user,
and the purpose of the CE language is provide a more human

friendly informatio
n
representation language to lower the technical barrier between such users and the
capabilities of the information processing system. Within the linguistic processing
environment described in this paper we believe that there will be a number of natural
s
pecialisations either in terms of different individuals involved in the processing, or for
smaller implementations perhaps the same individuals but with different operational
contexts. Such specialisations might include: domain specialists (such as an int
elligence
analyst), linguists (to provide system knowledge to help processing of natural language
documents), knowledge engineers (to help the domain user to better understanding their
world
-
view, and techniques for modelling this effectively), IT speciali
sts and systems
integrators (concerned with the implementation of applications and databases or other
middleware to enable an operational environment to be developed).

In addition to these
specialisations there are also likely to be different user roles,
layers of management and
work
-
flow/approval cycles that will be found in any such operational environment.





Figure 2 Processing Architecture


The aim of CE is to provide a common information representation format that can be
used by all parties, with different (but overlapping) domain models supporting each
specialisation in support of the whole endeavour. In addition to this there are some
research grade tooling capabilities, such as the “CE Store” that can also be used to
d
irectly support some of the requirements of the IT specialist staff.

CE is designed to be most useful in situations that have the following character
i
stics:



A high degree of human interaction, usually involving specialist users with
complex needs in non
-
tr
ivial environments
.



A likelihood of rapidly evolving or uncertain tasks, queries or other knowledge
-
based activities.



The need for collaboration
, either between different people or teams, and/or
across different disciplines.

CE is of little value if there
is no human
-
involvement, little complexity, or very firm and
stable requirements, and in such circumstances traditional application development
processes are a much more straightforward and low risk solution. In cases where there is
a high degree of custo
misation, development, uncertain requirements or short lead times,
especially in areas where human
-
led planning, thinking or decision making are required
then CE (or similar human
-
friendly information processing environments) could be a
very useful capabil
ity.

Stanford

Par
ser

Entity

Extractor

Situation

Extractor

Names

CE

Aggregator


CEStore

SYNCOIN

Reports

Message

PreProcessor

"Stylistic" CE

Conceptual Model

(concepts, logical rules, linguistic expression)

Proper Nouns

(places, units)

For Analysis

Ontology
-
based information extraction, normalization and
mapping

The ability to define an ontology for a domain and then use this knowledge to enhance
information extraction capabilities
is a current research topic in the Natural Language
Processing a
nd Semantic Information communities. The approach outlined in this paper
is very much aligned with this approach

since the CE conceptual model(s) are
synonymous w
ith Semantic Web ontologies [CE4
]
, but the specific augmentation of the
underlying semantic “
domain models” with explicit lexical information linkages enables
the domain models to be much more strongly linked to the typical natural language terms
used when discussing the underlying concepts. We have not undertaken any formal
comparisons to specif
ic ontology based information extraction techniques at this stage.

Agent / Blackboard architecture

Within the general CE
-
based information
-
processing environment there is the concept
that agents (machine or human) will consume and produce information in t
he form of CE
sentences. From a human perspective this can take the form of any valid CE sentence
being contributed by any user at any point in time. This open
-
ended and unconstrained
approach does allow for the unpredictability of human
processing and “
flashes of insight”
that might arise during human thinking, and the assertion of any such new information
can be made immediately available (if appropriate) to other machine or human agents
within the system for further processing.

From a machine
-
agent per
spective there are two distinct types of processing that typically
occur: the execution of logical inference rules, which are firmly based on the underlying
logic, and which automatically generate rationale
[CE2]
to explain the reasoning steps for
any new
“facts” that are inferred; and the execution of agent code which may carry out
any set of simple or complex processing against the input information in order to assert
new information as a result (for example complex entity analytics, or estimation of
curr
ent location based on historical information etc). In all cases the agent receives all
information from CE sentences and asserts any new information in the form of CE
sentences. Such new information may also be extensions to the underlying conceptual
mod
el, new logical inference rules, or simply new information to be added to the
underlying CE corpus.

All such new information is then immediately available for
processing by the other agents should that be required
, and the rationale is available for
inter
rogation/inspection by the machine or human analyst for decision support for
forensic processing [CE5]
.


Modules

The Analyst’s Helper module.


Our approach to linguistic processing relies upon the linking of words to concepts,
specifically via the “expres
ses” sentences. Whereas the meaning of
natural language
words

is

generally understood by the community of speakers, the
authoritative
me
aning
of the concepts is only known to the anal
yst who developed
the conceptual model. Only
the analyst can determine th
e linking of words to the concepts, although (s)he may be
assisted by tooling to perform this task.


To this end we are developing an “
Analyst’s Helper”
(AH)
to assist the analyst in
constructing
the
linguistic mappings betwee
n words and each concept in t
he conceptual
model, that is the “expresses” sentences. To reduce the burden on the analyst,
the
Analyst’s Helper uses
WordNet

®

[WN
1, WN2, WN3
]
to suggest possible

words for
each concept.
Each

concept
in the domain model is matched
to a
ll possible

WordNe
t
synset
s

(via
a simple analysis of
the word
s in the word

senses) and
the analyst
is
invited
to choose the best matching synset from those found. When the choice is made, the
Analyst’s Helper constructs suitable CE sentences describing the match between t
he
synset and concept, and
constructs
‘expresses’

CE sentences linking the
words in the
synset and the conce
pt
.

Rationale
for these sentences
is also specified, to allow future
explanation of the NL processing steps.


At present this matching process is si
mplistic, and it is planned to extend the
Analyst’s
Helper

to allow more complex matching of verbs and adjectives, to offer more
“remotely” matching synsets and to feedback the sets of unrecognised words from the
parse
r

for consideration by the analyst.

I
t may also be possible to build a set of
predefined concepts and word/concept mappings which may be used as the basis for the
building of a conceptual model by the analyst.



Figure 3 Analysts Helper


CE Store

The CE Store is a
research
-
grade Controlled English processing environment that will be
available shortly (during 2012) for evaluation and experimentation purposes with the CE
language. The CE Store provides a basic CE processing environment that includes the
following hig
h
-
level capabilities:



Basic CE sentence parsing



Define/extend any concept model



Assert any CE sentence conforming to the appropriate conceptual model(s)



Define and execute any CE query

Including an example “visual query composition” element



Define and exec
ute any CE rule

Again, including a visual composition element



Define and execute any “CE agent”

In the form of Java code which conforms to a simple “CE Store” interface



Operate entirely in memory
, or persist information to a relational database format



Exam
ple web
-
based client to allow rapid development and browsing of CE
-
based
information



Example agents to carry out basic information processing tasks



Some capability to convert to/from OWL and RDF formats

The purpose of the CE Store is to demonstrate a “pure
” CE
-
based implementation of an
information
-
processing environment within which human and machine agents can
contribute and interact with complex information based on common conceptual models
of a domain.

Analyst Helpe
r

NL parser

"expresses"

conceptual
model

Proper Names

wordnet/etc

meta information

ITAnet

MetaModel

generator

gazetteers etc

Analyst

the word |xxx| is an unrecognised word

wordnet/etc

gazetteers etc

translate

translate

semantic rules

the
word |www|
expresses the concept
yyy

Information Extraction Module


Figure 2 shows the s
tructure of the module to extract information from the sentences and
to convert them into CE facts, using the formats defined in previous sections.

This is
based upon a sequence of agents running under the CE

Store. Each agent reads the
relevant CE sentenc
es from the CE store, performs some processing and places the
resulting CE sentences back into the CE store.
The following

agents are executed
:


1.

The reports are converted into sentences via the Message Preprocessor agent (as
described
elsewhere
)

2.

The Stanfo
rd parser
agent is called on a sentence. This calls the Stanford parser
Java
API
code
[SP1, SP2]
to

pr
oduce

a raw parse tree,
and then turns the raw
parse tree

into a CE representation (
defining

phrases with heads and dependents
,
as described above
).

The u
se of this intermediate
CE
representation allows for
minor
deviations

in the parse tree representation, and permits the insertion of
other parsers in the future.

3.

The entity extractor agent analyses the CE head/dependent representation

and
uses entity extra
ction rules to generate information about the ‘things’ stood for by
the noun phrases,
adjective phrases and prepositional phrases
as outlined above.

The result is a set of entities, their characterisations as domain concepts, and
relations between them, as

a set of CE sentences. As part of this processing,
reference information is used, including:

a.

the ‘expresses’

links between words and
entity
concepts

b.

fact bases of proper nouns and their categorisations, and the domain
-
level
attributes (e
.
g
.

the coordinate
s of places)

4.

The situation extractor agent further analyses the CE head/dependent
representation of the parse tree together with information about the entities
extracted in the previous step. This uses further rules to extract the thematic roles
for the v
erbs and to add further relations between the situation (representing the
verb) and the part
icipants in the situation. ‘expresses’

links are also used at this
stage. The result is a set of CE sentences about the situation.

5.

A “naming” agent is run to provid
e more readable na
mes for the entities;
this
agent is
in initial development.

6.

As a result of the previous steps, there are a number of CE sentences describing
the en
tities and situations. D
ue to the incremental nature of the architecture, these
sentences a
re
small and
atomic in form, and are best presented to the user in an
more expressive
aggregated form. Thus a fi
nal “CE aggregation” processor is run
to
turn the atomic CE into a more

stylistic


CE, using techniques such as:
aggregating all information ab
out an entity into a single sentence; not duplicating
information; not displaying
supertypes that may be inferred
; and not displaying
relationships that are inferrable from other relationships. This process is also in
initial development.


The final output
, the set of CE sentences representing the entities and relations is now
available for further processing and analysis, via machine or human.


In some of these steps (specifically 3 and 4) the rationale for the inferred CE sentences is
also generated and
s
to
red, and is available for

presentation to the user if a better
understanding of where the information occurred from is required.

CE Parser module

Our experience of using CE in real applications indicates that it is of benefit but that it is
desirable to
extend the expressivity of the CE language

[CE6]
, for example by adding
prepositional phrases. The extension of the language may eventually add ambiguities, but
we suggest that the careful and incremental addition of new expressiveness will allow
control o
f such ambiguity. Our approach is to extend the CE language and associated
parsing system to more closely match the syntactic and semantic structure of real Natural
language, by following the same linguistic theories, and using the same linguistic model
(i
ncluding rules) in both
CE

and NL processing. Thus
our CE

becomes a more closely
constrained version of
a
real NL. We believe this has several advantages:


1.

We can use the CE language as a controlled example of a realistic NL, allowing
the exploration of li
nguistic processing techniques including the representation of
linguistic models in CE itself.
This may help to define configuration capabilities
for NL processing tools.

2.

We can use our understanding of a real NL to guide the selection of new syntax
and as
sociated semantics in order to extend CE without introducing uncontrolled
ambiguity

3.

We can reuse models, rules and processing technologies in both CE and NL
processing


As part of this parallelisation of CE and NL processing we have developed the notion
of a
‘linguistic frame’, as part of the CE linguistic model
. A linguistic frame defines a phrase
structure both as a syntactic component and a semantic component, together with the
ways of mapping between the interface. A linguistic frame is really a speci
alised type of
logical relationship, and we have integrated an interpreter of these logical relationships as
part of a chart parser to provide the basis for a CE parser.

There is a close correspondence
between the logic in a linguistic frame and the rules
used in the NL parsing, and it is our
intention to further parallelise the processing of NL and CE.

ACKNOWLEDGMENT

This research was sponsored by the U.S. Army Research Laboratory and the U.K. Ministry of Defence and was
accomplished under Agreement Number

W911NF
-
06
-
3
-
0001. The views and conclusions contained in this document
are those of the author(s) and should not be interpreted as representing the official policies, either expressed or
implied, of the U.S. Army Research Laboratory, the U.S. Government,
the U.K. Ministry of Defence or the U.K.
Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government
purposes notwithstanding any copyright notation hereon.

References

[CE1]

Mott, D., Summary of Controlled E
nglish, ITACS,

https://www.usukitacs.com/?q=node/5424, May 2010


[CE2]

Mott, D. Status on Work on Rationale and CNLs
https://www.usukitacs.com/?q=node/4420



[CE3]

Common Logic Controlled English, Sowa, J., March 2007,

http://www.jfsowa.com/clce/clce07.htm


[CE4] Mott, D., The representation of logic within semantic web

languages, ITACS https://www.usukitacs.com/?q=node/4986

August 2009


[CE5]
Mott, D., and Dorneich, M. C., “Visualising rationale in the

CPM”, 3rd Annual Conference of the International Technology

Alliance (ACITA), Maryland, USA, 2009


[CE6] Mott, D. and Hendler, J., Layered Controlled Natural Languages, 3rd Annual
Conference of the International Technology Alliance (ACITA), Maryland, USA,
2009


[WN
1
] Wordnet, a lexical database for English,
http://wordnet.princeton.edu/


[WN2] George A. Miller (1995). WordNet: A Lexical Database for English.

Communications of the ACM Vol. 38, No. 11: 39
-
41.


[WN3] Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database.
Cambridge, MA: MIT Press.


[SEMTRI]
Ogden, C. K. and Richards, I.

A. The Meaning of Meaning (1923)


[SP
1
]

The Stanford Parser, A

statistical parser,
http://nlp.stanford.edu/software/lex
-
parser.shtml


[SP2] Dan Klein and Christopher D. Manning. 2003.
Accurate Unlexicalized Parsing
.
P
roceedings of the 41st Meeting of the Association for Computational Linguistics, pp.
423
-
430.