Table of Contents

estonianmelonAI and Robotics

Oct 24, 2013 (3 years and 8 months ago)

86 views


1


Table of Contents


1

INTRODUCTION

................................
................................
................................
....

3

2

HISTORY

................................
................................
................................
...................

6

2.1

Early systems

................................
................................
................................
....

6

3

NATURAL LANGUAGE PARSING
................................
................................
.....

7

3.1

Rule
-
Based Syntactic Parsing

................................
................................
.........

7

3.2

T
erminal Symbols

................................
................................
............................

7

3.3

Non
-
terminal symbols
................................
................................
.....................

7

3.4

Production Rules

................................
................................
..............................

7

3.4.1

Grammar

................................
................................
................................
...

7

3.4.2

Parse tree

................................
................................
................................
...

8

3.4.2.1

Top down

................................
................................
..............................

8

3.4.2.
2

Bottom up

................................
................................
..............................

9

3.5

Probabilistic Parsing

................................
................................
......................

11

3.5.1

Disambiguation

................................
................................
......................

11

3.5.2

Training

................................
................................
................................
...

12

3.5.2.1

Treebank

................................
................................
..............................

12

3.5.2.2

Incremental learning
................................
................................
..........

12

3.6

Semantic Parsing

................................
................................
............................

13

3.6.1

Semantic Data Models

................................
................................
...........

13

3.6.2

Case Based Reasoning

................................
................................
...........

14

3.6.3

Semantic Representation

................................
................................
.......

15

3.6.4

Actions of the Parser

................................
................................
..............

15

4

NLIDB ARCHITECTURE

................................
................................
.....................

17

4.1

Pattern
-
matching systems
................................
................................
.............

17

4.2

Parsing based systems

................................
................................
...................

17

4.2.1

Semantic

grammar based parsing

................................
.......................

18

4.2.2

Translation

................................
................................
..............................

19

5

MARKET TEST

................................
................................
................................
.......

23

5.1

Goals
................................
................................
................................
.................

23

5.2

Tests
................................
................................
................................
..................

23

5.3

Results

................................
................................
................................
..............

23

5.3.1

Impressions

................................
................................
.............................

23

5.3.1.1

Microsoft English Query

................................
................................
...

23

5.3.1.2

Elfsoft

................................
................................
................................
...

24

5.3.2

Query
results
................................
................................
...........................

25

6

FUTURE
................................
................................
................................
...................

26

6.1

Language challenges
................................
................................
......................

26

6.2

Portabilit
y challenges

................................
................................
....................

26

6.3

Competing systems
................................
................................
........................

26

6.4

Possible avenues
................................
................................
.............................

26


2


6.4.1

Adaptation techniques

................................
................................
..........

27

6.4.2

Speech
-
based techniques

................................
................................
......

27

6.4.3

Learning algorithms

................................
................................
..............

27

6.4.3.1

User Dialogue

................................
................................
.....................

27

6.4.3.2

Neural Networks
................................
................................
................

28

6.4.3.3

Genetic Algorithms

................................
................................
............

28

7

CONCLUSIONS

................................
................................
................................
.....

29

8

BIBLIOGRAPHY

................................
................................
................................
....

32

9

CONTRIBUTIONS

................................
................................
................................
.

35


3


1

INTRODUCTION

The ability to exercise language to convey different thoughts and feelings
differentiates human beings from animals. The definition of Natural Language
Processing is the capability of a machine to understand the full context of human
lan
guage about a particular topic so that the unspecified guess and general
knowledge can be understood. “
Thus if the machine is able to achieve this, it has
come close to the notion of artificial intelligence itself

1
.



One may find interacting with a forei
gn person with no knowledge of English
intricate and frustrating. Thus, a translator will have to come into the picture to
allow one to communicate with the foreigner. Companies have related this
problem to extracting data from a database management system

(DBMS) such as
MS Access, Oracle and others. A person with no knowledge of Structured Query
Language (SQL) may find himself or herself handicapped in corresponding with
the database. Therefore, companies like Microsoft and Elfsoft (English Language
Fronte
nd Software) have analysed the abilities of Natural Language Processing
to develop products for people to interact with the database in simple English.
This enables a user to simply enter queries in English to the Natural Language
database interface. This
kind of application is known as a Natural Language
Interface to a DataBase (NLIDB).


The system works by utilizing the use of syntactic knowledge and the knowledge
it has been provided about the relevant database.
2

Hence, it is able to implement
the natura
l language input to the structure, scope and contents of the database.
The program translates the whole query into the standard query language to
extract the relevant information from the database. Thus, these products have
created a revolution in extracti
ng information from databases. They have
discarded the fuss of learning SQL and time is also saved in learning this query
language.


This report will look at the performance of each database interface connected to a
standard database. The Northwind databa
se has been chosen as the default
database to work on. There are several companies that are offering such products
in the market. Our group has found several of them, which include English
Query, Elfsoft, EasyAsk and NLBean created by Mr Mark Watson. We ha
ve
requested for these companies for their permissions to test their products in
regards to our research. We received positive responses from Elfsoft and
NLBean, but had to settle for tests on Microsoft English Query and Elfsoft only.
We have also contacte
d EasyAsk via email but the company has provided
minimal assistance in our research.




1

Manas Tungare

2

Manas Tungare


4



In order to produce accurate conclusions on the different interpretations of each
software, we have listed out over thirty questions to test the products. Each
product w
ill be asked the same questions in the same order. The questions have
been carefully planned to test the pros and the cons of each product.


These questions include:



Listing the specific columns and rows



Counting



Calculations



Cross referencing from more

than one tables



Ordinal positions



Followed
-
ups



Conclusions



Semantics



Grammar mistakes



Spelling mistakes



Out
-
of
-
context questions


There are three components in a natural language dialog system: analysis,
evaluation and generation.
3

The analysis compone
nt translates the query as
entered by the user into a semantic representation which is transcripted in the
knowledge representation language. There may be several communication
sessions between the natural language access system and user interface system t
o
the user in order to carry out the action to derive the result. The evaluation
component allows information to be absorbed by the dialog system when queries
have to be satisfied or the system needs to alert the user about any major state
changes. The gen
eration component gathers the intended information that the
user wants to see as provided in the query. This component will generate text,
graphs, query or any other responses according to the situational context of the
query.
4


The knowledge
-
based databas
e assistant (KDA) as stated, is a practical
development of an intelligent database front
-
end to assist novice users in
retrieving desirable information from an unfamiliar database system.
5

This
component exists in both Microsoft English Query and Elfsoft.
Thus, this useful
program directs the novice user to get the relevant results by entering the
accurate query or by prompting the user when insufficient information is entered



3

Dialog
-
Oriented Use of Natural Language

4

Dialog
-
Oriented Use of Natural Language

5

Manas T
ungare


5


to get the appropriate answer. This component can be seen in the later part in
th
is report in both programs.


In addition, “
the KDA's responding functionality, which could change the user's
knowledge state, is called query guidance

.
6

It can detect a user’s scope of
knowledge about the relevant database by studying the query entered by

the
user. If it sensed that the user has limited awareness about the database and
could not retrieve his or her desired answer, the query guidance will jump into
action and provide similar queries to allow the user to gather the appropriate
facts from the

database or present the most relevant query to the user based on
the user’s perceived intention. Such a component allows the novice to get
familiar with the database fast and enables the user to learn about the scope of
the database based on the prompt me
ssages and the queries generated from the
KDA without the expense of learning those mass databases stored in most
organizations.





6

Manas Tungare


6


2

HISTORY

As the use of databases for data storage spread during the 1970’s, the user
interface to these systems represented a b
urden for designers worldwide. At this
point, both the relational database model and the SQL interface language were
yet to be developed, which means that the task of inserting and querying data
was tedious and difficult.


It was therefore a logical step f
or programmers to attempt to develop more user
-
friendly and “human” interfaces to the databases. One of these approaches was
the use of natural language processing, where the user interactively would be
allowed to interrogate the stored information.

2.1

Early
systems

The most well
-
known historical natural language database interface systems are:



LUNAR, interfacing a database with information on rocks collected
during American moon expeditions. It was originally published in 1972.
When evaluated in 1977, it answ
ered 78 % of questions correctly. Based on
syntactic parsing, it tended to build several parse trees for the same query,
and was deemed as inefficient
7

and too domain
-
specific and inflexible.



LADDER, the first semantic grammar
-
based system, interfacing a
d
atabase with information on US Navy ships.



CHAT
-
80, probably the most famous example. It interfaced a database of
world geography facts. The entire application (both the database and the
user interface) was developed in Prolog. As the source code was freel
y
distributed, it is still used and cited. An online version can be found at
8
.




7

Hafner, C. D. and Gooden, K. pp 141
-
164

8

ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80


7


3

NATURAL LANGUAGE PARSING

3.1

Rule
-
Based Syntactic Parsing

Syntax means ways that words can fit together to form higher
-
level units such as
phrases, clauses, and sentences. Therefor
e syntactically driven parsing means
interpretations of larger groups of words are built up out of the interpretation of
their syntactic constituent words or phrases. In a way this is the opposite of
pattern matching as here the interpretation of the input

is done as a whole.


Syntactic analyses are obtained by application of a grammar that determines
what sentences are legal in the language that is being parsed.

Syntactic parsing operates through the translation of the natural language query
into a parse
tree, which is then converted to a SQL query. There are a number of
fundamental concepts in the theory of syntactic parsing.

3.2

Terminal Symbols

A terminal symbol is the basic building block of the language, i.e. words and
delimiters. Together, the set of ter
minal symbols form the “dictionary of words”
9

recognised by the system, i.e. the range of the vocabulary that it can read and
interpret.

3.3

Non
-
terminal symbols

Non
-
terminal symbols are higher
-
level language terms describing concepts and
connections in the sy
ntax of the language. Examples of non
-
terminal symbols
include
1

sentence, noun phrase, verb phrase, noun, and verb.

3.4

Production Rules

As the query is analysed, a number of production rules fires to identify and
classify the context of the read word. In anal
ogy with a production system (such
as the one used in PROLOG), a production rule in a context
-
free grammar
10

converts a left
-
hand non
-
terminal symbol to a sequence of symbols, which can be
either terminal or non
-
terminal. Examples of production rules:



Sente
nce := Noun phrase verb phrase



Verb phrase := verb

These rules are also commonly referred to as rewrite rules.

3.4.1

Grammar

The combination of the set of terminal symbols, set of non
-
terminal symbols, the
production rules and an assigned start symbol (the high
est
-
level construct in the
system, usually sentence) form the grammar of the syntax. The role of the
grammar is to define:



What category each word belongs to;




9

Luger, G.F. and Stubblefield, W.A.

10

This paper will be constricted to the treatment of context
-
free grammars and not deal wi
th the
more complex set of syntaxes known as context
-
sensitive.


8




What expressions are legal and syntactically correct;



How sentences are generated.

3.4.2

Parse tree

The

system analyses the sentence by reading the non
-
terminal symbols in order
and identifying what production rule to fire. As it does so, it gradually builds a
representation of the sentence referred to as a parse tree. The term has been
coined from the tree
-
like graph that is produced, where the root is the top
-
level
symbol (e.g. sentence), the children of each node are the right
-
hand non
-
terminal
symbols and the leaves are the terminal symbols (the words). The parse tree can
be built in two fundamentally di
fferent ways.

3.4.2.1


Top down

A top down parser starts at the root and gradually builds the tree downwards by
matching the read terminal symbols with symbols on the right
-
hand side of
possible production rules. Terminal or non
-
terminal symbols on the right hand
side are added at the level below the current symbol.

This is similar to the goal
-
driven approach of a production system. The basic
architecture of a top down parser is illustrated in figure 1.



9


Figure
1

Top down parsing of the se
ntence "the girl forgot the boy"
11


In many situations, the first token alone does not provide enough information to
make the decision on what production rule should be fired. In order to overcome
this, there are two basic methods.

3.4.2.1.1

Recursive Descent

The sys
tem starts by firing the first production rule of the candidates for which
the given terminal symbol could fit and builds the initial sub tree from this
information. If this further downwards in the tree results in an inconsistency or
syntactic error, it r
everts to the point where the decision was made, removes all
the nodes on the way back up and selects another of the possible productions.
This is a procedure very similar to depth
-
first searching and backtracking in
production systems.

3.4.2.1.2

Look Ahead

Look Ahe
ad systems will not be contented by just reading one token. Rather, it
reads the number of tokens necessary to identify the given right
-
hand side
beyond any ambiguities before firing any production rules.


Grammars are characterised by the maximum number
of terminal symbols
required to read before all possible conflicts in the choice of production rule can
be resolved. If this number is k, the grammar is referred to as an LL (k)
grammar
12
. The look ahead procedure is more in analogy with a breadth
-
first
sea
rch technique.

3.4.2.2

Bottom up

A bottom up parser, on the other hand, works from the leaf upward by “tagging”
the tokens, i.e. starting from the right
-
hand side of the production rules and
associating the read word with its category. When a full right
-
hand side

has been
identified, the production rule fires and the left
-
hand side non
-
terminal symbol is
added as a branch in the level above. This methodology corresponds to the data
-
driven technique of production systems. The bottom up parsing technique is
illustra
ted in figure 2.




11

Dougherty, R.C.

12

Eriksson, G.


10



Figure
2

Bottom up parsing of the sentence "the girl forgot the boy"
13


In some cases, the sentence is ambiguous in itself and there are multiple
production rules that match a given sentence, in which case the par
ser has to
make a choice between the two potential interpretations. One strategy for
dealing with these situations is referred to as probabilistic parsing.




13

Dougherty, R.C.


11


3.5

Probabilistic Parsing

Probabilistic parsing takes an empirical approach to the difficult task of
dis
ambiguation, i.e. identifying which of several mutually exclusive alternate
syntactic parse trees should be generated.


For example, consider the sentence “One morning I shot an elephant in my
pyjamas”
14
. There are two possible syntactic parses for this se
ntence
15
. One
implies that the person was wearing the pyjamas, while the opposing view
would claim that the elephant was in the underwear (hence the joke). Although
the selection between these two interpretations is obvious to a human, how is
this knowledge

automated in a computer?


One option, used in a.k.a. attribute grammars, is to encode information for each
verb as a parameter to each production rule. However, as the dictionary grows,
this approach may be too selective and require every different case t
o be
specifically added to the production rules.


Probabilistic parsing, on the other hand, works by augmenting the rules with
assigned probabilities, representing the chance of the particular expansion
(production rule) being the correct one.


For exampl
e, a probabilistic grammar would introduce the following
enhancements to the possible regular syntactic production rules for the
expansion of the non
-
terminal symbol sentence [
15
]:




Sentence:= Nounphrase Verbph
rase, P = 0.8



Sentence:= Auxiliary Nounphrase Verbphrase, P = 0.15



Sentence:= Verbphrase, P = 0.05


Note that the probabilities for the expansions of any given non
-
terminal symbol
always add up to 1.


3.5.1

Disambiguation

How does probabilistic parsing choose a

parse tree from two possible
interpretations? In most systems, it simply compares the products of all the
probabilities involved in every production required for the competing parses and
selects the one representing the highest of these probabilities.




14

Groucho Marx

15

Jurafsky, D. & Martin, J.


12


3.5.2

Tra
ining

One important task concerns how to set the probabilities. There are two
fundamentally different techniques for this task [
15
].

3.5.2.1

Treebank

A large database of sentences with their correct parses (parsed by k
nowledgeable
humans) is entered into the system. The respective probabilities are then
calculated as the relative frequencies of each possible parse. For more details, see
[
15
].


The largest known treebank is k
nown as the Penn Treebank
16
. The latest version,
Treebank 3 contains parses of
17
:



One million words of 1989 Wall Street Journal material;



A small sample of ATIS
-
3 transcripts. The Air Travel Information Service
is a joint project of DARPA (Defence Advanced
Research Projects Agency)
and SRI International, handling voice
-
based queries and requests about
flights. More information can be found at
18
;



A fully parsed tagged version of the Brown Corpus, consisting of one
million words from 500 different sources (nove
ls, academic books,
newspapers, non
-
fiction books etc. [
15
]);



Parsed and tagged text from a set of 560 transcripts of telephone
conversation (a.k.a. the Switchboard
-
1 corpus).



This is a widely used “training se
t” (in analogy with an artificial neural
network) enabling the parser to learn what classes of speech a given word
can belong to and how frequently a particular expression is to be
interpreted in different ways.


3.5.2.2

Incremental learning

The other technique i
s a “trial and error” method, in which the parsing system
much like an artificial neural network learns as it is used.


The initial probabilities can be assigned randomly or by the user. After that, the
system adjusts these probabilities according to the
following rules [
15
]:




If the sentence was unambiguous, its parse count is increased by 1, i.e. p
i
:
= p
i
+1;



If the sentence was ambiguous, each of the possible parses have their
counts incremented by their res
pective probabilities, i.e. p
i
: = p
i
+ P (p

i
).





16

Penn Treebank Project.


17

Quoted by the LDC office of the University of Pennsylvania in an emai
l dated 10/7
-
2001.

18

Language Reference


13


The algorithm for this computation is referred to as the Inside
-

Outside
Algorithm. It was originally proposed in
19

and is described in detail in
20
.

3.6

Semantic Parsing

The syntactic structure of a sentence is

not enough to express its meaning. For
instance, the noun phrase
the catch

can have different meanings depending on
whether one is talking about a baseball game or a fishing expedition. To talk
about different possible readings of the phrase
the catch
, on
e therefore has to
define each specific sense of the phrase. The representation of the context
-
independent meaning of a sentence is called its
logical form
.
21

Natural language
analysis based on semantic grammar is similar to syntactically driven parsing
exc
ept that in semantic grammar the categories used are defined semantically.


Database items can be ambiguous when the same item is listed under more than
one attribute. For example, the term “Mississippi” is ambiguous between being a
river name or a state
name, in other words, two different
logical forms
. The two
different meanings have to be represented distinctly for an interpretation of a
user query.


3.6.1

Semantic Data Models

Semantic data models (SDM) are widely researched in the database community.
They a
re closely related to semantic networks used in artificial intelligence,
which were originally developed to support natural language processing. Hence,
as database management systems they are capable of supporting large amounts
of information, while still
offering the potential of advanced inferencing
capabilities including NLP, machine learning, and query processing.


“SDMs can be seen as formalising many of the relationships, expressed in an ad
hoc manor in conventional hypermedia systems.”
22

SDMs support

a variety of
formalised links and relationships. An example of a small network on insects is
shown in figure 3. The links in this graph express generalisation relationships or
"ISA" (beneficial insect IS
-
A insect), part/whole (Abdomen is part of an Insect
),
association (Ladybugs eat Aphids), and class/instance (Ladybug is an instance of
Beneficial Insect).
23





19

Baker, J.K. pp. 547
-
550.

20

Manning, C.D. and Schutze, H.

21

Tang, R. L. p5

22

Beck, H., Mobini, A., Kadambari, V

23

Beck, H., Mobini, A., Kadambari, V


14



Figure
3

Semantic Data Model describing insects
24


In figure 3, solid lines are ISA relationships, diamonds are part/whole,

circles are
associations, and Instances are underlined.


Since concepts in SDMs are described by structured graphs expressing the
relationships among symbols rather than connections between text files as in
conventional hypertext, there exists the capabil
ity for manipulation of SDMs to
produce a number of desirable functions. Foremost is that of search or query
processing. [
3
] Suggests query processing based on graph matching techniques
by which the query is expressed as a smal
l semantic network. This query graph
is then matched against the larger database graph to find connections. This gives
a much more precise search capability than is possible with Boolean keyword
searches over text files.


3.6.2

Case Based Reasoning

In order to
construct an NLP system, one must construct a large dictionary.
Much of the recent advances in text understanding systems can be attributed to
advances in design and construction of large lexicons. But that presupposes that
word meaning is easily represent
ed and a case
-
based reasoning approach to
meaning is used. Words obtain meaning by how they are used. A particular
word is used in many different situations and contexts. Each occurrence of the
word is treated as one case. Similarities among cases can be o
bserved, and cases
with similar usage can be clustered together into categories. When a word is
used in a new situation, similar cases are retrieved from the case
-
based memory
in order to apply what happened before to the new context. The meaning of a
part
icular word is established by a large case base, and thus a single word may
be "worth 1,000 cases".
25





24

http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/bec
k/fig1.gif

25

Beck, H., Mobini, A., Kadambari, V.


15


3.6.3

Semantic Representation

The most basic constructs of the representation language are the terms used to
describe objects in the database and the basic rel
ations between them. Database
objects bear relationships to each other or can be related to other objects of
interest to a user who is requesting information from it. For instance, in a user
query like “
What is the capital of Texas
?”, the data of interest
is a city with certain
relationship to a state called Texas, or more precisely its capital. The
capital/2

relation, or predicate, is therefore defined to handle questions that require them.


Predicates

Description

city (C)

capital (S,C)

density (S,D)


lo
c (X,Y)

len (R,L)

next_to (S1,S2)

traverse (R,S)

C

is a city

C

is the capital of
S

D

is the population density of
state
S

X

is located in
Y

L

is the length of river
R

State
S1

borders
S2

River
R

traverses state
S

Table
1

Sample of
predicates
26


3.6.4

Actions of the Parser

Using the parser actions in CHILL [
23
] known as shift
-
reduce parsing we will
discuss the working of the parser. The parser actions are generated from
templates given by a logic
al query. An action template will be instantiated to
form a specific parsing action. Recall that the parser also requires a lexicon to
interpret meaning of phrases into specific logical forms. Consider the following
example
27
:


Sentence
: What is the capital

of Texas?

Logical Query
: answer(C,(capital(C,S),const(S,stateid(Texas)))).


A very simple lexicon will map ‘
capital’

to ‘
capital(_,_)
’ and ‘
Texas’

to

const(_,stateid(texas))
’. The parser begins with an initial stack and a buffer
holding the input sentenc
e, which is the initial parse state. Each predicate on the
parse stack has an attached buffer to hold the context in which it was introduced.
Words from the input sentence are shifted onto the stack buffer during parsing.
The initial parse is as follow:


P
arse Stack
: [answer(_,_):[]]




26

Lappoon R. T. p6

27

Tang, R.L.


16


Input Buffer
: [what,is,the,capital,of,texas,?]


Since the first three words in the input buffer do not map to any logical forms,
the next sequence of steps is to push the three words from the input buffer onto
the parse stack.
The process has the following result:


Parse Stack
: [answer(_,_):[the,is,what]]

Input Buffer
: [capital,of,Texas,?]


Now, ‘
capital’

is at the head of the input buffer and is mapped to ‘
capital(_,_)
’ in
the lexicon. The next action is to push the logical for
m onto the parse stack. The
resulting parse state is as followed:


Parse Stack
: [capital(_,_):[],answer(_,_):[the,is,what]]

Input Buffer
: [capital,of,Texas,?]


The parser then binds two arguments of two different logical forms to the same
variable, resulti
ng in the following parse state:


Parse Stack
: [capital(C,_):[],answer(C,_):[the,is,what]]

Input Buffer
: [capital,of,Texas,?]


The sequence repeats itself producing a parse state:


Parse Stack
:
[const(S,stateid(Texas)):[?,Texas]capital(C,S):[of,capital],an
swer(C,_):[the,is,what]
]

Input Buffer: []


The final step is to take the logical form on the parse stack and put it into one of
the arguments of the meta
-
predicate resulting in:


Parse Stack
: [answer(C,(capital(C,S),
const(S,stateid(Texas)))):[?,Texas,of,c
apital,the,is,what]]

Input Buffer: []


As this is the final parse state, the logical query is then constructed from the
parse stack.



17


4

NLIDB ARCHITECTURE

4.1

Pattern
-
matching systems

The first NLIDBs were based on pattern
-
matching techniques. As a simple
illust
ration of pattern matching technique, consider the following database:


Countries_Table

Country

Capital

Language

France

Italy



Paris

Rome



French

Italian



Table
2

Sample Database Table
28


A primitive pattern
-
matching system acc
ording to [
1
] may use rules as:


Pattern: … ”
capital
” … <country>


Action: Report CAPITAL of row where COUNTRY = <country>


Pattern: … “
capital
” … “
country



Action: Report CAPITAL and COUNTRY of each row


If the us
er asked “
What is the capital of France?
”, using the first pattern rule the
system would report “
Paris
”. The system would also use the same rule to handle
questions such as “
Print the capital of Italy
”, “
Could you please tell me what is the
capital of Fran
ce?
” etc.


Some advantages of this approach are that it requires no complicated parsing or
interpretation modules, and that it is easy to implement. But the main advantage
of this approach is its simplicity. However the shallowness of this approach often
l
ead to bad failures. An example is when a pattern
-
matching NLIDB was asked

TITLES OF EMPLOYEES IN LOS ANGELES
.” the system reported the state
where each employee worked, assuming the “
IN
” to denote the post code of
Indiana, and assumed that the question w
as about employees and states.
29


4.2

Parsing based systems

In general as [
21
] suggests, the system architectures of some NLIDBs can be seen
as being made of two major modules. The first module controls the natural
language, where a

question is submitted and successively transformed. At the
end of this process one or more intermediate logical query expressions is
obtained. Given the dimension of the domain and the flexibility of the natural



28

Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.14

29

Androutsopoulos, I., Ritchie, G.D., and Thanisch, P., pp.14
-
15


18


language, there usually exist several inter
pretations of the same question. The
second component is in charge of the connection with the database, translating
the expressions to structured query language (SQL) expressions (using mapping)
and sending them to the Data Base Management System (DBMS) to

produce the
answers.
30

For a graphical explanation of the structure, examine Figure 4.


Figure
4

NLIDB Architecture
31


As described in the previous section, the source language sentence is first parsed,
producing a parse tree. The
two methods often found of parsing are the syntax
based and semantic grammar based.


4.2.1

Semantic grammar based parsing

Using this technique, the grammar’s categories do not necessarily correspond to
syntactic concepts. Examine the following figure:




30

Reis, P., Matias, J. and Mamede N. p.3
-
4

31

Andro
utsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18


19



Figure
5

Semantic base parsing tree
32


Notice that some categories of the grammar (e.g.: Substance, Magnesium,
Specimen_question) do not correspond to syntactic constituents (e.g.: Noun
-
Phrase, Noun, Sentence). This is because the semantic

information about the
knowledge domain (e.g.: a question may either refer to specimens or spacecraft)
is hared
-
wired into the semantic grammar.
33


Because the semantic grammar approach contains hard
-
wired knowledge about
a specific knowledge domain, it is
very difficult to transfer it to other knowledge
domain. A new semantic grammar has to be written whenever the NLIDB is
configured for a new knowledge domain.
34


4.2.2

Translation

The translation is usually based on several mapping tables.
Figure
6

illustrates
this process for both the addition of new information based on an input sentence
and the processing of a related query. The query is represented by a small graph,
which initiates the mapping to the semantic hierarchy
. The small graph is
mapped to the semantic network by creating a link from each node in the smaller
graph to the corresponding nodes in the network starting with the most general
concept (the root) and ending with the most specific. This will create a uni
que
instance, which is the intersection of all of the nodes involved in the query and
may be used to narrow down a neighbourhood based on the requested
information.
35


The mapping process is bounded by rules, and completely based on the
information of the
parse tree. As an example of mapping rules, consider the
previous query of “
which rock contains magnesium
” taken from [
1
]:



The mapping of “
which
” is for_every X.




32

Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17

33

Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17

34

Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17

35

Beck, H., Mobini, A.,

Kadambari, V. [online]


20




The mapping of “
rock
” is (is_rock X).



The mapping of an NP is Det
’ N’, where Det’ and N’ are the mappings of
the determiner and the noun respectively. Thus resulting in for_every X
(is_rock X).



The mapping of “
contains’

is contains.



The mapping of “
magnesium
” is magnesium.



The mapping of a VP (V’ X N’). Thus resulting i
n (contains X magnesium).




21


Figure
6

Mapping and Query Processing Model
36


Figure
7

demonstrates when the user ask a query on how
John

spent his leisure
time and displays how the answer to
the query is produced by exploiting the
relationship between "spending leisure time" and "having a chance to go fishing"
(both are "doing").



Figure
7

Query processing model
37


In many systems the syntax rules linking non
-
leaf no
des and the semantic rules
are domain independent, and can be used in any application domain. The
information describing the possible words (leaf nodes) and the logic expressions
is domain dependent and has to be declared in the
lexicon
.
38



As an example,
consider the lexicon used in MASQUE [
1
] listing the possible
words, “
capital
”, “
capitals
”, “
border
”, “
borders
”, “
bordering
”, “
bordered
”.



The logic expression of “
capital
”, “
capitals
” could be
capital_of(Capital,Country).



The l
ogic expression of “
border
”, “
borders
”, “
bordering
”, “
bordered
” could be
borders(Country1,Country2).




36

http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig2.gif

37

http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/bec k/fig3.gif

38

Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19


22




The logic expression of “
country
” could be is_country(Country).


Then the question, “
What is the capital of each country bordering Greece?
” would be
mapped

to this query:

answer([Capital, Country]):
-

is_country(Country),

borders(Country, Greece),

capital_of(Capital, Country).


The meaning of the logic query above is to find all pairs
[Capital, Country]
, such that
Country

is a country,
Country

borders
Greece
,

and
Capital

is the capital of
Country
.

The interpreter also needs to consult a
world model

that describes the structure of
the surrounding world as shown by the figure below. Typically, the model
contains a
hierarchy

of classes of world objects, and
const
raints

on the types of
arguments each logic predicate may have.
39


Figure
8

Hierarchy in world model
40




39

Androutsopoulos, I
., Ritchie, G.D., and Thanisch, P. p.18
-
19

40

Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19


23


5

MARKET TEST

In order to get a good estimate of the current state of the technology, the
applications presented in the previous ch
apter were subjected to a neutral test.

5.1

Goals

The goals of the tests were:



To get a thorough understanding of contemporary market applications;



To get an estimate of the relevance and importance of this type of systems;



To get some insight into what featur
es are more and less important.

5.2

Tests

The tests were carried out on the Northwind database, a sample database with
information on a shipping company. The database comes as a demo with all
distributed copies of Microsoft Access.


A number of different queri
es of different types were posed to the respective
natural language front ends. The questions were classified as simple (S), average
(A), or complex (C).


For a more comprehensive explanation of the considerations behind the testing
procedures, see Append
ix A.

5.3

Results

5.3.1

Impressions

5.3.1.1

Microsoft English Query

English Query is a development environment that enables programmers to
produce natural language front ends for SQL 2000 databases. The product is
included with SQL 2000. The tests were performed on a demo o
f English Query,
developed by Microsoft to interface with the Northwind database.


The user interface has five fields, with the following functionalities:



Query (user input)



Interpretation of query



Required operations



Produced SQL statement



Results

A scree
n shot from one of the queries is presented in
Figure
9
.



24



Figure
9

Microsoft English Query.

5.3.1.2

Elfsoft

Elfsoft works together with either VB or Access. Queries are entered in a query
window
(see
Figure
10
) and can be output either as database tables (see
Figure
11
)
or in a graphical format.



Figure
10

Elfsoft query window.



25



Figure
11

Elfsoft answer output.


Elfsoft also includes several other options for enhanced portability, including:



Automatic analyser of any Access database



Enabling the user to teach program meanings of phrases



Allowing the user to expla
in why a query failed (what was missing
and/or wrong)



Permitting the user to edit the dictionary



Logging of queries for statistics

5.3.2

Query results

The results are summarised in
Table
3
. A full recollection of the questions asked is
presented in Appendix B.

Table
3

Accuracy percentages.

Type of query

English
Query

Elfsoft

Simple

71



23



Average

50



40



Complex

67


100


26


6

FUTURE

During the mid
-
eighties it was believed that natural language processing
syste
ms would become a universal interface to databases worldwide
41
. However,
due to the emergence of graphical interfaces to databases, the relative simplicity
of SQL and the inherent problems of natural language processing they have
never really caught on comm
ercially
42
.


The current position of NLIDBs is probably best described by “it’s a great idea,
but…” Although their usefulness is appreciated, they are still at a research stage.
There are several reasons as to why their usage is not taking off on a broader

scale.

6.1

Language challenges

It is still very hard to encode the vast source, complexity and ambiguity of a
human language into a computer. The formalisms for representing language
patterns are still not comprehensive enough to capture all the different way
s that
expressions and terms can be constructed and given meaning depending on the
context.

6.2

Portability challenges

Although several systems for communication with individual databases have
been successfully implemented and used, a general technique, which
would
allow the user to specify the database and use a system with any database
management system (whether it be Access, SQL 2000, Oracle or any type), is still
rather elusive. This would require the system to be able to recognize the fields
and attributes

of the new storage source seamlessly.


An even bigger hurdle to portability is the nature and scope of language
understanding. Language use in different domains is very dissimilar, which
means that any portable system has to have a huge vocabulary with t
erms from
many different application domains and be able to recognize expressions from
users of a wide variety of professions.

6.3

Competing systems

Graphical and form
-
based interfaces have become the de facto standard for
database front ends. Because of the c
hallenges presented above, these other types
of systems are generally possible to develop in shorter time and at a lower cost.

6.4

Possible avenues

There is still a lot of research going on in this area. Having explored the
application of Natural Language Proc
essing as database interfaces, the authors
can see a number of different scenarios.




41

Johnson, T.

42

Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. pp.29
-
81


27


6.4.1

Adaptation techniques

There is a need for methodologies that would enable the user to specify the data
source in a general descriptive language and to supply a given set of

terms used
within the domain. This would make the application portable from database to
database.


This need has been recognised in [
22
], where a solution based on the general
Resource Description Framework (RDF). The system
outlined in [
22
] learns the
pattern and domain vocabulary of any given database automatically and also
contains an interface that allows the user to change the database model (classes,
properties, tables etc.).

6.4.2

Speech
-
based tec
hniques

Certain authors [
1
] believe that natural language keyboard interfaces will be
superseded by speech recognition systems. However, as such systems are of an
even more complex nature, some of the linguistic challenges will

have to be
solved first. Research on NLIDBs can therefore be a base for the development of
voice
-
based systems [
1
].

6.4.3

Learning algorithms

Every person has its own vocabulary and way of using language. There is
absolutely no way
that a program can contain all words in a language or all
different meanings that a term may take on.


Further, the use of language changes over time, which means that the semantics
and vocabulary of a system may become obsolete after a certain time of us
e.


An important challenge for a natural language database front end (or any natural
language processing system in general) is to possess an ability to learn, as it is
used, evolve with the user and adapt to new users. This ability is after all one of
the

definitions of artificial intelligence.


There are several ways in which this could potentially be achieved. Note that
these are suggestions and not based on in
-
depth research.

6.4.3.1

User Dialogue

One way to achieve learning would be to include a lexical editor
, where the user
could enter language terms and link them to their synonyms. They should also
be able to specify the different forms of the word, e.g. noun plurals, adjective
comparative forms, verb tenses etc.


This ability is present in Elfsoft.



28


6.4.3.2

Neura
l Networks

By use of probabilistic techniques, a system might be able to adjust probabilities
of different parses based on training texts and test texts, which have been parsed
and tagged by the user or obtained from linguists. By continuously retraining t
he
network with parsed texts from the database
-
specific domain, the neural
network would be able to pick up language patterns and learn incrementally.


6.4.3.3

Genetic Algorithms

Another way would be for the system to obtain feedback from the user on the
accuracy
(e.g. ask the user whether queries were answered correctly) and adjust
its language processing structure (production rules) by the use of genetic
algorithms.


29


7

CONCLUSIONS

The project has focused on two main topics:



The techniques of translating a question
in natural language into a
database query, extracting the results that the user is looking for;



The leading contemporary applications on the market.


The underlying methods belong in the general natural language processing area,
while any system has to sel
ect among several different techniques involving
different degrees of syntactic analysis, semantic processing or a combination. A
general feature seems to be the translation of the query in two steps, first to an
intermediate language and then to a databas
e query language, e.g. SQL.


The topic integrates approaches several other facets of artificial intelligence, e.g.
production systems, neural networks, expert systems, and machine learning.


Two of the leading commercial software packages were tested with

mixed
results. Some rather complex queries were handled well, while the systems
tended to have problems handling rather easy tasks. The sample sizes involved
are too small to base any general conclusions on, however. The reason for this is
that the config
uration of the university computers at our disposal could not be
used for testing the programs.


Many companies have overestimated the use of natural language processing in
the database interface. Their interpretation of the system is that it is able to
un
derstand the significance of the query accurately. However, the system is not
able to fully comprehend the human language and jargon unless it has been
given the definitions for these terms relating to the relevant database.
43

This
mainly involves the seman
tic analysis. A sentence that is syntactically structured
may have lead to various meanings, which may not even be similar to one
another. Thus, as a result, this will produce undesirable conclusions in the
database queries. This is one main reason why man
y systems tend to fail and
explains why most companies would still rather rely on SQL programmers for
their database processing.


Although these kinds of applications are rather unpopular, the authors enjoyed
using them and encourage their future evolvemen
t. From the experiences of the
performed tests, systems have the potential to make the task of searching for
information a lot less tedious and time
-
consuming.





43

Timo Honkela


30


The eventual success for natural language front ends will depend on how well
they can adapt to
new environments, both regarding databases and users’ way of
using language. Two proposed benchmarks for these types of systems could be:



It has to be able to learn and understand the database faster than the user;



It has to learn natural language faster a
nd easier than the user can learn a
programming language.


31


ACKNOWLEDGEMENTS

The authors wish to extend their appreciations to the following people for their
support during the course of the project:



Jon Greenblatt, President of English Language Frontend So
ftware Co.



Girish Mohata, Teaching Fellow, IT School, Bond University





32


8

BIBLIOGRAPHY


1.

Androutsopoulos, I., Ritchiey G.D., and Thanischz, P.:
Natural Language
Interfaces to Databases
-

An Introduction.
Journal of Natural Language
Engineering, vol. 1, No. 1
. Cambridge University Press 1995


2.

Baker, J.K.: Trainable grammars for speech recognition, Speech
Communication Papers for the 97
th

Meeting of the Acoustical Society of
America, Acoustical Society of America 1979.


3.

Beck, H., Mobini, A., Kadambari, V. A Wor
d is Worth 1000 Pictures:
Natural Language Access to Digital Libraries. University of Florida.
http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/
b
eckmain.html


4.

Dialog
-
Oriented Use of Natural Language

http://www.dfki.uni
-
sb.de/vitra/papers/ro
-
man94/node5.html
.

Accessed on 310701


5.

Dougherty, R.C.: Natural Language Computin
g An English Generative
Grammar in Prolog. Erlbaum, Lawrence Associates 1994.


6.

EasyAsk
-

Applications Overview
http://www.englishwizard.com/applications/index.cfm

-
. Accessed
19/7
-
2001


7.

E
CL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80.
http://www.ifi.unizh.ch/cl/broder/chat/chat80.htm
.
Accessed 12/7
-
2001


8.

Eriksson, G.: Översättarteknik.
KFS AB 1984.


9.

Groucho Ma
rx in the movie
Animal Cracker
.


10.

Hafner, C. D. and Gooden, K.: Portability of Syntax and Semantics in
Datalog. ACM Transactions on Information Systems, vol. 3. Association
for Computing Machinery 1985.


11.

Honkela, T., The Www Version Of Self
-
Organizing Maps
In Natural
Language Processing of Helsinki University of Technology


viewed on
22/07/01

http://www.cis.hut.fi/~tho/thesis/



33


12.

Johnson, T.: Natural Language Computing: The Commercial Applications.
Ovum 1985
.


13.

Jurafsky, D. and Martin J. H.: Speech and Language Processing, An
Introduction to Natural Language Processing, Computational Linguistic,
and Speech Recognition. Prentice
-
Hall 2000


14.

Language Reference
http://www.darpa.mil/ito/psum2000/h165
-
0.html
.
Accessed 14/7
-
2001.


15.

Luger, G.F. and Stubblefield, W.A.: Artificial Intelligence. Structures and
Strategies for Complex Problem Solving. Third Edition. Addison
-
Wesley
1999.


16.

Manas Tungare


Natural L
anguage Processing
http://www.manastungare.com/articles/nlp/natural
-
language
-
processing.asp
. Accessed 30/07/01


17.

Manning, C.D. and Schutze, H.: Foundations of Statisti
cal Natural
Language Processing. MIT Press 1999.


18.

Natural
-
Language Database Interfaces from ELF Software Co
http://www.elfsoft.com/ns/FAQ.htm

-
. Accessed 19/7


2001.


19.

Palmer, M. and Finin, T.: Workshop on

the Evaluation of Natural
Language Processing Systems. Computational Linguistics, vol. 16, pp. 175
-
181. MIT Press 1990.


20.

Penn Treebank Project
http://www.cis.upenn.edu/~treebank/
. Accessed
10/7


2001.


21.

Reis, P., Matias, J., Mamede, N.:
Edite


A Natural Language Interface to
Databases, A new dimension for an old approach
.
http://digitais.ist.utl.pt/cstc/le/Papers/CSTCLE
-
12.PDF


22.

Sh
aroff, S. and Zhigalov, V.: Register
-
domain separation as a
Methodology for Development of Natural Language Interfaces to
Databases. Proceedings of the IFIP TC.13 International Conference on
Human
-
Computer Interaction. International Federation for Informat
ion
Processing 1999.



34


23.

Tang, R. L.: Integrating Statistical and Relational Learning for Semantic
Parsing: Applications to Learning Natural Language Interfaces for
Databases. University of Texas May 2000.



35


9

CONTRIBUTIONS

The respective chapters were produced
by the following group members:


Chapter 1: Jun

Chapter 2: Hakan

Chapter 3: Aris and Hakan

Chapter 4: Aris

Chapter 5: All

Chapter 6: Hakan

Chapter 7: Hakan and Jun

Bibliography and report compilation: Aris

Appendices: Hakan




APPENDIX A

Evaluating Systems

I
ntroduction

How good is a natural language database interface? The answer to this question
is hard to define. A survey conducted during the course of this project revealed
the existence of no formal evaluation techniques. As long as this situation
remains,

an unambiguous answer to the question will elude all stakeholders in
this area.

Why is there a need?

The need for formal evaluation schemes in this field, as in any other arises out of
several stakeholders’ desires:



Users want a guide for choosing between

systems;



Companies want benchmarks for product development and
improvement;



Companies need metrics for proving the capabilities of their products.

Current Marketing

The companies behind contemporary techniques market their products with
some of the follow
ing arguments:



Ease of set up and integration with new databases. It is often mentioned
[6,18] that end users will be relieved of the task of having to learn and
understand the internal workings of the DataBase Management System
(DBMS)



Money saved on searc
hing



Price



Ease of integration across different DBMSs (Access, SQL Server, Oracle
etc.)



Accuracy



The possibility to perform searches on several data stores simultaneously

Problems

There have been some attempts to define general formal metrics for natural

language processing systems [19]. In [19], it was concluded that this is a difficult
task for a number of reasons:



Systems are built using a variety of techniques;



They are used in many different domains, where users’ needs are varying;



There is a lack of

funding for research in this area.



However, it is also concluded that database front ends constitutes one of
the type of systems where metrics potentially could be developed and
adopted.




Black box metrics

In [19], a strong distinction is made between blac
k box and glass box metrics. A
black box approach only looks at the output generated by a certain input and
does not take into account the architecture of system, or the efficiency of
individual components.


Advantages



It takes the user’s view;



It can be
applied across platforms, on systems with different
implementation details;



It doesn’t tie to a specific implementation technique;



It can be used over time, regardless of trends in database and
programming methodologies.

Disadvantages:



It doesn’t give a go
od indication to programmers of what is actually
wrong;



It is badly suited for testing individual components of a system.

Proposed black box evaluation scheme

The proposed evaluation scheme takes into account several different aspects of
the program in que
stion.


Evaluation can be based on the following characteristics:

Overall Characteristics



User Friendliness: Is the application easy to understand and use? Are help
files accessible and explanatory? Are error messages clear?



Portability: Can it be used in
conjunction with only a specific database? If
no, how easy is it to integrate it with other databases?



Speed: How fast are answers extracted?



Fault Tolerance: Can the system recognize off
-
topic questions (queries on
information that is not in the database)

and give an informative response
within a reasonable time frame?



Accessibility: Can it be used over the web?

Vocabulary

Can the system accurately understand the following expressions
44
:



What?



Which?



How many?



How much?



Show




44

This list is arbitrary and may have to be expanded/contrac
ted.






List



Tell



Count

Ease of Interac
tion



Linguistic Flexibility: How many spelling errors in a word can the system
tolerate and understand? Can it suggest alternative spellings
45
?



Probing questions: Are “follow
-
up” questions (questions referring to the
previous answer) allowed?



Can the system

adjust for bad grammar and still understand the question?

Accuracy based on input complexity

The system is asked a number of different questions. These questions are ranked
as simple, average or complex. The accuracy (percentage of questions answered
cor
rectly) in each of the three categories is noted.


The evaluation scheme formed the basis of the market tests of chapter 5.
However, because of the small sample size of tested applications, no attempt to
formalize the scheme or develop a metric based on it

was made.




45

For an example of this capability, please try a search on
http://www.google.com

with a word
containing a slight spelling error, e.g. elpheants.




APPENDIX B

Test Protocol

The questions asked, their respective classifications, and the outcomes for the
tested programs are presented in table 4. In the classification column, S stands for
Simple, A for Average, and C for Complex.


Table 4. Te
st Protocol.


Question

Class

Microsoft English
Query outcome

Elfsoft outcome

Comments

Who is the oldest
employee?

S

Correct

Correct

English Query gave
the oldest person,
Elfsoft the one who
had worked the
longest at
Northwind.

Which supplier
(currently
)
supplies the most
products (which
are not
discontinued)?

C

Correct

Correct


Which employee
has handled the
most orders?

A

No answer

Correct

Elfsoft gave too
much information

What product is
the most
frequently
ordered?

S

Correct

No answer


List the co
untry
that has a
supplier that
ships tofu.

A

No answer

Correct


Name the third
most ordered
product.

S

No answer

No answer


What is the least
ordered product?

S

Wrong

No answer


How much is
1kg of Queso
Cabrales?

S

Correct

No answer






Question

Class

Microsoft English
Query outcome

Elfsoft outcome

Comments

How much tofu
have been
ordered?

A

No answer

Correct

Elfsoft gave too
much information

Show the phone
number of united
package.

S

Correct

Correct


Tell me the
names of the
sales
representatives

S

Correct

No answer


Tell me the age
of these people.

A

Correct

No answer


And their phone
numbers?

A

Correct

Correct


Count the
customers in
Germany.

S

Correct

Correct


What is the
average age of
the employees?

A

Correct

Wrong


Name the
employees that
are older than
average

A

Correct

No answer


Give the name of
the sales
manager.

S

Correct

No answer


Where is Around
the Horn from?

S

Correct

No answer


What is the
median of the
age of the
employees?

A

No answer

Wrong


List the names of
the people
wor
king
currently in the
company.

S

No answer

Wrong


Who is older
than Janet?

S

Correct

No answer






Question

Class

Microsoft English
Query outcome

Elfsoft outcome

Comments

What can you tell
me about Ernst
Handel?

S

Too little
information

No answer


Whic
h supplier
supplies tofu but
not longlife tofu?

C

Correct

Correct


What are the
contact names
and phone
numbers of
customers that
have received
products sent
with Federal
Shipping?

C

No answer

Wrong


What are the
products that
federal shipping
ships

A

Co
rrect

Correct

Microsoft English
Query had the
wrong
interpretation.

What customers
received these
shipments?

A

No answer

Wrong