ON FORMAL LANGUAGE THEORY FOR NATURAL

huntcopywriterAI and Robotics

Oct 24, 2013 (4 years and 14 days ago)

144 views

ON FORMAL LANGUAGE THEORY FOR NATURAL
LANGUAGE PROCESSING
ROBERT P.GOVE AND JAN RYCHT
´
A
ˇ
R
Abstract.
In this review article we demonstrate why computers still cannot
understand instructions in English or other human languages.We concentrate
on a seemingly simple problem–how can a computer recognize that a given sen-
tence is part of a given language.We introduce formal language theory,discuss
the Chomsky hierarchy of languages,and point to several serious obstacles in
natural language processing.
1.Introduction
It took more than one hundred years,from 1865 to 1969,for Jules Verne’s novel
From the Earth to the Moon,to come true through Apollo 11 program,[2,8].
However,when 50 years ago,Arthur C.Clarke,wrote his 2001:A Space Odyssey,
or when Isaac Asimov was writing I,Robot,between 1940-1950,nobody really
doubted that by now,we will see many of their fiction come true.Now,we see
computers and robots (without positronic brains,though) do miracles,yet there are
still something seemingly simple things they cannot do.For example,they cannot
fully understand a human language,even though there has been some relatively
early advances in this direction.
Computer systems such as SHRDLU,developed as early as 1968-70 at the M.I.T.
Artificial Intelligence Laboratory,[43,44],work extremely well and they can be
successfully commanded by a human language interface.However,they work only
with restricted vocabularies,and in a very small environment.
The conversational agent ELIZA,a computer program capable of speaking and
understanding of English,was developed in 1966,[41].ELIZA parodied a Rogerian
therapist,rephrasing many of the input sentences as questions and posing themback
to the user.It had to address many important tasks such as (i) the identification
of key words,(ii) the discovery of minimal context,(iii) the choice of appropriate
transformations,and (iv) the generation of responses.START (SynTactic Analysis
using Reversible Transformations),[24,27],a bit more sophisticated version of
ELIZA,is the world’s first Web-based question answering system.It was again
developed at the M.I.T.Computer Science and Artificial Intelligence Laboratory
and it is on-line and continuously operating since December,1993.It consists of
two modules - a) the understanding module and b) the generating module.The
understanding module analyzes English text and produces a knowledge base which
incorporates information found in the text.Given an appropriate segment of the
knowledge base,the generating module produces English sentences.A user can
thus retrieve information stored in the knowledge base by querying it in English.
Date:August 10,2009.
2000 Mathematics Subject Classification.Primary 68T50,Secondary 91F20.
1
2 ROBERT P.GOVE AND JAN RYCHT
´
A
ˇ
R
Georgetown-IBM experiment,[20],performed on January 7,1954,was a suc-
cessful demonstration of a completely automatic translation of more than sixty
Russian sentences into English.The process of the translation is quite simple on
paper:first,decode the meaning of the source text,and,second,re-encode this
meaning in the target language.Yet,more than 50 years from Georgetown-IBM
experiment,machine translation is still far from being solved [22].
So,the question remains:How is it possible that robots can fly to distant planets,
navigate and move in 2d and 3d space and still have a very hard time understanding
something as simple as instructions (and their meaning) in English?Such a question
is an integral part of the interdisciplinary field of natural language processing.One
should start by asking for the definition of understanding and meaning.This is
quite a major problem,what does it mean to understand and also whether and
how the level of understanding can be measured [17].We will not go into such
discussions,but we will provide a lot of detail on what is a language and how can
one capture and represent a language in a computer.Let us keep the following
question in mind through out the paper:
How do we know a sentence is part of a language?
Clearly,“John plays on the beach” is a sentence in English,but “Plays beach
on John the” is not,even though all the words are part of the English dictionary.
We as humans know and can see the difference very quickly,but is it possible for a
computer to make the same decision?
Compared to the tasks of the above discussed conversational agents or machine
translations,deciding whether a sentence is part of a language or not seems to be
trivial.Yet,we will see how nontrivial it can be.At the same time,answering
this question may be crucial for more complex and more meaningful tasks such as
translation.
2.Formal Languages and Grammars
We first need to formalize what we mean by a language.This is what is done
in formal language theory,developed by Chomsky [3,5] and it is closely related
to the foundation of computer science [39,40].Here are some classical definitions
that can be found in many texts,including the above mentioned references and also
[14,32,18,30] among others.
Definition 1.
An alphabet is a finite set of symbols.A string,or sentence,over an
alphabet Σ is a finite sequence of symbols from the alphabet Σ.A formal language
(over an alphabet Σ) is a set of strings over an alphabet Σ.
An alphabet may be the usual set {a,b,...,z},it can be a set of abstract symbols,
it can be a language over another alphabet,or simply anything else.Every symbol
is called a letter (if we talk about strings) or word (if we talk about sentences).Not
all sequences of symbols from the alphabet are necessarily part of the language,
just like not all combinations of English words form an English sentence.
Here are some examples of alphabets and languages.
(1)
Σ = {σ},L
σ
∗ = {σ
n
;n ≥ 1},where σ
n
is a string of length n consisting of
n symbols σ.
(2)
Σ = {σ},L
primes
= {σ
p
;p prime number}
(3)
Σ = {a,b,c,...,z},L
words
= set of English words
(4)
Σ = set of English words,L
English
= set of English sentences
ON FORMAL LANGUAGE THEORY FOR NATURAL LANGUAGE PROCESSING 3
(5)
Σ = {U,A,G,C},L
RNA
= set of valid genetic RNA sequences
(6)
Σ = {0,1},L
01

= {01
n
;n ≥ 1} = a set of strings that starts with a single
0 followed by a finite number of 1s.
(7)
Σ = {0,1},L
0
n
1
n
= {0
n
1
n
;n ≥ 1} a set of strings that starts with a finite
number of 0s followed by exactly the same number of 1s.
(8)
Σ = {0,1,2},L
012
= {0
n
1
n
2
n
;n ≥ 1} a set of strings that starts with a
finite number of 0s followed by exactly the same number of 1s and then by
exactly the same number of 2’s.
(9)
Σ = {σ},L
total
= {σ
n
;f
n
is total},where f
n
is a recursive function with
Turing machine T
n
,[16,p.807].
(10)
Σ = {σ},L
halts
= {σ
n
;f
n
(n) halts},[16,p.807].
Right now we know what a language is and we can theoretically decide whether
a string S is a part of a language L.The question is simple:is S ∈ L?All one has
to do to answer this question is look at every single element of L and compare it
to S.If one finds an element in L equal to S,then S ∈ L;otherwise,S ￿∈ L.
However,keep in mind that our objective is for a computer to decide whether
S ∈ L.As soon as the alphabet Σ contains at least one symbol σ ∈ Σ,the
language can be (and usually is) infinite.Thus,the computer may never come to a
complete decision;it may never stop searching for S in L,especially if S is not in
L.Consequently,by searching the list,we can never be sure whether or not S ∈ L
even if we did not receive an answer in finite time.
For a moment,consider the case of prime numbers,L
primes
.For any number n,
we can theoretically decide whether or not a number is prime simply by checking
whether k divides n for all k < n.We would like to point out that theoretically
does not always mean practically.There are many tests for determining whether a
number n is a prime or not.The fastest ones,such as the AKS test [1] can determine
it in polynomial time of the number of digits of n (i.e.a polynomial function of lnn).
Yet,to verify that 2
43,112,609
− 1,which is currently the largest known Mersenne
prime,it took about 13 days of computational time for a top computer [29];while
for language processing,one would need a bit faster responsiveness.
Note that if the language L
prime
is given to us as a list,i.e.as a subset of natural
numbers,the operation of divisions would seemlike something completely artificial.
It is the discovery of the inner structure of this language that helps us to determine
whether a given string is in this language or not.We have the same situation with
any other language as well:an alphabet is finite,a string is a finite sequence over
the alphabet,and thus the set Σ

of all possible strings over given alphabet is
countable,i.e.there is one-to-one and onto map f:Σ

￿→N.Identifying Σ

with N
via the function f enables us to see any language (that is formally a subset of Σ

)
as a subset of N.Thus,we cannot just “search” the list of elements of L,we need
some inner structure of the language to help us decide what sentences are part of
the language and what are not.For primes,the inner structure was provided by
multiplication.For a general language,we need something more universal.
Definition 2.
A grammar is a device that enumerates (and thus generates) all
sentences of a language.
More formally,a grammar G is defined as the ordered quad-tuple (N,Σ,P,S)
where

N is a finite set of nonterminal symbols,
4 ROBERT P.GOVE AND JAN RYCHT
´
A
ˇ
R
Figure 1.Chomsky hierarchy of grammars (and languages).

Σ is a finite set of terminal symbols that is disjoint from N,

P is a finite set of production,so called rewrite,rules.A general rule is of a
form S
1
→S
2
where S
1
,S
2
are any strings of terminals and nonterminals,

S is a distinguished symbol in N called a start symbol.
Terminals are elements of the alphabet.Nonterminals are placeholders that will
eventually get replaced by another string.We say that a sentence belongs to a
language if it can be generated by that language’s grammar.
For example,the language L
01

= {01
n
;n ≥ 1} can be produced by the following
grammar (cf.[30,p.254]).
Grammar 1.
S→0A
A→1A | 1
where S is the start symbol,A is another nonterminal and 0 and 1 are terminals.
A valid sentence of L begins with 0 and is followed by a string of 1s.The symbol
S → 0A is read as “S rewrites to 0A” or “0A is derivable from S”.The symbol |
means “or”,i.e.the nonterminal A may be rewritten either to 1A or to 1.Note that
grammars may not be deterministic,there is no single rule which will transform
the nonterminal A—it can be transformed to any of 1A or 1.To produce a string
0111 that is part of the above language,one starts with S,applies the rule S →0A
and then applies twice the rule A →1A and finally applies the rule A →1.
Examples of more grammars and production rules will be given later in the text.
The Chomsky hierarchy [3,5] distinguishes several types of grammars depending on
the complexity of the rules—regular,context-free,context-sensitive,unrestricted—
see Figure 1.In the next sections,we will discuss the grammars in more detail.
Note that there are continuum many languages (one language for every subset of
N) but only countably many grammars.Thus,there are many languages without
a grammar.L
total
is an example of such a language [16,p.807].
Several grammars may generate the same language.Some grammars may be bet-
ter than others.For example,if a context-free grammar is generated by a grammar
in Chomsky normal form [37,p.99],then the Cocke-Younger-Kasami algorithm
[6,45,23] can determine whether S ∈ L in polynomial time.Consequently,it is use-
ful to know whether a grammar G generates a language L,or similarly,whether two
grammars generate the same language.Those are hard and in general undecidable
problems [37,pp.156-159].
ON FORMAL LANGUAGE THEORY FOR NATURAL LANGUAGE PROCESSING 5
3.Natural Languages
In contrast to formal languages defined above,natural languages are not properly
defined.We just say that a natural language is a language that is spoken or written
by humans for general-purpose communication.
It is difficult to say what is and what is not English.For example,the sentence
Adam thinkz.
is not a correct sentence in English (as it should be Adam thinks.).However,the
sentence has a clear English meaning and human person would understand it.Let
us place natural languages in the context of the Chomsky hierarchy.
One approach is to see English as a regular language—there are no more than
10
6
English words,the vast majority of written or spoken sentences contain no
more than 10
4
words,and thus,for practical purposes,English consists of no more
than 10
6∙10
4
sentences.It means that “practical” English is a finite and thus regular
language since merely listing all sentences is equivalent to listing all rewrite rules.
However,there are at least two problems with this approach.First,there is so far
no capacity (memory- nor time-wise) to store and retrieve all sentences.Second,
and more serious,the listing representation does not really address any structure
of language and is overly static.Thus one still has to search for reasonably sim-
ple grammatical rules that would describe English without listing every possible
sentence.
Contrary to above,Chomsky,[4],presents a theoremthat English is not a regular
language and conjectures that it is not even context free.The main idea of the
argument uses the construction of center embedding.Start with a sentence
A man paid another man.
add a more specific description of the man
A man—whom a man paid—paid another man.
add a more specific description of the description
A man—whom a man—whom a man paid—paid—paid another man.
and continue by adding descriptions to an arbitrarily deep level.Thus,English
contains in itself the language
L = {A man (whom a man)
n
(paid)
n
paid another man,n ≥ 1}
which is practically the language L
0
n
1
n
= {0
n
1
n
;n ≥ 1}.This is not a regular
language,[30,p.256],and it follows that English is not regular either.Yet,not
many people use this center embedding construction to the 10th level or higher,so
it is a question of what exactly the practical consequences are of this result.
The question,whether English is context free,was raised early on by Chomsky
[3].The majority of researchers were looking for and seemingly found enough
evidence to support the claim that context-free grammars are not powerful enough
to generate natural languages.For example,there are the following non-context-free
constructions in natural languages

cross-serial dependencies in Swiss [36] and Austrian [30,p.261] German,

crossing dependencies in Dutch [19]),

reduplication in Bambara [7],etc.
On the other hand,the vast majority of forms in natural language are context-free
[34].
6 ROBERT P.GOVE AND JAN RYCHT
´
A
ˇ
R
a)
b)
Figure 2.a) A finite state automaton for the language L
01

=
{01
n
;n ∈ N},b) A finite state automaton for RNA code as in
grammar 3.
It may seem that the Chomsky hierarchy is an arbitrary and artificial structure
and there is no wonder that English or other natural languages do not naturally fit
in it.However,even the abstract grammars defined above have incredible power
and great relevance for natural languages as we will discuss below.
4.Regular Grammars and Regular languages
A regular language is a language that can be generated by a regular grammar,
i.e.a grammar whose rewrite rules are of the form
A →￿|σ|σB
where A,B ∈ N,σ ∈ Σ.This means that the left hand side of the rule can be only
a single nonterminal symbol,and the right-hand side may be the empty string ￿,
a single terminal symbol,or a single terminal symbol followed by a nonterminal
symbol,but nothing else.Regular grammars are the least complex grammars.
Grammar 1,describing the language L
01
∗,is regular.The language L
σ
∗ is also
regular because it can be generated by the grammar 2.
Grammar 2.
S→￿|σS,
where σ is any letter in the alphabet.
The language L
RNA
is regular as well.Any valid RNA sequence is a sequence
of triples of the nucleotid basis—adenine (A),guanine (G),cytosine (C) and uracil
(U)—that starts with AUG and ends with one of the following:UAG,UGA,UAA,
[15,pp.240,245-246,259].Hence,the grammar of the RNA language is
Grammar 3.
S→(AUG) α
α→(xyz) α | (UAG)|(UGA)|(UAA)
where xyz ∈ {U,C,A,G}
3
\{UAG,UGA,UAA},see also Figure 2b).
Regular languages are languages that can be generated by a finite state automa-
ton [25,p.17].
Definition 3.
A finite state automaton is a five-tuple A = ￿Q,Σ,q
0
,δ,F￿,where
Q is a finite set of states,Σ is an alphabet,q
0
is an initial state,F is a set of
terminal states,and δ:Q×Σ →Q is a function such that when δ(q
1
,σ) = q
2
,the
automaton allows transitions from state q
1
∈ Q to the state q
2
∈ Q and the name
or label of the transition is σ ∈ Σ.
A finite state automaton A = ￿Q,Σ,q
0
,δ,F￿ can be viewed as a grammar G =
￿N,Σ,P,S￿ of a regular language by the following identification,see e.g.[18,pp.
217-219],
ON FORMAL LANGUAGE THEORY FOR NATURAL LANGUAGE PROCESSING 7
Figure 3.A finite state automaton for the language L =
{we,wee,week,weekend,weed,weeded,weeding,weeder}.Only
the branching nodes are depicted,e.q.“we” is not really in the
alphabet,but in this language,there is no branching after letter w.
Note that this automaton contains additional information about
the words,the information may consist of tags such as noun,verb,
adjective,or adverb.The information may be different even for
the same word,depending on the actual meaning of the word.For
example,weed can be a noun as well as a verb,etc.

set N:= Q,Σ:= Σ,S:= q
0
,

add a production rule q
1
→σq
2
for every δ(q
1
,σ) = q
2

add a production rule f →￿ for every f ∈ F.
Conversely,a regular grammar G = ￿N,Σ,P,S￿ yields a finite automaton A =
￿Q,Σ,q
0
,δ,F￿ as follows:

set Q:= N ∪ {q

} where q

is a distinct symbol,

set Σ:= Σ,q
0
:= S,

set F:= {q

} ∪ {q ∈ Q;there is a production rule q →￿}

add δ(q
1
,σ) = q
2
for every rule q
1
→σq
2
,

add δ(q,σ) = q

for every rule q →σ.
Finite automata make the task of checking whether a string S is part of a regular
language L quite simple.Start in the initial state of the automaton.Move along
the automaton along the transitions labeled as the appropriate letter of the string.
If a final state is reached exactly as the end of the string,then S is part of the
language.In any other case (reaching a final state sooner,not reaching it at all,or
not being able to follow the path),the string S is not part of the language.Such
a procedure is very efficient as it takes no more than the length of the string to
decide whether the string is in the language or not.
8 ROBERT P.GOVE AND JAN RYCHT
´
A
ˇ
R
One of the important applications of finite state automata and regular languages
in natural language processing is their use as lexicons.For an example of a simple
lexicon see Figure 3.Applications of natural language processing,such as the above
mentioned dictionary lookup,morphological analysis or part-of-speech tagging are
sufficiently and effectively modeled through finite state automata.However,the
regular languages are not enough to describe complex human languages and we
need a broader and more complex spectrum of grammars.
5.Context-Free Grammars
A context-free grammar is a grammar in which the left-hand side of each pro-
duction rule consists of only a single non-terminal symbol (and the right hand side
can be an arbitrary string of terminals and nonterminals).It is called context-free
because any nonterminal symbol is rewritten regardless of its context.Here context
means just a position relative to other letters or group of letters in the string,it has
no other meaning (yet this name caused a lot of confusion in arguments whether
natural languages are context free).Similarly to regular languages and finite state
automata,context free languages are exactly those languages that are generated by
push-down automata [37].
Context free grammars are more powerful than regular grammars.For example,
the grammar
Grammar 4.
S→01 | 0S1
generates the language L
0
n
1
n
= {0
n
1
n
;n ≥ 1}.This is a well known example of
a language that is not regular [30,p.256].
Context free grammars are important for Natural Language Processing because
Subject-Verb-Object languages such as a restricted subset of English (described by
Grammar 5) are context free.
Grammar 5.
S→NP + VP
NP→Adj + NP | Noun | Noun + PP | Det + NP
VP→Verb + Adv | Verb + NP | VP + PP
PP→PP + NP
where NP is a Noun Phrase,V P is a Verb Phrase,Adj is an adjective,Adv is
an adverb,PP is a Prepositional Phrase,and Det is a Determiner.
Grammar 5 generates only a restricted subset of English.There are many English
sentences that are generated by Grammar 5,but some sentences are not.For
example,the sentence
“On the subject of linguistics,Noam Chomsky is an expert.”
is an English sentence that cannot be generated by Grammar 5 because it begins
with a prepositional phrase.
Every language generated by a context-free grammar is generated by a grammar
in Chomsky normal form [37,28,18],i.e.a grammar whose production rules are
only of the following form:
A → BC|σ
S → ￿
where S,A,B,C are non-terminals,S is the start symbol,S ￿= B,C and σ is a ter-
minal.Once the language is written in Chomsky normal form,the Cocke-Younger-
Kasami (CYK,sometimes CKY) algorithm [6,45,23] can determine whether a
ON FORMAL LANGUAGE THEORY FOR NATURAL LANGUAGE PROCESSING 9
Figure 4.An example of parsing the sentence “Adam sleeps.”
with the CYK algorithm.Note that the algorithm parses bottom-
up;it first assigns meaning to individual words and creates the
parse tree from leaves to the root.The sentence is represented in
a table (a),grouping (b) and a parse tree (c).
string belongs to the language and,if so,how it can be generated.The algorithm
was later generalized by Gray and Harrison [12,14] to work for any context-free
grammar.
The basic idea of the algorithm is as follows:
(1)
Represent the string of n letters as an n-by-n table.Initialize the table as
empty and point to position 0,before the first letter,of the string.
(2)
Add entries corresponding to all categories of the word which starts at the
pointed-to position.Make the pointer point to the next position.
(3)
As long as there is a pair of entries that can reduce,do the reduction and
add an entry representing the result in the table,unless an equivalent entry
is already present.
(4)
return to step 2.
The algorithm can formally be written as
input string a
1
a
2
...a
n
for j:= 1 to n do
begin
T(j −1,j):= {A;A is a lexical category for a
j
}
for i = j −2 down to 0 do
begin
T(i,j):= {A;there is k,i < k < j such that
(1) BC ⇒A for some B ∈ T(i,k),C ∈ T(k,j),
and (2) not(subsumed(A))}
end
end
See Figure 4 for an example of the output of the algorithm.
Since there are three nested cycles of length n,where n is the length of the
parsed string,the worst case asymptotic time complexity of CYK algorithm is
O(n
3
).This means that we can recognize whether or not S ∈ L in a polynomial
time.Interestingly,the CYK algorithm will not only give the series of production
rules for the sentence,but it will also give all possible series of production rules,thus
showing if the sentence is structurally ambiguous;see section 7 for some specific
examples.
10 ROBERT P.GOVE AND JAN RYCHT
´
A
ˇ
R
6.Context Sensitive and other grammars
As discussed in section 3,if we want to apply grammars to natural languages,
we need a more complex grammar than context-free.
A context-sensitive grammar is a grammar whose production rules are of the
form
S
1
AS
2
→S
1
SS
2
where A is a single non-terminal,and S
1
,S
2
and S are (possibly empty) strings of
terminals and nonterminals.Note that the non-terminal A can be transformed into
different things,based on its context,i.e.based on its neighbors.
Clearly,every context-free grammar is context sensitive as the production rules
are restricted by S
1
= S
2
= ￿.The language L
0
n
1
n
2
n = {0
n
1
n
2
n
,n ≥ 1} is not
context-free [30,p.257] but it is generated by a context-sensitive grammar such as
Grammar 6 below.
Grammar 6.
S→012|0AS2
A0→0A
A1→11
Note that Grammar 6 is not context sensitive in the sense we defined it (it
transforms A0 into 0A).This can be fixed by adding more nonterminals and more
rules as follows.
Grammar 7.
S→0R2
R→0RT|1
1T2→1122
1TT→11UT
UT→UU
UU2→V U2 →V 22
UV →V V
1V 2→1122
1V V →11WV
WV →WW
WW2→TW2 →T22
WT→TT
It has been shown that nearly all natural languages are context-sensitive,but
the whole class of context-sensitive languages seems to be much bigger than natural
languages.For our purposes,what is worse is that the decision problem of whether
or not S ∈ L for a context-sensitive language L is PSPACE-complete [18,p.346]
[37,pp.283-294];i.e.it is a problem that
a)
can be solved by a Turing machine using a polynomial amount of memory
and unlimited time;
b)
any problem satisfying a) can be reduced to our problem in a polynomial
time.
Thus,context sensitive grammars are impractical to use as a polynomial-time al-
gorithm for a PSPACE-complete problem would imply P=NP [9,pp.170-177].
Ongoing research has focused on formulating other grammars that are “mildly
context-sensitive”.The formal conditions imposed on such a class of grammars are:
1)
The languages must be parsable in polynomial time.
ON FORMAL LANGUAGE THEORY FOR NATURAL LANGUAGE PROCESSING 11
a)
b)
Figure 5.The parse trees for “She saw the boy with a telescope.”;
a) this tree shows the meaning “She saw a boy who had a tele-
scope.”,b) “She used a telescope to see the boy.”
2)
The language must have constant growth;this means that the distribution
of string lengths should be linear.This is often guaranteed by the pumping
lemma [37,pp.77-83,115-119].
3)
The language should admit limited cross-serial dependencies,allowing gram-
matical agreement to be enforced between two arbitrarily long subphrases.
This is formally enforced by requiring that the language consisting of strings
concatenated with themselves belong to the class of mildly context-sensitive
languages.
Roughly speaking,based on our discussion in section 3,the conditions 2) and
3) guarantee that natural languages are described by the mildly context sensitive
grammars;and condition 1) guarantees that the problem S ∈ L is decidable in a
polynomial time.
Some examples of mildly context-sensitive grammars include tree-adjoining gram-
mars [21],combinatory categorial grammars [38],linear context-free rewriting sys-
tems [42],indexed grammars [10] and others.
7.Ambiguity and natural language processing
Ambiguity is a very serious problemfor natural language processing.A grammar
is said to be ambiguous if it cannot produce a unique parse tree for all sentences
in the language and a language is said to be inherently ambiguous if all possible
grammars for the language are ambiguous [18,p.87].English seems to be such an
inherently ambiguous language [13].For example,the sentence
She saw the boy with a telescope.
can mean either a) “She saw a boy who had a telescope.” or b) “She used a telescope
to see the boy.” with both meanings having distinct parse trees,see Figures 5a),b).
The next sentence,used as an example by Kuno [26],does not seem to be
ambiguous to a human being,but it is very ambiguous to the computer.
“Time flies like an arrow.”
It may be interpreted in many ways:
(1)
time moves quickly just like an arrow does;
(2)
the magazine,Time,travels through the air in an arrow-like manner;
(3)
you should measure the speed of flies like you would measure that of an
arrow;
12 ROBERT P.GOVE AND JAN RYCHT
´
A
ˇ
R
a)
b)
Figure 6.Example parse trees for “Time flies like an arrow.”;
a) parse tree showing the meaning “time moves quickly just like
an arrow does.”,b) parse tree showing the meaning “an insect,
‘time-flies,’ enjoys a single arrow.”
(4)
you measure the speed of flies like an arrow would measure it;
(5)
you should measure the speed of flies that are like arrows;
(6)
an insect,“time-flies,” enjoys a single arrow.
See Figures 6a),b) for example parse trees of these meanings.
Besides problems with syntax,there is a problem with semantics.For example,
the sentence
Colorless green ideas sleep furiously.
used by Chomsky [3,p.92],is an English sentence but it makes no sense.How can
something be green and colorless at the same time and how can an idea sleep?
Moreover,in spoken language,one often implies additional information by plac-
ing stress on words.The sentence “I never saw her here” demonstrates the inherent
difficulty and ambiguity despite being unambiguous in its structural meaning.De-
pending on which word the speaker places the stress,this sentence could have
several distinct meanings:
“I never saw her here” - Someone else saw her,but I didn’t.
“I never saw her here” - I simply didn’t ever saw her.
“I never saw her here” - I might have heard her,but I did not see her.
“I never saw her here” - I saw someone else,but not her.
“I never saw her here” - I saw her,but not here.
8.Conclusions/discussion
Why do computers not understand human language?One of the reasons is that
there is ambiguity on many levels.One sentence can mean different things based
on its context,how it is said,who said it and on what occasion,etc.Another
important reason is known to any pupil—there are many rules and even more
exceptions.Moreover,we have very sparse data—many English sentences have not
been seen before (otherwise everything,including great novels and poems yet to be
written would have been already written) An almost philosophical reason is that
we do not know how the human brain works and thus we do not use appropriate
models for languages.
ON FORMAL LANGUAGE THEORY FOR NATURAL LANGUAGE PROCESSING 13
When a complete theory of a language will be developed,we may not only
understand the language,but we will also understand our brains.However,it may
very well be that the computers will “understand” the language quite soon and we
will still not know anything about the brain.The best machine translators,such
as what is used by Google [11],operates without any practical knowledge of the
language or language structure;instead,the translation is based on large databases
of already translated texts and statistical learning techniques [31].This approach
solves the problems with syntax and semantics quite effectively—it avoids them.
One can even go much further and consider an article in Russian as a very noisy
encoding of English [35].
We argue that despite the fact that the statistical approach solves many prac-
tical problems,it is important to keep searching for the appropriate model of the
language because after all,we want to understand the language and many things
behind it,not only to use it.
References
[1]
M.Agrawal,N.Kayal,N.Saxena,PRIMES is in P,Annals of Mathematics 160 (2004),no.
2,781-793.
[2]
Apollo project.http://www.nasm.si.edu/collections/imagery/Apollo/Apollo.htm,November
28,2008.
[3]
N.Chomsky,Three Models for the Description of Language.IRE Transactions on Information
Theory 2 (2),1956:113-124.
[4]
N.Chomsky,Syntactic Structures,Mouton & Co.The Hague,1957.The Netherlands.
[5]
N.Chomsky,On certain formal properties of grammars,Information and Control,Volume 2,
Issue 2,June 1959,Pages 137–167.
[6]
J.Cocke,J.T.Schwartz,Programming languages and their compilers:Preliminary notes.
Technical report,Courant Institute of Mathematical Sciences,New York University,1970.
[7]
C.Culy,The Complexity of the Vocabulary of Bambara.Linguistics and Philosophy 8 (1985):
345-351.doi:10.1007/BF00630918.
[8]
F.French,C.Burgess,W.Cunningham,In the Shadow of the Moon:A Challenging Journey
to Tranquility,1965-1969.Published by U of Nebraska Press,2007.ISBN 978-0-8032-1128-5.
[9]
M.R.Garey and D.S.Johnson,Computers and Intractability:A Guide to the Theory of NP-
Completeness.W.H.Freeman (1979).ISBN 0-7167-1045-5.
[10]
G.Gazdar,Applicability of Indexed Grammars to Natural Languages.in U.Reyle and C.
Rohrer.Natural Language Parsing and Linguistic Theories,1988.pp.69–94.
[11]
Google online translator,http://translate.google.com/?hl=en,January 15,2009.
[12]
J.Gray,M.A.Harrison,On the covering and reduction problems for Context-Free Phrase
Structure Grammars,Journal of the Association for Computing Machinery,19(4),1972:675–
698.
[13]
D.Z.Hakkani,G.Tur,K.Oflazer,T.Mitamura,E.Nyberg,An English-to-Turkish Interlin-
gual MT System,AMTA’98:Proceedings of the Third Conference of the Association for Machine
Translation in the Americas on Machine Translation and the Information Soup,Springer-Verlag,
London,UK,1998,pp.83–94.
[14]
M.A.Harrison,Introduction to Formal Language Theory,Reading MA:Addison-Wesley,
1978.
[15]
L.Hartwell,L.Hood,M.Goldberg,A.Reynolds,L.Silver,R.Veres,Genetics:From Genes
to Genomes,McGraw-Hill Publishing,New York,2004.
[16]
J.L.Hein,Discrete Structures,Logic,and Computability,second edition,Jones and Bartlett
Publishers,Inc.,USA,2002.
[17]
L.Hirschman,Language Understanding Evaluations:Lessons Learned from MUC and ATIS,
Proceedings of The First International Conference on Language Resources and Evaluation
(Granada,28-30 May 1998),Vol.1,pp.117–122.
[18]
J.E.Hopcroft,J.D.Ullman,Introduction to Automata Theory,Languages,and Computation,
Addison-Wesley,Reading,1979.
14 ROBERT P.GOVE AND JAN RYCHT
´
A
ˇ
R
[19]
M.A.C.Huybregts,Overlapping dependencies in Dutch,Utrecht Working papers in Linguis-
tics 1 (1976),24–65.
[20]
IBM press release,January 8,1954,701 Translator,http://www-03.ibm.com/ibm/history/
exhibits/701/701_translator.html,November 28,2008.
[21]
A.Joshi,S.R.Kosaraju,H.Yamada,String Adjunct Grammars.Proceedings Tenth Annual
Symposium on Automata Theory,Waterloo,Canada,1969.
[22]
D.Jurafsky,J.H.Martin,Speech and Language Processing:An introduction to natural lan-
guage processing,computational linguistics,and speech recognition,Second Edition,Pearson,
Prentice Hall Series in Artificial Intelligence,2008.
[23]
T.Kasami,An efficient recognition and syntax-analysis algorithm for context-free languages.
Scientific report AFCRL-65-758,1965,Air Force Cambridge Research Lab,Bedford,MA.
[24]
B.Katz,Annotating the World Wide Web Using Natural Language,Proceedings of the 5th
RIAO Conference on Computer Assisted Information Searching on the Internet (RIAO ’97),
1997.
[25]
D.C.Kozen,Automata and Computability,Springer,New York,1997.
[26]
S.Kuno,The predictive analyzer and a path elimination technique,Communications of the
ACM 8 (7):453-462.
[27]
B.Katz,web page of the START system,http://start.csail.mit.edu/,November 28,2008.
[28]
J.Martin,Introduction to Languages and the Theory of Computation.McGraw Hill,2003.
ISBN 0-07-232200-4.pp.237-240.
[29]
Mersenne prime numbers.http://www.mersenne.org/December 12,2008.
[30]
M.Nowak,Evolutionary Dynamics:Exploring the Equations of Life,Harvard Univ.Press,
Cambridge Mass.,2006.
[31]
F.Och,Official Google Research Blog:Statistical machine translation live,
http://googleresearch.blogspot.com/2006/04/statistical-machine-translation-live.html,Jan-
uary 15,2009.
[32]
B.H.Partee,A.G.B.ter Meulen,R.E.Wall,Mathematical Methods in Linguistics,Kluwer,
1990
[33]
D.Poole,A.Mackworth,R.Goebel Computational Intelligence:A Logical Approach,Oxford
Univ.Press,New York,1998.
[34]
G.K.Pullum,G.Gazdar (1982).Natural languages and context-free languages.Linguistics
and Philosophy 4 (1982):471-504.doi:10.1007/BF00360802.
[35]
C.E.Shannon,W.Weaver.The Mathematical Theory of Communication.University of Illi-
nois Press,1949.ISBN 0-252-72548-4.
[36]
S.Shieber,Evidence against the context-freeness of natural language Linguistics and Philos-
ophy 8 (1985):333-343.doi:10.1007/BF00630917.
[37]
M.Sipser,Introduction to the Theory of Computation.PWS Publishing 1997.ISBN 0-534-
94728-X.
[38]
M.Steedman,Combinatory grammars and parasitic gaps.Natural Language and Linguistic
Theory 5 (1987),403–439.
[39]
A.M.Turing,On Computable Numbers,with an Application to the Entscheidungsproblem,
Proc.London Math.Soc.42 (1936):230–265.
[40]
A.M.Turing,Computing Machinery and Intelligence,Mind LIX(236):433–460,1950.
[41]
J.Weizenbaum,ELIZA - A Computer Program For the Study of Natural Language Com-
munication Between Man And Machine,Communications of the ACM 9 (1):36-45.See also
http://www.fas.harvard.edu/
~
lib51/files/classics-eliza1966.html,November 28,2008.
[42]
D.J.Weir,Linear Context-Free Rewriting Systems and Deterministic Tree-Walking Trans-
ducers,In:Proceedings of the 30th Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics,Morristown,NJ,USA,pp.136–143,1992.
[43]
T.Winograd,Procedures as a Representation for Data in a Computer Program for Under-
standing Natural Language.MIT AI Technical Report 235,February 1971.
[44]
T.Winograd,Understanding Natural Language,Cognitive Psychology Vol.3 No 1,1972.
[45]
D.H.Younger,Recognition and parsing of context-free languages in time n
3
.Information and
Control 10(2) (1967):189-208.
ON FORMAL LANGUAGE THEORY FOR NATURAL LANGUAGE PROCESSING 15
Department of Computer Science,The University of Maryland at College Park,
College Park,MD 20742,USA
E-mail address:rpgove@gmail.edu
Department of Mathematics and Statistics,The University of North Carolina at
Greensboro,Greensboro,NC 27402,USA
E-mail address:rychtar@uncg.edu