NATURAL LANGUAGE PROCESSING USING HPSG

scarfpocketAI and Robotics

Oct 24, 2013 (4 years and 15 days ago)

79 views

NATURAL LANGUAGE PROCESSINGUSINGHPSG
Milan HOLUB,Master Degree Programme (4)
Dept.of Information Systems,FIT,BUT
E-mail:xholub10@stud.t.vutbr.cz
Supervised by:Dr.Alexander Meduna
ABSTRACT
Head-driven Phrase Structure Grammar - natural language processing theory is pre-
sented.The theory is based on assumption of the head constituent of a phrase and em-
ploys mechanismcalled unication which governs language analysis.Formal model called
sign which bears the language information and which simplies unication process is ex-
plained.
1 INTRODUCTION
The aim of natural language processing is to nd out methods for analysis natural
languages.There are several concepts of analysis:
corpus based takes huge representative samples of two languages and tries to match sim-
ilar syntactic construction with aimto build translational database
derivation based traditional approach based on hundreds (for natural language) of deriva-
tion rules
information based tries to combine natural language syntax and semantics with aim to
build unication based theory
Let us clarify what we need to work with if we are about to handle natural language.We
should state if we are concerned only in syntax or if semantics is also important for us.Our
theory should account for such grammatic notions like subject,predicate and their relation;
it should be also aware of relative clauses,inectional verbs,position of object in a clause,
passivization,interrogative etc.If we are also interested in semantics we should add to our
theory tools for grasping the meaning of utterances.HPSGseems to have all these features.
2 HPSG
HPSG which is an abbreviation from Head-driven Phrase Structure Grammar be-
longs to information based concept of language analysis.The basic idea of the theory is
incorporated in a notion of the head constituent of a phrase.The theory assumes that in
each phrase of any natural language there is a main element which drives the meaning of
the whole phrase.This idea well corresponds to intuitive feeling that some words or sub-
phrases have more important position in a phrase than the others.HPSG also accounts for
semantics - Pollard [2] claims:
Syntactic and semantic aspects of grammatical theory are built up in an in-
tegrated way from the start,under the assumption that neither can be well
understood in isolation fromthe other.
Natural language is according to HPSG uniquely described by following equation:
Language =P
1
∧∙ ∙ ∙ ∧P
n
￿￿￿￿
UG
∧∙ ∙ ∙ ∧P
n+m
￿￿￿￿
LSC
∧(L
1
∨∙ ∙ ∙ ∨L
p
￿￿￿￿
LS
∨R
1
∨∙ ∙ ∙ ∨R
q
￿￿￿￿
GR
) (1)
The equation denotes that an object is a natural language sign(will be discussed later)
token just in case it satises all the universal(UG) and language specic constraints(LSC),
and either it instantiates one of the language lexical signs(LS) or it instantiates one of the
language grammar rules(GR).
2.1 FOUNDATIONS
To successfully describe linguistic information it is useful to work rather with for-
mal model than direct with this raw information.The formal objects in information-based
linguistic are called feature structures.Mathematically,a feature structure is a connected,
acyclic,nite state machine but we won't make any effort here to fully describe mathemat-
ical properties of feature structures.Let us focus on how these mechanisms are taken into
account in HPSG.
A feature structure is an information-bearing object that describes another thing by
specifying values for certain kind of attributes of the described thing.According to amount
of chosen attributes we say that the feature structure provides partial information about the
thing.Standard notation for feature structures are attribute-value matrices (AVM).Feature
structure notated by AVM is called a sign.An important property of feature structures is
their potential for hierarchiality.This means that value instead of being atomic may be
specied by another feature structure which leads to the fact that feature structures can be
embedded inside another.An example of non-trivial feature structure follows in Figure 1.
And what is the meaning of the sign in Figure 1?We can note some atomic values
e.g.for attribute PHON and also some complex values denoted by embedded signs.We
can gain from the sign the information that we cope with a sign for English word cookie;
it is a normal noun,it has no daughters,it can be enhanced by a determiner (e.g.the,
every etc) and nally its semantic content is a property of being an instance of cookie.The
number in the box is in our case useless but it is broadly used for referencing within signs
which enrich our theory with structure sharing.Sign can bears either information about








PHON cookie
SYN|LOC


HEAD1￿
MAJ N
NFORM NORM
￿
SUBCAT ￿ DET￿


DTRS...
SEM|CONT COOKIE








Figure 1:Example of AVM
entry in dictionary - in this case it is called lexical sign - or,which is more interesting,it
can hold the whole phrase and then it is called phrasal sign:
sign
[ ] =
lexical−sign
[ ] ∨
phrasal−sign
[ ] (2)
One of the main features of sings is that we can ¨combine¨ more signs together.This pro-
cess is called unication.Let us imagine that we have 3 signs each of themproviding some
different features describing the same thing.Using unication we can obtain 1 sign which
is more specic than the others and provides all information about that thing previously
described in 3 signs.
2.2 SYNTACTIC FEATURES AND SYNTACTIC CATEGORIES
In this section we will discuss more in detail the structure of signs.According to idea
proposed by Pollard [2] that syntax is the glue that holds phonology and semantics together
let us focus on syntactic properties of sign.Syntactic categories are linguistic objects
that serve as values of the SYNTAX attribute.And the attributes that are appropriate for
describing categories are called syntactic features.Example:All the syntactic categories
from Figure 1 are embedded in attribute SYN|LOC.We nd there syntactic features like
HEAD,SUBCAT....
We distinguish two types of (syntactic) features:local and binding.Local features in
general specify inherent syntactic properties of a sign (e.g.inection,case,lexicality),on
the other hand,binding features provide non-local information about dependent elements
within a sign (e.g.relative pronouns,interrogative expressions).The latter is non-local in
a sense that the kinds of syntactic dependencies involved may extend over arbitrarily long
distances.
Among the local features we distinguish between HEAD,SUBCAT and LEX fea-
tures.
HEAD feature species syntactic properties that lexical sign shares with its phrasal signs;
it determines the head constituent of a phrase
SUBCAT feature gives information about the valence of a sign,i.e.the number and kind
of phrasal signs that the sign subcategorizes for
LEX feature this is binary feature which distinguishes between lexical and phrasal sign
Among the binding features we distinguish between SLASH,REL and QUE features.The
SLASH feature provides information about gaps within a sign which have not yet been
bound to an appropriate ller;the REL and QUE features give information about unbound
relative and interrogative (question) elements within the sign.
At the beginning we proposed that every sign has its head constituent which is the
most important part of a phrase.We are now about to make clear how is this feature
incorporated into HPSG.The answer is Head Feature Principle (HFP).
￿
DTRS
headed−structure
[ ]
￿

￿
SYN|LOC|HEAD1DTRS|HEAD−DTR|SYN|LOC|HEAD1￿
Figure 2:Head Feature Principle
Principle shown in Figure 2 belongs to principles of universal grammar (it is shared
among all natural languages).Simply said the principle designates following:If a phrase
has a head daughter,then they share the same head features.In other words we can say that
the head features of the lexical head are ¨propagated¨ up the derivation tree to all phrasal
heads.In fact,there is no actual movement of information;the process is successfully
realized by structure sharing between the lexical and phrasal head.
2.3 SOME NOTES ABOUT SEMANTICS
If we want to work with semantic meaning of utterances we need rst to specify the
way of classifying things in the world.One of the ways taken into account in HPSG is
called scheme of individuation.This can be thought of as a system for breaking up reality
into comprehensible parts.These parts are called:
individuals such as Alexander Meduna,planet Earth etc.
properties such as being a student of BUT
relations such as loving,hating or giving
To successfully work with semantic notions in HPSG we add new SEMcategory to signs.
We will demonstrate use of semantics in HPSG on example in Figure 3.
Let us analyze the sign in Figure 3.The sign is phrasal because PHONOLOGY
feature Kim left is not in a dictionary.We are already familiar with SYNTAX features;
just for recapitulation:The head constituent of the phrase is non-auxiliary,non-inverted
verb (MAJ V).The SUBCAT list is empty which means that the sign is saturated (will
be discussed in following section).What is new for us is SEMcategory.We can nd here
newfeature structures CONT and IND.The former represents semantic content of the sign.
Further we can see that as a value for SEM|CONTattribute serves complex feature structure
which denotes that there is a relation with positive polarity called ¨left¨ with a participant
denoted by 1.This participant of the ¨left¨-relation is stored in SEM|IND attribute and is
described as a positive relation ¨naming¨ with the name Kim.In SEM|IND (i.e.indices)

























PHON Kimleft
SYN|LOC






HEAD




MAJ V
VFORM FIN
AUX −
INV −




SUBCAT ￿ ￿






SEM












CONT


RELN LEFT
LEAVER1POL ONE


IND

















VAR1REST




RELN NAMING
NAME Kim
NAMED1POL ONE


























































Figure 3:Example of semantic features
attribute is always stored a list of all possible participants of described relation.So far to
semantics.Of course,there exist lot of unresolved problems (e.g.if we take into account
determiners and qualiers) - interested reader can consult Pollard [2].
2.4 SUBCATEGORIZATION
According to my opinion,subcategorization is one of the main and the most im-
portant concepts in HPSG.As we already know this is local feature.Subcategorization,
sometimes called valence as well,of a lexical or phrasal sign,is a specication of the num-
ber and kind of other signs that the sign in question characteristically combines with in
order to become complete.It describes dependencies that are hold between a lexical head
and its complements.Attribute SUBCAT takes a list of (partially specied) signs as its
value;the position of an element on the list corresponds to the obliqueness of the comple-
ment sign which it describes,with the rightmost element corresponding to the least oblique
dependent sign (in case of verb it is subject).The most common use of the SUBCAT fea-
ture is in connection with verbs which usually subcategorize beside subject also for direct
or sometimes for indirect object.See example of SUBCAT feature in Figure 4.




PHON forced
SYN|LOC


HEAD 1￿
MAJ V
VFORM FIN
￿
SUBCAT ￿ VP[INF],NP,NP￿






Figure 4:Example of SUBCAT feature of the verb force in past tense
We can nd 3 complements in SUBCAT list of verb force.The right most is the least
oblique element - subject,followed by direct object and nally the 3rd element represents
verb phrase.Consult the example below:
[He]
NP−subject
forced [me]
NP−direct−object
[to write the report]
VP[INF]
.
However verbs are not alone in bearing subcategorization feature.SUBCAT can suc-
cessfully treats a range of closely related phenomena known as case assignment,govern-
ment (e.g.of particular prepositions),role assignment and verb agreement.
In accordance with SUBCAT feature we distinguish 2 types of signs:
saturated signs their SUBCAT list is empty;there are no other complements to subcat-
egorize for,PHON value of sign is complete and gives ¨full meaning¨ (e.g.Kim
left)
unsaturated signs SUBCAT list is not empty;there are still some complements to sub-
categorize for,PHON value doesn't make sense on its own (e.g.forced me to write
the report),some information is missing
We also need a way of transmission of subcategorization information within a sign.This is
governed by a principle of universal grammar (again this feature is shared across all natural
languages) called Subcategorization Principle (Figure 5 )
￿
DTRS
headed−structure
[ ]
￿



SYN|LOC|HEAD2DTRS
￿
HEAD−DTR|SYN|LOC|SUBCAT append(1,2)
COMP−DTRS 1￿


Figure 5:Subcategorization Principle
Thus the meaning of equation in Figure 5 is as following:In any headed structure,the
SUBCAT value is the list obtained by removing fromthe SUBCAT value of the head those
specications that were satised by one of the complement daughters.In addition,the
structure sharing entails that the information from each complement daughter is actually
unied with the corresponding subcategorization specication on the head.
2.5 GRAMMAR RULES
Each language presents a nite set of lexical signs (vocabulary) and a nite set of
grammar rules.A grammar rule is just very partially specied phrasal sign which consti-
tutes one of the options offered by the language in question for making more concrete signs
(bears more information) fromless concrete ones.
If we want to analyze particular language,the only work within HPSG framework
is to specify language's grammar rules.All the theory above included universal grammar
principles remains the same which is an indisputable advantage.However to gure out
these rules isn't a trivial task because according to the principal idea of HPSG the rules
have to be as general as possible to account for the largest class of linguistic phenomena.
Let us take an example fromPollard [2] of one grammar rule for English (Figure 6).


SYN|LOC|SUBCAT ￿ ￿
DTRS
￿
HEAD−DTR|SYN|LOC|LEX −
COMP−DTRS ￿[ ]￿
￿


Figure 6:One of the grammar rules for English
The content of rule in Figure 6 is that one of the possibilities for a phrasal sign in
English is to be a saturated (i.e.empty SUBCATlist) sign which has as constituents a single
complement daughter (<[ ]> denotes list of length 1) and a head daughter,which in turn is
constrained to be a phrasal (LEX-) rather than a lexical sign.This general rule corresponds
to the rules standardly expressed in the forms ¨S -> NP VP¨,¨NP -> DET NOM¨,¨NP -
> NP[GEN] NOM¨ and it is also responsible for small predicative clauses.The symbols
from previous sentence are well known from classical derivation-based approach and are
explained in Table 1AbbreviationMeaningSstarting symbolNPnoun phraseNP[GEN]noun phrase in genitiveVPverb phraseNOMnoun in nominativeDETdeterminerTable 1:Abbreviations fromderivation based approach
3 CONCLUSIONS
In this paper we presented one of the approaches to natural language processing.
HPSGis information based approach which meets most of linguistic requirements.We are
able to analyze many aspects of language including subject-predicate relation,inectional
verbs (employing SUBCATfeature),interrogative (with help of binding features) and many
more.HPSGemploys formal model of linguistic information called sign.With help of sign
we can describe in a relative simple way a lot of syntactic and semantic properties of natural
language.We can strictly distinguish between syntactic and semantic properties that are
valid across all languages and those that are applicable only in a particular language.This
property makes the theory more versatile because of using ¨common base¨ in analysis of
languages.
On the other hand,the versatility and generality of principles and rules assumes
complexity of lexicon.But ways of simplifying lexicon structure using factorization of
common features out of the lexicon entry have been explored (not covered in this paper).
In my future work I would like to focus on implementation HPSGtheory in the Linux
environment.There are many practical applications of the theory:fromgrammar checkers
to complex translational systems fromone natural language to another.
ACKNOWLEDGEMENTS
I would like to thank my supervisor Alexander Meduna without whose enthusiasm
and professional advice writing of this paper would have been a far longer and far less
pleasant task.
REFERENCES
[1] Meduna A.:Automata and Languages:Theory and Applications,London,Springer
2000,ISBN 1-85233-074-0
[2] Pollard,C.and Sag I.A.:Information-Based Syntax And Semantics,Volume 1 Fun-
damentals,Stanford,CSLI 1987,ISBN 0-937073-23-7
[3] Oetiker T.:The Not So Short Introduction to L
A
T
E
X,part of electronic documentation
to L
A
T
E
X system