PART OF SPEECH TAGGING

estonianmelonAI and Robotics

Oct 24, 2013 (3 years and 10 months ago)

104 views


1

PART OF SPEECH TAGGING


A Term Paper

Submitted To Ceng463

Introduction To Natural Language Processing Course

Of The Department Of Computer Engineering

Of Middle East Technical University


by

Asl
ı Gülen
1128875

Esin Saka 1129121


December, 2001


Abstract

Th
is paper presents general information about part of speech tagging and
morphological disambiguation concept. After general definitions about the topics, a
more detailed explanation is done for rule
-
based (constraint
-
based) part
-
of
-
speech
tagging and morpho
logical disambiguation. At the end, some rule
-
based part
-
of
-
speech tagging studies on Turkish language are presented. Besides, there is a CD,
attached to the paper. It contains related documentation on part
-
of
-
speech tagging and
morphological disambiguatio
n concept, like our reports, usable papers and example
coding.


2

TABLE OF CONTENTS



Introduction



Text Tagging Examples



Approaches To Tagging And Morphological Disambiguation

o

Rule
-
Based (Constraint
-
Based) Approaches



Historical Review

o

Statistical (Stochast
ic) Approaches



Historical Review



Comparison Of Rule
-
Based Part Of Speech Tagging With
Statistical Part Of Speech Tagging



POS Tagging With Brill’s Algorithm



Abstract Tagging And Morphological Disambiguation Of
Turkish Text



Tagging And Solving Morphological
Disambiguation Of
Turkish Text

o

Historical Overview

o

Methodology



The Preprocessor



Constraint Rules



Evaluation

o

Conclusion



Acknowledgments



Bibliography


3

INTRODUCTION

Natural Language Processing is a research discipline related to artificial
intelligence, ling
uistics, philosophy, and psychology. The aim of this discipline is
building systems capable of understanding and interpreting the computational
mechanisms of natural languages. Researches in natural language processing has been
motivated by two main aims:



To lead to a better understanding of the structure and functions of human
language



To support the construction of natural language interfaces and thus to facilitate
communication between humans and computers

There are mainly four kinds of knowledge used in

understanding natural
language:



Morphological knowledge: description of the form of the words



Syntactic knowledge: description of the ways in which words must be
ordered to make structurally acceptable sentences



Semantic knowledge: description of the way
s in which words are related to
the concepts



Pragmatic knowledge: description of the ways in which we see the world

Assigning a category to a given word is tagging. The purpose of part of speech
tagging is to assign part of speech tags to words reflectin
g their syntactic category. A
part
-
of
-
speech (POS) tagger is a system that uses various sources of information to
assign possibly unique POS to words. Automatic text tagging is an important step in
discovering the linguistic structure of a large text corpo
ra. It is a major component in
higher
-
level analysis of text corpora. Its output can also be used in many natural
language processing applications, such as: speech synthesis, speech recognition,
spelling correction, query answering, machine translation, se
arching large text
databases, and information extraction.

In this term paper, a part
-
of
-
speech tagger and morphological ambiguity,
especially for Turkish text, is researched. Turkish language does not have a finite set
of tag. Because of this reason, in pl
ace of the term “part of speech tagging”, the term
“morphological disambiguation” can be used. Disambiguated texts can be beneficial
for applications like:

Corpus analysis: For example, when gathering statistic information about a
language, by using a cor
pus.

Syntactic parsing: Disambiguity of a text will decrease the ambiguity of
sentence.

Spelling correction: For instance, to select a better pronunciation, context
information may be used.


4

Speech synthesis: Such as to find the true spelling for a speech
, tagging of the
text will be useful.

Let us see the place of morphological disambiguation in an abstract context.

Figure:1

The place of morphological disambiguation in an abstract context

TEXT TAGGING EXAMPLES

The way you tag the text is up to the tagg
es you choose. For example if the
only tagges are simple tagges like verb, noun, adjective etc.

Ayla

sıcak

çikolatayı

sever
.

noun adjective noun verb

Ayla

loves

hot

chocolate
.

noun verb adjective noun

is a simple example. Or more complex sentences and complex classifications
can be done and these a
re more common and useful.


RAW TURKISH TEXT

MORPH
OLOGICAL
ANALYSIS

MORPHOLOGICAL
DISAMBIGUATION

DISAMBIGUATED
TURKISH TEXT

Tagged Corpus

Parsing

Text
-
to
-
Speech


5

Consider the following example:

İşten döner dönmez evimizin yakınında bulunan derin gölde yüzerek
gevşemek en büyük zevkimdi.


(Relaxing by swimming the deep lake near our house, as soon as I return from
work was my greatest ple
asure.)


First let’s give the basic idea behind tagging the sentence:

The construct d
ö
ner dönmez formed by two tensed verbs, is actually a
temporal adverb meaning ... as soon as .. return(s) hence these two lexical items can
be coalesced into a single lexi
cal item and tagged as a temporal adverb.


The second person singular possessive (2SG
-
POSS) interpretation of
yakınında is not possible since this word forms a simple compound noun phrase with
the previous lexical item and the third person singular possessive functions as the
compound marker.


The word derin (deep) is the modifier of a simple compound noun derin g
öl
(deep lake) hence the second choice can safely be selected. The verbal root in the
third interpretation is very unlikely to be used in text, let alone in second person
imperative form. The fourth and the fifth interpretations are not plausible, as
adjec
tives from aorist verbal forms almost never take any further inflectional suffixes.
The first interpretation (meaning your skin) may be a possible choice but can be
discarded in the middle of a longer compound noun phrase.


The word en preceding an adjecti
ve indicates a superlative construction and
hence the noun reading can be discarded.


However, there exists a semantic ambiguity for the lexical item bulunan. It has
two adjectival readings having the meaning something found and existing
respectively. Amon
g this two readings one can not resolve the ambiguity, as long as
he/she does not have any idea about the discourse. Contextual information is not
sufficient and the ambiguity should be left pending to the higher
-
level analysis.


So after the morphologica
l analysis, figure
-
2 is computed. In the figure
-
2
upper
-
case letters in the morphological break
-
downs represent some specific classes
of vowels, e.g., A stands for low
-
round vowels e and a, H stands for high vowels i,i,u
and "u, and D = fd,tg. Although, th
e final category is adjective the use of possessive
(and/or case, number) suffixes indicate nominal usage, as any adjective in Turkish can
be used as a noun. And the correct choices of tags are marked with +.


işten



Gloss






POS

1.


iş+Dan


N(i,s)+ABL





N+


döner


1.

döner



N(döner)





N


6

2.

dön+Ar


V(dön)+AOR+3SG




V+

3.

dön+Ar


V(dön)+VtoAdj(er)




ADJ


dönmez


1.

dön+mA+z


V(dön)+NEG+AOR+3SG



V+

2.

dön+mAz


V(dön)+VtoAdj(m
ez)



ADJ


evimizin


1.

ev+HmHz+nHn

N(ev)+1PL
-
POSS+GEN



N+


yakınında


1.

yakın+sH+nDA

ADJ(yakın)+3SG
-
POSS+LOC


N2+

2.

yakin+Hn+DA

ADJ(yakin)+2SG
-
POSS+LOC


N


bulunan


1.

bul+Hn+yAn


V(bul)+PASS+VtoADJ(yan)



ADJ

2.

bulun+yAn


V(bulun)+VtoA
DJ(yan)



ADJ+


derin

1.

deri+Hn


N(deri)+2SG
-
POSS




N

2.

derin



ADJ(derin)





ADJ+

3.

der+yHn


V(der)+IMP+2PL




V

4.

de+Ar+Hn


V(de)+VtoADJ(er)+2SG
-
POSS


N

5.

de+Ar+nHn


V(de)+VtoADJ(er)+GEN




N


gölde


1.

göl+DA


N(göl)+LOC





N+


yüzerek


1.

yüz+yArAk


V(yüz)+VtoADV(yerek)



ADV+


gevşemek


1.

gevşe+mAk


V(gevşe)+INF




V+


en


1.

en



N(en)






N

2.

en



ADV(en)





ADV+


büyük


1.

büyük



ADJ(büyük)





ADJ+


zevkimdi


1.

zevk+Hm+yDH

N(zevk)+1SG
-
POSS+NtoV()+PAST+3SG

V+


Figure
-
2:

Morphological analyzer output of t
he example sentence.


7

If we gather the correct choices of tags (the ones with + sign), we get figure
-
3:




Gloss






POS

iş+Dan



N(i,s)+ABL





N

dön+Ar


V(dön)+AOR+3SG




V

dön+mA+z


V(dön)+NEG+AOR+3SG



V

ev+HmHz+nHn

N(ev)+1PL
-
POSS+GEN



N

yakı
n+sH+nDA

ADJ(yakın)+3SG
-
POSS+LOC


N2

bulun+yAn


V(bulun)+VtoADJ(yan)



ADJ

derin



ADJ(derin)





ADJ

göl+DA


N(göl)+LOC





N

yüz+yArAk


V(yüz)+VtoADV(yerek)



ADV

gevşe+mAk


V(gevşe)+INF




V

en



ADV(en)





ADV

büyük



ADJ(büyük)





ADJ

zevk+Hm+yDH

N(zevk)+1SG
-
POSS+NtoV()+PAST+3SG

V


Figure
-
3:

Tagged form of the second example sentence.

However, there are a number of choices for tags of the lexical items in the
sentence, as can be seen in figure
-
2. Probably, all except the one above gi
ve rise to
ungrammatical sentence structures. The number of that kind of undesired solutions
shows the level of ambiguity of the tagging process.

APPROACHES TO TAGGING AND
MORPHOLOGICAL DISAMBIGUATION

There are many approaches to automated part of speech

tagging. Different
critters give different methodologies for classification of tagging and morphological
disambiguation approaches.

First, lets give a brief introduction to the types of tagging schemes, according
to using pre
-
tagged corpora or not. The
following diagram depicts the various
approaches to automatic POS tagging. In reality, the picture is much more
complicated, since many tagging systems use aspects of some or all of these
approaches.

Figure
-
4:

A classification of part
-
of
-
speech tagging met
hodologies.


8

According to this classification scheme, one of the main distinctions, which
can be made among POS taggers is in terms of the degree of automation of the
training and tagging process. The terms commonly applied to this distinction are
supervise
d

vs.
unsupervised
. Supervised taggers typically rely on pre
-
tagged corpora
to serve as the basis for creating any tools to be used throughout the tagging process,
for example: the tagger dictionary, the word/tag frequencies, the tag sequence
probabilities

and/or the rule set. Unsupervised models, on the other hand, are those
which do not require a pre
-
tagged corpus but instead use sophisticated computational
methods to automatically induce word groupings (i.e. tag sets) and based on those
automatic groupin
gs, to either calculate the probabilistic information needed by
stochastic taggers or to induce the context rules needed by rule
-
based systems. Each
of these approaches has pros and cons.

The primary argument for using a fully automated approach to POS ta
gging is
that it is extremely portable. It is known that automatic POS taggers tend to perform
best when both trained and tested on the same genre of text. The unfortunate reality is
that pre
-
tagged corpora are not readily available for the many languages
and genres,
which one might wish to tag. Full automation of the tagging process addresses the
need to accurately tag previously untagged genres and languages in light of the fact
that hand tagging of training data is a costly and time
-
consuming process. Th
ere are,
however, drawbacks to fully automating the POS tagging process. The word
clustering, which tend to result from these methods are very coarse, i.e., one loses the
fine distinctions found in the carefully designed tag sets used in the supervised
met
hods.

The following table outlines the differences between these two approaches.

SUPERVISED

UNSUPERVISED

Selection of tagset/tagged corpus

Induction of tagset using untagged
training data

Creation of dictionaries using tagged
corpus

Induction of dictio
nary using training
data

Calculation of disambiguation tools.
may include:

Induction of disambiguation tools. May
include:

Word frequencies

Word frequencies

Affix frequencies

Affix frequencies

Tag sequence probabilities

Tag sequence probabilities

"For
mulaic" expressions


Tagging of test data using dictionary
information

Tagging of test data using induced
dictionaries

Disambiguation using statistical,
hybrid or rule based approaches

Disambiguation using statistical, hybrid
or rule based approaches

Ca
lculation of tagger accuracy

Calculation of tagger accuracy


9

As it can be seen from the
Figure
-
4

there are two major approaches for used
for POS taggers and morphological disambiguators:



Rule
-
based (Constraint
-
based) approaches



Statistical (Stochastic) a
pproaches

In constraint
-
based approaches, a large number of hand
-
crafted linguistic
constraints are used. By these constraints, impossible tags and impossible
morphological analysis for a given word in a text are eliminated. But in stochastic
approaches,
a large corpora is used to get statistical information. By using a part of the
corpus, training phase is performed in order to get a statistical model, which will be
used to tag untagged texts and to make morphological analysis. And remaining of the
corpus

is used to test the statistical model.

Early approaches to part
-
of
-
speech tagging were rule
-
based ones. After
1980’s, statistical methods became more popular. In 1990’s, Brill introduced a
method to induce the constraints from tagged corpora, which is cal
led transformation
based error
-
driven learning. Nowadays, all of the approaches are used together to get
better results.

RULE
-
BASED (CONSTRAINT
-
BASED) APPROACHES

Typical rule based approaches use contextual information to assign tags to
unknown or ambiguou
s words. These rules are often known as
context frame rules
. As
an example, a context frame rule might say something like: “if an
ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag
it as an adjective”.

det
-

X
-

n = X/adj

In

addition to contextual information, many taggers use morphological
information to aid in the disambiguation process. One such rule might be: “if an
ambiguous/unknown word ends in an
-
ing and is preceded by a verb, label it a verb”
(depending on your theor
y of grammar, of course).

Some systems go beyond using contextual and morphological information by
including rules pertaining to such factors as capitalization and punctuation.
Information of this type is of greater or lesser value depending on the langua
ge being
tagged. In German for example, information about capitalization proves extremely
useful in the tagging of unknown nouns.

Rule based taggers most commonly require supervised training; but, very
recently there has been a great deal of interest in a
utomatic induction of rules. One
approach to automatic rule induction is to run an untagged text through a tagger and
see how it performs. A human then goes through the output of this first phase and
corrects any erroneously tagged words. The properly tagg
ed text is then submitted to
the tagger, which learns correction rules by comparing the two sets of data. Several
iterations of this process are sometimes necessary.


10

Historical Review

The earliest approach is due to Klein and Simmons. Their primary goal wa
s to
avoid the labor of constructing a very large dictionary. Their algorithm uses a set of
30 POS categories. It first seeks each word in dictionaries, then checks for suffixes
and special characters as clues. Finally, the context frame tests are applied.

These
work on scopes bounded by unambiguous words. However, Klein and Simmons
impose an explicit limit of three ambiguous words in a row. For each such span of
ambiguous words, the pair of unambiguous categories bounding it, is mapped into a
list. The lis
t includes all known sequences of tags occurring between the particular
bounding tags; all such sequences of the correct length become candidates. The
program then matches the candidate sequences against the ambiguities remaining
from earlier steps of the
algorithm. When only one sequence is possible,
disambiguation is successful.

This algorithm correctly and unambiguously tags about
90% of the words in several pages of the Golden Book Encyclopedia.

The next important tagger,
TAGGIT
, was developed by Greene

and Rubin in
1971.
The tag set used is very similar, but somewhat larger, at about 86 tags. The
dictionary used is derived from the tagged Brown Corpus, rather than from the
untagged version. TAGGIT divides the task of category assignment into initial
(po
tentially ambiguous) tagging, and disambiguation. Tagging is carried out as
follows; first, the program consults an exception dictionary of about 3,000 words.
Among other items, this contains all known closed
-
class words. It then handles
various special ca
ses, such as words with initial "$", contractions, special symbols,
and capitalized words. A word's ending is then checked against a suffix list of about
450 strings, that was derived from the Brown Corpus. If TAGGIT has not assigned
some tag(s) after thes
e several steps, the word is tagged as a noun, a verb and an
adjective, in order that the disambiguation routine may have something to work with.
This tagger correctly tags approximately 77% of the million words in the Brown
Corpus (the rest is completed b
y human post
-
editors).

A very successful constraint
-
based approach for morphological
disambiguation, known as Constraint Grammar, was developed in Finland, from 1989
to 1992, by four researchers: Fred Karlsson, Arto Anttila, Juha Heikkila and Atro
Voutilai
nen. In this framework, the problem of parsing was broken into seven sub
-
problems or modules; four of them are related to morphological disambiguation, the
rest are used for parsing the running text. One of the most important steps of
Constraint Grammar wa
s context
-
dependent morphological disambiguation, where
ambiguity is resolved using some context
-
dependent constraints. For this purpose they
wrote a grammar, which contains a set of constraints based on descriptive grammars
and studies of various corpora.

Each constraint is a quadruple consisting of domain,
operator, target and context condition(s).

Among those rule
-
based part
-
of
-
speech taggers, the one built by
Brill

has the
advantage of learning tagging rules automatically. As it is the method that we ma
inly
interest on, the detailed information about this approach will be given at a separate
section later in this paper.


11

STATISTICAL (STOCHASTIC) APPROACHES

The term 'stochastic tagger' can refer to any number of different approaches to
the problem of POS t
agging. Any model, which somehow incorporates frequency or
probability, i.e. statistics, may be properly labeled stochastic.

The simplest stochastic taggers disambiguate words based solely on the
probability that a word occurs with a particular tag. In ot
her words, the tag
encountered most frequently in the training set is the one assigned to an ambiguous
instance of that word. The problem with this approach is that while it may yield a
valid tag for a given word, it can also yield inadmissible sequences o
f tags.

An alternative to the word frequency approach is to calculate the probability of
a given sequence of tags occurring. This is sometimes referred to as the
n
-
gram

approach, referring to the fact that the best tag for a given word is determined by th
e
probability that it occurs with the n previous tags. The most common algorithm for
implementing an n
-
gram approach is known as the
Viterbi Algorithm
, a search
algorithm which avoids the polynomial expansion of a breadth first search by
"trimming" the sea
rch tree at each level using the best N
Maximum Likelihood
Estimates

(where n represents the number of tags of the following word). If n is one
(i.e. 1
-
gram approach), it is just the word frequency. If n is two (2
-
gram approach), it
has a special name: bi
-
gram approach. For example, if you consider the frequency of
“the smile”, where “the” is determiner and “smile” is common noun, this approach is
bi
-
gram approach. Similarly, if you consider frequency of n words, ordered, it is n
-
gram approach.

The next
level of complexity that can be introduced into a stochastic tagger is
the one that combines the previous two approaches, using both tag sequence
probabilities and word frequency measurements.

Historical Review

In 1983, Marshall described the Lancaster
-
Osl
o
-
Bergen (LOB) Corpus tagging
algorithm, later named as
CLAWS
. It is similar to TAGGIT program. The tag set used
is very similar, but somewhat larger, at about 130 tags. The dictionary used is derived
from the tagged Brown Corpus, rather than from the unta
gged version. The main
innovation of the CLAWS is the use of a matrix of collocation probabilities,
indicating the relative likelihood of co
-
occurrence of all ordered pairs of tags. This
matrix can be mechanically derived from any pre
-
tagged corpus. CLAWS
used a large
portion of the Brown Corpus, with 200,000 words. CLAWS has been applied to the
entire LOB Corpus with an accuracy of between 96% and 97%.

There are several advantages of this general approach over rule
-
based ones.
First, spans of unlimited len
gth can be handled. Second, a precise mathematical
definition is possible for the fundamental idea of CLAWS. However, CLAWS is
time
-

and storage
-
inefficient in the extreme.


12

Later in 1988, DeRose attempted to solve the inefficiency problem of the
CLAWS and
proposed a new algorithm called
VOLSUNGA
. The algorithm depends
on a similar empirically
-
derived transitional probability matrix to that of CLAWS,
and has a similar definition of optimal path. The tag set contains 97 tags. The optimal
path is defined to be

the one whose component collocations multiply out to the
highest probability. The more complex definition applied by CLAWS, using the sum
of all the paths at each node of the network, is not used. By this change VOLSUNGA
overcomes the complexity problem.
Application of the algorithm to Brown Corpus
resulted with the 96% accuracy.

A form of
Markov model

has also been widely used in statistical approaches.
In this model it is assumed that a word depends probabilistically on just its part
-
of
-
speech category,
which in turn depends solely on the categories of the preceding two
words. Two types of training have been used with this model. The first makes use of a
tagged training corpus. The second method of training does not require a tagged
training corpus. In th
is situation the Baum
-
Welch algorithm can be used. Under this
regime, the model is called a Hidden Markov Model (HMM), as state transitions (i.e.,
part
-
of
-
speech categories) are assumed to be unobservable.
Hidden Markov Model
taggers and visible Markov Mod
el taggers may be implemented using the Viterbi
algorithm, and are among the most efficient of the tagging methods discussed here.
Since this method is not the main interest of this paper the algorithms mentioned
above will not be discussed here. For deta
iled information, use the progress report and
related papers, which are in attached cd at the end of the term paper.

COMPARISON OF RULE
-
BASED PART OF SPEECH TAGGING
WITH STATISTICAL PART OF SPEECH TAGGING

Until the simple rule
-
based part of speech tagging
by Brill (1992), in the area
of automatic part of speech tagging, statistical techniques were more successful than
rule
-
based methods, but their storage, improvement and adaptation cost was higher. In
1992, Brill described a rule
-
based tagger. It’s success

and efficiency was good enough
to take consideration in tagging area. It had many advantages over statistical taggers.
Some may be listed as [1]:



A vast reduction in stored information required



The perspicuity of small set of meaningful rules as opposed t
o the large tables
of statistical needed for stochastic taggers



Ease of finding and implementing improvements to the tagger



Better portability from one tag set or corpus genre to another

After Brill’s tagger, rule
-
based (constraint
-
based) approaches improv
ed.
Nowadays, approaches combining both ideas are used commonly.


13

POS TAGGING WITH BRILL’S ALGORITHM

This tagger is an important step for natural language processing. Nearly all of
the rule
-
based taggers, after 1992, references Brill’s tagger.

The tagger
is a supervised model, which

starts with a small structurally
annotated corpus and a larger unannotated corpus, and uses these corpora to learn an
ordered list of transformations that can be used to accurately annotate fresh text.
It
uses Brown corpus for
testing. It works by automatically recognizing and remedying
its weakness, thereby incrementally improving its performance.

In 1992, Brill applied transformation
-
based error
-
driven learning to part
-
of
-
speech tagging, and obtained performance comparable to
that of stochastic taggers. In
this work, the tagger is trained with the following process: First, text is tagged with an
initial annotator, where each word is assigned with the most likely tag,

estimated by
examining a large corpus, without regard to cont
ext. The initial tagger has two non
-
textual procedures to improve the performance:



Words that were not in the training corpus and are capitalized tend to be
proper nouns, and attempts to fix tagging mistakes.



Words that are not in the training corpus may
be tagged by using the ending
three letters.

Once text is passed through the annotator, it is then compared to the correct
version, i.e., its manually tagged counterpart, and transformations, that can be applied
to the output of the initial state annotato
r to make it better resemble the truth, can then
be learned.

During this process, one must specify the following: (1) the initial state
annotator, (2) the space of transformations the learner is allowed to examine, and (3)
the scoring function for comparin
g the corpus to the truth.

In the first version, there were transformation templates of the following
example forms:

Change tag
a

to tag
b

when:

1.

The preceding (following) word is tagged
z
.

2.

The preceding (following) word is tagged
z

and the word two befor
e (after)
is tagged
w
.

where
a
,
b
,
z

and
w

are variables over the set of parts
-
of
-
speech. To learn a
transformation, the learner applies every possible transformation, counts the number
of tagging errors after that transformation is applied, and chooses th
at transformation
resulting in the greatest error reduction. Learning stops when no transformations can
be found whose application reduces errors beyond some pre
-
specified threshold. Once
an ordered list of transformations is learned, new text can be tagge
d by first applying

14

the initial annotator to it and then applying each of the learned transformations, in
order[1].

Later in 1994, Brill extended this learning paradigm to capture relationships
between words by adding contextual transformations that could
make reference to the
words as well as part
-
of
-
speech tags[4].

The next step is applying patch templates to the training corpus and
determining rules according to the errors. Patch templates are of the form:



If a word is tagged
a

and its content is in
C

t
hen change that tag to
b
, or



If a word is tagged
a

and it has lexical property P, then change that tag to
b
,



If a word is tagged
a
and a word in region
R

has lexical property
P
, then
change that tag to
b
.

Some examples for patch (contextual rule) templates

can be listed as:

Change tag
a

to tag
b

when:


1.

The preceding (following) word is
w
.

2.

The current word is
w

and the preceding (following) word is
x


3.

The current word is
w

and the preceding (following) word is tagged
z
.

where
w

and
x

are variables over all
words in the training corpus, and
z

is a
variable over all parts
-
of
-
speech.

In 1995, Brill improved this algorithm so that, it no longer requires a manually
annotated training corpus. Instead, all needed is the allowable part
-
of
-
speech tags for
each token,

and the initial state annotator tags each token in the corpus with a list of
all allowable tags.

The main idea can be explained best with the following example. Given the
sentence:

The can will be crushed.

using an unannotated corpus it could be discover
ed that of the unambiguous
tokens (i.e. that have only one possible tag) that appear after
the

in the corpus, nouns
are much more common than verbs or modals. From this, the following rule could be
learned:

“Change the tag of a word from
modal

or
noun

or
v
erb

to
noun

if the previous
word is
the
.”

Unlike supervised learning, in this approach, main aim is not to change the tag
of a token, but reduce the ambiguity, by choosing a tag for the words in a particular
context. Another difference arises in calculati
ng the scoring function. Unambiguous
words are used in the scoring of this approach. In each learning iteration, the learner

15

searches for the transformation, which maximizes this function. Learning stops when
no positive scoring transformations can be foun
d.

This tagger has remarkable performance. After training the tagger with the
corpus of size 600K, it produces 219 rules and generates 96.9% accuracy in the first
scheme. Moreover, after the extension, number of rules increases to 267 and accuracy
increase
s to 97.2%.

Brill’s rule
-
based method is an applicable method but is not fast as statistical
ones. It may repeat unnecessary operations repeatedly. Deterministic POS tagging
with finite state transducers decreased the complexity from
RCn

to
n
, where
R

is t
he
number of contextual rules,
C

is the required tokens of context and n is number of
input words [7]. This method relies on two central notions: the notion of finite state
transducer and the notion of a subsequential transducer.

ABSTRACT TAGGING AND MORPH
OLOGICAL
DISAMBIGUATION OF TURKISH TEXT

Turkish is an agglutinative language with word structures formed by
productive affixations of derivational and inflectional suffixes to the root words.
Extensive use of suffixes results in ambiguous lexical interpret
ations in many cases.
Almost 80% of each lexical item has more than one interpretation[9]. In this section,
the sources of morphosyntactic ambiguity in Turkish is explored.



Many words have ambiguous readings even though they have the same
morphological br
eak
-
down. These ambiguities are due to different POS of
roots. For example the word
yana

has three different readings:

yana



Gloss


POS


English

1.

yan+yA

V(yan)+OPT+3SG

V

let it burn

2.

yan+yA

N(yan)+3SG+DAT

N

to this side

3.

yana


POSTP(y
ana)

POSTP

The first and the second readings have the same root and derived with the
same suffix, but since the root word
yan

has two different readings, one verbal and
one nominal, morphological analyzer produces ambiguous output for the same break
-
down.

Moreover,
yana

has a third postpositional reading without any affixation.



In Turkish there are many root words which are prefix of another root word.
An example is:

Of the two root words,
uymak

and
uyumak
,
uy

is a prefix of
uyu

and when the
morphological
analyzer is fed with the word
uyuyor
, it outputs the following:

uyuyor


Gloss



POS English

1.

uy+Hyor

V(uy)+PR
-
CONT+3SG

V

it suits


16

2.

uyu+Hyor

V(uyu)+PR
-
CONT+3SG

V

s/he is sleeping



Nominal lexical items with nominative, locative or genitive ca
se, have
verbal/predicative interpretations. For example, the word
evde

is the locative
case of the root word
ev
. And the morphological analyzer produces the
following output for it.

evde



Gloss






POS English

1.

ev+DA

N(ev)+3SG+LOC





N


at home

2.

ev+DA

N(ev)+3SG+LOC+NtoV()+PR
-
CONT V (smt)is at home



There are morphological structure ambiguities due to the interplay between
morphemes and phonetic change rules. Following is the output of
morphological analyzer for the word
evin
:

evin




Gloss





POS


English

1.

ev+Hn

N(ev)+3SG+2SG
-
POSS+NOM

N

your house

2.

ev+nHn

N(ev)+3SG+GEN



N

of the house

Since the suffixes have to harmonize in certain aspects with the word affixed,
the consonant "n" is deleted in the
surface realization of the second reading of
evin
,
causing it to have same lexical form with the first reading.



Within a word category, e.g., verbs, some of the roots have specific features
which are not common to all. For example, certain reflexive verbs
may also
have passive readings, as in the following sentences:

Camasirlar dun yikandi.

Ali dun yikandi.

Following is the morphological break
-
down of
yikandi
:

yikandi



Gloss





POS English

1.

yika+Hn+DH

V(yika)+PASS+PAST+3SG


V

go
t washed

2. yika+n+DH

V(yika)+REFLEX+PAST+3SG

V

s/he had a bath

From the same verbal root
yika

two different break
-
downs are produced.
Passive reading of
yikandi

is used in the first sentence and the reflexive reading is
used in the second sentence.



Some lexicalized word formations can also be re
-
derived from the original
root and this is another source of ambiguity. The word
mutlu

has two parse
with the same meaning, but different morphological break
-
down.


17

mutlu



Gloss




POS English

1.

mut+lH

N(mut)+NtoADJ(li)+3SG+NOM

ADJ

happy

2.

mutlu


ADJ(mutlu)+3SG+NOM


ADJ

happy

mutlu

has a lexicalized adjectival reading where it is considered as a root form
as seen in the second reading. However, the same surface form is also derived from
the n
ominal root word
mut
, meaning happiness, with the suffix +li, and this form also
has the same meaning.



Plural forms may display an additional ambiguity due to drop of a second
plural marker. Consider the example word
evleri
:

evleri



Gloss





POS


English

1.

ev+lAr+sH

N(ev)+3PL+3PS
-
POSS

N

his/her houses

2.

ev+lArH

N(ev)+3SG+3PL
-
POSS

N

their house

3.

ev+lArH

N(ev)+3PL+3PL
-
POSS

N

their houses

4.

ev+lAr+yH

N(ev)+3PL+ACC


N

houses (accusative)

In the first and the second reading there is onl
y one level of plurality, where
either the owner or the ownee is plural. However, the third reading contains a hidden
suffix, where both of them are plural. Since it is not possible to detect which one is
plural from the surface form, three ambiguous readi
ngs are generated.

Considering all these cases, it is apparent that the higher
-
level analysis of
Turkish prose text will suffer from this considerable amount of ambiguity. On the
other hand, available local context might be sufficient to resolve some of th
ese
ambiguities. For example, if we can trace the sentential positions of nominal forms in
a given sentence, their predicative readings might be discarded, i.e., within a noun
phrase it is obvious that they cannot be predicative.

TAGGING AND SOLVING MORPHO
LOGICAL
DISAMBIGUATION OF TURKISH TEXT

Historical Overview

Since most of the tagging studies in the world are performed for English and
the structure of Turkish is different than English, it is not possible to apply available
methods directly for Turkish.
So, it is necessary to make researches specific for
languages like Turkish and Finish.

In Turkey, a group of scientists worked on Turkish morphological ambiguity
and part
-
of
-
speech tagging for Turkish, in 1990’s [2, 5, 6, 8, 9, 10]. Both stochastic
and con
straint
-
based approaches are applied.


18

Since we are especially interested in rule
-
based approaches, no detailed
information will be given for other approaches. But there are same basic papers
presented in the cd, at the end of the paper.

Methodology

The mor
phological disambiguation of a Turkish text, examined in this paper is
based on constraints. The tokens, on which the disambiguation will be performed, are
determined using a preprocessing module.

The Preprocessor

Early studies on automatic text tagging f
or Turkish had shown that some
preprocessing on the raw text is necessary before analyzing the words in a
morphological analyzer. This preprocessing module includes:



Tokenization, in which raw text is split into its tokens, which are not
necessarily separa
ted by blank characters or punctuation marks;



Morphological Analyzer, which is used for processing the tokens, obtained
from the tokenization module, using the morphological analyzer;



Lexical and Non
-
lexical Collocation Recognizer, in which lexical and
non
lexical collocations are recognized and packeted;



Unknown Word Processor, in which the tokens, which are marked as unknown
after the lexical and non
-
lexical collocation recognizer, are parsed;



Format Conversion, in which each parse of a token is converted
into a
hierarchical feature structure;



Projection, in which each feature structure is projected on a subset of its
features to be used in the training.


19

Figure
-
5:

The structure of the preprocessor.


20

Constraint Rules

The system uses rules of the sort:



if LC

and RC then choose PARSE

or



if LC and RC then delete PARSE

where LC and RC are feature constraints on unambiguous left and right contexts of a
given token, and PARSE is a feature constraint on the parse(s) that is (are) chosen (or
deleted) in that cont
ext if they are subsumed by that constraint.

This system uses two handcrafted sets of rules:

1.

It uses an initial set of hand
-
crafted
choose rules

to speed
-
up the learning
process by creating disambiguated contexts over which statistics can be
collected. The
se rules are independent of the corpus that is to be tagged, and
are linguistically motivated. They enforce some very common feature patterns
especially where word order is rather strict as in NP's or PP's. Another
important feature of these rules is that
they are applied even if the contexts are
also ambiguous, as the constraints are tight. That is, if each token in a
sequence of, say, three ambiguous tokens have a parse matching one of the
context constraints (in the proper order), then all of them are si
multaneously
disambiguated.

2.

It also uses a set of handcrafted heuristic delete rules to get rid of any very low
probability parses. For instance, in Turkish, postpositions have rather strict
contextual constraints and if there are tokens remaining with mu
ltiple parses
one of which is a postposition reading, that reading is to be deleted.

Given a training corpus, with tokens annotated with possible parses, first the
hand crafted rules are applied. Learning then goes on as a number of iterations over
the tr
aining corpus. The following schema is an adaptation of Brill's formulation:

1.

Generate a table, called in context, of all possible unambiguous contexts which
contain a token with an unambiguous (projected) parse, along with a count of
how many times this pa
rse occurs unambiguously in exactly the same context
in the corpus.

2.

Generate a table, called count, of all unambiguous parses in the corpus along
with a count of how many times this parse occurs in the corpus.

3.

Start going over the corpus token by token g
enerating contexts.

4.

For each unambiguous context encountered, with parses P1,……,Pk, and for
each parse Pi generate a candidate rule of the sort

if LC and RC then choose Pi

5.

Every such candidate rule is then scored.


21

6.

All candidate rules generated during one

pass over the corpus are grouped by
context specificity and in each group, the rules are ordered by descending
score.

7.

The selected rules are then applied in the matching contexts and ambiguity in
those contexts is reduced.

8.

If the threshold for the most s
pecific context falls below a given lower limit,
the learning process is terminated.

The combination of these handcrafted, statistical and learned information
sources is reported to yield a precision of 93 to 94% and ambiguity of 1.02 to 1.03
parses per to
ken, on test texts, which is rather satisfactory.

Evaluation

The resulting disambiguated text is evaluated by some metrics. These metrics
are:

Ambiguity = Number of Parses / Number of Tokens

Recall = Number of Tokens Correctly Disambiguated / Number of Tok
ens

Precision = Number of Tokens Correctly Disambiguated / Number of Parses
Remaining

In the ideal case, when every token is correctly and uniquely disambiguated,
recall and precision will be 1.0. But if they are not uniquely disambiguated, recall
again

will be 1.0 but precision will be small. So the aim is decreasing ambiguity and
getting recall and precision as near as possible to 1.0.

Conclusion

The results are satisfactory. Results indicate that by combining these hand
-
crafted, statistical and learne
d information sources, a recall of 96 to 97% with a
corresponding precision of 93 to 94% and ambiguity of 1.02 to 1.03 parses per token,
on test texts is attained. However the impact of the rules that are learned is not
significant as handcrafted rules do
most of the easy work at the initial stages [2].

The results are also reasonable, when we do the same experiments on two
unseen texts, which are on completely different topics. The recall we reach is 93
-
95%
with a corresponding precision of 90
-
91% and ambi
guity of 1.03 to 1.04 parses per
token[2].

Since recall and precision are both near to 1.0 and when compared with the
worldwide studies, the results are acceptable.


22

ACKNOWLEDGMENTS

We would like to thank Dilek Hakkani
-
Tür
and G
ö
khan T
ü
r for providing us
t
heir relevant research papers and their ideas; Bilge Say, Ay
ş
enur Birtürk and
Çağlar
İskender
for providing us related information and their feedback.

BIBLIOGRAPHY

[1] Brill, Eric. 1992. A simple rule
-
based part of speech tagger. In
Third Conference
on Applied Natural Language Processing
.

[2] T
ür, Gökhan. 1996.

Using Multiple Sources Of Information For Constraint

Based
Morphological Disambiguaition
. Master’s thesis, Bilkent University, Department Of
Computer Engineering and Information Science.

[3] Brill, Eric. 1994. Some advances in rule
-
based part of speech t
agging.
In
Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI
-
94),
Seattle, Washington.


[4]
Brill, Eric. 1995. Transformation
-
based error
-
driven learning and natural language
processing:
A case study in part
-
of
-
speech tagging.

Computational Linguistics,
21(4):543
-
566.


[5] Oflazer Kemal, Tür Gökhan. 1997. Morphological Disambiguation by Voting
Constraints.
In the Proceedings of ACL'97/EACL'97, the 35th Annual Meeting of the
Association for Computational Linguistics, Madrid, Spa
in, July, 7
-
12.

[6] Oflazer Kemal, Tür Gökhan. 1996. Unsupervised Learning in Constraint
-
based
Morphological Disambiguation.
In the Proceedings of EMNLP'96, Conference on
Empirical Methods in NLP, Pennsylvania, May 17
-
18.

[7] Roche, Emmanuel; Schabes, Yves
. 1995. Deterministic Part of Speech Tagging
With Finite State Transducers.

[8] Hakkani
-

Tür
, Dilek; Oflazer Kemal; Tür G
ö
khan. 2000. Statistical

Morphological Disambiguation for Agglutinative Languages. In the proceedings of
COLING
-
2000, the 18th I
nternational Conference on Computational Linguistics,
August.


[9] Oflazer Kemal; Kuru
öz
İ
lker. 1994.
Tagging and morphological disambiguation of
Turkish text.
In Proceedings of the 4th Applied Natural Language Processing
Conference, pages 144
-
149. ACL, October.


[10] Tür G
ö
khan ; Hakkani
-

Tür
, Dilek; Oflazer Kemal. Name Tagging Using
Lexical,

Contextual, and Morphological Information.

[11] Jorge, Alipio; Lopez, Alneu de Andrade. Iterative Part of Speech Tagging.


23

[12] Cutting, Doug; Kupiec, Julian; Petersen, Jan; Sibun, Penelope. A Practical Part of
Speech Tagger.

[13] Ratnaparkhi, Adwait. A M
aximum Entropy Model For Part of Speech Tagging.

[14] Guilder, Linda. 1995.Automated Part of Speech tagging: A Brief Overview.