Free University of Bozen-Bolzano

secrettownpanamanianMobile - Wireless

Dec 10, 2013 (3 years and 8 months ago)

142 views

Free University

of Bozen-Bolzano
Faculty of Computer Science
Bachelor of Science in Applied Computer Science

Transformation-based Learning
Algorithm for Part-of-Speech Tagging
of a Multilingual Corpus of Questions

Supervisor: Dr. Raffaella Bernardi
Co-Supervisor: Ing. Paolo Dongilli
Student: Anna Mari



Academic

Year

2004-2005
1
st
graduate Session – July 22-23, 2005


1
Abstract
Part-of-Speech Tagging is one of the preliminary analyses needed by any
Language Technology Applications. Tasks for which tagging has been
proved helpful include Machine Translation, Information Extraction,
Information Retrieval, Question Answering, Speech Recognition and higher-
level syntactic Processing. For instance, having information on the word
class of e.g. a noun or a verb is crucial to disambiguate linguistic strings like
the noun “ice-cream” from the sequence noun phrase-verb “I scream”
needed in Speech Recognition. Due to the relevant role that this task has in
all Natural Language Processing applications, it was chosen as thesis’ s
project.
In the first part of this thesis, a description of the state-of-the-art in Part-of-
Speech (PoS) tagging methods is provided. PoS tagging describes the
words in terms of both morphological and syntactic categories (or PoS tags).
The morphosyntactic annotation provides a great deal of information beyond
simply an unanalyzed sequence of words.
Among the existing algorithm particularly interesting is the “Transformation-
based Error-driven Learning algorithm” developed by Eric Brill in 1995.
Transformation-based learning is an attempt to preserve the benefit of having
linguistic knowledge in a human-readable form while automatically extracting
linguistic information from corpora. It combines rule based approaches which
are linguistically motivated with the advantages of stochastic approaches
which obtain better results than the former. The algorithm is described in the
second part of the thesis.
In the third Chapter, an overview on the history and recent re-discovery of
Corpora, i.e. collections of text-samples, is given underlining the basic
features a corpus must satisfy to be well representative of a certain
language. Moreover, the mark up language standards and the employed
corpus are described in details. In order to carry out experimental analysis on
comparative studies it was chosen to use a multilingual corpus, Multext-Joc
corpus, i.e. a corpus consisting of the same text written in more than one
language, Italian and English. After these general overviews, the
implemented PoS tagger is described providing information on the different
modules it consists of and how they interact between each other.
Finally, in the last chapter of the thesis the results obtained by the tagger are
examined. The results indicate that the algorithm achieves different results
according to the language of the corpus and the tag set used. Moreover,
some problems regarding the running time of the tagger are highlighted and
possible solutions to them are suggested.





2



Acknowledgements

For the realization of this work, I want to thank my professor and first
supervisor Prof. Raffaella Bernardi for introducing me to the discipline of
Computational Linguistics, which before attending the course of
Computational Linguistics was completely new to me. Moreover, I
appreciated that she had always time for supporting me in the thesis writing
with the intention to conclude a good job.

I also want to express my gratitude to my co-supervisor Paolo Dongilli, who
helped me in the implementation of the thesis’ project and gave me always
good hints for writing this thesis.

Especially, I want to thank my colleague Barbara Plank that supported me in
the course of Computational Linguistics and proposed always excellent
solutions to all my problems.

Finally, I want to thank my family, all my friends and in particular Ivan Zorzi
who in some way sustained my state of mind when I went into hysterics.

3
Table of Contents
Abstract
_____________________________________________________ 2
Acknowledgements
___________________________________________ 3
1. Part-of-Speech Tagging
______________________________________ 5
1.1 State of the art
________________________________________________ 5

1.1.1 Corpora
_________________________________________________________ 6

1.1.2. Part-of-Speech tags sets
____________________________________________ 7

1.1.3 Part-of-Speech taggers
_____________________________________________ 9

1.2 Case Study: Transformation-based Algorithm for Part-of-Speech tagging
of a Multilingual Corpus of Questions
_______________________________ 11

2. Transformation-based PoS Learning algorithm
_________________ 13
2.1 Training Phase
_______________________________________________ 13

2.1.1 Training Phase without unknown words
________________________________ 14

2.1.2 Training Phase with unknown words
__________________________________ 20

2.2 Test Phase
___________________________________________________ 21

3. Multext-Joc Corpus
________________________________________ 23
3.1 Corpus Description
___________________________________________ 23

3.2 Corpus Encoding
_____________________________________________ 23

4. Implementation of Transformation-based PoS Algorithm
_________ 27
4.1 .Net Environment
_____________________________________________ 27

4.2. The Algorithm implementation
__________________________________ 28

4.2.1. Conversion module from SGML to human-readable format
________ 28

4.2.2. Lexicon-Extraction and corpus-splitting module
_________________ 30

4.2.3.

Training module
____________________________________________ 33

4.2.4. Test module
_______________________________________________ 38

5. Results, Difficulties and Further work
_________________________ 41
5.1 Results of the Test Phase
______________________________________ 41

5.1.1 Results of the English corpus
________________________________________ 41

5.1.2 Results of the Italian corpus
_________________________________________ 43

5.2. Conclusions drawn from the obtained results and proposed solutions
46

Appendix
___________________________________________________ 48
Multext- Joc tag set for English
____________________________________ 48

Multext-Joc tag set for Italian
______________________________________ 51

List of Tables
_______________________________________________ 55
List of Figures
_______________________________________________ 55
Glossary
___________________________________________________ 55
Bibliography
________________________________________________ 57

4
1. Part-of-Speech Tagging
The task of Part-of-Speech tagging consists of assigning a word to its
appropriate word class. A word class or Part-of-Speech (PoS) is defined
as the role that a word plays in a sentence. Parts of Speech have been
recognized in linguistic studies for a long time. Dionysius Thrax, a Greek
grammarian who lived from 170 B.C. till 90 B.C, distinguished between eight
Parts of Speech: verb, noun, article, determiner, participle, pronoun,
preposition, adverb, conjunction. Nowadays word classes are many more
than eight, since they have been used in new applications for which the
higher number of PoS has been proved to improve performances.
The increase interest in PoS tagging is connected to another area of
Linguistics that has remerged and has become a cutting edge topic of
current research, namely the study of Corpora, i.e. finite collections of
naturally occurring utterances. Before 1960, corpora were collected, stored
and analyzed by linguists only by hand and on paper supports with the aim of
studying pragmatic and linguistic phenomena. Now they are machine-
readable and are built mainly as support tools for other applications.
There are many applications in which corpora annotated with PoS play a
significant role, some of them are Speech Recognition, Information Retrieval,
Machine-Translations (MT) and Questions Answering (QA).
For instance, in MT, the correctness of the translation of a word in the
source language into a word in the target language is highly dependent on
the Part-of-Speech of the source word, e.g. the word “gioco” in Italian can be
translated as either “toy” (noun) or “to play” (verb) in English depending on
the syntactic category it belongs to.
Also interesting is the case of QA that has emerged as a new research topic
in the field of Information Retrieval and now represents a promising
framework for finding information that closely matches user needs. QA
systems try to understand the users’ questions and retrieve a list of
documents that might contain the answer.
For instance, PoS are used to detect the type of the question, i.e. factual vs.
non-factual questions, Wh-questions, quantification questions, and the type
of its target as well as performing a sanity check with the retrieved answer.
For these purposes, many taggers, which are programs able to
automatically PoS tagging a corpus, have been implemented. Taggers can
be implemented differently according to the kind of algorithm they use.
In the following section, the state of the art about corpora, PoS tag sets and
taggers, is provided.
1.1 State of the art
The constituents for Part-of-Speech tagging are mainly a tagger, a Part-of-
Speech tag set and a corpus. The characteristics of the constituents
significantly determine the results of this kind of annotation. Therefore, it is
important to analyze all the aspects that regard them. In the following
sections, an enlightenment of the major aspects of the basic constituents of
tagging are given in details.



5
1.1.1 Corpora
A corpus is simply described as a large body of linguistic evidence typically
composed of attested language use. The term “Corpus” comes from Latin
and means body, but in linguistics, it refers to finite collections of naturally
occurring utterances. Every corpus should be considered under four main
headings:

1. sampling and representativeness
2. finite size
3. a standard reference
4. machine-readable form
Sampling and representativeness: NL is infinite, it is therefore necessary
to build a sample of the language variety we are interested. A sample is
representative if what we find for the sample also holds for the general
population. Even if a corpus cannot include all the valid utterances, it can
represent the regularities of the language [1]. For this reason, because it has
to be statistically representative of the language, it is composed of varied
material as everyday conversations, published writing, writing of children,
radio news broadcast, etc. The representativeness is a significant issue for
the development of corpora in corpus balancing. The factors which count for
balancing mainly are: the regional coverage(corpora should include all the
regional varieties of the language), the sociolinguistic coverage (authors of
corpora should be male and female representing different age groups, social
backgrounds and levels of education) and the generic coverage(corpora
should contain samples of genres or type of text) [2]. Determining whether a
corpus is representative is difficult, but it is extremely significant when doing
statistical Natural Language Processing (NLP) work.
Finite size: The term corpus usually implies a body of text of a finite size,
for example one million words. Usually when a corpus reaches the grand
total of words, collection stops and the corpus can be increased in size then.
A standard reference: corpora constitute a standard reference for the
language variety which it represents. This presupposes its widely availability
to all researchers, which is indeed the case with many corpora such as the
Brown corpus of written American English, the Lancaster-Oslo/Bergen (LOB)
corpus of written British and the London-Lund corpus of spoken English
British.
Machine-readable form: before 1960, the so called Early Corpus Linguistics
(ECL) period [1], a corpus corresponded to a compilation of naturally
occurring spoken or written language, a simple collection of linguistic
utterances. Corpora were collected only by hand and on paper supports.
Present day corpora are all machine-readable corpora rationally constructed.
Linguists extensively use computers ability to search, retrieve, sort, calculate
the content of corpora in a way that was not possible in the ECL. Moreover,
machine-readable corpora can be easily enriched with additional information
(annotation) [2].
Indeed, corpora are also easily found in the Internet. Some of them can be
found in the web at these links:
• Linguistic Data Consortium
http://www.ldc.upenn.edu
• European Language Resources Association

6
http://www.icp.grenet.fr/ELRA
• International Computer Archive of Modern English
http://nora.hd.uib.no./icame.html

Mainly, all the corpora should follow these four headings, but depending on
their use, they exist in two forms. Corpora exist in these forms: unannotated
(in their existing raw states of plain text) or annotated (enhanced of various
types of information). Unannotated corpora have been, and are, of
considerable use in language study, but the utility of the corpus is
considerably increased by the provision of annotation. For this reason, now
most of the corpora are annotated.
Annotating corpora can be done in many ways. There is currently no widely
agreed standard way of annotating texts. Current moves are aiming towards
formalized international standards for the encoding of any type of information
that one would conceivably want to encode in machine-readable texts. The
flagship of this current trend towards standards is the Text Encoding Initiative
(TEI) [6]. Aim of TEI is to provide standardized implementations for machine-
readable text interchange. Therefore, TEI employs an already existing form
of document markup known as Standardized Generalized Markup Language
(SGML) [7]. SGML was adopted because it is simple, clear, formally rigorous
and already recognized as an international standard [1].
In TEI, each text is conceived of two parts: a header and the text itself. The
header contains information about the text such as: author, title, date and so
on. The actual TEI annotation of the header and the text is based around two
basic devices: tags and entity references. Texts are assumed to be made
up of elements. An element can be any unit of text and it is marked using
SGML tags. A detailed description of this encoding languages is provided in
the chapter 3.2.
Having considered the ways in which additional information may be encoded
in machine-readable texts, it is central to consider also the types of
information typically found in corpora. Indeed, Part-of-Speech annotation is
not the only existing type of linguistic annotation. Other kind of annotations
are: lemmatization and parsing.
Lemmatization involves the reduction of the words in a corpus to their
respective lexemes. A lexeme is the head word form that one would look up
if one were looking for the word in a dictionary, e.g. kicks, kicked and kicking
would be reduced to the lexeme kick [1].
On the contrary, parsing is an annotation that can be done after Part-of-
Speech tagging. Once basic morphosyntactic categories have been
identified in a text, it is then possible to consider bringing these categories
into higher-level syntactic relationships with one another.
1.1.2. Part-of-Speech tags sets
Part-of-Speech tagging, sometimes also known as grammatical tagging or
morphosyntactic annotation, is the most basic type of linguistic corpus
annotation. The aim of Part-of-Speech tagging is to assign to each lexical
unit in the text a code indicating its Part-of-Speech.
Every corpus can be annotated differently according to a standard set of
tags. A set of tags encodes the target feature of classification, telling the
user the useful information about the grammatical class of a word.


7
There is currently no widely standard agreed way of representing tags in
texts. There are a small number of tagset for English, many of which evolved
from the 87-tag used for the Brown corpus, which originally (1961) contained
1,014,312 words sampled from about 15 text categories.

Actually, the most important tags set is the 45-tag Penn Tree Bank, which is
a simplified version of the Brown tags set. The alphabetical list of tags
contained in the Penn Treebank set is described below.


Number

Tag

Description

1.

CC

Coordinating conjunction

2.

CD

Cardinal number

3.

DT

Determiner

4.

EX

Existential there

5.

FW

Foreign word

6.

IN

Preposition or subordinating
conjunction

7.

JJ

Adjective

8.

JJR

Adjective, comparative

9.

JJS

Adjective, superlative

10.

LS

List item marker

11.

MD

Modal

12.

NN

Noun, singular or mass

13.

NNS

Noun, plural

14.

NNP

Proper noun, singular

15.

NNPS

Proper noun, plural

16.

PDT

Predeterminer

17.

POS

Possessive ending

18.

PRP

Personal pronoun

19.

PRP$

Possessive pronoun

20.

RB

Adverb

21.

RBR

Adverb, comparative

22.

RBS

Adverb, superlative

23.

RP

Particle

24.

SYM

Symbol

25.

TO

to

26.

UH

Interjection

27.

VB

Verb, base form

28.

VBD

Verb, past tense

29.

VBG

Verb, gerund or present participle

30.

VBN

Verb, past participle


8
31.

VBP

Verb, non-3rd person singular
present

32.

VBZ

Verb, 3rd person singular present

33.

WDT

Wh-determiner

34.

WP

Wh-pronoun

35.

WP$

Possessive wh-pronoun

36.

WRB

Wh-adverb

Table 1: Alphabetical list of tags (exclude punctuation) in the Penn Treebank


Example 1:
The Tag number 13 is written with the code NNS and indicates the Part-of-
Speech Plural Noun. An example of Plural Noun is “books”, which is the
plural form of the noun book.
These tags set are all for English. In general, tags set incorporate
morphological distinctions of a particular language, and so are not directly
applicable to other languages. Indeed, when creating multilingual corpora
that are those consisting of texts in more than one language, a different Part-
of-Speech tagset is provided for each language.
A concrete example is the tag set developed by the ” Istituto di Ricerca
Computazionale” for the Multext project, a project with the aim of building a
complete annotated multilingual corpus, which contains raw, tagged and
aligned data. For each language every word was tagged with the PoS tagset
designed for that language.

Example 2.
In the Italian language a word can be characterized by masculine or feminine
gender, and only sometimes it has a common specification. On the contrary,
in the English language no gender is assigned to words. This example can
explain why tags set are not directly applicable to other languages.
Another fact which should be considered is the size of tags sets. Large tags
sets may result in more advantages in Part-of-Speech tagging, such as
giving more information about the word each tag relates to, but may result
also in disadvantages, such as slowing the tagger’s execution time. An
example of the possible benefits or non-benefits will be given in the 5
th

chapter.
1.1.3 Part-of-Speech taggers
Annotating a corpus with PoS tagging can be done through a multi-step
procedure and the result is a kind of repository where some implicit
information has been made explicit. This process is carried out by a Part-of-
Speech tagger. The input to a tagger is a string of words of a natural
language sentence and a specified tags set. The output should be the
correct PoS tag for each word.
Work on building taggers originated in the 1950s and 1960s. Several
approaches have been developed to build them, which can be said to have
matured to a point where they offer quite reasonable levels of accuracy. For
example, one of the first programs was the one which tagged semi-
automatically the Brown Corpus and it worked by determining word-

9
environment combinations. The program’s algorithm did not achieve high
rates of accuracy because at that time there was no manually tagged corpus
to develop and test their manually created rules empirically [4].
Common to all algorithms is a process for the automated Part-of-Speech
tagging of natural language text. First, the system tries to see if each word is
present in a machine-readable lexicon it has available. If the word is present
in the lexicon, then the system assigns to the word the full list of Parts of
Speech it may be associated with. At this point, annotated corpora can be of
use to this stage of processing because they can produce a lexicon [1]. The
larger the lexicon, the better the chances of identifying a word and
associating the appropriate Part of 0053peech with it. After this stage, the
effective process of automating tagging starts.
Nowadays, there are many approaches to automated Part-of-Speech
tagging. One of the possible distinctions that can be done among these
approaches is in terms of degree of automation. There are two kinds of
taggers: supervised and unsupervised.
Supervised taggers rely on already tagged corpora, from which they can
learn and train data, so that later they will be able to tag a corpus.
Unsupervised taggers are those that do not require a pre-tagged corpus,
since they use sophisticated methods to induce word groupings, and
calculate immediately after probabilistic information needed by stochastic
taggers.
The main reason for using a fully automated approach to PoS tagging is that
it is extremely portable and above all, pre-tagged corpora are still not readily
available for all the languages and genres which one would like.
Another distinction that can be drawn between taggers is the algorithm which
they work with. Nowadays, there are rule-based and stochastic algorithms.
Stochastic algorithms refer to all the approaches that somehow incorporate
frequency. The simplest stochastic algorithm is the one which disambiguates
words based exclusively on the probability that a word occurs with a
particular tag. An alternative to this approach is the one which calculates the
probability of a given sequence of tags occurring, called n-gram approach. In
this case, the best tag for a given word is determined by the probability that it
occurs with the n previous. Finally, another more complex approach is the so
called Hidden Markov Model (HMM). A Markov model is a finite state
machine. For every state there are associated two probability distributions:
the probability of emitting a particular symbol and the probability of moving to
a particular state. The transitions between states are determined by
probabilities which combine the probabilities for state transition and output of
observed transitions. In Part-of-Speech tagging, it is assumed that the
probability of a tag is dependent only on a small and fixed number of
previous tags. This means that the content of a given tag can be fully
represented by only the two previous tags. This is the so called Markov
assumption. An advantage of HMM algorithms is that only a lexicon and a
text is needed for training a tagger, e.g. taggers can be trained without
tagged corpora. Moreover, HMM contains also methods for estimating
probabilities for low-linguistic phenomena such as long-distance
dependencies [4].
On the other hand, there are rule-based algorithms. The basic idea beyond
them is firstly to assign all possible tags to words and then to remove wrong
tags according to set of rules. These rules are usually called context frame
rules. An example is this: “if a word X is preceded by a determiner and

10
followed by a noun, tag it as an adjective”. In addition to this kind of rules,
many taggers use also morphological information to aid disambiguation
process, as for example: “if a word ends in an –ing and is preceded by a
verb, label it a verb”. Usually taggers implemented with rule-based
algorithms need supervised tagging and have some limitations, indeed they
are non-automatic, costly and time-consuming.
Another algorithm, which is a combination between rule-based and
stochastic algorithms, is the one conceived by Eric Brill [3], called
Transformation-based Error-driven Learning algorithm. This algorithm is
Transformation-based in the sense that it uses rules, and Error-driven in the
sense that is recourses to supervised learning. Transformation-based
learning is an attempt to preserve the benefit of having linguistic knowledge
in a readable form while extracting linguistic information automatically from
corpora. With this algorithm, first it is assigned the most probable tag for
each word as a start value, then tags are changed according to some
transformation rules. Training is done on tagged corpora and with a set of
rule templates. A limitation of this algorithm is that is slow, like rule-based
algorithms, but the accuracy of correctness reaches 96-97%.
1.2 Case Study: Transformation-based Algorithm for
Part-of-Speech tagging of a Multilingual Corpus of
Questions

The development of reliable PoS taggers has had a significant impact upon
corpus linguistics. By PoS tagging a corpus, many information can be
extracted from the results obtained by the process. Bearing this in mind, it is
possible to verify taggers accuracy by using the same tagger with different
corpora. Particularly interestins seems the application of a tagger to a
multilingual corpus, that is a corpus containing an original text and some
other parallel translations.
The corpus taken into consideration for the Part-of-Speech annotation is the
Multext- Joc Corpus, a corpus developed in the Multext project. The Multext
project is the largest project funded under the European Commission’s LRE
(Linguistic Research and Engineering). The corpus contains written
Questions and Answers of the Official Journal of the European Community.
Because of the extreme importance of Part-of-Speech tagging in the field of
Question Answering, relevance was given only the written Questions given
by the European Community.
Part-of-Speech tagging this corpus should give different results according to
the considered language of the corpus. For this reason, during my internship,
I implemented a Part-of-Speech tagger which should be able to tag the
multilingual corpus.
With the aim of implementing a tagger, I analyzed the different algorithms
that have been already developed. From the developed algorithms, it
resulted interesting the Transformation-based Error-driven learning algorithm
proposed by Eric Brill. Transformation- based tagging draws inspiration from
rule-base taggers and stochastic taggers. Like the rule-base taggers, it is
based on rules that specify what tags should be assigned to what words. But
like stochastic taggers, it is a machine learning technique, in which rules are
automatically induced from the data. This algorithm has been applied to a
number of NL problems, including Part-of-Speech tagging, prepositional
phrase attachment disambiguation, and syntactic parsing. Prepositional

11
phrase attachment disambiguation and syntactic parsing are a common
cause of structural ambiguity in NL. Prepositional phrase attachment
disambiguation is the problem of assigning a prepositional phrase PP to a
noun phrase (NP) or to a verb phrase (VP) depending on the sentence. On
the contrary, ssyntactic parsing means checking that tokens in a sentence
form an allowable expression. This is usually done with reference to a
context-free grammar (CFG) which recursively defines components that can
make up an expression and the order in which they must appear.
By studying the Transformation-based algorithm, I implemented a tagger and
I extracted the linguistic information needed to compare the parallel corpora
in the different languages from the original Multext corpus. Input of the
Transformation-based tagger is a manually annotated corpus, from which the
tagger should extract a lexicon and learn the best transformation rules.
Output of the tagger is a completely annotated corpus and the transformation
of the respective languages which were learned during this process (English
and Italian). The transformation rules learned by Part-of-Speech tagging the
English corpus should then be compared to those learned by tagging the
Italian corpus. From these rules I expected to infer similarities and
differences of the two languages compared. The analysis performed is in the
5
th
section.

12
2. Transformation-based PoS Learning algorithm
The Transformation-based algorithm has been applied to many Natural
Language (NL) problems, such as Part-of-Speech tagging, prepositional
phrase attachment disambiguation, and syntactic parsing. In Part-of-Speech
tagging it works in the following way.

The algorithm has four main tasks: it (i) extracts the lexicon, (ii) initially tags
the corpus using the frequency of the tags each word could be tagged with,
(iii) learns the transformation rules given the templates, (iv) finally tags the
corpus.

By means of (i)-(iv) it achieves its goal: it learns the best rules for
grammatically annotating a corpus and then verifies the accuracy of those
rules.

For accomplishing these goals, two phases are needed, called Training and
Test phase. The task of learning the rules is done in the Training phase,
during which the tagger learns some rules for automatically PoS tagging from
an annotated corpus. On the contrary, in the Test phase the rules previously
extracted are applied to a raw corpus.
Since there is the need of two corpora, one annotated and another not
annotated, the corpus taken into considerations is split into two parts: a part
containing PoS annotation (training corpus) and the another one without (test
corpus).



Training Phase
(training corpus)
Training Phase
(training corpus)
Test Phase
(test corpus)
Test Phase
(test corpus)
Transformation-
based tagger
Transformation-
based tagger

Figure 1: Transformation-based Tagger phases
The algorithm comprehends two important phases and each one needs a
corpus.

2.1 Training Phase
Brill’s algorithm can be applied mainly to implement a PoS tagger which only
considers known words, but it can be used also to create a tagger able to
handle unknown words. In the latter case, the training corpus is split into two
parts: one containing unknown words (unknown training corpus) and another
containing only known words (known training corpus).


13
Unknown
training corpus
Unknown
training corpus
Known training
corpus
Known training
corpus
Training
corpus
Training
corpus

Figure 2: Division of the training corpus into two parts
Unknown training corpus is composed of unknown words, while the other is
composed only by known words.

2.1.1 Training Phase without unknown words
In the case of a PoS tagger that does not handle unknown words, the only
corpus used is the known training corpus. From this corpus, a lexicon is
extracted. A lexicon is simply a list of all tags seen for a word in the training
known corpus, with one tag labeled as the most likely. The element of the
PoS tagger which accomplishes this task is called initial tagger. The initial
tagger extracts the lexicon from the known training corpus already
annotated, which can be named as Truth.

Example 3.
In the example given below the word “can” has been found labeled as a
modal four times, as a verb two times and finally as a noun only once. The
tags are sorted by decreasing frequency, MD in this case is the most
frequent tag and hence it is underlined.

“can”
MD 4
VB 2 NN 1
MD=modal VB= verb NN= noun

Once the lexicon is composed, all the tags are removed from the annotated
corpus to create a not annotated corpus called Guess as summarized by
Figure 3.


Annotated
corpus
Annotated
corpus
Extraction of lexicon
Tags elimination
Not annotated
Guess corpus
Not annotated
Guess corpus
Lexicon
Lexicon

Figure 3: Extraction of Lexicon
From the annotated corpus firstly (1) the lexicon is extracted, and secondly (2) all
the tags are eliminated, so as the result is the not annotated Guess corpus.


14
After this preliminary phase, the not annotated Guess corpus is passed again
through the initial tagger in order to annotate it. At this point, it is initially
tagged with the most frequent tag found in the corpus. The most frequent tag
per word is the one which occurred the highest number of times related to
the word. For instance, in the previous example the most frequent tag found
for the word “can” is modal. An explanation of the process is given in the
figure 4.
Not annotated
Guess corpus
Not annotated
Guess corpus
Annotated
Guess corpus
Annotated
Guess corpus
Initial tagging

Figure 4: Initial tagging
The not annotated Guess corpus is initially tagged with the most frequent tag.
The result is the Guess corpus (annotated).
The resulting corpus is then compared to the Truth, which is used as a
reference. Usually after the initial tagging phase, the corpus is 90 % correctly
tagged.

Example 4.
The most frequent tag for a word like “annoying” could be JJ, which stands
for adjective. But in many cases the word “annoying” could be used as a verb
in the form of present participle. Obviously, in the case this word has been
found in the corpus, it will be labeled as JJ, even though it is a verb. In all
these cases the word will be wrongly annotated.

Because the corpus still contains errors after the initial tagging phase, some
improvements should be done. The element which accomplishes the task of
improving the correctness of the corpus annotation is called Learner. The
Learner should compare the Truth with the Guess, and after the comparison
it learns a transformation. Hence, the Learner should learn every possible
transformation, decide which is the best one, that is the one that reduces the
highest number of errors, and finally apply it. At each iteration of learning, the
best transformation is learned and applied, then the transformation is added
to the ordered list of transformation rules and the training known corpus is
updated. Learning continues until there are no more transformations which
can improve the correctness of the corpus.

In order to adopt transformations some preliminary steps should be
performed. First of all, the Learner needs some templates to learn a
transformation. The templates used in this algorithm are the ones supplied
by Brill. The templates supplied by Brill are arranged in two classes: non-
lexicalized and lexicalized templates. Lexicalized templates are those which
do not make reference to specific words, but only to tags. They are:

1. The preceding (following) word is tagged z.
2. The word two before (after) is tagged z.
3. One of the two preceding (following) words is tagged z.
4. One of the three preceding (following) words is tagged z.

15
5. The preceding word is tagged z and the following word is tagged w.
6. The preceding (following) word is tagged z and the word two before
(after) is tagged w.
a,b,z and w are variables over the set of parts pf speech.
Starting from these templates, it is possible to create the transformations
which are learned from the Learner, evaluated, and finally only the best one,
which is the one which better reduces the errors, will be applied. This
process continues till no improvements are done.

Guess
Guess
Best Rule
Best Rule
Learner
Learner
Templates
Rules
learns
reads

Figure 5: Training Phase process
The Learner reads the Guess corpus along with the templates and from
these two components is able to create and learn all the possible
transformation rules, which are later evaluated. Only the best one is applied
to the Guess Corpus. A new iteration starts again till reasonably
improvements can be done.
In order to create a transformation, some more information should be given.
Every transformation is composed of an instantiation of a template and a
rewrite rule. The rewrite rule is always of the form:
Change the tag from X to Y

With the intention of creating the transformations, the context suitable for the
application of each template is needed. For instance, in the case of template
1 ”The following word is tagged z”, it creates pairs of tags of the juxtaposed
words. Moreover, it extracts the most frequent combination of the list relevant
for the template and the best combination will be used to instantiate the
template. Finally the transformation rules will be composed by adding to the
instantiations of the templates a rewrite rule. The rewrite rule is given by
comparing the Guess corpus with the Truth corpus. The process of creating
the transformation rules is described below in details through an example.
Example 5.

16
Consider a corpus which is composed by the sentence below. There are three
columns: in the first column you can find the word, in the second the tag present
in the Guess corpus(i.e. on the base of the frequency criterion used by the initial
tagger) , while in the third the correct tag which is present in the Truth corpus.
The code used for the following tags does not refer to any tags set.

WORD
GUESS
TRUTH
the
AT
AT
candidates
NN2
NN2
on
II
II
this
MD
JJ
reserve
MD
JJ
list
NN
NN
are
VBR
VB
not
XX
XX
used
VVN
VVN
as
CSA
CSA
temporary
MD
JJ
staff
NN
NN
Table 2: Example of a sentence in the Guess and Truth corpora
By considering the template” The following word is tagged z”, a set of
possible instantiations of this template is listed above:

The following word is tagged as NN2
The following word is tagged as II
The following word is tagged as MD
The following word is tagged as MD
The following word is tagged as NN
The following word is tagged as VBR
The following word is tagged as XX
The following word is tagged as VVN
The following word is tagged as CSA
The following word is tagged as MD
The following word is tagged as NN

It emerges that the most frequent instantiation of the template 1 is:

The following word is tagged as MD

After this, the rewrite rule should be formulated. The words which are
followed by the tag MD are:
on II II
this MD JJ
as CSA CSA

As a consequence three transformation rules will be formulated from this
sentence:

Change the tag from II to II if the following word is MD
Change the tag from MD to JJ if the following word is MD
Change the tag from CSA to CSA if the following word is MD


17
The same procedure happens for all the sentences in the corpus and for all
the templates proposed by Brill. A sum of different transformation rules is
made and only one transformation, the most frequent one, will be applied.

After creating all the possible transformations, only the best one should be
applied. The best transformation rule is the one which better reduces the
number of errors contained in the Guess corpus. For this reason, every
transformation is evaluated with a rank, and the higher it is, the bigger is the
number of error the transformation reduces. The rank is the net error
reduction given by the difference between the errors eliminated by the
transformations and additional errors caused by the transformation. In fact,
not always transformation rules reduce the number of errors. If we consider
the transformation Change the tag from X to Y if the following tag is Z as the
best transformation (that is based on template 1), different consequences
can happen. If Y is the correct tag, then the transformation will result in one
less error. On the other hand, it can happen that the tag X replaced by Y was
actually already the correct one, hence the transformation results into an
additional tagging error. Introduction of errors effect the score assigned to
the rule negatively. Finally it can also happen that nor X nor Y are correct, so
the number of errors remains the same.
In the example below it is described the continuation of the example which
shows how the best transformation rules is achieved and the effects of that
transformation in the Guess corpus.
Example 6.
Continuation of the previous example: In the first and third transformation, no
difference is present between Guess and Truth corpus, since the words “on”
and “as” are annotated with the same tag in both the corpora. For this reason
these transformations are called do-nothing rules. On the contrary the
second transformation will result in a change in the Guess corpus and in one
less error. For this reason it is considered the best one, the one which better
reduces the errors.
Change the tag from MD to JJ if the following word is MD
Hence, the entire corpus is examined and all the words labeled as MD and
followed by a word labeled also as MD, are applied a new tag. The tag MD is
changed to JJ. It could happen that those words labeled as MD were
correctly tagged, so a new error was originated, or it could happen the
contrary. In particular case, neither MD nor JJ were correct, hence no
increase nor decrease of errors was caused.
Obviously, the process described in the previous example works with
corpora composed of many more sentences, so the calculation of the best
transformation rule is more complex.


The process described before works also with lexicalized templates.
The lexicalized transformation templates I used are:
1. The preceding (following) word is w.
2. The word two before (after) is w.

18
3. One of the two preceding (following) words is w.
4. The current word is w and the preceding (following) word is x.
5. The current word is w and the preceding (following) word is tagged z.
6. The current word is w.
7. The preceding ( following) word is w and the preceding( following) tag is t.
8. The current word is w, the preceding (following) word is w2 and the
preceding ( following) tag is t.
Where w, x are variables over all words in the training known corpus, z is a
variable over all parts of speech.
The process described for the non-lexicalized transformations works in the
same way for the lexicalized transformation. A new example is given below.
Example 7.
Consider a corpus which is composed by the sentence below. There are
three columns: in the first column you can find the word, in the second, there
is the tag present in the Guess corpus(i.e. on the base of the frequency
criterion used by the initial tagger) , while in the third, there is the correct tag
which is present in the Truth corpus. The code used for the following tags
does not refer to any tags set.
WORD
GUESS
TRUTH
the
AT
AT
candidates
NN2
NN2
on
II
II
this
PP
PP
reserve
JJ
JJ
list
NN
NN
are
VBR
VB
as
IN
RB
intelligent
VVN
VVN
as
IN
IN
nice
MD
JJ
people
NNC
NNC
Table 3:Table Guess and Truth for sentence 2 of the example
By considering the lexicalized template” The word two after is w”, a list of
possible instantiations of this template will be composed:

The word two before is on
The word two before is this
The word two before is reserve
The word two before is are
The word two before is as *
The word two before is intelligent

19
The word two before is as *
The word two before is nice
The word two before is people

It emerges that the most frequent instantiation of the template considered is:
The word two after is “as”
After this, the rewrite rule should be formulated. The words considered for
the rewrite rule are:

list NN NN
as IN RB

As a consequence two transformation rules will be formulated:

Change the tag from NN to NN if the word two after is “as”
Change the tag from IN to RB if the word two after is “as”

Actually, the first transformation is a do-nothing rule because the tag in the
Guess corpus is equal to the tag in the Truth corpus. For this reason, the
best transformation rule is:

Change the tag from IN to RB if the word two after is “as”.

This transformation will be applied and hence it will try to correct the tags of
all words if the 2
nd
successive word is equal to “as”. In the case of this
sentence one less error will be present in the Guess corpus.
2.1.2 Training Phase with unknown words
In the case of a PoS tagger that handles unknown words, two corpora are
used: unknown training corpus and known training corpus. While known
training corpus is the same corpus used in the preceding training phase,
unknown training corpus is composed of words which will not be extracted in
the lexicon. For what concerns the known training corpus, the procedure is
the same described in the previous section 2.2.1 (with only known words).
For the other corpus a different procedure is applied. First of all, the initial
tagger, naively labels the most likely tag for unknown words as proper noun if
capitalized and common noun otherwise.
After initially tagging the unknown corpus, the corpus contains many errors,
so some improvements should be done. The element which accomplishes
the task of improving the correctness of the corpus annotation is called
Learner. The Learner should compare the unknown training corpus with
training corpus, and by the comparison it can learn a transformation. Hence,
the Learner should learn every possible transformation, decide which is the
best one, which is the one that reduces the highest number of errors, and
finally apply it.

As in the procedure for known words, at each iteration of learning, the best
transformation is learned and applied, then the transformation is added to the
ordered list of transformation and the training unknown corpus is updated.
Learning continues until there are no more transformations which can
improve the correctness of the corpus.
With the intention of learning a transformation, the Learner should use some
templates. These templates are five and are:

20
1. Deleting the prefix (suffix) x, results in a word contained in the training
known corpus
2. The first (last) (1,2,3,4) characters of the word are x
3. Adding the character string x as a prefix (suffix) results in a word
contained in the training known corpus
4. Word w ever appears immediately to the left (right) of the word
5. Character Z appears in the word

where |x| <= 4 (x is any string of length 1 to 4)
Starting from these templates, it is possible to create the transformations
which are learned from the Learner, evaluated, and finally only the best one
will be applied. This process continue till no improvements can be done.
The procedure for creating a transformation is the same as the procedure for
the other templates described in the previous section.
An example is given to better explain how the process works for unknown
words.
Example 8.
The training known corpus contains the word “make”. The word is tagged in
the Truth corpus as VB. Another word, which is “remake”, is present in the
unknown training corpus and since it is not capitalized, it is initially tagged by
the initial Tagger as NN (common noun)..
After initially tagging the whole corpus, the transformations should be
learned. We assume that the first template considered for the learning phase
is “Deleting the prefix x, results in a word contained in the lexicon”, and the
first word in the corpus is “remake”. For this reason, the Learner delete the
prefix of one character from the word and searches in the lexicon the word
“emake”. Since this word is not found, another character is removed, and the
word to be looked for is “make”. Now, this word is present in the lexicon and
is tagged as verb. Hence a transformation rule can be extracted. It is:
Change the tag from NN to VB if deleting the prefix “re”, results in a word
contained in the lexicon.
The same procedure happens for every word found in the corpus and for
every template. The most frequent transformation, which is the
transformation with the highest ranking, is the one which will be applied to
the Guess corpus. When no transformation rules can be learned any more,,
the training phase stops.

2.2 Test Phase
After the learning phase, it can be possible to verify the accuracy of the best
transformation rules learned during the training phase on the test corpus.
In order to verify those transformations, some initializations procedures are
first completed. The first procedure to be made is the verification of the

21
words contained in the corpus whether they are also contained in the lexicon.
This task is accomplished by the initial Tagger, which is an element of the
Part-of-Speech tagger. Depending on the result of the Initial Tagger’ s
verification, the words are labeled differently. On one hand, the initial tagger
labels every word which belongs also to the lexicon with the most frequent
tag. The most frequent tag per word is the one which occurred the highest
number of times related to the word. The most frequent tag is given by the
lexicon. On the other hand, the initial tagger labels every word which does
not belong to the lexicon (unknown word) as proper noun if the word is
capitalized, otherwise as common noun. The corpus initially tagged is called
Guess corpus.
Once the initial tagger’ s task is ended, the transformation rules learned in
the training phase can be applied to the Guess corpus. The best
transformation rules for unknown words learned in the training phase are
applied to those words which are considered as unknown words. Once these
rules are applied, the best transformation rules for known words can also be
applied to the Guess corpus.
Finally the corpus resulting from these transformations is compared to the
Truth, which is the test corpus correctly tagged. The accuracy of the
transformation rules is verified.
An explanation of the entire process applied in the test phase is shown in the
figure below.

Test
corpus
Test
corpus
Initial tagging
Lexicon
Lexicon
Guess
corpus
Guess
corpus
Rules for unknown
words
Rules for known
words
1
2
3

Figure 6: Test phase process

The test corpus is initially tagged (1). Unknown words are labeled as proper
noun if capitalized, otherwise as common noun. Known words are labeled
with the most frequent tag extracted from the lexicon. Once the test corpus is
initially tagged, the transformation rules for unknown words learned in the
training phase are applied to the Guess corpus (2). Finally the other
transformation rules are applied(3).

22
3. Multext-Joc Corpus
Till now this thesis has been mainly focused to the Transformation-based
algorithm, but in PoS tagging extremely important is also the corpus to
annotate. The corpus which I decided to annotate is called Multext-Joc
corpus and its description is given in the first section. Afterward, the
description of the encoding adopted for its creation is provided in the second
section.
3.1 Corpus Description
The corpus I used to train and test the Transformation-based tagger is the
MULTEXT-JOC corpus, which is a corpus developed in the Multext project
financed by the European Commission. The Multext project is the largest
project funded under the Linguistic Research Engineering (LRE) Program
and intended to contribute to the development of generally usable tools to
manipulate and analyze multi-lingual text and speech, and to annotate them
with structural and linguistic markup [5].
The corpus contains raw, tagged and aligned data from the Written
Questions and Answers of the Official Journal of the European Community.
Questions were asked by members of the European Parliament on a wide
variety of topics and corresponding answers were given from the European
Commission in 9 parallel versions, published in the Official Journal of the
European Community of the year 1993. The corpus contains ca. 1 million
words for each of five languages: English, French, German, Italian and
Spanish. A subset of 200,000 words per language (except for German) was
part-of-speech tagged [5].

The corpus was encoded in SGML at the paragraph level using the Corpus
Encoding Standard(CES) specifications. More information about this kind of
encoding is given in the second section.
For the implementation of the Transformation-based tagger only a part of
the corpus was employed. The part is composed only by questions and
contains circa 12000 words for each of two languages considered in the
internship’s project : English and Italian.


3.2 Corpus Encoding
Corpus Encoding Specifications (CES) is a way of Encoding, that is any
means of making explicit some interpretation of a text or collection of texts.
CES was optimally suited for use in language engineering and it can serve
as a widely accepted set of encoding standards for corpus-based work. The
overall goal of Language Engineering is the identification of a minimal
encoding level that corpora must achieve to be considered standardized in
terms of descriptive representation (marking of structural and linguistic
information) as well as general architecture (so as to be maximally suited for
use in a text database). CES distinguishes primary data, which is
"unannotated" data in electronic form, most often originally created for non-
linguistic purposes such as publishing, broadcasting, etc.; and linguistic

23
annotation, which comprises information generated and added to the
primary data as a result of some linguistic analysis. CES also covers
encoding conventions for linguistic annotation of text and speech, including
morphosyntactic tagging and parallel text alignment.

Indeed, CES provides a TEI - conformant Document Type Definition (DTD)
to be used for encoding various levels of primary data encoding together with
its documentation.

For the morphosyntactic annotation, which is the only part used by the Part-
of-Speech tagger, encoding has been done in SGML (Standardized
Generalized Markup Language) by using the Cesana specifications.
The Cesana DTD defines the syntax for segmentation and grammatical
annotation, including:
• Sentence boundary markup
• Tokens, each of which consists of the following:
o the orthographic form of the token as it appears in the corpus
o grammatical annotation, comprising one or more sets of the
following:

 a morpho-syntactic specification, in the EAGLES
annotation style
 a base lemma
 a corpus tag
EAGLES is a project sponsored by the European Commission to promote creation
of de facto standards in the area of Human Language Technologies.
The structure of the DTD constituents is based on the overall principle that
one or more "chunks" of a text may be included in the annotation document.
These chunks may correspond to parts of the document extracted at different
times for annotation, or simply to some subset of the text that has been
extracted for analysis. For example, it is likely that within any text, only the
paragraph content will undergo morphosyntactic analysis, and titles,
footnotes, captions, long quotations, etc. will be omitted or analysed
separately.
Elements in cesAna documents will, for the most part, use the notation
outlined below:
<cesAna>: a single annotation document, containing a <cesHeader>
element, followed by a <chunkList> element: in addition to the global
attributes, this element has the following attribute version which provides the
version of the cesAna DTD to which this document is compliant.
<chunklist>: is composed of <chunk> elements.
<chunk>: may correspond to parts of the document extracted at different
times for annotation, or simply to some subset of the text that has been
extracted for analysis. It contains a series of sentences, of tokens and
paragraph-like elements.
<par>: marks paragraph boundaries and contains a series of token
elements, a series of sentence elements, or a series of data elements

24
<s>: marks sentence boundaries and contains a series of tokens or data
elements; nested sentences may also appear.
<tok>: contains a token, consisting of its orthographic form in the original
document, followed optionally by disambiguated corpus tag and/or one or
more alternative sets of morphosyntactic information associated with the
token.
<orth>: contains the orthographic form of the token as it appears in the
original, and as it may appear in a lexicon, possibly modified by processing
(e.g., a compound may appear as "in_spite_of").
<ctag>: contains a corpus tag, when this tag appears within the <lex>
element, it gives the corpus tag associated with the accompanying
morphosyntactic information.
An excerpt of the corpus annotated with these specifications is:
Exampe 9.

<!DOCTYPE CESANA PUBLIC "-//CES//DTD cesAna//EN" []>
<CESANA VERSION="1.12">
<CHUNKLIST>

<CHUNK ID="C1" TYPE="DIV_Q" FROM="1.2.1.1.2.2.3.1">

<PAR ID="C1P1" TYPE="HEAD" FROM="1.2.1.1.2.2.3.1.1">

<S ID="C3P3S3">

<TOK><ORTH>How</ORTH><CTAG>RRQ</CTAG></TOK>
<TOK><ORTH>was</ORTH><CTAG>VBDZ</CTAG></TOK>
<TOK><ORTH>it</ORTH><CTAG>PPH1</CTAG></TOK>
<TOK><ORTH>possible</ORTH><CTAG>JJ</CTAG></TOK>
<TOK><ORTH>for</ORTH><CTAG>IF</CTAG></TOK>
<TOK><ORTH>those</ORTH><CTAG>DD2</CTAG></TOK>
<TOK><ORTH>laws</ORTH><CTAG>NN2</CTAG></TOK>
<TOK><ORTH>which</ORTH><CTAG>DDQ</CTAG></TOK>
<TOK><ORTH>were</ORTH><CTAG>VBDR</CTAG></TOK>
<TOK><ORTH>reasonably</ORTH><CTAG>RR</CTAG></TOK>
<TOK><ORTH>effective</ORTH><CTAG>JJ</CTAG></TOK>
<TOK><ORTH>to</ORTH><CTAG>TO</CTAG></TOK>
<TOK><ORTH>be</ORTH><CTAG>VBI</CTAG></TOK>
<TOK><ORTH>flouted</ORTH><CTAG>VVN</CTAG></TOK>
<TOK><ORTH>and</ORTH><CTAG>CC</CTAG></TOK>
<TOK><ORTH>evaded</ORTH><CTAG>VVN</CTAG></TOK>
<TOK><ORTH>?</ORTH><CTAG>?</CTAG></TOK>
</S>

<S ID="C3P3S4">

<TOK><ORTH>What</ORTH><CTAG>DDQ</CTAG></TOK>
<TOK><ORTH>legal</ORTH><CTAG>JJ</CTAG></TOK>
<TOK><ORTH>proceedings</ORTH><CTAG>NN2</CTAG></TOK>
<TOK><ORTH>will</ORTH><CTAG>VM</CTAG></TOK>

25
<TOK><ORTH>now</ORTH><CTAG>RT</CTAG></TOK>
<TOK><ORTH>by</ORTH><CTAG>II</CTAG></TOK>
<TOK><ORTH>initiated</ORTH><CTAG>VVN</CTAG></TOK>
<TOK><ORTH>against</ORTH><CTAG>II</CTAG></TOK>
<TOK><ORTH>the</ORTH><CTAG>AT</CTAG></TOK>
<TOK><ORTH>offenders</ORTH><CTAG>NN2</CTAG></TOK>
<TOK><ORTH>?</ORTH><CTAG>?</CTAG></TOK>
</S>

</PAR>
</CHUNK>

</CHUNKLIST>
</CESANA>


26
4. Implementation of Transformation-based PoS
Algorithm
This chapter is written to describe a case study of Part-of-Speech tagging.
In the first section, the choice of the environment for the implementation of
the Transformation-based tagger is explained and motivated. Moreover, the
second section will be devoted to the description of the project’
implementation.
4.1 .Net Environment
The implementation of the transformation-based algorithm has been made
by using the Microsoft Development Environment Visual Studio.Net 2003,
with the Framework 1.1. Microsoft Visual Studio lets you write CLR-managed
code and supports languages that target the common language runtime, like
Visual C# and Visual Basic .Net.

For this reason, the programming language used for the PoS tagger is Visual
C#. Microsoft C# is a new programming language designed for building a
wide range of enterprise applications that run on the .Net Framework.
Moreover, C#, which is an evolution of Microsoft C and Microsoft C++, is
simple and object oriented. Its code is compiled as managed code, which
means it benefits from the services of the common language runtime. These
services include language interoperability, garbage collection, enhanced
security, and improved versioning support. The library for Visual C#
programming is the .Net Framework.

The value of .Net framework lies in its interoperability and the seamless
connectivity of multiple systems and sources of data. This empowers them to
quickly and easily create required products. Above all, for the implementation
of the PoS tagger, the value of .Net framework lies in all the provided
methods to compare strings’ values. Since the most important part in the
implementation of a tagger is handling tokens and respective tags, C# and
.Net are a good solution.

The .Net Framework has two main components: the common language
runtime and the .Net Framework class library.

The language runtime manages code at execution time, providing core
services such as memory management, thread management, and remoting,
while also enforcing strict type safety and other forms of code accuracy that
ensure security and robustness.

While the other main component of the .Net Framework, the class library, is
a comprehensive, object-oriented collection of reusable types that you can
use to develop applications ranging from traditional command-line or
graphical user interface (GUI) applications to applications based on the latest
innovations provided by ASP.Net, such as Web Forms and XML Web
services.

Till now, all the applications created with Visual Studio and .Net did not allow
the portability of those applications on platforms and operating systems
different from Microsoft. Now, thanks to Mono it is possible.


27
Mono, the open source development platform based on the .Net framework,
allows developers to build Linux and cross-platform applications with
improved developer productivity. Mono's .Net implementation is based on the
ECMA standards for C#
(www.ecmainternational.org/pubblications/standards/ecma-334.htm) and the
Common Language Infrastructure (www.ecma-
international.org/pubblications/standards/ecma-335.htm).
Mono includes compilers, an ECMA-compatible runtime engine (the
Common Language Runtime, or CLR), and many libraries. The libraries
include Microsoft .Net compatibility libraries (including
Ado.Net
,
System.Windows.Forms
and
ASP.Net
), Mono's own and third party class
libraries. Moreover,
Gtk#
(http://www.mono-project.com/using/gtk-
sharp.html), a set of .Net bindings for the gtk+ toolkit and assorted GNOME
libraries can be found in the latter. This library allows you to build fully native
Gnome application using Mono and includes support for user interfaces built
with the Glade interface builder. Furthermore, Mono's runtime can be
embedded into applications for simplified packaging and shipping. Finally,
the Mono project offers an IDE (http://www.monodevelop.com), debugging,
and a documentation browser.

Once all the information about the environment chosen for the
implementation of the Trasnformation-based Part-of-Speech tagger are
given, the project’ implementation will be explained in the section 4.2.


4.2. The Algorithm implementation
The Transformation-based algorithm was implemented in four main
modules, hence the result are four different applications, each one
accomplishing a specific task. The result of the implementation is a tagger
which uses non-lexicalized and lexicalized transformations and
transformations for handling unknown words.

The implemented modules and functionalities are:
• Conversion module from SGML to human-readable format
• Lexicon-Extraction and corpus-splitting module
• Training module
• Test module
All the applications work independently from the others, but they are
interrelated to produce an annotated corpus.

These four applications are described with figures and Class diagrams in the
following sections.


4.2.1.

Conversion module from SGML to human-
readable format
As written in the previous sections, the corpus adopted for the Part-of-
Speech tagger was encoded in SGML and conformant to the Corpus
Encodings Standard (Cesana) specifications. Since the SGML format, very
similar to XML, is syntactically complex, I preferred to convert it into a simpler

28
and a more human-readable format. Thanks to a procedural language such
as Perl, it is possible to modify the format of the Joc corpus very quickly.
This is the entire code for converting the corpus into the desired format
composed of tokens and their respective tags:
!/usr/bin/perl
# set to 0 if you want to extract all sentences
$only_questions = 1;
$sentence = "";
$is_question = 0;
while (<STDIN>) {
$orth = "";
$ctag = "";
if ($_ =~ /<TOK>.*<\/TOK>/) {
s/\n$//;
/<ORTH>(.*)<\/ORTH>/;
$orth = $1;
if ($orth =~ /\?/) {
$is_question = 1;
}
/<CTAG>(.*)<\/CTAG>/;
$ctag = $1;
$sentence = "$sentence$orth $ctag\n";
}

if ($_ =~ /<\/S>/) {
extract.pl

















29



<TOK><ORTH>How</ORTH><CTAG>RRQ</CTAG></TOK>
How RRQ
<TOK><ORTH>was</ORTH><CTAG>VBDZ</CTAG></TOK>
was VBDZ
<TOK><ORTH>it</ORTH><CTAG>PPH1</CTAG></TOK>
it PPH1
<TOK><ORTH>possible</ORTH><CTAG>JJ</CTAG></TOK>
possible
JJ
<TOK><ORTH>for</ORTH><CTAG>IF</CTAG></TOK>
for IF
<TOK><ORTH>those</ORTH><CTAG>DD2</CTAG></TOK>
those DD2
<TOK><ORTH>laws</ORTH><CTAG>NN2</CTAG></TOK>
laws NN2
<TOK><ORTH>which</ORTH><CTAG>DDQ</CTAG></TOK>
which
DDQ
<TOK><ORTH>were</ORTH><CTAG>VBDR</CTAG></TOK>
were
VBDR
<TOK><ORTH>reasonably</ORTH><CTAG>RR</CTAG></TOK>
reasonably
RR
<TOK><ORTH>effective</ORTH><CTAG>JJ</CTAG></TOK>


effective
JJ
<TOK><ORTH>to</ORTH><CTAG>TO</CTAG></TOK>


to TO
<TOK><ORTH>be</ORTH><CTAG>VBI</CTAG></TOK>
be VBI
<TOK><ORTH>flouted</ORTH><CTAG>VVN</CTAG></TOK>


flouted
WN
<TOK><ORTH>and</ORTH><CTAG>CC</CTAG></TOK>


and CC
<TOK><ORTH>evaded</ORTH><CTAG>VVN</CTAG></TOK>


evaded
WN
<TOK><ORTH>?</ORTH><CTAG>?</CTAG></TOK>


? ?
Table 4: Conversion of the format

4.2.2. Lexicon-Extraction and corpus-splitting module
This application has as input a complete tagged corpus, with the format
showed in the previous section, and gives as a result:
• Three different corpora
o A training corpus
o A training corpus for unknown words called unknown training
corpus
o A test corpus
• A lexicon of words from the only training corpus


The parameters needed for starting the application are:
Corpus filename, size unknown training corpus, size
training corpus, size test corpus
• Corpus filename: the name of the entire corpus which is employed
for extracting the lexicon
• Size unknown training corpus: size (number of tokens) of the
training corpus from which rules for handling unknown words are

30
extracted
• Size training corpus: size (number of tokens) of the training
corpus from which rules are extracted
• Size test corpus: size (number of tokens) of the test corpus which is
used to apply the extracted rules

By writing the mandatory parameters on the console shell, the program is
launched. The screenshot of the Figure 7 appears:


Example 9.
The user selects as parameters:
Training_corpus_wsj_intero.txt 500 9000 2000

D:\Tagger>ExtractionLexicon.exe joc-corpusEn.txt 500 9000 2000


The program will read the entire corpus called
Training_corpus_wsj_intero.txt and divide into three parts. After
the file is loaded, a division of the corpus is done.
The first 500 words will compose the unknown training corpus, called
unknownTrainingCorpus.txt , the following 9000 words will compose the
trainingCorpus.txt and finally the following 2000 words are the ones of
the test corpus, testCorpus.txt.

Figure 7: Screenshot of the launch of the Extraction-Lexicon application

The results are four files, which are called:
• trainingCorpus.txt : the file containing the training corpus
• unknownTrainingCorpus.txt : the file containing the training unknown
corpus
• testCorpus.txt : the file containing the test corpus
• lexicon.txt : the file containing the lexicon extracted from the training
corpus


The classes devoted to achieve the previous results are those represented in
the Figure 8 .

31



Extraction
TaggerUnknown
Tagger
Corpus
Lexicon
LexiconEntry
Lexicon.txt
Lexicon.txt

Figure 8: Classes used for the Extraction of Lexicon

As showed in the figure, the application starts from the Extraction class.
This class is charged of checking that all the parameters, such as the name
of the corpus file, are given and typed in the correct format. Then, the
Extraction class calls the class Corpus, which loads the entire corpus
that the user wanted to divide. So, this class divide the corpus in three parts:
trainingUnknown corpus, training corpus and test corpus. Afterward, the
class Tagger extracts all the words with their tags from the corpus, and by
means of the class Lexicon composes the lexicon. Each word is added in
the lexicon through the class LexiconEntry. This class add each word and
all the respective tags for that word. Tags are written ordered by frequency of
occurrence in the corpus.

Once the application is launch, the first task of the application is to divide the
corpus into three parts. After the corpus is divided, every token in the three
corpora is saved in an ArrayList while every PoS tag in another one.
Finally, the Tagger class generates the lexicon by means of the
components of two ArrayList containing tokens and tags of the training
and test corpora.

Since every token can have more than a tag, the object lexiconEntry,
member of the LexiconEntry class, takes the responsibility of assigning to
a token all the related Part-of-Speech tags with the respective frequency of
occurrence, so that later the method getMostFrequentTag can assign the
most frequent tag to each word . Thus, every word with the most frequent tag
is saved in a sorted list called list.
Finally, all the components of the sorted list are saved into a file called
Lexicon.txt
The interaction among the classes Tagger and Lexicon is showed in the
Figure 9.


32
Tagger
_
stopTraining: int
_stopTest: int
_pathToTrueLexicon
getLexiconFromCorpus(corpu
s Corpus): void
Lexicon
list: Collection
addLexiconEntry(word: String,
tag: String): void
save(fileName: String): void
getMostFrequentTag(word:
String): String


Figure 9: Class Diagram of Lexicon and Tagger classes

Once the lexicon is extracted from the corpus, the Training phase can begin.
This phase is described in the following section.
4.2.3. Training module

The core of the algorithm is the training phase, during which the
transformation rules for tagging a corpus are learned and saved, so that later
in the Test Phase they will be applied in the order they were learned.
In order to start the application, four parameters should be provided:
• unknown training corpus
• training corpus.txt
• minimum score of the transformation rule
• max number of learned rules

By inserting the parameters, the Training application starts.
Moreover, the file containing the lexicon of the training corpus should be
provided in the same folder of the application, so that during the process the
tagger can use it to initially tagging the corpus.

The files of the two corpora and the lexicon file can be the ones resulting
from the previous application, or can be also others that come from different
procedures. The format is fundamental since the application recognizes
only the one shown at page 30.

D:\Tagger>Brill.exe UnknownTrainingCorpus.txt trainingCorpus.txt 3 10


33

Figure 10: Screenshot of the Initial Tagging
Once the parameters are provided, the program starts. The classes included
in this module are shown in the Figure 11.


Training
TaggerUnknown
Corpus
Tagger
Lexicon
LexiconEntry
Learner
TransformationRule
LearnerUnknown
TransformationRuleUK
TemplateRule
TemplateRuleUnknown

Figure 11: Classes used in the Training Module

34
In the figure all the classes of this application are shown. The application
starts in the Training class, which checks that the parameters are correct.
After starting the application, the class Corpus loads unknown training and
training corpus, the two corpora from which the transformation rules are
learned. Once the two corpora are loaded, also the template rules, from
which the transformation rules are learned, are instantiated. This job is done
by the classes TemplateRule and TemplateRuleUnknown classes.
Immediately after, by means of the Tagger and TaggerUnknown classes,
the initial tagging phase is made and each word of the training corpus is
assigned the most frequent tag. Each word of the unknownTraining corpus is
assigned the tag proper noun if capitalized, otherwise common noun. On the
contrary, the most frequent tag for known words is the first tag present in the
lexicon. For this reason, the class Lexicon is called and is charged to load
the file containing the lexicon which was created by the Lexicon-Extraction
Module. Once initial tagging phase is over, the Learner and
LearnerUnknown classes learns the transformation rules by means of
TransformationRule and TransformationRuleUK classes
respectively.

The process is described in details below.
Step a) The first step in the training phase is the initial tagging. The initial
state annotation is accomplished by the Tagger class for those words that
come from the training corpus and by the TaggerUnknown class for those
words contained in the trainingUnknown corpus.

For what concerns known words, the class Tagger assigns each word the
most frequent tag, which is present in the lexicon file. To this end, the class
Lexicon is called. This class, given a word, searches in the lexicon file the
tag with the highest “score”. The attribute score represents the frequency of
occurence of each tag. Once the tag with the highest score is found, it is
assigned to the word related to.

On the contrary, for all unknown words, which are those that come from the
unknown training corpus, the initial state annotation is done in a different
way. Since no lexicon is available for them, all the words whose first
character is uppercase are assigned proper noun as tag, otherwise they are
assigned common noun as tag. This task is accomplished by the
TaggerUnknown class.

b) After the initial tagging, tags saved in the ArrayList Truth are compared
to those allocated by the Tagger in the initial tagging phase and are stored
in the guess ArrayList. The ArrayList Truth contains the correct tags
extracted from the original annotated corpus.

35
Once the initial tagging phase is over, starts the crucial part and
transformation rules are learned for increasing the percentage of correctness
of both corpora. The transformation rules for known words are handled by
the Learner class, while LearnerUnknown class cares about rules for
unknown words.

In the next paragraphs, firstly the process of learning the transformation rules
for unknown words will be explained. Secondly, the process for known
words will be analyzed.

Process of learning transformation rules for unknown words
In order to increase tagging accuracy of the unknown training corpus, some
transformations should be applied. The class which learns these
transformations is called LearnerUnknown. This class should first find the
rules by reading sentence by sentence and then produce the rules by using
the available templates. The templates are for unknown words are described
in the section 2.2.3.

Example 10.
If the best template is “Change the tag from X to Y if the word w appears to
the left”, every token w to the left of an unknown word (one position before in
the ArrayList) is checked to verify whether it belongs to the lexicon. In the
case the word w belongs to the lexicon, the tag of the unknown word is
changed from X to Y.

For a better explanation of these templates more details are given in the
section 2.2.1.

Once the best instantiation of the templates is found, a transformation rule is
created. The method which creates the rules is called
getRulesByTemplate. At this point each transformation rule is created
with a list of attributes. The attributes are visible in the class diagram below.
From all these rules, only the one with the highest score is applied. The
score is an attribute that is equal to the net error deduction. The net error
reduction is the difference between all the rules found by the method
getRulesByTemplate and all the rules that have same X, Y and
Winterested, where Winterested is the word w considered in the
templates for unknown words. These latter rules are called do-nothing rules.
The class is shown in the Figure below.

TransformationRuleUK
_
X : String
_Y : String
_Winterested: String
_count : Int
_prefixed : Boolean
_donothingRule: Boolean
_
templateUK: TemplateRule Unknown
ToString() : String
saveRule() : String
Learner

Figure 12: TransformationRuleUK Class Diagram

36


As shown in the class, the attribute which characterize a transformation are
X, Y, Winterested, count, prefixed, do-nothing rule and template.

Finally, after applying the best rule, a new iteration of the process starts till
no significant improvements are achieved. Once the training phase for
unknown words is over, training phase for known words can start.

Process of learning transformation rules for known words

It’s the job of the class Learner, the learning module of the transformation-
based tagger, to extract a list of transformation rules and compute the score
of each one. The output of the transformation-based learner (TBL) is an
ordered list of transformation rules.
First of all, the class Learner reads sentence by sentence and gets an
instantiation of all the rules, beginning from the initial position and finishing to
the ending position. After each iteration, startposition is reset. For each
question that is found in the corpus, all the possible rules are learned.

Learner
rules : Collection
bestRules: Collection
templateRules: TemplateRule
Iteration : int
_trainingCorpus: Corpus
_trainingUnknownCorpus: Corpus
startLearningPhase(templateRules: TemplateRule, maxRules: int, minScore: int,
bestRulesFileName: String) : void
applyRule(bestRule: TransformationRule) : void
findRulesAtSentence(templateRules: TemplateRule) : void
getRulesByTemplateRule(currentTemplateRuleToApply: TemplateRule,
startPositionOfSentence: int, endPositionOfSentence: int) : void
findBestRule() : TransformationRule
outputBestRules() : void

Figure 13: Learner Class Diagram
In this Diagram I show in details the most important methods used by the
Learner class. The method startLearningPhase is the core of the
algorithm since most of the job is done here. This method has four main
tasks to accomplish :
• read sentence by sentence
• find the rules for each question
• get the rules by using Templates
• find the best rule to apply
• apply the best rule

In order to get rules, the Learner class must instantiate an object of the
class TransformationRule. This class is used to instantiate every
transformation.

37
The only difference between this class and TransformationRuleUK class
is the presence of the attributes Z1 and Z2 instead of Winterested. Z1
and Z2 are the components of the templates. For more details see section
2.2.1.
Once all the rules have been found and saved, the rule with the highest
score is applied. Before this, all the rules with a different rewrite rule, are
saved into a SortedList. All these latter rules are called asterix rules..
Immediately after, all the asterix rules are compared to those rules that have
same X and Z, but different Y. Finally, since every rule has an attribute
count, the score is equal to the difference of the count variable in the two
rules.
In this way, all the rules have a variable score, and the one with the highest
score is applied.
The step b is applied until no important improvements are achieved.
All the best rules encountered at each iteration are saved because in the test
phase the tagger will apply them( in the same order). In the figure below a
screenshot of the application is shown.


Figure 14: Screenshot of Training Application
4.2.4. Test module
In the test phase, the tagger verifies the accuracy of the transformation rules
extracted in the training module.
The only needed parameters for starting the test application is:
corpusfilename
• corpusfilename: file name of the annotated corpus

However, it is important that in the folder of the application, the file of the
lexicon generated for the training corpus is present, otherwise each word
found by the Test Phase application is considered as unknown.


38
By launching the application with the corpusfilename, the test phase
starts. The screenshot of the application immediately after the launch of the
program is shown in Figure 15.


Figure 15: Screenshot of Test phase
Moreover, an overview of the classes belonging to the Test Module is given
in the figure 16.

Training
TaggerUnknown
Corpus
Tagger
Lexicon
LexiconEntry
RuleApplier
TransformationRule
RuleUKApplier
TransformationRuleUK
TemplateRule
TemplateRuleUnknown

Figure 16: Classes used in the Test Module
From the figure, it is immediately visible the similarity between this phase
and the Training phase. The main difference is the process of learning. Since
in this phase, all the transformation rules found in the Training phase are
applied, no learning is made. The classes which apply these transformations
are RuleApplier and RuleUKApplier.



39
a) The first step in the test phase is the same described in the Training
Phase in the step a. See section 4.2.

b)After the initial tagging, all the best rules extracted from the unknown
training corpus and all the best rules extracted from the training part are
applied. Firstly all the rules for unknown words are applied to those words
not contained in the lexicon created in the Lexicon-Extraction module,
secondly all the other rules are applied.

After all the best rules are applied, guess and truth Arraylists are
compared to evaluate the tagger's accuracy.

40

5. Results, Difficulties and Further work
This chapter is devoted to elucidate the results obtained by launching the
Part-of-Speech tagger implemented with the Transformation-based
algorithm.

The implemented tagger can be used to annotate many corpora and can
achieve different results depending on them. Hence, this tagger can annotate
the multilingual corpus employed in this case study with the intention to
analyze the results obtained by annotating its different languages.

Therefore, in the first section of this chapter I provide the results of the initial
tagging phase and the list of the transformation rules generated for the
multilingual corpus by considering only two languages: English and Italian.