Natural Language Processing Laboratory: the CCS-DLSU Experience

estonianmelonΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

122 εμφανίσεις

Natural Language Processing

Laboratory:


the CCS
-
DLSU

Experience

Rachel
Edita
Roxas

Nathalie Rose Lim

Charibeth Cheng
College of Computer Studies

De La Salle University

{
rachel.roxas
, nats.lim, chari.cheng}@delasalle.ph



ABSTRACT

As the premiere human l
anguage technology center in the country,
we present the diverse research activities of the Natural Language
Processing Laboratory of the
College of Computer Studies,
De La
Salle University
, focusing mainly on the projects that we have
embarked on
.


These
projects include the formal representation of
human languages and the processes involving these
languages.


Language representation entails the development of
language resources such as lexicons and corpora for various human
languages including Philippine
languages, across various forms
such as text, speech and video files.


Applications on languages
that we have worked on include Machine Translation, Question
-
and
-
Answering Systems, Information Extraction, Natural
Language Generation, Automated Text Summari
zation and
Simplification, and Language Education.



These applications
provide the current human language interface for communication,
searching, and learning, to name a few.

Keywords:

Natural Language Processing,
Human
Language Technology

1.

INTRODUCTION

Th
ere are more than 6000 living
natural
languages in the world. In
the Philippines alone,
there are 168 natively spoken languages
[
24
]
.
Natural language processing (NLP or
H
uman language
technology
) provide
s

modern technologically
-
based solutions to

bridge

the

gap across language
.

It is concerned with the
interactions between
humans and
computers
through

natural
languages.

Main drivers of
NLP

research include the need for intelligent
natural interfaces and the problem of information overload.
Intelligent

natural interfaces should allow the user to communicate
with the machine through natural language.

Since information is readily available through various means
e
specially with the advent of the internet, there is
also
information
overload [
11
].
NLP

provid
es an interface that automatically filters
relevant information for the user.

Sub
-
areas in
NLP

include natural language understanding (NLU),
natural language generation (NLG), information retrieval,
information extraction, machine translation and question
-
answering.
Interleaving with these applications are the
computational layers of language resources necessary for
automatic representation and processing of human languages.

The
Natural Language Processing

Laboratory of the College of
Computer Studies, De
La Salle University, has endeavored to
address both the language resources and the corresponding
computational processes. We will discuss some of these projects
a
nd the future directions for NLP

research in the country.

2.

LANGUAGE RESOURCES

Though linguisti
c information on Philippine languages are
available, as of yet, the focus has been on theoretical linguistics
and little has been done on the computational aspects.
Since

these
language resources are the major building blocks in doing natural
language pro
cessing
, it is imperative that these be addressed
.
We
report here our attempts in the manual construction of these
language resources such as the lexicon, morphological information,
grammar, and the corpora which were literally built from almost
non
-
exist
ent digital forms. Due to the inherent difficulties of
manual construction, we also discuss our experiments on various
technologies for automatic extraction of these resources to handle
the intricacies of the Filipino language, designed with the intention

of using them for
various
language technology applications
.

One of the main resources of any system that involves natural
language is the list of words
which is
referred to as the lexicon
.
These words would have

associated information depending on the
pur
pose of the application
.
For instance, for automatic translation
of documents from one natural language to another, a

bi
-
directional
lexicon is essential. Currently, the
English
-
Filipino lexicon
contains 23,520 English and 20,540 Filipino word senses

wit
h
information on the part of speech and co
-
occurring words,
which

is based on the
dictionary of the K
omisyon sa Wikang Filipino
.
Additional information such as
syns
etID from Princeton WordNet

were integrated into the lexicon

[30]
. As manually populating th
e
database with the synsetIDs from WordNet is tedious, automating
the process through the SUMO

(Suggested Upper Merged
Ontology)

as an

InterLingual Index (ILI)
is now b
eing explored
.



Initial work on the manual collection of documents on Philippine
langua
ges
has been done
through the funding from the National
Commission for Culture and the Arts
considering four
ma
jor
Philippine Languages namely,

Tagalog, Cebuano, Ilocano and
Hiligaynon with 250,000 words each and the Filipino sign
language with 7,000 signs

[
36
].
Computational features include

word frequency count
s

and a concordancer that allows viewing co
-
occurring words in the corpus.

Aside from possibilities of connecting the Philippine islands and
regions through language, we are also aiming at crossing

boundaries
of time [
35
].
An u
nexplored but equally challenging area is the
collection of historical documents that will allow research on the
development of the Philippine languages through the centuries.
A
n
interesting piece of historical information
i
s in Doctrina
Christiana,
the first ever published work in the country
in 1593
which shows the translation of religious
material

in the local
Philippine script, the Alibata, and Spanish. A sample page is
shown
in Figure 1
(
courtesy of the Univer
sity of Sto
. Tomas
Library, 2007
).


Figure
1
:

Sample Page: Doctrina Christiana
(courtesy of the
University of Sto. Tomas Library, 2007)


Attempts are being made to expand on these language resources
and to complement manual efforts to buil
d these resources.
Automatic methods and social networking are the two main
options currently being considered.

2.1.

Language Resource Builder

Automatic methods for bilingual lexicon extraction, named
-
entity
extraction,
and
language corpora are also being exp
lored to exploit
on the resources
available on the internet
. These automatic
methods are discussed in detail in this section.

An automated approach of extracting bilingual lexicon from
comparable, non
-
parallel corpora was developed for English as the
sour
ce language and Tagalog as the target language, the latter
having limited electronic linguistic resources

available [39
].
We
combined approaches from p
revious researches which only
concentrated on context extraction, clustering techniques, or usage
of part

of speech tags for defining the different senses of a word,
and ranking has shown improvement to overall F
-
measure from
7.32% to 10.65%
within the range of values from previous
studies
.
This is d
espite the use of limited amount of corpora

of
400k and seed

lexicon of 9,026 entries in contrast to previous
studies of 39M and 16,380, respectively
.

The NER
-
Fil is a Named Entity Recognizer for Filipino Text

[29
]
.
This system
automatically identifies

and store
s
named
-
entities
from
documents, which can also be us
ed to annotate corpora with
named
-
entity information
.
Using machine learning techniques,
named
-
entites are also automatically classified into appropriate
categories such as

person, place,
and
organization.

AutoCor
is an
automatic

retrieval system for
do
cuments written in
closely
-
related languages [15]. Experiments have been conducted
on four closely
-
related
Philippine languages
, namely: Tagalog,
Cebuano and Bicol
ano
.
Input

documents are matched against the
n
-
gram language models

of
relevant and irreleva
nt documents.
Using common word pruning to differentiate between the closely
-
related Philippine languages, and the odds ratio query generation
methods, results show improvements in the precision of the
system.

Although automatic methods can facilitate th
e building of the
language resources needed for pr
ocessing natural languages, these
automatic methods usually employ learning approaches that
would require existing language resources as seed or learning data
sets.

2.2.

Online Community for Corpora Building

P
ALITO is an on
line repository of the Philippine corpus

[36
]
. It
is intended to allow linguists or language researchers to upload
text
documents
written in any
Philippine language
, and would
eventually function as corpora for Philippine language
documentat
ion and research
.
Automatic t
ools for
data
categorization and corpus annotation
are provided

by the system
.
The LASCOPHIL (La S
alle Corpus of Philippine Languages)
Working Group is assisting the project developers of PALITO in
refining the mechanics for
the l
evels of users and their
corresponding privileges
for

a manageable
monitor
ing

of

the
corpora
.
Videos on the Filipino sign language can also be uploaded
into the system. Uploading of speech recordings will be
considered in the near future, to address

the need to employ the
best technology to document and systematically collect speech
recordings of nearly
-
extinct languages in the country.
This

online
system

capitalizes

on the opportunity for the corpora to expand
faster and wider with the involvement o
f more people from
various parts of the world. This is also to exploit on the reality
that many of the Filipinos

here and abroad
are

native speakers of
their own
local
languages or dialects
and
can largely contribute to
the growth of the corpora on Philip
pine languages.


3.

LANGUAGE TOOLS

Language tools are applications that support linguistic research
and processing

of various language computational layers
.
These
include lexical units, to syntax and semantics.
Specifically, we
have worked on the morpholo
gical processes, part of speech
tagging and parsing. These processes usually employ either the

rule
-
based approach or the example
-
based approach. In general,
rule
-
based approaches capture language processes by formally
capturing these processes which wou
ld require consultations and
inputs from linguists. On the other hand, example
-
based
approaches employ machine learning methodologies where

automatic learning of rules is

performed based on manually
annotated data that are done also by linguists.

3.1.

Morpho
logical Processes

In general, morphological processes are categorized as
morphological analysis or morphological generation.
Morphological analyzers
(MA
)

are automated systems that
derive
the root word of a trans
formed word,

and
i
dentify the affixes used
and

the change
s

in semantics
due to

the word transformation.
In
this way, root words and their derivatives do not have to be stored
in the lexicon.

On the other hand,
morphological generators
transform a root word into the
surface

form given the desired wo
rd
usage.

We have tested both rule
-
based and example
-
based approaches in
developing our MA and MG. Rule
-
based m
orphological analysis
in the current methods, such as finite
-
state and unification
-
based,
are predominantly effective for handling concatenativ
e
morphology (e.g. prefixation and suffixation), although some of
these techniques can also handle limited non
-
concatenative
phenomena (e.g. infixation and partial and full
-
stem reduplication)
which are largely
used in Philippine languages
. TagMA [21
] uses

a
constraint
-
based method to perform morphological analysis that
handles both concatenative and non
-
concatenative morp
hological
phenomena
, based on the optimality theory framework and the
two
-
level morphology rule representation.
Test results showed
96%
accuracy
.

The 4% error is attributed to d
-
r alteration, an
example of which is in the word
lakaran
, which is from the root
word
lakad

and suffix
-
an
, but
d

is changed to
r
. Unfortunately,
since all candidates are generated, and erroneous ones are later
el
iminated through constraints and rules
,

time efficiency is affected
by the exhaustive search performed.

To
augment the rule
-
based approach
, an example
-
based
approach

was explored
by extending
Wicentowski’s Word Frame model
through
learning

of

morphology ru
les from examples
.

In the
WordFrame model, the seven
-
way split re
-
write rules composed
of the canonical prefix/beginning, point
-
of
-
prefixation, common
prefix substrings, internal vowel change, common suffix substring,
point
-
of
-
suffixation, and canonical s
uffix/ending. Infixation,
partial and full reduplication
as in Tagalog and other Philippine
languages
are improperly modeled in the WordFrame model as
point
-
of
-
prefixation as in the word (
hin)
-
intay
which should have
been modeled as the word
hintay

with i
nfix

in
-
.
Words with an
infix within a prefix are also modeled as point
-
of
-
prefixation as in
the word (
hini
-
)hintay

which should be represented as infix

in

in
partial reduplicated syllable
hi
-
.

In the revised WordFrame model

[10]
, the non
-
concatenative

Tagalog morphological behaviors such
as infixation and reduplication are modeled separately and
correctly.
Unfortunately, it
is still not capable of fully modeling
Filipino morphology since some occurrences of reduplication are
still represented as point
-
of
-
suffixation for various locations of the
longest common substring. There are also some problems in
handling the occurrence of several partial or whole
-
word
reduplications within a word. Despite these problems, the training
of the algorithm that learns

these re
-
write rules from 40,276
Filipino word pairs derived 90% accuracy when applied to an
MA. The complexity of creating a better model would be
computationally
costly but it would ensure an increase in
performa
nce and reduced number of rules
.


Wor
k is still to be done on exploring techniques and methodologies
for morphological generation (MG). Although it could be inferred
that the approaches for MA can be extended to handle MG,
an
additional disambiguation process is necessary to choose the
appro
priate output from the
many various surface form of words
that
can be generated from one underlying form.

3.2.

Part of Speech Tagging

One of the most useful information in the language corpora are the
part of speech tags that are associated with each word in
the
corpora. These tags allow applications to perform other syntactic
and semantic processes. Firstly, with the aid of linguists, we have
come up with a revised tagset for Tagalog, since a close
examination of the existing tagset for languages such as En
glish
showed the insufficiency of this tagset to handle certain
phenomena in Philippine languages. Manual tagging of corpora
has allowed us to perform automatic experiments on some
approaches for tagging for Philippine languages namely
MBPOST,
PTPOST4.1,
TPOST

and T
agAlog,

each one exploring on a
particular approach in tagging such as memory
-
based POS,
template
-
based and rule
-
based approaches
.
A study on the
performance of these taggers showed accuracies of
85,
73
, 65 and
61%
, respectively [32
]
.


3.3.

Langu
a
ge Grammars

Grammar checkers are some of the applications w
h
ere syntactic
specification of languag
es is

necessary.
SpellCheF is a spell
checker for Filipino that uses a h
y
brid approach in detecting and
correcting misspelled words in a document

[8
]
. Its a
pproach is
composed of dictionary
-
lookup, n
-
gram analysis, Soundex and
character distance measurements. It is
implemented as
a plug
-
in to
OpenOffice Writer. Two spelling rules and guidelines, namely, the
Komisyon sa Wikang Filipino 2001 Rev
i
sion of the Al
phabet and
Guidelines in Spellin
g the Filipino Language
, and the Gabay sa
Editin
g sa Wikang Filipino
rulebooks, were incorporated into the
system. SpellCheF
consists of

the lexicon builder, the detector,
and the corrector
; all of which utilized both
manual
ly
formulated
and
automatically
learned
rules to carry out their
respective
tasks.


FiSSAn, on the other hand, is a semantics
-
based grammar checker.
This software is also a plug
-
in to Open Office. Lastly, PanPam is

an extension of FiSSAn that also incor
porates a dictionary
-
based
spell checker

[4
]
.

These systems make use of the rule
-
based approach. To
complement these systems, an example
-
based approach is
considered through
a

grammar rule induction method [
1
].

Constituent structures are automatically in
duced using
unsupervised probabilistic approaches. Two models are
presented and results on the Filipino language show an F1 measure
of greater than 69%. Experiments revealed that the Filipino
language does not follow a strict binary structure as English,

but is
more right
-
biased.

A similar experiment has been conducted on grammar rule
induction for the
automatic
parsing of the Philippine component
of the International Corpus of English (ICE
-
PHI
) [
19
].
Automatic
p
art
of speech (POS) tagging will
first
be
performed using
the
tagger that was trained and used on

the
Great Britain component
of the
ICE.


Differences in expected mark
-
up syntax and lexical
items might surface during POS tagging; thus, requiring additional
editing of text documents.


After automat
ic tagging, m
anual post
editing will be performed to correct some mistagging expected,
some of which would be due to the inclusion of
indigenous words,
such as
Tagalog words in the
ICE
-
PHI
texts.


The POS tagger will
be retrained using the refined data for

re
-
processing.


Constituent
rule induction is performed from
manually
syntactically bracketed

files from the ICE
-
PHI, and will be used to parse the rest of the
corpus.

Manual post editing of the parse will be performed.
The
development of such tools wi
ll directly benefit the descriptive and
applied linguistics of Philippine English, as well as other
Englishes
, in particular, those language components in the ICE
.

4.

LANGUAGE APPLICATION
S

Various applications have been created to cater to different needs.
N
eeds range from summarizing to question answering and from
domains of education to the arts. Below are some of the language
technology applications

that have been developed at the

NLP
Laboratory, College of Computer Studies, De La Salle University
.

4.1.

Mac
hine Translation

The Hybrid English
-
Filipino Machine Translation
(MT)
System

is a three
-
year project (with funding from the PCASTRD, DOST),
which inv
olves a multi
-
engine approach for

automatic language
translation of English and Filipino

[33]
. The MT engi
nes explore
on app
roaches in translation using a

rule
-
based method and two
example
-
based methods. The rule
-
based approach requires the
formal specification of the human languages covered in the study
and utilizes these
rules to translate the input.

The tw
o other MT
engines make use of examples to determine the translation. The
example
-
based MT engines have different approaches in their use
of the examples (which are existing English and Filipino
documents), as well as the data that they are learning. Ref
er to
Figure
2

for the Architectural Diagram.

The system accepts as input a sentence or a document in the
source language and translates this into the target language.
If
source language is English, the target language is Filipino, and vise
versa.
The in
put text will undergo preprocessing that will include
POS tagging

and

morphological analysis.

After translation, the
output translation will undergo natural language generation
including morphological generation.

Si
n
c
e each of the MT engines
would not nec
essarily have the same output translation, an
additional component called the Output Modeler was created to
determine
the most appropriate
among

the translation outputs

[23
].

There are ongoing experiments on the hybridization of the
rule
-
based and the tem
plate
-
based approaches
where transfer rules
and unification constraints are derived

[
20
].

One of the main problems in language processing most especially
compounded in machine translation is finding the most
appropriate translation of a word
when there are

several meanings
of source words, and various target word equivalents depending on
the context of the source word
. One particular study that focuses
on the use of syntactic relationships to perform word sense
disambiguation has been explored
[
17
]. It us
es an

automated
approach for resolving target
-
word selection, based on “word
-
to
-
sense” and “sense
-
to
-
word” relationship between source words
and their translations, using syntactic relationships (subject
-
verb,
verb
-
object, adjective
-
noun).
Using informatio
n from
a bilingual
dictionary and word similarity measures from WordNet, a target
word
is selected
using statistics from a target language corpus.
Test results using English to Tagalog translations showed an
overall 64% accuracy for selecting word translat
ion.

4.1.1.

Rule
-
based Machine Translation Engine

The rule
-
based MT builds a database of rules for language
representation and translation rules from linguists and other
experts on translation from English to Filipino and from Filipino
to English. We have cons
idered lexical functional grammar (LFG)

Example
-
Based

MT1

POS Tagger &

Morph A/G

Rule
-
Based

MT

Examp le
-
Bas ed

MT2

Us er I nt erf ace


Out p ut Modeler

Sour ce

Language

Tar get

Language

Lex
icon &
Corpora

Figure 2
:

The Architecture of the Hybrid English
-
Filipino

Machine Translation System



1.

NLP RESEARCH IN CCS

-

THE NOW

With several faculty members being involved in the



as the formalism to capture these rules
. Given a sentence in the
source language, the sentence is processed and a computerized
representation in LFG of this sentence is constructed. An
evaluation of how comprehensi
ve and exhaustive the identified
grammar is to be considered. Is the system able to capture all
possible Filipino sentences? How are all possible sentences to be
represented since Filipino exhibits some form of free word order in
sentences? The next step

is the translation step, that is, the
conversion of the computerized representation of the input
sentence into the intended target language. After the translation
process, the computerized representation of the sentence in the
target language will now be

outputted into a sentence form, or
called the generation process. Although it has been shown in
various studies elsewhere and on various languages that LFG can
be used for analysis of sentences, there is still a question of
whether it can be used for the

generation process. The generation
involves the outputting of a sentence from a computer
-
based
representation of the sentence. This is part of the work that the
group intends to address.

The major advantage of the rule
-
based MT over other approaches
is
that it can produce high quality translation for sentence patterns
that were accurately captured by the rules of the MT engine; but
unfortunately, it cannot provide good translations to any sentence
that go beyond what the rules have considered.

4.1.2.

Corpus
-
bas
ed Machine Translation Engines

In contrast to the rule
-
based MT which requires building the rules
by hand, the corpus
-
based MT system automatically learns how
translation is done through examples found in a corpus of
translated documents. The system can i
ncrementally learn when
new translated documents are added into the knowledge
-
base,
thus, any changes to the language can also be accommodated
through the updates on the example translations. This means

it
can handle translation of documents from various
domains

[2
].

The principle of garbage
-
in
-
garbage
-
out applies here; if the
example translations are faulty, the learned rules will also be
faulty. That is why, although human linguists do not have to
specify and come up with the translation rules, the lingu
ist will
have to first verify the translated documents and consequently,
the learned rules, for accuracy.

It is not only the quality of the collection of translations that
affects the overall performance of the system, but also the
quantity. The collectio
n of translations has to be comprehensive
so that the translation system produced will be able to translate as
much types of sentences as possible. The challenge here is coming
up with a quantity of examples that is sufficient for accurate
translation of
documents.

With more data, a new problem arises when the knowledge
-
base
grows so large that access to it and search for applicable rules
during translation requires tremendous amount of access time and
to an extreme, becomes difficult. Exponential growth
of the
knowledge
-
base may also happen due to the free word order
nature of Filipino sentence construction, such that one English
sentence can be translated to several Filipino sentences. When all
these combinations are part of the translation examples, a
translation rule will be learned and extracted by the system for
each combination, thus, causing growth of the knowledge
-
base.
Thus, algorithms that perform generalization of rules are
considered to remove specificity of translation rules extracted and
th
us, reduce the size of the rule knowledge
-
base.

4.2.

Text Summarization

SUMMER TXT automatically summarizes a document given the
de
sired percentage of reduction
[16
].

It formally captures the
information in the
training data set
and the relationships among
t
hese data
using the rhetorical structure theory
. Thus,
the
summarized text maintains coherence without having to resort to
copying whole sentences from the original text. To add, deletion
of an arbitrary amount of source material has the potential of
los
ing essential information. Evaluation against existing
commercially available software has shown that the output of
SUMMER TXT is comparable to these
systems
. Unfortunately,
the domain of the training and test data has been limited to one
particular autho
r and in one particular domain. Experiments on
this approach for a wider range of authors and styles and their
corresponding domains are yet to be performed.

4.3.


Text Simplification

SimText
is a text simplification system that
accepts as input a
medical doc
ument and
transforms complex sentences into a set of
equivalent simpler sentences with the goal of making the resulting
text easier to read by some target group

[14
]. The simplification
includes the use of
easier to understand terminologies and shorter
se
ntence constructs

considering the specified reading level of the
intended target
user
s
.
The

text simplification process
identifies

components of sentence that may be separated out, and
transforms each of these into free
-
standing simpler sentences.
Some nu
ances of meaning from the original text may be lost in the
simplification process, since sentence
-
level syntactic restructuring
can possibly alter the meaning of the sentence.

4.4.


Educational
Applications

On the field of education, some applications
that we
have
developed
include Kids Quest III, Picture Books, MesCh,
P
opsicle
, and Automatic Essay Evaluator
.
Kids Quest is a tool
that automatically generates the animation of a story based on the
story inputted by the child [
6
]
.
This incorporates spelling and
grammar checking features
and rendering of the animation of the
story inputted
.

On a reverse process, Picture Books generates stories for children
from an input of character and object stickers

[2
6]
. The child
chooses the stickers (representing the char
acters and objects) and
the system associates these objects to a (manually created)

ontology and story pattern which is then generated in
to a story
.
On
a sample screen shot,
the left side of the screen
is

where the
child has already chosen characters and
is in the process of
choosing objects. The

rig
ht side of the screen shot
will appear the
generated story.

On the other hand, MesCH is a software that accepts children’s
stories and automatically generates multiple choice questions to
test the child’s rea
ding comprehension

[1
8
]
. The program
rephrases parts of the story into 4W questions (who, what, when,
where), sequence
questions (which came first), and vocab
ulary
questions
.
To illustrate, from a
sentence in a story “
Slimy
tadpoles came out from the egg
s

, t
he system will generate the
following possible stems:

1

What came out from the eggs?

2

Where did the slimy tadpoles come out?

3

In the sentence, “Slimy tadpoles came out from the eggs,” what
does the verb “came out” mean?

4

In the sentence, “Slimy tadpoles ca
me out from the eggs,” what
does the adjective “slimy” mean?


The system considers principles in instructional assessment such
as the formulation of 4W questions and the construction of
d
istractors through the use of entries in WordNet that relate with
the

correct answer
.

Popsicle is a software that identifies and corrects language errors
committed by students while they are learning the English
language

[25
]
.
The software
initially
assesses the English grammar
proficiency of the learner based on an input
document

that was
composed by the user
, identifies the grammatical errors committed
in the document,
provides feedback and suggestions in natural
language,

and generates grammar lessons that are tailor fit to the
individual needs of the learner. The learne
r is given opportunities
to correct and learn from his mistakes.
The software maintains a
user model that tracks an individual learner’s English grammar
proficiency, his position and path toward acquiring English, the
dialogue history containing the text g
enerated by the system
during the current tutorial session, the evaluation scores for each
of the teaching strategies employed, and a concise log of
explanations attempted by the system over the learning period of
the user.


Automatic essay evaluator [
12
]
automates the evaluation of large
collections of essay
-
type documents using the latent semantic
analysis
(LSA)
technique
. Rule
-
based natural language parsing is
used for the grammar checking of the input sentences, while LSA
is used to evaluate the conten
t. The system was trained on
corpora containing pre
-
graded essays gathered from a particular
high school class, which were graded by at least two human
teachers according to three criteria: its mechanics, organization and
content. Based on the tests perfo
rmed, the system deviated from
the human score by only 2.48%

4.5.

Information Extraction

LegalTRUTHS performs automatic extraction of structured data
from unstructured data; that is, from long textual documents to
databases

[11
]
. It aims to minimize the user
’s need of going
through countless number and infinitely long legal documents and
court decisions to extract key information about the case at hand.
Based on the sample documents, a template for the database was
developed through consultations with lawyer
s. The process
follows the traditional approach wherein preprocessing of the
input text is performed which includes text segmentation into
different regions, detection of sentence boundaries, part of speech
tagging and named entity recognition. Then text
recognition is
performed by applying the corresponding rules as needed to fill
up the database. These include detection of noun and verb groups
as a whole entity, normalization of the output, filtering of
irrelevant information, co
-
reference resolution an
d extraction of the
basic fields in the proposed template. The system also has an
automatic evaluation module that uses

longest common
subsequence
and the metrics

precision, recall and f
-
measure to
check the system’s correctness. As a front
-
end application
, the
system also provides keyword search

from the extracted fields
.
The matching entries provide links to t
he actual documents
.
Figure 3

shows a sample screen shot of the relevant information
extracted into table form.

Overall results show precision at
91%,
recall at 99%, and F
-
measure at 95%.


Figure 3:
Excerpt of Information Extracted by LegalTRUTHs.

4.6.

Human Language Interfaces for Software
Development

An a
pplication

that aids in

software development

is CAUse
.
CAUse automat
ically generates use case diagrams from E
nglish
requirement specifications [31].

The developers of this system
defined and represented the semantic implications of certain
linguistic markers. These later formed the basis of the rules used
in the generatio
n of the use case diagram.


A natural language database interface, Alladin,

automatically
generates the SQL statement from the English question and
retrieves the answe
r from the database [22]. Alladin is also
capable of anaphora and ellipsis resolution to
support a simple
dialogue system. A sample dialogue with Alladin is shown in
Listing 1.


User: How many people live in Barangay 1?

Aladdin: 489

User: How about Barangay 2?

Alladin: 367

User: How many of them are male?

Alladin: 203

Listing 1.

Sample dialo
gue with Alladin

Listing 2 illustrates a sample natural language query fed into
Alladin and the corresponding system
-
generated SQL statement
from the query.

User Input
: How many people live in
Barangay 1?


SQL statement generated by Alladin
:

Select count (
*) from MEMBERS where
MEMBERS.brgy = 1

Listing 2
.

Sample
SQL statement generated by

Alladin

4.7.

Virtual Museum

On the domain of culture and the arts, an on
-
line virtual museum
VIGAN was developed

[7
]
. The system generates artifact
descriptions in natural la
nguage

from structured
internal
representation of the information. This allows flexible generation
of English descriptions, since the words used may vary and the
descriptions generated may differ depending on the user’s interest

and preferences
.
From the

structured information in the database
as follows:

Original
-
painting
-
artist: Juan Luna
, Filipino

Education: Bachilerato (1853, San Juan de Letran)

Award: Gold medal for painting “Spolarium”

Height: 4 meters

Width: 7 meters


The system will generate the fo
llowing textual information
together with the sample picture of the work: “The Spolarium is a
painting by a Filipino artist, Juan Luna. The Spolarium measures
four meters in height and seven meters in width
.” Since

the user
has indicated that he/she is not

interested to know abo
ut his
educational background, this information has been withheld from
the user.

4.8.

Dialogue Systems

HelloPol is a question
-
answering system that converses with the
user in English

with
in the political domain
[3
]
.
The system has
been

fed with political news articles, and information extraction
has been integrated into the system to automatically extract
relevant information from the articles into a more structured type
of representation (or simply, a database) for use in the question
-
answering system.
The user may ask factoid questions (who,
what, when, where) and the program answers these by referring to
the database of information.

It

is also an adaptive question
-
answering system in that it considers in its responses the user’s
top
ic preference during the course of the dialogue.

4.9.

Sign Language Processing

Most of the work that we have done focused on textual
information. Recently, we have explored on video formats for the
inclusion of the Filipino sign language in our researches. Ne
xt in
line will be the speech corpus.

As mentioned in a previous section, the Filipino sign language has
been included in our attempt to come up with a corpus on
Philippine languages. The signs and discourse are recorded in
videos, which are edited, gloss
ed and transcribed. Video

editing

merely cuts the video for final rendering, gloss
ing

allows
association of sign to particular words, and transcription
allows

viewing of

textual equivalent
s

of the signed videos.

Work on the automatic recognition of
Fili
pino
sign language
involves digital signal processing concepts.
Initial work has been
done on sign language number recognition [
37
] using color
-
coded
gloves for feature extraction. The feature vectors were calculated
based on the position of the dominant
-
hand’s thumb. The system
learned through a database of numbers from 1 to 1000, and tested
by the automatic recognition of Filipino sign language numbers and
conversion into text. Over
-
all accuracy of number recognition is
85%.

A
nother

proposed work is t
he recognition of non
-
manual signals
focusing on the

various part of the face
;

in particular, initially, the
mouth is to be considered
. The automatic interpretation of the
signs can be disambiguated using the interpretation of the non
-
manual sign
al
s.

4.10.


Te
xt to Speech Systems

PinoyTalk is a
n initial study on a

Filipino
-
based text to speech
system that
automatically generates the s
peech from input text
[5
].
The input text is processed and parsed from words to
syllables, from syllables to letters, and assign
ed prosodic
properties for each one.
Six rules for Filipino syllabicati
on were
identified and used in t
he system.

A rule
-
based model for Filipino
was developed and used as basis for the implementation of the
system. The following were determined in the s
tudy considering
the Filipino speaker: duration of each phoneme and silences,
intonation, pitches of consonants and vowel
, and pitches of words
with the corresponding stress
.
The system generates an audio

output and able to save the generated file using th
e mp3 or wav
file format.

4.11.


Pun Generator

A template
-
based pun extractor and generator have been
developed that learns templates from training examples [
27
]. It
utilizes both phonetic and semantic knowledge to capture the
knowledge about puns and their ge
neration. Word relationships,
variables and tags are captured. From the test results, it has been
shown that the system is capable of generating
unique punning
riddles from the learned templates. Evaluation also shows that the
generated punning riddles a
re almost at par with human
-
made
riddles.

5.

FUTURE DIRECTIONS

Through

the NLP laboratory of the College of Computer Studies,
De la Salle University,
varied
NLP resources have been built, and
applicatio
ns and researches explored.
Our faculty members and
our
students have provided the expertise in these challenging
endeavors, with multi
-
disciplinary efforts and collaborations.
Through our graduate programs, we have trained many of the
faculty members
of Universities from various parts of the
country; thus, pro
viding a network of NLP researchers throughout
the archipelago. We have organized the National NLP Research
Symposium for the past five years, through the efforts of the NLP
laboratory of CCS
-
DLSU, and through the support of government
agencies such as PC
ASTRD, DOST and CHED, and our industry
partners. Last year, we hosted an international conference (the
22
nd

Pacific Asia Conference on Language, Information and
Computation) which was held in Cebu City

in partnership with
UPVCC and Cebu Institute of Techno
logy
. We have made a
commitment to nurture and strengthen NLP researches and
collaboration in the country, and expand on our international
linkages with key movers in both the Asian region and elsewhere.
For the past five years, we have brought in and in
vited
internationally
-
acclaimed NLP researchers into the country to
support these endeavors.
Recently, we have also received
invitations as visiting scholars, and participants to events and
meetings within the Asean region which provided scholarships,
whi
ch in turn, we also share with our colleagues and researchers in
other Philippine universities.

It is an understatement to say that much has to be
explored

in this
area of research that
interleaves
diverse disciplines
among
technology
-
based areas (such as
NLP, digital signal processing,
multi
-
media applications, and machine learning) and other fields of
study
(
such as
language, history, p
sychology, and education
),

and

cuts across
different regions and countries, and even time frames.
It is multi
-
modal and
considers various forms of data from textual,
audio, video and other forms of information.

Thus,
much is

yet
to
be accomplished
, and experts with diverse backgrounds in these
various related fields will bring this area of research to a new and
better dime
nsion.

7. ACKNOWLEDGEMENTS

The authors would like to thank the other
CCS
-
DLSU faculty
membe
rs of the NLP laboratory namely

Danniel Alcantara,
Allan
Borra,
Ethel Ong and Solomon See who have all contributed to
where we are now as a research laboratory, tog
ether with our
undergraduate and graduate students who have worked with us in
the laboratory’s endeavors. We also acknowledge the support of
our consistent Government partners, the Commission on Higher
Education (CHED), CHED
-
Zonal Research Center (CHED
-
ZR
C),
the Philippine Council for
Advanced
Science and Technology
Research and Development, Department of Science and
Technology (PCASTRD/DOST),
Komisyon sa Wikang Filipino
(KWF),
and the National Commission for Culture and the Arts
(NCCA).

7. REFERENCES

[1]

Alc
antara, D. and A. Borra.
Constituent Structure for
Filipino: Induction through Probabilistic Approaches.
Proceedings of the 22
nd

Pacific Asia Conference on Language,
Information and Computation.
113
-
122
(November 2008).

[2]

Alcantara, D., Hong, B., Perez, A.
and Tan, L. “Rule
Extraction Applied in Language Translation


R.E.A.L.
Translation”.
Undergraduate

Thesis, De la Salle University

(2006).

[3]

Alimario, P. M., A. Cabrera, E. Ching, E. J. Sia and M. W.
Tan.
HelloPol
: An Adaptive Political Conversationalist.
Pr
oceedings of the
1
st

National Natural Language Processing
Research Symposium

(2003).

[4]

Borra, A., M. Ang, P. J. Chan, S. Cagalingan and R. Tan.
FiSSan: Filipino Sentence Syntax and Semantic Analyzer.
Proceedings of the 7
th

Philippine Comput
ing Science
Cong
ress
.

74
-
78
(February 2007).

[5]

Casas, D, S. Rivera, G. Tan, and G. Villamil.
PinoyTalk: A
Filipino Based Text
-
to
-
Speech Synthesizer. Undergraduate
Thesis. De La Salle University

(April 2004).

[6]

Catabian, F., P. Gueco, R. Pleno, and R. Ripalda.
Kids Quest

III.
Proceedings of the
1
st

Natural Language Processing
Research Symposium

(2004).


[7]

Chen, H. W., M. G. Lim, P. B. Perez, J. P. Reyes, and N. R.

Lim.
Natural Language Generation of Museum Object
Descriptions based on User Model.
Proceedings of the 22
nd

Pac
ific Asia Conference on Language, Inform
ation and
Computation
.

141
-
150
(November 2008).


[8]

Cheng, C., C. P. Alberto, I. A. Chan and V. J. Querol.
SpellChef: Spelling Checker and Corrector for Filipino.
Journal of Research in Science, Computing and Engineeri
ng.

4(3), 75
-
82 (December 2007).


[9]

Cheng, C., Roxas, R,
A. B. Borra, N. R. L. Lim, E. C. Ong
and S. L. See.

e
-
Wika: Digitalization of Philippine Language
.
DLSU
-
Osaka Workshop

(
2008).

[10]

C
heng, C., S. See. The Revised Wo
rdframe Model for
Filipino Language.
Jo
urnal of Research in Science, Computing
and Engineering
. 3(2), 17
-
23

(August 2006).

[11]

Cheng, T. T., J. L. Cua, M. D. Tan and K. G. Yao. Legal
TRUTH
S
: Turning Unstructured Text Helpful Structure for
Legal Documents. Undergraduate
Thesis. De La Salle
Univer
sity

(September 2008).

[12]

Cruz, M, M. Escutin, A. Estioko, and M. Plaza.
Automated
Essay Evaluator.
Proceedings of the 1
st

Natural Language
Processing Research Symposium

(2003).

[13]

Dale, R..
Natural Language Processing: From Theory to
Application.
Proceedings of

the 2
nd

National Natural
Language Processing Research Symposium

(2004).

[14]

Damay, J. J., G. J. Lojico, K. A. Lu, D. Tarantan and E. Ong.
SIMTEXT: Text Simplification of Medical Literature.
Proceedings of the 3
rd

National Natural Language Processing
Research
Symposium

(2006).

[15]

Dimalen, D. M. and R.
Ro
x
as. AutoCor: A Query
-
Based
Automatic Acquisition of Corpora of Closely
-
Related
Languages.
Proceedings of t
he 21
st

Pacific Asia Conference on
Language, Inform
ation and Computation
.
146
-
154

(November 2007).

[16]

Diola
, A. M., J. T. Lopez, P. Torralba, S. So and A. Borra.
Automatic Text Summarization.
Proceedings of the 2
nd

National
Natural Language Processing Research Symposium

(2004).

[17]

Domingo, E. and Roxas, R.

Utilizing Clues in Syntactic
Relationships for Automatic
Target Word Sense
Disambiguation
.
Journal of Research for Science, Computing
and Engineering.

3(3), 18
-
24 (December 2006).

[18]

Fajardo, K., S. Di, K. Novenario and C. Yu.
Mesch:
Measurement System for Children’s Reasing
Comprehension. Undergraduate
Thesis.

De La Salle
University

(September 2008).

[19]

Flores, D. and R. Roxas. Automatic Tools for the Analysis of
the Philippine component of the International Corpus of
English.
Linguistic Society of the Philippines

Annual Meeting
and Convention

(2008).

[20]

Fontanill
a, G., and Roxas, R.
A Hybrid Filipino
-
English
Machine Translation System.

DLSU
Scienc
e and Technology
Congress

(July
2008).

[21]

Fortes
-
Galvan, F. C. and Roxas, R. Morphological Analysis
for Concatenative and Non
-
concatenative Phenomena.
Proceedings of t
he

Asian Applied NLP Conference

(March
2007).

[22]

Garcia, K. K., M. A. Lumain, J. A. Wong, J. G. Yap and C.
Cheng Natural Language Database Interface for the
Community
-
Based Monitoring System.
Proceedings of the
22
nd

Pacific Asia Conference on Language, Informa
tion and
Computation.
384
-
390

(November 2008).

[23]

Go
, K.

and
S.
See
. Incorporation of WordNet Features to N
-
Gram Features in a Language Modeller.
Proceedings of the
22
nd

Pacific Asia Conference on Language, Inform
ation and
Computation
,

179
-
188

(November 2008).


[24]

Gordon, R.

G., Jr. (Ed.).

Ethnologue: Languages of the
World
, Fifteenth edition. Dallas,Texas: SIL International.
Online version: www.e
th
nologue.com
Online
(2005).

[25]

Gurrea, A. M., A. Liu, D. Ngo Vincent, J. Que and E. Ong.
Recognizing Syntactic Errors in

Written Philippine English.
Proceedings of the 3
rd

National Natural Language Processing
Research Symposium

(2006).

[26]

Hong, A. J., C. J. Solis, J. T. Siy, E. Tabirao, and E. Ong.
Picture Books: An Automated Story Generator.
Proceedings
of the 5
th

National Na
tural Langua
ge Processing Research
Symposium

(November 2008)
.

[27]

Hong, B.
Template
-
based Pun Extractor and Generator.
Graduate Thesis. De La Salle Univ
ersity

(March 2008).

[28]

Jasa, M. A., M. J. Palisoc and J. M. Villa.
Panuring
Panitikan (PanPam): A Sentence

Syntax and Semantics
-
based
Grammar Checker for Filipino. Undergraduate
Thesis. De
La Salle University

(September 2007).

[29]

Lim, L. E., J. C. New, M. A. Ngo, M.

Sy
, and N. R. Lim.
A
Named
-
Entity Recognizer for Filipino Texts.
Proceedings of
the 4
th

National

Natural Language Processing Research
Symposium

(2007).

[30]

Lim, N. R., J. O. Lat, S. T. Ng, K.

S
ze, and G. D. Yu.
Lexicon
for an English
-
Filipino Machine Translation System.
Proceedings of the 4
th

National Natural Language Processing
Research Symposium

(2007)
.

[31]

Lim, N. R., J.

Rodil
, and C. Cayaba.

Automatic Generation
of Use Case Diagrams from English Specifications Document.
Proceedings of the 19
th

International Conference on Software
Engineering and

Knowledge Engineering

(2007).

[32]

Miguel, D. and Roxas, R. Comp
arative Evaluation of Tagalog
Part of Speech Taggers.
Proceedings of the 4
th

National
Natural Language Processing Research Symposium

(2007)

[33]

Rox
as, R. E., A. Borra, C. Ko, N. R. Lim, E. Ong, and M. W.
Tan.

Building Language Resources for a Multi
-
Engine
Mac
hine Translation System.
Language Resources and
Evaluation.

Springer, Netherlands.

42:183
-
195

(2008).

[34]

Roxas, R.
e
-
Wika: Philippine Connectivity through
Languages.

Proceedings of the 4
th

National Natural Langua
ge
Processing Research Symposium

(2007)
.


[35]

Roxa
s, R. Towards Building the Philippine Corpus.
Consultative Workshop on Building the Philippine Corpus

(November 2007).

[36]

Roxas, R., P. Inventado, G. Asenjo, M. Corpus, S. Dita, R.
Sison
-
Buban and D. Taylan. Online Corpora of Philippine
Languages.
2
nd

DLSU
Arts Congress: Arts and Environment

(February 2009).

[37]

Sandjaja, I. Sign Language Number Recognition.
Graduate
Thesis. De La Salle Universit
y

(August 2008).

[38]

Tan, P
.

P
.

and N
.

R
.

Lim. FILWORDNET:
Towards a
Filipino WordNet.

Proceedings of the 4
th

National
Na
tural
Langua
ge Processing Research Symposium

(2007)
.

[39]

Tiu, E. P. and Roxas, R.

Automatic Bilingual Lexicon
Extraction for a Minority Target Language
,

Proceedings of
the 22
nd

Pacific Asia Conference on Language, Information
and Computation.
Best Paper Awar
dee by PACLIC Steering
Committee.

368
-
376
(November 2008).