Doc format.

spraytownspeakerAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

70 views

Jessica Hullman

Natural Language Processing, Fall 2006

Professor Rich Thomason

December 15, 2006.




Abstract


I modeled my project after the implementation of supervised word sense disambiguation
with Support Vector Machines by Lee, Ng, and Chia
.
The aut
hors participated

in the
word sense disambiguation competition
Senseval
-
3

in the English lexical sample task,
using

information of

the
Part of Speech (
POS
)

of neighboring words, single words in
surrounding contexts, local collocations, and syntactic relati
ons

to implement the
machine learning technique of Support Vector Machines (SVM)
.


T
his paper details the
first
sec
tion of my project
in which I modify

the POS portion of their implementation

using the identically formatted Senseval
-
2 data
.
I scored my
pe
rformance

on the
accuracy of the sense assignments by the SVM and received a
mean average accuracy of
87%, with a standard deviation of 15% and a median of 92%.


This paper has

five parts (Introduction, Support Vector Machines, Method, Evaluation
section,
and Possible Improvements section). This is followed by a bibliography and a
section called “Programs” which outlines how the experiment proceeded more
specifically.

I would like to acknowledge the following individuals who helped me
(particularly in des
igning a couple of the more complicated progr
ams): Robert Finn,
Joshua Gerrish, and Rich Thomason.



Intro
duction

Word sense disambiguation, an area of considerable research in Computational
Linguistics, refers to the problem of d
ifferentiati
ng
t
he
various

meanings of a word.

A
word is described as polysemous if it has multiple meanings;

f
or example, given a word
“bar
”, and a set of word senses such as “a long piece of wood, metal etc. used as a
support”, “a barrier of any kind”, “a plea arr
esting an actio
n or claim”, etc.

T
he goal is to
identif
y the correct sense of
“bar
” in a
given
sentence.


The problem of disambiguation can be described as AI complete in that some
representation of common sense and real world knowledge is required before it can be
res
olved

(Lecture)
.
Two steps arise

in disambiguating a given word
. First, all of the
different senses of the word as well as the words to be consi
dered along with the given
word

must be determined, and second, a means must be determined by which to assign
e
ach occurrence of a word to the correct sense
. Several major sources of information are
typically used: the word’s context as well as external knowledge sources including
lexicons

(Ide and Veronis
, 1998, pg. 3
).


WordNet
,
a large lexical database of Engl
ish, developed under George A. Miller,

is the
best known of the external knowledge sources
.


Nouns, verbs, adjectives and adverbs are
grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept
;
different senses of a word a
re
ther
efore in

different synsets

(WordNet,
http://wordnet.princeton.edu/
)
The meaning of the synsets is furthe
r clarified with short
definitions
.


Context based methods (also called data
-
driven or corpus
-
based meth
ods) use knowledge
about the previously disambiguated instances of the word within corpora (Ide and
Veronis, 1998,
pg.
3
).

The distinction between lexicon driven, knowledge
-
based methods
and corpus based methods is often
the same as the distinction betwee
n super
vised and
unsupervised learning

(
supervised referring to a task in which the sense label of each
training instance is known, unsupervised in which it is not
)
. Unsupervised methods

outline a clustering task, in which the external knowledge source of

a dictionary or
lexicon is used to seed the system, which
then

augment
s

the labeled instances by learning
from unlabeled instances. Supervised learning, on the other hand, can be seen as a
classification task in which a function is deduced based on data
points
.


Numerous issues arise with regard

to word sense disambiguation.


WordNet’s

numerous

synsets

per
word

b
ring

up one of the most prevale
nt of these,

determining the appropriate
degree of sense

granularity
f
or a given task.

Several authors (e.g. Slat
er and Wilks, 1987)
have remarked that the sense divisions one finds in dictionaries are often too fine for the
purposes of NLP work
; WordNet’s sense distinctions have been criticized, for example,
for being more fine
-
grained that what may be needed in mos
t natural language processing
applications (Ide and Veronis, 1998, pg 13).
Overly fine sense distinctions create
practical difficulties for automated WSD by requiring making sense choices that are
extremely difficult,

even for expert lexicographers.


The
problem of data
-
sparseness becomes severe.
Very large amounts of text are needed

for supervised methods

to ensure that all of the possible senses of a word are represented.
Producing corpora hand
-
labeled for senses, however, is an expensive, time
-
consumi
ng
task, and the results are often less than satisfactory. There is often a fair amount of
disparity among human taggers regarding the finer sense distinctions of a word.



Natural Language Processing tasks in which w
ord sense disambiguation is a relevant

concern
include

information retrieval,

machine translation
, and speech processing
.

Despite the issues of granularity, e
valuating WSD systems

outside of these tasks remains

a wel
l
-
documented problem, arising

from not only the substantial differences in t
est
conditions across studies
,

but also the difference in test words and variance in the criteria
for evaluating the correctness of a sense assignment.



The SENSEVAL competition arose out of this need for accepted evaluation standards.

SENSEVAL uses
in v
itro

evaluation, which involves comparing

a systems output for a
given input using precision and recall (versus
in vivo

evaluation
, in which results are
evaluated in terms of their contribution to the overall performance of a system for a given
application
)

(Ides and Veronis, 1998, pg. 25)
.

While somewhat artificial, the reasoning
behind the Senseval competition, and thus that behind my project, is that close
examination of the problems that arise in word sense disambiguation will best improve
the methods
used.



Within the Senseval competition, p
articipants can compete in tasks including translation
as well as language
-
specific disambiguation.

English tasks

in Senseval

include an
English all
-
words task and the English lexical sample task
, the latter with

which my
project is concerned. In the lexical sample task, evaluation is based on how well a system
disambiguates word
-
class specific (for example, all noun) instances in the test data of a
sampling of words pulled from the WordNet lexicon. Tagging algo
rithms are expected to
assign probabilities to the possible tags they output.


To date three Senseval competitions have been held; this project uses Senseval
-
2 data.
The corpus for the

Senseval
-
2

English tasks is comprised of sentences from the British
National Corpus 2, the Penn Treebank, and the web, and is provided in xml format. I
used only this corpus in training my system.




Machine Learning Using Support Vector Machines

(SVM)


As stated, in recent years linear regression
-
based methods have incre
ased in popularity
with regard to supervised learning tasks.
Any linear classifier is simply a classifier that
uses a linear function of its inputs to base a classification decision on. In other words,
given that the input feature vector to the classifie
r is a real vector
, then the estimated
output score (or probability) is


where
is a real vector of weights and
f

is a function that converting the dot product o
f
the two vectors into the desired output

(Wikipedia, Linear classifier,
http://en.wikipedia.org/wiki/Linear_classifier
)
.

In general, linear classifiers are fast and work well when the number of dimensions of the
input vector are very large; for example,
in document classification, each element in the
input vector is typically the number of counts of a word in a document. They can be
divided into generative models, which model conditional density functions, and
discriminative training, models that attempt

to maximize the quality of the output of a
training set. While common generative methods like Bayesian classification handle
missing data well, discriminative training methods, including perceptron and Support
Vector Machines, generally yield a higher ac
curacy

(Wikipedia, Linear Classifier,
http://en.wikipedia.org/wiki/Linear_classifier
)
.

For a binary classification problem,
f

is a simple function mapping
of
all values above a
certain threshold to one class and all other values to a second class (i.e “ye
s” and “no”).
One can visualize the operation of a linear classifier as splitting a high
-
dimensional input
space with a hyperplane: all points on one side
of the plane belong to the first

class, all
points on the other side belong to the
second

class.


T
he SVM is a

binary

classification learning method that categorizes data by constructing
a hyperplane, using optimization, between training instances mapped in a feature space

(
Schölkopf and Smola, 2002)
.

Because
Lee, Ng, and Chia built one binary classif
ier for
e
ach sense class, I opted

to

do

the same.

Like the authors, I converted nominal features
with numerous possible values into the corresponding number of binary (0 or 1) features.
In this scheme, if a nominal takes the nth value then the correspond
ing (nth) feature is 1
and all of the other features are set to 0

(Witten and Frank, 2000).


The software
I used

is

SVM
light
, an implementation of SVMs in C. SVM
ligh,t
solves
classification and regression problems and ranking problems, by learning a ranki
ng
function.
I
t handles many thousands of support vectors and several hundred
-
thousands of
training examples. SVM
light

is an implementation of Vapnik's Support Vector Machine
(
Vapnik, 1995)
and the algorithms used in SVM
light

are described in

(Joachims,
1999
)
.

The goal is to learn a function from preference examples, so that it orders a new set of
objects as accurately as possible. Such ranking problems naturally occur in applications
like search engines and recommender systems
. The code has been used o
n a large range
of problems, including text classification, image recognition tasks, bioinformatics and
medical applications (SVM
light,
http://svmlight.joachims.org/
)
.


M
ethod

Features: POS of Neighboring Words


The decision of which features to use determ
ines the project.
Like Lee, Ng, and Chia, t
he
features I used were
the Parts of Speech (POS) of

neighboring words. The first step
involved deciding upon how many words ahead of and behind the given word I wanted to
consider as far as POS information.
A
gain, b
oth because Lee, Ng, and Chia used a three
-
word window, and because research has shown that a window of more k=3 or 4

is
unnecessary (Yarowsky,

1994a and b), I opted to use a three
-
word window
.


As an example, given the training corpus sentence “As
the leaves grow, train them
through the bars for a lovely effect,” the input vector, corresponding to

< P
-
3
, P
-
2
, P
-
1
, P
0
, P
1
, P
2
, P
3

> is set to

< DT, NNS,
VB
,
VB
, PRP, IN DT >. I converted
all nominal features with numerous possible values into the cor
responding number of
binary features. This results in an input vector resembling

< 01000…, 00001…, 00000…, 00000…, 10000…, 00000…, 00000… >;

w
herein each place in the vector corresponds to a 45 digit string of 0’s, with a 1 in the
place corresponding to

that particular tag.


Senseval provides the corpus already divided into a training and test set. My first step
involved parsing the XML format of the corpus.
For this I used XML::Twig, a non
-
event
based XML
parser that provides an easy
-
to
-
access tree in
terface (XML::Twig,
http://xmltwig.com/xmltwig/
)
. While Twig made the initial parsing task much more
efficient in terms of programming, I was still required to develop a program to insert
spacing where the parse
r

took out certain tags.


The accuracy of
the POS tagger used in a word sense disambiguation task is a limiting
factor. My next step being to POS tag the corpus, I opted to use the Brill Tagger, an
error
-
driven transformation
-
based tagger that works by
first tagging a corpus based on the
broadest

of a set of tagging rules, then applying a slightly more rule, repeating this
process until some stopping criterion is reached (Jurafsky and Martin, 2006). I chose the
tagger for its accuracy of

95
-
97% (Brill Tagger,
http://www.ling.gu.se/~lager/mogul/b
rill
-
tagger/index.html
).


Next, I
needed

to
substitute certain characters outputted by the Brill tagger, because these

characters were recognized by P
erl. I then needed to
extract

the

POS information of
the
given words from the output of
this pr
ogram

and convert the information to the format
needed by the SVM, namely vectors of zeros and ones.

To do this I
use
d a program that
created a table corresponding to the 45 parts of speech, and then read through the parsed,
POS
-
tagged corpora, keeping tr
ack of when it reached a new instance of a word. It then
reads through the context associated with each instance, keeping track of when it gets to
the given word. When the word to be disambiguated is reached, the POS’s of the three
words before and after

are converted into a vector of seven 45 d
igit strings, with each
place in the string corresponding to a POS. For each POS in the vector, a one is inserted
in the place corresponding to that POS, while all the other places remain 0’s.

A separate
file is
created for each word.



After this, the corpora

(now in the form of separate files for each word)

needed to be
separated into files correspondin
g to each separate word sense
, so that the SVM could be
run
once for each particular sens
e of a word.


Evaluat
ion


I evaluated my project using the evaluation module built into the SVM software.
Provided that the correct ans
wers are supplied with the test

data, the SVM outputs
statistics on the accuracy, precision, and recall of its sense assignments.

I evaluate
d my
project on the accuracy of the sense assignments, getting an average of 87%, with a
median of 92% and standard deviation of 15%.


Possible Improvements


There are multiple minor improvements which might considerably influence my results.
Most importa
ntly, to accurately compare this project to that which it was modeled on
(Lee, Ng, and Chia), I would need to use the Senseval
-
3 data (now in the public domain)
as well as to use the Senseval scoring software.


Running a sentence segmentation program on
the

corpora before POS tagging it
would’ve allowed me to track where in the sentence the word to be d
isambiguated
occurred. Currently
, my project track
s

POS information across sentence boundaries.



Like Lee, Ng, and Chia, I built one binary classifier fo
r each instance

(meaning) of a
word
. However, I might have run the SVM by instead

a step
-
wise reduction method, in
which a binary classifier is first built for all instances of a word, then as one word at a
time is eliminated by the SVM it is removed from

the input data file, and a new classifier
built for the remaining instances.
This method would be more computationally efficient,
but w
hether this would improve the accuracy remains to be seen.






























Bibliography




Ide, Nancy an
d Jean Veronis. (1998). “Word sense disambiguation: The state of the art.”

In
Computational Linguistics
,
24
(1).


Joachims, Thorsten. (1999). “Transductive inference for text classification using support

vector machines.” Universitat Dormund, Dortmu
nd, Germany.


Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An

introduction to natural language processing, computational linguistics, and speech
recognition, 2
nd

edition. (Online version).
http://www.cs.colorado.edu/~martin/slp.html


Lee, Keok Yoong and Hwee Tou Ng. (2002). “An empirical evaluation of knowledge

sources and learning algorithms for word sense disambiguation. In
Proceedings

of the Conference on Empi
rical Methods in Natural Language Processing

(EMNLP).


Schölkopf
, Bernhard

and Alex Smola.
(
2002
)
. Learning with kernels. MIT Press,

Cambridge, MA.


Vapnik, Vladimir N. (1995). The nature of statistical learning theory. Springer
-
Verlag,

New York.


W
itten
, Ian H.

and Eibe Frank.
(
2000
)
. Data mining: Practical machine learning tools

and techniques with java implementations. Morgan Kaufman, San Francisco.


Yarowsky, D. (1995). “A comparison of corpus
-
based techniques for restoring accents

in Span
ish and French text.”
Proceedings of the 2
nd

Annual Workshop on Very

Large Text Corpora.

Las Curces.















Programs



All programs can be found (and are to be run) from /data0/users/rthomaso/tmp/hullman

on tangra. All files created within the r
eferenced programs output to this directory.

This
sequence of commands/programs was run first on the training data, then on the test data.

To re
-
run it, some of the pathnames specified in the programs may need to be changed
back (
they are currently set t
o run on

the test data)
; the actual running of this sequence
gets rather complicated
.


Steps:

1.

Run “xml_spacer_2.pl”



This program inserts a space in the original xml corpora so that the next program,
which parses it, does not abut two words together with
out a space between.


2.

Run “senseval_parse2.pl”



This

program calls up XML::Twig and parses the corpo
ra, outputting a file
“senseval
_
data
_spaced
.txt”.


3.

Run the following three commands:


cd /data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data


export
PATH=$P
ATH:/data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data



/data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data/tagger LEXICON
/data0/user
s/rthomaso/tmp/hullman/senseval
_
data
_spaced
.txt BIGRAMS
LEXICALRULEFILE CONTEXTUALRULEFILE >
/data0/user
s/rthomaso/tmp/hullm
an/senseval
_
tagged.txt


These three commands run the Brill tagger on the parsed corpora
, outputting the
re
sults to a file called senseval
_
tagged.txt
.


4.

Run “POS_substitution.pl



This program t
akes

the

POS
-
tagged corpora and substitutes strings for
proble
matic characters outputted by the POS
-
tagger, includ
ing $, (, ), #, ., ;, and
comma
s
. O
utputs a file

“senseval
_
tagged_POS_substituted.txt”.




5.

Run “tag_parser
.pl”.


This program

and the next are the bulk of the project. This one

reads in the POS
-
tagge
d, character substituted output file fro
m the previous program,
track
ing

when
it gets to a new instance of a word. It then reads in the POS’s, and when it gets to
the given word, creates the vector of the POS of that word as well as the three
words before

and after the word. For each POS in that vector, a 45 digit string is
created, with a one inserted in the place in that string corresponding to the part of
speech of the word.

A

separate file
is created

for every word
,

along with
indication of which par
ticular vector corresponds to the POS of the given word
(out of all of the other words in the context). E
ach
of these files ends

in
“_SVM
_input.txt”
.


6.

Run “SVM_input.pl”:


This s
plit
s

the

data into separate files corresponding to each instance

and sense

id

of the word.

The f
iles it outputs for each instance/sense id are formatted for input
into SVMlight
; each ends in “_SVM
_prepared_input
.
txt”


7.

Run SVM

from /data0/tools/svmlight
using two commands (either

./
svm_learn input_
file model_file
” or “
./
svm_cl
assify input_file model_file
output_file”)
.


The first command runs the learning module on the training data file
(corresponding to one sense of a wo
rd) and outputs a model file (parameters)
which the SVM classifying module takes in in order to make a pred
iction.



The classifying module takes in an input file (corresponding to one sense of a
word in the test data), a model file created by the running of the learning module
on the training data file for that sense, and outputs a file with a prediction (in t
he
form of a one or negative one, depending on which side of the hyperplane the
instance falls) as well as statistics including the accuracy as well as the precision
and recall of the classification.



*
Because the Senseval test data did not supply the cor
rect answers within the xml
corpora, and because this information was needed if I was to eva
luate the SVM, I
needed
to call up a separate Senseval file of answers for the test data and insert
this information into the test corpora

(see “test_sense_id.plx”)
.