Advances in Word Sense Disambiguation

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 7 months ago)

89 views

Advances in

Word Sense Disambiguation

Tutorial at AAAI
-
2005

July 9, 2005


Rada Mihalcea

University of North Texas

http://www.cs.unt.edu/~rada



Ted Pedersen

University of Minnesota, Duluth

http://www.d.umn.edu/~tpederse




2

Goal of the Tutorial


Introduce the problem of word sense disambiguation
(WSD), focusing on the range of formulations and
approaches currently practiced.


Accessible to anyone with an interest in NLP or AI.


Persuade you to work on word sense disambiguation


It’s an interesting problem


Lots of good work already done, still more to do


There is infrastructure to help you get started


Persuade you to use word sense disambiguation in your
text applications.


3

Outline of Tutorial


Introduction (Ted)


Methodolodgy (Rada)


Knowledge Intensive Methods (Rada)


Supervised Approaches (Ted)


Minimally Supervised Approaches (Rada) / BREAK


Unsupervised Learning (Ted)


How to Get Started (Rada)


Conclusion (Ted)

Part 1:

Introduction

5

Outline


Definitions


Ambiguity for Humans and Computers


Very Brief Historical Overview


Theoretical Connections


Practical Applications


6

Definitions


Word sense disambiguation

is the problem of selecting a
sense for a word from a set of predefined possibilities.


Sense Inventory usually comes from a dictionary or thesaurus.


Knowledge intensive methods, supervised learning, and
(sometimes) bootstrapping approaches



Word sense discrimination

is the problem of dividing the
usages of a word into different meanings, without regard to
any particular existing sense inventory.


Unsupervised techniques


7

Outline


Definitions


Ambiguity for Humans and Computers


Very Brief Historical Overview


Theoretical Connections


Practical Applications


8

Computers versus Humans


Polysemy



most words have many possible meanings.


A computer program has no basis for knowing which one
is appropriate, even if it is obvious to a human…


Ambiguity is rarely a problem for humans in their day to
day communication, except in extreme cases…



9

Ambiguity for Humans
-

Newspaper Headlines!


DRUNK GETS NINE YEARS IN VIOLIN CASE


FARMER BILL DIES IN HOUSE


PROSTITUTES APPEAL TO POPE


STOLEN PAINTING FOUND BY TREE


RED TAPE HOLDS UP NEW BRIDGE


DEER KILL 300,000


RESIDENTS CAN DROP OFF TREES


INCLUDE CHILDREN WHEN BAKING COOKIES


MINERS REFUSE TO WORK AFTER DEATH


10

Ambiguity for a Computer


The fisherman jumped off the

bank
and into the water.


The

bank
down the street was robbed!


Back in the day, we had an entire
bank

of computers
devoted to this problem.


The
bank

in that road is entirely too steep and is really
dangerous.


The plane took a
bank

to the left, and then headed off
towards the mountains.


11

Outline


Definitions


Ambiguity for Humans and Computers


Very Brief Historical Overview


Theoretical Connections


Practical Applications


12

Early Days of WSD


Noted as problem for Machine Translation (Weaver, 1949)


A word can often only be translated if you know the specific sense
intended (A bill in English could be a pico or a cuenta in Spanish)


Bar
-
Hillel (1960) posed the following:


Little John was looking for his toy box. Finally, he found it. The
box was in the pen. John was very happy.


Is “pen” a writing instrument or an enclosure where children play?


…declared it unsolvable, left the field of MT!










13

Since then…


1970s
-

1980s


Rule based systems


Rely on hand crafted knowledge sources


1990s


Corpus based approaches


Dependence on sense tagged text


(Ide and Veronis, 1998) overview history from early days to 1998.


2000s


Hybrid Systems


Minimizing or eliminating use of sense tagged text


Taking advantage of the Web

14

Outline


Definitions


Ambiguity for Humans and Computers


Very Brief Historical Overview


Interdisciplinary Connections


Practical Applications


15

Interdisciplinary Connections




Cognitive Science & Psychology


Quillian (1968), Collins and Loftus (1975) : spreading activation


Hirst (1987) developed marker passing model


Linguistics


Fodor & Katz (1963) : selectional preferences


Resnik (1993) pursued statistically


Philosophy of Language


Wittgenstein (1958): meaning as use


“For a
large

class of cases
-
though not for all
-
in which we employ
the word "meaning" it can be defined thus: the meaning of a word
is its use in the language.”



16

Outline


Definitions


Ambiguity for Humans and Computers


Very Brief Historical Overview


Theoretical Connections


Practical Applications


17

Practical Applications


Machine Translation


Translate “bill” from English to Spanish


Is it a “pico” or a “cuenta”?


Is it a bird jaw or an invoice?


Information Retrieval


Find all Web Pages about “cricket”


The sport or the insect?


Question Answering


What is George Miller’s position on gun control?


The psychologist or US congressman?


Knowledge Acquisition


Add to KB: Herb Bergson is the mayor of Duluth.


Minnesota or Georgia?

18

References


(Bar
-
Hillel, 1960) The Present Status of Automatic Translations of Languages. In
Advances in Computers. Volume 1. Alt, F. (editor). Academic Press, New York, NY. pp
91
-
163.


(Collins and Loftus, 1975) A Spreading Activation Theory of Semantic Memory.
Psychological Review, (82) pp. 407
-
428.


(Fodor and Katz, 1963) The structure of semantic theory. Language (39). pp 170
-
210.


(Hirst, 1987) Semantic Interpretation and the Resolution of Ambiguity. Cambridge
University Press.


(Ide and Véronis, 1998)Word Sense Disambiguation: The State of the Art.
.

Computational Linguistics

(24
)
pp 1
-
40.



(Quillian, 1968) Semantic Memory. In Semantic Information Processing. Minsky, M.
(editor). The MIT Press, Cambridge, MA. pp. 227
-
270.


(Resnik, 1993) Selection and Information: A Class
-
Based Approach to Lexical
Relationships. Ph.D. Dissertation. University of Pennsylvania.


(Weaver, 1949): Translation. In Machine Translation of Languages: fourteen essays.
Locke, W.N. and Booth, A.D. (editors) The MIT Press, Cambridge, Mass. pp. 15
-
23.


(Wittgenstein, 1958) Philosophical Investigations, 3
rd

edition. Translated by G.E.M.
Anscombe. Macmillan Publishing Co., New York.

Part 2:


Methodology

20

Outline


General considerations


All
-
words disambiguation


Targeted
-
words disambiguation


Word sense discrimination, sense discovery


Evaluation (granularity, scoring)


21






Ex: “
chair




furniture or person


Ex: “
child



young person or human offspring


Overview of the Problem


Many words have several meanings (homonymy / polysemy)





D
etermine which sense of a word is used in a specific sentence



Note:



often, the different senses of a word are closely related


Ex:
title

-

right of legal ownership




-

document that is evidence of the legal ownership,



sometimes, several senses can be “activated” in a single context
(co
-
activation)


Ex:
“This could bring competition to the trade”


competition
:
-

the act of competing


-

the people who are competing

22

Word Senses


The
meaning
of a word in a given context



Word sense representations


With respect to a dictionary


chair

= a seat for one person, with a support for the back; "he put his coat
over the back of the chair and sat down"


chair

= the position of professor; "he was awarded an endowed chair in
economics"


With respect to the translation in a second language


chair
= chaise


chair

= directeur


With respect to the context where it occurs (discrimination)


“Sit on a
chair
” “Take a seat on this
chair



“The
chair

of the Math Department” “The
chair

of the meeting”

23

Approaches to Word Sense Disambiguation


Knowledge
-
Based Disambiguation


use of external lexical resources such as dictionaries and thesauri


discourse properties


Supervised Disambiguation


based on a labeled training set


the learning system has:


a training set of feature
-
encoded inputs AND


their appropriate sense label (category)


Unsupervised Disambiguation


based on unlabeled corpora


The learning system has:


a training set of feature
-
encoded inputs BUT


NOT their appropriate sense label (category)


24

All Words Word Sense Disambiguation


Attempt to disambiguate all open
-
class words in a text


“He
put

his
suit

over the
back

of the
chair




Knowledge
-
based approaches


Use information from dictionaries


Definitions / Examples for each meaning


Find similarity between definitions and current context


Position in a semantic network


Find that “
table
” is closer to “
chair/furniture
” than to “
chair/person



Use discourse properties


A word exhibits the same sense in a discourse / in a collocation

25

All Words Word Sense Disambiguation


Minimally supervised approaches


Learn to disambiguate words using small annotated corpora


E.g. SemCor


corpus where all open class words are
disambiguated


200,000 running words


Most frequent sense


26

Targeted Word Sense Disambiguation


Disambiguate one target word

“Take a seat on this
chair


“The
chair

of the Math Department”



WSD is viewed as a typical classification problem


use machine learning techniques to train a system


Training:


Corpus of occurrences of the target word, each occurrence
annotated with appropriate sense


Build feature vectors:


a vector of relevant linguistic features that represents the context (ex:
a window of words around the target word)


Disambiguation:


Disambiguate the target word in new unseen text

27

Targeted Word Sense Disambiguation


Take a window of
n

word around the target word


Encode information about the words around the target word


typical features include: words, root forms, POS tags, frequency, …


An electric
guitar and
bass

player stand

off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.



Surrounding context (local features)


[ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ]



Frequent co
-
occurring words (topical features)


[
fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band
]


[0,0,0,1,0,0,0,0,0,0,1,0]



Other features:


[followed by "player", contains "show" in the sentence,…]


[yes, no
, … ]




28

Unsupervised Disambiguation


Disambiguate word senses:


without supporting tools such as dictionaries and thesauri


without a labeled training text


Without such resources, word senses are not
labeled


We cannot say “
chair/furniture”

or “
chair/person”


We can:


Cluster/group the contexts of an ambiguous word into a number
of groups


Discriminate
between these groups without actually labeling
them


29

Unsupervised Disambiguation


Hypothesis: same senses of words will have similar neighboring
words


Disambiguation algorithm


Identify context vectors corresponding to all occurrences of a particular
word


Partition them into regions of high density


Assign a sense to each such region



“Sit on a
chair


“Take a seat on this
chair


“The
chair

of the Math Department”

“The
chair

of the meeting”

30

Evaluating Word Sense Disambiguation



Metrics:


Precision = percentage of words that are tagged correctly, out of the
words addressed by the system


Recall = percentage of words that are tagged correctly, out of all words
in the test set


Example


Test set of 100 words


Precision = 50 / 75 = 0.66


System attempts 75 words


Recall = 50 / 100 = 0.50


Words correctly disambiguated 50



Special tags are possible:


Unknown


Proper noun


Multiple senses


Compare to a gold standard


SEMCOR corpus, SENSEVAL corpus, …

31

Evaluating Word Sense Disambiguation


Difficulty in evaluation:


Nature of the senses to distinguish has a huge impact on results


Coarse versus fine
-
grained sense distinction

chair

= a
seat

for one person, with a support for the back; "he put his coat
over the back of the chair and sat down“

chair

= the position of
professor
; "he was awarded an endowed chair in
economics“


bank

= a
financial institution

that accepts deposits and channels the money
into lending activities; "he cashed a check at the bank"; "that bank holds
the mortgage on my home"

bank

= a
building

in which commercial banking is transacted; "the bank is
on the corner of Nassau and Witherspoon“


Sense maps


Cluster similar senses


Allow for both fine
-
grained and coarse
-
grained evaluation

32

Bounds on Performance



Upper and Lower Bounds on Performance:


Measure of how well an algorithm performs relative to the difficulty of
the task.



Upper Bound:


Human performance


Around 97%
-
99% with few and clearly distinct senses


Inter
-
judge agreement:


With words with clear & distinct senses


95% and up


With polysemous words with related senses


65%


70%



Lower Bound (or baseline):


The assignment of a random sense / the most frequent sense


90% is excellent for a word with 2 equiprobable senses


90% is trivial for a word with 2 senses with probability ratios of 9 to 1


33

References


(Gale,

Church

and

Yarowsky

1992
)

Gale,

W
.
,

Church,

K
.
,

and

Yarowsky,

D
.

Estimating

upper

and

lower

bounds

on

the

performance

of

word
-
sense

disambiguation

programs

ACL

1992
.


(Miller

et
.

al
.
,

1994
)

Miller,

G
.
,

Chodorow,

M
.
,

Landes,

S
.
,

Leacock,

C
.
,

and

Thomas,

R
.

Using

a

semantic

concordance

for

sense

identification
.

ARPA

Workshop

1994
.


(Miller,

1995
)

Miller,

G
.

Wordnet
:

A

lexical

database
.

ACM,

38
(
11
)

1995
.


(Senseval)

Senseval

evaluation

exercises

http
:
//www
.
senseval
.
org


Part 3:


Knowledge
-
based Methods for
Word Sense Disambiguation

35

Outline


Task definition


Machine Readable Dictionaries


Algorithms based on Machine Readable Dictionaries


Selectional Restrictions


Measures of Semantic Similarity


Heuristic
-
based Methods


36

Task Definition


Knowledge
-
based WSD

= class of WSD methods relying
(mainly) on knowledge drawn from dictionaries and/or raw
text


Resources


Yes


Machine Readable Dictionaries


Raw corpora


No


Manually annotated corpora


Scope


All open
-
class words

37

Machine Readable Dictionaries


In recent years, most dictionaries made available in
Machine Readable format (MRD)


Oxford English Dictionary


Collins


Longman Dictionary of Ordinary Contemporary English (LDOCE)


Thesauruses


add synonymy information


Roget Thesaurus


Semantic networks


add more semantic relations


WordNet


EuroWordNet



38

MRD


A Resource for Knowledge
-
based WSD


For each word in the language vocabulary, an MRD
provides:


A list of meanings


Definitions (for all word meanings)


Typical usage examples (for most word meanings)


WordNet definitions/examples for the noun
plant

1.
buildings for carrying on industrial labor; "they built a large plant to
manufacture automobiles“

2.
a living organism lacking the power of locomotion

3.
something planted secretly for discovery by another; "the police used a plant to
trick the thieves"; "he claimed that the evidence against him was a plant"

4.
an actor situated in the audience whose acting is rehearsed but seems
spontaneous to the audience

39

MRD


A Resource for Knowledge
-
based WSD


A thesaurus adds:


An explicit synonymy relation between word meanings





A semantic network adds:


Hypernymy/hyponymy (IS
-
A), meronymy/holonymy (PART
-
OF),
antonymy, entailnment, etc.

WordNet synsets for the noun “plant”


1. plant, works, industrial plant


2. plant, flora, plant life

WordNet related concepts for the meaning “plant life”


{plant, flora, plant life}


hypernym: {organism, being}


hypomym: {house plant}, {fungus}, …


meronym: {plant tissue}, {plant part}


holonym: {Plantae, kingdom Plantae, plant kingdom}

40

Outline


Task definition


Machine Readable Dictionaries


Algorithms based on Machine Readable Dictionaries


Selectional Restrictions


Measures of Semantic Similarity


Heuristic
-
based Methods


41

Lesk Algorithm


(Michael Lesk 1986): Identify senses of words in context
using definition overlap

Algorithm:

1.
Retrieve from MRD all sense definitions of the words to be
disambiguated

2.
Determine the definition overlap for all possible sense combinations

3.
Choose senses that lead to highest overlap


Example: disambiguate PINE CONE



PINE

1. kinds of evergreen tree with needle
-
shaped leaves

2. waste away through sorrow or illness



CONE

1. solid body which narrows to a point

2. something of this shape whether solid or hollow

3. fruit of certain evergreen trees

Pine#1


Cone#1 = 0

Pine#2


Cone#1 = 0

Pine#1


Cone#2 = 1

Pine#2


Cone#2 = 0

Pine#1


Cone#3 = 2

Pine#2


Cone#3 = 0


42

Lesk Algorithm for More than Two Words?


I

saw

a

man

who

is

98

years

old

and

can

still

walk

and

tell

jokes


nine

open

class

words
:


see
(
26
),

man
(
11
),

year
(
4
),

old
(
8
),

can
(
5
),

still
(
4
),

walk
(
10
),

tell
(
8
),

joke
(
3
)



43
,
929
,
600

sense

combinations!

How

to

find

the

optimal

sense

combination?


Simulated

annealing

(Cowie,

Guthrie,

Guthrie

1992
)


Define

a

function

E

=

combination

of

word

senses

in

a

given

text
.


Find

the

combination

of

senses

that

leads

to

highest

definition

overlap

(
redundancy)


1
.

Start

with

E

=

the

most

frequent

sense

for

each

word


2
.

At

each

iteration,

replace

the

sense

of

a

random

word

in

the

set

with

a

different

sense,

and

measure

E


3
.

Stop

iterating

when

there

is

no

change

in

the

configuration

of

senses

43

Lesk Algorithm: A Simplified Version


Original Lesk definition: measure overlap between sense definitions for
all words in context


Identify simultaneously the correct senses for all words in context


Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap
between sense definitions of a word and current context


Identify the correct sense for one word at a time


Search space significantly reduced


44

Lesk Algorithm: A Simplified Version

Example: disambiguate PINE in

“Pine cones hanging in a tree”



PINE

1. kinds of evergreen tree with needle
-
shaped leaves

2. waste away through sorrow or illness

Pine#1


Sentence = 1

Pine#2


Sentence = 0



Algorithm

for simplified Lesk:

1.
Retrieve from MRD all sense definitions of the word to be
disambiguated

2.
Determine the overlap between each sense definition and the
current context

3.
Choose the sense that leads to highest overlap

45

Evaluations of Lesk Algorithm


Initial evaluation by M. Lesk


50
-
70% on short samples of text manually annotated set, with respect
to Oxford Advanced Learner’s Dictionary


Simulated annealing


47% on 50 manually annotated sentences


Evaluation on Senseval
-
2 all
-
words data, with back
-
off to
random sense
(Mihalcea & Tarau 2004)


Original Lesk: 35%


Simplified Lesk: 47%


Evaluation on Senseval
-
2 all
-
words data, with back
-
off to
most frequent sense
(Vasilescu, Langlais, Lapalme 2004)


Original Lesk: 42%


Simplified Lesk: 58%


46

Outline


Task definition


Machine Readable Dictionaries


Algorithms based on Machine Readable Dictionaries


Selectional Preferences


Measures of Semantic Similarity


Heuristic
-
based Methods


47

Selectional Preferences


A way to constrain the possible meanings of words in a
given context



E.g. “
Wash a dish
” vs. “
Cook a dish



WASH
-
OBJECT vs. COOK
-
FOOD



Capture information about possible relations between
semantic classes


Common sense knowledge


Alternative terminology


Selectional Restrictions


Selectional Preferences


Selectional Constraints

48

Acquiring Selectional Preferences


From annotated corpora


Circular relationship with the WSD problem


Need WSD to build the annotated corpus


Need selectional preferences to derive WSD



From raw corpora


Frequency counts


Information theory measures


Class
-
to
-
class relations

49

Preliminaries: Learning Word
-
to
-
Word Relations


An indication of the
semantic fit between two words


1. Frequency counts


Pairs of words connected by a syntactic relations




2. Conditional probabilities


Condition on one of the words


)
,
,
(
2
1
R
W
W
Count
)
,
(

)
,
,
(

)
,
|
(
2
2
1
2
1
R
W
Count
R
W
W
Count
R
W
W
P

50

Learning Selectional Preferences (1)


Word
-
to
-
class relations (Resnik 1993)


Quantify the contribution of a semantic class using all the concepts
subsumed by that class






where





)
(
)
,
|
(
log
)
,
|
(
)
(
)
,
|
(
log
)
,
|
(
)
,
,
(
2
1
2
1
2
2
1
2
1
2
2
1
2
C
P
R
W
C
P
R
W
C
P
C
P
R
W
C
P
R
W
C
P
R
C
W
A
C






2
2
)
(
)
,
,
(
)
,
,
(

)
,
(
)
,
,
(
)
,
|
(
2
2
1
2
1
1
2
1
1
2
C
W
W
Count
R
W
W
Count
R
C
W
Count
R
W
Count
R
C
W
Count
R
W
C
P
51

Learning Selectional Preferences (2)


Determine the contribution of a word sense based on the assumption of
equal sense distributions:


e.g. “plant” has two senses


50% occurrences are sense 1, 50% are sense 2



Example: learning restrictions for the verb “
to drink



Find high
-
scoring verb
-
object pairs







Find “prototypical” object classes (high association score)


Co-occ score
Verb
Object
11.75
drink
tea
11.75
drink
Pepsi
11.75
drink
champagne
10.53
drink
liquid
10.2
drink
beer
9.34
drink
wine
A(v,c)
Object class
3.58
(beverage, [drink, …])
2.05
(alcoholic_beverage, [intoxicant, …])
52

Learning Selectional Preferences (3)



Other algorithms


Learn class
-
to
-
class relations (Agirre and Martinez, 2002)


E.g.:
“ingest food”
is a class
-
to
-
class relation for

“eat chicken”


Bayesian networks (Ciaramita and Johnson, 2000)


Tree cut model (Li and Abe, 1998)

53

Using Selectional Preferences for WSD

Algorithm:

1.

Learn a large set of selectional preferences for a given syntactic
relation R


2. Given a pair of words W
1


W
2

connected by a relation R

3. Find all selectional preferences W
1


C (word
-
to
-
class) or C
1


C
2

(class
-
to
-
class) that apply

4. Select the meanings of W
1

and W
2

based on the selected semantic
class


Example: disambiguate

coffee

in

“drink
coffee


1. (beverage) a beverage consisting of an infusion of ground coffee beans

2. (tree) any of several small trees native to the tropical Old World

3. (color) a medium to dark brown color



Given the selectional preference

“DRINK BEVERAGE”
:
coffee#1

54

Evaluation of Selectional Preferences for WSD


Data set


mainly on verb
-
object, subject
-
verb relations extracted from
SemCor


Compare against random baseline


Results (Agirre and Martinez, 2000)


Average results on 8 nouns


Similar figures reported in (Resnik 1997)


Object
Subject
Precision
Recall
Precision
Recall
Random
19.2
19.2
19.2
19.2
Word-to-word
95.9
24.9
74.2
18.0
Word-to-class
66.9
58.0
56.2
46.8
Class-to-class
66.6
64.8
54.0
53.7
55

Outline


Task definition


Machine Readable Dictionaries


Algorithms based on Machine Readable Dictionaries


Selectional Restrictions


Measures of Semantic Similarity


Heuristic
-
based Methods


56

Semantic Similarity


Words in a discourse must be related in meaning, for the
discourse to be coherent (Haliday and Hassan, 1976)


Use this property for WSD


Identify related meanings for
words that share a common context



Context span:


1. Local context: semantic similarity between pairs of words


2. Global context: lexical chains


57

Semantic Similarity in a Local Context


Similarity determined between pairs of concepts, or
between a word and its surrounding context


Relies on similarity metrics on semantic networks


(Rada et al. 1989)

carnivore

wild dog

wolf

bear

feline, felid

canine, canid

fissiped mamal, fissiped

dachshund

hunting dog

hyena dog

dingo

hyena

dog

terrier

58

Semantic Similarity Metrics (1)


Input: two concepts (same part of speech)


Output: similarity measure


(Leacock and Chodorow 1998)




E.g. Similarity(
wolf
,
dog
) = 0.60 Similarity(
wolf
,
bear
) = 0.42



(Resnik 1995)


Define information content, P(C) = probability of seeing a concept of type
C in a large corpus



Probability of seeing a concept = probability of seeing instances of that
concept


Determine the contribution of a word sense based on the assumption of
equal sense distributions:


e.g. “plant” has two senses


50% occurrences are sense 1, 50% are sense 2










D
C
C
Path
C
C
Similarity
2
)
,
(
log
)
,
(
2
1
2
1
, D is the taxonomy depth

))
(
log(
)
(
C
P
C
IC


59

Semantic Similarity Metrics (2)


Similarity using information content


(Resnik 1995)

Define similarity between two concepts (LCS = Least
Common Subsumer)




Alternatives
(Jiang and Conrath 1997)




Other metrics:


Similarity using information content
(Lin 1998)


Similarity using gloss
-
based paths across different hierarchies
(Mihalcea
and Moldovan 1999)


Conceptual density measure between noun semantic hierarchies and
current context

(Agirre and Rigau 1995)


Adapted Lesk algorithm

(Banerjee and Pedersen 2002)


))
,
(
(
)
,
(
2
1
2
1
C
C
LCS
IC
C
C
Similarity

))
(
)
(
(




))
,
(
(
2
)
,
(
2
1
2
1
2
1
C
IC
C
IC
C
C
LCS
IC
C
C
Similarity




60

Semantic Similarity Metrics for WSD


Disambiguate target words based on similarity with one
word to the left and one word to the right


(Patwardhan, Banerjee, Pedersen 2002)







Evaluation:


1,723 ambiguous nouns from Senseval
-
2


Among 5 similarity metrics, (Jiang and Conrath 1997) provide the
best precision (39%)

Example: disambiguate PLANT in “plant with flowers”

PLANT

1.
plant, works, industrial plant

2.
plant, flora, plant life


Similarity (plant#1, flower) = 0.2

Similarity (plant#2, flower) = 1.5
: plant#2

61

Semantic Similarity in a Global Context


Lexical chains
(Hirst and St
-
Onge 1988), (Haliday and Hassan 1976)



A lexical chain is a
s
equence of semantically related words, which
creates a context and contributes to the continuity of meaning and the
coherence of a discourse



Algorithm

for finding lexical chains:

1.
Select the candidate words from the text. These are words for which we can
compute similarity measures, and therefore most of the time they have the
same part of speech.

2.
For each such candidate word, and for each meaning for this word, find a
chain to receive the candidate word sense, based on a semantic relatedness
measure between the concepts that are already in the chain, and the candidate
word meaning.

3.
If such a chain is found, insert the word in this chain; otherwise, create a new
chain.



62

Semantic Similarity of a Global Context

A very long
train

traveling

along the
rails

with a constant
velocity

v in a

certain
direction



train

#1: public transport

#2: order set of things

#3: piece of cloth

travel

#1 change location

#2: undergo transportation

rail

#1: a barrier

# 2: a bar of steel for trains

#3: a small bird

63

Lexical Chains for WSD


Identify lexical chains in a text


Usually target one part of speech at a time


Identify the meaning of words based on their membership
to a lexical chain



Evaluation:


(Galley and McKeown 2003) lexical chains on 74 SemCor texts
give 62.09%


(Mihalcea and Moldovan 2000) on five SemCor texts give 90%
with 60% recall


lexical chains “anchored” on monosemous words


(Okumura and Honda 1994) lexical chains on five Japanese texts
give 63.4%

64

Outline


Task definition


Machine Readable Dictionaries


Algorithms based on Machine Readable Dictionaries


Selectional Restrictions


Measures of Semantic Similarity


Heuristic
-
based Methods


65

Example:
“plant/flora”
is used more often than

“plant/factory”



-

annotate any instance of

PLANT

as

“plant/flora”


Most Frequent Sense (1)


Identify the most often used meaning and use this meaning
by default




Word meanings exhibit a Zipfian distribution


E.g. distribution of word senses in SemCor

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2
3
4
5
6
7
8
9
10
Sense number
Frequency
Noun
Verb
Adj
Adv
66

Most Frequent Sense (2)


Method 1: Find the most frequent sense in an annotated
corpus


Method 2: Find the most frequent sense using a method
based on distributional similarity
(McCarthy et al. 2004)

1. Given a word
w
, find the top
k

distributionally similar words


N
w

= {n
1
, n
2
, …, n
k
}, with associated similarity scores
{dss(w,n
1
), dss(w,n
2
), … dss(w,n
k
)}

2. For each sense ws
i
of w, identify the similarity with the words n
j
,
using the sense of n
j

that maximizes this score

3. Rank senses ws
i

of w based on the total similarity score


))
,
(
(
max
)
,
(

where
,
)
,
'
(
)
,
(
)
,
(
)
(
)
(
)
(
'
x
i
n
senses
ns
j
i
N
n
w
senses
ws
j
i
j
i
j
i
ns
ws
wnss
n
ws
wnss
n
ws
wnss
n
ws
wnss
n
w
dss
ws
Score
j
x
w
j
i







67

Most Frequent Sense(3)


Word senses


pipe #1

= tobacco pipe


pipe #2

= tube of metal or plastic


Distributional similar words


N = {
tube, cable, wire, tank, hole, cylinder, fitting, tap
, …}


For each word in N, find similarity with pipe#i (using the sense that
maximizes the similarity)


pipe#1


tube

(#3) = 0.3


pipe#2


tube

(#1) = 0.6


Compute score for each sense pipe#i


score (
pipe#1
) = 0.25


score (
pipe#2
) = 0.73


Note
: results depend on the corpus used to find distributionally

similar words => can find domain specific predominant senses

68

E.g. The ambiguous word

PLANT
occurs 10 times in a discourse


all instances of

“plant”
carry the same meaning

One Sense Per Discourse


A word tends to preserve its meaning across all its occurrences in a
given discourse
(Gale, Church, Yarowksy 1992)


What does this mean?




Evaluation:


8 words with two
-
way ambiguity, e.g.
plant
,
crane
, etc.


98% of the two
-
word occurrences in the same discourse carry the same
meaning


The grain of salt: Performance depends on granularity


(Krovetz 1998) experiments with words with more than two senses


Performance of “one sense per discourse” measured on SemCor is approx.
70%


69

The ambiguous word

PLANT
preserves its meaning in all its

occurrences within the collocation

“industrial plant”,
regardless

of the context where this collocation occurs

One Sense per Collocation


A word tends to preserver its meaning when used in the same
collocation
(Yarowsky 1993)


Strong for adjacent collocations


Weaker as the distance between words increases


An example





Evaluation:


97% precision on words with two
-
way ambiguity


Finer granularity:


(Martinez and Agirre 2000) tested the “one sense per collocation”
hypothesis on text annotated with WordNet senses


70% precision on SemCor words

70

References


(Agirre

and

Rigau,

1995
)

Agirre,

E
.

and

Rigau,

G
.

A

proposal

for

word

sense

disambiguation

using

conceptual

distance
.

RANLP

1995
.




(Agirre

and

Martinez

2001
)

Agirre,

E
.

and

Martinez,

D
.

Learning

class
-
to
-
class

selectional

preferences
.

CONLL

2001
.



(Banerjee

and

Pedersen

2002
)

Banerjee,

S
.

and

Pedersen,

T
.

An

adapted

Lesk

algorithm

for

word

sense

disambiguation

using

WordNet
.

CICLING

2002
.


(Cowie,

Guthrie

and

Guthrie

1992
),

Cowie,

L
.

and

Guthrie,

J
.

A
.

and

Guthrie,

L
.:

Lexical

disambiguation

using

simulated

annealing
.

COLING

2002
.


(Gale,

Church

and

Yarowsky

1992
)

Gale,

W
.
,

Church,

K
.
,

and

Yarowsky,

D
.

One

sense

per

discourse
.

DARPA

workshop

1992
.


(Halliday

and

Hasan

1976
)

Halliday,

M
.

and

Hasan,

R
.
,

(
1976
)
.

Cohesion

in

English
.

Longman
.


(Galley

and

McKeown

2003
)

Galley,

M
.

and

McKeown,

K
.

(
2003
)

Improving

word

sense

disambiguation

in

lexical

chaining
.

IJCAI

2003


(Hirst

and

St
-
Onge

1998
)

Hirst,

G
.

and

St
-
Onge,

D
.

Lexical

chains

as

representations

of

context

in

the

detection

and

correction

of

malaproprisms
.

WordNet
:

An

electronic

lexical

database
,

MIT

Press
.


(Jiang

and

Conrath

1997
)

Jiang,

J
.

and

Conrath,

D
.

Semantic

similarity

based

on

corpus

statistics

and

lexical

taxonomy
.

COLING

1997
.


(Krovetz,

1998
)

Krovetz,

R
.

More

than

one

sense

per

discourse
.

ACL
-
SIGLEX

1998
.


(Lesk,

1986
)

Lesk,

M
.

Automatic

sense

disambiguation

using

machine

readable

dictionaries
:

How

to

tell

a

pine

cone

from

an

ice

cream

cone
.

SIGDOC

1986
.


(Lin

1998
)

Lin,

D

An

information

theoretic

definition

of

similarity
.

ICML

1998
.

71

References


(Martinez

and

Agirre

2000
)

Martinez,

D
.

and

Agirre,

E
.

One

sense

per

collocation

and

genre/topic

variations
.

EMNLP

2000
.


(Miller

et
.

al
.
,

1994
)

Miller,

G
.
,

Chodorow,

M
.
,

Landes,

S
.
,

Leacock,

C
.
,

and

Thomas,

R
.

Using

a

semantic

concordance

for

sense

identification
.

ARPA

Workshop

1994
.


(Miller,

1995
)

Miller,

G
.

Wordnet
:

A

lexical

database
.

ACM,

38
(
11
)

1995
.


(Mihalcea

and

Moldovan,

1999
)

Mihalcea,

R
.

and

Moldovan,

D
.

A

method

for

word

sense

disambiguation

of

unrestricted

text
.

ACL

1999
.



(Mihalcea

and

Moldovan

2000
)

Mihalcea,

R
.

and

Moldovan,

D
.

An

iterative

approach

to

word

sense

disambiguation
.

FLAIRS

2000
.


(Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. Figa
PageRank on
Semantic
Networks with Application to Word Sense Disambiguation,
COLING 2004.


(Patwardhan,

Banerjee,

and

Pedersen

2003
)

Patwardhan,

S
.

and

Banerjee,

S
.

and

Pedersen,

T
.

Using

Measures

of

Semantic

Relatedeness

for

Word

Sense

Disambiguation
.

CICLING

2003
.


(Rada

et

al

1989
)

Rada,

R
.

and

Mili,

H
.

and

Bicknell,

E
.

and

Blettner,

M
.

Development

and

application

of

a

metric

on

semantic

nets
.

IEEE

Transactions

on

Systems,

Man,

and

Cybernetics,

19
(
1
)

1989
.


(Resnik

1993
)

Resnik,

P
.

Selection

and

Information
:

A

Class
-
Based

Approach

to

Lexical

Relationships
.

University

of

Pennsylvania

1993
.




(Resnik

1995
)

Resnik,

P
.

Using

information

content

to

evaluate

semantic

similarity
.

IJCAI

1995
.


(Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme
"Evaluating
variants of the Lesk approach for disambiguating words”,
LREC 2004.


(Yarowsky,

1993
)

Yarowsky,

D
.

One

sense

per

collocation
.

ARPA

Workshop

1993
.

Part 4:


Supervised Methods of Word
Sense Disambiguation

73

Outline


What is Supervised Learning?


Task Definition


Single Classifiers


Naïve Bayesian Classifiers


Decision Lists and Trees


Ensembles of Classifiers

74

What is Supervised Learning?


Collect a set of examples that illustrate the various possible
classifications or outcomes of an event.


Identify patterns in the examples associated with each
particular class of the event.


Generalize those patterns into rules.


Apply the rules to classify a new event.


75

Learn from these examples :

“when do I go to the store?”


Day

CLASS

Go to Store?

F1

Hot Outside?

F2


Slept Well?

F3


Ate Well?


1


YES


YES


NO


NO


2


NO


YES


NO


YES


3


YES


NO


NO


NO


4


NO


NO


NO


YES

76

Learn from these examples :

“when do I go to the store?”


Day

CLASS

Go to Store?

F1

Hot Outside?

F2


Slept Well?


F3


Ate Well?


1


YES


YES


NO


NO


2


NO


YES


NO


YES


3


YES


NO


NO


NO


4


NO


NO


NO


YES

77

Outline


What is Supervised Learning?


Task Definition


Single Classifiers


Naïve Bayesian Classifiers


Decision Lists and Trees


Ensembles of Classifiers

78

Task Definition


Supervised WSD:

Class of methods that induces a classifier from
manually sense
-
tagged text using machine learning techniques.


Resources


Sense Tagged Text


Dictionary (implicit source of sense inventory)


Syntactic Analysis (POS tagger, Chunker, Parser, …)


Scope


Typically one target word per context


Part of speech of target word resolved


Lends itself to “targeted word” formulation


Reduces WSD to a classification problem where a target word is
assigned the most appropriate sense from a given set of possibilities
based on the context in which it occurs

79

Sense Tagged Text

Bonnie and Clyde are two really famous criminals, I think
they were
bank/1
robbers

My
bank/1
charges too much for an overdraft.

I went to the
bank/1

to deposit my check and get a new ATM
card.

The University of Minnesota has an East and a West
Bank/2

campus right on the Mississippi River.

My grandfather planted his pole in the
bank/2
and got a great
big catfish!

The
bank/2

is pretty muddy, I can’t walk there.

80

Two Bags of Words

(Co
-
occurrences in the “window of context”)

FINANCIAL_BANK_BAG:


a an and are ATM Bonnie card charges check Clyde
criminals deposit famous for get I much My new overdraft
really robbers the they think to too two went were


RIVER_BANK_BAG:


a an and big campus cant catfish East got grandfather great
has his I in is Minnesota Mississippi muddy My of on planted
pole pretty right River The the there University walk West

81

Simple Supervised Approach


Given a sentence S containing “bank”:




For each word W
i

in S



If W
i

is in FINANCIAL_BANK_BAG then




Sense_1 = Sense_1 + 1;



If W
i

is in RIVER_BANK_BAG then




Sense_2 = Sense_2 + 1;




If Sense_1 > Sense_2 then print “Financial”



else if Sense_2 > Sense_1 then print “River”


else print “Can’t Decide”;



82

Supervised Methodology


Create a sample of
training data

where a given
target word

is
manually annotated

with a sense from a
predetermined
set of possibilities.


One tagged word per instance/lexical sample disambiguation


Select a set of features with which to represent context.


co
-
occurrences, collocations, POS tags, verb
-
obj relations, etc...


Convert
sense
-
tagged

training instances to feature vectors.


Apply a machine learning algorithm to induce a classifier.


Form


structure or relation among features


Parameters


strength of feature interactions


Convert a
held out

sample of
test data

into feature vectors.


“correct” sense tags are known but not used


Apply classifier to test instances to assign a sense tag.

83

From Text to Feature Vectors


My/pronoun grandfather/noun used/verb to/prep fish/verb
along/adv the/det
banks/SHORE

of/prep the/det
Mississippi/noun River/noun. (S1)


The/det
bank/FINANCE
issued/verb a/det check/noun
for/prep the/det amount/noun of/prep interest/noun. (S2)




P
-
2

P
-
1

P+1

P+2

fish

check

river

interest

SENSE TAG

S1

adv

det

prep

det

Y

N

Y

N

SHORE

S2

det

verb

det

N

Y

N

Y

FINANCE

84

Supervised Learning Algorithms


Once data is converted to feature vector form, any
supervised learning algorithm can be used. Many have
been applied to WSD with good results:


Support Vector Machines


Nearest Neighbor Classifiers


Decision Trees


Decision Lists


Naïve Bayesian Classifiers


Perceptrons


Neural Networks


Graphical Models


Log Linear Models


85

Outline


What is Supervised Learning?


Task Definition


Naïve Bayesian Classifier


Decision Lists and Trees


Ensembles of Classifiers

86

Naïve Bayesian Classifier


Naïve Bayesian Classifier well known in Machine
Learning community for good performance across a range
of tasks (e.g., Domingos and Pazzani, 1997)


…Word Sense Disambiguation is no exception


Assumes
conditional independence

among features, given
the sense of a word.


The

form
of the model is assumed, but parameters are estimated
from training instances


When applied to WSD, features are often “a bag of words”
that come from the training data


Usually thousands of binary features that indicate if a word is
present in the context of the target word (or not)

87

Bayesian Inference


Given observed features, what is most likely sense?


Estimate probability of observed features given sense


Estimate unconditional probability of sense


Unconditional probability of features is a normalizing
term, doesn’t affect sense classification


)
,...,
3
,
2
,
1
(
)
(
)*
|
,...,
3
,
2
,
1
(
)
,...,
3
,
2
,
1
|

(
Fn
F
F
F
p
S
p
S
Fn
F
F
F
p
Fn
F
F
F
S
p


88

Naïve Bayesian Model




S

F2

F3

F4

F1

Fn

)
|
(
*
...
*
)
|
2
(
*
)
|
1
(
)
|
,...,
2
,
1
(
S
Fn
p
S
F
p
S
F
p
S
Fn
F
F
P

89

The Naïve Bayesian Classifier



Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense)
and 500 for bank/2 (river sense)


P(S=1) = 1,500/2000 = .75


P(S=2) = 500/2,000 = .25


Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.


P(F1=“credit”) = 204/2000 = .102


P(F1=“credit”|S=1) = 200/1,500 = .133


P(F1=“credit”|S=2) = 4/500 = .008


Given a test instance that has one feature “credit”


P(S=1|F1=“credit”) = .133*.75/.102 = .978


P(S=2|F1=“credit”) = .008*.25/.102 = .020



)
(
*
)
|
(
*
...
*
)
|
1
(
argmax
S
p
S
Fn
p
S
F
p
sense
S
sense


90

Comparative Results


(Leacock, et. al. 1993) compared Naïve Bayes with a
Neural Network and a Context Vector approach when
disambiguating six senses of
line…


(Mooney, 1996) compared Naïve Bayes with a Neural
Network, Decision Tree/List Learners, Disjunctive and
Conjunctive Normal Form learners, and a perceptron when
disambiguating six senses of
line



(Pedersen, 1998) compared Naïve Bayes with Decision
Tree, Rule Based Learner, Probabilistic Model, etc. when
disambiguating
line
and 12 other words…


…All found that Naïve Bayesian Classifier performed as
well as any of the other methods!

91

Outline


What is Supervised Learning?


Task Definition


Naïve Bayesian Classifiers


Decision Lists and Trees


Ensembles of Classifiers

92

Decision Lists and Trees


Very widely used in Machine Learning.


Decision trees used very early for WSD research (e.g.,
Kelly and Stone, 1975; Black, 1988).


Represent disambiguation problem as a series of questions
(presence of feature) that reveal the sense of a word.


List decides between two senses after one positive answer


Tree allows for decision among multiple senses after a series of
answers


Uses a smaller, more refined set of features than “bag of
words” and Naïve Bayes.


More descriptive and easier to interpret.

93

Decision List for WSD (Yarowsky, 1994)


Identify
collocational
features from sense tagged data.


Word immediately to the left or right of target :


I have my bank/1
statement
.


The
river
bank/2 is muddy.


Pair of words to immediate left or right of target :


The
world’s richest

bank/1 is here in New York.


The river bank/2
is muddy.


Words found within k positions to left or right of target,
where k is often 10
-
50 :


My
credit
is just horrible because my bank/1 has made several
mistakes with my
account

and the
balance
is very low.

94

Building the Decision List


Sort order of collocation tests using log of conditional
probabilities.


Words most indicative of one sense (and not the other)
will be ranked highly.


)
)
|
2
(
)
|
1
(
(log
i
n
Collocatio
i
F
S
p
i
n
Collocatio
i
F
S
p
Abs




95

Computing DL score


Given 2,000 instances of “bank”, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river sense)


P(S=1) = 1,500/2,000 = .75


P(S=2) = 500/2,000 = .25


Given “credit” occurs 200 times with bank/1 and 4
times with bank/2.


P(F1=“credit”) = 204/2,000 = .102


P(F1=“credit”|S=1) = 200/1,500 = .133


P(F1=“credit”|S=2) = 4/500 = .008


From Bayes Rule…


P(S=1|F1=“credit”) = .133*.75/.102 = .978


P(S=2|F1=“credit”) = .008*.25/.102 = .020



DL Score = abs (log (.978/.020)) = 3.89


96

Using the Decision List


Sort DL
-
score, go through test instance looking for
matching feature. First match reveals sense…

DL
-
score

Feature

Sense

3.89

credit

within bank

Bank/1 financial

2.20

bank
is muddy

Bank/2 river

1.09

pole
within bank

Bank/2 river

0.00

of the

bank

N/A

97

Using the Decision List


CREDIT?


BANK/1 FINANCIAL

IS MUDDY?




POLE
?

BANK/2 RIVER

BANK/2 RIVER

98

Learning a Decision Tree


Identify the feature that most “cleanly” divides the training
data into the known senses.


“Cleanly” measured by information gain or gain ratio.


Create subsets of training data according to feature values.


Find another feature that most cleanly divides a subset of
the training data.


Continue until each subset of training data is “pure” or as
clean as possible.


Well known decision tree learning algorithms include ID3
and C4.5 (Quillian, 1986, 1993)


In Senseval
-
1, a modified decision list (which supported
some conditional branching) was most accurate for English
Lexical Sample task (Yarowsky, 2000)


99

Supervised WSD with Individual Classifiers


Many supervised Machine Learning algorithms have been
applied to Word Sense Disambiguation, most work
reasonably well.


(Witten and Frank, 2000) is a great intro. to supervised learning.


Features tend to differentiate among methods more than
the learning algorithms.


Good sets of features tend to include:


Co
-
occurrences or keywords (global)


Collocations (local)


Bigrams (local and global)


Part of speech (local)


Predicate
-
argument relations


Verb
-
object, subject
-
verb,


Heads of Noun and Verb Phrases



100

Convergence of Results


Accuracy of different systems applied to the same data
tends to converge on a particular value, no one system
shockingly better than another.


Senseval
-
1, a number of systems in range of 74
-
78% accuracy for
English Lexical Sample task.


Senseval
-
2, a number of systems in range of 61
-
64% accuracy for
English Lexical Sample task.


Senseval
-
3, a number of systems in range of 70
-
73% accuracy for
English Lexical Sample task…


What to do next?

101

Outline


What is Supervised Learning?


Task Definition


Naïve Bayesian Classifiers


Decision Lists and Trees


Ensembles of Classifiers

102

Ensembles of Classifiers


Classifier error has two components (Bias and Variance)


Some algorithms (e.g., decision trees) try and build a
representation of the training data


Low Bias/High Variance


Others (e.g., Naïve Bayes) assume a parametric form and don’t
represent the training data


High Bias/Low Variance


Combining classifiers with different bias variance
characteristics can lead to improved overall accuracy


“Bagging” a decision tree can smooth out the effect of
small variations in the training data (Breiman, 1996)


Sample with replacement from the training data to learn multiple
decision trees.


Outliers in training data will tend to be obscured/eliminated.


103

Ensemble Considerations


Must choose different learning algorithms with
significantly different bias/variance characteristics.


Naïve Bayesian Classifier versus Decision Tree


Must choose feature representations that yield significantly
different (independent?) views of the training data.


Lexical versus syntactic features


Must choose how to combine classifiers.


Simple Majority Voting


Averaging of probabilities across multiple classifier output


Maximum Entropy combination (e.g., Klein, et. al., 2002)

104

Ensemble Results


(Pedersen, 2000) achieved state of art for
interest

and
line

data using ensemble of Naïve Bayesian Classifiers.


Many Naïve Bayesian Classifiers trained on varying sized
windows of context / bags of words.


Classifiers combined by a weighted vote


(Florian and Yarowsky, 2002) achieved state of the art for
Senseval
-
1 and Senseval
-
2 data using combination of six
classifiers.


Rich set of collocational and syntactic features.


Combined via linear combination of top three classifiers.


Many Senseval
-
2 and Senseval
-
3 systems employed
ensemble methods.

105

References


(Black, 1988) An experiment in computational discrimination of English word senses.
IBM Journal of Research and Development (32) pg. 185
-
194.


(Breiman, 1996) The heuristics of instability in model selection. Annals of Statistics
(24) pg. 2350
-
2383.


(Domingos and Pazzani, 1997) On the Optimality of the Simple Bayesian Classifier
under Zero
-
One Loss, Machine Learning (29) pg. 103
-
130.


(Domingos, 2000) A Unified Bias Variance Decomposition for Zero
-
One and Squared
Loss. In Proceedings of AAAI. Pg. 564
-
569.


(Florian an dYarowsky, 2002) Modeling Consensus: Classifier Combination for Word
Sense Disambiguation. In Proceedings of EMNLP, pp 25
-
32.


(Kelly and Stone, 1975). Computer Recognition of English Word Senses, North Holland
Publishing Co., Amsterdam.


(Klein, et. al., 2002) Combining Heterogeneous Classifiers for Word
-
Sense
Disambiguation, Proceedings of Senseval
-
2. pg. 87
-
89.


(Leacock, et. al. 1993) Corpus based statistical sense resolution. In Proceedings of the
ARPA Workshop on Human Language Technology. pg. 260
-
265.


(Mooney, 1996) Comparative experiments on disambiguating word senses: An
illustration of the role of bias in machine learning. Proceedings of EMNLP. pg. 82
-
91.

106

References


(Pedersen, 1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D.
Dissertation. Southern Methodist University.


(Pedersen, 2000) A simple approach to building ensembles of Naive Bayesian classifiers
for word sense disambiguation. In Proceedings of NAACL.


(Quillian, 1986). Induction of Decision Trees. Machine Learning (1). pg. 81
-
106.


(Quillian, 1993). C4.5 Programs for Machine Learning. San Francisco, Morgan
Kaufmann.


(Witten and Frank, 2000). Data Mining


Practical Machine Learning Tools and
Techniques with Java Implementations. Morgan
-
Kaufmann. San Francisco.


(Yarowsky, 1994) Decision lists for lexical ambiguity resolution: Application to accent
restoration in Spanish and French. In Proceedings of ACL. pp. 88
-
95.


(Yarowsky, 2000) Hierarchical decision lists for word sense disambiguation. Computers
and the Humanities, 34.


Part 5:


Minimally Supervised Methods
for Word Sense Disambiguation

108

Outline


Task definition


What does “minimally” supervised mean?


Bootstrapping algorithms


Co
-
training


Self
-
training


Yarowsky algorithm


Using the Web for Word Sense Disambiguation


Web as a corpus


Web as collective mind


109

Task Definition


Supervised

WSD = learning sense classifiers starting with
annotated data


Minimally supervised
WSD = learning sense classifiers
from annotated data, with
minimal

human supervision


Examples


Automatically bootstrap a corpus starting with
a few human
annotated examples


Use
monosemous relatives / dictionary definitions
to automatically
construct sense tagged data


Rely on
Web
-
users
+ active learning for corpus annotation

110

Outline


Task definition


What does “minimally” supervised mean?


Bootstrapping algorithms


Co
-
training


Self
-
training


Yarowsky algorithm


Using the Web for Word Sense Disambiguation


Web as a corpus


Web as collective mind


111

Bootstrapping WSD Classifiers


Build sense classifiers with little training data


Expand applicability of supervised WSD



Bootstrapping approaches


Co
-
training


Self
-
training


Yarowsky algorithm

112

Bootstrapping Recipe


Ingredients


(Some)
labeled data


(Large amounts of)
unlabeled data


(One or more)
basic classifiers


Output


Classifier that improves over the basic classifiers

113


plants#1

and animals …

… industry
plant#2



… building the only atomic
plant




plant

growth is retarded …

… a herb or flowering
plant



… a nuclear power
plant



… building a new vehicle
plant



… the animal and
plant

life …

… the passion
-
fruit
plant



Classifier 1

Classifier 2


plant#1

growth is retarded …

… a nuclear power
plant#2



114

Co
-
training / Self
-
training


1. Create a pool of examples U'


choose P random examples from U


2. Loop for I iterations


Train C
i

on L and label U'


Select G most confident examples and add to L


maintain distribution in L


Refill U' with examples from U


keep U' at constant size P


A set L of labeled training examples


A set U of unlabeled examples


Classifiers C
i

115


(Blum and Mitchell 1998)


Two classifiers


independent views


[independence condition can be relaxed]


Co
-
training in Natural Language Learning


Statistical parsing (Sarkar 2001)


Co
-
reference resolution (Ng and Cardie 2003)


Part of speech tagging (Clark, Curran and Osborne 2003)


...

Co
-
training

116

Self
-
training


(Nigam and Ghani 2000)


One single classifier


Retrain on its own output


Self
-
training for Natural Language Learning


Part of speech tagging (Clark, Curran and Osborne 2003)


Co
-
reference resolution (Ng and Cardie 2003)


several classifiers through bagging

117

Parameter Setting for Co
-
training/Self
-
training


1. Create a pool of examples U'


choose P random examples from U


2. Loop for I iterations


Train C
i

on L and label U'


Select G most confident examples and add to L


maintain distribution in L


Refill U' with examples from U


keep U' at constant size P





Pool size
Iterations
Growth size

A major drawback of bootstrapping


“No principled method for selecting optimal values for these

parameters” (Ng and Cardie 2003)

118

Experiments with Co
-
training / Self
-
training

for WSD


Training / Test data


Senseval
-
2 nouns (29 ambiguous nouns)


Average corpus size: 95 training examples, 48 test examples


Raw data


British National Corpus


Average corpus size: 7,085 examples



Co
-
training


Two classifiers: local and topical classifiers


Self
-
training


One classifier: global classifier



(Mihalcea 2004)


119

Parameter Settings


Parameter ranges


P = {1, 100, 500, 1000, 1500, 2000, 5000}


G = {1, 10, 20, 30, 40, 50, 100, 150, 200}


I = {1, ..., 40}


29 nouns


120,000 runs


Upper bound in co
-
training/self
-
training performance


Optimised on test set


Basic classifier: 53.84%


Optimal self
-
training: 65.61%


Optimal co
-
training: 65.75%


~
25%
error reduction


Per
-
word parameter setting:


Co
-
training = 51.73%


Self
-
training = 52.88%


Global parameter setting


Co
-
training = 55.67%


Self
-
training = 54.16%


Example:
lady


basic = 61.53%


self
-
training = 84.61%
[20/100/39]


co
-
training = 82.05% [1/1000/3]

120

Yarowsky Algorithm


(Yarowsky 1995)


Similar to co
-
training


Differs in the basic assumption
(Abney 2002)


“view independence” (co
-
training) vs. “precision independence”
(Yarowsky algorithm)




Relies on two heuristics and a decision list


One sense per collocation :


Nearby words provide strong and consistent clues as to the sense of a
target word


One sense per discourse :


The sense of a target word is highly consistent within a single
document

121

Learning Algorithm


A decision list is used to classify instances of target word :


“the loss of animal and
plant
species through extinction …”





Classification is based on the highest ranking rule that
matches the target context

LogL

Collocation

Sense







9.31

flower (within
+/
-

k words)



A (living)

9.24

job (within
+/
-

k words)



B (factory)

9.03

fruit (within
+/
-

k words)



A (living)

9.02

plant
species



A (living)

...

...



122

Bootstrapping Algorithm


All occurrences of the target word are identified


A small training set of seed data is tagged with word sense

Sense
-
B:
factory

Sense
-
A:
life

123

Bootstrapping Algorithm

Seed set grows and residual set shrinks ….

124

Bootstrapping Algorithm

Convergence: Stop when residual set stabilizes

125

Bootstrapping Algorithm


Iterative procedure:


Train decision list algorithm on seed set


Classify residual data with decision list


Create new seed set by identifying samples that are tagged with a
probability above a certain threshold


Retrain classifier on new seed set


Selecting training seeds


Initial training set should accurately distinguish among possible
senses


Strategies:


Select a single, defining seed collocation for each possible sense.


Ex: “
life
” and “
manufacturing
” for target
plant


Use words from dictionary definitions


Hand
-
label most frequent collocates

126

Evaluation


Test corpus: extracted from 460 million word corpus of multiple
sources (news articles, transcripts, novels, etc.)


Performance of multiple models compared with:


supervised decision lists


unsupervised learning algorithm of Sch
ü
tze (1992), based on
alignment of clusters with word senses


Word

Senses

Supervised


Unsupervised

Sch
ü
tze

Unsupervised

Bootstrapping

plant

living/factory

97.7

92

98.6

space

volume/outer

93.9

90

93.6

tank

vehicle/container

97.1

95

96.5

motion

legal/physical

98.0

92

97.9







-



Avg.

-

96.1

92.2

96.5

127

Outline


Task definition


What does “minimally” supervised mean?


Bootstrapping algorithms


Co
-
training


Self
-
training


Yarowsky algorithm


Using the Web for Word Sense Disambiguation


Web as a corpus


Web as collective mind


128

The Web as a Corpus


Use the Web as a large textual corpus


Build annotated corpora using monosemous relatives


Bootstrap annotated corpora starting with few seeds


Similar to (Yarowsky 1995)


Use the (semi)automatically tagged data to train WSD
classifiers

129

Monosemous Relatives


Idea
: determine a phrase (SP) which uniquely identifies the
sense of a word (W#i)

1.
Determine one or more Search Phrases from a machine readable
dictionary using several heuristics

2.
Search the Web using the Search Phrases from step 1.

3.
Replace the Search Phrases in the examples gathered at 2 with W#i.