Word sense disambiguation

minorbigarmΑσφάλεια

30 Νοε 2013 (πριν από 4 χρόνια και 1 μήνα)

417 εμφανίσεις





Sectoral

Operational
Programme

"Increase of Economic Competitiveness"

"Investments for
your

future"

Project co
-
financed by the European Regional Development Fund

General Word Sense Disambiguation System applied to Romanian and English Languages

-

SenDiS
-

Alin
Ş
tef
ănescu, Oana Șoica, Andrei Mincă

& SenDiS
team


June 27, 2013

Word sense disambiguation

using lexicon nets

Alin
Ş
tef
ănescu

Introduction

Page
3

SenDiS

The ambiguous hen


Găina

cea

nouă

ne
ouă

nouă

nouă

ouă
.“


Image from
aliexpress.com


Page
4

SenDiS

Natural Language Processing (NLP)



NLP
develops

systems

that

allow

computers

to

communicate

with

people

using

everyday

language
.




An
important

area
,
natural

language

understanding




Subproblem:
word

sense
disambiguation




Page
5

SenDiS

Softwin



NLP
is

an
active

research

area

at

Softwin

Research




biometrics

is

the

other

active

area




previously, antivirus research in the same R&D department led to
the creation of a award
-
winning, internationally certified internet
security and antivirus software



NLP @ SOFTWIN Research

Page
6

SenDiS

NLP @
Softwin

Reseach



SenDiS project



SenDiS

project

at

Softwin

Research




A
general

Word Sense
Disambiguation

System
applied

to




Romanian

and

English
languages




2010
-
2013



co
-
financed

through

Sectoral

Operational Programme

Increase

of

Economic

Competitiveness

(POS
-
CCE
)



team

of

7
-
10
computer

scientists

and

linguists



method
:
use

of

structured

linguistic

knowledge

encoded

with

Softwin‘s

GRAALAN
formalism



previous

projects
: PALIROM
&
LINCOR (
with

collaborators

from

UB, ILIR, UPB
etc
)





Page
7

SenDiS

NLP system
-

GRAALAN

1
. Linguistic
theoretical background

2. GRAALAN Grammar Abstract Language

3. Linguistic tools


4.
Linguistic
k
nowledge
b
ases

5. Linguistic
applications

SenDiS



SenDiS

builds

upon
and

further

develops

the

NLP
system

GRAALAN
at

Softwin

Research



Page
8

SenDiS

Word Sense
Disambiguation (WSD)



identify

the

meaning

of

words

in
context




in
a
computational

manner




very

difficult

problem




three

main

approaches
:



supervised

disambiguation



unsupervised

disambiguation



knowledge
-
based

disambiguation




SenDiS

“Tower of Babel”
by Brueghel

Page
9

SenDiS

GRAALAN knowledge bases can encode several types of ambiguities:




multiword expression (MWE) ambiguity



morphologic ambiguity (synthetic & analytic)



lexical ambiguity (synthetic & analytic)



morphemic ambiguity



syntactic
ambiguity

Dealing with ambiguity

SenDiS

Page
10

SenDiS



a simple and intuitive knowledge
-
based WSD approach




computes the
word overlap
between sense definitions of context target
words


For a two
-
word context (
w1,w2
) and
S1

in
Senses
(
w
1) and
S2

in
Senses
(
w2
):




scoreLesk

(
S1
,
S2
) = |
gloss
(
S
1) ∩
gloss
(
S
2) |




another variant,
less computational intensive
, computes the word overlap
between a word sense definition and other context words




scoreLeskVar

(
S
) = |
context
(
w
) ∩
gloss
(
S
) |

Lesk

Algorithm
-

b
asic
idea

Page
11

SenDiS

Our approach:
Lesk

Algorithm extended

1
W
2
W
n
W
1
W
2
W
m
W
...
1
W
2
W
m
W
...
1
W
2
W
m
W
...
1
W
2
W
m
W
...
1
W
2
W
m
W
...
1
W
2
W
m
W
...
1
W
2
W
m
W
...
1
W
2
W
m
W
...
1
W
2
W
m
W
...
...
Text
:
1
S
k
S
1
S
2
S
2
S
1
S
k
S
k
S
...
sense definition
annotated
/
WSD selected definition
link to a lexicon entry
/
sense
link to an annotated lexicon entry
/
sense
link to a non annotated lexicon entry
/
sense
Our approach:
Lesk

algorithm reasoning extended.

Every annotated sense is extended with its definition

that also has words with disambiguated senses and so on.


Page
12

SenDiS

Lesk

Algorithm extended
-

example

Generic example (Principle):

<lemma>…=

Sense 1 : <word> <word> <word> <word>


Sense 2 : <word> <word> <word> <word>


Sense 3 : <word> <word> <word> <word>

<lemma>…=

Sense 1 : <word> <word> <word> <word>


Sense 2 : <word> <word> <word> <word>


Sense 3 : <word> <word> <word> <word>

<lemma>…= Sense 1 : <word> <word> <word> <word>


Sense 2 : <word> <word> <word> <word>


Sense 3 : <word> <word> <word> <word>

Page
13

SenDiS

Romanian example
:

"radio" =



“0” : "
Aparat

de
receptie

radiofonica
;
radioreceptor
."



“1” : "
Instalatie

de
transmitere

a
sunetelor

prin

unde

electromagnetice
,
cuprinzând

aparatele

de
emisiune




şi

pe

cele

de
receptie
."

"
aparat
" =



"0" : "
Sistem

de
piese

care
serveste

pentru

o
operatie

mecanica
,
tehnica
,
stiintifica

etc."



"1" : "
Sistem

tehnic

care
transforma

o forma de
energie

în

alta.
"



"2" : "
Ansamblu

de
organe

anatomice

care
servesc

la
îndeplinirea

unei

functiuni

fundamentale
."



"3" : "
Totalitatea

serviciilor

sau

a
personalului

care
asigura

bunul

mers

al
unei

institutii

sau

al
unui





domeniu

de
activitate
. "



"4" : "
Ansamblul

mijloacelor

care
servesc

penrtu

un
anumit

scop
."

"
receptie
" =



"0" : "
Operatie

de
luare

în

primire

a
unui

material
sau

a
unei

lucrari
,
pe

baza

verificarii

lor

cantitative

şi




calitative
."



"1" : "
Serviciu

într
-
o
întreprindere

hoteliera

care are
evidenta

persoanelor

aflate

în

hotel, face





repartizarea

în

camere

a
solicitatorilor

etc."



"2" : "(
Tehn
)
Primire

a
unei

anumite

forme

de
energie

pentru

a o
transforma

în

alta

forma
de




energie
."



"3" : "
Reuniune
,
banchet

cu
caracter
,
festiv

(
În

cercurile

oficiale
).



"4" : "
Primire
,
întâmpinare

(cu
caracter

ceremonios
) a
unui

oaspete
."

"
radiofonic
" =



"0" : "Care
aparţine

radiofoniei
,
privitor

la
radiofonie
, care
utilizeaza

radiofonia
."

"
radioreceptor
" =



"0" : "
Aparat

folosit

pentru

receptionarea

undelor

radiofonice

(
prin

antene
),
pentru






transformarea

lor

în

semnale

sonore

şi

transmiterea

lor

prin

intermediul

difuzoarelor
;

radio."


Lesk

Algorithm extended
-

example

Page
14

SenDiS

WSD using a specific
lexicon
n
etwork

defines

defined by

“gloss tagged”

relation

C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
C-S
a
LARGE
lexicon net

Word Sense

Word Sense

Page
15

SenDiS

SenDiS
-

workflow


D
i
s
a
m
b
i
g
u
a
t
i
o
n

t
o
o
l

T
o
o
l

f
o
r

c
r
e
a
t
i
n
g

t
h
e

l
e
x
i
c
o
n

n
e
t

a
l
g
o
r
i
t
h
m

f
o
r

l
e
x
i
c
o
n

n
e
t

c
o
n
s
t
u
c
t
i
o
n

l
e
m
m
a
t
i
z
a
t
i
o
n

e
n
g
i
n
e

l
e
x
i
c
o
n

l
i
n
g
u
i
s
t
i
c

k
n
o
w
l
e
d
g
e



w
o
r
d

s
e
n
s
e

d
e
f
i
n
i
t
i
o
n
s

i
n
f
l
e
c
t
i
o
n

f
o
r
m
s

l
e
m
m
a
s

u
n
o
r
d
e
r
e
d

l
e
x
i
c
o
n

n
e
t


(
w
o
r
d

s
e
n
s
e

r
e
l
a
t
i
o
n
s
)


o
r
d
e
r
e
d

l
e
x
i
c
o
n

n
e
t

l
e
x
i
c
o
n

o
r
d
o
n
a
t
ă

T
o
o
l

f
o
r

o
r
d
e
r
i
n
g

t
h
e

l
e
x
i
c
o
n

n
e
t

T
e
x
t

t
a
g
g
e
d

w
i
t
h

d
i
s
a
m
b
i
g
u
a
t
i
o
n

i
n
f
o
r
m
a
t
i
o
n


S
o
u
r
c
e


t
e
x
t

s