Lecture 8: Natural Language Processing and IR. Morphology and ...

blabbingunequaledAI and Robotics

Oct 24, 2013 (4 years and 15 days ago)

67 views

Special Topics in Computer Science



Advanced Topics in Information Retrieval



Lecture 8:


Natural Language Processing and IR.
Synonymy, Morphology, and Stemming




Alexander Gelbukh


www.Gelbukh.com


2

Previous Chapter:
Conclusions


Parallel computing can improve


response time for each query and/or


throughput: number of queries processed with same speed


Document partitioning is simple


good for distributed computing


Term partitioning is good for some data structures


Distributed computing is MIMD computing with slow c
ommunication


SIMD machines are good for Signature files


Both are out of favor now

3

Previous Chapter: Research topics


How to evaluate the speedup


New algorithms


Adaptation of existing algorithms


Merging the results is a bottleneck


Meta search engines


Creating large collections with judgements


Is recall important?

4

Problem


Recall image retrieval:


Find images similar in color, size, ...


Find photos of Korean President
?


Find nice girls
? (
Don’s show ugly ones!
)


Looks very stupid


Lacks understanding


Too difficult


Text retrieval is no exception


Find stories with sad beginning and happy end

?


Lacks understanding


Difficult but possible

5

Possible?


Text is intended to facilitate understanding


Supposedly, even partial understanding should help


Degrees of understanding:


Character strings (what is used now):
well, geese, him


Words (often used now):
goose, he


Concepts:
hole in the ground

(
well
), Roh Moo
-
Hyun


Complex concepts:
oil well, hot dog


Situations (sentences, paragraphs)


The story (direct meaning)


The message (pragmatics, intended impact)

6

Easy?


Main problems:


Multiple ways to say the same


Query does not match the doc


Difficult to specify all variants


Ambiguity of the text


False alarms in matching


Lack of implicit knowledge of the computer


The computer “does not understand” the message


Difficult to make inferences


Natural Language Processing tries to solve them



7

Solutions


Multiple ways to say the same?


Normalizing: transforming to a “standard” variant


Ambiguity of the text?


Ambiguity resolution


Normalizing to one of the variants


Perhaps the main problem in natural language processing


Lack of implicit knowledge of the computer?


Dictionaries, grammars


Knowledge on language structure is needed in all tasks


Knowledge of world is useful for advanced task


Knowledge on language use is a substitute


8

Synonymy


Multiple ways to say the same


Or at least when the difference does not matter


Can be substituted in any (many?) context


Lexical synonymy


Woman

/
female
,
professor

/
teacher


Dictionaries


Phrase
-
level or sentence
-
level synonymy


They game a book

/
I was given a book by them


Syntactic analyzers


Semantic
-
level synonymy


Reasoning

9

Not only synonymy


Multiple ways to say


the same (synonymy)


less: more general (hypernymy)


more: more specific (hyponymy)


Complete synonyms are rare


professor



teacher


Abbreviations are usually (almost) complete synonyms


When the differences do not matter, can be treated as
synonymy


But: different data structures and methods

10

Lexical
-
level synonymy


Lexical synonymy


Woman

/
female


Mixed
-
type synonymy:
USA

/
United States


Morphology is a kind of synonymy (
actually hyponymy
)


‘geese’ = ‘goose’ + ‘many’


Russian ‘knigu’ = ‘kniga’ + ‘dative role’


the “second” part of the meaning is either not important or
is another term


Morphology is a very common problem in IR

11

Lexical synonymy


Woman

/
female


Dictionaries


Synonym dictionaries


WordNet


Automatic learning of synonymy


Clustering of contexts


If the contexts are very similar, then possible synonyms


Problem: preserves meaning?
Monday

/
Tuesday


An interesting solution: compare dictionary definitions


12

Uses in IR


Query expansion


Add synonyms of the word to the query and process
normally


Flexible, slow


Best for lexical synonymy: few synonyms, doubtful


Reducing at index time


When reading the documents, reduce each word to a
“standard” synonym


Fast, rigid


Best for morphology: many synonyms, less doubtful


Hierarchical indexing

13

Hierarchical indexing

(Gelbukh, Sidorov, Guzman
-
Arenas 2002)


Tree of concepts


Living things


Animals

1.
a. Cat, b. cats

2.
a. Dog, b. dogs


Persons

3.
a. Professor, b. professors

4.
a. Student, b. students


Order vocabulary by the order of the leaves of tree


Query expansion is done by ranges:


cat: 1, living things: 1
-
4

14

Morphology


One of the large concerns in IR


Can be done


precisely


approximately (quick
-
and
-
dirty)


Level of generalization


inflection: student


students


derivation: study


student


Ambiguity


all variants


one variant

15

... morphology


Result is


The unique ID


The dictionary form


A “stem”: part of the same string


16

Morphological analyzers


Precise analysis


Ambiguous


Give all variants


Tables
:
to table

or
the table
?


Spanish
charlas
:
charla

‘talk’

or
charlar

‘to talk’


Russian
dush
:
dush

‘shower’ or
dusha

‘soul’


Common in languages with developed morphology


For short words, some 3


5


10 variants


Dictionaries are used

17

Morphological system


Dictionary specifies:


Stem:
bak
-
, ask
-


POS (part of speech): verb


Inflection class (what endings it accepts): 1, 2


Tables of endings specify


Paradigms:

1.
-
e
-
es
-
ed
-
ed
-
ing

2.
-

,
-
s
-
ed
-
ed
-
ing


Meanings: participle, ...


18

... morphological system


Algorithm


Decompose the word into an existing stem and ending


Check compatibility of stem and ending


Give the stem ID and ending meaning


Ambiguous


Many variants of decompositions


Many stems with different IDs


Many endings with different meaning


-
ed: past or participle


Problem: words absent in dictionary

19

Stemming


Substitute for real analysis


Both inflection and derivation


Quick
-
and
-
dirty


Only one variant


Result: a part of the string


gene
,
genial



gen
-


Cheap development


bad results


simple description. Standard


Often used in academic research


Used to be used in real systems, but now less

20

Porter stemmer


Martin Porter, 1980


Standard stemmer


Provides equal basis

for evaluation of

different IR programs


Uses “measure” m:


[C](VC){m}[V].


m=0 TR, EE, TREE, Y, BY.


m=1 TROUBLE, OATS, TREES, IVY.


m=2 TROUBLES, PRIVATE, OATEN, ORRERY.


21

... Porter stemmer


Step 1a


SSES
-
> SS caresses
-
> caress


IES
-
> I ponies
-
> poni ties
-
> ti


SS
-
> SS caress
-
> caress


S
-
> cats
-
> cat

22

... Porter stemmer


Step 1b


(m>0) EED
-
> EE feed
-
> feed agreed
-
> agree


(*v*) ED
-
> plastered
-
> plaster bled
-
> bled


(*v*) ING
-
> motoring
-
> motor sing
-
> sing

23

... Porter stemmer


If 2
nd

or 3
rd

rule successful


AT
-
> ATE conflat(ed)
-
> conflate


BL
-
> BLE troubl(ed)
-
> trouble


IZ
-
> IZE siz(ed)
-
> size


(*d and not (*L or *S or *Z))
-
> single letter


hopp(ing)
-
> hop


tann(ed)
-
> tan


fall(ing)
-
> fall


hiss(ing)
-
> hiss


fizz(ed)
-
> fizz


(m=1 and *o)
-
> E


fail(ing)
-
> fail


fil(ing)
-
> file

24

... Porter stemmer


Step 1c


(*v*) Y
-
> I


happy
-
> happi


sky
-
> sky

25

... Porter stemmer


Step 2


(m>0) ATIONAL
-
> ATE relational
-
> relate


(m>0) TIONAL
-
> TION conditional
-
> condition rational
-
> rational


(m>0) ENCI
-
> ENCE valenci
-
> valence


(m>0) ANCI
-
> ANCE hesitanci
-
> hesitance


(m>0) IZER
-
> IZE digitizer
-
> digitize


(m>0) ABLI
-
> ABLE conformabli
-
> conformable


(m>0) ALLI
-
> AL radicalli
-
> radical


(m>0) ENTLI
-
> ENT differentli
-
> different


(m>0) ELI
-
> E vileli
-

> vile


(m>0) OUSLI
-
> OUS analogousli
-
> analogous


(m>0) IZATION
-
> IZE vietnamization
-
> vietnamize


(m>0) ATION
-
> ATE predication
-
> predicate


(m>0) ATOR
-
> ATE operator
-
> operate


(m>0) ALISM
-
> AL feudalism
-
> feudal


(m>0) IVENESS
-
> IVE decisiveness
-
> decisive


(m>0) FULNESS
-
> FUL hopefulness
-
> hopeful


(m>0) OUSNESS
-
> OUS callousness
-
> callous


(m>0) ALITI
-
> AL formaliti
-
> formal


(m>0) IVITI
-
> IVE sensitiviti
-
> sensitive


(m>0) BILITI
-
> BLE sensibiliti
-
> sensible

26

... Porter stemmer


Step 3


(m>0) ICATE
-
> IC triplicate
-
> triplic


(m>0) ATIVE
-
> formative
-
> form


(m>0) ALIZE
-
> AL formalize
-
> formal


(m>0) ICITI
-
> IC electriciti
-
> electric


(m>0) ICAL
-
> IC electrical
-
> electric


(m>0) FUL
-
> hopeful
-
> hope


(m>0) NESS
-
> goodness
-
> good

27

... Porter stemmer


Step 4


(m>1) AL
-
> revival
-
> reviv


(m>1) ANCE
-
> allowance
-
> allow


(m>1) ENCE
-
> inference
-
> infer


(m>1) ER
-
> airliner
-
> airlin


(m>1) IC
-
> gyroscopic
-
> gyroscop


(m>1) ABLE
-
> adjustable
-
> adjust


(m>1) IBLE
-
> defensible
-
> defens


(m>1) ANT
-
> irritant
-
> irrit


(m>1) EMENT
-
> replacement
-
> replac


(m>1) MENT
-
> adjustment
-
> adjust


(m>1) ENT
-
> dependent
-
> depend


(m>1 and (*S or *T)) ION
-
> adoption
-
> adopt


(m>1) OU
-
> homologou
-
> homolog


(m>1) ISM
-
> communism
-
> commun


(m>1) ATE
-
> activate
-
> activ


(m>1) ITI
-
> angulariti
-
> angular


(m>1) OUS
-
> homologous
-
> homolog


(m>1) IVE
-
> effective
-
> effect


(m>1) IZE
-
> bowdlerize
-
> bowdler

28

... Porter stemmer


Step 5a


(m>1) E
-
> probate
-
> probat rate
-
> rate


(m=1 and not *o) E
-
> cease
-
> ceas


Step 5b


(m > 1 and *d and *L)
-
> single letter


controll
-
> control


roll
-
> roll

29

Statistical stemmers


Take a list of words


Construct a model of language that “generates” it


The “best” one


The simplest one? How to find?


List of stems, list of endings


Determine their probabilities


Usage statistics


Decompose any input string into a stem and an
ending


Take the most probable variant

30

Research topics


Constructing and application of ontologies


Building of morphological dictionaries


Treatment of unknown words with morphological

analyzers


Development of better stemmers


Statistical stemmers?

31

Conclusions




Reducing synonyms can help IR


Better matching


Ontologies are used. WordNet


Morphology is a variant of synonymy


widely used in IR systems


Precise analysis: dictionary
-
based analyzers


Quick
-
and
-
dirty analysis: stemmers


Rule
-
based stemmers. Porter stemmer


Statistical stemmers

32

Thank you!

Till May
24? 25?
, 6 pm