NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

106 εμφανίσεις

NATURAL LANGUAGE
PROCESSING AND
INFORMATION RETRIEVAL

Dr. Eleni Galiotou

Assistant Professor,

Department of Informatics,

TEI of Athens

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

2

(Computer
-
based) Information
Retrieval


Locate (electronically available) documents
satisfying user
´
s information needs


Information need: A statement in a query
language matched against
document
surrogates

(title, abstract, keywords etc)


Outcome of IR process: articles, memos,
reports, books, annotated image and sound
files

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

3

The IR strategy


Purpose:


Retrieve
all relevant

documents


Retrieve as few of
non
-
relevant documents

as possible


Techniques in classical IR:


Empirical and ad
-
hoc


Quantitative methods


IR : also a
Natural Language Processing

problem


Heterogeneous Collections of full
-
text documents


Need for Content

Understanding


=> NLP techniques



Feb 24, 2004

Eleni Galiotou : Tempus Seminar

4

Main areas of research in IR


Content analysis


Relationships between documents to
improve efficiency and effectiveness of IR
strategies


Measurement of effectiveness of retrieval



Feb 24, 2004

Eleni Galiotou : Tempus Seminar

5

Example
: The vector
-
space model (1)


The SMART Text Retrieval System


Documents and queries represented as
vectors in T
-
dimensional space ( T: number
of distinct terms in document collection)


Automated indexing: Assigning of terms to
a piece of text


Weighted terms to reflect their relative
importance in the text

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

6

The vector
-
space model (2)


Result of a query: Ranked list of
documents ordered by
similarity

to the
query


Similarity measure: cosine of the angle
formed by the query and the document
vector
(cosine correlation)




q
i
d
i


sim = cos(
Q,D
) =


(


q
i
2


d
i
2
)
1/2





T

i=1

T

i=1

T

i=1

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

7

Extended vector
-
space model


Vector : collection of subvectors used to
represent different aspects of documents in
collection


Overall similarity between two extended
vectors:


sim(
Q,D
) =



α
i

sim
i

(
Q
i
, D
i
)


subvector i



a
i

=

importance of subvector

i
in the
overall similarity between texts



Feb 24, 2004

Eleni Galiotou : Tempus Seminar

8

Indexing


Index language

used to describe documents
and requests


Pre
-
coordinate index terms

: logical
combination of any index terms used as a
label to identify a class of documents


Post
-
coordinate terms
: combination of
classes of documents labeled with the
individual index terms


Feb 24, 2004

Eleni Galiotou : Tempus Seminar

9

LMI vs. NLI


N
on
-
L
inguistic
I
ndexing: Removing
stopwords, Applying statistical criteria


L
inguistically
M
otivated
I
ndexing:


Applying syntactic and/or semantic techniques
for term identification and description formation


Identifying multi
-
word units and characterizing
their internal structure


NLP needed for automated indexing ?!


Feb 24, 2004

Eleni Galiotou : Tempus Seminar

10

Index Term Weighting


tf (
t
): within
-
document frequency of term
t


idf(
t
): inverse document frequency = log (N/n)



N= total number of documents in collection



n : number of documents containing term t


General weighting schema:


w(
t
) = tf(
t
) X idf(
t
)


Assumptions on term independence often false


Situation worse when single
-
word terms are
intermixed with phrasal terms


Feb 24, 2004

Eleni Galiotou : Tempus Seminar

11

NLP Based Indexing


Example: TREC Experiments


“joint venture” important in Wall Street
Journal database



“joint”, “venture” dropped from list of
terms by the system because of too low
idf


Identify groups creating meaningful phrases


Simple collocations, Statistically
-
validated
N
-
grams, Part
-
Of
-
Speech tagged sequences,
Syntactic structures, Semantic Concepts

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

12

Obstacles in the application of NLP
techniques in IR


Lack of robustness and efficiency


Representations produced : Complex
structures effectively compared to
determine relevance



Solution: Use NLP to
assist

IR system
(boolean, statistical, probabilistic) in
representing documents for search purposes



Off
-
line database indexing

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

13

Stream
-
based IR Model (1)


Combination of Statistical and NLP Techniques


Term Extraction Steps

1.
Elimination of Stopwords

(no
-
content or low
content words: determiners, preposition,
pronouns, very frequent words)

2.
Morphological Stemming
: Affix
-
stripping
process or Morphological Analysis)




Feb 24, 2004

Eleni Galiotou : Tempus Seminar

14

Stream
-
based IR Model (2)

4.
Phrase Extraction

: Shallow text processing
techniques (POS tagging, Phrase boundary
detection, Word co
-
occurrence metrics) used to
identify relatively stable groups of words


5.
Phrase Normalization
: “Head+Modifier”
pairs to normalize across syntactic variants and
reduce to a common “concept” , e.g.
weapon proliferation, proliferation of weapons



weapon+ proliferate

6.
Proper Name Extraction
: People names and
titles, Location names, Organization names used
for indexing

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

15

Stream
-
based IR Model (3)


Final results: merge ranked lists of
documents obtained from searching all
streams with appropriately preprocessed
queries.


Contributions from each stream are weighted
using an effective combination of
alternative retrieval and routing methods



Meta
-
search strategy which maximizes
contributions of each stream (base search
engines: SMART v. 11, PRISE v.2 e.t.c)

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

16

Advantages of Stream Architecture


Easier to compare contributions of different
indexing features or representations


Convenient testbed to experiment with
algorithms designed to merge results obtained
using different IR engines and/ot techniques


Easier to fine
-
tune system in order to obtain
optimum performance


Allows usage of IR engines without having to
adopt them

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

17

P
art
O
f
S
peech
T
agging (1)


Allows resolution of lexical ambiguities in a
running text assuming a known general type of
text and a context in which a word is used




more accurate lexical normalization, phrase
boundary detection


Assigns POS label(s) to each word in a text
depending on labels assigned to preceding
words



Feb 24, 2004

Eleni Galiotou : Tempus Seminar

18

POS Tagging (2)


Best
-
tag
-
only option: Only top
-
ranked for
each word is output




gain in speed and robustness of
subsequent processes (e.g. parsing)


Brill
´
s rule based Tagger trained on Wall
Street Journal texts to preprocess linguistic
streams used by SMART


Feb 24, 2004

Eleni Galiotou : Tempus Seminar

19

Syntactic Tagging (1)


Capturing semantic dependencies critical for
accurate text indexing


Need to exploit syntactic structures produced by a
fairly comprehensive parser


TREC experiment: TTP (Tagged Text Parser)
based on Linguistic String Grammar


Full grammar parser with a built
-
in timer
regulating amount of time allowed for parsing a
sentence



Feb 24, 2004

Eleni Galiotou : Tempus Seminar

20

Syntactic Tagging (2)


If no parse is returned before allotated time
elapses


parser in “skip
-
and
-
fit” mode


Result: approximate parse


Fragments skipped in first pass:




analyzed by simple phrasal parser
looking for noun phrases and relative
clauses




attached to main parse structure



Feb 24, 2004

Eleni Galiotou : Tempus Seminar

21

Corpus
-
based disambiguation of
long Noun Phrases (1)


Relationships between in complex phases
required to decompose longer phrases into
meaningful head+modifier pairs


Pair extractor looks at distribution statistics
of compound terms




association between any two words in
noun phrase syntactically valid and
semantically significant

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

22

Corpus
-
based disambiguation (2)


Phrasal terms extracted in two phases:

1.
Only unambiguous head
-
modifier pairs are
generated

2.
Distributional statistics gathered in first phase
are used to predict the strength of alternative
modifier
-
modified links within ambiguous
phrases


Example: multiple unambiguous occurrences :
“inside trading”, a few of “trading case”,
numerous phrases: “insider trading case”,
“insider trading legislation”

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

23

Language Resources


Machine Readable Dictionaries (MRD)




Mixed results in experiments


Knowledge bases


CYC : Huge Knowledge base of Common
Sense Knowledge, Untested contribution to IR


WordNet : Models Lexical Knowledge of a
native user of English

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

24

Usage of WordNet in
Information Retrieval Tasks (1)


WordNet: organized around logical
groupings of related terms (
synsets
)


Synset: list of synonymous word forms and
semantic pointers describing relationships
between current and other synsets


Knowledge Base: Nouns in WordNet


Nouns: Most content
-
bearing of all word
classes and occur in every sentence


Feb 24, 2004

Eleni Galiotou : Tempus Seminar

25

Usage of WordNet (2)


WordNet partitioned into Hierarchical Concept
Graphs (HCG) based on the IS
-
A hierarchical
links between synsets


Information content of each synset
approximated by estimating the probability of
occurrence of all nouns in all subordinate
synsets.


Semantic similarity between two nouns
(synsets from which the nouns are drawn):
information content of first synset which
subsumes the two synsets

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

26

Usage of WordNet (3)


Simple word sense disambiguation process
for documents which choose the single most
likely sense of a noun occurrence


Experiments: Top 1000 documents pre
-
fetched from the collection using term
weighting (conventional IR technique) and
exhaustive word distance based measure on
these documents

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

27

Usage of WordNet (4)


Retrieval effectiveness results using word
-
word distances, (in terms of precision and
recall): poor compared to the tf X idf
term weighting strategy


Possibility of errors in syntactic tagging of
documents, in word sense disambiguation,
in semantic matching between words.



Feb 24, 2004

Eleni Galiotou : Tempus Seminar

28

Other Roles for NLP (1)


Routing (Filtering): Amount of training
data is the dominant factor in performance


Text categorization (automatic assignment:
to prior headings): Using complex terms
had no extra beneficial effect


?Real

contribution to
selective

content
-
based information management

Feb 24, 2004

Eleni Galiotou : Tempus Seminar

29

Other Roles for NLP (2)


Displaying information about whole
documents: giving selected phrases more
informative than highlighting matching
terms or listing key individual words



Information Extraction and Summarizing


Real role of NLP: Supporting more exigent
information
-
management functions within a
larger, multi
-
functional whole