Intelligent Information Retrieval

boorishadamantΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

138 εμφανίσεις

1

Intelligent Information Retrieval

(and Web Search)

Professor Celso A A Kaestner, PhD.

Brazil


2

Site:

www.dainf.ct.utfpr.edu.br/~kaestner/Konstanz/iir.htm

3

Introduction

4

Introduction: Information Retrieval


IR: representation, storage, organization of,
and access to information items;


Focus is on the
user information need
;


User information need:


Find all docs containing information on college football
teams which: (1) are maintained by an university and
(2) participate in the national tournament.


Emphasis is on the retrieval of information
(not data).

5

Data retrieval x Information retrieval


Data Retrieval:


which docs. contain a set of keywords?


well defined semantics;


a single erroneous object implies failure!


Information Retrieval (IR):


information about a subject or topic;


semantics is frequently loose;


small errors are tolerated.


IR system:


interpret contents of information items;


generate a
ranking

which reflects relevance;


notion of relevance

is most important.

6

Information Retrieval (IR)


The indexing and retrieval of textual
documents.


Searching for pages on the World Wide
Web is the most recent “killer app.”


Concerned firstly with retrieving
relevant

documents to a query.


Concerned secondly with retrieving from
large

sets of documents
efficiently
.




7

Typical IR Task


Given:


A corpus of textual natural
-
language
documents.


A user query in the form of a textual
string.


Find:


A ranked set of documents that are
relevant to the query.





8

IR System

IR

System

Query
String

Document

corpus

Ranked

Documents

1. Doc1

2. Doc2

3. Doc3


.


.


9

Relevance


Relevance is a subjective judgment
and may include:


Being on the proper subject.


Being timely (recent information).


Being authoritative (from a trusted
source).


Satisfying the goals of the user and
his/her intended use of the information
(
information need
).

10

Keyword Search


Simplest notion of relevance is that
the query string appears verbatim in
the document.


Slightly less strict notion is that the
words in the query appear frequently
in the document, in any order (
bag of
words
).

11

Problems with Keywords


May not retrieve relevant documents
that include synonymous terms.


“restaurant” vs. “café”


“PRC” vs. “China”


May retrieve irrelevant documents that
include ambiguous terms.


“bat” (baseball vs. mammal)


“Apple” (company vs. fruit)


“bit” (unit of data vs. act of eating)


12

Beyond Keywords


We will cover the basics of keyword
-
based
IR, but…


We will focus on extensions and recent
developments that go beyond keywords.


We will cover the basics of building an
efficient

IR system, but…


We will focus on basic capabilities and
algorithms rather than system’s issues that
allow scaling to industrial size databases.

13

Intelligent IR


Taking into account the
meaning

of
the words used.


Taking into account the
order

of words
in the query.


Adapting to the user based on direct
or indirect feedback.


Taking into account the
authority

of
the source.


14

IR System Architecture

Text

Database

Database

Manager

Indexing

Index

Query

Operations

Searching

Ranking

Ranked

Docs

User

Feedback

Text Operations

User Interface

Retrieved

Docs

User

Need

Text

Query

Logical View

Inverted


file

15

IR System Components


Text Operations

forms index words (tokens).


Standardization (caps …)


Stopword removal


Stemming


Indexing

constructs an
inverted index

of
word to document pointers.


Searching

retrieves documents that contain
a given query token from the inverted index.


Ranking

scores all retrieved documents
according to a relevance metric.


16

IR System Components (continued)


User Interface

manages interaction
with the user:


Query input and document output.


Relevance feedback.


Visualization of results.


Query Operations

transform the query
to improve retrieval:


Query expansion using a thesaurus.


Query transformation using relevance
feedback.


17


IR at the center of the stage:


Advent of the Web changed this
perception once and for all:


universal repository of knowledge;


free (low cost) universal access;


no central editorial board;


many problems though: IR seen as key to
finding the solutions!

IR and the Web

18

IR and the Web


And more:


Most of the human task employ the
treatment of information in textual and/ or
graphic form (Lyman, 2003);


How Much Information project (Berkeley):

www.sims.berkeley.edu/how
-
much
-
info
-
2003
.



Each

person

generates

800

Mbytes

/

year
.


19

In

2002
:

5

Exabytes

of

new

information
;


Magnetic

media

(HD’s)
:

92
%
;



Films
:

7
%
;


Print

material
:

0
,
01
%
;


Optical

media
:

0
,
002
%
.


5

Exabytes

=

5

million

Terabytes

=

5
.
000
.
000
.
000
.
000
.
000
.
000

bytes
;

2

times

the

amount

of

1999
,

given

an

increasing

rate

of

30
%

/

year
.


IR and the Web

20

Information

flow

-

radio,

TV,

Internet
:


18

Exabytes

of

new

information

in

2002
;


3
,
5

times

of

the

amount

stored
;


Telephone

lines

(and

cell

phones)
:

98
%
;


320

million

hours

of

radio

and

TV

transmissions,

with

70

million

new

hours,

with

81

Gigabytes

of

texts
.

IR and the Web

21

Email
:


31

billion

of

e
-
mails

/

year

=

400
.
000

Tbytes

of

new

information
;


The

Internet

(
Web
)
:


170

Tbytes

of

information

=

17

times

the

printed

content

of

the

US

Library

of

Congress
.

IR and the Web

22

Search sites:


“Yahoo”, “Google”, etc. = the 1
st

option of
access for the users;


A typical Internet user: 11 h 20 m / month;


Access to the desired information = 1 / 3 of
the period;


The user is obliged to verify if the received
information is the desired one, and several
times is impossible to recover the
information needed.

IR and the Web

23




Information Glut or Information Overload
: is
the main challenge to be surpassed by
automatic text treatment systems.

IR and the Web

24

Web Search


Application of IR to HTML documents
on the World Wide Web.


Differences:


Must assemble document corpus by
spidering the web.


Can exploit the structural layout
information in HTML (XML).


Documents change uncontrollably.


Can exploit the link structure of the web.

25

Web Search System

Query
String

IR

System

Ranked

Documents

1. Page1

2. Page2

3. Page3


.


.


Document

corpus

Web

Spider

26

Other IR
-
Related Tasks


Automated document categorization


Information filtering (spam filtering)


Information routing


Automated document clustering


Recommending information or products


Information extraction


Information integration


Question answering

27

History of IR


1960
-
70’s:



Initial exploration of text retrieval systems
for “small” corpora of scientific abstracts,
and law and business documents.


Development of the basic Boolean and
vector
-
space models of retrieval.


Prof. Salton and his students at Cornell
University are the leading researchers in
the area.

28

IR History Continued


1980’s:


Large document database systems, many
run by companies:


Lexis
-
Nexis


Dialog


MEDLINE


29

IR History Continued


1990’s:


Searching FTPable documents on the
Internet


Archie


WAIS


Searching the World Wide Web


Lycos


Yahoo


Altavista

30

IR History Continued


1990’s continued:


Organized Competitions


NIST TREC


Recommender Systems


Ringo


Amazon


NetPerceptions


Automated Text Categorization &
Clustering


31

Recent IR History


2000’s


Link analysis for Web Search


Google


Automated Information Extraction


Whizbang


Fetch


Burning Glass


Question Answering


TREC Q/A track


32

Recent IR History


2000’s continued:


Multimedia IR


Image


Video


Audio and music


Cross
-
Language IR


DARPA Tides


Document Summarization

33

Related Areas


Database Management


Library and Information Science


Artificial Intelligence


Natural Language Processing


Machine Learning

34

Database Management


Focused on
structured

data stored in
relational tables rather than free
-
form text.


Focused on efficient processing of well
-
defined queries in a formal language (SQL).


Clearer semantics for both data and queries.


Recent move towards
semi
-
structured

data
(XML) brings it closer to IR.

35

Library and Information Science


Focused on the human user aspects of
information retrieval (human
-
computer
interaction, user interface, visualization).


Concerned with effective categorization of
human knowledge.


Concerned with citation analysis and
bibliometrics
(structure of information).


Recent work on
digital libraries

brings it
closer to CS & IR.



36

Artificial Intelligence


Focused on the representation of knowledge,
reasoning, and intelligent action.


Formalisms for representing knowledge and
queries:


First
-
order Predicate Logic


Bayesian Networks


Others …


Recent work on web ontologies and intelligent
information agents brings it closer to IR.


37

Natural Language Processing


Focused on the syntactic, semantic,
and pragmatic analysis of natural
language text and discourse.


Ability to analyze syntax (phrase
structure) and semantics could allow
retrieval based on
meaning

rather
than keywords.

38

Natural Language Processing:

IR Directions


Methods for determining the sense of
an ambiguous word based on context
(
word sense disambiguation
).


Methods for identifying specific pieces
of information in a document
(
information extraction
).


Methods for answering specific NL
questions from document corpora.

39

Machine Learning


Focused on the development of
computational systems that improve their
performance with experience.


Automated classification of examples based
on learning concepts from labeled training
examples (
supervised learning
).


Automated methods for clustering unlabeled
examples into meaningful groups
(
unsupervised learning
).

40

Machine Learning:

IR Directions


Text Categorization


Automatic hierarchical classification (Yahoo).


Adaptive filtering/routing/recommending.


Automated spam filtering.


Text Clustering


Clustering of IR query results.


Automatic formation of hierarchies (Yahoo).


Learning for Information Extraction


Text Mining


Text Summarization