Applications, Research, and Challenges

cabbagecommitteeΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

107 εμφανίσεις

IR in a
Nutshell
:

Applications
, Research, and Challenges


Session 1


Feb 21
st

2013

Tamer Elsayed

Roadmap


What is Information Retrieval (IR)?


O
verview and applications


Overview of my research interests


Large
-
scale problems


MapReduce Extensions


Twitter Analysis


The future of IR research


SWIRL 2012


IR in a Nutshell: Applications, Research, and Challenges

2

WHAT IS IR?

OVERVIEW & APPLICATIONS/RESEARCH TOPICS

IR in a Nutshell: Applications, Research, and Challenges

3

Information Retrieval (IR)


4

Unstructured

Query

Hits

IR in a Nutshell: Applications, Research, and Challenges

information

need

Who and Where?


*Source: Matt Lease (IR Course at
UTexes
)

IR is
not

just



Web Page
” Ranking

6

or Document

or Retrieval

Web Search: Google

Search
suggestions

Vertical search

Query
-
biased
summarization

Sponsored search

Search shortcuts

Vertical
search

(
news,
blog,
image
)

Web Search: Google
II

Spelling correction

Personalized search / social ranking

Vertical search (local)

Cross
-
Lingual IR


1
/3
of the Web is in
non
-
English


About 50% of Web users do not use English as their
primary language



Many
(maybe most) search applications have to deal
with multiple languages


monolingual search
: search in one language, but with many
possible languages


cross
-
language search
: search in multiple languages at the
same time

Routing / Filtering


Given standing query, analyze new information as it
arrives


Input: all email, RSS feed or listserv, …


Typically classification rather than ranking


Simple example: Ham vs.
spam

*Source: Matt Lease (IR Course at
UTexes
)

Content
-
based
Music
S
earch


*Source: Matt Lease (IR Course at
UTexes
)

Speech
Retrieval

*Source: Matt Lease (IR Course at
UTexes
)

Entity Search


*Source: Matt Lease (IR Course at
UTexes
)

Question Answering & Focused Retrieval


*Source: Matt Lease (IR Course at
UTexes
)

Expert
Search


*Source: Matt Lease (IR Course at U
Texes
)

Blog Search


*Source: Matt Lease (IR Course at
UTexes
)

μ
-
Blog Search (e.g. Twitter)


*Source: Matt Lease (IR Course at
UTexes
)

e
-
Discovery


*Source: Matt Lease (IR Course at
Utexes
)

Book Search


Find books or more focused results


Detect / generate / link table of contents


Classification: detect genre (e.g. for browsing)


Detect related books, revised editions


Challenges: Variable
scan quality, OCR
accuracy, Copyright,
etc.

Other Visual Interfaces


*Source: Matt Lease (IR Course at
Utexes
)

MY RESEARCH

IR in a Nutshell: Applications, Research, and Challenges

21

22

My Research …

Text

Large
-
Scale

Processing

emails

+

web pages

Enron

CLuE

Web

Identity

Resolution

Web

Search

~500,000

~1,000,000,000

User Application

Back in 2009 …


Before 2009, small text collections are available


Largest: ~ 1M documents



ClueWeb09


Crawled by CMU in 2009


~ 1B documents !


need to move
to cluster
environments



MapReduce/
Hadoop

seems like promising framework



23

MapReduce Framework

24

map

map

map

map

reduce

reduce

reduce

input

input

input

input

output

output

output

Shuffling


group values
by:
[
keys
]

(a) Map

(b) Shuffle

(c) Reduce

(
k
2
,
[
v
2
])

(
k
1
, v
1
)

[(
k
3
, v
3
)]

[
k
2
, v
2
]

Framework handles “everything else” !


E2E Search Toolkit using MapReduce


Completely designed for the
Hadoop

environment


Experimental Platform for research


Supports common text collections


+ ClueWeb09


Open source release


Implements state
-
of
-
the
-
art retrieval
models

25

http://ivory.cc

Ivory

(1) Pairwise Similarity in Large Collections

26

~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~

~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~

~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~

~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~

~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74


Applications
:


Clustering



more
-
like
-
that
” queries

Decomposition

27

reduce

Each term contributes only if appears in

map

(2) Cross
-
Lingual Pairwise Similarity


F
ind
similar document pairs
in
different languages




Multilingual text mining, Machine Translation



Application: automatic
generation of
potential

interwiki
” language
links



Locality
-
sensitive Hashing



28

More difficult than monolingual!

Vectors
close to each other

are
likely to have similar
signatures

Solution Overview

CLIR

projection

N
f

German
articles

N
e

English

articles

Preprocess

N
e
+N
f

English
document
vectors

N
e
+N
f

Signatures

Signature
generation

Sliding

window

algorith
m

Similar
article
pairs

<
n
obel
=0.324,
prize=0.227,
book=0.01, …>

01110000101

11100001010

Random Projection/

Minhash
/
Simhash

(3) Approximate Positional Indexes

30

Learn

“Learning to Rank”
models

Term

positions

e
ffective ranking
functions

Proximity
features

Approximate

Large

index

Slow query
evaluation



X

X

Smaller
index

Faster query
evaluation





Close Enough is Good Enough?

Fixed
-
Width Buckets


Buckets of length W

31

………...........….

………...........….

………...........….

………...........….

………...........….

d
2

1

2

3

d
1

………...........….

………...........….

………...........….

………...........….

………...........….

………...........….

………...........….

………...........….

………...........….

………...........….

1

2

3

4

5

(4) Pseudo Training Data for Web Rankers


Documents, queries, and relevance judgments


Important driving force behind IR innovation



In industry, easy to get


In academia, hard and really expensive

Web Graph

web search

SIGIR 2012

web search

web search

web search

Google

web search

P
1

P
4

P
2

P
5

P
7

P
3

P
6

Queries and Judgments?

SIGIR 2012

P
1

P
4

P
2

P
7

P
3

P
6

web search

Bing

P
5

Google

anchor text lines
≈ pseudo
queries

t
arget pages

relevant candidates

n
oise reduction ?

(5) Extending MapReduce Framework


Iterative Computations (iHadoop)



Concurrent Jobs with shared data


m

maps
-

r

reduces instead of 1 map
-
1 reduce

IR in a Nutshell: Applications, Research, and Challenges

35

(6) Twitter Analysis


Real
-
time search in Twitter


TREC 2011 (6
th

out of 59 teams)


TREC 2013?



Answering Real
-
time Questions from Arabic Social
Media


NPRP
-
submitted

IR in a Nutshell: Applications, Research, and Challenges

36

FUTURE RESEARCH DIRECTIONS

IR in a Nutshell: Applications, Research, and Challenges

37

SWIRL 2012


Goal of Report


Inspire
researchers and graduate students to address
the questions
raised


Provide
funding agencies data to focus and coordinate
support
for information
retrieval research
.



Participants were asked to focus on efforts that could
be
handled
in an
academic setting
,
without

the
requirement
of
large
-
scale commercial data
.

Key Themes (across
T
opics)


Not just a ranked
list


move beyond the classic “single
adhoc

query
and ranked list” approach


Help for
users


support users more broadly, including ways to bring IR
to inexperienced
,
illiterate, and disabled users.


Capturing
context


Treats people
using search systems, their context, and their information
needs as critical
aspects needing
exploration.


Information, not
documents


beyond
document
retrieval and into more complex types of data and more
complicated results


New Domains


data
with restricted access,
collections of
“apps,” and richly connected
workplace data


Evaluation


suggest new techniques for evaluation

“Most Interesting” Topics


IR in a Nutshell: Applications, Research, and Challenges

41

[1] Conversational
Answer
Retrieval


IR: provides
ranked lists of documents in response to a
wide

range

of keyword
queries


QA: provides
more specific answers to a very
limited
range

of
natural language questions
.



Goal:
combine the advantages of
both to
provide
effective retrieval
of appropriate
answers to a wide
range of questions expressed in natural language,
with
rich
user
-
system dialogue



Proposed Research


Questions:
open
-
domain
, natural language text questions


Answers: Develop
more general approaches to
identifying as
many constraints

as possible
on the answers

for questions


Dialogue would be initiated by the searcher and proactively
by
the system, for:


refining
the
understanding of questions


improving
the quality of
answers



Answers: short answers, text
passages, clustered groups of
passages, documents, or even
groups of
documents may be
appropriate answers. Even tables, figures, images, or videos

IR in a Nutshell: Applications, Research, and Challenges

43

Challenges


Definitions

of question and answer for open domain
searching


Techniques
for
representing

questions and answers


Techniques
for
reasoning

about and ranking answers


Techniques
for representing a mixed
-
initiative CAR
dialogue


Effective
dialogue actions for improving question
understanding


Effective
dialogue actions for refining answers

IR in a Nutshell: Applications, Research, and Challenges

44

[2] Finding
What You Need with Zero Query
Terms (or Less)


Function
without an explicit
query, depending
on
context

and
personalization

in order to understand user
needs


Anticipate
user needs and respond with
information appropriate
to the current context without the user having to enter a
query
(
zero query terms
) or
even initiate
an interaction
with the
system
(
or less
).


In
a mobile
context: take
the form of an app
that recommends
interesting places and activities based on the user’s location,
personal preferences,
past history
, and environmental factors
such as weather and time.


In
a traditional desktop
environment: might
monitor ongoing
activities and suggest related information, or track news, blogs,
and social
media for interesting updates
.


Imagine
a system that automatically gathers information related
to an upcoming task.

Proposed Research


New
representations
of information and user needs
,
along with methods for matching the
two


Modeling
person, task, and context
;


Methods
for
finding “objects
of interest

, including
content, people, objects and
actions


M
ethods
for determining
what, how
and when

to show
material of interest.

IR in a Nutshell: Applications, Research, and Challenges

46

Challenges


T
ime
-

and geo
-
sensitivity
; trust, transparency,
privacy
;
determining
interruptibility
;
summarization


Power
management

in
mobile contexts


Evaluation

IR in a Nutshell: Applications, Research, and Challenges

47

[3] Mobile
Information
Retrieval

Analytics

(MIRA)


N
o
company or researcher has an understanding of mobile
information access
across a
variety
of tasks
, modes of
interaction, or software applications
.


For example
, a search service provider might know that a
query was issued, but not know whether
the results
it
provided resulted in consequent action
.


The identification of common
types
of web search queries
led to query classification and
algorithms tuned
for different
purposes, which improved web search accuracy. A similar
understanding for
mobile information
seeking would focus
research on the problems of highest value to mobile users
.


study
what
information, what
kind
of information, and
what
granularity
of
information to deliver

for different
tasks and contexts

Proposed Research


M
ethodology
and
tools for doing large
-
scale collection of
data
about mobile
information access
.


Research on
incentive

mechanisms is required to
understand situations in which people are willing
to allow
their behavior to be monitored
.


Research on
privacy

is required
to understand what can be
protected by dataset licenses alone, what must be
anonymized
,
and tradeoffs
between
anonymization

and
data utility
.



D
evelopment
of
well
-
defined

information seeking
tasks


S
upport
quantitative evaluation

in well
-
defined evaluation
frameworks that
lead to repeatable scientific research

IR in a Nutshell: Applications, Research, and Challenges

49

Challenges


Developing
incentive mechanisms



Developing
data collections that are
sufficiently
detailed to be useful

while
still protecting
people’s
privacy.


Collection
of data in a manner that university
internal
review
boards

will
consider acceptable
ethically.


Collection
of data in a manner that does
not violate
the
Terms
of Use

restrictions of commercial service
providers.

IR in a Nutshell: Applications, Research, and Challenges

50

[4] Empowering
Users to Search and Learn


Search engines are
currently optimized for look
-
up
tasks

and
not

tasks that require more sustained
interactions with
information


People
have been
conditioned

by
current search
engines to interact
in particular
ways that prevent them
from achieving higher levels of learning.



We
seek to
empower
users to
be more proactive and
critical thinkers during the information search process.

[5] The
Structure Dimension


Better integration of structured and unstructured
information

to seamlessly
meet a user’s information
needs is a promising, but underdeveloped area of
exploration
.


Named entities
, user profiles,
contextual annotations
,
as well as (typed)
links

between information objects
ranging from web pages
to
social media
messages
.


[6] Understanding
People in Order to Improve
Information (Retrieval) Systems


Development
of a
research resource for the IR
community
:

1.
from
which hypotheses about how to support people in
information interactions can be
developed

2.
in
which IR system designs can be appropriately
evaluated.


Conducting studies
of people


before
,
during, and
after engagement with information systems,


at
a variety of levels,


using
a variety of methods.


ethnography


in
situ
observation


controlled observation


large
-
scale logging


IR in a Nutshell: Applications, Research, and Challenges

54

Thank You!

IR in a Nutshell: Applications, Research, and Challenges

55