crosslingual filtering systems

yardbellΤεχνίτη Νοημοσύνη και Ρομποτική

14 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

78 εμφανίσεις



1

13/05/07

1
/20

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit


The INFILE project: a
crosslingual

filtering systems
evaluation campaign



Romaric Besançon
, Stéphane Chaudiron, Djamel Mostefa,
Ismaïl Timimi, Khalid Choukri





2

13/05/07

2
/20

Overview


Goals

and

features

of

the

INFILE

campaign


Test

collections
:



Documents


Topics


Assessments


Evaluation

protocol


Evaluation

procedure


Evaluation

metrics


Conclusions

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



3

13/05/07

3
/20

Goals and features of the INFILE Campaign


Information

Filtering

Evaluation


filter

documents

according

to

long
-
term

information

needs

(user

profiles

-

topics)



Adaptive

:

use

simulated

user

feedback


Following

TREC

adaptive

filtering

task


Crosslingual


three

languages
:

English,

French,

Arabic


close

to

real

activity

of

competitive

intelligence

professionals


in

particular,

profiles

developed

by

CI

professional

(STI)



pilot

track

in

CLEF

2008

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



4

13/05/07

4
/20

Test Collection


Built

from

a

corpus

of

news

from

the

AFP

(Agence

France

Presse)



almost

1
.
5

million

news

in

French,

English

and

Arabic



For

the

information

filtering

task
:


100

000

documents

to

filter,

in

each

language



NewsML

format


standard

XML

format

for

news

(IPTC)


LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



5

13/05/07

5
/
20

Document example


LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit

document identifier

keywords

headline



6

13/05/07

6
/
20

Document example


LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit

location

IPTC category

AFP category

content



7

13
/
05
/
07

7
/20

Profiles


50

interest

profiles



20

profiles

in

the

domain

of

science

and

technology



developped

by

CI

professionals

from

INIST,

ARIST,

Oto

Research,

Digiport



30

profiles

of

general

interest

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



8

13/05/07

8
/
20

Profiles


Each

profile

contains

5

fields
:


title
:

a

few

words

description


description
:

a

one
-
sentence

description


narrative
:

a

longer

description

of

what

is

considered

a

relevant

document


keywords
:

a

set

of

key

words,

key

phrases

or

named

entities


sample
:

a

sample

of

relevant

document

(one

paragraph)



Participants

may

use

any

subset

of

the

fields

for

their

filtering

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



9

13
/
05
/
07

9
/20

Constitution of the corpus


To

build

the

corpus

of

documents

to

filter
:



find

relevant

documents

for

the

profiles

in

the

original

corpus


use

a

pooling

technique

with

results

of

IR

tools



the

whole

corpus

is

indexed

with

4

IR

engines

(Lucene,

Indri,

Zettair

and

CEA

search

engine)




each

search

engine

is

queried

independently

using

the

5

different

fields

of

the

profiles

+

all

fields

+

all

fields

but

the

sample


28

runs

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



10

13/05/07

10
/
20

Constitution of the corpus (
2
)



pooling

using

a


Mixture

of

Experts”

model


first

10

documents

of

each

run

is

taken


first

pool

assessed


a

score

is

computed

for

each

run

and

each

topic

according

to

the

assessments

of

the

first

pool


create

next

pool

by

merging

runs

using

a

weighted

sum


weights

are

proportional

to

the

score


ongoing

assessments


keep

all

documents

assessed


documents

returned

by

IR

systems

by

judged

not

relevant

form

a

set

of

difficult

documents


choose

random

documents

(noise)


LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



11

13
/
05
/
07

11
/20

Evaluation procedure


One

pass

test


Interactive

protocol

using

a

client
-
server

architecture

(webservice

communication)



participant

registers


retrieves

one

document


filters

the

document


ask

for

feedback

(on

kept

documents)



retrieves

new

document


limited

number

of

feedbacks

(
50
)



new

document

available

only

if

previous

one

has

been

filtered

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



12

13/05/07

12
/
20

Evaluation metrics


Precision

/

Recall/F
-
measure





Utility

(from

TREC)




LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit

relevant
a
b
c
d
not relevant
ret rieved
not ret rieved
P=a/a+b

R=a/a+c

F=2PR/P+R

u=w
1

a
-
w
2

b



min
min
min
max
1
max
u
u
,u
u/u
=
u
n




13

13
/
05
/
07

13
/
20

Evaluation metrics (
2
)



Detection

cost

(from

TDT)



uses

probability

of

missed

documents

and

false

alarms

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit

relevant
not relevant
ret rieved
a
b
not ret rieved
c
d
c
a
c
=
P
miss

/
d
b
b
=
P
false

/


topic
false
false
topic
miss
miss
det
P
P
c
P
P
c
=
c


1


14

13/05/07

14
/
20

Evaluation metrics


per

profile

and

averaged

on

all

profiles


adaptivity
:

score

evolution

curve

(values

computed

each

10000

documents)




two

experimental

measures


originality


number

of

relevant

documents

a

system

uniquely

retrieves


anticipation


inverse

rank

of

first

relevant

document

detected

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit



15

13
/
05
/
07

15
/
20

Conclusions


INFILE

campaign



Information

Filtering

Evaluation
:



adaptive,

crosslingual
,

close

to

real

usage


Ongoing

pilot

track

in

CLEF

2008


current

constitution

of

the

corpus


dry

run

mid
-
June


evaluation

campaign

in

July


workshop

in

September


Work

in

progress


the

modelling

of

the

filtering

task

assumed

by

the

CI

practitioners

LIST


DTSI


Interfaces, Cognitics and Virtual Reality Unit