Challenges with Semi-Structured

muscleblouseΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 21 μέρες)

77 εμφανίσεις


Challenges
with

XML

Challenges
with

Semi
-
Structured

collections


Ludovic
Denoyer

University

of Paris 6

Bridging

the gap
between

research

communities

Outline


Motivations



XML
Mining

Challenge



Graph Labelling/
WebSpam

Challenge



Conclusion and future
work


General
Idea


The
two

challenges have been
proposed

to
try

to
attract

researchers

from

different

domains
:


Mainly

Machine Learning and Information
Retrieval



Show to IR
researchers

that

ML
methods

are able to
solve

some

of
their

problems



Show to ML
researchers

that

IR
tasks

provide

interesting

context

for
developping

new
general

Machine Learning
Algorithms




General
Idea


Find

generic

tasks

that

correspond to:


IR
new
real
-
applications


ML
new
generic

problems



To
work

together
….


To
mutualize

efforts…



To
solve

these

tasks

faster



To compare the
approaches



Open questions in ML

Structure+content
classification

Classification of
inter
-
dependant

variables

Structured

output
classification

Open questions in IR

Structure+content
classification

Classification of
inter
-
dependant

variables

Structured

output
classification

Semi
structured

documents
(XML)

Interconnected

documents

Heterogeneous

collections

Motivations

Structured

input
classification

Classification of
inter
-
dependant

variables

Structured

output
classification

Semi
structured

documents
(XML)

Hyperlinked

documents

Heterogeneous

collections


XML
Mining

Challenge

Motivations

Structured

input
classification

Classification of
inter
-
dependant

variables

Structured

output
classification

Semi
structured

documents
(XML)

Hyperlinked

documents

Heterogeneous

collections










WebSpam

Challenge


XML
Mining

Challenge

Motivations

Information
Retrieval

Machine
Learning

Data
Mining

Web

Proposed

Challenges

Challenges



XML
Mining

Challenge


«

Bridging

the gap
between

Machine Learning
and Information
Retrieval

»



Graph Labelling Challenge


Application to
WebSpam

detection

Outline


Motivations



XML
Mining

Challenge



WebSpam

Challenge



Conclusion and future
work


XML
Mining

Challenge


Launched

in 2005


PASCAL (Network of excellence in ML)


DELOS (Network of excellence in Digital Librairies)



Organized

as a INEX
Track


INEX: Initiative for the Evaluation of XML IR


More
than

50
different

institutes
involved



One
event

each

year

at

INEX (
december
)


Biggest

INEX
Track

(
after

ad
-
hoc
retrieval
)


We

are
currently

launching

the 4th XML
Mining

track


XML
Mining

Challenge


ML Goal



Classification of large collections of
structures



IR Goal


Classification of semi
-
structured

collections


Using

both

structure and content



Underlying

idea


Using

structure and content Information

Collections


Different

collections have been
used
:


2005


Artificial

collection


Movie

collection


2006


Scientific

articles


Wikipedia

XML
based

collection


2007


Wikipedia

XML
based

collection


96,000 documents in XML


21
categories


Submitted

papers

Number of papers
IR Papers
ML Papers
DM Papers
Large
variety

of
models


Different

existing

ML
Methods

have been
applied
:


Self
Organizing

Map


SVM


(Graph) Neural Network


CRF


Incremental

Models





Some

new
models

have been
developped

Short
Typology










See

Report on the XML
Mining

track



SIGIR Forum

Results

-

2007


Classification















Authors

Method

Micro
recall

Macro
recall

Zhang and

al.

Kernel
+SVM

0.87

0.83

L. M. de
Campos and

al.

Graphical

Models



Bayesian

netwoks

0.78

0.76

Meenakshi

and al.

Negative

Category

Document
Frequency

0.78

0.75

….

XML Structure
Mapping

task


Proposed

in 2006



ML
task

:
Structured

ouput

classification


Learning to
transform

trees


IR application :
Dealing

with

hetereogenous

collections


Learning to
transform

heterogeneous

documents to a
mediated

schema

XML Structure
Mapping







A
generic

ML model able to
solve

this

task

has a lot of
potential

applications:


Conversion
between

file formats


Automatic

translation


Natural
Language

processing




Conclusion


Existing

structured

input
models

(
kernel
,…) have been
tested

on
this

task


New
specific

models

have been
developped



Difficult

to know
which

model
is

the best


Need

to
wait

one more
year



The challenge has
attracted

researchers

from

different

communities


Each

year
, ML
researchers

are
coming

to INEX and:


Discover

a new
domain


Present

advanced

ML
models

to
other

researchers



The collections are
freely

available

and have been
downloaded

a
hundred

times



some

articles
start

to
appear

in
different

conferences




WebSpam

Challenge


PASCAL «

Graph Labelling Challenge

»



Organized

by:



Ricardo BAEZA
-
YATES

(Yahoo!
Research

Barcelona)


Carlos CASTILLO

(Yahoo!
Research

Barcelona)


Brian DAVISON

(
Lehigh

University
, USA )


Ludovic DENOYER

(
University

Paris 6, France)


Patrick GALLINARI

(
University

Paris 6, France)



The
Web Spam Challenge

2007 was supported by
PASCAL


The
Web Spam Challenge

2007 was also
supported by the DELIS EU
-

FET research project


WebSpam

Challenge



Three

Events:



AirWeb

workshop 2007 (WWW’07)


May 2007


Web
-
oriented

part


GraphLab

workshop 2007


P KDD/ECML


September

2007


ML
-
oriented

part



AirWeb

workshop 2008 (WWW’08 ?)



WebSpam

Challenge




IR (Web)
Task

:


Detection

of web spam


Spam =
any

attempt

to
get

“an
unjustifiably

favorable relevance or importance score for
some

Web pages,
considering

the
page’s

true

value”




Example

of spam

WebSpam

Challenge


ML Learning
task
:


Graph labelling


Classification of
inter
-
dependant

variables

Collection


A collection of
interconnected

Web
pages


77 millions pages


About 11,000 hosts


manually

labeled

as
spam

or
normal (host
level
)



Blinded

evaluation

of
models

Participants

Participants
ML Participants
Web/IR participants
Industrial participants
Participants


Why

such

an
increase

of ML participants
during

GraphLab

?


GraphLab

workshop
at

ECML/PKDD 2007


Collection has been
fully

preprocessed

by the
organizers


Each

node

corresponds to a
vector

(in
SVMLight

format)
based

on the
words

distribution in
each

host/page


The
contingenchy

matrix

has been
built



One
small

collection
with

9,000
nodes


One large collection
with

400,000
nodes



10% for train/20% for validation/70% for test



You
can

easily

apply

your

«

relationnal

»
models

on
this

corpus
without

knowing

anything

about
text

processing




Results


Small collection (9,000
nodes
)


Participants

Methods

AUC

Abernethy

and

al.

Semi

supervised

learning

95.2

Tang and al.

SVM

95.1

Filoche and al.

Stacked

Learning

92.7

Csalogany

and al.

C4.5

87.7

Tian

and al.

Semi
Supervised

86.3







Results


Large collection (400,000
nodes
)


Participants

Methods

AUC

Weiss and al.

Semi

supervised

learning

99.8

Filoche and al.

Stacked

Learning

99.1

Tang and al.

SVM

98.9







Conclusion on
WebSpam


Different

pure ML
methods

used

«

as if

»


Semi
supervised

methods


Stacked

Learning






Very

nice

performances of ML
models

(
equivalent

to Web «

hand
-
made

»
models
)




Conclusion on
WebSpam



Devlopment

of a ML benchmark for graph
labelling



WebSpam

also

proposes
interesting

ML
challenges
that

could

be

integrated

in the
challenge


Learning
with

a few
examples


Large
scale

problems


Adversial

Machine Learning




Conclusion



The
two

challenges have
proposed

benchmarks
for IR/Web applications and
also

for
generic

ML
problems



It
is

possible to mix
researchers

from

different

communities



ML
researchers

dislike

to clean real collections


you

have to
preprocess

the collections


ML
researchers

dislike

large collections


but
it

is

moving



Future
work


XML
Mining

will

continue
this

year


See

http://xmlmining.lip6.fr


The corpus
will

be

preprocessed

?



WebSpam

challenge
will

also

continue


See

http://webspam.lip6.fr


We

will

see

after

WWW’08 if
we

propose an
other

GraphLab

workshop (
see

http://graphlab.lip6.fr
)


Note
that

a new
larger

corpus has been
developped

in 2008



Thank

you

for
your

attention



(
Thank

you

to the participants of the
different

challenges
that

are in the room)