ARTICLE1: Generating Personalized Summaries Using Publicly ...

lilactruckInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 14 μέρες)

68 εμφανίσεις

BY

Asef poormasoomi

Motivation


summaries

which

are

generic

in

nature

do

not

cater

to

the

user’s

background

and

interests



results show that each person has
different perspective
on the same text



So a good summary should change in
accordance to
preferences

of its reader

Motivation


Marcu
-
1997
: found percent agreement of 13 judges over 5
texts from scientific America is 71 percent.



Rath
-
1961 :

found that extracts selected by four different
human judges had only 25 percent overlap



Salton
-
1997 :

found that most important 20 paragraphs
extracted by 2 subjects have only 46 percent overlap

Users Feedback


Query History
:


is the most widely used implicit user feedback at present.


http://www.google.com/psearch



Data Click:


when a user clicks on a document, the document is considered
to
be of more interest to the user than other
unclicked

ones


Attention Time :


often referred to as display time or reading
time


Other types of implicit user feedbacks :


Other types of implicit user feedbacks include display time,
scrolling, annotation, bookmarking and printing behaviors

ARTICLE1
:



Generating Personalized Summaries Using Publicly
AvailableWeb

Documents


2008 IEEE


Chandan Kumar, Prasad Pingali, Vasudeva Varma



extract the personal information of the user using
information available on
the web


Generic Sentence Scoring In General

: compute the probability
distribution over the words
w

appearing in the input
D
,
p(
w|D
)
:




For each sentence S in the input, assign a weight equal to the average
probability of the words in the sentence


Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma


Estimating User Background model :
used search engine to ex
tract the
personal information of the user using information available on the
web.


put the person’s full name to a search engine (name is quoted with
double quotation such as ”
Albert Einstein
”)


’n’ top documents are taken and retrieved.


After performing the removal of
stop words
and
stemming
, a unigram
language model is learned on the extracted text content.


This model can be interpreted as the

probability of a word w being
related to the person’s profile U :



Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma


User Specific Sentence Scoring :





the term probability of the document set D p(w|D), and the user
profile U p(w|U) have been merged using a linear weighted
combination. The score of a sentence S for user u is given as :

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma


After sentence scoring, eliminate redundancy :


for redundancy identification, use the measure of
number of terms
overlapping
between the already generated summary and the new
sentence being considered


sentence are arranged based on chronological ordering (between
documents i.e.based on
the time stamp
) and order of
occurrence

(within the document).

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma


Example :


Topic of summary generation is ”
Microsoft to open research lab in India



8 articles published in different new sources forms
the news cluster


In the example we are showing the condensed summary(100 words) for
two users. User A is from
NLP domain

and User B from
network

security

domain
.


The italic text in user specific summary shows the differnce compare to
generic summary

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma



Generic

summary
:




The

New

Lab,

Called

Microsoft

Research

India,

Goes

Online

In

January,

And

Will

Be

Part

Of

A

Network

Of

Five

Research

Labs

That

Microsoft

Runs

Worldwide,

Said

Padmanabhan

Anandan
,

Managing

Director

Of

Microsoft

Research

India
.

Microsoft’s

Mission

India,

Formally

Inaugurated

Jan
.

12
,

2005
,

Is

Microsoft’s

Third

Basic

Research

Facility

Established

Outside

The

United

States

.

In

Line

With

Microsoft’s

Research

Strategy

Worldwide

,

The

Bangalore

Lab

Will

Collaborate

With

And

Fund

Research

At

Key

Educational

Institutions

In

India,

Such

As

The

Indian

Institutes

Of

Technology,

Anandan

Said

.

Although

Microsoft

Research

Doesn’t

Engage

In

Product

Development

Itself,

Technologies

Researchers

Create

Can

Make

Their

Way

Into

The

Products

The

Company

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma



User

A

Specific

summary

:




The

New

Lab,

Called

Microsoft

Research

India,

Goes

Online

In

January,

And

Will

Be

Part

Of

A

Network

Of

Five

Research

Labs

That

Microsoft

Runs

Worldwide,

Said

Padmanabhan

Anandan
,

Managing

Director

Of

Microsoft

Research

India
.
Microsoft’s

Mission

India,

Formally

Inaugurated

Jan
.

12
,

2005
,

Is

Microsoft’s

Third

Basic

Research

Facility

Established

Outside

The

United

States
.

Microsoft

Will

Collaborate

With

The

Government

Of

India

And

The

Indian

Scientific

Community

To

Conduct

Research

In

Indic

Language

Computing

Technologies,

This

Will

Include

Areas

Such

As

Machine

Translation

Between

Indian

Languages

And

English,

Search

And

Browsing

And

Character

Recognition
.

In

Line

With

Microsoft’s

Research

Strategy

Worldwide,The

Bangalore

Lab

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma



User

B

Specific

summary

:




The

New

Lab,

Called

Microsoft

Research

India,

Goes

Online

In

January,

And

Will

Be

Part

Of

A

Network

Of

Five

Research

Labs

That

Microsoft

Runs

Worldwide,

Said

Padmanabhan

Anandan

,

Managing

Director

Of

Microsoft

Research

India
.

The

Newly

Announced

India

Research

Group

Focuses

On

Cryptography,

Security,

Algorithms

And

Multimedia

Security,

Ramarathnam

Venkatesan,

A

Leading

Cryptographer

At

Microsoft

Research

In

Redmond,

Washington,

In

The

US,

Will

Head

The

New

Group
.

Microsoft

Research

India

will

conduct

a

four
-
week

summer

school

featuring

lectures

by

leading

experts

in

the

fields

of

cryptography,

algorithms

and

security
.

The

program

is

aimed

at

senior

undergraduate

students,

graduate

students

and

faculty

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma



Evaluation


The

evaluation

of

this

technique

was

carried

out

on

five

different

research

scholars

working

in

different

fields

of

computer

science


News

articles

of

science

and

technology

domain

were

considered

for

summarization
.

25

different

topics

were

chosen

with

each

topic

having

5
-
10

articles
.


Each researcher was asked to judge the relevance of both versions of
summaries for all
25
topics(
1
-
5
score
).


Result show that the users prefer profile based personalized summaries

compared to a generic summary given by general automatic

summarization system

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma

Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma



Evaluation


This

figure

shows

the

scores

given

by

a

particular

user

across

different

topics
.



for

most

of

the

topics

user

find

personalized

summaries

relevant

for

him
.


personalized summaries for the
topics strongly related
to the user’s domain
are more relevant to him


For topics which are
not closely related to user’s field
, the personalized and
generic summaries are quite
similar


For
a few rare topics
the user did not find personalized summary better


Article1
:
Generating Personalized Summaries Using Publicly
AvailableWeb

Documents, 2008 IEEE,

Chandan Kumar, Prasad Pingali, Vasudeva Varma



Evaluation


ARTICLE2
:


User
-
oriented Document Summarization Through Vision
-
based Eye
-
tracking


2009 ACM


Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau

Article2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009 ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau


MAIN IDEA


The key idea is to rely on the
attention (reading) time
of individual users spent on
single words in a document.


The prediction of user attention over every word in a document is based on the
user’s attention during
his previous reads


algorithm tracks a user’s attention times over individual words using a
vision
-
based
commodity eye
-
tracking
mechanism.


user attention time over any arbitrary word is predicted by a
data mining process



use simple web camera and an existent eye
-
tracking algorithm “
Opengazer project



The error of the detected gaze location on the screen is
between
1

2
cm
, depending
which area of the screen the user is looking at

(a
19
” screen monitor)
.

Article2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009 ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau


Anchoring Gaze Samples onto Individual Words


the
detected gaze
central point is positioned at (
x; y) on the screen space


compute the central displaying point
of the word
which is denoted as (
xi; yi).






and are the average
width and height of a word’s displaying
bounding box
in
the document


For
each gaze detected
by eye
-
tracking module, assign the gaze samples to the
words in the document in this manner.


The overall attention that a word in the document receives

is the sum of all the
fractional gaze samples it
is assigned in the above process


During processing, remove the
stop words
.

Article2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009 ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau


PREDICTION OF USER ATTENTION OVER A SENTENCE


attention time prediction for a word is based on the
semantic similarity of two
words.


Sim(wi,wj) to denote the semantic similarity between word wi and word wj ,
where Sim(wi,wj)


[0; 1]


use

the

algorithm

proposed

in

:

Y
.

Li,

Z
.

A
.

Bandar,

and

D
.

Mclean
.

An

approach

for

measuring

semantic

similarity

between

words

using

multiple

information

sources
.

IEEE

Transactions

on

Knowledge

and

Data

Engineering
.


for an arbitrary word
w

which is not among , calculate the
similarity between
w

and every
wi
(i = 1,…, n)
and

then select
k

words which
share the highest semantic similarity with
w
.(
k is set as min(10; n)

)


Article2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009 ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau


Predicting User Attention for Sentences


estimate
the total attention
of a certain user on a sentence as the sum of the
user’s attention over all the words in the sentence :






AT(w
i
;U
j
) is user U
j
’s attention over
the word
w
i
, which is either sampled from the user’s
previous
reading activities via (1) or predicted via (2).



= 0 if
the word
w
i
is a stop word;



= 0:6
if there is no attention sample for the user
U
j
over the word w
i
,


= 1,
otherwise

Article2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009 ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau

Article
2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009
ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau


A Hybrid Summarization Approach


In early experiments, noticed that the performance of our user
-
oriented
document summarization algorithm heavily
depends on the amount of
available user attention time samples


To address the issue,
integrate

new method with a conventional automatic
document summarization algorithm(
MEAD
)






= 1 if sentence
s
i
is
selected by

MEAD
in its document summarization
result, = 0 otherwise.


k

is

free parameter
and is user tunable.




Article2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009 ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau




EXPERIMENT RESULTS


comparing the document summarization results with those generated
by two
popular text

summarization algorithms.


use
two

sets of articles. Articles in the first set are all about
science
(60
articles from “Science” magazine)

and articles in the second set are all about

entertainment and leisure
(sixty articles are randomly selected from the travel
and sports section on “New York Times”)


12 people
with different knowledge backgrounds read some selected articles
from the two article sets.


they are asked to
provide a summary
for the article they just read

Article2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009 ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau


EXPERIMENT RESULTS







to measure the performance , three measurements :Recall (
R), Precision (P) and
F
-
rate (F) are introduced


SU

e

is the human summary result



Article2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009 ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau


EXPERIMENT RESULTS






Article
2
:
User
-
Oriented Document Summarization through Vision
-
Based Eye
-
Tracking
,
2009
ACM,

Songhua
Xu

,
Hao

Jiang
,
Francis C.M. Lau


EXPERIMENT RESULTS


experiment to evaluate the performance of hybrid approach under different
settings for the parameter
K
.






ARTICLE3
:



Webpage Summarization Using
Clickthrough

Data


2005 ACM


Jiantao

Sun
,
Dou
Shen

,
Huajun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen

Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


Main Idea


use extra knowledge of the
clickthrough data
to improve Web
-
page
summarization


collection of clickthrough data, can be represented by a set of triples


< u; q; p >


Typically, a
user's query words
, reflect the true meaning of the target
Web
-
page content



In new algorithm, adapt two text
-
summarization methods to summarize
Web pages.


The first approach is based on
significant
-
word selection
adapted from Luhn's method


The second method is based
on Latent Semantic Analysis
(LSA)




Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


Problems


Web pages may
have no associated
query words


the clickthrough data are often very
noisy


Solution


thematic lexicon
: (
using the annotated hierarchical taxonomy of Web pages such as
the one provided by ODP web
-
site (
http://dmoz.org
/)
)



Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


Adapted Significant Word (ASW) Method


each sentence is assigned a significance factor(
word frequency
) and the
sentences with high significance factors are selected to form the summary


customized factor :




Adapted Latent Semantic Analysis (ALSA) Method


The corpus can be represented by a term
-
document matrix.

Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


Summarize Web Pages Not Covered by
Clickthrough

Data


build a
thematic lexicon


use
TS(c)

to represent a set of terms associated with category
c
.


thematic lexicon is a set of

TS
, which correspond with categories in ODP.


The lexicon is built as follows :


first, TS corresponding to each category
is set empty


for each page covered by the
clickthrough

data,
its query words are adde
d into TS


if a page belongs to more than one category, its query terms will be
added into all TS
associated with
all its categories.


At last, term weight in each TS is
multiplied

by its Inverse Category Frequency (
ICF
).


For each Web page
that are not covered by the
clickthrough

data,first

look up
the lexicon for TS according to the page's category,
Then the summarization
methods are used.


When a TS does not have sufficient terms, TS corresponding with
its parent
category is used


Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


EXPERIMENTS


data set contains about
44.7 million records
of 29 days from Dec 6 of 2003
to Jan 3 of 2004 (
MSN search engine
)


3,074,678

Web pages of the ODP directory are crawled. Web pages crawled


At last got
1,125,207
Web pages,
260,763

of which are clicked by Web users
using
1,586,472

different queries.


DAT1, consists of
90 pages
which are selected from the browsed pages.


Three human evaluators were employed to summarize these pages





they also use a relatively large scale data set, denoted by DAT2, to evaluate
summarization methods(
10,000 pages
).


Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


Summarization Results on DAT1 (ASW)


ROUGE

is a software package adopted by DUC

for automatic
summarization evaluation (
http://www.isi.edu/ cyl/ROUGE/
)

Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


Summarization Results on DAT
1
(ALSA)

Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


evaluation summarization method using the thematic lexicon


clickthrough data contains only
260,763 pages
, and lexicon contains
141,869 categories
, which is a subset of the
ODP

category structure.


If terms under this category have more
than P% overlap
with distinct terms
in the Web page, then they are used for summarization. Otherwise, use
lexicon terms of
its parent category.


This process continues until we find a category which
covers enough query
terms or until we reach
the root of the
thematic lexicon

Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


evaluation summarization method using the thematic lexicon


(ASW)

Article3
:
WebPage

Summarization Using
Clickthrough

Data
,
2005 ACM,

JianTao

Sun
,
Dou
Shen

,
HuaJun

Zeng
,
Qiang

Yang ,
Yuchang

Lu ,
Zheng

Chen


evaluation summarization method using the thematic lexicon


(ALSA)

thanks