By : asef poormasoomi

religiondressInternet and Web Development

Oct 21, 2013 (3 years and 11 months ago)

110 views


By :
asef

poormasoomi

autumn 2009

1

Introduction


summary
:
brief

but
accurate

representation of the
contents

of a document


2

Motivation


Abstracts for
Scientific and other articles


News

summarization (mostly Multiple document
summarization)



Classification of articles and other written data


Web pages for
search engines


Web access from
PDAs, Cell phones


Question answering
and data gathering

3


Extract vs. abstract


lists fragments of text vs. re
-
phrases content coherently.


example

:
He ate banana, orange and apple
=>

He ate fruit


Generic vs. query
-
oriented


provides
author’s‏view‏
vs. reflects
user’s‏interest
.


example

:
question answering system


Personal vs.
general


consider‏reader’s‏prior‏knowledge‏vs.‏general.‏


Single
-
document vs. multi
-
document source


based on one text vs. fuses together many texts.


Indicative vs. informative


used for quick categorization vs. content processing.


Genres

4

Summarization In 3
steps
(
Lin and
Hovy

-
1997
)


Content/Topic
Identification


goal

:
find/extract the most important
material.


techniques
: methods based on
position
,
cue

phrases,
concept

counting
,
word
frequency
.


Conceptual/Topic
Interpretation


application

: just for
abstract

summaries


methods

:
merging

or
fusing

related topics into more general ones,
removing

redundancies
, etc.


example
:


He
sat down, read the menu, ordered, ate and left
=>

He visited the restaurant
.


Summary Generation:


say
it in your own words


Simple if extraction if preformed


5

Methods



Statistical
scoring
methods (Pseudo)


Higher semantic/syntactic structures


Network (graph) based methods


Other methods (rhetorical analysis, lexical chains, co
-
reference chains)


AI methods

6

Statistical scoring (Pseudo)



General method:

1.
score
each entity (sentence, word)
;

2.

combine scores;

3.
choose
best sentence(s
)


Scoring
tecahniques
:


Word frequencies

throughout the text
(
Luhn

58
)


Position in the text
(
Edmunson

69
,
Lin&Hovy

97
)


Title method
(
Edmunson

69
)


Cue phrases
in sentences
(
Edmunson

69
)


Bayesian Classifier

(
Kupiec

at el
95
)


7

Word frequencies
(
Luhn

58)


Very
first work
in automated summarization


Claim
:
words which are frequent in a document indicate the
topic discussed


Frequent words indicate the topic


Clusters of frequent words indicate summarizing sentence


Stemming should be used


“stop‏words”‏(
i.e.”the
”,‏“a”,‏“for”,‏“is”)‏are‏
ignord


8

Word frequencies
(
Luhn

58
)


Calculate term frequency in document: f(term)


Calculate inverse log
-
frequency in corpus : if(term)





Words with high f(term)if(term) are indicative


Sentence
with highest sum of
weights
is chosen



9


Claim

:
Important sentences occur in specific
positions


Position depends on type(genre) of text


inverse
of‏position‏in‏document‏works‏well‏for‏the‏“
news



Important
information occurs in specific sections of the
document (
introduction/conclusion
)


Assign score to sentences according to
location in paragraph


Assign score to paragraphs and sentences according to
location in entire text


Position in the
text
(
Edmunson

69,
Lin&Hovy

97
)

10


Claim

: title of document indicates its content
(Duh!)


words in title help find relevant content


create‏a‏list‏of‏title‏words,‏remove‏“stop‏words”


Use those as keywords in order to find important sentences




Title method
(
Edmunson

69
)

11

Cue phrases

method

(
Edmunson

69)



Claim

: Important sentences contain cue words/indicative
phrases



The main aim
of‏the‏present‏paper‏is‏to‏describe…”‏(IND)



The purpose of
this‏article‏is‏to‏review…”‏(IND)



In this report
,‏we‏outline…”‏(IND)



Our investigation
has‏shown‏that…”‏(INF)


Some words are considered bonus others stigma


bonus
: comparatives, superlatives, conclusive expressions, etc.


stigma
: negatives, pronouns, etc.


Implemented‏for‏French‏(Lehman‏‘97)


Paice

implemented a dictionary of <
cue,weight
>


Grammar for indicative expressions


In + skip(0) + this + skip(2) + paper + skip(0) + we + ...


Cue‏words‏can‏be‏learned‏(Teufel’98)

12

Feature combination
(
Edmundson

’69)



Linear contribution of 4 features


title
,
cue
,
keyword
,
position


the weights are adjusted using
training data
with any
minimization technique



The following results were obtained


best system


cue + title + position




13


Uses Bayesian classifier:


Assuming statistical independence:

Bayesian Classifier
(
Kupiec

at el 95)



Higher probability sentences are
chosed

to be in the summary



Performance:



For 25% summaries, 84% precision



14

Methods


Statistical scoring methods


problems :


Synonymy
: one concept can be expressed by different words.


example
cycle and bicycle refer to same kind of vehicle
.


Polysemy
: one word or concept can have several meanings.


example,
cycle could mean life cycle or bicycle.

•‏‏‏‏
Phrases
: a phrase may have a meaning different from the words in it.


An alleged murderer is not a murderer (Lin and
Hovy

1997)



Higher semantic/syntactic structures


Network (graph) based methods


Other methods (rhetorical analysis, lexical chains, co
-
reference chains)


AI methods


15

Higher semantic/syntactic structures


Claim
:
Important sentences/paragraphs are the highest
connected entities in more or less elaborate semantic structures.



Classes
of approaches


lexical similarity (
WordNet
, lexical chains);


word
co
-
occurrences;


co
-
reference;


combinations
of the above.


16



lexical cohesion : (
Hasan

,
Halliday
)



reiteration



synonym



antonym



hyperonym



collocation



co
occurance



example :

ناونع هب وا
ملعم

رد
هسردم

دنک یم راک



Lexical chain :



Sequence of words which have
lexical cohesion
(Reiteration/Collocation)

Lexical chain

17


Method for creating chain:


Select a set of
candidate words
from the text.


For each of the candidate words
, find an appropriate chain
, relying on a
relatedness
criterion
among members of the chains and the candidate words.


If such a chain is found, insert the word in this chain and
update

it accordingly;
else
create a new chain.


Scoring the chains :


synonym =10, antonym=7, hyponym=4


Strong chain must select


Sentence selection for summary


H1
: select the first sentence that contains a member of a strong chain


example :

Chain: AI=2 ; Artificial Intelligence =1 ; Field=7 ; Technology=1 ;
Science=1


H2
:‏select‏the‏first‏sentence‏that‏contains‏a‏“
representative
”‏(frequency)‏member‏of‏
the chain


H3
: identify
a text segment
where the chain is highly dense (density is the
proportion of words in the segment that belong to the chain)



Lexical chain

18


Mr. Kenny

is the

person

that invented the anesthetic
machine

which uses
micro
-
computers

to control the
rate at which an anesthetic is pumped into the blood.
Such
machines

are nothing new. But his
device

uses two
micro
-
computers

to
achineve

much closer monitoring of
the
pump

feeding the anesthetic into the patient.

Lexical chain

19

Network based method
(Salton&al’
97
)


Vector Space Model


each text unit represented as vector


Standard similarity metric





Construct a graph of paragraphs or other entities. Strength of link is the similarity metric


Use threshold to decide upon similar paragraphs or entities (pruning of the graph)


paragraph selection heuristics


bushy path


select paragraphs with many connections with other paragraphs and present them in
text order


depth
-
first path


select one paragraph with many connections; select a connected paragraph (in text
order) which is also well connected; continue


20

Text relation map

C

A

B

D

E

F

C=
2

A=3

B=
1

D=1

E=
3

F=
2

sim>thr

sim<thr

similarities

links based
on thr

21

22

Motivation


summaries

which

are

generic

in

nature

do

not

cater

to

the

user’s

background

and

interests


results show that each person has
different perspective
on the same
text


Marcu
-
1997
: found percent agreement of 13 judges over 5 texts from scientific
America is 71 percent.



Rath
-
1961 :

found that extracts selected by four different human judges had
only 25 percent overlap



Salton
-
1997 :

found that most important 20 paragraphs extracted by 2 subjects
have only 46 percent overlap


23


Data Click:


when a user clicks on a document, the document is considered to be of
more interest to the user than other
unclicked

ones


Query History
:


is the most widely used implicit user feedback at present.


example : http://www.google.com/psearch


Attention Time :


often referred to as display time or reading time


Other types of implicit user feedbacks :


Other types of implicit user feedbacks include, scrolling, annotation,
bookmarking and printing behaviors

Users Feedback

24

Summarization Using Data click


use extra knowledge of the
clickthrough

data
to improve Web
-
page
summarization


collection of
clickthrough

data, can be represented by a set of
triples
< u; q; p >


Typically, a
user's query words
, reflect the true meaning of the
target Web
-
page content


Problems :


incomplete click problem


noisy data click




25

Attention Time


MAIN IDEA


The key idea is to rely on the
attention (reading) time
of individual users spent
on single words in a document.


The prediction of user attention over every word in a document is based on the
user’s‏attention‏during‏
his previous reads


algorithm‏tracks‏a‏user’s‏attention‏times‏over‏individual‏words‏using‏a‏
vision
-
based commodity eye
-
tracking
mechanism.


use simple web camera and an existent eye
-
tracking‏algorithm‏“
Opengazer

project



The error of the detected gaze location on the screen is
between
1

2
cm
,
depending which area of the screen the user is looking at (a
19
”‏screen‏
monitor).




26

Attention Time


Anchoring Gaze Samples onto Individual Words


the
detected gaze
central point is positioned at (
x; y) on the screen space


compute the central displaying point
of the word
which is denoted as (
xi;
yi
).





For
each gaze detected
by eye
-
tracking module, assign the gaze samples to the
words in the document in this manner.


The overall attention that a word in the document receives

is the sum of all the
fractional gaze samples it
is assigned in the above process


During processing, remove the
stop words
.

27


attention time prediction for a word is based on the
semantic similarity of
two
words.


for an arbitrary word
w

which is not among

, calculate the
similarity between
w

and every
wi
(
i

=
1
,…,‏n)‏


select
k

words which share the highest semantic similarity with
w
.





Predicting User Attention for Sentences




Attention Time

28

Attention Time

29

Attention Time

30

Other types of implicit user feedbacks


extract the personal information of the user using
information available on
the web


put‏the‏person’s‏full‏name‏to‏a‏search‏engine‏(name‏is‏quoted‏with‏double‏
quotation‏such‏as‏”
Albert Einstein
”)


’n’‏top‏documents‏are‏taken‏and‏retrieved.


After performing the removal of
stop words
and
stemming
, a unigram
language model is learned on the extracted text content.


User Specific Sentence Scoring :




sentence score :



31


Example


Topic‏of‏summary‏generation‏is‏”
Microsoft to open research lab in India



8 articles published in different new sources forms
the news cluster


User A is from
NLP domain

and User B from
network

security

domain
.



Generic

summary
:



The

New

Lab,

Called

Microsoft

Research

India,

Goes

Online

In

January,

And

Will

Be

Part

Of

A

Network

Of

Five

Research

Labs

That

Microsoft

Runs

Worldwide,

Said

Padmanabhan

Anandan
,

Managing

Director

Of

Microsoft

Research

India
.

Microsoft’s

Mission

India,

Formally

Inaugurated

Jan
.

12
,

2005
,

Is

Microsoft’s

Third

Basic

Research

Facility

Established

Outside

The

United

States

.

In

Line

With

Microsoft’s

Research

Strategy

Worldwide

,

The

Bangalore

Lab

Will

Collaborate

With

And

Fund

Research

At

Key

Educational

Institutions

In

India,

Such

As

The

Indian

Institutes

Of

Technology,

Anandan

Said

.

Although

Microsoft

Research

Doesn’t

Engage

In

Product

Development

Itself,

Technologies

Researchers

Create

Can

Make

Their

Way

Into

The

Products

The

Company


Other types of implicit user feedbacks

32


User

A

Specific

summary

:




The

New

Lab,

Called

Microsoft

Research

India,

Goes

Online

In

January,

And

Will

Be

Part

Of

A

Network

Of

Five

Research

Labs

That

Microsoft

Runs

Worldwide,

Said

Padmanabhan

Anandan
,

Managing

Director

Of

Microsoft

Research

India
.
Microsoft’s

Mission

India,

Formally

Inaugurated

Jan
.

12
,

2005
,

Is

Microsoft’s

Third

Basic

Research

Facility

Established

Outside

The

United

States
.

Microsoft

Will

Collaborate

With

The

Government

Of

India

And

The

Indian

Scientific

Community

To

Conduct

Research

In

Indic

Language

Computing

Technologies,

This

Will

Include

Areas

Such

As

Machine

Translation

Between

Indian

Languages

And

English
,

Search

And

Browsing

And

Character

Recognition
.

In

Line

With

Microsoft’s

Research

Strategy

Worldwide,The

Bangalore

Lab

Other types of implicit user feedbacks

33



User

B

Specific

summary

:




The

New

Lab,

Called

Microsoft

Research

India,

Goes

Online

In

January,

And

Will

Be

Part

Of

A

Network

Of

Five

Research

Labs

That

Microsoft

Runs

Worldwide,

Said

Padmanabhan

Anandan

,

Managing

Director

Of

Microsoft

Research

India
.

The

Newly

Announced

India

Research

Group

Focuses

On

Cryptography
,

Security
,

Algorithms

And

Multimedia

Security
,

Ramarathnam

Venkatesan
,

A

Leading

Cryptographer

At

Microsoft

Research

In

Redmond,

Washington,

In

The

US,

Will

Head

The

New

Group
.

Microsoft

Research

India

will

conduct

a

four
-
week

summer

school

featuring

lectures

by

leading

experts

in

the

fields

of

cryptography
,

algorithms

and

security
.

The

program

is

aimed

at

senior

undergraduate

students,

graduate

students

and

faculty

Other types of implicit user feedbacks

34

FarsiSum

A Persian text summarizer

By :
Nima

Mazdak

, Martin Hassel

Department of Linguistics Stockholm University
2004

35

FarsiSum


Tokenizer
:
Sentence boundaries are found by searching for periods, exclamations
, question marks and <BR> (the HTML new line) and the Persian question mark (
؟
)
,‏“.”,‏“,”,‏“!”,‏“?”,‏“<”,‏“>”,‏“:”,‏spaces,‏tabs‏and‏new‏lines


Sentence Scoring:
Text lines are put into a data structure
16
for storing key/value
called
text table value


36

FarsiSum


Sentence Scoring:


Word score = (word frequency) * (a keyword constant)


Sentence‏Score‏=‏Σ‏word‏score‏(for‏all‏words‏in‏the‏current‏sentence)


average sentence length (ASL)


Average sentence length (ASL) = Word
-
count / Line
-
count


Sentence score = (ASL * Sentence Score)/ (nr of words in the current sentence)

37


Notes on the Current Implementation :


Word Boundary Ambiguity :


stop (.) marks a sentence boundary, but it may also appear in the formation of abbreviations or
acronyms.


Compound words and light verb constructions may also appear with or without a space .


Ambiguity in morphology





Word Order :


The canonical word order in Persian is SOV, but Persian is a free word order language


Possessive Construction


FarsiSum

38

thanks

39