Cover Coefficient based

imminentpoppedIA et Robotique

23 févr. 2014 (il y a 3 années et 1 mois)

68 vue(s)

Cover Coefficient based
Multidocument Summarization

CS 533 Information Retrieval Systems











Özlem İSTEK










Gönenç ERCAN










Nagehan PALA


Outline


Summarization


History and Related Work


Multidocument Summarization (MS)


Our Approach: MS via C
³
M


Datasets


Evaluation


Conclusion and Future Work


References

Summarization


Information overload problem


Increasing need for IR and automated
text summarization systems


Summarization: Process of distilling
the most salient information from a
source/sources for a particular user
and task

Steps for Summarization


Transform
t
ext into a
internal
representation.


Detect important
t
ext
u
nits.


Generate summary


In extracts
n
o
g
eneration but
i
nformation
o
rdering,
a
naphora
r
esolution (or
a
voiding
a
naphoric
s
tructures)


In
a
bstracts,
t
ext
g
eneration. Sentence
f
usion,
p
araphrasing,
N
atural Language Generation.

Summarization Techniques


Surface level: Shallow features


Term frequency statistics, position in text, presence of text
from the title, cue words/phrases: e.g. “in summary”,
“important”


Entity level: Model text entities and their relationship


Vocabulary overlap, distance between text units, co
-
occurence, syntactic structure, coreference


Discourse level: Model global structure of text


Document outlines, narrative stucture


Hybrid

History and Related Work


in 1950’s: First systems surface level approaches


Term frequency (Luhn, Rath)


in 1960’s: First entity level approaches


Syntactic analysis


Surface Level: Location features (Edmundson 1969)


in 1970’s:


Surface Level: Cue phrases (Pollock and Zamora)


Entity Level


First Discourse Level: Stroy grammars


in 1980’s:



Entity Level (AI): Use of scripts, logic and production rules, semantic
networks (Dejong 1982, Fum et al.1985)


Hybrid (Aretoulaki 1994)


from 1990’s
-
:
explosuion of all


Multidocument Summarization (MS)


Multiple
s
ource
d
ocuments about a
s
ingle
t
opic or an
e
vent.


Application oriented task, such as;


News
p
ortal,
p
resenting
a
rticles from different
s
ources


Corporate
e
mails
o
rganized by
s
ubjects.


Medical
r
eports
a
bout a
p
atient.


Some r
eal
-
life systems


Newsblaster, NewsInEssence, NewsFeed
Researcher


Our Focus


Multiple
d
ocument
s
ummarization


Building
e
xtracts for
a topic


Sentence
s
election

(Surface level)

Term Frequency and Summarization


Salien
t
; Obvious, noticeable.


Salient
s
entence
s

should have more
c
ommon
t
erms with other
s
entences


Two
s
entences are talking about the
same fact if they
s
hare too much
common terms. (Repetition)


Select salient sentences, but inter
-
sentence
-
similarity should be low.

C
3
M vs. CC Summarization

Select sentences with the highest si values, that
are dissimilar to already selected sentences.

Select seed documents with
the highest p
i


Summary Power Function:

s
i

=


i

Seed Power Function:

p
i

=

i





i


x
di


Calculate the number of summary sentences using
compression percentage(i.e, %10)

Calculate number of clusters

Create sentence by sentence C matrix

Create document by
document C matrix

Uses sentence by term matrix

Uses document by term
matrix

Aim : to select the most representative sentences
avoiding redundancy in the summary

Aim: to cluster

Summarization based on Cover
Coefficient

Clustering based on
C
3
M

An Example

















375
.
0
.
0
125
.
375
.
125
.
0
.
0
417
.
417
.
0
.
0
167
.
083
.
277
.
361
.
083
.
194
.
188
.
0
.
0
063
.
563
.
188
.
083
.
111
.
194
.
25
.
361
.
C=

Sorted si values;

s3 ==> 0,64

s1 ==> 0,64

s5 ==> 0,63

s4 ==> 0,58

s2 ==> 0,44

Lets Form a Summary of 3 Sentences!!!

An Example (Step 1)

















375
.
0
.
0
125
.
375
.
125
.
0
.
0
417
.
417
.
0
.
0
167
.
083
.
277
.
361
.
083
.
194
.
188
.
0
.
0
063
.
563
.
188
.
083
.
111
.
194
.
25
.
361
.
C=

Sorted si values;

s3 ==> 0,64

s1 ==> 0,64

s5 ==> 0,63

s4 ==> 0,58

s2 ==> 0,44

Summary Sentences;

s3

Choose the sentence
which is most similar to
others.

An Example (Step 2)

















375
.
0
.
0
125
.
375
.
125
.
0
.
0
417
.
417
.
0
.
0
167
.
083
.
277
.
361
.
083
.
194
.
188
.
0
.
0
063
.
563
.
188
.
083
.
111
.
194
.
25
.
361
.
C=

Sorted si values;

s3 ==> 0,64

s1 ==> 0,64

s5 ==> 0,63

s4 ==> 0,58

s2 ==> 0,44

Summary Sentences;

s3

s1


s1 is next according to si values.
Check if s1 is too much similar to
s3, which is in summary. Include it
to summary if s3 does not cover
s1.

AC3 = (c31 + c32 + c34+c35) / (5
-
1)

AC3 = (1


c33) / (5
-
1)

AC3 = 0,35

(c35 = 0,194) < (AC3 = 0,35)

An Example (Step 3)

















375
.
0
.
0
125
.
375
.
125
.
0
.
0
417
.
417
.
0
.
0
167
.
083
.
277
.
361
.
083
.
194
.
188
.
0
.
0
063
.
563
.
188
.
083
.
111
.
194
.
25
.
361
.
C=

Sorted si values;

s3 ==> 0,64

s1 ==> 0,64

s5 ==> 0,63

s4 ==> 0,58

s2 ==> 0,44

Summary Sentences;

s3

s1

s5


s5 is next.


check with s3.

(c35 = 0,083) < (AC3 = 0,35)
ok



check with s1

(c15 = 0,083) < (AC1 = 0,16)
ok

Some Possible Improvements


Integrate position of the sentence in its
source document.


Effect of stemming


Effect of stop
-
word list


Integrating time of the source
document(s). (no promises)


Integrating Position Feature


The probability distribution for

α
i
is
normal distribution in C
3
M.


Use a probability distribution, where
sentences that appear in first
paragraphs are more probable.

Datasets


We will use two datasets.


DUC (Document Understanding Conferences) dataset for
English

Multidocument Summarization.


Turkish New Event Detection and Tracking dataset for
Turkish

Multidocument Summarization.

Evaluation

Two methods for evaluation
:

1.
We will use this method for English Multidocument
Summarization.

Overlap between the
model summaries

which are prepared by human judges

and the
system
generated summary

gives the accuracy of the summary
.


ROUGE (Recall Oriented Understudy for Gist Evaluation)
is the
o
fficial
s
coring
t
echnique for Document
Understanding Conference (DUC) 2004.


ROUGE uses
d
ifferent
m
easures. ROUGE
-
N uses N
-
Grams to
m
easure the
o
verlap. ROUGE
-
L uses Longest
Common Subsequence. ROUGE
-
W uses Weighted
Longest Common Subsequence.


Evaluation

2.
We wil use this method for Turkish Multidocument
Summarization.



We will add the
extracted summaries

as new documents.


Then, we will select these summary documents as the centroids
of clusters.


Then, a

centroid based clustering algorithm is used for
clustering.


If the documents are attracted by their centroids which is the
summary of these documents then we can say our
summarization approach is good.




Evaluation

c
1

d
1

d
2

d
3

Summary documents are selected as the centroids

c
2

d
4

d
5

d
6

d
7

c
1

is the summary of
d
1
, d
2

and d
3
.

c
2

is the summary of
d
4
, d
5
, d
6

and d
7
.

Conclusion and Future
Work


Multidocument Summarization using Cover
Coefficents of sentences is an intuitive and to
our knowledge a new approach.


This situation has its own advantages and
disadvantages. We have fun because it is
new. We are anxious about it because we
have not seen any result summary yet.



Conclusion and Future
Work


After implementing the CC based summarization,
we can try different methods on the same
multidocuments set.



First method:


A sentence
-
by
-
term matrix from
all

sentences of
all

documents can be formed.


Then, CC based Summarization can be applied.


Conclusion and Future
Work


Second method:


Cluster the documents using C3M.


Then, apply the first method to each cluster.


Combine the extracted summaries of each cluster to form
one summary.



Third method:


Summarize each document applying the first method.
The only difference is that sentence
-
by
-
term matrices are
constructed for sentences of each document.


Then, take the summaries of documents as documents
and apply the first approach.


References


Can, F., and Özkarahan, E. A.
Concepts and
E
ffectiveness of the
C
over
-
C
oefficient
-
B
ased
C
lustering
M
ethodology for
T
ext
D
atabases
,
ACM Transactions on Database Systems
, 15, 4 (1990)


Lin, C. Y. Rouge, A Package for Automatic Evaluation of Summaries


H. Luhn, The Automatic Creation of Literature Abstracts


G.J.Rath, A. Rescnick and T.R. Savage, The Formation of Abstracts by
the Selection of Sentences.


H.P. Edmundson, New Methods in Automatic Extracting


J.J. Pollock and A. Zamora, Automatic Abstacting Research at
Chemical Abstracts Service.


T.A. vanDijk, Macrostructures: An Interdisciplinary Study of Global
Structures in Discourse, Interaction and Cognition


G. F. Dejong, An Overview of the FRUMP System


D. Fum, G. Guida and C.Tasso, Evaluating Importance: A Step
Towards Text Summarization


M. Aretoulaki, Towards a Hybrid Abstract Generation System



Questions















Thank you.