VV-SUM-JHU-Nov17-201..

signtruculentBiotechnology

Oct 2, 2013 (4 years and 11 days ago)

102 views

A summarization Journey



From Extraction to Abstraction


Vasudeva Varma

www.iiit.ac.in
/~
vasu

About
IIITH
-
>LTRC
-
>SIEL


IIIT Hyderabad is a 12 year young Research University


Research that makes difference


to the society and industry


It was set up as a not
-
for
-
profit public private partnership (NPPP) and is the
first IIIT to be set up (under this model) in India.



IIIT
-
H is organized as research centers and Labs


not departments


Providing multi
-
disciplinary teams to tackle research problems


IIIT
-
H has large research group in country working in NLP, Speech and
Computer
Vision



Combines pioneering research with top class education



IIIT
-
H faculty have won many awards including document summarization,
Robocup

and
eGovernance
, Stockholm …


Summarization Journey: Extraction to Abstraction Vasudeva Varma


2


Research Centres/Labs


Technology


Communications (CRC)


Data Engineering (CDE)


Languages Technologies (LTRC)



Natural Language Processing & Machine Translation (NLP
-
MT)



Search and Information Extraction (SIEL)



Speech



Anusaaraka


Robotics (RRC)


Security, Theory and Algorithms (C
-
STAR)


Software Engineering (SERL)


Visual Information Technology (CVIT)


VLSI and Embedded System (C
-
VEST)


Compilers (CL)



Summarization Journey: Extraction to Abstraction Vasudeva Varma


3


Research Centres/Labs (Contd..)


Domains


Agriculture and Rural Development (ARD)


Building Science (CBS)


Cognitive Science (CS)


Computational Linguistics (See under LTRC)


Computational Natural Sciences and Bioinformatics (CCNSB)


Earthquake Engineering (EERC)


Education (
cITe
)


Education Technology and Learning Sciences (CETLS)


Exact Humanities (CEH)


Power Systems (PSRC)


Spatial Informatics (LSI)


Development Centers


Engineering Technology and Innovation Centre (ENTICE)


Innovation and Entrepreneurship (CIE)


Open Software (COS)


Societal and Human Applications of Artificial Intelligence (SAHAAI)



Summarization Journey: Extraction to Abstraction Vasudeva Varma


4


About IIITH
-
>
LTRC
-
>SIEL


About Language Technologies
Research Center (LTRC)


One of the largest groups in South
Asia working on NLP (about 175
researchers)


Four labs


Core NLP/Machine Translation


Speech


Search and Information Extraction



Anusaaraka



Synergy within the centre


Closely working with various
other centers and groups



Summarization Journey: Extraction to Abstraction Vasudeva Varma


5



About IIITH
-
>LTRC
-
>
SIEL


Industry focus


Technology transfers to Amazon.com, Nokia, TCS, ADRIN, Department of Space, Intel,
Rediff.com
, Zicorp


Government funding: DST, MCIT, Dept of Space


Industry Funding: Amazon.com, AOL, TCS, Yahoo, Nokia, TCS, Rediff.com, Intel,
several start
-
ups


Major achievements:


#1 in Automatic summarization system (DUC
-
2006, 2007)


#1 in Squishy QA task (TAC
-
2008)


#1 in Knowledge Base Population Task (TAC
-
2009)


#1 in guided summarization (TAC 2010)


India’s first cross language search engine


Only Academic group from India to present in WWW
developer
track


First team from India to participate in CLEF, DUC, TAC


Summarization Journey: Extraction to Abstraction Vasudeva Varma


6


About IIITH
-
>LTRC
-
>
SIEL


Major Research areas


Summarization


Cross Language Information Access


Indian Language Search


Enterprise Search


Question Answering


Semantic Web


Distributed and Large Scale IR


Cloud Computing


Computational Advertising


Published in: WWW, ACL, SIGIR, CIKM, ECIR, OOPSLA,
CICLing
,
IJCNLP, COLING, NAACL, RANLP, ....



Summarization Journey: Extraction to Abstraction Vasudeva Varma


7


Information Overload

Explosive growth of information on web


Failure of information retrieval systems to

satisfy user’s information need

Need for sophisticated information access

solutions


Summarization Journey: Extraction to Abstraction Vasudeva Varma


8


Summarization



Summary is a condensed
version of a source document
having a recognizable genre
and a very specific purpose: to
give the reader an exact and
concise idea of the contents of
the source.

Text interpretation

Extraction of Relevant information

Condensing Extracted Information

Summary Generation


Summarization Journey: Extraction to Abstraction Vasudeva Varma


9


Summaries Can Help !


Summarization Journey: Extraction to Abstraction Vasudeva Varma


10


Flavors of Summarization

Progressive

Single
document

Query
Focused
MDS

Opinion/
Sentiment

Code

Comparative

Guided

Personalized

Cross
Language

Squishy
Question
Answering

Query
Independent
MDS

11

Towards Abstraction

Personalized ,

Cross Lingual

Summarization

Guided Summarization

Code Summarization

Comparison Summarization



Blog summarization

Progressive Summarization

Abstractive

Single Document,


Query Focused Multi Document

Summarization


Summarization Journey: Extraction to Abstraction Vasudeva Varma


12


Underlying Technology


Summarization Journey: Extraction to Abstraction Vasudeva Varma


13


EXTRACTIVE SUMMARIZERS


Summarization Journey: Extraction to Abstraction Vasudeva Varma


14


Single document summarization

Document



Tex琠乯t浡汩ma瑩潮



Se湴敮ne 䵡牫rr



L潧楣慬⁁湡汹獩s



P慲獩sg 潦⁳敮瑥湣敳

䑯D畭敮琠g牡灨pge湥牡瑩潮

Tex琠䅮A汹獩s


Sentence
transformation
rules

Graph clustering into
topics

Graph scoring

Sentence selection

Summary Generation


Graph Based Approach



Logical Analysis and Graph clustering



System will generate the summary by
understanding the logical structure of
the sentence.



Scoring of nodes and relations is
based on how central it is for the whole
document.




Able to identify important sentences
more accurately than any other
statistical techniques.



Summarization Journey: Extraction to Abstraction Vasudeva Varma


15


Query Focused Summarization


Documents should be ranked in
order of probability of relevance to
the request or information need, as calculated from whatever
evidence is available

to the system


Query Dependent ranking: Relevance Based Language models


Language models (PHAL)


Query Independent ranking: Sentence Prior




Summarization Journey: Extraction to Abstraction Vasudeva Varma


16



RBLM is an IR approach that computes the conditional probabilities of
relevance from document and query


Overcomes the problem of sparseness of document LM


PHAL
-

probabilistic extension to HAL spaces


HAL constructs dependencies of a term w on other terms based on their
occurrence in its context in the corpus


Summarization Journey: Extraction to Abstraction Vasudeva Varma


17




Sentence prior captures importance of
sentence explicitly using pseudo
relevant documents (Web, Wikipedia)



Based on Domain knowledge,
Background Information, Centrality



Log Linear Relevance



Information Measure
in a sentence


Entropy is a measure of information
contained in a message



Summarization Journey: Extraction to Abstraction Vasudeva Varma


18


DUC 2005 and 2006 Performance

38 systems
participated in 2006


Significant difference
between first two
systems


5th Rank in linguistic
quality


Summarization Journey: Extraction to Abstraction Vasudeva Varma


19


Extract vs. Abstract Summarization


We conducted a study (2005)


Generated best possible extracts


Calculated the scores for these extracts


Evaluation with respect to the reference summaries


Rouge 2

Rouge SU4

Human Answers

0.1025

0.1624

Best Answers

0.09965

0.15407

HAL Feature

0.07618

0.13805


Summarization Journey: Extraction to Abstraction Vasudeva Varma


20


Cross Lingual Summarization


A bridge between CLIR and MT


Extended our mono
-
lingual summarization framework to a cross
-
lingual setting in RBLM framework


Designed a cross
-
lingual experimental setup using DUC 2005
dataset


Experiments were conducted for Telugu
-
English language pair


Comparison with mono
-
lingual baseline shows about
90%
performance in
ROUGE
-
SU4

and about
85%
in
ROUGE
-
2

f
-
measures



Summarization Journey: Extraction to Abstraction Vasudeva Varma


21


Cross Lingual Summarization


Summarization Journey: Extraction to Abstraction Vasudeva Varma


22


Progressive Summarization


Emerging area of research in summarization



Summarization
with a sense of prior knowledge



Introduced as “
Update Summarization
” at DUC 2007, TAC 2008, TAC 2009


Generate
a short summary of a set of newswire articles, under the
assumption that the user has already read a given set of earlier articles
.



To keep track of news stories, reviews of products



Summarization Journey: Extraction to Abstraction Vasudeva Varma


23


Key challenge

To detect information that is not only
relevant

but also
new

given the
prior
knowledge
of reader


Relevant and new Vs

Non
-
Relevant and new Vs

Relevant and redundant


Summarization Journey: Extraction to Abstraction Vasudeva Varma


24


Novelty Detection


Identifying sentences containing new information (
Novelty Detection
) from
cluster of documents is the key of progressive summarization



Shares similarity with Novelty track at TREC from 2002


2004


Task 1: Extract relevant sentences from a set of documents for a topic


Task 2: Eliminate redundant sentences from relevant sentences



Progressive summarization differs, as in producing summary from novel
sentences (requires scoring and ranking)


Summarization Journey: Extraction to Abstraction Vasudeva Varma


25


Three level approach to Novelty Detection

Sentence Scoring


Developing new features that
capture novelty along with
relevance of a sentence


NF, NW

Ranking


Sentences are re ranked based on
the amount of novelty it contains

ITSim
,
CoSim

Summary Generation

A selected pool of sentences that
contain novel facts. All remaining
sentences are filtered out


Summarization Journey: Extraction to Abstraction Vasudeva Varma


26



Evaluations


TAC 2008 Update Summarization data for
training: 48 topics


Each topic divided into A, B with 10
documents


Summary for cluster A is normal
summary and cluster B is update
summary


TAC 2009 update Summarization for
testing: 44 topics


Baseline summarizer generates summary
by picking first 100 words of last document


Run1


DFS + SL1


Run2


PHAL + KL


Summarization Journey: Extraction to Abstraction Vasudeva Varma


27



Personalized Summarization


Perception of text differs with background of the reader


Need of incorporating user background in the summarization
process


Summarization not only a function of input text but also the reader


Serve

Tennis
player

Hotel
manager

Politician


Summarization Journey: Extraction to Abstraction Vasudeva Varma


28



Web
-
based profile creation: Personal information available on web
-

a
conference page, a project page, an online paper, or even in a Weblog.


Estimate Model P(w/Mu) to incorporate user in sentence extraction
process


Experiments


5 Users, 25
-
Doc Clusters


Each User was asked to asked to give his relevance score to the summary
on a 5
-
point scale..


Summarization Journey: Extraction to Abstraction Vasudeva Varma


29


Average Scores for different Uses

Scores for different topics for a user

Evaluation


Summarization Journey: Extraction to Abstraction Vasudeva Varma


30


Comparative summarization


Summaries for comparing multiples items belonging to a category


Category of “Mobile phones“ will have “Nokia”, “Black berry’ as its items



Comparative summaries provide the properties or facts common to these items
and their corresponding values with respect to each item.


“Memory”, “Display”, “Battery Life”,

Memory

Battery Life


Summarization Journey: Extraction to Abstraction Vasudeva Varma


31


Comparative Summaries Generation


Attribute Extraction


Find the attributes of the product class


Attribute Ranking


Rank the attributes according to importance in comparison


Summary Generation


Find the occurrence of attributes in various products


Summarization Journey: Extraction to Abstraction Vasudeva Varma


32


Guided Summarization


Query Focused Summarization


User’s information need expressed as a query along with a narrative


Set of documents related to the topic


Goal is to produce a shot coherent summary focusing answer to the query


Guided Summarization


Each topic is classified into a set of predefined categories


Each category has a template of important aspects about the topic


Summary is expected to answer all the aspects of template while containing
other relevant information


Summarization Journey: Extraction to Abstraction Vasudeva Varma


33


When

What

Where

Who

How

Summarizer

Summary

Guided
Summary

Doc
s

Query

34

Guided summarization


Encourage deeper linguistic and semantic analysis of the source documents
instead of relying only on document word frequencies to select important
concepts



Shares similarity with information extraction


Specific information from unstructured text is identified and consequently
classified into a set of semantic labels (templates)


Makes information more suitable for other information processing tasks


A guided summarization system has to produce a readable summary
encompassing all the information about the templates


Very few investigations exploring the potential of merging summarization with
information extraction techniques


Summarization Journey: Extraction to Abstraction Vasudeva Varma


35


Our approach


Building a domain model


Essential background knowledge for information extraction



Sentence Annotations


To identify sentences having answers to aspects of template



Concept Mining


To use semantic concepts instead of words to calculate sentence importance



Summary Extraction


Modification of summary extraction algorithm to adapt to the requirements
using sentence annotations


Summarization Journey: Extraction to Abstraction Vasudeva Varma


36


ROUGE

2

ROUGE SU4

Pyramid

Responsiveness

Run1

0.09574
(1/43)

0.13014

(1/43)

0.425

(1/43)

3.130

(2/43)

Run 2

0.0695
(23/43)

0.10788
(22/43)

0.347

(21/43)

2.804

(21/43)

Category

Pyramid score

Responsiveness

Accidents

0.445

3.429

Attacks

0.524

3.286

Health and safety

0.300

2.583

Endangered Resources

0.396

3.100

Investigations

0.520

3.500

Run1 is successful in producing informative summaries for cluster A


Ranked
first

in all evaluation metrics including pyramids and ROUGE


Difficulty of task depends on the type of category. Summarizing Health and
safety, Endangered resources is relatively hard



37

KNOWLEDGE BASE POPULATION


Summarization Journey: Extraction to Abstraction Vasudeva Varma


38



Inconsistency


Incompleteness


Accuracy of facts


Novel information


Cost of Manual efforts


Solution:
Automatically updating
information of the entities in knowledge
bases

39

Knowledge Base Population


Knowledge Base Population can be fundamentally broken down into two sub
problems



Entity Linking
: Linking entity mentions in documents to Knowledge Base nodes



Slot Filling
: Extracting attribute information for query entities


Guided
Summarization

KBP

Summarization and KBP are complementing tasks


Summaries help in filling the slot values more effectively


Slot values enhance the quality of guided summaries



Summarization Journey: Extraction to Abstraction Vasudeva Varma


40


When

What

Where

Who

How


41


Vasudeva Varma


vv@iiit.ac.in





www.iiit.ac.in/~vasu

42