Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA)

grandgoatAI and Robotics

Oct 23, 2013 (4 years and 17 days ago)

227 views

Automating Keyphrase Extraction
with Multi
-
Objective Genetic
Algorithms (MOGA)

Jia
-
Long Wu

Alice M. Agogino

Berkeley Expert System Laboratory

U.C. Berkeley

Outline


Role of Keyphrases


Phrase Extraction Algorithms


Phrase Extraction with Multi
-
Objective
Genetic Algorithm


Experiment and Results


Results Evaluation


Conclusion


Future Research

Role of Keyphrases


Concept representations


Document indexing


Enhance document retrieval / Browsing


Query formulation assistance


Document surrogates

Vision of Unified Language System

Design
Research
Repository

Corporate
Design
Repository

Design
Education
Materials

Unified Language System for
Engineering Design

Unified
Subject
Headings

Context
Mapping
Mechanism

Semantic
Network

Keyphrase Extraction Algorithms


Heuristic, Syntactic, Machine Learning


Requires prior training


Heuristic cut
-
off thresholds in number of
phrases


Focuses on single document


Redundancy when aggregated for the
whole document collection

Keyphrase Extraction with MOGA


Phrase extraction as an optimization problem


Candidate phrases generation


Optimize phrase selection with MOGA


Model & Genetic Operators

3d
scanning

abstraction

active control
system

1

0

1

Candidate Phrases

Chromosome

Crossover

Phenotype & Genotype

1

0

0

1

1

0

1

1

0

1

1

0

0

0

1

Parents

Offspring

Keyphrase Extraction with MOGA


Optimize phrase selection with MOGA (cont.)


Model & Genetic Operators (cont.)




Evaluation fitness functions


Minimize clustering measure / dispersion (Bookstein ’98)




Minimize number of phrases


Non
-
Dominated Sorting Genetic Algorithm
(NSGA
-
II)

Mutation

1

0

0

1

0

1

1

0

1

0

Experiment and Results


Data set


34 papers from Design Theory and Methodology
Conference ’01


Candidate phrases


~5000 noun phrases extracted


Genetic Algorithm Parameters


Population size 100


Converges at 5000 generations


5 hours on Xeon 1.8GHz CPU

Experiment and Results

Pareto plot of Dispersion versus Number of Phrases

Experiment and Results

Histogram of number of optimal solutions a keyphrase appears

Evaluation

Evaluation


6 domain experts participated in the evaluation.


Core phrases vs. Non
-
core phrases.


Less than 10% are deemed irrelevant.


Significant deviation between evaluators.

Relevant Core
Phrases (out of
385 candidates)

Relevant Non
-
Core
Phrases (out of
994 candidates)

Relevant Noise
Phrases (out of
300 phrases)

Average

363.5

905.5

26.0

Percentage Relevant

94.42%

91.10%

8.67%

Standard Deviation

13.08

74.77

4.61

Conclusion


Keyphrase extraction can be successfully
implemented as a multi
-
objective global
optimization problem.


Reasonably good keyphrases can be extracted
without prior training or domain knowledge.


Trade
-
off information between objectives such
as number of phrases vs. average quality of
phrases can be gained from Pareto solutions.


Preferences can be made based on the user
needs and trade
-
off information.


Future Research


Test on larger text collection.


Implement extracted keyphrases in IR system as
browsing and query expansion tool and compare
to full
-
text search IR system.


Evaluate with more raters and 1
-
5 scale.


Build domain thesauri with extracted keyphrases
and semantic discovery algorithms (e.g. Latent
Semantic Analysis).

Metathesaurus in Digital Library

Thank you!

Comments? Questions?


jialong@me.berkeley.edu

aagogino@me.berkeley.edu

Mode Analysis of Scaled Evaluation