Multi-Abstraction Concern Localization

hostitchAI and Robotics

Oct 23, 2013 (3 years and 9 months ago)

77 views

Multi
-
Abstraction

Retrieval

Motivation

Experiments

Overall Framework

Multi
-
Abstraction Concern Localization

Tien
-
Duy B. Le, Shaowei Wang, and David Lo

{btdle.2012, shaoweiwang.2010,davidlo}@smu.edu.sg

Abstraction

Hierarchy

Method Corpus



Concerns

Preprocessing

Hierarchy
Creation

Level 1

Level 2

Level N

….

Standard Retrieval
Technique

+



Multi
-
Abstraction
Retrieval

Ranked
Methods

Per Concern





We

remove

Java

keywords,

punctuation

marks,

special

symbols,

and

break

identifiers

into

tokens

based

on

Camel

casing

convention



Finally,

we

apply

Porter

Stemming

algorithm

to

reduce

English

words

into

their

root

forms
.


We

apply

Latent

Dirichlet

Allocation

(LDA),

with

different

number

of

topics,

a

number

of

times,

to

construct

an

abstraction

hierarchy


Each

application

of

LDA

creates

a

topic

model,

which

corresponds

to

an

abstraction

level
.


We

refer

to

the

number

of

topic

models

contained

in

a

hierarchy

as

the

height

of

the

hierarchy




Concern

Localization

is

the

process

of

locating

code

units

that

match

a

particular

textual

description

(bug

reports

or

feature

requests)



Recent

concern

localization

techniques

compare

documents

at

one

level

of

abstraction

(i
.
e
.

words/topics)


A

word

can

be

abstracted

at

multiple

levels

of

abstraction
.

For

example,

Eindhoven

can

be

abstracted

to

North

Brabant
,

Netherlands
,

Western

Europe
,

European

Continent
,

Earth

etc
.



In

multi
-
abstraction

concern

localization,

we

represent

documents

at

multiple

abstraction

levels

by

leveraging

multiple

topic

models
.


Text Preprocessing

Hierarchy Creation Step


We propose multi
-
abstraction Vector Space Model (
VSM
MA
)
by combining VSM with our abstraction hierarchy.


In multi
-
abstraction VSM, document vectors are extended
by adding elements corresponding to topics in the hierarchy.


Given a query
q

and a document
d

in corpus
D,
the
similarity between
q

and
d

is calculated in
VSM
MA

as follows:


V

is the size of the original document vector



w
i

is the
i
th

word in
d



L

is the height of abstraction hierarchy
H


H
i

is the
i
th

abstraction level in the hierarchy



is the probability of topic
t
i

to appear in
d

as assigned
by the
k
th

topic model in abstraction hierarchy
H



tf
-
idf

(
w,d,D
)
is the term frequency
-
inverse document
frequency of word
w

in document
d

given corpus
D

Effectiveness of Multi
-
Abstraction VSM

Number

of Topics

MAP

Improvement

Baseline (VSM)

0.0669

N/A

H1

50

0.0715

6.82%

H2

50,

100

0.0777

16.11%

H3

50, 100, 150

0.0787

17.65%

H4

50,

100, 150, 200

0.0799

19.36%


The MAP improvement of H4 (over baseline) is 19.36%


The MAP is improved when the height of the abstraction
hierarchy is increased


Future Work


Extend the experiments with combinations of


Different numbers of topics in each level of the hierarchy


Different hierarchy heights


Different topic models (Pachinko Allocation Model,
Syntactic Topic Model, Hierarchical LDA)


Experiment with
Panichella

et al. ‘s method [1] to infer
good LDA configurations for our approach


[1]A.
Panichella
, B.
Dit
,
R.Oliveto
, M.D.
Penta
, D.
Poshyvanyk
, and A.D Lucia. How to effectively use topic
models for software engineering tasks? an approach based on
genetic algorithms. (ICSE 2013)

Where