Algorithms for Condensing Reverse

unknownlippsAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

69 views


Leiden University. The university to discover.



Hafeez Osman


Michel R.V.
Chaudron


Peter
v.d

Putten



(
hosman@liacs.nl
)


(
chaudron@chalmers.se
)

(
putten@liacs.nl
)





An Analysis of Machine Learning
Algorithms for Condensing Reverse
Engineered Class
Diagrams



Presenter: Hafeez Osman


Leiden University. The university to discover.







O
VERVIEW


1. Introduction

2. Research Question

3
. Approach

4. Results

5. Discussion

6. Future Work

7. Conclusion


Leiden University. The university to discover.


Who ?

Software Engineer,
Software Maintainer,
Software Designer


What ?

Simplifying UML Class
Diagram: Based on Software
Design Metrics using
Machine Learning

Why ?

Reverse engineered class diagrams
are typically too detailed a
representation

Introduction


Leiden University. The university to discover.


Leiden University. The university to discover.

Introduction

A
im:
analyze performance of classification algorithms
that
decide which classes should be included in a class diagram

This paper focusses on
using design metrics as predictors
(input variables used by the classification algorithm)


Omit


Leiden University. The university to discover.

Introduction

Explore Structural Properties of Classes


Software design metrics from the following categories :


Size :
NumAttr
,
NumOps
,
NumPubOps
, Getters, Setters


Coupling :
Dep_Out
,
Dep_In
,
EC_Attr
,
IC_Attr
,
EC_Par
,



IC_Par

Machine Learning Classification Algorithms


Supervised classification algorithms


J48 Decision Tree, Decision Tables, Decision Stumps,
Random Forests and Random Trees


k
-
Nearest Neighbor, Radial Basis Function Networks


Logistic
Regression, Naive Bayes,


Leiden University. The university to discover.

RQ
1
: Which individual predictors are influential for

the classification?

For
each case study,
the predictive
power of individual
predictors
are explored


RQ
2
: How robust is the classification to the inclusion
of
particular sets of predictors?

Explore
how the
performance of
the classification algorithm is
influenced by
partitioning the
predictor
-
variables in different sets.


RQ
3
: What are suitable
algorithms for classifying
classes
?

The candidate classification
algorithms are
evaluated
w.r.t. how well
they perform in classifying the
key classes in a class diagram.

Research Questions


Leiden University. The university to discover.

Evaluation Method


RQ1:

Predictors


Univariate

Analysis


Information Gain Attribute Evaluator


To measure predictive power of predictors


RQ2, 3:

Machine Learning Classification Algorithm



Area Under ROC Curve (AUC)



The AUC shows the ability of the


classification algorithms to


correctly rank classes as


included in the class diagram


or not




Approach


Leiden University. The university to discover.

Case Study Characteristics






Approach

Project

Total Classes

(a)/(b) = %

Source code (a)

Design (b)

ArgoUML

903

44

4.87

Mars

840

29

3.45

JavaClient

214



57


26.64

JGAP

171

18

10.52

Neuroph

2.3

161

24

14.9

JPMC

127

11

8.66

Wro4J

87

11

12.64

xUML
-
Compiler

84

37

44.05

Maze

59

28

47.45


Leiden University. The university to discover.

Grouping Predictors in Sets






Approach

No

Predictor

Predictor
Set
A

Predictor Set
B

Predictor Set
C

1

NumAttr





x

2

NumOps





x

3

NumPubOps



x

x

4

Setters



x

x

5

Getters



x

x

6

Dep_out







7

Dep_In







8

EC_Attr







9

IC_Attr







10

EC_Par







11

IC_Par








Leiden University. The university to discover.

Approach

1.
Reverse engineer the source code to UML
design.

i.
Eliminate library classes


2.
Calculate design metrics

i.
Eliminate unused metrics


3.
Merge the information “In Design” with the
software design metrics data


4.
Prepare set of predictors


5.
Run all set of predictors with machine learning
tool



Leiden University. The university to discover.

EC_Par
NumOps
Dep_In
NumPub
Ops
Dep_out
NumAttr
Setters
Getters
EC_Attr
IC_Attr
IC_Par
Predictor
6
5
5
4
4
3
3
3
3
3
2
0
1
2
3
4
5
6
7
No of Projects

Influential Predictors

RQ1 : Predictor Evaluation






Result

** Out of 9 projects


Leiden University. The university to discover.

Result

Decision
Table
J48
Decision
Stump
RBF
Network
Naïve
Bayes
Random
Tree
Function
Logistic
k-NN(1)
k-NN(5)
Random
Forest
Prediction Set A
3
5
6
7
8
8
7
8
8
9
Prediction Set B
3
5
7
7
7
7
7
7
8
9
Prediciton Set C
3
5
6
7
7
7
8
9
9
9
0
1
2
3
4
5
6
7
8
9
10
No of projects

No. of Projects for which a Classification Algorithm scores AUC >
0.60

** Out of 9 projects

RQ2 : Dataset Evaluation


Leiden University. The university to discover.

Result

** Out of 9 projects

Decision
Table
J48
Random
Tree
RBF
Network
Decision
Stump
Function
Logistic
Naïve
Bayes
k-NN(1)
k-NN (5)
Random
Forest
Predictor Set A
0.60
0.63
0.66
0.66
0.66
0.69
0.70
0.70
0.73
0.75
Predictor Set B
0.59
0.61
0.66
0.67
0.64
0.70
0.70
0.70
0.72
0.75
Predictor Set C
0.58
0.61
0.66
0.66
0.65
0.68
0.69
0.71
0.72
0.74
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Average AUC Score

RQ3 : Evaluation on Classification Algorithms


Leiden University. The university to discover.

A.
Predictor


Three
class diagram metrics
should
be considered as influential

predictors
:



Export Coupling Parameter(EC Par),


Dependency In (
Dep

In)


Number of Operation (
NumOps
)

**
Means,
a higher
value of these metrics
for a
class
indicates
that this class can be a candidate
for inclusion in the CD.



B.
Classification Algorithm


k
-
NN(
5
) and Random Forest are suitable classification algorithms in

this
study.


Their AUC
score
is at least
0.64


The classifiers are robust
for all
projects and predictor sets

Discussion


Leiden University. The university to discover.

C.
Threat to Validity

i.
Assumption of ground truth:



Exactly all classes that
should be

in the
forward
designs
are
in the


forward design. There is a possibility
that

:


some
of these classes were not
the
key classes of the system.


there
is a
possibility that
the forward design used is too ‘old
’.


i.
Input is dependent on
Reverse Engineering tool (
MagicDraw
)


ii.
Cover only 9 open
-
source projects



Discussion


Leiden University. The university to discover.




Future Work

1.
Alternative predictor variables


use
of
other type
of design metrics

ex. (
semantics
of) the
names of
classes, methods and
attributes.


use
source code metrics such as Line of
Code (LOC
) and Lines of
Comments.



Change History of a class


2.
Learning
models
(classification algorithm)


testing
out an
ensemble approach (combines
classification
algorithms)


3.
S
emi
supervised or interactive
approach


4.
Compare this study result with other approaches


O
ther
works that apply different algorithm such as HITS web


mining,
network analysis on dependency
graphs
and
PageRank.


5. Validate understandability of abstract Class Diagrams



Leiden University. The university to discover.

Questions…………..


Conclusion



1.
The
most influential predictors


Export
Coupling
Parameter


Dependency
In


Number
of Operation


2.
Most
suitable Classification
Algorithms


Random
Forest


k
-
Nearest
Neighbor


3.
Classification
algorithms are able to produce a
predictor
that can be used to
rank
classes
by relative importance.


4.
Based
on this class
-
ranking information, a
tool can
be developed that provides
views of reverse
engineered class
diagrams at different levels of abstraction.


5.
D
evelopers
may generate multiple levels of class
diagram abstractions.