CMCD: Count Matrix based Code

eyelashesnectarineSoftware and s/w Development

Nov 3, 2013 (3 years and 10 months ago)

65 views

CMCD: Count Matrix based Code
Clone Detection

Yang Yuan and Yao
Guo

Key
Laboratory of High
-
Confidence Software
Technologies (Ministry of Education
)

Peking University

Code Clones


In software development, it is common to
reuse some
code fragments
by copying with or
without minor modifications.



This
kind of code fragments are called
code
clones
. [
Jurgens

et al., ICSE 2009
]

Scenario
-
based Evaluation

Original Copy

Example of Scenario #1

Scenario
-
based Evaluation

Original Copy

Example of Scenario #2

Scenario
-
based Evaluation

Original Copy

Example of Scenario #3

Scenario
-
based Evaluation

Original Copy

Example of Scenario #4

Importance of Code Clones


Code clone brings troubles:


Increase the complexity of source code


Increase the maintenance cost of software system


Increase the possibility of getting bugs


7%
-
23% of the code in large software system is
cloned.
[Roy et al., SCP 2009]


Detecting code clones may help:


Analyze the programming habits of the programmers


Find the design patterns of the source code


Previous Work in Clone Detection


lower level:


Textual approach


SDD [
Lee and
Jeong
, OOPSLA 2005
]


NICAD [
Roy and
Cordy
, ICPC 2008
]


...


Lexical approach


DUP [
Baker, WCRE 1995
]


CCFinder

[
Kamiya

et al., TSE 2002
]


CP
-
Miner [
Li et al., OSDI 2004, TSE 2006
]


….

Previous Work in Clone Detection


Higher level:


Syntactic
approach


CloneDr

[
Baxter et al., ICSM 1998
]


Deckard [
Jiang et al., ICSE 2007
]


CloneDigger

[
Bulychev
,
SyRCoSE

2008
]





Semantic
approach


Duplix

[
Krinke
, WCRE 2001
]


GPLAG [
Liu et al., KDD 06
]





Challenges

Low level approaches


Faster



Usually focusing on local
characters



No Idea about global
meanings


High level approaches


Slower



Better understanding of the
programs



Difficult to scale


G
A
P

Our idea


A novel
count matrix
based clone detection
approach.


Benefits of counting


By ignoring the order of variables, it can identify
clones with statement swapping cases, which is
difficult for both lexical and syntactic approaches.


Easy to calculate and implement


Reduces space and time complexity



Count Matrix Construction

Token Sequence


Count Vector


Count Matrix


tot,=,
n,+,Find
,(,n,),
for,i
,=,1
,to,n,
-
,1,
if,a
,[,i,],>,a,[,j,],

,k,=,a,[,i,]….

tot

1

0

0



0

i

3

0

0



2

j

1

0

0



1

a

3

0

0



3

n

2

1

0



0

t潴

1

0

0



0

i

3

0

0



2

j

1

0

0



1

A

3

0

0



3

n

2

1

0



0

Comparison Algorithms


Goal:


Find more scenario #4 clones with more
transformations such as sentence swapping


Run fast


General principles:


Compare individual variables, instead of variable
sequences


Ignore variable orders in the count matrix


bipartite graph matching


Use bipartite graph matching to find code
clone in
different granularity
:


Bottom
-
up approach


Can be used for compute the similarity between two
projects, two classes, or two methods


Use two kinds of bipartite graph


KM algorithm (low
-
level, slow, accurate)


Hungarian algorithm (high
-
level, fast, inaccurate)

Optimization


Use Euclidean metrics to compute the
similarity of CVs


Use quick rejection algorithm to improve
speed


Eliminate false positives:


Cut and check


Slice and match


Implementation


Use Soot to convert Java
-
>
Jimple



[
Vallee
-
Rai

et al., CASCON 1999]


3
-
address intermediate representation


Smaller language set


Break complex statements into basic ones


Does not change the meaning of the program



A new version of CMCD without using Soot


Overview


Performance Comparison to Deckard

833

565

571

636

2274

0.5
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
1.0(1.0)
0.95(0.9999)
0.9(0.999)
0.85(0.99)
0.8(0.95)
Compare Time(sec)

Similarity

Stage1
Stage2
Stage3
Stage2+key
Stage3+key
Deckard
Scenario
-
based Evaluation

Based on scenario classification from Roy et al., paper “Comparison
and Evaluation of Code Clone Detection Techniques ”

Detecting Plagiarisms


Student
-
submitted compiler lab projects


29 submissions


106
-

251 Java classes


7,825


38,086 Lines of code


Experimental Results


Running time: 123 minutes


2 clusters of code clones, each has 3 copies


Confirmed


Now used by two courses in Peking University for
detecting students’ homework

Analyzing
JDK 1.6 Source Code



JDK 1.6.0_18


7,197 files


2,079,166
LoC


Experimental Results


Running time: 163
minutes


Found: 786 methods in 174 clusters (
Small
methods are
omitted)



Code Comparison: Two Clones

Method 1:
(in
com.sun.corba.se.impl.ior.iiop.SyncFactory
)

public static
SyncFactory

getSyncFactory
(){


if(
syncFactory

== null){



synchronized(
SyncFactory.class
) {




if(
syncFactory

== null){





syncFactory

= new
SyncFactory
();




}
//end if



}
//end synchronized block


}
//end if


return
syncFactory
;

}

Method
2:
(in
javax.swing.JComponent
)

static Set<
KeyStroke
>
getManagingFocusBackwardTraversalKeys
() {


synchronized(
JComponent.class
) {



if
(
managingFocusBackwardTraversalKeys

== null) {




managingFocusBackwardTraversalKeys

= new
HashSet
<
KeyStroke
>(1);




managingFocusBackwardTraversalKeys.add
(
KeyStroke.getKeyStroke
(




KeyEvent.VK_TAB,InputEvent.SHIFT_MASK|InputEvent.CTRL_MASK
));



}


}


return
managingFocusBackwardTraversalKeys
;

}


Detected a bug

Method 1: (in
com.sun.corba.se.impl.ior.iiop.SyncFactory
)

public static
SyncFactory

getSyncFactory
(){


if(
syncFactory

== null){



synchronized(
SyncFactory.class
) {




if(
syncFactory

== null){





syncFactory

= new
SyncFactory
();




}
//end if



}
//end synchronized block


}
//end if


return
syncFactory
;

}

Method 3: (in
com.sun.corba.se.impl.ior.iiop.JavaSerializationComponent
)

public static
JavaSerializationComponent

singleton() {


if (singleton == null) {



synchronized (
JavaSerializationComponent.class
) {




singleton =new
JavaSerializationComponent
(
Message.JAVA_ENC_VERSION
);



}


}


return singleton;

}


http
://
bugs.sun.com/bugdatabase/vie
w_bug.do?bug_id=6999537

Conclusion


We propose a code clone detection approach

CMCD:


Extracting count
-
based information


Language independent


Scales to large programs (> 1M
LoC
)


Capabilities


Performs well in scenario
-
based evaluation


Detects code plagiarism in students’ homework


Identifies a potential bug in JDK source code