JuxtApp: Scalable system for detecting code reuse among Android ...

idleheadedceleryΚινητά – Ασύρματες Τεχνολογίες

10 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

74 εμφανίσεις


Juxtapp: A Scalable System for Detecting Code Reuse Among
Android Applications


Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles
Chen, and Dawn Song



On the Feasibility of Internet
-
Scale Author Identification


Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John
Bethencourt, Eui Chul Richard Shin, Dawn Song, Emil Stefanov


Used to be applicable to literary corpus/
academia only



Source code similarity/plagiarism detection is
very important



“Moss” is the most widely known s/w
similarity detection tool



Can provide valuable insight into malware
detection



Generally not true



In the android apps domain, it can be!



86% of the android malwares are repackaged
versions of legitimate apps with malicious
payloads (source: “
Dissecting android
malware:characterization and evolution”)




Similarity detection is crucial





Each android app is an apk file, ends with a
.apk extension



Each apk file has .dex file which is a dalvik
executable file and is executed by the dalvik
virtual machine



Fingerprint the apk using bithashing








Application preprocessing


Each app is segmented into basic blocks.
Only the
opcodes

are retained, the exception
being
opcodes

storing constant data, e.g. const
-
string
opcode
. In this case the
opcode

is
concatenated with the value it references


Feature Extraction

K
-
grams of
opcodes

are extracted by


sliding a window of size k and hashing it
with djb2 hash function. For each hash
value, corresponding bit in the
bitvector

is
set.










Value

of

K

was

set

to

5

and

was

selected

by

an

experiment
.

Pairs

of

apps

were

selected

from

randomly

sampled

6000

apps
.

The

distance

between

the

pairs

were

computed
.

It

was

found

that

starting

from

5
,

the

value

of

K

has

little

impact

on

the

distance

calculation



Mean

is

5
.
35

opcodes

and

median

is

2

opcodes
,

while

the

largest

basic

block

in

the

dataset

contains

35517

opcodes


The

bitvector

size

m

is

chosen

by

experiment
.

m

>>

N,

the

number

of

k
-
grams

extracted

from

an

application

between

two

k
-
gram

feature

sets



30000

apps

were

used

to

determine

m
.



m

=

N
90


x

9

=

240
,
007
,

a

prime

number



Given two bitvector representations of two
apps A and B, their similarity is computed by
the given formula:



J(A,B) = |A ∧ B| / |A ⋁ B|


This formula Is a variation of the original
Jaccard similarity.






If the app is heavily obfuscated, then juxtapp
may not perform well



Use of third
-
party libraries can add a lot of
noise and adversely affect the similarity score


Who wrote it?



Identify an anonymous author by comparing
his/her writing style against a corpus of texts
of known authorship




Primary application has shifted from literary
domain to forensics : terrorist threats,
harassment


2.4 million posts from 100,000 blogs (almost
a billion words)



Stylometry : Identify author based on writing
style



Are
N
-
gram techniques suitable?


Not really,
because they reveal more about the
context

rather than the
author




Prepare test set and training set



Build a classifier with the training set



Test the classifier with the test set



Which features should be considered?

Syntax tree by Stanford


parser


Yule’s K



k = 10000*(M
-
N)/(N*N)


N= Total number of
words in the text

M = ∑
i * i * V
i



where V
i

is the
number of words that
occur

i

times


In 20% of cases the classifiers can correctly
identify an anonymous author given a corpus
of texts from 100,000 authors



In 35% of cases the correct author is one of
the top 20 guesses


Malware author identification from :



Plain
-
text source code



Binary executables



Intermediate
-
code