polonium_ldmta2010x - ArnetMiner

wonderfuldistinctAI and Robotics

Oct 16, 2013 (4 years and 2 months ago)

102 views

Polonium:

T
ERA
-
S
CALE

G
RAPH

M
INING

FOR

M
ALWARE

D
ETECTION

P
ATENT

P
ENDING

Polo Chau

Machine Learning Dept

Carey Nachenberg

Vice President & Fellow

My boss at

Jeffrey Wilhelm

Principal Software
Engr

Adam Wright

Software Engineer

My advisor

Prof. Christos Faloutsos

Computer Science Dept

2

Anti
-
Virus Software…

Polo Chau. Machine Learning Department, Carnegie Mellon University

Detecting Malware

Traditional malware detection approaches rely on
signatures

1.
Collect malware samples

2.
Security experts generate signatures from samples

3.
Signatures distributed to users’ computers as
updates


How to handle the increasingly common
“Zero
-
day”
malware?

Many new or unknown malware

Signature
-
based approach does not work

No samples


No signatures


No detection




3

Polo Chau. Machine Learning Department, Carnegie Mellon University

Symantec’s New Reputation
-
Based Approach

World’s leading

personal security software provider


Computes
reputation score
for every application;
protects users from those with
poor
reputation


Leverages
terabytes of data
anonymously
contributed

by the millions of participants of the
worldwide
Norton Community Watch

program

4

Polo Chau. Machine Learning Department, Carnegie Mellon University

Uses an ensemble of
machine learning
and
data mining
algorithms, plus
many other
detection modules, to
compute application reputations


Polonium

is a new malware detection technology


I helped created in Fall 2009 at Symantec as an intern


Being incorporated into their products


Patent pending

5

Polo Chau. Machine Learning Department, Carnegie Mellon University

Symantec’s New Reputation
-
Based Approach

Related Work

(briefly)

6

Existing research

Polonium

Detects
specific

types
of
malware (e.g.,

worms,
rootkits
,
trojans
)

Detects
all

types
of malware
instances similar to those already
flagged by Symantec

Considers only malware’s
intrinsic

properties

Leverages

external

properties

(e.g., consider machines that
downloaded the files)

Small

dataset
;

few malware samples

Huge

dataset;

many malware samples

Relaxed

false
-
positive rate
requirement (e.g., >10%)

Strict

false
-
positive rate requirement
(
1%
)

Existing research has used many familiar techniques,

e.g., Naïve
Bayes
, SVM, decision trees

Polo Chau. Machine Learning Department, Carnegie Mellon University

P
ropagation


O
f


L
everage

O
f

N
etwork

I
nfluence

U
nearths

M
alware

7

P

O


L

O

N

I

U

M

Polo Chau. Machine Learning Department, Carnegie Mellon University

The Data

60+ terabytes
of data

anonymously
contributed

by participants of worldwide
Norton Community Watch
program

>50 million
machines

>900 million
executable

files


Constructed a machine
-
file bipartite
graph (0.2 TB+)

~1 billion
nodes (machines and files)

~37 billion
edges


8

Polo Chau. Machine Learning Department, Carnegie Mellon University

Terminology

File ground truth

File label,
good

or
bad
, assigned by
security experts at Symantec

Known
-
good

file

File with
good

ground truth

Known
-
bad

file

File with
bad

ground truth

Positive

Malware instance
, infected files

True Positive

Malware
correctly
identified as
bad

False Positive

A good file
wrongly

identified as
bad

9

Polo Chau. Machine Learning Department, Carnegie Mellon University

The Malware Detection Problem

Given

1.
a
billion
-
node machine
-
file bipartite graph

2.
prior knowledge about some files and machines’ goodness


Treat each
file
i

as a random variable
X
i
={
x
g
,
x
b
}

x
g

is the
good
label,
P(
x
g
)
is file
goodness

x
b

is the
bad
label,
P(x
b
)
is file
badness


Goal
: find
file goodness
P(X
i
=
x
g
) for each file
i



10

Polo Chau. Machine Learning Department, Carnegie Mellon University

Since
goodness

+
badness

=1,

just consider
goodness

First describe domain
knowledge to be incorporated

Then the Polonium algorithm
that computes file goodness

1. Prior file reputation

11

Symantec maintains a
ground
truth database
of
known
-
good

and
known
-
bad

files

Correlates
prior file reputation
with
file prevalence

?

Polo Chau. Machine Learning Department, Carnegie Mellon University

“Unknown” files

“Known” files

e.g., set
known
-
good

file’s prior to
0.9

Intuition
:

good files appear on many machines;

bad files appear on few machines

2. Prior machine reputation

12

Computed using Symantec’s
proprietary formula; takes into
account multiple anonymous aspects
of machine’s usage and behavior


Machine reputation is
a value
between 0 and 1


Intuitively, files associated with a
machine with high reputation are
more likely to be good

Polo Chau. Machine Learning Department, Carnegie Mellon University

3. “
Homophilic
” machine
-
file relationships

Also known as “guilt
-
by
-
association”


Bad

files more likely appear on
low

reputation machines


Good

files more likely appear on
high

reputation machines


13

Good

Bad

Good

0.6

0.4

Bad

0.4

0.6

File

Machine

Polo Chau. Machine Learning Department, Carnegie Mellon University

Recap: Incorporating Domain Knowledge

14

How to infer the reputation of an unknown file, using its
neighbors’ (and their neighbors’) reputation?

Adapts
Belief Propagation
algorithm.


Polo Chau. Machine Learning Department, Carnegie Mellon University

Computing Node Reputation
/Belief


15

Node belief

≈ P(x
i
)

Prior node belief

Message

from neighboring nodes

Normalization Constant

Polo Chau. Machine Learning Department, Carnegie Mellon University

(same for file node &
machine node
)

Neighbor’s opinion
about the node’s
reputation

16

Propagation
function

~Node
i’s

belief

i’s

message to
j

Generating message sent from

node
i



node
j

We choose
Є
= 0.001 to preserve minute
probability differences

Polo Chau. Machine Learning Department, Carnegie Mellon University

(same for file


machine & machine


file
)

Good

Bad

Good

0.6

0.4

Bad

0.4

0.6

Example function

Assigning
Prior
Probabilities

Polo Chau. Machine Learning Department, Carnegie Mellon University

17

1

2

3

4

A

B

C

Machines

Files

0.9

0.1

0.6

0.45

0.35

0.5

0.5

Machine nodes use (proprietary)
machine
reputations

e.g., [0.6, 0.4]


machine reputation is
0.6

Known
-
good file:

[
0.9
, 0.1]

Known
-
bad file:

[
0.1
,

0.9
]

Unknown file:

[0.5, 0.5]

0.5

All messages initialized to [0.5, 0.5]. E.g., m
A1
=[0.5, 0.5], m
1A
=[0.5, 0.5]

0.5

0.9

0.1

0.5

0.5

0.45

0.35

0.6

Propagate
Machine


䙩F攠
Messages

Polo Chau. Machine Learning Department, Carnegie Mellon University

18

1

2

3

4

A

B

C

Machines

Files

0.9

0.1

0.6

0.45

0.35

0.5

0.5

0.92

0.06

0.58

0.38

Good

Bad

Good

0.9

0.1

Bad

0.1

0.9

Propagate
File


䵡捨楮e

Messages

Polo Chau. Machine Learning Department, Carnegie Mellon University

19

1

2

3

4

A

B

C

Machines

Files

0.92

0.06

0.58

0.38

0.6

0.35

0.87

0.1

Good

Bad

Good

0.9

0.1

Bad

0.1

0.9

0.45

0.81

2

3

0.5

0.5

0.58

0.38

Algorithm Termination

Ideally, algorithm stops when reputations converge

Theoretically NO guarantee this will happen

Empirically run for fixed number of iterations (we used 7)


Upon completion, we have reputation scores for all

file and machine
; we only want
file
reputations

20

Polo Chau. Machine Learning Department, Carnegie Mellon University

21

Polo Chau. Machine Learning Department, Carnegie Mellon University

Experiments

Evaluated with full machine
-
file bipartite graph

~
1 billion
nodes (>900M files, >50M machines)

~
37 billion
edges

Largest file
-
submission graph constructed and analyzed


Evaluated with 1/10 ground truth files;

9/10 for setting file priors


Run on 64Bit Red Hat Linux with 4 Quad
-
Core
processors and 256GB RAM


22

Polo Chau. Machine Learning Department, Carnegie Mellon University

One
-
Iteration Results

for files reported by four or more machines

23

% of malware
correctly identified

% of non
-
malware wrongly labeled as malware

84.9%
True Positive Rate

1%
False Positive Rate

In computer security industry,

high TPR is important.

Low FPR is critical!

Polo Chau. Machine Learning Department, Carnegie Mellon University

Multi
-
Iteration Results

for files reported by four or more machines

24

2.2%


in TPR

same 1% FPR


Diminishing return

1

2

3

4

5

6

7

Polo Chau. Machine Learning Department, Carnegie Mellon University

Scalability:

Running Time Per Iteration

25

3 hours,

for full data with
37 billion edges

Polo Chau. Machine Learning Department, Carnegie Mellon University

Optimization #1

Doubles speed by
computing half of messages

File


䵡捨楮攠
messages depend
ONLY

on

Machine


䙩汥F
messages from previous iteration

26

Polo Chau. Machine Learning Department, Carnegie Mellon University

Optimization #2

Externalize “Edge File”

27

Observation:
random access to graph edges or edge
messages is NOT necessary;
sequential access
is sufficient


Use
adjacency list
layout to store messages

Polo Chau. Machine Learning Department, Carnegie Mellon University

e.g.,

[F

M0] [F

M0] [F

M1] [F

M1] [F

M2]
[F

M2]…

Scaling
-
up Computation Further


Belief Propagation


hence Polonium


can be
implemented as matrix
-
vector multiplication that
leverages research on parallel computation,
architecture, etc.


Fast Sparse Matrix
-
Vector Multiplication on
GPUs
:
Implications for Graph Mining

Xintian

Yang


Inference of Beliefs on Billion
-
Scale Graphs

U Kang

Polo Chau. Machine Learning Department, Carnegie Mellon University

28

Conclusions

Polonium

is a new and effective reputation
-
based
malware detection technology adapting the Belief
Propagation algorithm:

87% TPR
, at 1% FPR


Evaluated on
37 billion

edge machine
-
file bipartite
graph,
largest file submissions
dataset ever published


60 TB
raw data


0.2 TB
for derived graph


Scalable & Fast


Optimization doubles speed, reduces storage


29

Polo Chau. Machine Learning Department, Carnegie Mellon University

Polonium:

T
ERA
-
S
CALE

G
RAPH

M
INING

FOR

M
ALWARE

D
ETECTION

P
ATENT

P
ENDING

Polo Chau

Machine Learning Dept

Carey Nachenberg

Vice President & Fellow

My boss at

Jeffrey Wilhelm

Principal Software
Engr

Adam Wright

Software Engineer

My advisor

Prof. Christos Faloutsos

Computer Science Dept


Polo Chau. Machine Learning Department, Carnegie Mellon University

31

32

Data Statistics:

Machine
-
Submission Distribution

Polo Chau. Machine Learning Department, Carnegie Mellon University

Data Statistics:

File
-
Prevalence Distribution

33

Polo Chau. Machine Learning Department, Carnegie Mellon University

The “Right” Algorithm

Easy to
incorporate domain knowledge

Must be effective: high TPR at
low FPR

Easy to understand (a “
whitebox
” method)

34

Polo Chau. Machine Learning Department, Carnegie Mellon University

Domain Knowledge to Incorporate

1.
Prior file reputation

2.
Prior machine reputation

3.

Homophilic
” machine
-
file relationships

35

Polo Chau. Machine Learning Department, Carnegie Mellon University

The Polonium Algorithm

An adaptation of the Belief Propagation algorithm

Given

1.
a
billion
-
node machine
-
file bipartite graph

2.
prior knowledge about some files and machines’ goodness

3.
the intuition of “guilt
-
by
-
association”


Treat each node
i

as a two
-
state random variable X
i
={
x
g
,
x
b
}

x
g

is the
good

label,
P(
x
g
)
is node
goodness

x
b

is the
bad

label,
P(
x
b
)
is node
badness


Goal
: Find
file goodness
P(X
i
=
x
g
) for each file
i



36

We don’t care about machines

Polo Chau. Machine Learning Department, Carnegie Mellon University

Symantec

World’s leading
security software provider


Released
1.8 million signatures

in 2008,
resulting in
200 million detections


Estimated release rate of malicious or
unwanted software would
exceed
that of
legitimate software

(2008 Symantec Security Threat Report)

37

Polo Chau. Machine Learning Department, Carnegie Mellon University

>

Legitimate
software

Malicious or
unwanted
software