PPT - University of Delaware

journeycartAI and Robotics

Oct 15, 2013 (4 years and 23 days ago)

84 views

CISC 879
-

Machine Learning for Solving Systems Problems


Presented by: Satyajeet

Dept of Computer & Information Sciences

University of Delaware


Automatic Analysis of Malware Behavior
using Machine Learning

Author’s: Konrad Rieck, Philipp Trinius,
Carsten Willems, and Thosten Holz

CISC 879
-

Machine Learning for Solving Systems Problems


Abstract & Introduction



Malware
-



Poses major threat to security of computer systems.


Very diverse


viruses, internet worms, trojan horses,


Amount of malware


millions of hosts infected


Obfuscation and polymorphism impede detection at
file level


Dynamic analysis helps characterizing and
defending.


CISC 879
-

Machine Learning for Solving Systems Problems


A
bstract & Introduction
Contd..


Framework for automatic analysis of
malware behavior using Machine learning


Framework allows automatic analysis of novel
classes of malware with similar behavior


Clustering.


Assigning unknown classes of malware to these
discovered classes


Classification.


An incremental approach based on both for
behavior based analysis.

CISC 879
-

Machine Learning for Solving Systems Problems


Automatic analysis of
Malware Behavior


Framework steps and procedure


Executing and monitoring malware binaries in
sandbox environment. Report generated on
system calls and their arguments.


Sequential reports are embedded in a vector
space where each dimension is associated with
a behavioral pattern.


ML techniques then applied to the embedded
reports to identify and classify malware.


Incremental analysis progress by alternating
between clustering and classification.

CISC 879
-

Machine Learning for Solving Systems Problems


Report representation


Can be textual or XML


Human readable and suitable for computation of
general statistics


But not efficient for automatic analysis



Hence MIST (Malware Instr. Set)


Inspired from instr. set used in process design.





CISC 879
-

Machine Learning for Solving Systems Problems


MIST





Category of system calls


Operation
-

Reflects a particular system call


Arguments as argblocks.

CISC 879
-

Machine Learning for Solving Systems Problems


Sandbox and MIST
representation

CISC 879
-

Machine Learning for Solving Systems Problems


Representation


These sequential reports identify typical behavior of
malware


Changing registry keys, modifying
system files.


But still not suitable for efficient analysis
techniques. Hence the need to embed behavior
reports in vector space


Using instruction q
-
grams.


This embedding enables expressing the similarity
of behavior geometrically


Calculating distance.


CISC 879
-

Machine Learning for Solving Systems Problems


Clustering and Classification


Reports are embedded in vector space


Process
ready for applying ML techniques


Clustering of behavior


where classes of similar
behavior malware are identified.


Classification of behavior


which allows to assign
malware to known classes of behavior.


What allows us to do this?


Malware binaries are a family of similar variants
with similar behavior patterns !

CISC 879
-

Machine Learning for Solving Systems Problems


Contd..


CISC 879
-

Machine Learning for Solving Systems Problems


Algorithms



Prototype extraction


Iterative algorithm


Extracts small set of prototypes from set of reports. First
one chosen at random.


Clustering using Prototypes


Prototypes at beginning are individual clusters


Algorithm determines and merges nearest pairs of
clusters


Classification using Prototypes


Allows to learn to discriminate between classes of
malware.

CISC 879
-

Machine Learning for Solving Systems Problems


Algorithms Contd..



For each report algorithm determines the nearest
prototype of clusters in training data, if within radius
then assigns to cluster


Else rejects and holds back for later incremental
analysis.



Incremental analysis


Reports to be analyzed are received from source.


Initially classified using prototypes of known clusters


Thereby variants of known malware are identified for
further analysis.


Prototypes extracted from remaining reports and
clustered again.

CISC 879
-

Machine Learning for Solving Systems Problems


Experiments and Results

CISC 879
-

Machine Learning for Solving Systems Problems


Evaluating components


Prototype extraction


Evaluated using Precision, Recall and Compression.


Precision


0.99 when corpus compressed by 2.9 % &
7%


Clustering


Evaluated using F
-
measure



F
-
measure for experiments


MIST 1 = 0.93 and MIST 2 =
0.95 better than previous related work 0.881


Classification


F
-
measure for experiments


MIST 1= 0.96 and MIST 2 =
0.99


CISC 879
-

Machine Learning for Solving Systems Problems


Experiments and Results Contd..

CISC 879
-

Machine Learning for Solving Systems Problems


Experiments and Results Contd..

CISC 879
-

Machine Learning for Solving Systems Problems


Conclusion



A new framework introduced which overcomes
several previous deficiencies.


The framework is learning based


Framework can be implemented in practice


Steps


Collection of malware, a study in sandbox
environment, embed observed behavior in vector space,
apply learning algorithms


clustering and classification.


This process is efficient and learns automatically
after initial setup and run.

CISC 879
-

Machine Learning for Solving Systems Problems








Thank you !