Intelligent Detection of Malicious Script Code

boorishadamantAI and Robotics

Oct 29, 2013 (3 years and 9 months ago)

73 views

Intelligent Detection of
Malicious Script Code

CS194, 2007
-
08

Benson Luk

Eyal Reuveni

Kamron Farrokh


Advisor: Adnan Darwiche

Introduction

3
-
quarter project

Sponsored by Symantec

Main focuses:



Web programming



Database development



Data mining



Artificial intelligence


Overview



Current security software catches
known malicious attacks based on a list of
signatures



The problem: New attacks are being
created every day



Developers need to create new signatures
for these attacks



Until these signatures are made, users are
vulnerable to these attacks

Overview (cont.)



Our objective is to build a system that
can effectively detect malicious activity
without relying on signature lists



The goal of our research is to see if and
how artificial intelligence can discern
malicious code from non
-
malicious code

Data Gathering



Gather data using a web crawler (probably a
modified web crawler based on the Heritrix
software)



Crawler scours a list of known “safe” websites



Will also branch out into websites linked to by
these websites for additional data, if necessary



While this is performed, we will gather key
information on the scripts (function calls,
parameter values, return values, etc.)



This will be done in Internet Explorer

Data Storage



When data is gathered it will need to be
stored for the analysis that will take place
later



Need to develop a database that can
efficiently store the script activity of tens
of thousands (possibly
millions
) of
websites

Data Analysis



Using information from database,
deduce normal behavior



Find a robust algorithm for generating a
heuristic for acceptable behavior



The goal here is to later weigh this
heuristic against scripts to determine
abnormal (and thus potentially malicious)
behavior

Challenges



Gathering



How to grab relevant information from scripts?



How deep do we search?



Good websites may inadvertently link to malicious ones



The traversal graph is probably infinitely long



Storage



In what form should the data be stored?



Need efficient way to store data without simplifying it



Example: A simple laundry list of function calls does not
take call sequence into account



Analysis



What analysis algorithm can handle all of this data?



How can we ensure that the normality heuristic it generates
minimizes false positives and maximizes true positives?


Milestones



Phase I:
Setup



Set up equipment for research, ensure whitelist is clean



Phase II:
Crawler



Modify crawler to grab and output necessary data so that it
can later be stored and begin crawler activity for sample
information



Phase III:
Database



Research and develop an effective structure for storing
data and link it to webcrawler



Phase IV:
Analysis



Research and develop an effective algorithm for learning
from massive amounts of data



Phase V:
Verification



Using webcrawler, visit a large volume of websites to
ensure that heuristic generated in phase IV is accurate



Certain milestones may need to be revisited depending
on results in each phase