Intelligent Detection of Malicious Script Code

boorishadamantΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

87 εμφανίσεις

Intelligent Detection of
Malicious Script Code

CS194, 2007

Benson Luk

Eyal Reuveni

Kamron Farrokh

Advisor: Adnan Darwiche


quarter project

Sponsored by Symantec

Main focuses:

Web programming

Database development

Data mining

Artificial intelligence


Current security software catches
known malicious attacks based on a list of

The problem: New attacks are being
created every day

Developers need to create new signatures
for these attacks

Until these signatures are made, users are
vulnerable to these attacks

Overview (cont.)

Our objective is to build a system that
can effectively detect malicious activity
without relying on signature lists

The goal of our research is to see if and
how artificial intelligence can discern
malicious code from non
malicious code

Data Gathering

Gather data using a web crawler (probably a
modified web crawler based on the Heritrix

Crawler scours a list of known “safe” websites

Will also branch out into websites linked to by
these websites for additional data, if necessary

While this is performed, we will gather key
information on the scripts (function calls,
parameter values, return values, etc.)

This will be done in Internet Explorer

Data Storage

When data is gathered it will need to be
stored for the analysis that will take place

Need to develop a database that can
efficiently store the script activity of tens
of thousands (possibly
) of

Data Analysis

Using information from database,
deduce normal behavior

Find a robust algorithm for generating a
heuristic for acceptable behavior

The goal here is to later weigh this
heuristic against scripts to determine
abnormal (and thus potentially malicious)



How to grab relevant information from scripts?

How deep do we search?

Good websites may inadvertently link to malicious ones

The traversal graph is probably infinitely long


In what form should the data be stored?

Need efficient way to store data without simplifying it

Example: A simple laundry list of function calls does not
take call sequence into account


What analysis algorithm can handle all of this data?

How can we ensure that the normality heuristic it generates
minimizes false positives and maximizes true positives?


Phase I:

Set up equipment for research, ensure whitelist is clean

Phase II:

Modify crawler to grab and output necessary data so that it
can later be stored and begin crawler activity for sample

Phase III:

Research and develop an effective structure for storing
data and link it to webcrawler

Phase IV:

Research and develop an effective algorithm for learning
from massive amounts of data

Phase V:

Using webcrawler, visit a large volume of websites to
ensure that heuristic generated in phase IV is accurate

Certain milestones may need to be revisited depending
on results in each phase