Transcript ppt

Intelligent Detection of
Malicious Script Code
CS194, 2007-08
Benson Luk
Eyal Reuveni
Kamron Farrokh
Advisor: Adnan Darwiche
3-quarter project
Sponsored by Symantec
Main focuses:
 Web programming
 Database development
 Data mining
 Artificial intelligence
Current security software catches
known malicious attacks based on a list of
 The problem: New attacks are being
created every day
• Developers need to create new signatures
for these attacks
• Until these signatures are made, users are
vulnerable to these attacks
Overview (cont.)
Our objective is to build a system that
can effectively detect malicious activity
without relying on signature lists
 The goal of our research is to see if and
how artificial intelligence can discern
malicious code from non-malicious code
Data Gathering
Gather data using a web crawler (probably a
modified web crawler based on the Heritrix
 Crawler scours a list of known “safe” websites
 Will also branch out into websites linked to by
these websites for additional data, if necessary
 While this is performed, we will gather key
information on the scripts (function calls,
parameter values, return values, etc.)
 This will be done in Internet Explorer
Data Storage
When data is gathered it will need to be
stored for the analysis that will take place
 Need to develop a database that can
efficiently store the script activity of tens
of thousands (possibly millions) of
Data Analysis
Using information from database,
deduce normal behavior
 Find a robust algorithm for generating a
heuristic for acceptable behavior
 The goal here is to later weigh this
heuristic against scripts to determine
abnormal (and thus potentially malicious)
• How to grab relevant information from scripts?
• How deep do we search?
 Good websites may inadvertently link to malicious ones
 The traversal graph is probably infinitely long
• In what form should the data be stored?
 Need efficient way to store data without simplifying it
 Example: A simple laundry list of function calls does not
take call sequence into account
• What analysis algorithm can handle all of this data?
• How can we ensure that the normality heuristic it generates
minimizes false positives and maximizes true positives?
Phase I: Setup
• Set up equipment for research, ensure whitelist is clean
Phase II: Crawler
• Modify crawler to grab and output necessary data so that it
can later be stored and begin crawler activity for sample
Phase III: Database
• Research and develop an effective structure for storing data
and link it to webcrawler
Phase IV: Analysis
• Research and develop an effective algorithm for learning
from massive amounts of data
Phase V: Verification
• Using webcrawler, visit a large volume of websites to
ensure that heuristic generated in phase IV is accurate
Certain milestones may need to be revisited depending
on results in each phase