Transcript ppt
Intelligent Detection of
Malicious Script Code
CS194, 2007-08
Benson Luk
Eyal Reuveni
Kamron Farrokh
Advisor: Adnan Darwiche
Introduction
3-quarter project
Sponsored by Symantec
Main focuses:
Web programming
Database development
Data mining
Artificial intelligence
Overview
Current security software catches
known malicious attacks based on a list of
signatures
The problem: New attacks are being
created every day
• Developers need to create new signatures
for these attacks
• Until these signatures are made, users are
vulnerable to these attacks
Overview (cont.)
Our objective is to build a system that
can effectively detect malicious activity
without relying on signature lists
The goal of our research is to see if and
how artificial intelligence can discern
malicious code from non-malicious code
Data Gathering
Gather data using a web crawler (probably a
modified web crawler based on the Heritrix
software)
Crawler scours a list of known “safe” websites
Will also branch out into websites linked to by
these websites for additional data, if necessary
While this is performed, we will gather key
information on the scripts (function calls,
parameter values, return values, etc.)
This will be done in Internet Explorer
Data Storage
When data is gathered it will need to be
stored for the analysis that will take place
later
Need to develop a database that can
efficiently store the script activity of tens
of thousands (possibly millions) of
websites
Data Analysis
Using information from database,
deduce normal behavior
Find a robust algorithm for generating a
heuristic for acceptable behavior
The goal here is to later weigh this
heuristic against scripts to determine
abnormal (and thus potentially malicious)
behavior
Challenges
Gathering
• How to grab relevant information from scripts?
• How deep do we search?
Good websites may inadvertently link to malicious ones
The traversal graph is probably infinitely long
Storage
• In what form should the data be stored?
Need efficient way to store data without simplifying it
Example: A simple laundry list of function calls does not
take call sequence into account
Analysis
• What analysis algorithm can handle all of this data?
• How can we ensure that the normality heuristic it generates
minimizes false positives and maximizes true positives?
Milestones
Phase I: Setup
• Set up equipment for research, ensure whitelist is clean
Phase II: Crawler
• Modify crawler to grab and output necessary data so that it
can later be stored and begin crawler activity for sample
information
Phase III: Database
• Research and develop an effective structure for storing data
and link it to webcrawler
Phase IV: Analysis
• Research and develop an effective algorithm for learning
from massive amounts of data
Phase V: Verification
• Using webcrawler, visit a large volume of websites to
ensure that heuristic generated in phase IV is accurate
Certain milestones may need to be revisited depending
on results in each phase