Download Report

Transcript paper_learning

Learning to Detect Malicious
Executables in the Wild
Original Paper by Jeremy Z. Colter and
Marcus A. Maloof, of Georgetown
Presented by Alvin Grissom
What is Malicious Code?
Malicious code is "any code added, changed, or removed from a
software system to intentional cause harm or subvert the system's
intended function." (G. McGraw and G. Morisett, 2000)
Examples include software that is used to...
-compromise computer systems
-destroy information
-gather/distribute information (e.g., spyware)
Types of Malicious Software
Malicious executables are generally categorized based on their
respective transport mechanisms:
Viruses inject malicious code into other executables, infecting them
and propagating the infection.
Worms are self-contained programs which spread over a network,
usually by exploiting vulnerabilities.
Trojan Horses masquerade as benign programs.
Anti-Virus Software
Anti-virus software scans executables for known patterns and has
been quite successful.
However, there are substantial shortcomings inherent to this
- program must be obtained before patterns can be identified
-simple obfuscation techniques can deceive anti-virus software
A Different Approach
- Use machine learning and techniques from text mining and
-Form as a classification problem
Related Work
- Lo, et. al. investigated using "tell-tale signs" to filter programs, but
provided no empirical results
-IBM's Watson Research Center have investigated using neural
networks, and have incorporated it into their anti-virus software for
boot sector virus protection
-Shultz, et. al. used data mining methods to detect malicious code, a
technique similar to the work presented here.
-used extracted features, e.g., function names, hex dumps,
DLL names
-Naive Bayes performed best of the techniques attempted
-Are these features stable? Compiler-dependent?
Data Collection
- The data set consists of 1,972 benign executables and 1,651
malicious executables, in Windows's portable executable format for
Windows 2000 and XP.
- Some executables were obfuscated with compression and/or
- The authors used the hexdump utility to convert executables to
hexadecimal, before producing n-grams.
-E.g. ff00ab3e12be --> {ff00ab3e, 00ab3e12, ab3e12b3}
This yielded ~256 million distinct n-grams
Classification Methodology
-Used n-grams by viewing each n-gram as either present (1) or
absent (0) from the executable. The most relevant n-grams were then
selected, by computing the information game for each.
-Selected the top 500 n-grams and applied several learning methods,
namely, IBk, TFIDF, Naive Bayes, SVM, and a decision tree.
-Boosting was also used
-Techniques were evaluated on both a large and a small collection
Experimental Results
-Best performance achieved by using top 500 n-grams
-Using n-grams of size n = 4 (bytes) produced best results.
-The most relevant attributes were extracted by computing
information gain.
-Boosted J48 (decision tree) exhibited best performance
-Evaluation was based on the area under the ROC curve
Other Interesting Finds
-One n-gram appeared in 75% of malicious executables, but was not
executable code; it was a string sequence. Its purpose is unclear. :-/
-No decision tree contained more than 103 nodes; the average was 90,
and the height never exceeded 13.
-Naive Bayes results were inferior to previous studies, possibly due to
conditional dependence.