Learning to Detect and Classify Malicious Executables in the Wild

Transcript Learning to Detect and Classify Malicious Executables in the Wild

Learning to Detect and Classify
Malicious Executables in the
Wild
Reporter: 林佳宜
Email: [email protected]
2015/4/13
1
References

Learning to Detect and Classify Malicious
Executables in the Wild. J. Zico Kolter,
Marcus A. Maloof, JMLR 2006.
2
Outline





Introduction
Classification Methodology
Experimental Design
Experimental Results
Conclusion
3
Introduction

Malicious code can
 cause harm or subvert the system’s intended function

Malicious executables have three categories
 viruses, worms, and Trojan horses.

Describe the use of machine learning and
data mining
 detect and classify malicious executables
4
Three main contributions

Detect and classify malicious executables
 Use text classification

Present empirical results
 from an extensive study of inductive methods for
detecting and classifying malicious executables

Show that the methods achieve high
detection rates
 even on completely new, previously unseen malicious
executables
5
Several learning methods

Implemented in the Wakaito Environment
for Knowledge Acquisition (WEKA)
 IBk
 naive Bayes
 support vector machine (SVM)
 J48

Used the AdaBoost.M1 algorithm
 boost SVMs, J48, naive Bayes
6
Data Collection

Gathered this collection early of 2003
◦ Benign executables




1971
from Windows 2000 and XP operating systems
SourceForge
download.com
◦ Malicious executables
 1651
 from Web site VX Heavens
 MITRE Corporation, the sponsors of this project

Recently,obtained 291 malicious executables
 from VX Heavens
7
Experimental Design

To evaluate the approach and methods
 stratified ten-fold cross-validation
 randomly partitioned the executables into ten disjoint
sets of equal size
 one as a testing set
 nine to form a training set
Extracted n-grams from the executables in
the training and testing sets
 Selected the most relevant features from the
training data
 To conduct ROC analysis, for each method

8
Detecting Malicious Executables

Learning methods detected malicious
executables
 three experimental studies

The first was a pilot study to determine the
 size of words and n-grams
 the number of n-grams relevant for prediction

The second experiment consisted of
applying all of the classification methods to
 a small collection of executables

The third then involved applying the
methodology to
 a larger collection of executables
9
Pilot Studies[1/2]

Pilot studies to determine three parameters
 the size of n-grams
 the size of words,
 the number of selected features

Extracted bytes from
 476 malicious executables, 561 benign executables
 produced n-grams, for n = 4

Selected the best 10, 20, . . . , 100, 200, . . . ,
1000, 2000, . . . , 10000 n-grams,
 Selecting 500 n-grams produced the best results
10
Pilot Studies[2/2]

Fixed the number of n-grams
 at 500
 varied n, the size of the n-grams

Evaluated the same methods for n=1,2,....,10
 n = 4 produced the best results

Varied the size of the words (one byte, two
bytes, etc.)
 single bytes produced better results
11
Classification Methodology

Form training examples
 used the n-grams extracted from the executables
 by viewing each n-gram as a Boolean attribute

Selected the most relevant attributes by
 computing the information gain (IG) for each:

Selected the top 500 n-grams
12
Experiment with a Small
Collection

Executables produced 68744909 distinct ngrams

Areas under these curves (AUC) with 95%
confidence intervals
 the boosted methods performed well
 Naive Bayes did not perform as well
13
14
15
Experiment with a Larger
Collection

This collection consisted of
 1971 benign executables
 1651 malicious executables
 over 255 million distinct n-grams of size four

The areas under these curves with 95%
confidence intervals
 boosted J48 outperformed all other methods
16
17
18
Classifying Executables by
Payload Function

Classify malicious executables based on
 function of their payload

present results for three functional
categories
 opened a backdoor、 mass-mailed、executable
virus

Reduce the previously undiscovered
malicious executables
19
20
21
Evaluating Real-world, Online
Performance

Compare the actual detection rates
 larger collection VS the 291 new malicious

Selected three desired false-positive rates
 0.01, 0.05, 0.1

Detected about 98% of the new malicious
executables
 boosted J48
 false-positive rate of 0.05
22
23
Conclusion

Detecting and classifying unknown malicious
executables by
 machine learning, data mining, text classification

Detecting malicious executables
 boosted J48 produced the best detector with an area
under the ROC curve of 0.996

Classify malicious executables based on
payload’s function
 boosted J48 produced the best detectors with areas
under the ROC curve around 0.9
24
Questions
25

Learning to Detect and Classify Malicious Executables in the Wild

Transcript Learning to Detect and Classify Malicious Executables in the Wild

Directory