John Cavazos Institute for Computing Systems Architecture

Download Report

Transcript John Cavazos Institute for Computing Systems Architecture

Learning to Detect and Identify Malicious
Executables in Wild
J. Zico Kotler
Marcus A Maloof
Presented by: Ashwani Rao
Dept of Computer & Information Sciences
University of Delaware
CISC 879 - Machine Learning for Solving Systems Problems
Introduction
•
Machine learning and data mining to identify
malicious code
•
Malicious Codes ?
•
Why not antivirus suites?
•
Training set: 1971 good and 1651 malicious
executables
•
Features extracted: n-gram byte code and
executable based on their functions of payload
•
Learning algorithms: naïve bayes, SVM, decision
trees and boosting
CISC 879 - Machine Learning for Solving Systems Problems
Goals of the research Paper
•
How to use established methods to detect and
classify malicious executables ?
•
Present empirical results from an extensive study of
inductive methods for detection and classification
•
To show that methods achieve high detection rates
on new and unseen executables.
CISC 879 - Machine Learning for Solving Systems Problems
Related Work
•
Lo et al., 1995; Kephart et al., 1995; Tesauro et
al.,1996;Schultz et al.,2001
•
Lo et al., 1995: analysis of several programs
•
Schultz et al.2001, used data mining to detect
•
Binary profiling
•
String Sequences (Naïve Bayes)
•
Hex dumps
(Ripper learning)
(six naïve bayesian classifiers)
CISC 879 - Machine Learning for Solving Systems Problems
Data Collection and
Classification methods
•
1971 benign and 1651 malicious executables of
windows pe format
•
N-grams: Combine each four bye sequence into
single term. For e.g.: ff 00 ab 3e 12 b3 , the
corresponding n-grams are ff00ab3e, 00ab3e12,
ab3e12b3 etc.
•
N-gram: each of them are considered as attributes
•
Most relevant attribute (n-grams) are calculated
using Information gain also called average mutual
information. Collected 500 most relevant n-grams
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
•
Instance based learner: Collection of training
examples
•
Naive bayes: Probablisitc model. Based on
condition probability of each class P(Ci) and P(Vj |
Ci)
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
•
Support Vector machines: vector of weights w and
threshold,b. Uses a kernel function to map training
data into higher dimensioned space so that problem
is linearly separable.
•
Decision Trees: Internal nodes correspond to
attributes and leaf nodes corresponds to class
labels.
•
Boosted classifiers: It is method for combining
multiple classifiers. Boosting produces set of
weighted models by iteratively learning a model
from a weighted data set, evaluating it and
reweighting the data set based on model’s
performance.
CISC 879 - Machine Learning for Solving Systems Problems
Detecting malicious code
using n-grams
•
Used Ten-fold cross validation
•
Pilot Study: To determine the size of n-grams and
number of n-grams relevant. Used n-grams with
n=4 and calculated the best number of n-grams
using Information gain. 500 relevant n-grams
produced the best result.
•
Experiment With Small collection: Small collection
of executable with total of 68,744,909 n-grams
•
Experiment with Large Collection: 255 million
distinct n-grams of size of 4.
CISC 879 - Machine Learning for Solving Systems Problems
Results of Small Collection
•
ROC curve for detecting malicious executables in
small collection
CISC 879 - Machine Learning for Solving Systems Problems
Result of Bigger Collection
•
ROC Curve for bigger collection
CISC 879 - Machine Learning for Solving Systems Problems
Classifying executables by
Payload function
•
Extent to which classification methods could
determine whether a given malicious executable
opened a backdoor, mass mailed or was an
executable virus.
•
Identify and enumerate the functions of payloads
•
Many executables fell into many categories
•
Experimental design similar to previous but for each
of the fucntion data set is made from malicious
executables only.
•
Used ten fold Cross validation
CISC 879 - Machine Learning for Solving Systems Problems
Experimental Results
•
ROC curve for mass mailing capabilities
CISC 879 - Machine Learning for Solving Systems Problems
Experimental Results
•
ROC Curve for backdoor entries
CISC 879 - Machine Learning for Solving Systems Problems
Evaluating Real World
Online Performance
•
Applied method to 291 real world malicious code to
discovered after the original data were gathered
•
Classifiers from the original data were build for both
benign and malicious code
•
Boosted decision tree detected 98% of the new
malicious code.
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion and Future wor
•
Machine learning and data mining are useful and
appropriate tool for detection of malware
•
Boosted Classifiers, support vector machines
performed exceptionally well
•
Boosting removes bias and variance and
outperformed other classifiers in the study
•
This approach is scalable
•
20-25 % of the codes were obfuscated using
compression and encryption
•
For functions of payload experiments remove
obfuscation and rerun the experiments with larger
set
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion and Future Work
•
Similarity of malicious code and how such
executables change over time. Clustering can
provide good insight into this.
•
This approach combined with search for known
signatures, executing and analyzing code in virtual
machine will provide better computer security
CISC 879 - Machine Learning for Solving Systems Problems
Q&A ?
CISC 879 - Machine Learning for Solving Systems Problems