Detecting Malicious Executables using Data Mining

Download Report

Transcript Detecting Malicious Executables using Data Mining

Data Mining for Security
Applications:
Detecting Malicious Executables
Mr. Mehedy M. Masud (PhD Student)
Prof. Latifur Khan
Prof. Bhavani Thuraisingham
Department of Computer Science
The University of Texas at Dallas
Outline and Acknowledgement
●
Vision for Assured Information Sharing
●
Handling Different Trust levels
●
Defensive Operations between Untrustworthy
Partners
–
●
Detecting Malicious Executables using Data Mining
Research Funded by Air Force Office of Scientific
Research and Texas Enterprise Funds
Vision: Assured Information
Sharing
Data/Policy for Coalition
Publish
Data/Policy
Publish
Data/Policy
Publish
Data/Policy
Component
Data/Policy for
Agency A
Component
Data/Policy for
Agency C
Component
Data/Policy for
Agency B
1.
Trustworthy Partners
2.
Semi-Trustworthy partners
3.
Untrustworthy partners
4.
Dynamic Trust
Our Approach
●
Integrate the Medicaid claims data and mine the data; next
enforce policies and determine how much information has
been lost by enforcing policies
–
●
Apply game theory and probing techniques to extract
information from semi-trustworthy partners
–
●
Prof. Murat Kantarcioglu and Ryan Layfield (PhD Student)
Data Mining for Defensive and offensive operations
–
–
●
Prof. Khan, Dr. Awad (Postdoc) and Student Workers (MS
students)
E.g., Malicious code detection, Honeypots
Prof. Latifur Khan and Mehedy Masud
Dynamic Trust levels, Peer to Peer Communication
–
Prof. Kevin Hamlen and Nathalie Tsybulnik (PhD student)
Introduction: Detecting Malicious
Executables using Data Mining
0 What are malicious executables?
- Harm computer systems
- Virus, Exploit, Denial of Service (DoS), Flooder, Sniffer,
Spoofer, Trojan etc.
- Exploits software vulnerability on a victim
- May remotely infect other victims
- Incurs great loss. Example: Code Red epidemic cost $2.6
Billion
Malicious code detection: Traditional approach
0
-
Signature based
Requires signatures to be generated by human experts
So, not effective against “zero day” attacks
State of the Art: Automated
Detection
Automated detection approaches:
O
Behavioural: analyse behaviours like source, destination
address, attachment type, statistical anomaly etc.
●
Content-based: analyse the content of the malicious executable
– Autograph (H. Ah-Kim – CMU): Based on automated
signature generation process
– N-gram analysis (Maloof, M.A. et .al.): Based on mining
features and using machine learning.
●
New Ideas
Content -based approaches consider only machinecodes (byte-codes).
✗
Is it possible to consider higher-level source codes
for malicious code detection?
✗
Yes: Diassemble the binary executable and retrieve
the assembly program
✗
Extract important features from the assembly
program
✗
Combine with machine-code features
✗
Feature Extraction
Binary n-gram features
✗
–
Sequence of n consecutive bytes of binary executable
Assembly n-gram features
✗
–
Sequence of n consecutive assembly instructions
System API call features
✗
–
DLL function call information
The Hybrid Feature Retrieval Model
●
Collect training samples of normal and malicious
executables.
●
Extract features
●
Train a Classifier and build a model
●
Test the model against test samples
Hybrid Feature Retrieval (HFR)
●
Training
Hybrid Feature Retrieval (HFR)
●
Testing
Feature Extraction
Binary n-gram features
–
Features are extracted from the byte codes in the form of
n-grams, where n = 2,4,6,8,10 and so on.
Example:
Given a 11-byte sequence:
0123456789abcdef012345,
The 2-grams (2-byte sequences) are: 0123, 2345, 4567,
6789, 89ab, abcd, cdef, ef01, 0123, 2345
The 4-grams (4-byte sequences) are: 01234567, 23456789,
456789ab,...,ef012345 and so on....
Problem:
–
Large dataset. Too many features (millions!).
Solution:
–
–
Use secondary memory, efficient data structures
Apply feature selection
Feature Extraction
Assembly n-gram features
–
Features are extracted from the assembly programs in
the form of n-grams, where n = 2,4,6,8,10 and so on.
Example:
three instructions
“push eax”; “mov eax, dword[0f34]” ; “add ecx, eax”;
2-grams
(1) “push eax”; “mov eax, dword[0f34]”;
(2) “mov eax, dword[0f34]”; “add ecx, eax”;
Problem:
–
Same problem as binary
Solution:
–
Same solution
Feature Selection
●
Select Best K features
●
Selection Criteria: Information Gain
●
Gain of an attribute A on a collection of
examples S is given by

| Sv |
Gain ( S, A)  Entropy ( S) 
Entropy ( Sv )
|
S
|
VValues ( A)
Experiments
0 Dataset
– Dataset1: 838 Malicious and 597 Benign executables
– Dataset2: 1082 Malicious and 1370 Benign executables
– Collected Malicious code from VX Heavens
(http://vx.netlux.org)
0 Disassembly
– Pedisassem (
http://www.geocities.com/~sangcho/index.html )
0 Training, Testing
– Support Vector Machine (SVM)
– C-Support Vector Classifiers with an RBF kernel
Results
●
●
●
HFS = Hybrid Feature Set
BFS = Binary Feature Set
AFS = Assembly Feature Set
Results
●
●
●
HFS = Hybrid Feature Set
BFS = Binary Feature Set
AFS = Assembly Feature Set
Results
●
●
●
HFS = Hybrid Feature Set
BFS = Binary Feature Set
AFS = Assembly Feature Set
Future Plans
●
●
System call:
– seems to be very useful.
– Need to Consider Frequency of call
– Call sequence pattern (following program path)
– Actions immediately preceding or after call
Detect Malicious code by program slicing
– requires analysis
Data Mining
to Detect Buffer Overflow Attack
Mohammad M. Masud,
Latifur Khan,
Bhavani Thuraisingham
Department of Computer Science
The University of Texas at Dallas
Introduction
●
Goal
–
–
●
Intrusion detection.
e.g.: worm attack, buffer overflow attack.
Main Contribution
–
–
'Worm' code detection by data mining coupled
with 'reverse engineering'.
Buffer overflow detection by combining data
mining with static analysis of assembly code.
Background
●
What is 'buffer overflow'?
–
●
A situation when a fixed sized buffer is overflown
by a larger sized input.
How does it happen?
–
example:
........
char buff[100];
gets(buff);
........
memory
Input
string
buff
Stack
Background (cont...)
●
Then what?
buff
memory
buff
........
char buff[100];
gets(buff);
........
Stack
Stack
Return address
overwritten
Attacker's code
memory
buff
Stack
New return address points
to this memory location
Background (cont...)
●
So what?
–
–
●
It can now
–
–
–
●
Program may crash
or
The attacker can execute his arbitrary code
Execute any system function
Communicate with some host and download
some 'worm' code and install it!
Open a backdoor to take full control of the victim
How to stop it?
Background (cont...)
●
Stopping buffer overflow
–
–
●
Preventive approaches
–
–
–
●
Preventive approaches
Detection approaches
Finding bugs in source code. Problem: can only
work when source code is available.
Compiler extension. Same problem.
OS/HW modification
Detection approaches
–
–
Capture code running symptoms. Problem: may
require long running time.
Automatically generating signatures of buffer
overflow attacks.
CodeBlocker (Our approach)
●
A detection approach
●
Based on the Observation:
–
●
Main Idea
–
●
Attack messages usually contain code while
normal messages contain data.
Check whether message contains code
Problem to solve:
–
Distinguishing code from data
Severity of the problem
●
It is not easy to detect actual instruction
sequence from a given string of bits
Our solution
●
●
●
●
●
Apply data mining.
Formulate the problem as a classification
problem (code, data)
Collect a set of training examples, containing
both instances
Train the data with a machine learning
algorithm, get the model
Test this model against a new message
CodeBlocker Model
Feature Extraction
Disassembly
●
We apply SigFree tool
–
implemented by Xinran Wang et al. (PennState)
Feature extraction
●
Features are extracted using
–
–
●
N-gram analysis
Control flow analysis
N-gram analysis
What is an n-gram?
-Sequence of n instructions
Traditional approach:
-Flow of control is ignored
2-grams are: 02, 24, 46,...,CE
Assembly program
Corresponding IFG
Feature extraction (cont...)
●
Control-flow Based N-gram analysis
What is an n-gram?
-Sequence of n instructions
Proposed Control-flow based
approach
-Flow of control is considered
2-grams are:
02, 24, 46,...,CE, E6
Assembly program
Corresponding IFG
Feature extraction (cont...)
●
Control Flow analysis. Generated features
–
–
–
●
Checking IMR
–
–
●
A memory is referenced using register
addressing and the register value is undefined
e.g.:
mov ax, [dx + 5]
Checking UR
–
●
Invalid Memory Reference (IMR)
Undefined Register (UR)
Invalid Jump Target (IJT)
Check if the register value is set properly
Checking IJT
–
Check whether jump target does not violate
instruction boundary
Feature extraction (cont...)
●
Why n-gram analysis?
–
●
Intuition: in general,
disassembled executables should have a
different pattern of instruction usage than
disassembled data.
Why control flow analysis?
–
Intuition: there should be no invalid memory
references or invalid jump targets.
Putting it together
●
Compute all possible n-grams
●
Select best k of them
●
●
Compute feature vector (binary vector) for
each training example
Supply these vectors to the training algorithm
Experiments
●
Dataset
–
–
–
●
Real traces of normal messages
Real attack messages
Polymorphic shellcodes
Training, Testing
–
Support Vector Machine (SVM)
Results
●
●
CFBn: Control-Flow Based n-gram feature
CFF: Control-flow feature
Novelty / contribution
●
●
●
We introduce the notion of control flow based
n-gram
We combine control flow analysis with data
mining to detect code / data
Significant improvement over other methods
(e.g. SigFree)
Advantages
●
1) Fast testing
●
2) Signature free operation
3) Low overhead
●
4) Robust against many obfuscations
Limitations
●
●
Need samples of attack and normal
messages.
May not be able to detect a completely new
type of attack.
Future Works
●
Find more features
●
Apply dynamic analysis techniques
●
Semantic analysis
Reference / suggested readings
–
X. Wang, C. Pan, P. Liu, and S. Zhu. Sigfree: A
signature free buffer overflow attack blocker. In
USENIX Security, July 2006.
–
Kolter, J. Z., and Maloof, M. A. Learning to detect
malicious executables in the wild Proceedings of
the tenth ACM SIGKDD international conference
on Knowledge discovery and data mining
Seattle, WA, USA Pages: 470 – 478, 2004.
Email Worm Detection
(behavioural approach)
Outgoing
Emails
The Model
Feature
extraction
Training data
Test data
Machine
Learning
Classifier
Clean or Infected ?
Feature Extraction
Per email features
= Binary valued Features
Presence of HTML; script tags/attributes; embedded images;
hyperlinks;
Presence of binary, text attachments; MIME types of file
attachments
= Continuous-valued Features
Number of attachments; Number of words/characters in the
subject and body
Per window features
= Number of emails sent; Number of unique email recipients; Number
of unique sender addresses; Average number of words/characters
per subject, body; average word length:; Variance in number of
words/characters per subject, body; Variance in word length
= Ratio of emails with attachments
Feature Reduction & Selection
Principal Component Analysis
= Reduce higher dimensional data into lower dimension
= Helps reducing noise, overfitting
Decesion Tree
= Used to Select Best features
Experiments
0 Data Set
- Contains instances for both normal and viral emails.
– Six worm types:
● bagle.f, bubbleboy, mydoom.m, mydoom.u, netsky.d,
sobig.f
- Collected from UC Berkeley
●
Training, Testing:
- Decision Tree: C4.5 algorithm (J48) on Weka
Systems
- Support Vector Machine (SVM) and Naïve Bayes
(NB).
Results
Conclusion & Future Work
●
Three approaches has been tested
–
–
–
Apply classifier directly
Apply dimension reduction (PCA) and then
classify
Apply feature selection (decision tree) and then
classify
●
Decision tree has the best performance
●
Future Plans
–
●
Combine content based with behavioral
approaches
Offensive Operations
–
Honeypots, Information operations