Network Based Malware Detection PhD Proposal, F. Tegeler

Download Report

Transcript Network Based Malware Detection PhD Proposal, F. Tegeler

BotFinder: Finding Bots in Network Traffic
Without Deep Packet Inspection
F. Tegeler, X. Fu (U Goe), G. Vigna, C. Kruegel (UCSB)
Motivation

Sophisticated type of malware: Bots



Multiple bots under single control
Distinct characteristics:
command and control (C&C) channel
botnet
C&C
Victim
hosts
Threats raised by bots:





2/24
Spam
Information theft (e.g., credit card data)
$2M-$600M revenue
Identity theft
estimated for single botnet
Click fraud
Distributed denial of service attacks (DDoS)
CoNEXT 2012
Challenge


How to detect bot infections?

Classically: End host – Anti Virus Scanner

But: Requires installation on every machine
Complementary approach: Network
based

Vertical correlation (single end host) (Rishi,
BotHunter, Wurzinger et al., …)




Horizontal correlation (multiple end hosts)
(BotSniffer, BotMiner, TAMD…):

3/24
Typical behavior (SPAM, DDos traffic)
Anomaly detection (Giroire et al.)
Packet analysis: HTTP structure, payloads,
typical signatures
Two or more hosts do the same malicious stuff
CoNEXT 2012
Challenge and Solution Approach



Existing vertical: Typically relies on scanning, spam, DDoS traffic
and requires packet inspection.
Existing horizontal: Requires multiple hosts in single domain to
be infected. Also triggered by noisy activity (e.g., BotMiner)
Contribution: Vertical detection of single bot infections
without packet inspection!


4/24
Botmaster establishes C&C connections frequently to disseminate
orders. C&C connections show patterns.
Use these statistical properties of C&C communication! Core
assumption: Periodic behavior!
CoNEXT 2012
Methodology

Basic machine learning approach:

Learn about bot behavior:


Use learned behavior:


Training phase (a)
Detection phase (b)
Training:




5/24
Observe malware in controlled
environment
Extract flows and build traces
Perform statistical analysis to obtain “features”
Create models to describe malware
CoNEXT 2012
Methodology – Detection Phase

Detection:




Obtain traffic
Perform analysis analog to training
Compare statistical features of the
traffic with models
During the whole process:

6/24
No deep packet inspection!
CoNEXT 2012
Methodology – Details


Analysis performed on flows
Flow is a connection from A to B:
 Source IP address
 Destination IP address
 Source port
 Destination port
 Transport protocol ID
 Start time
 Duration of connection
This information is easy to obtain in
 Number of bytes
real-world environments!
Example: NetFlow
 Number of packets
7/24
CoNEXT 2012
Methodology – Details cont’d

Trace: Chronologically ordered
sequence of flows.

Represents long term communication
behavior!
Example for two dimensions:
time and duration
8/24
CoNEXT 2012
Distinguishing Characteristics

Bot traffic is more regular than normal, benign traffic!
The lower the bar, the
more periodic.
9/24
CoNEXT 2012
Methodology – Features

Use statistical features to
describe trace!





Average time between two flows.
Average duration of flows.
Average number of source bytes.
Average number of destination bytes.
A Fourier transform to detect underlying
communication frequencies. More robust
than simple averaging.
10/24
CoNEXT 2012
Methodology – Models

Example scenario:


Multiple binary versions of the same
bot family generated traces
Example: time interval feature:
912min
24min
9min
20min
7.5min
8min
20min
18min
8.2min
22min
Cluster centroids



230min
210min
17min
“Intervals of 8, 20, or190min
210 minutes are typical for this bot.”
Feature
clustering…
Clusters
with
low standard deviation are trustworthy representations
of malware behavior
Drop very small (one-element) clusters
11/24
CoNEXT 2012
Methodology – Model Matching

Compare a trace to the cluster
centers of a malware family model:

1. If trace feature “hits” a model:



2. Take model with highest scoring
value
3. If scoring value > threshold:


Increase scoring value based on cluster
quality
Consider model matched
Some more math involved (quality of matching trace,
clustering algorithm, minimal trace length, etc.)
12/24
CoNEXT 2012
Evaluation

Method is implemented in BotFinder

Six representative malware families

Dataset LabCapture: 2.5 months of lab traffic with 60 machines



Full traffic capture – allows verificiation
Should contain benign traffic only
Dataset ISPNetflow: one month of NetFlow data from large
network


Reflects 540 Terabytes of data or 150 MegaBytes(!) per second of
traffic.
No ground truth but possibility to compare to blacklisted IP
addresses and judgment of usability.
13/24
CoNEXT 2012
Evaluation – Cross Validation

Execution:





Split the ground truth malware dataset
randomly into a training set and a
detection set
Mix the detection set with all traces
from the LabCapture dataset
Train BotFinder on the training set
Run BotFinder against the detection
set
77% detection rate with low false
positives (1 out of 5 million traces)
14/24
Training
set
Detection
set
LabCapture
Train
Result summary:

Training
data
Detect
Repeat experiment 50 times
per acceptance threshold
CoNEXT 2012
Evaluation – Cross Validation
15/24
CoNEXT 2012
Evaluation – Comparison to BotHunter

BotHunter is an optimized Snort Intrusion Detection System.
It requires packet inspection and leverages anomaly detection.

Many false positives for BotHunter, typically raised by IRC
activity or binary downloads.

Detection Results:


BotFinder Detection Rate: 77.5%
BotHunter Detection Rate: 10%
*

BotFinder outperformed BotHunter and shows relatively high
detection rates and low false positives.
16/24
*: http://www.bothunter.net
CoNEXT 2012
Evaluation - ISPNetFlow



Challenging to analyze as minimal information (only
internal IP ranges) is available
542 traces (from >1 billion traces) are identified by
BotFinder to be malicious
On average 14.6 alerts per day
17/24
CoNEXT 2012
Evaluation ISP NetFlow

Speed is sufficient for large networks:


3min for 15M NetFlow records (~15min of ISPNetFlow,
800MB filesize)
Processing is dominated by feature extraction


Easy to parallelize
Detailed IP address investigation of raised alarms:


Comparison of external IPs with publicly available blacklists*
Result: 56% of all IPs are known to be malicious!



The “false positives” show a large cluster of connections to Apple
With whitelisted Apple: 61% of all raised alerts connect to known
malicious pages
Strong support that BotFinder works!
*=rbls.org
18/24
CoNEXT 2012
Bot Evolution

Botmasters may try to evade detection by changing
communication patterns:




Introduction of randomized intervals
Introduction of large gaps between flows
IP or domain flux (fast changing C&C servers)
Randomization impact:

Lower
limit!
Randomizing individual
features does not
significantly impact
detection
19/24
CoNEXT 2012
FFT Peak Detection with Gaps
20/24
CoNEXT 2012
Anti-Domain Flux

Problem: Fast C&C-Domain/IP changes
Subtrace 1: A to C&C IP 1
Subtrace 2: A to C&C IP 2
Change of IP address
Trace “breaks”


Problem: BotFinder can’t create a sufficiently long trace
Idea:



Look at each source IP and compare all connections with each
other
When two connections look very similar, combine them to
one!
Inherently horizontal correlation per source IP!
21/24
CoNEXT 2012
Additional Pre-Processing

How can one check that it is working?

Split of real C&C traces and random other, long traces (from real
traffic). Does BotFinder recombine them?
Large distance!
Good!

“Low” overhead: 85% increase in the ISPNetFlow.
22/24
CoNEXT 2012
Conclusion

High detection rates - nearly 80% - with low false positives and
no need for packet inspection!

BotFinder shows better results than BotHunter.

61% of BotFinder-flagged connections in the ISPNetFlow
dataset were destined to known, blacklisted host!

BotFinder is robust against potential evasion strategies.
23/24
CoNEXT 2012
Questions

Thank you for your attention!

Any questions?
24/24
CoNEXT 2012