Automatically Generating Models for Botnet Detection

Download Report

Transcript Automatically Generating Models for Botnet Detection

Automatically Generating
Models for Botnet Detection
Presenter: 葉倚任
Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz,
Jan Goebel, Christopher Kruegel, Engin Kirda
European Symposium on Research in Computer Security
(ESORICS'09)
Outline






Introduction
System Overview
Model Generation Data
Generating Detection Models
Evaluation
Conclusion
Introduction
 Two main kinds of network-based detection
system
 Vertical correlation technique
 Detection of individual bots
 Checking traffic patterns, content of C&C traffic,
and bot related activities.
 Require prior knowledge of C&C channels and
propagation vectors of bot
 Horizontal correlation technique
 Detection of a group of bots
 Based on network traffic
 Require that at two bots in the monitor networks
Introduction (cont’d)
 Characteristic behavior of a bot
 Receive commands from botmater
 Carry out some actions in response to these
commands
 This paper proposed a two-stage detection
model to leverage these two characteristics
 In the experiments, the authors generated 18
different bot families.
 16 controlled via IRC,
 One via HTTP (Kraken)
 One via a peer-to-peer network (Storm Worm).
System Overview
 Input of the system
 A collection of bot binaries
 Launch a bot in a controlled environment and
record its network activities (traces)
 Identify the commands that this bot receives as
well as its corresponding responses
 Translate observations into detection models
 Output of the system
 Detection models for different bot families
Detecting Procedure
 Stateful model (two-stage detection)
1. Checking if a bot command is sent
2. If yes in stage 1, checking if the responses
is above a threshold or not
(e.g., the number of new connections opened by a host)
 Use content-based specifications to model
commands
(comparable to intrusion detection signatures)
 Use network-based specifications to model
responses
(comparable to anomaly detection)
Model Generation Data
 Run each bot binary for a period of
several days
 Locating bot responses
 Finding commands
 Extracting model generation data
Locating bot responses
 Assumption: bot responses that lead to a
change in network behavior
 Partition network traffic into consecutive time
intervals of equal length
 For each time interval, define 8 normalized
features (called traffic profile):
Locating bot responses (cont’d)
 Convert the traffic profiles (vectors) into time
series data d(t) as follows:
where ε is the sliding window size
 Locate bot responses by using CUSUM algorithm
 ε = 5 and an interval of 50 seconds delivered the
best results in the tests
Finding bot commands
 After locating bot responses, a small section of
network traffic (snippet) is extracted for each
response
 Cluster those traffic snippets that lead to
similar responses
Extracting model generation data
 Extract two pieces of information the
subsequent model generation step
 A snippet
 Contains 90 seconds of traffic
 Plus last 30 seconds of the previous one and first
10 seconds of the following one A snippet
 Average of the traffic profile vectors
 This period is the time from the start of the
current response to the next change in behavior
Generating Detection Models
 Command model generation
 Response model generation
Command model generation
 The goal is to identify common elements in a
particular behavior cluster
 First, apply a second clustering refinement step
that groups similar network packet payloads
within each behavior cluster
 The longest common subsequence algorithm is
applied to each set of similar payloads
 Generate one token sequence per set
Response model generation
 Compute the element-wise average of the
individual behavior profiles for a behavior
cluster
 Give minimal bounds for certain network
features




1,000 for UDP packets
100 for HTTP packets
10 for SMTP packets
20 for different IPs
 A detection model is not generated if a
response profile exceeds none of these
thresholds
Evaluation
 Collected a set of 416 different (based on MD5
hash) bot samples
 From Anubis
 The collection period was more than 8 months
 Each bot produce a traffic trace with a length of five
days
 Divided into families of bots
 16 different IRC bot families (with 356 traffic traces)
 One HTTP bot family (with 60 traffic traces)
 One p2p bot family (Storm Worm, with 30 traffic
traces)
Detection Capability
 Split our set of 446 network traces into training
sets and test sets
 Each training set contained 25% of one bot
family's traces
 This procedure was performed four times per
family (four-fold cross validation)
Real-World Deployment
 Deployed a sensor
 In front of the residential homes of RWTH Aachen
University
 At a Greek university network
 The total traffic is in the order of 94 billion
network packets over a period of over three
months at two different sites in Europe
Real-World Deployment
 In the Greek network, most cases were false
positives.
 BotHunter w/o Blacklist means BotHunter
without blacklists of known DNS names and IP
addresses
 The detection rate of BotHunter w/o Blacklist in
the detection capacity experiment drops to
39%
Conclusion
 This paper proposed a two-stage detection
method which included a command model and
a response model
 Automatically derives signatures for the bot
commands and network-level specifications for
the bot responses
 Can generate models for IRC bots, HTTP bots,
and even P2P bots such as Storm