Transcript nips-slides

Tracking Malicious Regions of the
IP Address Space Dynamically
Introduction

Recent focus on network-level behaviour of malicious activity:




Application:


Spam originating from different /24 prefixes [RF06]
Spam originating from long-lived network-aware clusters [VSSHS07]
Malicious activity originating from different /16 prefixes [CSFJWS07]
Real-time, effective mechanism for blocking traffic at line speed
Challenges:



Transience of individual IP addresses: due to e.g., bots, DHCP effects
Effective clustering: variety of different clustering of IP prefixes possible
Scaling to large volume of data
[RF06] Ramachandran & Feamster, Understanding the Network-level Behavior of Spammers,
SIGCOMM’06
[VSSHS07] Venkataraman et al. Exploiting Network Structure for Proactive Spam Mitigation. Security’ 07
Study: Transience of Individual IPs
Measurement Study: Analysis of spamming IP addresses & regions
Data: Enterprise mail logs over 6 months, 28 million emails
Spamming IPs present
on many days contribute
little spam!
Study: Persistence of IP Prefix Clusters
Network-Aware Clusters [KW00]:
Pre-defined set of IP prefixes that reflect Internet routing structure
IP address belongs to cluster with longest matching prefix
Bad Clusters
Most spam comes
from “bad” clusters
present for many
Bad IPs
days
90% of total spam comes from
“bad” clusters present for 60+
days
[KW00] On Network-Aware Clustering of Web Clients, Krishnamurty & Wang, SIGCOMM ’00
Introduction

Recent focus on network-level behaviour of malicious activity:




Challenges:




Spam originating from different /24 prefixes [RF06]
Spam originating from long-lived network-aware clusters [VSSHS07]
Malicious activity originating from different /16 prefixes [CSFJWS07]
Transience of individual IP addresses: due to e.g., bots, DHCP effects
Scaling to large volumes of data
Effective Clustering: variety of different clustering of IP prefixes possible
Question:

Can we automatically compute the optimal clustering for tracking and
predicting malicious activity?
[RF06] Ramachandran & Feamster, Understanding the Network-level Behavior of Spammers,
SIGCOMM’06
[VSSHS07] Venkataraman et al. Exploiting Network Structure for Proactive Spam Mitigation. Security’ 07
Problem, v.1 (naïve)

0.0.0.0/0
e.g., e-mail logs with sending mail server’s IP,
SpamAssassin labels spam(-) or nonspam(+)
128.0.0.0/1
0.0.0.0/1
Input: IP addresses labeled
normal(+)/malicious(-)
192.0.0.0/2
0.0.0.0/2

x.x.x.x/24

x.x.x.x/30
x.x.x.x/32
(IP addr)
Output: Tree of IP prefixes, that

optimal for classifying IPs as
normal(+)/malicious(-)
contains no more than k leaves
Limit output IP tree to k leaves:
-
- -
-
-
- - +


Small output state
Avoid overfitting
Challenges
As stated, version 1 easy: use dynamic programming! However:

IP tree over dynamic and evolving data:
Data is collected and labelled over time (e.g., e-mail logs, traffic logs)
 Compromised/malicious IP prefixes may change over time
Want to compute updated tree without going back to past data


Low space & low overhead: up to 232 leaves in IP tree!
Want algorithm to use space near-linear in k

Adversarially-generated IP addresses:
Want guarantees as function of data seen & adversary’s power
Approach: Design online algorithms to address all issues
Online Learning
Data
e.g., mail logs
labelled +/-

Low space & overhead: IPs seen one at a time



Algorithm
Algorithm only needs to maintain small internal state
Naturally incorporates dynamic aspect of data
Guarantees: function of mistakes made by optimal offline tree


Quantifies cost of learning tree online
Data may be generated adversarially
Problem, v.2: Online Learning of Static Tree
0.0.0.0/0
128.0.0.0/1
0.0.0.0/1
Input: a stream of IPs labelled
normal(+)/malicious(-)
192.0.0.0/2
0.0.0.0/2
Output: tree of IP prefixes:
x.x.x.x/24

x.x.x.x/30

x.x.x.x/32
(IP addr)
-
- - -
- - - +
Predicts nearly as well as
OPTIMAL offline tree with
k leaves
Using low space in online
model of learning
Dynamic IP Prefix Tree


Malicious IP regions may change over time
New Goal: Predict nearly as well as the optimal changing tree


Optimal tree may make two kinds of changes: splitting & changing leaf sign
Prediction Guarantee: Our mistakes = O(OPTIMAL tree’s cost)
Optimal tree’s cost: function of mistakes made, and changes it makes
#2: Leaf splits
-
+
#1: Leaf changes sign
Problem: Online Learning of Dynamic Tree
Problem: Compute IP tree online to predict nearly as well as best
~
changing tree with k leaves, using space O(k)
0.0.0.0/0
0.0.0.0/1
128.0.0.0/1
x.x.x.x/24
-
x.x.x.x/29
IP address
x.x.x.x/32
- - - -
-- +
+
-- - - +
+
+
+
+
-+
Related Problems

Machine learning:
 Predicting as well as the best pruning of a decision tree
[HS97]


Learning decision trees in streaming models [DH00]



Our requirements: low space, tracking a dynamic tree online,
real-world implementation
Our requirement: Adversarial data
Simpler problem: tree fixed; no need to “learn” a tree
Online algorithms:
 Paging Algorithms
[HS97] Helmbold & Schapire, Predicting Nearly As Well As the Best Pruning of a Decision Tree. Machine Learning ‘97
[DH00] Domingos & Hulten, Mining High-Speed Data Streams, KDD 2000
Overview of our Algorithm
IP: a.b.c.d

Given IP address, predict:
 Trace IP path on tree & flag all
nodes on path
(i.e., all prefixes of input IP in tree)
 Label IP by combining weights of
flagged nodes

Given correct label, update:
 If predicted correctly, do nothing
 If predicted incorrectly:
Predicted
Label
Correct
Label
update flagged node weights

grow tree by splitting leaf
(if necessary)

Four Algorithmic Questions

Fixed Tree Structure:



Relative importance of flagged nodes?
Label of flagged node: positive/negative?
Changing the Tree Structure:


When to grow the tree?
How to maintain a small tree?
Relative Importance of Nodes

w0
Use sleeping experts algorithm:

w1

w2

w10
w15

w16
Algorithm:

w17
Every node is expert
Each flagged node is an
“awake” expert
Best expert: leaf of optimal
tree


Each node has a “relative” weight
Predict with all experts that are
awake e.g., weighted coin toss
Update by redistribution of
weight between awake experts
Label of Flagged Nodes

Shifting experts problem:
 Each node has 2 internal experts:
“positive” expert: label +
“negative” expert: label –
 Best expert: label of node

Algorithm:
 Predict by weighted coin toss
between + & – experts
 Update by shifting weight from
incorrect to correct expert

Tracking a dynamic tree:
 Automatically incorporates leaf
changing sign
w+
w-
Growing the Tree

Algorithm:



# mistakes?
Track # mistakes made by
each leaf
Split after leaf makes
sufficient mistakes
Tracking dynamic tree:

Also incorporates changes
caused by leaf splitting
Maintaining Small-Space Tree

Convert to Paging Problem:


?

Each node is a page
Size of optimal cache: 2k
Q: which node to discard?
?
?
?
?

?
Use paging algorithms &
competitive analysis:

?
?
e.g. using Flush-When-Full
Start from scratch after tree has
grown to max size
Analysis
Define ε so that:
additive change to internal node weight per mistake: ε
multiplicative loss factor to relative node weight per mistake: 1/ε
mistakes needed before splitting node: 1/ε
Then, E[# mistakes of algorithm] =
(1+ε) (mistakes of OPTIMAL) +
(1/ε)(sign-changes of OPTIMAL’s leaves) +
(log k/ε) (splits of OPTIMAL’s leaves)
Space required in using FWF: O(k log k/ε2)
Implementation Issues

Sparse data from some IP prefixes

Might not see any IP addresses from some prefixes



Include loss function to penalize prefixes for “nothingness”
Efficient implementations for large-scale data:



Clusters might be too big to be meaningful
Coalesce nodes appropriately in binary tree
(effectively becomes trie)
Randomization calls are computationally-expensive
Experimental performance: In progress