INFOCOM11 - Columbia University

Download Report

Transcript INFOCOM11 - Columbia University

INFOCOM’2011
Shanghai, China
Consensus Extraction from Heterogeneous
Detectors to Improve Performance over
Network Traffic Anomaly Detection
INPUT: multiple simple atomic detectors
OUTPUT: optimization-based combination mostly consistent with
1, Wei Fan2, Deepak Turaga2,
Jing
Gao
all atomic detectors
2
2
Olivier Verscheure , Xiaoqiao Meng ,
Lu Su1,Jiawei Han1
1 Department of Computer Science
University of Illinois
2 IBM TJ Watson Research Center
Network Traffic Anomaly Detection
Computer
Network
Dest
Port
Number
of bytes
1 206.135.38.95 11:07:20 160.94.179.223
139
192
2 206.163.37.95 11:13:56 160.94.179.219
139
195
3 206.163.37.95 11:14:29 160.94.179.217
139
180
4 206.163.37.95 11:14:30 160.94.179.255
139
199
5 206.163.37.95 11:14:32 160.94.179.254
139
19
6 206.163.37.95 11:14:35 160.94.179.253
139
177
7 206.163.37.95 11:14:36 160.94.179.252
139
172
8 206.163.37.95 11:14:38 160.94.179.251
139
285
9 206.163.37.95 11:14:41 160.94.179.250
139
195
Tid
SrcIP
Start
time
Dest IP
…
…
Anomalous or Normal?
10
Network Traffic
2
Challenges
• the normal behavior can be too
complicated to describe.
• some normal data could be similar to the
true anomalies
• labeling current anomalies is expensive
and slow
• the network attacks adapt themselves
continuously – what we know in the past
may not work for today
3
The Problem
• Simple rules (or atomic rules) are relatively
easy to craft.
• Problem:
– there can be way too many simple rules
– each rule can have high false alarm or FP
rate
• Challenge: can we find their non-trivial
combination (per event, per detector) that
significantly improve accuracy?
Why We Need Combine Detectors?
Count
0.1-0.5
Entropy
0.1-0.5
Count
0.3-0.7
Entropy
0.3-0.7
Too many alarms!
Count
0.5-0.9
Entropy
0.5-0.9
Combined view is better than individual views!!
Label
5
Combining Detectors
• is non-trivial
– We aim at finding a consolidated solution without any
knowledge of the true anomalies (unsupervised)
– But we could improve with limited supervision and
incrementally (semi-supervised and incremental)
– We don’t know which atomic detectors are better and which
are worse
– At some given moment, it could be some non-trivial and
dynamic combination of atomic detectors
– There could be more bad base detectors than good ones, so
that majority voting cannot work
6
Problem Formulation
Which one is anomaly?
A1
A2
Record 1
Y
N
Record 2
N
Y
Record 6
N
N
Record 7
N
N
……
……
……
Ak-1
Ak
N
N
Y
N
N
N
N
N
Combine atomic detectors into one!
……
Record
3
Y
N
N
N
We propose a non-trivial combination
……
Consensus:
Y
Y
N
Y
Record 4
1. mostly
consistent
withN all atomic
detectors
……
N
Y
Y
Record 5
2. optimization-based framework
……
……
……
7
How to Combine Atomic Detectors?
•
Linear Models
–
–
–
As long as one detector is correct, there always exist weights to combine them linearly
Question: how to figure out these weights
Per example & per detector
•
•
Principles
–
–
–
–
–
•
Different from majority voting and model averaging
Consensus considers the performance among a set of examples and weights each
detectors by considering its performance over others, i.e, each example is no longer i.i.d
Consensus: mostly consistent among all atomic detectors
Atomic detectors are better than random guessing and systematic flipping
Atomic detectors should be weighted according to their detection performance
We should rank the records according to their probability of being an anomaly
Algorithm
–
–
Reach consensus among multiple atomic anomaly detectors
•
unsupervised
•
Semi-supervised
•
incremental
Automatically derive weights of atomic detectors and records – per detector & per event –
no single weight works for all situations.
8
Framework
[1 0]
[0 1]

record i ui  [ui 0 , ui1 ]

A1

qj
……

detector j q j  [q j 0 , q j1 ]

ui
……
Ak
probability of anomaly, normal
adjacency
1 ui q j
aij  
0 otherwise
initial probability
 [1 0] anomalous
yj  
normal
[0 1]
Detectors
Records
9
Objective
[1 0]
minimize disagreement
[0 1]

v
  2
 
min Q,U ( aij || ui  q j ||    ||q j  y j ||2 )
n
v
i 1 j 1
A1

qj
……

ui
……
j 1
Similar probability of being an anomaly
if the record is connected to the
detector
Ak
Do not deviate much from the initial
probability
Detectors
Records
10
Methodology
[1 0]
[0 1]
Iterate until convergence

Update detector probability


a
u


y
 ij i j
n
A1

qj
……

ui
……

qj 
i 1
n
a
ij
i 1

Update record probability
Ak

a
q
 ij j
v

ui 
j 1
v
a
j 1
Detectors
Records
ij
11
Propagation Process
[1 0]
[0 1]

[0.6828 0.3172]
[0.7 0.3]
[0.304 0.696] [0.357 0.643]
……
[0.7514 0.2486]
[0.7 0.3]
[0.304 0.696] [0.357 0.643]
Detectors
[0.5 0.5]
[0.5285 0.4715]
[0.5 0.5]
[0.357 0.643]
[0.5 0.5]
[0.5285 0.4715]
[0.5 0.5]
[0.7 0.3]
[0.5 0.5]
[0.5285 0.4715]
[0.5 0.5]
[0.5285 0.4715]
[0.5 0.5]
[0.357 0.643]
[0.5 0.5]
[0.357 0.643]
……
Records
12
Consensus Combination Reduces Expected Error
• Detector A
– Has probability P(A)
– Outputs P(y|x,A) for record x regarding y=0 (normal)
and y=1 (anomalous)
• Expected error of single detector
Err S  A

P( x, y)P( y | x)  P( y | x, A)
( x, y )
2

• Expected error of combined detector
2
C
Err  ( x , y ) P( x, y) P( y | x)   A P( A) P( y | x, A)


• Combined detector has a lower expected error
Err C  Err S
13
Extensions
• Semi-supervised
– Know the labels of a few records in advance
– Improve the performance of the combined
detector by incorporating this knowledge
• Incremental
– Records arrive continuously
– Incrementally update the combined detector
14
Incremental
[1 0]
[0 1]
When a new record arrives

Update detector probability
n 1
A1

qj
……

ui
……

u n 1
Ak

un

qj 



 aijui  anjun  y j
i 1
n 1
a
i 1
 anj  
ij
Update record probability

a
q
 ij j
v

ui 
j 1
v
a
j 1
Detectors
ij
Records
15
Semi-supervised
[1 0]


[0 1]
Iterate until convergence


 aijui  y j
n
A1

qj
……

ui

qj 
i 1
n
a
i 1

a
q
 ij j
v
……

ui 
Ak

ij
j 1
v
unlabeled
a
ij
j 1


 aij q j  f i
v

ui 
j 1
a
j 1
Detectors
Records
labeled
v
ij

16
Benchmark Data Sets
• IDN
– Data: A sequence of events: dos flood, syn flood, port
scanning, etc, partitioned into intervals
– Detector: setting threshold on two high-level measures
describing the probability of observing events during
each interval
• DARPA
– Data: A series of TCP connection records, collected by
MIT Lincoln labs, each record contains 34 continuous
derived features, including duration, number of bytes,
error rate, etc.
– Detector: Randomly select a subset of features, and
apply unsupervised distance-based anomaly detection
algorithm
17
Benchmark Datasets
• LBNL
– Data: an enterprise traffic dataset collected at the
edge routers of the Lawrence Berkeley National
Lab. The packet traces were aggregated by
intervals spanning 1 minute
– Detector: setting threshold on six metrics including
number of TCP SYN packets, number of distinct IPs
in the source or destination, maximum number of
distinct IPs an IP in the source or destination has
contacted, and 6) maximum pairwise distance
between distinct IPs an IP has contacted.
18
Experiments Setup
• Baseline methods
– base detectors
– majority voting
– consensus maximization
– semi-supervised (2% labeled)
– stream (30% batch, 70% incremental)
• Evaluation measure
– area under ROC curve (0-1, 1 is the best)
– ROC curve: tradeoff between detection rate
and false alarm rate
19
AUCMajority
on Benchmark
Data Sets
voting
among detectors
worst
best
average
IDN
MV
UC
SC
IC
0.5269 0.6671
0.5904
0.7089 0.7255 0.7204 0.7270
0.2832 0.8059
0.5731
0.6854 0.7711 0.8048 0.7552
0.3745 0.8266
0.6654 0.8871
0.9076 0.9089
0.9090
Consensus
combination
improves
anomaly
DARPA 0.5804 0.6068
0.5981performance!
0.7765 0.7812 0.8005 0.7730
detection
LBNL
0.5930 0.6137
0.6021
0.7865 0.7938 0.8173 0.7836
0.5851 0.6150
0.6022
0.7739 0.7796 0.7985 0.7727
0.5005 0.8230
0.7101
0.8165 0.8180 0.8324 0.8160
Worst, best and
average performance
of atomic detectors
Unsupervised, semisupervised and
incremental version of
consensus combination20
Stream Computing
Continuous Ingestion
Continuous Complex Analysis in low latency
Conclusions
• Consensus Combination
– Combine multiple atomic anomaly detectors to a
more accurate one in an unsupervised way
• We give
– Theoretical analysis of the error reduction by
detector combination
– Extension of the method to incremental and semisupervised learning scenarios
– Experimental results on three network traffic
datasets
22
Thanks!
• Any questions?
Code available upon request
23