Sensitivity of PCA for Traffic Anomaly Detection

Download Report

Transcript Sensitivity of PCA for Traffic Anomaly Detection

Sensitivity of PCA for
Traffic Anomaly Detection
Evaluating the robustness of
current best practices
Haakon Ringberg1, Augustin Soule2,
Jennifer Rexford1, Christophe Diot2
1Princeton University, 2Thomson Research
Outline
Context






Background and motivation
Bigger picture
PCA (subspace method) in one slide
Challenges with current PCA methodology
Conclusion & future directions
2
Background

Promising applications of PCA to AD


But we weren’t nearly as successful applying
technique to a new data set


[Lakhina et al, SIGCOMM 04 & 05]
Same source code
What were we doing wrong?

Unable to tune the technique
3
Bigger Picture

Many statistical techniques evaluated for AD



e.g., Wavelets, PCA, Kalman filters
Promising early results
But questions about performance remain

What did the researchers have to do in order to
achieve presented results?
4
Questions about techniques

“Tunability” of technique




Other aspects of robustness



Number of parameters
Sensitivity to parameters
Interpretability of parameters
Sensitivity to drift in underlying data
Sensitivity to sampling
Assumptions about the underlying data
5
Principal Components
Analysis (PCA)



PCA transforms data
into new coordinate
system
Principal components
(new bases) ordered by
captured variance
The first k (topk) tend to
capture periodic trends


normal subspace
vs. anomalous subspace
6
Data used




Géant and Abilene networks
IP flow traces
21/11 through 28/11 2005
Detected anomalies were
manually inspected
7
Outline
Context
Challenges with current PCA methodology






Sensitivity to its parameters
Contamination of normalcy
Identifying the location of detected anomalies
Conclusion & future directions
8
Sensitivity to topk


Where is the line drawn
between normal and
anomalous?
What is too anomalous?
topk
PCA
normal
signal
anomalous
9
Sensitivity to topk

Very sensitive to topk



Not an issue if topk
were tunable
Tried many methods





Total detections and FP
3σ deviation heuristic
Cattell’s Scree Test
Humphrey-Ilgen
Kaiser’s Criterion
None are reliable
10
Contamination of normalcy




Large anomalies may be
included among topk
Invalidates assumption that
top PCs are periodic
Pollutes definition of normal
In our study, the outage to
the left affected 75/77 links
 Only detected on a handful!
PCA
normal
signal
anomalous
11
Conclusion & future directions

PCA (subspace method) methodology issues




Generally: room for rigorous evaluation of
statistical techniques applied to AD


Sensitivity to topk parameter
Contamination of normal subspace
Identifying the location of detected anomalies
Tunability, robustness
Assumptions about underlying data

Under what conditions does method excel?
12
Thanks!
Questions?
Haakon Ringberg
Princeton University Computer Science
http://www.cs.princeton.edu/~hlarsen/
Identifying anomaly locations

Spikes when state
vector projected on
anomaly subspace



But network operators
don’t care about this
They want to know
where it happened!
state vector
How do we find the
original location of the
anomaly?
anomaly subspace
14
Identifying anomaly locations

Previous work used a
simple heuristic


state vector
Associate detected spike
with k flows with the
largest contribution to the
state vector v
No clear a priori reason
for this association
anomaly subspace
A
Network
B
15