Network Tomography - McGill University

Download Report

Transcript Network Tomography - McGill University

Network Tomography and
Anomaly Detection
Mark Coates
Tarem Ahmed
Network map from www.opte.org
Brain mapping
(opening it up can
disturb the system)
Internet mapping
(opening it up can
disturb the system)
Internet Boom
1969
2005
1993

too complex to measure everywhere, all the time

traffic measurements expensive (hardware, bandwidth)
Brain Tomography
unknown
object
counting
statistical&
projection
model
prior
knowledge
MRF
model
Maximum
likelihood
estimate
measurements
physics
Poisson
maximize
likelihood
data
Link-level Network
Tomography
unknown
object
end-to-end
measurements
queuing
prior knowledge
behavior
Maximum
likelihood
estimate
measurements
physics
maximize
likelihood
data
Link-level Network
Tomography
Solely from edge-based traffic measurements, infer internal
link-level losstopology
probability
and delay distribution
/ connectivity
Application: Topology
Discovery
Challenges:
• 12 % never respond,15 % multiple interfaces - Barford et al (2000)
• detect level-2 topology “invisible” to IP layer (e.g., switches)
Application: Overlay
Voice-over-IP

Multiple paths to choose
from

select paths with minimal
delay or delay variance

Send a small number of
critical packets (vocal
transitions) along multiple
paths

Use these packets to
estimate the path delays
(and the extent of path
diversity)
Access
Network
Overlay Link
Autonomous
System(s)
Service
Gateway
Network Monitoring

Challenges




Restricted measurement
High volumes and high rates of data (sampling of traffic on
Gb/s routers)
High dimensional data (source/destination IP addresses,
port numbers)
Goals



Supply networking protocols with relevant performance
information.
Identify anomalous behaviour and operational transitions.
Provide network administrators with appropriate notification
or visualization.
Outline

Inference about network performance based
on passive measurements or active probing

Two components to the talk:



Network tomography
Network anomaly detection
Focus on online, sequential approaches


Account for non-stationary behaviour
Don’t repeat work that has already been done
Network Tomography:
Likelihood Formulation
y  A  
A = routing matrix (graph)
 = packet loss probabilities
or queuing delays
for each link
y = packet losses or delays
measured at the edge
 = randomness inherent in
traffic measurements
l ( A, )  f ( y | A, )
Statistical
likelihood function
Classical Problem
Solve the linear system
y  A  
Interesting if A, , or  have special structures
Maximize the likelihood function
l ( )  f ( y | A )
or:
l ( A, )  f ( y | A )
Network Tomography:
The Basic Idea
sender
receivers
Network Tomography:
The Basic Idea
sender
receivers
Packet-pair measurements
packet(2) packet(1)
cross-traffic
packet(2)
packet(1)
delay
measurement
packet pair
packet(1) and packet(2) experience (nearly) identical
losses and/or delays on shared links
Modelling time-variations
Cross-traffic
Cross-traffic

Nonstationary cross-traffic induces time-variation

Directly model the dynamics (but not the traffic!)

Goal is to perform online tracking and prediction of
network link characteristics
Non-stationary behaviour
Introduce time-dependence in parameters
yt  At t  t
Filtering exercise (track θt ):
(1) Describe dynamic behaviour of θt
(2) Form estimate:
ˆt  E p ( | y ) [ t ]
t
1: t
(MMSE)
Particle Filtering
{i ,m ,  i ,m }iL
{i ,m1 , i ,m1}iL
Delay Distribution
Tracking
• Time-varying delay distribution of window size R at time m
Tm,R (k )
Delay unit
• In each window, R probe
measurements.
time
• Form estimates of
average delay and jitter
over short time intervals
Delay units
Dynamic Model
• Queue/traffic model:


log j ,m1  log j ,m  N (0, )
2
 j,m
reflected random walk on [0,max_del]
p j ,m (k )  exp(  j ,m k   j ,m )
Probability
Delay units
Observations
• Measurements:

x j , m ~ p j , m (k )
 Observe
packet(2) (m)
y ( 2) (m)
y j ,m 
x
s ,m
sPath( j ,m)
packet(1) (m)
y (1) (m)
i  1,2
Estimation of Delay
Distributions
• Sequential Monte Carlo Approximation to posterior
mean estimate:
N
pˆ j ,m (k )   p( x j ,m  k | ym , m ,  m ) wm
(i )
(i )
i 1
Message-passing algorithm
Particle weights
• Estimate of time-varying delay distribution:
m
1
Tˆm, R (k ) 
pˆ ( x j ,l | y1:l )

R l m R 1
(i )
Analysis
• Complexity:
2
O( LK N )
Average Number
of Unique Links
per measurement
Max. delay
units per link
Number of
Particles
• Convergence analysis of [Crisan, Doucet 01; Le Gland,
Oudjane 02] applies.
• The approximation to the posterior mean estimate
converges to the true estimate as N ∞
Simulation Results – ns2
Delay Distributions
true
tracking
Mean Delay
time
Comments

Dynamic models allow us to account for nonstationarity


but realistic models are hard to derive and incorporate
Particle filtering only appropriate when analytical
techniques fail



non-Gaussian or non-linear dynamics or observations
Sequential structure allows on-line implementation
Care must be taken to reduce computation at each step
Network Anomaly
Detection

In tomography, a primary challenge is the
restriction on available measurements.

Anomaly detection – a primary challenge is
the abundance of measurements.

How can we process data at a sufficient rate?

How should we extract relevant information?
Netflow Data

Records of flows.

A flow is defined by: (source IP, dest. IP,
source port #, dest. port #)

Packets are sampled at configurable rates.

Exported at 1-minute or 5-minute intervals.
Dataset – Abilene Network
Abilene Weathermap – Indiana University
Thanks to Rick Summerhill and Mark Fullmer at Abilene for
providing access to the data.
Principal Component
Analysis (PCA)

Goal: Identify a low-dimensional subspace that
captures the key components of the feature set

Idea: If (most of) a measurement does not lie in this
subspace, then it is anomalous

PCA



conduct a linear transformation to choose a new coordinate
system
Projection onto first principal component has greater
variance than any other projection (maximum energy).
Subsequent principal components capture greatest
remaining energy
PCA (2)

Reduce dimensionality by eliminating
principal components that do not contribute
significantly to variance in the dataset (small
singular value)

Not optimized for class separability (linear
discriminant analysis)

Minimizes reconstruction error under L2
norm.
“Eigenflow” Analysis

Lakhina et al. (2004, 2004b).

PCA analysis of Origin-Destination (OD)
Flows

Eigenflow: set of flows mapped onto a single
principle component

Intrinsic Dimensionality: Empirical studies for
Sprint and Abilene networks indicated that 510 principal components sufficed to capture
most of the energy.
PCA-based Anomaly
Detection

Perform PCA on block of OD flow measurements

Project each measurement onto primary principal
components

Test whether the residual energy exceeds a threshold.

Squared prediction error (SPE - Q-statistic) used to
test for unusual flow-types.

Prone to Type-I errors (false positives) when applied to
transient operations.

In these cases, the assumption that the source data is
normally distributed is violated.
Online Method

Don’t need to relearn from scratch when new
data arrive

Computational cost per time step should be
bounded by constant independent of time

Block-based PCA unattractive

Alternative method: Kernel Recursive Least
Squares (KRLS)
KRLS
t

Represent function as:
fˆ (x)   i k (xi , x)
i 1

Where {xi} are training points

Desire a sparse solution (storage and time savings
+ generalization ability)

Effective dimensionality of manifold spanned by
training feature vectors may be much smaller than
feature space dimension

Identify linearly independent feature vectors that
approximately span this manifold.
KRLS

Sequentially sample a stream of input/output
pairs
(x1, y1 ),(x2 , y2 ),..., xi  X, yi  R

At time step t, assume we have collected a
mt 1
dictionary of samples: D  ~
x
t 1
 
j j 1
mt 1
~
where by construction  ( x j )j 1 are linearly
independent feature vectors
KRLS

We encounter a new sample xt.

Test whether  ( xt ) is approximately linearly
dependent on feature vectors.

If not, add it to dictionary.
mt 1
2
 t  min  a j (~x j )   (x t )  
a
j 1
Dictionary approximation
Threshold
KRLS Properties

Provided input set X is compact, then number
of dictionary elements is finite.

Approximate version of kernel PCA

eigenvectors with eigenvalues significantly larger
than  are projected almost entirely onto the
dictionary set.

O(m2) memory and O(tm2) time

Compare exact kernel PCA – O(t2) memory
and O(t2p) time.
Application in Networks

Data set is the Origin-Destination Flows
(11x11 matrix = 121 dimensional vector per
measurement interval).

Normalized, these comprise the features.

We use the total traffic per measurement
interval as the associated value y
No. Packets
Total traffic
Measurement interval
0000 hrs on Aug 10, 2005 to 2359 hrs Aug 21, 2005 at
Chicago router. Gives 3456, 5-minute intervals over the 12day period.
Origin-Destination Flows
t =1
t =100
t =1300
t =3000
Building the Dictionary
# Elements
δ
Gaussian
δ
# Elements
 = 0.2
 = 0.1
Linear
Measurement interval
Measurement interval
Dictionary Components
Element 5
Element 20
Element 6
Element 22
KRLS Anomaly Detection
Algorithm
1.
Based on xt , evaluate δt.
2.
If δt < ν1, green-light traffic.
3.
If δt < ν2, raise red alarm.
4.
If ν1 < δt < ν2 raise orange alarm.
1.
2.
3.
5.
Test usefulness of xt. (Does φ(xt) provide good support for
ensuing vectors).
If yes, add xt to the dictionary.
If no, raise red alarm.
Remove any obsolete dictionary elements
Evaluating Usefulness
Kernel value
Normal
Obsolete
Anomalous
Timestep
Anomaly Detection

KRLS
Magnitude
of Residual
PCA
Euclidean
distance
OCNM
KRLS
PCA
OCNM
Timestep
No. IP flows
PCA versus KRLS
Anomaly 1

Timestep
Magnitude
of Projection
No. IP
flows
PCA Versus KRLS:
Anomaly 2
Timestep
Summary and Challenges

Network monitoring presents challenges on
different fronts:



Constraints on available measurements
(reconstruction based on partial views)
High-rate, high-dimensional, distributed data
(Some of the many) open questions:


Tomography: network models, spatial + temporal
correlations, optimal sampling, multiple source.
Anomaly detection: thresholds, dictionary control,
feature space, dataset
Detection Rate (%)
Fig 3
False Alarm Rate (%)
Particle filtering
Objective:
Estimate expectations
 h(
with respect to a sequence of distributions
known up to a normalizing constant, i.e.
0: t
)  t (d 0: t )
 t t 0
 t (d0: t )   t (0: t ) d0: t
Monte Carlo: Obtain N weighted samples
where
w  0,
(i )
t
N
w
i 1
(i )
t
1

(i )
0:t

(i )
t
i 1, , N
,w
such that
 w h     h    d 
N
i 1
(i )
t
(i )
0:t
N 
0:t
t
0:t