Transcript ppt

Network Traffic Modeling
Mark Crovella
Boston University Computer Science
Outline of Day
•
•
•
•
•
9:00 – 10:45 Lecture 1, including break
10:45 – 12:00 Exercise Set 1
12:00 – 13:30 Lunch
13:30 – 15:15 Lecture 2, including break
15:15 – 17:00 Exercise Set 2
The Big Picture
• There are two main uses for Traffic
Modeling:
– Performance Analysis
• Concerned with questions such as delay, throughput,
packet loss.
– Network Engineering and Management:
• Concerned with questions such as capacity planning,
traffic engineering, anomaly detection.
• Some principal differences are that of
timescale and stationarity.
Relevant Timescales
Performance
effects happen on
short timescales
Network Engineering
effects happen on long
timescales
from nanoseconds
up to an hour
1 usec
1 sec
from an hour to
months
1 hour
1 day
1 week
Stationarity, informally
• “A stationary process has the property
that the mean, variance and
autocorrelation structure do not change
over time.”
• Informally: “we mean a flat looking series,
without trend, constant variance over time,
a constant autocorrelation structure over
time and no periodic fluctuations
(seasonality).”
NIST/SEMATECH e-Handbook of Statistical Methods
http://www.itl.nist.gov/div898/handbook/
The 1-Hour / Stationarity Connection
• Nonstationarity in traffic is primarily a result of
varying human behavior over time
• The biggest trend is diurnal
• This trend can usually be ignored up to timescales
of about an hour, especially in the “busy hour”
Outline
• Morning: Performance Evaluation
– Part 0: Stationary assumption
– Part 1: Models of fine-timescale behavior
– Part 2: Traffic patterns seen in practice
• Afternoon: Network Engineering
– Models of long-timescale behavior
– Part 1: Single Link
– Part 2: Multiple Links
Morning Part 1:
Traffic Models for Performance Evaluation
• Goal: Develop models useful for
– Queueing analysis
• eg, G/G/1 queues
– Other analysis
• eg, traffic shaping
– Simulation
• eg, router or network simulation
A Reasonable Approach
• Fully characterizing a stochastic process
can be impossible
– Potentially infinite set of properties to capture
– Some properties can be very hard to estimate
• A reasonable approach is to concentrate on
two particular properties:
marginal distribution and autocorrelation
Marginals and Autocorrelation
Characterizing a process in terms of these two
properties gives you
–
–
–
–
a good approximate understanding of the process,
without involving a lot of work,
or requiring complicated models,
or requiring estimation of too many parameters.
… Hopefully!
Marginals
Given a stochastic process
X={Xi}, we are interested in
the distribution of any Xi:
i.e., f(x) = P(Xi=x)
Since we assume X is
stationary, it doesn’t matter
which Xi we pick.
Estimated using a histogram:
Histograms and CDFs
• A Histogram is often a poor estimate of the pdf f(x)
because it involves binning the data
• The CDF F(x) = P[Xi <= x] will have a point for each
distinct data value; can be much more accurate
Modeling the Marginals
• We can form a compact summary of
a pdf f(x) if we find that it is well
described by a standard
distribution – eg
–
–
–
–
–
Gaussian (Normal)
Exponential
Poisson
Pareto
Etc
• Statistical methods exist for
– asking whether a dataset is well
described by a particular distribution
– Estimating the relevant parameters
Distributional Tails
• A particularly
important part of a
distribution is the
(upper) tail
• P[X>x]
• Large values dominate
statistics and
performance
• “Shape” of tail
critically important
Light Tails, Heavy Tails
• Light –
Exponential
or faster
decline
• Heavy –
Slower than
any
exponential
f1(x) = 2 exp(-2(x-1))
f2(x) = x-2
Examining Tails
• Best done using
log-log
complementary
CDFs
• Plot log(1-F(x))
vs log(x)
1-F2(x)
1-F1(x)
Heavy Tails Arrive
pre-1985: Scattered measurements note high
variability in computer systems workloads
1985 – 1992: Detailed measurements note “long”
distributional tails
– File sizes
– Process lifetimes
1993 – 1998: Attention focuses specifically on
(approximately) polynomial tail shape:
“heavy tails”
post-1998: Heavy tails used in standard models
Power Tails, Mathematically
We say that a random variable X is power tailed if:
P[ X  x] ~ x
where a ~ b means lim
a
a( x)
x  b ( x )
0 a  2
 1.
Focusing on polynomial shape allows
Parsimonious description
Capture of variability in a parameter
A Fundamental Shift in Viewpoint
• Traditional modeling methods have focused
on distributions with “light” tails
– Tails that decline exponentially fast (or faster)
– Arbitrarily large observations are vanishingly
rare
• Heavy tailed models behave quite
differently
– Arbitrarily large observations have nonnegligible probability
– Large observations, although rare, can dominate
a system’s performance characteristics
Heavy Tails are Surprisingly Common
• Sizes of data objects in computer systems
– Files stored on Web servers
– Data objects/flow lengths traveling through the
Internet
– Files stored in general-purpose Unix filesystems
– I/O traces of filesystem, disk, and tape activity
• Process/Job lifetimes
• Node degree in certain graphs
– Inter-domain and router structure of the Internet
– Connectivity of WWW pages
• Zipf’s Law
Evidence: Web File Sizes
Barford et al., World Wide Web, 1999
Evidence: Process Lifetimes
Harchol-Balter and
Downey,
ACM TOCS,
1997
The Bad News
• Workload metrics following heavy tailed
distributions are extremely variable
• For example, for power tails:
– When a  2, distribution has infinite variance
– When a  1, distribution has infinite mean
• In practice, empirical moments are slow to
converge – or nonconvergent
• To characterize system performance,
either:
– Attention must shift to distribution itself, or
– Attention must be paid to timescale of analysis
Heavy Tails in Practice
Power tails
with a=0.8
Large observations
dominate statistics
(e.g., sample mean)
Autocorrelation
• Once we have characterized the marginals, we
know a lot about the process.
• In fact, if the process consisted of i.i.d. samples,
we would be done.
• However, most traffic has the property that its
measurements are not independent.
• Lack of independence usually results in
autocorrelation
• Autocorrelation is the tendency for two
measurements to both be greater than, or less
than, the mean at the same time.
Autocorrelation
Measuring Autocorrelation
Autocorrelation Function (ACF) (assumes stationarity):
R(k) = Cov(Xn,Xn+k) =
E[Xn Xn+k] – E2[X0]
ACF of i.i.d. samples
How Does Autocorrelation Arise?
Network traffic is the superposition of flows
Client
Request
Internet
(TCP/IP)
“click”
Response
Server
Why Flows?: Sources appear to be ON/OFF
ON
P2:
P3:
•
•
•
{
{
P1:
OFF
Superposition of ON/OFF sources 
Autocorrelation
P1:
P2:
P3:
Morning Part 2:
Traffic Patterns Seen in Practice
Traffic patterns on a link are strongly
affected by two factors:
– amount of multiplexing on the link
• Essentially – how many flows are sharing the link?
– Where flows are bottlenecked
• Is each flow’s bottleneck on, or off the link?
• Do all bottlenecks have similar rate?
Low Multiplexed Traffic
• Marginals: highly
variable
• Autocorrelation: low
Highly Multiplexed
Traffic
High Multiplexed, Bottlenecked Traffic
• Marginals:
tending to
Gaussian
• Autocorrela
tion: high
Highly Mutiplexed, Mixed-Bottlenecks
dec-pkt-1
(Internet
Traffic
Archive)
Alpha and Beta Traffic
• ON/OFF model revisited:
High variability in connection rates (RTTs)
Low rate = beta
High rate = alpha
+
+
+
=
=
fractional Gaussian noise
stable Levy noise
Rice U., SPIN Group
Long Range Dependence
R[k] ~ a-k
a>1
R[k] ~ k-a 0 < a < 1
H=1-a/2
Correlation and Scaling
• Long range dependence affects how
variability scales with timescale
• Take a traffic timeseries Xn, sum it over
blocks of size m
– This is equivalent to observing the original
process on a longer timescale
• How do the mean and std dev change?
– Mean will always grow in proportion to m
– For i.i.d. data, the std dev will grow in
proportion to sqrt(m)
– So, for i.i.d. data, the process is “smoother” at
longer timescale
Self-similarity: unusual scaling of variability
• Exact self-similarity of a zero-mean,
stationary process Xn
d
Xt  m
H
tm
X
i ( t 1) m 1
i
for all m  N, t  0
• H: Hurst parameter
1/2 < H < 1
• H = 1/2 for i.i.d. Xn
• LRD leads to (at least) asymptotic s.s.
Self Similarity in Practice
H=0.95
H=0.50
10ms
1s
100s
The Great Wave (Hokusai)
How Does Self-Similarity Arise?
Self-similarity  Autocorrelation  Flows
Autocorrelation declines like a power law 
Distribution of flow lengths has power law tail

Power Tailed ON/OFF sources 
Self-Similarity
P1:
P2:
P3:
OFF
{
{
ON
Measuring Scaling Properties
• In principle, one can
simply aggregate Xn
over varying sizes of
m, and plot resulting
variance as a function
of m
• Linear behavior on a
log-log plot gives an
estimate of H (or a).
• Slope > -1 indicates
LRD
WARNING: this method is very sensitive
to violation of assumptions!
Better: Wavelet-based estimation
Veitch and Abry
Optional Material: Performance
Implications of Self-Similarity
Performance implications of S.S.
• Asymptotic queue length distribution
(G/G/1)
– For SRD Traffic:
P[Q  x] ~ exp( c1 x)
– For LRD Traffic:
P[Q  x] ~ exp( c2 x
– Severe - but, realistic?
2 2 H
)
Evaluating Self-Similarity
• Queueing Models like these are open
systems
– delay does not feed back to source
• TCP dynamics are not being considered
– packet losses cause TCP to slow down
• Better approach:
– Closed network, detailed modeling of TCP
dynamics
– self-similarity traffic generated “naturally”
Simulating Self-Similar Traffic
• Simulated network with multiple clients,
servers
• Clients alternative between requests, idle
times
• Files drawn from heavy-tailed distribution
– Vary a to vary self-similarity of traffic
• Each request is simulated at packet level,
including detailed TCP flow control
• Compare with UDP (no flow control) as an
example of an open system
Traffic Characteristics
• Self-similarity varies smoothly as function
of a
Performance Effects:
Packet Loss
Open Loop: UDP
Closed Loop: TCP
Performance Effects:
TCP Throughput
Performance Effects:
Buffer Queue Lengths
Open Loop: UDP
Closed Loop: TCP
Severity of Packet Delay
Performance Implications
• Self-similarity is present in Web traffic
– The internet’s most popular application
• For the Web, the causes of s.s. can be
traced to the heavy-tailed distribution of
Web file sizes
– Caching doesn’t seem to affect things much
– Multimedia tends to increase tail weight of
Web files
– But, even text files alone appear to be heavytailed
Performance Implications (continued)
• Knowing how s.s. arises allows us to
recreate it “naturally” in simulation
• Effects of s.s. in simulated TCP networks
– Packet loss not as high as open-loop models
might suggest
– Throughput not a problem
– Packet delays are the big problem
Morning Lab Exercises
• For each dataset, explore its marginals
– Histograms
– CDFs, CCDFs,
– Log-log CCDFs to look at tails
• For each dataset, explore its correlations
– ACFs
– Logscale diagrams
– Compare to scrambled dataset
• Study performance
– Simple queueing
– Compare to scrambled dataset
Afternoon: Network Engineering
• Moving from stationary domain to
nonstationary
• Goal: Traffic models that are useful for
– capacity planning
– traffic engineering
– anomaly / attack detection
• Two main variants
– Looking at traffic on a single link at a time
– Looking at traffic on multiple or all links in a
network simultaneously
Part 1: Single Link Analysis
• The general modeling framework that is
most often used is
“signal + noise”
• Sometimes interested in the signal,
sometimes the noise…
• So, signal processing techniques are
common
– Frequency Domain / Spectral Analysis
• Generally based on FFT
– Time-Frequency Analysis
• Generally based on wavelets
A Typical Trace
Notable features:
periodicity
noisiness
“spikes”
Capacity Planning
• Here, mainly interested in the “signal”
• Want to predict long-term trends
• What do we need to remove?
– Noise
– Periodicity
K. Papagiannaki et al., Infocom 2003
Periodicity: Spectral Analysis
Denoising with Wavelets
Capacity Planning via Forecasting
Anomaly detection
• Goal: models that are useful in detecting
–
–
–
–
–
Network equipment failures
Network misconfigurations
Flash crowds
Attacks
Network Abuse
Misconfiguration detection via Wavelets
Traffic
High Freq.
Med Freq.
Low Freq.
P. Barford et al., Internet Measurement Workshop 2002
Flash Crowd Detection
Long Term
Change in
Mean Traffic
(8 weeks)
Detecting DoS attacks
Afternoon Part 2: Multiple Links
• How to analyze traffic from multiple links?
– Clearly, could treat as a collection of single
links, and proceed as before
– But, want more: to detect trends and patterns
across multiple links
• Observation: multiple links share common
underlying patterns
– Diurnal variation should be similar across links
– Many anomalies will span multiple links
• Problem is one of pattern extraction in
high dimension
– Dimension is number of links
Example Link Traces from a Single Network
Some have visible structure, some less so…
High Dimensionality: A General Strategy
• Look for a low-dimensional representation
preserving the most important features of
data
• Often, a high-dimensional structure may
explainable in terms of a small number of
independent variables
• Commonly used tool: Principal Component
Analysis (PCA)
Principal Component Analysis
For any given
dataset, PCA finds a
new coordinate
system that maps
maximum variability
in the data to a
minimum number of
coordinates
New axes are called
Principal Axes or
Components
Correlation in Space, not Time
time
# Links
Traditional Frequency-Domain Analysis
PCA on Link Traffic
# Links
# Links
time
time
# Links
# Links
PC
Single Link
X:
V:
OD flow
matrix
XVS1=U
Principal
matrix
(Orthogonal)
Eigenlink
U:
Eigenlink
matrix
(Columns are
Orthonormal)
X=USVT
Singular Value Decomposition
X=USVT is the
singular value decomposition of X
PCA on Link Traffic (2)
;
Singular values indicate the
energy attributable to a
principal component
Each link is weighted sum of
all eigenlinks
Low Intrinsic Dimensionality
of Link Data
A plot of the singular
values reveals how
much energy is
captured by each PC
Sharp elbow indicates
that most of the energy
captured by 5-10
singular values, for all
datasets
Implications of Low Instrinsic Dimensionality
• Apparently, we can reconstruct X with high
accuracy, keeping only a few columns of U
X’=U’SVT
• A form of lossy data compression
• Even more, a way of extracting the most
significant part of X, automatically
– “signal + noise”?
Approximating With Top 5 Eigenlinks
Approximating With Top 5 Eigenlinks
Approximating With Top 5 Eigenlinks
A Link, Reconstructed
Link traffic
Components 1-5
Components 6-10
All the 100+ rest
Anomaly Detection
Single Link Approach:
Use wavelets to detrend
each flow in isolation.
[Barford:IMW02]
Multi Link approach:
Detrend all links
simultaneously by
choosing only certain
principal components.
X = X’ + X’’
PCA based anomaly detection
L2 norm
of entire
traffic
vector X
L2 norm
of residual
vector X’’
Traffic Forecasting
Single-Link approach:
Treat each flow timeseries independently.
Use wavelets to extract trends.
Build timeseries forecasting models on trends.
[Papagiannaki:INFOCOM03]
X’=U’SVT
Multi-Link approach:
Build forecasting models on most significant
eigenlinks as trends.
Allows simultaneous examination and forecasting
for entire ensemble of links.
Conclusions
• Whew!
• Bibliography on handout
• Traffic analysis methods vary considerably
depending on:
– Question being asked
– Timescale
– Stationarity
Conclusions
Performance Evaluation
Network Engineering
-Marginals
-Watch out for heavy
tails
-Correlation (in time)
-Watch out for LRD /
Self Similarity
•Signal + Noise
•Single-link: Frequency
domain analysis
•Multi-link: Exploit
Spatial correlation
1 usec
1 sec
1 hour
1 day
1 week
Afternoon Lab Exercises
• For each dataset,
– Perform PCA
– Assess the variance in each component
– Reconstruct using small number of components
• Time / Interest permitting,
– Analyze some of the single link timeseries using
wavelets (matlab wavelet toolbox)