Transcript Document

Network Anomaly Diagnosis
Status Report
15-03-2006
Contents
•
•
•
•
•
Network performance metrics
Active monitoring tools
Preparation of canonical dataset
Anomaly detection
Anomaly diagnosis
Network performance metrics
• One-Way Delay (OWD)
–
–
–
–
Serialization Delay
Propagation Delay
Queuing Delay
Forwarding Delay
• Round-Trip Time (RTT)
• Delay Variation (Jitter)
• Packet Loss
– Congestion
– Errors
• Packet Reordering
• Maximum Transmission Unit (MTU)
• Bandwidth Delay Product (BDP)
One-Way Delay (OWD)
• The time it takes for a packet to reach its end-toend destination
• It can be broken down into:
– per-hop one-way delays, and these in turn into:
• per-link and
• per-node delay components
• The per-link component of one-way delay consists
of two sub-components: propagation delay and
serialization delay.
• The per-node component of one-way delay also
consists of two sub- components: forwarding
delay and queuing delay.
Serialization Delay
• Its the time taken to separate a packet into
sequential link transmission units (bits).
• It is obtained by dividing the packet size (in
bits) by the capacity of the link (in bits per
second).
• Nowadays, as links increasingly have a
higher bit rate, serialization delay is less
relevant.
Propagation Delay
• Propagation delay is the duration of time for
signals to move from the transmitting to the
receiving end of a link.
• On simple links, this is the product of:
– the link's physical length and
– the characteristic propagation speed of media.
• On high-speed wide-area network (WAN)
paths, delay is usually dominated by
propagation times.
Queuing Delay
• Queuing delay is defined as the time a
packet has to spend inside a node such as a
router while waiting for availability of the
output link.
• It depends on
– the amount of traffic competing to send packets
towards the output link and
– on the priorities of the packet
Forwarding Delay
• Processing at the node
– Reading forwarding-relevant information
• Dest address + other headers
– Forwarding decision
• Based on the routing table
Round-Trip Time (RTT)
• Its the sum of the one-way delays from
source to destination plus time it takes B to
formulate the response.
• Large RTT values can cause problems for
TCP and other window-based transport
protocols.
• The round-trip time influences the
achievable throughput, as there can only be
a window's worth of unacknowledged data
in the network.
Delay Variation (Jitter)
• OWD is not constant on a real network
because of competing traffic and contention
for processing resources.
• The difference between a given packet’s
actual and average OWD is termed ‘delay
variation’ or jitter.
• It only compares the delays experienced by
packets of equal size, as OWD is dependent
on packet size because of serialization delay
Packet Loss
• Packet loss is determined as the probability of a
packet being lost in transit from a source A to a
destination B.
• Applications requiring reliable transmission e.g.
Bulk data transfers, use retransmission, which
reduces performance.
• In addition, congestion-sensitive protocols such as
standard TCP assume that packet loss is due to
congestion, and respond by reducing their
transmission rate accordingly.
• Congestion and errors are the two main reasons
for packet loss.
Congestion
• When the offered load exceeds the capacity of a
part of the network, packets are buffered in
queues.
• Since these buffers are also of limited capacity,
congestion can lead to queue overflows, which
leads to packet losses.
• Congestion can be caused by
– moderate overload condition maintained for an
extended amount of time or
– by the sudden arrival of a very large amount of traffic
(traffic burst).
Errors
• Another reason for loss of packets is
corruption, where parts of the packet are
modified in-transit.
• When such corruptions happen on a link
(due to noisy lines etc.), this is usually
detected by a link-layer checksum at the
receiving end, which then discards the
packet.
Packet Reordering
• The Internet Protocol (IP) does not guarantee the
packet ordering.
• Packet reordering concerns packets of different
sizes.
– larger packets take longer to transfer and may be
overtaken by smaller packets in transit.
• It can be measured by:
– injecting the same traffic pattern via traffic generator
and calculating the reordering.
– To measure maximal reordering: short burst of long
packets immediately followed by a short burst of short
packets
Maximum Transmission Unit (MTU)
• It describes the maximum size of an IP
packet that can be transferred over the link
without fragmentation.
• Common MTU sizes are:
– 1500 bytes (Ethernet, 802.11 WLAN)
– 4470 bytes (FDDI, common default for POS
and serial links)
– 9000 bytes (Internet2 and GÉANT convention,
limit of some Gigabit Ethernet adapters)
– 9180 bytes (ATM, SMDS)
Bandwidth Delay Product (BDP)
• The Bandwidth Delay Product (BDP) of an end-toend path is the product of the bottleneck bandwidth
and the delay of the path.
• It is often useful to think of BDP as the "memory
capacity" of a path, i.e. the amount of data that fits
entirely into the path between two end-systems.
• This relates to throughput, which is the rate at
which data is sent and received.
• Network paths with a large BDP are called Long
Fat Networks or LFNs.
• BDP is an important parameter for the performance
of window-based protocols such as TCP.
Available bandwidth
• dynamic bandwidth capacity (Cap) by
analyzing the minimum inter-packet delay;
• the cross-traffic (Xtr) byanalyzing the interpacket delay variability;
• and the available bandwidth
– (Abw = Cap – Xtr)
Active Probing Tools
• Some publicly available active network monitoring
tools are:
• Throughput & Delay Measurement Tools
–
–
–
–
Ping
Traceroute
Iperf
Thrulay
• Path Characterization & Bandwidth Estimation
–
–
–
–
–
pathChirp
Pathload
ABwE
Netperf
Nettest
Ping
• Uses ICMP echo mechanism
• Measures
– RTT
– Packet loss
– Reachability
• Source & Destination use timestamps to calculate
RTT
• Packet loss can be measured by sending a set of
packets from a source to a destination and
comparing the number of received packets against
the number of packets sent.
Ping example
Remote host
Repeat count
Packet size
RTT
syrup:/home$ ping -c 6 -s 64 thumper.bellcore.com
PING thumper.bellcore.com (128.96.41.1): 64 data bytes
72 bytes from 128.96.41.1: icmp_seq=0 ttl=240 time=641.8 ms
72 bytes from 128.96.41.1: icmp_seq=2 ttl=240 time=1072.7 ms
seq #
72 bytes from 128.96.41.1: icmp_seq=3 ttl=240 time=1447.4Missing
ms
72 bytes from 128.96.41.1: icmp_seq=4 ttl=240 time=758.5 ms
72 bytes from 128.96.41.1: icmp_seq=5 ttl=240 time=482.1 ms
--- thumper.bellcore.com ping statistics --- 6 packets transmitted, 5
packets received, 16% packet loss round-trip min/avg/max =
482.1/880.5/1447.4 ms
Summary
Traceroute
• Determines the route a packet takes through the
Internet to reach its destination; i.e. the number of
"hops" it takes.
• It sends "probe" packets with TTL values
incrementing from one, and uses ICMP "Time
Exceeded" messages to detect "hops" on the way
to the specified destination.
• It also records "response" times for each hop, and
displays losses and other types of failures
• Used to measure
– hop-by-hop connectivity and
– RTT
Traceroute
• UDP/ICMP tool to show route packets take from local to
Max hops
remote host
Remote host
Probes/hop
17cottrell@flora06:~>traceroute -q 1 -m 20 lhr.comsats.net.pk
traceroute to lhr.comsats.net.pk (210.56.16.10), 20 hops max, 40 byte packets
1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2) 0.642 ms
2 RTR-MSFC-DMZ.SLAC.Stanford.EDU (134.79.135.21) 0.616 ms
3 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.66) 0.716 ms
4 snv-slac.es.net (134.55.208.30) 1.377 ms
5 nyc-snv.es.net (134.55.205.22) 75.536 ms
6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
7 gin-nyy-bbl.teleglobe.net (192.157.69.33) 154.742 ms
8 if-1-0-1.bb5.NewYork.Teleglobe.net (207.45.223.5) 137.403 ms
9 if-12-0-0.bb6.NewYork.Teleglobe.net (207.45.221.72) 135.850 ms
No response:
10 207.45.205.18 (207.45.205.18) 128.648 ms
Lost packet or router
11 210.56.31.94 (210.56.31.94) 762.150 ms
ignores
12 islamabad-gw2.comsats.net.pk (210.56.8.4) 751.851 ms
13 *
14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms
Iperf
• It reports bandwidth, delay jitter, datagram loss.
• It has two modes
– TCP
– UDP
• In TCP mode it is used to measure the bandwidth.
• It reports MSS (Maximum Segment Size) and the
observed read sizes. It supports TCP window size
adjustment via socket buffers.
• In UDP mode packet loss and delay jitter can be
measured.
Thrulay
• Thrulay stands for THRUput and deLAY.
• It measures the capacity of a network by
sending a bulk TCP stream over it.
• One-way delay can be measured by sending
a Poisson stream (arrivals at random) of
very precisely positioned UDP packets.
Pathload
• Pathload is based on the technique of Self-Loading
Periodic Streams (SLoPS), for measuring available
bandwidth.
• A periodic stream in SLoPS consists of K packets of size
L, sent to the path at a constant rate R.
• If the stream rate R is higher than the available bandwidth,
the one-way delays of successive packets at the receiver
show an increasing trend.
• In pathload, sender uses UDP for packet stream and TCP
connection for controlling measurements.
• Sender timestamps each packet upon transmission, which
provides relative one-way delay at the receiver side.
pathChirp
• pathChirp is used to calculate unused
capacity or available bandwidth.
• Based on the concept of self-induced
congestion.
– Probing rate < available bw  no delay increase
– Probing rate > available bw  delay increases
• Unique to pathChirp is an exponentially
spaced chirp probing train.
ABwE
• ABwE is a light weight available bandwidth
measurement tool based on the packet pair
dispersion technique.
• It uses fixed size packets with specific delay
between each packet pair.
• The observed packet pair delay is converted into
available bandwidth calculations.
• The tool can be used in continuous mode and
detects all substantial bandwidth changes caused
by improper routing or by congestions.
Netperf
• Netperf is a benchmark that can be used to
measure the performance of many different types
of networking.
• It provides tests for both unidirectional
throughput, and end-to-end latency.
• Essentially, these tests will measure how fast one
system can send data to another and/or how fast
that other system can receive it.
• Request/response performance is the second area
that can be investigated with netperf.
Selection of tools
• Shriram et al. concludes his comparison of public End-toEnd bandwidth estimation tools on High-Speed Links with
following remarks:
– “Pathload and Pathchirp are the most accurate. Iperf requires
maximum buffer size and is sensitive to small packet loss.”
• Ribeiro et al. in comparisons of pathChirp with pathload
and TOPP, clearly show that pathChirp performs far batter
than the other two tools in single-hop, multi-hop scenarios
and on real internet as well.
• Cottrel et al. states “Ping utility does not require any extra
installation/deployment efforts and has become a de-facto
standard for measuring RTT and packet loss.”
• Tools selected are pathChirp, pinger, thrulay and
traceroute.
References
• Cottrell L., Internet Monitoring, Presented at NUST
Institute of Information Technology (NIIT) Rawalpindi,
Pakistan, March 15, 2005
• Cottrell L., Matthews W. and Logg C., Tutorial on Internet
Monitoring & PingER at SLAC
• Jacobson V., “pathchar — a tool to infer characteristics of
Internet paths”
• Ribeiro V., Riedi R., Baraniuk R., Navratil J., Cottrell L.
pathChirp: Efficient Available Bandwidth Estimation for
Network Paths.
• Jain M., Dovrolis C., End-to-End Available Bandwidth:
Measurement Methodology, Dynamics, and Relation with
TCP Throughput. ACM SIGCOMM 2002.
• Navratil J. and. Cottrell L., ABwE: A Practical Approach to
Available Bandwidth Estimation.
• http://thrulay-hd.sourceforge.net
References
•
•
•
•
•
•
•
•
•
•
•
•
http://www.netperf.org/netperf/training/Netperf.html
Beginner's Guide to Network-Distributed Resource Usage
Cottrell, L., Logg, C.: Overview of IEPM-BW Bandwidth Testing of Bulk
Data Transfer. In: Sc2002: High Performance Networking and Computing.
(2002)
http://dast.nlanr.net/Projects/Iperf/iperfdocs_1.7.0.html
http://www-didc.lbl.gov/NCS/netest/
Shriram A., Murray M., Hyun Y., Brownlee N., Broido A., Fomenkov M.,
claffy k., “Comparison of Public End-to-End Bandwidth Estimation tools on
High-Speed Links”
Prasad R. S., Dovrolis C., “Capacity estimation using packet dispersion
techniques : recent progress and open issues” Presentation
http://www.academ.com/nanog/june1997/performance.html
Labit Y., Owezarski P., Larrieu N., Evaluation of Active Measurement Tools
for Bandwidth Estimation in Real Environment.
Dovrolis C., Jain M., “End-to-end available bandwidth estimation using
Pathload” Presentation
claffy k., Dovrolis C., “Bandwidth estimation: measurement methodologies
and applications” Presentation
GÉANT2 Performance Enhancement and Response Team (PERT) User Guide
and Best Practice Guide
Anomaly Detection Techniques
•
•
•
•
•
•
Exponential weighted moving average (EWMA)
Holt-Winters forecasting algorithm
Plateau Algorithm
Principal Component Analysis (PCA)
ARIMA-based Box-Jenkins forecasting models
Signal analysis techniques, e.g. the wavelet
analysis scheme
EWMA
• EWMA, a forecasting based techniques
• EWMA predicts the next value in a given
timeseries based on recent history.
• Specifically, if zt denotes the traffic at time t,
then the EWMA prediction for time t+1 is
given by ^zt+1 as:
• Anomalies can then be quantified by taking
the difference between the forecasted and the
actual value, i.e.,
Holt-Winters (HW) Algorithm
• Service network variable time series exhibit the
following regularities:
– A trend over time (i.e., a gradual increase in application
daemon requests over a two month period due to
increased subscriber load).
– A seasonal trend or cycle (i.e., every day bytes per
second increases in the morning hours, peaks in the
afternoon and declines late at night).
– Seasonal variability. (i.e., application requests fluctuate
wildly minute by minute during the peak hours of 4-8
pm, but at 1 am application requests hardly vary at all).
Holt-Winters (HW) Algorithm
• The Holt-Winters (HW) algorithm uses a triple
Exponential Weighted Moving Average (EWMA)
approximation to characterize the time series behavior
as a superposition of three components: a baseline, a
linear trend and a seasonal effect.
Plateau Algorithm
• It is used to detect a significant change in the base RTT
of a path.
• Base RTT means a change in most of the RTT values
measured, new values are based on the new, higher level.
• The plateau detector is based around two windows,
– the summary window and
– the sample window.
• Summary window is used to characterize the normal
state of the path which is based on the previous samples.
• The number of samples stored in the summary window is
a user settable parameter.
Plateau Algorithm
• Samples are first stored in summary window,
trigger selector decides if a sample forms part of a
trigger.
• In this case, the sample is stored in the sample
window.
• This decision is made on the value of the sample,
the mean and variance of the summary window,
and on parameters that the user has chosen.
– RTT > samplemean + (samplevariance* sensitivty)
Plateau Algorithm
• A trigger is generated when a number of samples
(a user settable parameter) meet the above
equation are received.
• Each time a sample, that fulfils the above
equation, is received a counter is incremented.
• For a non-zero counter when a sample that does
not meet the above equation is received the
counter is decremented.
• If the counter reaches the user parameter trigger
duration a trigger is generated. If the counter
returns to zero the trigger is aborted.
• In either case, data in the sample window is passed
to the summary window.
Principal Component Analysis (PCA)
• PCA is a coordinate transformation method that maps a given set of
data points onto new axes. These axes are called the principal axes or
principal components.
• When working with zero-mean
• data, each principal component has the property that it points in
• the direction of maximum variance remaining in the data, given
• the variance already accounted for in the preceding components.
• As such, the first principal component captures the variance of the
• data to the greatest degree possible on a single axis. The next principal
• components then each capture the maximum variance among
• the remaining orthogonal directions. Thus, the principal axes are
• ordered by the amount of data variance that they capture.
References
•
McGregor A.J. and Braun H-W,
”Automated Event Detection for Active
Measurement Systems”, Proceedings of
PAM2001, Amsterdam, Netherlands, April
2001
Preparation of Canonical dataset
•
•
•
•
•
•
Fetch data from monitoring nodes
(Pathchirp, Thrulay, Ping, Traceroute) Adnan, Ali
Study/Analyze Data - Adnan, Ali
Run Run-period - Adnan
Analyze Alerts - Adnan, Ali
Design a format - Adnan, Ali
Rearrange data according to new format Adnan, Ali
Topology to be used
Monitoring
Node
Monitored Nodes = 21
Monitoring
Node
Monitoring
Node
Monitoring
Node
Data being collected using this
topology
•
•
•
•
•
Pinger: Round trip Time (Min, Max, Avg)
Pathchirp: Throughput (Avg, Max, Min, Stdev)
Thrulay: Throughput (Avg, Max, Min, Stdev)
RTT (Avg, Max, Min, Stdev)
trace route summaries
Host Monitoring data (e.g. data from LISA,
Nagios, Ganglia) --- candidate
Format of canonical data set
• Not sure, needs some more investigation of
available data from multiple sources.
• Data is fetched from 1st Jan, 06 to 28th Feb
2006.