Diagnostic Steps
Download
Report
Transcript Diagnostic Steps
Diagnostic Steps
Les Cottrell – SLAC
Presented at the Optimization Technologies for Low-Bandwidth Networks, ICTP
Workshop, Trieste, Italy, 9-20 October 2006
http://www.slac.stanford.edu/grp/scs/net/talk06/diagnostics.ppt
Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end
Performance Monitoring (IEPM), also supported by IUPAP
http://sdu.ictp.it/lowbandwidth/
Get ready
Bring up terminal window so can try some commands
Bring up the presentation so can click on links:
www.slac.stanford.edu/grp/scs/net/talk06/diagnostics.ppt
Les Cottrell, SLAC
Slide: 2
Aim
Goal: provide a practical guide to debugging common
problems
Why is diagnosis difficult yet important?
Local host
Ping, Traceroute, PingRoute
Looking at time series
Locating bottlenecks
Correlation of problems with routes
More tools and problems
Where is a node
Who do you tell, what do you say?
Case studies and More Information
Les Cottrell, SLAC
Slide: 3
Why is diagnosis difficult?
Internet's evolution as a composition of independently
developed and deployed protocols, technologies, and core
applications
Diversity, highly unpredictable, hard to find “invariants”
Rapid evolution & change, no equilibrium so far
Findings may be out of date
Measurement/diagnosis not high on vendors list of priorities
Resources/skill focus on more interesting an profitable issues
Tools lacking or inadequate
Implementations are flaky & not fully tested with new releases
Les Cottrell, SLAC
Slide: 4
Add to that …
Distributed systems are very hard
A distributed system is one in which I can't get my work done because a
computer I've never heard of has failed. Butler Lampson
Network is deliberately transparent
The bottlenecks can be in any of the following components:
the applications
the OS
the disks, NICs, bus, memory, etc. on sender or receiver
the network switches and routers, and so on
Problems may not be logical
Most problems are operator errors, configurations, bugs
When building distributed systems, we often observe unexpectedly low
performance
the reasons for which are usually not obvious
Just when you think you’ve cracked it, in steps security
Firewall, NAT boxes etc.
Block pings, traceroute looks like port scan, diagnostic tool ports are
blocked …
ISPs worried about providing access to core, making results public, &
privacy issues
Les Cottrell, SLAC
Slide: 5
Sources of problems
Host “errors”
TCP buffers, heavy utilization …
Duplex mismatch (Ethernet)
Misconfigured router/switches
Including routing errors, especially for backup paths
Bad equipment, wiring/fiber problem
Congestion
Les Cottrell, SLAC
Slide: 6
Fire: Local Host
Usual Unix tools (uname-a, top, vmstat, iostat …)
Is the host overloaded, do you have a gateway (route), name server
(nslookup/dig), which interface are you using (mii-tool (needs
root), gives duplex & speed = common error source)
21cottrell@pinger:~>sudo
mii-tool
eth0
– eth0:
100
Mbit,
full
duplex, link ok
Net: ifconfig –a (look at errors), netstat –a | more
Is server running (if you know port)?
>telnet localhost 2811
Trying 127.0.0.1
220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI type
Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready.
^]
telnet> quit
Les Cottrell, SLAC
Slide: 7
Ping
Ping
1. to localhost,
2. ping to gateway (use route or traceroute to find
gateway),
3. ping to well known host
4. & to relevant remote host
Use IP address to avoid nameserver problems
Look for connectivity, loss, RTT, jitter, dups
May need to run for a long time to see some pathologies
(e.g. bursty loss due to DSL loss of sync)
Try flood pings if suspect rate limited
Use synack or sting if ICMP blocked
www-iepm.slac.stanford.edu/tools/synack/
Les Cottrell, SLAC
Slide: 8
Ping example
Repeat count
Packet size
Remote host
RTT
syrup:/home$ ping -c 6 -s 64 thumper.bellcore.com
PING thumper.bellcore.com (128.96.41.1): 64 data bytes
Missing seq #
72 bytes from 128.96.41.1: icmp_seq=0 ttl=240 time=641.8 ms
72 bytes from 128.96.41.1: icmp_seq=2 ttl=240 time=1072.7 ms
72 bytes from 128.96.41.1: icmp_seq=3 ttl=240 time=1447.4 ms
Summary
72 bytes from 128.96.41.1: icmp_seq=4 ttl=240 time=758.5 ms
72 bytes from 128.96.41.1: icmp_seq=5 ttl=240 time=482.1 ms
--- thumper.bellcore.com ping statistics --- 6 packets transmitted, 5 packets
received, 16% packet loss round-trip min/avg/max = 482.1/880.5/1447.4 ms
Les Cottrell, SLAC
Slide: 9
Try the following Ping Examples
ping cepheid.physics.utoronto.ca
From mcl-gpb.gw.utoronto.ca … Destination Host Unreachable
ping rolandlap.ph.unimelb.edu.au
From rtr4-000037.unimelb.edu.au … Packet filtered
ping www.ncit.edu.np
ping: unknown host www.ncit.edu.np
ping inpe-gw-sp.cptec.inpe.br
From 150.163.200.100 icmp_seq=0 Time to live exceeded
ping www.ug.edu.gh
34 packets transmitted, 0 received, 100% packet loss, time 33068ms
synack -p 80 -k 5 www.ug.edu.gh
5 packets transmitted, 5 packets received, 0.00 percent packet loss
round-trip (ms) min/avg/max = 182.052/182.701/183.151 (std = 0.578)
(median = 183.095)
(interquartile range = 1.039)
(25 percentile = 182.085)
(75 percentile = 183.124)
Les Cottrell, SLAC
Slide: 10
3rd party ping
Find servers:
http://www.slac.stanford.edu/comp/net/wanmon/traceroute-srv.html
Glasgow University*# Scotland.
ICTP +*, Trieste, Italy.
IHEP + Beijing, China.
Modify URL to request a ping for hosts with +
pinger.ictp.it/cgi-bin/traceroute.pl?
function=ping&target=brunsvigia.tenet.ac.za
ping from 134.79.18.163 (www.slac.stanford.edu) to
196.21.99.222 (brunsvigia.tenet.ac.za) for
140.105.16.64
–
–
–
–
PING 196.21.99.222: 56 data bytes
64 bytes from brunsvigia.tenet.ac.za (196.21.99.222): icmp_seq=0. time=370. ms
64 bytes from brunsvigia.tenet.ac.za (196.21.99.222): icmp_seq=1. time=1911. ms
64 bytes from brunsvigia.tenet.ac.za (196.21.99.222): icmp_seq=2. time=911. ms 64 bytes
from brunsvigia.tenet.ac.za (196.21.99.222): icmp_seq=3. time=385. ms
– 64 bytes from brunsvigia.tenet.ac.za (196.21.99.222): icmp_seq=4. time=366. ms
– ----196.21.99.222 PING Statistics---- 5 packets transmitted, 5 packets received, 0% packet
loss round-trip (ms) min/avg/max = 366/788/1911
Les Cottrell, SLAC
Slide: 11
Brazil
300ms
E. Coast
Europe & S. America
RTT (ms)
Frequency
RTT from California to world
Europe
0.3*0.6c
300ms
Longitude (degrees)
RTT (ms.)
Les Cottrell, SLAC
Data from CAIDA SkitterSlide:
project
12
Traceroute
Traceroute to remote host
Is the route direct, over commercial congested nets
Reverse traceroute from remote host to you or 3rd
party
www.slac.stanford.edu/comp/net/wan-mon/traceroutesrv.html
www.tracert.com/
CAIDA
Mouse
sensitive
map
Les Cottrell, SLAC
Slide: 13
Traceroute
Probes/hop
Max hops
Remote host
UDP/ICMP tool to show route packets take from local to
remote host
location
17cottrell@flora06:~>traceroute -q 1 -m 20 lhr.comsats.net.pk
traceroute to lhr.comsats.net.pk (210.56.16.10), 20 hops max, 40 byte packets
1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2) 0.642 ms
2 RTR-MSFC-DMZ.SLAC.Stanford.EDU (134.79.135.21) 0.616 ms
3 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.66) 0.716 ms
4 snv-slac.es.net (134.55.208.30) 1.377 ms
5 nyc-snv.es.net (134.55.205.22) 75.536 ms
Long delay
6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
satellite
7 gin-nyy-bbl.teleglobe.net (192.157.69.33) 154.742 ms
8 if-1-0-1.bb5.NewYork.Teleglobe.net (207.45.223.5) 137.403 ms
9 if-12-0-0.bb6.NewYork.Teleglobe.net (207.45.221.72) 135.850 ms
No response:
10 207.45.205.18 (207.45.205.18) 128.648 ms
Lost packet or router
11 210.56.31.94 (210.56.31.94) 762.150 ms
12 islamabad-gw2.comsats.net.pk (210.56.8.4) 751.851 ms
ignores
13 *
Les
14Cottrell,
lhr.comsats.net.pk
SLAC
(210.56.16.10) 827.301 ms
Slide: 14
Traceroute server results
Example: www.slac.stanford.edu/cgi-bin/nph-traceroute.pl
Related
info
Security
warning
Traceroute
Enter IP address or name
Les Cottrell, SLAC
Slide: 15
Graphical Traceroute
http://visualroute.visualware.com/
Les Cottrell, SLAC
Slide: 16
Pingroute
Ping routers along route, e.g. a tool to install that helps:
www.slac.stanford.edu/comp/net/fpingroute.pl
or www.slac.stanford.edu/comp/net/pingroute.pl if fping N/A
15cottrell@noric04:~>fpingroute.pl
fpingroute.pl does a traceroute to the selected host. For each of the hops
along the route it then uses fping to ping each node (in parallel) 'count'
times. Output includes traceroute information, RTTs, losses for 100 and
'size‘ byte pings.
Version=0.21, 8/24/04
Usage: fpingroute.pl [Opts] host
where host is the remote host's IP address or name
e.g. www.slac.stanford.edu
Opts:
[-c count default=10]
[-s size default=1400]
[-i initial default=1]
Example: fpingroute.pl -i 3 -c 10 -s 1400 www.triumf.ca
Les Cottrell, SLAC
Slide: 17
Pingroute example
May help tell where losses start
Will need many pings if losses small
Start of losses?
But?
Start of
sustained
losses
Les Cottrell, SLAC
Routers
may
not
respond
Slide: 18
Look at time series
Look at history plots (PingER, IEPM-BW, ISPs, own
border router etc.), when did problem start, how big an
effect is it?
Assumes you know “proximity” of paths for which there are
archived active measurements to the path that you are
interested in
Also that relevant measurements exist
www-iepm.slac.stanford.edu/pinger/
amp.nlanr.net/ unfortunately no longer funded
ISPs plots: (www.slac.stanford.edu/comp/net/wanmon/netmon.html for a a place to start looking)
–
–
–
–
Abilene: http://stryper.uits.iu.edu/abilene/
GEANT: http://stats.geant.net/usagemap/usagemap
RIPE: http://www.ripe.net/projects/ttm/Plots/
ESnet: http://measurement.es.net/ (OWAMP)
Collaboration between Internet2/ESnet/Geant to provide
access to router measurements holds promise
Look at traceroute histories (see later)
Les Cottrell, SLAC
Slide: 19
Example time series
Look for
change in
measured
value
Note
time
Correlate
Les Cottrell, SLAC
Italy disconnected
Slide: 20
Find location of a bottleneck
Look at hops along the path
Pingroute (see earlier)
If possible look at utilizations or active probes launched from there
Pathneck http://www.cs.cmu.edu/~hnn/pathneck/
Uses trains of packets to probe hops along route, looking at
dispersion induced by queuing
Pipechar (son of pathchar, pchar)
http://www.dsd.lbl.gov/OldProjects/NCS
Send packets of varying sizes to each router along path
Look at RTT as a function of packet size
From slope deduce “bandwidth”
Diferentiate to find capacity at each hop
However pipechar has uncertain support
Packet size variation limited to 1-MTU (~1500) Bytes, so on fast links
timing is difficult, with the result that estimates may not be reliable
(OK for slow links)
Les Cottrell, SLAC
Slide: 21
Divide & Conquer
Abilene has hosts at major PoPs running bwctl
So make measurements from end to middle to ID loss
of performance
http://e2epi.internet2.edu/pipes/ami/bwctl/
Les Cottrell, SLAC
Slide: 22
Correlate with routes (traceanal)
Les Cottrell, SLAC
Slide: 23
Visualizing traceroutes
www.slac.stanford.edu/comp/net/iepmbw.slac.stanford.edu/slac_wan_bw_tests.html, => traceroutes
One compact page per day
One row per host, one column per hour
One character per traceroute to indicate pathology or change (usually
period(.) = no change)
Identify unique routes with a number
Be able to inspect the route associated with a route number
Provide for analysis of long term route evolutions
Route # at start of day, gives
idea of route stability
Multiple route changes
(due to GEANT),
later restored to
original route
Period (.) means no change
Les Cottrell, SLAC
Slide: 24
Changes in network topology (BGP) can result
in dramatic changes in performance
Hour
Remote host
Samples of
traceroute trees
generated from the
table
Snapshot of traceroute summary table
Mbits/s
Notes:
1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00
2. ESnet/GEANT working on routes from 2:00 to 14:00
3. A previous occurrence went un-noticed for 2 months
4. Next step is to auto detect and notify
Drop in performance
Back to original path
Dynamic BW capacity (DBC)
(From original path: SLAC-CENIC-Caltech
to SLAC-Esnet-LosNettos (100Mbps) -Caltech )
Changes detected by
IEPM-Iperf and AbWE
Available BW = (DBC-XT)
Cross-traffic (XT)
Esnet-LosNettos segment in the path
(100 Mbits/s)
measurement
LesABwE
Cottrell,
SLACone/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am
Slide: 25
Moving towards application
Try user application (mem to mem & disk to disk)
GridFTP, bbcp, bbftp …
Iperf or thrulay (also provides RTT) to test TCP or UDP
throughput (injects traffic, +server)
dast.nlanr.net/Projects/Iperf/
www.internet2.edu/~shalunov/thrulay/
Bottleneck
Available bandwidth:
Min spacing
Pathload: wwwSpacing preserved
At
bottleneck
static.cc.gatech.edu/fac/Constantinos.Dovrolis/pathload.html
On higher speed links
Pathchirp: www.spin.rice.edu/Software/pathChirp/
bing …
NDT
What are the interface speeds?
What is the bottleneck?
Is there a duplex mismatch?
Are buffers set right (both ends)?
Les Cottrell, SLAC
Slide: 26
NDT example (Rich Carlson)
http://e2epi.internet2.edu/ndt/
Les Cottrell, SLAC
Slide: 27
Other tools
Ntop
Summarizes libpcap (sniffer) infor
Internet2 Detective:
Tests connectivity to I2, bandwidth, multicast, IPv6
Can run as Java applet
http://detective.internet2.edu/
NLANR Internet Advisor
Ethereal, tcpdump, snoop for masochists
Passive tools:
Netflow for characterizing network, spotting abnormalities, e.g.
www.itec.oar.net/abilene-netflow
www.slac.stanford.edu/comp/net/slac-netflow/html/SLACnetflow.html
SNMP based tools
Les Cottrell, SLAC
Slide: 28
And then …
Wireless
Avoid peer-to-peer/ad-hoc connections
Disable connecting to ad-hoc (set infrastructure only)
Disable bridging
How to do it varies by OS (XP, OSX, Linux)
Ad hoc can still interfere if on same channel
Tools to locate an access point (e.g. Yellow-Jacket)
Vendors have management tools to enable APs to detect rogue APs
NAT boxes may block or not support application
Private addresses:
10.0.0.0 - 10.255.255.255 a single class A net
172.16.0.0 - 172.31.255.255 16 contiguous class Bs
192.168.0.0 – 192.168.255.255 256 contiguous class Cs
Les Cottrell, SLAC
Slide: 29
“Where is” a host?
Beware some of information following is ephemeral, in general use
heuristics with Google
Google “Internet country codes” for TLDs
Host may not be in TLD country, especially developing regions often use proxies
elsewhere
Location may be encoded in router name
ipls=Indianapolis, snv=Sunnyvale …
Name server lookup to find hostname given IP address
47cottrell@netflow:~>nslookup
Server: localhost
Address: 127.0.0.1
Name:
lhr.comsats.net.pk
Address: 210.56.16.10
210.56.16.10
Use a whois server, e.g.
www.networksolutions.com/cgi-bin/whois/whois (Americas & Africa)
www.ripe.net/cgi-bin/whois (Europe)
www.apnic.net/ (Asia)
May identify site name, address, contact, etc, not all domains are in
databases (e.g. will not find comsats.net.pk)
Les Cottrell, SLAC
Slide: 30
“Where is” a host – cont.
Find the Autonomous System (AS) administering
Form giving AS for domain name
http://www.fixedorbit.com/search.htm
Gives AS number, name adjacent AS’s web page for
AS
Given an AS find out more about it:
Use http://bgp.potaroo.net/cidr/ go to bottom and
enter AS into form:
– Gives ISP name, web page, phone number, email, hours etc.
Review list of AS's ordered by Upstream AS Adjacency
www.telstra.net/ops/bgp/bgp-as-upsstm.txt
Tells what AS is upstream of an ISP
Les Cottrell, SLAC
Slide: 31
“Where is” a host - cont.
May be able to get latitude & longitude:
http://www.hostip.info/index.html
http://www.ip2location.com/
But it is a subscriber service ($$$, but …), however it is
probably best for developing regions
Google:
www.geoiptool.com/http://www.geoiptool.com/
Triangulate pings from landmarks (in development)
http://www.slac.stanford.edu/comp/net/wan-mon/tulip/
Need more landmarks, send email
[email protected]
http://www.cs.cornell.edu/~bwong/octant/ # for US only
Les Cottrell, SLAC
Slide: 32
Who you gonna tell?
Local network support people
Internet Service Provider (ISP) usually done by local networker
Usually will know immediate one, e.g. [email protected]
Use puck.nether.net/netops/nocs.cgi to find ISP
Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to find upstream ISPs
Well managed sites and ISPs maintain a list of email addresses
such as abuse@ or postmaster@, that one can send email to,
for example to complain about spam etc.
This follows an Internet recommendation (RFC 2142).
Some less helpful sites do not provide such services, for more on these,
see RFC-ignorant.org
Les Cottrell, SLAC
Slide: 33
What ya gonna tell ‘em?
Describe problem with details
What is affected?
Application, host OS (uname –a), NIC (ifconfig, route)
How is it affected?
Non responsiveness, unable to contact remote host
Slow performance (see Brian’s talk), packet loss
When did it start?
Send ping output between hosts
Send traceroute forward & reverse – if possible
Maybe use –I (ICMP option)
NDT
Identify when it started
If complex think about creating web page with details
Top, vmstat, pingroute, pipechar, application output (GridFTP, iperf)…
Les Cottrell, SLAC
Slide: 34
Web page examples: Case studies
http://www.slac.stanford.edu/grp/scs/net/case/html/
http://e2epi.internet2.edu/case-studies/
Les Cottrell, SLAC
Slide: 35
More Information
Tutorial on monitoring
www.slac.stanford.edu/comp/net/wan-mon/tutorial.html
RFC 2151 on Internet tools
www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
Network monitoring tools
www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
www.caida.org/tools/taxonomy/
Network Performance Tools: an I2 Cookbook
e2epi.internet2.edu/network-perf-wk/tools-cookbook.pdf
Network Monitoring sites
www.slac.stanford.edu/comp/net/wan-mon/netmon.html
How to Accelerate Your Internet, ISBN: 0-9778093-15, Ed. Flickenger R.
Les Cottrell, SLAC
Slide: 36
Local Host - LISA
Localhost Information Service Agent LISA is a Java Web
Start application which provides:
Integration with MonALISA
Complete Monitoring of the System (Load, CPU, Memory, Disk,
Disk IO, Paging, Processes, Network Traffic and Connectivity...).
History and instantaneous
Filters to trigger actions when predefined conditions are detected.
A user Friendly GUI to present the monitoring information.
Optimization modules for distributed applications.
It is a lightweight application that can be easily deployed on any
system.
Modules for End to End network measurements ( e.g. IPERF).
See monalisa.caltech.edu/dev_lisa.html
Les Cottrell, SLAC
Slide: 37