Slide: 29 Les Cottrell, SLAC

Download Report

Transcript Slide: 29 Les Cottrell, SLAC

SPACE Weather School:
Basic theory & hands-on experience
Network Problem
Diagnosis for Nonnetworkers
Les Cottrell – SLAC
University of Helwan / Egypt, Sept 18 – Oct 3, 2010
Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end
Performance Monitoring (IEPM), also supported by IUPAP
http://www.slac.stanford.edu/grp/scs/net/talk10/diagnosis.pptx
Overview
Goal: provide a practical guide to debugging common
problems
Why is diagnosis difficult yet important?
Local host
Ping, Traceroute, PingRoute
Looking at time series
Locating bottlenecks
Correlation of problems with routes
More tools and problems
Where is a node
Who do you tell, what do you say?
Case studies and More Information
Les Cottrell, SLAC
Slide: 2
Why is diagnosis difficult?
 Internet's evolution as a composition of independently
developed and deployed protocols, technologies, and core
applications
 Diversity, highly unpredictable, hard to find “invariants”
 Rapid evolution & change, no equilibrium so far
 Findings may be out of date
 Measurement/diagnosis not high on vendors list of priorities
 Resources/skill focus on more interesting an profitable issues
 Tools lacking or inadequate
 Implementations are flaky & not fully tested with new releases
Les Cottrell, SLAC
Slide: 3
Add to that …
 Distributed systems are very hard
 A distributed system is one in which I can't get my work done because a
computer I've never heard of has failed. Butler Lampson
 Network is deliberately transparent
 The bottlenecks can be in any of the following components:




the applications
the OS
the disks, NICs, bus, memory, etc. on sender or receiver
the network switches and routers, and so on
 Problems may not be logical
 Most problems are operator errors, configurations, bugs
 When building distributed systems, we often observe unexpectedly low
performance
 the reasons for which are usually not obvious
 Just when you think you’ve cracked it, in steps security
 Firewall, NAT boxes etc.
 Block pings, traceroute looks like port scan, diagnostic tool ports are
blocked …
 ISPs worried about providing access to core, making results public, &
privacy issues
Les Cottrell, SLAC
Slide: 4
Sources of problems
Host “errors”
 TCP buffers, heavy utilization …
Ethernet duplex and speed mismatch between your
host and the network device
Misconfigured router/switches
 Including routing errors, especially for backup paths
Bad equipment, wiring/fiber problem
Congestion
Les Cottrell, SLAC
Slide: 5
First steps
Command prompt, find out about network connection
 ipconfig ?
 ipconfig
Default gives IP address, gateway/1st router, subnet mask
of all your network devices (Ethernet, wireless,
bluetooth…)
Make a note of the gateway
Icon at bottom right of screen
 Allows asking of questions and tries to provide assistance
Go to Command prompt and type
 ping ?
Les Cottrell, SLAC
Slide: 6
Ping on Windows
RTT
IP address of target
Size of packet
target
Specify number pings
C:\Users\cottrell>ping –n 4 –l 32 mail.alex.edu.ca
Pinging mail.alex.edu.ca [67.215.65.132] with 32 bytes of data:
Reply from 67.215.65.132: bytes=32 time=80ms TTL=45
Reply from 67.215.65.132: bytes=32 time=85ms TTL=45
Reply from 67.215.65.132: bytes=32 time=83ms TTL=45
Reply from 67.215.65.132: bytes=32 time=90ms TTL=43
Ping statistics for 67.215.65.132:
?
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 80ms, Maximum = 90ms, Average = 84ms
Try: ping –t, what use is ping -f
Les Cottrell, SLAC
Slide: 7
C:\Users\cottrell>ping www.lbl.gov
Pinging www.lbl.gov [128.3.41.105] with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Ping statistics for 128.3.41.105:
Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),
Enable Telnet by following these steps:
Start=>Control Panel=>Programs And Features=>
Turn Windows features on or off=>
Check Telnet Client
Hit OK
Now try:
16cottrell@pinger:~>telnet www.lbl.gov 80
Blank screen web server waiting to talk to you
Hit ctrl ] and type exit
Compare with another port (non existent application)
C:\Users\cottrell>telnet www.lbl.gov 1010
Connecting To www.lbl.gov...Could not open connection to the
host, on port 1010:
Connect failed
Les Cottrell, SLAC
Anomalies
Pings blocked
Slide: 8
Diversion on ports
Applications such as telnet (23), ssh (22) www (80,
443), DNS are assigned a “port” on the host
Sometimes written as for example
www.slac.stanford.edu:80
See http://www.iana.org/assignments/port-numbers
for what applications use which ports
Les Cottrell, SLAC
Slide: 9
Try:
1.
2.
3.
4.
5.
ping localhost
ping mail.alex.edu.eg
ping sohag-univ.edu.eg
ping www.minia.edu.eg
ping www.alex.edu.eg
Les Cottrell, SLAC
Slide: 10
3rd party ping (via Looking Glass)
 Find servers:
 http://www.cogentco.com/us/network_lookingglass.php,
 http://www.ip.tiscali.net/lg/
 http://stat.qwest.net/cgi-bin/jlg-new-asia.pl
 http://www.slac.stanford.edu/comp/net/wanmon/viper/tulip_map.htm
Les Cottrell, SLAC
Slide: 11
Brazil
300ms
E. Coast
Europe & S. America
RTT (ms)
Frequency
RTT from California to world
Europe
0.3*0.6c
300ms
Longitude (degrees)
RTT (ms.)
Les Cottrell, SLAC
Data from CAIDA SkitterSlide:
project
12
Geostationary Satellite links
Each bar represents min RTT for 1 country
Satellite flies 24k miles high, RTT~400ms
Note cut off between satellite and terrestrial
Min RTT (ms)
Satellite
500
400
300
200
100
0
Les Cottrell, SLAC
Terrestrial
Country
Slide: 13
Traceroute Rough algorithm
Rough traceroute algorithm
ttl=1; #To 1st router
port=33434; #Starting UDP port
max=30; #default maximum number of hops
while hops <= maxhops & ttl<max {
send UDP packet to host:port with ttl
get response
if time exceeded note roundtrip time
else if UDP port unreachable
print *
next
print output
ttl++; port++
}
Les Cottrell, SLAC
Slide: 14
Traceroute (tracert on Windows)
C:\Users\cottrell>tracert Max hops
Target IP address
gets help
3 RTTs
C:\Users\cottrell>tracert -h 30 mail.alex.edu.eg
Tracing route to mail.alex.edu.eg [193.227.16.29] over a maximum of 30 hops
1 1 ms 1 ms
1 ms 10.13.11.1
Router IP address
2 1 ms <1 ms 1 ms 10.100.100.53
3 1 ms <1 ms <1 ms 10.0.0.3
4 1 ms 1 ms
1 ms 81.21.100.177
5 53 ms 12 ms 1 ms 10.181.28.33
No response
6 2 ms 24 ms 2 ms 172.18.28.117
7 5 ms 6 ms 6 ms 172.20.1.162
8 6 ms 6 ms 8 ms 172.19.8.106
9 * * *
10 6 ms 6 ms 6 ms mail.alex.edu.eg [193.227.16.29]
Try tracert www.lbl.gov
Why do the first hops take so long to reply?
Try tracert –d www.lbl.gov
Les Cottrell, SLAC
Slide: 15
Private address space
N.b. first few addresses are 10.x.y.z
Typically these are private (not known to the global
Internet) IP addresses, that can be re-used at multiple
sites
See http://en.wikipedia.org/wiki/Private_network
 Ranges 10.0.0.0 – 10.255.255.255 (16M addresses, 24bits)
 172.16.0.0 – 172.31.255.255 (1M addresses, 20 bits)
 192.168.0.0 – 192.168.255.255 (65K addresses, 16 bits)
Les Cottrell, SLAC
Slide: 16
Traceroute from elsewhere
 Traceroute to remote host
 Is the route direct, over commercial congested nets
 Reverse traceroute from remote host to you or 3rd party
 www.slac.stanford.edu/comp/net/wan-mon/traceroute-srv.html
 www.tracert.com/
 visualroute.visualware.com/ # requires Java
Visualroute servers in Europe
Les Cottrell, SLAC
Slide: 17
Traceroute server results
 Example: www.slac.stanford.edu/cgi-bin/nph-traceroute.pl
Related
info
Security
warning
Traceroute
Your IP name
Les Cottrell, SLAC
Your IP address
Enter IP address or name
Slide: 18
Warning
Some Linux versions have bug that incorrectly IDs
cksum error on MPLS links. Make Pkt length>=140,
else get checksum errors (not a problem, just
annoying). e.g. on Linux
 traceroute www.slac.stanford.edu 140
Les Cottrell, SLAC
Slide: 19
Pingroute example
May help tell where losses start
Will need many pings if losses small
Start of losses?
But?
Start of
sustained
losses
Les Cottrell, SLAC
Routers
may
not
respond
Slide: 20
Matt’s Traceroute (mtr)
Run traceroute, then ping each router n times
 helps identify where in route the problems start to occur
Routers may not respond to pings, or may treat pings
directed at them, differently to other packets
Get Matt’s TraceRoute MTR from
www.bitwizard.nl/mtr/ or pathping (built into windows
but inferior)
 Slower
 Less info
Les Cottrell, SLAC
Slide: 21
Pathping en.wikipedia.org/wiki/PathPing
Tracing route to mail.alex.edu.eg [193.227.16.29] over max 30 hops:
0 CDIV-PC83982.win.slac.stanford.edu [10.13.250.215]
1 10.13.11.1
2 10.100.100.53
3 10.0.0.3
4 81.21.100.177
5 10.181.28.33
6 172.18.28.117
7 172.20.1.162
8 172.19.8.106
9 10.191.8.30
10 mail.alex.edu.eg [193.227.16.29]
Computing statistics for 250 seconds...
Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address
0
CDIV-PC83982.win.slac.stanford.edu
[10.13.250.215]
0/ 100 = 0% |
1 1ms 0/ 100 = 0% 0/ 100 = 0% 10.13.11.1
0/ 100 = 0% |
2 1ms 0/ 100 = 0% 0/ 100 = 0% 10.100.100.53
0/ 100 = 0% |
3 0ms 0/ 100 = 0% 0/ 100 = 0% 10.0.0.3
0/ 100 = 0% |
4 2ms 0/ 100 = 0% 0/ 100 = 0% 81.21.100.177
13/ 100 = 13% |
5 --- 100/ 100 =100% 87/ 100 = 87% 10.181.28.33
0/ 100 = 0% |
6 --- 100/ 100 =100% 87/ 100 = 87% 172.18.28.117
0/ 100 = 0% |
7 --- 100/ 100 =100% 87/ 100 = 87% 172.20.1.162
0/ 100 = 0% |
8 --- 100/ 100 =100% 87/ 100 = 87% 172.19.8.106
0/ 100 = 0% |
9 --- 100/ 100 =100% 87/ 100 = 87% 10.191.8.30
0/ 100 = 0% |
10 10ms 13/ 100 = 13% 0/ 100 = 0% mail.alex.edu.eg [193.227.16.29]
Default probes/hop = 100
|=Link
Router
No RTT variance
provided
Help try pathping
Trace complete.
Les Cottrell, SLAC
Slide: 22
Look at time series
Look at history plots (PingER, ISPs, own border router
etc.), when did problem start, how big an effect is it?
 Assumes you know “proximity” of paths for which there are
archived active measurements to the path that you are
interested in
 Also that relevant measurements exist
www-iepm.slac.stanford.edu/pinger/
 Collaboration between Internet2/ESnet/Geant to provide
access to router measurements holds promise
Les Cottrell, SLAC
Slide: 23
Example time series
Look for
change in
measured
value
 Note
time
 Correlate
Les Cottrell, SLAC
Italy disconnected
Slide: 24
Moving towards application
 Is the server application listening:
 telnet www.slac.stanford.edu 80
Trying 134.79.18.188...
 Connected to www.slac.stanford.edu.
 Escape character is '^]'.
 ^]
 telnet> quit
 Connection closed.
 Try user application (mem to mem & disk to disk)
 GridFTP, bbcp, bbftp …
 Iperf or thrulay (also provides RTT) to test TCP or UDP
throughput
 dast.nlanr.net/Projects/Iperf/, www.internet2.edu/~shalunov/thrulay/
 NDT (http://www.internet2.edu/performance/ndt/)
 What are the interface speeds?, What is the bottleneck?
 Is there a duplex mismatch?’ Are buffers set right (both ends)?
Les Cottrell, SLAC
Slide: 25
NDT example
Try: http://netspeed.stanford.edu/
Les Cottrell, SLAC
Slide: 26
And then …
 Wireless
 Avoid peer-to-peer/ad-hoc connections
 Disable connecting to ad-hoc (set infrastructure only)
 Disable bridging
 How to do it varies by OS (XP, OSX, Linux)
 Ad hoc can still interfere if on same channel
 Tools to locate an access point (e.g. Yellow-Jacket)
 See
 www2.slac.stanford.edu/comp/net/wireless/Wireless-MeetingHandout.mht
 NAT boxes may block or not support application
 Private addresses:
 10.0.0.0 - 10.255.255.255 a single class A net
 172.16.0.0 - 172.31.255.255 16 contiguous class Bs
 192.168.0.0 – 192.168.255.255 256 contiguous class Cs
Les Cottrell, SLAC
Slide: 27
Strategy: divide & conquer
Ping to localhost, ping to gateway & to remote host
 Use IP address to avoid nameserver problems
 Look for connectivity, loss & RTT
 May need to run for a long time to see some pathologies
(e.g. bursty loss dues to DSL loss of sync)
 Use telnet host port to see if ping blocked
Traceroute to remote host
Reverse traceroute from remote host to you
Ping routers along route (mtr helps)
Look at history plots (PingER), when did problem
start, how big an effect is it?
• Look at own connectivity NDT (netspeed.stanford.edu)
Les Cottrell, SLAC
Slide: 28
“Where is” a host?
 Beware some of information following is ephemeral, in general use
heuristics with Google
 Google “Internet country codes” for TLDs
 Host may not be in TLD country, especially developing regions often use proxies
elsewhere
 Location may be encoded in router name
 ipls=Indianapolis, snv=Sunnyvale …
 Name server lookup (nslookup & dig) to find hostname given IP
address
47cottrell@netflow:~>nslookup
Server: localhost
Address: 127.0.0.1
Name:
lhr.comsats.net.pk
Address: 210.56.16.10
210.56.16.10
 Use a whois server (download www.gena01.com/win32whois/)
 www.networksolutions.com/cgi-bin/whois/whois (Americas & Africa)
 www.ripe.net/cgi-bin/whois (Europe)
 www.apnic.net/ (Asia)
 May identify site name, address, contact, etc, not all domains are in
databases (e.g. will not find comsats.net.pk)
Les Cottrell, SLAC
Slide: 29
“Where is” a host – cont.
 Find the Autonomous System (AS) administering
 Form giving AS for domain name
http://www.fixedorbit.com/search.htm
Gives AS number, name adjacent AS’s web page for
AS
 Given an AS find out more about it:
Use http://bgp.potaroo.net/cidr/ go to bottom and
enter AS into form:
– Gives ISP name, web page, phone number, email, hours etc.
 Review list of AS's ordered by Upstream AS Adjacency
www.telstra.net/ops/bgp/bgp-as-upsstm.txt
Tells what AS is upstream of an ISP
Les Cottrell, SLAC
Slide: 30
“Where is” a host - cont.
Visit site’s www server, often location in home page
May be able to get lat & long form database:
 www.geoiptool.com/ or via: geotool.flagfox.net/
 http://www.hostip.info/index.html
 Networldmap determines geographical information by
acquiring location information from willing participants.
 http://www.ip2location.com/
But it is a subscriber service ($$$, but …), however it is
probably best for developing regions
 Quova has a large (2.4 Billion addresses) database of IP
addresses to locations that they can provide access to for
organizations, but must subscribe ($$$).
Triangulate pings from landmarks:
 www.slac.stanford.edu/grp/scs/net/talk10/geolocation.pptx
Les Cottrell, SLAC
Slide: 31
Who you gonna tell?
 Local network support people
 Internet Service Provider (ISP) usually done by local networker
 Usually will know immediate one, e.g. [email protected]
 Use puck.nether.net/netops/nocs.cgi to find ISP
 Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to find upstream ISPs
 Well managed sites and ISPs maintain a list of email addresses
such as abuse@ or postmaster@, that one can send email to,
for example to complain about spam etc.
 This follows an Internet recommendation (RFC 2142).
 Some less helpful sites do not provide such services, for more on these,
see RFC-ignorant.org
Les Cottrell, SLAC
Slide: 32
What ya gonna tell ‘em?
 Describe problem with details
 What is affected?
 Application, host OS (uname –a), NIC (ifconfig, route)
 How is it affected?
 Non responsiveness, unable to contact remote host
 Slow performance (see Brian’s talk), packet loss
 When did it start?
 Send ping output between hosts
 Send traceroute forward & reverse – if possible
 Maybe use –I (ICMP option)
 NDT
 Identify when it started
 If complex think about creating web page with details
 Top, vmstat, pingroute, pipechar, application output (GridFTP, iperf)…
Les Cottrell, SLAC
Slide: 33
More Information
Tutorial on monitoring
 www.slac.stanford.edu/comp/net/wan-mon/tutorial.html
RFC 2151 on Internet tools
 www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
Network monitoring tools
 www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
 www.caida.org/tools/taxonomy/
Network Performance Tools: an I2 Cookbook
 e2epi.internet2.edu/network-perf-wk/tools-cookbook.pdf
Case Studies:
 confluence.slac.stanford.edu/display/IEPM/Problem+Cases
 e2epi.internet2.edu/case-studies/
Les Cottrell, SLAC
Slide: 34
More slides
Les Cottrell, SLAC
Slide: 35
Local Host (also see NDT later)
Usual Unix tools (uname -a, top, vmstat,
iostat ..)
 Is the host overloaded, do you have a gateway
(route), name server (nslookup), which interface are
you using (mii-tool (needs root), gives duplex &
speed = common error source)
 Net: ifconfig –a (look at errors), netstat –a
Is server running (if you know port)?
 >telnet localhost 2811 Trying 127.0.0.1
 220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI
type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42)
ready.
 ^]
 telnet> quit
Les Cottrell, SLAC
Slide: 36
Ping example
Repeat count
Packet size
Remote host
RTT
syrup:/home$ ping -c 6 -s 64 thumper.bellcore.com
PING thumper.bellcore.com (128.96.41.1): 64 data bytes
72 bytes from 128.96.41.1: icmp_seq=0 ttl=240 time=641.8 ms
72 bytes from 128.96.41.1: icmp_seq=2 ttl=240 time=1072.7 ms
Missing seq #
72 bytes from 128.96.41.1: icmp_seq=3 ttl=240 time=1447.4 ms
72 bytes from 128.96.41.1: icmp_seq=4 ttl=240 time=758.5 ms
Summary
72 bytes from 128.96.41.1: icmp_seq=5 ttl=240 time=482.1 ms
--- thumper.bellcore.com ping statistics --- 6 packets transmitted, 5 packets received,
16% packet loss round-trip min/avg/max = 482.1/880.5/1447.4 ms
Les Cottrell, SLAC
Slide: 37
Traceroute
 UDP/ICMP tool to show route packets take fromRemote
local to
remote
host
Max hops (20)
host
Probes/hop
17cottrell@flora06:~>traceroute -q 1 -m 20 lhr.comsats.net.pk
traceroute to lhr.comsats.net.pk (210.56.16.10), 20 hops max, 40 byte packets
1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2) 0.642 ms
location
2 RTR-MSFC-DMZ.SLAC.Stanford.EDU (134.79.135.21) 0.616 ms
3 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.66) 0.716 ms
4 snv-slac.es.net (134.55.208.30) 1.377 ms
5 nyc-snv.es.net (134.55.205.22) 75.536 ms
6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
Long delay
7 gin-nyy-bbl.teleglobe.net (192.157.69.33) 154.742 ms
satellite
8 if-1-0-1.bb5.NewYork.Teleglobe.net (207.45.223.5) 137.403 ms
9 if-12-0-0.bb6.NewYork.Teleglobe.net (207.45.221.72) 135.850 ms
10 207.45.205.18 (207.45.205.18) 128.648 ms
11 210.56.31.94 (210.56.31.94) 762.150 ms
No response:
12 islamabad-gw2.comsats.net.pk (210.56.8.4) 751.851 ms
Lost packet or router
13 *
ignores
14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms
Les Cottrell, SLAC
Slide: 38
Pingroute
 Ping routers along route, e.g. a tool to install that helps:
 www.slac.stanford.edu/comp/net/fpingroute.pl
 or www.slac.stanford.edu/comp/net/fpingroute.pl if fping avaialable
15cottrell@noric04:~>fpingroute.pl
fpingroute.pl does a traceroute to the selected host. For each of the hops
along the route it then uses fping to ping each node (in parallel) 'count'
times. Output includes traceroute information, RTTs, losses for 100 and
'size‘ byte pings.
Version=0.21, 8/24/04
Usage: fpingroute.pl [Opts] host
where host is the remote host's IP address or name
e.g. www.slac.stanford.edu
Opts:
[-c count default=10]
[-s size default=1400]
[-i initial default=1]
Example: fpingroute.pl -i 3 -c 10 -s 1400 www.triumf.ca
Les Cottrell, SLAC
Slide: 39
Other tools
 Ntop
 Summarizes libpcap (sniffer) infor
 Internet2 Detective:
 Tests connectivity to I2, bandwidth, multicast, IPv6
 Can run as Java applet
 http://detective.internet2.edu/
 NLANR Internet Advisor
 Ethereal, tcpdump, snoop for masochists
 Passive tools:
 Netflow for characterizing network, spotting abnormalities, e.g.
 www.itec.oar.net/abilene-netflow
 www.slac.stanford.edu/comp/net/slac-netflow/html/SLACnetflow.html
 SNMP based tools
Les Cottrell, SLAC
Slide: 40