Network diagnostics made easy

Download Report

Transcript Network diagnostics made easy

Performance Diagnostic Research
at PSC
Matt Mathis
John Heffner
Ragu Reddy
5/12/05
http://www.psc.edu/~mathis/papers/
PathDiag20050512.ppt
The Wizard Gap
The non-experts are falling behind
•
•
•
•
•
•
•
Year
1988
1991
1995
1999
2003
2004
Why?
Experts
1 Mb/s
10 Mb/s
100 Mb/s
1 Gb/s
10 Gb/s
40 Gb/s
Non-experts Ratio
300 kb/s
3:1
3 Mb/s
3000:1
TCP tuning requires expert knowledge
• By design TCP/IP hides the ‘net from upper layers
– TCP/IP provides basic reliable data delivery
– The “hour glass” between applications and networks
• This is a good thing, because it allows:
– Old applications to use new networks
– New application to use old networks
– Invisible recovery from data loss, etc
• But then (nearly) all problems have the same symptom
– Less than expected performance
– The details are hidden from nearly everyone
TCP tuning is really debugging
• Application problems:
– Inefficient or inappropriate application designs
• Operating System or TCP problems:
– Negotiated TCP features (SACK, WSCALE, etc)
– Failed MTU discovery
– Too small retransmission or reassembly buffers
• Network problems:
–
–
–
–
Packet losses, congestion, etc
Packets arriving out of order or even duplicated
“Scenic” IP routing or excessive round trip times
Improper packet sizes limits (MTU)
TCP tuning is painful debugging
• All problems reduce performance
– But the specific symptoms are hidden
• But any one problem can prevent good performance
– Completely masking all other problems
• Trying to fix the weakest link of an invisible chain
– General tendency is to guess and “fix” random parts
– Repairs are sometimes “random walks”
– Repair one problem at time at best
The Web100 project
• When there is a problem, just ask TCP
– TCP has the ideal vantage point
• In between the application and the network
– TCP already “measures” key network parameters
• Round Trip Time (RTT) and available data capacity
• Can add more
– TCP can identify the bottleneck
• Why did it stop sending data?
– TCP can even adjust itself
• “autotuning” eliminates one major class of bugs
See: www.web100.org
Key Web100 components
• Better instrumentation within TCP
– 120 internal performance monitors
– Poised to become Internet standard “MIB”
• TCP Autotuning
– Selects the ideal buffer sizes for TCP
– Eliminate the need for user expertise
• Basic network diagnostic tools
– Requires less expertise than prior tools
• Excellent for network admins
• But still not useful for end users
Web100 Status
• Two year no-cost extension
– Can only push standardization after most of the work
– Ongoing support of research users
• Partial adoption
– Current Linux includes (most of) autotuning
• John Heffner is maintaining patches for the rest of Web100
– Microsoft
• Experimental TCP instrumentation
• Working on autotuning (to support FTTH)
– IBM “z/OS Communications Server”
• Experimental TCP instrumentation
The next step
• Web100 tools still require too much expertise
– They are not really end user tools
– Too easy to overlook problems
– Current diagnostic procedures are still cumbersome
• New insight from web100 experience
– Nearly all symptoms scale with round trip time
• New NSF funding
– Network Path and Application Diagnosis (NPAD)
– 3 Years, we are at the midpoint
Nearly all symptoms scale with RTT
• For example
– TCP Buffer Space, Network loss and reordering, etc
– On a short path TCP can compensate for the flaw
• Local Client to Server: all applications work
– Including all standard diagnostics
• Remote Client to Server: all applications fail
– Leading to faulty implication of other components
Examples of flaws that scale
• Chatty application (e.g., 50 transactions per request)
– On 1ms LAN, this adds 50ms to user response time
– On 100ms WAN, this adds 5s to user response time
• Fixed TCP socket buffer space (e.g., 32kBytes)
– On a 1ms LAN, limit throughput to 200Mb/s
– On a 100ms WAN, limit throughput to 2Mb/s
• Packet Loss (e.g., 0.1% loss at 1500 bytes)
– On a 1ms LAN, models predict 300 Mb/s
– On a 100ms WAN, models predict 3 Mb/s
Review
• For nearly all network flaws
– The only symptom is reduced performance
– But this reduction is scaled by RTT
• On short paths many flaws are undetectable
–
–
–
–
False pass for even the best conventional diagnostics
Leads to faulty inductive reasoning about flaw locations
This is the essence of the “end-to-end” problem
Current state-of-the-art diagnosis relies on tomography
and complicated inference techniques
Our new tool: pathdiag
• Specify End-to-End application performance goal
– Round Trip Time (RTT) of the full path
– Desired application data rate
• Measure the performance of a short path section
– Use Web100 to collect detailed statistics
– Loss, delay, queuing properties, etc
• Use models to extrapolate results to full path
– Assume that the rest of the path is ideal
• Pass/Fail on the basis of the extrapolated performance
Deploy as a Diagnostic Server
• Use pathdiag in a Diagnostic Server (DS) in the GigaPop
• Specify End to End target performance
– from server (S) to client (C) (RTT and data rate)
• Measure the performance from DS to C
– Use Web100 in the DS to collect detailed statistics
– Extrapolate performance assuming ideal backbone
• Pass/Fail on the basis of extrapolated performance
Example diagnostic output 1
Tester at IP address: xxx.xxx.115.170 Target at IP address: xxx.xxx.247.109
Warning: TCP connection is not using SACK
Fail: Received window scale is 0, it should be 2.
Diagnosis: TCP on the test target is not properly configured for this path.
> See TCP tuning instructions at http://www.psc.edu/networking/perf_tune.html
Pass data rate check: maximum data rate was 4.784178 Mb/s
Fail: loss event rate: 0.025248% (3960 pkts between loss events)
Diagnosis: there is too much background (non-congested) packet loss.
The events averaged 1.750000 losses each, for a total loss rate of 0.0441836%
FYI: To get 4 Mb/s with a 1448 byte MSS on a 200 ms path the total
end-to-end loss budget is 0.010274% (9733 pkts between losses).
Warning: could not measure queue length due to previously reported bottlenecks
Diagnosis: there is a bottleneck in the tester itself or test target
(e.g insufficient buffer space or too much CPU load)
> Correct previously identified TCP configuration problems
> Localize all path problems by testing progressively smaller sections of the full path.
FYI: This path may pass with a less strenuous application:
Try rate=4 Mb/s, rtt=106 ms
Or if you can raise the MTU:
Try rate=4 Mb/s, rtt=662 ms, mtu=9000
Some events in this run were not completely diagnosed.
Example diagnostic output 2
Tester at IP address: 192.88.115.170 Target at IP address: 128.182.61.117
FYI: TCP negotiated appropriate options: WSCALE=8, SACKok, and
Timestamps)
Pass data rate check: maximum data rate was 94.206807 Mb/s
Pass: measured loss rate 0.004471% (22364 pkts between loss events)
FYI: To get 10 Mb/s with a 1448 byte MSS on a 10 ms path the total
end-to-end loss budget is 0.657526% (152 pkts between losses).
FYI: Measured queue size, Pkts: 33 Bytes: 47784 Drain time: 2.574205 ms
Passed all tests!
FYI: This path may even pass with a more strenuous application:
Try rate=10 Mb/s, rtt=121 ms
Try rate=94 Mb/s, rtt=12 ms
Or if you can raise the MTU:
Try rate=10 Mb/s, rtt=753 ms, mtu=9000
Try rate=94 Mb/s, rtt=80 ms, mtu=9000
Example diagnostic output 3
Tester at IP address: 192.88.115.170 Target at IP address: 128.2.13.174
Fail: Received window scale is 0, it should be 1.
Diagnosis: TCP on the test target is not properly configured for this path.
> See TCP tuning instructions at http://www.psc.edu/networking/perf_tune.html
Test 1a (7 seconds): Coarse Scan
Test 2a (17 seconds): Search for the knee
Test 2b (10 seconds): Duplex test
Test 3a (8 seconds): Accumulate loss statistics
Test 4a (17 seconds): Measure static queue space
The maximum data rate was 8.838274 Mb/s
This is below the target rate (10.000000).
Diagnosis: there seems to be a hard data rate limit
> Double check the path: is it via the route and equipment that you expect?
Pass: measured loss rate 0.012765% (7834 pkts between loss events)
FYI: To get 10 Mb/s with a 1448 byte MSS on a 50 ms path the total
end-to-end loss budget is 0.026301% (3802 pkts between losses).
Key DS features
• Nearly complete coverage for OS and Network flaws
– Does not address flawed routing at all
– May fail to detect flaws that only affect outbound data
• Unless you have Web100 in the client or a (future) portable DS
– May fail to detect a few rare corner cases
– Eliminates all other false pass results
• Tests becomes more sensitive on shorter paths
– Conventional diagnostics become less sensitive
– Depending on models, perhaps too sensitive
• New problem is false fail (queue space tests)
• Flaws no longer completely mask other flaws
– A single test often detects several flaws
• E.g. both OS and network flaws in the same test
– They can be repaired in parallel
Key features, continued
• Results are specific and less geeky
– Intended for end-users
– Provides a list of action items to be corrected
• Failed tests are showstoppers for high performance app.
– Details for escalation to network or system admins
• Archived results include raw data
– Can reprocess with updated reporting SW
The future
• Current service is “pre-alpha”
– Please use it so we can validate the tool
• We can often tell when it got something wrong
– Please report confusing results
• So we can improve the reports
– Please get us involved if it is non-helpful
• We need interesting pathologies
• Will soon have another server near FRGP
– NCAR in Boulder CO
• Will someday be in a position to deploy more
– Should there be one at PSU?
What about flaws in applications?
• NPAD is also thinking about applications
• Using an entirely different collection of techniques
– Symptom scaling still applies
• Tools to emulate ideal long paths on a LAN
– Prove or bench test applications in the lab
• Also checks some OS and TCP features
– If it fails in the lab, it can not work on a WAN
For example classic ssh & scp
• Long known performance problems
• Recently diagnosed
– Internal flow control for port forwarding
– NOT encryption
• Chris Rapier developed a patch
– Update flow control windows from kernel buffer size
– Already running on most PSC systems
See: http://www.psc.edu/networking/projects/hpn-ssh/
NPAD Goal
• Build a minimal tool set that can detect “every” flaw
– Pathdiag: all flaws affecting inbound data
– Web100 in servers or portable diagnostic servers:
All flaws affecting outbound data
– Application bench test: All application flaws
– Traceroute: routing flaws
• We believe that this is a complete set
http://kirana.psc.edu/NPAD/