ppt - Internet2

Download Report

Transcript ppt - Internet2

Network Path and
Application Diagnostics
Matt Mathis
John Heffner
Ragu Reddy
4/24/06
http://www.psc.edu/~mathis/papers/
PathDiag20060424.ppt
Outline
• What is the real problem?
– Lessons from Web100
– A new perspective
• Path and lower layer diagnosis
– The pathdiag tool
– A diagnostic server
• Application and upper layer diagnosis
– LAN bench testing
• Future plans
TCP tuning requires expert knowledge
• By design TCP/IP hides the ‘net from upper layers
– TCP/IP provides basic reliable data delivery
– The “hour glass” between applications and networks
• This is a good thing, because it allows:
– Invisible recovery from data loss, etc
– Old applications to use new networks
– New application to use old networks
• But then (nearly) all problems have the same symptom
– Less than expected performance
– The details are hidden from nearly everyone
TCP tuning is painful debugging
• All problems reduce performance
– But the specific symptoms are hidden
• Any one problem can prevent good performance
– Completely masking all other problems
• Trying to fix the weakest link of an invisible chain
– General tendency is to guess and “fix” random parts
– Repairs are sometimes “random walks”
– Repair one problem at time at best
The Web100 project
• When there is a problem, just ask TCP
– TCP has the ideal vantage point
• In between the application and the network
– TCP already “measures” key network parameters
• Round Trip Time (RTT), available data capacity, etc
• Can add many more
– TCP can identify the bottleneck
• Why did it stop sending data?
– TCP can even adjust itself
• “autotuning” eliminates one major class of flaws
See: www.web100.org
The next step
• Web100 tools still require too much expertise
– They are not really end user tools
– Too easy to overlook problems
– Current diagnostic procedures are still cumbersome
• New insight from web100 experience
– Nearly all symptoms scale with round trip time
• New NSF funded project:
Network Path and Application Diagnosis (NPAD)
Nearly all symptoms scale with RTT
• For example
– TCP Buffer Space, Network loss and reordering, etc
– On a short path TCP can compensate for the flaw
• Local Client to Server: all applications work
– Including all standard diagnostics
• Remote Client to Server: all applications fail
– Leading to faulty implication of other components
Examples of flaws that scale
• Chatty application (e.g., 50 transactions per request)
– On 1ms LAN, this adds 50ms to user response time
– On 100ms WAN, this adds 5s to user response time
• Fixed TCP socket buffer space (e.g., 32kBytes)
– On a 1ms LAN, limit throughput to 200Mb/s
– On a 100ms WAN, limit throughput to 2Mb/s
• Packet Loss (e.g., 0.1% loss at 1500 bytes)
– On a 1ms LAN, models predict 300 Mb/s
– On a 100ms WAN, models predict 3 Mb/s
The confounded problems
• For nearly all network flaws
– The only symptom is reduced performance
– But the reduction is scaled by RTT
• On short paths, most flaws are undetectable
–
–
–
–
False pass for even the best conventional diagnostics
Leads to faulty inductive reasoning about flaw locations
This is the essence of the “end-to-end” problem
Current state-of-the-art diagnosis relies on tomography
and complicated inference techniques
The solutions
• New diagnostic techniques to compensate for
“symptom scaling”
• For path testing (and lower layers)
– Test path sections using a instrumented application
that can extrapolate test results to a long path
•
For applications (and upper layers)
– Bench test over an (emulated) ideal long path
Testing the path
• Need to test short path sections to localize a flaw
– But “symptom scaling” normally hides a failing section
• New tool (“pathdiag”):
– Measure the performance of each short section
• Use Web100 to collect detailed statistics
• Loss, delay, queuing properties, etc
– Use models to extrapolate results to the full path
• Assume that the rest of the path is ideal
• You have to specify the end-to-end performance goal
– Data rate and RTT
– Pass/Fail on the basis of the extrapolated performance
Deploy as a Diagnostic Server
• Use pathdiag in a Diagnostic Server (DS)
• Specify End to End target performance
– From server (S) to client (C) (RTT and data rate)
• Measure the performance from DS to C
– Use Web100 in the DS to collect detailed statistics
– Extrapolate performance assuming ideal backbone
• Pass/Fail on the basis of extrapolated performance
Example 1- good news
Example 1, continued
Example 2 - not so good
Example 2, continued
Key pathdiag/DS features
• Results are intended for end-users
– Provides a list of specific items to be corrected
• Failed tests are showstoppers for HPN apps
– Includes explanations and tutorial information
– Details for escalation to network or system admins
• Coverage for a majority of OS and network flaws
– Most of the remaining flaws can be detected with pathdiag in
the client or traceroute
– Eliminates nearly all(?) false pass results
• Tests becomes more sensitive on shorter paths
– Conventional diagnostics become less sensitive
– Depending on models, perhaps too sensitive
• New problem is false fail (e.g. queue space tests)
Key features, continued
• Flaws no longer completely mask other flaws
– A single test often detects several flaws
• E.g. find both OS and network flaws in the same test
– They can be repaired concurrently
• Archived DS results include raw web100 data
– Can reprocess with updated reporting SW
• New reports from old data
– Critical feedback for the NPAD project
• We really want to collect “interesting” failures
Status
• Public servers are now online. See:
– http://www.psc.edu/networking/projects/pathdiag/
• Version 1.0 available for download
– Follow the download link
– Requires current web100 kernel patches
– Should be faster than clients
• Version 1.1 is coming soon
– Better support for non-local testing
– Better support for TeraGrid scale testing
Blast from the past
• Same base algorithm as “Windowed Ping” [Mathis, INET’94]
– Aka “mping”
– See http://www.psc.edu/~mathis/wping/
– Killer diagnostic in use at PSC in the early 90s
– Stopped working with the advent of “fast path” routers
• Use a simple fixed window protocol
– Scan window size in 1 second steps
– Measure data rate, loss rate, RTT, etc as window changes
Diagnosing applications
• Goal: Tools to “bench test” applications in the lab
– Client and server on the same LAN
• App developer has easy access to all components
– Emulate a long ideal path between client and server
• Also checks some OS and TCP features
• Several different techniques (next topic)
• Developer gets first hand experience with delay
– If it fails in the lab, it will not work on a WAN
– Can not blame the network
– Can not repeal the speed of light
– Has to fix the application
Emulating delay
• Multiple techniques to emulate long paths
– Scenic routing via tunnels
– Kernel delays (e.g. netem, nistnet, dummynet)
– Application (pipe) delay via a proxy
• We have ~5 techniques prototyped/under test
– Kernel hacking vs non-privileged users
– Ease of use/ease of installation
– Maximum data rate
– Authenticity of the delay
• Not ready for prime time
Try it!
http://www.psc.edu/networking/projects/pathdiag/