PowerPoint Presentation - Network diagnostics made easy

Download Report

Transcript PowerPoint Presentation - Network diagnostics made easy

Network Path and
Application Diagnostics
Matt Mathis
John Heffner
Ragu Reddy
7/17/06
http://www.psc.edu/~mathis/papers/
PathDiag20060717.ppt
(Corrected)
Outline
•
•
•
•
NPAD/Pathdiag - Why should you care?
What are the real performance problems?
Automatic diagnosis
Deployment
NPAD/Pathdiag - Why should you care?
• One click automatic performance diagnosis
– Designed for (non-expert) end users
– Accurate end-systems and last mile diagnosis
• Eliminate most false pass results
• Accurate distinction between host and path flaws
• Accurate and specific identification of most flaws
– Basic networking tutorial info
• Help the end user understand the problem
• Help train 1st tier support (sysadmin or netadmin)
• Backup documentation for support escalation
• Empower the user to get it fixed
– The same reports for users and admins
Recalibrate user expectations
• Long history of very poor network performance
– Users do not know what to expect
– Users have become completely numb
• Goal for new baseline user expectations:
– 1 Gigabyte in less than 2 minutes (~67 Mb/s)
• Everyone should be able to reach these rates by default
• People who can’t should know why or be very angry
The Wizard Gap
The Wizard Gap Updated
• Experts have topped out end systems & links
– 10 Gb/s NIC bottleneck
– 40 Gb/s “link” bandwidth (striped)
• Median I2 bulk rate is 3 Mbit/s
– See http://netflow.internet2.edu/weekly/
• Current Gap is about 3000:1
• Closing the first factor of 30 should now be “easy”
TCP tuning requires expert knowledge
• By design TCP/IP hides the ‘net from upper layers
– TCP/IP provides basic reliable data delivery
– The “hour glass” between applications and networks
• This is a good thing, because it allows:
– Invisible recovery from data loss, etc
– Old applications to use new networks
– New application to use old networks
• But then (nearly) all problems have the same symptom
– Less than expected performance
– The details are hidden from nearly everyone
TCP tuning is painful debugging
• All problems reduce performance
– But the specific symptoms are hidden
• Any one problem can prevent good performance
– Completely masking all other problems
• Trying to fix the weakest link of an invisible chain
– General tendency is to guess and “fix” random parts
– Repairs are sometimes “random walks”
– Repair one problem at time at best
• The solution is to instrument TCP
The Web100 project
• Instrumentation and autotuning for TCP
– TCP has the ideal diagnostic vantage point
– TCP-ESTATS-MIB now past IETF WG last-call
• Will be a standard track RFC soon
• Prototypes for Linux (www.Web100.org) and Windows Vista
– TCP Autotuning
• Automatically adjusts TCP buffers
• Linux 2.6.17 default maximum window size is 4 M Bytes
• Announced for Vista - details unknown
• New insight
– Nearly all symptoms scale with round trip time
Nearly all symptoms scale with RTT
• For example
– TCP Buffer Space, Network loss and reordering, etc
– On a short path TCP can compensate for the flaw
• Local Client to Server: all applications work
– Including all standard diagnostics
• Remote Client to Server: all applications fail
– Leading to faulty implication of other components
The confounded problems
• For nearly all network flaws
– The only symptom is reduced performance
– But the reduction is scaled by RTT
• Therefore, flaws are undetectable on short paths
– False pass for even the best conventional diagnostics
– Leads to faulty inductive reasoning about flaw locations
– Diagnosis often relies on tomography and complicated
inference techniques
• This is the real end-to-end problem
The NPAD solution:
•
For applications (and upper layers)
– Bench test over an (emulated) ideal long path
– Topic of a future talk
• “Pathdiag” tests short path sections to localize a flaw
– Use Web100 to collect detailed statistics
• Loss, delay, queuing properties, etc
– Use models to extrapolate results to the full path
• Assume that the rest of the path is ideal
• You have to specify the end-to-end performance goal
– Data rate and RTT
– Pass/Fail on the basis of the extrapolated performance
Deploy as a Diagnostic Server
• Use pathdiag in a Diagnostic Server (DS)
• Specify End to End target performance
– From server (S) to client (C) (RTT and data rate)
• Measure the performance from DS to C
– Use Web100 in the DS to collect detailed statistics
• On both the path and client
– Extrapolate performance assuming ideal backbone
• Pass/Fail on the basis of extrapolated performance
Demo
Laptop
PSC
Key NPAD/pathdiag features
• Results are intended for end-users
– Provides a list of specific items to be corrected
• Failed tests are showstoppers for fast apps
– Includes explanations and tutorial information
– Clear differentiation between client and path problems
– Accurate escalation to network or system admins
– The reports are public and can be viewed by either
• Coverage for a majority of OS and last-mile network flaws
– Most of the remaining flaws can be detected with pathdiag in
the client or traceroute
– Eliminates nearly all(?) false pass results
More features
• Tests becomes more sensitive as the path gets shorter
– Conventional diagnostics become less sensitive
– Depending on models, perhaps too sensitive
• New problem is false fail (e.g. queue space tests)
• Flaws no longer completely mask other flaws
– A single test often detects several flaws
• E.g. find both OS and network flaws in the same test
– They can be repaired concurrently
• Archived DS results include raw web100 data
– Can reprocess with updated reporting SW
• New reports from old data
– Critical feedback for the NPAD project
• We really want to collect “interesting” failures
NPAD/pathdiag deployment
• Why should a campus networking organization care?
– “Zero effort” solution to miss-tuned end-systems
– Accurate reports of real problems
• You have the same view as the user
• Saves time when there really is a problem
• You can document reality for management
• Suggestion:
– require pathdiag reports for all performance problems
What about impact of the test traffic?
• NPAD/pathdiag is single threaded
– Only one test at a time
• Same load as any well tuned TCP application
– Protected by TCP “fairness”
• Large flows are generally “softer” than small flows
• Large flows are easily disturbed by small flows
Impact
• Automatically diagnose first level problems
– Easily expose all path bottlenecks that limit
performance to less than 100 Mb/s
– Easily expose all end-system/OS problems that limit
performance to less than 100 Mb/s
• (Will become moot as autotuning is deployed)
• Empower the users to apply the proper motivation
• Still need to recalibrate user expectations
– Less than 1 gigabyte / 2 minutes is too slow
– Many paths should support 5 gigabytes/minute
• Less than 1 Gb/s
Download and install
• User documentation:
http://www.psc.edu/networking/projects/pathdiag/
• Follow the link to “Installing a Server”
– Easily customized with a site specific skin
– Designed to be easily upgraded with new releases
• Roughly every 2 months
• Improving reports through ongoing field experience
– Drops into existing NDT servers
• Plans for future integration
• Enjoy!
Backup slides
Blast from the past
• Same base algorithm as “Windowed Ping” [Mathis, INET’94]
– Aka “mping”
– See http://www.psc.edu/~mathis/wping/
– Killer diagnostic in use at PSC in the early 90s
– Stopped working with the advent of “fast path” routers
• Use a simple fixed window protocol
– Scan window size in 1 second steps
– Measure data rate, loss rate, RTT, etc as window changes