Visual Flow Analysis: What do real

Download Report

Transcript Visual Flow Analysis: What do real

Visual Flow Analysis: What do real-world
problems look like?
Brent Draney
NERSC Center Division, LBNL
2/07/06
What is NERSC
• DOE scientific computer center
• Supports ~2000 scientists around the world
(mainly DOE and Universities)
• Supports most major disciplines
• Combined ~20-TFLOPS, 8.8 Petabytes
• 10 Gigabit lan backbone and 10 Gigabit ESnet
uplink
• O(100) sockets accounts for ~95% of bytes
transferred
• O(5000) IP addresses in a single building but only
100 desktops
2
Network and Security
Team(NAST)
• Enablers and Inhibitors of the network in
one group
– All responsibility is here
• Networking is responsible for end-to-end
performance
– Wherever the customer is
– “Not our problem” is not sufficient or
acceptable
3
Performance tools
• Optical taps everywhere
• Mobile crashcart with all types of
interfaces
• Tcpdump, Tcptrace and Xplot
• A lot of head scratching
Note: Analyzing a mult-Gigabyte flow packet
by packet is impossible!
4
Simple Example
Consistent Slope
No anomalies
Protocol limited
5
Simple Example Detail
Sender Advertised Window
Packets
ACK’ed data
6
Brick Wall Example
Transfer Hangs
Few anomalies
7
Brick Wall Detail
One Dropped packet
3 Dupe ACK’s
No Retransmit, Ever
8
Brick Wall Example
Troubleshooting and Answer
• Troubleshooting
– Sender verifies that retransmits are sent
– “Non-tuned” traffic never fails
• Answer
– A stateful firewall tracking TCP sequence
numbers didn’t believe that the retransmits
were legitimate
9
Perverse Example
Holy Mackerel!
Jumbo Packets
Retransmits
10
Perverse Example
Is PMTU working? Yes
[Scratch Head]
11
Perverse Example
Troubleshooting and Answer
• Troubleshooting
– Review sender configuration
– PMTU installed in routing table correctly? Yes
– TCPdump on host shows 64K packets leaving a 9k
interface
– “Large Send” enabled offloading packet creation to NIC
• Answer
– NIC doesn’t have access to routing table
• Route MTU not honored
– Retransmits handled by kernel
• Route MTU Honored
12
Conclusions
• Diverse problems have the same general feel of
poor performance.
• Flow visualization can isolate problems quickly.
• Very large flows require visualization.
• Protocol limits (host buffers, sftp …) are still a
major cause but are becoming less so.
• New and “creative” methods to achieve higher
performance can create strangeness and are
becoming more of a problem.
• Seeing is believing. Pictures are convincing (to
users, system admins and network admins).
13