presentation source

Download Report

Transcript presentation source

Lessons Learned from
Real Life
Jeremy Elson
National Institutes of Health
November 11, 1998
1
quick bio
• Hi, I’m Jeremy. Nice to meet you.
• 1996: BS Johns Hopkins, Comp Sci
• Sep 96 - Sep 98: Worked at NIH full-time
– Led software development effort on a small team
developing an ATM-based telemedicine system
called the Radiology Consultation WorkStation
(RCWS)
• Sep 98: Decided to return to school full-time
• Nov 98: Gave a talk to dgroup about
interesting lessons learned during
2
development of the RCWS
my talk
• Very quick description of the RCWS
– In future dgroups, I can give a talk about the
RCWS, or about ATM, if there is interest
• Some pitfalls and fallacies in networking I
discovered while developing the RCWS
• Techniques for network problem solving
3
Radiology Consultation
Workstation Network
4
RCWS Block Diagram
5
an unintended test
• Initial Configuration: 2 Sparc 20’s w/50MHz
CPUs; Solaris 2.5.1; Efficient Networks ATM
NICs @155MHz, LattisCell 10114-SM
– TTCP memory-to-memory: 60 Mbps
• Upgrade to 75MHz chips, otherwise identical
– TTCP now reports 90Mbps!
• 50% upgrade in CPU speed led to exactly
50% increase in network throughput
6
pitfall: infinite CPU
• In many systems, the network is the bottleneck;
we have “infinite” CPU in comparison. We try
to use CPU to save network bandwidth:
–
–
–
–
Compression
Multicast
Caching (sort of)
Micronet design
• Pitfall: Assuming this is always true. In our
ATM app, compression might slow it down!
7
a surprising outcome
• There are various ways of doing IP over ATM
– “Classical IP” MTU ~9K
– “LANE” MTU 1500 bytes (for Ethernet bridging)
• Which would you expect would have better
bulk TCP performance, and by how much?
• Classical IP did better -- by a factor of ~5! I
didn’t believe it at first.
• Turned out that both were sending roughly the
same packets/sec; CLIP: more bytes/packet
8
pitfall: networks run
out of bandwidth first
• The number of bytes per second is only one
metric; consider packets per second also. This
is sometimes the wall you hit first.
• Fixed packet processing cost appears to far
outweigh the incremental cost to transmit
more bytes as part of the same packet
• This fits nicely with the previous observation:
CPU is only fast enough for n packets/sec
• This is old news to Cisco, backbone ISPs, etc.9
pathological networks
• We built an on-campus ATM network and
bought access to a MAN (ATDnet), but the
only WAN available was the ACTS satellite
• Our network was very long and very fat: OC3
(155 Mb/sec) over satellite (500ms RTT).
• We were expecting standard LFN-related
problems; the solutions are fairly well-known
(window scaling, PAWS, SACK, etc.)
• What surprised me was something else:
interactive performance!
10
ACTS Satellite
To perform actions such
as screen updates, requests
must go through a server.
Therefore the user response
time will be ~RTT.
Request
Reply
1/8 of a second from
Earth to a geostationary
satellite; RTT ~1/2 second
(plus ground switching delay & queuing delay)
Earth
the best laid plans
• Requests are small messages (<100 bytes)
transmitted using TCP over ATM
• Everything seemed to work fine on-campus
• Over the satellite, we were expecting to see
delays of 1/2 sec in command execution
• Instead we saw >1 second delays: much
more than we were expecting & hard to use.
Uh oh.
• My job (with 2 hours of satellite time ticking away…):
figure out why this was happening
12
the answer: tcpdump
• ‘tcpdump’ is a packet-sniffer written by
Steve McCanne, Craig Leres, and Van
Jacobson at LBL
• Monitors a LAN in realtime; prints info
about each packet (source/dest, sequence
numbers, flags, acknowledgements, options)
• Runs on most UNIX variants
• The most spectacularly fantastically
wonderful network debugging tool on planet
Earth; my knee-jerk reaction whenever there
is any problem is to fire this up first
13
tick, tock, tick, tock...
At the application layer, messages are 70 bytes long.
Client
Server
Time
Client sends data to server
RTT
Server’s TCP stack ACKs that data
Client sends more data as soon
as the ACK is received
Server ACKs the new 42 bytes
Server application has now received a
complete 70 byte message; sends reply
Client TCP stack ACKs
Server sends new data after it receives
the client’s ACK
Client TCP stack ACKs
USER SEES RESPONSE HERE
14
the nagle finagle
• Each application-layer message is split into
2 segments. Why?
– Because the app was calling write() twice
• For some reason, the second half isn’t sent
until the first half is ACKed! Why?
– The Nagle Algorithm, which says “don’t send a
tinygram if there is an outstanding tinygram.”
• Users had to wait 3 RTTs instead of 1
• Short term fix: turn off the Nagle Algorithm
(setsockopt TCP_NODELAY in Solaris)
• Long term fix: rewrite the message-passing
library to use writev() instead of write(). 15
pitfall: don’t care how
TCP and app get along
• It’s easy to think of TCP as a generic way of
getting things from Here to There; sometimes,
if we look deeper, we find problems
• Good example: HTTP interactions with TCP
study by Touch, Heidemann & Obraczka
• Of course, different TCP implementations
react differently. (Maybe some TCPs wait
before launching and would have hidden this.)
16
the big mystery
• Remember: 90 Mbps Sparc 20 to Sparc 20
• Scenario: Two machines doing FTP (to /dev/null)
– Machine A: Sun Ultra-1 running Solaris 2.5.1,
155 Mbps fiber ATM NIC
– Machine B: Fast Pentium-II running Windows
NT 4.0, 25 Mbps UTP ATM NIC
– Using LANE, 1500 byte MTU
• Transmitting from A to B: 23 Mbps
• Transmitting from B to A: 8 Mbps!! Why?
17
tcpdump to the rescue
A
B
Window advert. from receiver
MSS-sized segment from sender
Time
Smaller segment from sender
Another MSS
… more segments (not shown)
Another segment from sender
long quiet time - no activity
Receiver finally ACKs
Cycle starts again
18
observations about
our mystery
• Sending A to B (the 22Mbps case), machine
generated only MSS segments; B to A did
not. (Could account for some slowdown.)
• The ACKs from A all came at very regular
intervals (~50ms)
• Data came quickly (say, all in about 20ms)
followed by long quiet time (say, 30ms)
• What’s going on????
19
deferred ACKs
• When we receive data, we wait a certain
interval before sending an ACK
• This attempts to reduce traffic generated by
interactive (keystroke) activity by hoping a
new window and/or data will be ready, too
• We don’t want to do this with bulk data
(defined as 3 MSS’s in a row)
20
keystrokes: the worst case
Assume both sides are initially advertising Win = 100
User
Server
User types a character
TCP stack sends ACK
telnet daemon wakes up; reads char
telnet daemon sends echoed char
TCP stack sends ACK
Time
telnet client wakes up; reads char
21
keystrokes: what we want
Assume both sides are initially advertising Win = 100
User
Server
User types a character
Deferred ACK interval: don’t
send an ACK right away; wait,
and hope that we have a new
window and echoed char ready
telnet daemon sends ACK of received
char, echoed char, and open window
telnet client wakes up; reads char
Time
22
another look at the trace
A
B
Deferred ACK interval expires
MSS-sized segment from sender
Time
Smaller segment from sender - which
fools the receiver into thinking that we
are not doing bulk data transfer
… more segments (not shown)
Smaller segment from sender
WINDOW IS NOW CLOSED
long quiet time - no activity
Timer expires; receiver sends ACK
Cycle starts again
23
the mystery unmasked
• Only observable because all of the following
were true (take out 1, the problem vanishes)
– Receiver using deferred ACKs
– Sender not sending all MSS sized data
– Bandwidth high enough and window small
enough so that the window can be filled before
the deferred ACK interval expires (rare at 10mbps)
• When I turned off the deferred ACKs on the
receiver, bandwidth jumped to 23 Mbps.
(Under Solaris this can be done with ndd)
24
tcpdump: our best friend
• Virtually impossible to figure out problems
like the previous one by just puzzling it out
• Reading about how protocols work is a
good starting point; implementing them
gives you even more. But…
• Nothing gave me more intimate knowledge
of TCP than seeing it come alive. Not
looking at high level behavior, but actually
watching packets fly across the wire
• Different stacks have different personalities
• TCP/IP Illustrated v1 is great to learn how 25
other uses of tcpdump
• Keeping my ISDN router from dialing
• Widespread teardrop attack on NIH (I patched
tcpdump to make this easier)
• Netscape SYN bug
• Samba hitting DNS
• Inoculan directed broadcasts
• Diagnosing dead and/or segmented networks
• Even rough performance measurement
• The network people thought I was a magician!
26
summary:
lessons learned
I.
Thou shalt not assume that thy CPU is
infinite in power, for thy network may
indeed be more plentiful.
II.
Thou shalt take mind of the number of
packets thou sendeth to thy network; for,
yea, a multitude thereof may wreak havoc
thereupon.
27
summary:
lessons learned
III. Thou shalt read the Word of Stevens in
TCP/IP Illustrated, and become learned
in the ways of tcpdump, so that thy days
of network debugging shall be pleasant
and brief.
IV.
Thou shalt watch carefully the packets that
thy applications create, so that TCP may
be thy servant and not thy taskmaster.
28
that’s all, folks!
29