Transport layer - TCP

Download Report

Transcript Transport layer - TCP

Reliable Byte-Stream (TCP)
Outline
Connection Establishment/Termination
Sliding Window Revisited
Flow Control
Adaptive Timeout
Spring 2010
CS 332
1
End-to-End Protocols
• Underlying best-effort network
–
–
–
–
–
drops messages
re-orders messages
delivers duplicate copies of a given message
limits messages to some finite size
delivers messages after an arbitrarily long delay
Spring 2010
CS 332
2
End-to-End Protocols
• Common end-to-end services
–
–
–
–
–
–
–
guarantee message delivery
deliver messages in the same order they are sent
deliver at most one copy of each message
support arbitrarily large messages
support synchronization (between sender and receiver)
allow the receiver to flow control the sender
support multiple application processes on each host
Spring 2010
CS 332
3
Simple Demultiplexer (UDP)
• Extends host-to-host service into process-toprocess
• Unreliable and unordered datagram service
• Adds multiplexing
• No flow control
• Endpoints identified by ports (why not PID?)
– servers have well-known ports (clients don’t need this)
• Often just starting point
– see /etc/services on Unix/Linux
– Implemented as message queue
Spring 2010
CS 332
4
Simple Demultiplexer (UDP)
• Header format
– Note 16 bit port number (so only 64K ports)
– Process really identified via <port,host> pair
• Checksum (optional in IPv4, mandatory in IPv6)
– pseudo header + UDP header + data
• Pseudo header:
Protocol number
Source IP
Dest IP
UDP length field
0
16
31
SrcPort
DstPort
Checksum
Length
Data
Why?
Spring 2010
CS 332
5
TCP Overview
• Connection-oriented
• Byte-stream
• Full duplex
• Flow control: keep sender
from overrunning receiver
• Congestion control: keep
sender from overrunning
network
– app writes bytes
– TCP sends segments
– app reads bytes
Application process
Application process
…
…
Write
bytes
TCP
Send buffer
Segment
Read
bytes
TCP
Receive buffer
Segment
…
Segment
Transmit segments
Spring 2010
CS 332
6
Flow Control vs Congestion Control
• Flow Control
– Prevent sender from overloading receiver
– End-to-end issue
• Congestion Control
– Prevent too much data from being injected into network
– Concerned with how hosts and network interact
Spring 2010
CS 332
7
Data Link Reliability (text 2.5)
Wherein we look at reliability issues on a point-to-point
link! Error correcting codes can’t handle all possible
errors (without introducing lots of overhead--including this
is not designing for normal situation), so badly garbled
frames are dropped. We need a way to recover from these
lost frames.
Spring 2010
CS 332
8
Acks and Timeouts
• Acknowledgement (ACK)
– Small frame sent to peer indicating receipt of frame
– No data
– Piggybacking
• Timeout
– If ACK not received within reasonable time, original
frame is retransmitted
• Automatic Repeat Request (ARQ)
– General strategy of using ACKS and timeouts to
implement reliable delivery
Spring 2010
CS 332
9
Acknowledgements & Timeouts
Spring 2010
CS 332
10
Acknowledgements & Timeouts
Spring 2010
CS 332
11
A Subtlety…
• Consider scenarios (c) and (d) in previous slide.
– Receiver receives two good frames (duplicate)
– It may deliver both to higher layer protocol (not good!)
– Solution: 1-bit sequence number in frame header
Spring 2010
CS 332
12
Stop-and-Wait
Sender
Receiver
• Problem: keeping the pipe full
• Example
– 1.5Mbps link x 45ms RTT = 67.5Kb (8KB)
– 1KB frames implies 1/8th link utilization (Next slide)
Spring 2010
CS 332
13
Bandwidth x Delay Product
• Sending a 1KB packet in 45ms implies sending at
rate of (1024 x 8)/0.045 = 182 Kbps, or 1/8 of
bandwidth.
• Bandwidth-delay: The number of bits that fits in
the pipe in a single round trip. (I.e. the amount of
data that could be “in transit” at any given time.)
• Goal: Want to be able to send this much data
before getting first ACK. (keeping the pipe full)
Spring 2010
CS 332
14
Sliding Window
• Allow multiple outstanding (un-ACKed) frames
• Upper bound on un-ACKed frames, called window
…
Receiver
…
Time
Sender
Spring 2010
CS 332
15
Sliding Window: Sender
• Assign sequence number to each frame (SeqNum)
• Maintain three state variables:
– send window size (SWS)
– last acknowledgment received (LAR)
– last frame sent (LFS)
• Maintain invariant: LFS - LAR ≤ SWS
 SWS
…
…
LAR
LFS
• Advance LAR when ACK arrives
• Buffer up to SWS frames (must be prepared to retransmit
frames until they are ACKed)
Spring 2010
CS 332
16
Sliding Window: Receiver
• Maintain three state variables
– receive window size (RWS) (upper bound on # out-of-order frames)
– largest frame acceptable (LFA) (sequence # of)
– last frame received (LFR)
• Maintain invariant: LFA - LFR ≤ RWS
 RWS
…
…
LFA
LFR
• Frame SeqNum arrives:
– if LFR < SeqNum ≤ LFA
accept
– if SeqNum ≤ LFR or SeqNum > LFA discard
• Send cumulative ACKs
Spring 2010
CS 332
17
Note:
• When packet loss occurs, pipe is no longer kept
full!
• Longer it takes to notice lost packet, worse the
condition becomes
• Possible solutions:
– Send NACKs
– Selective acknowledgements (just ACK exactly those
frames received, not highest frame received)
– Not used: too much added complexity
Spring 2010
CS 332
18
Sequence Number Space
• SeqNum field is finite; sequence numbers wrap
around
• Sequence number space must be larger than
number of outstanding frames (I.e. stop-and-wait
had sequence # space size == 2)
– I.e. if sequence number space is of size 8 (say 0..7), and
number of outstanding frames is allowed to be 10, then
sender can send sequence numbers 0,1,2,3,4,5,6,7,0,1
all at once. Now if receiver sends back an ACK with
sequence number 1, which packet 1 is it ACKing?
Spring 2010
CS 332
19
Sequence Number Space
• Even SWS < SequenceSpaceSize is not sufficient
–
–
–
–
–
–
suppose 3-bit SeqNum field (0..7) (so SequenceSpaceSize = 8)
Let SWS=RWS=7
sender transmit frames 0..6
Frames arrive successfully, but ACKs are lost
sender retransmits 0..6
receiver expecting 7, 0..5, but believes it is receiving the second
incarnation of 0..5 (because the receiver has at this point updated its
various pointers)
• SWS ≤ (SequenceSpaceSize+1)/2 is rule (if
SWS=RWS)
• Intuitively, SeqNum “slides” between two halves of sequence
number space
Spring 2010
CS 332
20
Easy to overlook…
• Relationship between window size and sequence
number space depends on assumption that frames
are not reordered in transit (easy to assume on
point-to-point link).
Spring 2010
CS 332
21
Back to Chapter 5…
Spring 2010
CS 332
22
Data Link Versus Transport
• Transport potentially connects many different hosts
– need explicit connection establishment and termination
• Transport has potentially different RTT (over different
routes and at different times, even on scale of minutes)
– need adaptive timeout mechanism
• Transport has potentially long delay in network
– need to be prepared for arrival of very old packets
• Transport has potentially different capacity at destination
– need to accommodate different node capacity
• Transport has potentially different network capacity
– need to be prepared for network congestion
Spring 2010
CS 332
23
The “End-to-End” Argument
• Consider TCP vs X.25
• TCP: Consider underlying IP network unreliable
and use sliding window to provide end-to-end inorder reliable delivery
• X.25: Use sliding window within network on hopby-hop basis (which should guarantee end-to-end).
Several problems with this:
– No guarantee that added hop preserves service
– In link from A to B to C, no guarantee that B behaves
perfectly (nodes known to introduce errors and mix
packet order)
Spring 2010
CS 332
24
End-to-End
• “A function should not be provided in the lower
levels of the system unless it can be completely
and correctly implemented at that level”
• Does allow for functions to be incompletely
provided at lower levels for optimization
– E.g. detecting and retransmitting single corrupt packet
across one hop preferable to retransmitting entire file
end-to-end.
• See reading assignment on class homework page
Spring 2010
CS 332
25
Segment Format
0
10
4
16
31
SrcPort
DstPort
SequenceNum
Acknow ledgment
HdrLen
0
Flags
AdvertisedWindow
Checksum
UrgPtr
Options (variable)
Data
Spring 2010
CS 332
26
Segment Format (cont)
• Each connection identified with 4-tuple:
– (SrcPort, SrcIPAddr, DestPort, DestIPAddr)
• Sliding window and flow control
– acknowledgment, SequenceNum, AdvertisedWindow
Data(SequenceNum)
Sender
• Flags
Receiver
Acknowledgment +
AdvertisedWindow
– SYN, FIN, RESET, PUSH, URG, ACK
• Checksum
– pseudo header + TCP header + data
Spring 2010
CS 332
27
Connection Establishment and
Termination
Active participant
(client)
Passive participant
(server)
Note: SequenceNum
contains the sequence
number of the first
data byte contained
in the segment. ACK
field always gives the
sequence number of
the next data byte
expected. (Except for
the SYN segments)
Spring 2010
CS 332
28
State Transition Diagram
CLOSED
Active open/SYN
Passive open
Close
Close
Opening
connection
LISTEN
SYN_RCVD
SYN/SYN + ACK
Send/SYN
SYN/SYN + ACK
ACK
Close/FIN
SYN + ACK/ACK
FIN/ACK
FIN_WAIT_1
CLOSE_WAIT
FIN/ACK
ACK
Close/FIN
FIN_WAIT_2
CLOSING
FIN/ACK
Spring 2010
event/action
ESTABLISHED
Close/FIN
Closing
connection
SYN_SENT
ACK Timeout after two
segment lifetimes
TIME_WAIT
CS 332
LAST_ACK
ACK
CLOSED
29
Sliding Window Revisited
Sending application
Receiving application
TCP
TCP
LastByteWritten
LastByteAcked
LastByteSent
• Sending side
– LastByteAcked ≤
LastByteSent
– LastByteSent ≤
LastByteWritten
– buffer bytes between
LastByteAcked and
LastByteWritten
Spring 2010
LastByteRead
NextByteExpected
LastByteRcvd
• Receiving side
– LastByteRead <
NextByteExpected
– NextByteExpected ≤
LastByteRcvd +1
– buffer bytes between
LastByteRead and
LastByteRcvd
CS 332
30
Flow Control
• Send buffer size: MaxSendBuffer
• Receive buffer size: MaxRcvBuffer
• Receiving side
– LastByteRcvd - LastByteRead ≤ MaxRcvBuffer
– AdvertisedWindow = MaxRcvBuffer - (LastByteRcvd LastByteRead)
• Sending side
– LastByteSent - LastByteAcked ≤ AdvertisedWindow
– EffectiveWindow = AdvertisedWindow - (LastByteSent LastByteAcked)
– LastByteWritten - LastByteAcked ≤ MaxSendBuffer
– block sender if (LastByteWritten - LastByteAcked) + y >
MaxSenderBuffer
Spring 2010
CS 332
31
Flow Control
• Always send ACK in response to arriving data segment
– This response contains latest Acknowledge and
AdvertisedWindow fields even if they haven’t changed
• Slow receiving process may cause AdvertisedWindow = 0.
• Problem: How does the sending side know when the
advertised window is no longer 0?
– It can’t get this info, since receiver only sends window advertisements
in response to received packets, and sender can’t send anything because
it believes the window size is zero.
• Solution: Persist when AdvertisedWindow = 0
– Periodically send a probe segment with one byte of data. Although
most won’t be accepted, they trigger responses, and eventually one will
come back with a nonzero advertised window.
Spring 2010
CS 332
32
Protection Against Wrap Around
• 32-bit SequenceNum
Bandwidth
T1 (1.5 Mbps)
Ethernet (10 Mbps)
T3 (45 Mbps)
FDDI, FastEther (100 Mbps)
OC-3 (155 Mbps)
OC-12 (622 Mbps)
OC-48 (2.5 Gbps)
Spring 2010
CS 332
Time Until Wrap Around
6.4 hours
57 minutes
13 minutes
6 minutes
4 minutes
55 seconds
14 seconds
33
Keeping the Pipe Full
• 16-bit AdvertisedWindow
Bandwidth
T1 (1.5 Mbps)
Ethernet (10 Mbps)
T3 (45 Mbps)
FDDI, FastEther (100 Mbps)
OC-3 (155 Mbps)
OC-12 (622 Mbps)
OC-48 (2.5 Gbps)
Spring 2010
CS 332
Results below assume
RTT of 100 ms, typical
for cross-country link
Delay x Bandwidth Product
18KB
122KB
549KB
1.2MB
1.8MB
7.4MB
29.6MB
34
TCP Extensions
• Implemented as header options
• Store timestamp in outgoing segments
• Extend sequence space with 32-bit timestamp:
PAWS (Protection Against Wrapped Sequence
Numbers)
• Shift (scale) advertised window
Spring 2010
CS 332
35
Nagle Algorithm
• 1 byte data segment generates 41 byte packets (20 for
IP header + 20 for TCP header).
• Small packets are called tinygrams
– On LANs, usually not an issue, but on WANs, this can be
a problem (it adds congestion)
• Solution: Nagle Algorithm (RFC 896, Nagle, 1984):
When a TCP connection has outstanding data that
has not yet been Acked, small segments cannot be
sent until the outstanding data is acknowledged.
Spring 2010
CS 332
36
Nagle Algorithm (continued)
• Nagle is self-clocking: the faster the ACKs come
back, the faster the data is sent. But on slow
WAN, where tinygrams can be a problem, fewer
segments are sent.
– Ex. On LAN, time for single byte to be sent, ACKed
and echoed is around 16ms. To generate data at this
rate, you need to be typing around 60 characters per
second (so on LAN you don’t kick in Nagle)
– On WAN, you’ll often kick in Nagle
Spring 2010
CS 332
37
Disabling the Nagle Algorithm
• Why would you want to?
– X Window system: small messages (mouse movements) need to be
delivered without delay
– Typing one of the terminals special function keys during interactive
login
• Function keys normally generate multiple bytes of data, beginning with
ASCII escape character. If TCP gets data a byte at a time, it can
potentially send first byte and then hold the rest of the characters. The
server wouldn’t generate the ACK until it received the rest of the
command, so Nagle would kick in, meaning rest of bytes not sent for
200ms, which can be a noticeable delay.
• With sockets API, the TCP_NODELAY option disables Nagle
• Host Requirements RFCs (1122, 1123) specify that there
must be a way for an app to disable Nagle on an individual
TCP connection.
Spring 2010
CS 332
38
Adaptive Retransmission
(Original Algorithm)
• Measure SampleRTT for each segment/ACK pair
• Compute weighted average of RTT
EstRTT  a  EstRTT  (1  a )  SampleRTT
 a between 0.8 and 0.9 (recommended value 0.9)
– Note a in this range has a strong smoothing effect
• Set timeout based on EstRTT
– TimeOut = 2 x EstRTT (rather conservative)
Spring 2010
CS 332
39
Karn/Partridge Algorithm
Sender
Receiver
SampleR TT
Receiver
SampleR TT
Sender
• Problem: ACK doesn’t acknowledge a transmission (it acks a
receive)
Why?
• Do not sample RTT when retransmitting
• Double timeout after each retransmission (exponential backoff)
Spring 2010
CS 332
40
A Problem
• Problem with both these approaches: they can’t
keep up with wide RTT fluctuations, thus causing
unnecessary retransmissions
• When the network is already loaded, unnecessary
retransmissions add to the network load (as
Stevens notes, “It is the network equivalent of
pouring gasoline on a fire”)
• What’s needed: keep track of the variance in RTT
measurements AND use smooth RTT estimator.
Spring 2010
CS 332
41
Jacobson/ Karels Algorithm
• New Calculations for average RTT
• Diff = sampleRTT - EstRTT
• EstRTT = EstRTT + ( g x Diff)
Note these
– Recommended value for g is 0.125
values?
– EstRTT is just the smoothed RTT as before
• Dev = Dev + h ( |Diff| - Dev)
– Recommended value for h is 0.25
– Dev is the smoothed mean deviation (easier to compute mean that
standard deviation, which requires a square root)
• TimeOut = EstRTT + 4 x Dev
– Larger gain for the deviation makes the TimeOut value increase
faster when the RTT changes.
• Notes
– algorithm only as good as granularity of clock (500ms on Unix)
– accurate timeout mechanism important to congestion control (later)
Spring 2010
CS 332
42
TCP Interactive Data Flow
• Material here is from TCP/IP Illustrated, Vol. 1
• Study by Caceres, et. al. (1991) :
– On a packet count basis, about half of all TCP segments
contain bulk data (ftp, email, Usenet news)
– Half contain interactive data (telnet, rlogin)
– On byte count basis, ratio is around 90% bulk transfer,
10% interactive.
– Bulk data tends to be full size (normally 512 bytes of
data), interactive is much smaller (90% of telnet and
rlogin packets carry less than 10 bytes of data).
Spring 2010
CS 332
43
More recent data
• 1997 MCI study
–
–
–
–
Web: 75% of bytes, 70% of packets
FTP: 5% of bytes, 3% of packets
SMTP (email): 5% of bytes, 5% of packets
DNS, NNTP, Telnet/rlogin: < 2% each
• 2000 CAIDA study
– Napster, streaming audio, online games show up
• 2004 top application?
– P2P: 62% of Internet traffic. BitTorrent: approx. 35%
(CacheLogic)
– Update – 2007, Cringley says 50% BitTorrent (5% of users)
Spring 2010
CS 332
44
Rlogin and Telnet
•
•
Surprisingly, each interactive keystroke typically
generates a packet (as opposed to a line
generating a packet).
Moreover, a single rlogin keystroke can generate
4 segments (though usually 3)
i. Interactive keystroke from client
ii. ACK of keystroke from server (typically piggybacked
in echo of data byte) see next slide
iii. Echo of data byte from server
iv. ACK of echoed byte from client
Spring 2010
CS 332
45
Delayed ACKs
• Normally, TCP does not send an ACK the instant it
receives data. Instead, it delays the ACK, hoping to
have data going in other direction on which it can
piggyback the ACK.
• Most implementations use a 200ms delay (delays
ACK up to 200ms before sending the ACK by itself)
• This is why in previous slide, ACK would normally
piggyback with the echoed character
Spring 2010
CS 332
46