Transcript 18-TL-TCP2
Reliable Data Transfer in
Transmission Control Protocol (TCP)
TCP Data Transfer
TCP creates reliable,
ordered data transfer
service on top of IP’s
unreliable service
Pipelined segments
Cumulative ACKs
Uses single
retransmission timer
Retransmissions are
triggered by:
timeout events
duplicate ACKs
Window size
controlled by receiver
and inferred from
network
Flow control
Congestion control
Sender/Receiver Overview
Sender
Last ACK received
Receiver
Next frame expected Last frame acceptable
Last Frame Sent
…
…
…
…
Sender window
Sent & Acked
Sent Not Acked
OK to Send
Not Usable
Receiver window
Received & Acked
Acceptable Packet
Not Usable
TCP Sender/Receiver Invariants
Sending application
Receiving application
TCP
LastByteWritten
LastByteAcked LastByteSent
TCP
LastByteRead
NextByteExpected
LastByteRcvd
Snd: LastByteAcked ≤ LastByteSent ≤ LastByteWritten
Rcv: LastByteRead < NextByteExpected ≤ LastByteRcvd + 1
LastByteAcked < NextByteExpected ≤ LastByteSent+1
TCP sender
(simplified)
NextSeqNum = InitialSeqNum + 1
SendBase = InitialSeqNum + 1 /* == LastByteAcked + 1 */
loop (forever) {
switch(event)
event: data received from application above
1. create TCP segment with sequence number NextSeqNum
2. if (timer currently not running)
2.1. start timer – timeout after TimeOutInterval later
3. pass segment to IP
4. NextSeqNum = NextSeqNum + length(data)
event: timer timeout
1. retransmit not-yet-acknowledged segment with
smallest sequence number
2. start timer – timeout after TimeOutInterval later
event: ACK received, with ACK field value of y
1. if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
restart timer
}
} /* end of loop forever */
Comment:
• SendBase-1:
last
cumulatively
ack’ed byte, i.e.,
the receiver has
received bytes
up to
SendBase –1 and
is expecting the
byte starting at
SendBase
TCP: retransmission scenarios
Host A
Host B
Host B
Seq=92 timeout
timeout
Host A
X
loss
Sendbase
= 100
SendBase
= 120
SendBase
= 100
time
SendBase
= 120
lost ACK scenario
time
premature timeout
TCP retransmission scenarios (more)
timeout
Host A
Host B
X
loss
SendBase
= 120
time
Cumulative ACK scenario
TCP Receiver ACK generation
[RFC 1122, RFC 2581]
Event at Receiver
TCP Receiver action
Arrival of in-order segment with
expected seq #. All data up to
expected seq # already ACKed
Delayed ACK. Wait up to 500ms
for next segment. If no next segment,
send ACK (ACKs maybe piggybacked)
Arrival of in-order segment with
expected seq #. One other
segment has ACK pending
Immediately send single cumulative
ACK, ACKing both in-order segments
Arrival of out-of-order segment
higher-than-expect seq. # .
Gap detected
Immediately send duplicate ACK,
indicating seq. # of next expected byte
Arrival of segment that
partially or completely fills gap
Immediate send ACK, provided that
segment starts at lower end of gap
Fast Retransmit
Time-out period often
relatively long:
long delay before
resending lost packet
Detect lost segments
via duplicate ACKs.
Sender often sends
many segments back-toback
If segment is lost,
there will likely be many
duplicate ACKs.
If sender receives 3
ACKs for the same
data, it supposes that
segment after ACKed
data was lost:
fast retransmit: resend
segment before timer
expires
Fast retransmit algorithm:
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
else { /* y == SendBase. y cannot be smaller than SendBase */
increment count of dup ACKs received for y
if (count of dup ACKs received for y == 3) {
resend segment with sequence number y == SendBase
}
a duplicate ACK for
already ACKed segment
fast retransmit
TCP Flow Control
TCP Flow Control
flow control
receive side of TCP
connection has a
receive buffer:
sender won’t overflow
receiver’s buffer by
transmitting too much,
too fast
speed-matching
app process may be
slow at reading from
buffer
service: matching the
send rate to the
receiving app’s drain
rate
TCP Flow control: how it works
Rcvr advertises spare room
LastByteRead
by including value of
RcvWindow in segments
NextByteExpected
LastByteRcvd
MaxRcvBuffer
spare room in buffer
RcvWindow
= MaxRcvBuffer [NextByteExpectd LastByteRead]
Sender limits unACKed data
to RcvWindow
guarantees receive buffer
doesn’t overflow
LastByteSent – LastByteAcked
≤
RcvWindow
SndWindow = RcvWindow [LastByteSent – LastByteAcked]
TCP Flow control Issues
What happens if advertised window is 0?
Receiver updates window when application reads
data
What if this update is lost?
• Deadlock
TCP Persist timer
Sender periodically sends window probe packets
Receiver responds with ACK and up-to-date window
advertisement
TCP flow control enhancements
Problem: (Clark, 1982)
If receiver advertises small increases in the
receive window then the sender may waste time
sending lots of small packets
This problem is known as “Silly window syndrome”
• Receiver advertises one byte window
• Sender sends one byte packet (1 byte data, 40 byte
header = 4000% overhead)
Solving Silly Window Syndrome
Receiver avoidance [Clark (1982)]
Prevent receiver from advertising small windows
Increase advertised receiver window by min(MSS,
RecvBuffer/2)
Solving Silly Window Syndrome
Sender Avoidance [Nagle’s algorithm (1984)]
prevent
packets
sender from unnecessarily sending small
How long does sender delay sending data?
too long: hurts interactive applications
too short: poor network utilization
strategies: timer-based vs self-clocking
When application generates additional data
if fills a max segment (and window open): send it
else
• if there is unack’ed data in transit: buffer it until ACK
arrives
• else: send it
Keeping the Pipe Full
16-bit AdvertisedWindow
Bandwidth
T1 (1.5 Mbps)
Ethernet (10 Mbps)
T3 (45 Mbps)
FDDI (100 Mbps)
STS-3 (155 Mbps)
STS-12 (622 Mbps)
STS-24 (1.2 Gbps)
Delay x Bandwidth Product
18KB
122KB
549KB
1.2MB
1.8MB
7.4MB
14.8MB
assuming 100ms RTT
TCP Extension to allow window scaling
Put
is options field
TCP Round Trip Time Estimation and
Setting the Timeout
TCP Round Trip Time and Timeout
Q: how to set TCP
timeout value?
longer than RTT
but RTT varies
too short: premature
timeout
unnecessary
retransmissions
too long: slow reaction
to segment loss
Q: how to estimate RTT?
SampleRTT: measured time from
segment transmission until ACK
receipt
ignore retransmissions
SampleRTT will vary, want
estimated RTT “smoother”
average several recent
measurements, not just
current SampleRTT
TCP Round Trip Time and Timeout
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
Exponential weighted moving average
influence of past sample decreases exponentially fast
typical value: = 0.125
e.g: Ai Estimated RTT at time i and M sampled RTT at
time i
A0 = M0
A1 = (1- ) M0 + M1
A2 = (1- )2 M0 + (1- ) M1 + M2
….
Example RTT estimation:
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
350
RTT (milliseconds)
300
250
200
150
100
1
8
15
22
29
36
43
50
57
64
71
time (seconnds)
SampleRTT
Estimated RTT
78
85
92
99
106
TCP Round Trip Time and Timeout
Setting the timeout
EstimtedRTT plus “safety margin”
large variation in EstimatedRTT -> larger safety margin
first estimate of how much SampleRTT deviates from
EstimatedRTT:
DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically, = 0.25)
Then set timeout interval:
TimeoutInterval = EstimatedRTT + 4*DevRTT
TCP Congestion Control
Congestion Control Overview
Congestion: informally: “too many sources sending
too much data too fast for network to handle”
Signal and detect congestion
Policy for source to adjust transmission rate to
match network bandwidth capacity
Decrease rate upon congestion signal
Increase for utilization
Initialization to reach steady state
Network Utilization
Queuing delay (theoretically) could
Throughput/delay
approach infinity with increased load
Network Power (ratio of throughput to
delay)
Queuing delay
Optimal
load
Load
Knee
Approaches towards congestion control
Two broad approaches towards congestion control:
End-end congestion
control:
no explicit feedback from
network
congestion inferred from
end-system observed loss,
delay
approach taken by TCP
Network-assisted
congestion control:
routers provide feedback
to end systems
single bit indicating
congestion (SNA,
DECbit, TCP/IP ECN,
ATM)
explicit rate sender
should send at
TCP Congestion Control
end-end control (no network
assistance)
sender limits transmission:
sndWindow
= LastByteSent-LastByteAcked
min(cwnd, rcvWin)
Cwnd is a dynamic function of
perceived network congestion
How does sender
perceive congestion?
loss event = timeout or
3 duplicate acks
TCP sender reduces
rate (cwnd) after loss
event
TCP Congestion Control Outline
Basic Idea: Probe (test) the current available
bandwidth in the network & adjust your
sending rate accordingly
1.
2.
3.
Start with a very small sending rate -- 1 packet
Increase your sending rate as long as no
congestion is detected
•
•
How do you detect “no congestion”? – No packet loss
How do you increase your sending rate?
•
How do you detect congestion? – Packet loss (timeout, 3
duplicate ACK)
How do you decrease your sending rate?
Decrease your sending rate when you detect
congestion
•
TCP Slow Start
When connection begins, cwnd = 1 MSS
Example: MSS = 500 bytes & RTT = 200 msec
initial rate = 20 kbps
available bandwidth may be >> MSS/RTT
desirable to quickly ramp up to respectable rate
When connection begins, increase rate
exponentially fast until first loss event
When loss occurs (congestion signal), set cwnd
to 1 MSS and re-start with slow start
TCP Slow Start (more)
When connection begins,
increase rate exponentially until
first loss event:
double cwnd every RTT
done by incrementing cwnd by 1
MSS for every ACK received
Summary:
initial rate is slow but ramps up
exponentially fast
When the first packet loss occurs
(either a timeout occurred or 3
duplicate ACKs were received), set
cwnd to 1 MSS and start over.
Host B
RTT
Host A
time
TCP After the First Packet Loss
Question: Should we continue doing slow start throughout
the lifetime of the TCP connection?
Why did we increase cwnd exponentially at the beginning?
• Because we had no idea how much of our traffic the network can
carry, so we needed to probe fast to figure it out
What do we know at the first packet loss event?
• That the network cannot carry our traffic at the rate we had at
the time the packet loss occurred
– What was our rate at the time the packet loss occurred?
» Cwnd/RTT
Refinement Idea: Use the knowledge we obtained at
the time of the packet loss for further refinement of
the slow start algorithm. How?
Keep a threshold value, ssthresh, and set to 1/2 of cwnd just
before loss event.
Cnwd increases exponentially until it reaches ssthresh, and
linearly afterwards – this is called congestion avoidance
Congestion Avoidance – TCP Tahoe
Implementation:
loss has occurred when cwnd was 16,
so sshtresh = 8, and cwnd = 1
14
congestion window size
(segments)
Q: When should the
exponential increase
switch to linear?
A: When cwnd gets to
1/2 of its value before
timeout.
Figure assumes that the first packet
Threshold variable ssthresh
Set to some large value, i.e.,
65K, when the connection is
established
At loss event, ssthresh is set to
1/2 of cwnd just before loss
event
12
10
8
6
threshold
4
2
TCP
Tahoe
0
1
2 3
4 5
6 7
8 9 10 11 12 13 14 15
Transmission round
if (cwnd < ssthresh)
cwnd +=1;
else
cwnd += 1/cwnd;
Linear Window Increase Example
During Congestion Avoidance
Host B
RTT
Host A
Linear Congestion Window
Increase during
Congestion Avoidance
Phase
Window is increased by 1
packet after a full window
size of packets is ACKed
time
Towards a better Congestion Control
Algorithm – TCP Reno
Question: Should we set cwnd to 1 both after
After a timeout
When 3 duplicate ACKs are received
Answer:
timeout before 3 dup ACKs is “alarming”.
• So set cwnd to 1
BUT
3 dup ACKs before timeout indicates that the network is
capable of delivering some segments
• Why not set cwnd to a bigger value rather than 1?
• TCP Reno: Set cwnd to half of its value, which is equal to sshtresh,
and enter linear increase phase
TCP Reno Refinement (more)
RCP Reno Refinement
After 3 dup ACKs:
cwnd is cut in half
window then grows
linearly
This is called “fast
recovery”
congestion window size
(segments)
14
12
10
8
6
threshold
4
2
TCP
Tahoe
0
1
But after timeout
event:
cwnd instead set
to 1 MSS;
window then grows
exponentially
to a threshold,
then grows linearly
2 3
TCP
Reno
4 5
6 7
8 9 10 11 12 13 14 15
Transmission round
At time 8, when cwnd = 12, a packet
loss is detected by 3 duplicate ACKs.
Tahoe sets cwnd to 1 unconditionally
Reno sets cwnd to half of its current
value, which is 6
Notice that ssthresh is set to 6 in
both cases
Also notice that if this was a
“timeout”, then Reno would also set
cwnd to 1
Summary: TCP Congestion Control
When cwnd is below ssthresh, sender in slow-
start phase, window grows exponentially.
When cwnd is above sshthresh, sender is in
congestion-avoidance phase, window grows linearly.
When a triple duplicate ACK occurs, ssthresh set
to cwnd/2 and cwnd set to ssthresh.
When timeout occurs, ssthresh set to cwnd/2
and cwnd is set to 1 MSS.
Congestion window always oscillates !
Steady State TCP Modeling
congestion
window
24 Kbytes
16 Kbytes
8 Kbytes
time
Long-lived TCP connection
Figure above shows the behavior of a TCP connection in steady
state
Additive Increase/Multiplicative Decrease (AIMD)
Further improving TCP Congestion Control Algorithm --TCP
Vegas
Detect congestion before packet loss occurs by observing RTTs
• Longer RTT, greater the congestion in the routers
Lower the transmission rate linearly when packet loss is imminent