Transcript 18-TL-TCP2

Reliable Data Transfer in
Transmission Control Protocol (TCP)
TCP Data Transfer
 TCP creates reliable,
ordered data transfer
service on top of IP’s
unreliable service
 Pipelined segments
 Cumulative ACKs
 Uses single
retransmission timer
 Retransmissions are
triggered by:


timeout events
duplicate ACKs
 Window size
controlled by receiver
and inferred from
network


Flow control
Congestion control
Sender/Receiver Overview
Sender
Last ACK received
Receiver
Next frame expected Last frame acceptable
Last Frame Sent
…
…
…
…
Sender window
Sent & Acked
Sent Not Acked
OK to Send
Not Usable
Receiver window
Received & Acked
Acceptable Packet
Not Usable
TCP Sender/Receiver Invariants
Sending application
Receiving application
TCP
LastByteWritten
LastByteAcked LastByteSent
TCP
LastByteRead
NextByteExpected
LastByteRcvd
Snd: LastByteAcked ≤ LastByteSent ≤ LastByteWritten
Rcv: LastByteRead < NextByteExpected ≤ LastByteRcvd + 1
LastByteAcked < NextByteExpected ≤ LastByteSent+1
TCP sender
(simplified)
NextSeqNum = InitialSeqNum + 1
SendBase = InitialSeqNum + 1 /* == LastByteAcked + 1 */
loop (forever) {
switch(event)
event: data received from application above
1. create TCP segment with sequence number NextSeqNum
2. if (timer currently not running)
2.1. start timer – timeout after TimeOutInterval later
3. pass segment to IP
4. NextSeqNum = NextSeqNum + length(data)
event: timer timeout
1. retransmit not-yet-acknowledged segment with
smallest sequence number
2. start timer – timeout after TimeOutInterval later
event: ACK received, with ACK field value of y
1. if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
restart timer
}
} /* end of loop forever */
Comment:
• SendBase-1:
last
cumulatively
ack’ed byte, i.e.,
the receiver has
received bytes
up to
SendBase –1 and
is expecting the
byte starting at
SendBase
TCP: retransmission scenarios
Host A
Host B
Host B
Seq=92 timeout
timeout
Host A
X
loss
Sendbase
= 100
SendBase
= 120
SendBase
= 100
time
SendBase
= 120
lost ACK scenario
time
premature timeout
TCP retransmission scenarios (more)
timeout
Host A
Host B
X
loss
SendBase
= 120
time
Cumulative ACK scenario
TCP Receiver ACK generation
[RFC 1122, RFC 2581]
Event at Receiver
TCP Receiver action
Arrival of in-order segment with
expected seq #. All data up to
expected seq # already ACKed
Delayed ACK. Wait up to 500ms
for next segment. If no next segment,
send ACK (ACKs maybe piggybacked)
Arrival of in-order segment with
expected seq #. One other
segment has ACK pending
Immediately send single cumulative
ACK, ACKing both in-order segments
Arrival of out-of-order segment
higher-than-expect seq. # .
Gap detected
Immediately send duplicate ACK,
indicating seq. # of next expected byte
Arrival of segment that
partially or completely fills gap
Immediate send ACK, provided that
segment starts at lower end of gap
Fast Retransmit
 Time-out period often
relatively long:

long delay before
resending lost packet
 Detect lost segments
via duplicate ACKs.


Sender often sends
many segments back-toback
If segment is lost,
there will likely be many
duplicate ACKs.
 If sender receives 3
ACKs for the same
data, it supposes that
segment after ACKed
data was lost:

fast retransmit: resend
segment before timer
expires
Fast retransmit algorithm:
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
else { /* y == SendBase. y cannot be smaller than SendBase */
increment count of dup ACKs received for y
if (count of dup ACKs received for y == 3) {
resend segment with sequence number y == SendBase
}
a duplicate ACK for
already ACKed segment
fast retransmit
TCP Flow Control
TCP Flow Control
flow control
 receive side of TCP
connection has a
receive buffer:
sender won’t overflow
receiver’s buffer by
transmitting too much,
too fast
 speed-matching
 app process may be
slow at reading from
buffer
service: matching the
send rate to the
receiving app’s drain
rate
TCP Flow control: how it works
 Rcvr advertises spare room
LastByteRead
by including value of
RcvWindow in segments
NextByteExpected
LastByteRcvd
MaxRcvBuffer
 spare room in buffer
RcvWindow
= MaxRcvBuffer [NextByteExpectd LastByteRead]
 Sender limits unACKed data
to RcvWindow

guarantees receive buffer
doesn’t overflow
 LastByteSent – LastByteAcked
≤
RcvWindow
 SndWindow = RcvWindow [LastByteSent – LastByteAcked]
TCP Flow control Issues
 What happens if advertised window is 0?
Receiver updates window when application reads
data
 What if this update is lost?

• Deadlock
 TCP Persist timer
Sender periodically sends window probe packets
 Receiver responds with ACK and up-to-date window
advertisement

TCP flow control enhancements
 Problem: (Clark, 1982)


If receiver advertises small increases in the
receive window then the sender may waste time
sending lots of small packets
This problem is known as “Silly window syndrome”
• Receiver advertises one byte window
• Sender sends one byte packet (1 byte data, 40 byte
header = 4000% overhead)
Solving Silly Window Syndrome
 Receiver avoidance [Clark (1982)]
Prevent receiver from advertising small windows
 Increase advertised receiver window by min(MSS,
RecvBuffer/2)

Solving Silly Window Syndrome
 Sender Avoidance [Nagle’s algorithm (1984)]
 prevent
packets
sender from unnecessarily sending small
 How long does sender delay sending data?
 too long: hurts interactive applications
 too short: poor network utilization
 strategies: timer-based vs self-clocking
 When application generates additional data
if fills a max segment (and window open): send it
 else

• if there is unack’ed data in transit: buffer it until ACK
arrives
• else: send it
Keeping the Pipe Full
 16-bit AdvertisedWindow
Bandwidth
T1 (1.5 Mbps)
Ethernet (10 Mbps)
T3 (45 Mbps)
FDDI (100 Mbps)
STS-3 (155 Mbps)
STS-12 (622 Mbps)
STS-24 (1.2 Gbps)
Delay x Bandwidth Product
18KB
122KB
549KB
1.2MB
1.8MB
7.4MB
14.8MB
assuming 100ms RTT
 TCP Extension to allow window scaling
 Put
is options field
TCP Round Trip Time Estimation and
Setting the Timeout
TCP Round Trip Time and Timeout
Q: how to set TCP
timeout value?
 longer than RTT

but RTT varies
 too short: premature
timeout
 unnecessary
retransmissions
 too long: slow reaction
to segment loss
Q: how to estimate RTT?
 SampleRTT: measured time from
segment transmission until ACK
receipt
 ignore retransmissions
 SampleRTT will vary, want
estimated RTT “smoother”
 average several recent
measurements, not just
current SampleRTT
TCP Round Trip Time and Timeout
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
 Exponential weighted moving average
 influence of past sample decreases exponentially fast
 typical value:  = 0.125
e.g: Ai  Estimated RTT at time i and M sampled RTT at
time i
A0 = M0
A1 = (1- ) M0 + M1
A2 = (1- )2 M0 + (1- ) M1 + M2
….
Example RTT estimation:
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
350
RTT (milliseconds)
300
250
200
150
100
1
8
15
22
29
36
43
50
57
64
71
time (seconnds)
SampleRTT
Estimated RTT
78
85
92
99
106
TCP Round Trip Time and Timeout
Setting the timeout
 EstimtedRTT plus “safety margin”

large variation in EstimatedRTT -> larger safety margin
 first estimate of how much SampleRTT deviates from
EstimatedRTT:
DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically,  = 0.25)
Then set timeout interval:
TimeoutInterval = EstimatedRTT + 4*DevRTT
TCP Congestion Control
Congestion Control Overview
Congestion: informally: “too many sources sending
too much data too fast for network to handle”
 Signal and detect congestion
 Policy for source to adjust transmission rate to
match network bandwidth capacity


Decrease rate upon congestion signal
Increase for utilization
 Initialization to reach steady state
Network Utilization
 Queuing delay (theoretically) could
Throughput/delay
approach infinity with increased load
 Network Power (ratio of throughput to
delay)
Queuing delay
Optimal
load
Load
Knee
Approaches towards congestion control
Two broad approaches towards congestion control:
End-end congestion
control:
 no explicit feedback from
network
 congestion inferred from
end-system observed loss,
delay
 approach taken by TCP
Network-assisted
congestion control:
 routers provide feedback
to end systems
 single bit indicating
congestion (SNA,
DECbit, TCP/IP ECN,
ATM)
 explicit rate sender
should send at
TCP Congestion Control
 end-end control (no network
assistance)
 sender limits transmission:
sndWindow
= LastByteSent-LastByteAcked
 min(cwnd, rcvWin)
 Cwnd is a dynamic function of
perceived network congestion
How does sender
perceive congestion?
 loss event = timeout or
3 duplicate acks
 TCP sender reduces
rate (cwnd) after loss
event
TCP Congestion Control Outline
 Basic Idea: Probe (test) the current available
bandwidth in the network & adjust your
sending rate accordingly
1.
2.
3.
Start with a very small sending rate -- 1 packet
Increase your sending rate as long as no
congestion is detected
•
•
How do you detect “no congestion”? – No packet loss
How do you increase your sending rate?
•
How do you detect congestion? – Packet loss (timeout, 3
duplicate ACK)
How do you decrease your sending rate?
Decrease your sending rate when you detect
congestion
•
TCP Slow Start
 When connection begins, cwnd = 1 MSS
Example: MSS = 500 bytes & RTT = 200 msec
 initial rate = 20 kbps

 available bandwidth may be >> MSS/RTT
 desirable to quickly ramp up to respectable rate
 When connection begins, increase rate
exponentially fast until first loss event
 When loss occurs (congestion signal), set cwnd
to 1 MSS and re-start with slow start
TCP Slow Start (more)
 When connection begins,
increase rate exponentially until
first loss event:

double cwnd every RTT
done by incrementing cwnd by 1
MSS for every ACK received
 Summary:
 initial rate is slow but ramps up
exponentially fast
 When the first packet loss occurs
(either a timeout occurred or 3
duplicate ACKs were received), set
cwnd to 1 MSS and start over.
Host B
RTT

Host A
time
TCP After the First Packet Loss
Question: Should we continue doing slow start throughout
the lifetime of the TCP connection?

Why did we increase cwnd exponentially at the beginning?
• Because we had no idea how much of our traffic the network can
carry, so we needed to probe fast to figure it out

What do we know at the first packet loss event?
• That the network cannot carry our traffic at the rate we had at
the time the packet loss occurred
– What was our rate at the time the packet loss occurred?
» Cwnd/RTT
 Refinement Idea: Use the knowledge we obtained at
the time of the packet loss for further refinement of
the slow start algorithm. How?


Keep a threshold value, ssthresh, and set to 1/2 of cwnd just
before loss event.
Cnwd increases exponentially until it reaches ssthresh, and
linearly afterwards – this is called congestion avoidance
Congestion Avoidance – TCP Tahoe
Implementation:
loss has occurred when cwnd was 16,
so sshtresh = 8, and cwnd = 1
14
congestion window size
(segments)
Q: When should the
exponential increase
switch to linear?
A: When cwnd gets to
1/2 of its value before
timeout.
 Figure assumes that the first packet
 Threshold variable ssthresh
 Set to some large value, i.e.,
65K, when the connection is
established
 At loss event, ssthresh is set to
1/2 of cwnd just before loss
event
12
10
8
6
threshold
4
2
TCP
Tahoe
0
1
2 3
4 5
6 7
8 9 10 11 12 13 14 15
Transmission round
if (cwnd < ssthresh)
cwnd +=1;
else
cwnd += 1/cwnd;
Linear Window Increase Example
During Congestion Avoidance
Host B
RTT
Host A
 Linear Congestion Window
Increase during
Congestion Avoidance
Phase

Window is increased by 1
packet after a full window
size of packets is ACKed
time
Towards a better Congestion Control
Algorithm – TCP Reno
 Question: Should we set cwnd to 1 both after
 After a timeout
 When 3 duplicate ACKs are received
 Answer:
 timeout before 3 dup ACKs is “alarming”.
• So set cwnd to 1


BUT
3 dup ACKs before timeout indicates that the network is
capable of delivering some segments
• Why not set cwnd to a bigger value rather than 1?
• TCP Reno: Set cwnd to half of its value, which is equal to sshtresh,
and enter linear increase phase
TCP Reno Refinement (more)
RCP Reno Refinement
 After 3 dup ACKs:
 cwnd is cut in half
 window then grows
linearly
 This is called “fast
recovery”
congestion window size
(segments)
14
12
10
8
6
threshold
4
2
TCP
Tahoe
0
1
 But after timeout
event:
 cwnd instead set
to 1 MSS;
 window then grows
exponentially
 to a threshold,
then grows linearly
2 3
TCP
Reno
4 5
6 7
8 9 10 11 12 13 14 15
Transmission round
 At time 8, when cwnd = 12, a packet




loss is detected by 3 duplicate ACKs.
Tahoe sets cwnd to 1 unconditionally
Reno sets cwnd to half of its current
value, which is 6
Notice that ssthresh is set to 6 in
both cases
Also notice that if this was a
“timeout”, then Reno would also set
cwnd to 1
Summary: TCP Congestion Control
 When cwnd is below ssthresh, sender in slow-
start phase, window grows exponentially.
 When cwnd is above sshthresh, sender is in
congestion-avoidance phase, window grows linearly.
 When a triple duplicate ACK occurs, ssthresh set
to cwnd/2 and cwnd set to ssthresh.
 When timeout occurs, ssthresh set to cwnd/2
and cwnd is set to 1 MSS.
Congestion window always oscillates !
Steady State TCP Modeling
congestion
window
24 Kbytes
16 Kbytes
8 Kbytes
time
Long-lived TCP connection
 Figure above shows the behavior of a TCP connection in steady
state

Additive Increase/Multiplicative Decrease (AIMD)
 Further improving TCP Congestion Control Algorithm --TCP

Vegas
Detect congestion before packet loss occurs by observing RTTs
• Longer RTT, greater the congestion in the routers

Lower the transmission rate linearly when packet loss is imminent