3rd Edition: Chapter 3
Download
Report
Transcript 3rd Edition: Chapter 3
What’s the Internet: “nuts and bolts” view
millions of connected
computing devices: hosts
= end systems
examples of hosts?
router
workstation
server
mobile
local ISP
running network apps
examples of applications?
regional ISP
company
network
Transport Layer
3-1
What’s the Internet: “nuts and bolts” view
communication links
fiber, copper, radio,
satellite
transmission rate =
bandwidth
• typical bandwidth for
modem? wireless?
router
workstation
server
mobile
local ISP
regional ISP
routers: forward packets
(chunks of data)
what’s in a packet?
company
network
Transport Layer
3-2
What’s the Internet: “nuts and bolts” view
protocols control sending,
receiving of msgs
e.g., TCP, IP, HTTP, FTP, PPP
Internet: “network of networks”
loosely hierarchical
public Internet versus private
intranet
router
server
mobile
local ISP
Internet standards
workstation
regional ISP
RFC: Request for comments
IETF: Internet Engineering
Task Force
company
network
Transport Layer
3-3
What’s the Internet: a service view
communication infrastructure
enables distributed
applications:
Web, email, other examples?
communication services
provided to apps:
connection-oriented reliable
• example apps?
Connectionless unreliable
• example apps?
Transport Layer
3-4
What’s a protocol?
human protocols:
“what’s the time?”
“I have a question”
introductions
network protocols:
machines rather than
humans
all communication activity
in Internet governed by
protocols
protocols define format, order of msgs
sent and received among network
entities, and actions taken on msg
transmission, receipt
Transport Layer
3-5
What’s
a
protocol?
a human protocol and a computer network protocol:
Hi
TCP connection
req
Hi
TCP connection
response
Got the
time?
Get http://www.awl.com/kurose-ross
2:00
<file>
time
Q: Why are protocols so important?
Transport Layer
3-6
A closer look at network structure:
network edge: applications
and hosts
network core:
routers
network of networks
access networks, physical
media: communication links
Transport Layer
3-7
Protocol “Layers”
Networks are complex!
many “pieces”:
hosts
routers
links of various media
applications
protocols
hardware, software
Question:
Is there any hope of organizing
structure of network?
Or at least our discussion of
networks?
Transport Layer
3-8
Organization of air travel
ticket (purchase)
ticket (complain)
baggage (check)
baggage (claim)
gates (load)
gates (unload)
runway takeoff
runway landing
airplane routing
airplane routing
airplane routing
a series of steps
Transport Layer
3-9
Layering of airline functionality
ticket (purchase)
ticket (complain)
ticket
baggage (check)
baggage (claim
baggage
gates (load)
gates (unload)
gate
runway (takeoff)
runway (land)
takeoff/landing
airplane routing
airplane routing
airplane routing
departure
airport
airplane routing
airplane routing
intermediate air-traffic
control centers
arrival
airport
Layers: each layer implements a service
via its own internal-layer actions
relying on services provided by layer below
Transport Layer 3-10
Why layering?
Dealing with complex systems:
explicit structure allows identification, relationship of
complex system’s pieces
layered reference model for discussion
modularization eases maintenance, updating of system
change of implementation of layer’s service transparent
to rest of system
e.g., change in gate procedure doesn’t affect rest of
system
layering considered harmful?
Transport Layer
3-11
Internet protocol stack
application: supporting network applications
FTP, SMTP, STTP
transport: host-host data transfer
TCP, UDP
network: routing of datagrams from source to
destination
transport
IP, routing protocols
link: data transfer between neighboring
network elements
application
PPP, Ethernet
physical: bits “on the wire”
network
link
physical
Transport Layer 3-12
source
message
segment Ht
datagram Hn Ht
frame
Hl Hn Ht
M
M
M
M
Encapsulation
application
transport
network
link
physical
Hl Hn Ht
M
link
physical
Hl Hn Ht
M
switch
destination
M
Ht
M
Hn Ht
Hl Hn Ht
M
M
application
transport
network
link
physical
Hn Ht
Hl Hn Ht
M
M
network
link
physical
Hn Ht
Hl Hn Ht
M
M
router
Transport Layer 3-13
Internet transport protocols services
TCP service:
connection-oriented: setup
required between client and
server processes
reliable transport between
sending and receiving process
flow control: sender won’t
overwhelm receiver
congestion control: throttle
sender when network overloaded
does not provide: timing,
minimum bandwidth guarantees
UDP service:
unreliable data transfer
between sending and
receiving process
does not provide: connection
setup, reliability, flow
control, congestion control,
timing, or bandwidth
guarantee
Q: why bother? Why is there a
UDP?
Transport Layer 3-14
Transport vs. network layer
network layer: logical
communication between
hosts
transport layer: logical
communication between
processes
relies on, enhances,
network layer services
Household analogy:
12 kids sending letters to 12 kids
processes = kids
app messages = letters in
envelopes
hosts = houses
transport protocol = Ann and
Bill
network-layer protocol =
postal service
Transport Layer 3-15
Reliable data transfer: getting started
We’ll:
incrementally develop sender, receiver sides of
reliable data transfer protocol (rdt)
consider only unidirectional data transfer
but control info will flow on both directions!
use finite state machines (FSM) to specify
sender, receiver
state: when in this
“state” next state
uniquely determined
by next event
state
1
event causing state transition
actions taken on state transition
event
actions
state
2
Transport Layer 3-16
Rdt1.0: reliable transfer over a reliable channel
underlying channel perfectly reliable
no bit errors
no loss of packets
separate FSMs for sender, receiver:
sender sends data into underlying channel
receiver read data from underlying channel
Wait for
call from
above
rdt_send(data)
packet = make_pkt(data)
udt_send(packet)
sender
Wait for
call from
below
rdt_rcv(packet)
extract (packet,data)
deliver_data(data)
receiver
Transport Layer 3-17
Rdt2.0: channel with bit errors
underlying channel may flip bits in packet
checksum to detect bit errors
the question: how to recover from errors:
acknowledgements (ACKs): receiver explicitly tells sender
that pkt received OK
negative acknowledgements (NAKs): receiver explicitly
tells sender that pkt had errors
sender retransmits pkt on receipt of NAK
new mechanisms in rdt2.0 (beyond rdt1.0):
error detection
receiver feedback: control msgs (ACK,NAK) rcvr->sender
Transport Layer 3-18
rdt2.0: FSM specification
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for
Wait for
call from
ACK or
udt_send(sndpkt)
above
NAK
rdt_rcv(rcvpkt) && isACK(rcvpkt)
L
sender
receiver
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Transport Layer 3-19
rdt2.0: operation with no errors
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for
Wait for
call from
ACK or
udt_send(sndpkt)
above
NAK
rdt_rcv(rcvpkt) && isACK(rcvpkt)
L
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Transport Layer 3-20
rdt2.0: error scenario
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for
Wait for
call from
ACK or
udt_send(sndpkt)
above
NAK
rdt_rcv(rcvpkt) && isACK(rcvpkt)
L
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Transport Layer 3-21
rdt2.0 has a fatal flaw!
Stop and Wait
Sender sends one packet, then waits for receiver response
What happens if ACK/NAK corrupted?
sender doesn’t know what happened at receiver!
can’t just retransmit: possible duplicate
Handling duplicates:
sender adds sequence number to each pkt
sender retransmits current pkt if ACK/NAK garbled
receiver discards (doesn’t deliver up) duplicate pkt
Transport Layer 3-22
rdt2.1: sender, handles garbled ACK/NAKs
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
Wait
for
Wait for
isNAK(rcvpkt) )
ACK or
call 0 from
udt_send(sndpkt)
NAK 0
above
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt)
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt)
L
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isNAK(rcvpkt) )
udt_send(sndpkt)
L
Wait for
ACK or
NAK 1
Wait for
call 1 from
above
rdt_send(data)
sndpkt = make_pkt(1, data, checksum)
udt_send(sndpkt)
Transport Layer 3-23
rdt2.1: receiver, handles garbled ACK/NAKs
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq0(rcvpkt)
rdt_rcv(rcvpkt) && (corrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && (corrupt(rcvpkt)
sndpkt = make_pkt(NAK, chksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
not corrupt(rcvpkt) &&
has_seq1(rcvpkt)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
sndpkt = make_pkt(NAK, chksum)
udt_send(sndpkt)
Wait for
0 from
below
Wait for
1 from
below
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq1(rcvpkt)
rdt_rcv(rcvpkt) &&
not corrupt(rcvpkt) &&
has_seq0(rcvpkt)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
Transport Layer 3-24
rdt2.1: discussion
Sender:
seq # added to pkt
two seq. #’s (0,1) will
suffice. Why?
must check if received
ACK/NAK corrupted
twice as many states
state must “remember”
whether “current” pkt
has 0 or 1 seq. #
Receiver:
must check if received
packet is duplicate
state indicates whether
0 or 1 is expected pkt
seq #
note: receiver can not
know if its last
ACK/NAK received OK
at sender
Transport Layer 3-25
rdt2.2: a NAK-free protocol
same functionality as rdt2.1, using ACKs only
instead of NAK, receiver sends ACK for last pkt
received OK
receiver must explicitly include seq # of pkt being ACKed
duplicate ACK at sender results in same action as
NAK: retransmit current pkt
Transport Layer 3-26
rdt2.2: sender, receiver fragments
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
Wait for
Wait for
isACK(rcvpkt,1) )
ACK
call 0 from
0
udt_send(sndpkt)
above
sender FSM
fragment
rdt_rcv(rcvpkt) &&
(corrupt(rcvpkt) ||
has_seq1(rcvpkt))
udt_send(sndpkt)
Wait for
0 from
below
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,0)
receiver FSM
fragment
L
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq1(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK1, chksum)
udt_send(sndpkt)
Transport Layer 3-27
rdt3.0: channels with errors and loss
New assumption:
underlying channel can
also lose packets (data
or ACKs)
checksum, seq. #, ACKs,
retransmissions will be
of help, but not enough
Approach: sender waits
“reasonable” amount of time
for ACK
retransmits if no ACK received in
this time
if pkt (or ACK) just delayed (not
lost):
retransmission will be
duplicate, but use of seq. #’s
already handles this
receiver must specify seq #
of pkt being ACKed
requires countdown timer
Transport Layer 3-28
rdt3.0 sender
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt)
L
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,1)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isACK(rcvpkt,0) )
timeout
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,0)
stop_timer
stop_timer
timeout
udt_send(sndpkt)
start_timer
L
Wait
for
ACK0
Wait for
call 0from
above
L
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isACK(rcvpkt,1) )
Wait
for
ACK1
Wait for
call 1 from
above
rdt_send(data)
rdt_rcv(rcvpkt)
L
sndpkt = make_pkt(1, data, checksum)
udt_send(sndpkt)
start_timer
Transport Layer 3-29
rdt3.0 in action
Transport Layer 3-30
rdt3.0 in action
Transport Layer 3-31
Performance of rdt3.0
rdt3.0 works, but performance stinks
Why?
Transport Layer 3-32
rdt3.0: stop-and-wait operation
sender
receiver
first packet bit transmitted, t = 0
last packet bit transmitted, t = L / R
RTT
first packet bit arrives
last packet bit arrives, send ACK
ACK arrives, send next
packet, t = RTT + L / R
Transport Layer 3-33
Pipelined protocols
Pipelining: sender allows multiple, “in-flight”, yet-to-beacknowledged pkts
range of sequence numbers must be increased
buffering at sender and/or receiver
Two generic forms of pipelined protocols: go-Back-N, selective
repeat
Transport Layer 3-34
Pipelining: increased utilization
sender
receiver
first packet bit transmitted, t = 0
last bit transmitted, t = L / R
RTT
first packet bit arrives
last packet bit arrives, send ACK
last bit of 2nd packet arrives, send ACK
last bit of 3rd packet arrives, send ACK
ACK arrives, send next
packet, t = RTT + L / R
Transport Layer 3-35
Go-Back-N
Sender:
k-bit seq # in pkt header
“window” of up to N, consecutive unack’ed pkts allowed
ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK”
may deceive duplicate ACKs (see receiver)
timer for each in-flight pkt
timeout(n): retransmit pkt n and all higher seq # pkts in window
Transport Layer 3-36
GBN: receiver
ACK-only: always send ACK for correctlyreceived pkt with highest in-order seq #
may generate duplicate ACKs
need only remember expectedseqnum
out-of-order pkt:
discard (don’t buffer) -> isn't this bad?
Re-ACK pkt with highest in-order seq #
Transport Layer 3-37
GBN in
action
Transport Layer 3-38
Selective Repeat
receiver individually acknowledges all correctly
received pkts
buffers pkts, as needed, for eventual in-order delivery
to upper layer
sender only resends pkts for which ACK not
received
sender timer for each unACKed pkt
sender window
N consecutive seq #’s
again limits seq #s of sent, unACKed pkts
Transport Layer 3-39
Selective repeat: sender, receiver windows
Transport Layer 3-40
Selective repeat
sender
data from above :
receiver
pkt n in
[rcvbase, rcvbase+N-1]
if next available seq # in
send ACK(n)
timeout(n):
in-order: deliver (also deliver
window, send pkt
resend pkt n, restart timer
ACK(n) in [sendbase,sendbase+N]:
mark pkt n as received
if n smallest unACKed pkt,
advance window base to
next unACKed seq #
out-of-order: buffer
buffered, in-order pkts),
advance window to next notyet-received pkt
pkt n in
[rcvbase-N,rcvbase-1]
ACK(n)
otherwise:
ignore
Transport Layer 3-41
Selective repeat in action
Transport Layer 3-42
Selective repeat:
dilemma
Example:
seq #’s: 0, 1, 2, 3
window size=3
receiver sees no
difference in two
scenarios!
incorrectly passes
duplicate data as new in
(a)
Q: what relationship
between seq # size and
window size?
Transport Layer 3-43
TCP: Overview
full duplex data:
point-to-point:
RFCs: 793, 1122, 1323, 2018, 2581
one sender, one receiver
reliable, in-order byte
steam:
no “message boundaries”
pipelined:
connection-oriented:
TCP congestion and flow
control set window size
send & receive buffers
socket
door
application
reads data
TCP
send buffer
TCP
receive buffer
handshaking (exchange of
control msgs) init’s sender,
receiver state before data
exchange
flow controlled:
application
writes data
bi-directional data flow in
same connection
MSS: maximum segment
size
sender will not overwhelm
receiver
socket
door
segment
Transport Layer 3-44
TCP segment structure
32 bits
URG: urgent data
(generally not used)
ACK: ACK #
valid
PSH: push data now
(generally not used)
RST, SYN, FIN:
connection estab
(setup, teardown
commands)
Internet
checksum
(as in UDP)
source port #
dest port #
sequence number
acknowledgement number
head not
UA P R S F
len used
checksum
Receive window
Urg data pnter
Options (variable length)
counting
by bytes
of data
(not segments!)
# bytes
rcvr willing
to accept
application
data
(variable length)
Transport Layer 3-45
TCP seq. #’s and ACKs
Seq. #’s:
byte stream “number”
of first byte in
segment’s data
ACKs:
seq # of next byte
expected from other
side
cumulative ACK
piggybacking
Q: how receiver handles out-oforder segments
A: TCP spec doesn’t say,
- up to implementor
Host A
User
types
‘C’
Host B
host ACKs
receipt of
‘C’, echoes
back ‘C’
host ACKs
receipt
of echoed
‘C’
simple telnet scenario
time
Transport Layer 3-46
TCP Round Trip Time and Timeout
Q: how to set TCP
timeout value?
longer than RTT
but RTT varies
too short: premature
timeout
unnecessary
retransmissions
too long: slow reaction
to segment loss
Q: how to estimate RTT?
SampleRTT: measured time from
segment transmission until ACK
receipt
ignore retransmissions
SampleRTT will vary, want
estimated RTT “smoother”
average several recent
measurements, not just
current SampleRTT
Transport Layer 3-47
Example RTT estimation:
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
350
RTT (milliseconds)
300
250
200
150
100
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
time (seconnds)
SampleRTT
Estimated RTT
Transport Layer 3-48
TCP reliable data transfer
TCP creates rdt
service on top of IP’s
unreliable service
Pipelined segments
Cumulative acks
TCP uses single
retransmission timer
Retransmissions are
triggered by:
timeout events
duplicate acks
Initially consider
simplified TCP sender:
ignore duplicate acks
ignore flow control,
congestion control
Transport Layer 3-49
TCP sender events:
data rcvd from app:
Create segment with
seq #
seq # is byte-stream
number of first data
byte in segment
start timer if not
already running (think
of timer as for oldest
unacked segment)
expiration interval:
TimeOutInterval
timeout:
retransmit segment
that caused timeout
restart timer
Ack rcvd:
If acknowledges
previously unacked
segments
update what is known to
be acked
start timer if there are
outstanding segments
Transport Layer 3-50
NextSeqNum = InitialSeqNum
SendBase = InitialSeqNum
loop (forever) {
switch(event)
event: data received from application above
create TCP segment with sequence number NextSeqNum
if (timer currently not running)
start timer
pass segment to IP
NextSeqNum = NextSeqNum + length(data)
event: timer timeout
retransmit not-yet-acknowledged segment with
smallest sequence number
start timer
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
} /* end of loop forever */
TCP
sender
(simplified)
Comment:
• SendBase-1: last
cumulatively
ack’ed byte
Example:
• SendBase-1 = 71;
y= 73, so the rcvr
wants 73+ ;
y > SendBase, so
that new data is
acked
Transport Layer 3-51
TCP: retransmission scenarios
Host A
X
loss
Sendbase
= 100
SendBase
= 120
SendBase
= 100
time
SendBase
= 120
lost ACK scenario
Host B
Seq=92 timeout
Host B
Seq=92 timeout
timeout
Host A
time
premature timeout
Transport Layer 3-52
TCP retransmission scenarios (more)
timeout
Host A
Host B
X
loss
SendBase
= 120
time
Cumulative ACK scenario
Transport Layer 3-53
TCP ACK generation
[RFC 1122, RFC 2581]
Event at Receiver
TCP Receiver action
Arrival of in-order segment with
expected seq #. All data up to
expected seq # already ACKed
Delayed ACK. Wait up to 500ms
for next segment. If no next segment,
send ACK
Arrival of in-order segment with
expected seq #. One other
segment has ACK pending
Immediately send single cumulative
ACK, ACKing both in-order segments
Arrival of out-of-order segment
higher-than-expect seq. # .
Gap detected
Immediately send duplicate ACK,
indicating seq. # of next expected byte
Arrival of segment that
partially or completely fills gap
Immediate send ACK, provided that
segment startsat lower end of gap
Transport Layer 3-54
Fast Retransmit
Time-out period often
relatively long:
long delay before
resending lost packet
Detect lost segments
via duplicate ACKs.
Sender often sends
many segments back-toback
If segment is lost,
there will likely be many
duplicate ACKs.
If sender receives 3
ACKs for the same
data, it supposes that
segment after ACKed
data was lost:
fast retransmit: resend
segment before timer
expires
Transport Layer 3-55
TCP Flow Control
receive side of TCP
connection has a
receive buffer:
flow control
sender won’t overflow
receiver’s buffer by
transmitting too much,
too fast
speed-matching
app process may be
service: matching the
send rate to the
receiving app’s drain
rate
slow at reading from
buffer
Transport Layer 3-56
TCP Flow control: how it works
Rcvr advertises spare
(Suppose TCP receiver
discards out-of-order
segments)
spare room in buffer
room by including value
of RcvWindow in
segments
Sender limits unACKed
data to RcvWindow
guarantees receive
buffer doesn’t overflow
= RcvWindow
= RcvBuffer-[LastByteRcvd LastByteRead]
Transport Layer 3-57
TCP Connection Management
Recall: TCP sender, receiver establish “connection” before
exchanging data segments
initialize TCP variables:
seq. #s
buffers, flow control info (e.g. RcvWindow)
client: connection initiator
Socket clientSocket = new
Socket("hostname","port
number");
server: contacted by client
Socket connectionSocket = welcomeSocket.accept();
Transport Layer 3-58
TCP Connection Management
Three way handshake:
Step 1: client host sends TCP SYN segment to server
specifies initial seq #
no data
Step 2: server host receives SYN, replies with SYNACK
segment
server allocates buffers
specifies server initial seq. #
Step 3: client receives SYNACK, replies with ACK segment,
which may contain data
Transport Layer 3-59
TCP Connection Management (cont.)
Closing a connection:
client closes socket:
clientSocket.close();
client
server
close
Step 1: client end system sends
TCP FIN control segment to
server
close
replies with ACK. Closes
connection, sends FIN.
timed wait
Step 2: server receives FIN,
closed
Transport Layer 3-60
TCP Connection Management (cont.)
Step 3: client receives FIN,
replies with ACK.
Enters “timed wait” - will
respond with ACK to
received FINs
client
server
closing
closing
Step 4: server, receives ACK.
Note: with small modification,
can handle simultaneous FINs.
timed wait
Connection closed.
closed
closed
Transport Layer 3-61
TCP Congestion Control
end-end control (no network
assistance)
sender limits transmission:
LastByteSent-LastByteAcked
CongWin
Roughly,
CongWin is dynamic,
function of
CongWin
rate = network congestion
Bytes/sec
perceived
RTT
How does sender perceive
congestion?
loss event = timeout or 3
duplicate acks
TCP sender reduces rate
(CongWin) after loss event
three mechanisms:
AIMD
slow start
conservative after timeout
events
Transport Layer 3-62
TCP AIMD
multiplicative decrease:
cut CongWin in half
after loss event
congestion
window
additive increase:
increase CongWin by
1 MSS every RTT in
the absence of loss
events: probing
24 Kbytes
16 Kbytes
8 Kbytes
time
Long-lived TCP connection
Transport Layer 3-63
TCP Slow Start
When connection begins,
CongWin = 1 MSS
Example: MSS = 500
bytes & RTT = 200 msec
initial rate = 20 kbps
When connection begins,
increase rate
exponentially fast until
first loss event
available bandwidth may
be >> MSS/RTT
desirable to quickly ramp
up to respectable rate
Transport Layer 3-64
TCP Slow Start (more)
When connection
Host B
RTT
begins, increase rate
exponentially until
first loss event:
Host A
double CongWin every
RTT
done by incrementing
CongWin for every ACK
received
Summary: initial rate
is slow but ramps up
exponentially fast
time
Transport Layer 3-65
Refinement
After 3 dup ACKs:
CongWin is cut in half
window then grows linearly
But after timeout event:
CongWin instead set to 1
MSS;
window then grows
exponentially
to a threshold, then grows
linearly
Philosophy:
• 3 dup ACKs indicates
network capable of
delivering some segments
• timeout before 3 dup
ACKs is “more alarming”
Transport Layer 3-66
Refinement (more)
Q: When should the
exponential increase
switch to linear?
A: When CongWin gets
to 1/2 of its value
before timeout.
Implementation:
Variable Threshold
At loss event, Threshold is set
to 1/2 of CongWin just before
loss event
Transport Layer 3-67
Summary: TCP Congestion Control
When CongWin is below Threshold, sender in slow-start
phase, window grows exponentially.
When CongWin is above Threshold, sender is in congestion-
avoidance phase, window grows linearly.
When a triple duplicate ACK occurs, Threshold set to
CongWin/2 and CongWin set to Threshold.
When timeout occurs, Threshold set to CongWin/2 and
CongWin is set to 1 MSS.
Transport Layer 3-68
TCP sender congestion control
Event
State
TCP Sender Action
Commentary
ACK receipt
for previously
unacked
data
Slow Start
(SS)
CongWin = CongWin + MSS,
If (CongWin > Threshold)
set state to “Congestion
Avoidance”
Resulting in a doubling of
CongWin every RTT
ACK receipt
for previously
unacked
data
Congestion
Avoidance
(CA)
CongWin = CongWin+MSS *
(MSS/CongWin)
Additive increase, resulting
in increase of CongWin by
1 MSS every RTT
Loss event
detected by
triple
duplicate
ACK
SS or CA
Threshold = CongWin/2,
CongWin = Threshold,
Set state to “Congestion
Avoidance”
Fast recovery,
implementing multiplicative
decrease. CongWin will not
drop below 1 MSS.
Timeout
SS or CA
Threshold = CongWin/2,
CongWin = 1 MSS,
Set state to “Slow Start”
Enter slow start
Duplicate
ACK
SS or CA
Increment duplicate ACK count
for segment being acked
CongWin and Threshold not
changed
Transport Layer 3-69
Interplay between routing and
forwarding
routing algorithm
local forwarding table
header value output link
0100
0101
0111
1001
3
2
2
1
value in arriving
packet’s header
0111
1
3 2
Transport Layer 3-70
Graph abstraction
5
2
u
2
1
Graph: G = (N,E)
v
x
3
w
3
1
5
z
1
y
2
N = set of routers = { u, v, w, x, y, z }
E = set of links ={ (u,v), (u,x), (v,x), (v,w), (x,w), (x,y), (w,y), (w,z), (y,z) }
Remark: Graph abstraction is useful in other network contexts
Example: P2P, where N is set of peers and E is set of TCP connections
Transport Layer 3-71
Graph abstraction: costs
5
2
u
v
2
1
x
• c(x,x’) = cost of link (x,x’)
3
w
3
1
5
z
1
y
- e.g., c(w,z) = 5
2
• cost could always be 1, or
inversely related to bandwidth,
or inversely related to
congestion
Cost of path (x1, x2, x3,…, xp) = c(x1,x2) + c(x2,x3) + … + c(xp-1,xp)
Question: What’s the least-cost path between u and z ?
Routing algorithm: algorithm that finds least-cost path
Transport Layer 3-72
Routing Algorithm classification
Global or decentralized
information?
Global:
all routers have complete
topology, link cost info
“link state” algorithms
Decentralized:
router knows physicallyconnected neighbors, link costs
to neighbors
iterative process of
computation, exchange of info
with neighbors
“distance vector” algorithms
Static or dynamic?
Static:
routes change slowly over
time
Dynamic:
routes change more quickly
periodic update
in response to link cost
changes
Transport Layer 3-73
A Link-State Routing Algorithm
Dijkstra’s algorithm
net topology, link costs known
to all nodes
accomplished via “link
state broadcast”
all nodes have same info
computes least cost paths
from one node (‘source”) to all
other nodes
gives forwarding table for
that node
iterative: after k iterations,
know least cost path to k
dest.’s
Notation:
c(x,y): link cost from node x
to y; = ∞ if not direct
neighbors
D(v): current value of cost of
path from source to dest. v
p(v): predecessor node along
path from source to v
N': set of nodes whose least
cost path definitively known
Transport Layer 3-74
Dijsktra’s Algorithm
1 Initialization:
2 N' = {u}
3 for all nodes v
4
if v adjacent to u
5
then D(v) = c(u,v)
6
else D(v) = ∞
7
8 Loop
9 find w not in N' such that D(w) is a minimum
10 add w to N'
11 update D(v) for all v adjacent to w and not in N' :
12
D(v) = min( D(v), D(w) + c(w,v) )
13 /* new cost to v is either old cost to v or known
14 shortest path cost to w plus cost from w to v */
15 until all nodes in N'
Transport Layer 3-75
Dijkstra’s algorithm: example
Step
0
1
2
3
4
5
N'
u
ux
uxy
uxyv
uxyvw
uxyvwz
D(v),p(v) D(w),p(w)
2,u
5,u
2,u
4,x
2,u
3,y
3,y
D(x),p(x)
1,u
D(y),p(y)
∞
2,x
D(z),p(z)
∞
∞
4,y
4,y
4,y
5
2
u
v
2
1
x
3
w
3
1
5
z
1
y
2
Transport Layer 3-76
Dijkstra’s algorithm, discussion
Algorithm complexity: n nodes
each iteration: need to check all nodes, w, not in N
n(n+1)/2 comparisons: O(n2)
more efficient implementations possible: O(nlogn)
Oscillations possible:
e.g., link cost = amount of carried traffic
D
1
1
0
A
0 0
C
e
1+e
e
initially
B
1
2+e
A
0
D 1+e 1 B
0
0
C
… recompute
routing
0
D
1
A
0 0
C
2+e
B
1+e
… recompute
2+e
A
0
D 1+e 1 B
e
0
C
… recompute
Transport Layer 3-77
Distance Vector Algorithm (1)
Bellman-Ford Equation (dynamic programming)
Define
dx(y) := cost of least-cost path from x to y
Then
dx(y) = min {c(x,v) + dv(y) }
where min is taken over all neighbors of x
Transport Layer 3-78
Bellman-Ford example (2)
5
2
u
v
2
1
x
3
w
3
1
5
z
1
y
Clearly, dv(z) = 5, dx(z) = 3, dw(z) = 3
2
B-F equation says:
du(z) = min { c(u,v) + dv(z),
c(u,x) + dx(z),
c(u,w) + dw(z) }
= min {2 + 5,
1 + 3,
5 + 3} = 4
Node that achieves minimum is next
hop in shortest path ➜ forwarding table
Transport Layer 3-79
Distance Vector Algorithm (3)
Dx(y) = estimate of least cost from x to y
Distance vector: Dx = [Dx(y): y є N ]
Node x knows cost to each neighbor v: c(x,v)
Node x maintains Dx = [Dx(y): y є N ]
Node x also maintains its neighbors’ distance
vectors
For each neighbor v, x maintains
Dv = [Dv(y): y є N ]
Transport Layer 3-80
Distance vector algorithm (4)
Basic idea:
Each node periodically sends its own distance
vector estimate to neighbors
When node a node x receives new DV estimate
from neighbor, it updates its own DV using B-F
equation:
Dx(y) ← minv{c(x,v) + Dv(y)}
for each node y ∊ N
Under minor, natural conditions, the estimate Dx(y)
converge the actual least cost dx(y)
Transport Layer 3-81
Distance Vector Algorithm (5)
Iterative, asynchronous: each
local iteration caused by:
local link cost change
DV update message from
neighbor
Distributed:
Each node:
wait for (change in local link
cost of msg from neighbor)
each node notifies neighbors
only when its DV changes
neighbors then notify their
neighbors if necessary
recompute estimates
if DV to any dest has
changed, notify neighbors
Transport Layer 3-82
Dx(y) = min{c(x,y) + Dy(y), c(x,z) + Dz(y)}
= min{2+0 , 7+1} = 2
node x table
cost to
x y z
x ∞∞ ∞
y ∞∞ ∞
z 71 0
from
from
from
from
x 0 2 7
y 2 0 1
z 7 1 0
cost to
x y z
x 0 2 7
y 2 0 1
z 3 1 0
x 0 2 3
y 2 0 1
z 3 1 0
cost to
x y z
x 0 2 3
y 2 0 1
z 3 1 0
x
2
y
7
1
z
cost to
x y z
from
from
from
x ∞ ∞ ∞
y 2 0 1
z ∞∞ ∞
node z table
cost to
x y z
x 0 2 3
y 2 0 1
z 7 1 0
cost to
x y z
cost to
x y z
from
from
x 0 2 7
y ∞∞ ∞
z ∞∞ ∞
node y table
cost to
x y z
cost to
x y z
Dx(z) = min{c(x,y) +
Dy(z), c(x,z) + Dz(z)}
= min{2+1 , 7+0} = 3
x 0 2 3
y 2 0 1
z 3 1 0
time
Transport Layer 3-83
Distance Vector: link cost changes
Link cost changes:
node detects local link cost change
updates routing info, recalculates
distance vector
if DV changes, notify neighbors
“good
news
travels
fast”
1
x
4
y
50
1
z
At time t0, y detects the link-cost change, updates its DV,
and informs its neighbors.
At time t1, z receives the update from y and updates its table.
It computes a new least cost to x and sends its neighbors its DV.
At time t2, y receives z’s update and updates its distance table.
y’s least costs do not change and hence y does not send any
message to z.
Transport Layer 3-84
Distance Vector: link cost changes
Link cost changes:
good news travels fast
bad news travels slow - “count
to infinity” problem!
44 iterations before algorithm
stabilizes: see text
60
x
4
y
50
1
z
Poissoned reverse:
If Z routes through Y to get
to X :
Z tells Y its (Z’s) distance to
X is infinite (so Y won’t route
to X via Z)
will this completely solve
count to infinity problem?
Transport Layer 3-85
Comparison of LS and DV algorithms
Message complexity
LS: with n nodes, E links, O(nE)
msgs sent
DV: exchange between
neighbors only
convergence time varies
Speed of Convergence
LS: O(n2) algorithm requires
O(nE) msgs
may have oscillations
DV: convergence time varies
may be routing loops
count-to-infinity problem
Robustness: what happens if router
malfunctions?
LS:
DV:
node can advertise incorrect
link cost
each node computes only its own
table
DV node can advertise incorrect
path cost
each node’s table used by
others
• error propagate thru network
Transport Layer 3-86
Multiple Access Links and Protocols
Two types of “links”:
point-to-point
PPP for dial-up access
point-to-point link between Ethernet switch and host
broadcast (shared wire or medium)
traditional Ethernet
upstream HFC
802.11 wireless LAN
Transport Layer 3-87
Multiple Access protocols
single shared broadcast channel
two or more simultaneous transmissions by nodes: interference
collision if node receives two or more signals at the same time
multiple access protocol
distributed algorithm that determines how nodes share channel,
i.e., determine when node can transmit
communication about channel sharing must use channel itself!
no out-of-band channel for coordination
Transport Layer 3-88
Ideal Mulitple Access Protocol
Broadcast channel of rate R bps
1. When one node wants to transmit, it can send at
rate R.
2. When M nodes want to transmit, each can send at
average rate R/M
3. Fully decentralized:
no special node to coordinate transmissions
no synchronization of clocks, slots
4. Simple
Transport Layer 3-89
MAC Protocols: a taxonomy
Three broad classes:
Channel Partitioning
divide channel into smaller “pieces” (time slots,
frequency, code)
allocate piece to node for exclusive use
Random Access
channel not divided, allow collisions
“recover” from collisions
“Taking turns”
Nodes take turns, but nodes with more to send can take
longer turns
Transport Layer 3-90
Channel Partitioning MAC protocols: TDMA
TDMA: time division multiple access
access to channel in "rounds"
each station gets fixed length slot (length = pkt trans time)
in each round
unused slots go idle
example: 6-station LAN, 1,3,4 have pkt, slots 2,5,6 idle
Transport Layer 3-91
Channel Partitioning MAC protocols: FDMA
FDMA: frequency division multiple access
channel spectrum divided into frequency bands
each station assigned fixed frequency band
unused transmission time in frequency bands go idle
example: 6-station LAN, 1,3,4 have pkt, frequency bands 2,5,6
frequency bands
idle
Transport Layer 3-92
Random Access Protocols
When node has packet to send
transmit at full channel data rate R.
no a priori coordination among nodes
two or more transmitting nodes ➜ “collision”,
random access MAC protocol specifies:
how to detect collisions
how to recover from collisions (e.g., via delayed
retransmissions)
Examples of random access MAC protocols:
slotted ALOHA
ALOHA
CSMA, CSMA/CD, CSMA/CA
Transport Layer 3-93
Slotted ALOHA
Assumptions
all frames same size
time is divided into equal
size slots, time to transmit 1
frame
nodes start to transmit
frames only at beginning of
slots
nodes are synchronized
if 2 or more nodes transmit
in slot, all nodes detect
collision
Operation
when node obtains fresh frame,
it transmits in next slot
no collision, node can send new
frame in next slot
if collision, node retransmits
frame in each subsequent slot
with prob. p until success
Transport Layer 3-94
Slotted ALOHA
Pros
single active node can
continuously transmit at full
rate of channel
highly decentralized: only
slots in nodes need to be in
sync
simple
Cons
collisions, wasting slots
idle slots
nodes may be able to
detect collision in less than
time to transmit packet
clock synchronization
Transport Layer 3-95
Slotted Aloha efficiency
Efficiency is the long-run
fraction of successful slots
when there are many nodes, each
with many frames to send
Suppose N nodes with many
frames to send, each transmits in
slot with probability p
prob that node 1 has success in a
slot
= p(1-p)N-1
prob that any node has a success =
Np(1-p)N-1
For max efficiency with N
nodes, find p* that
maximizes
Np(1-p)N-1
For many nodes, take limit
of Np*(1-p*)N-1 as N goes
to infinity, gives 1/e = .37
At best: channel
used for useful
transmissions 37%
of time!
Transport Layer 3-96
Pure (unslotted) ALOHA
unslotted Aloha: simpler, no synchronization
when frame first arrives
transmit immediately
collision probability increases:
frame sent at t0 collides with other frames sent in [t0-1,t0+1]
Transport Layer 3-97
CSMA (Carrier Sense Multiple Access)
CSMA: listen before transmit:
If channel sensed idle: transmit entire frame
If channel sensed busy, defer transmission
Human analogy: don’t interrupt others!
Transport Layer 3-98
CSMA collisions
spatial layout of nodes
collisions can still occur:
propagation delay means
two nodes may not hear
each other’s transmission
collision:
entire packet transmission
time wasted
note:
role of distance & propagation
delay in determining collision
probability
Transport Layer 3-99
CSMA/CD (Collision Detection)
CSMA/CD: carrier sensing, deferral as in CSMA
collisions detected within short time
colliding transmissions aborted, reducing channel
wastage
collision detection:
easy in wired LANs: measure signal strengths,
compare transmitted, received signals
difficult in wireless LANs: receiver shut off while
transmitting
human analogy: the polite conversationalist
Transport Layer 3-100
CSMA/CD collision detection
Transport Layer 3-101
“Taking Turns” MAC protocols
channel partitioning MAC protocols:
share channel efficiently and fairly at high load
inefficient at low load: delay in channel access, 1/N
bandwidth allocated even if only 1 active node!
Random access MAC protocols
efficient at low load: single node can fully utilize channel
high load: collision overhead
“taking turns” protocols
look for best of both worlds!
Transport Layer 3-102
“Taking Turns” MAC protocols
Polling:
master node
“invites” slave nodes
to transmit in turn
concerns:
polling overhead
latency
single point of
failure (master)
Token passing:
control token passed from one
node to next sequentially.
token message
concerns:
token overhead
latency
single point of failure (token)
Transport Layer 3-103
Ethernet uses CSMA/CD
No slots
adapter doesn’t transmit
if it senses that some
other adapter is
transmitting, that is,
carrier sense
transmitting adapter
aborts when it senses
that another adapter is
transmitting, that is,
collision detection
Before attempting a
retransmission,
adapter waits a
random time, that is,
random access
Transport Layer 3-104
Ethernet CSMA/CD algorithm
1. Adaptor receives datagram from net 4. If adapter detects another
layer & creates frame
transmission while transmitting,
2. If adapter senses channel idle, it
aborts and sends jam signal
starts to transmit frame. If it
5. After aborting, adapter enters
senses channel busy, waits until
exponential backoff: after the
channel idle and then transmits
mth collision, adapter chooses a
3. If adapter transmits entire frame
K at random from
without detecting another
{0,1,2,…,2m-1}. Adapter waits
transmission, the adapter is done
K·512 bit times and returns to
with frame !
Step 2
Transport Layer 3-105
Ethernet’s CSMA/CD (more)
Jam Signal: make sure all other
transmitters are aware of
collision; 48 bits
Bit time: .1 microsec for 10 Mbps
Ethernet ;
for K=1023, wait time is about
50 msec
See/interact with Java
applet on AWL Web site:
highly recommended !
Exponential Backoff:
Goal: adapt retransmission
attempts to estimated
current load
heavy load: random wait will
be longer
first collision: choose K from
{0,1}; delay is K· 512 bit
transmission times
after second collision: choose
K from {0,1,2,3}…
after ten collisions, choose K
from {0,1,2,3,4,…,1023}
Transport Layer 3-106