Transcript TCP

Transport Layer
Our goals:
• understand principles
behind transport
layer services:
• learn about transport layer
protocols in the Internet:
– UDP: connectionless
transport
– TCP: connection-oriented
transport
– TCP congestion control
– multiplexing/demulti
plexing
– reliable data transfer
– flow control
– congestion control
Ref: slides by J. Kurose and K. Ross
Xin Liu
1
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
2
3
Transport services and protocols
• provide logical communication
between app processes running on
different hosts
• transport protocols run in end
systems
– send side: breaks app
messages into segments,
passes to network layer
– rcv side: reassembles segments
into messages, passes to app
layer
• more than one transport protocol
available to apps
– Internet: TCP and UDP
application
transport
network
data link
physical
Xin Liu
network
data link
physical
network
data link
physical
network
data link
physical
network
data link
physical
network
data link
physical
application
transport
network
data link
physical
Transport vs. network layer
• network layer: logical
communication
between hosts
• transport layer:
logical communication
between processes
– relies on, enhances,
network layer services
Xin Liu
Household analogy:
12 kids sending letters to 12
kids
• processes = kids
• app messages = letters in
envelopes
• hosts = houses
• transport protocol = Ann
and Bill
• network-layer protocol =
postal service
4
Internet transport-layer protocols
• reliable, in-order delivery
(TCP)
– congestion control
– flow control
– connection setup
application
transport
network
data link
physical
network
data link
physical
network
data link
physical
network
data link
physical
• unreliable, unordered
delivery: UDP
network
data link
physical
network
data link
physical
– no-frills extension of “besteffort” IP
application
transport
network
data link
physical
• services not available:
– delay guarantees
– bandwidth guarantees
Xin Liu
5
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
6
7
Multiplexing/demultiplexing
Multiplexing at send host:
gathering data from multiple
sockets, enveloping data with
header (later used for
demultiplexing)
Demultiplexing at rcv host:
delivering received segments
to correct socket
= socket
application
transport
network
link
= process
P3
P1
P1
application
transport
network
P2
P4
application
transport
network
link
link
physical
host 1
physical
host 2
Xin Liu
physical
host 3
8
How demultiplexing works
• host receives IP datagrams
– each datagram has source IP
address, destination IP address
– each datagram carries 1
transport-layer segment
– each segment has source,
destination port number
(recall: well-known port
numbers for specific
applications)
• host uses IP addresses & port
numbers to direct segment to
appropriate socket
32 bits
source port #
dest port #
other header fields
application
data
(message)
TCP/UDP segment format
Xin Liu
Connectionless demultiplexing
• Create sockets :
9
• When host receives UDP
sock=socket(PF_INET,SOCK_DGR
segment:
AM, IPPROTO_UDP);
– checks destination port number
bind(sock,(struct sockaddr
in segment
*)&addr,sizeof(addr));
– directs UDP segment to socket
sendto(sock,buffer,size,0);
with that port number
recvfrom(sock,Buffer,buffers
• IP datagrams with different
ize,0);
• UDP socket identified by
two-tuple:
(dest IP address, dest port number)
Xin Liu
source IP addresses and/or
source port numbers directed
to same socket
Connection-oriented demux
• TCP socket identified
by 4-tuple:
–
–
–
–
source IP address
source port number
dest IP address
dest port number
• recv host uses all four
values to direct
segment to appropriate
socket
• Server host may support
many simultaneous TCP
sockets:
– each socket identified by
its own 4-tuple
• Web servers have
different sockets for
each connecting client
– non-persistent HTTP will
have different socket for
each request
Xin Liu
10
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
11
Port
• Known port
• Port mapper
• Port is an abstraction
– Implementation depends on OS.
• Q: can we use pid (process id) instead of
port number?
Xin Liu
12
13
UDP: User Datagram Protocol [RFC 768]
• “no frills,” “bare bones”
Internet transport protocol
• “best effort” service, UDP
segments may be:
– lost
– delivered out of order to
app
• connectionless:
– no handshaking between
UDP sender, receiver
– each UDP segment handled
independently of others
Why is there a UDP?
• no connection establishment
(which can add delay)
• simple: no connection state at
sender, receiver
• small segment header
• no congestion control: UDP can
blast away as fast as desired
Xin Liu
DHCP client-server scenario
DHCP server: 223.1.2.5
DHCP discover
src : 0.0.0.0, 68
dest.: 255.255.255.255,67
yiaddr: 0.0.0.0
transaction ID: 654
DHCP offer
src: 223.1.2.5, 67
dest: 255.255.255.255, 68
yiaddrr: 223.1.2.4
transaction ID: 654
Lifetime: 3600 secs
DHCP request
time
src: 0.0.0.0, 68
dest:: 255.255.255.255, 67
yiaddrr: 223.1.2.4
transaction ID: 655
Lifetime: 3600 secs
DHCP ACK
src: 223.1.2.5, 67
dest: 255.255.255.255, 68
yiaddrr: 223.1.2.4
transaction ID: 655
Lifetime: 3600 secs
Xin Liu
arriving
client
14
Applications and protocols
application
E-mail
Remote terminal access
Web
File transfer
Streaming
IP-phone
Routing
Name translation
Dynamic IP
Network mng.
App_layer prtcl
SMTP
Telnet
HTTP
FTP
proprietary
proprietary
RIP
DNS
DHCP
SNMP
Xin Liu
Transport prtcl
TCP
TCP
TCP
TCP
Typically UDP
Typically UDP
Typically UDP
Typically UDP
Typically UDP
Typically UDP
15
16
UDP: more
• often used for streaming
multimedia apps
Length, in
– loss tolerant
bytes of UDP
segment,
– rate sensitive
including
• reliable transfer over
header
UDP: add reliability at
application layer
– application-specific
error recovery!
32 bits
source port #
dest port #
length
checksum
Application
data
(message)
UDP segment format
Xin Liu
Checksum
• Goal: detect “errors” (e.g., flipped bits) in
transmitted segment
• UDP header and data, and part of IP header
• Pseudo header
– Source/dest IP address
– Protocol, & UDP length field (again!)
– Q: why include the pseudo header, which is not a part
of UDP segment?
• Same procedure for TCP
Xin Liu
17
UDP checksum
Sender:
Receiver:
• treat segment contents as
sequence of 16-bit
integers
• checksum: addition (1’s
complement sum) of
segment contents
• sender puts checksum
value into UDP
checksum field
• compute checksum of
received segment
• check if computed
checksum equals checksum
field value:
– NO - error detected
– YES - no error detected.
But maybe errors
nonetheless?
– may pass the damaged
data
Xin Liu
18
Checksum Examples
1111
1000
------0111
1
-----1000
A: 0111
Q: one bit flip
Recv: 0011
1111
1000
0011
--------1011
A: error!
Xin Liu
Q: two bits flip?
Q: is it possible
the checksum is all 1s?
19
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
20
21
Principles of Reliable data transfer
• important in app., transport, link layers
• top-10 list of important networking topics!
• characteristics of unreliable channel will determine
complexity of reliable data transfer protocol (rdt)
Xin Liu
22
Reliable data transfer: getting started
rdt_send(): called from above,
(e.g., by app.). Passed data to
deliver to receiver upper layer
deliver_data(): called by
rdt to deliver data to upper
send
side
receive
side
udt_send(): called by rdt,
to transfer packet over
unreliable channel to receiver
rdt_rcv(): called when packet
arrives on rcv-side of channel
Xin Liu
23
Reliable data transfer: getting started
We’ll:
• incrementally develop sender, receiver sides of reliable
data transfer protocol (rdt)
• consider only unidirectional data transfer
– but control info will flow on both directions!
• use finite state machines (FSM) to specify sender,
receiver
state: when in this
“state” next state
uniquely determined
by next event
event causing state transition
actions taken on state transition
state
1
event
actions
Xin Liu
state
2
24
Rdt1.0: reliable transfer over a reliable channel
• underlying channel perfectly reliable
– no bit errors
– no loss of packets
• separate FSMs for sender, receiver:
– sender sends data into underlying channel
– receiver read data from underlying channel
Wait for
call from
above
rdt_send(data)
packet = make_pkt(data)
udt_send(packet)
sender
Wait for
call from
below
rdt_rcv(packet)
extract (packet,data)
deliver_data(data)
receiver
Q: What about bit error?
Xin Liu
25
Rdt2.0: channel with bit errors
• underlying channel may flip bits in packet
– recall: UDP checksum to detect bit errors
• the question: how to recover from errors:
– acknowledgements (ACKs): receiver explicitly tells sender that pkt
received OK
– negative acknowledgements (NAKs): receiver explicitly tells sender
that pkt had errors
– sender retransmits pkt on receipt of NAK
– human scenarios using ACKs, NAKs?
• new mechanisms in rdt2.0 (beyond rdt1.0):
– error detection
– receiver feedback: control msgs (ACK,NAK) rcvr->sender
Xin Liu
26
rdt2.0: FSM specification
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for
Wait for
call from
ACK or
udt_send(sndpkt)
above
NAK
rdt_rcv(rcvpkt) && isACK(rcvpkt)
L
receiver
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
sender
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Xin Liu
27
rdt2.0: operation with no errors
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for
Wait for
call from
ACK or
udt_send(sndpkt)
above
NAK
rdt_rcv(rcvpkt) && isACK(rcvpkt)
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
L
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Xin Liu
28
rdt2.0: error scenario
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for
Wait for
call from
ACK or
udt_send(sndpkt)
above
NAK
rdt_rcv(rcvpkt) && isACK(rcvpkt)
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
L
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Q: what about ack/nak corruption?
Xin Liu
rdt2.0 has a fatal flaw!
What happens if ACK/NAK
corrupted?
Handling duplicates:
• sender adds sequence number
to each pkt
• sender retransmits current pkt if
ACK/NAK garbled
• receiver discards (doesn’t
deliver up) duplicate pkt
• sender doesn’t know what
happened at receiver!
• can’t just retransmit: possible
duplicate
What to do?
• sender ACKs/NAKs receiver’s
ACK/NAK? What if sender
ACK/NAK lost?
• retransmit, but this might cause
retransmission of correctly
received pkt!
Xin Liu
stop and wait
Sender sends one packet,
then waits for receiver
response
29
30
rdt2.1: sender, handles garbled ACK/NAKs
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
Wait
for
Wait for
isNAK(rcvpkt) )
ACK or
call 0 from
udt_send(sndpkt)
NAK 0
above
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt)
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt)
L
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isNAK(rcvpkt) )
udt_send(sndpkt)
L
Wait for
ACK or
NAK 1
Wait for
call 1 from
above
rdt_send(data)
sndpkt = make_pkt(1, data, checksum)
udt_send(sndpkt)
Xin Liu
31
rdt2.1: receiver, handles garbled ACK/NAKs
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq0(rcvpkt)
rdt_rcv(rcvpkt) && (corrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && (corrupt(rcvpkt)
sndpkt = make_pkt(NAK, chksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
not corrupt(rcvpkt) &&
has_seq1(rcvpkt)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
sndpkt = make_pkt(NAK, chksum)
udt_send(sndpkt)
Wait for
0 from
below
Wait for
1 from
below
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq1(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
Xin Liu
rdt_rcv(rcvpkt) &&
not corrupt(rcvpkt) &&
has_seq0(rcvpkt)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
rdt2.1: discussion
Sender:
• seq # added to pkt
• two seq. #’s (0,1) will
suffice. Why?
• must check if received
ACK/NAK corrupted
• twice as many states
– state must “remember”
whether “current” pkt
has 0 or 1 seq. #
Receiver:
• must check if received
packet is duplicate
– state indicates whether
0 or 1 is expected pkt
seq #
• note: receiver can not
know if its last
ACK/NAK received
OK at sender
Xin Liu
32
33
rdt2.2: a NAK-free protocol
• same functionality as rdt2.1, using NAKs only
• instead of NAK, receiver sends ACK for last pkt
received OK
– receiver must explicitly include seq # of pkt being ACKed
• duplicate ACK at sender results in same action as
NAK: retransmit current pkt
Xin Liu
34
rdt2.2: sender, receiver fragments
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
Wait for
Wait for
isACK(rcvpkt,1) )
ACK
call 0 from
0
udt_send(sndpkt)
above
sender FSM
fragment
rdt_rcv(rcvpkt) &&
(corrupt(rcvpkt) ||
has_seq1(rcvpkt))
udt_send(sndpkt)
Wait for
0 from
below
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,0)
receiver FSM
fragment
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq1(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK1, chksum)
udt_send(sndpkt)
Xin Liu
L
35
rdt3.0: channels with errors and loss
New assumption: underlying
channel can also lose
packets (data or ACKs)
– checksum, seq. #, ACKs,
retransmissions will be of
help, but not enough
Q: how to deal with loss?
– sender waits until certain
data or ACK lost, then
retransmits
– yuck: drawbacks?
Approach: sender waits
“reasonable” amount of time
for ACK
• retransmits if no ACK received in
this time
• if pkt (or ACK) just delayed (not
lost):
– retransmission will be
duplicate, but use of seq. #’s
already handles this
– receiver must specify seq # of
pkt being ACKed
• requires countdown timer
Xin Liu
36
rdt3.0 sender
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt)
L
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,1)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isACK(rcvpkt,0) )
timeout
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,0)
stop_timer
stop_timer
timeout
udt_send(sndpkt)
start_timer
L
Wait
for
ACK0
Wait for
call 0from
above
L
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isACK(rcvpkt,1) )
Wait
for
ACK1
Wait for
call 1 from
above
rdt_send(data)
sndpkt = make_pkt(1, data, checksum)
udt_send(sndpkt)
start_timer
Q: can we improve this one?
Xin Liu
rdt_rcv(rcvpkt)
L
37
rdt3.0 in action
Xin Liu
38
rdt3.0 in action
Xin Liu
39
Performance of rdt3.0
• example: 1 Gbps link, 15 ms e-e prop. delay,
1KB packet
• U sender: utilization – fraction of time sender
busy sending
• Utilization?
Xin Liu
40
rdt3.0: stop-and-wait operation
sender
receiver
first packet bit transmitted, t = 0
last packet bit transmitted, t = L / R
first packet bit arrives
last packet bit arrives, send ACK
RTT
ACK arrives, send next
packet, t = RTT + L / R
•rdt3.0 works, but performance stinks
Xin Liu
41
Performance
Ttransmit =
U
L (packet length in bits)
8kb/pkt
=
= 8 microsec
R (transmission rate, bps)
10**9 b/sec
=
sender
L/R
RTT + L / R
=
.008
30.008
= 0.00027
microsec
onds
– 1KB pkt every 30 msec -> 33kB/sec thruput over 1 Gbps
link
– network protocol limits use of physical resources!
Xin Liu
42
Pipelined protocols
Pipelining: sender allows multiple, “in-flight”, yetto-be-acknowledged pkts
– range of sequence numbers must be increased
– buffering at sender and/or receiver
• Two generic forms of pipelined protocols: go-Back-N,
selective repeat
Xin Liu
43
Pipelining: increased utilization
sender
receiver
first packet bit transmitted, t = 0
last bit transmitted, t = L / R
first packet bit arrives
last packet bit arrives, send ACK
last bit of 2nd packet arrives, send ACK
last bit of 3rd packet arrives, send ACK
RTT
ACK arrives, send next
packet, t = RTT + L / R
Increase utilization
by a factor of 3!
U
sender
=
3*L/R
RTT + L / R
=
.024
30.008
Xin Liu
= 0.0008
microsecon
ds
Go-Back-N
Sender:
• k-bit seq # in pkt header
• “window” of up to N, consecutive unack’ed pkts allowed
• ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK”
-- duplicate ACKs (see receiver)
• timer for each in-flight pkt
• timeout(n): retransmit pkt n and all higher seq # pkts in window
Xin Liu
44
45
GBN: sender extended FSM
rdt_send(data)
L
base=1
nextseqnum=1
if (nextseqnum < base+N) {
sndpkt[nextseqnum] = make_pkt(nextseqnum,data,chksum)
udt_send(sndpkt[nextseqnum])
if (base == nextseqnum)
start_timer
nextseqnum++
}
else
refuse_data(data)
Wait
rdt_rcv(rcvpkt)
&& corrupt(rcvpkt)
L
timeout
start_timer
udt_send(sndpkt[base])
udt_send(sndpkt[base+1])
…
udt_send(sndpkt[nextseqnum-1])
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
base = getacknum(rcvpkt)+1
If (base == nextseqnum)
stop_timer
else
start_timer
Xin Liu
46
GBN: receiver extended FSM
default
udt_send(sndpkt)
L
Wait
expectedseqnum=1
sndpkt =
make_pkt(expectedseqnum,ACK,chksum)
rdt_rcv(rcvpkt)
&& notcurrupt(rcvpkt)
&& hasseqnum(rcvpkt,expectedseqnum)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(expectedseqnum,ACK,chksum)
udt_send(sndpkt)
expectedseqnum++
ACK-only: always send ACK for correctly-received pkt with
highest in-order seq #
– may generate duplicate ACKs
– need only remember expectedseqnum
• out-of-order pkt:
– discard (don’t buffer) -> no receiver buffering!
– Re-ACK pkt with highest in-order seq #
Xin Liu
47
GBN in
action
Xin Liu
48
Selective Repeat
• receiver individually acknowledges all correctly
received pkts
– buffers pkts, as needed, for eventual in-order delivery
to upper layer
• sender only resends pkts for which ACK not
received
– sender timer for each unACKed pkt
• sender window
– N consecutive seq #’s
– again limits seq #s of sent, unACKed pkts
Xin Liu
49
Selective repeat: sender, receiver windows
Xin Liu
50
Selective repeat
receiver
pkt n in [rcvbase, rcvbase+N-1]
sender
data from above :
• send ACK(n)
• out-of-order: buffer
• in-order: deliver (also deliver
buffered, in-order pkts),
advance window to next notyet-received pkt
• if next available seq # in
window, send pkt
timeout(n):
• resend pkt n, restart timer
ACK(n) in [sendbase,sendbase+N]:
• mark pkt n as received
• if n smallest unACKed pkt,
advance window base to next
unACKed seq #
pkt n in [rcvbase-N,rcvbase-1]
• ACK(n)
otherwise:
• ignore
Xin Liu
51
Selective repeat in action
Xin Liu
Selective repeat:
dilemma
Example:
• seq #’s: 0, 1, 2, 3
• window size=3
• receiver sees no difference
in two scenarios!
• incorrectly passes
duplicate data as new in
(a)
Q: what relationship between
seq # size and window
size?
Xin Liu
52
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
53
54
TCP: Overview
RFCs: 793, 1122, 1323, 2018, 2581
• point-to-point:
• full duplex data:
– one sender, one receiver
– bi-directional data flow in
same connection
– MSS: maximum segment
size
• reliable, in-order byte
steam:
– no “message boundaries”
• connection-oriented:
• pipelined:
– handshaking (exchange of
control msgs) init’s sender,
receiver state before data
exchange
– TCP congestion and flow
control set window size
• send & receive buffers
• flow controlled:
socket
door
application
writes data
application
reads data
TCP
send buffer
TCP
receive buffer
socket
door
segment
Xin Liu
– sender will not overwhelm
receiver
TCP segment structure
55
32 bits
URG: urgent data
(generally not used)
ACK: ACK #
valid
PSH: push data now
source port #
sequence number
acknowledgement number
head not
UA P R S F
len used
checksum
RST, SYN, FIN:
connection estab
(setup, teardown
commands)
Internet
checksum
(as in UDP)
dest port #
Receive window
Urg data pnter
Options (variable length)
application
data
(variable length)
Xin Liu
counting
by bytes
of data
(not segments!)
# bytes
rcvr willing
to accept
TCP Options
•
•
•
•
MSS
Timestamp
Window scale
Selective-ACK
Xin Liu
56
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
57
58
TCP seq. #’s and ACKs
Seq. #’s:
– byte stream “number”
of first byte in
segment’s data
ACKs:
– seq # of next byte
expected from other
side
– cumulative ACK
Q: how receiver handles outof-order segments
– A: TCP spec doesn’t
say, - up to
implementor
Host A
User
types
‘C’
Host B
host ACKs
receipt of
‘C’, echoes
back ‘C’
host ACKs
receipt
of echoed
‘C’
simple telnet scenario
Xin Liu
time
59
TCP Round Trip Time and Timeout
Q: how to set TCP
timeout value?
• longer than RTT
– but RTT varies
• too short: premature
timeout
– unnecessary
retransmissions
• too long: slow reaction
to segment loss
Q: how to estimate RTT?
• SampleRTT: measured time
from segment transmission until
ACK receipt
– ignore retransmissions
• SampleRTT will vary, want
estimated RTT “smoother”
– average several recent
measurements, not just
current SampleRTT
Xin Liu
60
TCP Round Trip Time and Timeout
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
• Exponential weighted moving average
• influence of past sample decreases exponentially fast
• typical value:  = 0.125
Xin Liu
61
Example RTT estimation:
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
350
RTT (milliseconds)
300
250
200
150
100
1
8
15
22
29
36
43
50
57
64
71
time (seconnds)
SampleRTT
Estimated RTT
Xin Liu
78
85
92
99
106
62
TCP Round Trip Time and Timeout
Setting the timeout
• EstimtedRTT plus “safety margin”
– large variation in EstimatedRTT -> larger safety margin
• first estimate of how much SampleRTT deviates from
EstimatedRTT:
DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically,  = 0.25)
Then set timeout interval:
TimeoutInterval = EstimatedRTT + 4*DevRTT
Xin Liu
RTT
• Timestamp can be used to measure RTT for
each segment
• Better RTT estimate
• Q: do sender and receiver need to
synchronize their clocks/timestamp unit?
• NO Syn required
Xin Liu
63
TCP reliable data transfer
• TCP creates rdt
service on top of IP’s
unreliable service
• Pipelined segments
• Cumulative acks
• TCP uses single
retransmission timer
• Retransmissions are
triggered by:
– timeout events
– duplicate acks
• Initially consider
simplified TCP sender:
– ignore duplicate acks
– ignore flow control,
congestion control
Xin Liu
64
TCP sender events:
data rcvd from app:
• Create segment with seq #
• seq # is byte-stream
number of first data byte
in segment
• start timer if not already
running (think of timer as
for oldest unacked
segment)
• expiration interval:
TimeOutInterval
timeout:
• retransmit segment that
caused timeout
• restart timer
Ack rcvd:
• If acknowledges
previously unacked
segments
– update what is known to be
acked
– start timer if there are
outstanding segments
Xin Liu
65
66
NextSeqNum = InitialSeqNum
SendBase = InitialSeqNum
TCP
sender
loop (forever) {
switch(event)
event: data received from application above
create TCP segment with sequence number NextSeqNum
if (timer currently not running)
start timer
pass segment to IP
NextSeqNum = NextSeqNum + length(data)
event: timer timeout
retransmit not-yet-acknowledged segment with
smallest sequence number
start timer
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
} /* end of loop forever */
Xin Liu
(simplified)
Comment:
• SendBase-1: last
cumulatively
ack’ed byte
Example:
• SendBase-1 = 71;
y= 73, so the rcvr
wants 73+ ;
y > SendBase, so
that new data is
acked
67
TCP: retransmission scenarios
Host A
X
loss
Sendbase
= 100
SendBase
= 120
SendBase
= 100
time
Host B
Seq=92 timeout
Host B
SendBase
= 120
Seq=92 timeout
timeout
Host A
time
lost ACK scenario
Xin Liu
premature timeout
68
TCP retransmission scenarios (more)
timeout
Host A
Host B
X
loss
SendBase
= 120
time
Cumulative ACK scenario
Xin Liu
Q: what to do at time t+t_0?
Is there a timeout?
TCP ACK generation [RFC 1122, RFC
2581]
Event at Receiver
TCP Receiver action
Arrival of in-order segment with
expected seq #. All data up to
expected seq # already ACKed
Delayed ACK. Wait up to 500ms
for next segment. If no next segment,
send ACK
Arrival of in-order segment with
expected seq #. One other
segment has ACK pending
Immediately send single cumulative
ACK, ACKing both in-order segments
Arrival of out-of-order segment
higher-than-expect seq. # .
Gap detected
Immediately send duplicate ACK,
indicating seq. # of next expected byte
Arrival of segment that
partially or completely fills gap
Immediate send ACK, provided that
segment startsat lower end of gap
Xin Liu
69
Nagel algorithm
• Small segments cannot be sent until the
outstanding data is acked if a TCP
connection has outstanding data that has not
yet been acked.
• Self-clocking: the fast the ack comes back,
the faster the data is sent
• May need to be disabled somtimes
Xin Liu
70
Fast Retransmit
• Time-out period often
• If sender receives 3
relatively long:
ACKs for the same
– long delay before
data, it supposes that
resending lost packet
segment after ACKed
• Detect lost segments
data was lost:
via duplicate ACKs.
– Sender often sends
many segments backto-back
– If segment is lost, there
will likely be many
duplicate ACKs.
Xin Liu
– fast retransmit: resend
segment before timer
expires
71
Fast retransmit algorithm:
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
else {
increment count of dup ACKs received for y
if (count of dup ACKs received for y = 3) {
resend segment with sequence number y
}
a duplicate ACK for
already ACKed segment
fast retransmit
Xin Liu
72
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
73
TCP Flow Control
flow control
• receive side of TCP
connection has a
receive buffer:
sender won’t overflow
receiver’s buffer by
transmitting too much,
too fast
• speed-matching
service: matching the
send rate to the
receiving app’s drain
rate
• app process may be
slow at reading from
buffer
Xin Liu
74
TCP Flow control: how it works
(Suppose TCP receiver discards
out-of-order segments)
• spare room in buffer
= RcvWindow
= RcvBuffer-[LastByteRcvd LastByteRead]
Xin Liu
75
• Rcvr advertises spare
room by including
value of RcvWindow
in segments
• Sender limits
unACKed data to
RcvWindow
– guarantees receive
buffer doesn’t overflow
More
• Slow receiver
– Persist timer, a timer set by sender to keep asking receiver about its
window size.
– Ack new window
• Long fat pipeline: high speed link and/or long RTT
– Q: to fully utilize a long fat pipeline, what should the window size
be?
• At least bandwidth* RTT.
• Window scale option during handshaking
– Option: scale 0 -14, 2^scale
Xin Liu
76
Header
32 bits
source port #
dest port #
sequence number
acknowledgement number
head not
UA P R S F
len used
checksum
Receive window
Urg data pnter
Options (variable length)
application
data
(variable length)
Xin Liu
77
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
79
TCP Connection Management
Recall: TCP sender, receiver
Three way handshake:
establish “connection” before
exchanging data segments
• initialize TCP variables:
– seq. #s
– buffers, flow control info
(e.g. RcvWindow)
Step 1: client host sends TCP SYN
segment to server
– specifies initial seq #
– no data
• client: connection initiator
– connect();
• server: contacted by client
– accept();
Step 2: server host receives SYN,
replies with SYNACK segment
– server allocates buffers
– specifies server initial seq. #
Step 3: client receives SYNACK,
replies with ACK segment, which
may contain data
Xin Liu
80
TCP Connection
• SYN, FIN consume
one sequence number
• ISN increase by one
every 4ms.
• 9.5 hours reuse cycle.
• MSS: maximum
segment size
Xin Liu
81
82
TCP Connection Management (cont.)
Closing a connection:
client
client closes socket:
close();
close
Step 1: client end system
close
timed wait
sends TCP FIN control
segment to server
Step 2: server receives FIN,
replies with ACK. Closes
connection, sends FIN.
server
closed
Xin Liu
83
TCP Connection Management (cont.)
Step 3: client receives FIN,
client
replies with ACK.
– Enters “timed wait” - will
respond with ACK to
received FINs
Step 4: server, receives ACK.
Connection closed.
server
closing
FIN_WAIT_1
closing
FIN_WAIT_2
TIME_WAIT
timed wait
Note: with small modification,
can handle simultaneous
FINs.
closed
Xin Liu
closed
84
TCP Connection Management (cont)
TCP server
lifecycle
TCP client
lifecycle
Xin Liu
TCP Connection Management
• Allow half-close, i.e., one end to terminate
its output, but still receiving data
• Allow simultaneous open
• Allow simultaneous close
• Crashes?
Xin Liu
85
86
[root@shannon liu]# tcpdump -S tcp port 22
tcpdump: listening on eth0
23:01:51.363983 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: S
3036713598:3036713598(0) win 5840 <mss 1460,sackOK,timestamp 13989220 0,nop,wscale 0> (DF)
23:01:51.364829 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: S
2462279815:2462279815(0) ack 3036713599 win 24616 <nop,nop,timestamp 626257407
13989220,nop,wscale 0,nop,nop,sackOK,mss 1460> (DF)
23:01:51.364844 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462279816 win 5840
<nop,nop,timestamp 13989220 626257407> (DF)
23:01:51.375451 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: P
2462279816:2462279865(49) ack 3036713599 win 24616 <nop,nop,timestamp 626257408 13989220>
(DF)
23:01:51.375478 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462279865 win 5840
<nop,nop,timestamp 13989221 626257408> (DF)
23:01:51.379319 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: P
3036713599:3036713621(22) ack 2462279865 win 5840 <nop,nop,timestamp 13989221 626257408>
(DF)
23:01:51.379570 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: . ack 3036713621 win 24616
<nop,nop,timestamp 626257408 13989221>
(DF)
Xin Liu
87
23:01:51.941616 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: P 3036714373:3036714437(64)
ack 2462281065 win 7680 <nop,nop,timestamp 13989277 626257462> (DF)
23:01:51.952442 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: P
2462281065:2462282153(1088) ack 3036714437 win 24616 <nop,nop,timestamp 626257465 13989277>
(DF)
23:01:51.991682 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462282153 win 9792
<nop,nop,timestamp 13989283 626257465> (DF)
23:01:54.699597 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: F 3036714437:3036714437(0)
ack 2462282153 win 9792 <nop,nop,timestamp 13989553 626257465> (DF)
23:01:54.699880 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: . ack 3036714438 win 24616
<nop,nop,timestamp 626257740 13989553>(DF)
23:01:54.701129 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: F 2462282153:2462282153(0)
ack 3036714438 win 24616 <nop,nop,timestamp 626257740 13989553> (DF)
23:01:54.701143 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462282154 win 9792
<nop,nop,timestamp 13989553 626257740> (DF)
26 packets received by filter
0 packets dropped by kernel
Xin Liu
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
88
89
Principles of Congestion Control
Congestion:
• informally: “too many sources sending too much
data too fast for network to handle”
• different from flow control!
• manifestations:
– lost packets (buffer overflow at routers)
– long delays (queueing in router buffers)
• a top-10 problem!
Xin Liu
90
Causes/costs of congestion: scenario 1
Host A
• two senders, two
receivers
• one router,
infinite buffers
• no retransmission
lout
lin : original data
unlimited shared
output link buffers
Host B
• large delays when
congested
• maximum
achievable
throughput
Xin Liu
91
Causes/costs of congestion: scenario 2
• one router, finite buffers
• sender retransmission of lost packet
Host
A
Host B
lin : original
data
l'in : original data, plus
retransmitted data
lout
finite shared output
link buffers
• “costs” of congestion:
• more work (retrans) for given “goodput”
• unneeded retransmissions: link carries multiple copies
of pkt
Xin Liu
92
Causes/costs of congestion: scenario 2
Ideal case: knowing
whether the buffer is
free.
Knowing which
packet is lost.
Xin Liu
Packet loss +
unnecessary
retransmission
93
Causes/costs of congestion: scenario 3
Q: what happens as lin
and linincrease ?
• four senders
• multihop paths
• timeout/retransmit
Host A
lin : original data
l'in : original data, plus
retransmitted data
finite shared output
link buffers
Host B
Xin Liu
lout
94
Causes/costs of congestion: scenario 3
H
o
s
t
A
l
o
u
t
H
o
s
t
B
Another “cost” of congestion:
• when packet dropped, any “upstream transmission
capacity used for that packet was wasted!
Xin Liu
95
Approaches towards congestion control
Two broad approaches towards congestion control:
End-end congestion
control:
Network-assisted
congestion control:
• no explicit feedback from
network
• congestion inferred from
end-system observed loss,
delay
• approach taken by TCP
• routers provide feedback
to end systems
– single bit indicating
congestion (SNA,
DECbit, TCP/IP ECN,
ATM)
– explicit rate sender
should send at
Xin Liu
96
Case study: ATM ABR congestion control
ABR: available bit rate:
• “elastic service”
• if sender’s path
“underloaded”:
– sender should use
available bandwidth
• if sender’s path
congested:
– sender throttled to
minimum guaranteed
rate
RM (resource management)
cells:
• sent by sender, interspersed
with data cells
• bits in RM cell set by
switches (“network-assisted”)
– NI bit: no increase in rate
(mild congestion)
– CI bit: congestion
indication
• RM cells returned to sender
by receiver, with bits intact
Xin Liu
97
Case study: ATM ABR congestion control
• two-byte ER (explicit rate) field in RM cell
– congested switch may lower ER value in cell
– sender’ send rate thus minimum supportable rate on path
• EFCI bit in data cells: set to 1 in congested switch
– if data cell preceding RM cell has EFCI set, sender sets CI
bit in returned RM cell
Xin Liu
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
98
TCP Congestion Control
• end-end control (no network
assistance)
• sender limits transmission:
LastByteSent-LastByteAcked
 cwnd
• Roughly,
rate =
cwnd
RTT
Bytes/sec
• cwnd is dynamic, function of
perceived network congestion
99
How does sender
perceive congestion?
• loss event = timeout or
3 duplicate acks
• TCP sender reduces
rate (cwnd) after loss
event
mechanisms:
– slow start
– congestion avoidance
– AIMD
Xin Liu
TCP Slow Start
• When connection
begins, increase rate
exponentially
fast
until
– Example: MSS = 500
first loss event
bytes & RTT = 200 msec
• When connection
begins, cwnd = 1 MSS
– initial rate = 20 kbps
• available bandwidth
may be >> MSS/RTT
– desirable to quickly ramp
up to respectable rate
Xin Liu
100
101
TCP Slow Start (more)
• When connection
begins, increase rate
exponentially until
first loss event or
ssthresh:
Host B
RTT
Host A
– incrementing cwnd for
every ACK received
– double cwnd every
RTT
• Summary: initial rate
is slow but ramps up
exponentially fast
time
Xin Liu
Congestion Avoidance
• ssthresh: when cwnd reaches ssthresh,
congestion avoidance begins
• Congestion avoidance: increase cwnd by
1/cwnd each time an ACK is received
• Congestion happens: ssthresh=max(2MSS,
cwnd/2)
Xin Liu
102
103
TCP AIMD
multiplicative decrease:
cut cwnd in half after
loss event
congestion
window
24 Kbytes
additive increase:
increase cwnd by 1
MSS every RTT in the
absence of loss events:
probing
16 Kbytes
8 Kbytes
time
Long-lived TCP connection
Xin Liu
Reno vs. Tahoe
Philosophy:
• After 3 dup ACKs:
– cwnd is cut in half
– window then grows linearly
• But after timeout event:
– cwnd instead set to 1 MSS;
• 3 dup ACKs indicates
network capable of
delivering some segments
• timeout before 3 dup
ACKs is “more alarming”
– window then grows
exponentially
– to a sshthresh, then grows
linearly
•Fast retransmission: retransmit the seemingly missing segment
•Fast recovery: congestion avoidance instead of slow start.
Xin Liu
104
105
Summary: TCP Congestion Control
• When cwnd is below sshthresh, sender in slow-start
phase, window grows exponentially.
• When cwnd is above sshthresh, sender is in
congestion-avoidance phase, window grows linearly.
• When a triple duplicate ACK occurs, sshthresh set to
cwnd/2 and cwnd set to sshthresh.
• When timeout occurs, sshthresh set to cwnd/2 and
cwnd is set to 1 MSS.
• Read Table 3.3 in the textbook.
Xin Liu
Trend
• Recent research proposes network-assisted
congestion control: active queue management
• ECN: explicit congestion notification
– 2 bits: 6 &7 in the IP TOS field
• RED: random early detection
– Implicit
– Can be adapted to explicit methods by marking instead
of dropping
Xin Liu
106
107
TCP Fairness
Fairness goal: if K TCP sessions share same
bottleneck link of bandwidth R, each should have
average rate of R/K
TCP connection 1
TCP
connection 2
bottleneck
router
capacity R
Xin Liu
Why is TCP fair?
Two competing sessions:
• Additive increase gives slope of 1, as throughout increases
• multiplicative decrease decreases throughput proportionally
R
equal bandwidth share
loss: decrease window by factor of 2
congestion avoidance: additive increase
loss: decrease window by factor of 2
congestion avoidance: additive increase
Connection 1 throughput R
Xin Liu
108
Fairness (more)
109
Fairness and parallel TCP
Fairness and UDP
connections
• Multimedia apps often
• nothing prevents app from
do not use TCP
opening parallel cnctions
– do not want rate
between 2 hosts.
throttled by congestion
control
• Web browsers do this
• Instead use UDP:
• Example: link of rate R
– pump audio/video at
supporting 9 cnctions;
constant rate, tolerate
packet loss
– new app asks for 1 TCP,
gets rate R/10
– new app asks for 11 TCPs,
gets R/2 !
• Research area: TCP
friendly
Xin Liu
Steady State Throughput
• R: rate at the bottleneck router
• W=R*RTT ( unit is byte)
– Maximum window size before packets
dropping
• cwnd ( unit is byte)
– 0.5 * W--- 1 * W
• Average Throughput
– 0.75*W/RTT =0.75 R
Xin Liu
110
Steady State Loss Rate
•
•
•
•
cwnd varies from W/2/RTT to W/RTT.
Only one packet is lost
w= W/MSS: # of segments in window W.
Loss rate is
Xin Liu
111
112
Delay modeling
Notation, assumptions:
Q: How long does it take to
receive an object from a
Web server after sending
a request?
Ignoring congestion, delay
is influenced by:
• Assume one link between
client and server of rate R
• S: MSS (bits)
• O: object size (bits)
• no retransmissions (no
loss, no corruption)
• TCP connection establishment
• data transmission delay
• slow start
Xin Liu
Window size:
• First assume: fixed
congestion window, W
segments
• Then dynamic window,
modeling slow start
113
Fixed congestion window (1)
First case:
WS/R > RTT + S/R: ACK
for first segment in
window returns before
window’s worth of data
sent
delay = 2RTT + O/R
Xin Liu
114
Fixed congestion window (2)
Second case:
• WS/R < RTT + S/R:
wait for ACK after
sending window’s worth
of data sent
delay = 2RTT + O/R
+ (K-1)[S/R + RTT - WS/R]
Xin Liu
TCP Delay Modeling: Slow Start (1)
Now suppose window grows according to slow start
Will show that the delay for one object is:
Latency  2 RTT 
O
S
S

 P  RTT    ( 2 P  1)
R
R
R

where P is the number of times TCP idles at server:
P  min {Q, K  1}
- where Q is the number of times the server idles
if the object were of infinite size.
- and K is the number of windows that cover the object.
Xin Liu
115
116
TCP Delay Modeling: Slow Start (2)
Delay components:
• 2 RTT for connection
estab and request
• O/R to transmit
object
• time server idles due
to slow start
initiate TCP
connection
request
object
first window
= S/R
RTT
second window
= 2S/R
Server idles:
P = min{K-1,Q} times
Example:
• O/S = 15 segments
• K = 4 windows
•Q=2
• P = min{K-1,Q} = 2
Server idles P=2 times
third window
= 4S/R
fourth window
= 8S/R
complete
transmission
object
delivered
time at
client
Xin Liu
time at
server
117
TCP Delay Modeling (3)
S
 RTT  time from when server starts to send segment
R
until server receives acknowledg ement
initiate TCP
connection
2k 1
S
 time to transmit the kth window
R
request
object

S
k 1 S 

RTT

2
 idle time after the kth window
 R
R 
first window
= S/R
RTT
second window
= 2S/R
third window
= 4S/R
P
O
delay   2 RTT   idleTime p
R
p 1
P
O
S
S
  2 RTT   [  RTT  2 k 1 ]
R
R
k 1 R
O
S
S
  2 RTT  P[ RTT  ]  (2 P  1)
R
R
R
fourth window
= 8S/R
complete
transmission
object
delivered
time at
client
Xin Liu
time at
server
118
TCP Delay Modeling (4)
Recall K = number of windows that cover object
How do we calculate K ?
K  min {k : 2 0 S  21 S    2 k 1 S  O}
 min {k : 2 0  21    2 k 1  O / S }
O
 min {k : 2  1  }
S
O
 min {k : k  log 2 (  1)}
S
O


 log 2 (  1)
S


k
Calculation of Q, number of idles for infinite-size object,
is similar.
Xin Liu
Delay
• We show:
Latency  2 RTT 
O
S
S

 P  RTT    ( 2 P  1)
R
R
R

where P is the number of times TCP idles at server:
P  min {Q, K  1}
- where Q is the number of times the server idles
if the object were of infinite size.
- and K is the number of windows that cover the object.
Xin Liu
119
Summary
• What factors limit TCP throughput?
• What if the sender’s link is the bottleneck?
– What happens to cwnd?
• A bottleneck in the middle of the network
• The receiver is the bottleneck?
• Summary:
– Minimum of all
Xin Liu
120
121
HTTP
• Assume Web page consists of:
– 1 base HTML page (of size O bits)
– M images (each of size O bits)
• Persist HTTP
– Multiple objects can be sent over single TCP connection between client
and server.
– HTTP/1.1 uses persistent connections in default mode
• Non-persist HTTP
– At most one object is sent over a TCP connection.
– HTTP/1.0 uses nonpersistent HTTP
Xin Liu
HTTP Modeling
• Non-persistent HTTP:
– M+1 TCP connections in series
– Response time = (M+1)O/R + (M+1)2RTT + sum of idle times
• Persistent HTTP:
– 2 RTT to request and receive base HTML file
– 1 RTT to request and receive M images
– Response time = (M+1)O/R + 3RTT + sum of idle times
• Non-persistent HTTP with X parallel connections
– Suppose M/X integer.
– 1 TCP connection for base file
– M/X sets of parallel connections for images.
– Response time = (M+1)O/R + (M/X + 1)2RTT + sum of idle times
Xin Liu
122
HTTP response time
• Small RTT?
– low bandwidth, connection & response time dominated
by transmission time.
– Persistent connections only give minor improvement
over parallel connections.
• Large RTT?
– For larger RTT, response time dominated by TCP
establishment & slow start delays. Persistent
connections now give important improvement:
particularly in high delay*bandwidth networks.
Xin Liu
123
Reading
• D. Chiu and R. Jain, “Analysis of the increase and
decrease algorithms for the congestion avoidance
in computer networks”, Computer Networks and
ISDN systems, vol. 17, no. 1, pp. 1-14. (fairness
and performance)
• T. Lakshman, U. Madhow, “the performance of
TCP/IP for networks with high-bandwidth-delay
products and random losses,” IEEE/ACM Trans.
On Networking, vol. 5, no. 3, pp. 336-350.
Xin Liu
124
SCTP
• SCTP: stream control transmission protocol
• Motivation
– Many applications need reliable message delivery –
they do so by delineating a TCP stream
– TCP provides both strict-ordering and reliability –
many applications may not need both
– Many applications require better fault tolerance
Xin Liu
125
Motivation (contd)
• HTTP is one such application
– While transferring multiple embedded files we
only want
• Reliable file transfer for each file
• Partial ordering for the packets of each file but not
total ordering amongst all the packets
– TCP provides more than this (but overhead?)
– SCTP may help (how? – later)
Xin Liu
126
What is SCTP?
• Originally designed to support PSTN
signaling messages over IP Networks
• It is a reliable transport protocol operating
on top of a connectionless packet network
such as IP (same level as TCP)
Xin Liu
127
Major Differences from TCP
• Connection Setup
• SCTP is message oriented as opposed to being
byte stream oriented
– Boundary conservation
• SCTP has the concept of an association instead of
a connection
– Each association can have multiple streams
• SCTP separates reliable transfer of datagrams
from the delivery mechanism
– No-head-of line blocking
• SCTP supports multihoming
Xin Liu
128
Similarities to TCP
• Similar Flow Control and Congestion
Control Strategies employed
–
–
–
–
Slow Start and Congestion Avoidance phases
Selective Acknowledgement
Fast Retransmit
Slight differences for supporting multihoming
• Will allow co-existence of TCP and SCTP
[JST 2000]
Xin Liu
129
130
Connection Setup
server
Generate
cookie
• No action at server after send cookie
• Cookie verification
– Effective to SYN attacks
Xin Liu
Packet Format
Xin Liu
131
HTTP Server Architecture
132
Single File Transfer ( Both TCP and SCTP are similar)
Request file
Server
Fork child
Client
Send file
Xin Liu
Child
process
HTTP Server Architecture
133
Multiple File Transfer (Embedded files) - TCP
Request file 0
Client
Send file 0
Request file 1..N
Send file 1,2,…N
Xin Liu
Server
Fork child
Child
process
SCTP Packet Format
streams
Xin Liu
134
HTTP Server Architecture
135
Multiple Files Transfer (Embedded Files) - SCTP
Request file 0
Client
Send file 0 – stream 0
Request files 1..N
Send file 1 – stream 1
Send file N – stream N
Xin Liu
Server
Fork child
Child
process
136
Hypothesis
Server
Client
1
3
2
File 3
1
3
2
File 2
1
3
2
1
File 1
TCP
Receive buffer in kernel
Xin Liu
137
Hypothesis
Server
Client
1
3
2
File 3
1
3
2
File 2
1
3
2
1
File 1
SCTP
Receive buffer in kernel
Xin Liu
Multihome
• designed to offer network resilience to
failed interfaces on the host and faster
recovery during network failures
– Less effective if a single point of failure exists
– Today’s IP subject to a routing reconvergence
time
Xin Liu
138
Reading
• RFC 2960
• SCTP for Beginners
http://tdrwww.exp-math.uni-essen.de/inhalt/forschung/sctp_fb/sctp_intro.html
Xin Liu
139
Wireless TCP
• Motivation
– Wireless channels are unreliable and timevarying
– Cause TCP timeout/Duplicate acks
• Approaches
Xin Liu
140
Channel Conditions
• Decides transmission performance
• Determined by
– Strength of desired signal
– Noise level
• Interference from other transmissions
• Background noise
– Time-varying and location-dependent.
Xin Liu
141
142
Interference and Noise
Noise
Interference
Interference
Desired
Signal
Interference
Xin Liu
143
Propagation Environment
Weak
Shadowing
Strong
Path Loss
Multi-path Fading
Xin Liu
Time-varying Channel
Conditions
• Due to users’ mobility and variability in the
propagation environment, both desired signal and
interference are time-varying and locationdependent
• A measure of channel quality:
SINR (Signal to Interference plus Noise Ratio)
Xin Liu
144
145
Illustration of Channel Conditions
Based on Lee’s path loss model, log-normal shadowing, and Raleigh fading
Xin Liu
Random Errors
• If number of errors is small, they may be
corrected by an error correcting code
• Excessive bit errors result in a packet being
discarded, possibly before it reaches the
transport layer
Xin Liu
146
147
Random Errors May Cause Fast Retransmit
• Fast retransmit results in
– retransmission of lost packet
– reduction in congestion window
• Reducing congestion window in response to
errors is unnecessary
• Reduction in congestion window reduces
the throughput
Xin Liu
Sometimes Congestion Response May
be Appropriate in Response to Errors
• On a CDMA channel, errors occur due to interference from
other user, and due to noise [Karn99pilc]
– Interference due to other users is an indication of congestion. If
such interference causes transmission errors, it is appropriate to
reduce congestion window
– If noise causes errors, it is not appropriate to reduce window
• When a channel is in a bad state for a long duration, it
might be better to let TCP backoff, so that it does not
unnecessarily attempt retransmissions while the channel
remains in the bad state [Padmanabhan99pilc]
Xin Liu
148
149
Burst Errors May Cause Timeouts
• If wireless link remains unavailable for extended
duration, a window worth of data may be lost
– driving through a tunnel
– passing a truck
• Timeout results in slow start
• Slow start reduces congestion window to 1 MSS,
reducing throughput
• Reduction in window in response to errors
unnecessary
Xin Liu
150
Random Errors May Also Cause Timeout
• Multiple packet losses in a window can
result in timeout when using TCP-Reno
Xin Liu
Impact of Transmission Errors
• TCP cannot distinguish between packet
losses due to congestion and transmission
errors
• Unnecessarily reduces congestion window
• Throughput suffers
Xin Liu
151
Various Schemes
•
•
•
•
Link level mechanisms
Split connection approach
TCP-Aware link layer
TCP-Unaware approximation of TCP-aware link
layer
• Explicit notification
• Receiver-based discrimination
• Sender-based discrimination
Ref: tutorials by Nitin Vaidya
Xin Liu
152
153
Split Connection Approach
Xin Liu
Split Connection Approach
• End-to-end TCP connection is broken into
one connection on the wired part of route
and one over wireless part of the route
• A single TCP connection split into two TCP
connections
– if wireless link is not last on route, then more
than two TCP connections may be needed
Xin Liu
154
Split Connection Approach
• Connection between wireless host MH and
fixed host FH goes through base station BS
• FH-MH = FH-BS + BS-MH
FH
Fixed Host
BS
MH
Base Station
Xin Liu
Mobile Host
155
Split Connection Approach
Per-TCP connection state
TCP connection
TCP connection
application
application
transport
transport
transport
network
network
network
link
link
link
physical
physical
physical
rxmt
wireless
Xin Liu
application
156
157
Split Connection Approach : Classification
• Hides transmission errors from sender
• Primary responsibility at base station
• If specialized transport protocol used on
wireless, then wireless host also needs
modification
Xin Liu
158
Split Connection Approach : Advantages
• BS-MH connection can be optimized independent of FHBS connection
– Different flow / error control on the two connections
• Local recovery of errors
– Faster recovery due to relatively shorter RTT on wireless link
• Good performance achievable using appropriate BS-MH
protocol
Xin Liu
159
Split Connection Approach : Disadvantages
• End-to-end semantics violated
• BS retains hard state
– BS failure can result in loss of data (unreliability)
– Hand-off latency increases due to state transfer
– Buffer space needed at BS for each TCP connection
• Extra copying of data at BS
– copying from FH-BS socket buffer to BS-MH socket buffer
– increases end-to-end latency
• May not be useful if data and acks traverse different paths (both do not
go through the base station)
– Example: data on a satellite wireless hop, acks on a dial-up channel
Xin Liu
Various Schemes
•
•
•
•
Link layer mechanisms
Split connection approach
TCP-Aware link layer
TCP-Unaware approximation of TCP-aware
link layer
• Explicit notification
• Receiver-based discrimination
• Sender-based discrimination
Xin Liu
160
161
TCP-Aware Link Layer
Xin Liu
Snoop Protocol
[Balakrishnan95acm]
• Retains local recovery of Split Connection
approach and link level retransmission
schemes
• Improves on split connection
– end-to-end semantics retained
– soft state at base station, instead of hard state
Xin Liu
162
163
Snoop Protocol
Per TCP-connection state
TCP connection
application
application
application
transport
transport
transport
network
network
link
link
link
physical
physical
physical
FH
BS
Xin Liu
rxmt
wireless
network
MH
Snoop Protocol
• Buffers data packets at the base station BS
– to allow link layer retransmission
• When dupacks received by BS from MH,
retransmit on wireless link, if packet present
in buffer
• Prevents fast retransmit at TCP sender FH
by dropping the dupacks at BS
FH
BS
MH
Xin Liu
164
165
Snoop : Example
35
36
TCP state
maintained at
link layer
37
38
40
39
38
FH
37
BS
34
MH
36
Example assumes delayed ack - every other packet ack’d
Xin Liu
166
Snoop : Example
35
39
36
37
38
41
40
39
34
38
36
Xin Liu
167
Snoop : Example
37
40
38
39
42
41
40
36
39
36
dupack
Duplicate acks are not delayed
Xin Liu
168
Snoop : Example
37
40
38
41
39
43
42
41
36
40
36
36
Duplicate acks
Xin Liu
169
Snoop : Example
44
37
40
38
41
39
42
43
37
FH
41
BS
Discard
dupack
Dupack triggers retransmission 36
of packet 37 from base station
BS needs to be TCP-aware to
be able to interpret TCP headers
Xin Liu
MH
36
36
170
Snoop : Example
45
37
40
38
41
39
42
44
43
42
37
36
36
36
Xin Liu
36
171
Snoop : Example
46
37
40
43
38
41
44
39
42
45
43
42
36
TCP sender does not
fast retransmit
36 36
36
Xin Liu
41
172
Snoop : Example
47
37
40
43
38
41
44
39
42
45
46
44
43
41
TCP sender does not
fast retransmit
36 36
36 36
Xin Liu
173
Snoop : Example
42
45
43
46
44
48
47
45
FH
44
BS
41
MH
43
36 36
36 36
Xin Liu
Snoop [Balakrishnan95acm]
2000000
bits/sec
1600000
1200000
base TCP
Snoop
800000
400000
0
no error
256K
128K
64K
32K
16K
1/error rate
(in bytes)
2 Mbps Wireless link
Xin Liu
174
Snoop Protocol When Beneficial?
• Snoop prevents fast retransmit from sender despite
transmission errors, and out-of-order delivery on the
wireless link
• delivery causes fast retransmit only if it results in at least 3
dupacks
• If wireless link level delay-bandwidth product is less than
4 packets, a simple (TCP-unaware) link level
retransmission scheme can suffice
– Since delay-bandwidth product is small, the retransmission scheme
can deliver the lost packet without resulting in 3 dupacks from the
TCP receiver
Xin Liu
175
Snoop Protocol : Classification
• Hides wireless losses from the sender
• Requires modification to only BS (networkcentric approach)
Xin Liu
176
177
Snoop Protocol : Advantages
• High throughput can be achieved
– performance further improved using selective acks
• Local recovery from wireless losses
• Fast retransmit not triggered at sender despite out-of-order
link layer delivery
• End-to-end semantics retained
• Soft state at base station
– loss of the soft state affects performance, but not correctness
Xin Liu
178
Snoop Protocol : Disadvantages
• Link layer at base station needs to be TCP-aware
• Not useful if TCP headers are encrypted (IPsec)
• Cannot be used if TCP data and TCP acks traverse
different paths (both do not go through the base
station)
Xin Liu
WTCP Protocol [Ratnam98]
• Snoop hides wireless losses from the sender
• But sender’s RTT estimates may be larger in
presence of errors
• Larger RTO results in slower response for
congestion losses
FH
BS
MH
Xin Liu
179
180
WTCP Protocol
• WTCP performs local recovery, similar to Snoop
• In addition, WTCP uses the timestamp option to estimate
RTT
• The base station adds base station residence time to the
timestamp when processing an ack received from the
wireless host
• Sender’s RTT estimate not affected by retransmissions on
wireless link
FH
BS
MH
Xin Liu
181
WTCP Example
3
3
FH
BS
4
MH
3
Numbers in this figure are timestamps
Base station residence time is 1 unit
Xin Liu
182
WTCP : Disadvantages
• Requires use of the timestamp option
• May be useful only if retransmission times are large
– link stays in bad state for a long time
– link frequently enters a bad state
– link delay large
• WTCP does not account for congestion on wireless hop
– assumes that all delay at base station is due to queuing and
retransmissions
– will not work for shared wireless LAN, where delays also incurred
due to contention with other transmitters
Xin Liu
Optional Reading
• S. Dawkins, et. al, “performance implications of
link-layer characteristics: links and errors,” tech.
rep. IETF, June, 1999.
• G. Montenegro, et al, “Long thin networks”, tech.
rep. IETF, May, 1999.
• Nitin Vaidya,
http://www.crhc.uiuc.edu/wireless/tutorials.html
Xin Liu
183