Transcript Chapter 3

Transport Layer
Our goals:
• understand principles
behind transport
layer services:
• learn about transport layer
protocols in the Internet:
– UDP: connectionless
transport
– TCP: connection-oriented
transport
– TCP congestion control
– multiplexing/demulti
plexing
– reliable data transfer
– flow control
– congestion control
Ref: slides by J. Kurose and K. Ross
Xin Liu
1
Outline
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• TCP congestion control
Xin Liu
2
3
Transport services and protocols
• provide logical communication
between app processes running on
different hosts
• transport protocols run in end
systems
– send side: breaks app
messages into segments,
passes to network layer
– rcv side: reassembles segments
into messages, passes to app
layer
• more than one transport protocol
available to apps
– Internet: TCP and UDP
application
transport
network
data link
physical
Xin Liu
network
data link
physical
network
data link
physical
network
data link
physical
network
data link
physical
network
data link
physical
application
transport
network
data link
physical
Transport vs. network layer
• network layer: logical
communication
between hosts
• transport layer:
logical communication
between processes
– relies on, enhances,
network layer services
Xin Liu
Household analogy:
12 kids sending letters to 12
kids
• processes = kids
• app messages = letters in
envelopes
• hosts = houses
• transport protocol = Ann
and Bill
• network-layer protocol =
postal service
4
Internet transport-layer protocols
• reliable, in-order delivery
(TCP)
– congestion control
– flow control
– connection setup
application
transport
network
data link
physical
network
data link
physical
network
data link
physical
network
data link
physical
• unreliable, unordered
delivery: UDP
network
data link
physical
network
data link
physical
– no-frills extension of “besteffort” IP
application
transport
network
data link
physical
• services not available:
– delay guarantees
– bandwidth guarantees
Xin Liu
5
Outline
•
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Principles of reliable data transfer
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• Principles of congestion control
• TCP congestion control
Xin Liu
6
7
Multiplexing/demultiplexing
Multiplexing at send host:
gathering data from multiple
sockets, enveloping data with
header (later used for
demultiplexing)
Demultiplexing at rcv host:
delivering received segments
to correct socket
= socket
application
transport
network
link
= process
P3
P1
P1
application
transport
network
P2
P4
application
transport
network
link
link
physical
host 1
physical
host 2
Xin Liu
physical
host 3
8
How demultiplexing works
• host receives IP datagrams
– each datagram has source IP
address, destination IP address
– each datagram carries 1
transport-layer segment
– each segment has source,
destination port number
(recall: well-known port
numbers for specific
applications)
• host uses IP addresses & port
numbers to direct segment to
appropriate socket
32 bits
source port #
dest port #
other header fields
application
data
(message)
TCP/UDP segment format
Xin Liu
Connectionless demultiplexing
• Create sockets :
9
• When host receives UDP
sock=socket(PF_INET,SOCK_DGR
segment:
AM, IPPROTO_UDP);
– checks destination port number
bind(sock,(struct sockaddr
in segment
*)&addr,sizeof(addr));
– directs UDP segment to socket
sendto(sock,buffer,size,0);
with that port number
recvfrom(sock,Buffer,buffers
• IP datagrams with different
ize,0);
• UDP socket identified by
two-tuple:
(dest IP address, dest port number)
Xin Liu
source IP addresses and/or
source port numbers directed
to same socket
Connection-oriented demux
• TCP socket identified
by 4-tuple:
–
–
–
–
source IP address
source port number
dest IP address
dest port number
• recv host uses all four
values to direct
segment to appropriate
socket
• Server host may support
many simultaneous TCP
sockets:
– each socket identified by
its own 4-tuple
• Web servers have
different sockets for
each connecting client
– non-persistent HTTP will
have different socket for
each request
Xin Liu
10
Outline
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• TCP congestion control
Xin Liu
11
12
UDP: User Datagram Protocol [RFC 768]
• “no frills,” “bare bones”
Internet transport protocol
• “best effort” service, UDP
segments may be:
– lost
– delivered out of order to
app
• connectionless:
– no handshaking between
UDP sender, receiver
– each UDP segment handled
independently of others
Why is there a UDP?
• no connection establishment
(which can add delay)
• simple: no connection state at
sender, receiver
• small segment header
• no congestion control: UDP can
blast away as fast as desired
Xin Liu
DHCP client-server scenario
DHCP server: 223.1.2.5
DHCP discover
src : 0.0.0.0, 68
dest.: 255.255.255.255,67
yiaddr: 0.0.0.0
transaction ID: 654
DHCP offer
src: 223.1.2.5, 67
dest: 255.255.255.255, 68
yiaddrr: 223.1.2.4
transaction ID: 654
Lifetime: 3600 secs
DHCP request
time
src: 0.0.0.0, 68
dest:: 255.255.255.255, 67
yiaddrr: 223.1.2.4
transaction ID: 655
Lifetime: 3600 secs
DHCP ACK
src: 223.1.2.5, 67
dest: 255.255.255.255, 68
yiaddrr: 223.1.2.4
transaction ID: 655
Lifetime: 3600 secs
Xin Liu
arriving
client
13
Applications and protocols
application
E-mail
Remote terminal access
Web
File transfer
Streaming
IP-phone
Routing
Name translation
Dynamic IP
Network mng.
App_layer prtcl
SMTP
Telnet
HTTP
FTP
proprietary
proprietary
RIP
DNS
DHCP
SNMP
Xin Liu
Transport prtcl
TCP
TCP
TCP
TCP
Typically UDP
Typically UDP
Typically UDP
Typically UDP
Typically UDP
Typically UDP
14
15
UDP: more
• often used for streaming
multimedia apps
Length, in
– loss tolerant
bytes of UDP
segment,
– rate sensitive
including
• reliable transfer over
header
UDP: add reliability at
application layer
– application-specific
error recovery!
32 bits
source port #
dest port #
length
checksum
Application
data
(message)
UDP segment format
Xin Liu
Checksum
• Goal: detect “errors” (e.g., flipped bits) in
transmitted segment
• UDP header and data
• Pseudo header
– Source/dest IP address
– Protocol, length
• Same procedure for TCP
Xin Liu
16
UDP checksum
Sender:
Receiver:
• treat segment contents as
sequence of 16-bit
integers
• checksum: addition (1’s
complement sum) of
segment contents
• sender puts checksum
value into UDP
checksum field
• compute checksum of
received segment
• check if computed
checksum equals checksum
field value:
– NO - error detected
– YES - no error detected.
But maybe errors
nonetheless?
– may pass the damaged
data
Xin Liu
17
Outline
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• TCP congestion control
Xin Liu
18
19
TCP: Overview
RFCs: 793, 1122, 1323, 2018, 2581
• point-to-point:
• full duplex data:
– one sender, one receiver
– bi-directional data flow in
same connection
– MSS: maximum segment
size
• reliable, in-order byte
steam:
– no “message boundaries”
• connection-oriented:
• pipelined:
– handshaking (exchange of
control msgs) init’s sender,
receiver state before data
exchange
– TCP congestion and flow
control set window size
• send & receive buffers
• flow controlled:
socket
door
application
writes data
application
reads data
TCP
send buffer
TCP
receive buffer
socket
door
segment
Xin Liu
– sender will not overwhelm
receiver
TCP segment structure
20
32 bits
URG: urgent data
(generally not used)
ACK: ACK #
valid
PSH: push data now
source port #
sequence number
acknowledgement number
head not
UA P R S F
len used
checksum
RST, SYN, FIN:
connection estab
(setup, teardown
commands)
Internet
checksum
(as in UDP)
dest port #
Receive window
Urg data pnter
Options (variable length)
application
data
(variable length)
Xin Liu
counting
by bytes
of data
(not segments!)
# bytes
rcvr willing
to accept
Urgent data
pointer
TCP Connection Management
Recall: TCP sender, receiver
Three way handshake:
establish “connection” before
exchanging data segments
• initialize TCP variables:
– seq. #s
– buffers, flow control info
(e.g. RcvWindow)
Step 1: client host sends TCP SYN
segment to server
– specifies initial seq #
– no data
• client: connection initiator
– connect();
• server: contacted by client
– accept();
Step 2: server host receives SYN,
replies with SYNACK segment
– server allocates buffers
– specifies server initial seq. #
Step 3: client receives SYNACK,
replies with ACK segment, which
may contain data
Xin Liu
21
22
TCP Connection Management (cont.)
Closing a connection:
client
client closes socket:
close();
close
Step 1: client end system
close
timed wait
sends TCP FIN control
segment to server
Step 2: server receives FIN,
replies with ACK. Closes
connection, sends FIN.
server
closed
Xin Liu
23
TCP Connection Management (cont.)
Step 3: client receives FIN,
client
replies with ACK.
– Enters “timed wait” - will
respond with ACK to
received FINs
Step 4: server, receives ACK.
Connection closed.
server
closing
FIN_WAIT_1
closing
FIN_WAIT_2
TIME_WAIT
timed wait
Note: with small modification,
can handle simultaneous
FINs.
closed
Xin Liu
closed
24
TCP Connection Management (cont)
TCP server
lifecycle
TCP client
lifecycle
Xin Liu
TCP Connection Management
• Allow half-close, i.e., one end to terminate
its output, but still receiving data
• Allow simultaneous open
• Allow simultaneous close
• Crashes?
Xin Liu
25
26
[root@shannon liu]# tcpdump -S tcp port 22
tcpdump: listening on eth0
23:01:51.363983 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: S
3036713598:3036713598(0) win 5840 <mss 1460,sackOK,timestamp 13989220 0,nop,wscale 0> (DF)
23:01:51.364829 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: S
2462279815:2462279815(0) ack 3036713599 win 24616 <nop,nop,timestamp 626257407
13989220,nop,wscale 0,nop,nop,sackOK,mss 1460> (DF)
23:01:51.364844 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462279816 win 5840
<nop,nop,timestamp 13989220 626257407> (DF)
23:01:51.375451 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: P
2462279816:2462279865(49) ack 3036713599 win 24616 <nop,nop,timestamp 626257408 13989220>
(DF)
23:01:51.375478 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462279865 win 5840
<nop,nop,timestamp 13989221 626257408> (DF)
23:01:51.379319 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: P
3036713599:3036713621(22) ack 2462279865 win 5840 <nop,nop,timestamp 13989221 626257408>
(DF)
23:01:51.379570 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: . ack 3036713621 win 24616
<nop,nop,timestamp 626257408 13989221>
(DF)
Xin Liu
27
23:01:51.941616 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: P 3036714373:3036714437(64)
ack 2462281065 win 7680 <nop,nop,timestamp 13989277 626257462> (DF)
23:01:51.952442 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: P
2462281065:2462282153(1088) ack 3036714437 win 24616 <nop,nop,timestamp 626257465 13989277>
(DF)
23:01:51.991682 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462282153 win 9792
<nop,nop,timestamp 13989283 626257465> (DF)
23:01:54.699597 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: F 3036714437:3036714437(0)
ack 2462282153 win 9792 <nop,nop,timestamp 13989553 626257465> (DF)
23:01:54.699880 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: . ack 3036714438 win 24616
<nop,nop,timestamp 626257740 13989553>(DF)
23:01:54.701129 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: F 2462282153:2462282153(0)
ack 3036714438 win 24616 <nop,nop,timestamp 626257740 13989553> (DF)
23:01:54.701143 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462282154 win 9792
<nop,nop,timestamp 13989553 626257740> (DF)
26 packets received by filter
0 packets dropped by kernel
Xin Liu
Outline
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• TCP congestion control
Xin Liu
28
29
TCP seq. #’s and ACKs
Seq. #’s:
– byte stream “number”
of first byte in
segment’s data
ACKs:
– seq # of next byte
expected from other
side
– cumulative ACK
Q: how receiver handles outof-order segments
– A: TCP spec doesn’t
say, - up to
implementor
Host A
User
types
‘C’
Host B
host ACKs
receipt of
‘C’, echoes
back ‘C’
host ACKs
receipt
of echoed
‘C’
simple telnet scenario
Xin Liu
time
30
TCP Round Trip Time and Timeout
Q: how to set TCP
timeout value?
• longer than RTT
– but RTT varies
• too short: premature
timeout
– unnecessary
retransmissions
• too long: slow reaction
to segment loss
Q: how to estimate RTT?
• SampleRTT: measured time
from segment transmission until
ACK receipt
– ignore retransmissions
• SampleRTT will vary, want
estimated RTT “smoother”
– average several recent
measurements, not just
current SampleRTT
Xin Liu
31
TCP Round Trip Time and Timeout
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
• Exponential weighted moving average
• influence of past sample decreases exponentially fast
• typical value:  = 0.125
Xin Liu
32
Example RTT estimation:
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
350
RTT (milliseconds)
300
250
200
150
100
1
8
15
22
29
36
43
50
57
64
71
time (seconnds)
SampleRTT
Estimated RTT
Xin Liu
78
85
92
99
106
33
TCP Round Trip Time and Timeout
Setting the timeout
• EstimtedRTT plus “safety margin”
– large variation in EstimatedRTT -> larger safety margin
• first estimate of how much SampleRTT deviates from
EstimatedRTT:
DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically,  = 0.25)
Then set timeout interval:
TimeoutInterval = EstimatedRTT + 4*DevRTT
Xin Liu
RTT
• Timestamp can be used to measure RTT for
each segment
• Better RTT estimate
• NO synchronization required
Xin Liu
34
TCP reliable data transfer
• TCP creates reliable
service on top of IP’s
unreliable service
• Pipelined segments
• Cumulative acks
• TCP uses single
retransmission timer
• Retransmissions are
triggered by:
– timeout events
– duplicate acks
• Initially consider
simplified TCP sender:
– ignore duplicate acks
– ignore flow control,
congestion control
Xin Liu
35
TCP sender events:
data rcvd from app:
• Create segment with seq #
• seq # is byte-stream
number of first data byte
in segment
• start timer if not already
running (think of timer as
for oldest unacked
segment)
• expiration interval:
TimeOutInterval
timeout:
• retransmit segment that
caused timeout
• restart timer
Ack rcvd:
• If acknowledges
previously unacked
segments
– update what is known to be
acked
– start timer if there are
outstanding segments
Xin Liu
36
37
NextSeqNum = InitialSeqNum
SendBase = InitialSeqNum
TCP
sender
loop (forever) {
switch(event)
event: data received from application above
create TCP segment with sequence number NextSeqNum
if (timer currently not running)
start timer
pass segment to IP
NextSeqNum = NextSeqNum + length(data)
event: timer timeout
retransmit not-yet-acknowledged segment with
smallest sequence number
start timer
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
} /* end of loop forever */
Xin Liu
(simplified)
Comment:
• SendBase-1: last
cumulatively
ack’ed byte
Example:
• SendBase-1 = 71;
y= 73, so the rcvr
wants 73+ ;
y > SendBase, so
that new data is
acked
38
TCP: retransmission scenarios
Host A
X
loss
Sendbase
= 100
SendBase
= 120
SendBase
= 100
time
Host B
Seq=92 timeout
Host B
SendBase
= 120
Seq=92 timeout
timeout
Host A
time
lost ACK scenario
Xin Liu
premature timeout
39
TCP retransmission scenarios (more)
timeout
Host A
Host B
X
loss
SendBase
= 120
time
Cumulative ACK scenario
Xin Liu
TCP ACK generation [RFC 1122, RFC
2581]
Event at Receiver
TCP Receiver action
Arrival of in-order segment with
expected seq #. All data up to
expected seq # already ACKed
Delayed ACK. Wait up to 500ms
for next segment. If no next segment,
send ACK
Arrival of in-order segment with
expected seq #. One other
segment has ACK pending
Immediately send single cumulative
ACK, ACKing both in-order segments
Arrival of out-of-order segment
higher-than-expect seq. # .
Gap detected
Immediately send duplicate ACK,
indicating seq. # of next expected byte
Arrival of segment that
partially or completely fills gap
Immediate send ACK, provided that
segment startsat lower end of gap
Xin Liu
40
TCP Flow Control
flow control
• receive side of TCP
connection has a
receive buffer:
sender won’t overflow
receiver’s buffer by
transmitting too much,
too fast
• speed-matching
service: matching the
send rate to the
receiving app’s drain
rate
• app process may be
slow at reading from
buffer
Xin Liu
41
TCP Flow control: how it works
(Suppose TCP receiver discards
out-of-order segments)
• spare room in buffer
= RcvWindow
= RcvBuffer-[LastByteRcvd LastByteRead]
Xin Liu
42
• Rcvr advertises spare
room by including
value of RcvWindow
in segments
• Sender limits
unACKed data to
RcvWindow
– guarantees receive
buffer doesn’t overflow
More
• Slow receiver
– Ack new window
• Long fat pipeline: high speed link and/or
long RTT
• Window scale option during handshaking
Xin Liu
43
Header
32 bits
source port #
dest port #
sequence number
acknowledgement number
head not
UA P R S F
len used
checksum
Receive window
Urg data pnter
Options (variable length)
application
data
(variable length)
Xin Liu
44
Outline
•
•
•
•
Transport-layer services
Multiplexing and demultiplexing
Connectionless transport: UDP
Connection-oriented transport: TCP
–
–
–
–
segment structure
reliable data transfer
flow control
connection management
• TCP congestion control
Xin Liu
45
46
Principles of Congestion Control
Congestion:
• informally: “too many sources sending too much data too
fast for network to handle”
• different from flow control!
• Who benefits?
• manifestations:
– lost packets (buffer overflow at routers)
– long delays (queueing in router buffers)
• a top-10 problem!
Xin Liu
TCP Congestion Control
• end-end control (no network
assistance)
• sender limits transmission:
LastByteSent-LastByteAcked
 cwnd
• Roughly,
rate =
cwnd
RTT
Bytes/sec
• cwnd is dynamic, function of
perceived network congestion
47
How does sender
perceive congestion?
• loss event = timeout or
3 duplicate acks
• TCP sender reduces
rate (cwnd) after loss
event
mechanisms:
– slow start
– congestion avoidance
– AIMD
Xin Liu
TCP Slow Start
• When connection
begins, cwnd = 1 MSS
– Example: MSS = 500
bytes & RTT = 200 msec
– initial rate = 20 kbps
• When connection
begins, increase cwnd
when an ack received
• available bandwidth
may be >> MSS/RTT
– desirable to quickly ramp
up to respectable rate
Xin Liu
48
49
TCP Slow Start (more)
• When connection
begins, increase rate
exponentially until
first loss event:
Host B
RTT
Host A
– incrementing cwnd for
every ACK received
– double cwnd every
RTT
• Summary: initial rate
is slow but ramps up
exponentially fast
time
Xin Liu
Congestion Avoidance
• ssthresh: when cwnd reaches ssthresh,
congestion avoidance begins
• Congestion avoidance: increase cwnd by
1/cwnd each time an ACK is received
• Congestion happens: ssthresh=max(2MSS,
cwnd/2)
Xin Liu
50
51
TCP AIMD
multiplicative decrease:
cut cwnd in half after
loss event
congestion
window
24 Kbytes
additive increase:
increase cwnd by 1
MSS every RTT in the
absence of loss events:
probing
16 Kbytes
8 Kbytes
time
Long-lived TCP connection
Xin Liu
Reno vs. Tahoe
Philosophy:
• After 3 dup ACKs:
– cwnd is cut in half
– window then grows linearly
• But after timeout event:
– cwnd instead set to 1 MSS;
– window then grows
exponentially
– to a sshthresh, then grows
linearly
Xin Liu
• 3 dup ACKs indicates
network capable of
delivering some segments
• timeout before 3 dup
ACKs is “more alarming”
52
53
Summary: TCP Congestion Control
• When cwnd is below sshthresh, sender in slow-start
phase, window grows exponentially.
• When cwnd is above sshthresh, sender is in
congestion-avoidance phase, window grows linearly.
• When a triple duplicate ACK occurs, sshthresh set to
cwnd/2 and cwnd set to sshthresh.
• When timeout occurs, sshthresh set to cwnd/2 and
cwnd is set to 1 MSS.
Xin Liu
Trend
• Recent research proposes network-assisted
congestion control: active queue management
• ECN: explicit congestion notification
– 2 bits: 6 &7 in the IP TOS field
• RED: random early detection
– Implicit
– Can be adapted to explicit methods by marking instead
of dropping
Xin Liu
54
Wireless TCP
• Motivation
– Wireless channels are unreliable and timevarying
– Cause TCP timeout/Duplicate acks
• Approaches
Xin Liu
55