Lecture-9 on 10/22/2009 - Computer Science and Engineering
Download
Report
Transcript Lecture-9 on 10/22/2009 - Computer Science and Engineering
CSE 124
Networked Services
Fall 2009
B. S. Manoj, Ph.D
http://cseweb.ucsd.edu/classes/fa09/cse124
Some of these slides are adapted from various sources/individuals including but not limited to
the slides from the text books by Kurose and Ross, digital libraries such as
IEEE/ACM digital libraries and slides from Prof. Vahdat. Use of these slides other than for
pedagogical purpose for CSE 124, may require explicit permissions from the respective sources.
10/22/2009
CSE 124 Networked Services Fall 2009
1
Announcements
• Programming Assignment 1
– Submission window 23-26th October
• Week-3 Homework
– Due on 26th October
• First Paper Discussion
– Discussion on 29th October
– Write-up due on: 28th October
• Midterm: November 5
10/22/2009
CSE 124 Networked Services Fall 2009
2
TCP Round Trip Time and Timeout
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
Exponential weighted moving
RTT (milliseconds)
average
influence of past sample
decreases exponentially fast
typical value: = 0.125
350
300
250
200
150
100
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
time (seconnds)
SampleRTT
10/22/2009
CSE 124 Networked Services Fall 2009
Estimated RTT
3
TCP Round Trip Time and Timeout
Setting the timeout
• EstimtedRTT plus “safety margin”
– large variation in EstimatedRTT -> larger safety margin
• first estimate of how much SampleRTT deviates from EstimatedRTT:
DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically, = 0.25)
Then set timeout interval:
TimeoutInterval = EstimatedRTT + 4*DevRTT
TimeoutInterval is expoentially increased with
10/22/2009
CSE 124 Networked Services Fall 2009
every retransmission
4
Fast Retransmit
• time-out period often
relatively long:
– long delay before resending
lost packet
• detect lost segments via
duplicate ACKs.
– sender often sends many
segments back-to-back
– if segment is lost, there will
likely be many duplicate ACKs
for that segment
10/22/2009
• If sender receives 3 ACKs
for same data, it assumes
that segment after ACKed
data was lost:
– fast retransmit: resend
segment before timer
expires
CSE 124 Networked Services Fall 2009
5
Host A
seq # x1
seq # x2
seq # x3
seq # x4
seq # x5
Host B
X
ACK x1
ACK x1
ACK x1
ACK x1
timeout
triple
duplicate
ACKs
time
10/22/2009
CSE 124 Networked Services Fall 2009
6
TCP congestion control:
TCP sender should transmit as fast as possible, but
without congesting network
Q: how to find rate just below congestion level
decentralized: each TCP sender sets its own rate, based
on implicit feedback:
ACK: segment received (a good thing!), network not
congested, so increase sending rate
lost segment: assume loss due to congested network,
so decrease sending rate
10/22/2009
CSE 124 Networked Services Fall 2009
7
TCP congestion control: bandwidth probing
“probing for bandwidth”: increase transmission rate on
receipt of ACK, until eventually loss occurs, then
decrease transmission rate
continue to increase on ACK, decrease on loss (since available
bandwidth is changing, depending on other connections in
network)
ACKs being received,
so increase rate
X loss, so decrease rate
sending rate
X
X
X
TCP’s
“sawtooth”
behavior
X
time
Q: how fast to increase/decrease?
10/22/2009 details to follow CSE 124 Networked Services Fall 2009
8
TCP Congestion Control: details
• sender limits rate by limiting number of
unACKed bytes “in pipeline”:
LastByteSent-LastByteAcked cwnd
– cwnd: differs from rwnd (how, why?)
– sender limited by min(cwnd,rwnd)
• roughly,
rate =
cwnd
RTT
bytes/sec
• cwnd is dynamic, function of perceived
network congestion
10/22/2009
cwnd
bytes
CSE 124 Networked Services Fall 2009
RTT
ACK(s)
9
TCP Congestion Control: more details
segment loss event: reducing
cwnd
• timeout: no response from
receiver
– cut cwnd to 1
• 3 duplicate ACKs: at least
some segments getting
through (recall fast
retransmit)
ACK received: increase
cwnd
slowstart phase:
increase exponentially fast
(despite name) at
connection start, or
following timeout
congestion avoidance:
increase linearly
– cut cwnd in half, less
aggressively than on timeout
10/22/2009
CSE 124 Networked Services Fall 2009
10
TCP Slow Start
10/22/2009
Host A
Host B
RTT
• when connection begins, cwnd = 1
MSS
– example: MSS = 500 bytes &
RTT = 200 msec
– initial rate = 20 kbps
• available bandwidth may be >>
MSS/RTT
– desirable to quickly ramp up to
respectable rate
• increase rate exponentially until
first loss event or when threshold
reached
– double cwnd every RTT
– done by incrementing cwnd by
1 for every ACK received
CSE 124 Networked Services Fall 2009
time
11
TCP slow (exponential) start
10/22/2009
CSE 124 Networked Services Fall 2009
12
Transitioning into/out of slowstart
ssthresh: cwnd threshold maintained by TCP
• on loss event: set ssthresh to cwnd/2
– remember (half of) TCP rate when congestion last occurred
• when cwnd >= ssthresh: transition from slowstart to congestion avoidance
phase
duplicate ACK
dupACKcount++
L
cwnd = 1 MSS
ssthresh = 64 KB
dupACKcount = 0
timeout
ssthresh = cwnd/2
cwnd = 1 MSS
dupACKcount = 0
retransmit missing segment
10/22/2009
slow
start
new ACK
cwnd = cwnd+MSS
dupACKcount = 0
transmit new segment(s),as allowed
cwnd > ssthresh
L
timeout
ssthresh = cwnd/2
cwnd = 1 MSS
dupACKcount = 0
retransmit missing segment
CSE 124 Networked Services Fall 2009
congestion
avoidance
13
TCP: congestion avoidance
• when cwnd > ssthresh
grow cwnd linearly
– increase cwnd by 1
MSS per RTT
– approach possible
congestion slower than
in slowstart
– implementation: cwnd
= cwnd +
MSS/cwnd for each
ACK received
10/22/2009
AIMD
ACKs: increase cwnd
by 1 MSS per RTT:
additive increase
loss: cut cwnd in half
(non-timeout-detected
loss ): multiplicative
decrease
AIMD: Additive Increase
Multiplicative Decrease
CSE 124 Networked Services Fall 2009
14
TCP congestion control FSM: details
duplicate ACK
dupACKcount++
L
cwnd = 1 MSS
ssthresh = 64 KB
dupACKcount = 0
slow
start
timeout
ssthresh = cwnd/2
cwnd = 1 MSS
dupACKcount = 0
retransmit missing segment
dupACKcount == 3
ssthresh= cwnd/2
cwnd = ssthresh + 3
retransmit missing segment
new ACK
cwnd = cwnd+MSS
dupACKcount = 0
transmit new segment(s),as allowed
new ACK
cwnd = cwnd + MSS (MSS/cwnd)
dupACKcount = 0
transmit new segment(s),as allowed
.
cwnd > ssthresh
L
congestion
avoidance
timeout
ssthresh = cwnd/2
cwnd = 1 MSS
dupACKcount = 0
retransmit missing segment
timeout
ssthresh = cwnd/2
cwnd = 1
dupACKcount = 0
retransmit missing segment
duplicate ACK
dupACKcount++
New ACK
cwnd = ssthresh
dupACKcount = 0
dupACKcount == 3
ssthresh= cwnd/2
cwnd = ssthresh + 3
retransmit missing segment
fast
recovery
duplicate ACK
cwnd = cwnd + MSS
transmit new segment(s), as allowed
10/22/2009
CSE 124 Networked Services Fall 2009
15
Popular “flavors” of TCP
cwnd window size (in
segments)
TCP Reno
ssthresh
ssthresh
TCP Tahoe
Transmission round
10/22/2009
CSE 124 Networked Services Fall 2009
16
Summary: TCP Congestion Control
• when cwnd < ssthresh, sender in slow-start phase,
window grows exponentially.
• when cwnd >= ssthresh, sender is in congestionavoidance phase, window grows linearly.
• when triple duplicate ACK occurs, ssthresh set to
cwnd/2, cwnd set to ~ ssthresh
• when timeout occurs, ssthresh set to cwnd/2, cwnd
set to 1 MSS.
10/22/2009
CSE 124 Networked Services Fall 2009
17
Simplified TCP throughput
• Average throughout of TCP as function of
window size, RTT?
– ignoring slow start
• let W be window size when loss occurs.
– when window is W, throughput is W/RTT
– just after loss, window drops to W/2, throughput
to W/2RTT.
– average throughout: .75 W/RTT
10/22/2009
CSE 124 Networked Services Fall 2009
18
TCP throughput as a function of
Loss rate
• Assuming in a cycle, 1 packet is lost
• Therefore, the loss rate L is obtained as
• Since
we can get
• Throughput = .75 W/RTT=
10/22/2009
CSE 124 Networked Services Fall 2009
19
TCP Futures: TCP over “long, fat pipes”
• example: 1500 byte segments, 100ms RTT, want 10 Gbps
throughput
• requires window size W = 83,333 in-flight segments
• throughput in terms of loss rate:
1.22 MSS
10 10
RTT L
9
• Required value of packet loss rate, L = 2x10-10
• Existing TCP may not scale well in future networks
• Need new versions of TCP for high-speed
10/22/2009
CSE 124 Networked Services Fall 2009
20
TCP Fairness
fairness goal: if K TCP sessions share same bottleneck link of
bandwidth R, each should have average rate of R/K
TCP connection 1
TCP
connection 2
10/22/2009
bottleneck
router
capacity R
CSE 124 Networked Services Fall 2009
21
Why is TCP fair?
Two competing sessions:
• Additive increase gives slope of 1, as throughout increases
• multiplicative decrease decreases throughput proportionally
R
equal bandwidth share
loss: decrease window by factor of 2
congestion avoidance: additive increase
loss: decrease window by factor of 2
congestion avoidance: additive increase
Connection 1 throughput
10/22/2009
R
CSE 124 Networked Services Fall 2009
22
Fairness (more)
Fairness and UDP
• multimedia apps often do
not use TCP
– do not want rate throttled
by congestion control
• instead use UDP:
– pump audio/video at
constant rate, tolerate
packet loss
10/22/2009
Fairness and parallel TCP
connections
• nothing prevents app from
opening parallel connections
between 2 hosts.
• web browsers do this
• example: link of rate R
supporting 9 connections;
– new app asks for 1 TCP, gets rate
R/10
– new app asks for 11 TCPs, gets
R/2 !
CSE 124 Networked Services Fall 2009
23
Bandwidth sharing with TCP
Two TCP flows sharing a link
10/22/2009
TCP and UDP flows sharing a
link
CSE 124 Networked Services Fall 2009
24
Networks Vs Processors
• Network speeds
– 100Mbps to 1Gbps to 10Gbps
• Network protocol stack throughput
– is good for only 100Mbps,
– with fine-tuning, OK for 1Gbps
– What about 10Gbps?
• Example
– Payload size: 1460B, 2-3GHz processor
– Receive throughput achieved: 750Mbps
– Transmit throughput achieved: 1Gbps
• Need radical solutions to support 10Gbps and beyond
10/22/2009
CSE 124 Networked Services Fall 2009
25
Where is the overhead?
• TCP was suspected of being too complex
– In 1989, Clarke, Jacobson and others proved otherwise
• The complexity (overhead) lies in
– Computing environment where TCP operates
•
•
•
•
Interrupts
OS scheduling
Buffering
Data movement
• Simple solutions that improves performance
– Interrupt moderation
• NIC waits for multiple packets and notify the processor once
• Amortize the high cost of interrupts
– Checksum offload
• Checksum calculation in processor is costly
• Offload checksum calculation to NIC (in hardware)
– Large Segment offload
• Segment large chunks of data to smaller segments is
expensive
• Offload segmentation and TCP/IP header preparation to NIC
• Useful for sender-side TCP
– Can support upto ~1Gbps PHYs
10/22/2009
CSE 124 Networked Services Fall 2009
26
Challenges in detail
• OS issues
– Interrupts
• Interrupt moderation
• Polling
• Hybrid interrupts
• Memory
– Latency
• Memory is slower than processor
– Poor cache locality
• New Data entering from NIC or application
• Cache miss and CPU stall is common
• Buffering and copying
– Usually two copies required
• Application to TCP copy and TCP to NIC copy
– Receive side:
• Copy can be reduced to one if posted buffers are provided by application
• Mostly two copy required
– Transmit side:
• Zero copy on Transmit (DMA from Application to NIC) can help
• Implemented on selected systems
10/22/2009
CSE 124 Networked Services Fall 2009
27
TCP/IP Acceleration Methods
• Three main strategies
– TCP Offloading Engine (TOE)
– TCP Onloading
– Stack and NIC enhancements
• TCP Offloading Engine
– Offload TCP/IP processing to devices attached to the
server’s I/O system
– Use separate processing and memory resources
– Pros
• Improves throughput and utilization performance
• Useful for bulk data transfer such as IP-storage
• Good for few connections with high bandwidth links
– Cons
•
•
•
•
May not scale well to large number of connections
Needs special processors (expensive)
Needs high memory in NIC (expensive)
Store and forward in ToE is suitable only for large transfers
Processor
Cache memory
NIC device
TCP Offload
Engine
– Latency between I/O subsystem and main memory is high
10/22/2009 •
Expensive TOEs or NICs
requiredServices Fall 2009
CSEare
124 Networked
28
TCP onloading
• Dedicate TCP/IP processing to
one or more general purpose
cores
– high performance
– Cheap
– Main memory to CPU latency is
small
Core 0
(Applic
ation)
Core 1
(Applic
ation)
Core 2
(Applic
ation)
• Extensible
– Programming tools and
implementations exist
– Good for long term
performance
• Scalable
Core 3
(TCP/IP
Processing
or
onloading)
Cache memory
NIC device
– Good for large number of flows
10/22/2009
CSE 124 Networked Services Fall 2009
29
Stack and NIC enhancements
• Asynchronous I/O
– Asynchronous call backs on data arrival
– Pre-posting buffers by application to avoid copying
• Header Splitting
– Splitting headers and data
– Better data pre-fetching
– NIC can place the header
• Receive-side scaling
– Using multiple cores to achieve connection level
parallelism
– Have multiple Queues in NIC
– Map each queue to mapped to a different processor
10/22/2009
CSE 124 Networked Services Fall 2009
30
Summary
Reading assignment
• TCP from Chapter 3 in Kurose and Ross
• TCP from Chapter 5 in Peterson and Davie
• Homework:
– Problems P37 and P43 (Pages 306-308) from
Kurose and Ross
– Deadline: 30th October 2009
10/22/2009
CSE 124 Networked Services Fall 2009
31