20031015-FAST-Ravot

Download Report

Transcript 20031015-FAST-Ravot

Performance Engineering
E2EpiPEs and FastTCP
Internet2 member meeting - Indianapolis
World Telecom 2003 - Geneva
October 15, 2003
[email protected]
Agenda
 High TCP performance over wide area networks :
 TCP at Gbps speed
 MTU bias
 RTT bias
 TCP fairness
 How to use 100% of the link capacity with TCP Reno
 Network buffers impact
 New Internet2 Land Speed record
Single TCP stream performance
under periodic losses
Bandwidth Utilization (%)
Effect of packet loss
100
90
80
70
60
50
40
30
20
10
0
0.000001
Loss rate =0.01%:
LAN BW
utilization= 99%
WAN BW
utilization=1.2%
0.00001
0.0001
0.001
0.01
0.1
Packet Loss frequency (%)
WAN (RTT=120ms)
LAN (RTT=0.04 ms)
1
10
Bandwidth available = 1 Gbps
 TCP throughput is much more sensitive to packet loss in WANs
than in LANs
 TCP’s congestion control algorithm (AIMD) is not suited to gigabit
networks
 Poor limited feedback mechanisms
 The effect of packets loss is disastrous
 TCP is inefficient in high bandwidth*delay networks
 The future performance of computational grids looks bad if we
continue to rely on the widely-deployed TCP RENO
Responsiveness (I)
 The responsiveness r measures how quickly we go back to using
the network link at full capacity after experiencing a loss if we
assume that the congestion window size is equal to the Bandwidth
Delay product when the packet is lost.
2
C
.
RTT
r=
2 . MSS
C : Capacity of the link
TCP responsiveness
18000
16000
14000
Time (s)
12000
C= 622 Mbit/s
C= 2.5 Gbit/s
10000
8000
C= 10 Gbit/s
6000
4000
2000
0
0
50
100
RTT (ms)
150
200
Responsiveness (II)
Case
C
RTT (ms)
MSS (Byte)
Responsiveness
Typical LAN today
1 Gb/s
2
(worst case)
1460
96 ms
WAN
Geneva <-> Chicago
1 Gb/s
120
1460
10 min
WAN
Geneva <-> Sunnyvale
1 Gb/s
180
1460
23 min
WAN
Geneva <-> Tokyo
1 Gb/s
300
1460
1 h 04 min
WAN
Geneva <-> Sunnyvale
2.5 Gb/s
180
1460
58 min
Future WAN
CERN <-> Starlight
10 Gb/s
120
1460
1 h 32 min
Future WAN link
CERN <-> Starlight
10 Gb/s
120
8960
(Jumbo Frame)
15 min
The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the
responsiveness is multiplied by two. Therefore, values above have to be multiplied by two!
Single TCP stream
TCP connection between Geneva and Chicago: C=1 Gbit/s; MSS=1,460 Bytes; RTT=120ms
35 minutes




Time to increase the throughout from 100Mbps to 900Mbps = 35 minutes
Loss occurs when the bandwidth reaches the pipe size
75% of bandwidth utilization (assuming no buffering)
Cwnd<BDP :
Throughput < Bandwidth
RTT constant
Throughput = Cwnd / RTT
Measurements with Different MTUs
TCP connection between Geneva and Chicago: C=1 Gbit/s; RTT=120ms
 In both cases: 75% of the link utilization
 Large MTU accelerate the growth of the window
 Time to recover from a packet loss decreases with large MTU
 Larger MTU reduces overhead per frames (saves CPU cycles,
reduces the number of packets)
MTU and Fairness
Host #1
1 GE
GbE
Switch
Host #2
CERN (GVA)
1 GE




R
Host #1
Host #2
1 GE
Bottleneck
 Two TCP streams share a 1 Gbps

R
1 GE
POS 2.5 Gbps
bottleneck
RTT=117 ms
MTU = 1500 Bytes; Avg.
throughput over a period of
4000s = 50 Mb/s
MTU = 9000 Bytes; Avg.
throughput over a period of
4000s = 698 Mb/s
Factor 14 !
Connections with large MTU
increase quickly their rate and
grab most of the available
bandwidth
Starlight (Chi)
RTT and Fairness
Host #1
1 GE
GbE
Switch
Host #2




R
POS 2.5 Gb/s
1 GE
R
10GE
R
POS 10 Gb/s
R
Host #2
1 GE
Host #1
Bottleneck
CERN (GVA)

1 GE
Starlight (Chi)
Sunnyvale
Two TCP streams share a 1 Gbps bottleneck
CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s
CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s
MTU = 9000 bytes
Connection with small RTT increases quickly there rate and grab most of the available bandwidth
Throughput of two streams with different RTT sharing a 1Gbps bottleneck
1000
900
Throughput (Mbps)
800
RTT=181ms
700
600
Average over the life of the
connection RTT=181ms
RTT=117ms
500
400
300
Average over the life of the
connection RTT=117ms
200
100
0
0
1000
2000
3000
4000
Time (s)
5000
6000
7000
How to use 100% of the bandwidth?
 Single TCP stream GVA - CHI
 MSS=8960 Bytes; Throughput = 980Mbps
 Cwnd > BDP => Throughput = Bandwidth
 RTT increase
 Extremely Large buffer at the bottleneck
 Network buffers have an important impact on
performance
 Have buffers to be well dimensioned in order to
scale with the BDP?
 Why not use the end-to-end delay as congestion
indication.
Bandwidth delay product
Single stream TCP performance
Date
From
Geneva to
Size of
transfer
Duration
(second)
RTT
(ms)
MTU
(Bytes)
IP
version
Throughput
Record
Award
Feb
27
Sunnyvale
1,1 TByte
3700
180
9000
IPv4
2.38 Gbps
Internet2 LSR
CENIC award
Guinness
World Record
May
27
Tokyo
65.1
GByte
600
277
1500
IPv4
931 Mbps
May
2
Chicago
385 GByte
3600
120
1500
IPv6
919 Mbps
May
2
Chicago
412 GByte
3600
120
9000
IPv6
983 Mbps
Internet2 LSR
NEW Submission (Oct-11): 5.65 Gbps from Geneva to Los Angeles across
the LHCnet, Starlight, Abilene and CENIC.
Early 10 Gb/s 10,000 km TCP Testing
Monitoring of the Abilene traffic in LA




Single TCP stream at 5,65 Gbps
Transferring a full CD in less than 1s
Un-congestioned network
No packet loss during the transfer
 Probably qualifies as new Internet2
LSR
Conclusion
 The future performance of computational grids looks bad if we
continue to rely on the widely-deployed TCP RENO
 How to define the fairness?
Taking into account the MTU
Taking into account the RTT
 Larger packet size (Jumbogram : payload larger than 64K)
Is standard MTU the largest bottleneck?
New Intel 10GE cards : MTU=16K
J. Cain (Cisco): “It’s very difficult to build switches to switch large
packets such as jumbogram”
 Our vision of the network:
“The network, once viewed as an obstacle for virtual collaborations
and distributed computing in grids, can now start to be viewed as a
catalyst instead. Grid nodes distributed around the world will simply
become depots for dropping off information for computation or
storage, and the network will become the fundamental fabric for
tomorrow's computational grids and virtual supercomputers”