Transcript Document

Transport Level Protocol Performance Evaluation for Bulk Data Transfers
Matei Ripeanu
The University of Chicago
http://www.cs.uchicago.edu/~matei/
Abstract: Before developing new protocols targeted at bulk data transfers,
the achievable performance and limitations of the broadly used TCP
protocol should be carefully investigated. Our first goal is to explore TCP's
bulk transfer throughput as a function of network path properties, number
of concurrent flows, loss rates, competing traffic, etc. We use analytical
models, simulations, and real-world experiments. The second objective is
to repeat this evaluation for some of TCP's replacement candidates (e.g.
NETBLT). This should allow an informed decision whether (or not) to put
effort into developing and/or using new protocols specialized on bulk
transfers.
Main inefficiencies TCP is blamed for:
Overhead. However, less than 15% of time spent
in proper TCP processing.
Flow control. Claim: a rate-based protocol would
be faster. However, there is no proof that this is
better than (self) ACK-clocking.
Congestion control:
• Underlying problem: underlying layers do not
give explicit congestion feedback, TCP
therefore assumes any packet loss is a
congestion signal
• Not scalable.
Packet loss discovered
through fast recovery
mechanism
Fast retransmit
Slow Start
(exponential growth)
Time
Packet loss discovered
through timeout
Questions:
 Is TCP appropriate/usable?
 What about rate based protocols?
Want to optimize:
 Link utilization
 Per file transfer delay
While maintaining “fair” sharing
(Rough) analytical stable-state throughput estimates
(based on [Math96]


MSS C
*
for p  8 2

3Wmax
RTT
p

Throughput 
 MSS
1
8 2
*
for
p

 RTT
pWmax
1
3Wmax



Wmax
8
Stable-state throughput as % of bottleneck link rate
(RTT=80ms, MSS=1460bytes)
% of bottleneck link rate (%).
TCP Refresher:
Congestion Avoidance
(linear growth)
Application requirements (GriPhyN):
 Efficient management of 10s to 100s of
PetaBytes (PB) of data, many PBs of
new raw data / year.
 Granularity: file size 10M to 1G.
 Large-pipes: OC3 and up, high
latencies
 Efficient bulk data transfers
 Gracefully share with other applns.
 Projects: CMS, ATLAS, LIGO, SDSS
OC3 link (155Mbps)
100%
OC12 link (622 Mbps)
80%
T3 link (43.2Mbps)
60%
40%
20%
Striping
 Widely used (browsers, ftp, etc)
 Good practical results
 Not ‘TCP friendly’!
•RFC2140/ Ensemble TCP –
share information and
congestion management
among parallel flows
Ideal
10
1.E-02
5.E-03
2.E-03
1.E-03
5.E-04
2.E-04
1.E-04
5.E-05
2.E-05
1.E-05
5.E-06
2.E-06
1.E-06
5.E-07
2.E-07
1.E-07
5.E-08
1
MSS=1460, DelAck
MSS=1460, WS ok
MSS=9000
FACK
5 flows
100
1.E-02
5.E-03
2.E-03
1.E-03
5.E-04
2.E-04
1.E-04
5.E-05
2.E-05
1.E-05
5.E-06
2.E-06
(b etw een L B NL an d A NL via ES-N et)
OC12, ANL to LBNL (56ms), Linux boxes
350
300
250
200
150
100
25 flows
50
Ideal
0
0
5.E-03
1.E-02
2.E-03
5.E-04
1.E-03
2.E-04
5.E-05
1.E-04
2.E-05
5.E-06
1.E-05
1.E-06
2.E-06
5.E-07
1.E-07
2.E-07
1.E-08
2.E-08
10
5.E-08
1GB transfer time (sec) (log scale)
OC12 link, 100ms RTT, MSS=1460 initially
G rid FT P and ip erf P erfo rm ance
Ban dw id th (M bs)
MSS=9000, FACK, 5 flows
1000
Number of parallel flows (stripes) used
1000
0
1000
900
800
700
600
500
400
300
200
100
0
900
5
800
10
700
15
600
20
500
25
400
30
50
45
40
35
30
25
20
15
10
5
0
300
35
10 flows
20 flows
50 flows
100 flows
200 flows
300 flows
400 flows
500 flows
600 flows
700 flows
800 flows
900 flows
1000 flows
10
3000
StdDev (right scale)
8
6
Number of parallel flows used
3000
6
StDev (right scale)
4
2
1000
Dropped messages
(left scale)
2
0
0
Number of parallel flows used
1000
900
800
700
600
500
0
400
1000
900
800
700
600
500
400
300
200
100
0
8
300
500
4000
2000
4
1000
10
200
1500
5000
100
2000
12
0
2500
6000
Transfer time (StDev)
3500
Packets Dropped (left
scale)
4000
Packets dropped
12
Transfer time (StdDev)
Number of parallel flows used
4500
15
20
G r id F TP
200
40
10
25
30
35
MCS/ANL courtesy
100
Time (sec)
45
5
# o f T C P str eam s
Loss rate=0.1%
50
0
1.E-06
MSS=9000, FACK
Link loss rate (log scale)
0
5.E-07
MSS=9000
100
Loss rate=0
Time (sec)
2.E-07
MSS=1460, WS ok
0.5GB striped transfer, OC3 link (155Mbps), RTT80ms, MSS=9000 using up to 1000 flows
Dropped packets .
1.E-07
Loss indication rates (log scale)
Link loss rate (log scale)
TCP striping issues
 Widespread usage exposes
scaling problems in TCP
congestion control
mechanism:
• Unfair allocation: a small
number of flows grabs
almost all available
bandwidth
• Reduced efficiency: a large
number of packets are
dropped.
• Rule of thumb: have less
flows in the systems than
‘pipe size’ expressed in
packets
 Not ‘TCP unfriendly’ as long
as link loss rates are high
 Even high link loss rates do
not break unfairness
5.E-08
1.E-08
MSS=1460, DelAck, WS ok
1.E-08
100MB transfer time (sec) (log scale).
 Significant throughput
improvements can be achieved just
by tuning the end-systems and the
network path: set up proper
window-sizes, disable delayed
ACK, use SACK and ECN, use
jumbo frames, etc.
 For high link loss rates, striping is a
legitimate and effective solution.
MSS=1460, DelAck, huge WS
2.E-08
Simulations (using NS []):
 Simulation topology:
OC3 link, 80ms RTT, MSS=1460 initially
1000
2.E-08
0%
ip e r f
Conclusions
TCP can work well with careful end-host and
network tuning
For fair sharing with other users, need
mechanisms to provide congestion feedback
and distinguish genuine link losses from
congestion indications.
In addition, admission mechanisms based on
the number of parallel flows might be
beneficial
Future work
What are optimal buffer sizes for bulk
transfers?
Can we use ECN and large buffers to reliably
detect congestion without using dropped
packets as a congestion indicator?
Assuming the link loss rate pattern is known,
can it be used to reliably detect congestion and
improve throughput and