Transcript Document

Lessons Learned in
Grid Networking
or
How do we get
end-2-end performance
to Real Users ?
Richard Hughes-Jones
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
1
Network Monitoring is Essential
 Detect or X-check problem reports
 Isolate / determine a performance issue
 Capacity planning
 Publication of data: network “cost” for middleware
 RBs for optimized matchmaking
 WP2 Replica Manager
End2End Time Series
 Throughput UDP/TCP
 Rtt
 Packet loss
Passive Monitoring
 Routers Switches SNMP MRTG
 Historical MRTG
 Capacity planning
 SLA verification
 Isolate / determine throughput bottleneck – work
with real user problems
 Test conditions for Protocol/HW investigations
Packet/Protocol Dynamics
 Protocol performance / development
 Hardware performance / development
 Application analysis
Output from Application tools
 Input to middleware – eg gridftp throughput
 Isolate / determine a (user) performance issue
 Hardware / protocol investigations
 tcpdump
 web100
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
2
Multi-Gigabit transfers are possible and stable
10 GigEthernet at SC2003 BW Challenge
 Three Server systems with 10 GigEthernet NICs
 Used the DataTAG altAIMD stack 9000 byte MTU
 Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to:


Pal Alto PAIX
rtt 17 ms , window 30 MB
Shared with Caltech booth
4.37 Gbit hstcp I=5%
Then 2.87 Gbit I=16%
Fall corresponds to 10 Gbit on link
10
8
7
6
5
4
3
2
1
3.3Gbit Scalable I=8%
Tested 2 flows sum 1.9Gbit I=39%
0
11/19/03
15:59
11/19/03
16:13
11/19/03
16:27
Router traffic to Abilele
11/19/03
16:42
11/19/03
16:56
11/19/03
17:11
11/19/03
17:25 Date & Time
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
Phoenix-Chicago
Chicago Starlight
rtt 65 ms , window 60 MB
Phoenix CPU 2.2 GHz
3.1 Gbit hstcp I=1.6%



Amsterdam SARA
rtt 175 ms , window 200 MB
Phoenix CPU 2.2 GHz
10
Phoenix-Amsterdam
9
8
Throughput Gbits/s







10 Gbits/s throughput from SC2003 to PAIX
9
Throughput Gbits/s






Router to LA/PAIX
Phoenix-PAIX HS-TCP
Phoenix-PAIX Scalable-TCP
Phoenix-PAIX Scalable-TCP #2
7
6
5
4
3
2
1
0
11/19/03
15:59
11/19/03
16:13
11/19/03
16:27
4.35 Gbit hstcp I=6.9%
Very Stable
GNEW2004 CERN March 2004
Both used Abilene to Chicago R. Hughes-Jones Manchester
11/19/03
16:42
11/19/03
16:56
11/19/03
17:11
11/19/03
17:25 Date & Time
3
The performance of the end host / disks is really important
BaBar Case Study: RAID Throughput & PCI Activity



3Ware 7500-8 RAID5 parallel EIDE
3Ware forces PCI bus to 33 MHz
BaBar Tyan to MB-NG SuperMicro
Network mem-mem 619 Mbit/s

Disk – disk throughput bbcp
40-45 Mbytes/s (320 – 360 Mbit/s)



PCI bus effectively full!
User throughput ~ 250 Mbit/s
User surprised !!
Read from RAID5 Disks
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
Write to RAID5 Disks
4
Application design – Throughput + Web100




2Gbyte file transferred RAID0 disks
Web100 output every 10 ms
Gridftp
See alternate 600/800 Mbit and zero
 Apachie web server + curl-based client
 See steady 720 Mbit
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
MB - NG
5








Network Monitoring is vital
Development of new TCP stacks and non-TCP protocols is required
Multi-Gigabit transfers are possible and stable on current networks
Complementary provision of packet IP & λ-Networks is needed
The performance of the end host / disks is really important
Application design can determine Perceived Network Performance
Helping Real Users is a must – can be harder than herding cats
Cooperation between Network providers, Network Researchers, and
Network Users has been impressive
 Standards (eg GGF / IETF) are the way forward
 Many grid projects just assume the network will work !!!
 It takes lots of co-operation to put all the components together
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
6
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
7
Tuning PCI-X: Variation of mmrbc IA32




16080 byte packets every 200 µs
Intel PRO/10GbE LR Adapter
PCI-X bus occupancy vs mmrbc
Plot:



mmrbc
1024 bytes
Measured times
Times based on PCI-X times from
the logic analyser
Expected throughput
50
45
9
40
35
30
7
8
6
5
25
20
15
10
4
Measured PCI-X transfer time us
expected time us
rate from expected time Gbit/s
Max throughput PCI-X
5
0
0
2
1
0
5000
CSR Access
mmrbc
2048 bytes
PCI-X Sequence
Data Transfer
Interrupt & CSR Update
Kernel 2.6.1#17 HP Itanium Intel10GE Feb04
PCI-X Transfer time
us
10
1000
2000
3000
4000
Max Memory Read Byte Count
3
PCI-X Transfer rate Gbit/s
PCI-X Transfer time us
mmrbc
512 bytes
8
mmrbc
4096 bytes
6
4
2
measured Rate Gbit/s
rate from expected time Gbit/s
Max throughput PCI-X
0
0
1000
2000
3000
4000
Max Memory Read Byte Count
5000
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
8
GGF: Hierarchy Characteristics Document
 “A Hierarchy of Network Performance Characteristics for
Grid Applications and Services”
 Document defines terms & relations:



Network characteristics
Measurement methodologies
Observation
 Discusses Nodes & Paths
 For each Characteristic



Characteristic
Bandwidth
Hoplist
Capacity
Utilized
Available
Achievable
Queue
Discipline
Capacity
Length
Delay
Forwarding
Defines the meaning
Attributes that SHOULD be included
Issues to consider when making an observation
Round-trip
Forwarding
Policy
One-way
Forwarding
Table
Forwarding
Weight
Loss
Jitter
Loss Pattern
Availability
Round-trip
MTBF
Avail. Pattern
Others
 Status:



One-way
Closeness
Originally submitted to GFSG as Community Practice Document
draft-ggf-nmwg-hierarchy-00.pdf Jul 2003
Revised to Proposed Recommendation
http://www-didc.lbl.gov/NMWG/docs/draft-ggf-nmwg-hierarchy-02.pdf 7 Jan 04
Now in 60 day Public comment from 28 Jan 04 – 18 days to go.
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
9