Transcript Document
Lessons Learned in
Grid Networking
or
How do we get
end-2-end performance
to Real Users ?
Richard Hughes-Jones
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
1
Network Monitoring is Essential
Detect or X-check problem reports
Isolate / determine a performance issue
Capacity planning
Publication of data: network “cost” for middleware
RBs for optimized matchmaking
WP2 Replica Manager
End2End Time Series
Throughput UDP/TCP
Rtt
Packet loss
Passive Monitoring
Routers Switches SNMP MRTG
Historical MRTG
Capacity planning
SLA verification
Isolate / determine throughput bottleneck – work
with real user problems
Test conditions for Protocol/HW investigations
Packet/Protocol Dynamics
Protocol performance / development
Hardware performance / development
Application analysis
Output from Application tools
Input to middleware – eg gridftp throughput
Isolate / determine a (user) performance issue
Hardware / protocol investigations
tcpdump
web100
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
2
Multi-Gigabit transfers are possible and stable
10 GigEthernet at SC2003 BW Challenge
Three Server systems with 10 GigEthernet NICs
Used the DataTAG altAIMD stack 9000 byte MTU
Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to:
Pal Alto PAIX
rtt 17 ms , window 30 MB
Shared with Caltech booth
4.37 Gbit hstcp I=5%
Then 2.87 Gbit I=16%
Fall corresponds to 10 Gbit on link
10
8
7
6
5
4
3
2
1
3.3Gbit Scalable I=8%
Tested 2 flows sum 1.9Gbit I=39%
0
11/19/03
15:59
11/19/03
16:13
11/19/03
16:27
Router traffic to Abilele
11/19/03
16:42
11/19/03
16:56
11/19/03
17:11
11/19/03
17:25 Date & Time
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
Phoenix-Chicago
Chicago Starlight
rtt 65 ms , window 60 MB
Phoenix CPU 2.2 GHz
3.1 Gbit hstcp I=1.6%
Amsterdam SARA
rtt 175 ms , window 200 MB
Phoenix CPU 2.2 GHz
10
Phoenix-Amsterdam
9
8
Throughput Gbits/s
10 Gbits/s throughput from SC2003 to PAIX
9
Throughput Gbits/s
Router to LA/PAIX
Phoenix-PAIX HS-TCP
Phoenix-PAIX Scalable-TCP
Phoenix-PAIX Scalable-TCP #2
7
6
5
4
3
2
1
0
11/19/03
15:59
11/19/03
16:13
11/19/03
16:27
4.35 Gbit hstcp I=6.9%
Very Stable
GNEW2004 CERN March 2004
Both used Abilene to Chicago R. Hughes-Jones Manchester
11/19/03
16:42
11/19/03
16:56
11/19/03
17:11
11/19/03
17:25 Date & Time
3
The performance of the end host / disks is really important
BaBar Case Study: RAID Throughput & PCI Activity
3Ware 7500-8 RAID5 parallel EIDE
3Ware forces PCI bus to 33 MHz
BaBar Tyan to MB-NG SuperMicro
Network mem-mem 619 Mbit/s
Disk – disk throughput bbcp
40-45 Mbytes/s (320 – 360 Mbit/s)
PCI bus effectively full!
User throughput ~ 250 Mbit/s
User surprised !!
Read from RAID5 Disks
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
Write to RAID5 Disks
4
Application design – Throughput + Web100
2Gbyte file transferred RAID0 disks
Web100 output every 10 ms
Gridftp
See alternate 600/800 Mbit and zero
Apachie web server + curl-based client
See steady 720 Mbit
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
MB - NG
5
Network Monitoring is vital
Development of new TCP stacks and non-TCP protocols is required
Multi-Gigabit transfers are possible and stable on current networks
Complementary provision of packet IP & λ-Networks is needed
The performance of the end host / disks is really important
Application design can determine Perceived Network Performance
Helping Real Users is a must – can be harder than herding cats
Cooperation between Network providers, Network Researchers, and
Network Users has been impressive
Standards (eg GGF / IETF) are the way forward
Many grid projects just assume the network will work !!!
It takes lots of co-operation to put all the components together
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
6
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
7
Tuning PCI-X: Variation of mmrbc IA32
16080 byte packets every 200 µs
Intel PRO/10GbE LR Adapter
PCI-X bus occupancy vs mmrbc
Plot:
mmrbc
1024 bytes
Measured times
Times based on PCI-X times from
the logic analyser
Expected throughput
50
45
9
40
35
30
7
8
6
5
25
20
15
10
4
Measured PCI-X transfer time us
expected time us
rate from expected time Gbit/s
Max throughput PCI-X
5
0
0
2
1
0
5000
CSR Access
mmrbc
2048 bytes
PCI-X Sequence
Data Transfer
Interrupt & CSR Update
Kernel 2.6.1#17 HP Itanium Intel10GE Feb04
PCI-X Transfer time
us
10
1000
2000
3000
4000
Max Memory Read Byte Count
3
PCI-X Transfer rate Gbit/s
PCI-X Transfer time us
mmrbc
512 bytes
8
mmrbc
4096 bytes
6
4
2
measured Rate Gbit/s
rate from expected time Gbit/s
Max throughput PCI-X
0
0
1000
2000
3000
4000
Max Memory Read Byte Count
5000
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
8
GGF: Hierarchy Characteristics Document
“A Hierarchy of Network Performance Characteristics for
Grid Applications and Services”
Document defines terms & relations:
Network characteristics
Measurement methodologies
Observation
Discusses Nodes & Paths
For each Characteristic
Characteristic
Bandwidth
Hoplist
Capacity
Utilized
Available
Achievable
Queue
Discipline
Capacity
Length
Delay
Forwarding
Defines the meaning
Attributes that SHOULD be included
Issues to consider when making an observation
Round-trip
Forwarding
Policy
One-way
Forwarding
Table
Forwarding
Weight
Loss
Jitter
Loss Pattern
Availability
Round-trip
MTBF
Avail. Pattern
Others
Status:
One-way
Closeness
Originally submitted to GFSG as Community Practice Document
draft-ggf-nmwg-hierarchy-00.pdf Jul 2003
Revised to Proposed Recommendation
http://www-didc.lbl.gov/NMWG/docs/draft-ggf-nmwg-hierarchy-02.pdf 7 Jan 04
Now in 60 day Public comment from 28 Jan 04 – 18 days to go.
GNEW2004 CERN March 2004
R. Hughes-Jones Manchester
9