US ATLAS Tier 1

Download Report

Transcript US ATLAS Tier 1

USATLAS Network/Storage and Load
Testing
Jay Packard
Dantong Yu
Brookhaven National Lab
Outline
 USATLAS Network/Storage Infrastructures: Platform of
Performing Load Test.
 Load Test Motivation and Goals.
 Load Test Status Overview.
 Critical Components in Load Test: Control and Monitoring
and Network Monitoring and Weather Maps.
 Detailed Plots for Single v.s. Multiple host Load Tests.
 Problems.
 Proposed Solutions:

Network Research and Its Role in Dynamic Layer 2 Circuits
between BNL and US ATLAS Tier 2 sites.
2
BNL 20 Gig-E Architecture Based on CISCO6513
20 GBps LAN for
LHCOPN
20GBps for Production
IP
Full Redundancy: can
survive the failure of any
network switch.
No Firewall for
LHCOPN, as shown in
the green lines.
Two Firewalls for all
other IP networks.
Cisco Firewall Services
Module (FWSM), a line
card plugged into CISCO
chassis with 5*1Gbps
capacity, allows outgoing
connection.
3
dCache and Network Integration
2x10 Gb/s WAN
ESnet
BNL LHC OPN VLAN
Load Testing
2x10 Gb/s
8 x 1 Gb/s
Srmcp path
Hosts
FTS controlled
20 Gb/s
New Panda and
Gridftp door
dCache SRM and
(8 nodes)
Core Servers
Panda DB.
8 x 1 Gb/s
20 Gb/s
Tier 1 VLANS
. . . . N x 1 Gb/s . . . .
dCache
Thumpers
New Farm Pool
(30 nodes, 720TB Raw ) (80 nodes, 360TB Raw )
T0 Export Pool
Farm Pool
(>=30 nodes)
(434 nodes / 360 TB)
Write Pool
Logical
Connections
HPSS Mass Storage System
4
Tier 2 Network Example: ATLAS Great
Lakes Tier 2
5
Need More Details of Tier 2
Network/Storage Infrastructures
 Hope to see architectural maps from each Tier2 to describe
the integration of Tier 2 network and production and testing
storage systems in the site reports.
6
Goal
 Develop a toolkit for testing and viewing I/O performance at
various middleware layers (network, grid-ftp, FTS) in order
to isolate problems.
 Single-host transfer optimization at each layer.


120 MB/s is ideal for memory to memory transfer and high
performance storage.
40 MB/s is ideal for disk transfer to a regular worker node.
 Multi-host transfer optimization for site with 10Gbps
connectivity.



Starting point: Sustained 200 MB/s disk-to-disk transfer for 10
minutes between Tier1 and each Tier2 is goal (Rob Gardner).
Then increase disk-to-disk transfer to 400MBytes/second.
For site with 1Gbps bottleneck, we should max out the network
capacity.
7
Status Overview
 MonALISA control application has been developed for
specifying single-host transfer, protocol, duration, size,
stream range, tcp buffer range, etc. Currently only run by
Jay Packard at BNL, but may eventually be run by multiple
administrators at other sites within MonALISA framework.
 MonALISA monitoring plugin has been developed to display
current results in graphs. They are available in monALISA
client (
http://monalisa.cacr.caltech.edu/ml_client/MonaLisa.jnlp)
and will soon be available on a web page.
Status Overview...
 Have been performing single-host tests for past few months.

Types






Network memory to memory (using iperf)
Grid-ftp memory to memory (using globus-url-copy)
Grid-ftp memory to disk (using globus-url-copy)
Grid-ftp disk to disk (using globus-url-copy)
At least one host at each site has been TCP tuned, which has shown dramatic
improvements at some sites in the graphs (e.g. 5 MB/s to 100 MB/s for iperf tests)

If Tier 2 has 10Gbps, there is significant improvement for single TCP stream, from
50mbps to close to 1Gbps. (IU, UC, BU, UMich).

If Tier 2 has 1Gbps bottleneck, network performance can be improved with multiple
TCP streams. Simple TCP buffer size tune can not improve single TCP stream
performance due to bandwidth competition.

Discovered problems: dirty fiber, CRC error on network interface, and moderate TCP
buffer size, details can be found in Shawn’s talk.
Coordinating with Michigan and BNL (Hiro Ito, Shawn McKee, Robert Gardner, Jay
Packard) to measure and optimize total throughput using FTS disk-to-disk. We are
trying to leverage high performance storage (Thumper at BNL and Dell NAS at
Michigan) to achieve our goal.
MonALISA Control Application
 Our Java class implements MonALISA's AppInt interface as a plug-in.
 900 lines of code currently.
 Does the following:







Generates and prepares source files for disk to disk transfer
Starts up remote iperf server and local iperf client using globus-job-run
remotely and ssh locally
Runs iperf or grid-ftp for a period of time and collects output
Parses output for average and maximum throughput
Generates output understood by monitoring plugin
Cleans up destination files
Stops iperf servers
 Flexible to account for heterogeneous sites (e.g., killing iperf is done
differently on a managed fork gatekeeper; one site runs BWCTL
instead of iperf). This flexibility in the code requires frequently watching
the output of the application and augmenting the code to handle many
circumstances.
MonALISA Control Application...
• Generates the average and maximum throughput during
a 2 minute interval, which interval is required for the
throughput to “ramp up”.
• Sample configuration for grid-ftp memory-to-disk:
command=gridftp_m2d
startHours=4,16
envScript=/opt/OSG_060/setup.sh
fileSizeKB=5000000
streams=1, 2, 4, 8, 12
repetitions=1
repetitionDelaySec=1
numSrcHosts=1
timeOutSec=120
tcpBufferBytes=4000000, 8000000
hosts=dct00.usatlas.bnl.gov, atlas-g01.bu.edu/data5/dq2-cache/test/, atlas.bu.edu/data5/dq2-cache/test/,
umfs02.grid.umich.edu/atlas/data08/dq2/test/, umfs05.aglt2.org/atlas/data16/dq2/test/, dq2.aglt2.org/atlas/data15/mucal/test/, iut2dc1.iu.edu/pnfs/iu.edu/data/test/, uct2-dc1.uchicago.edu/pnfs/uchicago.edu/data/ddm1/test/,
gk01.swt2.uta.edu/ifs1/dq2_test/storageA/test/, tier2-02.ochep.ou.edu/ibrix/data/dq2-cache/test/, ouhep00.nhn.ou.edu/raid2/dq2cache/test/, osgserv04.slac.stanford.edu/xrootd/atlas/dq2/tmp/
MonALISA Monitoring Application
 Java class that implements MonALISA's MonitoringModule
interface.
 Much simpler than controlling application (only 180 lines of
code).
 Parses log file produced by controlling application in the
format (time, site name, module, host, statistic, value:

1195623346000, BNL_ITB_Test1, Loadtest, bnl->uta(dct00->ndt),
network_m2m_avg_01s_08m, 6.42 (01s = 1 stream, 08m = TCP
buffer size of 8 MB)
 Data pulled by MonALISA server, which displays graph
upon demand.
Single-host Tests
• Too many graphs to show all, but two key graphs
will be shown. For one stream:
Single-host Tests...
• For 12 streams (notice disk-to-disk improvement):
Multi-host tests
 Using FTS to perform tests from BNL to Michigan initially
and then to other Tier 2 sites.
 Goal is sustained 200 MB/s disk-to-disk transfer for 10
minutes from Tier 1 to each Tier 2. Can be in addition to
existing traffic.
 Trying to find optimum number of streams and TCP buffer
size to use by finding optimum for single-host transfer
between two high performance machines.

Low disk-to-disk, one-stream performance from BNL's thumper to
Michigan's Dell NAS of 2 MB/s whereas iperf mem-to-mem, one-stream
gives 66 MB/s between same hosts (Nov 21, 07). Should this be higher for
one stream?
 Found that the more streams the higher the throughput, but cannot use too
many especially with a high TCP buffer size or applications will crash.
 Disk-to-disk throughput currently so low that a larger TCP buffer doesn't
matter.
Multi-host Tests and Monitoring


Monitoring using netflow graphs rather than Monalisa available at
http://netmon.usatlas.bnl.gov/netflow/tier2.html.
Some sites will likely require the addition of more storage pools and doors
that are each TCP tuned to achieve the goal.
Problems
 Getting reliable testing results amidst existing traffic

Each test runs for a couple minutes and produces several samples, so hopefully a
window exists when the traffic is low during which the maximum is attained.
 The applications could be changed to output the maximum of the last few tests
(tricky to implement).
 Use dedicated Network Circuits: TeraPaths
 Disk-to-disk bottleneck

Not sure if problem is the hardware or the storage software (e.g. dCache, Xrootd).
FUSE (Filesystem in Userspace), or filesystem in memory, which provides could
help isolate storage software degradation. Bonnie could help isolate hardware
degradation.
 Is there anyone that could offer disk performance expertise?
 Discussed in Shawn McKee's presentation, 'Optimizing USATLAS Data
Transfers.'".
 Progress is happening slowly due to a lack of in-depth coordination,
scheduling difficulties, and a lack of manpower (Jay is using ~1/3 FTE). Too
much on the agenda at the Computing Integration and Operations meeting to
allow for in-depth coordination.

Ideas for improvement
TeraPaths and Its Role in Improving Network
Connectivities between BNL and US ATLAS
Tier 2 sites.
 The problem: support efficient/reliable/predictable peta-scale data
movement in modern high-speed networks

Multiple data flows with varying priority

Default “best effort” network behavior can cause performance and service
disruption problems
 Solution: enhance network functionality with QoS features to allow
prioritization and protection of data flows

Treat network as a valuable resource

Schedule network usage (how much bandwidth and when)

Techniques: DiffServ (DSCP), PBR, MPLS tunnels, dynamic circuits (VLANs)
 Collaboration with ESnet (OSCARS) and Internet 2 (DRAGON) to
dynamically create end-to-end paths, and dynamically forward traffic into
these paths. Software is being deployed to US ATLAS Tier 2 sites.

Option 1: Layer 3: MPLS tunnels (Umich and SLAC)

Option 2: Layer 2: VLANs (BU, UMichi, demonstrated at SC’07)
Northeast Tier 2 Dynamic Network
Links
Questions?