CHEP2009_poster_jorge - Indico

Download Report

Transcript CHEP2009_poster_jorge - Indico

Wide Area Network Access to CMS
Data Using the Lustre Filesystem
J. L. Rodriguez†, P. Avery*, T. Brody†, D. Bourilkov*, Y.Fu*, B. Kim*, C. Prescott*, Y. Wu*
†Florida
International University (FIU), *University of Florida (UF)
Introduction
We explore the use of the Lustre cluster filesystem over the WAN to access CMS (Compact Muon Solenoid) data stored on a storage system
located hundreds of miles away. The Florida State Wide Lustre Testbed consist of two client sites located at CMS Tier3s, one in Miami, FL,
one in Melbourne, FL and a Luster storage system located in Gainesville at the University of Florida’s HPC Center. In this paper we report on
I/O rates between sites, using both the CMS application suite CMSSW and the I/O benchmark tool IOzone. We describe our configuration,
outlining the procedures implemented, and conclude with suggestions on the feasibility of implementing a distributed Lustre storage to
facilitate CMS data access for users at remote Tier3 sites.
Online
System
Tier 1
Tier 2
CERN Computer
Center
200 - 1500 MB/s
Korea
Russia
UK
OSG
UF
HPC
UF
Tier2
3000 physicists, 60 countries
10s of Petabytes/yr by 2010
CERN / Outside = 10-20%
CMS Experiment
Tier 0
FLR: 10 Gbps
U Florida
10-40 Gb/s
FermiLab
Caltech
FlaTech
Tier3
FIU
Tier3
10 Gb/s
UCSD
1.0 Gb/s
Flatech
FSU
Desktop or laptop PCs, Macs…
Computing facilities in the distributed computing model for the CMS experiment at CERN. In the US, Tier2 sites are medium size
facilities with approximately 106 kSI2K of computing power and 200TBs of disk storage. The facilities are centrally managed; each
with dedicated computing resources and manpower. Tier3 sites on the other hand range in size from a single interactive analysis
computer or small cluster to large facilities that rival the Tier2s in resources. Tier3s are usually found at Universities in close
proximity to CMS researchers.
Lustre is a POSIX compliant, network aware, highly scalable, robust and reliable cluster filesystem developed by Sun Microsystems Inc. The
system can run over several different types of networking infrastructure including ethernet, Infiniband, myrinet and others. It can be configured
with redundant components to eliminating single points of failure. It has been tested with 10,000’s of nodes, providing petabytes of storage
and can move data at 100’s of GB/sec. The system employs state-of-the-art security features and plans to introduce GSS and kerberos
based security in future releases. The system is available as Public Open Source under the GNU General Public License. Lustre is deployed
on a broad array of computing facilities, Both large and small, commercial and public organizations including some of the largest super
computing centers in the world are currently using the Lustre as their distributed file system.
The Florida State Wide Lustre Testbed
Lustre version 1.6.7
Network: Florida Lambda Rail (FLR)




IO Performance with CMSSW: FIU to UF
FIU: Servers were connected to the FLR via a dedicated Campus Research Network (CRN) @
1Gbps, however local hardware issues limits FIU’s actual bandwidth to ~ 600 Mbps
UF: Servers connected to FLR via their own dedicated CRN @ 2x10Gbps
Flatech: Servers connected to FLR @ 1Gbps
Server TCP buffers set to max of 16MB
Lustre Fileserver at UF-HPC/Tier2 Center: Gainesville, FL



Storage subsystem: Six each, RAID INC Falcon III with redundant dual port 4Gbit FC RAID
controller shelves with 24x750 GB HDs, with raw storage of 104 TB
Attached to: Two dual quad core Barcelona Opteron 2350 with 16 GB RAM, three FC cards
and 1x10GigE Chelsio NIC
Storage system clocked at greater than 1 GBps via TCP/IP large block I/O
FIU Lustre Clients: Miami, FL




CMS analysis server: medianoche.hep.fiu.edu, dual 4 core Intel X5355 with 16GB RAM, dual
1GigE
FIU fileserver: fs1.local, dual 2 core Intel Xeon, with 16GB RAM, 3ware 9000 series RAID cntlr,
NFS ver 3.x, RAID 5 (7+1) with 16TB disk raw
OSG gatekeeper: dgt.hep.fiu.edu, dual 2 core Xeon with 2GB RAM single GigE Used as in
Lustre tests, experimented with NAT ( it works, but not tested)
System configuration: Lustre patched kernel-2.6.9-55.EL_lustre1.6.4.2, both systems mounted
UF-HPC’s Lustre filesystem on local mount point
Flatech Lustre Client: Melbourne, FL


CMS server: flatech-grid3.fit.edu, dual 4 core Intel E5410 w/8GB RAM, GigE
System configuration: unpatched SL4 kernel. Lustre enabled via runtime kernel modules
Using the CMSSW application we tested the IO performance of the testbed between the FIU Tier3
and the UF-HPC Lustre storage. An IO bound CMSSW application was used for the tests. Its main
function was to skim objects from data collected during the Cosmic Runs at Four Tesla (CRAFT) in
the Fall of 2008. The application is the same as that utilized by the Florida Cosmic Analysis group.
The output data file was redirected to /dev/null.
Report on aggregate and average read I/O rate
– Aggregate IO rate is the total IO rate per node vs.
number of jobs concurrently running on a single node
– Average IO rate is per process per job vs. number of
jobs concurrently running on a single node
Compare IO rates between Lustre NFS and local disk
– NFS: fileserver 16TB 3ware 9000 over NFS ver. 3
– Local: single 750GB SATAII hard drive
Observations
– For NFS and Lustre filesystems the IO rates scale
linearly with the number of jobs, not so with local disk
– Average IO rates remain relatively constant as a
function of jobs per node for distributed filesystem
– The Lustre IO rate are significantly lower than seen
with IOzone and lower than obtained with NFS
80
Lustre over WAN
Local disk
60
40
20
0
1
2
4
8
Jobs/node
CMSSW Average IO Rate per process vs.
Number of Jobs per Node
NFS over LAN
15
Local disk
Lustre over WAN
10
5
2
4
8
Jobs/node
We are now investigating the cause of the
discrepancy between the Lustre CMSSW IO rates and the rates observed with IOzone
Lustre clients are easy to deploy, mounts are easy to establish, are reliable and robust
Security established by restricting IPs and sharing UID/GID domains between all sites
IO Performance with the IOzone benchmark tool FIU to UF
Lustre I/O Performance FIU to UF
Sequential write
"mounted over WAN" uf-hpc@tcp1:/crn ... /fiulocal/crn
Sequential read
Random write
120
I/O Performance [MBps]
– Lustre fs at UF-HPC was mounted on local mount
point on medianoche.hep.fiu.edu located in Miami
– File sizes set to 2XRAM, to avoid cacheing effects
– Measurements made as function of record length
– Checked in multi-processor mode: 1 thru 8
concurrent processes
– Checked with dd read/write rates
– All tests consistent with IOzone results shown
NFS over LAN
1
All sites share common UID/GID domains
Mount access restricted to specific IP’s via firewall
ACLs and root_squash security features are not currently implemented in testbed
The IOzone benchmark tool was used to establish
the maximum possible I/O performance of Lustre
over the WAN between FIU and UF and between
Flatech and UF. Here we report on results between
FIU and UF only.
100
0
Site Configuration and Security



CMSSW Aggregate IO Rates vs.
Number ofJobs per node
Total Rate [MBps]
FIU
<I/O> Rate per process
[MBps]
Tier 3
Random read
100
80
60
40
20
0
64
128
256
512
1024
2048
4096
8192
16384
Record Length [KB]
IO performance of the testbed between FIU and UF. The plot shows sequential
and random read/write performance, in Mbytes per second using the IOzone as
a function of record length.
With large block IO, we can saturate the network link between UF and FIU using the
standard IO benchmark tool IOzone
Summary and Conclusion
Summary:
– Lustre is very easy to deploy, particularly so as a client installation
– Direct I/O operations show that the Lustre filesystem mounted over the WAN works
reliably and with high degree of performance. We have demonstrated that we can
easily saturate a 1 Gbps link with I/O bound applications
– CMSSW remote data access was observed to be slower than expected when
compared to IO rates using IO benchmarks and when compared to other distributed
filesystems
– We have demonstrated that the CMSSW application can access data located
hundreds of miles away with the Lustre filesystem. Data accessed this way can be
done seamlessly, reliably, with a reasonable degree of performance even with all
components “out of the box”
Conclusion:
The Florida State Wide Lustre Testbed demonstrates an alternative method
for accessing data stored at dedicated CMS computing facilities. This
method has the potential of greatly simplifying access to data sets, large,
medium or small, for remote experimenters with limited local computing
resources.