Scientific Data Storage

Download Report

Transcript Scientific Data Storage

BaBarGrid
Roger Barlow
Manchester University
1:Simulation
2: Data Distribution: The SRB
3: Distributed Analysis
GridPP10 Meeting
CERN June 3 rd 2004
1: Grid based simulation
(Fergus Wilson + Co.)
• Using existing UK farms (80 CPUs)
• Dedicated process at RAL merging output
and sending to SLAC
• Use VDT Globus rather than LCG
– Why? Installation difficulty/Reliability/stability
problems.
– VDT Globus is subset of LCG: running on LCG
system perfectly possible (in principle)
– US groups talk of using GRID3. VDT Globus is
also a subset of GRID3 – but GRID3 and LCG
different. Mistake to rely on LCG features?
BaBarGrid: GridPP10, CERN June3 2004
Slide 2 / 16
Current situation
5 Million events in official production since 7th
March. Best week (so far!) 1.6 million events.
Now producing at RHUL & Bristol. Manchester &
Liverpool in ~2 weeks. Then QMUL & Brunel.
4 farms will produce 3-4 million a week.
Sites cooperative (need to install BaBar
Conditions Database which uses Objectivity)
Major problem has been firewalls. Complicated
interaction with all the communication and
ports. Identifying the source has been hard.
BaBarGrid: GridPP10, CERN June3 2004
Slide 3 / 16
What the others are doing
• Italians and Germans going full-blown LCG
route
• Objectivity database through networked ams
servers (need 1 server per ~30 processes)
• Otherwise assume BaBar environment
available at remote hosts
Our approaches will converge one day
• Meanwhile, they will try sending jobs to RAL,
we will try sending jobs to Ferrara.
BaBarGrid: GridPP10, CERN June3 2004
Slide 4 / 16
Future
Keep production running.
Test an LCG interface (RAL? Ferrara?
Manchester Tier 2?) when we have the
manpower. Will give more functionality
and stability in the long-term.
Smooth and streamline process
BaBarGrid: GridPP10, CERN June3 2004
Slide 5 / 16
2: Data Distribution and The SRB
SLAC/BaBar
Richard P. Mount
SLAC
May 20, 2004
SLAC-BaBar Computing Fabric
Client
Client
Client
Client
Client
IP Network
(Cisco)
Disk
Server
Disk
Server
Disk
Server
Disk
Server
Tape
Server
Tape
Server
1500 dual CPU Linux
900 single CPU
Sun/Solaris
Objectivity/DB object database
+ HEP-specific ROOT software (Xrootd)
Disk
Server
IP Network
(Cisco)
Tape
Server
Client
Disk
Server
120 dual/quad CPU
Sun/Solaris
400 TB Sun
FibreChannel RAID
arrays
HPSS + SLAC enhancements to
Objectivity and ROOT server code
Tape
Server
Tape
Server
BaBarGrid: GridPP10, CERN June3 2004
25 dual CPU
Sun/Solaris
40 STK 9940B
6 STK 9840A
6 STK Powderhorn
7 / 16
over 1 PBSlide
of data
BaBar Tier-A Centers
A component of the Fall 2000 BaBar Computing Model
• Offer resources at the disposal of BaBar;
• Each provides tens of percent of total BaBar
computing/analysis need;
– 50% of BaBar computing investment was in Europe in
2002, 2003
• CCIN2P3, Lyon, France in operation for 3+ years;
• RAL, UK in operation for 2+ years
• INFN-Padova, Italy in operation for 2 years
• GridKA, Karlsruhe, Germany in operation for 1 year.
BaBarGrid: GridPP10, CERN June3 2004
Slide 8 / 16
SLAC-PPDG Grid Team
Richard Mount
10%
PI
Bob Cowles
10%
Strategy and Security
Adil Hasan
50%
BaBar Data Mgmt
Andy Hanushevsky
20%
Xrootd, Security …
Matteo Melani
80%
New hire
Wilko Kroeger
100% SRB data distribution
Booker Bense
80%
Post Doc
50%
Grid software
installation
BaBar - OSG
BaBarGrid: GridPP10, CERN June3 2004
Slide 9 / 16
Network/Grid Traffic
BaBarGrid: GridPP10, CERN June3 2004
Slide 10 / 16
SLAC-BaBar-OSG
• BaBar-US has been:
– Very successful in deploying Grid data distribution
(SRB US-Europe)
– Far behind BaBar-Europe in deploying Grid job
execution (in production for simulation)
• SLAC-BaBar-OSG plan
– Focus on achieving massive simulation production
in US within 12 months
– make 1000 SLAC processors part of OSG
– Run BaBar simulation on SLAC and non-SLAC
OSG resources
BaBarGrid: GridPP10, CERN June3 2004
Slide 11 / 16
3: Distributed Analysis
At GridPP9:
Good news: Basic grid job submission
system deployed and working (Alibaba /
Gsub) with GANGA portal
Bad news: Low take up because of
• Users uninterested
• Poor reliability
BaBarGrid: GridPP10, CERN June3 2004
Slide 12 / 16
Since then…
Alessandra
•
Move to Tier 2
system manager
post
Mike
•
Give talk at IoP parallel
session
•
Write Abstract (accepted) for
All Hands meeting
•
Write Thesis
Roger
• Submit Proforma 3
• Complete quarterly progress report
James
• Revise Proforma 3
•
Starts June 14th
•
• Advertise and recruit replacement post
Attended GridPP10
meeting
• Negotiate on revised Proforma 3
Janusz
• Write Abstract (pending) for CHEP
•
Improve portal
•
• Submit JeSRP-1
Develop web-based version
BaBarGrid: GridPP10, CERN June3 2004
Slide 13 / 16
• Write contribution for J Phys G Grid article
Future two-point plan(1)
• James to review/revise/relaunch job
submission system
• Work with UK Grid/SP team (short term)
and Italian/German LCG system (long
term)
• Improve reliability through core team of
users on development system
BaBarGrid: GridPP10, CERN June3 2004
Slide 14 / 16
Future two-point plan (2)
Drive Grid usage through incentive
RAL CPUs very heavily loaded by BaBar.
Slow turnround  stressed users
Make significant CPU resources available
to BaBar users only through the Grid
• Some of the new Tier 1/A resources
• All the Tier 2 (Manchester) resources
And see that Grid certificate take-up grow!
BaBarGrid: GridPP10, CERN June3 2004
Slide 15 / 16
Final Word
Our challenges
problems today will be your
problems tomorrow
challenges
BaBarGrid: GridPP10, CERN June3 2004
Slide 16 / 16