slides - Indico

Download Report

Transcript slides - Indico

Database Access Patterns
in ATLAS Computing Model
G. Gieraltowski, J. Cranshaw, K. Karr, D. Malon, A. Vaniachine
ANL
P, Nevski, Yu. Smirnov, T. Wenaus
BNL
N. Barros, L. Goossens, R. Hawkings, A. Nairz, G. Poulard, Yu. Shapiro, F. Zema
CERN
XV International Conference on Computing in High Energy and Nuclear Physics
T.I.F.R., Mumbai, India
February 13-17, 2006
CHEP06, Mumbai, India
February 13-17, 2006
Outline
1) Emphasis on the early days of LHC running:
 Calibration/Alignment is a priority
 Must be done before the reconstruction start
 ATLAS 2006 Computing System Commissioning:
 Calibration/Alignment procedures are included in acceptance tests
2) Real experience in prototypes and production systems
 General issues encountered:
 Increased fluctuations in database server load
 Connections count limitations
3) Development of the ATLAS distributed computing model:
 Server-side developments:
 Deployment: LCG3D Project and OSG Edge Services Framework Activity
 Technology: Grid-enabled server technology - Project DASH
 Application-side technology developments:
 Deployment: Integration with Production System database (Conditions data slices)
 Technology: ATLAS Database Client Library (now adopted by COOL/POOL/CORAL)
Alexandre Vaniachine (ANL)
2
CHEP06, Mumbai, India
February 13-17, 2006
ATLAS Computing Model
 In the ATLAS Computing Model widely distributed
applications require access to terabytes of data stored
in relational databases
 Realistic database services data flow – including
Calibration & Alignment – is presented in the Computing
Technical Design Report
 Preparations are on track towards Computing System
Commissioning to exercise realistic database data flow
Alexandre Vaniachine (ANL)
3
CHEP06, Mumbai, India
February 13-17, 2006
ATLAS CSC Goals
 2006 is the year
of ATLAS CSC
 The first goal of
the CSC is
calibration and
alignment
procedures
 ConditionsDB is
included in CSC
acceptance tests
WLCG SC4 Workshop - 12 February 2006
Computing System Commissioning Goals



We have defined the high-level goals of the Computing System
Commissioning operation during 2006

Formerly called “DC3”

More a running-in of continuous operation than a stand-alone challenge
Main aim of Computing System Commissioning will be to test the
software and computing infrastructure that we will need at the
beginning of 2007:

Calibration and alignment procedures and conditions DB

Full trigger chain

Event reconstruction and data distribution

Distributed access to the data for analysis
At the end (autumn-winter 2006) we will have a working and operational
system, ready to take data with cosmic rays at increasing rates
Alexandre Vaniachine (ANL)
4
Dario Barberis: ATLAS SC4 Plans
4
CHEP06, Mumbai, India
February 13-17, 2006
Towards the Early Days of
LHC Running
 Calibration/Alignment is a priority
 Must be done before the reconstruction start
 Calibration/Alignment is a part of the overall
Computing System Commissioning activity to:
 Demonstrate the calibration ‘closed loop’:
Iterate and improve reconstruction
 Exercise the conditions DB access and distribution infrastructure
 Encourage development of subdetector calibration algorithms
 Initially focussed on ‘steady-state’ calibration
 Assuming required samples are available and can be selected
 Also want to look at initial 2007/2008 running at low luminosity
Alexandre Vaniachine (ANL)
5
CHEP06, Mumbai, India
February 13-17, 2006
Calibration Data Flow
Alexandre Vaniachine (ANL)
6
CHEP06, Mumbai, India
February 13-17, 2006
Prerequisites for Success
Simulation
 Ability to simulate a realistic, misaligned, miscalibrated detector
Reconstruction
 Use of calibration data in reconstruction; ability to handle timevarying calibration
Calibration Algorithms
 Algorithms in Athena, running from standard ATLAS data
Data Preparation
 Organisation and bookkeeping
 run number ranges, production system,…
Alexandre Vaniachine (ANL)
7
CHEP06, Mumbai, India
February 13-17, 2006
Production System
Enhancements
To prepare for new challenges first ATLAS Database
Services Workshop was organized in December:
http://agenda.cern.ch/fullAgenda.php?ida=a057425
Among the Workshop recommendations was:
 A tighter integration of the production system database,
task definition, Distributed Data Management and
conditions data tags
Implementation opportunities:
 Distribute (push) snapshots via pacman
 Use of DDM for large payload files
 Try Oracle 10g file management for external files
 Expand existing ServersCatalog with top tags
Alexandre Vaniachine (ANL)
8
CHEP06, Mumbai, India
February 13-17, 2006
ATLAS DB Applications
 In preparation for data taking, the ATLAS experiment has
run a series of large-scale computational exercises to
test and validate multi-tier distributed data grid solutions
under development
 Real experience in prototypes and production systems
was collected with three ATLAS major database
applications:
 Geometry DB
 Conditions DB
 TAG databases
 ATLAS computational exercises run on a world-wide
federation of computational grids
Alexandre Vaniachine (ANL)
9
CHEP06, Mumbai, India
February 13-17, 2006
Data Mining of Operations
 The data-mining of the collected operations data reveals
a striking feature – a very high degree of correlations
between the failures:
 if the job submitted to some cluster failed, there is a high
probability that a next job submitted to the cluster would fail too
 if the submit host failed, all the jobs scattered over different
clusters will fail too
 Taking these correlations into account is not yet
automated by the grid middleware
 That is why production databases and grid monitoring
data that are providing immediate feedback on the Data
Challenge operations to the production operators is very
important for efficient utilization of the Grid capacities
Alexandre Vaniachine (ANL)
10
CHEP06, Mumbai, India
February 13-17, 2006
Production Rate Growth and
Daily Fluctuations
14000
12000
10000
Jobs/day
Rome Production (mix of jobs)
LCG/CondorG
LCG/Original
NorduGrid
Grid3
8000
2005
Database
Capacities
Bottleneck
Data Challenge 2
(short jobs period)
6000
Data Challenge 2
(long jobs period)
4000
2000
0
Alexandre Vaniachine (ANL)
Jul
Aug
Sep
Oct
Nov
Dec
11
Jan
Feb
Mar
Apr
May
CHEP06, Mumbai, India
February 13-17, 2006
Lessons Learned
 Among the lessons learned is the increase in
fluctuations in database server workloads due to
the chaotic nature of grid computations
 The observed fluctuations in database access patterns
are of a general nature and must be addressed through
services enabling dynamic and flexibly managed
provisioning of database resources
 In many cases the connections count happens
to be the limiting resource
Alexandre Vaniachine (ANL)
12
CHEP06, Mumbai, India
February 13-17, 2006
Opportunistic Grids
 Campus computing grids like the GLOW
http://osg-docdb.opensciencegrid.org/cgi-bin/ShowDocument?docid=361
utilize spare cycles to run jobs
 The priority has the owner of resource
 ATLAS jobs are often put to hibernate
 Thus optimal jobs are shorter, i.e. only few events
 Resulting in order of magnitude more frequent database access
 Jobs put to hibernation during the initialization phase overload CERN
database resources by keeping database connections open for days
 This problem was resolved by deploying dedicated replica servers
in US and CERN to support the GLOW grid
 In comparison to production grids opportunistic grids require extra
development and support efforts
 not sustainable in the long run
Alexandre Vaniachine (ANL)
13
CHEP06, Mumbai, India
February 13-17, 2006
Client Library
 To improve robustness of
database access in a data
grid environment we
developed the applicationside solution – a software
component abstracting
the database and/or
middleware connectivity
concerns in a generalized
Database Client Library
http://indico.cern.ch/contributionDisplay.py?contribId=32&sessionId=4&confId=048
Alexandre Vaniachine (ANL)
14
CHEP06, Mumbai, India
February 13-17, 2006
Server Indirection
 One of lessons learnt in ATLAS Data Challenges
is that the database server address should NOT
be hardwired in data processing transformations
 The logical-physical indirection for database
servers is now introduced in ATLAS
 Similar to the logical-physical file Replica Location
Service indirection of the Grid file catalogs
 Supported by ATLAS Client Library
 Now adopted by LHC POOL project:
http://indico.cern.ch/contributionDisplay.py?contribId=329&sessionId=4&confId=048
Alexandre Vaniachine (ANL)
15
CHEP06, Mumbai, India
February 13-17, 2006
Tier-0 Operations
CondDBB
CTB DBs
Online server
(atlobk01)
DB replication
Data acquisition
programs
NOVA
DBs
POOL cat
CondDB
CTB DBs
Offline server
(atlobk02)
Test DBs
OBK DBs
POOLcat
NOVA
DBs
Browsing applications,
Athena programs
(Other Browsing
applications)
OBK DBs
Alexandre Vaniachine (ANL)
 In addition to
distributed operations,
ATLAS database
services are relevant to
local CERN data taking
operations including
the conditions data
flow of ATLAS
Combined Test Beam
operations, prototype
Tier-0 scalability tests
and event tag
database operations
16
CHEP06, Mumbai, India
February 13-17, 2006
TAG Database Access
TAG Replication is a part
of SC4 Tier-0 test
 Loading TAGs into
the relational
database at CERN
 Replicating it using
Oracle streams from
Tier-0 to Tier-1s and
to Tier-2s
 Also as an
independent test,
using TAG files that
are already available
generated
TAG Access
• TAG is a keyed list of variables/event
• Two roles
• Direct access to event in file via pointer
• Data collection definition function
• Two formats, file and database
• Now believe large queries require full database
• Restricts it to Tier1s and large Tier2s/CAF
• Ordinary Tier2s hold file-based TAG corresponding to
locally-held datasets
Alexandre Vaniachine (ANL)
17
CHEP06, Mumbai, India
February 13-17, 2006
Participation in LCG 3D
LCG 3D Service Architecture
 ATLAS is fully
committed to
use Distributed
Database
Deployment
infrastructure
developed in
collaboration
with the LCG
3D Project
T0
M
O
M
- autonomous
reliable service
T1- db back bone
- all data replicated
- reliable service
O
T2 - local db cache
T3/4
-subset data
-only local service
M
O
M
Oracle Streams
Cross vendor copy
MySQL/SQLight Files
Proxy Cache
Alexandre Vaniachine (ANL)
18
CHEP06, Mumbai, India
February 13-17, 2006
Participation in OSG ESF
 US ATLAS is
participating in
OSG Edge
Services
Framework Activity
to enhance
traditional
database services
infrastructure
deployed in 3D
with dynamic
database services
deployment
capabilities
Open Science Grid
Edge Services
• Services executing on the edge of the public and
private network
CE
CMS
ATLAS
CDF
Guest
VO
SE
Site
Compute nodes
and Storage nodes
SC05 booth presentation
OSG Edge Services Framework
1
http://indico.cern.ch/contributionDisplay.py?contribId=214&sessionId=7&confId=048
Alexandre Vaniachine (ANL)
19
CHEP06, Mumbai, India
February 13-17, 2006
Project DASH
 To grid-enable MySQL database server ATLAS
is participating in the project DASH:
http://indico.cern.ch/contributionDisplay.py?contribId=36&sessionId=7&confId=048
 A new collaborative project has just started at
Argonne to grid-enable PostgreSQL database
 Both projects target integration with OSGA-DAI
 Please contact us if you are interested to
contribute to these projects
Alexandre Vaniachine (ANL)
20
CHEP06, Mumbai, India
February 13-17, 2006
Conclusions
 As grid computing technologies mature, development must focus on
database and grid integration
 New technologies are required to bridge the gap between data
accessibility and the increasing power of grid computing used for
distributed event production and processing
 Changes must happen both on the server side and on the client side
Server technology
 Must support dynamic deployment of capacities
 Must support replication on a lower granularity level: Conditions DB slices
 Must be coordinated with production system
 Must support grid authorization (Project DASH)
Client technology
 Must support database server indirection
 Must support coordinated client-side solution:
 ATLAS Database Client Library (now a part of COOL/POOL/CORAL)
Alexandre Vaniachine (ANL)
21