ppt - Institute of Physics

Download Report

Transcript ppt - Institute of Physics

High-Level Data Access
With Grace
Ease
Greg Landsberg
Prague Software Workshop
Run I Lessons
Tower of Babylon!
 Plethora of Ntuple formats:
How to avoid this in Run II?
 Enforce standard physics analysis
 QCD
packages
 WZQCD
 Strongly enforce standard object
 Top
ID
 LQ
 Create and support standard
 SUSY
heptuples
…
 Proprietary physics analysis code  Develop a number of WWWfriendly high-end analysis tools
 Number of nearly identical
particle ID criteria
 The ultimate product of this
 Lack of standard documented
collaboration is a flow of highways to start a new physics
quality papers, and we have to
analysis
make it as easy as possible for
Result: a living hell for remote
remote physicists to efficiently
collaborators!
contribute to physics analyses
Standard Object ID
 The two key components are:
Strong ID groups
Strong management
 Strong ID groups:
 Develop out-of-the-box
particle objects and ID criteria
 Provide enough versatility to
satisfy different physics needs
(e.g. low energy and high
energy electrons)
 Provide well optimized
selection tools
 Provide efficiencies and fake
probabilities for standard ID
cuts
 Strong Management
 No new analysis is allowed to
start with a proprietary object ID
 If a new object ID is convincingly
proved to be essential as a result
of a new analysis, it should be
immediately standardized and
either be added to a list of
accepted standards or replace
one of the existing standards
 New efficiencies and fake
probabilities are then calculated
for a new ID
 Standard objects should be used
in standard heptuples
Standard HEPtuple
Proposal
 To become a standard, these
 The proposed format is based on
heptuples need to be introduced
my experience in analyzing Run I
early on, before people go off
data; it’s just a starting point!
and do MC-based physics
 Necessary components:
analyses
 Event tags
 The time is now!
 Triggers
 They should be enforced
 Accelerator conditions
(management) and supported
 Global quantities
(ID groups)
 Basic physics quantities
 Versatile enough to meet most
 Electrons/photons
of the physics goals
 Muons
 Expandable to accommodate
 Jets
new analyses
 t’s, b- and c-jets
 RCP/WWW-controlled user
 High-pT tracks
interface
…
 Sufficiently smaller than mDST,
i.e.  2 KB/event
Standard HEPtuples
details
Triggers
Event Tags
Accelerator
Run #
L0
Instantaneous Lum
Event #
MI Flag
Accelerator conditions
BadRun Flag
L1 bit string
Calorimeter baselines
Time Stamp
L2 bit string
Average int/crossings
Subdetector flags
L3 bit string
…
…
Compressed L1 info
Compressed L2 info
Compressed L3 info
User-defined list of
L1/L2/L3 triggers
…
Standard HEPtuples
details (cont’d)
Global Quantities
Physics Quantities
Photons
# Primary vertices
N leptons
E, ET
Primary Vertices
N photons
Px, Py, Pz
# Tracks per vertex
N jets per algorithm
h, f, hdet
# Secondary vertices
Total energy
Vertex from pointing
Secondary vertices
ST
ID flag
# Tracks per vertex
HT
c2, ISOA, ISOF, EMF
L0/L1 vertex
2-body Masses
Compressed cells
L2 vertices
Sphericity
Preshower info
Index of the closest
track, jet
L3 vertices
Missing ET’s
Aplanarity
User-defined list of
L1/L2/L3 triggers
Likelihood
Standard HEPtuples
details (cont’d)
Electrons
Muons
Jets
E, ET
p, pT
E, ET
Px, Py, Pz
Px, Py, Pz
Ex, Ey, Ez
h, f, hdet
h, f, hdet
h, f, hdet
Vertex ID, pointing
Muon system info
EMF, CHF, ICDF, HCF
E/p, dE/dx
ID flag, Sign
ID flag, Rcone/KT
ID flag, Sign
c2, ISO, CAL, dE/dx
Ntr, Width, Q/G/t/b/c
c2, ISOs, EMF, s
Timing
Timing
Compressed cells
Compressed hits
Compressed cells
Preshower info
Closest track, jet
Preshower info
Closest track, jet
Closest track, jet
Energy correction
Likelihood
Likelihood
Standard HEPtuples
details (cont’d)
b/c/t Jets
High pT tracks
E, ET
p, pT
Ex, Ey, Ez
Px, Py, Pz
h, f, hdet
h, f, hdet
EMF, CHF, ICDF, HCF
ISO, CAL, dE/dx, occ.
ID flag, Rcone/KT
Compressed CFT info
Ntr, Width
Compressed SMT info
Vertex, Impact param.
Compressed Muon info
Neutrino energy corr.
Timing
Compressed SMT info
Compressed hits
Compressed cal. cells
Preshower info
Closest track, jet
Closest track, jet
Energy correction
Likelihood
Physics Object Database
 Slides from Richard
Partridge (GCM talk,
Seattle workshop)
What is POD?
 R&D project to investigate the use of database technology for
physics analysis of large data samples
 POD uses a commercial relational database program to store:
 Calibrated physics objects (leptons, jets, ET)
 Results of particle ID algorithms
 Global quantities (triggers, vertices, etc.)
 Database queries performs event selection
 Example: select top em events by requiring 1 e, 1m, and 2 jets with
|h| cuts and ET/ET thresholds
 Query output is physics analysis input
 Ntuple with database info for selected events
 List of run/event numbers allow selected mDSTs to be quickly fetched
for advanced analyses
 Current goal is to demonstrate feasibility, develop necessary tools,
and establish performance benchmarks using a database loaded
with the Run 1 data sample
Why Should One Use a
Database?
 Designed to store, retrieve, update, and manage complex data
samples
 Large number of data types
 Bits, integers, floats, characters, binary objects, etc.
 Many ways of organizing data
 Physics object, event, file, stream, run, etc.
 Architecture allows fast access to data
 Avoid reading/unpacking entire event to look at 1 bit
 Separating algorithm results from physics object data eliminates need
to look at all 600M events
 Flexible access to data
 Data, columns, tables, etc. can be added, updated, or deleted without
recreating the database
 Example: new calibrations/algorithms can be added to the database
and compared to the old ones
 Central location for latest calibrations, corrections, algorithms, etc.
 Local processing, minimal network IO
POD Status
 Database server running at Brown
 Dual P-II/450 with ~40 GB available for testing
 Using SQL Server for present tests
 Preliminary studies using pseudo-data
 30M “electron” 4-vectors generated with flat ET, h, f distributions
 ~1 minute to select 100K events satisfying restrictive cuts ET, |h|
 1.5 - 3.5 minutes to select ~16K events with 2 electrons satisfying
loose ET, |h| cuts
 For comparison, expect 2M produced Wen per fb-1
 While rather crude, these results suggest that the POD approach can
increase the speed for event selection by several orders of magnitude
 C++ program is being developed to load ntuples into the database
 Heptuple used to read ntuples
 ADO API used to write to database
 Database is being loaded with Run 1 data (ALL stream), LQ-based
ntuples
POD Tools Planned
 Program to load heptuples into the database
 Web interface to construct database queries that
perform event selection
Provide web form for selecting desired physics objects,
algorithms, kinematic cuts (ET, |h|, etc.), triggers, runs, etc.
Translate selection criteria into an SQL command
Save resulting event list, output ntuple
 Ntuple generator to create heptuple of database
variables for selected events
 Web interface to define correspondence between
heptuple and database columns
NT as POD Server OS
 NT has proven ability to handle large databases
 Supported by all leading database vendors
 NT has best price/performance in standard database benchmarks
 Good scalability in multi-processor systems (up to 8 P-III processors
with forthcoming Profusion chip set)
 NT supports ADO (Active Data Objects)
 Provides high level API that greatly simplifies programming the
database interface
 ADO interfaces to all leading database products
 Brown is using ADO to develop software for loading ntuples into a
database for our prototype studies
 Brown plans to also provide a web-based query capability based on
ADO
 NT makes setting up and managing a high performance / high
reliability database remarkably easy
 CD support for NT project servers not required
POD Server Requirements
 Disk subsystem
 Disk capacity determines how much info is stored
 ~ 1 TB would allow ~ 1K of info/event
 ~20 objects/event in Run 1  ~10 words/object
 More info can be stored by adding disk space
 Multiprocessor Server(s)
 Large database queries are CPU (and disk) intensive
 Queries execute in parallel on multiple CPUs
 Goal would be to have a typical query selecting a small sub-sample in
~1 minute
 Multiple servers can be clustered if needed
 Optional DVD-RAM jukebox
 Expect to be able to store ~2.8 TB in a single jukebox at 1/4 cost of
disk space
 Allows retrieval of full mDST event information for events selected by
the database query
POD Server is well matched to Project Server specs
DVD Jukebox Storage
 DVD-RAM: 2.6 GB/side, 5.2 GB total, $15-25/DVD; 4.8
GB/side were just announced!
 DVD libraries: 600 DVD, 3 TB of storage for about $45K
or $15/GB!
 10-40 MB/s throughput
 Very promising technology, potential capacity up to 17
GB/DVD
 Fast price drop, wide availability
 Brown group has purchased a single 1X DVD-RAM
recorder for performance tests
 Excellent tool for remote collaborators to have local
copies of mDST data set (or selected STA/DST streams)
Possible POD/DVD Server
Central
Analysis
Server
20-30 TB
Fermilab
»$1M
T-3 or faster
Network
Remote
Physicist
Web-based
GUI
T-1/T-3
Network
A solution to the public data access
provision H.R.4328 (or Quarknet)?
POD Server
DB DB DB
SCSI
bus
$60/GB
DVD-RAM Library
15 TB storage
mDST
DST
STA
$15/GB
Tape
robot
3 TB fast RAID storage
Eight P-III or
Merced CPU
1 GB memory
$30K/
5000 MIP
server
One
Fast
or two
(shown) 1Gbit/s
Ethernet
fast
server(s)
SCSI DVD Server SCSI
bus Eight P-III or bus
10-40
600-DVD MB/sec
multi-drive
changers
Merced CPU
1 GB memory
Cache disk
POD/DVD Server at YOUR INSTITUTION