Mass Analysis of D0 Top Candidates
Download
Report
Transcript Mass Analysis of D0 Top Candidates
The PHysics Analysis SERver Project
(PHASER)
M. Bowen, G. Landsberg, and R. Partridge*
Brown University
CHEP 2000
Padova, Italy
February 7-11, 2000
What is the PHASER project?
Effort to substantially increase productivity of physicists
analyzing multi-TB summary data sets
Our immediate focus is on the DØ experiment
» 600 million data events/year starting in early 2001
» Summary data set expected to grow at rate of 3TB/year
Concentrate on event selection and “ntuple” creation stage
» transition in data handling from monolithic reconstruction processing to
the much more chaotic processing of summary data by many physicisits
» IO and CPU intensive due to need to apply latest calibration, particle ID,
and event selection algorithms to several hundred million events
Richard Partridge
2
PHASER Architecture
Physics Object Database
(POD) stores meta-data used
by most physics analyses for
their initial event selection
Physics Object and Particle ID
tables in POD store calibrated
4-vectors, object quality
variables, and results of
particle ID algorithms
DVD storage of full summary
(mDST) data set and useful
subsets of larger DST and STA
data sets
Richard Partridge
3
PHASER is PHast
New calibrations and particle ID algorithms can be quickly
incorporated
» Only the changes need to be importd
» Regenerating the large mDST data set will only be done infrequently
Storage of up-to-date calibrations and particle ID
algorihtms avoids the need to re-apply these alogorithms
for each event selection pass
Particle ID tables are small, making it possible to quickly
eliminate events not having the desired set of physics
objects
Direct access to full mDST sample on DVD allows a mDST
subset to be quickly generated for advanced analyses
developing new algorithms not yet in the database
Richard Partridge
4
The Physics Object Database (POD)
Stores fully calibrated meta-data associated with the
various physics objects
» leptons, photons, jets, missing ET, secondary vertices, triggers, etc.
» for example, an electron object would have the energy, direction, and
various quantities used in the electron ID algorithms stored
Each physics object associated with a table in a relational
database
Primary key uniquely identifies each physics object and
provides information needed to correlate physics objects
from a single event
» Currently use Run, Event, Instance (where appropriate) and row number
from ntuple used to load database
» Alternative: data source index, sequence number, and instance
Richard Partridge
5
Why use a Relational Database?
Physics objects typically have a fixed set of attributes used
for event selection and analysis
Independence of tables aids loading, updating database
» Data can be “bulk loaded” as long as primary key is provided in input data
stream
Several vendors with quite capable products, large
commercial market
Richard Partridge
6
Prototype POD
Use DØ Run 1 data (1992 - 1996 running period)
62 million events loaded into the database
Entire “All-Stream” data set loaded
» Data set used by almost all DØ physics analyses
» Only files with special processing or trigger conditions excluded
Column-wise ntuple format used for importing/exporting
data
Richard Partridge
7
DØ Run 1 POD
Object
Electron
Muon
Photon
Jets (3 cone sizes)
Jets with e/ removed (3 cone sizes)
Missing ET
Vertex
Trigger
Event Parameters
Totals
Columns
28
37
22
3 x 14
3x6
14
6
19
5
191
Rows
52,540,491
79,688,956
69,278,259
472,626,080
67,003,537
62,353,601
90,004,529
62,353,601
62,353,601
1,018,202,655
Size (GB)
6.8
13.2
7.4
35.7
3.1
4.8
4.1
3.5
1.8
80.4
Including indexes, Run 1 POD occupies ~100 GB
»
»
»
»
58% physics object data
18% indexes on object ET
12% primary keys
12% database overhead
Richard Partridge
8
POD Benchmarks
Z e+e- candidate event selection:
» 7 seconds to identify ~6k events
W en candidate event selection:
» 18 seconds to identify ~86k events
Both benchmarks times make use of particle ID tables
Event selection times compare very favorably with ~1000
CPU hours required to generate ntuples used in this study
Benchmark Hardware/Software
450 MHz dual-processor Pentium II with 256 MB RAM
Database stored on (6) 36 GB disks in Raid 0 stripe set
MS SQL Server running on Windows NT 4.0
Richard Partridge
9
DVD Storage
Provide access to additional event information not included
in POD
DVD-RAM has a number of unique capabilities
» Less expensive than disk storage, doesn’t require backup
» Access to individual events is much faster than tape storage
Current disk capacity is 2.6 GB, 4.7 GB expected soon
Commercial DVD libraries hold up to 600 DVD disks
» 2.8 TB capacity using 4.7 GB DVD-RAM disks
» Average disk load time of 4.5 s, <1 hour to cycle through 600 disks
» Up to 6 DVD-RAM drives gives ~10 MB/s IO rate
Richard Partridge
10
Web Interface
Plan to develop web-based user interface
Interface modelled on “3-tier” architecture widely used in
commercial applications
Physicist will enter event selection requirements using a
Java applet
Applet communicates request to “Physics Intelligence”
middleware running on PHASER system (via CORBA)
» Translate request to SQL for event selection
» Verify that request can be accommodated within resource constraints
» Produce the requested output files
Richard Partridge
11
PHASER Output
Several output options:
» List of run and event numbers satisfying the request
» Ntuple created from POD information
» mDST stream containing requested events from DVD library
Output files will generally be small enough to transfer over
the network
Larger output files can be written to DVD and physically
sent to physicist for further analysis
Richard Partridge
12
Conclusions
PHASER offers a way for both experts, novices, and
“dinosaurs” to quickly extract information about a
particular class of events
Feasibility of loading “Run 1” size physics object info into
a relational database has been demonstrated
Significant improvements in event selection time has been
observed for W/Z benchmarks
Expect these results will scale up to Run 2 data load
Database technology is also potentially useful for helping
manage complex analyses and storing intermediate results
Richard Partridge
13