slides - Indico

Download Report

Transcript slides - Indico

DZERO Data Acquisition
Monitoring and History Gathering
G. Watts
U. Of Washington, Seattle
CHEP04 - #449
Monitor Data Archiver
Save Selected Data at 15 second intervals
Produce Plots on web in real time of any quantity vs run, store, or date
Started as a one person project!
• Not a great deal of data!
• Raw data is approximately 20 GB/year.
• Built on work of DØ L3 Trigger/DAQ Group (Mature Project
at this point).
• Hope was a project written, and forgotten (except in use!)
Project is more interesting for its failures than its successes…
•Introduction to DØ L3 Trigger/DAQ & Monitoring
•Design of l3xHistoryViewer
•Multiple Incarnations (Root, Oracle, Root)
The DØ L3 Trigger/DAQ System
Typical Collider
Multilevel Trigger
System
See #477 (Chapin) for details
Read
Out
Crate
Level 1
Level 2
Level 3
Tape
Archive
DAQ Readout
Supervisor
Farm
Node
Read
Out
Crate
Switch
Farm
Node
Read
Out
Crate
Routing
Master
Farm
Node
Monitor Server
Display
Display
Display
This project would not be possible without all the work of the DØ DAQ Group!
The Monitor Server
• Typically > 150 clients, > 25
displays
• Hole in online firewall
allows external monitor
displays.
• All monitor data returned
from clients is cached.
• Displays may request
data no older than.
• Heavily multithreaded C++
program.
• Uses ACE for
communication library
• L3mq Web Application
caches complex requests
• Web pages used to alter
requests.
The Monitor Server
Client
Handle
r
Reply
Builder
Monitor Data
Cache
Display
Handle
r
l3mq Web
Application
Monitor Server Communication
All Monitor Server Communication is done in XML
Name of a Monitor Item:
Data Source Type
Machine Name
Item Name
Item Data
Format is extensible.
Items defined by the
Clients.
Adding new items is a
matter of adding a new
client and requesting the
items
The l3xHistoryViewer Data Flow
Monitor Server
l3mq Cached
Monitor
Request
• Archive Monitor Data 4 times/minute
• Archive ~4000 values
• Generate plots on the fly for viewing on the web
• Must be quick (<5 seconds)
• View by Run Number, Store Number, Date…
• Little or no load on the DAQ/Online System
• Allow stale Monitor Data to prevent re-queries
• Easily change the Archived Monitor Items
• l3mq web pages used for this.
Data Collector
Data Store
Web Pages
(Plots)
l3xHistoryViewer
Data Collector
Monitor Server
l3mq Cached
Monitor
Request
Data Collector
Data Store
Parse MS XML
Generate
Monitor Item
Names
Save to Data
Store
• Robust against
Monitor Server
failures.
• Robust against
various L3/DAQ
machines crashing.
• Written in C++ and/or
C#
• Depending upon
version.
Web Pages
(Plots)
l3xHistoryViewer
Web Pages
Monitor Server
Discover Data
Location for
Query
Extract Data
Plot and Send
JPEG to Web
l3mq Cached
Monitor
Request
Data Collector
Data Store
• Fast enough to work
in real time -- ~5
seconds
• Robust against web
• Written in C++ and/or
C# and ASP.NET
• Depending upon
version.
Web Pages
(Plots)
l3xHistoryViewer
History Data Store
Monitor Server
l3mq Cached
Monitor
Request
Trouble Maker!
• Relational Data Base (RDB) vs ROOT drove the
design of the other components
• All choices have worked to various degrees
• Speed & Size were always the issue
• Insertion speed
• Can all the data be inserted in less than 15
seconds
• Extraction Speed
• Produce a plot before user hits reload
button.
Data Collector
Data Store
Web Pages
(Plots)
l3xHistoryViewer
The 3 Implementations
1
The Prototype
• Monitor Items Appear, Disappear
• Homegrown XML parsing isn’t
robust
• ROOT doesn’t perform well on
TTree’s with 4000 branches!
2
Oracle Implementation
• DØ Computing Support offered
• Backup, Oracle Maintenance
• Included RDB expert (Consultant)
• Offered ability to select data on
many criteria
• Speed & Size issues never
satisfactorily addressed.
3
ROOT Implementation
• Use ROOT to store data,
RDB to store lookup
indices.
• Use both for what
they are good at
• Speed still an issue
• Mostly understanding
ROOT I/O.
• Back to standalone
project.
• Ad-hoc backup
mechanism.
Oracle
Database Design Was Difficult!
• Never Fully Resolved tension
between speed and space
Oracle
Contains one Entry Per 15
Seconds. The ID is used as
reference to EVENT_TO_VALUE
table.
Oracle
Contains one entry per monitor
item per 15 seconds – 4000
entries per 15 seconds. Relates
Time and Monitor Data. Compact,
and heavily indexed for lookups.
Stored Procedures and Array
Inserts used for speed and
database consistency
Oracle
Contains one entry Per Value. A
Monitor Item that is constant
will not generate extra space in
this table! Each value is attached
to the actual name, in the
ITEM_NAME table..
Performance for Oracle Version
Database was filled with 3-4 weeks
of data for these tests.
Started by using monolithic SQL
statement to return data.
Without tuning single SQL
query for non-existent run
took 31 seconds!
Request that returned 147
values took 247 seconds!
Tuning by adding Indices and
studying Oracle SQL Plan.
Toadsoft has excellent
free-ware tool for this!
Empty query 0.3 seconds,
but 147 item query still
150 seconds!
8.1.7, Running on a Dual 2.4 GHz
Xeon, 2 GB, SCSI Disk (not
RAID)
Perform JOINs in local computer,
extract significant portions of
data. Reduce number of
EVENT_TO_VALUE entries by
putting 20 values on a single entry.
Empty query 0.3 seconds,
but 147 item query 31
seconds!
Oracle on Linux Learning Curve
for DØ also meant downtime (1
month at one point due to disk
problems)
Data Store Disk Space
N Gb
8 x N Gb
Data
Index
Redo
Data
Index
Backup
mirror
mirror
Unexpected?
Rollback
Replication
Good rule of thumb:
You need 10x the disk to hold a given amount of data in an RDB.
From Jack Cranshaw’s Talk on CDF experiences in
DBs given at ATLAS Software Workshop last
week…
History DB was heading towards 100 GB/year
ROOT II
Use Root to hold Data, RDB to hold Index
History Names
Branch
Array of Monitor
Item Names and
Index
History Data
Branch: One Per Data Source
Array of Values
One Root File
Keeps # of Branches Small (~100)
Close/Open file on new day and when monitor data changes
(5 per day is typical)
ROOT Performance
Query involving 20 files, 900 values on 1.3 GHz PIII (M): 7.5
seconds!
Tricky to achieve: string read back!
Evidence that this can be made even faster.
Database is currently Access; change when speed an issue…
On track to write ~15 GB/year
Web Front End
Undergoing major rewrite from prototype version
Allows for caching common plot requests
Allows authorized users to upload root code to generate
custom plots
Conclusions
• Iterations of this took 2 years!
– Seemingly Simple Task…
• Don’t store data in a database you won’t index on.
– Store in separate file or BLOB
– If need to index on something later, regenerate database
from binary data!
• Oracle and its ilk aren’t one-person shows
– Very hard to develop on a portable, for example.
– Requires team to manage and run.
• Tools Matter! 
– Good debugger and Development Environment
• Should understand databases.
• RAD meant 1 hour after started with Oracle I was writing to
it. Allowed me to concentrate on DB design mostly, rather than
coding.
• Used Windows, but…
– No reason any of this couldn’t have been done on Linux and
PHP, etc.