An SQL API for Object Oriented Perl

Download Report

Transcript An SQL API for Object Oriented Perl

The Bookkeeping
SQL API
Tim Adye
Rutherford Appleton Laboratory
Bookkeeping / Data Distribution Parallel
BaBar Collaboration Meeting
8th December 2004
8th December 2004
Tim Adye
1
Talk Plan
• The problem
• “Why not just write SQL?”
• The BaBar SQL API
•
•
•
•
User view: Perl classes and command-line tool
Behind the scenes
Table schema configuration classes
Summary of features
• Could this be generalised to other applications?
• Possible improvements
• Comparison with other DBIx packages
• Summary and references
8th December 2004
Tim Adye
2
User Access
• Users need to query database to find out what
data to process
• May also need other information
• eg. luminosity, run numbers, file sizes
• Mostly select by dataset, but may need to limit further
• eg. only data available locally, taken at peak energy,
excluding some problem datataking period
• Cannot expect users to know which tables to use,
how to join them, or even the SQL syntax
• Even worse if the schema change
• Cannot expect developers to code for all
combinations of queries with all possible selections
• Previous ad hoc tools (some mine!) tried to do this and it
was a nightmare, even for a simpler table structure
8th December 2004
Tim Adye
3
8th December 2004
Tim Adye
4
BaBar SQL API – user view
• Each column that users might want to query or
select on is given a unique logical name – regardless
of which table it lives in
• These names are used to specify query values
$query->addValues('collection', 'gbytes');
and selections
Each of these
happens to be in
a different table
$query->addSelector('dataset', 'Dilepton-*');
$query->addSelector('run', '10000-19999');
• Different types of data allow for different selection
syntax, eg. wildcards for names, or ranges for run numbers.
• Can also use SQL expressions (in terms of logical names)
$query->addValues('SUM(lumi)/1000');
and sorting, row limits, etc
8th December 2004
Tim Adye
5
SQL API – returning results
• That’s enough to generate a valid SQL SELECT
query. To return the results:my $sta = $query->execute();
while (my $row = $sta->fetch()) {
print $row->gbytes(), $row->collection(), "\n";
}
• The $query object collects the user requests
• $query->execute() returns a “statement accessor” (like a
DBI statement handle).
• $sta iterates over row objects, each of which has accessors
for each query value, gbytes and collection.
• That’s all there is to it!
• After the usual DBI connect, and $query object
instantiation (see later), these statements form a
working program
8th December 2004
Tim Adye
6
Command-Line Tools
• Standard BaBar tools use this API to create job
configuration, create datasets, calculate
luminosities, etc.
• Standard tasks, but optionally allowing additional
selections
• Also provide an “expert” tool that allows access to
full API functionality from the command line
• This has proved very popular, with many “non-experts”
making their own unique queries
8th December 2004
Tim Adye
7
Examples
$ BbkUser --dataset=A0-Run4-OnPeak-R14 \
--is_local=1 --file_status=0 \
dse_lumi events gbytes file \
--style=adye --display
DSE_LUMI
========
1250.3
1250.3
1348.4
1348.4
...
156 rows
EVENTS GBYTES FILE
====== ====== ====================================================
526115
1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.01.root
526115
0.8 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.02HBCA.root
576239
1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.01.root
576239
1.0 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.02HBCA.root
returned
$ BbkUser –-collection-file=coll.lis \
tot_gbytes collection
8th December 2004
Tim Adye
8
What happens behind your back
• The SQL API
• translates the logical names to table columns
• selects the required tables and joins
• including otherwise unused tables required for the joins
• generates and executes a valid SQL SELECT statement
• creates a statement accessor object
• dynamically generates a class for the row objects with
accessors for each query value
8th December 2004
Tim Adye
9
BbkUser --dataset=A0-Run4-OnPeak-R14
--is_local=1 --file_status=0
dse_lumi events gbytes file
Our Example
• That first BbkUser command involved 5 tables
• including one that provides the join between dataset and
collection tables
SELECT dse.lumi_sum AS "dse_lumi",
dse.output_nev AS "events",
file.bytes,
dse.name AS "collection",
file.suffix AS "file_suffix", ds.id AS "ds_id",
dse.id AS "dse_id",
dtd.id AS "dtd_id",
dtd.link_status
FROM bbk_dataset ds, bbk_dsentities dse, data_files dfile,
bbk_files file, bbk_dstodse dtd
WHERE ds.id=dtd.ds_id
The SQL API can even
AND dtd.dse_id=dse.id
pretty-print it like this for you 
AND dse.id=file.dse_id
(What’s shown here is somewhat
AND file.id=dfile.file_id
abbreviated: actual command
AND ds.name='A0-Run4-OnPeak-R14'
includes full database and table
AND dse.is_local='1'
names in case of ambiguities)
AND dfile.status='0';
8th December 2004
Tim Adye
10
8th December 2004
Tim Adye
11
Table schema configuration classes
• Mapping between logical names and table columns is
defined in the configuration classes.
• One class per table
• Can also define special properties of each column (eg. whether
to allow ranges (“100-199”) for selection).
• Possible joins between tables defined here too
• Use logical column names for join conditions, so one table class
does not need to know about column names in other classes.
• In most cases it’s just a matter of listing logical vs.
column names
• with a little Perl syntactic sugar
• Inheritence of config classes expresses commonalities
• eg. common id and created, and modified columns
8th December 2004
Tim Adye
12
Example Table Configuration
sub table { return 'bbk_files' }
sub tableConfig {
return {
alias => 'file',
columns => [
bytes
uuid
checksum
file_suffix
nfiles
gbytes
tot_gbytes
file_dse_id
file
],
joins => [
dse_id
file_id
],
}}
=>
=>
=>
=>
=>
=>
=>
=>
=>
'bytes‘,
'uuid',
'checksum',
'suffix',
'COUNT(DISTINCT file_id)',
'(bytes/1073741824)',
'SUM(bytes)/1073741824',
{ column => 'dse_id', selectorType => 'range' },
{ valueAction => 'addLfnValue', selectorAction => 'lfnSelector' },
=> 'file_dse_id',
=> 'dfile_file_id',
8th December 2004
Tim Adye
13
Putting it all together
• Configuration classes must be registered with
$query object
my $query = new BbkSqlSelect($bbkconfig);
$query->addModules(new MyTableClass($bbkconfig));
but of course it is usually simpler to provide a
$query object pre-registered with all the table
configs as part of a specific API.
8th December 2004
Tim Adye
14
Overriding and Synthetic Columns
• A crutial advantage of this system is that it allows us to
override the default behaviour
• Allows us to hide complexities from user
• Make even complex schema changes transparent to users
• A logical name can refer to
•
•
•
•
ordinary database column name
SQL expression (in terms of database columns, or other logical names)
Perl method to pre- or post-process selection or query value
“synthetic” query value or selection
• can return calculated value or alter behaviour
• Global post-processing
• Can be triggered by value, selection, or table inclusion
• Allows global filtering of returned rows
8th December 2004
Tim Adye
15
What happens behind your back 2
•
We already used some of these features without
noticing!
1. Dataset names can be found in the bbk_dataset or the
bbk_aliases table
•
Requires a check and translation using the alias table
2. Datasets can evolve with time, with collection being
added or removed
•
•
Need to query dataset for any time in the past, or use
tagged dataset alias (like a CVS tag)
Implemented by automatically including date selection in
query, and post-processing returned results to remove
deleted collections
3. File names are made from a collection name + a suffix
•
$query->addSelector('file') splits the file name for the
query and the $row->file() accessor rejoins them
8th December 2004
Tim Adye
16
Features
• Supports Oracle 8 and MySQL 3.23
• Most queries that can be expressed in both these dialects
can be expressed by users via the API – without breaking the
paradigm of a flat namespace
•
•
•
•
•
aggregatation and grouping
sorting and distinct
MySQL’s LIMIT emulated in Oracle
inner and outer joins (generates Oracle or MySQL syntax)
Does not support UNION or subqueries
• Could be added, but not in MySQL 3.23
• Convenience features
• automatic Getopt specification
• query results display formatting and summary table generation
• Configuration class summary table generation
8th December 2004
Tim Adye
17
Limitations
• Assumes tables can be joined in a unique way
• ie. the joins form an acyclic graph
• can still select different joins with explicit switches
• Each column must have its own unique logical name
• This is usually a good thing
• but if the same data is held in different columns, it would
be more efficient to automatically select from tables
that are already included
8th December 2004
Tim Adye
18
A public version
• Current version has a few BaBar-specific pieces
• BaBar Connection/Configuration manager – can use DBI directly
• BaBar Options manager – can use Getopt directly
• BaBar base objects – borrow required methods
• BaBar table formatting class – publish this too
… otherwise just uses standard Perl modules
but with different table configs could be used
elsewhere
• Already do this in BaBar – used for QA and TM databases
• Maybe I’m making some other assumptions that are true of our
database and requirements, but not more generally so. I can’t
think of any.
• Needs a better name!
• This is really an SQL API creator
8th December 2004
Tim Adye
DBIx::SqlAbstractor ???
19
Possible improvements
• Tidy up code!
• User and config APIs are OK, but in between it’s pretty ugly
• Separate functionality that can be used on its own
• Already true of the DBI statement accessor class
• More SQL dialects: PostgreSQL, MS SQL?
• New SQL syntax: subqueries, UNIONs,…
• INSERT, UPDATE, etc
• these don’t need joins, so hand-coding not such a problem
• Automatic selection of different join possibilities
• Automatic generation of default table classes from SQL
schema
• Could use “The SQL Fairy”
• though not much work to do it by hand
8th December 2004
Tim Adye
20
Why not use another package?
• More than 100 DBIx and other SQL access packages
in CPAN
• Could not find any that do all (or even most) of
•
•
•
•
•
•
hide table structure from user
allow multi-table queries, taking care of joins automatically
do not impose their own conventions on table schema
allow query values and selections to be overridden
allow transparent post-processing of query results
provide accessor functions for query results
• I believe that taken together these features provide
a clear and easy to use abstraction
8th December 2004
Tim Adye
21
Feedback and Discussion
• Would this be useful outside BaBar?
• Is it a good idea to make a public release?
• eg. on CPAN
• Does it need any improvements?
• New features
• Make it compatible with some other standards
• eg. sit on top of another abstraction like DBIx::Table
• A better name!
8th December 2004
Tim Adye
22
References
• BaBar Bookkeeping project
http://slac.stanford.edu/BFROOT/www/Computing/Distri
buted/Bookkeeping/Documentation/
• BaBar Bookkeeping presentation and paper
http://indico.cern.ch/contributionDisplay.py?contrib
Id=338&sessionId=7&confId=0
D.A. Smith et al.,
BaBar Book Keeping project –
a distributed meta-data catalog of the BaBar event store,
Proc. Computing in High Energy and Nuclear Physics 2004 (CHEP04).
• CPAN Database Interfaces
(see particularly DBIx)
http://cpan.uwinnipeg.ca/chapter/Database_Interfaces
8th December 2004
Tim Adye
23