stonjek_Blue_Sky
Download
Report
Transcript stonjek_Blue_Sky
Lessons from SAM
What data should be in a metadata
system
An incomplete guide
06-Jul-2006
Stefan Stonjek: Lessons from
SAM
1
Outline
What is SAM
– The Wallpaper
What is stored in SAM
Useful and useless data
Bits and pieces we learned during the design,
implementation and operation of SAM
Don’t reinvent the wheel
06-Jul-2006
Stefan Stonjek: Lessons from SAM
2
What is SAM
Sequential file Access via Metadata
DØ and CDF data handling system
Central database
– with middleware front-end (contains business
logic)
“Eierlegende Wollmilchsau” (compared to
LHC approach)
06-Jul-2006
Stefan Stonjek: Lessons from SAM
3
06-Jul-2006
Stefan Stonjek: Lessons from SAM
4
The File
The central entity in SAM is a data file
File is identified by unique name
– Not necessary meaningful
File information is divided in
– Physical file data
– Physics data
06-Jul-2006
Stefan Stonjek: Lessons from SAM
5
File provenience
SAM keeps provenience for every file
– Therefore every file has to be in SAM
File relationship is “m x n”
Connection is done via “process”
– Contains: date, executable, version, machine
State machine for files
06-Jul-2006
Stefan Stonjek: Lessons from SAM
6
Do not delete
No information will be deleted
Files get retired
If a disk size changes, an old alias gets retired and
a new alias is used (at least this is the theory)
File name problem
– A re-reprocessing should create a corrected file with the
same name
– Solution: file names should not carry information
06-Jul-2006
Stefan Stonjek: Lessons from SAM
7
SAM station (like LFC+SRM)
Every SAM station has attached disks/tapes
SAM station transfers all the data into its
realm (if necessary)
Central systems knows which files are
where
SAM Grid can send jobs to where the data
is
06-Jul-2006
Stefan Stonjek: Lessons from SAM
8
SAM projects and processes
(like GANGA)
SAM keeps track of all processes which
witch
– Write files into SAM
– Read files from SAM
Processes are grouped in projectsa
– project reads/writes a whole dataset
– process reads/writes some files from a dataset
06-Jul-2006
Stefan Stonjek: Lessons from SAM
9
Dimensions
SQL like language to define a selection of
input files which form a dataset
Dimension query is stored on database
Translated to SQL
Prevents user from accidentally overloading
database server
06-Jul-2006
Stefan Stonjek: Lessons from SAM
10
Dimension vs. plain SQL
Dimensions are easy to use
Dimensions shield the database from typos etc.
A dimension query requires the admins to
configure all query types
Run_number > 100 and Run_number < 200
SQL is more flexible!!!
… where Runs.Number > 100 and
Runs.number < 200 …
06-Jul-2006
Stefan Stonjek: Lessons from SAM
11
Memo
User should think “dataset” not file
“file” is atomic unit for any datahandling
system
In some cases a single file might be useless
06-Jul-2006
Stefan Stonjek: Lessons from SAM
12
Oracle 9 vs. Oracle 10
Rule based optimization is gone
Therefore optimal solution is different for
Oracle 9 and Oracle 10
06-Jul-2006
Stefan Stonjek: Lessons from SAM
13
Summary
A stupid man does’t learn from his mistakes
A clever man does learn from his mistakes
A wise man learns from others mistakes
LCG Grid can learn from SAMGrid
– (I hope)
06-Jul-2006
Stefan Stonjek: Lessons from SAM
14