stonjek_Blue_Sky

Download Report

Transcript stonjek_Blue_Sky

Lessons from SAM
What data should be in a metadata
system
An incomplete guide
06-Jul-2006
Stefan Stonjek: Lessons from
SAM
1
Outline
 What is SAM
– The Wallpaper
 What is stored in SAM
 Useful and useless data
 Bits and pieces we learned during the design,
implementation and operation of SAM
 Don’t reinvent the wheel
06-Jul-2006
Stefan Stonjek: Lessons from SAM
2
What is SAM
 Sequential file Access via Metadata
 DØ and CDF data handling system
 Central database
– with middleware front-end (contains business
logic)
 “Eierlegende Wollmilchsau” (compared to
LHC approach)
06-Jul-2006
Stefan Stonjek: Lessons from SAM
3
06-Jul-2006
Stefan Stonjek: Lessons from SAM
4
The File
 The central entity in SAM is a data file
 File is identified by unique name
– Not necessary meaningful
 File information is divided in
– Physical file data
– Physics data
06-Jul-2006
Stefan Stonjek: Lessons from SAM
5
File provenience
 SAM keeps provenience for every file
– Therefore every file has to be in SAM
 File relationship is “m x n”
 Connection is done via “process”
– Contains: date, executable, version, machine
 State machine for files
06-Jul-2006
Stefan Stonjek: Lessons from SAM
6
Do not delete
 No information will be deleted
 Files get retired
 If a disk size changes, an old alias gets retired and
a new alias is used (at least this is the theory)
 File name problem
– A re-reprocessing should create a corrected file with the
same name
– Solution: file names should not carry information
06-Jul-2006
Stefan Stonjek: Lessons from SAM
7
SAM station (like LFC+SRM)
 Every SAM station has attached disks/tapes
 SAM station transfers all the data into its
realm (if necessary)
 Central systems knows which files are
where
 SAM Grid can send jobs to where the data
is
06-Jul-2006
Stefan Stonjek: Lessons from SAM
8
SAM projects and processes
(like GANGA)
 SAM keeps track of all processes which
witch
– Write files into SAM
– Read files from SAM
 Processes are grouped in projectsa
– project reads/writes a whole dataset
– process reads/writes some files from a dataset
06-Jul-2006
Stefan Stonjek: Lessons from SAM
9
Dimensions
 SQL like language to define a selection of
input files which form a dataset
 Dimension query is stored on database
 Translated to SQL
 Prevents user from accidentally overloading
database server
06-Jul-2006
Stefan Stonjek: Lessons from SAM
10
Dimension vs. plain SQL
 Dimensions are easy to use
 Dimensions shield the database from typos etc.
 A dimension query requires the admins to
configure all query types
 Run_number > 100 and Run_number < 200
 SQL is more flexible!!!
 … where Runs.Number > 100 and
Runs.number < 200 …
06-Jul-2006
Stefan Stonjek: Lessons from SAM
11
Memo
 User should think “dataset” not file
 “file” is atomic unit for any datahandling
system
 In some cases a single file might be useless
06-Jul-2006
Stefan Stonjek: Lessons from SAM
12
Oracle 9 vs. Oracle 10
 Rule based optimization is gone
 Therefore optimal solution is different for
Oracle 9 and Oracle 10
06-Jul-2006
Stefan Stonjek: Lessons from SAM
13
Summary
 A stupid man does’t learn from his mistakes
 A clever man does learn from his mistakes
 A wise man learns from others mistakes
 LCG Grid can learn from SAMGrid
– (I hope)
06-Jul-2006
Stefan Stonjek: Lessons from SAM
14