VO as a Data Grid, NeSC `03

Download Report

Transcript VO as a Data Grid, NeSC `03

WFCAM Science Archive
Nigel Hambly
Wide Field Astronomy Unit
Institute for Astronomy, University of Edinburgh
VO as a Data Grid, NeSC ‘03
Background & context
• Wide Field Astronomy:
- large-scale public surveys
- multi-colour, multi-epoch imaging data sets
• Developments over recent decades:
- whole-sky Schmidt telescope surveys (eg. SuperCOSMOS)
- current generation optical/IR, eg. SDSS, WFCAM
- next generation, eg. VISTA
 Prime examples of key datasets that will be the cornerstone
of the VO datagrid
VO as a Data Grid, NeSC ‘03
SuperCOSMOS scans photographic
media:
• 10 Gbyte/day
• 3 colours: B, R & I
• 1 colour (R) at 2 epochs
• 0.7 “/pixel
• 2 byte/pixel
• whole sky
• total data volume (pix): ~15 Tbyte
• S hemisphere completed 2002
(N hemisphere by end 2005)
VO as a Data Grid, NeSC ‘03
WFCAM will image the sky directly using IR sensitive
detectors; deployment on a 4m telescope (UKIRT):
• 100 Gbyte/night
• 5 colours: ZYJHK; some multi-epoch imaging
• 0.4 “/pixel
• 4 byte/pixel
• ~10% sky coverage in selected areas (various depths)
• total data volume (pix): ~100 Tbyte
• observations start in 2004; 7 yr programme planned
VO as a Data Grid, NeSC ‘03
VISTA (also 4m) will have 4x as many IR detectors as WFCAM:
• 500 Gbyte/night
• 4 colours: zJHK
• targeted surveys (various depths & areas)
• 0.34 “/pixel
• total data volume (pix): ~0.5 Pbyte
• observations start at the end of 2006
VO as a Data Grid, NeSC ‘03
Characteristics of astronomy DBs (I)
• pixel images processed into lists
of parameterised detections known
as “catalogues” (parameterised
data typically <10% of pixel data
volume)
• detection association within survey data
yielding multi-colour, multi-epoch
source record
Characteristics of astronomy DBs (II)
• detailed (but relatively small) amount of descriptive data
with images and catalogues
• required to track descriptive data and images along
with catalogue data
• for current/future generation surveys processing and ingest
dictated by observing patterns
• but users require well defined, stable catalogue products on
which to do their science
 hence require periodic release of stable, well-defined,
read-only catalogues
VO as a Data Grid, NeSC ‘03
Typical usages (I)
• increasingly involve jointly querying different survey datasets
in different databases
-example shows stellar population
discrimination using SDSS colours
and SSA proper motions
(Digby et al., astro-ph/0304056,
MNRAS in print)
VO as a Data Grid, NeSC ‘03
Typical usages (II)
• position & proximity searches v. common
- spatial indexing (2d, spherical geom.) required
• statistical studies: ensemble characteristics of
different species of source
• one-in-a-million searches for peculiar sources with
highly detailed, specific properties
- whole table scans
• …?
=> enable flexible interrogation to inspire new, innovative
usage and promote new science
VO as a Data Grid, NeSC ‘03
Science archive development at WFAU:
• SSA: a few Tbytes
• WSA = 10x SSA
• VSA = 5x WSA
 approach is to set up a prototype archive system now
(SSA), expand and implement WSA to coincide with
WFCAM ops, then scale to VSA.
VO as a Data Grid, NeSC ‘03
Database design: key requirements (I)
Flexibility:
• ingested data are rich in structure
• daily ingest; daily/weekly/monthly curation
• many varied usage modes
• protect proprietorial rights
• allow for changes/enhancements in design
VO as a Data Grid, NeSC ‘03
Database design: key requirements (II)
Scalability:
• ~2 Tbytes of new data per year
• operating lifetime > 5 years
• maintain performance for increasing data volumes
Portability:
• V1.0/V2.0 phased approach to hardware/OS/DBMS
VO as a Data Grid, NeSC ‘03
Database design: fundamentals (I)
• RDBMS, not OODBMS
• WSA V1.0: Windows/SQL Server (“SkyServer”)
- V2.0 may be the same, DB2, or Oracle
• Image data stored as external flat files, not BLOBs
- but image metadata stored in DBMS
• All attributes “not null”, ie. mandatory values
• Archive curation information stored in DBMS
VO as a Data Grid, NeSC ‘03
Database design: fundamentals (II)
• Calibration coefficients stored for astrometry & photometry
- instrumental quantities stored (XY in pix; flux in ADU)
- calibrated quantities stored based on current calibration
- all previous coefficients and versioning stored
VO as a Data Grid, NeSC ‘03
Database design: fundamentals (III)
• Reruns: reprocessed image data
- same observations yield new source attribute values
- re-ingest, but retain old parameterisation
• Repeats: better measurements of the same source
- eg. stacked image detections
- again, retain old parameterisation
• Duplicates: same source & filter but different observation
- eg. overlap regions
- store all data, and flag “best”
VO as a Data Grid, NeSC ‘03
Hardware design (I)
• separate servers for
- pixels
- catalogue curation
- catalogue public access
- web services
• different hardware solutions
- mass storage on IDE with
HW RAID5
- high bandwidth catalogue
servers using SCSI and
SW RAID
VO as a Data Grid, NeSC ‘03
Hardware design (II)
• mass storage of pixels
using low-cost IDE
VO as a Data Grid, NeSC ‘03
Hardware design (III)
• dual P4 Xeon server
• independent PCI-X
buses for maximum
b/w
• dual channel Ultra320
SCSI adapters
High bandwidth
catalogue server
VO as a Data Grid, NeSC ‘03
Hardware design (IV)
• individual Seagate 146 Gbyte
disks sustain > 50 Mbyte/s
sequential read
• Ultra320 saturates at
200 Mbyte/s in one channel
• 4 disks per channel
• SW RAID striping across
disks
(following SkyServer design
of Gray, Szalay & colleagues)
VO as a Data Grid, NeSC ‘03
The SuperCOSMOS Science Archive (SSA)
• WFCAM Science Archive prototype
• Existing ad hoc flat file archive (inflexible, restricted access)
re-implemented in an RDBMS
• Catalogue data only (no image pixel data)
• 1.3 Tbytes of catalogue data
• Implement a working service for users & developers to
exercise prior to arrival of Tbytes of WFCAM data
VO as a Data Grid, NeSC ‘03
SSA has several similarities to WSA:
• spatial indexing is required over celestial sphere
• many source attributes in common, eg. position,
brightness, colour, shape, …
• multi-colour, multi-epoch detection information
results from multiple measurements of the same
source
VO as a Data Grid, NeSC ‘03
Development method: “20 queries approach”
• a set of real-world astronomical queries, expressed in SQL
• includes joint queries between the SSA and SDSS
Example:
/* Q14: Provide a list of stars with multiple epoch measurements,
which have light variations >0.5 mag. */
select objid into results
from Source
where (classR1=1 and classR2=1 and qualR1<128 and qualR2<128)
and abs (bestmagR1-bestmagR2) > 0.5
VO as a Data Grid, NeSC ‘03
SSA relational model:
• relatively simple
• catalogues have ~256 byte
records with mainly 4-byte
attributes, ie. 50 to 60 per
record
• so 2 tables dominate the DB
- Detection: 0.83 Tbyte
- Source:
0.44 Tbyte
VO as a Data Grid, NeSC ‘03
SSA has been implemented & data are being ingested:
VO as a Data Grid, NeSC ‘03
WSA has significant differences, however:
• catalogue and pixel data;
• science – driven, nested survey programmes (as opposed
to SSA “atlas” maps of whole sky) result in complex
data structure;
• curation & update within DBMS (whereas SSA is a finished
data product ingested once into the DBMS).
VO as a Data Grid, NeSC ‘03
WFCAM Science Archive : relational design
VO as a Data Grid, NeSC ‘03
WFCAM Science Archive
Schematic picture of the WSA:
• Pixels:
- one flat – file image store; access
layer restricts public access
- filenames and all metadata are
tracked in DBMS tables with
unrestricted access
• Catalogues:
- WFAU incremental (no public
access)
- Public, released DBs
- external survey datasets also
held
VO as a Data Grid, NeSC ‘03
Image metadata relational model
• Programme & Field
=> vital
• library calibration
frames stored &
related
• primary/extension
HDU keys logically
stored & related
• this will work for
VISTA
VO as a Data Grid, NeSC ‘03
Astrometric and photometric calibration data:
• require to store calibration information
• recalibration is required – esp. photometric
• old calibration coefficients must be stored
• time-dependence (versioning) complicates the relational
model
Calibration data are related to images; source detections are
related to images and hence their relevant calibration data
VO as a Data Grid, NeSC ‘03
Image calibration data:
• “set-ups” define nightly detector
& filter combinations:
- extinctions have nightly values
- zps have detector & nightly values
• coefficients split into current &
previous entities
• Versioning & timing recorded
• highly non-linear systematics
are allowed for via 2D maps
Catalogue data: general model
• related back through
progenitor image to
calibration data
• detection list for each
programme (or set of
sub-surveys)
• merged source entity
is maintained
• merge events recorded
• list re-measurements
derived
VO as a Data Grid, NeSC ‘03
Non-WFCAM data: general model
• each non-WFCAM survey has a stored catalogue
(currently locally stored).
• cross-neighbour table:
- records nearby sources between any two surveys
- yields associated (“nearest”) source
VO as a Data Grid, NeSC ‘03
Example: UKIDSS LAS & relationship to SDSS
• UKIDSS LAS overlaps with SDSS
• list measurements:
- at positions defined by IR source,
but in optical image data;
- do not currently envisage
implementing this the other way
(ie. optical source positions placed
in IR image data)
VO as a Data Grid, NeSC ‘03
Curation:
– set of entities to track
in-DBMS processing:
• archived programmes have:
- required filter set
- required join(s)
- required list – driven
measurement product(s)
- release date(s)
- final curation task
- one or more curation timestamps
• a set of curation procedures is
defined for the archive
VO as a Data Grid, NeSC ‘03
WFCAM Science Archive: V1.0 schema
implementation
VO as a Data Grid, NeSC ‘03
Implementation: unique identifiers (UIDs)
• meaningful UIDs, not arbitrary DBMS-assigned sequence no.
• following relational model, compound UIDs from appropriate
attributes, eg.
- detection UID is a combination of sequence no. on detector
and detector UID
- detector UID is a combination of extension no. of detector
and multiframe UID
• but: top-level UIDs compounded into new attribute to avoid
copying many columns down the relational hierarchy, eg.
- meaningful multiframe UID is made up from UKIRT run no.,
and observation and ingest dates.
VO as a Data Grid, NeSC ‘03
Implementation: SQL Server
database picture (I)
• Multiframe
& nearest
neighbour
tables
VO Design
as a Data
Grid,
NeSC
‘03
Critical
Review
, April
2003
Implementation: SQL Server
database picture (II)
• UKIDSS
LAS
& nearest
neighbour
tables
VO as a Data Grid, NeSC ‘03
Implementation: spatial index
attributes
• Hierarchical Triangular Mesh algorithm
(courtesy of P. Kunszt, A. Szalay & colleagues)
• HTM attribute HTMID for each occurrence of RA & Dec
• SkyServer functions & stored procedures:
- spHTM_Lookup, spHTM_Cover, spHTM_To_String,
fHTM_Cover etc.
VO as a Data Grid, NeSC ‘03
Implementation: table indexing
• standard RDBMS practice: index tables on commonly used
fields
• one “clustered” index per table based on primary key (default)
- results in re-ordering of data on disk
• further non-clustered indices:
- when indexing on more than one field, put in order of
decreasing selectivity
- HTM index attribute is included as most selective in at least
one non-clustered index on appropriate tables
- index files stored on different disk volumes to tables to help
minimise disk “thrashing”
= > experimentation required with real astronomical data and
queries: SSA prototype
VO as a Data Grid, NeSC ‘03
User interface & Grid context (I)
• “traditional” interfaces (ftp/http), eg. existing implementations:
 WWW from interface
Access via CDS Aladin tool 
VO as a Data Grid, NeSC ‘03
User interface & Grid context (II)
• SQL form interfaces:
VO as a Data Grid, NeSC ‘03
User interface & Grid context (III)
• web services under development (XML/SOAP/VOtable)
• other data (eg. SDSS, 2MASS, …) mirrored locally initially
• but aspiration is to enable usages employing distributed
resources (both data and CPU) ultimately
 recast web services as Grid services to integrate WSA
into the VO Data Grid
VO as a Data Grid, NeSC ‘03