Transcript ppt
Astronomical Data Archiving and Curation
Clive Page
AstroGrid Project
University of Leicester
2004 March 22
Importance of Data Archiving in Astronomy
• No observation can be repeated exactly, as the
sky is always changing
– After a violent event (e.g. supernova explosion)
earlier observations are crucial
• Observations over a long period can identify
– Variability
– Proper motions
• In recent years all data come in digital form
• Important earlier datasets on photographic
plates have now mostly been digitised.
Principal Data Types in Archives
•
•
•
•
Raw data from telescopes
Observing logs
Calibration datasets
Calibrated/reduced data:
– Images
– Spectra
– Time-series
• Derived data products:
– Source catalogues
– Sky survey image collections
Data Formats
• A variety, but FITS format predominates:
– FITS can store arrays and tables, and encapsulates
data and metadata, but…
• Standards have evolved, older FITS files less compatible
• Individual observatory conventions also exist
• Metadata vital - sometimes to be found only:
– In associated software packages or documentation
– In the heads of those developing the software
Important UK data archive sites
• Cambridge - Astronomical Survey Unit (CASU):
– INT wide-field survey, APM catalogue, VIZIER mirror,
UKIRT archive. In future: WFCAM, VISTA.
• Edinburgh – Wide-field Astronomy Unit (WFAU)
– SuperCOSMOS images and catalogue, 6df galaxy
survey, SLOAN DSS copy. In future: WFCAM, VISTA.
• Leicester - Data Archive Service (LEDAS):
– EXOSAT, GINGA, ASCA, ROSAT, XMM; Chandra
mirror, many optical datasets. In future: SWIFT,
SuperWASP source archive.
Important UK data archive sites (continued)
• Manchester - Jodrell Bank:
– Merlin, HI surveys, European VLBI datasets, pulsar
catalogues. Future: e-Merlin archive.
• Rutherford Laboratory:
– World Data Centre for STP, CLUSTER and ISO UK
data centres, Starlink software collection and data
archive. In future: SuperWASP image archive.
• UCL - Mullard Space Science Laboratory:
– YOHKOH, SOHO, TRACE, ReSIK and other
solar/STP archives.
Database management systems
• DBMS currently used by UK archives include:
–
–
–
–
–
–
–
–
–
BROWSE – written at ESOC/ESTEC in 1980s.
DB2 (IBM)
Ingres
miniSQL – free simple DBMS
MySQL – open source, supports many web sites
PostgreSQL – open source, good spatial indexing
SQL Server (Microsoft)
Sybase ASE
WFCtools – written at Harvard/SAO for accessing
large optical catalogues
User access methods
• Residual telnet/ssh services
– Allows registered users to perform DBMS operations
store their own subsets etc.
– Mostly obsolescent
• FTP access for large downloads
• Web interfaces use CGI with Perl, PHP, or
Python
– Results mostly returned as HTML tables/GIFs, with
some FITS and VOtable.
• No use (pre-AstroGrid) of XML-based Web
Services (Xforms, SOAP, WSDL etc.)
Problems – (1) technical
• Data storage: thanks to Moore’s Law, new datasets are
much bigger than old ones. May get adequate storage
for existing data from:
– new big projects like WFCAM, SWIFT, e-MERLIN, VISTA?
– SRIF funding?
• International Virtual Observatory Alliance (IVOA) is
developing new standards e.g. for tabular data, registry,
query language.
– These have to be implemented before fully stable.
• DBMS: freeware like MySQL, PostgreSQL improving
rapidly, probably adequate.
– If not, licence costs may be substantial.
• Database middleware (OGSA-DAI, ELDAS)
– still developing, not quite ready for large-scale use
Problems – (2) structural
• Data preservation requires migration to new
platforms, new DBMS every few years
• Many DBMS in use are incapable of supporting
functionality required e.g. no spatial indexing
– Also implies migration to new DBMS
• AstroGrid (and other VO projects) will supply the
middleware, but have no remit (and no funding)
to update the archives themselves.
• Serious data mining research will require serious
processing power near the data stores (e.g. an
Astronomical Data Warehouse).
Problems – (3) managerial
• VO software from AstroGrid includes MySpace:
a temporary user space on remote systems.
– Optional, but highly desirable because of need to
“shift the results not the data”
– will sites give space to users unknown to them?
– how to administer many ad-hoc groups of users?
• Creation of the VO Registry will require
considerable input from managers of existing
data archives – exact mechanism TBD.
Manpower
Additional manpower needed for:
• Migration of existing data collections to new
platforms, and often to new DBMS
• Installation of AstroGrid and other VO software
• Provision of metadata to the Registry
• Implementation and operation of MySpace
• Setting up astronomical data warehouse
facilities at a few sites
Funding problems
• SRIF funding is for hardware only, not
manpower
• AstroGrid2 bid failed to get support for elements
of data centre support
• PPARC grant applications to support data
archiving and curation have an unhappy history:
they tend to fall between research and projects
funding lines.
Summary
•
Archives have a vital role in astronomy
– They are basically in good shape in that no
important bits have been lost (as far as I know)
– But we have been muddling through
•
•
•
•
Technical problems look soluble
Data storage – we may be able to find enough
Much work needed on current archives for
them to survive into the VO era.
Additional skilled manpower will be essential
– sources of support for this are lacking
•
Continuity is vital for archives – this is a longterm problem with no obvious solution.