Presentation

Download Report

Transcript Presentation

GREAT Workshop on Astrostatistics and Data Mining in Astrnomical Databases
La Palma, Spain May 30 - June 3, 2011
Data Management at Gaia Data
Processing Centers
Pilar de Teodoro Idiago
Gaia Database Administrator
European Space Astronomy Center (ESAC)
Madrid
Spain
http://www.rssd.esa.int/Gaia
Data Processing
Centres
Data Processing
Centers
*
*
*
*
*
*
DPCE (ESAC)
DPCB (Barcelona)
DPCC (CNES)
DPCG (Obs. Geneva / ISDC)
DPCI (IoA, Cambridge)
DPCT (Torino)
All contributed to this talk
Processing Overview (simplified)
Catalogue
Many
iterations
Astrometric Treatment
Fix geometrical calibration
Adjust Attitude
CU3/SOC
Fix source positions
Photometry Treatment
Initial Data Treatment
Turn CCD transits into source
observations on sky
Should be linear transform CU3
Calibrate flux scale
give magnitudes
CU5
Spectral Treatment
Calibrate and disentangle
provide s spectra
CU6
Solar System CU4
Variability
CU7
Astrophysical
Parameters CU8
Non Single
Systems CU4
DPCE
DPCB
DPCC (CNES)
•
CU4 (Objects Processing),
•
CU6 (Spectroscopic processing)
•
CU8 (Astrophysical Parameters)
Solutions based on:
•
performance
•
scalability of the solution
•
data safety
•
impacts on the existing software
•
impacts on the hardware architecture
•
cost of the solution during the whole mission
•
durability of the solution
•
administration and monitoring tools
DPCG
•Detection and characterization of variable sources observed by
Gaia (CU7)
•Analytical queries must be done over sources or processing results
(attributes) to support unknown research requirements.
•Timeseries reconstruction while importing MDB data
•Parameter analysis for simulations and configurations changes on
historical database.
•ETL-like support must be done for external data.
•At present Apache OpenJPA. Postgress used as well.
•Other alternatives : Hadoop, SciDB, VoltDB and Extensions to PG.
DPCI
Given the use case:
•bulk-processing of a large data set
•data volume increases with time (DPAC-wide iterations)
We can state that:
•Random data access is expensive and less efficient than sequential
access.
•Hub-and-Spoke architecture is prone to bottlenecks and therefore does
not scale very well with the number of clients.
•Hadoop adopted in 2009
•HDFS:distributed filesystem
•Map/Reduce jobs to minimize synchronization
•DAL much simpler
DPCT
•CU3 AVU
•IGSL support
•Persistent data
management