SkyQuery: a distributed probabilistic join infrastructure of astronomy
Download
Report
Transcript SkyQuery: a distributed probabilistic join infrastructure of astronomy
László Dobos1,2, Tamás Budavári2, Nolan Li2,
Alex Szalay2, István Csabai1
[email protected] , [email protected]
1 Eötvös Loránd University, Budapest,
Hungary
Department of Physics of Complex Systems
2 The Johns Hopkins University,
Baltimore, USA
Department of Physics & Astronomy
24th SSDBM conference, 25-27 June, 2012 ‒ Chania, Crete, Greece
Astronomical catalogs
in RDBMS
o(100 million) objects
o(1TB – 10TB) DB size
Done by coordinates
RA, Dec
Astrometric error
Different sky coverage
Different wavelength range
Moving objects etc.
infrared (2MASS)
visible (DSS)
ultraviolet (Galex)
All data in RDBMS
run computation inside the database
use multiple servers and parallelize
must be transparent for users
Astronomers „script” what they do
multiple re-runs, tweak parameters etc.
huge web forms: no-no
Use SQL to formulate the problem
functions and language extensions to support astronomy
extra syntax to describe the coordinate-based probabilistic join
spatial constraints: celestial regions
SELECT s.objId, g.objID, t.objID,
s.ra, s.dec, g.ra, g.dec, t.ra, t.dec,
x.ra, x.dec
FROM SDSSDR7:Galaxies AS s
Standard
SQL
CROSS JOIN Galex:Galaxies AS g
CROSS JOIN TwoMASS:ExtendedSources AS t
XMATCH BAYESIAN AS x
MUST s ON POINT(s.cx, s.cy, s.cz), 0.1
MUST g ON POINT(g.ra, g.dec), 0.2
MAY
Probabilistic
crossmatch
t ON POINT(t.ra, t.dec), 0.5
HAVING LIMIT 1e3
REGION CIRCLE J2000 165.7, 0.3, 60
Spatial
constraint
Custom SQL
query
Workflow of
many traditional
SQL queries
Parsing
Job queue
Spatial
Partitioning
Parallel
Execution
SQL is declarative
everything can be executed that can be expressed
extensions must be executable in any case
Query optimization is hard
Design language with easy optimization in mind
constrain on the level of the grammar
custom clauses instead of complex where clause logic
Remote
Virtual Observatory
Data Source
Internet
Graywulf Database Server Cluster
SQL queries
MyDB
SkyQuery
Web Interface
Job Scheduler
XMATCH query
SDSS × 2MASS = ?
Cluster Registry
Registry
complete description of the server cluster
from machine group to disk volumes
contains all info for optimal database allocation
Management tools
allocate, resize, copy, mirror etc. databases
monitor cluster status
Scheduler
co-location aware query execution
jobs implemented as workflows
.Net Workflow Foundation (WF4)
parallel execution out-of-the-box
extensive logging, persistence, retry logic etc.
SQL parser generator
supports grammar inheritence
easy to add custom extensions to plain SQL
Metadata tools
Tag SQL scripts with metadata
Make it accessible from web interface
Extract provenance information from user queries
User web interface
write and submit queries
access to own database (MyDB)
Jobs
Crossmatch workflow, etc.
Current system: focus on astronomy/crossmatch
Implement spatial constraints
Extend to a generic framework + API
mirroring, sharding of datasets
query partitioning
limited distributed joins
transparent access to remote datasets
smart caching of remote data / query results