SkyQuery: a distributed probabilistic join infrastructure of astronomy

Transcript SkyQuery: a distributed probabilistic join infrastructure of astronomy

László Dobos1,2, Tamás Budavári2, Nolan Li2,
Alex Szalay2, István Csabai1
[email protected] , [email protected]
1 Eötvös Loránd University, Budapest,
Hungary
Department of Physics of Complex Systems
2 The Johns Hopkins University,
Baltimore, USA
Department of Physics & Astronomy
24th SSDBM conference, 25-27 June, 2012 ‒ Chania, Crete, Greece

Astronomical catalogs
 in RDBMS
 o(100 million) objects
 o(1TB – 10TB) DB size

Done by coordinates
 RA, Dec
 Astrometric error
 Different sky coverage
 Different wavelength range
 Moving objects etc.
infrared (2MASS)
visible (DSS)
ultraviolet (Galex)

All data in RDBMS
 run computation inside the database
 use multiple servers and parallelize
 must be transparent for users

Astronomers „script” what they do
 multiple re-runs, tweak parameters etc.
 huge web forms: no-no

Use SQL to formulate the problem
 functions and language extensions to support astronomy
 extra syntax to describe the coordinate-based probabilistic join
 spatial constraints: celestial regions
SELECT s.objId, g.objID, t.objID,
s.ra, s.dec, g.ra, g.dec, t.ra, t.dec,
x.ra, x.dec
FROM SDSSDR7:Galaxies AS s
Standard
SQL
CROSS JOIN Galex:Galaxies AS g
CROSS JOIN TwoMASS:ExtendedSources AS t
XMATCH BAYESIAN AS x
MUST s ON POINT(s.cx, s.cy, s.cz), 0.1
MUST g ON POINT(g.ra, g.dec), 0.2
MAY
Probabilistic
crossmatch
t ON POINT(t.ra, t.dec), 0.5
HAVING LIMIT 1e3
REGION CIRCLE J2000 165.7, 0.3, 60
Spatial
constraint
Custom SQL
query
Workflow of
many traditional
SQL queries
Parsing
Job queue
Spatial
Partitioning
Parallel
Execution

SQL is declarative
 everything can be executed that can be expressed
 extensions must be executable in any case

Query optimization is hard

Design language with easy optimization in mind
 constrain on the level of the grammar
 custom clauses instead of complex where clause logic
Remote
Virtual Observatory
Data Source
Internet
Graywulf Database Server Cluster
SQL queries
MyDB
SkyQuery
Web Interface
Job Scheduler
XMATCH query
SDSS × 2MASS = ?
Cluster Registry

Registry
 complete description of the server cluster
 from machine group to disk volumes
 contains all info for optimal database allocation

Management tools
 allocate, resize, copy, mirror etc. databases
 monitor cluster status

Scheduler





co-location aware query execution
jobs implemented as workflows
.Net Workflow Foundation (WF4)
parallel execution out-of-the-box
extensive logging, persistence, retry logic etc.

SQL parser generator
 supports grammar inheritence
 easy to add custom extensions to plain SQL

Metadata tools
 Tag SQL scripts with metadata
 Make it accessible from web interface
 Extract provenance information from user queries

User web interface
 write and submit queries
 access to own database (MyDB)

Jobs
 Crossmatch workflow, etc.

Current system: focus on astronomy/crossmatch
 Implement spatial constraints

Extend to a generic framework + API
 mirroring, sharding of datasets
 query partitioning
 limited distributed joins
 transparent access to remote datasets
 smart caching of remote data / query results

SkyQuery: a distributed probabilistic join infrastructure of astronomy

Transcript SkyQuery: a distributed probabilistic join infrastructure of astronomy

Directory