The Pan -STARRS Data Challenge

Download Report

Transcript The Pan -STARRS Data Challenge

The Pan-STARRS Data
Challenge
Jim Heasley
Institute for Astronomy
University of Hawaii
ICS 624 – 28 March 2011
What is Pan-STARRS?
• Pan-STARRS - a new telescope facility
• 4 smallish (1.8m) telescopes, but with
extremely wide field of view
• Can scan the sky rapidly and repeatedly,
and can detect very faint objects
– Unique time-resolution capability
• Project led by IfA with help from Air Force,
Maui High Performance Computer Center,
MIT’s Lincoln Lab.
• The prototype, PS1, will be operated by an
international consortium
ICS624
Pan-STARRS Overview
•Pan-STARRS observatory specifications
–Four 1.8m R-C + corrector
–7 square degree FOV - 1.4Gpixel cameras
–Sited in Hawaii
–A  = 50
–R ~ 24 in 30 s integration
–-> 7000 square deg/night
–All sky + deep field surveys in g,r,i,z,y
• Time domain astronomy
– Transient objects
– Moving objects
– Variable objects
• Static sky science
– Enabled by stacking repeated scans to form a collection of
ultra-deep static sky images
ICS624
The Published Science Products Subsystem
Published Science
Products Subsystem
(PSPS)
Web-Based
Interface
(WBI)
End Users
l
Te
es
co
pe
Photons
Data Retrieval
Layer
(DRL)
Records
Images
Solar System
Data Manager
(SSDM)
Records
Gigapixel
Camera
Object Data
Manager
(ODM)
Image
Processing
Pipeline
(IPP)
Moving Object
Processing
System
(MOPS)
Detection Records
ICS624
ICS624
ICS624
Front of the Wave
• Pan-STARRS is only the first of a new
generation of astronomical data programs
that will generate such large volumes of
data:
– SkyMapper, southern hemisphere optical
– VISTA, southern hemisphere IR survey
– LSST, an all sky survey like Pan-STARRS
• Eventually, these data sets will be useful
for data mining.
ICS624
ICS624
ICS624
PS1 Data Products
• Detections—measurements obtained directly
from processed image frames
– Detection catalogs
– “Stacks” of the sky images source catalogs
– Difference catalogs
• High significance (> 5 transient events)
• Low significance (transients between 3 and 5 )
– Other Image Stacks (Medium Deep Survey)
• Objects—aggregates derived from detections
ICS624
What’s the Challenge?
• At first blush, this looks pretty much like
the Sloan Digital Sky Survey…
• BUT
– Size – Over its 3 year mission, PS1 will record
over 150 billion detections for approximately
5.5 billion sources
– Dynamic Nature – new data will be always
coming into the database system, for things
we’ve seen before or new discoveries
ICS624
Book Learning
• The books on database design tell you to
– Interview your users to determine what they
want to use the database for
– Determine the most common queries your
users are going to ask
– Organize your data into a normalized logical
schema
– Select a physical schema appropriate to your
problem.
ICS624
Real World
• The infamous “20 Queries” of Alex Szalay
(JHU) in designing the SDSS
• Normalized schema are good but can
result in very big performance penalties
• Money talks – in the real world you are
constrained by a budget and not all
physical implementations of your database
may be affordable (for one reason or
another)!
ICS624
PSPS Top Level Requirements
• 3.3.01 The PSPS shall be able to ingest a
total of 1.5x1011 P2 detections, 8.3x1010
cumulative sky detections, and 5.5 x109
celestial objects together with their
linkages.
ICS624
PSPS Top Level Requirements
• 3.3.02 The PSPS shall be able to ingest
the observational metadata for up to a
total of 1.1x1010 observations.
• 3.3.0.3 The PS1 PSPS shall be capable
of archiving up to ~ 100 Terabytes of data.
ICS624
PSPS Top Level Requirements
• 3.3.0.4 The PSPS shall archive the PS1
data products.
• 3.3.0.5 The PSPS shall possess a
computer security system to protect
potentially vulnerable subsystems from
malicious external actions.
ICS624
PSPS Top Level Requirements
• 3.3.0.6 The PSPS shall provide end-users
access to detections of objects in the PanSTARRS databases.
• 3.3.0.7 The PSPS shall provide end-users
access to the cumulative stationary sky
images generated by the Pan-STARRS.
ICS624
PSPS Top Level Requirements
• 3.3.0.8 The PSPS shall provide end-users
with metadata required to interpret the
observational legacy and processing
history of the Pan-STARRS data products.
• 3.3.0.9 The PSPS shall provide end-users
with Pan-STARRS detections of objects in
the Solar System for which attributes can
be assigned.
ICS624
PSPS Top Level Requirements
• 3.3.0.10 The PSPS shall provide end-users with
derived Solar System objects deduced from
Pan-STARRS attributed observations and
observations from other sources.
• 3.3.0.11 The PSPS shall provide the capability
for end-users to construct queries to search the
Pan-STARRS data products over space and
time to examine magnitudes, colors, and proper
motions.
ICS624
PSPS Top Level Requirements
• 3.3.0.12 The PSPS shall provide a mass
storage system with a reliability
requirement of 99.9% (TBR).
• 3.3.0.13 The PSPS baseline
configuration should accommodate future
additions of databases (i.e., be
expandable).
ICS624
How to Approach This
Challenge
• There are many possible approaches to deal
with this data challenge.
• Shared what?
– Memory
– Disk
– Nothing
• Not all of these approaches are created equal,
either in cost and/or performance (DeWitt &
Gray, 1992, “Parallel Database Systems: The
Future of High Performance Database
Processing”).
ICS624
Conversation with the PanSTARRS Project Manager
• Jim: Tom, what are we going to do if the
solution proposed by SAIC is more than you can
afford?
• Tom: Jim, I’m sure you’ll think of something!
• Not long after that, SAIC did give us a
hardware/software plan we couldn’t afford. Not
long after, Tom resigned from the project to
pursue other activities…
ICS624
The SAIC ODM Architecture Proposal
Ingest
LEFT
BRAIN
Single multi-processor machine
High performance storage
Objects
Staging
Ingest detections
Query
Publish
RIGHT
BRAIN
Clustered small processor machines
High capacity storage
ICS624
Published detections
The SAIC ODM Architecture Proposal
Ingest
LEFT
BRAIN
Query
$
Single multi-processor machine
High performance storage
Objects
Staging
Ingest detections
Publish
RIGHT
BRAIN
Clustered small processor machines
High capacity storage
ICS624
Published detections
Conversation with the PanSTARRS Project Manager
• The Pan-STARRS project teamed up with Alex
Szalay and his database team at JHU as they
were the only game in town with real experience
building large astronomical databases.
ICS624
Building upon the SDSS
Heritage
• In teaming with the group at JHU we hoped to
build upon the experience and software
developed for the SDSS.
• A key question was how could we scale the
system to deal with the volume of data expected
from PS1 (> 10X SDSS in the first year alone).
• The second key question, could the system keep
up with the data flow.
• The heritage is more one of philosophy than
recycled software, as to deal with the challenges
posed by PS1 we’ve had to generate a great
deal of new code.
ICS624
High-Level Organization
Data
DataTransformation
TransformationLayer
Layer(DX)
(DX)
objZoneIndx
orphans
Detections_l1
objZoneIndx
Linked servers
Load
Support1
LoadAdmin
Load
Supportn
PartitionMap
Data Loading Pipeline (DLP)
[LnkToObj_p1]
Linked servers
[Objects_pm]
P1
Pm
[Detections_p1]
Meta
Detections
PS1
PartionsMap
Objects
LnkToObj
Database
Full table
[partitioned table]
Output table
Partitioned View
[LnkToObj_pm]
[Detections_pm]
Data Storage (DS)
Legend
Detections_ln
LnkToObj_ln
LnkToObj_l1
[Objects_p1]
orphans
Meta
Query
QueryManager
Manager(QM)
(QM)
Web
WebBased
BasedInterface
Interface(WBI)
(WBI)
ICS624
Meta
Data Storage Logical Schema
ICS624
The Object Data Manager
• The Object Data Manager (ODM) was
considered to be the “long pole” in the
development of the PS1 PSPS.
• Parallel database systems can provide
both data redundancy and spreading very
large tables that can’t fit on a single
machine over multiple storage volumes.
• For PS1 (and beyond) we need both.
ICS624
Distributed Architecture
• The bigger tables will be spatially partitioned
across servers called Slices
• Using slices improves system scalability
• Tables are sliced into ranges of ObjectID,
which correspond to broad declination ranges
• ObjectID boundaries are selected so that
each slice has a similar number of objects
• Distributed Partitioned Views “glue” the data
together
ICS624
Data Storage Logical Schema
ICS624
Design Decisions: ObjID
• Objects have their positional information
encoded in their objID
– fGetPanObjID (ra, dec, zoneH)
– ZoneID is the most significant part of the ID
– objID is the Primary Key
• Objects are organized (clustered indexed) so
nearby objects in the sky are stored on disk
nearby as well
• It gives good search performance, spatial
functionality, and scalability
ICS624
Pan-STARRS Data Flow
← Behind the Cloud|| User facing services →
Data Valet Workflows
The Pan-STARRS Science Cloud
Data Creators
Image
Procesing
Pipeline (IPP)
Telescope
CSV
Files
CSV
Files
Validation
Exception
Notification
Load
Workflow
Load
DB
Merge
Workflow
Load
Workflow
Cold
Slice
DB 1
Flip
Workflow
Load
DB
Merge
Workflow
Cold
Slice
DB 2
Flip
Workflow
Admin & Load-Merge Machines
Data flows in one
direction→, except for
error recovery
ICS624
Slice Fault
Recover
Workflow
Warm
Slice
DB 1
Data Consumer
Queries & Workflows
Hot
Slice
DB 2
Warm
Slice Hot
DB 2 Slice
DB 1
Astronomers
(Data Consumers)
MyDB
MainDB
Distribute
d View
MainDB
Distribute
d View
Production Machines
CASJobs
Query
Service
MyDB
Pan-STARRS Data Layout
Image
Pipeline
L1 Data
csv
csv
csv
csv
csv
csv
L2 Data
Load
Merge 1
S
1
S
2
Load
Merge 2
S
3
S
4
S
5
Load
Merge 3
S
6
S
7
S
8
Load-Merge Nodes
L
O
A
D
Load
Merge 4
S
9
S
10
S
11
Load
Merge 5
S
12
S
13
S
14
Load
Merge 6
S
15
C
O
L
D
S
16
Slice Nodes
Slice
1
S
1
S
16
S
2
S
3
Slice
2
S
3
S
2
S
4
S
5
Distributed View
Slice
3
S
5
S
4
S
6
S
7
Slice
4
S
7
S
6
Head 1
S
8
S
9
Main
Slice
5
S
9
S
8
S
10
S
11
ICS624
Main
Slice
6
S
11
S
10
S
12
S
13
Head 2
Slice
7
S
13
S
12
S
14
S
15
Head Nodes
Slice
8
S
15
S
14
H
O
T
S
16
S
1
W
A
R
M
The ODM Infrastructure
• Much of our software development has gone
into extending the ingest pipeline developed for
SDSS.
• Unlike SDSS, we don’t have “campaign” loads
but a steady from of data from the telescope
through the Image Processing Pipeline to the
ODM.
• We have constructed data workflows to deal with
both the regular data flow into the ODM as well
as anticipated failure modes (lost disk, RAID,
and various severer nodes).
ICS624
Pan-STARRS
Object Data Manager Subsystem
System Operation
UI
System Health
Monitor UI
Query Performance
UI
Data Flow
Control Flow
System & Administration
Workflows
Configuration, Health &
Performance Monitoring
Orchestrates all cluster changes,
such as, data loading, or fault
tolerance
Cluster deployment and operations
Pan-STARRS Cloud Services
for Astronomers
Deployed
Query
Astronomy
Manager
Science queries
Databases
Loaded
Astronomy
Databases
Internal Data Flow and State
Logging
Tools for supporting workflow authoring
and execution
~70TB
Transfer/Week
~70TB
Storage/Year
and MyDB for
results
Pan-STARRS
Telescope
Image Processing
Pipeline
Extracts objects like stars and
galaxies from telescope
images
~1TB Input/Week
ICS624
36
What Next?
• Will this approach scale to our needs?
– PS1 – yes. But, we already see the need for better
parallel processing query plans.
– PS4 – unclear! Even though I’m not from Missouri,
“show me!” One year of PS4 produces > data
volume than the entire PS1 3 year mission!
• Column based databases?
• Cloud computing?
– How can we test issues like scalability without
actually building the system?
– Does each project really need its own data center?
– Having these databases “in the cloud” may greatly
facilitate data sharing/mining.
ICS624