Giving Access to Science Data

Download Report

Transcript Giving Access to Science Data

Where The Rubber Meets the Sky
Giving Access to Science Data
Talk at
National Institute of Informatics, Tokyo, Japan
October 2005
Jim Gray
Microsoft Research
[email protected]
Http://research.Microsoft.com/~Gray
Alex Szalay
Johns Hopkins University
[email protected]
1
•Abstract:
I have been working with some
astronomers
for the last 6 years
trying to apply DB technology to science
problems.
These are some lessons I learned
Paper at:
Where the Rubber Meets the Sky: Bridging the Gap between
Databases and Science,”
Jim Gray; Alexander S. Szalay;
MSR-TR-2004-110, October 2004
2
New Science Paradigms
• Thousand years ago:
science was empirical
describing natural phenomena
• Last few hundred years:
theoretical branch
using models, generalizations
2
 .
4G
c2
a
 a   3   a 2
 
• Last few decades:
a computational branch
simulating complex phenomena
• Today:
data exploration (eScience)
unify theory, experiment, and simulation
using data management and statistics
– Data captured by instruments
Or generated by simulator
– Processed by software
– Scientist analyzes database / files
3
The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it?
How to coexist with others?
•
•
•
Data Query and Visualization tools
Support/training
Performance
– Execute queries in a minute
– Batch (big) query scheduling
4
Experiment Budgets ¼…½ Software
Software for
• Instrument scheduling
• Instrument control
• Data gathering
• Data reduction
• Database
• Analysis
• Visualization
Millions of lines of code
Repeated for experiment
after experiment
Not much sharing or learning
Let’s work to change this
Identify generic tools
• Workflow schedulers
• Databases and libraries
• Analysis packages
• Visualizers
5
• …
Data Lifecycle
• Raw data → primary data → derived data
• Data has bugs:
– Instrument bugs
– Pipeline bugs
• Data comes in versions
– later versions fix known bugs
– Just like software (indeed data is software)
• Can’t “un-publish” bad data.
Level 1
calibrated
Level 0
raw
instrument
or
simulator
pipeline
pipeline
other
data
Level 2
derived
other
data
6
Data Inflation – Data Pyramid
Level 2
Level 1A
Grows X TB/year
~ .4X TB/y
compressed
(level 1A in NASA terms)
Derived data products ~10x smaller
But there are many.
L2≈L1
• Publish new edition each year
– Fixes bugs in data.
– Must preserve old editions
– Creates data pyramid
• Store each edition
– 1, 2, 3, 4… N ~ N2 bytes
• Net: Data Inflation: L2 ≥ L1
Level 1A
4 editions of 4 Level 2 products
E4
E3
time
E2
E1
4 editions of
level 1A data
(source data)
4 editions of level 2 derived data products. Note that each derived product is
small, but they are numerous. This proliferation combined with the data
pyramid implies that level2 data more than doubles the total storage volume.
7
180
The Year 5 Problem
Yearly Demand
160
Depreciated Inflated Demand
• Data arrives at R bytes/year
• New Storage & Processing
– Need to buy R units in year N
• Data inflation means
Yearly Demand ( R )
140
80
60
40
20
0
– Need to buy NR units
0
• Capital expense
peaks at year 5
• See 6x Over-Power slide next
4
6
8
10
8
10
Yearly Capital Cost
4.0
3.5
Marginal Capital Cost
60%/year price decline
2
Year
• Depreciate over 3 years
• Moore’s law:
Naive Demand
100
~N2R
– After year 3
need to buy N2R + (N-3)2R
Inflated Demand
120
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0
2
4
6
Year
8
6x Over-Power Ratio
• If you think you need X raw capacity,
then you probably need 6X
• Reprocessing
• Backup copies
• Versions
• …
• Hardware is cheap,
Your time is precious.
PubDB
3.6TB
DR2C
1.8TB
DR2M
1.8TB
DR2P
1.8TB
DR3C
2.4TB
DR3M
2.4TB
DR3P
2.4TB
9
Data Loading
• Data from outside
– Is full of bugs
– Is not in your format
• Advice
– Get it in a “Universal Format”
(e.g. Unicode CSV)
– Create Blood-Brain barrier
Quarantine in a “load database”
– Scrub the data
•
•
•
•
Cross check everything you can
Check data statistics for sanity
Reject or repair bad data
Generate detailed bug reports
(needed to send rejection upstream)
– Expect to reload many times
Automate everything!
LOAD
Export
EXP
Check CSV
CHK Build Task DBs
BLD Build SQL Schema
SQL Validate
VAL
Backup
BCK
Detach
DTC
PUBLISH
Publish
PUB
Cleanup
CLN
Test
Test Uniqueness
Uniqueness
Of
Of Primary
Primary Keys
Keys
FINISH
FIN
Test the unique
Key in each table
Test
Test
Foreign
Foreign Keys
Keys
Test for consistency
of keys that link tables
Test
Test
Cardinalities
Cardinalities
Test consistency of
numbers of various
quantities
Test
Test
HTM
HTM IDs
IDs
Test
Test parent-child
parent-child
consistency
consistency
Test the Hierarchical
Triamgular Mesh IDs
used for spatial
indexing
10
Ensure that all parents
and children and linked
Performance Prediction & Regression
• Database grows exponentially
• Set up response-time requirements
– For load
– For access
• Define a workload to measure each
• Run it regularly to detect anomalies
• SDSS uses
– one-week to reload
– 20 queries with response of 10 sec to 10 min.
11
Data Subsets
For Science and Development
• Offer 1GB, 10GB, …, Full
subsets
• Wonderful tool for you
Design & Debug
• Good tool for scientists
– Experiment on subset
– Not for needle in haystack,
but good for global stats
• Challenge: How make
statistically valid subsets?
– Seems domain specific
– Seems problem specific
– But, must be some general
concepts.
12
Data Curation Problem Statement
• Once published,
scientific data needs to be available forever,
so that the science can be reproduced/extended.
• What does that mean?
NASA “level 0”
– Data can be characterized as
• Primary Data: could not be reproduced
• Derived data: could be derived from primary data.
– Meta-data: how the data was collected/derived
is primary
• Must be preserved
• Includes design docs, software, email, pubs, personal
notes, teleconferences,
13
Schema (aka metadata)
• Everyone starts with the same schema
<stuff/>
Then the start arguing about semantics.
• Virtual Observatory: http://www.ivoa.net/
• Metadata based on Dublin Core:
http://www.ivoa.net/Documents/latest/RM.html
• Universal Content Descriptors (UCD):
http://vizier.u-strasbg.fr/doc/UCD.htx
Captures quantitative concepts and their units
Reduced from ~100,000 tables in literature to ~1,000 terms
• VOtable – a schema for answers to questions
http://www.us-vo.org/VOTable/
• Common Queries:
Cone Search and Simple Image Access Protocol, SQL
• Registry: http://www.ivoa.net/Documents/latest/RMExp.html
still a work in progress.
14
Archive Challenges
• Cost of administering storage:
– Presently 10x to 100x the hardware cost.
• Resist attack: geographic diversity
• At 1GBps it takes 12 days to move a PB
• Store it in two (or more) places online (on disk).
A geo-plex
• Scrub it continuously (look for errors)
• On failure,
– use other copy until failure repaired,
– refresh lost copy from safe copy.
• Can organize the copies differently
(e.g.: one by time, one by space)
15
References
http://SkyServer.SDSS.org/
http://research.microsoft.com/pubs/
http://research.microsoft.com/Gray/SDSS/ (download personal SkyServer)
Extending the SDSS Batch Query System to the National Virtual Observatory Grid,
M. A. Nieto-Santisteban, W. O'Mullane, J. Gray, N. Li, T. Budavari, A. S. Szalay, A. R. Thakar, MSR-TR-2004-12, Feb. 2004
Scientific Data Federation,
J. Gray, A. S. Szalay, The Grid 2: Blueprint for a New Computing Infrastructure, I. Foster, C. Kesselman, eds, Morgan Kauffman,
2003, pp 95-108.
Data Mining the SDSS SkyServer Database,
J. Gray, A.S. Szalay, A. Thakar, P. Kunszt, C. Stoughton, D. Slutz, J. vandenBerg, Distributed Data & Structures 4:
Records of the 4th International Meeting, pp 189-210, W. Litwin, G. Levy (eds),, Carleton Scientific 2003, ISBN 1-894145-13-5,
also MSR-TR-2002-01, Jan. 2002
Petabyte Scale Data Mining: Dream or Reality?,
Alexander S. Szalay; Jim Gray; Jan vandenBerg, SIPE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa,
Hawaii, MSR-TR-2002-84
Online Scientific Data Curation, Publication, and Archiving,
J. Gray; A. S. Szalay; A.R. Thakar; C. Stoughton; J. vandenBerg, SPIE Astronomy Telescopes and Instruments, 22-28 August
2002, Waikoloa, Hawaii, MSR-TR-2002-74
The World Wide Telescope: An Archetype for Online Science,
J. Gray; A. Szalay,, CACM, Vol. 45, No. 11, pp 50-54, Nov. 2002, MSR TR 2002-75,
The SDSS SkyServer: Public Access To The Sloan Digital Sky Server Data,
A. S. Szalay, J. Gray, A. Thakar, P. Z. Kunszt, T. Malik, J. Raddick, C. Stoughton, J. vandenBerg:,
ACM SIGMOD 2002: 570-581 MSR TR 2001 104.
The World Wide Telescope,
A.S., Szalay, J., Gray, Science, V.293 pp. 2037-2038. 14 Sept 2001. MS-TR-2001-77
Designing & Mining Multi-Terabyte Astronomy Archives: Sloan Digital Sky Survey,
A. Szalay, P. Kunszt, A. Thakar, J. Gray, D. Slutz, P. Kuntz, June 1999, ACM SIGMOD 2000, MS-TR-99-30,
16