The World Wide Telescope as an architype for Online Science

Download Report

Transcript The World Wide Telescope as an architype for Online Science

The World Wide Telescope
an Archetype for Online-Science
Jim Gray (Microsoft)
Alex Szalay (Johns Hopkins University)
Microsoft Academic Days in Silicon Valley
http://research.microsoft.com/~gray/talks
1
First, an aside: 2 other projects
• TerraServer
– joint with USGS
• Giga Byte File Transfers
– joint with Caltech and CERN
2
TerraServer
•
•
•
•
Seamless mosaic of US
~20 TB of imagery
30 M web hits/day
A scalability laboratory
TerraServer Bricks – A High Availability Cluster
Alternative (2004)
TerraServer Cluster and SAN Experience (2004)
TerraService.NET: An Introduction to Web Services
(2002)
Microsoft TerraServer: A Spatial Data Warehouse
(1999)
The Microsoft TerraServerTM (1998)
KVM / IP
3
Giga Byte Per Second File Mover
• CERN to Pasadena
– Windows TCP/IP, NTFS
– Quantifying performance
– Working on better algorithms
– Opteron
– Disk-to-Disk at 550MBps now
(~2 TB/Hour).
• GOAL: 1GBps disk-to-disk.
CERN-Caltech Trasfer Speeds
GBps Land Speed Record
PCI -X limit
limit
MBps
Gigabyte Bandwidth
Enables Global Co-Laboratories
Sequential Disk IO Tests for
Newisys->Newisys
1000
900
tcp
800
700
600
500
400
300
200
100
0
Mar-04
File Transfer MBps
1 Stream tcp MBps
May-04
Jun-04
4
Aug-04
Sep-04
The World Wide Telescope
an Archetype for Online-Science
Jim Gray (Microsoft)
Alex Szalay (Johns Hopkins University)
Microsoft Academic Days in Silicon Valley
http://research.microsoft.com/~gray/talks
5
The Evolution of Science
• Observational Science
– Scientist gathers data by direct observation
– Scientist analyzes data
• Analytical Science
– Scientist builds analytical model
– Makes predictions.
• Computational Science
– Simulate analytical model
– Validate model and makes predictions
• Data Exploration Science
Data captured by instruments
Or data generated by simulator
– Processed by software
– Placed in a database / files
– Scientist analyzes database / files
6
Information Avalanche
• In science, industry, government,….
– better observational instruments and
– and, better simulations
producing a data avalanche
Image courtesy
C. Meneveau & A. Szalay @ JHU
• Examples
– BaBar: Grows 1TB/day
2/3 simulation Information
1/3 observational Information
– CERN: LHC will generate 1GB/s .~10 PB/y
– VLBA (NRAO) generates 1GB/s today
– Pixar: 100 TB/Movie
BaBar, Stanford
P&E Gene Sequencer From
http://www.genome.uci.edu/
• New emphasis on informatics:
– Capturing, Organizing,
Summarizing, Analyzing, Visualizing
7
Space Telescope
The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it
How to coexist with others
• Query and Vis tools
• Support/training
• Performance
– Execute queries in a minute
– Batch query scheduling
8
FTP - GREP
• Download (FTP and GREP) are not adequate
–
–
–
–
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~3,000 disks
• At some point we need
indices to limit search
parallel data search and analysis
• This is where databases can help
• Next generation technique: Data Exploration
– Bring the analysis to the data!
9
The Speed Problem
• Many users want to search the whole DB
ad hoc queries, often combinatorial
• Want ~ 1 minute response
• Brute force (parallel search):
– 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB
• Indices (limit search, do column store)
– 1,000x less equipment: 1M$/PB
• Pre-compute answer
– No one knows how do it for all questions.
10
Next-Generation Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks
• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and computers grow at same rate,
we can only keep up with N logN
• A way out?
– Relax notion of optimal
(data is fuzzy, answers are approximate)
– Don’t assume infinite computational resources or memory
• Combination of statistics & computer science
11
Analysis and Databases
• Much statistical analysis deals with
–
–
–
–
–
–
–
–
–
Creating uniform samples –
data filtering
Assembling relevant subsets
Estimating completeness
censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
• Traditionally these are performed on files
• Most of these tasks are much better done inside a database
• Move Mohamed to the mountain, not the mountain to
12
Mohamed.
Organization & Algorithms
• Use of clever data structures (trees, cubes):
–
–
–
–
Up-front creation cost, but only N logN access cost
Large speedup during the analysis
Tree-codes for correlations (A. Moore et al 2001)
Data Cubes for OLAP (all vendors)
• Fast, approximate heuristic algorithms
– No need to be more accurate than cosmic variance
– Fast CMB analysis by Szapudi et al (2001)
• N logN instead of N3 => 1 day instead of 10 million years
• Take cost of computation into account
– Controlled level of accuracy
– Best result in a given time, given our computing resources
13
World Wide Telescope
Virtual Observatory
http://www.ivoa.net/
• Premise:
Most data is (or could be online)
• The Internet is the world’s best telescope:
–
–
–
–
It has data on every part of the sky
In every measured spectral band: optical, x-ray, radio..
As deep as the best instruments (2 years ago).
It is up when you are up.
The “seeing” is always great
(no working at night, no clouds no moons no..).
– It’s a smart telescope:
links objects and data
to literature on them.
14
Why Astronomy?
• Community has lots of data
• Data is real and well documented
– High-dimensional (with confidence intervals)
– Spatial, temporal
•
Diverse and distributed
– Many different instruments from
many different places and
many different times
• Community wants to share/cross compare
– Can freely share data and algorithms.
– “DataMining, Not Data MINE!!” Mark Ellisman, UCSD
• They are well organized
• Community is small and homogeneous
• No commercial or privacy concerns
– All the problems are technical or social.
15
The WWT Components
• Data Sources
– Literature
– Archives
• Unified Definitions
– Units,
– Semantics/Concepts/Metrics,
Representations,
– Provenance
• Object model
• Classes and methods
• Portals
16
Data Sources
• Literature online and cross indexed
– Simbad, ADS, NED,
http://simbad.u-strasbg.fr/Simbad, http://adswww.harvard.edu/, http://nedwww.ipac.caltech.edu/
• Many curated archives online
– FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,…
– Typically files with English meta-data and some programs
• Groups, Researchers, Amateurs Publish
– Datasets online in various formats
– Data publications are ephemeral (may disappear)
– Many have unknown provenance
• Documentation varies; some good and some none. 17
Unified Definitions
• Universal Content Definitions
http://vizier.u-strasbg.fr/doc/UCD.htx
– Collated all table heads from all the literature
– 100,000 terms reduced to ~1,500
– Rough consensus that this is the right thing.
– Refinement in progress as people use UCDs
• Defines
– Units:
• gram, radian, second, janski...
– Semantic Concepts / Metrics
• Std error, Chi2 fit, magnitude, flux @ passband, velocity,
18
Provenance
• Most data will be derived.
• To do science,
need to trace derived data back to source.
• So programs and inputs must be registered.
• Must be able to re-run them.
• Example: Space Telescope Calibrated Data
– Run on demand
– Can specify software version (to get old answers)
• Scientific Data Provenance and Curation are
largely unsolved problems
(some ideas but no science).
19
Object Model
Your
• General acceptance of XML
program
• Recent acceptance of XML Schema
(XSD over DTD)
Web
Server
• Wait-and-See about SOAP/WSDL/…
– “ Web Services are just Corba with angle
brackets.”
– FTP is good enough for me.
• Personal opinion:
– Web Services are much more than
“Corba + <>”
– Huge focus on interop
– Huge focus on integrated tools
Your
program
Data
• But the community says “Show me!” In your
address
– Many technologists convinced,
space
but not yet the astronomers
Web
Service
20
Classes and Methods
Your
program
• First Class: VO table
http://www.us-vo.org/VOTable/
– Represents an answer set in XML
Web
Service
Data
In your
address
space
• Defined by an XML Schema (XSD)
• Metadata (in terms of UCDs)
• Data representation (numbers and text)
– First method
• Cone Search: Get objects in this cone
http://voservices.org/cone/
21
Other Classes
Your
program
• Space-Time class
– http://hea-www.harvard.edu/~arots/nvometa/STCdoc.pdf
• Image Class (returns pixels)
– SdssCutout
– Simple Image Access Protocol
Web
Service
Data
In your
address
space
http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ACF8DE.pdf
– HyperAtlas
http://bill.cacr.caltech.edu/usvo-pubs/files/hyperatlas.pdf
• Spectral
– Simple Spectral Access Protocol
– 500K spectra available at http://voservices.net/wave
• Query Services
– ADQL and SkyNode http://skyservice.pha.jhu.edu/develop/vo/adql/
– And http://SkyQuery.Net
• Registry:
– see below
22
The Registry
• UDDI seemed inappropriate
– Complex
– Irrelevant questions
– Relevant questions missing
• Evolved Dublin Core
– Represent Datasets, Services, Portals
– Needs to be machine readable
– Federation (DNS model)
– Push & Pull: register then harvest
• http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg
23
Demo
• SkyServer:
– navigator showing cutout web service
– List: showing many calls and variant use.
• SkyQuery:
– Show integration of various archives.
– Explain spatial join xMatch operator.
24
SkyServer.SDSS.org
• A modern Astronomy archive
– Raw Pixel data lives in file servers
– Catalog data (derived objects) lives in Database
– Online query to any and all
• Also used for education
– 150 hours of online Astronomy
– Implicitly teaches data analysis
• Interesting things
–
–
–
–
–
–
Spatial data search
Client query interface via Java Applet
Query interface via Emacs
Popular
Cloned by other surveys (a template design)
Web services are core of it.
25
SkyQuery.Net
A Prototype WWT
• Started with SDSS data and schema
• Imported12 other datasets
into that spine schema.
(a day per dataset plus load time)
• Unified them with a portal
• Implicit spatial join among the datasets.
• All built on Web Services
– Pure XML
– Pure SOAP
– Used .NET toolkit
26
Federation: SkyQuery.Net
• Combine 4 archives initially
• Added 9 more
• Send query to portal,
portal joins data from archives.
• Problem: want to do multi-step data analysis
(not just single query).
• Solution: Allow personal databases on portal
• Problem: some queries are monsters
• Solution: “batch schedule” on portal server,
Deposits answer in personal database.
27
SkyQuery Structure
• Portal is
• Each SkyNode publishes
– Plans Query (2 phase) – Schema Web Service
– Integrates answers
– Database Web Service
– Is a web service
Image
Cutout
SDSS
INT
SkyQuery
Portal
FIRST
2MASS
28
MyDB
http://skyserver.sdss.org/cas
• Portal allows federation of data but…
• Intermediate results may be large.
• Intermediate results
feed into next analysis step.
• Sending them back-and-forth to client is
costly and sometimes infeasible.
• Solution: create a working DB for client at
Portal: MyDB
29
MyDB
http://skyserver.sdss.org/cas
• Anyone can create a personal DB at
SkyServer portal.
– It is about 100 MB
– It is private
•
•
•
•
•
Simple queries done immediately
Complex queries done by batch scheduler
All queries can create/read/write MyDB tables
Very popular with “serious” users.
MyDB will be sharable with by a group.
30
Open SkyQuery
• SkyQuery being adopted by AstroGrid as
reference implementation for OGSA-DAI
(Open Grid Services Architecture, Data Access and Integration).
• SkyNode basic archive object
http://www.ivoa.net/twiki/bin/view/IVOA/SkyNode
• SkyQuery Language (VoQL) is evolving.
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOQL
31
The WWT Components
Outline
What we learned
• Data Sources
• Astro is a community of 10,000
• Homogenous & Cooperative
• If you can’t do it for Astro,
do not bother with 3M bio-info.
• Agreement
– Literature
– Archives
• Unified Definitions
– Units,
– Semantics/Concepts/Metrics,
Representations,
– Provenance
•
•
•
•
– Takes time
– Takes endless meetings
• Big problems are non-technical
Object model
– Legacy is a big problem.
Classes and methods
• Plumbing and tools are there
Portals
But…
WWT is a poster child for
– What is the object model?
the Data Grid.
– What do you want to save?
– How document provenance?
32
References (all are MSR TRs)
Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science
When Database Systems Meet the Grid
There Goes the Neighborhood: Relational Algebra for Spatial Data Search
Extending the SDSS Batch Query System to the National Virtual Observatory Grid
The World-Wide Telescope, an Archetype for Online Science
Data Mining the SDSS SkyServer Database
The SDSS SkyServer – Public Access to the Sloan Digital Sky Server Data
Web Services for the Virtual Observatory
Online Scientific Data Curation, Publication, and Archiving
Petabyte Scale Data Mining: Dream or Reality?
The World-Wide Telescope, an Archetype for Online Science
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey
33