Szalay Web Services spie-2002

Download Report

Transcript Szalay Web Services spie-2002

SPIE, Hawaii, 2002
Web Services for the
Virtual Observatory
(Living in an exponential world….)
Alex Szalay, Tamas Budavari, Tanu
Malik, Jim Gray, and Ani Thakar
Outline
Collecting Data
Exponential Growth
Making Discoveries
Publishing Data
VO: How will it work?
Web Services
Atomic vs Composite services
Distributed queries with SkyQuery
Cross-Matching Algorithm
SkyNode Web Services + Portal
Alex Szalay, SPIE 2002
2
The World is Exponential
Astrophysical data is growing exponentially
Doubling every year (Moore’s Law+):
both data sizes and number of data sets
Computational resources scale the same way
Constant $$$ will keep up with the data
Main problem is the software component
Currently components are not reused
Software costs are increasingly larger fraction
Aggregate costs are growing exponentially
Alex Szalay, SPIE 2002
3
Making Discoveries
When and where are discoveries made?
Always at the edges and boundaries
Going deeper, using more colors….
Metcalfe’s law
Utility of computer networks grows as the
number of possible connections: O(N2)
VO: Federation of N archives
Possibilities for new discoveries grow as O(N2)
Current sky surveys have proven this
Very early discoveries from SDSS, 2MASS, DPOSS
Alex Szalay, SPIE 2002
4
Publishing Data
Roles
Traditional
Emerging
Authors
Scientists
Collaborations
Publishers
Journals
Project www site
Curators
Libraries
Bigger Archives
Consumers Scientists
Scientists
Alex Szalay, SPIE 2002
5
Changing Roles
Exponential growth:
Projects last at least 3-5 years
Data sent upwards only at the end of the project
Data will be never centralized
More responsibility on projects
Becoming Publishers and Curators
Larger fraction of budget spent on software
Lot of development duplicated, wasted
More standards are needed
Easier data interchange, fewer tools
More templates are needed
Develop less software on your own
Alex Szalay, SPIE 2002
6
Emerging New Concepts
Standardizing distributed data
Web Services, supported on all platforms
Custom configure remote data dynamically
XML: Extensible Markup Language
SOAP: Simple Object Access Protocol
WSDL: Web Services Description Language
Standardizing distributed computing
Grid Services
Custom configure remote computing dynamically
Build your own remote computer, and discard
Virtual Data: new data sets on demand
Alex Szalay, SPIE 2002
7
Shielding Users
Users do not want to deal with XML,
they want their data
Users do not want to deal with configuring
grid computing, they want results
SOAP: data appears in user memory, XML is
invisible
SOAP call: just a remote procedure
Alex Szalay, SPIE 2002
8
NVO: How Will It Work?
Define commonly used `atomic’ services
Build higher level toolboxes/portals on top
We do not build `everything for everybody’
Use the 90-10 rule:
1
0.9
0.8
0.7
# of users
Define the standards and interfaces
Build the framework
Build the 10% of services
that are used by 90%
Let the users build the rest
from the components
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
# of s e rvice s
Alex Szalay, SPIE 2002
9
Atomic Services
Metadata information about resources
Waveband
Sky coverage
Translation of names to universal dictionary (UCD)
Simple search patterns on the resources
Cone Search
Image mosaic
Unit conversions
Simple filtering, counting, histogramming
On-the-fly recalibrations
Alex Szalay, SPIE 2002
10
Higher Level Services
Built on Atomic Services
Perform more complex tasks
Examples
Automated resource discovery
Cross-identifications
Photometric redshifts
Outlier detections
Visualization facilities
Expectation:
Build custom portals in matter of days from existing building
blocks (like today in IRAF or IDL)
Alex Szalay, SPIE 2002
11
SkyQuery
Distributed Query tool using a set of services
Feasibility study, built in 6 weeks from scratch
Tanu Malik (JHU CS grad student)
Tamas Budavari (JHU astro postdoc)
Implemented in C# and .NET
Won 2nd prize of Microsoft XML Contest
Allows queries like:
SELECT o.objId, o.r, o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND o.type=3 and (o.I - t.m_j)>2
Alex Szalay, SPIE 2002
12
Architecture
Web Page
Image cutout
SkyQuery
SkyNode
SDSS
SkyNode
2Mass
SkyNode
First
Alex Szalay, SPIE 2002
13
Cross-id Steps
Parse query
Get counts
Sort by counts
Make plan
Cross-match
SELECT o.objId, o.r,
o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND (o.i - t.m_j) > 2
AND o.type=3
Recursively,
from small to large
Select necessary attributes only
Return output
Insert cutout image
Alex Szalay, SPIE 2002
14
Monte-Carlo Simulation
Comparing different algorithms for 3-way xid
Transmit all the data
Transmit after filtering
Recursive cross-match
Surveys
2000
1500
1000
SDSS
2MASS
First
500
Random variables:
0
-4
Sky Area (0..10 sqdeg)
Selectivity of each subselect (0..1)
Efficiency of join (0.5..2)
Selectivity of common select (0..1)
Alex Szalay, SPIE 2002
-2 log cost 0
2
4
15
SkyNode
Metadata functions (SOAP)
Info, Tables, Columns, Schema, Functions, Keysearch
Query functions (SOAP)
Dataset Query(String sqlCmd)
Dataset Xmatch(Dataset input, String sqlCmd, float eps)
Database
MS SQL Server
Upload dataset
Very fast spatial search engine (HTM-based)
crossmatch takes <3 ms/object over 15M in SDSS
User defined functions and stored procedures
Alex Szalay, SPIE 2002
16
Data Flow
query
SkyQuery
SkyNode 1
SkyNode 2
SkyNode 3
http://www.skyquery.net
Alex Szalay, SPIE 2002
17
Other web services
Create density maps and masks for angular
clustering
Deliver photometric redshifts form
photometry data
Intersect pointed observations with surveys
Generate XSLT from script XML=> SVG
Wrap legacy (Linux C) data mining
applications as a web service
Create a C# class for the CFITSIO library
Alex Szalay, SPIE 2002
18
Archive Footprint
Footprint is a ‘fractal’
Result depends on context
all sky, degree scale, pixel scale
Translate to web services
Footprint()
returns single region that contains the archive
Intersection(region, tolerance)
feed a region and returns the intersection with archive
footprint
Contains(point)
returns yes/no (maybe fuzzy) if point is inside archive
footprint
Alex Szalay, SPIE 2002
19
Summary
Exponential data growth
– distributed data
– federation needed
Projects now Publishers and Curators
Web Services – hierarchical architecture
Use the 90-10 rule (maybe 80-20)
There are clever ways to federate
datasets!
Alex Szalay, SPIE 2002
20