Transcript Document
ADASS 2000
Towards a Virtual Observatory
Alex Szalay
Department of Physics and Astronomy
The Johns Hopkins University
The Virtual Observatory
• National/Global
• distributed in scope across institutions, agencies and countries
• available to all astronomers and the public
• Virtual
• not tied to a single “brick-and-mortar” location
• supports astronomical “observations” and discoveries
via remote access to digital representations of the sky
• Observatory
•
•
•
•
general purpose
access to large areas of the sky at multiple wavelengths
supports a wide range of astronomical explorations
enables discovery via new computational tools
Why Now ?
The past decade has witnessed
•
•
•
•
•
a thousand-fold increase in computer speed
a dramatic decrease in the cost of computing & storage
a dramatic increase in access to broadly distributed data
large archives at multiple sites and high speed networks
significant increases in detector size and performance
These form the basis for science
of qualitatively different nature
Nature of Astronomical Data
• Imaging
– 2D map of the sky at multiple wavelengths
• Derived catalogs
– subsequent processing of images
– extracting object parameters (400+ per object)
• Spectroscopic follow-up
– spectra: more detailed object properties
– clues to physical state and formation history
– lead to distances: 3D maps
• Numerical simulations
• All inter-related!
Trends
Future dominated by detector improvements
1000
100
10
1
0.1
1970
1975
1980
1985
1990
1995
2000
CCDs
• Moore’s Law growth in
CCD capabilities
• Gigapixel arrays on the
horizon
• Improvements in computing
and storage will track growth
in data volume
• Investment in software is
critical, and growing
Glass
Total area of 3m+ telescopes in the world in m2, total number
of CCD pixels in Megapix, as a function of time. Growth over
25 years is a factor of 30 in glass, 3000 in pixels.
The Age of Mega-Surveys
• The next generation mega-surveys and archives will
change astronomy, due to
–
–
–
–
top-down design
large sky coverage
sound statistical plans
well controlled systematics
• The technology to store and access the data is here
we are riding Moore’s law
• Data mining will lead to stunning new discoveries
• Integrating these archives is for the whole community
=> Virtual Observatory
Ongoing surveys
• Large number of new surveys
– multi-TB in size, 100 million objects or more
– individual archives planned, or under way
• Multi-wavelength view of the sky
– more than 13 wavelength coverage in 5 years
• Impressive early discoveries
– finding exotic objects by unusual colors
• L,T dwarfs, high-z quasars
– finding objects by time variability
• gravitational microlensing
MACHO
2MASS
DENIS
SDSS
GALEX
FIRST
DPOSS
GSC-II
COBE
MAP
NVSS
FIRST
ROSAT
OGLE
...
The Discovery Process
Past:
observations of small, carefully selected samples
of objects in a narrow wavelength band
Future:
high quality, homogeneous multi-wavelength
data on millions of objects, allowing us to
discover significant patterns
from the analysis of statistically rich and
unbiased image/catalog databases
understand complex astrophysical
systems
via confrontation between data and
large numerical simulations
The discovery process
will rely heavily on advanced visualization
and statistical analysis tools
The Necessity of the VO
• Enormous scientific interest in the survey data
• The environment to exploit these huge sky surveys
does not exist today!
– 1 Terabyte at 10 Mbyte/s takes 1 day
– Hundreds of intensive queries and thousands of casual
queries per-day
– Data will reside at multiple locations, in many different formats
– Existing analysis tools do not scale to Terabyte data sets
• Acute need in a few years, solution will not just happen
VO- The challenges
• Size of the archived data
40,000 square degrees is 2 Trillion pixels
– One band
4 Terabytes
– Multi-wavelength
10-100 Terabytes
– Time dimension
10 Petabytes
• Current techniques inadequate
– new archival methods
– new analysis tools
– new standards
• Hardware/networking requirements
– scalable solutions required
• Transition to the new astronomy
VO: A New Initiative
•
•
•
•
•
•
Priority in the Astronomy and Astrophysics Survey
Enable new science not previously possible
Maximize impact of large current and future efforts
Create the necessary new standards
Develop the software tools needed
Ensure that the community has network and
hardware resources to carry out the science
• Keep up with evolving technology
New Astronomy- Different!
• Data “Avalanche”
– the flood of Terabytes of data is already happening,
whether we like it or not
– our present techniques of handling these data do not scale
well with data volume
• Systematic data exploration
– will have a central role
– statistical analysis of the “typical” objects
– automated search for the “rare” events
• Digital archives of the sky
– will be the main access to data
– hundreds to thousands of queries per day
Examples: Data Pipelines
Examples: Rare Events
Discovery of several new
objects by SDSS & 2MASS
SDSS T-dwarf
(June 1999)
Examples: Reprocessing
Gravitational lensing
28,000 foreground galaxies over 2,045,000 background
galaxies in test data (McKay etal 1999)
Examples: Galaxy Clustering
• Shape of fluctuation spectrum
– cosmological parameters and initial conditions
• The new surveys (SDSS) are the first when logN~30
• Starts with a query
• Compute correlation function
– All pairwise distances N2, N log N possible
• Power spectrum
– Optimal: the Karhunen-Loeve transform
– Signal-to-noise eigenmodes
– N3 in the number of pixels
• Likelihood analysis in 30 dimensions:
needs to be done many times over
Relation to the HEP Problem
• Similarities
–
–
–
–
–
–
need to handle large amounts of data
data is located at multiple sites
data should be highly clustered
substantial amounts of custom reprocessing
need for a hierarchical organization of resources
scalable solutions required
• Differences of Astro from HEP
–
–
–
–
data migration is in opposite direction
the role of small queries is more important
relations between separate data sets (same sky)
data size currently smaller, we can keep it all on disk
Data Migration Path
Tier 0
Tier 1
portal
Tier 2
Tier 3
HEP
Astro
Queries are I/O limited
• In our applications few fixed access patterns
– one cannot build indices for all possible queries
– worst case scenario is linear scan of the whole table
• Increasingly large differences between
– Random access
• controlled by seek time (5-10ms), <1000 random I/O /sec
– Sequential I/O
• dramatic improvements, 100 MB/sec per SCSI channel easy
• reached 215 MB/sec on a single 2-way Dell server
• Often much faster to scan than to seek
• Good layout => more sequential I/O
Fast Query Execution
• Query plan
– given the layout of the database, how can one get
the result the fastest possible way
– evaluate a cost function, using a heuristic algorithm
• Indexing
– clustered and unclustered
• database layout is crucial
– one-dimensional indices:
• B-tree, R-tree
– higher dimensional indices:
• Quad-tree, Oc-tree,KD-tree, R+-tree
– typically logN access instead of N
Distributed Archives
• Network speeds are much slower
– minimize data transfer
– run queries locally
• I/O will scale almost linearly with nodes
– 1 GB/sec aggregate I/O engine can be built for <$100K
• Non-trivial problems in
– load balancing
– query parallelization
– queries across inhomogeneous data sources
• These problems are not specific to astronomy
– commercial solutions are around the corner
SDSS: Distributed Layout
User Interface
Analysis Engine
Master
SX Engine
Objectivity Federation
Objectivity
Slave
Slave
Slave
Objectivity
Slave
Objectivity
RAID
Objectivity
RAID
Objectivity
RAID
RAID
Astronomy Issues
•
•
•
•
•
•
•
What are the user profiles
How will the data be analyzed
How to organize searches
Geometric indexing
Color searches and localization
SDSS Architecture
Virtual data
User Profiles
• Power users
– sophisticated, with lots of resources
– research is centered around the archival data
• moderate number of very large queries, large output sizes
• Astronomers
– frequent, but casual lookup of objects/regions
– archives help their research, but not central
• large number of small queries, cross-identification requests
• Wide public
– browsing a `Virtual Telescope’
– can have large public appeal
– need special packaging
• very large number of simple requests
Geometric Approach
• Main problem
– fast, indexed searches of Terabytes in N-dim space
– searches are not axis-parallel
• simple B-tree indexing does not work
• Geometric approach
– use the geometric nature of the data
– quantize data into containers of `friends’
• objects of similar colors
• close on the sky
• clustered together on disk
– containers represent coarse-grained map of the data
• multidimensional index-tree (eg KD-tree)
Geometric Indexing
“Divide and Conquer”
Partitioning
Attributes
Number
Sky Position
Multiband Fluxes
Other
3
N = 5+
M= 100+
3NM
Hierarchical
Triangular
Mesh
Split as k-d tree
Stored as r-tree
of bounding boxes
Using regular
indexing
techniques
Computing Virtual Data
• Analyze large output volumes next to the database
– send results only (`Virtual Data’):
the system `knows’ how to compute the result (Analysis Engine)
• Analysis: different CPU to I/O ratio than database
– multilayered approach
• Highly scalable architecture required
– distributed configuration – scalable to data grids
• Multiply redundant network paths between
data-nodes and compute-nodes
– `Data-wolf’ cluster => Virtual Data Grid
A Data Grid Node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute node
Compute layer
Hardware requirements
200 CPUs
• Large distributed database engines
– with few Gbyte/s aggregate I/O speed
• High speed (>10 Gbit/s) backbones
– cross-connecting the major archives
• Scalable computing environment 10 Gbits/s Other nodes
Objectivity
Objectivity
RAID
Objectivity
RAID
Objectivity
RAID
Objectivity
RAID
Objectivity
RAID
Objectivity
RAID
– with hundreds of CPUs for analysis
RAID
Interconnect layer
1 Gbits/sec/node
RAID
Database layer
2 GBytes/sec
Clustering of Galaxies
Generic features of galaxy clustering:
• Self organized clustering driven by long range forces
• These lead to clustering on all scales
• Clustering hierarchy: distribution of galaxy counts is
approximately lognormal
• Scenarios: ‘top-down’ vs ‘bottom-up’
Clustering of Computers
• Problem sizes have lognormal distribution
– multiplicative process
• Optimal queuing strategy
– run smallest job in queue
– median scale set by local resources: largest jobs never finish
• Always need more computing
– ‘infall’ to larger clusters nearby
– asymptotically long-tailed distribution of compute power
• Short range forces: supercomputers
• Long range forces: onset of high speed networking
• Self-organized clustering of computing resources
– the Computational Grid
VO: Conceptual Architecture
User
Discovery tools
Analysis tools
Gateway
Data Archives
The Flavor/Role of the NVO
• Highly Distributed and Decentralized
• Multiple Phases, built on top of another
•
•
•
•
Establish standards, meta-data formats
Integrate main catalogs
Develop initial querying tools
Develop collaboration requirements,
establish procedure to import new catalogs
• Develop distributed analysis environment
• Develop advanced visualization tools
• Develop advanced querying tools
NVO Development Functions
• Software development
– query generation/optimization, software agents, user
interfaces, discovery tools, visualization tools
• Standards development
– Meta-data, meta-services, streaming formats, object
relationships, object attributes,...
• Infrastructure development
– archival storage systems, query engines, compute
servers, high speed connections of main centers
• Train the Next Generation
– train scientists equally at home in astronomy and
modern computer science, statistics, visualization
The Mission of the VO
The Virtual Observatory should
provide seamless integration of the digitally
represented multi-wavelength sky
enable efficient simultaneous access to multiTerabyte to Petabyte databases
develop and maintain tools to find patterns and
discoveries contained within the large
databases
develop and maintain tools to confront data
with sophisticated numerical simulations
Sociological Impact
• The VO is an entirely new way of doing astronomy
• Will have a serious sociological impact
• One can expect a love-hate response from the
community
• Needs to be driven by science goals
• Technology will constantly evolve
• Educational impact two-fold
– need new skills to use it efficiently
– provides enormous new opportunities
• Astronomy is rather unique in its wide appeal
– the VO can reach out to an even wider audience
Conclusions
• Databases became an essential part of astronomy:
most data access will soon be via digital archives
• Data at separate locations, distributed worldwide,
evolving in time: move queries not data (if you can)!
• Computations in both processing and analysis will be
substantial: need to create a `Virtual Data Grid’
• Problems similar to HEP, lot of commonalities, but
data flow more complex
• Interoperability of archives is essential:
the Virtual Observatory is inevitable
www.voforum.org