Codes for astrostatistics: StatCodes & VOStat Eric

Download Report

Transcript Codes for astrostatistics: StatCodes & VOStat Eric

Codes for astrostatistics:
StatCodes & VOStat
Eric Feigelson
Penn State
Vast range of statistical problems
in modern astronomy
• Poisson processes: point processes, time series analysis
• Image analysis: MLE deconvolution, adaptive smoothing,
wavelet analyses
• Multivariate analysis & classification (w/ meas errors)
• Survival analysis (censoring & truncation w/ meas errors)
• Parametric models: Model selection, non-linear regression
• Non-parametric methods
• Confidence limits: bootstrap resampling
• Prior knowledge: Bayesian inference
(see talk at PhysStat 2003 conference)
The problem
Astronomers are insufficiently trained in
modern applied statistics …..
but even if they knew what to do, they
inadequate access to computer codes.
• Astronomers never use large commercial statistical
packages like SAS, SPSS, Statistica
• Some astronomers sometimes use UNIX-based commandline systems like MatLab or S-Plus.
• Astronomers like mini-codes in Numerical Recipes & often
write their own codes. Many like IDL which has simple
statistics.
• NASA/NSF observatories produce huge data analysis codes
(IRAF, AIPS, CIAO, …) which by policy avoid proprietary
codes
• A few specialized stand-along astrostat codes written under
NASA funding: ROSTAT, ASURV, SLOPES, StatPy
Altogether this is a very bad situation:
vast statistical needs
with very inadequate codes
The rise of the Virtual Observatory
Vast collections of calibrated data (images, spectra,
time series), extracted catalogs (rows=sources,
columns=properties), and source bibliographies
emerged during the 1990s.
NASA Science Archive Centers (MAST, HEASARC, IRSA,
LAMDA), bibliographic databases (ADS, SIMBAD, NED),
& more are being transformed into a federated (though
still distributed & heterogeneous) system. XML
metadata (VOTable), SOAP protocols, … for data mining
& extraction.
but originally no plan for visualization &
statistical analysis of extracted datasets
StatCodes: A partial solution
• In late-1990s, the Penn State group created a Web
metasite with annotated links to ~200 open source
packages & codes of utility to astronomers.
• Quite successful: 50-100 hits/day for 7 years.
• Multivariate & time series methods most popular.
But the collection of on-line codes was
very inhomogeneous and incomplete
R
Finally a broad public-domain
statistical software system emerges
Based on the successful commercial UNIX-based
S/S-Plus, R has an interactive command-line feel
(like IDL), flexible data I/O, acceptable graphics,
integration to C/Fortran/Python/…, and quite a lot of
sophisticated statistical methods.
Core R: 2000-page manual with ~200 functionalities,
some very complex & advanced
CRAN: 300 add-on packages, dozens useful to
astronomers. Some are themselves full systems.
VOStat: A Web service
1. Web form interface providing simple statistical R
functions with VOTable inputs
2. Same R functions provided through a more
sophisticated Java-based grid-computing mode.
Dispersed
VO
Heavy data
Requests
User
Answers
data
bases
VOStat
server
Heavy
statistical
computation
VOStat may be a big improvement but …
• Generic Web-based services are inherently inflexible
& limited. VOStat may serve to entice the astronomer
to download R & perform the real analysis at home.
• Astronomers need training in advanced methods
before using them with R. Penn State has just
created a Center for Astrostatistics to develop
curriculum, conduct tutorials, provide template R
code, etc.
• R/CRAN does not serve huge VO datasets or some
special astrostat needs. New methodological/code
development underway (CMU, Cornell, PSU, UCIrv,…)