Searching in High Dimensions - University of Oxford Department of

Download Report

Transcript Searching in High Dimensions - University of Oxford Department of

New Software
Bob Nichol
ICG, Portsmouth
Thanks to all my colleagues in SDSS, GRIST & PiCA
Special thanks to Chris Miller, Alex Gray, Gordon Richards,
Brent Bryan, Chris Genovese, Ryan Scranton, Larry
Wasserman, Jeff Schneider
Outline
1. Cosmological data is exploding in both size and
complexity (see Szalay’s talk)
2. Can present software handle this data quality and
quantity?
3. Also present models lack detail to understand the data
which drives us towards non-parametric techniques.
4. Important synergies between CS, statistics and domain
sciences
Jim reviewed existing statistical software, I will discuss
new software we may need
Phystat 2005 Oxford
Gravity correlates everything
The 2-point function ((r)) has a long history in
cosmology (Peebles 1980). It is the excess joint
probability (dP12) of a pair of points over that
expected from a Poisson process.
dP12 = n2 dV1 dV2 [1 + (r)]
dV1
r
dV2
dP123=n3dV1dV2dV3[1+23(r)+13(r)+12(r)+123(r)]
Also, marked correlation functions
Motivation for the N-point functions: Measure of the
topology of the large-scale structure in universe
Same 2pt, different 3pt
Multi-resolutional KD-trees
Scale to n-dimensions (although for very high
dimensions use new tree structures)
Use Cached Representation (store at each node
summary sufficient statistics). Compute counts
from these statistics
Prune the tree which is stored in memory! (Moore
et al. 2001 astro-ph/0012333)
Many applications; suite of algorithms!
See Alex Gray’s talk tomorrow
Top Level
1st Level
2nd Level
5th Level
Just a set of range searches
Also Prune cells inside!
Greater saving in time
Prune cells outside range
Dual Tree Algorithm
N1 dmax
Usually binned into annuli
rmin< r < rmax
dmin
Thus, for each r transverse both
trees and prune pairs of nodes
N2
No count
dmin < rmax or dmax < rmin
Therefore, only need to calculate
pairs cutting the boundaries.
Scales to n-point functions also do
all r values at once
N1 x N2
rmin > dmin and rmax< dmax
Phystat 2005 Oxford
Survey
size
bins
Can also parallelize on the Grid!
Code Available
• The code is freely available
http://www.autonlab.org/autonweb/showSoftware/127/
• Added “Monte carlo sampling” to code for faster answers: approximate
answers
• Added to AstroGrid suite of algorithms: “aimed at building a data-grid
for UK astronomy, which will form the UK contribution to a global
Virtual Observatory”
• VOtech is an EU-funded R&D project to explore the use of advanced
algorithms in the IOVA infrastructure
• Broker developed to handle submission of jobs to National Grid
Service (NGS) - the UK's core production computational and data grid
service. Also using US Teragrid (5000 correlations in one weekend).
Phystat 2005 Oxford
Phystat 2005 Oxford
EuroVO
• The Euro-VO Data Centre Alliance (DCA):
– A collaborative and operational network of European data
centres who publish data and metadata to the Euro-VO and
who provide a research infrastructure of GRID-enabled
processing and storage facilities.
• The Euro-VO Facility Centre (VOFC):
– An organization that provides the Euro-VO for centralized
resource registry, standards definition and promotion as well
as community support for VO technology take-up and
scientific program support using VO technologies and
resources.
• The Euro-VO Technology Centre (VOTC):
– A distributed organisation that coordinates a set of research
and development projects on VO technology, systems and
tools.
Phystat 2005 Oxford
EuroVO: VOTech Project
• 6.6M Euro Design Study under EU FP6,
• Aims:
– Complete all technical preparatory work necessary for
the construction of the European Virtual Observatory,
– Responsible for development of infrastructure and tools:
• Intelligent resource discovery (ontology and the
semantic web), data interoperability, data mining,
and visualisation capabilities.
– Provide the ability to offload mass scale computational
process onto the Enabling Grids for E-sciencE (EGEE)
backbone.
Phystat 2005 Oxford
Existing infrastructure
• VOTech is tasked with building upon existing
infrastructure:
• In particular:
–IVOA for standards,
–AstroGrid for middleware:
• Web Services based,
• Presumably IVOA will continue to look towards
other standards bodies:
–World Wide Web Consortium (W3C),
–Global Grid Forum (GGF),
–…
Phystat 2005 Oxford
IVOA Standards (Recommendations)
VOTable Format Definition Version 1.1:
– An XML language,
• Flexible storage and exchange format for tabular data:
Emphasis on astronomical tables,
– Allows meta data and data to be stored separately with
links to remote data.
• Resource Metadata for the VO Version 1.01:
– For describing what data and computational facilities
are available, and once identified how to use them.
• Unified Content Descriptor (UCD) (Proposed):
– A formal (and restricted) vocabulary for astronomical
data.
• IVOA Identifiers Version 1.10 (Proposed):
– Syntax for globally unique resource names.
Phystat 2005 Oxford
AstroGrid Components
• MySpace: Distributed file store for workflows,results,
• Common Execution Architecture (CEA):
– Codes need wrapping before use,
– Take command line apps and present as a Web Service.
• Algorithm Registry:
– Meta data from wrapped codes are published in a yellow pages,
for searching.
• Portal:
– Web interface for interacting with preceding services,
– Workflow: Coordinate data flow/control of components within a
larger system of work,
– Submit jobs and observe status, and access files in MySpace.
• Dashboard/Workbench:
– Interact with MySpace, Registry, CEA from any language that
provides XML-RPC library. Web Start application.
Phystat 2005 Oxford
AG rollout and access via portal
AG Portal
Phystat 2005 Oxford
Non-parametric Techniques
• The complexicity and wealth of the data
demands non-parametric techniques,
ie., can one describe phenomena using
the least amount of assumptions?
• The challenges here are computational
as well as psychological
Phystat 2005 Oxford
CMB Power Spectrum
Before WMAP
WMAP data
Are the 2nd and 3rd
peaks detected?
Phystat 2005 Oxford
In parametric models of the CMB power spectrum the
answer is likely “yes” as all CMB models have multiple
peaks. But that has not really answered our question!
Can we answer the question non-parametrically e.g.,
Yi = f(Xi) + ci
Where Yi is the observed data, f(Xi) is an orthogonal
function (icos(iXi)), ci is the covariance matrix. The
challenge is to “shrink” f(Xi), we use
• Beran (2000) to strink f(Xi) to N terms equal to the number
of data points - optimal for all smooth functions and
provides valid confidence intervals
• Monotonic shrinkage of i - specifically nested subset
selection (NSS)
See Genovese et al. (2004) astro-ph/0410104
Phystat 2005 Oxford
Results
(optimal smoothing through bias-variance trade-off)
Concordance
Our f(Xi)
Note, WMAP only fit is not same as concordance model
Phystat 2005 Oxford
Testing models
• The main advantage of this method is that we can construct
a “confidence ball” (in N dimensions) around f(Xi) and thus
perform non-parametric interferences e.g. is the second
peak detected?
Not at 95%
confidence!
Phystat 2005 Oxford
Information Content
fh(Xi) = f(Xi) + b*h
Beyond here there
is little information
Phystat 2005 Oxford
Gray are models in the
95% confidence ball
ASA “Outstanding Application of the year” (2005)
Using CMBfast we can make parametric models (11
parameters) and test if they are within the “confidence
ball”. Varying b we get a range of 0.0169 to 0.0287
Phystat 2005 Oxford
Testing in high D
• Now we can now jointly search all 11 parameters in
the parametric models and determine which models
fit in the confidence ball (at 95%).
• Traditionally this is done by marginalising over the
other parameters to gain confidence intervals on
each parameter separately. This is a problem in highD where the likelihood function could be degenerate,
ill-defined and under-identified
• This is computational intense as billions of models
need to searched, each of which takes ~minute to run
• Use Kriging methods to predict where the surface of
the confidence ball exists and test models there.
Phystat 2005 Oxford
7D parameter
space
400000 samples
Cyan : 0.5
Purple: 1
Blue : 1.5
Red : 2
Green : >2
Bimodal dist. for
several
parameters
Phystat 2005 Oxford
Other applications
• Physics of CMB is well-understood and
people counter that parametric analyses are
better (including Bayesian methods) [however,
concerns about CI]
• Other areas of astrophysics have similar data
problems, but the physics is less developed
– Galaxy and quasar spectra (models are still
rudimentary)
– Galaxy clustering (non-linear gravitational effects
are not confidently modeled)
– Galaxy properties (e.g. star-formation rate)
Phystat 2005 Oxford
Summary
• Existing statistical software can’t scale-up to
the next generation of datasets. Nor does it
exploit the “Grid”
• Need to explore non-parametric statistics
• Software will be made available via AstroGrid
and IVOA - seemless access to data,
computational resources and algorithms
Phystat 2005 Oxford