Transcript Slide 1

A Statistician's View
of Upcoming Grand Challenges
Alanna Connors
Imputed by Xiao-Li Meng
Joint work with Alex Blocker, Paul Baines
Vinay Kashyap, Pavlos Protopapas, and Andreas Zezas
(all members of CBAS, a.k.a CHASC)
I. Assessing Uncertainty When We
Have No Idea What We Are Doing!

OK, maybe we know a little bit or a little piece of it
Genuine replications are NOT possible
Create pseudo-replications

Bootstrap (the Green Book by Efron and Tibshirani, 1994)

Posterior Predictive Replications (Rubin, 1984, Annals of Statistics;
Gelman, Meng and Stern, Statistica Sinica, 1996)

Data Perturbation-- taking derivative with respect to data
“On Measuring and Correcting the Effects of Data Mining and
Model Selection ”
(J. Ye, 1998, 120-131, J. of American Statistical Association)


II. “Black Box” Inference and Computation
The likelihood is given as a “black box” (either as a
computer routine or a look-up table);
 The prior is given the same way or we can simulate from
a prior;
 And we want samples from the Bayesain posterior.
 Easy, right? Using Metropolis-Hasting, with prior as
proposal, and likelihood as the M-H ratio …
 Useless, since the posterior typically will be quite
different (we at least hope!) from the prior, so the
Markov chain won’t converge/mix, especially for highdimensional problems …
 So what do we do?

We need to adaptively blend many advanced methods
Parallel Tempering (Geyer, 1991, Proc. 23rd
Symposium of CS & Stat Interface)
 Equi-energy Sampling (Kou, Zhou & Wong,
2006, with discussions, Annals of Statistics)
 Ancillarity-Sufficiency Interweaving Strategy
(ASIS) (Yu and Meng, 2010, with discussions, J.
Computational and Graphical Statistics)
 AND, we need to know how to “cut corners” …

Example: Color-Magnitude Diagrams
(Baines, Zezas, Kashyap)
 Goal:
Estimate the mass, age (and
possibly metallicity) of a cluster of stars
 Parameters: Mass, Age, Metallicity
 Data: Photometric data
 Theory/Likelihood: Isochrones (Tables)
 The isochrones connect the scientifically
interesting parameters to the observed
data via a complicated mapping
A Colorful But Ugly Likelihood
We want an Equi-Energy (EE) Sampler …
 Jump
between points of equal
density/probabity (or “energy”)
Approximate EE by “Equi-Expectation”
Implementing the Equi-Energy Sampler in high
dimensions is impractical
 Idea: Use the structure of the problem to
construct a low-dimensional and efficient
approximation to EE
 For Gaussian-like data, “Equi-Expectation”
clusters approximate “Equi-Energy” clusters
 “Equi-Expectation” clusters are data (e.g. star)
independent, and hence require one-time preMCMC step

The original parameter space (e.g., magnitude & color)
The “rocking boat” represents the “Expectation Space”
Clustering on the “Expectation Space”
Creating approximate “equi-energy” clusters on
the original space
III. Many Frustrations!!!
Outliers, really extreme ones!
 Large, Long tailed measurement errors
 Strong dependence
 Non-linear trends (or whatever you want to call them)
 Confounding signals (e.g., quasi-periodic)
 High dimensions
 Too much data
 Too many variables (large p, small n)
 Too little data (there is always ONE observable universe
and ONE entire history!)
 Too little funding, too little time …

Example: Event Detection in Time Series
(Alex Blocker and Pavlos Protopapas)
Use all your tools, but in the right order!





Do some pre-processing (e.g., scan statistics) to reduce
computational burden, but with GREAT CAUTION
Be aware of the artifacts innocent-looking methods may
introduce (e.g., spurious correlations); Always try on
test data first!
Let more rigorous statistical models to take care of
complications first whenever the computation is feasible
Take advantage of more ad-hoc methods when signal is
relative strong and computational gain is great
Don’t forget to do model checking and uncertainty
assessment via pseudo replications!
A two-stage approach for event detection

We fit a statistical model to separate low-frequency
trends L, median-frequency “candidates” M (event or
quasi-periodic), and white noise N – we use t-model with
small df (e.g. 3) to deal with outliers:
Y(t) = ∑aiMi(t) + ∑bjLj(t) + N(t)

Once the data are reduced to cleaner (e.g., outliers and
non-linear trends removed) and lower dimension feature
vector: { ai, i=1, …, I}, we can use a classifier to
separate isolated events from quasi-periodic by training
on previously identified light curves from each category.
Cutting Corners: even the simple Haar
wavelets might do the job …
The Grandest Challenge of All …






We need many more future talents who are
passionate about quantitative sciences
And who will stay away from the Wall Street
regardless of the economy!
So what do we do?
Better teaching and training!
“Desired and Feared– What Do We Do Now and
Over the Next 50 Years? ” Am. Stat., 2009, Aug.
“Real-life statistics: Your Chance for Happiness
(or Misery)” Amstat News, 2009, Sept.
19
They are intoxicated by …