Interpretation of the Test Statistic - GLAST at SLAC

Download Report

Transcript Interpretation of the Test Statistic - GLAST at SLAC

Interpretation of the Test Statistic
or: basic Hypothesis Testing,
with applications, in 15 minutes
Patrick Nolan
Stanford University
GLAST LAT DC2 Kickoff
2 March 2006
1
The Likelihood Ratio
Likelihood is defined to be the probability of observing the data,
assuming that our model is correct.
L(θ)  P(x|θ)
Here x is the observed data and θ is the parameter(s) of the model.
Likelihood is a function of the model parameters (aka the “hypothesis”).
Suppose there are two models with parameter(s) θ0 and θ1. Typically θ0
represents a “null” hypothesis (for instance, no point source is present)
while θ1 represents an “alternate” hypothesis (for instance, there is a point
source). The likelihood ratio is Λ  L(θ0)/L(θ1). If Λ is small, then the
alternate hypothesis explains the data better than the null hypothesis.
This needs to be made quantitative.
2
The Power of a Statistical Test
In hypothesis testing, we decide whether we think θ0 or θ1 is the best explanation
for the data. There are two ways we could go wrong:
Type 1 error Null is true, but we choose alternate.
(spurious detection)
Prob = α
Type 2 error Alternate is true, but we choose null.
(missed detection)
Prob = 
This is the notation
used in every
textbook.
We would like to have both α and  be small, but there are tradeoffs. The usual
procedure is to design a statistical test so that α is fixed at some value, called the size
or significance level of the test. For a single test, a number like 5% might be OK.
When looking for point sources in many places, a smaller α is needed because there are
many opportunities for a Type 1 error. Once α is fixed, 1- is called the power of the
test. Large power means that real effects are unlikely to be missed.
The likelihood ratio is useful in this context because of the Neyman-Pearson lemma,
which says that the likelihood ratio is the “best” way to choose between hypotheses.
If we choose the alternative hypothesis over the null when Λ < k, where P(Λ < k | θ0) =
α, then the results will be unbiased and the test is the most powerful available.
3
Making it Quantitative
Usually we deal with composite hypotheses. That is, θ isn’t a single point in parameter
space, but we allow a range of values. Then we compare the best representatives of
the two hypotheses: choose θ0 to maximize L(θ0) and θ1 to maximize L(θ1) in
the allowed regions of parameter space.
In order to use the likelihood ratio test (LRT) we need to be able to solve the
equation P(Λ < k | θ0) = α for k. In general the distribution of Λ is unknown, but Wilks’s
Theorem gives a useful asymptotic expression. The alternate model must “include” the
null. That is, the set of null parameters {θ0} must be a subset of {θ1}. For instance, θ0
describes N point sources, while θ1 has N+1 sources. When there are many events,
-2 ln(Λ) ~ r2
This is what we call TS (“test statistic”).
Here r is the difference in the number of parameters in the null and alternate sets.
This is the basis for the popular 2 and F tests. If r = 1, then
 2 ln( ) ~ N (0,1)
the unit normal distribution! Thus a 3-sigma result
requires ln(Λ) = -4.5.
Why doesn’t this work? See next page.
4
Conditions, caveats & quibbles
•
•
•
•
How many photons do we need to use the asymptotic distribution? I’m
not sure. The faintest EGRET detections on a strong background
always had at least ~50. That’s certainly enough. Can GLAST detect a
source with fewer?
More seriously, Wilks’s Theorem doesn’t work for our most common
situation. It is valid only under “regularity” conditions on the likelihood
function and the area of parameter space we study.
Example: We want to know if there is a point source at a certain
position. The brightness of the source will be the only adjustable
parameter in the alternate model. Of course the brightness must be ≥
0. When the brightness  0, the alternate and null models are
indistinguishable. This is one of the regularity conditions.
What are the consequences?
5
EGRET pathology: not so bad
Extensive simulations were done using
the EGRET likelihood program and a
realistic background model, with no
point sources. The histogram of test
statistic values doesn’t follow the 12
distribution. It’s low by a factor of 2.
This discrepancy isn’t surprising. Half of the simulations would produce a
better fit with negative source brightness. This isn’t allowed, so Λ = 1
(TS = 0) in all these cases. There should be a δ-function at 0 in the
graph. Statisticians call the resulting distribution ½ 02 + ½ 12.
6
GLAST pathology: ???
We are in the early stages of similar
simulations for GLAST. The results
are harder to understand. In this
example, about ¾ of the cases result
in TS = 0, rather than the expected
half. About half of the positive TS
values are < 0.1. The distribution
cuts off at large TS more sharply
than a 2 should.
If this type of behavior persists, the interpretation of TS values will be
more difficult. We will need to use simulations to produce probability
tables.
7
Final Words
•
•
•
•
•
This is by no means everything we need to know about statistics. I have
said nothing about parameter estimation, upper limits, or comparing
models which are not “nested”.
Finding an efficient method to optimize the parameter values is a major
effort.
The problem of multiple point sources is an example of a “mixture
model”. How do we decide when to stop adding more sources? That’s
cutting-edge research in statistics.
I have also skipped over the Bayesian method for dealing with the
hypothesis testing problem. That could be a whole other talk.
Some of us have been talking with Ramani Pilla of the Statistics dept.
at Case Western. She has a novel method which avoids the use of
Wilks’s Theorem. The computation of probabilities is quite involved, but
it should be tractable for comparisons with only one additional
parameter.
8
References
•
•
•
•
•
•
The ultimate reference for all things statistical is Kendall & Stuart,
“The Advanced Theory of Statistics”. I have consulted the 1979
edition, Volume 2. It is very dense and mathematical.
More accessible versions can be found in Barlow’s “Statistics” and
Cowan’s “Statistical Data Analysis”, both written for physicists. These
books are a bit expensive, but I like them. They consider both Bayesian
and frequentist methods.
A cheaper alternative is Wall & Jenkins “Practical Statistics for
Astronomers”. It tends to skimp on the theory, but it could be useful.
The downfall of the LRT was pointed out clearly by Protassov et al.
2002, ApJ 571, 545.
Pilla’s method is described in Pilla et al. 2005, PRL 95, 230202.
The EGRET likelihood method is explained by Mattox et al. 1996, ApJ
461, 396.
9