(Should) Sensitivity Estimates Mean?

Download Report

Transcript (Should) Sensitivity Estimates Mean?

”What (should)
sensitivity estimates mean ?”
Some comments on
Jan Conrad
Royal Institute of Technology (KTH)
Stockholm
Jan Conrad
NuFACT 06
25 August 2006
1
Outline


Definitions of “sensitivity”
confidence intervals/p-values with
systematic uncertainties





Averaging
Profiling
An illustration
Remarks on the ensemble
Summary/recommendations
Jan Conrad
NuFACT 06
25 August 2006
2
Definition of ”sensitivity” - I

1 (well known HEP-statistics expert from Oxford)


2 (fairly well known HEP statistics expert from Germany)


Median upper limit obtained from repeated experiment with no signal.
in a two dimensional problem keep one parameter fixed.
Mean result of whatever the quantity we want to measure, for
example 90 % confidence intervals, mean being taken over identical
replica of the experiment.
3 (less well known HEP statistics expert from Italy)

Look at that paper I wrote in arXiv:physics. Nobody has used it but it
is the best definition .....
Jan Conrad
NuFACT 06
25 August 2006
3
Definition of sensitivity -II

Definition using p-values (hypothesis test):

The experiment is said to be sensitive to a given value of
the parameter Θ13 = Θ13sens at signficance level α if the
mean p-value obtained given Θ13sens is smaller than α .

The p-value is (per defintion) calculated given zero
hypothesis Θ13 = 0:
test statistics, T, could be
for example χ2
Jan Conrad
NuFACT 06
Actually observed value of
the test statistics
25 August 2006
4
Definition of sensitivity –III
(what nuFact people most often use ?)

Definition using confidence intervals (CI ) 1

The experiment is said to be sensitive to a given value of
the parameter Θ13 = Θ13sens at signficance level α if the
mean2 1-α CI obtained, given Θ13sens , does not contain
Θ13 = 0.
1) This means using confidence intervals for hypothesis
testing. I think I convinced myself, that the
approaches are equivalent, but .....
2) some people prefer median .... (because the median
is invariant under parameter transformations)
Jan Conrad
NuFACT 06
25 August 2006
5
So what ?

Once we decided on the definition of sensitivity,
two problems need to be addressed:

What method should be used to calculate the CI or the pvalue ?

Since the experiment does not exist, what is the ensemble
of experiments we use to calculate the mean (or other
quantities) ?
Jan Conrad
NuFACT 06
25 August 2006
6
P-values and the Neyman Pearson
lemma

Uniformly most powerful test statistic:

To calculate p-values, we need to know the nulldistribution of T.
Remember:
Therefore it comes handy that asymptotically:
Jan Conrad
NuFACT 06
25 August 2006
7
Example: practical calculation
using p-values

Simulate an observation were Θ13 >0. Fit a model with Θ13 = 0
and a model with Θ13 >0 then:

δχ2 is (under certain circumstances) χ2 distributed.
For problems with these approach Luc Demortier: "P-Values: What
They Are and How to Use Them", draft report presented at the BIRS
workshop on statistical inference problems in high energy physics and
astronomy, July 15-20, 2006.
Jan Conrad
NuFACT 06
25 August 2006
8
Some methods for p-value calculation







Conditioning
Prior-predictive
Posterior-predictive
Plug-In
Likelihood Ratio
Confidence Interval
Generalized frequentist
Jan Conrad
NuFACT 06
I will not talk
about these any
more.
25 August 2006
9
Some methods for confidence interval
calculation (the Banff list)







Bayesian
Feldman & Cousins with Bayesian treatment of
nuisance parameters
I will talk a
Profile Likelihood
little bit about
Modified Likelihood
this one
Feldman & Cousins with Profile Likelihood
Fully frequentist
Empirical Bayes
Jan Conrad
NuFACT 06
25 August 2006
10
Properties I: Coverage

A method is said to have coverage (1-α) if, in infinitely many
repeated experiments the resulting CIs include (cover) the true
value in a fraction (1-α) of all cases (irrespective of what the true
value is).
1 -α
1
over-covering
0.9
under-covering
s
Jan Conrad
NuFACT 06
25 August 2006
11
Properties II:
Type I, type II error and power

Type I error:
Reject H0, though it is true.
Prob(Type I error) = α (corresponds to coverage for hyp. tests)

Type II error:
Accept H0, though it is false
Power
β = 1 – Prob(Type II error)
Given H1, what is the probability that we will reject H0 at given
significance α ?
Jan Conrad
NuFACT 06
25 August 2006
12
Nuisance parameters

Nuisance parameters are parameters which enter the
data model, but which are not of prime interest.
Example background:

You don’t want to give CIs (or p-values) dependent
on nuisance parameters  need a way to get rid of
them
Jan Conrad
NuFACT 06
25 August 2006
13
How to treat nuisance parameters ?

There is a wealth of approaches to dealing with nuisance
parameters. Two are particularly common:

Averaging
Bayesian
No time to discuss this, see:
J.C et. al. Phys. Rev D67:012002,2003
J.C & F. Tegenfeldt , Proceedings PhyStat 05, physics/0511055
F. Tegenfeldt & J.C. Nucl. Instr. Meth.A539:407-413, 2005

Profiling

Example which I will present here:
Profile Likelihood/MINUIT (which is similar to what many of
you have been doing)
Jan Conrad
NuFACT 06
25 August 2006
14
Profile Likelihood Intervals
meas n,
meas. b
MLE of b given s
MLE of b and s
given observations
2.706
To extract limits:
Lower limit
Jan Conrad
NuFACT 06
25 August 2006
Upper Limit
15
From MINUIT manual

See F. James, MINUIT Reference Manual, CERN Library Long
Write-up D506, p.5:
“The MINOS error for a given parameter is defined as the
change in the value of the parameter that causes the F’ to
increase by the amount UP, where F’ is the minimum w.r.t to
all other free parameters”.
Confidence
Interval
ΔΧ2 = 2.71 (90%),
ΔΧ2
Jan Conrad
= 1.07 (70 %)
NuFACT 06
25 August 2006
Profile Likelihood
16
Coverage of profile likelihood
Rolke
et al
Minuit
(1- α)MC
Background: Poisson (unc ~ 20 % -- 40 %) , Efficiency:
binomial (unc ~ 12%)
true s
Jan Conrad
NuFACT 06
25 August 2006
W. Rolke, A. Lopez,
J.C. Nucl. Inst.Meth
A 551 (2005) 493503
17


Basic idea: calculate
5 σ confidence
interval and claim
discovery if s = 0 is
not included.
Straw-man model:
Sideband of size τ
- Bayesian under-covers
badly (add 16 events to get
correct significance)
- Profile is the only method
considered here which gives
coverage (exc. full
construction)
Observed in background region
Confidence Intervals for new
particle searches at LHC?
Bayesian
K. S. Cranmer, Proceedings PhyStat 2005
Jan Conrad
NuFACT 06
Profile
Observed in signal region
25 August 2006
18
The profile likelihood and the χ2

The most common method in neutrino physics seems to
be minimizing a χ2

Assume Likelihood function:

Omitting terms not dependent on parameter:
Exact
asymptotically
χ2 fit asymptotically equivalent to
profile likelihood if you minimize w.r.t
nuisance parameters
Jan Conrad
NuFACT 06
25 August 2006
19
A simple example calculation.

Model generating the data:

This means: in each experiment you measure n and
bmeas, given s and b. σb is assumed to be known.

In what follows I use the χ2 to calculate a p-value (not
a confidence interval)
Jan Conrad
NuFACT 06
25 August 2006
20
Two approaches using χ2

Adding the uncertainty in quadrature:
...seems to be
quite common ...

Allowing for a nuisance parameter (background
normalisation) and minimize with respect to the
nuisance parameter:
Similar to what is used
in for example:
Burguet-Castell et.
al.Nucl.Phys.B725:306326,2005 (Beta-beams
at CERN SPS)
Jan Conrad
NuFACT 06
25 August 2006
21
Coverage (type I error)
Nominal χ2:
what you assume is the
correct null distribution
Ignore/Profile/Quad. add
etc:
”real” null distributions of
what you call a χ2
Empirical:
...... to the extent you trust ROOT.....
Jan Conrad
NuFACT 06
”true” χ2 distribution
25 August 2006
22
What if we have only Gaussian
processes ?
Jan Conrad
NuFACT 06
25 August 2006
23
Which method is more
sensitive to signal ? Power
Jan Conrad
NuFACT 06
25 August 2006
24
Power and sensitivity ?


In most cases I saw, an average result is presented
This tells you very little about the probability that a
given signal will yield a significant observation
(power)
Shot at ”What should sensitivity mean ?”:
An experiment is sensitive to a finite value Θ of a
parameter if the probability of obtaining an
observation n which rejects Θ = 0 with at least
significance α is at least β.
Jan Conrad
NuFACT 06
25 August 2006
25
What is the ensemble ....

..... .of repeated experiments which


I should use to calculate the ”mean” (or the
probability β) in the sensitivity calculation ?
I should use to calculate the coverage ?
Jan Conrad
NuFACT 06
25 August 2006
26
My answer ......

.... both ensembles should be the same ....

Each pseudo-experiment:


has fixed true values of the prime parameter and the nuisance nuisance
parameters.
yields a prime measurement (e.g. Number of observed events),
yields one estimate for each nuisance parameter (e.g. background)1)
This estmate might come from auxiliary measurements in the same or
other detectors or from theory. In the former case, care has to be taken
that the measurement procedure is replicated as in the real experiment.
In case of theoretical uncertainties, there is no real ”measurement
process”. I would argue even theoretical uncertainties should be treated
as there was a true value and an estimate, which we pretent is a random
variable.
1) Shape
Otherwise generalize
Jan
Conradand size of uncertainties
NuFACT 06 known beforehand
25 August? 2006
27 .....
Update
”what should sensitivity mean ?”
An experiment is sensitive to a finite value Θ of a
parameter if the probability of obtaining an observation n
which rejects Θ = 0 with at least significance α is at least
β.

The probability is hereby evaluated using replica of the experiment with
fixed true parameter Θ and fixed nuisance parameters. The random
variables in this experiment are thus the observation n and the estimates
of the nuisance parameter.

The significance of the observation n hereby evaluated using replica of
the experiment with fixed true parameter Θ = 0 and fixed nuisance
parameters (assuming a p-value procedure, otherwise by CI)
Jan Conrad
NuFACT 06
25 August 2006
28
How do things look in the nuFact
community ?

Unscientific study of 12 papers dealing with
sensitivities to oscillation parameters






0 papers seem to worry about the ensemble w.r.t which the ”mean” is calculated
0 papers check the statistical validity of the χ2 used
3 papers treat systematics and write down explicitely what χ2 used (or give enough information
to reconstruct it in principle )
6 papers ignore the systematics or don’t say how they are included in the fit
2 of the papers don’t say how signficance/CIs are calculated
1 paper doesn’t even tell me the signficance level
No paper is doing what I would like best, ¼ of the
papers are in my opinion acceptable with some
goodwill, ¾ of the papers I would reject.
Binomial errors on these figures are neglected 
Jan Conrad
NuFACT 06
25 August 2006
29
Summary/Recommendations

More is more !




Include systematics in your calculation (or discuss why
you neglect them) ...not just neglect them ....
Report under which assumptions the data is generated.
Report the test statistic you are using explicetly.
What does ”mean” mean ?

Jan Conrad
I did not encounter any discussion of neither the power
of your sensitivity analysis nor of the ensemble of
experiments which is used for the ”average”.
NuFACT 06
25 August 2006
30
Summary con’d

And the winner is ....


Most of the papers have been using a χ2 fit. If you include nuisance
parameters in those and minimize w.r.t them, this is equivalent to a Profile
Likelihood approach for strictly Gaussian processes. Otherwise
asymptotically equivalent. This approach seems to provide coverage in
many even unexpected cases.
Don’t think, compute .....

Given the computer power available (and since the stakes are high), I think
for sensitivity studies comparing different experimental configurations,
there is no reason to stick slavely to the native χ2 distribution instead of
doing a Toy MC to construct the distribution of test statistics yourself.
The thinking part is to choose the ensemble of experiments to simulate.
Jan Conrad
NuFACT 06
25 August 2006
31
And a last one ....for the customer
 Best is not necessarily best.

The intuitive (and effective) result of including systematics (or
doing a careful statistical analysis instead of a crude one) is to
worsen the calculated sensitivity. If I was to spend XY M$ on an
experiment, I would insist to understand how the sensitivity is
calculated in detail.
Otherwise, if anything, I would give the XY M$ to the group with the
worse sensitivity but the more adequate calculation.
Jan Conrad
NuFACT 06
25 August 2006
32
List of relevant references.

G. Feldman & R. Cousins. Phys.Rev.D57:3873-3889,1998


J.C et. al. Phys. Rev D67:012002,2003


a defintion of sensitivity including power
S. Baker and R. Cousins


profile likelihood without calling it that
G. Punzi, Proceedings of Phystat 2003, SLAC, Stanford (2003)


Signficance calculation for the LHC
F. James, Computer Phys. Comm. 20 (1980) 29- 35


Profile likelihood and its coverage
K. S. Cranmer, Proceedings PhysStat 05.


all you do want to know about p-values, but don’t dare to ask
W.Rolke, A. Lopez and J.C, Nucl. Inst. Meth A. 551 (2005) 493-503


coverage of CI intervals.
L. Demortier: presented at the BIRS workshop on statistical inference problems in high energy physics
and astronomy, July 15-20, 2006.


combined experiments, power calculations for CIs with Bayesian treatment of systematics
F. Tegenfeldt & J.C. Nucl. Instr. Meth.A539:407-413, 2005


combining FC with Bayesian treatment of systematics.
J.C & F. Tegenfeldt , Proceedings PhyStat 05, Oxford, 2005, physics/0511055


THE method for confidence interval calculation
Likelihood and χ2 in fits to histograms.
J. Burguet-Castell et. al. Nucl.Phys.B725:306-326,2005
Example of a rather reasonable sensitivity calculation in neutrino physics (random pick, there are certainly others
…maybe even better).


R. Barlow, .. J.C. et. al “The Banff comparison of methods to calculate Confidence Interval”

Systematic comparison of confidence interval methods, to be published beginning 2007.
Jan Conrad
NuFACT 06
25 August 2006
33
Backups
Jan Conrad
NuFACT 06
25 August 2006
34
What if we have 20 % unc ?
Jan Conrad
NuFACT 06
25 August 2006
35
Added uncertainty in efficiency.
Jan Conrad
NuFACT 06
25 August 2006
36
Requirements for χ2

Gauss distribution: N(s,s)

Hypothesis linear in parameters (so for example
”χ2 ” = (n-s2)/s does’nt work)

Functional form for the hypothesis is correct.
Jan Conrad
NuFACT 06
25 August 2006
37