Example 1: W ± W ± jj Production (ATLAS) - Indico

Download Report

Transcript Example 1: W ± W ± jj Production (ATLAS) - Indico

Practical Statistics for LHC Physicists
Bayesian Inference
Harrison B. Prosper
Florida State University
CERN Academic Training Lectures
9 April, 2015
Confidence Intervals – Recap
Any questions about this figure?
parameter space
s
 L  P(x  D | u)
f p
 R  P(x  D | l)
u
l
L  f  R  1
sample space
2
Outline
 Frequentist Hypothesis Tests Continued…
 Bayesian Inference
3
Hypothesis Tests
In order to perform a realistic hypothesis test we need first to
rid ourselves of nuisance parameters.
Here are the two primary ways:
1. Use a likelihood averaged over the nuisance parameters.
2. Use the profile likelihood.
4
Example 1: W±W±jj Production
(ATLAS)
Recall, that for B = 3.0 events (ignoring the uncertainty)
p(D | H 0 )  Poisson(D | B)
D  12
we found
p-value 
observed count

5
Poisson(D
|
3.0)

7.1

10

D12
Z ~ 3.8
(sometimes called the Z-value)
5
Example 1: W±W±jj Production
(ATLAS)
Method 1: We eliminate b from the problem as follows*:

P(D | s)   P(D | s,b) d(kb)
0
D
1
 (1 x)2  Beta(x, D  r  1, Q) Poisson(r, s)
Q
r 0
Exercise 10: Show this
where,
1
(n  m) n1
x
, Beta(x, n, m) 
x (1 x)m1
1 k
(n)(m)
L(s) = P(12|s) is the average likelihood.
(*R.D. Cousins and V.L. Highland. Nucl. Instrum. Meth., A320:331–335, 1992)
6
Example 1: W±W±jj Production
(ATLAS)
Background, B = 3.0 ± 0.6 events
p(D | H 0 )  p(D | s  0)
1
 (1 x)2 Beta(x, D  1, Q)
Q
D is observed count
p-value 


D  12
p(D | H 0 )  21  105
D12
Exercise 11: Verify this calculation
This is equivalent to 3.5 σ
which may be compared with
the 3.8 σ obtained earlier.
7
An Aside on s / √b
The quantity s / √b is often used as a rough measure of
significance on the “n-σ” scale. But, it should be used with
caution.
In our example, s ~ 12 – 3.0 = 9.0 events.
So according to this measure, the ATLAS W±W±jj result is a
9.0/√3 ~ 5.2σ effect, which is to be compared with 3.8σ!
Beware of s / √b!
8
The Profile Likelihood Revisited
Recall that the profile likelihood is just the likelihood with all
nuisance parameters replaced by their conditional
maximum likelihood estimates (CMLE).
In our example,
LP (s)  L(s, b( ŝ))
We also saw that the quantity
t(s)  2ln[L p (s) / L p (ŝ)]
can be used to compute approximate confidence intervals.
9
The Profile Likelihood Revisited
t(s) can also be used to test hypotheses, in particular, s = 0.
Wilks’ theorem, applied to our example, states that for large
samples the density of the signal estimate will be
approximately,
Gaussian( ŝ, s,  )
if s is the true expected signal count. Then,
t(s) 2ln[Lp (s) / Lp (ŝ)] (s  ŝ) / 
2
2
will be distributed approximately as a χ2 density of one degree
of freedom, that is, as a density that is independent of all the
parameters of the problem!
10
The Profile Likelihood Revisited
Since we now know the analytical form of the probability density
of t, we can calculate the observed
p-value = P[t(0) ≥ tobs(0)]
given the observed value t(0), tobs(0), for the s = 0 hypothesis.
Then if we find that the
p-value < α,
the significance of our test, we reject the s = 0 hypothesis.
Furthermore, Z = √tobs(0), so we can skip the p-value calculation!
11
Example 1: W±W±jj Production
(ATLAS)
Background, B = 3.0 ± 0.6 events. For this example,
tobs(0) = 12.65
therefore, Z = 3.6
D  12
Exercise 12: Verify this calculation
12
Example 4: Higgs to γγ (CMS)
Example 4: Higgs to γγ (CMS)
This example mimics part of the CMS Run 1 Higgs γγ data.
We simulate 20,000 background di-photon masses and 200
signal masses. The signal is chosen to be a Gaussian bump
with standard deviation 1.5 GeV and mean of 125 GeV.
background model
fb (x, a)
 A exp[(a1 x  a2 x 2 )]
signal model
fs (x, m,w)
 Gaussian(x, m, w)
N
p(x | s, m, w,b, a)  exp[(s  b)] s fs (xi , m, w)  b fb (xi , a)
i 1
14
Example 4: Higgs to γγ (CMS)
Fitting using
Minuit (via RooFit)
yields:
15
Bayesian Inference
Bayesian Inference – 1
Definition:
A method is Bayesian if
1. it is based on the degree of belief interpretation of
probability and if
2. it uses Bayes’ theorem
p(D |  , ) ( , )
p( , | D) 
p(D)
for all inferences.
D
θ
ω
π
observed data
parameter of interest
nuisance parameters
prior density
17
Bayesian Inference – 2
Nuisance parameters are removed by marginalization:
 p( , | D) d
  p(D |  , ) ( , ) d  / p(D)
p( | D) 
in contrast to profiling, which can be viewed as marginalization
with the data-dependent prior  ( , )  [  φ( , D)]
p( | D)   p(D |  , )  ( , ) d / p(D)
  p(D |  , )  (  φ) d / p(D)
; p(D |  , φ( )) / p(D)
18
Bayesian Inference – 3
Bayes’ theorem can be used to compute the probability of a
model. First compute the posterior density:
p(D |  H , , H )  ( H , , H )
p( H , , H | D) 
p(D)
D
H
θH
ω
π
observed data
model or hypothesis
parameters of model H
nuisance parameters
prior density
19
Bayesian Inference – 4
1. Factorize the priors:  ( H , ω, H) =  (θH, ω | H)  (H)
2. Then, for each model, H, compute the function
p(D | H)   p(D |  H , , H)  ( H , | H) d H d
3. Then, compute the probability of each model, H
p(D | H )  (H )
p(H | D) 
 p(D | H )  (H )
H
20
Bayesian Inference – 5
In order to compute p(H |D), however, two things are needed:
1. Proper priors over the parameter spaces
  (
H
, | H) d H d  1
2. The priors  (H).
In practice, we compute the Bayes factor:
p( H1 | D)  p( D | H1 )    ( H1 ) 



p( H 0 | D)  p( D | H 0 )    ( H 0 ) 
which is the ratio in the first bracket, B10.
21
Example 1: W±W±jj Production
(ATLAS)
Example 1: W±W±jj Production
(ATLAS)
Step 1: Construct a probability model for the observations
P(D | s, b) 
e
(sb)
(s  b)
D!
D
 kb
Q
e (kb)
(Q  1)
and insert the data
D = 12 events
B = 3.0 ± 0.6 background events
B=Q/k
Q  (B /  B)2  25
δB = √Q / k
2
k  B /  B  8.33
to arrive at the likelihood.
23
Example 1: W±W±jj Production
(ATLAS)
Step 2: Write down Bayes’ theorem:
p(D, s, b) p(D | s, b)  (s, b)
p(s, b | D) 

p(D)
p(D)
and specify the prior:
 (s, b)   (b | s)  (s)
It is often convenient first to compute the marginal likelihood
by integrating over the nuisance parameters
p(D | s)   p(D | s,b)  (b | s)db
24
Example 1: W±W±jj Production
(ATLAS)
The Prior: What do
and
represent?
 (b | s)
 (s)
They encode what we know, or assume, about the expected
background and signal in the absence of new observations.
We shall assume that s and b are non-negative and finite.
After a century of argument, the consensus today is that there
is no unique way to represent such vague information.
25
Example 1: W±W±jj Production
(ATLAS)
For simplicity, we shall take π(b | s) = 1*.
We may now eliminate b from the problem:

p(D | s, H1 )   p(D | s,b)  (b | s) d(kb)
0
D
1
2
 (1 x)  Beta(x, r  1, Q) Poisson(D  r | s)
Q
r 0
which, of course, is exactly the same function we found
earlier! H1 represents the background + signal hypothesis.
*Luc Demortier, Supriya Jain, Harrison B. Prosper,
Reference priors for high energy physics, Phys.Rev.D82:034002,2010
26
Example 1: W±W±jj Production
(ATLAS)
L(s) = P(12 | s, H1) is
marginal likelihood for
the expected signal s.
Here we compare the
marginal and profile
likelihoods. For this
problem they are found
to be almost identical.
But, this happy thing does
not always happen!
27
Example 1: W±W±jj Production
(ATLAS)
Given the likelihood
P( D | s, H1 )
we can compute the posterior density
P(D | s, H1 )  (s | H1 )
p(s | D, H1 ) 
P(D | H1 )
where

p(D | H1 )   P(D | s, H1 )  (s | H1 ) ds
0
28
Example 1: W±W±jj Production
(ATLAS)
Assuming a flat prior for the signal π (s | H1) = 1, the
posterior density is given by
D
p(s | D, H1 ) 
 Beta(x,r  1,Q)Poisson(D  r,s)
r 0
D
 Beta(x,r  1,Q)
r 0
The posterior density of the parameter (or parameters) of
interest is the complete answer to the inference problem
and should be made
Exercise 13: Derive an expression
available. Better still,
for p(s | D, H1) assuming a gamma
publish the likelihood
prior Gamma(qs, U +1) for π(s | H1)
and the prior.
29
Example 1: W±W±jj Production
(ATLAS)
By solving

u
l
p(s | D, H1 ) ds  0.68
we obtain
s [6.3, 13.5] @ 68% C.L.
Since this is a Bayesian calculation, this statement means:
the probability (that is, the degree of belief) that
s lies in [6.3, 13.5] is 0.68.
30
Example 1: W±W±jj Production
(ATLAS)
As noted, the number

p(D | H1 )   p(D | s, H1 )  (s | H1 ) ds
0
can be used to perform a hypothesis test. But, to do so, we
need to specify a proper prior for the signal, that is, a prior
π(s| H1) that integrates to one.
The simplest such prior is a δ-function, e.g.:
π (s | H1) = δ(s – 9), which yields
p(D | H1 ) = p(D |9, H1 ) =1.13 x 10-1
31
Example 1: W±W±jj Production
(ATLAS)
From
p(D | H1 )
p(D | H0 )
= 1.13 x 10-1 and
= 2.23 x 10-4
we conclude that the odds in favor of the hypothesis s = 9 has
increased by 506 relative to the whatever prior odds you
started with.
It is useful to convert this Bayes factor into a (signed) measure
akin to the “n-sigma” (Sezen Sekmen, HBP)
Z  sign(ln B10 ) 2 | ln B10 |  3.6, B10  p(D | H1 ) / p(D | H0 )
Exercise 14: Verify this number
32
Example 4: Higgs to γγ (CMS)
Here is a plot of Z vs. mH
as we scan through
different hypotheses about
the expected signal s.
For simplicity, the signal
width and background
parameters have been fixed
to their maximum likelihood
estimates.
33
Summary – 1
Probability
Two main interpretations:
1. Degree of belief
2. Relative frequency
Likelihood Function
Main ingredient in any full scale statistical analysis
Frequentist Principle
Construct statements such that a fraction f ≥ C.L. of them
will be true over a specified ensemble of statements.
34
Summary – 2
Frequentist Approach
1. Use likelihood function only.
2. Eliminate nuisance parameters by profiling.
3. Decide on a fixed threshold α for rejection and reject null
if p-value < α, but do so only if rejecting the null makes
scientific sense, e.g.: the probability of the alternative is
judged to be high enough.
Bayesian Approach
1. Model all uncertainty using probabilities and use Bayes’
theorem to make all inferences.
2. Eliminate nuisance parameters through marginalization.
35
The End
“Have the courage to use your own understanding!”
Immanuel Kant
36