ProsperLPSC2007

Download Report

Transcript ProsperLPSC2007

Bayesian Statistics in Analysis
Harrison B. Prosper
Florida State University
Workshop on Top Physics:
from the TeVatron to the LHC
October 19, 2007
Harrison B. Prosper
Workshop on Top Physics, Grenoble
Outline
Introduction
Inference
Model Selection
Summary
Harrison B. Prosper
Workshop on Top Physics, Grenoble
2
Introduction
Blaise Pascal
1670
Thomas Bayes
1763
Pierre Simon de Laplace
1812
Harrison B. Prosper
Workshop on Top Physics, Grenoble
3
Introduction
Let P(A) and P(B) be probabilities, assigned to statements, or
events, A and B and let P(AB) be the probability assigned to the
joint statement AB, then the conditional probability of
A given B is defined by
P( AB)
P( A | B) 
P( B)
A
Harrison B. Prosper
AB
B
P(A) is the probability of A without
the restriction specified by B.
P(A|B) is the probability of A when
we restrict to the conditions specified
by statement B
P( AB)
P( B | A) 
P( A)
Workshop on Top Physics, Grenoble
4
Introduction
From
we deduce immediately
Bayes’ Theorem:
P ( AB )  P ( B | A) P ( A)
 P( A | B) P( B)
P( A | B) P( B)
P( B | A) 
P( A)
Bayesian statistics is the application of Bayes’ theorem to
problems of inference
Harrison B. Prosper
Workshop on Top Physics, Grenoble
5
Inference
Harrison B. Prosper
Workshop on Top Physics, Grenoble
Inference
The Bayesian approach to inference is conceptually simple and
always the same:
Compute
Pr(Data|Model)
Compute
Pr(Model|Data) = Pr(Data|Model) Pr(Model)/Pr(Data)
Pr(Model)
Pr(Data|Model)
Pr(Model|Data)
Harrison B. Prosper
is called the prior. It is the probability
assigned to the Model irrespective of the
Data
is called the likelihood
is called the posterior probability
Workshop on Top Physics, Grenoble
7
Inference
In practice, inference is done using the continuous form of
Bayes’ theorem:
posterior density
p( ,  | D) 
 are the
parameters of interest
likelihood
prior density
p( D |  ,  )  ( ,  )
 p( D |  ,  )  ( ,  )d d 
marginalization
p( | D)   p( ,  | D)d 
Harrison B. Prosper
 denote all other
parameters of the
problem, which are
referred to as
nuisance
parameters
Workshop on Top Physics, Grenoble
8
Example – 1
Model
n  s b
Prior information
bˆ  b
0  s  smax
Likelihood
s is the mean signal count
b is the mean background count
Task: Infer s, given N
Datum
D  {N }
P( D | s, b)  Poisson( N , s  b)
Harrison B. Prosper
Workshop on Top Physics, Grenoble
9
Example – 1
Apply Bayes’ theorem:
posterior
p ( s , b | D) 
likelihood
prior
P( D | s, b)  ( s, b)
 P( D | s, b)  (s, b)dsdb
(s,b) is the prior density for s and b, which encodes our prior
knowledge of the signal and background means.
The encoding is often difficult and can be controversial.
Harrison B. Prosper
Workshop on Top Physics, Grenoble
10
Example – 1
First factor the prior
 ( s, b)   (b | s )  ( s)
  (b)  ( s )
Define the marginal likelihood
l ( D | s)   P( D | s, b)  (b) db
and write the posterior density for the signal as
p ( s | D) 
Harrison B. Prosper
l ( D | s)  ( s)
 l ( D | s)  (s)ds
Workshop on Top Physics, Grenoble
11
Example – 1
The Background Prior Density
Suppose that the background has been estimated from a
Monte Carlo simulation of the background process, yielding
B events that pass the cuts.
Assume that the probability for the count B is given by
P(B|) = Poisson(B, ), where  is the (unknown) mean count
of the Monte Carlo sample. We can infer the value of  by
applying Bayes’ theorem to the Monte Carlo background
experiment
p ( | B ) 
Harrison B. Prosper
P ( B |  )  ( )
 P ( B |  )  ( ) d 
Workshop on Top Physics, Grenoble
12
Example – 1
The Background Prior Density
Assuming a flat prior prior () = constant, we find
p(|B) = Gamma (, 1, B+1)
(= B exp(–)/B!).
Often the mean background count b in the real experiment is
related to the mean count  in the Monte Carlo experiment
linearly, b = k , where k is an accurately known scale factor,
for example, the ratio of the data to Monte Carlo integrated
luminosities.
The background can be estimated as follows
bˆ  k B,  b  k B
Harrison B. Prosper
Workshop on Top Physics, Grenoble
13
Example – 1
The Background Prior Density
The posterior density p(|B) now serves as the prior density
for the background b in the real experiment
(b) = p(|B), where b = k.
We can write
l ( D | s )  k  P ( D | s, k  )  ( k  ) d 
and
p ( s | D) 
Harrison B. Prosper
l ( D | s)  ( s)
 l ( D | s)  (s) ds
Workshop on Top Physics, Grenoble
14
Example – 1
The calculation of the marginal likelihood yields:
l ( D | s )   P ( D | s, k  )  ( k  ) d 

e( s k ) (s  k  ) N e  B

d
0
N!
B!
r
N r
N
s
k
( N  r  B  1)
s
e 
N  r  B 1
( N  r )! B !
r  0 r ! (1  k )

Harrison B. Prosper
Workshop on Top Physics, Grenoble
15
Example – 2: Top Mass – Run I
Data partitioned into K bins and modeled by a sum of N sources of
strength p. The numbers A are the source distributions for model M.
Each M corresponds to a different top signal + background model
N
model
di   p j a ji
j 1
K
likelihood
P( D | a, p, M )   exp(di )d Di Di !
i 1
N
prior
 (a, p, M )   ( p) exp(a ji )a ji
A ji
j 1
posterior
Harrison B. Prosper
P( M | D)  
Aji !
 P(a, p, M | D) da dp
Workshop on Top Physics, Grenoble
16
Example – 2: Top Mass – Run I
Probability of Model M
0.3
P(M|d)
0.2
0.1
0
130
140
150
160
170
180
190
200
210
220
230
Top Quark Mass (GeV/c**2)
mtop
s
b
Harrison B. Prosper
= 173.5 ± 4.5 GeV
= 33 ± 8 events
= 50.8 ± 8.3 events
Workshop on Top Physics, Grenoble
To Bin Or Not To Bin
Binned – Pros
Likelihood can be modeled accurately
Bins with low counts can be handled exactly
Statistical uncertainties handled exactly
Binned – Cons
Information loss can be severe
Suffers from the curse of dimensionality
Harrison B. Prosper
Workshop on Top Physics, Grenoble
18
To Bin Or Not To Bin
December 8, 2006 - Binned likelihoods do work!
Harrison B. Prosper
Workshop on Top Physics, Grenoble
19
To Bin Or Not To Bin
Un-Binned – Pros
No loss of information (in principle)
Un-Binned – Cons
Can be difficult to model likelihood accurately.
Requires fitting (either parametric or KDE)
Error in likelihood grows approximately
linearly with the sample size. So at LHC, large
sample sizes could become an issue.
Harrison B. Prosper
Workshop on Top Physics, Grenoble
20
Un-binned Likelihood Functions
Start with the standard binned likelihood over K bins
model
di  ai  bi
K
likelihood
P( D |  , a, b)   exp(di )d Di Di !
i 1
K
K
i 1
i 1
 exp( di ) d Di Di !
Harrison B. Prosper
Workshop on Top Physics, Grenoble
21
Un-binned Likelihood Functions
Make the bins smaller and smaller
di   d ( x)dx  [a( xi )  b( xi )]xi
i
the likelihood becomes
P( D |  , A, B)  exp[  (a( x)  b( x))dx]
i
i
K
where K is now the
 [a( xi )  b( xi )]xi
number of events
i 1
and a(x) and b(x) are
K
the effective luminosity

exp[

(
A


B
)]
[
a
(
x
)


b
(
x
)]
i
i
and background densities,
i 1
respectively, and A and B are their integrals


Harrison B. Prosper
Workshop on Top Physics, Grenoble
22
Un-binned Likelihood Functions
The un-binned likelihood function
K
p( D |  , A, B)  exp[( A  B)] [a( xi )  b( xi )]
i 1
is an example of a marked Poisson likelihood. Each event is
marked by the discriminating variable xi, which could be
multi-dimensional.
The various methods for measuring the top cross section and mass
differ in the choice of discriminating variables x.
Harrison B. Prosper
Workshop on Top Physics, Grenoble
23
Un-binned Likelihood Functions
Note: Since the functions a(x) and b(x) have to be modeled, they
will depend on sets of modeling parameters  and , respectively.
Therefore, in general, the un-binned likelihood function is
K
p( D |  , A, B,  ,  )  exp[ ( A  B)] [a( xi ,  )  b( xi ,  )]
i 1
which must be combined with a prior density
 ( , A, B,  ,  )
to compute the posterior density for the cross section
p( | D)   dA dB  d  d  p( D |  , A, B,  ,  )  ( , A, B,  ,  )
Harrison B. Prosper
Workshop on Top Physics, Grenoble
24
Computing the Un-binned Likelihood Function
If we write s(x) = a(x), and S = A  we can re-write the
un-binned likelihood function as
K
p( D | S , B)  exp[( S  B)] [ s ( xi )  b( xi )]
i 1
Since a likelihood function is defined only to within a scaling by a
parameter-independent quantity, we are free to scale it by,
for example, the observed distribution d(x)
 s( xi )  b( xi ) 
p( D | S , B)  exp[( S  B)] 

d ( xi ) 
i 1 
K
Harrison B. Prosper
Workshop on Top Physics, Grenoble
25
Computing the Un-binned Likelihood Function
One way to approximate the ratio [s(x)+ b(x)]/d(x) is with a
neural network function trained with an admixture of data, signal
and background in the ratio 2:1:1.
If the training can be done accurately enough, the network will
approximate
n(x) = [s(x)+ b(x)]/[ s(x)+b(x)+d(x)]
in which case we can then write
 n( xi ) 
p( D | S , B)  exp[( S  B)] 

i 1 1  n( xi ) 
K
Harrison B. Prosper
Workshop on Top Physics, Grenoble
26
Model Selection
Harrison B. Prosper
Workshop on Top Physics, Grenoble
Model Selection
Model selection can also be addressed using Bayes’
theorem. It requires computing
posterior
evidence
prior
p( D | M ) P( M )
P( M | D) 
p( D)
where the evidence for model M is defined by
p( D | M )   p( D |  M , M , M )
  ( M , M | M ) d M d M
Harrison B. Prosper
Workshop on Top Physics, Grenoble
28
Model Selection
posterior odds
Bayes factor
P ( M | D)  p ( D | M ) 


P( N | D)  p ( D | N ) 
prior odds
P( M )
P( N )
The Bayes Factor, BMN, or any one-to-one function thereof,
can be used to choose between two competing models M and
N, e.g., signal + background versus background only.
However, one must be careful to use proper priors.
Harrison B. Prosper
Workshop on Top Physics, Grenoble
29
Model Selection – Example
Consider the following two prototypical models
Model 1
Model 2
P( D | s, b)  Poisson( N , s  b),  ( s, b)
P( D | b)  Poisson( N , b),  (b)
The Bayes factor for these models is given by
P( D | 1)  Poisson( N , s  b)  ( s, b) dsdb
B12 

P( D | 2)
 Poisson( N , b)  (b) db
Harrison B. Prosper
Workshop on Top Physics, Grenoble
30
Model Selection – Example
Calibration of Bayes Factors
Consider the quantity (called the Kullback-Leibler divergence)
P( D | 1)
k (2 || 1)   P( D | 1) ln
dD
P( D | 2)
For the simple Poisson models with known signal and background,
it is easy to show that
s

k (2 || 1)   s  ( s  b) ln 1  
b

For s << b, we get √k(2||1) ≈ s /√b. That is, roughly speaking,
for s << b, √ ln B12 ≈ s /√b
Harrison B. Prosper
Workshop on Top Physics, Grenoble
31
Summary
Bayesian statistics is a well-founded and general
framework for thinking about and solving analysis
problems, including:
Analysis design
Modeling uncertainty
Parameter estimation
Interval estimation (limit setting)
Model selection
Signal/background discrimination etc.
It well worth learning how to think this way!
Harrison B. Prosper
Workshop on Top Physics, Grenoble
32