Statistics for HEP Roger Barlow Manchester University

Download Report

Transcript Statistics for HEP Roger Barlow Manchester University

Statistics for Experimental HEP
Kajari Mazumdar
TIFR, Mumbai
Slide 0
Why do we do experiments?
 To study some phenomenon X which could be:
• Whether a particular species of particle exist or not.
• If it does exist what are its properties, like, mass, lifetime, charge,
magnetic moment, … .
• If of finite lifetime, then which particles does it decay to? What are
the branching fractions?
Do the distribution of variables/parameters (energy, directions of
daughter particles) agree with theoretical predictions?
• To probe what processes can occur and with what probability?
(in a given experimental situation depending on initial particles and
collision energy, of course)
Slide 1
Template of an experiment
• Arrange for instances of X.
• Record events that might be X.
• Reconstruct the measurable quantities of visible particles.
• Select events that could be X by applying cuts.
• Histogram distributions of interesting variables compare with
• Theorectical prediction(s, there may be several in the market.)
Confrontation of Theory with Experiment.
Ask:
• Is there any evidence for X or is the null hypothesis refuted?
• Given X, what are the parameters involved in the model?
• Are the results from the experiment compatible with the predictions
of X?
Slide 2
Data analysis in particle physics
Observe events of a certain type, which has various
uncertianties which are quantified in terms of
probability:
• theory is not deterministic,
• measuremnets has various random errors
• other limitations, like cost, time, ..
• Measure characteristics of each event (particle momenta, number of
muons, energy of jets,...)
• Theories (e.g. SM) predict distributions of these properties up to free
parameters, e.g., a, GF, MZ, as, mH, ...
• Some tasks of data analysis:
Estimate (measure) the parameters;
Quantify the uncertainty of the parameter estimates;
Test the extent to which the predictions of a theory are
in agreement with the data (→ presence of New Physics?)
Slide 3
Everything is counting experiment in HEP
within certain approximations
• To measure branching ratio or cross-sections, we count the
number of events produced/ observed (account for inefficiency of
observation).
• To measure the mass of a particle, we use histogram where the
entry in each bin is a random (Poisson) process.
•This unpredictability is inherent in nature, driven by quantum
mechanical consideration everything becomes probabilistic.
When we want to infer something about the probabilistic processes
that produced the data, we need Statistics.
Slide 4
What happens if there’s nothing ?
• In the final analysis, we may make approximations, take a
pragmatic approach, or do things acc. to convention.
 We need to have good understanding of foundational
aspects pf statistics.
• Even if your analysis finds no events, this is still useful information
about the way the universe is built.
• Want to say more than: “We looked for X, we didn’t see it.”
• Need statistics – which can’t prove anything.
• “We show that X probably has a mass greater than..
OR a coupling smaller than…”
Slide 5
Statistical tools mostly used in particle physics
1. Monte Carlo simulation
2. Likelihood methods to estimate parameters.
3. Fitting of data.
4. Goodness of fit.
5. Toy Monte Carlo: to achieve a given level of confidence given
data (Neyman construction).
Slide 6
Monte Carlo simulation
Theoretically, the distributions, perhaps with few unknown parameters,
are beautiful and simple (this is not only a dogma, but a reality)
• angular distributions may be flat or is a function of few trigonometric
variables, like, sin Ө, cosӨ;
• masses may be described by Cauchy (Breit-Wigner) function,
• time distribution may be exponential, or exponential, modulated by
sinusoidal oscillation (neutrino oscillation).
• But they are modified due to various complicated effects; e.g., higher
order perturbation effects may by one of the theoretical reasons.
The detector effects (reconstruction and measurement processes), …
•A monte carlo simulation starts with a simple distribution and is put
through repeated randomization to take into account of various
unavoidable complications, to finally produce histograms!
This is both computer and man-power intensive.
Slide 7
Concept of Probability
A random variable is a numerical characteristic
assigned to an element of the sample space;
 can be discrete or continuous.
Probability of discrete variable x: P(x) = Lim N(x) / N
(N ∞)
e.g, coins, dice, cards, ..
For continuous x  Probability Density Function (pdf): P(x to x+dx) = P(x) dx
e.g, parton (quark, gluon) density functions of proton
Mathematical definition
P(A) is a number obeying the Kolmogorov axioms
P ( A)  0
P ( A1  A2 )  P ( A1 )  P ( A2 )
 P( A )  1
i
Problem with mathematical definition :
No information is conveyed by P(A) !
All As are considered equally likely.
P(A) depends on A and the ensemble.
In particle physics, A1, A2: outcomes of a repeatable experiment, say, decays.
In frequentist’s interpretation P(Higgs boson exists)
= either 0, OR, 1
but we don’t know which one is correct!
Slide 8
Treatment of probability in a subjective way.
• In particle physics frequency interpretation often most useful,
• but subjective probability can provide more natural treatment of nonrepeatable phenomena: eg., systematic uncertainties, probability that
Higgs boson exists,...
Baysian interpretation
 P(A) is degree of belief that A is true.
Conditional probability  probability of A, given B =
and similarly, probability of B given A=
But
so
so
If A, B are independent
In Bayesian probability, assume in advance a probability that Higgs boson
exists and then interprete the data, taking into account all possibilities which
can produce such a data.
Slide 9
Frequentist Use of Bayes Theorem
Example: Particle Identification: Particle types e,,,K,p
Detector Signals: DCH,RICH,TOF,TRD (different subdetectors through
Which particles pass and leave similar signatures.
Probability of a signal in DCH to be due to e is determined by probability of an e
to leave a detectable signal in DCH, probability of an electron to be produced in
a reaction and also total probability of DCh to register signals due to different
particles:
P( DCH | e)
P' (e)  P(e | DCH ) 
P(e)
P( DCH )
P( DCH )  P( DCH | e) P(e)  P( DCH |  ) P( )  P( DCH |  ) P( )...
Slide 10
Continuous variable
Prob. to find x within x+dx:
Eg., Parton density function:
f(x) is NOT a probability!
x must be found somewhere!
Cumulative density:
prob. to have outcome
less than or equal to x
Slide 11
Expectation values
Consider continuous r.v. x with pdf f (x).
Define expectation (mean) value as
Notation (often):
~ “centre of gravity” of pdf.
For a function y(x) with pdf g(y),
s ~ width of pdf, same units as x, given by.
(equivalent)
Define covariance cov[x,y] as
Variance:
Correlation coeff.:
For x, y independent:
Standard deviation:
Reverse is not true!
Slide 12

Correlation
Slide 13
Statistics
• Population: includes all objects of interest  large.
Parameters (mean, standard deviation etc.) are associated with a
population (, s).
• Sample: only a portion of population convenient, but comes
with a cost. Statistic is associated with sample providing the
characteristics or measure obtained from a sample.
We compute statistics to estimate parameters.
Variables can be discrete, like, number of events, or continuous,
like, mass of the Higgs boson.
• Mean  sum of all values/ no. of values.
• Median  mid-point of data after being ranked in ascending order  there
are as many numbers above the median, as below.
• Mode  the most frequent number in the distribution.
Slide 14
Basic Data description
• Weighted mean:
e.g., measurement of tracks using multiple hits.
• Sample variance (unbiased estimator of population variance s 2) 
Takes care of the fact that the sample mean is determined
from same set of observations x .
Slide 15
Distribution/pdf
Binomial
Multinomial
Poisson
Uniform
Exponential
Gaussian
Chi-square
Cauchy
Landau
Slide 16
Example use in HEP
Branching ratio
Histogram with fixed N
Number of events found
Monte Carlo method
Decay time
Measurement error
Goodness-of-fit
Mass of resonance
Ionization energy loss
The Binomial
Random process with exactly 2 possible outcomes, order not important.
 Bernoulli process.
Individual success probability p, total n trials and r successes.
P(r ; n, p) 
n!
p r (1  p) nr r=0,1,2,. ..n; 1-p p  q; 0 ≤p≤ 1
r!(n  r )!
Mean = np, variance = np(1-p)
Eg., 1. Efficiency/Acceptance calculations.
2. Observe n number of W decays, out of which r are of type
Wn,
p= branching ratio.
Multinomial distribution with m possible outcomes. For ith outcome, pi = success
rate, all are failure, ni = random variable here, binomially distributed.
Eg. in a histogram with m bins, total no. of N entries, content of each bin is an
independent random binomial variable .
Slide 17
Poisson
‘Events in a continuum’ number of events in data.
Mean rate n in time (or space) interval.
Prob. of finding exactly n events in a given interval:
n =0,1,2,…; mean = n, variance = n
e.g., 1.cosmic muons reaching the lab., 2.Geiger Counter clicks,
3. Number of scattering events n with cross-section s, for a given luminosity
∫L dt, with
Exponential
x continuous, mean = x, variance = x 2
Proper decay time t of an unstable particle, life time = t,
Slide 18
Two Poissons
2 Poisson sources, means 1 and 2
Combine samples:
e.g. leptonic and hadronic decays of W.
Forward and backward muon pairs.
Tracks that trigger and tracks that don’t . …
What you get is a Convolution:
P( r )= P(r’; 1 ) P(r-r’; 2 )
Turns out this is also a Poisson with mean 1+2 !
Avoids lot of worry!
Signal and background, each independent Poisson variable, but
total no. of observed events, is also Poisson distributed!
In actual experiment total number of observed events =
expected signal+ estimated background
Slide 19
Slide 19
Gaussian
f =probability density for continuous, r.v. x, with mean = , variance = s 2
Max. height = 1/ s √ (2).
-∞<x<∞;-∞<<∞;s>0
• Height is reduced by factor of √e (~ 61%) at x = ± s  half width at half max.
• Probability of x to be within  ± 0.6745 s  45%
Special case of standard / normalised Gaussian:
If y ~ Gaussian with , s, then x = (y- )/s follows f (x)
90% within 1.645 s
95% within 1.960 s
99% within 2.576 s
99.9% within 3.290s
68.27% within 1s
95.45% within 2s
99.73% within 3s
These numbers apply to Gaussians and only Gaussians!
Slide 20
Central Limit Theorem: Why is the Gaussian Normal?
If a continuous random variable x is distributed acc. to any pdf with finite
mean and variance, the sample mean on n observations of x will
have a pdf which approaches Gaussian for large n.
•
If xi is a set of independent variables of mean  and variance
s2, y=  xi/N, for large N, tends to become Gaussian with
mean  and variance (1/N) s 2 .
Connection among Gaussian, Binomial and Poisson distributions
p 0, N∞, Np=
Binomial
Poisson
N∞
 ∞
Slide 21
Gaussian
For large variety of situations, if the experiment is repeated many times,
if the value of the quantity is measured accurately, without any bias,
the results are distributed acc. to Gaussian.
•Typically, we assume that the form of experimental resolution is
Gaussian, which may not be the case quite often!
Artificial enhancement of significance of observed deviations.
It is also important to estimate the magnitude of the error correctly
under-estimation of errors by 50%  4 s effect may be actually 2 s!
Slide 22
Multidimensional Gaussian
For a set of n Gaussian random variables, not necessarily indepdt,
The joint pdf is a multivariate Gaussian:
P ( x; μ, V ) 
1
( 2 )
N /2
e

1 ~ ~
( x  μ ) V 1 ( x  μ )
2
V
V is the covariance matrix of x’s, symmetric, nxn, V_ii=Var(x_i),
V_ij = <(x_i - _i)(x_j – _j)> =ρ_ij .s_i . s_j
ρ_ij = correlation coeff. for x_i and x_j, | ρ_ij | 2 ≤1.
ρ_ij = 0 for x_i, x_j to be independent of each other.
Correlation coeff. : ρ= cov(x,y)/ sx s y,
Slide 23
P( x, y;  x ,  y , s x , s y ) 
1
( x   x ) / 2s x
s xs y 2 e
Each elliptical contour
 fixed probability
2
2
( y   y ) / 2s y
2
e
No correlation
=0
With correlations
P ( x, y;  x ,  y , s x , s y ,  )

1
s xs y 2 1  
Slide 24

2
e
1
 ( x   ) 2 / s 2  ( y   ) 2 / s 2  2  ( x   )( y   ) / s s 
x
x
y
y
x
y
x y
2 

2 (1  ) 
2
Slide 25
More on correlation between variables.
Covariance matrix plays very important role in propagation of
error in changing variables, from x to y (in first order only!).
-ve covariance  anti-correlation.
• Semi-axis of ellipse given by sq. root of eigen values of error
matrix.
•The ellipse provides the likely range of x, y values and they lie in
a region smaller than the rectangle defined by max of x’,y’ values.
•For the case of 2-variables, the point X lies outside 1-s.d. ellipsoid
with probability 61%.
Slide 26
Chi-squared
z is continuous random variable =  2    xi   i 
 s 
i 1 
i

n
2
Mean=n, Variance = 2n
z = sum of squared discrepancies, scaled by expected error,
n = 1,2, .. = no. of degrees of freedom; x_i : independent Gaussians.
Used frequently to test goodness-of-fit.
Confidence level is obtained by integrating the
tail of the f distribution (from χ 2 upto ∞)
CL(χ 2 )=∫ f(z,n) dz
Cumulative distribution of χ 2 is useful in
judging consistency of data with a model.
Since mean =n a reasonable experiment
should get χ 2 ≈ n
Thus reduced χ 2 is a useful measure!
Slide 27
Slide 28
Slide 29
About Estimation
Probability
Theory
Calculus
Data
Given these distribution
parameters, what can we
say about the data?
Statistical
Theory
Inference
Given this data, what can we
say about the properties or
parameters or correctness of the
distribution functions?
Data
Having estimated a parameter of the theory, we need to provide the error in
the estimation as well.
Slide 30
What is an estimator?
An estimator is a procedure giving a value for a parameter or property of the
distribution as a function of the actual data values
ˆ ({x}) 
1
N
x
i
i
xmax  xmin
ˆ ({x}) 
2
1
ˆ
V ({x}) 
N
2
ˆ
(
x


)
 i
i
A perfect estimator is consistent, unbiased and efficient.
Often we have to deal with less than perfect estimator!
Limit (aˆ )  a
N 
aˆ   ...aˆ ( x1 , x2 ,...)P( x1; a) P( x2 ; a) P( x3 ; a)...dx1dx2 ...  a
V (aˆ )  (aˆ  aˆ
)
2
Minimum Variance Bound
Slide 31
V ( aˆ ) 
1
d 2 ln L
da 2
The Likelihood Function
Set of data {x1, x2, x3, …xN}:
• Each x may be multidimensional
•Probability depends on some parameter a. a may be multidimensional!
Total probability (density) The Likelihood
P(x1;a) P(x2;a) P(x3;a) …P(xN;a)=L(x1, x2, x3, …xN ;a)
Given data {x1, x2, x3, …xN} estimate a by maximising the L.
In practice usually maximise ln L as it’s
easier to calculate and handle; just add
the ln P(xi)
ML has lots of nice properties (eg., it is
consistent and efficient for large N).
Slide 32
Ln L
â
a
ML does not give goodness of fit !
• ML will not complain if your assumed
P(x;a) is rubbish
• The value of L tells you nothing.
• Normalisation of L is important.
• Quote only the upper limit from
analysis.
Fit P(x)=a1x+a0 will give a1=0;
constant P  L= a0N
Just like you get from fitting
eg., Lifetime distribution
pdf
p(t;λ) = λ e -λt
So
L(λ;t) = λ e –λt
(single observed t) , here both t and λ are continuous
pdf maximises at t = 0 while L maximises at λ = t
.. Functional form of P(t) and L(λ) are different
Slide 33
Lifetime distribution
pdf
p(t;λ) = λ e -λt
So
L(λ;t) = λ e –λt
(single observed t)
Here both t and λ are continuous
pdf maximises at t = 0
L maximises at λ = t
.. Functional form of P(t) and L(λ) are different
Fixed 
Fixed t
L
P
Slide 34
t
λ
Least Squares
y
• Measurements of y at various x with
errors s and prediction f(x;a)
• Probability
• Ln L
• To maximise ln L, minimise 2
χ 2=
• Should get 2 1 per data point.
 yi  f ( xi ; a ) 
1

 i 
2
si


e
 ( y  f ( x;a )) 2 / 2s 2
Ndegrees Of Freedom=Ndata pts – N parameters
Provides ‘Goodness of agreement’ figure which allows for credibility
So ML ‘proves’ Least Squares.
Slide 35
2
Chi Squared Results
Extended Maximum Likelihood:
Allow the normalisation of P(x;a) to float
Predicts numbers of events as well as their distributions
Need to modify L
Extra term stops normalistion shooting up to infinity
Small 2 comes from
1. Overestimated errors
2. Good luck
Large 2 comes from
1. Bad Measurements
2. Bad Theory
3. Underestimated errors
4. Bad luck
Slide 36
Slide 36
Variance of estimator
• One way to do this would be to simulate the entire experiment many times with a
Monte Carlo program (use ML estimate for MC).
• Log-likehood method: expand around the maximum.
2nd term is zero.
To a good approximation 
Since
Basically, increase θ, until ln L decreases by -1/2.
For least square estimator:
Slide 37
Hypothesis Testing
• Consider a set of measurements pertaining to a particular subset of events:
• xi may refer to number of muons in the events, the transverse energy of the
leading jet, missing transverse energy and so on.
•
refers to n-dim. joint pdf which depends on the type of event actually
produced, eg.,
For each reaction we consider we will have a hypothesis for the pdf of , e.g.,
f( | H0), f( |H 1), and so on, where, Hi refers to different possibilities.
Say, H0 corresponds to Higgs boson, H1, 2, ..  backgrounds.
Now, each event is a point in space, so we
put a set of criteria/cuts, called Test statistics:
and work out the pdfs
such that the sample space is divided into 2
regions, where we accept or reject H0 .
Slide 38
Level of Significance and Efficiency
Probability to reject H0, if it is true:
Error of 1st kind
To accept H0 when H1 is true
Error of 2nd kind
Significance level
Power of test = 1 - b
Probability to accept a signal event (signal efficiency)
Probability to accept a background event (background efficiency)
Purity of selected sample depends on the prior probabilities as well as
the efficiencies.
Slide 39