m - indico in2p3

Download Report

Transcript m - indico in2p3

Statistics Exercises
Tommaso Dorigo, INFN Padova
SOS School,
Autrans May 30- June 3 2016
Contents of today's lesson
• Basic statistical distributions, and the pitfalls of neglecting their
importance
– How free quarks were discovered and then retracted
– Bootstrapping and the false Poisson
• Error propagation: a simple example
– Smart weighting
– Derivation of the weighted average
• An example of the method of least squares
– Two chisquareds and a likelihood
• An example of the method of max likelihood
– which you can solve with paper and pencil
1 - Why it is crucial to know basic
statistical distributions
•
I bet 90% of you know the expression, and at least the basic properties, of the following:
–
–
–
–
–
•
Gaussian (AKA Normal) distribution
Poisson distribution
Exponential distribution
Uniform distribution
Binomial and Multinomial distribution
A mediocre physicist can live a comfortable life without having other distributions at his or
her fingertips. However, I argue you should at the very least recognize and understand :
–
–
–
–
–
–
–
–
Chisquare distribution
Compound Poisson distribution
Log-Normal distribution
Gamma distribution
Beta distribution
Cauchy distribution (AKA Breit-Wigner)
Laplace distribution
Fisher-Snedecor distribution
•
There are many other important distributions –the list above is just a sample set.
•
You believe you have better things to do than going through the properties of all these
functions. However, most Statistics books discuss them carefully, for a good reason.
We can make at least just an example of the pitfalls you may avoid by knowing they exist!
•
The Poisson
distribution
• I believe you know what the Poisson
distribution is:
P(n; m ) 
m ne m
n!
– The expectation value of a Poisson variable with mean μ is E(n) = m
– its variance is V(n) = m
The Poisson is a discrete distribution. It describes the probability of getting
exactly n events in a given time, if these occur independently and randomly at
constant rate (in that given time) μ
Other fun facts:
– it is a limiting case of the Binomial for p0, in the limit of large N
– it converges to the Normal for large m
N n
P(n)    p (1  p) N n
n
The Compound Poisson distribution
• Less known is the compound Poisson distribution, which describes the
sum of N Poisson variables, all of mean m, when N is also a Poisson
variable of mean l:
 ( Nm ) n e  Nm lN e  l 
P(n; m  l )   

n
!
N
!
N 0 


– Obviously the expectation value is E(n)=lm
– The variance is V(n) = lm(1+m)
• One seldom has to do with this distribution in practice. Yet I will make the
point that it is necessary for a physicist to know it exists, and to recognize
it is different from the simple Poisson distribution.
Why ? Should you really care ?
Let me ask before we continue: how many of you knew about the existence
of the compound Poisson distribution?
PRL 23, 658 (1969)
In 1968 McCusker and Cairns observed four tracks in a Wilson
chamber whose apparent ionization was compatible with the one
expected for particles of charge 2/3e.
Successively, they published a paper where they showed a track
which could not be anything but a fractionary charge particle!
In fact, it produced 110 counted droplets per unit path length
against an expectation of 229 (from the 55,000 observed tracks).
What is the probability to observe such a phenomenon ?
We compute it in the following slide.
Before we do, note that if you are strong in nuclear physics and
thermodynamics, you may know that a scattering interaction
produces on average about four droplets. The scattering and the
droplet formation are independent Poisson processes.
However, if your knowledge of Statistics is poor, this observation
does not allow you to reach the right conclusion. What is the
difference, after all, between a Poisson process and the
combination of two ?
Significance of the observation
Case A: single Poisson process, with μ=229:
229i e 229
P(n  110)  
 1.6 1018
i!
i 0
110
Since they observed 55,000 tracks, seeing at least one track with P=1.6x10-18 has
a chance of occurring of 1-(1-P)55000, or about 10-13
Case B: compound Poisson process, with lm=229, m=4:
One should rather compute
 ( Nm )i e  Nm lN e l 
5
P' (n  110)   

4
.
7

10

i!
N! 
i 0 N 0 
110 
from which one gets that the probability of seeing at least one such track is
rather 1-(1-P’)55000, or 92.5%. Ooops!
Bottomline:
You may know your detector and the underlying physics as well as you know your ***,
but only your knowledge of basic Statistics prevents you from being fooled !
Another example for the
Compound Poisson: Bootstrapping
• Bootstrapping: creating new samples from a dataset by fishing
events at random, with replacement
• The idea of bootstrapping is that inference on properties of a
unknown distribution from which we have a sample of data can be
substituted by inference on resampled sets
• Example (from Wikipedia): assume we are interested in the
average height of people worldwide. We only measure the heights
of N individuals. From that single sample, only one estimate of the
mean can be obtained. In order to reason about the population, we
need some sense of the variability of the mean that we have
computed.
 By resampling with replacement we may construct many (say 1000)
sets of size N, and study the distribution of the mean (or of the variance,
or whatever statistic we are interested in)
So where's the compound Poisson?
• Imagine you have a histogram of events in the original
dataset. Each bin content has Poisson properties
• What are the statistical properties of the bin entries in the
bootstrapped histograms ?
– In other words: if the expectation value of a bin's content is μ,
what is the associated variance σ ?
• As you might have correctly guessed, the variability in the
number of entries in each bin is larger than σ=μ as the
Poisson distribution would imply.
– And you know the answer, as it was quoted three slides earlier!
 ( Nm ) n e  Nm lN e  l 
P(n; m  l )   

n!
N! 
N 0 

V(n) = lm(1+m)
EXERCISE: write a program that tests this.
Bootstrap_variance.C
void Bootstrap_variance (double Ndata=10000, int Nrep=100, double
fracBoot=1.0) {
// Ndata = Expectation value of number of events in original histogram
// Nrep = Number of Bootstrap replicas drawn
// fracBoot = fraction of Ndata drawn in Bootstrapped sets
double NdataB=Ndata*fracBoot;
const int Nbins = 100; // We fix the number of bins to 100
if (Ndata>100000) {
cout << "Too much data per sample, reduce to <100000. Exiting..." <<
endl;
return;
}
// Repeat many times to get average of variance over replicas
double sumvar =0;
double Average_content=0;
for (int i=0; i<Nrep; i++) {
// Generate histogram data
double data[100000];
int thisdata = gRandom->Poisson(Ndata);
for (int j=0; j<thisdata; j++) {
double x = gRandom->Uniform(0.,(double)Nbins);
data[j]= x;
}
// Create Bootstrap sample
double Bdata[100000];
thisdata = gRandom->Poisson(NdataB);
Average_content+=thisdata;
for (int j=0; j<thisdata; j++) {
int index=(int)gRandom->Uniform(0.,Ndata);
if (index==Ndata) index=Ndata-1;
Bdata[j]=data[index];
}
// Study statistical properties of Bdata in each bin by computing
the bin-by-bin variances
int Contents[Nbins];
double sum=0;
double sum2=0;
for (int k=0; k<Nbins; k++) {
Contents[k]=0;
for (int j=0; j<thisdata; j++) {
if (Bdata[j]>=(double)k && Bdata[j]<(double)k+1.) {
Contents[k]++;
}
}
sum+= Contents[k];
sum2+= Contents[k]*Contents[k];
}
double var = sum2/Nbins-pow(sum/Nbins,2);
sumvar +=var;
}
Average_content = Average_content/Nrep;
double Average_variance = sumvar/Nrep;
cout << endl;
cout << " Average variance in bootstrapped sets is " <<
Average_variance << endl;
cout << " Expectation for compound Poisson is " <<
NdataB/Nbins*(1+Average_content/Ndata) << endl;
cout << " Variance for a Poisson distribution is " <<
NdataB/Nbins << endl;
}
2 - Error propagation
Imagine you have n variables xi. You do not know their pdf but at least know their mean and
covariance matrix. Now say there is a function y of the xi and you wish to determine its pdf:
you can expand it in a Taylor series around the means, stopping at first order:
 y 
y ( x)  y ( m )   
 ( xi  mi )

x
i 1 
i  xm
n
From this one can show that the expectation value of y and y2 are, to first order,
E[ y ( x)]  y ( m )
E[ y ( x)]  y ( m ) 
2
2
remember: E[(x-E[x])2] = E[x2]-m2
 y y 

 Vij


x

x
i , j 1 
j 
 i
 xm
n
so the variance of y is the
second term in this expression.
In case you have a set of m functions y(x), you can build their own covariance matrix
m 
y yl 
U kl    k
 Vij
i , j 1 
 xi x j  x  m
 yk 
T
This is often expressed in matrix form once one
Aki  
  U  AVA
defines a matrix of derivatives A,
 xi  x  m
The above formulas allow one to “propagate” the variances from the xi to the yj, but this is
only valid if it is meaningful to expand linearly around the mean!
Beware of routine use of these formulas in non-trivial cases.
How it works
•
To see how standard error propagation works, let us use the formula for the
variance of a single y(x)
y
2
 y y 
 
 Vij
i , j 1 
 xi x j  x  m
n
and consider the simplest examples
with two variables x1,x2: their sum and
product.
y  x1  x2   y   1   2  2V12
2
2
2
for the sum,
y  x1 x2   y  x22V11  x12V22  2 x1 x2V12
2

•
 y2
y
2

 12
2
1
x

 22
x
2
2

2V12
x1 x2
for the product.
One thus sees that for uncorrelated variables x1,x2 (V12=0), the variances of their
sum add linearly, while for the product it is the relative variances which add
linearly.
Example: why we need to
understand error propagation
•
We have seen how to propagate uncertainties from some measurements (random
variables!) xi to a derived quantity y = f(x):
2
y
2
 f ( x) 
  xi 2
  
i  xi 
this is just standard error propagation, for uncorrelated random variables xi.
What we neglect to do sometimes is to stop and think
at the consequences of that simple formula, in the
specific cases to which we apply it. That’s too bad!
•
Let us take the problem of weighting two objects A and B
with a two-arm scale offering a constant accuracy, say
1 gram. You have time for two weight measurements.
What do you do ?
– weight A, then weight B
– something else ? Who has a better idea ?
Exercise: is weighting sum and difference a good idea ?
Smart weighting
• If you weight separately A and B, your results will be affected by the stated
accuracy of the scale: A =  = 1g , B =  = 1g.
• But if you instead weighted S=A+B, and then weighted D=B-A by putting
them on different dishes, you would obtain
S D

   
A   A   S   D  
2 2

 2   2 
2
S D

   
 B   S   D  
2 2

 2   2 
2
B
2
2
= 0.71 grams !
Your uncertainties on A and B have become 1.41 times smaller! This is the
result of having made the best out of your measurements, by making
optimal use of the information available. When you placed one object on a
dish, the other one was left on the table, begging to participate!
Addendum: fixed % error
•
•
What happens to the previous problem if instead of a constant error of 1 gram, the
balance provides measurements with accuracy of k% ?
If we do separate weightings, of course we get A=kA, B=kB. But if we rather
weight S = B+A and D = B-A, what we get is (as A=(S-D)/2, B=(D-S)/2)
 
B 
 S 2   D2
4
 S 2   D2
4
k 2 ( A  B) 2  k 2 ( A  B) 2
A2  B 2

k
4
2
k 2 ( A  B) 2  k 2 ( A  B) 2
A2  B 2

k
4
2
•
The procedure has shared democratically the uncertainty in the weight of the two
objects. If A=B we do not gain anything from our “trick” of measuring S and D:
both A=kA and B=kB are the same as if you had measured A and B separately.
•
Of course the limiting case of A>>B corresponds instead to a very inefficient
measurement of B, while the uncertainty on A converges to what you would get if
you weighted it twice.
Weighted average
• Now suppose we need to combine two different, independent
measurements with variances σ1,σ2 of the same physical quantity x0:
– we denote them with x1(x0,σ1), x2(x0,σ2)  the PDFs are G(x0,σi)
• We wish to combine them linearly to get the result with the smallest
possible variance,
x = c x1 + d x2

Let us try this simple
exercise
Answer: we first of all note that d=1-c if we want <x>=x0 (reason with expectation
values to convince yourself of this). Then, we simply express the variance of x in terms
of the variance of x1 and x2 :
from the expression of x by error propagation we get
x  cx1  (1  c) x2
What are c, d such that σF is smallest ?
 x  c 2 21(1  c) 2  22 , and find c which minimizes the expression. This yields:
x1 /  12  x2 /  22
x
1 /  12  1 /  22
The generalization of these
1
formulas to N measurements is
2
x 
trivial
1 /  12  1 /  22
3 - The method of least squares
• Imagine you have a set of n independent measurements yi –Gaussian random
variables– with different unknown means li and known variances i2. The yi can
be considered a vector having a joint pdf which is the product of n Gaussians:
n

g ( y1 ,..., yn ; l1 ,..., ln ;  12 ,... n2 )   2

1
2 2
i
 ( yi  l ) 2
e
2 i2
i 1
• Let also λ be a function of x and a set of m parameters q, l(x;q1…qm). In other
words, λ is the model you want to fit to your data points y(x).
We want to find estimates of q – in other words, we want to fit the model to
the data, extracting the model parameters.
If we take the logarithm of the joint pdf we get the log-likelihood function,
1 n ( yi  l ( xi ;q )) 2
log L(q )   
2 i 1
 i2
which is maximized by finding q such that the following quantity is minimized:
n
( yi  l ( xi ;q )) 2
2
 (q )  
2
i 1
i
•
Caveat: The expression written above near the minimum follows a 2 distribution
only if the function l(x;q) is linear in the parameters q and if it is the true form from
which the yi were drawn.
•
Yet the method of least squares given above “works” also for non-Gaussian errors σi,
as long as the yi are independent. But it may have worse properties than a full
likelihood.
•
If the measurements are not independent, the joint pdf will be a n-dimensional
Gaussian. Then the following generalization holds:
n
 (q )   ( yi  l ( xi ;q ))(Vij ) ( y j  l ( x j ;q ))
2
1
i , j 1
y
λ(x;a,b,c)
Note that unlike the ML, the 2 only requires a
unbiased estimate of the variance of a distribution
to work! (it does a Gaussian approximation)
Both a nice and a devaluing property!
x
Example: know the properties of your
estimators
• Issues (and errors hard to trace) may arise in the simplest of
calculations, if you do not know the properties of the tools you are
working with.
• Take the simple problem of combining three measurements of the
same quantity. Make these be counting rates, i.e. with Poisson
uncertainties:
– A1 = 100
– A2 = 90
– A3 = 110
If they aren’t,
don’t combine!
These measurements are fully compatible with each other, given that
the estimates of their uncertainties are sqrt(Ai)={10, 9.5, 10.5}
respectively. We may thus proceed to average them, obtaining
<A> = 100.0+-5.77
.. Or would you use a weighted average ? With what weights ??
Now imagine, for the sake of argument, that we were on a lazy mood, and
rather than do the math we used a 2 fit to evaluate <A>.
Surely we would find the same answer as the simple average of the three
numbers, right?
… Wrong!
the 2 fit does not “preserve
the area” of the fitted histogram
What is going on ??
Let us dig a little bit into this matter. This
requires us to study the detailed definition
of the test statistics we employ in our fits.
2 fit
Likelihood fit
In general, a 2 statistic results from a
weighted sum of squares; the weights
should be the inverse variances of the true
values.
Unfortunately, we do not know the latter!
Two chisquareds and a Likelihood
•
The “standard” definition is called “Pearson’s 2”, which for Poisson data we write as
( Ni  n) 2
 
n
i 1
k
2
P
•
(here n is the best fit value,
Ni are the measurements)
The other (AKA “modified” 2) is called “Neyman’s 2”:
2
(
N

n
)
 N2   i
Ni
i 1
k
•
While 2P uses the best-fit variances at the denominator, 2N uses the individual estimated
variances. Although both of these least-square estimators have asymptotically a 2
distribution, and display optimal properties, they use approximated weights.
The result is a pathology: neither definition preserves the area in a fit!
2P overestimates the area, 2N underestimates it. In other words, neither works to
make a simple weighted average !
•
The maximization of the Poisson maximum likelihood,
n Ni e  n
LP  
Ni!
i 1
k
instead preserves the area, and obtains exactly the result of the simple average.
PROVE IT BY DIRECT SOLVING FOR n !
Proofs: Pearson’s
2

• Let us compute n from the minimum of 2P:
( N i  n) 2
 
n
i 1
k
note: a variable weight!
2
P
k
2n( n  N i )  ( N i  n) 2
 P2
0

n
n2
i 1
k
k
0   (n N i )  kn   N i2
2
2
i 1
2
i 1
k
n
N
i 1
2
i
k
n is found to be the square root of the average of squares, and is
thus by force an overestimate of the area!
Proofs: Neyman’s 2
•
If we minimize 2N ,
we have:
Just developing
the fraction leads to
( N i  n) 2
 
Ni
i 1
(ALTERNATIVELY,
just solve this one for n)
k
k

 k  k

0   ( N i  n)  N j     N j  n  N j 
i 1 
j 1, j  i
j 1, j  i
 i 1  j 1

k
k
 N
i 1 j 1
k
from which we finally get
again a variable weight
k
 
2( N i  n )
0

n
Ni
i 1
k
which implies that
k


1

n
k
j
 n
k
N
i 1 j 1, j  i
j
k
 N
i 1 j 1, j  i
k
k
 N
i 1 j 1
j
1 k 1
 
k i 1 N i
j
the minimum is found for n equal to the harmonic mean of the inputs – which is
an underestimate of the arithmetic mean!
Proofs: The Poisson Likelihood LP
• We minimize LP by first taking its logarithm, and find:
n Ni e  n
LP  
Ni!
i 1
k
k
ln( LP )    n  N i ln n  ln N i !
i 1
Ni 
 ln( LP ) k 
1 k
0
    1    k   N i
n
n 
n i 1
i 1 
k
n
N
i 1
i
k
As predicted, the result for n is the arithmetic mean. Likelihood fitting
preserves the area!
EXERCISE: Extract biases with ROOT
• Take a k=100-bin histogram, fill it with random entries from a Poisson distribution
of mean m
• Fit it to a constant by minimizing 2P , 2N , -2ln(LP) in turn
• Repeat many times, study fractional bias (ratio of average result to true m) as a
function of m
 See sample program Chi2vsLik.C
Chi2vsLik.C
void Chi2vsLik (int mumax=100, int Nbins=100, int Maxrep=100) {
TH1D * data = new TH1D ("data", "", Nbins, 0., (double)Nbins);
TH1D * Pearson = new TH1D ("Pearson", "", mumax, 0.5,
0.5+(double)mumax);
TH1D * Neyman = new TH1D ("Neyman", "", mumax, 0.5,
0.5+(double)mumax);
TH1D * Likelihood = new TH1D ("Likelihood", "", mumax, 0.5,
0.5+(double)mumax);
for (int i=1; i<=mumax; i++) { // Looping on different values of mu
double mu = (double)i;
double bigsumP=0;
double bigsumN=0;
double bigsumL=0;
for (int Nrep=0; Nrep<Maxrep; Nrep++) { // Repeat to get avg
data->Reset();
for (int j=0; j<Nbins; j++) { // create data histogram
double k = gRandom->Poisson(mu);
data->SetBinContent(j+1,k);
}
double sum2=0;
double suminv=0;
double sum=0;
for (int j=0; j<Nbins; j++) { // Extract collective statistics
double k=data->GetBinContent(j+1);
sum2+=k*k/Nbins;
suminv+=1/k;
sum+=k/Nbins;
}
bigsumP+=sqrt(sum2)/mu;
bigsumN+=Nbins/suminv/mu;
bigsumL+=sum/mu;
}
Pearson->SetBinContent(i,bigsumP/Maxrep);
Neyman->SetBinContent(i,bigsumN/Maxrep);
Likelihood->SetBinContent(i,bigsumL/Maxrep);
}
TCanvas * C = new TCanvas ("C","",500,500);
C->cd();
Pearson->SetMinimum(0);
Pearson->SetLineWidth(3);
Neyman->SetLineWidth(3);
Likelihood->SetLineWidth(3);
Pearson->SetLineColor(kRed);
Neyman->SetLineColor(kBlue);
Likelihood->SetLineColor(kBlack);
Pearson->Draw();
Neyman->Draw("SAME");
Likelihood->Draw("SAME");
}
Results for μ=10-100,
100 bin histograms
• One observes that the
convergence is slowest
for Neyman’s 2, but the
bias is significant also
for 2P
• This result depends on
k, the number of bins,
but is present in all
cases
• Keep that in mind when
you fit a histogram!
• Standard ROOT fitting
uses V=Ni  Neyman’s
definition!
Discussion
•
What we are doing when we fit a constant through a set of k bin contents is to extract the common,
unknown, true value m from which the entries were generated, by combining the k measurements
We have k Poisson measurement of this true value. Each equivalent measurement should have the same
weight in the combination, because each is drawn from a Poisson of mean m, whose true variance is m.
Whenever
possible, use a
Likelihood!
But having no m to start with, we must use estimates of the variance as a (inverse) weight. So the 2N
gives the different observations different weights 1/Ni. Since negative fluctuations (Ni < m) have larger
weights, the result is downward biased!
What 2P does is different: it uses a common weight for all measurements, but this is of course also an
estimate of the true variance V = m : the denominator of 2P is the fit result for the average, m*. Since
we minimize 2P to find m*, larger denominators get preferred, and we get a positive bias: m* > m!
All methods have optimal asymptotic properties: consistency, minimum variance. However, one seldom
is in that regime. 2P and 2N also have problems when Ni is small (non-Gaussian errors) or zero (
2N undefined). These drawbacks are solved by grouping bins, at the expense of loss of information.
LP does not have the approximations of the two sums of squares, and it has in general better properties.
Cases when the use of a LL yields problems are rare. Whenever possible, use a Likelihood!
4 – Basics of The Maximum Likelihood
•
Take a pdf for a random variable x, f(x; q) which is analytically known, but for which the
value of m parameters q is not. The method of maximum likelihood allows us to
estimate the parameters q if we have a set of data xi distributed according to f.
•
The probability of our observed set {xi} depends on the distribution of the pdf. If the
n
measurements are independent, we have
to find x in [x ,x +dx [
p   f ( xi ;q )dx i
•
The likelihood function
n
i
i
i
i
i 1
L(q )   f ( xi ;q )
i 1
is then a function of the parameters q only. It is written as the joint pdf of the xi, but we
treat those as fixed. L is not a pdf! NOTA BENE! The integral under L is MEANINGLESS.
•
Using L(q) one can define “maximum likelihood estimators” for the parameters q as the
values which maximize the likelihood, i.e. the solutions qˆ  (qˆ ,qˆ ,...qˆ ) of the equation
1
 L(q ) 

 0
 q 
j q qˆ

2
m
for j=1…m
Note: The ML requires (and exploits!)
the full knowledge of the distributions
Variance of the MLE
• In the simplest cases, i.e. when one has unbiased estimates and
Gaussian distributed data, one can estimate the variance of the
maximum likelihood estimate with the simple formula
1
ˆ q q

  2 ln L 

  
 
 q q q 
This is also the default used by Migrad to return the uncertainty of a
MLE from a fit.
However, note that this is only a lower limit of the variance in
conditions when errors are not Gaussian and when the ML
estimator is unbiased. A general formula called the Rao-CramerFrechet inequality gives this lower bound as
2


b

ln L 


ˆ
V [q   1 
 / E 
2 

q

q




2
(b is the bias, E is the expectation value)
Example: the loaded die
Imagine you want to test whether a die is loaded. Your hypothesis might be that
the probabilities of the six occurrences are not equal, but rather that
That is, you have
a model for the
load on the die
Your data comes from N=20 repeated throws of the die, whereupon you get:
The likelihood is the product of probabilities, so to estimate t you write L as
Setting the derivative wrt t to zero of –logL yields a quadratic equation:
One solution in the allowed range for t, [-1/6,1/3]: t=0.072. Uncertainty on t can be
obtained as the square root of variance, computed as the inverse of the second derivative of
the likelihood. This amounts to +-0.084. The point estimate of the “load”, the MLE, is
different from zero, but compatible with it. We thus cannot establish the presence of a bias.
Exercise with root
1)
2)
3)
4)
5)
6)
7)
8)
Write a root macro that determines, using the likelihood of the previous slide, the
value of the bias, t, and its uncertainty, given a random set of N (unbiased) die
throws.
Directions:
Your macro will be called “Die.C” and it will have a function called “void Die(int N)
{}”
Produce a set of N throws of the die by looping i=0...N-1 and storing the result of
(int)(1+gRandom->Uniform(0.,6.));
Call N1=number of occurrence of 1; N3=occurrences of 6; N2=other results.
With paper and pencil, derive the coefficients of the quadratic equation in t for
the likelihood maximum as a function of N1, N2, N3.
Also derive the expression of –d2lnL/dt2 as a function of t and N1,N2,N3.
Insert the obtained formulas in the code to compute t* and its uncertainty σ(t*).
Print out the result of t in the allowed range [-1/6,1/3] and its uncertainty. If there
are two solutions in that interval, print the result away from the boundary.
How frequently do you get a result for t less than one standard deviation away
from 0?
(You can embed the above in a cycle so that you compute the frequency at (8))
Die.C
void Die(int N=100) {
int res[100000];
int n1=0, n2=0, n3=0;
for (int i=0; i<N; i++) {
res[i]=1+(int)gRandom->Uniform(0.,6.);
if (res[i]==1) {
n1++;
} else if (res[i]<6) {
n2++;
} else {
n3++;
}
}
cout << endl << " Die throwing results:" << endl;
cout << " n1 = " << n1;
cout << " n2 = " << n2;
cout << " n3 = " << n3 << endl << endl;
// Quadratic equation for max of L: coefficients
double a = 18*(n1+n2+n3);
double b = -3*(7*n1+n2+10*n3);
double c = -(4*n1+n2-8*n3);
double rms, t1, t2;
double discr=b*b-4*a*c;
double tmin=-1./6., tmax=1./3., tstar=0;
if (discr<0) {
cout << " No solution for max likelihood" << endl;
} else {
t1 = (-b-sqrt(discr))/(2*a);
t2 = (-b+sqrt(discr))/(2*a);
if (t1>=tmin && t1<=tmax) {
if (t2>=tmin && t2<=tmax) { // n1=0 yields one solution at t=0.33333
if (t1-tmin>tmax-t2) {
cout << " Bias is estimated to be t = " << t1;
tstar=t1;
} else {
cout << " Bias is estimated to be t = " << t2;
tstar=t2;
}
} else {
cout << " Bias is estimated to be t = " << t1;
tstar=t1;
}
} else if (t2>=tmin && t2<=tmax) {
cout << " Bias is estimated to be t = " << t2;
tstar=t2;
}
// Determine error from inverse of second derivative of logL
double d2logl;
if (tstar>-1/6. && tstar<1/3.) {
d2logl=9*n1/pow(1-3*tstar,2)+ 9*n2/pow(4-3*tstar,2)+
36*n3/pow(1+6*tstar,2);
rms = sqrt(1/d2logl);
}
cout << " +- " << rms << endl;
}
// Compute corresponding p-values
cout << endl;
cout << " This corresponds to the following probabilities:" << endl;
cout << " p(1) = " << 1./6.-tstar/2. << " +- " << rms/2. << endl;
cout << " p(x) = " << 1./6.+tstar/8. << " +- " << rms/8. << endl;
cout << " p(6) = " << 1./6.+tstar << " +- " << rms << endl << endl;
}
Die2.C
void Die2(int N=100, int Nrep=100) {
// N = number of die throws; Nrep = number of expts
if (N>100000) return; // do not allow more than 100000 throws per exp
double bignumber=100000.;
double ok=0; // number of expts when res is compatible with the true value
for (int rep=0; rep<Nrep; rep++) { // repeated experiments
int res[100000];
int n1=0; // number of times the die throw returns a 1
int n2=0; // number of times the die throw returns a 2,3,4,5
int n3=0; // number of times the die throw returns a 6
for (int i=0; i<N; i++) { // simulate die throws
res[i]=1+(int)gRandom->Uniform(0.,6.);
if (res[i]==1) {
n1++;
} else if (res[i]<6) {
n2++;
} else {
n3++;
}
}
double a = 18*(n1+n2+n3);
// coefficients of quadratic equation
double b = -3*(7*n1+n2+10*n3); // that solves the likelihood minimization
double c = -(4*n1+n2-8*n3);
double t1,t2;
double discr=b*b-4*a*c;
// bounds on t are determined by the fact that 0<=p(i)<=1
double tmin=-1./6.;
double tmax=1./3.;
double tstar; // Tstar is estimate of bias from this experiment
double rms;
if (discr<0) {
cout << "No solution for max likelihood" << endl;
} else {
t1 = (-b-sqrt(discr))/(2*a);
t2 = (-b+sqrt(discr))/(2*a);
if (t1>=tmin && t1<=tmax) {
if (t2>=tmin && t2<=tmax) {
// NNBB when n1=0 there is always one solution at t=0.33333
endl;
if (t1-tmin>tmax-t2) {
cout << "Bias is estimated to be t = " << t1;
tstar=t1;
} else {
cout << "Bias is estimated to be t = " << t2;
tstar=t2;
}
} else {
cout << "Bias is estimated to be t = " << t1;
tstar=t1;
}
} else if (t2>=tmin && t2<=tmax) {
cout << "Bias is estimated to be t = " << t2;
tstar=t2;
}
// Determine error from inverse of second derivative of logL
double d2logl;
if (tstar>-1/6. && tstar<1/3.) {
d2logl=9*n1/pow(1-3*tstar,2)+
9*n2/pow(4-3*tstar,2)+
36*n3/pow(1+6*tstar,2);
} else {
d2logl=bignumber;
}
rms = sqrt(1/d2logl);
cout << " +- " << rms << endl;
}
if (fabs(tstar)<rms) ok++;
}
// Print result of set of trials and return.
cout << "In total the error bar on t covers " << ok/(double)Nrep*100 << "% of
the times" << endl;
}
5 - The Jeffreys-Lindley Paradox
•
One specific problem (among many!) which finds Bayesians and Frequentists in stark
disagreement on the results: hypothesis testing in the limit of large data, when the prior
includes a “point null”. E.g., charge bias of a tracker at LEP
• Imagine you want to investigate whether your tracker has a bias in reconstructing
positive versus negative curvature. We work with a zero-charge initial state (e+e-). You
take a unbiased set of events, and count how many positive and negative curvature
tracks you have reconstructed in a set of n=1,000,000. You get n+=498,800, n-=501,200.
You want to test the hypothesis that R=0.5 with a size a=0.05.
• Bayesians will need a prior to make a statistical inference: their typical choice would be
to assign equal probability to the chance that R=0.5 and to it being different (R<>0.5): a
“point mass” of p=0.5 at R=0.5, and a uniform distribution of the remaining p in [0,1]
• The calculation goes as follows: we are in high-statistics regime and away from 0 or 1, so
Gaussian approximation holds for the Binomial. The probability to observe a number of
positive tracks as small as the one observed can then be written, with x=n+/n, as N(x)
with =x(1-x)/n. The posterior probability that R=0.5 is then
1
1
( x )2 
( x )2

( x R )2
2
2



2


1
2 2
2 2
1
1 e 2
1
e
1
e
P ( R  | x, n ) 
/
 
dR   0.97816

2
2 2   2 2  2 0 2 




A Bayesian concludes that there is no evidence against R=0.5, as data strongly supports the null
Jeffreys-Lindley: frequentist solution
• Frequentists will not need a prior, and just ask themselves how often a
result “at least as extreme” as the one observed arises by chance, if the
underlying distribution is N(R,) with R=1/2 and 2=x(1-x)/n as before.
1
• One then has
(t  )
2
1
P( x  0.4988 | R  ) 
2
0.4988

0

e
2
2 2
2 
dt  0.008197
1
 P' ( x | R  )  2 * P  0.01639
2
(we multiply by two since we would be just as surprised to observe an
excess of positives as a deficit).
From this, frequentists conclude that the tracker is biased, since there is a
less-than 2% probability, P’<a, that a result as the one observed could
arise by chance! A frequentist thus draws the opposite conclusion that a
Bayesian draws from the same data .
Making sense of JL paradox
• The paradox is often used by Bayesians to criticize the way
inference is drawn by frequentists:
– Jeffrey: “What the use of [the p-value] implies, therefore, is that a
hypothesis that may be true may be rejected because it has not
predicted observable results that have not occurred”
– the issue is the violation of the Likelihood principle (discussed in the
next few slides)
• The paradox has roots in the fact that while for a Frequentist
hypothesis testing and interval estimation are dual to each other
(test H0: x=x0  is x0 in confidence interval [x-s,x+s]?),
for a Bayesian they are extremely different things. Bayesians in
fact recommend different priors for H testing and for interval
estimation! In JL paradox, the issue is the dependence of Bayesian
results on the chosen prior for the parameter of interest.
The JL paradox remains largely a research topic. Statisticians have blamed the concept of a
“point mass”, as well as suggested n-dependent priors (n= size of data).
In particular, professional statisticians tend to believe that “the precise null” is never true,
hence assigning it a non-zero prior is the source of the problem. However, we do believe
our point nulls in HEP and astro-HEP!!
The JL paradox arises when there are three independent scales in a problem:
Conclusions for Part 1
• Ask yourself "what PDF are these data coming from ?" 
you'll save yourself a lot of trouble
• Think about the statistical properties of your estimators,
and cook up better ones than your peer, you'll improve
your results!
• Use paper and pencil! You can derive useful information
about your problem more often than you would think
possible
• Chisquare is a good thing, but the Likelihood knows more
about your data, use it if you can.