cowan_cernschool_6jun07

Download Report

Transcript cowan_cernschool_6jun07

Statistical Methods for Particle Physics
CERN-FNAL
Hadron Collider Physics
Summer School
CERN, 6-15 June, 2007
Glen Cowan
Physics Department
Royal Holloway, University of London
[email protected]
www.pp.rhul.ac.uk/~cowan
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 1
Outline
1 Brief overview
Probability: frequentist vs. subjective (Bayesian)
Statistics: parameter estimation, hypothesis tests
2 Statistical tests for Particle Physics
multivariate methods for event selection
Wednesday
goodness-of-fit tests for discovery
Friday
3 Systematic errors
Treatment of nuisance parameters
Bayesian methods for systematics, MCMC
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 2
A definition of probability
Consider a set S with subsets A, B, ...
Kolmogorov
axioms (1933)
Also define conditional probability of A given B (with P(B) ≠ 0):
E.g. rolling dice:
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 3
Interpretation of probability
I. Relative frequency
A, B, ... are outcomes of a repeatable experiment
cf. quantum mechanics, particle scattering, radioactive decay...
II. Subjective probability
A, B, ... are hypotheses (statements that are true or false)
• Both interpretations consistent with Kolmogorov axioms.
• In particle physics frequency interpretation often most useful,
but subjective probability can provide more natural treatment of
non-repeatable phenomena:
systematic uncertainties, probability that Higgs boson exists,...
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 4
Bayes’ theorem
From the definition of conditional probability we have
and
but
, so we find Bayes’ theorem:
The denominator P(B) serves to ensure probabilities sum to unity;
often more convenient to use:
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 5
Frequentist Statistics − general philosophy
In frequentist statistics, probabilities are associated only with
the data, i.e., outcomes of repeatable observations.
Probability = limiting frequency
Probabilities such as
P (Higgs boson exists),
P (0.117 < as < 0.121),
etc. are either 0 or 1, but we don’t know which.
The tools of frequentist statistics tell us what to expect, under
the assumption of certain probabilities, about hypothetical
repeated observations.
The preferred theories (models, hypotheses, ...) are those for
which our observations would be considered ‘usual’.
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 6
Bayesian Statistics − general philosophy
In Bayesian statistics, interpretation of probability extended to
degree of belief (subjective probability). Use this for hypotheses:
probability of the data assuming
hypothesis H (the likelihood)
posterior probability, i.e.,
after seeing the data
prior probability, i.e.,
before seeing the data
normalization involves sum
over all possible hypotheses
Bayesian methods can provide more natural treatment of nonrepeatable phenomena:
systematic uncertainties, probability that Higgs boson exists,...
No golden rule for priors (“if-then” character of Bayes’ thm.)
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 7
Likelihood
We have data: x (could be a vector, discrete or continuous) and a
(q could be vector of parameters)
probability model:
Now evaluate the probability function using the data that
we observed and treat it as a function of the parameters.
This is the likelihood function:
(here x is constant)
For example, if we have n independent observations of
a random variable x, where x ~ f (x;q ), then
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 8
Maximum Likelihood
The likelihood function plays an important role in both
frequentist and Bayesian statistics.
E.g., to estimate the parameter q, the method of maximum
likelihood (ML) says to take the value that maximizes L(q).
ML and other parameter
estimation methods would
be a large part of a longer
course on statistics —
for now need to move on...
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 9
Statistical tests (in a particle physics context)
Suppose the result of a measurement for an individual event
is a collection of numbers
x1 = number of muons,
x2 = jet pt of jets,
x3 = missing energy, ...
follows some n-dimensional joint pdf, which depends on
the type of event produced, i.e., was it
For each reaction we consider we will have a hypothesis for the
pdf of , e.g.,
etc.
Often H0 is the Standard Model, (the background hypothesis),
H1 ... is a signal hypothesis we are searching for
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 10
Selecting events
Suppose we have a data sample with two kinds of events,
corresponding to hypotheses H0 and H1 and we want to select
those of type H0.
Each event is a point in space. What decision boundary
should we use to accept/reject events as belonging to event
type H0?
H1
Probably start
with cuts:
H0
G. Cowan
RHUL Physics
accept
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 11
Other ways to select events
Or maybe use some other sort of decision boundary:
linear
or nonlinear
H1
H1
H0
H0
accept
accept
How can we do this in an ‘optimal’ way?
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 12
Test statistics
Construct a ‘test statistic’ of lower dimension (e.g. scalar)
Goal is to compactify data without losing ability to discriminate
between hypotheses.
We can work out the pdfs
Decision boundary is now a
single cut on t.
This effectively divides the sample
space into two regions where we either:
accept H0 (acceptance region)
or reject it (critical region).
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 13
Significance level and power of a test
Probability to reject H0 if it is true (error of the 1st kind):
(significance level)
Probability to accept H0 if H1 is true (error of the 2nd kind):
(1 - b = power)
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 14
Efficiency, purity, etc.
Signal efficiency
Background efficiency
Expected number of signal events:
s =  s s L
Expected number of background events:
b =  b b L
Prior probabilities ps, pb proportional to cross sections, so
for e.g. the signal purity,
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 15
Constructing a test statistic
How can we select events in an ‘optimal way’?
Neyman-Pearson lemma states:
To get the lowest b for a given s (highest power for a given
significance level), choose acceptance region such that
where c is a constant which determines s.
Equivalently, optimal scalar test statistic is
N.B. any monotonic function of this is just as good.
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 16
Purity vs. efficiency — optimal trade-off
Consider selecting n events:
expected numbers s from signal, b from background;
→ n ~ Poisson (s + b)
Suppose b is known and goal is to estimate s with minimum
relative statistical error.
Take as estimator:
Variance of Poisson variable equals its mean, therefore
→
So we should maximize
equivalent to maximizing product of signal efficiency  purity.
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 17
Why Neyman-Pearson doesn’t always help
The problem is that we usually don’t have explicit formulae for
the pdfs
Instead we may have Monte Carlo models for signal and
background processes, so we can produce simulated data,
and enter each event into an n-dimensional histogram.
Use e.g. M bins for each of the n dimensions, total of Mn cells.
But n is potentially large, → prohibitively large number of cells
to populate with Monte Carlo data.
Compromise: make Ansatz for form of test statistic
with fewer parameters; determine them (e.g. using MC) to
give best discrimination between signal and background.
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 18
Multivariate methods
Many new (and some old) methods:
Fisher discriminant
Neural networks
Kernel density methods
Support Vector Machines
Decision trees
Boosting
Bagging
New software for HEP, e.g.,
TMVA , Höcker, Stelzer, Tegenfeldt, Voss, Voss, physics/0703039
StatPatternRecognition, I. Narsky, physics/0507143
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 19
Fisher discriminant
Assume linear test statistic,
H1
and maximize ‘separation’
between the two classes:
Corresponds to a linear
decision boundary.
H0
accept
Equivalent to Neyman-Pearson if the signal and background
pdfs are multivariate Gaussian with equal covariances;
otherwise not optimal, but still often a simple, practical solution.
Sometimes first transform data to better approximate Gaussians.
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 20
Nonlinear test statistics
The optimal decision boundary may not be a hyperplane,
→ nonlinear test statistic
H1
Multivariate statistical methods
are a Big Industry:
HEP can benefit from progress
in Machine Learning
H0
See recent (03 & 05) PHYSTAT
proceedings, e.g., papers by J.
Friedman, K. Cranmer, ...
G. Cowan
RHUL Physics
accept
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 21
Neural networks: the multi-layer perceptron
Use e.g. logistic sigmoid activation function,
Define values for ‘hidden nodes’
The network output is given by
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 22
Neural network example from LEP II
Signal: e+e- → W+W-
(often 4 well separated hadron jets)
Background: e+e- → qqgg (4 less well separated hadron jets)
← input variables based on jet
structure, event shape, ...
none by itself gives much separation.
Neural network output does better...
(Garrido, Juste and Martinez, ALEPH 96-144)
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 23
Probability Density Estimation (PDE) techniques
Construct non-parametric estimators of the pdfs
and use these to construct the likelihood ratio
(n-dimensional histogram is a brute force example of this.)
More clever estimation techniques can get this to work for
(somewhat) higher dimension.
See e.g. K. Cranmer, Kernel Estimation in High Energy Physics, CPC 136 (2001) 198; hep-ex/0011057;
T. Carli and B. Koblitz, A multi-variate discrimination technique based on range-searching,
NIM A 501 (2003) 576; hep-ex/0211019
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 24
Kernel-based PDE (KDE, Parzen window)
Consider d dimensions, N training events, x1, ..., xN,
estimate f (x) with
kernel
bandwidth
(smoothing parameter)
Use e.g. Gaussian kernel:
Need to sum N terms to evaluate function (slow);
faster algorithms only count events in vicinity of x
(k-nearest neighbor, range search).
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 25
Product of one-dimensional pdfs
First rotate to uncorrelated variables, i.e., find matrix A such that
for
we have
Estimate the d-dimensional joint pdf as the product of 1-d pdfs,
(here x decorrelated)
This does not exploit non-linear features of the joint pdf, but
simple and may be a good approximation in practical examples.
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 26
Decision trees
A training sample of signal and background data is repeatedly
split by successive cuts on its input variables.
Order in which variables used based on best separation between
signal and background.
Iterate until stop criterion reached,
based e.g. on purity, minimum
number of events in a node.
Resulting set of cuts is a ‘decision tree’.
Tends to be sensitive to
fluctuations in training sample.
Example by Mini-Boone, B. Roe et
al., NIM A 543 (2005) 577
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 27
Boosted decision trees
Boosting combines a number classifiers into a stronger one;
improves stability with respect to fluctuations in input data.
To use with decision trees, increase the weights of misclassified
events and reconstruct the tree.
Iterate → forest of trees (perhaps > 1000). For the mth tree,
Define a score am based on error rate of mth tree.
Boosted tree = weighted sum of the trees:
Algorithms: AdaBoost (Freund & Schapire), -boost (Friedman).
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 28
Multivariate analysis discussion
For all methods, need to check:
Sensitivitiy to statistically unimportant variables
(best to drop those that don’t provide discrimination);
Level of smoothness in decision boundary (sensitivity
to over-training)
Given the test variable, next step is e.g., select n events and
estimate a cross section of signal:
Now need to estimate systematic error...
If e.g. training (MC) data ≠ Nature, test variable is not optimal,
but not necessarily biased.
But our estimates of background b and efficiencies would then
be biased if based on MC. (True also for ‘simple cuts’.)
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 29
Multivariate analysis discussion (2)
But in a cut-based analysis it may be easier to avoid regions
where untested features of MC are strongly influencing the
decision boundary.
Look at control samples to test joint distributions of inputs.
Try to estimate backgrounds directly from the data (sidebands).
The purpose of the statistical test is often to select objects for
further study and then measure their properties.
Need to avoid input variables that are correlated with the
properties of the selected objects that you want to study.
(Not always easy; correlations may be poorly known.)
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 30
Comparing multivariate methods (TMVA)
Choose the best one!
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 31
Some multivariate analysis references
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning,
Springer (2001);
Webb, Statistical Pattern Recognition, Wiley (2002);
Kuncheva, Combining Pattern Classifiers, Wiley (2004);
Specifically on neural networks:
L. Lönnblad et al., Comp. Phys. Comm., 70 (1992) 167;
C. Peterson et al., Comp. Phys. Comm., 81 (1994) 185;
C.M. Bishop, Neural Networks for Pattern Recognition, OUP (1995);
John Hertz et al., Introduction to the Theory of Neural Computation,
Addison-Wesley, New York (1991).
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 32
Wrapping up lecture 1
Now we’ve defined a way to select e.g. SUSY, Higgs, ... events
Next apply the method to the real data and... depending on what
we see, claim (or not) a new discovery.
Next lecture:
How do we decide when to make this claim?
How do we incorporate systematic uncertainties?
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 33
Extra slides
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 34
Some statistics books, papers, etc.
R.J. Barlow, Statistics, A Guide to the Use of Statistical in the
Physical Sciences, Wiley, 1989
see also hepwww.ph.man.ac.uk/~roger/book.html
G. Cowan, Statistical Data Analysis, Clarendon, Oxford, 1998
see also www.pp.rhul.ac.uk/~cowan/sda
L. Lyons, Statistics for Nuclear and Particle Physics, CUP, 1986
W. Eadie et al., Statistical and Computational Methods in
Experimental Physics, North-Holland, 1971 (New 2nd ed. 2007)
S. Brandt, Statistical and Computational Methods in Data
Analysis, Springer, New York, 1998 (with program library on CD)
W.M. Yao et al. (Particle Data Group), Review of Particle Physics,
Journal of Physics G 33 (2006) 1; see also pdg.lbl.gov sections
on probability statistics, Monte Carlo
G. Cowan
RHUL Physics
Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School
page 35