Title of slide - WebHome < PP/Public < RHUL Physics

Download Report

Transcript Title of slide - WebHome < PP/Public < RHUL Physics

Statistical Data Analysis: Lecture 6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
G. Cowan
Probability, Bayes’ theorem, random variables, pdfs
Functions of r.v.s, expectation values, error propagation
Catalogue of pdfs
The Monte Carlo method
Statistical tests: general concepts
Test statistics, multivariate methods
Goodness-of-fit tests
Parameter estimation, maximum likelihood
More maximum likelihood
Method of least squares
Interval estimation, setting limits
Nuisance parameters, systematic uncertainties
Examples of Bayesian approach
tba
Lectures on Statistical Data Analysis
1
G. Cowan
Lectures on Statistical Data Analysis
2
G. Cowan
Lectures on Statistical Data Analysis
3
Nonlinear test statistics
The optimal decision boundary may not be a hyperplane,
→ nonlinear test statistic
H1
Multivariate statistical methods
are a Big Industry:
Neural Networks,
Support Vector Machines,
Kernel density methods,
...
H0
accept
Particle Physics can benefit from progress in Machine Learning.
G. Cowan
Lectures on Statistical Data Analysis
4
Introduction to neural networks
Used in neurobiology, pattern recognition, financial forecasting, ...
Here, neural nets are just a type of test statistic.
logistic
Suppose we take t(x) to have the form
sigmoid
This is called the
single-layer perceptron.
s(·) is monotonic
→ equivalent to linear t(x)
G. Cowan
Lectures on Statistical Data Analysis
5
The multi-layer perceptron
Generalize from one layer
to the multilayer perceptron:
The values of the nodes in the
intermediate (hidden) layer are
and the network output is given by
weights (connection strengths)
G. Cowan
Lectures on Statistical Data Analysis
6
Neural network discussion
Easy to generalize to arbitrary number of layers.
Feed-forward net: values of a node depend only on earlier layers,
usually only on previous layer (“network architecture”).
More nodes → neural net gets closer to optimal t(x), but
more parameters need to be determined.
Parameters usually determined by minimizing an error function,
where t (0) , t (1) are target values, e.g., 0 and 1 for logistic sigmoid.
Expectation values replaced by averages of training data (e.g. MC).
In general training can be difficult; standard software available.
G. Cowan
Lectures on Statistical Data Analysis
7
Neural network example from LEP II
Signal: e+e- → W+W-
(often 4 well separated hadron jets)
Background: e+e- → qqgg (4 less well separated hadron jets)
← input variables based on jet
structure, event shape, ...
none by itself gives much separation.
Neural network output does better...
(Garrido, Juste and Martinez, ALEPH 96-144)
G. Cowan
Lectures on Statistical Data Analysis
8
Some issues with neural networks
In the example with WW events, goal was to select these events
so as to study properties of the W boson.
Needed to avoid using input variables correlated to the
properties we eventually wanted to study (not trivial).
In principle a single hidden layer with an sufficiently large number of
nodes can approximate arbitrarily well the optimal test variable (likelihood
ratio).
Usually start with relatively small number of nodes and increase
until misclassification rate on validation data sample ceases
to decrease.
Usually MC training data is cheap -- problems with getting stuck in
local minima, overtraining, etc., less important than concerns of systematic
differences between the training data and Nature, and concerns about
the ease of interpretation of the output.
G. Cowan
Lectures on Statistical Data Analysis
9
Probability Density Estimation (PDE) techniques
Construct non-parametric estimators of the pdfs
and use these to construct the likelihood ratio
(n-dimensional histogram is a brute force example of this.)
More clever estimation techniques can get this to work for
(somewhat) higher dimension.
See e.g. K. Cranmer, Kernel Estimation in High Energy Physics, CPC 136 (2001) 198; hep-ex/0011057; T. Carli
and B. Koblitz, A multi-variate discrimination technique based on range-searching,
NIM A 501 (2003) 576; hep-ex/0211019
G. Cowan
Lectures on Statistical Data Analysis
10
Kernel-based PDE (KDE, Parzen window)
Consider d dimensions, N training events, x1, ..., xN,
estimate f (x) with
kernel
bandwidth
(smoothing parameter)
Use e.g. Gaussian kernel:
Need to sum N terms to evaluate function (slow);
faster algorithms only count events in vicinity of x
(k-nearest neighbor, range search).
G. Cowan
Lectures on Statistical Data Analysis
11
G. Cowan
Lectures on Statistical Data Analysis
12
G. Cowan
Lectures on Statistical Data Analysis
13
G. Cowan
Lectures on Statistical Data Analysis
14
G. Cowan
Lectures on Statistical Data Analysis
15
Decision trees
Out of all the input variables, find the one for which with a single cut
gives best improvement in signal purity:
where wi. is the weight of the ith event.
Resulting nodes classified as either
signal/background.
Iterate until stop criterion reached based on
e.g. purity or minimum number of events
in a node.
The set of cuts defines the decision
boundary.
G. Cowan
Example by MiniBooNE experiment,
B. Roe et al., NIM 543 (2005) 577
Lectures on Statistical Data Analysis
16
Finding the best single cut
The level of separation within a node can, e.g., be quantified by
the Gini coefficient, calculated from the (s or b) purity as:
For a cut that splits a set of events a into subsets b and c, one
can quantify the improvement in separation by the change in
weighted Gini coefficients:
where, e.g.,
Choose e.g. the cut to the maximize D; a variant of this
scheme can use instead of Gini e.g. the misclassification rate:
G. Cowan
Lectures on Statistical Data Analysis
17
G. Cowan
Lectures on Statistical Data Analysis
18
G. Cowan
Lectures on Statistical Data Analysis
19
G. Cowan
Lectures on Statistical Data Analysis
20
G. Cowan
Lectures on Statistical Data Analysis
21
Overtraining
If decision boundary is too flexible it will conform too closely
to the training points → overtraining.
Monitor by applying classifier to independent test sample.
training sample
G. Cowan
independent test sample
Lectures on Statistical Data Analysis
22
Monitoring overtraining
From MiniBooNE
example:
Performance stable
after a few hundred
trees.
G. Cowan
Lectures on Statistical Data Analysis
23
G. Cowan
Lectures on Statistical Data Analysis
24
Comparing multivariate methods (TMVA)
Choose the best one!
G. Cowan
Lectures on Statistical Data Analysis
25
G. Cowan
Lectures on Statistical Data Analysis
26
G. Cowan
Lectures on Statistical Data Analysis
27
Wrapping up lecture 6
We looked at statistical tests and related issues:
discriminate between event types (hypotheses),
determine selection efficiency, sample purity, etc.
Some modern (and less modern) methods were mentioned:
Fisher discriminants, neural networks,
PDE, KDE, decision trees, ...
Next we will talk about significance (goodness-of-fit) tests:
p-value expresses level of agreement between data
and hypothesis
G. Cowan
Lectures on Statistical Data Analysis
28
Extra slides
G. Cowan
Lectures on Statistical Data Analysis
29
Particle i.d. in MiniBooNE
Detector is a 12-m diameter tank of
mineral oil exposed to a beam of
neutrinos and viewed by 1520
photomultiplier tubes:
Search for nm to ne oscillations
required particle i.d. using
information from the PMTs.
G. Cowan
H.J. Yang, MiniBooNE PID, DNP06
Lectures on Statistical Data Analysis
30
BDT example from MiniBooNE
~200 input variables for each event (n interaction producing e, m or p).
Each individual tree is relatively weak, with a misclassification
error rate ~ 0.4 – 0.45
B. Roe et al., NIM 543 (2005) 577
G. Cowan
Lectures on Statistical Data Analysis
31
Comparison of boosting algorithms
A number of boosting algorithms on the market; differ in the
update rule for the weights.
G. Cowan
Lectures on Statistical Data Analysis
32
Using classifier output for discovery
signal
search
region
N(y)
f(y)
background
background
excess?
y
Normalized to unity
ycut
y
Normalized to expected
number of events
Discovery = number of events found in search region incompatible
with background-only hypothesis.
p-value of background-only hypothesis can depend crucially distribution
f(y|b) in the "search region".
G. Cowan
Lectures on Statistical Data Analysis
33
Single top quark production (CDF/D0)
Top quark discovered in pairs, but
SM predicts single top production.
Use many inputs based on
jet properties, particle i.d., ...
Pair-produced tops are now
a background process.
G. Cowan
Lectures on Statistical Data Analysis
signal
(blue +
green)
34
Different classifiers for single top
Also Naive Bayes and various approximations to likelihood ratio,....
Final combined result is statistically significant (>5s level) but not
easy to understand classifier outputs.
G. Cowan
Lectures on Statistical Data Analysis
35