The Boosting Algorithm, an Alternative to Artificial
Download
Report
Transcript The Boosting Algorithm, an Alternative to Artificial
Some Current Statistical
Considerations in Particle Physics
Byron P. Roe
Department of Physics
University of Michigan
Ann Arbor, MI 48109
Byron Roe
1
Outline
• Preliminaries
• Nuisance Variables
• Modern Classification Methods (especially
boosting and related methods).
Byron Roe
2
Preliminaries
• Try to determine a
parameter l given a
measurement x
• For each l draw line so
probability x is within
limits is 90%
• The probability of a result
falling in region is 90%
• Given an x, then for 90%
of experiments l is in that
region. This is the
Neyman Construction
Byron Roe
3
Frequentist and Bayesian
• This construction is “frequentist”; no
probability is assigned to l a physical
quantity
• Bayesian point of view: probability refers
to state of knowledge of parameter and l
can have a probability
• Formerly a “war” between the two views.
• People starting to realize each side has
some merits and uses; war abating
Byron Roe
4
Ambiguities
• At a given l, 90% of the time x will fall in region,
but do you want 5% on each side or 8% on
lower and 2% on upper?
• Useful ordering principle introduced into physics
by Feldman and Cousins: Choose the region to
have the largest values of R=likelihood this l
given x / best likelihood of any physical l given x
• Always gives a region; goes automatically from
limits to regions
Byron Roe
5
But is it new?
• Feldman and Cousins soon realized this
was a standard statistical technique
described in a text (Kendall and Stuart)
• Physicists have, in the past, often ignored
statistical literature to the detriment of both
physicists and statisticians
• In recent years, helped by conferences on
statistics in physics since 2000, there has
been more and more cooperation
Byron Roe
6
Nuisance Parameters
• Experiments may depend on background,
efficiency… which are not the targets of
the experiment, but are needed to get to
the physical parameter l
• These are called nuisance parameters
• The expectation values may be well
known or have an appreciable uncertainty.
Byron Roe
7
Problem with Feldman Cousins
• Karmen experiment in 1999 reported
results. They were checking LSND expt.
• Background was known to be 2.83+/-0.13
events.
• They observed 0 events and set a limit on
lambda using FC at 1.1 at 90% CL
• Common sense: if 0 signal, 2.3 is 90% CL
• FC ordering is P given data, BUT 90% CL
is overall P, not P given data
Byron Roe
8
Attempts to Improve Estimate
• With a statistician, Michael Woodroofe, I tried
different methods
• Suppose try Bayesian method, taking a prior
(initial) probability for l uniform.
• Obtain credible limit (Bayesian equivalent of CL)
• Go back to frequentist view and look at coverage
• Quite close to frequentist and with small
modification, very close in almost all regions, but
gives 2.5 for Karmen limit close to desired 2.3
Byron Roe
9
Nuisance Variables with Significant
Uncertainty
• Can draw a 90% CL
region for joint
probability of l and b
(nuisance par.)
• Project onto l axis
and take extreme
values for CL
• Safe, but often
grossly over-covers
Byron Roe
10
Nuisance Parameters 2
• 1. Integrate over nuisance parameter b using measured
probability of b. Introduces Bayesian concept for b.
Tends to over-cover and there are claims of undercoverage
• 2. Suppose max. likelihood solution has values L, B.
Suppose, given a l, the maximum likelihood solution for
b is bl. Consider:
R = likelihood(x|l,bl)/ likelihood(x|L,B)
(Essentially FC ratio)
Let Gl,b = probl,b(R>C) = CL
Approximate: Gl,b approx Gl,bl
Byron Roe
11
Nuisance Parameters 3
• Use this and make a full Neyman construction.
Good coverage for a number of examples, OR…
• Assume -2ln R is approximately a c2 distribution.
(It is asymptotically.) This is method of MINOS in
MINUIT. Coverage good in some recent cases.
Clipping required for nuisance parameters far
from expected values.
Byron Roe
12
Data Classification
• Given a set of events to separate into
signal and background and some partial
knowledge in the form of set of particle
identification (PID) variables.
• Make a series of cuts on PID variables—
often inefficient.
• Neural net. Invented by John Hopfield as
a method the brain might use to learn.
• Newer methods—boosting,…
Byron Roe
13
Neural Nets and Modern Methods
• Use a training sample of events for which you know
which are signal and which are background.
• Practice an algorithm on this set, updating it and trying to
find best discrimination.
• Need second unbiased set to test result on, the test
sample.
• If the test set was used to determine parameters or
stopping point of algorithm, need a third set, verification
sample
• Results here for testing samples. Verification samples in
our tests gave essentially same results.
Byron Roe
14
Neural Network Structure
Combine the features
in a non-linear way to
a “hidden layer” and
then to a “final layer”
Use a training set to find
the best wik to
distinguish signal and
background
Byron Roe
15
Intuition
• Neural nets and most modern methods
use PID variables in complicated nonlinear ways. Intuition is somewhat difficult
• However, they are often much more
efficient than cuts and are used more and
more.
• I will not discuss neural nets further, but
will discuss modern methods—
boosting,etc.
Byron Roe
16
Boosted Decision Trees
• What is a decision tree?
• What is “boosting the decision trees”?
• Two algorithms for boosting.
Byron Roe
17
Decision Tree
• Go through all PID
variables and find best
variable and value to split
events.
• For each of the two
subsets repeat the
process
• Proceeding in this way a
tree is built.
• Ending nodes are called
leaves.
Byron Roe
Background/Signal
18
Select Signal and Background
Leaves
• Assume an equal weight of signal and
background training events.
• If more than ½ of the weight of a leaf
corresponds to signal, it is a signal leaf;
otherwise it is a background leaf.
• Signal events on a background leaf or
background events on a signal leaf are
misclassified.
Byron Roe
19
Criterion for “Best” Split
• Purity, P, is the fraction of the weight of a
leaf due to signal events.
• Gini: Note that gini is 0 for all signal or all
background.
• The criterion is to minimize ginileft + giniright
of the two children from a parent node
Byron Roe
20
Criterion for Next Branch to Split
• Pick the branch to maximize the change in
gini.
Criterion = giniparent – giniright-child –ginileft-child
Byron Roe
21
Decision Trees
• This is a decision tree
• They have been known for some time, but
often are unstable; a small change in the
training sample can produce a large
difference.
Byron Roe
22
Boosting the Decision Tree
• Give the training
events misclassified
under this procedure
a higher weight.
• Continuing build
perhaps 1000 trees
and do a weighted
average of the results
(1 if signal leaf, -1 if
background leaf).
Byron Roe
23
Two Commonly used Algorithms for
changing weights
• 1. AdaBoost
• 2. Epsilon boost (shrinkage)
Byron Roe
24
Definitions
• Xi= set of particle ID variables for event i
• Yi= 1 if event i is signal, -1 if background
• Tm(xi) = 1 if event i lands on a signal leaf of
tree m and -1 if the event lands on a
background leaf.
Byron Roe
25
AdaBoost
• Define err_m = weight wrong/total weight
1 err _ m
m log
err _ m
Increase weight for misidentified events
wi wi exp(m )
Byron Roe
26
Scoring events with AdaBoost
• Renormalize weights
N
wi wi / wi
i 1
• Score by summing over trees
Ntree
T ( x) mTm ( x)
m 1
Byron Roe
27
eBoost (shrinkage)
• After tree m, change weight of misclassified
events, typical e ~0.01 (0.03). For
misclassfied events:
wi wi exp(2e )
• Renormalize weights
N
wi wi / wi
i 1
• Score by summing over trees
Ntree
Byron Roe
T ( x) e Tm ( x)
m 1
28
Unwgted, Wgted Misclassified
Event Rate vs No. Trees
Byron Roe
29
Comparison of methods
e-boost changes weights a little at a time
• Let y=1 for signal, -1 for bkrd, T=score
summed over trees
• AdaBoost can be shown to try to optimize
each change of weights. exp(-yT) is
minimized;
• The optimum value is
T=½ log odds probability that y is 1 given x
Byron Roe
30
Tests of Boosting Parameters
• 45 Leaves seemed to work well for our application
• 1000 Trees was sufficient (or over-sufficient).
• AdaBoost with about 0.5 and e-Boost with e about 0.03
worked well, although small changes made little
difference.
• For other applications these numbers may need
adjustment
• For MiniBooNE need around 50-100 variables for best
results. Too many variables degrades performance.
• Relative ratio = const.*(fraction bkrd kept)/
(fraction signal kept).
Smaller is better!
Byron Roe
31
Effects of Number of Leaves and
Number of Trees
Smaller is better! R = c X frac. sig/frac. bkrd.
Byron Roe
32
Number of feature variables in
boosting
• In recent trials we have used 182 variables.
Boosting worked well.
• However, by looking at the frequency with which
each variable was used as a splitting variable, it
was possible to reduce the number to 86 without
loss of sensitivity. Several methods for choosing
variables were tried, but this worked as well as
any
• After using the frequency of use as a splitting
variable, some further improvement may be
obtained by looking at the correlations between
variables.
Byron Roe
33
Effect of Number of PID Variables
Byron Roe
34
Comparison of Boosting and ANN
• Relative ratio here is
ANN bkrd
kept/Boosting bkrd
kept. Greater than one
implies boosting wins!
• A. All types of
background events.
Red is 21 and black is
52 training var.
• B. Bkrd is p0 events.
Red is 22 and black is
52 training variables
Byron Roe
35
Percent nue CCQE kept
Robustness
• For either boosting or ANN, it is important to
know how robust the method is, i.e. will small
changes in the model produce large changes in
output.
• In MiniBooNE this is handled by generating
many sets of events with parameters varied by
about 1s and checking on the differences. This
is not complete, but, so far, the selections look
quite robust for boosting.
Byron Roe
36
How did the sensitivities change
with a new optical model?
• In Nov. 04, a new, much changed optical model
of the detector was introduced for making MC
events
• The reconstruction tunings needed to be
changed to optimize fits for this model
• Using the SAME PID variables as for the old
model:
• For a fixed background contamination of p0
events fraction of signal kept dropped by 8.3%
for boosting and dropped by 21.4% for ANN
Byron Roe
37
For ANN
• For ANN one needs to set temperature, hidden
layer size, learning rate… There are lots of
parameters to tune.
• For ANN if one
a. Multiplies a variable by a constant,
var(17) 2.var(17)
b. Switches two variables
var(17)var(18)
c. Puts a variable in twice
The result is very likely to change.
Byron Roe
38
For Boosting
• Only a few parameters and once set have
been stable for all calculations within our
experiment.
• Let y=f(x) such that if x1>x2 then y1>y2,
then the results are identical as they
depend only on the ordering of values.
• Putting variables in twice or changing the
order of variables has no effect.
Byron Roe
39
Tests of Boosting Variants
• None clearly better than AdaBoost or
EpsilonBoost
• I will not go over most, except Random Forests
• For Random Forest, one uses only a random
fraction of the events (WITH replacement) per
tree and only a random fraction of the variables
per node. NO boosting is used—just many
trees. Each tree should go to completion—every
node very small or pure signal or background
• Our UM programs weren’t designed well for this
many leaves and better results (Narsky) have
been obtained—but not better than boosting.
Byron Roe
40
Byron Roe
41
Can Convergence Speed be
Improved?
• Removing correlations between variables
helps.
• Random Forest WHEN combined with
boosting.
• Softening the step function scoring:
y=(2*purity-1); score = sign(y)*sqrt(|y|).
Byron Roe
42
Soft Scoring and Step Function
Byron Roe
43
Performance of AdaBoost with Step
Function and Soft Scoring Function
Byron Roe
44
Conclusions for Nuisance Variables
• Likelihood ratio methods seem very useful as an
organizing principle with or without nuisance
variables
• Some problems in extreme cases where data is
much smaller than is expected
• Several tools for handling nuisance variables
were described. The method using approx.
likelihood to construct Neyman region seems to
have good performance.
Byron Roe
45
References for Nuisance Variables
1
• J. Neyman, Phil. Trans. Royal Soc., London A333 (1937).
• G.J. Feldman and R.D. Cousins, Phys. Rev. D57,3873 (1998).
• A. Stuart, K. Ord, and S. Arnold, Kendall’s Advanced Theory of
Statistics, Vol 2A, 6th ed., (London: 1999).
• R. Eitel and B. Zeitnitz, hep-ex/9809007.
• The LSND collaboration, C. Athanassopoulos et. al. Phys. Rev. Lett.
75, 2650(1995); Phys. Rev. Lett. 77, 3082(1996); Phys. Rev. C54,
2685 (1996); Phys. Rev. D64, 112008 (2001).
• B. P. Roe and M.B. Woodroofe, Phys. Rev. D60, 053009 (1999).
• B.P. Roe and M.B. Woodroofe, Phys. Rev. D63, 013009 (2001).
Byron Roe
46
References for Nuisance Variables
2
• R. Cousins and V.L. Highland, Nucl. Instrum. Meth. A 320, 331
(1992).
• J. Conrad, O. Botner, A. Hallgren and C. Perez de los Heros, Phys.
Rev. D67, 118101 (2003).
• R.D. Cousins, to appear in Proceedings of PHYSTAT2005:
Statistical Problems in Particle Physics, Astrophysics, and
Cosmology (2005).
• K.S. Cranmer, Proceedings of PHYSTAT2003: Statistical Problems
in Particle Physics, Astrophysics and Cosmology, 261 (2003).
• G, Punzi, to appear in Proceedings of PHYSTAT2005.
• F. James and M. Roos, Nucl. Phys. B172, 475 (1980).
• W.A. Rolke and A.M. Lopez, Nucl. Intrum. Meth. A458, 745 (2001).
• W.A. Rolke, A.M. Lopez and J. Conrad, Nucl. Instrum. Meth. A551,
493 (2005).
Byron Roe
47
Conclusions For Classification
• Boosting is very robust. Given a sufficient number of
leaves and trees AdaBoost or EpsilonBoost reaches an
optimum level, which is not bettered by any variant tried.
• Boosting was better than ANN in our tests by 1.2-1.8.
• There are ways (such as the smooth scoring function) to
increase convergence speed in some cases.
• Several techniques can be used for weeding variables.
Examining the frequency with which a given variable is
used works reasonably well.
• Downloads in FORTRAN or C++ available at:
http://www.gallatin.physics.lsa.umich.edu/~roe/
Byron Roe
48
References for Boosting
•
•
•
•
•
•
•
•
R.E. Schapire ``The strength of weak learnability.’’ Machine Learning 5 (2), 197-227
(1990). First suggested the boosting approach for 3 trees taking a majority vote
Y. Freund, ``Boosting a weak learning algorithm by majority’’, Information and
Computation 121 (2), 256-285 (1995) Introduced using many trees
Y. Freund and R.E. Schapire, ``Experiments with an new boosting algorithm, Machine
Learning: Proceedings of the Thirteenth International Conference, Morgan Kauffman,
SanFrancisco, pp.148-156 (1996). Introduced AdaBoost
J. Friedman, “Recent Advances in Predictive (Machine) Learning”, Proceedings of
PHYSTAT2003: Statistical Problems in Particle Physics, Astrophyswics and
Cosmology, 196 (2003).
J. Friedman, T. Hastie, and R. Tibshirani, ``Additive logistic regression: a statistical
view of boosting’’, Annals of Statistics 28 (2), 337-407 (2000). Showed that AdaBoost
could be looked at as successive approximations to a maximum likelihood solution.
T. Hastie, R. Tibshirani, and J. Friedman, ``The Elements of Statistical Learning’’
Springer (2001). Good reference for decision trees and boosting.
B.P. Roe et. al., “Boosted decision trees as an alternative to artificial neural networks
for particle identification”, NIM A543, pp. 577-584 (2005).
Hai-Jun Yang, Byron P. Roe, and Ji Zhu, “Studies of Boosted Decision Trees for
MiniBooNE Particle Identification”, Physics/0508045, NIM A555, 370-385 (2005).
Byron Roe
49
Adaboost Output for Training and
Test Samples
Byron Roe
50
The MiniBooNE Collaboration
Byron Roe
51
40’ D tank, mineral oil, surrounded by about 1280
photomultipliers. Both Cher. and scintillation light.
Geometrical shape and timing distinguishes events
Byron Roe
52
Numerical Results from sfitter (a
second reconstruction program)
• Extensive attempt to find best variables for
ANN and for boosting starting from about
3000 candidates
• Train against pi0 and related
backgrounds—22 ANN variables and 50
boosting variables
• For the region near 50% of signal kept,
the ratio of ANN to boosting background
was about 1.2
Byron Roe
53
Post-Fitting
• Post-Fitting is an attempt to reweight the
trees when summing tree scores after all
the trees are made
• Two attempts produced only a very
modest (few %), if any, gain.
Byron Roe
54
Byron Roe
55