data_analysis_2008_v15x - Indico

Download Report

Transcript data_analysis_2008_v15x - Indico

Data Analysis
Wouter Verkerke
(NIKHEF)
Wouter Verkerke, UCSB
Course Overview
• Basic statistics
– 24 pages
• Event classification
– 61 pages
• Estimation and fitting
– 80 pages
• Significance & limits
– 42 pages
• Systematic checks
– 12 pages
Wouter Verkerke, NIKHEF
Basic Statistics
—
—
—
—
—
—
Mean, Variance, Standard Deviation
Gaussian Standard Deviation
Covariance, correlations
Basic distributions – Binomial, Poisson, Gaussian
Central Limit Theorem
Error propagation
Wouter Verkerke, NIKHEF
Data – types of data
• Qualitative (numeric) vs Quantitative (non-numeric)
Not suitable for
mathematical
treatment
Discrete
(Integers)
Continuous
(Reals)
Binning
‘Histograms’
{ 5.6354
7.3625
8.1635
9.3634
1.3846
0.2847
1.4763 }
‘N-tuples’
Wouter Verkerke, NIKHEF
Describing your data – the Average
• Given a set of unbinned data (measurements)
{ x1, x2, …, xN}
then the mean value of x is
1
x
N
N
x
i 1
i
• For binned data
x
1
x
N
N
n x
i 1
ni
i i
xi
– where ni is bin count and xi is bin center
– Unbinned average more accurate due to rounding
Wouter Verkerke, NIKHEF
Describing your data – Spread
• Variance V(x) of x expresses how much x is liable to
vary from its mean value x
V ( x) 



1
2
(
x

x
)
 i
N i
2
1
2
(
x

2
x
x

x
)

i
i
N i
1
1
1 2
2
x i  2 x  xi  x 1)

N i
N
N
i
i
x2  2x 2  x 2
 x2  x 2
• Standard deviation   V ( x ) 
x2  x 2
Wouter Verkerke, NIKHEF
Different definitions of the Standard Deviation

1
N
2
2
(
x

x
)

is the S.D. of the data sample
i
• Presumably our data was taken from a parent
distributions which has mean m and S.F. 
Data Sample

Parent Distribution
(from which data sample was drawn)

m
x
x – mean of our sample
m – mean of our parent dist
 – S.D. of our sample
 – S.D. of our parent dist
Beware Notational Confusion!Wouter Verkerke, NIKHEF
Different definitions of the Standard Deviation
• Which definition of  you use, data or parent, is matter of
preference, but be clear which one you mean!
Data Sample
Parent Distribution
(from which data sample was drawn)
data
x
parent
m
• In addition, you can get an unbiased estimate of parent from a
given data sample using
ˆ parent 
1
N
2
2
ˆ
(
x

x
)



data
N 1 i
N 1

  data 


1
N

2
2

(
x

x
)
i


Wouter Verkerke, NIKHEF
More than one variable
• Given 2 variables x,y and a dataset consisting of pairs
of numbers
{ (x1,y1), (x2,y2), … (xN,yN) }
• Definition of x, y, x, y as usual
• In addition, any dependence between x,y described by
the covariance
1
cov( x, y )   ( xi  x )( yi  y )
N i
 ( x  x )( y  y )
 xy  x y
(has dimension D(x)D(y))
• The dimensionless
cov( x, y )
correlation coefficient is defined as  
 x y
 [1,1]
Wouter Verkerke, NIKHEF
Visualization of correlation
= 0
 = 0.1
 = 0.5
 = -0.7
 = -0.9
 = 0.99
Correlation & covariance in >2 variables
• Concept of covariance, correlation is easily extended to
arbitrary number of variables
cov( x( i ) , x( j ) )  x( i ) x( j )  x( i ) x( j )
• so that Vij  cov( x( i ) , x( j ) ) takes the form of
a n x n symmetric matrix
• This is called the covariance matrix, or error matrix
• Similarly the correlation matrix becomes
ij 
cov( x( i ) , x( j ) )
 ( i ) ( j )
Vij  ij i j
Wouter Verkerke, NIKHEF
Basic Distributions – The binomial distribution
• Simple experiment – Drawing marbles from a bowl
– Bowl with marbles, fraction p are black, others are white
– Draw N marbles from bowl, put marble back after each drawing
– Distribution of R black marbles in drawn sample:
Probability of a
specific outcome
e.g. ‘BBBWBWW’
P( R; p, N )  p (1  p)
P(R;0.5,4)
R
p=0.5
N=4
R
N R
Number of equivalent
permutations for that
outcome
N!
R!( N  R)!
Binomial distribution
Wouter Verkerke, UCSB
Properties of the binomial distribution
• Mean:
• Variance:
r  n p
V (r)  np(1  p)    np(1  p)
p=0.1, N=4
p=0.1, N=1000
p=0.5, N=4
p=0.5, N=1000
p=0.9, N=4
p=0.9, N=1000
Wouter Verkerke, UCSB
Basic Distributions – the Poisson distribution
• Sometimes we don’t know the equivalent of the number
of drawings
– Example: Geiger counter
– Sharp events occurring in a (time) continuum
• What distribution to we expect in measurement over
fixed amount of time?
– Divide time interval l in n finite chunks,
– Take binomial formula with p=l/n and let n
P(r ; l / n, n) 
lr
l nr
n!
(
1

)
nr
n
r!(n  r )!
e l lr
P(r; l ) 
r!
lim n
n!
 nr ,
r!(n  r )!
l
lim n (1  ) n  r  e l
n
Poisson distribution
Wouter Verkerke, UCSB
Properties of the Poisson distribution
l=0.1
l=0.5
l=1
l=2
l=5
l=10
l=20
l=50
l=200
Wouter Verkerke, UCSB
More properties of the Poisson distribution
e  l lr
P(r; l ) 
r!
r l
• Mean, variance:
V (r )  l
  l
• Convolution of 2 Poisson distributions is also a Poisson
distribution with lab=la+lb
P( r ) 
r
 P( r ; l
A
rA 0
e
 l A  lB
e
A
) P( r  rA ; lB )
lrA lrBr
 r !(r  r )!
A
A
A
( l  lB )
 e ( l A  lB ) A
r!
A
r
r!  l A 



(
r

r
)!
l

l
rA 0
A
B 
 A
r
( l  lB )  l A
lB 


 e ( l A  lB ) A

r!
 l A  lB l A  l B 
r
 ( l A  l B ) ( l A  lB )
e
r!
r
rA
 lB 


l

l
B 
 A
r  rA
r
Wouter Verkerke, UCSB
Basic Distributions – The Gaussian distribution
• Look at Poisson distribution in limit of large N
P ( r; l )  e
l
lr
r!
l=1
Take log, substitute, r = l + x,
and use ln( r! )  r ln r  r  ln 2r
ln( P( r; l ))  l  r ln l  ( r ln r  r )  ln 2r
x 

 l  r ln l  ln( l (1  ))   (l  x )  ln 2l
l 

ln( 1  z )  z  z / 2
 x x2 
 x  (l  x )  2   ln( 2l)
 l 2l 
 x2

 ln( 2l)
Take exp
2l2
e x / 2l
Familiar Gaussian distribution,
P( x ) 
(approximation reasonable for N>10)
2l
l=10
2
l=200
Wouter Verkerke, UCSB
Properties of the Gaussian distribution
1
( x  m ) 2 / 2 2
P( x; m ,  ) 
e
2
• Mean and Variance

x 
 xP( x; m , )dx  m


V ( x)   ( x  m ) 2 P( x; m ,  )dx   2

 
• Integrals of Gaussian
68.27% within 1
90%  1.645
95.43% within 2
95%  1.96
99.73% within 3
99%  2.58
99.9%
 UCSB
3.29
Wouter
Verkerke,
Errors
• Doing an experiment  making measurements
• Measurements not perfect  imperfection quantified in
resolution or error
• Common language to quote errors
– Gaussian standard deviation = sqrt(V(x))
– 68% probability that true values is within quoted errors
[NB: 68% interpretation relies strictly on Gaussian sampling distribution,
which is not always the case, more on this later]
• Errors are usually Gaussian if they quantify a result that
is based on many independent measurements
Wouter Verkerke, UCSB
The Gaussian as ‘Normal distribution’
• Why are errors usually Gaussian?
• The Central Limit Theorem says
– If you take the sum X of N independent measurements xi,
each taken from a distribution of mean mi, a variance Vi=i2,
the distribution for x
(a) has expectation value
X   mi
i
(b) has variance
V ( X )  Vi   i2
i
i
(c ) becomes Gaussian as N  
– Small print: tails converge very slowly in CLT, be careful in assuming
Gaussian shape beyond 2
Wouter Verkerke, UCSB
Demonstration of Central Limit Theorem
 5000 numbers taken at random from a
uniform distribution between [0,1].
N=1
– Mean = 1/2, Variance = 1/12
 5000 numbers, each the sum of 2
random numbers, i.e. X = x1+x2.
– Triangular shape
N=2
 Same for 3 numbers,
X = x1 + x2 + x3
N=3
 Same for 12 numbers, overlaid curve is
exact Gaussian distribution
N=12
Wouter Verkerke, UCSB
Central Limit Theorem – repeated measurements
• Common case 1 : Repeated identical measurements
i.e. mi = m, i   for all i
C.L.T
X   mi
 Nm 
i
1
V ( x )   Vi ( x )  2
N
i
 (x) 

N
X
x  m
N
N 2  2
i Vi ( X )  N 2  N
 Famous sqrt(N) law
Wouter Verkerke, UCSB
Central Limit Theorem – repeated measurements
• Common case 2 : Repeated measurements with
identical means but different errors
(i.e weighted measurements, mi = m)
x
2
x
/

 i i
2
1
/

 i
Weighted average
1
V (x) 
  (x) 
2
1 /  i
1
2
1
/

 i
‘Sum-of-weights’ formula for
error on weighted measurements
Wouter Verkerke, UCSB
Error propagation – one variable
f ( x)  ax  b
• Suppose we have
• How do you calculate V(f) from V(x)?
V( f )  f
2
 f
2
 (ax  b)  ax  b
2
2
 a 2 x 2  2ab x  b2  a x  2ab x  b2
2

 a2 x2  x
 a 2V ( x )
2

 i.e. f = |a|x
2
• More general:
df
 df 
V ( f )    V ( x) ;  f 
x
dx
 dx 
– But only valid if linear approximation is good in range
of error
Wouter Verkerke, UCSB
Error Propagation – Summing 2 variables
• Consider f  ax  by  c
V( f )  a  x
2
2
 x
2
 b  y
2
2
 y
2
 2ab xy  x
y

 a 2V ( x)  b 2V ( y )  2ab cov( x, y )
Familiar ‘add errors in quadrature’
only valid in absence of correlations,
i.e. cov(x,y)=0
• More general
2
 df 
 df 
 df  df 
V ( f )    V ( x )    V ( y )  2   cov( x, y )
 dx 
 dx  dy 
 dy 
2
2
 df 
 df 
 df  df 
 2f     x2     y2  2    x y
 dx 
 dx  dy 
 dy 
2
But only valid if linear approximation The correlation coefficient
Wouter
Verkerke,
UCSB
 [-1,+1] is
0 if x,y
uncorrelated
is good in range of error
Error propagation – multiplying, dividing 2 variables
• Now consider f  x  y
V ( f )  y 2V ( x)  x 2V ( y )
 f

 f
 x   y 
      
  x   y 
2
2
– Result similar for
(math omitted)
2
f  x/ y
• Other useful formulas
1/ x
1/ x

x
x
Relative error on
x,1/x is the same
;  ln( x ) 
x
x
Error on log is just
fractional error
Wouter Verkerke, UCSB
Event classification
—
—
—
—
—
Comparing discriminating variables
Choosing the optimal cut
Working in more than one dimension
Approximating the optimal discriminant
Techniques: Principal component analysis,
Fisher Discriminant, Neural Network,
Probability Density Estimate, Empirical Modeling
Wouter Verkerke, UCSB
Introduction to event classification
• Most HEP analysis involve classification of events in
‘signal’ and ‘background’
– Statistics connection : Hypothesis testing
– Determine consistency of each event with a signal and
background hypothesis
– No certain answer on signal-vs-background classification for each
event, but rather a probability
P(sig)=50%
Signal
Background hypothesis
Flat distribution in x
Signal hypothesis
Gaussian distribution in x
Background
P(sig)=0%
– In simple cases like above example no need to
make cut on signal probability
– Go straight to parameter estimation: what is # of signal events?
Multivariate event classification
•
Many HEP problem are much more difficult than preceding
example
– Many observables, overwhelming background.
– Various approaches possible
•
Analyses perform various combination of two techniques
– Selection of events (throw out events that are not very much signal-like)
– Parameter estimation (assign signal probability to each event  determine
total number of signal events in sample)
•
“Cut and count” (or fit)
– First make preselection of events that are most ‘signal-like’
– Then do parameter estimation or simple counting on those events
•
“Compactification”
– Compactify information in multiple observables into one observable (e.g.
output of neural network)  ‘test statistic t(x)’
– Do parameter estimation on compactify representation
•
“Big fit”
– Make explicit model of signal and background hypothesis in all observables
and fit all data to estimate model parameters
(#signal events, Higgs mass etc…)
Wouter Verkerke, NIKHEF
Merits of various approaches
• Statistical sensitivity
– Cutting usually throws away some information.
– Compactification can loose some information, but not necessarily
– Big fit keeps all information
• Feasibility
– Cutting usually easiest
– Compactification. Recent developments in machine learning make this
a lot easier
– Big fit clearly most ambitious
• What is the best you can do? Best (this context):
Smallest stat. error on measurement
– Equivalent statement for a selection cut:
lowest possible bkg. efficiency at a given sig. efficiency
What is the best you can do?
• Neyman-Pearson lemma:
– For a problem described by a continuous signal distribution f(x|s)
and a continuous bkg distribution f(x|b) the optimal acceptance
region is defined by
– Note that for continuous distributions you cannot always achieve
perfect separation
• Translation for 3 approaches
– Cut and count: Optimal cut is defined by surface S(x)/B(x)>a
– Compactification: Optimal 1-D test statistics t(x) = S(x)/B(x)
– Big fit: Optimal fit is when model M(x)=NSS(x) +NBB(x)
Wouter Verkerke, NIKHEF
Why Neyman-Pearson doesn’t always help
• The problem is that we usually don’t have explicit
formulae for the pdfs
– Instead we may have Monte Carlo models for signal and
background processes, so we can produce simulated data, and
enter each event into an n-dimensional histogram. Use e.g. M bins
for each of the n dimensions, total of Mn cells.
• But n is potentially large  prohibitively large number
of cells to populate with Monte Carlo data.
– Compromise: make Ansatz for form of test statistic
with
fewer parameters; determine them (e.g. using MC) to give best
discrimination between signal and background.
Wouter Verkerke, NIKHEF
Using test statistics to describe the selection
• All decision boundaries (‘cuts’) can be expressed as
where t(x) is a test statistic and a is a parameter
t (x)  a
• Goal is a test statistic as close as possible to t(x)=S(x)/D(x)
Rectangular cut
Linear cut
accept
H1
H1
H1
H0
Non-linear cut
H0
t ( x)   ( x j  c j ) ( xi  ci )
accept
t ( x)  a j  x j  ai  xi
• Two separate questions
– What is the optimal form of t(x) –
Independent of ratio Nsig/Nbkg
– What is the best value of a –
Depends on ratio Nsig/Nbkg
H0
accept
   
t ( x)  a  x  xAx  ...
Constructing test statistics – Linear discriminants
• A linear discriminant constructs t(x)
from a linear combination of the variables xi
N

 
t ( x )   ai xi  a  x
i 1
H0
H1
accept
– Optimize discriminant by chosing ai to maximize separation
between signal and background
• Most common form of the linear discriminant is the
Fisher discriminant

a


 T 1 
F ( x)  mS  mB  V x
Mean values in
xi for sig,bkg
R.A. Fisher
Ann. Eugen. 7(1936) 179.
Inverse of variance matrix
of signal/background
(assumed to be the Wouter
same)Verkerke, UCSB
Ansatz test statistics – The Fisher discriminant

a


 T 1 
F ( x)  mS  mB  V x
Mean values in
xi for sig,bkg
R.A. Fisher
Ann. Eugen. 7(1936) 179.
Inverse of variance matrix
of signal/background
(assumed to be the same)
• Advantage of Fisher Discriminant:
– Ingredients ms,mb,V can all be calculated directly from
data or simulation samples. No ‘training’ or ‘tuning’
• Disadvantages of Fisher Discriminant
– Fisher discriminant only exploits difference in means.
– If signal and background have different variance, this information
is not used.
Wouter Verkerke, UCSB
Example of Fisher discriminant
• The “CLEO” Fisher discriminant
– Goal: distinguish between
e+e-  Y4s  bb and uu,dd,ss,cc
– Method: Measure energy flow
in 9 concentric cones around
direction of B candidate
1
2
Cone
Energy
flows
Energy flow
in bb
1
2
3
4
5
6
7
8
9
3
4
9
8 7 6
5
F(x)
Energy flow
in u,d,s,c
When is Fisher discriminant is the optimal discriminant?
• A very simple dataset
S   Gauss( xi ; miS ,  i )
i
B   Gauss( xi ; miB ,  i )
Multivariate Gaussian distributions
with different means but same width
for signal and background
i
• Fisher is optimal discriminant for this case
– In this case we can also directly correlate F(x)
to absolute signal probability
1
P( F ) 
F
1 e
‘Logistic sigmoid function’
Wouter Verkerke, UCSB
Non-linear test statistics
• The optimal decision boundary may not be a hyperplane 
non-linear test statistic
• Large variety of Ansatzes
H1
– Neural network
– Support Vector Machines
– Rule ensembles
– Decision Trees
– Kernel density estimates
H0
– Most of these test statistics
t(x,p) have many free parameters p that need to accept
be tuned to
maximize their performance
• Unlike Fisher Discriminant, no analytical calculation
possible  Computational solution
– Machine Learning
– Bayesian Learning
Wouter Verkerke, NIKHEF
Common approximations made in machine learning
•
Machine learning approach most popular for training multivariate
non-linear test statistics
•
Approximation of knowledge of true signal and background
distributions with sample of signal and background events
– Finite statistics limit precision (in itself usually not a problem)
•
Risks of overtraining
– For finite datasets it is always
possible to construct a perfect
discriminant
(i.e. better than Neyman-Pearson
optimum for true signal and bkg)
•
E.g. tiny rectangular box cuts around
each signal events will do the job
– Control overtraining by measuring
performance of test statistic on
independent ‘test sample’.
•
Independent test sample
Training sample
Training iteration
Never train on data! (unless control sample)
– Overtraining on data  Better selection efficiency than on simulated events
that are used to calculate efficiency  Bias towards positive fluctuations
Wouter Verkerke, NIKHEF
Multivariate data selection – Neural networks
• Neural networks are used in neurobiology, pattern
recognition, financial forecasting (and also HEP)



N ( x )  s a0   ai xi 
i


s(t) is the activation function,
usually a logistic sigmoid
1
s(t ) 
1  e t
• This formula corresponds to the ‘single layer perceptron’
– Visualization of single layer network topology
x1
N(x)
xN
Since activation function s(t) is monotonic,
the single layer N(x) is equivalent
to the Fisher discriminant F(x)
Wouter Verkerke, UCSB
Neural networks – general structure
• The single layer model and easily be generalized to a
multilayer perceptron
m


N ( x )  s(a0   ai hi ( x ))
x1
i 1,
N(x)
with
n

hi ( x )  s( wi 0   wij x j )
j 1
xN
hidden
layer
with ai and wij weights
(connection strengths)
– Easy to generalize to arbitrary number of layers
– Feed-forward net: values of a node depend only on earlier layers
(usually only on preceding layer) ‘the network architecture’
– More nodes bring N(x) close to optimal t(x)=S(x)/B(x) but with
much more parameters to be determined
Wouter Verkerke, UCSB
Neural networks – Training
• Parameters of NN usually determined by minimizing the
error function

 
2
e   N ( x )  0 B( x )dx 
NN target value
for background

 
2
 N ( x )  1 S ( x )dx
NN target value
for signal
• Same principle as Fisher discriminant, but cannot solve
analytically for general case
– In practice replace e with averages from training data from MC
(Adjusting parameters  ‘Learning’)
– Generally difficult, but many programs exist to do this for you
(‘error back propagation’ technique most common)
Wouter Verkerke, UCSB
Neural networks – training example
Output Variables (1)
Input Variables (9)
cosQHB
cosQ*B
cosQthr
Signal
Signal MC Output
Background
cosQHD
Fisher
QhemiDiff
Signal
N(x)
Background
ln|DOCAK|
m(Kl)
QBSQobK
Signal
Background MC Output
Background
Wouter Verkerke, UCSB
Practical aspects of Neural Net training
• Choose input variables sensibly
– Don’t include badly understood observables (such as #tracks/evt)
– Some input variables may be highly correlated  drop or
decorrelate
– Some input variables may contain little or no discriminating power
 drop them
– Transform strongly peaked distributions into smooth ones (rarity
transforms / Gaussianization)
– Fewer inputs  fewer parameters to be adjusted  parameters
better determined for finite training data
• Choose architecture sensibly
– No ‘rules’ for number of hidden layers, nodes
– Usually better to start simple and gradually increase compexity
and see how that pays off
• Verify sensible behavior
– NN are not magic, understand what your trained Wouter
NN is Verkerke,
doing UCSB
Improving performance of non-linear test statistics
• Strong correlations may adversely impact training and
performance of non-linear test statistics
– Extreme example with rectangular cut optimization, but other
algorithm are also affected to lesser degree
Scenario with
uncorrelated
X,Y in sig,bkg
Signal
Scenario with
strongly correlated X,Y in sig
Additional background
could have been reduced
at no cost with a differently
shaped cut
Background
Need different approach…
Wouter Verkerke, NIKHEF
Decorrelation of input variables – two ways
• Removal of linear correlations by rotating input variables
Cholesky decomposition: determine square-root C of covariance
matrix C, i.e., C = CC
Transform orig (x) into decorrelated variable space (x) by: x = C1x
• Principal component analysis
1) Compute variance matrix Cov(X)
2) Compute eigenvalues li and eigenvectors vi
3) Construct rotation matrix T = Col(vi)T
4) Finally calculate ui = Txi
u1
u2
Example of decorrelation
Original
Cholesky (√C)
Principal Comp. Ana.
• Note that decorrelation is only complete, if
–
Correlations are linear
–
Input variables are Gaussian distributed
–
Not very accurate conjecture in general
(but valid for above example)
Wouter Verkerke, NIKHEF
Gaussianization
• Decorrelation can be improved by applying a
transformation to each observable that results in a
Gaussian distribution
– Can Gaussianize either signal or background sample (not both…)
• Two-step transformation
– First apply rarity transform  Creates uniform distribution
xkflat  i event  
xk  ievent 

pk  xk  dxk , k  variables

Measured value
PDF of variable k
Rarity transform of variable k
– Second: make Gaussian via inverse error function: erf  x  

2

x
t
 e dt
2
0

xkGauss  i event   2  erf 1 2 xkflat  i event   1 , k  variables
Wouter Verkerke, NIKHEF
Decision Trees
• A Decision Tree encodes sequential rectangular cuts
– But with a lot of underlying theory on training and optimization
– Machine-learning technique, widely used in social sciences
– L. Breiman et al., “Classification and Regression Trees” (1984)
• Basic principle
– Extend cut-based selection
– Try not to rule out events failing a particular criterion
– Keep events rejected by one criterion and see whether other criteria
could help classify them properly
Wouter Verkerke, NIKHEF
Building a tree – splitting the data
• Essential operation : splitting the data in 2 groups using
a single cut, e.g. HT<242
– Need to find ‘best cut’ as quantified through
best separation of signal and background
i.e. most signal passes, most background fails
(requires some metric to quantify this)
– Procedure:
1) Find cut value with best separation for each observable
2) Apply only cut on observable that results in best separation
Wouter Verkerke, NIKHEF
Building a tree – recursive splitting
• Repeat splitting procedure on sub-samples of previous split
• Requires algorithm to decide when to stop splitting
• Output of decision tree:
– true/false or
– probability based on expected purity of leaf (s/s+b)
Parameters in the construction of a decision tree
• Normalization of signal and background before training
– Usually same total weight for signal and background events
• In the selection of splits
– list of questions (vari < cuti) to consider
– Separation metric (quantifies how good the split is)
• Decision to stop splitting (declare a node terminal)
– Minimum leaf size (e.g. 100 events)
– Insufficient improvement from splitting
– Perfect classification (all events in leaf belong to same class)
• Assignment of terminal node to a class
– Usually: purity>0.5 = signal, purity<0.5 = background
Wouter Verkerke, NIKHEF
Separation metric – The impurity function
•
Introduce impurity function i(t) to
quantify (im)purity of a sample
– maximum impurity for equal mix of S and B
– Simplest option: i(t) = misclassification rate
– strictly concave  reward purer nodes
•
•
favors end cuts with one smaller node and one larger node
Definition of optimal split – Minimize i(t) on output samples of split
– Metric = decrease of impurity (increase of purity)
– For a split s of a node t  tL,tR
Impurity
of sample
before split
Impurity
of ‘left’
sample
Impurity
of ‘right’
sample
Aim: find split that results in largest Di(s,t)
•
Stop splitting when
– not enough improvement (introduce a cutoff parameter b on Di)
– not enough statistics in sample
– node is pure (signal or background)
Wouter Verkerke, NIKHEF
Pruning
•
•
Why prune a tree?
– Overtraining  Possible to get a perfect classifier on training events
•
E.g. tree with one class only per leaf (down to 1 event per leaf if necessary)
– Pruning: eliminate sub-trees (branches) that seem too specific to training
sample:
•
•
a node and all its descendants turn into a leaf
How?
– Expected error pruning
(result from tree with node pruned in consistent within error with orig. tree)
•
Statistical error estimate with binomial error p(1 − p)/N for node with purity p and N training events
•
No need for testing sample
– Cost/complexity pruning
•
Idea: penalize “complex” trees (many nodes/leaves) and find compromise between good fit to training
data (larger tree) and good generalization properties (smaller tree)
Concrete example of Decision Tree
Background
Signal
3
1
2
3
2
2
1
2
1
1
Wouter Verkerke, NIKHEF
Decision Tree score card
•
Training is fast
•
Human readable (not a black box)
•
Deals with continuous and discrete variables simultaneously
•
No need to transform inputs, resistant to irrelevant variables
–
•
Works well with many variables
–
•
Irrelevant variables not used, order of training events is irrelevant
Ranking of variable xi : sum impurity at each node where xi is used  Largest decrease of
impurity = best variable
Good variables can be masked
–
xj may be just a little worse than xi but will never be picked  xj is ranked as irrelevant
–
But remove xi and xj becomes very relevant (Solution available (surrogate split, not covered
now))
•
Very few parameters
•
For some time still “original” in HEP
•
Unstable tree structure
–
•
Small changes in sample can lead to very different tree structures, easy to overtrain
Piecewise nature of output (not a continuous distribution)
–
Output distribution discrete by nature
–
granularity related to tree complexity tendency to have spikes at certain purity values (or just
two delta functions at ±1 if not using purity)
–
Solution available: averaging over multiple Decision Trees (Boosting)
Wouter Verkerke, NIKHEF
A brief history of boosting
• First provable algorithm by
Schapire (1990)
– Train classifier T1 on N events
– Train T2 on new N-sample,
half of which misclassified by T1
– Build T3 on events where T1 and T2 disagree
– Boosted classifier: MajorityVote(T1,T2,T3)
• Then
– Variation by Freund (1995): boost by majority (combining many
learners with fixed error rate)
– Freund & Schapire joined forces: 1st functional model AdaBoost
(1996)
• Recently in HEP
– MiniBooNe compared performance of different boosting algorithms and
neural networks for particle ID (2005)
– D0 claimed first evidence for single top quark production (2006) CDF
(2008)
Wouter Verkerke, NIKHEF
Principles of boosting
• Principles of boosting
– General method, not limited to decision trees
– Hard to make a very good learner, but easy to make simple,
error-prone ones (but still better than random guessing)
– Goal: combine such weak classifiers into a new more stable one,
with smaller error
• General algorithm
Wouter Verkerke, NIKHEF
AdaBoost
•
AdaBoost = Adaptive Boosting (Freund & Shapire ‘96)
– Learning procedure adjusts to training data to classify it better
– Many variations on the same theme for actual implementation
– Most common boosting algorithm around
•
Schematic view of iterative algorithm
– Calculate misclassification rate for Tree K
“Weighted average
of isMisclassified over
all training events”
– Derive weight of Tree K
– Increase weight of misclassified events in Sample(k) to create Sample(k+1)
– Train T(k+1) on Sample (k+1)
– Boosted result
“Weighted average
Wouter
NIKHEF
of Trees
by Verkerke,
their performance
AdaBoost by example
• So-so classifier (Error rate = 40%)
– Misclassified events get their weight multiplied by exp(0.4)=1.5
– Next tree will have to work a bit harder on these events
• Good classifier (Error rate = 5%)
– Misclassified events get their weight multiplied by exp(2.9)=19 (!!)
– Being failed by a good classifier means a big penalty: must be a
difficult case
– Next tree will have to pay much more attention to this event and try to
get it right
• Overtraining in boosting
– Misclassification rate on training sample is bound  Rate falls to zero
for sufficiently large N(tree)
•
Corollary: training data is over fitted
•
Error rate on test sample may reach a minimum and then potentially rise. Stop boosting at
the minimum.
– In principle AdaBoost must overfit training sample
Wouter Verkerke, NIKHEF
Example of Boosting
T0(x,y)
T1(x,y)
4
B( x, y )   a iTi ( x, y )
i 0
T2(x,y)
T4(x,y)
T3(x,y)
Wouter Verkerke, NIKHEF
Example of BDT analysis of single top quark in D0
Wouter Verkerke, NIKHEF
• Find hyperplane that best separates
signal from background
– Best separation:
maximum distance (margin) between
closest events (support) to hyperplane
x2
Separable data
Support Vector Machines
support
vectors
margin
– Linear decision boundary is defined
by solution of a Langrangian
• For non-separable data add
misclassification cost
– add misclassification cost parameter
C·Sii to minimization function
x2
Non-separable data
– Solution of Lagrangian only
depends on inner product of
support vectors
x1
1
2
3
4
margin
x1
Wouter Verkerke, NIKHEF
Support Vector Machines
x
2
• Non-linear cases
– Transform variables into higher
dimensional feature space
(x,y)  (x,y,z=f(x,y))
x
1
where again a linear boundary
(hyperplane) can separate the data
– Explicit basis functions not required:
use Kernel Functions to approximate
scalar products between transformed
vectors in the higher dimensional
feature space
– Choose Kernel and use the hyperplane
using the linear techniques developed
above
x
3
x
2
f(x1,x2)
x
1
Wouter Verkerke, NIKHEF
Probability density estimates
• Instead of constructing a test statistic t(x) using
machine learning…
• You can also try to model S(x) and B(x) individually and
construct a test statistic as t(x)S(x)/B(x)
• Training and parameter-free approach –
Probability density estimates from MC samples
– Idea (1-dim): represent each event of your MC sample as a
Gaussian probability distribution
– Add probability distributions from all events in sample
Sample of events
Gaussian
probability distributions
for each event
Summed
probability distribution
for all events in sample
Wouter Verkerke, NIKHEF
Probability Density Estimates – Adaptive Kernel
• Width of single event Gaussian can of course vary
– Width of Gaussian tradeoff between smoothness and ability to
describe small features
• Idea: ‘Adaptive kernel’ technique
– Choose wide Gaussian if local density of events is low
– Choose narrow Gaussian if local density of events is high
– Preserves small features in high statistics areas, minimize jitter in
low statistics areas
Static Kernel
(with of all Gaussian identical)
Adaptive Kernel
(width of all Gaussian depends
on local density of events)
Wouter Verkerke, UCSB
Probability Density Estimates – Some examples
• Example 2D signal and background distribution
S(x,y)
Theoretical
distributions
Kernel Estimation
(N=1000)
Kernel Estimation
(N=10000)
B(x,y)
D(x,y)=S(x,y)/B(x,y)
Probability Density Estimates
• Also works in >2 dimensions
• Simplified approach also possible:
– count events in a (hyper)cube around (x)
• Advantages of PDE technique
– No training necessary
– Insightful / intuitive
• Disadvantages
– Prone to effects of low statistics
– Computationally very intense in application phase for multiple
dimensions
Wouter Verkerke, UCSB
Characterizing and comparing performance
• Performance of a test statistic characterized
by e(sig) vs e(bkg) curve
– Curve for theoretical maximum performance can be added if true
S(x) and B(x) are known
Good Performance
– NB: Does not tell you where optimal cut is
Bad Performance
Wouter Verkerke, NIKHEF
Performance comparison of (boosted) decision trees
Wouter Verkerke, NIKHEF
Using the compactified output
• Using the output of a of test statistic t(x)
– Fit data to sum of templates M(x) = NsigTS(x) + NbkgTB(x)
• Optimal use of information, but possibly more sensitive to systematic uncertainties
that influence the shape of TS and TB
– Or select only events with t(x)>a and either count, or fit a
distribution of another observable y that was not used in
construction of t(x)
• You throw away some information, but if shapes of signal and background in y are
well understood, you have smaller systematic uncertainties
– What is the optimal value of a?
– Need a ‘Figure of Merit’
to quantify this
Wouter Verkerke, NIKHEF
Figure of merit
• Common choices for Figure of Merit
– Choose position of cut a where F(a) is maximal
F (a ) 
Note that position of
optimum depends on
a priori knowledge of
signal cross section
S (a )
S (a )  B(a )
‘measurement’
F (a ) 
S (a )
B(a )
‘discovery’
Where S(a) and B(a) are number of signal and background
events remaining after a cut a
• Note that above FOMs are ‘traditional’, not statistically
rigorous.
– Better calculations exist, see e.g. Punzi / PhyStat 2003
– Example: Counting experiment where signal and background are
Poisson processes, discovery at n sigma
F (a ) 
e (a )
n / 2  B(a )
Where e(a) the signal efficiency and
B(a) is number of background events
at a cut a
Using a figure of merit
• Application of S/√S+B figure of merit to simple
one-dimensional cut
Make
cut |x|<C
X
Simulated signal
X
X
Small Bkg Scenario
X
S/sqrt(S+B)
Large Bkg Scenario
Strongly
peaked
optimum
C
Make
cut |x|<C
S/sqrt(S+B)
Simulated bkg.
Shallow
optimum
Wouter Verkerke, UCSB
C
(Software Advertisement #1)
TMVA
Wouter Verkerke, NIKHEF
What is TMVA
ROOT: is the analysis framework used by most (HEP)-physicists
Idea: rather than just implementing new MVA techniques and
making them available in ROOT (i.e., like TMulitLayerPercetron
does):
Have one common platform / interface for all MVA classifiers
Have common data pre-processing capabilities
Train and test all classifiers on same data sample and evaluate consistently
Provide common analysis (ROOT scripts) and application framework
Provide access with and without ROOT, through macros, C++ executables
or python
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Limitations of TMVA
Development started beginning of 2006 – a mature but not a
final package
Known limitations / missing features
Performs classification only, and only in binary mode: signal versus background
Supervised learning only (no unsupervised “bump hunting”)
Relatively stiff design – not easy to mix methods, not easy to setup categories
Cross-validation not yet generalised for use by all classifiers
Optimisation of classifier architectures still requires tuning “by hand”
Work ongoing in most of these areas  see later in this talk
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
TMVA
Content
Currently implemented classifiers
Rectangular cut optimisation
Projective and multidimensional likelihood estimator
k-Nearest Neighbor algorithm
Fisher and H-Matrix discriminants
Function discriminant
Artificial neural networks (3 multilayer perceptron impls)
Boosted/bagged decision trees
RuleFit
Support Vector Machine
Currently implemented data preprocessing stages:
Decorrelation
Principal Value Decomposition
Transformation to uniform and Gaussian distributions
(coming soon)
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Using
TMVA
A typical TMVA analysis consists of two main steps:
1.
Training phase: training, testing and evaluation of classifiers using data
samples with known signal and background composition
2.
Application phase: using selected trained classifiers to classify unknown data
samples
Illustration of these steps with toy data samples
 T MVA tutorial
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
A Toy Example (idealized)
Use data set with 4 linearly correlated Gaussian distributed
variables:
---------------------------------------Rank : Variable : Separation
---------------------------------------1 : var4
: 0.606
2 : var1+var2 : 0.182
3 : var3
: 0.173
4 : var1-var2 : 0.014
---------------------------------------
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Preprocessing the Input Variables
Decorrelation of variables before training is useful for this
example
Note that in cases with non-Gaussian distributions and/or nonlinear
correlations decorrelation may do more harm than any good
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Evaluating the Classifier Training (II)
Check for overtraining: classifier output for test and training
samples …
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Evaluating the Classifier Training (V)
Optimal cut for each classifiers …
Determine the optimal cut (working
point) on a classifier output
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Receiver Operating Characteristics (ROC) Curve
Smooth background rejection versus signal
efficiency curve: (from cut on classifier output)
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Example: Circular Correlation
• Illustrate the behavior of linear and nonlinear classifiers
Circular correlations
(same for signal and background)
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
The “Schachbrett” Toy
•
Performance achieved without parameter tuning:
PDERS and BDT best “out of the box” classifiers
•
After specific tuning, also SVM und MLP perform
well
Theoretical maximum
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Summary of the Classifiers and their Properties
Classifiers
Criteria
Cuts
Performance
no / linear
correlations
nonlinear
correlations
Training
Speed
Response
Robust
-ness
Overtraining
Weak input
variables
Curse of
dimensionality
Transparency








Likelihood
PDERS
/ k-NN
 
 
 
 /
 
 
 
 
H-Matrix
Fisher
MLP
BDT
RuleFit
SVM
















































The properties of the Function discriminant (FDA) depend on the chosen function
Top Workshop, LPSC, Oct 18–20, 2007
A. Hoecker: Multivariate Analysis with TMVA
Do you trust your NN / BDT etc?
• Keep in mind that any trained NN / BDT / SVM is just a
parameterized decision boundary in n dimensions
– You can visualize it. It may be sub-optimal (or overtrained) but it
is really just a boundary.
• Better question: Do you believe that the signal and
background training samples that were used were an
accurate reflection of reality?
– In general, they don’t agree (perfectly) so performance should be
interpreted with some care
– Usually background is harder to get right than signal, because in
addition to any imperfections in the simulation process, you also
need to guess correctly which physics processes constitute your
background
– In particular, you should not assume that the efficiency of a cut
on your test statistic t(x)>a on simulated events is the same as in
data  Important source of systematic uncertainty in your final
result
Wouter Verkerke, NIKHEF
Measuring the performance of your t(x) on data
• Common solution for controlling uncertainties in the
performance of your t(x) is to measure it on data
– Several ways to do this, will highlight one here
• Keep one (uncorrelated) observable out of your test
statistic
– Example with NN here. Instead of using all observables with good
discriminating power, keep one out of your test statistic
– Choose one that is powerful and maximally uncorrelated with the
others (for both signal and background). This may be your ‘best’
observable
x1
x1
N(x)
xN
N(x)
xN
xN
Measuring the performance of your t(x) on data
• Analysis strategy
– Cut on optimal value on N(x) to reduce background as much as
possible
– Make a fit to distribution of xN to measure residual amount of
signal and background
• In the limit of zero correlation between N(x) and xN the
shape of the distribution is invariant under the cut on
N(x)
No cut on N(x)
After cut N(x)>a
Background much
reduced, but you
can measure
how much is left
xN
xN
– In the limit of complete de-correlation between N(x) and XN no
systematic uncertainty to N(x) performance
Finding the right method
• Which technique is right for your problem? Depends on
– Complexity of your problem
– Time scale in which you would like to finish the analysis
• On finding the absolute best set of cuts
– All methods for finding discriminants are approximate when used with finite
training/tuning statistics
– Your experiments event simulation is imperfect – your performance on data
can be different (usually it is less)
– You may a systematic error later that might depend on your choice of cuts
– Don’t hunt for upward statistical fluctuations in tuning data
– If it takes you 6 months of work to reduce your error by 10% keep in mind
that your experiment may have accumulated enough additional data by
them to reduce your statistical error by a comparable or larger amount
• It is more important to get the right(=unbiased) answer than
the smallest possible statistical error
– Don’t use discriminating variables that you know are poorly modeled in
simulation
– Always try to find a way to cross check your performance on data, e.g. by
using a control sample, or leaving one observable out ofWouter
your t(x)
Verkerke, UCSB
Estimation & Fitting
—
—
—
—
—
—
—
—
Introduction to estimation
Properties of c2, ML estimators
Measuring and interpreting Goodness-Of-Fit
Numerical issues in fitting
Understanding MINUIT
Mitigating fit stability problems
Bounding fit parameters
Fit validation studies
— Fit validity issues at low statistics
— Toy Monte Carlo techniques
— Simultaneous fitting
— Multidimensional fitting
Wouter Verkerke, UCSB
Estimation – Introduction
 
T ( x; p)
Probability
Theory

D( x )
Data
Calculus
• Given the theoretical distribution parameters p, what
can we say about the data
 
T ( x; p)

D( x )
Data
Theory
Statistical
inference
• Need a procedure to estimate p from D
– Common technique – fit!
Wouter Verkerke, UCSB
A well known estimator – the c2 fit
•

Given a set of points {( xi , yi ,  i )}
and a function f(x,p)
define the c2

c 2 ( p)  
  2
( yi  f ( x; p ))
i
 y2
• Estimate parameters by minimizing the c2(p) with
respect to all parameters pi c2
Error on pi is
given by c2
variation of +1
– In practice, look for
dc 2 ( pi )
0
dpi
pi
Value of pi at
minimum is
estimate for pi
• Well known: but why does it work? Is it always right?
Wouter Verkerke, UCSB
Does it always give the best possible error?
Basics – What is an estimator?
• An estimator is a procedure giving a value for a
parameter or a property of a distribution as a function of
the actual data values, i.e.
1
N
1
Vˆ ( x) 
N
mˆ ( x) 
x
i
 Estimator of the mean
i
 2
(
x

m
 i )
 Estimator of the variance
i
• A perfect estimator is
– Consistent:
lim n (aˆ )  a
– Unbiased – With finite statistics you get the right answer on average
– Efficient
V ( aˆ )  (aˆ  aˆ ) 2
This is called the
Minimum Variance Bound
– There are no perfect estimators!
Wouter Verkerke, UCSB
Probability density functions
• We can characterize data distributions with
probability density functions F(x;p)
– x = observables (measured quantities)
– p = parameters (model/theory parameters)
• Properties
– Normalized to unity with respect to observable x
– Positive definite
 
F ( x; p)  0
 F ( x)dx  1
  
F
(
x
 ; p)dx  1
 F ( x, y)dxdy  1
Probability density functions
• Properties
– Parameters can be physics quantities of interest (life time, mass)
Decay time distribution
observable x (decay time)
parameter  (lifetime)
Invariant mass distribution
observable x (inv. mass)
parameter m (physics mass)
parameter  (decay width)
 1  x  m 2 
1
f ( x; m) 
exp   
 

2
 2   
– Vehicle to infer physics parameters from data distributions
Wouter Verkerke, NIKHEF
Likelihood
• The likelihood is the value of a probability density
function evaluated at the measured value of the
observable(s)
– Note that likelihood is only function of parameters, not of
observables

 

L( p)  F ( x  xdata; p)
• For a dataset that consists of multiple data points, the
product is taken

 
L( p)   F ( xi ; p), i.e.




L( p)  F ( x0 ; p)  F ( x1; p)  F ( x2 ; p)...
i
Wouter Verkerke, NIKHEF
Parameter estimation using Maximum Likelihood
• Likelihood is high for values of p that result in
distribution similar to data
• Define the maximum likelihood (ML) estimator(s) to be
the parameter value(s) for which the likelihood is
maximum.
Wouter Verkerke, NIKHEF
Parameter estimation – Maximum likelihood
• Computational issues
– For convenience the negative log of the Likelihood is often used
as addition is numerically easier than multiplication

 
 ln L( p)   ln F ( xi ; p)
i
– Maximizing L(p) equivalent to minimizing –log L(p)
• In practice, find point where derivative is zero

d ln L( p )

dp
0
pi  pˆ i
Wouter Verkerke, UCSB
Variance on ML parameter estimates
• The ML estimator for the parameter variance is
 d ln L 

2
 d p 
2
ˆ ( p) 2  Vˆ ( p )  
1
From Rao-Cramer-Frechet
inequality
V ( pˆ ) 
– I.e. variance is estimated from
2nd derivative of –log(L) at minimum
db
1  dp
 
d 2 ln L
d2p
b = bias as function of p,
inequality becomes equality
in limit of efficient estimator
– Valid if estimator is
efficient and unbiased!
– Taylor expand –log(L) around minimum
ln L( p )  ln L( pˆ ) 
 ln Lmax
 ln Lmax
d ln L
dp
d 2 ln L
 2
d p
( p  pˆ )  12
p  pˆ
p  pˆ
( p  pˆ ) 2

2ˆ p2
2
d ln L
d2p
( p  pˆ ) 2
p  pˆ
0.5
p̂
( p  pˆ ) 2
2
 ln L( p   )  ln Lmax 
-log(L)
• Visual interpretation of variance estimate
1
2
Wouter Verkerke, UCSB
p
Properties of Maximum Likelihood estimators
• In general, Maximum Likelihood estimators are
– Consistent
(gives right answer for N)
– Mostly unbiased
(bias 1/N, may need to worry at small N)
– Efficient for large N (you get the smallest possible error)
– Invariant:
(a transformation of parameters
will Not change your answer, e.g
 pˆ 2   p 2 
Use of 2nd derivative of –log(L)
for variance estimate is usually OK
• MLE efficiency theorem: the MLE will be unbiased and
efficient if an unbiased efficient estimator exists
– Proof not discussed here for brevity
– Of course this does not guarantee that any MLE is unbiased and
efficient for any given problem
Wouter Verkerke, UCSB
More about maximum likelihood estimation
• It’s not ‘right’ it is just sensible
• It does not give you the ‘most likely value of p’ –
it gives you the value of p for which this data is most likely
• Numeric methods are often needed to find
the maximum of ln(L)
– Especially difficult if there is >1 parameter
– Standard tool in HEP: MINUIT (more about this later)
• Max. Likelihood does not give you a goodness-of-fit measure
– If assumed F(x;p) is not capable of describing your data for any p,
the procedure will not complain
– The absolute value of L tells you nothing!
Wouter Verkerke, UCSB
Properties of c2 estimators
• Properties of c2 estimator follow from properties of ML
estimator
  y  f ( x ; p )  2 

i
 
F ( xi ; p)  exp   i
i
 
 
Probability Density Function
in p for single data point xi(i)
and function f(xi;p)
Take log,
Sum over all points xi



y

f
(
x
;
p
)
i
   12 c 2
ln L( p)   12   i
i
i 

The Likelihood function in p
for given points xi(i)
and function f(xi;p)
• The c2 estimator follows from ML estimator, i.e it is
– Efficient, consistent, bias 1/N, invariant,
– But only in the limit that the error i is truly Gaussian
– i.e. need ni > 10 if yi follows a Poisson distribution
• Bonus: Goodness-of-fit measure – c2  1 per
d.o.f
Wouter
Verkerke, UCSB
Maximum Likelihood or c2 – What should you use?
•
c2 fit is fastest, easiest
– Works fine at high statistics
– Gives absolute goodness-of-fit indication
– Make (incorrect) Gaussian error assumption on low statistics bins
– Has bias proportional to 1/N
– Misses information with feature size < bin size
• Full Maximum Likelihood estimators most robust
– No Gaussian assumption made at low statistics
– No information lost due to binning
– Gives best error of all methods (especially at low statistics)
– No intrinsic goodness-of-fit measure, i.e. no way to tell if ‘best’ is actually
‘pretty bad’
– Has bias proportional to 1/N
– Can be computationally expensive for large N
• Binned Maximum Likelihood in between
– Much faster than full Maximum Likihood


 ln L( p) binned   nbin ln F ( xbincenter ; p)
bins
– Correct Poisson treatment of low statistics bins
– Misses information with feature size < bin size
– Has bias proportional to 1/N
Wouter Verkerke, UCSB
Using weighted data in estimators
•
c2 fit of histograms with weighted data are straightforward
yi   wi
i
From C.L.T
  2
 yi  f ( xi ; p) 
2

c   
i
i 

– NB: You may no longer be able to interpret
(i.e. 68% contained in 1)
From C.L.T
i 
1
w
2
i
i
ˆ ( p)  Vˆ ( p)
as a Gaussian error
• In ML fits implementation of weights easy, but interpretation of
errors is not!
Eventweight


 ln L( p) weighted   wi ln F ( xi ; p)
i
– Variance estimate on parameters will be proportional to
– If
w  N
i
i
errors will be too small, if
w  N
i
w
i i
errors will be too large!
i
– Interpretation of errors from weighted LL fits difficult -- Avoid it if you can
Wouter Verkerke, UCSB
Estimating and interpreting Goodness-Of-Fit
• Fitting determines best set of parameters
of given model to describe data
– Is ‘best’ good enough?, i.e.
– Is it an adequate description,
or are there significant and
incompatible differences?
‘Not good enough’
• Most common test: the c2 test
  2

yi  f ( xi ; p ) 
2

c   

i 
i

– If f(x) describes data then c2  N, if c2 >> N something is wrong
– How to quantify meaning of ‘large c2’?
Wouter Verkerke, UCSB
How to quantify meaning of ‘large c2’
• Probability distr. for c2 is given by
 y  mi 

c 2    i
i 
i 
2
2 N / 2
N 2  c 2 / 2
p( c , N ) 
c e
( N / 2)
2
• To make judgement on goodness-of-fit,
relevant quantity is integral of above:
P( c ; N ) 
2

2
2
p
(
c
'
;
N
)
d
c
'

c2
• What does c2 probability P(c2,N) mean?
– It is the probability that a function which does genuinely describe
the data on N points would give a c2 probability as large or larger
than the one you already have.
• Since it is a probability, it is a number in the range [0-1]
Wouter Verkerke, UCSB
Goodness-of-fit – c2
• Example for c2 probability
– Suppose you have a function f(x;p) which gives a c2 of 20 for 5
points (histogram bins).
– Not impossible that f(x;p) describes data correctly, just unlikely

– How unlikely?
2
2
p
(
c
,
5
)
d
c
 0.0012

20
• Note: If function has been fitted to the data
– Then you need to account for the fact that parameters have been
adjusted to describe the data
N d.o.f .  N data  N params
• Practical tips
– To calculate the probability in ROOT ‘TMath::Prob(chi2,ndf)’
– For large N, sqrt(2c2) has a Gaussian distribution
with mean sqrt(2N-1) and =1
Wouter Verkerke, UCSB
Practical estimation – Numeric c2 and -log(L) minimization
• For most data analysis problems minimization of c2 or –
log(L) cannot be performed analytically
– Need to rely on numeric/computational methods
– In >1 dimension generally a difficult problem!
• But no need to worry – Software exists to solve this
problem for you:
– Function minimization workhorse in HEP many years: MINUIT
– MINUIT does function minimization and error analysis
– It is used in the PAW,ROOT fitting interfaces behind the scenes
– It produces a lot of useful information, that is sometimes
overlooked
– Will look in a bit more detail into MINUIT output and functionality
next
Wouter Verkerke, UCSB
Numeric c2/-log(L) minimization – Proper starting values
• For all but the most trivial scenarios it is not possible to
automatically find reasonable starting values of
parameters
– This may come as a disappointment to some…
-log(L)
– So you need to supply good starting values for your parameters
Reason: There may exist
multiple (local) minima
in the likelihood or c2
Local
minimum
True minimum
p
– Supplying good initial uncertainties on your parameters helps too
– Reason: Too large error will result in MINUIT coarsely scanning a
wide region of parameter space. It may accidentally find a far away
local minimum
Wouter Verkerke, UCSB
Example of interactive fit in ROOT
• What happens in MINUIT behind the scenes
1) Find minimum in –log(L) or c2 – MINUIT function MIGRAD
2) Calculate errors on parameters – MINUIT function HESSE
3) Optionally do more robust error estimate – MINUIT function MINOS
Wouter Verkerke, UCSB
Minuit function MIGRAD
• Purpose: find minimum
Progress information,
watch for errors here
**********
**
13 **MIGRAD
1000
1
**********
(some output omitted)
MIGRAD MINIMIZATION HAS CONVERGED.
MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.
COVARIANCE MATRIX CALCULATED SUCCESSFULLY
FCN=257.304 FROM MIGRAD
STATUS=CONVERGED
31 CALLS
32 TOTAL
EDM=2.36773e-06
STRATEGY= 1
ERROR MATRIX ACCURATE
EXT PARAMETER
STEP
FIRST
NO.
NAME
VALUE
ERROR
SIZE
DERIVATIVE
1 mean
8.84225e-02
3.23862e-01
3.58344e-04 -2.24755e-02
2 sigma
3.20763e+00
2.39540e-01
2.78628e-04 -5.34724e-02
ERR DEF= 0.5
EXTERNAL ERROR MATRIX.
NDIM= 25
NPAR= 2
ERR DEF=0.5
1.049e-01 3.338e-04
3.338e-04 5.739e-02
Parameter values and approximate
PARAMETER CORRELATION COEFFICIENTS
errors reported by MINUIT
NO. GLOBAL
1
2
1 0.00430
1.000 0.004
Error definition (in this case 0.5 for
2 0.00430
0.004 1.000
a likelihood fit)
Wouter Verkerke, UCSB
Minuit function MIGRAD
• Purpose: find minimum
Value of c2 or likelihood at
**********
minimum
**
13 **MIGRAD
1000
1
**********
(NB: c2 values are not divided
(some output omitted)
by Nd.o.f)
MIGRAD MINIMIZATION HAS CONVERGED.
MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.
COVARIANCE MATRIX CALCULATED SUCCESSFULLY
FCN=257.304 FROM MIGRAD
STATUS=CONVERGED
31 CALLS
32 TOTAL
EDM=2.36773e-06
STRATEGY= 1
ERROR MATRIX ACCURATE
EXT PARAMETER
STEP
FIRST
NO.
NAME
VALUE
ERROR
SIZE
DERIVATIVE
1 mean
8.84225e-02
3.23862e-01
3.58344e-04 -2.24755e-02
2 sigma
3.20763e+00
2.39540e-01
2.78628e-04 -5.34724e-02
ERR DEF= 0.5
EXTERNAL ERROR MATRIX.
NDIM= 25
NPAR= 2
ERR DEF=0.5
1.049e-01 3.338e-04
3.338e-04 5.739e-02
Approximate
PARAMETER CORRELATION COEFFICIENTS
Error matrix
NO. GLOBAL
1
2
And covariance matrix
1 0.00430
1.000 0.004
2 0.00430
0.004 1.000
Wouter Verkerke, UCSB
Minuit function MIGRAD
• Purpose: find minimum
Status:
Should be ‘converged’ but can be ‘failed’
Estimated Distance to Minimum
should be small O(10-6)
**********
**
13 **MIGRAD
1000
1
Error Matrix Quality
**********
should be ‘accurate’, but can be
(some output omitted)
‘approximate’ in case of trouble
MIGRAD MINIMIZATION HAS CONVERGED.
MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.
COVARIANCE MATRIX CALCULATED SUCCESSFULLY
FCN=257.304 FROM MIGRAD
STATUS=CONVERGED
31 CALLS
32 TOTAL
EDM=2.36773e-06
STRATEGY= 1
ERROR MATRIX ACCURATE
EXT PARAMETER
STEP
FIRST
NO.
NAME
VALUE
ERROR
SIZE
DERIVATIVE
1 mean
8.84225e-02
3.23862e-01
3.58344e-04 -2.24755e-02
2 sigma
3.20763e+00
2.39540e-01
2.78628e-04 -5.34724e-02
ERR DEF= 0.5
EXTERNAL ERROR MATRIX.
NDIM= 25
NPAR= 2
ERR DEF=0.5
1.049e-01 3.338e-04
3.338e-04 5.739e-02
PARAMETER CORRELATION COEFFICIENTS
NO. GLOBAL
1
2
1 0.00430
1.000 0.004
2 0.00430
0.004 1.000
Wouter Verkerke, UCSB
Minuit function HESSE
• Purpose: calculate error matrix from
d 2L
dp 2
**********
**
18 **HESSE
1000
Symmetric errors
**********
calculated from 2nd
COVARIANCE MATRIX CALCULATED SUCCESSFULLY
derivative
of –ln(L) or 42
c2 TOTAL
FCN=257.304 FROM HESSE
STATUS=OK
10 CALLS
EDM=2.36534e-06
STRATEGY= 1
ERROR MATRIX ACCURATE
EXT PARAMETER
INTERNAL
INTERNAL
NO.
NAME
VALUE
ERROR
STEP SIZE
VALUE
1 mean
8.84225e-02
3.23861e-01
7.16689e-05
8.84237e-03
2 sigma
3.20763e+00
2.39539e-01
5.57256e-05
3.26535e-01
ERR DEF= 0.5
EXTERNAL ERROR MATRIX.
NDIM= 25
NPAR= 2
ERR DEF=0.5
1.049e-01 2.780e-04
2.780e-04 5.739e-02
PARAMETER CORRELATION COEFFICIENTS
NO. GLOBAL
1
2
1 0.00358
1.000 0.004
2 0.00358
0.004 1.000
Wouter Verkerke, UCSB
Minuit function HESSE
• Purpose: calculate error matrix from
d 2L
dp 2
**********
**
18 **HESSE
Error matrix 1000
**********
(Covariance Matrix)
COVARIANCEcalculated
MATRIX CALCULATED
SUCCESSFULLY
from
1 STATUS=OK
FCN=257.304 FROM HESSE
10 CALLS
42 TOTAL
 d 2 ( lnEDM=2.36534e-06

STRATEGY= 1
ERROR MATRIX ACCURATE
L) 

Vij 
EXT PARAMETER
INTERNAL
INTERNAL


NO.
NAME  dpVALUE
ERROR
STEP SIZE
VALUE
i dp j

1 mean
8.84225e-02
3.23861e-01
7.16689e-05
8.84237e-03
2 sigma
3.20763e+00
2.39539e-01
5.57256e-05
3.26535e-01
ERR DEF= 0.5
EXTERNAL ERROR MATRIX.
NDIM= 25
NPAR= 2
ERR DEF=0.5
1.049e-01 2.780e-04
2.780e-04 5.739e-02
PARAMETER CORRELATION COEFFICIENTS
NO. GLOBAL
1
2
1 0.00358
1.000 0.004
2 0.00358
0.004 1.000
Wouter Verkerke, UCSB
Minuit function HESSE
• Purpose: calculate error matrix from
d 2L
dp 2
**********
**
18 **HESSE
1000
**********
COVARIANCE MATRIX CALCULATED SUCCESSFULLY
FCN=257.304 FROM HESSE
STATUS=OK
10 CALLS
42 TOTAL
EDM=2.36534e-06
STRATEGY= 1
ERROR MATRIX ACCURATE
EXT PARAMETER
INTERNAL
INTERNAL
NO.
NAME
VALUE
ERROR
STEP SIZE
VALUE
1 mean
8.84225e-02
3.23861e-01
7.16689e-05
8.84237e-03
Correlation matrix
ij
2 sigma
3.20763e+00
2.39539e-01
3.26535e-01
calculated 5.57256e-05
from
ERR DEF= 0.5
EXTERNAL ERROR MATRIX.
NDIM= 25
NPAR= 2
ERR DEF=0.5
ij
i
j
ij
1.049e-01 2.780e-04
2.780e-04 5.739e-02
PARAMETER CORRELATION COEFFICIENTS
NO. GLOBAL
1
2
1 0.00358
1.000 0.004
2 0.00358
0.004 1.000
V   
Wouter Verkerke, UCSB
Minuit function HESSE
• Purpose: calculate error matrix from
d 2L
dp 2
**********
**
18 **HESSE
1000
**********
COVARIANCE MATRIX CALCULATED SUCCESSFULLY
FCN=257.304 FROM HESSE
STATUS=OK
10 CALLS
42 TOTAL
EDM=2.36534e-06
STRATEGY= 1
ERROR MATRIX ACCURATE
EXT PARAMETER
INTERNAL
INTERNAL
NO.
NAME
VALUE
ERROR
STEP SIZE
VALUE
1 mean
8.84225e-02
3.23861e-01
7.16689e-05
8.84237e-03
Global
correlation
vector:
2 sigma
3.20763e+00
2.39539e-01
3.26535e-01
correlation
of each
parameter 5.57256e-05
DEF= 0.5
with all otherERR
parameters
EXTERNAL ERROR MATRIX.
NDIM= 25
NPAR= 2
ERR DEF=0.5
1.049e-01 2.780e-04
2.780e-04 5.739e-02
PARAMETER CORRELATION COEFFICIENTS
NO. GLOBAL
1
2
1 0.00358
1.000 0.004
2 0.00358
0.004 1.000
Wouter Verkerke, UCSB
Minuit function MINOS
• MINOS errors are calculated by ‘hill climbing algorithm’.
– In one dimension find points where DL=+0.5.
– In >1 dimension find contour with DL=+0.5. Errors are defined by
bounding box of contour.
– In >>1 dimension very time consuming, but more in general more
robust.
• Optional – activated by option “E” in ROOT or PAW
**********
**
23 **MINOS
1000
**********
FCN=257.304 FROM MINOS
STATUS=SUCCESSFUL
52 CALLS
94 TOTAL
EDM=2.36534e-06
STRATEGY= 1
ERROR MATRIX ACCURATE
EXT PARAMETER
PARABOLIC
MINOS ERRORS
NO.
NAME
VALUE
ERROR
NEGATIVE
POSITIVE
1 mean
8.84225e-02
3.23861e-01 -3.24688e-01
3.25391e-01
2 sigma
3.20763e+00
2.39539e-01 -2.23321e-01
2.58893e-01
ERR DEF= 0.5
Symmetric error
(repeated result
from HESSE)
MINOS error
Can be asymmetric
(in this example the ‘sigma’ error
is slightly
asymmetric)
Wouter
Verkerke, UCSB
Illustration of difference between HESSE and MINOS errors
• ‘Pathological’ example likelihood with multiple minima
and non-parabolic behavior
-logL(p)
MINOS error
Extrapolation
of parabolic
approximation
at minimum
Parameter
HESSE error
Wouter Verkerke, NIKHEF
Practical estimation – Fit converge problems
• Sometimes fits don’t converge because, e.g.
– MIGRAD unable to find minimum
– HESSE finds negative second derivatives
(which would imply negative errors)
• Reason is usually numerical precision and stability
problems, but
– The underlying cause of fit stability problems is usually
by highly correlated parameters in fit
• HESSE correlation matrix in primary investigative tool
PARAMETER CORRELATION COEFFICIENTS
NO. GLOBAL
1
2
1 0.99835
1.000 0.998
2 0.99835
0.998 1.000
Signs of trouble…
– In limit of 100% correlation, the usual point solution becomes a line
solution (or surface solution) in parameter space.
Minimization problem is no longer well defined Wouter Verkerke, UCSB
Practical estimation – Bounding fit parameters
• Sometimes is it desirable to bound the allowed range of
parameters in a fit
– Example: a fraction parameter is only defined in the range [0,1]
Bounded Parameter space
External Error
– MINUIT option ‘B’ maps finite range parameter to an internal infinite
range using an arcsin(x) transformation:
MINUIT internal parameter space (-∞,+∞)
Internal Error
Wouter Verkerke, UCSB
Practical estimation – Bounding fit parameters
Bounded Parameter space
External error
• If fitted parameter values is close to boundary, errors will
become asymmetric (and possible incorrect)
MINUIT internal parameter space (-∞,+∞)
• So be careful with bounds!
Internal error
– If boundaries are imposed to avoid region of instability, look into other
parameterizations that naturally avoid that region
– If boundaries are imposed to avoid ‘unphysical’, but statistically valid
results, consider not imposing the limit and dealing with the ‘unphysical’
Wouter Verkerke, UCSB
interpretation in a later stage
Practical model writing
• Focus so far on parameter estimation using simple
illustration p.d.f.s (Gaussian etc)
– Now cover a few aspects of writing of p.d.f.s
– Canonical example: fit data to signal + background p.d.f
• Aim to write probability density functions rather than ‘plain’
functions to describe your data
– Easier to interpret your models.
If Blue and Green p.d.f.s are each
guaranteed to be normalized to 1,
then fractions of blue,green can
be cleanly interpreted as #events
– Many statistical techniques only
function properly with PDFs
(e.g maximum likelihood)
– Can sample ‘toy Monte Carlo’ events
from p.d.f because value is always
guaranteed to be >=0
Wouter Verkerke, NIKHEF
Modeling the signal distribution
• Many signal shapes are (somewhat) Gaussian
– Gaussian distribution obvious choice for signal shape in many
cases
• But beware that certain physics distributions are quite
different in the tails
– Many resonances are Breit-Wigner shapes  Much longer tails
than a Gaussian
Breit-Wigner (=1) vs Gaussian (=1)
Wouter Verkerke, NIKHEF
Modeling the signal distribution
• Correct description of most resonances is Breit-Wigner
convoluted with Gaussian
– Breit-Wigner describes physics resonance
– Gaussian describes detector resolution
– Can approximate with Gaussian if detector resolution is
dominating

=
• NB: Convolutions are hard to calculate
– Numeric calculation using Fourier Transforms usually most viable
(Tools available in ROOT [RooFFTConvPdf])
Wouter Verkerke, NIKHEF
Multiple Gaussians as signal shape
• Sometimes (detector) resolution effect has two distinct
resolution components
– Describe resolution with sum of two Gaussians
– But beware strong correlations that spoil fit stability
F ( x; f , m, s1, s2 )  fG1 ( x; s1, m)  (1  f )G2 ( x; s2 , m)
HESSE correlation matrix
Widths s1,s2
strongly correlated
fraction f
PARAMETER
NO.
[ f]
[ m]
[s1]
[s2]
CORRELATION COEFFICIENTS
GLOBAL
[ f]
[ m]
[s1]
[s2]
0.96973
1.000 -0.135 0.918 0.915
0.14407 -0.135 1.000 -0.144 -0.114
0.92762
0.918 -0.144 1.000 0.786
0.92486
0.915 -0.114 0.786 1.000
Wouter Verkerke, UCSB
Multiple Gaussians as signal shape
– Different parameterization:
fG1 ( x; s1 , m1 )  (1  f )G2 ( x; s1  s2 , m2 )
PARAMETER
NO.
[ f]
[ m]
[s1]
[s2]
CORRELATION COEFFICIENTS
GLOBAL
[f]
[m]
[s1]
[s2]
0.96951
1.000 -0.134 0.917 -0.681
0.14312 -0.134 1.000 -0.143 0.127
0.98879
0.917 -0.143 1.000 -0.895
0.96156 -0.681 0.127 -0.895 1.000
– Correlation of width s2 and fraction f reduced from 0.92 to 0.68
– Choice of parameterization matters!
• Alternative – Fix all but one of the correlated parameters
– If floating parameters are highly correlated, some of them may be
redundant and not contribute to additional degrees of freedom in your
model
Wouter Verkerke, UCSB
Some other signal shape options
• Some other common shapes used to model signal
distributions
• Landau
– Describes energy loss in material
(no simple analytical form)
• Crystal Ball
– Gaussian with exponential tail
‘glued on’. Useful for e.g. m(ee)
of J/y where Bremsstrahlung losses
are significant
• Novosibirsk
– Gaussian distribution with extra
asymmetric skewing parameter a
Exponential
Gaussian
a=0 (Gaussian)
a=0.5
a=1
Wouter Verkerke, NIKHEF
Modeling the background
• Functional description of background shape usually
much more difficult than signal shape
• Common options
– Polynomial (plain or Chebychev)
– Phase-space kinematics – based
– Empirical (Histogram or Kernel estimation)
• If kinematics of background are well understood a
dedicated functional form describing kinematics is
usually best
– N-body phase space of B decay near threshold  ‘Argus shape’
– D*0-D0 mass difference phase space
Wouter Verkerke, NIKHEF
Kinematics inspired background functions
• Exponential decay
• Argus shape
– Describes phase space
near cutoff
– Example shown for
background of B meson
decays from Y(4s)BB
(Kinematic cutoff at 5.291 GeV)
• D*±-D0 phase space
– D*± mesons usually found from
distribution m(D*)-m(D0) due
to small phase space for slow
pion.
– Characteristic phase space
distribution for background
Wouter Verkerke, NIKHEF
Mitigating fit stability problems -- Polynomials
• Warning: Regular parameterization of polynomials
a0+a1x+a2x2+a3x3 nearly always results in strong
correlations between the coefficients ai.
– Fit stability problems, inability to find right solution common at
higher orders
• Solution: Use existing parameterizations of
polynomials that have (mostly) uncorrelated variables
– Example: Chebychev polynomials
Wouter Verkerke, UCSB
Non-parametric models using control sample data
• Instead of aiming at analytical description, can also go for
empirical approach with kernel estimation
Static Kernel
(with of all Gaussian identical)
Adaptive Kernel
(width of all Gaussian depends
on local density of events)
Wouter Verkerke, UCSB
• Most useful when you have a data control sample as input
– Note that there is a uncertainty associated to the shape due to finite
statistics of the control sample that is not automatically propagated to
your fit.
– Specialized fitters exist (ROOT TFractionFitter) to propagate such
errors if all components (signal and background) are histogram
templates
Wouter Verkerke, NIKHEF
Practical fitting – Error propagation between samples
• Common situation: you want to fit
a small signal in a large sample
– Problem: small statistics does not
constrain shape of your signal very well
– Result: errors are large
• Idea: Constrain shape of your signal
from a fit to a control sample
– Larger/cleaner data or MC sample with
similar properties
• Needed: a way to propagate the information from the
control sample fit (parameter values and errors) to your
signal fit
Wouter Verkerke, UCSB
Practical fitting – Simultaneous fit technique
• given data Dsig(x) and model Fsig(x;a,b) and
data Dctl(x) and model Fctl(x;b,c)
– Construct –log[Lsig(a,b)] and –log[Lctl(b,c)] and
Dsig(x), Fsig(x;a,b)
Dctl(x), Fctl(x;b,c)
-log(L)
Likelihood view
Combined
‘CTL’
‘SIG’
b
• Minimize -logL(a,b,c)= -logL(a,b)+ -logL(b,c)
– Errors, correlations on common param. b automatically propagated
Wouter Verkerke, UCSB
Another application of simultaneous fits
• You can also use simultaneous fits to samples of the same
type (“signal samples”) with different purity
• Go back to example of NN with one observable left out
– Fit xN after cut on N(x)
– But instead of just fitting data with
N(x)>a, slice data in bins of N(x)
and fit each bin.
– Now you exploit all data instead
of just most pure data. Still no
uncontrolled systematic uncertainty
as purity is measured from data in
each slide
– Combine information of all slices in
simultaneous fit
– a
Wouter Verkerke, NIKHEF
Practical Estimation – Verifying the validity of your fit
•
How to validate your fit? – You want to demonstrate that
1) Your fit procedure gives on average the correct answer ‘no bias’
2) The uncertainty quoted by your fit is an accurate measure for the statistical
spread in your measurement ‘correct error’
•
Validation is important for low statistics fits
–
•
Correct behavior not obvious a priori due to intrinsic ML bias
proportional to 1/N
Basic validation strategy – A simulation study
1) Obtain a large sample of simulated events
2) Divide your simulated events in O(100-1000) samples with the same size as
the problem under study
3) Repeat fit procedure for each data-sized simulated sample
4) Compare average value of fitted parameter values with generated value 
Demonstrates (absence of) bias
5) Compare spread in fitted parameters values with quoted parameter error 
Demonstrates (in)correctness of error
Wouter Verkerke, UCSB
Fit Validation Study – Practical example
• Example fit model in 1-D (B mass)
– Signal component is Gaussian
centered at B mass
– Background component is
Argus function (models phase
space near kinematic limit)
 
F (m; N sig , N bkg , pS , pB )  N sig  G (m; pS )  N bkg  A(m; pB )
• Fit parameter under study: Nsig
Nsig(generated)
– Results of simulation study:
1000 experiments
with NSIG(gen)=100, NBKG(gen)=200
– Distribution of Nsig(fit)
– This particular fit looks unbiased…
Wouter Verkerke,
UCSB
Nsig(fit)
Fit Validation Study – The pull distribution
• What about the validity of the error?
– Distribution of error from simulated
experiments is difficult to interpret…
– We don’t have equivalent of
Nsig(generated) for the error
• Solution: look at the pull distribution
– Definition:
pull(N sig ) 
(Nsig)
fit
true
N sig
 N sig
 Nfit
– Properties of pull:
• Mean is 0 if there is no bias
• Width is 1 if error is correct
– In this example: no bias, correct error
within statistical precision of study
pull(N
Wouter Verkerke,
UCSB sig)
Fit Validation Study – Low statistics example
• Special care should be taken when fitting small data
samples
– Also if fitting for small signal component in large sample
• Possible causes of trouble
–
c2 estimators may become approximate as Gaussian
approximation of Poisson statistics becomes inaccurate
– ML estimators may no longer be efficient
 error estimate from 2nd derivative may become inaccurate
– Bias term proportional to 1/N of ML and c2 estimators may
no longer be small compared to 1/sqrt(N)
• In general, absence of bias, correctness of error can not
be assumed. How to proceed?
– Use unbinned ML fits only – most robust at low statistics
– Explicitly verify the validity of your fit
Wouter Verkerke, UCSB
Demonstration of fit bias at low N – pull distributions
NSIG(gen)=20
• Low statistics example:
– Scenario as before but now with
200 bkg events and
only 20 signal events (instead of 100)
NBKG(gen)=200
• Results of simulation study
Distributions become
asymmetric at low statistics
NSIG(gen)
NSIG(fit)
(NSIG)
 Fit is positively biased!
pull(NSIG)
• Absence of bias, correct error at low statistics not obvious!
– Small yields are typically overestimated
Wouter Verkerke, UCSB
Fit Validation Study – How to obtain 10.000.000 simulated events?
• Practical issue: usually you need very large amounts of
simulated events for a fit validation study
– Of order 1000x number of events in your fit, easily >1.000.000 events
– Using data generated through a full GEANT-based detector
simulation can be prohibitively expensive
• Solution: Use events sampled directly from your fit function
– Technique named ‘Toy Monte Carlo’ sampling
– Advantage: Easy to do and very fast
– Good to determine fit bias due to low statistics, choice of
parameterization, boundary issues etc
– Cannot be used to test assumption that went into model
(e.g. absence of certain correlations). Still need full GEANT-based
simulation for that.
Wouter Verkerke, UCSB
Toy MC generation – Accept/reject sampling
• How to sample events directly from your fit function?
• Simplest: accept/reject sampling
fmax
1) Determine maximum of function fmax
2) Throw random number x
3) Throw another random number y
4) If y<f(x)/fmax keep x,
otherwise return to step 2)
y
x
– PRO: Easy, always works
– CON: It can be inefficient if function
is strongly peaked.
Finding maximum empirically
through random sampling can
be lengthy in >2 dimensions
Wouter Verkerke, UCSB
Toy MC generation – Inversion method
• Fastest: function inversion
1) Given f(x) find inverted function F(x)
so that f( F(x) ) = x
2) Throw uniform random number x
3) Return F(x)
x
Take –log(x)
– PRO: Maximally efficient
– CON: Only works for invertible functions
Exponential
distribution
-ln(x)
Wouter Verkerke, UCSB
Toy MC Generation in a nutshell
• Hybrid: Importance sampling
1) Find ‘envelope function’ g(x)
that is invertible into G(x)
and that fulfills g(x)>=f(x)
for all x
g(x)
2) Generate random number x
from G using inversion method
3) Throw random number ‘y’
y
4) If y<f(x)/g(x) keep x,
otherwise return to step 2
f(x)
G(x)
– PRO: Faster than plain accept/reject sampling
Function does not need to be invertible
– CON: Must be able to find invertible envelope function
Wouter Verkerke, UCSB
(Software Advertisement #2)
RooFit
Wouter Verkerke, UCSB
RooFit
A general purpose tool kit for data modeling
Wouter Verkerke (UC Santa Barbara)
David Kirkby (UC Irvine)
Wouter Verkerke, UCSB
Implementation – Add-on package to ROOT
libRooFitCore.so
libRooFitModels.so
Data Modeling
ToyMC data
Generation
Model
Visualization
Data/Model
Fitting
MINUIT
C++ command line
interface & macros
Data management &
histogramming
I/O support
Graphics interface
Wouter Verkerke, UCSB
Data modeling – OO representation
• Mathematical objects are represented as C++ objects
Mathematical concept
x, p

f (x )
function
  
F ( x ; p, q )
PDF

space point
x
x
variable
RooFit class
RooRealVar
RooAbsReal
RooAbsPdf
RooArgSet
max
integral
 f ( x)dx
xmin
list of space points
xk
RooRealIntegral
RooAbsData
Wouter Verkerke, UCSB
Data modeling – Constructing composite objects
• Straightforward correlation between mathematical
representation of formula and RooFit code
G ( x , m, s )
Math

RooFit
diagram

RooRealVar x
RooGaussian g
RooRealVar m
RooFormulaVar sqrts



RooFit
code





RooRealVar s
RooRealVar x(“x”,”x”,-10,10) ;
RooRealVar m(“m”,”mean”,0) ;
RooRealVar s(“s”,”sigma”,2,0,10) ;
RooFormulaVar sqrts(“sqrts”,”sqrt(s)”,s) ;
RooGaussian g(“g”,”gauss”,x,m,sqrts) ;
Wouter Verkerke, UCSB
Model building – (Re)using standard components
• RooFit provides a collection of compiled standard PDF classes
Physics inspired
RooBMixDecay
ARGUS,Crystal Ball,
Breit-Wigner, Voigtian,
B/D-Decay,….
RooPolynomial
RooHistPdf
Non-parametric
RooArgusBG
Histogram, KEYS
(Probability Density Estimate)
RooGaussian
Basic
Gaussian, Exponential, Polynomial,…
PDF Normalization
• By default RooFit uses numeric integration to achieve normalization
• Classes can optionally provide (partial) analytical integrals
Wouter Verkerke, UCSB
• Final normalization can be hybrid numeric/analytic form
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooBMixDecay
RooPolynomial
RooHistPdf
RooArgusBG
RooGaussian
+
RooAddPdf
Wouter Verkerke, UCSB
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooBMixDecay
RooPolynomial
RooHistPdf
RooGaussian
RooLandau
⊗
RooFFTConvPdf
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooBMixDecay
RooPolynomial
RooHistPdf
RooArgusBG
RooGaussian
*
RooProdPdf
Wouter Verkerke, UCSB
Model building – (Re)using standard components
• Building blocks are flexible
– Function variables can be functions themselves
– Just plug in anything you like
– Universally supported by core code
(PDF classes don’t need to implement special handling)
m(y;a0,a1)
g(x;m,s)
RooPolyVar m(“m”,y,RooArgList(a0,a1)) ;
RooGaussian g(“g”,”gauss”,x,m,s) ;
g(x,y;a0,a1,s)
Wouter Verkerke, UCSB
Model building – Expression based components
•
RooFormulaVar – Interpreted real-valued function
– Based on ROOT TFormula class
– Ideal for modifying parameterization of existing compiled PDFs
RooBMixDecay(t,tau,w,…)
RooFormulaVar w(“w”,”1-2*D”,D) ;
•
RooGenericPdf – Interpreted PDF
– Based on ROOT TFormula class
– User expression doesn’t
need to be normalized
– Maximum flexibility
Wouter Verkerke, UCSB
RooGenericPdf f("f","1+sin(0.5*x)+abs(exp(0.1*x)*cos(-1*x))",x)
Using models - Overview
• All RooFit models provide universal and complete
fitting and Toy Monte Carlo generating functionality
– Model complexity only limited by available memory and CPU power
• models with >16000 components, >1000 fixed parameters
and>80 floating parameters have been used (published physics result)
– Very easy to use – Most operations are one-liners
Fitting
Generating
data = gauss.generate(x,1000)
RooAbsPdf
gauss.fitTo(data)
RooDataSet
RooAbsData
Wouter Verkerke, UCSB
Using models – Toy MC Generation
• Generate “Toy” Monte Carlo samples from any PDF
– Sampling method used by default, but PDF components can advertise
alternative (more efficient) generator methods
– No limit to number of dimensions,
discrete-valued dimensions also supported
data=pdf.generate(x,y,1000)
x
y
data=pdf.generate(x,ydata)
x
y
x
– Subset of variables can be taken from a prototype dataset
• E.g. to more accurately model the statistical fluctuations in a particular sample.
• Correlations with prototype observables correctly taken into account
Wouter Verkerke, UCSB
y
Using models – Plotting
•
RooPlot – View of 1 datasets/PDFs projected on the same dimension
Curve always normalized to last
plotted dataset in frame
For multi-dimensional PDFs:
appropriate 1-dimensional projection is
automatically created:
 
F
(
x
,
y
)dy

Projection [ F ]( x )  N 


F
(
x
,
y
)
dxd
y

Poisson
errors on
histogram
Adaptive spacing of curve points
to achieve 1‰ precision,
regardless of data binning
Wouter Verkerke, UCSB
Many default solutions for standard problems
• Unbinned ML fit of efficiency curves
– Example: trigger threshold
• Template interpolation
– Example: morph polynomial into Gaussian
– Realistic use case:
interpolate full MC Higgs
signal between masses of
e.g. 160 GeV and 180 GeV
Wouter Verkerke, NIKHEF
Persisting models in the workspace
• Using both model & p.d.f from file
TFile f(“myresults.root”) ;
RooWorkspace* w = f.Get(“w”) ;
Make plot
of data
and p.d.f
Construct
likelihood
& profile LH
Draw
profile LH
RooPlot* xframe = w->var(“x”)->frame() ;
w->data(“d”)->plotOn(xframe) ;
w->pdf(“g”)->plotOn(xframe) ;
RooNLLVar nll(“nll”,”nll”,*w->pdf(“g”),*w->data(“d”)) ;
RooProfileLL pll(“pll”,”pll”, nll,*w->var(“m”)) ;
RooPlot* mframe = w->var(“m”)->frame(-1,1) ;
pll.plotOn(mframe) ;
mframe->Draw()
Wouter Verkerke, NIKHEF
Advanced features – Task automation
• Support for routine task automation, e.g. goodness-of-fit study
Input model
Generate toy MC
Fit model
Repeat
N times
Accumulate
fit statistics
Distribution of
- parameter values
- parameter errors
- parameter pulls
// Instantiate MC study manager
RooMCStudy mgr(inputModel) ;
// Generate and fit 100 samples of 1000 events
mgr.generateAndFit(100,1000) ;
// Plot distribution of sigma parameter
mgr.plotParam(sigma)->Draw()
Wouter Verkerke, UCSB
Documentation and macros
• Classes and methods
documented in code
 HTML on root web site
• Seventy complete
example macros in
$ROOTSYS/tutorial/roofit
• Users manual
– 80 pages, Oct 1  120 pages, Dec 1  180 pages
Wouter Verkerke, NIKHEF
Extending models to more than one dimension
• If you have data with many observables,
there are two common approaches
– Compactify information with test statistic (see previous section)
– Describe full N-dimensional distribution with a p.d.f.
• Choice of approach largely correlated with understanding
of correlation between observables and amount of
information contained in correlations
– No correlation between observables 
‘Big fit’ and ‘Compactification’ work equally well.
– Important correlations that are poorly understood 
Compactification preferred. Approach:
1. Compactify all-but-one observable (ideally uncorrelated with the compactified
observables)
2. Cut on compactification test statistic to reduce backgrounds
3. Fit remaining observable  Estimate from data remaining amount of background
(smallest systematic uncertainty due to poor understanding of test statistic and its
inputs)
– Important correlations that are well understood  Big fit preferred
Extending models to more than one dimension
• Bottom line: N-dim models used when either no
correlations or well understood correlations
• Constructing multi-dimensional models without
correlations is easy
– Just multiply N 1-dimensional p.d.f.s.
*
=
– No complex issues with p.d.f. normalization: if 1-dim p.d.f.s are
normalized then product is also by construction
Wouter Verkerke, NIKHEF
Writing multi-dimensional models with correlations
• Formulating N-dim models with correlations may seem
daunting, but it really isn’t!
– Simplest approach: start with one-dimensional model, replace one
parameter p with a function p’(y) of another observable
– Yields correction distribution of x for every given value of y
F(x,m,s)
F’(x,[a0+a1y],s)
– NB: Distribution of y probably not correct…
Wouter Verkerke, NIKHEF
Writing multi-dimensional models with correlations
•
Solution: see F’(x,y,p) as a conditional p.d.f. F’(x|y)
–
Difference is in normalization
 F ( x, y)dxdy  1
–
 F ( x | y)dx  1
for each value of y
Then multiply with a separate p.d.f describing distribution in y
M ( x, y )  F ' ( x | y )  G ( y )
F’(x|y)
•
*
G(y)
=
M(x,y)
Almost all modeling issues with correlations can be treated this way
–
Iteration 1) Exponential Missing ET distr. of ‘background’ is independent of Transverse mass
–
Iteration 2) Slope depends linearly on MT  write conditional pdf F(ET|MT)
–
Iteration 3) Multiply F(ET|MT) with empirical shape for MT
Visualization of multi-dimensional models
• Visualization of multi-dimensional models presents
some additional challenges w.r.t. 1-D
• Can show 2D,3D distribution
– Graphically appealing, but not so useful as you cannot overlay
model on data and judge goodness-of-fit
– Prefer to project on one dimension (there will be multiple choices)
– But plain projection discards a lot of information contained in both
model and data
Significance of signal
less apparent
Reason: Discriminating information in
y observable in both data and model is ignored
Visualizing signal projections of N-dim models
• Simplest solution, only show model and data in
“signal range” of observable y
– Significance shown in “range projection” much more in line with
that of 2D distribution
y[-10,10]
y[-2,2]
• Easy to define a “signal range” simple model above.
How about 6-dimensional model with non-trivial shape?
– Need generic algorithm  Likelihood ratio plot
Wouter Verkerke, NIKHEF
Likelihood ratio plots
• Idea: use information on S/(S+B) ratio in projected
observables to define a cut
• Example: generalize previous toy model
to 3 dimensions
• Express information on S/(S+B) ratio of model in terms
of integrals over model components
LR( x, y, z ) 
S ( x, y , z )
S ( x, y, z)  B( x, y, z)
Integrate over x
LR( y, z ) 
 S ( x, y, z)dx
 S ( x, y, z)  B( x, y, z)dx
Plot LR vs (y,z)
Wouter Verkerke, NIKHEF
Likelihood ratio plots
• Decide on s/(s+b) purity
contour of LR(y,z)
– Example s/(s+b) > 50%
• Plot both data and model
with corresponding cut.
– For data: calculate LR(y,z) for each event, plot only event with LR>0.5
– For model: using Monte Carlo integration technique:
 M ( x, y, z)dydz 
LR ( y , z ) 0.5
All events
1
N
 M ( x, yi , zi )
D( y,z )
Dataset with values of (y,z)
sampled from p.d.f and
filtered for events that meet
LR(y,z)>0.5
Only LR(y,z)>0.5
Wouter Verkerke, NIKHEF
Multidimensional fits – Goodness-of-fit determination
• Goodness-of-fit determination of >1 D models is difficult
– Standard c2 test does not work very will in N-dim because of
natural occurrence of large number of empty bins
– Simple equivalent of (unbinned) Kolmogorov test in >1-D does
not exist
• This area is still very much a work in progress
– Several new ideas proposed but sometimes difficult to calculate,
or not universally suitable
– Some examples
• Cramer-von Mises (close to Kolmogorov in concept)
• Anderson-Darling
• ‘Energy’ tests
– No magic bullet here
– Some references to recent progress:
• PHYSTAT2001/2003/2005
Wouter Verkerke, UCSB
Significance & limits
—Null Hypothesis testing – P-values
— Classical or ‘frequentist’ confidence intervals
— Issues that arise in interpretation of fit result
— Bayesian statistics and intervals
Wouter Verkerke, UCSB
Testing significance
• Significance  Hypothesis testing
– Hypothesis H predicts pdf f(x|H) for a set of observations x
– We observe a single point in this space xobs
– What can we say about the validity of H in light of the data?
• Practical translation attempt #1
– H = SuperSymmetry (of some kind)
– Prediction: Nevt(8 jets>pT20) = 10 (at x fb-1)
– Measurement: N=9 Did we discover SUSY?
• Practical translation attempt #2
– H = Standard Model
– Prediction: Nevt(8 jets>pT20) = 3 (at x fb-1)
– Measurement: N=9 Did we disprove the Standard Model
• Be sure to formulate the correct question!
– What we usually want to know is: what is the probability that the
‘background’ has an fluctuation that looks like our signal (or better),
i.e. Attempt #2
– Attempt #1 quantifies the probability that nature with SUSY would
result in an experimental result consistent with no SUSY. You would
use this to set a limit
Calculating significance – P values
• Work out attempt #2 (aim to disprove Standard Model)
– ‘Signal’ = SUSY ‘Background’ = Standard model
Prediction
N=3
Measurement N=9
N(bkg)
=3
N(sig+bkg) = 9
• Can we calculate probability that SM mimics SM+SUSY
(i.e. result is a ‘false positive)?
– Calculation details depend on how measurement was done (fit,
counting etc..)
– Simplest case: counting experiment, Poisson process

p   Poisson (n; m  3)dn  0.0038
9
=‘p value’
Calculating significance – Z values
• Often defines significance Z as the number of standard
deviations that a Gaussian variable would fluctuate in
one direction to give the same p-value.
p
Z
TMath::Prob
TMath::NormQuantile
E.g. Z = 5 (a ‘5 sigma effect’) means p = 2.87  10-7
At what p/Z value do we claim discovery?
• HEP folklore: claim discovery when p-value of
background only hypothesis is 2.87  10-7,
corresponding to significance Z = 5.
• This is very subjective and really should depend on the
prior probability of the phenomenon in question, e.g.,
– phenomenon
D0D0 mixing
Higgs
Life on Mars
Astrology
reasonable p-value for discovery
~0.05
~10-7 (?)
~10-10
~10-20
Wouter Verkerke, NIKHEF
Real-life problems
• Exact calculation of p-values/Z-values possible for
counting experiment with background that is exactly
known.
• Real-life problems are often more difficult
– Result is (ML) fit to complex distribution in >=1 dimensions,
not a counting experiment
– Background estimate has uncertainty
• Correct calculation of significance no longer guaranteed!
– Q: Is that a problem?
– A: Yes. If your (naïve) calculation says Z=5, but it is really Z=3,
there is a substantial chance your discovery is fake
– If ATLAS and CMS use different methods one experiment may
claim discovery of e.g. Higgs with only half the data of the other
because of differences in significance calculation
Wouter Verkerke, NIKHEF
Example of simple p-value calculation with calibration issues
• Pearsons c2 test
– Calculate c2 of data with respect to
null hypotheses function (e.g. fit
with signal component fixed to zero)
• P-value given by
(ni  f i null ) 2
c 
null
f
i 1
i
N
2
• Example: c2 = 29.8 for N=20 d.o.f  P(c2)=0.073 = P-value
P( c ; N ) 
2

2
2
p
(
c
'
;
N
)
d
c
'

c2
c2 normal distribution function
TMath::Prob
• Warning: P(c2) probability interpretation only valid for
normal sampling distribution.
– Conversion of c2 value into probability assumes that distribution of
measured values follows theoretical distribution.
Wouter Verkerke, NIKHEF
Pearsons c2 test – p value calculation
• How was c2  p-value conversion done?
– Expected distribution of c2 values for given N taken from normal distribution
– Probability determined from integral of that distribution
p( c 2 ; N  20)
P ( c 2 ; N  20) 


2
2
p
(
c
'
;
N

20
)
d
c
'

c2
• What happens if actual distribution of c2 values is not normal
one?
Wouter Verkerke, NIKHEF
Pearsons c2 test – p value calculation
• You will get the wrong p-value
Actual distribution…
log
• Calibration for a particular analysis is always possible
– Procedure: use actual distribution (obtained from e.g. toy MC
experiments) instead of theoretical distribution
Wouter Verkerke, NIKHEF
Pearsons c2 test – calibration considerations
• Result of calibration applied to example
– In p-values 0.073  0.11
– In Z-values 1.45  1.22
• Why was this particular example so bad?
–
c2 test assumes Gaussian errors.
– Poisson errors on bins with N<10 deviate
quite strongly from Gaussian assumption
• How much MC do you need to calibrate?
– You need roughly 100/p-value experiments to get reasonable accuracy in
calibration at that p-value
– In this example p=0.1  About 1000 toy experiments (generated data + fit
data) are sufficient, quite doable
• How about Z=5?
– P=210-7  You need O(500 million) experiments! At 1 second per
expt and 100 dedicated CPUs this still takes 2 months!
Calculating significance correctly at high Z
• Any method you use can be calibrated explicitly using
toy MC, but this may be prohibitively expensive at Z=5
• What to do with your Higgs fit with sizable background
uncertainty?
– Calibration will be specific to your problem, must calculate it
yourself
• You can always state your result in simpler
‘standardized’ form ignoring shape information:
counting experiment with sideband
– Signal region: Nobs = s+b = 11
– Sideband region: Nobs’ = tb = 25 (t=18/2)
– Less statistical sensitivity than full fit,
but simpler problem. May be able to
calculate high significance with more
reliability
Nuisance parameters
• Counting experiment returns two correlated variables
(s+b) and (tb)
– Example result (s+b) = 178 (± 13), tb=100 (± 10)
• Number of background events b from preceding
example is a nuisance parameter (if b is only known
with finite precision)
– Dealing with nuisance parameters is often a nuisance…
• Need to incorporate information on determination of b
in significance calculation in parameter of interest (s)
– Various ways to do this: Frequentist, Profile Likelihood, Bayesian
– Different tradeoffs in computational complexity, assumptions
made, easiness in interpretation
– No unique best solution!
• Will highlight one technique (Profile Likelihood)
Wouter Verkerke, NIKHEF
s = parameter of interest
b = all nuisance parameters
P-value profile likelihood ratio
• Profile likelihood ratio is constructed from likelihood as
ˆˆ
L( s, b( s ))
l (s) 
L( sˆ, bˆ)
Minimum of L in b
for given value of s
Global minimum of L
• Concept of profile likelihood easiest visualized in fit to
model with shape distributions
N
 

 
L( s, b, p, q )   s  FS ( x; p )  b  FB ( x; q )
i 1
model, data
-log(l(s))
Regular fit
(log(l)=0
by definition)
D(-lnL)
Fit with Nsig=0
(deformation of pdf
that best fits the data with
the given constraint s=0)
Wouter
NsigVerkerke,
0 NIKHEF
P-value profile likelihood ratio – Counting with SB
• Calculate significance assuming normal sampling
distributions
D log L  12 Z 2
• Profile Likelihood works very well for example with
shapes, but is not proven to be well calibrated at Z=5
• Can also apply technique to counting exp with sideband
 | t  b)
L  Poisson ( xobs | s  b)  Poisson ( xobs
–
Parameter of interest = s
–
Nuisance parameter = b (t is assumed to be known exactly)
–
Advantage: standard form that can be tested / calibrated
independent of experimental details up to high Z
• Example result
–
Nobs=178 (s+b), Nobs’=100 (b)  Z= 5.0
Recent comparisons results from PhyStat 2007
Recent comparisons results from PhyStat 2007
Wouter Verkerke, NIKHEF
Intermediate summary on significance
• Can make exact calculation for counting experiment of
Poisson process with known background (valid up to
high Z values)
• Calibration uncertainties arise at high Z with results
based on fits with shapes and/or problems with
uncertainties on backgrounds
– Can sidestep shape issue (at the expense of some loss in
statistical precision) by casting result as counting expt with
sideband
• Problem of uncertain background requires elimination of
nuisance parameters
– Various ways to do this (Frequentist, Bayesian, Likelihood based)
– No guarantee of good calibration at Z=5
– Profile likelihood seems to work well though…
Wouter Verkerke, NIKHEF
Significance calculations for Higgs
• NB: Most activity on significance calculations at LHC is
for Higgs
Wouter Verkerke, NIKHEF
Setting limits
• Return to intro example. If statistics are insufficient to
rule out SM at desired p-value, quantify at which
probability we can exclude SUSY given measurement
– What is probability SUSY is viable, given experimental data
• Example
– Counting experiment: nobs=9, expected background b=3
– Aim is quantify values of s that are allowed or excluded by above
observation at a certain confidence level
– Probability to observe <=9 events with s=100 is
9
 Poisson (i,100  3)  7.3 10
33
Hypothesis s=100
clearly ruled out
i 0
– Probability to observe <=9 events with s=10 is
9
 Poisson (i,10  3)  0.17
i 0
Hypothesis s=10
clearly not…
Wouter Verkerke, NIKHEF
Limits and confidence intervals
• Plot of Poisson sum as function sup
9
 Poisson (i, s
i 0
up
 3)
sup
s<7 at (100%-33%=67%)
confidence level
s<12 at 95% confidence level
s<30 at Z=5 confidence level
• ‘Classical’ or ‘Frequentist’ confidence interval
Wouter Verkerke, NIKHEF
Classical confidence intervals
• What does (s < 12 @ 95% C.L.) mean?
• If you repeat a measurement many times
and calculate a confidence interval for each of them,
CL% of the time the true value will be contained in the
interval.
– Note that a frequentist confidence interval makes no statement about
the true value of x.
– For a given experiment and corresponding interval, the true value
either is or isn’t in the interval, no statement is made about that.
– It just says that if you repeat the experiment and interval calculation
many times, CL% of the time the true value is inside the interval
• Definition: ‘Coverage’ = Probability that true value is in
interval.
– Ideal limit construction procedure has exact coverage at target C.L. for
all true values of x.
• Back to SUSY example:
– Models with prediction of s >=12 are excluded at 95% C.L.
 If s(true)=12 then only 5% of experiments would result in 9 events
or less
More on coverage
• Over-coverage : probability to be in interval > C.L
– Resulting confidence interval is conservative
• Under-coverage : probability to be in interval < C.L
– Resulting confidence interval is optimistic
– Under-coverage is undesirable  You may claim discovery too early
• Exact coverage is difficult to achieve
– For Poisson process impossible due to discrete nature of event count
– “Calibration graph” for preceding example below
Wouter Verkerke, NIKHEF
One-sided versus two-sided intervals
• Previous limit (s<12 @ 95% C.L) is one-sided (upper limit)
• Can set lower-limit as well constructed two-sided limit
– Choosing limits such that 33% outside C.I. is evenly distributed on
both sides

 Poisson (i, s
i 10
lo
9
 Poisson (i, s
 3)
i 0
S[3.2,9.1] at 66% C.L.
up
 3)
sup
Wouter Verkerke, NIKHEF
Note that definition of two-sided interval is ambiguous
• Resolve ambiguity in definition by requiring either
– 1) Symmetry: x-,x+ are equidistant from mean
– 2) Shortest interval: choose x-,x+ such that |x--x+| is smallest
– 3) Central
x

 P( x)dx   P( x)dx 

x
1  C.L.
2
 Most sensible,
(shown in prev. page)
– For Gaussian sampling distributions all 3 requirements result in
same confidence interval
Wouter Verkerke, UCSB
Construction of the confidence belt
• You can visualize the 2-sided interval calculation procedure
through the confidence belt
• Example: s+b counting (Poisson) with b known
– Calculate limit [s-,s+] for C.L. g for each value of xobs given known
PoissonSumLow
PoissonSumHigh
– Results in functions s-(xobs) and s+(xobs)
– zz
Observed #evts
Wouter Verkerke, NIKHEF
The confidence belt
• The confidence belt for various fixed values of b
• Confidence interval construction from belt
– Find corresponding xobs
– Read off interval
Confidence belt for b=18
Central C.I [S-,S+] 95%
Confidence belt for b=9
Central C.I [S-,S+] 95%
Central C.I [S-,S+] 95%
Confidence belt for b=4.5
xobs
Wouter Verkerke, NIKHEF
• Note 1 – Belt depends (strongly) on b  uncertainty on b
will impact result  nuisance parameters
• Note 2 – Belt returns intervals with negative boundaries
The confidence belt – toy MC approach
• You can also make the confidence belt with toy MC events
(for cases where no analytical calculation is possible)
Each point measurement xobs from
a MC dataset generated with Xtrue
Intervals that contains 68%
of values of xobs for each Xtrue
Xtrue
Xtrue
Xtrue=0
xobs
‘Confidence Belt’
xobs
x
Confidence interval in the presence of nuisance parameters
• If amount of background is not exactly known, e.g.
b = 4.5 ± 1, calculation of C.I. much more complicated
– Must eliminate nuisance parameter(s) in construction of interval
– Same problem as with significance calculation with nuisance
parameters
• Multiple solutions for elimination of b in confidence
intervals
– Frequentist – Neyman construction. In general very difficult
(computationally). Will offer not details here. Otherwise, Feldman
Cousins.
– Bayesian – Integrate L(s,b) over b.
More straightforward (but no concept of coverage). More in a moment
– Profile Likelihood – Take L(s,b’’)/L(s,b’)
Popular (easy-to-do, seems to have good coverage in many
applications, but not guaranteed!)
– Hybrids (e.g. Cousins-Highland a.k.a ZN) ZN has recently been shown
to have coverage problems
Wouter Verkerke, NIKHEF
Intervals from profile likelihood ratio
• Profile likelihood ratio is constructed from likelihood as
ˆˆ
L( s, b( s ))
l (s) 
L( s, bˆ)
Minimum of L in b
for given value of s
Global minimum of L
• Use again likelihood expression for counting experiment
with sideband
 | t  b)
L  Poisson ( xobs | s  b)  Poisson ( xobs
• Look for interval defined by
increase in profile likelihood
ratio by set amount
2
– E.g. Dl  12 Z
Dl  12 Z 2
Interval
Intervals from profile likelihood ratio
• Can also use Likelihood from model with shapes for
signal and background
N
 

 
L( s, b, p, q )   s  FS ( x; p )  b  FB ( x; q )
i 1
model, data
• Corresponding likelihood ratio in s
– Note larger number of nuisance parameters:
p,q, in addition to b
ˆˆ
ˆˆ
ˆˆ
L( s, b( s), p( s), q ( s))
l ( s) 
 
L( s, bˆ, pˆ , qˆ )
– Note that both numerator and
denominator of l(s) can be calculated
quite efficiently with MINUIT minimization
Dl  12 Z 2
– But: no guarantee that Dl  12 Z 2
interpretation in terms of significance
is correct  Requires explicit MC calibration
for Z>~1 = expensive
Confidence Interval
Link between MINOS errors and profile likelihood
Nuisance parameter
Parameter of interest
• Note that MINOS algorithm in
MINUIT gives same errors as
Profile Likelihood Ratio
– MINOS errors is bounding box
around l(s) contour
– Profile Likelihood = Likelihood
minimized w.r.t. all nuisance
parameters
Wouter Verkerke, NIKHEF
Two-sided intervals – connection to parameter estimation
• MINOS errors have explicit connection to intervals
estimated with profile likelihood method
• In limit of normal distribution, various interval setting
methods give the same answer (Profile Likelihood,
Confidence Belt, Bayesian)
– E.g HESSE errors from 2nd derivate give same errors as MINOS
Wouter Verkerke, NIKHEF
Interpreting your results
• Classical intervals well defined, but neatly avoid
statements on true value and can include values that
are unphysical (e.g. NHiggs<0)
-10 < Nsig < +5
at 68% C.L.
?
Nsig my is number of Higgs
decays so it must be  0.
• How do you deal with this?
– In this particular case problem arises in interpretation step.
– Observables Nobs=(s+b) and Nobs’=(tb) are always >=0
– Nsig can be negative if Nobs<Nobs’ (t=1)
Wouter Verkerke, NIKHEF
Bayesian statistics – Decision making
• Bayesian statistics interprets probabilities very differently
from Frequentist statistics
– It provides a natural framework to include prior beliefs (such as Nsig>0)
• Essential Bayesian formulas:
Notation of conditional probabilities:
p(A|B) is probability of A given B
p( B | A)
p ( A | B) 
p( A)
p ( B)
Bayes Theorem:
Say ‘A’ = ‘theory’
‘B’ = ‘exp. result’
Law of Total Probability
p(res )   P(res | thei ) P(thei )
i
p(res | the) p(the)
p(the | res ) 
 p(res | thei ) p(thei )
i
Wouter Verkerke, UCSB
Bayesian statistics – Decision making
• How to read this formula
P(res|the) = Your measurement
the probability of an experimental
under a given theoretical hypothesis
P(the) = Prior Belief
(e.g Nsig>0)
p(res | the) p(the)
p(the | res ) 
 p(res | thei ) p(thei )
i
Normalization term
P(the|res) = Your new belief in
the theory, given the just obtained
experimental result ‘interpretation’
Wouter Verkerke, UCSB
Bayesian statistics – Medical example
• Medical example: P(disease) = 0.001
– Prior belief (your input theory)
– E.g. based on population average
• Consider test for disease, result is either + or –
– P(+|+) = 0.98
– Prob that test will correctly ID disease
– P(-|+) = 0.02
– Prob that test will give false negative
– P(+|-) = 0.03
– Prob that test will give false positive
– P(-|-) = 0.97
– Prob that test will correctly ID absence of disease
• Suppose you test positive – should you be worried?
P( | ) 
P( | ) P(disease )
0.90  0.001

 0.032
P( | ) P()  P( | ) P() 0.98  0.001  0.03  0.999
– Posterior belief is 0.032, larger than initial belief but still not
very large!
Wouter Verkerke, UCSB
Bayesian statistics – Medical example
P( | ) 
P( | ) P(disease )
0.90  0.001

 0.032
P( | ) P()  P( | ) P() 0.98  0.001  0.03  0.999
• Medical example deals with simple hypothesis (true or
false)
• In physics we often deal with composite hypothesis
– I.e. our hypothesis has parameters
– We will use Probability Density Functions as function of vector of
parameters a rather than with total probabilities, i.e.
p (the) 
p (res | the) 

p (the; a )

p (res | the; a )
Wouter Verkerke, UCSB
Physics Example – Measurement of parameter Q=Q0 ± (Q)
p(the; Q)  p(res | the; Q)
 p(the | res; Q)
 p(res | thei ; Q) p(thei ; Q)
i
Initial belief on Q ‘prior’ :
we know nothing
x
Measurement of Q=Q0±(Q):
Gaussian PDF with mean
of Q0 and width of (Q)
Posterior belief on Q
is product of prior belief
and measurement
=
Bayesian 68% interval = Area that integrates 68%
of posterior Bayesian distribution
(Resolve ambiguity in definition in the
same way as for a frequentist confidence interval)
Verkerke, UCSB
NB: In this Gaussian example Bayesian interval is sameWouter
as Frequentist
interval
Bayesian Physics Example – Incorporating any measurements
p(the; Q)  p(res | the; Q)
 p(the | res; Q)
 p(res | thei ; Q) p(thei ; Q)
i
Initial belief on Q ‘prior’ :
we know nothing
x
Posterior belief on Q
is product of prior belief
and measurement
Measurement of Q
from ML fit
=
Very practical aspect of Bayesian analysis:
Measurement of Q = Likelihood distribution from fit!
Wouter Verkerke, UCSB
Including prior knowledge – Using a non-trivial prior
p(the; Q)  p(res | the; Q)
 p(the | res; Q)
 p(res | thei ; Q) p(thei ; Q)
i
x
Values below
0 now a priori
forbidden
Posterior belief on Q
is product of prior belief
and measurement
Measurement of Q
=
Excluded
New initial belief on Q
it must be >0
Bayesian interval
changed to take
initial belief into account
Baysian interval with this prior will be different fromWouter
Frequent
interval
Verkerke,
UCSB
Using priors in Bayesian intervals
• You can in fact use any function as prior.
– Preceding example used step function to prohibit certain values
– Can also represent prior measurement with a Gaussian prior 
Posterior likelihood will represent combined result
• But some care in interpretation is required.
– You always have a prior in Bayesian inference.
No prior specified = flat prior.
– Is a flat prior equivalent to complete ignorance? Maybe not: flat
prior on parameter p implies non-flat prior on p2. Perhaps p2 is
your physical observable…
Wouter Verkerke, NIKHEF
Bayesian inference
• Practical example of ‘flat prior’ choice problems…
– Measurement of CP violation parameter g from B  DK.
– Physics observable of interest is g = phase of Vub (CKM matrix)
– But measurement is made from interference of two processes with
a relative amplitude r. If r is large (max=1.0) then maximum
statistical sensitivity to g. If r is small, little sensitivity to g.
– Measurement behaves like
measurement of vector
with length r and angle g
– Choose flat prior in (r,g)
= Polar coordinates
– Or choose flat prior in
(x=rcosg,y=rsing)
= Cartesian coordinates
y
r
g
– Case can be made for
both and resulting limit
on g is quite different…
Wouter Verkerke, NIKHEF
x
Bayesian limits – Nuisance parameters, coverage
• In Bayesian inference there is a well defined procedure
for dealing with nuisance parameters
– Integrate likelihood over nuisance parameters
L( s )   L( s, b)db
NB: Profile Likelihood
ˆˆ
L(s)  L(s, b(s)) / ...
– Note that integration over >>1 nuisance parameters leads to very
unpleasant numeric integration problems. (Markov Chain MC can
help)
• There is no concept of coverage in Bayes theory
– Bayesian intervals do not (necessarily) have good coverage
• The is no absolute calibration of probabilities
– Coverage = Absolute calibration of classical intervals
– Bayesian probability is ‘Degree of belief’
– Decision theory : p(A)>p(B) therefore take action assuming A
Wouter Verkerke, NIKHEF
Avoiding C.I.’s with negative limits without Bayes
• It is possible to avoid C.I.s like [-10,30] without using
Bayesian inference
– Idea: Use classical confidence intervals but switch from two-sided
interval N[X,Y] to one-sided interval N<X ‘in time’
– Easier said than done: Arbitrary criteria when to switch will spoil
coverage properties of classical intervals  Must be very careful
• Proper way described by Feldman & Cousins
– Example: ‘A Unified Approach to the Classical Statistical Analysis
of Small Signals’, Gary J Feldman and Robert D Cousins [ PRD 57,
3873 (1998) ]
– Note that FC intervals still make no statement about true value!
• Does it help to avoid C.I.s with negative lower bounds?
– Sometimes, Intrinsic problem with 1-sided intervals remains: they
are difficult to average a posteriori. E.g. given two results A,B
• NA=1513, NB=107  NAB=136
• NA<28, NB<17  NAB=???
Wouter Verkerke, NIKHEF
Classical, Likelihood, Bayesian intervals
• Frequentist confidence intervals
– Provide ‘summary of information content’ of measurement
– No interpretation of result is made  Intervals may include values
deemed unphysical -
• Bayesian intervals
– Support physical interpretation of result.
– Provides easy framework for incorporating physical constraints etc
(these are all ‘prior’ beliefs)
– Priors, interpretation are mixed blessing (c.f. flat priors, d.o.b)
• Likelihood (PL ratio) intervals
– Good coverage properties for simple cases
– Easy handling of nuisance parameters
– Really considered as a 3rd way in recent days
– Popular in HEP statistics these days
• Note that issues, divergences in outcome are most
dramatic at high Z (e.g. 5 ‘discovery’) at Z=1 all 3
methods usually give very similar answers
Wouter Verkerke, NIKHEF
Systematic errors
—
—
—
—
—
Sources of systematic errors
Sanity checks versus systematic error studies
Common issues in systematic evaluations
Correlations between systematic uncertainties
Combining statistical and systematic error
Wouter Verkerke, UCSB
Systematic errors vs statistical errors
• Definitions
Statistical error = any error in measurement due to
statistical fluctuations in data
Systematic errors = all other errors
Systematic uncertainty  Systematic error
• But Systematic error  Systematic mistake!
– Suppose we know our measurement needs to be
corrected by a factor of 1.05 ± 0.03
– Not correcting the data by factor 1.05 introduces
a systematic mistake
– Right thing to do: correct data by factor 1.05
and take uncertainty on factor (0.03) as a systematic error
Wouter Verkerke, UCSB
Source of systematic errors – ‘Good’ and ‘Bad’ errors
• ‘Good’ errors arise from clear causes and can be evaluated
– Clear cause of error
– Clear procedure to identify and quantify error
– Example: Calibration constants,
efficiency corrections from simulation
• ‘Bad’ errors arise from clear causes, but can not be evaluated
– Still clear cause
– But no unambiguous procedure to quantify uncertainty
– Example: theory error:
• Given 2 or more choices of theory model you get 2 or more different answers.
• What is the error?
Wouter Verkerke, UCSB
Sources of systematic errors – ‘Ugly’ errors
• ‘Ugly’ errors arise from sources that have been overlooked
– Cause unknown  error unquantifiable
• ‘Ugly’ errors are usually found through failed sanity checks
– Example: measurement of CP violation on a sample of events that is
known to have no CP-violation: You find ACP=0.10  0.01
– Clearly something is wrong – What to do?
– 1) Check your analysis
– 2) Check your analysis again
– 3) Phone a friend
– 4) Ask the audience
…
– 99) Incorporate as systematic error
as last and desperate resort!
Wouter Verkerke, UCSB
What about successful sanity checks?
• Do not incorporate successful checks in your systematic
uncertainty
– Infinite number of successful sanity checks would otherwise lead
to infinitely large systematic uncertainty. Clearly not right!
• Define beforehand if a procedure is a sanity check
or an evaluation of an uncertainty
– If outcome of procedure can legitimately be different from zero, it
is a systematic uncertainty evaluation
– If outcome of procedure can only significantly different from zero
due to mistake or unknown cause, it is a sanity check
Wouter Verkerke, UCSB
Common scenarios in evaluating systematic errors
• Two values – corresponding to use of two (theory) models A,B
– What is a good estimate for your systematic uncertainty?
• I) If A and B are extreme scenarios, and the truth must
always be between A and B
– Example: fully transverse and fully longitudinal polarization
– Error is root of variance with uniform distribution with width A-B

| A B |
12
2
V ( x)  x  x
2
2
1
1 1 1
1
     x 2dx   
4 3 12
2 0
– Popular method because sqrt(12) is quite small, but only justified if A,B are
truly extremes!
• II) If A and B are typical scenarios
– Example: JETSET versus HERWIG (different Physics simulation packages)
N
Factor N  1
– Error is difference divided by sqrt(2)
to get unbiased
| A B |
| A  B | estimate of parent

2
 2
2
Wouter Verkerke, UCSB
Common scenarios in evaluating systematic errors
• Two variations of the analysis procedure on the same data
– Example: fit with two different binnings giving A  A and B  B
– Clearly, results A,B are correlated so
of smallness of error
| A B |
 
2
A
2
B
is not a good measure
• Generally difficult to calculate, but can estimate
uppper,lower bound on systematic uncertainty
 A2   02   B2   02   AB   A2   02   B2   02
2
2
– Where A>B and 0 is the Minimum Variance Bound.  0 (aˆ )  (aˆ  aˆ )
– If the better technique (B) saturates the MVB the range reduces to
 2   A2   B2
A B
– If MVB is not saturated (e.g. you have low statistics) you will need to
use a toy Monte Carlo technique to evaluate A-B Wouter Verkerke, UCSB
Common scenarios in evaluating systematic errors
• Perhaps most common scenario in HEP analysis:
you need to assign systematic uncertainty to
(in)accuracy of full Monte Carlo simulation
• Popular technique: ‘Cut variation’
– Procedure: vary each of your cuts by a little bit. For each change,
1) Measure new yield on data
2) Correct with new MC efficiency.
3) Difference between efficiency corrected results is systematic
uncertainty.
– Example, for a nominal cut in x at ‘p’ you find N(data)=105, with
a MC efficiency eMC=0.835 so that N(corrected)=125.8
N(data)
e(MC)
N(corrected)
p+Dp
110
0.865
127.2
p-Dp
100
0.803
124.5
p
 sys
 (127.2 124.5) / 2  1.4
x  125.8  1.4
Wouter Verkerke, UCSB
Common scenarios in evaluating systematic errors
• Warning I: Cut variation does not give an precise
measure of the systematic uncertainty due data/MC
disagreement!
– Your systematic error is dominated by a potentially large statistical
error from the small number of events in data between your two
cut alternatives
• This holds independent of your MC statistics
– You could see a large
statistical fluctuation
 error overestimated
– You could see no change due
to a statistical fluctuation
 error underestimated
Wouter Verkerke, UCSB
Common scenarios in evaluating systematic errors
• Warning II: Cut variation doesn’t catch all types of
data/MC discrepancies that may affect your analysis
– Error may be fundamentally underestimated
– Example of discrepancy missed by cut variation:
Nominal cut
Data
Simulation
Data and Simulation
give same efficiency
for nominal and
alternate cut, sp
Zero systematic
is evaluated
(in limit N∞)
Alternate cut
Even though data and
MC are clearly different
Cut variation is a good sanity check,
Wouter Verkerke, UCSB
but not necessarily a good estimator for systematic
uncertainty
Systematic errors and correlations
• Pay attention to correlation between systematic errors
2
 xy2
2
 df 
 df 
 df  df 
    x2     y2  2    x y
 dx 
 dx  dy 
 dy 
• If error uncorrelated, =0
– Add in quadrature
• If error 100% correlated, then =1.
– E.g. tracking efficiency uncertainty per track for 6 tracks,
3trk = trk+trk+trk = 3trk
(not 3 trk)
• If errors 100% anti-correlated, then =-1
– This can really happen!
– Example BF(D*0  D00) =67% and BF(D*0D0g) =33%
Wouter Verkerke, UCSB
Combining statistical and systematic uncertainty
• Systematic error and statistical error are independent
– They can be added in quadrature to obtain combined error
– Nevertheless always quote (also) separately!
– Also valid procedure if systematic error is not Gaussian:
Variances can be added regardless of their shape
– Combined error usually approximately Gaussian anyway (C.L.T)
• Combining errors a posteriori not only option
– You can include any systematic error directly in your c2 or ML fit:
In c2 fit
 p  p0 
2
2

c  c nom  

 p 
In ML fit
2
2




p

p
0
1
 
;  ln L   ln Lnom  2 
   

p

 

– Or, for multiple uncertainties with correlations
c pen
 T 1 
pV p
;  ln Lpen
 T 1 
  ( p Wouter
V p)Verkerke, UCSB
1
2
The End
• These slides will be available on the agenda server of
the school
• Some books for further reading
– R. Barlow, Statistics: A Guide to the Use of Statistical Methods in
the Physical Sciences, Wiley, 1989
– L. Lyons, Statistics for Nuclear and Particle Physics, Cambridge
University Press,
– G. Cowan, Statistical Data Analysis, Clarendon, Oxford, 1998
• Some presentations with more details
– Statistics – G. Cowan
http://www.pp.rhul.ac.uk/~cowan/mva_lectures.html
– TMVA – A. Hoecker
http://indico.in2p3.fr/materialDisplay.py?contribId=18&materialId=slides&confId=750
– RooFit – W. Verkerke
http://indico.in2p3.fr/materialDisplay.py?contribId=15&materialId=slides&confId=750
– Boosted Decision Trees – Y. Coadou
http://indico.in2p3.fr/materialDisplay.py?contribId=14&materialId=slides&confId=750
Wouter Verkerke, UCSB