Data Modeling

Download Report

Transcript Data Modeling

RooFit Tutorial – Topical Lectures June 2007
Tutorial by Wouter Verkerke
Presented by Max Baak
RooFit: Your toolkit for data modeling
What is it?
• A powerful toolkit for modeling the expected distribution(s) of
events in a physics analysis
• Primarily targeted to high-energy physicists using ROOT
• Originally developed for the BaBar collaboration by Wouter
Verkerke and David Kirkby.
• Included with ROOT v5.xx
Documentation:
• http://root.cern.ch/root/Reference.html – for latest class
descriptions. RooFit classes start with “Roo”.
• http://roofit.sourceforge.net – for documentation and tutorials
• Bug Wouter 
RooFit purpose - Data Modeling for Physics Analysis

Distribution of observables 
x
Define data model

Probability Density Function F(x; p, q)

• Physical parameters of interest p

• Other parameters q to describe
detector effect (resolution,efficiency,…)
• Normalized over
 allowed range of the


observables x w.r.t the parameters p and q
Fit model to data

Determination of p,q
 
Implementation – Add-on package to ROOT
Shared library: libRooFit.so
Data Modeling
ToyMC data
Generation
Model
Visualization
Data/Model
Fitting
MINUIT
C++ command line
interface & macros
Data management &
histogramming
I/O support
Graphics interface
Data modeling - Desired functionality
Building/Adjusting Models
Analysis
cycle
 Easy to write basic PDFs ( normalization)
 Easy to compose complex models (modular design)
 Reuse of existing functions
 Flexibility – No arbitrary implementation-related restrictions
Using Models
 Fitting : Binned/Unbinned (extended) MLL fits, Chi2 fits
 Toy MC generation: Generate MC datasets from any model
 Visualization: Slice/project model & data in any possible way
 Speed – Should be as fast or faster than hand-coded model
Data modeling – OO representation
• Mathematical objects are represented as C++ objects
Mathematical concept
x, p

f (x )
function
  
F ( x ; p, q )
PDF

space point
x
x
variable
RooFit class
RooRealVar
RooAbsReal
RooAbsPdf
RooArgSet
max
integral
 f ( x)dx
xmin
list of space points
xk
RooRealIntegral
RooAbsData
Model building – (Re)using standard components
• RooFit provides a collection of compiled standard PDF classes
Physics inspired
RooBMixDecay
ARGUS,Crystal Ball,
Breit-Wigner, Voigtian,
B/D-Decay,….
RooPolynomial
RooHistPdf
Non-parametric
RooArgusBG
Histogram, KEYS
RooGaussian
Basic
Gaussian, Exponential, Polynomial,…
PDF Normalization
• By default RooFit uses numeric integration to achieve normalization
• Classes can optionally provide (partial) analytical integrals
• Final normalization can be hybrid numeric/analytic form
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooBMixDecay
RooPolynomial
RooHistPdf
RooArgusBG
RooGaussian
+
RooAddPdf
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooBMixDecay
RooPolynomial
RooHistPdf
RooArgusBG
RooGaussian
*
RooProdPdf
Model building – (Re)using standard components
• Building blocks are flexible
– Function variables can be functions themselves
– Just plug in anything you like
– Universally supported by core code
(PDF classes don’t need to implement special handling)
m(y;a0,a1)
g(x;m,s)
RooPolyVar m(“m”,y,RooArgList(a0,a1)) ;
RooGaussian g(“g”,”gauss”,x,m,s) ;
g(x,y;a0,a1,s)
Model building – Expression based components
•
RooFormulaVar – Interpreted real-valued function
– Based on ROOT TFormula class
– Ideal for modifying parameterization of existing compiled PDFs
RooBMixDecay(t,tau,w,…)
RooFormulaVar w(“w”,”1-2*D”,D) ;
•
RooGenericPdf – Interpreted PDF
– Based on ROOT TFormula class
– User expression doesn’t
need to be normalized
– Maximum flexibility
RooGenericPdf f("f","1+sin(0.5*x)+abs(exp(0.1*x)*cos(-1*x))",x)
Using models – Fitting options
• Fitting interface is flexible and powerful, many options supported
Data type
Binned
Unbinned
Sample interactive MINUIT session
RooNLLVar nll(“nll”,”nll”,pdf,data) ;
RooMinuit m(nll) ;
Weighted unbinned
Goodness-of-fit
measure
-log(Likelihood)
Extended –log(L)
Chi2
m.hesse() ;
x.setConstant() ;
y.setVal(5) ;
m.migrad() ;
m.minos()
Access any of MINUITs
minimization methods
Change and fix param. values,
using native RooFit interface
during fit session
RooFitResult* r = m.save() ;
User Defined
(add custom/penalty
terms to any of these)
Interface
One-line: RooAbsPdf::fitTo(…)
Interactive: RooMinuit class
Output
Modifies parameter objects of PDF
Save snapshot of initial/final parameters,
correlation matrix, fit status etc…
Using models – Fitting speed & optimizations
• Benefit of function optimization traditionally a trade-off between
– Execution speed (especially in fitting)
– Flexibility/maintainability of analysis user code
•
Optimizations usually hard-code assumptions…
• Evaluation of –log(L) in fits lends it well to optimizations
– Constant fit parameters often lead to higher-level constant PDF components
– PDF normalization integrals have identical value for all data points
– Repetitive nature of calculation ideally suited for parallelization.
• RooFit automates analysis and implementation of optimization
– Modular OO structure of PDF expressions facilitate automated introspection
• Find and pre-calculate highest level constant terms in composite PDFs
• Apply caching and lazy evaluation for PDF normalization integrals
• Optional automatic parallelization of fit on multi-CPU hosts
– Optimization concepts are applied consistently and completely to all PDFs
– Speedup of factor 3-10 typical in realistic complex fits
• RooFit delivers per-fit tailored optimization without user overhead!
Using models – Plotting
•
RooPlot – View of 1 datasets/PDFs projected on the same dimension

Create the view on mes
RooPlot* frame = mes.frame() ;

Project the data on the mes view
data->plotOn(frame) ;

Project the PDF on the mes view
pdf->plotOn(frame) ;

Project the bkg. PDF component
pdf->plotOn(frame,Components(“bkg”))
 Draw the view on a canvas
frame->Draw() ;
Axis labels auto-generated
Using models - Overview
• All RooFit models provide universal and complete
fitting and Toy Monte Carlo generating functionality
– Model complexity only limited by available memory and CPU power
• models with >16000 components, >1000 fixed parameters
and>80 floating parameters have been used (published physics result)
– Very easy to use – Most operations are one-liners
Fitting
Generating
data = gauss.generate(x,1000)
RooAbsPdf
gauss.fitTo(data)
RooDataSet
RooAbsData
Advanced features – Task automation
• Support for routine task automation, e.g. goodness-of-fit study
Input model
Generate toy MC
Fit model
Repeat
N times
// Instantiate MC study manager
RooMCStudy mgr(inputModel) ;
// Generate and fit 100 samples of 1000 events
mgr.generateAndFit(100,1000) ;
// Plot distribution of sigma parameter
mgr.plotParam(sigma)->Draw()
Accumulate
fit statistics
Distribution of
- parameter values
- parameter errors
- parameter pulls
RooFit users tutorial
The basics
Probability density functions & likelihoods
The basics of OO data modeling
The essential ingredients: PDFS, datasets, functions
Outline of the hands-on part
1. Guide you through the fundamentals of RooFit
2. Look at some sample composite data models
1. Still quite simple, all 1-dimensional
3. Try to do at least one ‘advanced topic’
1. Tutorial 8: Calculating the P-value of your analysis
–
P-Value = How often does an equivalent data sample with no signal mimic
the signal you observe
2. Tutorial 9: Fit the top mass
•
Copy tutorial.tar.gz from /project/atlas/users/mbaak/TPL/
–
Untar tutorial.tar in your home directory
–
Contents of the tutorial setup
tutorial/docs/roofit_tutorial.ppt  This presentation
/macros
 Macros to be used in this tutorial
http://root.cern.ch/root/html514/ClassIndex.html
Open in your favorite browser
Loading RooFit into ROOT
• >source setup.csh (in the tutorial/ directory)
• Make sure libRooFit.so is in $ROOTSYS/lib
• Start ROOT
• In the ROOT command line load the RooFit library
gSystem->Load(“libRooFit”) ;
– Or start root in the tutorial/ directory.
Creating a variable – class RooRealVar
• Creating a variable object
RooRealVar mass(“mass”,“m(e+e-)”,0,1000) ;
C++ name
Name
Title
– Every RooFit objects must have a unique name!
Allowed range
Creating a probability density function
• First create the variables you need
Allowed range
RooRealVar mass(“mass”,“mass”,-10,10) ;
RooRealVar mean(“mean”,“mean”,0.0,-10,10) ;
RooRealVar width(“width”,“sigma”,3.0,0.1,10.) ;
Initial value
Allowed range
• Then create a function object
RooGaussian gauss(“gauss”,”Gaussian”,
mass, mean, width) ;
– Give variables as arguments to link variables to function
Making a plot of a function
• First create an empty plot
RooPlot* frame = mass.frame() ;
– A frame is a plot associated with a RooFit variable (mass in this case)
• Draw the empty plot on a ROOT canvas
frame->Draw()
Plot range taken from limits of x
Making a plot of a function (continued)
• Draw the (probability density) function in the frame
gauss->plotOn(frame) ;
• Update the frame in the ROOT canvas
frame->Draw()
Axis label from gauss title
Unit
normalization
Interacting with objects
• Changing and inspecting variables
sigma.getVal() ;
(const Double_t) 3.00
sigma = 1.0 ;
sigma.getVal() ;
(const Double_t) 1.00
• Draw another copy of gauss
gauss->plotOn(frame) ;
frame->Draw()
Inspecting composite objects
• Inspecting the structure of gauss
gauss->printCompactTree() ;
0x10b95fc0 RooGaussian::gauss (gauss) [Auto]
0x10b90c78 RooRealVar::x (x)
0x10b916f8 RooRealVar::mean (mean)
0x10b85f08 RooRealVar::sigma (sigma)
• Inspecting the contents of frame
frame->Print(“v”)
RooPlot::frame(10ba6830): "A RooPlot of "x""
Plotting RooRealVar::x: "x"
Plot contains 2 object(s)
(Options="L") RooCurve::curve_gaussProjected: "Projection of gauss"
(Options="L") RooCurve::curve_gaussProjected: "Projection of gauss"
Data
• Unbinned data is represented by a RooDataSet object
• Class RooDataSet is RooFit interface to ROOT class TTree
RooDataSet
RooRealVar y
RooRealVar x
RooDataSet associates
a RooRealVar with
column of a TTree
TTree
row
x
y
1
0.57 4.86
2
5.72 6.83
3
2.13 0.21
4
10.5 -35.
5
-4.3 -8.8
Association by matching TTree
Branch name with RooRealVar
name
Creating a dataset from a TTree
• First open file with TTree
TFile f(“tut0.root”) ;
macros/tut0.root
f.ls() ;
root [1] .ls
TFile**
tut1.root
TFile*
tut1.root
KEY: TTree
xtree;1 xtree
• Create RooDataSet from tree
RooDataSet data(“data”,”data”,x,xtree) ;
RooFit Variable in dataset
Imported TTree
Drawing a dataset on a frame
• Create new plot frame, draw RooDataSet on frame,
draw frame
RooPlot* frame2 = x.frame() ;
data->plotOn(frame2) ;
frame2->Draw() ;
Note Poisson Error bars
Overlaying a PDF curve on a dataset
• Add PDF curve to frame
pdf->plotOn(frame2) ;
frame2->Draw() ;
Unit normalized
PDF automatically
scaled to dataset
But shape is not right!
Lets fit the curve
to the data
Fitting a PDF to an unbinned dataset
• Fit gauss to data
gauss->fitTo(*data) ;
• Behind the scenes
1. RooFit constructs the Likelihood from the PDF and the dataset
2. RooFit passes the Likelihood function to MINUIT to minimize
3. RooFit extracts the result from MINUIT and stores in the
RooRealVar objects that represent the fit parameters
• Draw the result
gauss->plotOn(frame2) ;
frame2->Draw() ;
Looking at the fit results
• Look again at the PDF variables
sigma.Print() ;
RooRealVar::sigma:
1.9376 +/- 0.043331 (-0.042646, 0.044033) L(-10 – 10)
mean.Print() ;
RooRealVar::mean: -0.0843265 +/- 0.061273 (-0.061210, 0.061361) L(-10 - 10)
Adjusted value
Symmetric
error
(from HESSE)
Asymmetric
error
(from HESSE)
– Results from MINUIT back-propagated to variables
Putting it all together
• A self contained example to construct a model, fit it,
and plot it on top of the data
void fit(TTree* dataTree) {
macro/tut1.C
// Define model
RooRealVar x(“x”,”x”,-10,10) ;
RooRealVar sigma(“sigma”,”sigma”,2,0.1,10) ;
RooRealVar mean(“mean”,”mean”,-10,10) ;
RooGaussian gauss(“gauss”,”gauss”,x,mean,sigma) ;
// Import data
RooDataSet data(“data”,”data”,dataTree,x) ;
// Fit data
gauss.fitTo(data) ;
// Make plot
RooPlot* frame = x.frame() ;
data.plotOn(frame) ;
gauss.plotOn(frame) ;
frame->Draw() ;
}
In macro/tut1.C
uncomment two lines
below // Make plot
and see what happens
Building composite PDFS
• RooFit has a collection of many basic PDFS
RooArgusBG
RooBifurGauss
RooBreitWigner
RooCBShape
RooChebychev
RooDecay
RooExponential
RooGaussian
RooKeysPdf
RooPolynomial
RooVoigtian
-
Argus background shape
Bifurcated Gaussian
Breit-Wigner shape
Crystal Ball function
Chebychev polynomial
Simple decay function
Exponential function
Gaussian function
Non-parametric data description
Generic polynomial PDF
Breit-Wigner (X) Gaussian
HTML class documentation in
(open with internet explorer)
doc/html/ClassIndex.html
Building realistic models
• You can combine any number of the preceding PDFs to
build more realistic models
RooRealVar x(“x”,”x”,-10,10)
macro/tut2.C
// Construct background model
RooRealVar alpha(“alpha”,”alpha”,-0.3,-3,0) ;
RooExponential bkg(“bkg”,”bkg”,x, alpha) ;
// Construct signal model
RooRealVar mean(“mean”,”mean”,3,-10,10) ;
RooRealVar sigma(“sigma”,”sigma”,1,0.1,10) ;
RooGaussian sig(“sig”,”sig”,x,mean,sigma) ;
// Construct signal+background model
RooRealVar sigFrac(“sigFrac”,”signal fraction”,0.1,0,1) ;
RooAddPdf model(“model”,”model”,RooArgList(sig,bkg),sigFrac) ;
// Plot model
RooPlot* frame = x.frame() ;
model.plotOn(frame) ;
model.plotOn(frame,Components(bkg),LineStyle(kDashed)) ;
frame->Draw() ;
Building realistic models
Sampling ‘toy’ Monte Carlo events from model
• Just like you can fit models, you can also sample ‘toy’
Monte Carlo events from models
RooDataSet* mcdata = model->generate(x,1000) ;
RooPlot* frame2 = x.frame() ;
mcdata->plotOn(frame2) ;
model->plotOn(frame2) ;
frame2->Draw() ;
Try it ...
RooAddPdf can add any number of models
RooRealVar x("x","x",0,10) ;
macros/tut3.C
// Construct background model
RooRealVar alpha("alpha","alpha",-0.7,-3,0) ;
RooExponential bkg1("bkg1","bkg1",x,alpha) ;
// Construct additional background model
RooRealVar bkgmean("bkgmean","bkgmean",7,-10,10) ;
RooRealVar bkgsigma("bkgsigma","bkgsigma",2,0.1,10) ;
RooGaussian bkg2("bkg2","bkg2",x,bkgmean,bkgsigma) ;
// Construct signal model
RooRealVar mean("mean","mean",3,-10,10) ;
RooRealVar width("width","width",0.5,0.1,10) ;
RooBreitWigner sig("sig","sig",x,mean,width) ;
// Construct signal+2xbackground model
RooRealVar bkg1Frac("bkg1Frac","signal fraction",0.2,0,1) ;
RooRealVar sigFrac("sigFrac","signal fraction",0.5,0,1) ;
RooAddPdf model("model","model",RooArgList(sig,bkg1,bkg2),
RooArgList(sigFrac,bkg1Frac))
;
RooPlot* frame = x.frame() ;
model.plotOn(frame) ;
model.plotOn(frame,Components(RooArgSet(bkg1,bkg2)),LineStyle(kDashed)) ;
frame->Draw() ;
RooAddPdf can add any number of models
Try adding
another
signal
term
Extended Likelihood fits
• Regular likelihood fits only fit for shape
– Number of coefficients in RooAddPdf is always one less than
number of components
• Can also do extended likelihood fit
– Fit for both shape and observed number of events
– Accomplished by adding ‘extended likelihood term’ to regular LL

 
 log L( p)    log( g ( xi , p))  N exp  N obs log( N exp )
D
• Extended term automatically constructed in RooAddPdf
if given equal number of coefficients & PDFS
Extended Likelihood fits and RooAddPdf
• How to construct an extended PDF with RooAddPdf
// Construct extended signal+2xbackground model
RooRealVar nbkg1(“nbkg1",“number of bkg1 events",300,0,1000) ;
RooRealVar nbkg2(“nbkg2",“number of bkg2 events",200,0,1000) ;
RooRealVar nsig( “nsig",“number of signal events",500,0,1000) ;
RooAddPdf emodel(“emodel",“emodel",RooArgList(sig, bkg1, bkg2),
RooArgList(nsig,nbkg1,nbkg2))
Previous model
sigFrac
bkg1Frac
Add extended term
sigFrac
bkg1Frac
ntotal
• Fitting with extended model
emodel.fitTo(data,”e”) ;
Include extended term in fit
;
New representation
nsig
nbkg1
nbkg2
macros/tut4.C
Look at sum, expected
errors, and
correlations between
fitted event numbers
Switching gears
• Hands-on exercise so far designed to introduce you to
basic model building syntax
• Real power of RooFit is in using those models to explore
your analysis in an efficient way
• No time in this short session to cover this properly, so
next slide just gives you a flavor of what is possible
1. Multidimensional models, selecting by likelihood ratio
2. Demo on ‘task automation’ as mentioned in last slide of
introductory slide
Multi-dimensional PDFs
• RooFit handles multi-dimensional PDFs as easily as 1D
PDFs
– Just use class RooProdPdf to multiply 1D PDFS
• Case example: selecting B+  D0 K+
– Three discriminating variables: mES, DeltaE, m(D0)
Signal Model
*
*
Background Model
*
*
• Look at example model, fit, plots in
macros/tut5.C
Selecting by Likelihood ratio
• Plain projection of multi-dimensional PDF and dataset
often don’t do justice to analyzing power of PDF
– You don’t see selecting power of PDF in dimensions that are
projected out
Plain projection of mES
of previous excercise
Result from 3D fit
Nsig = 91 ± 10
Close to sqrt(N)
– Possible solution: don’t plot all events, but
show only events passing cut of signal,bkg
likelihood ratios constructed from PDF
dimensions that are not shown in the plot
macros/tut6.C
Next topic: How stable is your fit
• When looking at low statistics fit, you’ll want to check
explicitly
– Is your fit stable and unbiased
• Check by running through large set of toy MC samples
– Fit each sample, accumulate fit statistics and make pull
distribution
• Technical procedure
– Generate toy Monte Carlo sample with desired number of events
– Fit for signal in that sample
– Record number of fitted signal events
– Repeat steps 1-3 often
– Plot distributions of Nsig, s(Nsig), pull(Nsig)
• RooFit can do all this for you with 2 lines of code!
– Try out the example in
macros/tut7.C
Experiment with
lowering number
of signal events
How often does background mimic your signal?
• Useful quantity in determining importance of your signal: the
P-value
– P-Value: How often does a data sample of comparable statistics with no
signal mimic the signal yield you observe
– Tells you how probable it is that your peak is the result of a statistical
fluctuation of the background
• Procedure very similar to last exercise
– First generate fake ‘data’, fit data to determine ‘data signal yield’
– Generate toy Monte Carlo sample with 0 signal events
– Fit for signal in that sample
– Record number of fitted signal events
– Repeat steps 1-3 often
– See what fraction of fits result in a signal yield exceeding your ‘observed
data yield’
• Try out the example in
macros/tut8.C
Additional Exercises
• Set up you own fit!
• Fit the top quark mass distribution in
macros/tut9.C
• For the top signal (around 160 GeV/c2), use a Gaussian.
• For the background, try out
– Polynomial (RooPolynomial)
– Chebychev polynomial (RooChebychev)
Minumum number of terms needed?
Which background description works better?
Why? Look at correlation matrix.
Advanced examples
• See the macros/examples/ directory.