PHYSTAT05 Highlights

Download Report

Transcript PHYSTAT05 Highlights

1
Phystat05 Highlights
PHYSTAT05 Highlights:
Statistical Problems in Particle
Physics,
Astrophysics and Cosmology
Müge Karagöz Ünel
MKU
Oxford University
University College London
03/11/2006
2
Phystat05 Highlights
MKU
Outline
•
•
•
•
•
•
Conference Information and History
Introduction to statistics
Selection of hot topics
Available tools
Astrophysics and cosmology
Conclusions
MKU
Phystat05 Highlights
PHYSTAT History
3
MKU
Phystat05 Highlights
4
5
Phystat05 Highlights
Chronology of PHYSTAT05
Where
CERN
Fermilab
Durham
SLAC
Oxford
When
Jan
2000
March 2000
March 2002
Sept 2003
Sept 2005
Issues
Limits
Limits
Wider range of
topics
Wider range
of topics
Wider range of
topics
Physicists
Particles
Particles +3
astrophysicists
Particles +
Astro
+Cosmo
Particles +
Astro +
Cosmo
Many
Many
MKU
Statisticians
3
3
Particles
+3
astrophysicists
2
6
Phystat05 Highlights
MKU
PHYSTAT05 Programme
7 Invited talks by Statisticians
9 Invited talks by Physicists
38 Contributed talks
8 Posters
Panel Discussion
3 Conference Summaries
90 participants
7
Phystat05 Highlights
MKU
Invited Talks by Statisticians
David Cox
Keynote Address: Bayesian, Frequentists
& Physicists
Steffen Lauritzen
Goodness of Fit
Jerry Friedman
Machine Learning
Susan Holmes
Visualisation
Peter Clifford
Time Series
Mike Titterington
Deconvolution
Nancy Reid
Conference Summary (Statistics)
8
Phystat05 Highlights
MKU
Invited Talks by (Astro+)Physicists
Bob Cousins
Nuisance Parameters for Limits
Kyle Cranmer
LHC discovery
Alex Szalay
Astrophysics + Terabytes
Jean-Luc Starck
Multiscale geometry
Jim Linnemann
Statistical Software for Particle Physics
Bob Nichol
Statistical Software for Astrophysics
Stephen Johnson
Historical Transits of Venus
Andrew Jaffe
Conference Summary (Astrophysics)
Gary Feldman
Conference Summary (Particles)
9
Phystat05 Highlights
MKU
Contents of the Proceedings
Bayes/Frequentist
5 talks
Goodness of Fit
5
Likelihood/parameter estimation
6
Nuisance parameters/limits/discovery
10
Machine learning
7
Software
8
Visualisation
1
Astrophysics
5
Time series
1
Deconvolution
3
10
Phystat05 Highlights
MKU
Statistics in (A/P)Physics
11
Phystat05 Highlights
Statistics in (Particle) Physics
An experiment goes through following stages:
• Prepare conditions for taking data for a particle X ( if
theory driven)
• Record events that might be X and reconstruct the
measurables
• Select events that could have X by applying criteria (cuts)
• Generate histograms of variables and ask the questions:
Is there any evidence for new things or is the null hypothesis
unrefuted? If there is evidence, what are the estimates for parameters
of X? (Confrontation of theory with experiment or v.v.)
MKU
• The answers can come via your favorite statistical
technique (depends on how you ask the question)
(from S. Andreon’s web page)
12
MKU
Phystat05 Highlights
(yet another) Chronology
• Homo apriorius establishes probability of an hypothesis, no matter what
data tell.
• Homo pragamiticus establishes that it is interested by the data only.
• Homo frequentistus measures probability of the data given the
hypothesis.
• Homo sapiens measures probability of the data and of the hypothesis.
• Homo bayesianis measures probability of the hypothesis, given the data.
13
Bayesian vs Frequentist
Phystat05 Highlights
We need to make a statement about Parameters, given Data
Bayes 1763
Frequentism 1937
Both analyse data (x)  statement about parameters (  )
Both use Prob (x;  ),
e.g. Prob (
   
l
u
) = 90%
but very different interpretation
Bayesian : Probability (parameter, given data)
Frequentist : Probability (data, given parameter)
MKU
“Bayesians address the question everyone is
interested in, by using assumptions no-one believes”
“Frequentists use impeccable logic to deal with an
issue of no interest to anyone”
14
Phystat05 Highlights
MKU
Goodness of Fit
Lauritzen
Invited talk - GoF
Yabsley
GoF and sparse multi-D data
Ianni
GoF and sparse multi-D data
Raja
GoF and L
Gagunashvili
2 and weighting
Pia
Software Toolkit for Data Analysis
Block
Rejecting outliers
Bruckman
Alignment
Blobel
Tracking
15
Phystat05 Highlights
Goodness of Fit
• We would like to: know if a given distribution is of a
specified type, test the validation of a postulated model,..
• A few GoF tests are widely used in practice:
– 2 test: most widely used application is 1 or 2D fits to data
– G2 (the likelihood ratio statistics) test: the general version of
2 test (Lauritzen’s personal choice)
– Kolmogorov-Smirnov test: a robust but prone to mislead test,
can be used to confirm, say, two distributions (histograms) are
the same by calculating the p-value for the difference hypothesis.
MKU
– Other new methods, like Aslan&Zech’s energy test, exist…
16
An example from ATLAS (Bruckman)
Direct Least-Squares solution to the Silicon Tracker alignment problem
The method consists of minimizing the giant 2 resulting from a
Phystat05 Highlights
simultaneous fit of all particle trajectories and alignment parameters:
Intrinsic measurement error + MCS
Let us consequently use the linear expansion (we assume all second
order derivatives are negligible). The track fit is solved by:
hit
m
e
residual
while the alignment parameters are given by:
Key relation!
MKU
kˆ
Systems large: inherent
Computational challenges
Equivalent to Millepede approach from V. Blobel
17
Phystat05 Highlights
MKU
Nuisance Parameters/Limits/Discovery
Cousins Limits and Nuisance Params
Reid
Respondent
Punzi
Frequentist multi-dimensional ordering rule
Tegenfeldt Feldman-Cousins + Cousins-Highland
Rolke
Limits
Heinrich Bayes + limits
Bityukov Poisson situations
Hill
Limits v Discovery (see Punzi @ PHYSTAT2003)
Cranmer LHC discovery and nuisance parameters
18
Systematics
Note:Systematic errors (HEP) <-> nuisance params (statistician)
Phystat05 Highlights
An example:

b
Nevents  LA



we need to know these,
Observed Physics
parameter probably from other

measurements (and/or theory)
N N
MKU
for statistical errors
Uncertainties error in

Some are arguably statistical errors
LA  LA0   LA
b  b0   b
19
Phystat05 Highlights
MKU
Nuisance Parameters
• Nuisance parameters are parameters with unknown
true values. They may be:
– statistical, such as number of background events in
a sideband used for estimating the background
under a peak.
– systematic, such as the shape of the background
under the peak, or the error caused by the
uncertainty of the hadronic fragmentation model in
the Monte Carlo.
– Most experiments have a large number of
systematic uncertainties.
– If the experimenter is blind to these uncertainties,
they become a bigger nuisance!
20
Phystat05 Highlights
MKU
Issues with LHC
• LHC will collide 40 million times/sec and collect
petabytes of data. pp collisions at 14 TeV will generate
events much more complicated than LEP, TeVatron.
• Kyle Cranmer has pointed out that systematic issues will
be even more important at the LHC.
– If the statistical error is O(1) and systematic error is O(0.1), it
does not much matter how you treat it.
– However, at the LHC, we may have processes with 100
background events and 10% systematic errors, this is not
negligible.
– Even more critical, we want 5 for a discovery level.
21
Phystat05 Highlights
Why 5? (Feldman+Cranmer)
• LHC searches: 500 searches each of which has 100
resolution elements (mass, angle bins, ...) = 5 x 104
chances to find something.
• One experiment: False positive rate at 5  
(5 x 104) (3 x 10-7) = 0.015. OK.
• Two experiments:
– Assume allowable false positive rate: 10.
– 2 (5 x 104) (1 x 10-4) = 10  3.7  required.
– Required other experiment verification, assume rate 0.01:
(1 x 10-3)(10) = 0.01  3.1  required.
MKU
• Caveats: Is the significance real? Are there
common systematic errors?
22
Confidence Intervals
MKU
Phystat05 Highlights
• Various techniques discussed during conference. Most
concerns were summarized by Feldman.
– Bayesian: good method but Heinrich showed that flat priors in
multi-D may lead to undesirable results (undercoverage).
– Frequentist-Bayesian hybrids: Bayesian for priors and
frequentist to extract range. Cranmer considered this for LHC
(which was also used at Higgs searches).
– Profile likelihood: shown by Punzi to have issues when
distribution is Poisson-like.
– Full Neyman reconstruction: Cranmer and Punzi attempted
this, but is not feasible for large number of nuisance parameters.
• Banff workhsop of this summer was found useful in
comparing various methods. The real suggestions for
LHC will likely come from 2007 workshop on LHC
issues.
23
Event Classification
Phystat05 Highlights
• The problem: Given a measurement of an event X find F(X) which
returns 1 if the event is signal (s) and 0 if the event is background (b) to
optimize a figure of merit, say, s/√b for discovery and s/ √(s+b) for
established signal.
• Theoretical solution: Use MC to calculate the likelihood ratio
Ls(X)/Lb(X) and derive F(X) from it. Unfortunately, this does not work as
in a high-dimension space, even the largest data set is sparse.
(Feldman)
• In recent years, physicists have turned to machine learning:
MKU
give the computer samples of s and b events and let the computer
figure out what F(X) is.
24
Phystat05 Highlights
MKU
Multivariate Analysis
Friedman
Machine learning
Prosper
Respondent
Narsky
Bagging
Roe
Boosting (Miniboone)
Gray
Bayes optimal classification
Bhat
Bayesian networks
Sarda
Signal enhancement
25
Multivariates and Machine Learning
Phystat05 Highlights
Various methods exist to classify, train and test events.
• Artificial neural networks (ANN): currently the most widely
used (examples from Prosper, …)
• Decision trees: differentiating variable is used to separate sample
into branches until a leaf with a preset number of signal and
background events are found.
• Trees with rules: combining a series of trees to increase single
decision tree power (Friedman)
• Bagging (Bootstrap AGGregatING) trees: build a collection
of trees by selecting a sample of the training data (Narsky)
• Boosted trees: a robust method that gives misclassified events in
MKU
one tree a higher weight in the generation of a new tree
Comparisons of significance were performed, but not all of
were controlled experiments, so conclusions may be
deceptive until further tests..
26
Phystat05 Highlights
MKU
Ex: Boosted Decision Trees (Roe)
• An nice example from MiniBoone
• Create M many trees and take the final score for signal
and background as weighted sum of individual trees
Decision
tree
Boosting the tree
27
Punzi effect (getting L wrong)
MKU
Phystat05 Highlights
Giovanni Punzi @ PHYSTAT2003
“Comments on L fits with variable resolution”
Separate two close signals (A and B) , when resolution σ
varies event by event, and is different for 2 signals
e.g. M, Different numbers of tracks  different σM
Avoiding Punzi bias
• Include p(σ|A) and p(σ|B) in fit OR
• Fit each range of σi separately, and add (NA)i  (NA)total, and
similarly for B
Beware of event-by-event variables and construct likelihoods
accordingly
(Talk by Catastini)
28
Blind Analyses
Phystat05 Highlights
Potential problem:
Experimenters’ bias
Original suggestion? Luis Alvarez
Methods of blinding:
•
•
•
•
Keep signal region box closed
Add random numbers to data
Keep Monte Carlo parameters blind
Use part of data to define procedure
A number of analyses in experiments doing blind searches
MKU
Don’t modify result after unblinding, in general..
Question: Will LHC experiments choose to be blind? In
which analysis?
29
Phystat05 Highlights
MKU
Astrophysics + Cosmology Highlights
30
Phystat05 Highlights
Astro/Cosmo General Issues
‘“There is only one universe” and some experiments can
never be rerun’ – A. Jaffe (concluding talk)
 Astro+cosmo tend to be more Bayesian, by nature.
Cosmologists
“Astronomers”
Bayesians
Particle Physicists
Frequentists
MKU
• Virtual Observatories: all astro data available from desktop
• Data volume growth doubling every year, most data are on
the web (Szalay)
– Bad: computing & storage issues
– Good (?): Systematic errors more significant statistical errors
• Nichol discussed using grid techniques.
31
Astrophysics: Various Hot Points
MKU
Phystat05 Highlights
• Flat priors have been used commonly, but are dangerous
(Cox, Le Diberder, Cousins): would  be the best quantity to use
or is it h2 ?
• Issues with non-gaussian distribution of noise taken into
account in the spectrum: a few methods discussed by
Starck, Digel, ..
• Blind analyses are rare (not so good at a priori modeling!)
• Lots of good software in astrophysics and repositories
more advanced than PP.
• Jaffe’s talk has a a nice usage of CMB as a case study for
statistical methods in astrophysics, starting from 1st
principles of Bayesian.
32
Phystat05 Highlights
MKU
Software and Available Tools
33
Phystat05 Highlights
MKU
Talks Given on Software
Linnemann
Nichol
Le Diberder
Paterno
Kreschuk
Verkerke
Pia
Buckley
Narsky
Software for Particles
Software for Astro (and Grid)
sPlot
R
ROOT
RooFit
Goodness of Fit
CEDAR
StatPatternRecognition
34
Phystat05 Highlights
MKU
Available Tools
• A number of good software has become more and more
available (good news for LHC!)
• PP and astro use somehow different softwares (IDL, IRAF
by astro, for ex.)
• 2004 Phystat workshop at MSU on statistical software
(mainly on R & ROOT) by Linnemann
• Statatisticians have a repository of standard source codes
(StatLib): http://lib.stat.cmu.edu/
• One good output of the conference was a
Recommendation of Statistical Software Repository at
FNAL
• Linnemann has a web page of collections:
http://www.pa.msu.edu/people/linnemann/stat_resources.h
tml
35
CDF Statistics Committee resources
MKU
Phystat05 Highlights
• Documentation about statistics and a repository: http://wwwcdf.fnal.gov/physics/statistics/statistics_home.html
MKU
Phystat05 Highlights
36
Sample Repository Page
MKU
Phystat05 Highlights
37
CEDAR & CEPA
38
Summary & Conclusions
• Very useful physicists/statisticians interaction
Phystat05 Highlights
e.g. Confidence intervals with nuisance parameters,
Multivariate techniques, etc..
• Lots of things learnt from
• ourselves (by having to present own stuff!)
• each other (various different approaches..)
• statisticians (update on techniques..)
• A step towards common tools/Software repositories:
http://www.phystat.org (Linnemann)
MKU
• Programme, transparencies, papers, etc:
http://www.physics.ox.ac.uk/phystat05 (with useful links such as
recommended readings)
• Proceedings published by Imperial College Press (Spring ’06)
39
What is Next?
MKU
Phystat05 Highlights
• A few workshops/schools took place since October, 2005
e.g. Manchester (Nov 2005), SAMSI Duke (April 2006), Banff (July
2006), Spanish Summer School (July 2006)
• No PHYSTAT Conference in summer 2007
• ATLAS Workshop on Statistical Methods, 18-19 Jan 2007
• PHYSTAT Workshop at CERN, 27-29 June 2007 on
“Statistical issues for LHC Physics analyses”.
(Both workshops will likely aim at discovery significance. Please attend!)
• Suggestions/enquiries to: [email protected]
• LHC will take data soon. We do not wish to say
“The experiment was inconclusive, so we had to use statistics”
(inside cover of “the Good Book” by L. Lyons)
• rather say
We used statistics, and so we are sure that we’ve discovered X
(well… with some confidence level!)
40
Some Final Notes
MKU
Phystat05 Highlights
• Tried to give you a collage of PHYSTAT05 topics.
• My deepest thanks to Louis for giving me the chance &
introducing me to the PHYSTAT experience!
• Apologies to those talks I have not been able to cover…
• Thank you for the invitation!
41
Phystat05 Highlights
MKU
Backup
•
•
•
•
Bayes
Frequentist
Cousins-Highland
Higgs Saga at CERN
42
Bayesian Approach
Phystat05 Highlights
Bayesian
P( B; A) x P( A)
P( A; B) 
P( B)
P( param; data )  P(data; param) x P(param)


posterior
likelihood
Problems: P(param) True or False
“Degree of belief”
Prior
MKU
Bayes’
Theorem
What functional form?
Flat? Which variable?

prior
43
Frequentist Approach
Phystat05 Highlights
Neyman Construction
µ
x
µ = Theoretical parameter
MKU
x = Observation
x
0
NO PRIOR
44
Frequentist Approach
   
Phystat05 Highlights
l
Frequentist

u
l
at 90% confidence
and


u
Probability statement about
Bayesian
 and 
l
u

MKU
known, but random
unknown, but fixed

l
and

u
known, and fixed
unknown, and random
Probability/credible statement about

45
A Method
Phystat05 Highlights
Method: Mixed Frequentist - Bayesian
Full frequentist method hard to apply in several dimensions
Bayesian for nuisance parameters and
Frequentist to extract range
Philosophical/aesthetic problems?
Highland and Cousins
MKU
NIM A320 (1992) 331
46
Higgs Saga
Phystat05 Highlights
P (Data;Theory)

P (Theory;Data)
Is data consistent with Standard Model?
or with Standard Model + Higgs?
End of Sept 2000: Data not very
consistent with S.M.

Prob (Data ; S.M.) < 1% valid frequentist statement
Turned by the press into: Prob (S.M. ; Data) < 1%
MKU
and therefore
Prob (Higgs ; Data) > 99%
i.e. “It is almost certain that the Higgs has been seen”