Introduction to Statistics and Machine Learning
Download
Report
Transcript Introduction to Statistics and Machine Learning
Introduction to Statistics
and Machine Learning
How do we:
• understand
• interpret
our measurements
How do we get the data for
our measurements
1
Outline
Multivariate classification/regression algorithms (MVA)
motivation
another introduction/repeat the ideas of hypothesis tests in this
context
Multidimensional Likelihood (kNN : k-Nearest Neighbour)
Projective Likelihood (naïve Bayes)
What to do with correlated input variables?
Decorrelation strategies
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
2
MVA-Literature /Software
Packages... a biased selection
Literature:
T.Hastie, R.Tibshirani, J.Friedman, “The Elements of Statistical Learning”, Springer 2001
C.M.Bishop, “Pattern Recognition and Machine Learning”, Springer 2006
Software packages for Mulitvariate Data Analysis/Classification
individual classifier software
e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad
and many other packages
attempts to provide “all inclusive” packages
StatPatternRecognition: I.Narsky, arXiv: physics/0507143
http://www.hep.caltech.edu/~narsky/spr.html
TMVA: Höcker,Speckmayer,Stelzer,Therhaag,von Toerne,Voss, arXiv: physics/0703039
http://tmva.sf.net or every ROOT distribution (development moved from SourceForge to ROOT repository)
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
“R”: a huge data analysis library: http://www.r-project.org/
Conferences: PHYSTAT, ACAT,…
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
3
Event Classification
Suppose data sample of two types of events:
with class labels Signal
and Background (will restrict here to two class cases. Many classifiers can in
principle be extended to several classes, otherwise, analyses can be staged)
how to set the decision boundary to select events of type S ?
we have discriminating variables x1, x2, …
Rectangular cuts?
x2
A linear boundary?
x2
S
B
x2
S
B
x1
How can we decide what to uses ?
Low variance (stable), high bias methods
A nonlinear one?
S
B
x1
x1
High variance, small bias methods
Once decided on a class of boundaries, how to find the “optimal” one ?
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
4
Regression
how to estimate a “functional behaviour” from a given set of ‘known measurements” ?
assume for example “D”-variables that somehow characterize the shower in your calorimeter
energy as function of the calorimeter shower parameters .
constant ?
linear?
non - linear?
f(x)
f(x)
Energy
f(x)
x Size
Cluster
x
x
if we had an analytic model (i.e. know the function is a nth -order polynomial) than
we know how to fit this (i.e. Maximum Likelihood Fit)
but what if we just want to “draw any kind of curve” and parameterize it?
seems trivial ?
The human brain has very good pattern recognition capabilities!
what if you have many input variables?
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
5
Regression model functional
behaviour
Assume for example “D”-variables that somehow characterize the shower in your calorimeter.
Monte Carlo or testbeam
data sample with measured cluster observables
+ known particle energy
= calibration function (energy == surface in D+1 dimensional space)
f(x)
2-D example
events generated according: underlying distribution
1-D example
f(x,y)
y
x
x
better known: (linear) regression fit a known analytic function
e.g. the above 2-D example reasonable function would be: f(x) = ax2+by2+c
what if we don’t have a reasonable “model” ? need something more general:
e.g. piecewise defined splines, kernel estimators, decision trees to approximate f(x)
NOT in order to “fit a parameter”
provide predition of function value f(x) for new measurements x (where f(x) is not known)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
6
Event Classification
Each event, if Signal or Background, has “D” measured variables.
D
“feature
space”
Find
a mapping from D-dimensional input-observable =”feature” space
y(x): RDR:
most general form
to one dimensional output class label
y = y(x); x D
x={x1,….,xD}: input variables
plotting (historamming)
the resulting y(x) values:
Who sees how this would
look like for the
regression problem?
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
y(x)
7
Event Classification
Each event, if Signal or Background, has “D” measured variables.
Find a mapping from D-dimensional input/observable/”feature” space
to one dimensional output
y(B) 0, y(S) 1
class lables
D
“feature
space”
y(x): RnR:
y(x):
“test statistic” in D-dimensional space of input variables
distributions of y(x):
PDFS(y) and PDFB(y)
used to set the selection cut!
efficiency and purity
y(x):
> cut: signal
= cut: decision boundary
< cut: background
y(x)=const: surface defining the decision boundary.
overlap of PDFS(y) and PDFB(y) separation power , purity
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
8
Classification ↔ Regression
Classification:
Each event, if Signal or Background, has “D” measured variables.
y(x): RDR:
“test statistic”
in D-dimensional space of
input variables
y(x)=const: surface defining
the decision boundary.
D
“feature
space”
Helge Voss
y(x): RDR:
Regression:
Each event has “D” measured variables + one function value
(e.g. cluster shape variables in the ECAL + particles energy)
y(x): RDR
find
f(x1,x2)
y(x)=const hyperplanes where the
target function is constant
Now, y(x) needs to be build such that it
best approximates the target, not such
X2
that it best separates signal from bkgr.
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
X1
9
Event Classification
y(x): RnR: the mapping from the “feature space” (observables) to one output variable
PDFB(y). PDFS(y): normalised distribution of y=y(x)
for background and signal events
(i.e. the “function” that describes the shape of the
distribution)
with y=y(x) one can also say PDFB(y(x)), PDFS(y(x)): :
1.5
0.45
y(x)
Probability densities for background and signal
now let’s assume we have an unknown event from the example above for which y(x) = 0.2
PDFB(y(x)) = 1.5 and PDFS(y(x)) = 0.45
let fS and fB be the fraction of signal and background events in the sample, then:
ffSSPDF
PDF
S (y(x))
S (y)
| y(x))
P(C
P(CSS
| y)
fSfSPDF
PDF
ffBBPDF
PDF
S (y(x))
B (y(x))
S (y)
B (y)
Helge Voss
is the probability of an event with
measured x={x1,….,xD} that gives y(x)
to be of type signal
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
10
Event Classification
P(Class=C|x) (or simply P(C|x)) : probability that the event class is of C, given the
measured observables x={x1,….,xD} y(x)
Probability density distribution
according to the measurements x
and the given mapping function
Prior probability to observe an event of “class C”
i.e. the relative abundance of “signal” versus
“background”
P(y | C) P(C)
P(Class = C | y) =
P(y)
Posterior probability
Overall probability density to observe the actual
measurement y(x). i.e. P(y) = P(y | Class)P(Class)
Classes
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
11
Any Decision Involves Risk!
Decide to treat an event as “Signal” or “Background”
Type-1 error: (false positive)
classify event as Class C even though it is not
Trying to select signal events:
(i.e. try to disprove the null-hypothesis
stating it were “only” a background event)
(accept a hypothesis although it is not true/i.e.false)
(reject the null-hypothesis although it would have been the correct one)
loss of purity
Signal
Background
Signal
Type-2
error
Background
Type-1
error
(e.g. accepting wrong events)
Type-2 error: (false negative)
fail to identify an event from Class C as such
(reject a hypothesis although it would have been correct/true)
(fail to reject the null-hypothesis/accept null hypothesis although it is false)
loss of efficiency (e.g. miss true (signal) events)
“A”: region of the outcome of the test where you accept the event as Signal:
Significance α: Type-1 error rate:
α = background selection “efficiency”
should be
small
Size β:
Type-2 error rate: (how often you miss the signal)
Power: 1- β = signal selection efficiency
should be
small
most of the rest of the lecture will be about methods that try to make as little mistakes as possible
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
12
Neyman-Pearson Lemma
y(x)
P(x | S)
P(x | B)
Neyman-Peason:
The Likelihood ratio used as “selection criterium”
y(x) gives for each selection efficiency the best
possible background rejection.
i.e. it maximises the area under the “Receiver
Operation Characteristics” (ROC) curve
1
1- ebackgr.
Likelihood Ratio :
y’(x)
y’’(x)
0
0
esignal
1
varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve
how to choose “cut” need to know prior probabilities (S, B abundances)
measurement of signal cross section:
discovery of a signal (typically: S<<B):
precision measurement:
trigger selection:
Helge Voss
maximum of S/√(S+B) or equiv. √(e·p)
maximum of S/√(B)
high purity (p) large background rejection
high efficiency (e)
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
13
MVA and Machine Learning
The previous slide was basically the idea of “Multivariate Analysis” (MVA)
rem: What about “standard cuts” (event rejection in each variable
separately with fix conditions. i.e. if x1>0 or x2<3 then background) ?
Finding y(x) : RnR
given a certain type of model class y(x)
in an automatic way using “known” or “previously solved” events
i.e. learn from known “patterns”
such that the resulting y(x) has good generalization properties when
applied to “unknown” events (regression: fits well the target function “in
between” the known training events
that is what the “machine” is supposed to be doing: supervised machine
learning
Of course… there’s no magic, we still need to:
choose the discriminating variables
choose the class of models (linear, non-linear, flexible or less flexible)
tune the “learning parameters” bias vs. variance trade off
check generalization properties
consider trade off between statistical and systematic uncertainties
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
14
Event Classification
Unfortunately, the true probability densities functions are typically unknown:
Neyman-Pearsons lemma doesn’t really help us directly
Monte Carlo simulation or in general cases: set of known (already classified) “events”
2 different ways: Use these “training” events to:
estimate the functional form of p(x|C):
(e.g. the differential cross section folded with the
detector influences) from which the likelihood ratio can be obtained
e.g. D-dimensional histogram, Kernel densitiy estimators, …
find a “discrimination function” y(x)
and corresponding decision boundary (i.e.
hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from
background
e.g. Linear Discriminator, Neural Networks, …
* hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
15
Unsupervised Learning
Just a short remark as we talked about “supervised” learning before:
supervised:
training with “events” for which we know the outcome (i.e. Signal or Backgr)
un-supervised: - no prior knowledge about what is “Signal” or “Background” or … we don’t
even know if there are different “event classes”, then you can for example do:
- cluster analysis: if different “groups” are found class labels
- principal component analysis: find basis in observable space with biggest
hierarchical differences in the variance
infer something about underlying substructure
Examples:
- think about “success” or “not success” rather than “signal” and “background”
(i.e. a robot achieves his goal or does not / falls or does not fall/ …)
- market survey:
If asked many different question, maybe you can find “clusters” of people,
group them together and test if there are correlations between this groups
and their tendency to buy a certain product. address them specialy
- medical survey:
group people together and perhaps find common causes for certain
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
16
Nearest Neighbour and Kernel
Density Estimator
“events” distributed according to P(x)
estimate probability density P(x) in
D-dimensional space:
x2
The only thing at our disposal is our “training data”
h
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “x”
N*∫P(x)dx events from a dataset with N events
“x”
V
For the chosen a rectangular volume
K-events:
x1
1
1, ui , i 1...D
x - xn
K k
,
with
k
(u)
2
h
n 1
0, otherwise
N
K
k(u): is called a Kernel function
(from the “training data”) estimate of average P(x) in the volume V: ∫P(x)dx = K/N
V
Classification:
1
P (x)
N
Determine
PDFS(x) and PDFB(x)
likelihood ratio as classifier!
Helge Voss
N
n 1
1 x - xn
k
D
h
h
Kernel Density estimator of the probability density
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
17
Nearest Neighbour and Kernel
Density Estimator
“events” distributed according to P(x)
estimate probability density P(x) in
D-dimensional space:
The only thing at our disposal is our “training data”
x2
h
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “x”
N*∫P(x)dx events from a dataset with N events
“x”
V
For the chosen a rectangular volume
K-events:
1
1, ui , i 1...D
x - xn
K k
,
with
k
(u)
2
h
n 1
0, otherwise
N
K
x1
k(u): is called a Kernel function:
rectangular Parzen-Window
(from the “training data”) estimate of average P(x) in the volume V: ∫P(x)dx = K/N
V
Regression:
particle)
Helge Voss
If each events with (x1,x2) carries a “function value” f(x1,x2) (e.g. energy of incident
1 N
ˆ
k(xi - x)f(x i ) f(x)P(x)dx
N i
V
i.e.: the average function value
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
18
Nearest Neighbour and Kernel
Density Estimator
“events” distributed according to P(x)
estimate probability density P(x) in
D-dimensional space:
The only thing at our disposal is our “training data”
x2
h
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “x”
N*∫P(x)dx events from a dataset with N events
“x”
V
For the chosen a rectangular volume
K-events:
1
1, ui , i 1...D
x - xn
K k
,
with
k
(u)
2
h
n 1
0, otherwise
x1
N
x2
determine K from the “training data” with signal and
background mixed together
kNN
: k-Nearest Neighbours
relative number events of the various
classes amongst the k-nearest neighbours
nS
y(x)
K
x1
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
19
Kernel Density Estimator
Parzen Window: “rectangular Kernel” discontinuities at window edges
smoother model for P(x) when using smooth Kernel Functions: e.g. Gaussian
x - xn
1
1
P(x)
exp
2 1/2
2h 2
N n1 (2 h )
N
place a “Gaussian” around each “training
2
↔ probability density estimator
individual kernels
averaged kernels
data point” and sum up their contributions
at arbitrary points “x” P(x)
h: “size” of the Kernel “smoothing
parameter”
there is a large variety of possible Kernel
functions
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
20
Kernel Density Estimator
1
P( x )
N
N
K
n 1
h
h: “size” of the Kernel
(x - xn )
: a general probability density estimator using kernel K
“smoothing parameter”
chosen size of the “smoothing-parameter” more
important than kernel function
h too small:
h too large:
overtraining
not sensitive to features in P(x)
which metric for the Kernel (window)?
normalise all variables to same range
include correlations ?
Mahalanobis Metric: x*x xV-1x
(Christopher M.Bishop)
a drawback of Kernel density estimators:
Evaluation for any test events involves ALL TRAINING DATA typically very time consuming
binary search trees (i.e. Kd-trees) are typically used in kNN methods to speed up searching
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
21
“Curse of Dimensionality”
Bellman, R. (1961), Adaptive
Control Processes: A
Guided Tour, Princeton
University Press.
We all know:
Filling a D-dimensional histogram to get a mapping of the PDF is typically unfeasable due
to lack of Monte Carlo events.
Shortcoming of nearest-neighbour strategies:
in higher dimensional classification/regression cases
the idea of looking at “training events” in a reasonably
small “vicinity” of the space point to be classified
becomes difficult:
consider: total phase space volume V=1D
for a cube of a particular fraction of the volume:
edge length=(fraction of volume)1/ D
In 10 dimensions: in order to capture 1% of the phase space
63% of range in each variable necessary that’s not “local” anymore..
Therefore we still need to develop all the alternative classification/regression techniques
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
22
Naïve Bayesian Classifier
“often called: (projective) Likelihood”
Multivariate Likelihood (k-Nearest Neighbour)
estimate the full D-dimensional joint probability density
D
If correlations between variables are weak: P( x ) P(
i x)
i0
PDFs
Likelihood ratio
for event event
y ( xPDE,k event )
i variables
product of marginal PDFs
(1-dim “histograms”)
discriminating variables
Pisignal (x i ,kevent )
Classes: signal,
C
P
(x
)
i i ,kevent background types
Cclasses i variables
One of the first and still very popular MVA-algorithm in HEP
No hard cuts on individual variables,
allow for some “fuzzyness”: one very signal like variable may
counterweigh another less signal like variable
optimal method if correlations == 0
Helge Voss
(Neyman Pearson Lemma)
PDE introduces fuzzy logic
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
23
Naïve Bayesian Classifier
“often called: (projective) Likelihood”
How parameterise the 1-dim PDFs ??
event counting
(histogramming)
Automatic, unbiased,
but suboptimal
parametric (function) fitting
Difficult to automate
for arbitrary PDFs
nonparametric fitting
(i.e. splines,kernel)
Easy to automate, can create
artefacts/suppress information
example: original (underlying) distribution is
Gaussian
If the correlations between variables is really
negligible, this classifier is “perfect” (simple,
robust, performing)
If not, you seriously loose performance
How can we “fix” this ?
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
24
What if there are correlations?
Typically correlations are present:
Cij=cov[ xi , x j ]=E[ xi xj ]−E[ xi ]E[ xj ]≠0 (i≠j)
pre-processing: choose set of linear transformed input variables for which Cij = 0 (i≠j)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
25
Decorrelation
Find variable transformation that diagonalises the covariance matrix
Determine square-root C of correlation matrix C, i.e., C = C C
compute C by diagonalising C: D STCS C S DST
transformation from original (x) in de-correlated variable space (x) by: x = C -1x
Attention: eliminates only linear correlations!!
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
26
Decorrelation:
Principal Component Analysis
PCA (unsupervised learning algorithm)
reduce dimensionality of a problem
find most dominant features in a distribution
Eigenvectors of covariance matrix “axis” in transformed variable space
large eigenvalue large variance along the axis (principal component)
sort eigenvectors according to their eigenvalues
transform dataset accordingly
diagonalised covariance matrix with first “variable” variable with
largest variance
xkPC i event )
v variables
Principle Component
(PC) of variable k
xv i event ) - xv vv( k ) , k variables
sample means
eigenvector
Matrix of eigenvectors V obey the relation:
C V D V PCA eliminates correlations!
correlation matrix
Helge Voss
diagonalised square root of C
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
27
How to Apply the Pre-Processing
Transformation?
•
•
Correlation (decorrelation): different for signal and background variables
we don’t know beforehand if it is signal or background.
What do we do?
for likelihood ratio, decorrelate signal and background independently
y L i event )
k variables
k variables
pkS xk (i event ) )
pkS xk (i event ))
k variables
pkB xk (i event ))
signal transformation
y Ltrans i event )
kvariables
kvariables
pkS Tˆ S xk (i event )
)
pkS Tˆ S xk (i event )
kvariables
)
background transformation
pkB Tˆ B xk (i event )
)
for other estimators, one needs to decide on one of the two… (or
decorrelate on a mixture of signal and background events)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
28
Decorrelation at Work
Example: linear correlated Gaussians decorrelation works to 100%
1-D Likelihood on decorrelated sample give best possible performance
compare also the effect on the MVA-output variable!
correlated variables:
after decorrelation
(note the different scale on the y-axis… sorry)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
29
Limitations of the Decorrelation
in cases with non-Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care
How does linear
decorrelation affect
cases where
correlations
between signal and
background differ?
Original correlations
Signal
Helge Voss
Background
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
30
Limitations of the Decorrelation
in cases with non-Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care
How does linear
decorrelation affect
cases where
correlations
between signal and
background differ?
SQRT decorrelation
Signal
Helge Voss
Background
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
31
Limitations of the Decorrelation
in cases with non-Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care
How does linear
decorrelation affect
strongly nonlinear
cases ?
Original correlations
Signal
Helge Voss
Background
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
32
Limitations of the Decorrelation
in cases with non-Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care
How does linear
decorrelation affect
strongly nonlinear
cases ?
SQRT decorrelation
Signal
Background
Watch out before you used decorrelation “blindly”!!
Perhaps “decorrelate” only a subspace!
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
33
“Gaussian-isation“
Improve decorrelation by pre-Gaussianisation of variables
First: transformation to achieve uniform (flat) distribution:
xkflat i event )
xk ievent )
pk xk ) dxk , k variables
-
Rarity transform of variable k
Measured value
PDF of variable k
The integral can be solved in an unbinned way by event counting,
or by creating non-parametric PDFs (see later for likelihood section)
Second: make Gaussian via inverse error function: erf x )
2
x
-t
e dt
2
0
)
xkGauss i event ) 2 erf -1 2 xkflat i event ) - 1 , k variables
Third: decorrelate (and “iterate” this procedure)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
34
“Gaussian-isation“
Background
Signal -Original
Gaussianised
- Gaussianised
We cannot simultaneously “Gaussianise” both signal and background !
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
35
Summary
Hope you are all convinced that Multivariate Algorithem are nice
and powerful classification techniques
Do not use hard selection criteria (cuts) on each individual
observables
Look at all observables “together”
eg. combing them into 1 variable
Mulitdimensinal Likelihood PDF in D-dimensions
Projective Likelihood (Naïve Bayesian) PDF in D times 1
dimension
How to “avoid” correlations
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
36