Introduction to Statistics and Machine Learning

Download Report

Transcript Introduction to Statistics and Machine Learning

Introduction to Statistics
and Machine Learning
How do we:
• understand
• interpret
our measurements
How do we get the data for
our measurements
1
Outline
 Multivariate classification/regression algorithms (MVA)
 motivation
 another introduction/repeat the ideas of hypothesis tests in this
context
 Multidimensional Likelihood (kNN : k-Nearest Neighbour)
 Projective Likelihood (naïve Bayes)
 What to do with correlated input variables?
 Decorrelation strategies
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
2
MVA-Literature /Software
Packages... a biased selection
Literature:
 T.Hastie, R.Tibshirani, J.Friedman, “The Elements of Statistical Learning”, Springer 2001
 C.M.Bishop, “Pattern Recognition and Machine Learning”, Springer 2006
Software packages for Mulitvariate Data Analysis/Classification
 individual classifier software
 e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad
and many other packages
 attempts to provide “all inclusive” packages
 StatPatternRecognition: I.Narsky, arXiv: physics/0507143



http://www.hep.caltech.edu/~narsky/spr.html
TMVA: Höcker,Speckmayer,Stelzer,Therhaag,von Toerne,Voss, arXiv: physics/0703039
http://tmva.sf.net or every ROOT distribution (development moved from SourceForge to ROOT repository)
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
“R”: a huge data analysis library: http://www.r-project.org/
Conferences: PHYSTAT, ACAT,…
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
3
Event Classification
 Suppose data sample of two types of events:
with class labels Signal
and Background (will restrict here to two class cases. Many classifiers can in
principle be extended to several classes, otherwise, analyses can be staged)
 how to set the decision boundary to select events of type S ?
 we have discriminating variables x1, x2, …
Rectangular cuts?
x2
A linear boundary?
x2
S
B
x2
S
B
x1
 How can we decide what to uses ?
Low variance (stable), high bias methods
A nonlinear one?
S
B
x1
x1
High variance, small bias methods
 Once decided on a class of boundaries, how to find the “optimal” one ?
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
4
Regression
 how to estimate a “functional behaviour” from a given set of ‘known measurements” ?
 assume for example “D”-variables that somehow characterize the shower in your calorimeter
 energy as function of the calorimeter shower parameters .
constant ?
linear?
non - linear?
f(x)
f(x)
Energy
f(x)
x Size
Cluster
x
x
 if we had an analytic model (i.e. know the function is a nth -order polynomial) than
we know how to fit this (i.e. Maximum Likelihood Fit)
 but what if we just want to “draw any kind of curve” and parameterize it?
 seems trivial ?
 The human brain has very good pattern recognition capabilities!
 what if you have many input variables?
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
5
Regression  model functional
behaviour
 Assume for example “D”-variables that somehow characterize the shower in your calorimeter.
 Monte Carlo or testbeam
 data sample with measured cluster observables
+ known particle energy
= calibration function (energy == surface in D+1 dimensional space)
f(x)
2-D example
events generated according: underlying distribution
1-D example
f(x,y)
y
x
x
 better known: (linear) regression  fit a known analytic function
 e.g. the above 2-D example  reasonable function would be: f(x) = ax2+by2+c
 what if we don’t have a reasonable “model” ?  need something more general:
 e.g. piecewise defined splines, kernel estimators, decision trees to approximate f(x)
 NOT in order to “fit a parameter”
 provide predition of function value f(x) for new measurements x (where f(x) is not known)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
6
Event Classification
 Each event, if Signal or Background, has “D” measured variables.
D
“feature
space”
 Find
a mapping from D-dimensional input-observable =”feature” space
y(x): RDR:
most general form
to one dimensional output  class label
y = y(x); x D
x={x1,….,xD}: input variables

 plotting (historamming)
the resulting y(x) values:
 Who sees how this would
look like for the
regression problem?
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
y(x)
7
Event Classification
 Each event, if Signal or Background, has “D” measured variables.
 Find a mapping from D-dimensional input/observable/”feature” space
to one dimensional output
y(B)  0, y(S)  1
 class lables
D
“feature
space”
y(x): RnR:

 y(x):
“test statistic” in D-dimensional space of input variables
 distributions of y(x):
PDFS(y) and PDFB(y)
 used to set the selection cut!
efficiency and purity
y(x):
> cut: signal
= cut: decision boundary
< cut: background
 y(x)=const: surface defining the decision boundary.
 overlap of PDFS(y) and PDFB(y)  separation power , purity
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
8
Classification ↔ Regression
Classification:
 Each event, if Signal or Background, has “D” measured variables.
 y(x): RDR:
“test statistic”
in D-dimensional space of
input variables
 y(x)=const: surface defining
the decision boundary.
D
“feature
space”
Helge Voss
y(x): RDR:

Regression:
 Each event has “D” measured variables + one function value
(e.g. cluster shape variables in the ECAL + particles energy)
 y(x): RDR
 find
f(x1,x2)
 y(x)=const  hyperplanes where the
target function is constant
Now, y(x) needs to be build such that it
best approximates the target, not such
X2
that it best separates signal from bkgr.
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
X1
9
Event Classification
y(x): RnR: the mapping from the “feature space” (observables) to one output variable
PDFB(y). PDFS(y): normalised distribution of y=y(x)
for background and signal events
(i.e. the “function” that describes the shape of the
distribution)
with y=y(x) one can also say PDFB(y(x)), PDFS(y(x)): :
1.5
0.45
y(x)
Probability densities for background and signal
now let’s assume we have an unknown event from the example above for which y(x) = 0.2
 PDFB(y(x)) = 1.5 and PDFS(y(x)) = 0.45
let fS and fB be the fraction of signal and background events in the sample, then:
ffSSPDF
PDF
S (y(x))
S (y)
| y(x))
 P(C
P(CSS
| y)
fSfSPDF
PDF
 ffBBPDF
PDF
S (y(x))
B (y(x))
S (y) 
B (y)
Helge Voss
is the probability of an event with
measured x={x1,….,xD} that gives y(x)
to be of type signal
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
10
Event Classification
P(Class=C|x) (or simply P(C|x)) : probability that the event class is of C, given the
measured observables x={x1,….,xD}  y(x)
Probability density distribution
according to the measurements x
and the given mapping function
Prior probability to observe an event of “class C”
i.e. the relative abundance of “signal” versus
“background”
P(y | C) P(C)
P(Class = C | y) =
P(y)
Posterior probability
Overall probability density to observe the actual
measurement y(x). i.e. P(y) =  P(y | Class)P(Class)
Classes
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
11
Any Decision Involves Risk!
Decide to treat an event as “Signal” or “Background”
Type-1 error: (false positive)
classify event as Class C even though it is not
Trying to select signal events:
(i.e. try to disprove the null-hypothesis
stating it were “only” a background event)
(accept a hypothesis although it is not true/i.e.false)
(reject the null-hypothesis although it would have been the correct one)
 loss of purity
Signal
Background
Signal

Type-2
error
Background
Type-1
error

(e.g. accepting wrong events)
Type-2 error: (false negative)
fail to identify an event from Class C as such
(reject a hypothesis although it would have been correct/true)
(fail to reject the null-hypothesis/accept null hypothesis although it is false)
 loss of efficiency (e.g. miss true (signal) events)
“A”: region of the outcome of the test where you accept the event as Signal:
Significance α: Type-1 error rate:
α = background selection “efficiency”
should be
small
Size β:
Type-2 error rate: (how often you miss the signal)
Power: 1- β = signal selection efficiency
should be
small
most of the rest of the lecture will be about methods that try to make as little mistakes as possible 
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
12
Neyman-Pearson Lemma
y(x) 
P(x | S)
P(x | B)
Neyman-Peason:
The Likelihood ratio used as “selection criterium”
y(x) gives for each selection efficiency the best
possible background rejection.
i.e. it maximises the area under the “Receiver
Operation Characteristics” (ROC) curve
1
1- ebackgr.
Likelihood Ratio :
y’(x)
y’’(x)
0
0
esignal
1
varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve
how to choose “cut”  need to know prior probabilities (S, B abundances)
 measurement of signal cross section:
 discovery of a signal (typically: S<<B):
 precision measurement:
 trigger selection:
Helge Voss
maximum of S/√(S+B) or equiv. √(e·p)
maximum of S/√(B)
high purity (p)  large background rejection
high efficiency (e)
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
13
MVA and Machine Learning


The previous slide was basically the idea of “Multivariate Analysis” (MVA)

rem: What about “standard cuts” (event rejection in each variable
separately with fix conditions. i.e. if x1>0 or x2<3 then background) ?
Finding y(x) : RnR

given a certain type of model class y(x)

in an automatic way using “known” or “previously solved” events
 i.e. learn from known “patterns”

such that the resulting y(x) has good generalization properties when
applied to “unknown” events (regression: fits well the target function “in
between” the known training events
 that is what the “machine” is supposed to be doing: supervised machine
learning

Of course… there’s no magic, we still need to:

choose the discriminating variables

choose the class of models (linear, non-linear, flexible or less flexible)

tune the “learning parameters”  bias vs. variance trade off

check generalization properties

consider trade off between statistical and systematic uncertainties
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
14
Event Classification
 Unfortunately, the true probability densities functions are typically unknown:
 Neyman-Pearsons lemma doesn’t really help us directly
 Monte Carlo simulation or in general cases: set of known (already classified) “events”
 2 different ways: Use these “training” events to:
 estimate the functional form of p(x|C):
(e.g. the differential cross section folded with the
detector influences) from which the likelihood ratio can be obtained
 e.g. D-dimensional histogram, Kernel densitiy estimators, …
 find a “discrimination function” y(x)
and corresponding decision boundary (i.e.
hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from
background
 e.g. Linear Discriminator, Neural Networks, …
* hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
15
Unsupervised Learning
Just a short remark as we talked about “supervised” learning before:
supervised:
training with “events” for which we know the outcome (i.e. Signal or Backgr)
un-supervised: - no prior knowledge about what is “Signal” or “Background” or … we don’t
even know if there are different “event classes”, then you can for example do:
- cluster analysis: if different “groups” are found  class labels
- principal component analysis: find basis in observable space with biggest
hierarchical differences in the variance
 infer something about underlying substructure
Examples:
- think about “success” or “not success” rather than “signal” and “background”
(i.e. a robot achieves his goal or does not / falls or does not fall/ …)
- market survey:
If asked many different question, maybe you can find “clusters” of people,
group them together and test if there are correlations between this groups
and their tendency to buy a certain product.  address them specialy
- medical survey:
group people together and perhaps find common causes for certain
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
16
Nearest Neighbour and Kernel
Density Estimator
“events” distributed according to P(x)
 estimate probability density P(x) in
D-dimensional space:
x2
 The only thing at our disposal is our “training data”
h
 Say we want to know P(x) at “this” point “x”
 One expects to find in a volume V around point “x”
N*∫P(x)dx events from a dataset with N events
“x”
V
 For the chosen a rectangular volume
 K-events:
x1
1

1, ui  , i  1...D
 x - xn 
K  k 
,
with
k
(u)

2


h

n 1 
0, otherwise
N
K
k(u): is called a Kernel function
(from the “training data”)  estimate of average P(x) in the volume V: ∫P(x)dx = K/N
V
 Classification:
1
P (x) 
N
Determine
PDFS(x) and PDFB(x)
likelihood ratio as classifier!
Helge Voss
N

n 1
1  x - xn 
k

D
h
h


 Kernel Density estimator of the probability density
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
17
Nearest Neighbour and Kernel
Density Estimator
“events” distributed according to P(x)
 estimate probability density P(x) in
D-dimensional space:
 The only thing at our disposal is our “training data”
x2
h
 Say we want to know P(x) at “this” point “x”
 One expects to find in a volume V around point “x”
N*∫P(x)dx events from a dataset with N events
“x”
V
 For the chosen a rectangular volume
 K-events:
1

1, ui  , i  1...D
 x - xn 
K  k 
,
with
k
(u)

2


h

n 1 
0, otherwise
N
K
x1
k(u): is called a Kernel function:
rectangular Parzen-Window
(from the “training data”)  estimate of average P(x) in the volume V: ∫P(x)dx = K/N
V
 Regression:
particle) 
Helge Voss
If each events with (x1,x2) carries a “function value” f(x1,x2) (e.g. energy of incident
1 N
ˆ
k(xi - x)f(x i )   f(x)P(x)dx

N i
V
i.e.: the average function value
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
18
Nearest Neighbour and Kernel
Density Estimator
“events” distributed according to P(x)
 estimate probability density P(x) in
D-dimensional space:
 The only thing at our disposal is our “training data”
x2
h
 Say we want to know P(x) at “this” point “x”
 One expects to find in a volume V around point “x”
N*∫P(x)dx events from a dataset with N events
“x”
V
 For the chosen a rectangular volume
 K-events:
1

1, ui  , i  1...D
 x - xn 
K  k 
,
with
k
(u)

2


h

n 1 
0, otherwise
x1
N
x2
 determine K from the “training data” with signal and
background mixed together
kNN
: k-Nearest Neighbours
relative number events of the various
classes amongst the k-nearest neighbours
nS
y(x) 
K
x1
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
19
Kernel Density Estimator
 Parzen Window: “rectangular Kernel”  discontinuities at window edges
smoother model for P(x) when using smooth Kernel Functions: e.g. Gaussian
 x - xn
1
1
P(x)  
exp 
2 1/2
 2h 2
N n1 (2 h )

N
 place a “Gaussian” around each “training
2

 ↔ probability density estimator


individual kernels
averaged kernels
data point” and sum up their contributions
at arbitrary points “x”  P(x)
 h: “size” of the Kernel  “smoothing
parameter”
 there is a large variety of possible Kernel
functions
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
20
Kernel Density Estimator
1
P( x ) 
N
N
K
n 1
h
 h: “size” of the Kernel
(x - xn )
: a general probability density estimator using kernel K
 “smoothing parameter”
 chosen size of the “smoothing-parameter”  more
important than kernel function
 h too small:
 h too large:

overtraining
not sensitive to features in P(x)
which metric for the Kernel (window)?
 normalise all variables to same range
 include correlations ?
 Mahalanobis Metric: x*x  xV-1x
(Christopher M.Bishop)
 a drawback of Kernel density estimators:
Evaluation for any test events involves ALL TRAINING DATA  typically very time consuming
 binary search trees (i.e. Kd-trees) are typically used in kNN methods to speed up searching
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
21
“Curse of Dimensionality”
Bellman, R. (1961), Adaptive
Control Processes: A
Guided Tour, Princeton
University Press.
We all know:
Filling a D-dimensional histogram to get a mapping of the PDF is typically unfeasable due
to lack of Monte Carlo events.
Shortcoming of nearest-neighbour strategies:

in higher dimensional classification/regression cases
the idea of looking at “training events” in a reasonably
small “vicinity” of the space point to be classified
becomes difficult:
consider: total phase space volume V=1D
for a cube of a particular fraction of the volume:
edge length=(fraction of volume)1/ D
 In 10 dimensions: in order to capture 1% of the phase space
 63% of range in each variable necessary  that’s not “local” anymore..
Therefore we still need to develop all the alternative classification/regression techniques
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
22
Naïve Bayesian Classifier
“often called: (projective) Likelihood”
Multivariate Likelihood (k-Nearest Neighbour)
 estimate the full D-dimensional joint probability density
D
If correlations between variables are weak: P( x )   P(
i x)
i0
PDFs
Likelihood ratio
for event event
y ( xPDE,k event ) 

i variables
product of marginal PDFs
(1-dim “histograms”)
discriminating variables
Pisignal (x i ,kevent )

 Classes: signal,
C
P
(x
)
   i i ,kevent  background types
Cclasses  i variables

 One of the first and still very popular MVA-algorithm in HEP
 No hard cuts on individual variables,
 allow for some “fuzzyness”: one very signal like variable may
counterweigh another less signal like variable
 optimal method if correlations == 0
Helge Voss
(Neyman Pearson Lemma)
PDE introduces fuzzy logic
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
23
Naïve Bayesian Classifier
“often called: (projective) Likelihood”
How parameterise the 1-dim PDFs ??
event counting
(histogramming)
Automatic, unbiased,
but suboptimal
parametric (function) fitting
Difficult to automate
for arbitrary PDFs
nonparametric fitting
(i.e. splines,kernel)
Easy to automate, can create
artefacts/suppress information
example: original (underlying) distribution is
Gaussian
 If the correlations between variables is really
negligible, this classifier is “perfect” (simple,
robust, performing)
 If not, you seriously loose performance 
 How can we “fix” this ?
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
24
What if there are correlations?
 Typically correlations are present:
Cij=cov[ xi , x j ]=E[ xi xj ]−E[ xi ]E[ xj ]≠0 (i≠j)
 pre-processing: choose set of linear transformed input variables for which Cij = 0 (i≠j)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
25
Decorrelation
 Find variable transformation that diagonalises the covariance matrix
 Determine square-root C  of correlation matrix C, i.e., C = C C 
compute C  by diagonalising C: D  STCS  C  S DST
 transformation from original (x) in de-correlated variable space (x) by: x = C -1x
Attention: eliminates only linear correlations!!
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
26
Decorrelation:
Principal Component Analysis
 PCA (unsupervised learning algorithm)
 reduce dimensionality of a problem
 find most dominant features in a distribution
 Eigenvectors of covariance matrix  “axis” in transformed variable space
 large eigenvalue  large variance along the axis (principal component)
 sort eigenvectors according to their eigenvalues
 transform dataset accordingly
 diagonalised covariance matrix with first “variable”  variable with
largest variance
xkPC  i event ) 

v variables
Principle Component
(PC) of variable k
 xv  i event ) - xv   vv( k ) , k  variables
sample means
eigenvector
 Matrix of eigenvectors V obey the relation:
C V  D V  PCA eliminates correlations!
correlation matrix
Helge Voss
diagonalised square root of C
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
27
How to Apply the Pre-Processing
Transformation?
•
•
Correlation (decorrelation): different for signal and background variables
 we don’t know beforehand if it is signal or background.
 What do we do?
for likelihood ratio, decorrelate signal and background independently
y L  i event ) 


k variables
k variables
pkS  xk (i event ) )
pkS  xk (i event )) 

k variables
pkB  xk (i event ))
signal transformation
y Ltrans  i event ) 


kvariables

kvariables

pkS Tˆ S xk (i event )
)
pkS Tˆ S xk (i event ) 

kvariables
)
background transformation

pkB Tˆ B xk (i event )
)
for other estimators, one needs to decide on one of the two… (or
decorrelate on a mixture of signal and background events)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
28
Decorrelation at Work
 Example: linear correlated Gaussians  decorrelation works to 100%
1-D Likelihood on decorrelated sample give best possible performance
compare also the effect on the MVA-output variable!
correlated variables:
after decorrelation
(note the different scale on the y-axis… sorry)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
29
Limitations of the Decorrelation
in cases with non-Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care

How does linear
decorrelation affect
cases where
correlations
between signal and
background differ?
Original correlations
Signal
Helge Voss
Background
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
30
Limitations of the Decorrelation
in cases with non-Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care

How does linear
decorrelation affect
cases where
correlations
between signal and
background differ?
SQRT decorrelation
Signal
Helge Voss
Background
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
31
Limitations of the Decorrelation
in cases with non-Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care

How does linear
decorrelation affect
strongly nonlinear
cases ?
Original correlations
Signal
Helge Voss
Background
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
32
Limitations of the Decorrelation
in cases with non-Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care

How does linear
decorrelation affect
strongly nonlinear
cases ?
SQRT decorrelation
Signal
Background
Watch out before you used decorrelation “blindly”!!
Perhaps “decorrelate” only a subspace!
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
33
“Gaussian-isation“

Improve decorrelation by pre-Gaussianisation of variables
First: transformation to achieve uniform (flat) distribution:
xkflat  i event ) 
xk  ievent )

pk  xk ) dxk , k  variables
-
Rarity transform of variable k
Measured value
PDF of variable k
The integral can be solved in an unbinned way by event counting,
or by creating non-parametric PDFs (see later for likelihood section)
Second: make Gaussian via inverse error function: erf  x ) 

2

x
-t
 e dt
2
0
)
xkGauss  i event )  2  erf -1 2 xkflat  i event ) - 1 , k  variables
Third: decorrelate (and “iterate” this procedure)
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
34
“Gaussian-isation“
Background
Signal -Original
Gaussianised
- Gaussianised
We cannot simultaneously “Gaussianise” both signal and background !
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
35
Summary
 Hope you are all convinced that Multivariate Algorithem are nice
and powerful classification techniques
 Do not use hard selection criteria (cuts) on each individual
observables
 Look at all observables “together”
 eg. combing them into 1 variable
 Mulitdimensinal Likelihood  PDF in D-dimensions
 Projective Likelihood (Naïve Bayesian)  PDF in D times 1
dimension
 How to “avoid” correlations
Helge Voss
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
36