PPT - Snowmass 2001

Download Report

Transcript PPT - Snowmass 2001

Support Vector Machines:
Get more Higgs out of your data
Daniel Whiteson
UC Berkeley
Daniel Whiteson
July 11, 2001
Multivariate algorithms
Square cuts may work well for simpler tasks,
but as the data are multivariate, the algorithms
also must be.
Daniel Whiteson
July 11, 2001
Multivariate Algorithms
• HEP overlaps with Computer Science,
Mathematics and Statistics in this area:
How can we construct an algorithm that can
be taught by example and generalize
effectively?
• We can use solutions from those fields:
Neural Networks
Probability Density Estimators
Support Vector Machines
Daniel Whiteson
July 11, 2001
Neural Networks
• Constructed
from a very
simple object,
they can learn
complex
patterns.
• Decision function learned using freedom in hidden layers.
– Used very effectively as signal discriminators, particle identifiers
and parameter estimators
– Fast evaluation makes them suited to triggers
Daniel Whiteson
July 11, 2001
Probability Density Estimation
If we knew the distributions of the signal fs(x)
Then we could
calculate
fs ( x )
Ps (x) 
fs ( x )  fb ( x )
Example disc. surface
and the background fb(x),
And use it to discriminate.
Daniel Whiteson
July 11, 2001
Probability Density Estimation
Of course we do not know the analytical distributions.
• Given a set of points
drawn from a distribution,
put down a kernel centered
at each point.
• With high statistics,
this approximates a
smooth probability
density.
Surface with many kernels
Daniel Whiteson
July 11, 2001
Probability Density Estimation
• Simple techniques have advanced to more
sophisticated approaches:
– Adaptive PDE
• varies the width of the kernel for smoothness
– Generalized for regression analysis
• Measure the value of a continuous parameter
– GEM
• Measures the local covariance and adjusts the
individual kernels to give a more accurate estimate.
Daniel Whiteson
July 11, 2001
Support Vector Machines
• PDEs must evaluate a kernel at every training point for
every classification of a data point.
• Can we build a decision surface that only uses the relevant
bits of information, the points in training set that are near
the signal-background boundary?
For a linear, separable case, this
is not too difficult. We simply
need to find the hyperplane that
maximizes the separation.
Daniel Whiteson
July 11, 2001
Support Vector Machines
• To find the hyperplane that gives the highest separation
(lowest “energy”), we maximize the Lagrangian w.r.t i:
1
L    i    i j yi y j x i  x j
2 i, j
i
(xi,yi) are training data
i are positive Lagrange multipliers
The solution is:
w   i yi xi
i
Where i=0 for non support vectors
(images from applet at http://svm.research.bell-labs.com/)
Daniel Whiteson
July 11, 2001
Support Vector Machines
But not many problems
of interest are linear.
Map data to higher dimensional
space where separation can be
made by hyperplanes
:R H
d
x i  x j  (x i )   (x j )
We want to work in our original
space. Replace dot product with
kernel function:
K ( xi , x j )   ( xi )   ( x j )
3
K
(
x
i
,
x
j
)

(
x
i

x
j

1
)
For these data, we need
Daniel Whiteson
July 11, 2001
Support Vector Machines
Neither are entirely
separable problems
very difficult.
• Allow an imperfect
decision boundary, but
add a penalty.
• Training errors,
points on the wrong
side of the boundary,
are indicated by
crosses.
Daniel Whiteson
July 11, 2001
Support Vector Machines
We are not limited to
linear or polynomial
kernels.
K (x i , x j )  e
 xi x j
2
/ 2 2
Gives a highly
flexible SVM
 Gaussian kernel SVMs
outperformed PDEs in
recognizing handwritten
numbers from the USPS
database.
Daniel Whiteson
July 11, 2001
Comparative study for HEP
Signal: Wh to bb
Neural Net
Background: Wbb
Background: tt
PDE
Background: WZ
2-dimensional
discriminant
with variables
Mjj and Ht
SVM
Discriminator Value
Daniel Whiteson
July 11, 2001
Comparative study for HEP
Signal to Noise Enhancement
Efficiency 43%
Efficiency 50%
All of these
methods
provide
powerful
signal
enhancement
Efficiency 49%
Discriminator Threshold
Daniel Whiteson
July 11, 2001
Algorithm Comparisons
Algorithm
Advantages
Disadvantages
Neural Nets
•Very fast evaluation
•Build structure by hand
•Black box
•Local optimization
PDE
•Transparent operation •Slow evaluation
•Requires high statistics
SVM
•Fast evaluation
•Kernel positions
chosen automatically
•Global optimization
Daniel Whiteson
•Complex
•Training can be time
intensive
•Kernel selection by hand
July 11, 2001
Conclusions
• Difficult problems in HEP overlap with those
in other fields. We can take advantage of our
colleagues’ years of thought and effort.
• There are many areas of HEP analysis where
intelligent multivariate algorithms like NNs,
PDEs and SVMs can help us conduct more
powerful searches and make more precise
measurements.
Daniel Whiteson
July 11, 2001