Part 1 - MLNL - University College London

Download Report

Transcript Part 1 - MLNL - University College London

Pattern Recognition Methods
Part 1: Kernel Methods & SVM
5
10
15
20
London, UK
21 May 2012
25
30
35
40
45
5
10
15
20
25
30
35
40
45
Janaina Mourao-Miranda,
Machine Learning and Neuroimaging Lab,
University College London, UK
Outline
• Pattern Recognition: Concepts & Framework
• Kernel Methods for Pattern Analysis
• Support Vector Machine (SVM)
Pattern Recognition Concepts
•
Pattern recognition aims to assign a label to a given pattern (test example) based
either on a priori knowledge or on statistical information extracted from the
previous seen patterns (training examples).
•
The patterns to be classified are usually groups of measurements or observations
(e.g. brain image), defining points in an appropriate multidimensional vector
space.
Currently implemented in
PRoNTo
Types of learning procedure:
• Supervised learning: The training data consist of pairs of input (typically
vectors, X), and desired outputs or labels (y). The output of the function can
be a continuous (regression) or a discrete (classification) value.
• Unsupervised learning: The training data is not labeled. The aims is to to find
inherent patterns in the data that can then be used to determine the correct
output value for new data examples (e.g. clustering).
• Semi-supervised learning, reinforcement learning.
•
Pattern Recognition Framework
Input (brain scans)
X1
X2
X3
No mathematical
model available
Machine
Learning
Methodology
Output (control/patient)
y1
y2
y3
Computer-based procedures that learn a function from a series of examples
Learning/Training Phase
Training Examples:
(X1, y1), . . .,(Xs, ys)
Generate a function or classifier
f such that
f
f(xi) -> yi
Test Example
Xi
Testing Phase
Prediction
f(Xi) = yi
Standard Statistical Analysis (mass-univatiate)
Input
Output
Voxel-wise
GLM model
estimation
BOLD signal
...
Independent
statistical
test at each
voxel
Correction
for
multiple
comparisons
Univariate Statistical
Parametric Map
Time
Pattern Recognition Analysis (multivariate)
Input
Output
…
Volumes from task 1
Training Phase
Multivarate Map
(classifier’s or regression’s weights)
…
Volumes from task 2
New example
Test Phase
Predictions
y = {+1, -1} or p(y = 1|X,θ)
e.g. +1 = Patients and -1 = Healthy controls
How to extract features from the fMRI?
Brain volume
fMRI/sMRI
3D matrix of voxels
Feature Vector
Dimensionality =
number of voxels
Binary classification: finding a hyperplane
task 1
task 2
task 1
task 2
task ?
4 2
volume in t1 volume in t2
L
volume in t3 volume in t4
Linear classifiers
(hyperplanes) are
parameterized by a
weight vector w and
a bias term b.
voxel 2
R
volume in t2
volume in t1
volume in t4
2
w
volume in t3
4
volume from a
new subject
voxel 1
Different
classifiers/algorithm
s will compute
different
hyperplanes
• In neuroimaging applications often the dimensionality of the data is
greater than the number of examples (ill-conditioned problems).
• Possible solutions:
– Feature selection strategies (e.g. ROIS, select only activated voxels)
– Searchlight
– Kernel Methods
Kernel Methods for Pattern Analysis
The kernel methodology provides a powerful and unified framework for
investigating general types of relationships in the data (e.g. classification,
regression, etc).
Kernel methods consist of two parts:
•Computation of the kernel matrix (mapping into the feature space ).
•A learning algorithm based on the kernel matrix (designed to discover linear
patterns in the feature space).
Advantages:
•Represent a computational shortcut which makes possible to represent
linear patterns efficiently in high dimensional space.
•Using the dual representation with proper regularization* enables efficient
solution of ill-conditioned problems.
* e.g. restricting the choice of functions to favor
functions that have small norm.
Examples of Kernel Methods
•
•
•
•
•
Support Vector Machines (SVM)
Gaussian Processes (GPs)
Kernel Ridge Regression (KRR)
Relevance Vector Regression (RVR)
Etc
Kernel Matrix (similarity measure)
Brain scan 2
4
Brain scan 4
Dot product = (4*-2)+(1*3) = -5
1
5
10
15
-2 3
20
25
30
35
40
45
5
10
15
20
25
30
35
40
45
• Kernel is a function that, for given two pattern x and x*, returns a real number
characterizing their similarity.
•A simple type of similarity measure between two vectors is a dot product (linear
kernel).
Nonlinear Kernels
Original Space
Feature Space
• Nonlinearof
kernels
used to map the data to a higher dimensional space as an
Advantage
linear are
models:
to makedata
it linearly
separable.
•attempt
Neuroimaging
are extremely
high-dimensional and the sample sizes are
• Thesmall,
kerneltherefore
trick enable
the computation
of similarities
thebenefit.
feature space
very
non-linear
kernels often
don’t bringinany
havingreduce
to compute
theofmapping
explicitly.
•without
Linear model
the risk
overfitting
the data and to allow direct extraction
of the weight vector as an image (i.e. discrimination map).
Linear classifiers
• Linear classifiers (hyperplanes) are parameterized by a weight
vector w and a bias term b.
• The weight vector can be expressed as a linear combination of
training examples xi (where i = 1,…,N and N is the number of
training examples).
N
w   i x i
i1
How to make predictions?
• The general equation for making predictions for a test example x* with
kernel methods is
f (x * )  w x *  b
Primal representation
N
f (x * )    i x i  x *  b
i1
N
f (x * )    iK(x i ,x * )  b
Dual representation
i1
• Where f(x*) is the predicted score for regression or the distance to the
decision boundary for classification models.

How to interpret the weight vector?
Examples of class 1
Voxel 1
Weight vector or
Discrimination map
…
Voxel 2
Voxel 1
Voxel 2
Training
Examples of class 2
Voxel 1
Voxel 2
…
Voxel
1
Voxel
2
w1 = +5
w2 = -3
Spatial representation of
thepattern
decision
Multivariate
-> function
No local inferences should
made!
f(x)be
= (w
1.v1+w2.v2)+b
New example
v1 = 0.5
v2 = 0.8
Testing
= (+5.0.5-3.0.8)+0
= 0.1
Positive value -> Class 1
Example of Kernel Methods
(1) Support Vector Machine
Support Vector Machines (SVMs)
• A classifier derived from statistical learning theory by Vapnik,
et al. in 1992.
• SVM became famous when, using images as input, it gave
accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task.
• Currently, SVM is widely used in object detection &
recognition, content-based image retrieval, text recognition,
biometrics, speech recognition, neuroimaging, etc.
• Also used for regression.
Largest Marging Classifier
• Among all hyperplanes separating the data there is a unique optimal hyperplane, the one
which presents the largest margin (the distance of the closest points to the hyperplane).
•Let us consider that all test points are generated by adding bounded noise (r) to the
training examples (test and training data are assumed to have been generate by the same
underlying dependence).
r

• If the optimal hyperplane has margin >r it will correctly separate the test points.
Linearly separable case (Hard Margin SVM)
(w⊤xi + b) =+1
(w⊤xi + b) =-1

(w⊤xi + b) > 0
(w⊤xi + b) < 0
w
• We assume that the data are linearly separable, that is, there exist w∈IRd and b∈IR such that
yi(w⊤xi + b) > 0, i = 1,...,m.
• Rescaling w and b such that the point(s) closest to the hyperplane satisfy |(w⊤xi + b)| =1 we
obtain the canonical form of the hyperplane satisfying yi(w⊤xi + b) > 0.
• The distance ρx(w,b) of a point x from a hyperplane Hw,b is given by ρx= |(w⊤xi + b)| / ||w||
• The quantity 1/||w|| is the margin of the Optimal Separating Hyperplane.
• Our aim is to find the largest margin hyperplane.
• Optimization problem
Quadratic problem: unique
optimal solution
• The solution of this problem is equivalent to determine the saddle point
of the Lagrangian function
where αi ≥ 0 are the Lagrange multipliers.
• We minimize L over (w,b) and maximize over α.
Differentiating L w.r.t w and b we obtain:
Substituting w in L leads to the dual problem
where A is an m × m matrix
Note that the complexity of
this problem depends on m
(number of examples), not on
the number of input
components d (number of
dimensions).
If α is a solution of the dual problem then the solution (w, b) of the primal
problem is given by
Note that w is a linear combination of only the xi for which αi > 0. These xi are
called support vectors (SVs).
Parameter b can be determined by b = yj - w⊤xj, where xi corresponds to a SV.
A new point x is classified as
Some remarks
• The fact that that the Optimal Separating Hyperplane is determined only by the
SVs is most remarkable. Usually, the support vectors are a small subset of the
training data.
• All the information contained in the data set is summarized by the support
vectors. The whole data set could be replaced by only these points and the same
hyperplane would be found.
Linearly non-separable case (Soft Margin SVM)
•If the data is not linearly separable the previous analysis can be generalized
by looking at the problem
• The idea is to introduce the slack variables ξi to relax the separation
constraints (ξi > 0 ⇒ xi has margin less than 1).
1
2
New dual problem
•A saddle point analysis (similar to that above) leads to the dual problem
• This is like the previous dual problem except that now we have “box
constraints” on αi. If the data is linearly separable, by choosing C large
enough we obtain the Optimal Separating Hyperplane.
•
Again we have
The role of the parameter C
• The parameter C that controls the relative importance of minimizing the norm
of w (which is equivalent to maximizing the margin) and satisfying the margin
constraint for each data point.
•If C is close to 0, then we don't pay that much for points violating the margin
constraint. This is equivalent to creating a very wide tube or safety margin around
the decision boundary (but having many points violate this safety margin).
•If C is close to inf, then we pay a lot for points that violate the margin constraint,
and we are close the hard-margin formulation we previously described - the
difficulty here is that we may be sensitive to outlier points in the training data.
•C is often selected by minimizing the leave-one-out cross validation error.
Summary
•SVMs are prediction devices known to have good performance in highdimensional settings
• "The key features of SVMs are the use of kernels, the absence of local minima,
the sparseness of the solution and the capacity control obtained by optimizing the
margin.” Shawe-Taylor and Cristianini (2004).
References
•Shawe-Taylor J, Christianini N (2004) Kernel Methods for Pattern Analysis.
Cambridge: Cambridge University Press.
•Schölkopf, B., Smola, A., 2002. Learning with Kernels. MIT Press.
•Burges, C., 1998. A tutorial on support vector machines for
patternrecognition. Data Min. Knowl. Discov. 2 (2), 121–167.
•Vapnik, V., 1995. The Nature of Statistical Learning Theory. SpringerVerlag, New York.
•Boser, B.E., Guyon, I.M., Vapnik, V.N., 1992. A training algorithm for optimal
margin classifiers. D. Proc. Fifth Ann. Workshop on Computa-tional Learning
Theory, pp. 144–152.