Spectral Clustering for Identification of Vortex packages
Download
Report
Transcript Spectral Clustering for Identification of Vortex packages
An Introduction to Kernel-Based
Learning Algorithms
K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf
Presented by: Joanna Giforos
CS8980: Topics in Machine Learning
9 March, 2006
Outline
Problem Description
Nonlinear Algorithms in Kernel Feature Space
Supervised Learning:
Supervised Learning:
Nonlinear SVM
Kernel Fisher Discriminant Analysis
Kernel Principle Component Analysis
Applications
Model specific kernels
Problem Description
2-class Classification: estimate a function,
input-output training data
such that
using
will correctly classify unseen examples.
i.e., Find a mapping:
Assume: Training and test data are drawn from the same
probability distribution
Problem Description
A learning machine is a family of functions
For a task of learning two classes
f(x,) 2 {-1,1}, 8 x,
Too complex ) Overfitting
Not complex enough ) Underfitting
Want to find the right balance between accuracy and
complexity
Problem Description
Best
is the one that minimizes the expected error:
Empirical Risk (training error):
Remp()! R() as n!1
Structural Risk Minimization
Construct a nested family of function classes
with non-decreasing VC dimension.
,
Let
be the solutions of the empirical risk
minimization.
SRM chooses the function class and the function such
that an upper bound on the generalization error is minimized.
Nonlinear Algorithms in Kernel Feature Space
Via a non-linear mapping
the data is mapped into a potentially much higher dimensional feature
space.
Given this mapping, we can compute scalar products in feature
spaces using kernel functions.
does not need to be known explicitly ) every linear algorithm that
only uses scalar products can implicitly be executed in
by using
kernels.
Nonlinear Algorithms in Kernel Feature Space:
Example
Supervised Learning: Nonlinear SVM
Consider linear classifiers in feature space using dot products.
Conditions for classification without training error:
GOAL: Find
and b such that the empirical risk and
regularization term are minimized.
But we cannot explicitly access w in the feature space, so we
introduce Lagrange multipliers, i, one for each of the above
constraints.
Supervised Learning: Nonlinear SVM
Last class we saw that the nonlinear SVM primal problem is:
Which leads to the dual:
Supervised Learning: Nonlinear SVM
Using KKT second order optimality conditions on the dual SVM
problem, we obtain:
The solution is sparse in ) many patterns are outside the margin
area and the optimal i’s are zero.
Without sparsity, SVM would be impractical for large data sets.
Supervised Learning: Nonlinear SVM
The dual problem can be rewritten as:
Where
Since objective function is convex, every local max is a global max,
but there can be several optimal solutions (in terms of i)
Once i’s are found using QP solvers, simply plug into prediction
rule:
Supervised Learning: KFD
Discriminant analysis seeks to find a projection of the data in a
direction that is efficient for discrimination.
Image from: R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification,
John Wiley & Sons, INC., 2001.
Supervised Learning: KFD
Solve Fisher’s linear discriminant in kernel feature space.
Aims at finding linear projections such that the classes are well
separated.
How far are the projected means apart? (should be large)
How big is the variance of the data in this direction? (should be small)
Recall, that this can be achieved by maximizing the Rayleigh
quotient:
where
Supervised Learning: KFD
In kernel feature space
training patterns:
To get:
Where,
, express w in terms of mapped
Supervised Learning: KFD
Projection of a test point onto the discriminant is computed by:
Can solve the generalized eigenvalue problem:
But N and M may be large and non-sparse, can transform
KFD into a convex QP problem.
Question – can we use numerical approximations to the
eigenvalue problem?
Supervised Learning: KFD
Can reformulate as constrained optimization problem.
FD tries to minimize the distance between the variance of the
data along the projection whilst maximizing the distance
between the means:
This QP is equivalent to J() since
M is a matrix of rank 1 (columns are linearly dependent)
Solutions w in J() are invariant under scaling.
) Can fix the distance of the means to some arbitrary, positive value
and just minimize the variance.
Connection Between Boosting and Kernel
Methods
Can show that Boosting maximizes the smallest margin .
Recall, SVM attempts to maximize w
In general, using an arbitrary lp norm constraint on the weight
vector leads to maximizing the lq distance between the
hyperplane and the training points.
Boosting uses l1 norm
SVM uses l2 norm
Unsupervised Methods: Linear PCA
Principal Components Analysis (PCA) attempts to efficiently
represent the data by finding orthonormal axes which
maximally decorrelate the data
Given centered observations:
PCA finds the principal axes by diagonalizing the covariance
matrix
Note that C is positive definite,and thus can be diagonalized
with nonnegative eigenvalues.
Unsupervised Methods: Linear PCA
Eigenvectors lie in the span of x1, …, xn:
Thus it can be shown that,
But
is just a scalar, so all solutions v with 0 lie in the
span of x1, …, xn, i.e.
Unsupervised Methods: Kernel PCA
If we first map the data into another space,
Then assuming we can center the data, we can write the
covariance matrix as:
Which can be diagonalized with nonnegative eigenvalues
satisfying:
Unsupervised Methods: Kernel PCA
As in linear PCA, all solutions v with 0 lie in the span of
(xi), …, (xm) i.e.
Substituting, we get:
Where K is the inner product kernel:
Premultiplying both sides by (xk)T, we finally get:
Unsupervised Methods: Kernel PCA
The resulting set of eigenvectors
are then used to extract the
Principle Components of a test point by:
Unsupervised Methods: Kernel PCA
Nonlinearities only enter the computation at two points:
In the calculation of the matrix K
In the evaluation of new points
Drawback of PCA:
For large data sets, storage and computational complexity issues.
Can use sparse approximations of K.
Question: Can we think of other unsupervised methods which
can make use of kernels?
Kernel k-means, Kernal ICA, Spectral Clustering
Unsupervised Methods: Linear PCA
Unsupervised Methods: Kernel PCA
Applications
Support Vector Machines and Kernel Fisher Discriminant:
Bioinformatics: protein classification
OCR
Face Recognition
Content based image retrieval
Decision Tree Predictive Modeling
…
Kernel PCA
Denoising
Compression
Visualization
Feature extraction for classification
Kernels for Specific Applications
Image Segmentation: Gaussian weighted 2-distance
between local color histograms.
Can be shown to be robust for color and texture discrimination
Text classification: Vector Space kernels
Structured Data (strings, trees, etc.): Spectrum kernels
Generative models: P-kernels, Fisher kernels