Spectral Clustering for Identification of Vortex packages

Download Report

Transcript Spectral Clustering for Identification of Vortex packages

An Introduction to Kernel-Based
Learning Algorithms
K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf
Presented by: Joanna Giforos
CS8980: Topics in Machine Learning
9 March, 2006
Outline



Problem Description
Nonlinear Algorithms in Kernel Feature Space
Supervised Learning:



Supervised Learning:



Nonlinear SVM
Kernel Fisher Discriminant Analysis
Kernel Principle Component Analysis
Applications
Model specific kernels
Problem Description

2-class Classification: estimate a function,
input-output training data
such that
using
will correctly classify unseen examples.

i.e., Find a mapping:

Assume: Training and test data are drawn from the same
probability distribution
Problem Description





A learning machine is a family of functions
For a task of learning two classes
f(x,) 2 {-1,1}, 8 x,
Too complex ) Overfitting
Not complex enough ) Underfitting
Want to find the right balance between accuracy and
complexity
Problem Description

Best
is the one that minimizes the expected error:

Empirical Risk (training error):

Remp()! R() as n!1
Structural Risk Minimization

Construct a nested family of function classes
with non-decreasing VC dimension.
,

Let
be the solutions of the empirical risk
minimization.

SRM chooses the function class and the function such
that an upper bound on the generalization error is minimized.
Nonlinear Algorithms in Kernel Feature Space



Via a non-linear mapping
the data is mapped into a potentially much higher dimensional feature
space.
Given this mapping, we can compute scalar products in feature
spaces using kernel functions.
 does not need to be known explicitly ) every linear algorithm that
only uses scalar products can implicitly be executed in
by using
kernels.
Nonlinear Algorithms in Kernel Feature Space:
Example
Supervised Learning: Nonlinear SVM

Consider linear classifiers in feature space using dot products.

Conditions for classification without training error:

GOAL: Find
and b such that the empirical risk and
regularization term are minimized.

But we cannot explicitly access w in the feature space, so we
introduce Lagrange multipliers, i, one for each of the above
constraints.
Supervised Learning: Nonlinear SVM

Last class we saw that the nonlinear SVM primal problem is:

Which leads to the dual:
Supervised Learning: Nonlinear SVM



Using KKT second order optimality conditions on the dual SVM
problem, we obtain:
The solution is sparse in  ) many patterns are outside the margin
area and the optimal i’s are zero.
Without sparsity, SVM would be impractical for large data sets.
Supervised Learning: Nonlinear SVM

The dual problem can be rewritten as:

Where

Since objective function is convex, every local max is a global max,
but there can be several optimal solutions (in terms of i)

Once i’s are found using QP solvers, simply plug into prediction
rule:
Supervised Learning: KFD

Discriminant analysis seeks to find a projection of the data in a
direction that is efficient for discrimination.
Image from: R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification,
John Wiley & Sons, INC., 2001.
Supervised Learning: KFD


Solve Fisher’s linear discriminant in kernel feature space.
Aims at finding linear projections such that the classes are well
separated.


How far are the projected means apart? (should be large)
How big is the variance of the data in this direction? (should be small)

Recall, that this can be achieved by maximizing the Rayleigh
quotient:

where
Supervised Learning: KFD

In kernel feature space
training patterns:

To get:

Where,
, express w in terms of mapped
Supervised Learning: KFD

Projection of a test point onto the discriminant is computed by:

Can solve the generalized eigenvalue problem:

But N and M may be large and non-sparse, can transform
KFD into a convex QP problem.

Question – can we use numerical approximations to the
eigenvalue problem?
Supervised Learning: KFD



Can reformulate as constrained optimization problem.
FD tries to minimize the distance between the variance of the
data along the projection whilst maximizing the distance
between the means:
This QP is equivalent to J() since


M is a matrix of rank 1 (columns are linearly dependent)
Solutions w in J() are invariant under scaling.
) Can fix the distance of the means to some arbitrary, positive value
and just minimize the variance.
Connection Between Boosting and Kernel
Methods

Can show that Boosting maximizes the smallest margin .

Recall, SVM attempts to maximize w

In general, using an arbitrary lp norm constraint on the weight
vector leads to maximizing the lq distance between the
hyperplane and the training points.


Boosting uses l1 norm
SVM uses l2 norm
Unsupervised Methods: Linear PCA
 Principal Components Analysis (PCA) attempts to efficiently
represent the data by finding orthonormal axes which
maximally decorrelate the data
 Given centered observations:

PCA finds the principal axes by diagonalizing the covariance
matrix

Note that C is positive definite,and thus can be diagonalized
with nonnegative eigenvalues.
Unsupervised Methods: Linear PCA

Eigenvectors lie in the span of x1, …, xn:

Thus it can be shown that,

But
is just a scalar, so all solutions v with   0 lie in the
span of x1, …, xn, i.e.
Unsupervised Methods: Kernel PCA

If we first map the data into another space,

Then assuming we can center the data, we can write the
covariance matrix as:

Which can be diagonalized with nonnegative eigenvalues
satisfying:
Unsupervised Methods: Kernel PCA

As in linear PCA, all solutions v with   0 lie in the span of
(xi), …, (xm) i.e.

Substituting, we get:

Where K is the inner product kernel:

Premultiplying both sides by (xk)T, we finally get:
Unsupervised Methods: Kernel PCA

The resulting set of eigenvectors
are then used to extract the
Principle Components of a test point by:
Unsupervised Methods: Kernel PCA

Nonlinearities only enter the computation at two points:



In the calculation of the matrix K
In the evaluation of new points
Drawback of PCA:

For large data sets, storage and computational complexity issues.


Can use sparse approximations of K.
Question: Can we think of other unsupervised methods which
can make use of kernels?

Kernel k-means, Kernal ICA, Spectral Clustering
Unsupervised Methods: Linear PCA
Unsupervised Methods: Kernel PCA
Applications

Support Vector Machines and Kernel Fisher Discriminant:







Bioinformatics: protein classification
OCR
Face Recognition
Content based image retrieval
Decision Tree Predictive Modeling
…
Kernel PCA




Denoising
Compression
Visualization
Feature extraction for classification
Kernels for Specific Applications

Image Segmentation: Gaussian weighted 2-distance
between local color histograms.

Can be shown to be robust for color and texture discrimination

Text classification: Vector Space kernels

Structured Data (strings, trees, etc.): Spectrum kernels

Generative models: P-kernels, Fisher kernels