Kernel Trick - Home
Download
Report
Transcript Kernel Trick - Home
Support Vector Machines
and Kernel Methods
Kenan Gençol
Department of Electrical and Electronics Engineering
Anadolu University
submitted
in the course
MAT592 Seminar
Advisor: Prof. Dr. Yalçın Küçük
Department of Mathematics
Agenda
Linear Discriminant Functions and Decision
Hyperplanes
Introduction to SVM
Support Vector Machines
Introduction to Kernels
Nonlinear SVM
Kernel Methods
Linear Discriminant Functions and
Decision Hyperplanes
Figure 1. Two classes of patterns and a
linear decision function
Linear Discriminant Functions and
Decision Hyperplanes
Each pattern is represented by a vector
Linear decision function has the equation
where w1,w2 are weights and w0 is the
bias term
Linear Discriminant Functions and
Decision Hyperplanes
The general decision hyperplane equation
in d-dimensional space has the form:
where w = [w1 w2 ....wd] is the weight
vector and w0 is the bias term.
Introduction to SVM
There are many hyperplanes that
separates two classes
Figure 2. An example of two possible classifiers
Introduction to SVM
THE GOAL:
Our goal is to search for direction w and
bias w0 that gives the maximum possible
margin, or in other words, to orientate this
hyperplane in such a way as to be as far
as possible from the closest members of
both classes.
SVM: Linearly Separable Case
Figure 3. Hyperplane through two linearly separable classes
SVM: Linearly Separable Case
Our training data is of the form:
This hyperplane can be described by
and called separating hyperplane.
SVM: Linearly Separable Case
Select variables w and b so that:
These equations can be combined into:
SVM: Linearly Separable Case
The points that lie closest to the
separating hyperplane are called support
vectors (circled points in diagram) and
are called supporting hyperplanes.
SVM: Linearly Separable Case
Figure 3. Hyperplane through two linearly separable classes
(repeated)
SVM: Linearly Separable Case
The hyperplane’s equidistance from H1
and H2 means that d1= d2 and this
quantity is known as SVM Margin:
1 b
d1+ d2 =
1
d1= d2=
w
w
1 b
w
2
w
SVM: Linearly Separable Case
1
w
Maximizing
min w such that yi(xi . w + b) -1 >= 0
1
w
Minimizing w is equivalent to minimizing
2
Minimizing w
to perform Quadratic Programming (QP)
optimization
2
SVM: Linearly Separable Case
Optimization problem:
1 2
Minimize
w
2
subject to y i ( xi w b) 1 0, i
SVM: Linearly Separable Case
This is an inequality constrained optimization problem
with Lagrangian function:
(1)
where αi >= 0 i=1,2,....,L are Lagrange multipliers.
SVM
The corresponding KKT conditions are:
(2)
(3)
SVM
This is a convex optimization problem.The
cost function is convex and the set of
constraints are linear and define a convex
set of feasible solutions. Such problems
can be solved by considering the so called
Lagrangian Duality
SVM
Substituing (2) and (3) gives a new
formulation which being dependent on α,
we need to maximize.
SVM
This is called Dual form (Lagrangian Dual)
of the primary form. Dual form only
requires the dot product of each input
vector to be calculated.
This is important for the Kernel Trick
which will be described later.
SVM
So the problem becomes a dual problem:
Maximize
L
1 T
i H
2
i 1
L
subject to
i 1
i
yi 0
i 0, i
SVM
Differentiating with respect to αi ‘s and
using the constraint equation, a system of
equations is obtained. Solving the system,
the Lagrange multipliers are found and
optimum hyperplane is given according to
the formula:
SVM
Some Notes:
SUPPORT VECTORS are the feature vectors
for αi > 0 i=1,2,....,L
The cost function is strictly convex.
Hessian matrix is positive definite.
Any local minimum is also global and unique.
The optimal hyperplane classifier of a SVM is
UNIQUE.
Although the solution is unique, the resulting
Lagrange multipliers are not unique.
Kernels: Introduction
When applying our SVM to linearly separable
data we have started by creating a matrix H
from the dot product of our input variables:
k ( xi , x j ) xi x j xi x j being known as Linear
T
Kernel, an example of a family of functions
called Kernel functions.
Kernels: Introduction
The set of kernel functions are all based on
calculating inner products of two vectors.
This means if the function is mapped to a higher
dimensionality space by a nonlinear mapping
function : x ( x) only the inner products of
the mapped inputs need to be determined
without needing to explicitly calculate Ф .
This is called “Kernel Trick”
Kernels: Introduction
Kernel Trick is useful because there are
many classification/regression problems
that are not fully separable/regressable in
the input space but separable/regressable
in a higher dimensional space.
: d H
xi x j ( xi ) ( x j )
xi x j ( xi ) , ( x j )
Kernels: Introduction
Popular Kernel Families:
Radial Basis Function (RBF) Kernel
Polynomial Kernel
Sigmodial (Hyperbolic Tangent) Kernel
Nonlinear Support Vector Machines
The support vector machine with kernel functions
becomes:
and the resulting classifier:
Nonlinear Support Vector Machines
Figure 4. The SVM architecture employing
kernel functions.
Kernel Methods
Recall that a kernel function computes the inner
product of the images under an embedding of
two data points
k ( x, z ) ( x), ( z )
k : X X is a kernel if
1. k is symmetric: k(x,y) = k(y,x)
2. k is positive semi-definite, i.e., the “Gram Matrix”
Kij = k(xi,xj) is positive semi-definite.
Kernel Methods
The answer for which kernels does there
exist a pair {H,φ}, with the properties
described above, and for which does there
not is given by Mercer’s condition.
Mercer’s condition
Let
X be a compact subset of
n
and let
: x X ( x) H
x X and a mapping
where H is an Euclidean space. Then the inner product
operation has an equivalent representation
k ( x, z ) ( x), ( z ) r ( x) r ( z )
r
and k ( x, z )is a symmetric function satisfying the following condition
k ( x, z) g ( x) g ( z)dxdz 0
X X
for any g (x ) ,
x X such that
2
g
( x)dx
Mercer’s Theorem
Theorem. Suppose K is a continuous symmetric nonnegative definite kernel. Then there is an orthonormal
basis {ei}i of L2[a, b] consisting of eigenfunctions of TK
T ( x) K ( x, s) (s)ds
b
K
a
such that the corresponding sequence of eigenvalues
{λi}i is nonnegative. The eigenfunctions corresponding to
non-zero eigenvalues are continuous on [a, b] and K has
the representation
K ( s, t ) j e j ( s)e j (t )
j 1
where the convergence is absolute and uniform.
Kernel Methods
Suppose k1and k2 are valid (symmetric,
positive definite) kernels on X. Then the
following are valid kernels:
1.
2.
3.
Kernel Methods
4.
5.
6.
7.
References
[1] C.J.C. Burges, “Tutorial on support vector
machines for pattern recognition”, Data Mining
and Knowledge Discovery 2, 121-167, 1998.
[2] Marques de Sa, J.P., “Pattern Recognition
Concepts,Methods and Applications”, Springer,
2001.
[3] S. Theodoridis, “Pattern Recognition”,
Elsevier Academic Press, 2003.
References
[4] T. Fletcher, “Support Vector Machines
Explained”, UCL, March,2005.
[5] Cristianini,N., Shawe-Taylor,J., “Kernel
Methods for Pattern Analysis”, Cambridge
University Press, 2004.
[6] “Subject Title: Mercer’s Theorem”, Wikipedia:
http://en.wikipedia.org/wiki/Mercer’s_theorem
Thank You