Kernel Trick - Home

Download Report

Transcript Kernel Trick - Home

Support Vector Machines
and Kernel Methods
Kenan Gençol
Department of Electrical and Electronics Engineering
Anadolu University
submitted
in the course
MAT592 Seminar
Advisor: Prof. Dr. Yalçın Küçük
Department of Mathematics
Agenda






Linear Discriminant Functions and Decision
Hyperplanes
Introduction to SVM
Support Vector Machines
Introduction to Kernels
Nonlinear SVM
Kernel Methods
Linear Discriminant Functions and
Decision Hyperplanes
Figure 1. Two classes of patterns and a
linear decision function
Linear Discriminant Functions and
Decision Hyperplanes

Each pattern is represented by a vector

Linear decision function has the equation

where w1,w2 are weights and w0 is the
bias term
Linear Discriminant Functions and
Decision Hyperplanes


The general decision hyperplane equation
in d-dimensional space has the form:
where w = [w1 w2 ....wd] is the weight
vector and w0 is the bias term.
Introduction to SVM

There are many hyperplanes that
separates two classes
Figure 2. An example of two possible classifiers
Introduction to SVM


THE GOAL:
Our goal is to search for direction w and
bias w0 that gives the maximum possible
margin, or in other words, to orientate this
hyperplane in such a way as to be as far
as possible from the closest members of
both classes.
SVM: Linearly Separable Case
Figure 3. Hyperplane through two linearly separable classes
SVM: Linearly Separable Case

Our training data is of the form:

This hyperplane can be described by

and called separating hyperplane.
SVM: Linearly Separable Case

Select variables w and b so that:

These equations can be combined into:
SVM: Linearly Separable Case


The points that lie closest to the
separating hyperplane are called support
vectors (circled points in diagram) and
are called supporting hyperplanes.
SVM: Linearly Separable Case
Figure 3. Hyperplane through two linearly separable classes
(repeated)
SVM: Linearly Separable Case

The hyperplane’s equidistance from H1
and H2 means that d1= d2 and this
quantity is known as SVM Margin:
1 b

d1+ d2 =

1
d1= d2=
w
w

1 b
w
2

w
SVM: Linearly Separable Case
1
w

Maximizing

min w such that yi(xi . w + b) -1 >= 0

1
w
Minimizing w is equivalent to minimizing
2

 Minimizing w
to perform Quadratic Programming (QP)
optimization
2
SVM: Linearly Separable Case

Optimization problem:
1 2
 Minimize
w
2
 subject to y i ( xi  w  b)  1  0, i
SVM: Linearly Separable Case

This is an inequality constrained optimization problem
with Lagrangian function:
(1)

where αi >= 0 i=1,2,....,L are Lagrange multipliers.
SVM

The corresponding KKT conditions are:
(2)
(3)
SVM

This is a convex optimization problem.The
cost function is convex and the set of
constraints are linear and define a convex
set of feasible solutions. Such problems
can be solved by considering the so called
Lagrangian Duality
SVM

Substituing (2) and (3) gives a new
formulation which being dependent on α,
we need to maximize.
SVM


This is called Dual form (Lagrangian Dual)
of the primary form. Dual form only
requires the dot product of each input
vector to be calculated.
This is important for the Kernel Trick
which will be described later.
SVM


So the problem becomes a dual problem:
Maximize
L
1 T
 i   H

2
i 1
L

subject to

i 1
i
yi  0
 i  0, i
SVM

Differentiating with respect to αi ‘s and
using the constraint equation, a system of
equations is obtained. Solving the system,
the Lagrange multipliers are found and
optimum hyperplane is given according to
the formula:
SVM






Some Notes:
SUPPORT VECTORS are the feature vectors
for αi > 0 i=1,2,....,L
The cost function is strictly convex.
Hessian matrix is positive definite.
Any local minimum is also global and unique.
The optimal hyperplane classifier of a SVM is
UNIQUE.
Although the solution is unique, the resulting
Lagrange multipliers are not unique.
Kernels: Introduction


When applying our SVM to linearly separable
data we have started by creating a matrix H
from the dot product of our input variables:
k ( xi , x j )  xi  x j  xi x j being known as Linear
T
Kernel, an example of a family of functions
called Kernel functions.
Kernels: Introduction



The set of kernel functions are all based on
calculating inner products of two vectors.
This means if the function is mapped to a higher
dimensionality space by a nonlinear mapping
function  : x  ( x) only the inner products of
the mapped inputs need to be determined
without needing to explicitly calculate Ф .
This is called “Kernel Trick”
Kernels: Introduction

Kernel Trick is useful because there are
many classification/regression problems
that are not fully separable/regressable in
the input space but separable/regressable
in a higher dimensional space.
: d  H
xi  x j   ( xi )   ( x j )
xi  x j    ( xi ) ,  ( x j )
Kernels: Introduction

Popular Kernel Families:
Radial Basis Function (RBF) Kernel

Polynomial Kernel

Sigmodial (Hyperbolic Tangent) Kernel

Nonlinear Support Vector Machines


The support vector machine with kernel functions
becomes:
and the resulting classifier:
Nonlinear Support Vector Machines
Figure 4. The SVM architecture employing
kernel functions.
Kernel Methods

Recall that a kernel function computes the inner
product of the images under an embedding  of
two data points
k ( x, z )    ( x),  ( z ) 

k : X  X   is a kernel if


1. k is symmetric: k(x,y) = k(y,x)
2. k is positive semi-definite, i.e., the “Gram Matrix”
Kij = k(xi,xj) is positive semi-definite.
Kernel Methods

The answer for which kernels does there
exist a pair {H,φ}, with the properties
described above, and for which does there
not is given by Mercer’s condition.
Mercer’s condition

Let
X be a compact subset of 
n
and let
 : x  X   ( x) H

x X and a mapping
where H is an Euclidean space. Then the inner product
operation has an equivalent representation
k ( x, z )    ( x),  ( z )   r ( x) r ( z )
r

and k ( x, z )is a symmetric function satisfying the following condition
 k ( x, z) g ( x) g ( z)dxdz  0
X X

for any g (x ) ,
x X such that
2
g
 ( x)dx  
Mercer’s Theorem

Theorem. Suppose K is a continuous symmetric nonnegative definite kernel. Then there is an orthonormal
basis {ei}i of L2[a, b] consisting of eigenfunctions of TK
T ( x)   K ( x, s) (s)ds
b
K
a

such that the corresponding sequence of eigenvalues
{λi}i is nonnegative. The eigenfunctions corresponding to
non-zero eigenvalues are continuous on [a, b] and K has
the representation

K ( s, t )    j e j ( s)e j (t )
j 1

where the convergence is absolute and uniform.
Kernel Methods

Suppose k1and k2 are valid (symmetric,
positive definite) kernels on X. Then the
following are valid kernels:

1.

2.

3.
Kernel Methods

4.

5.

6.

7.
References



[1] C.J.C. Burges, “Tutorial on support vector
machines for pattern recognition”, Data Mining
and Knowledge Discovery 2, 121-167, 1998.
[2] Marques de Sa, J.P., “Pattern Recognition
Concepts,Methods and Applications”, Springer,
2001.
[3] S. Theodoridis, “Pattern Recognition”,
Elsevier Academic Press, 2003.
References



[4] T. Fletcher, “Support Vector Machines
Explained”, UCL, March,2005.
[5] Cristianini,N., Shawe-Taylor,J., “Kernel
Methods for Pattern Analysis”, Cambridge
University Press, 2004.
[6] “Subject Title: Mercer’s Theorem”, Wikipedia:
http://en.wikipedia.org/wiki/Mercer’s_theorem
Thank You