Transcript slides
Support Vector Machine (SVM)
Based on Nello Cristianini presentation
http://www.support-vector.net/tutorial.html
Basic Idea
• Use Linear Learning Machine (LLM).
• Overcome the linearity constraints:
Map to non-linearly to higher dimension.
• Select between hyperplans
Use margin as a test
• Generalization depends on the margin.
General idea
Original Problem
Transformed Problem
Kernel Based Algorithms
• Two separate learning functions
• Learning Algorithm:
in an imbedded space
• Kernel function
performs the embedding
Basic Example: Kernel Perceptron
• Hyperplane classification
f(x)=<w,x>+b = <w’,x’>
h(x)= sign(f(x))
• Perceptron Algorithm:
Sample: (xi,ti), ti{-1,+1}
If ti <wk,xi> < 0 THEN /* Error*/
wk+1 = wk + ti xi
k=k+1
Recall
• Margin of hyperplan w
w x
D( w) min
xi ,bi S w
• Mistake bound
xmax
M
*
D( w )
2
Observations
• Solution is a linear combination of inputs
w = ai ti xi
where ai >0
• Mistake driven
Only points on which we make mistake
influence!
• Support vectors
The non-zero ai
Dual representation
• Rewrite basic function:
f(x) = <w,x> +b = ai ti <xi , x> +b
w = ai ti xi
• Change update rule:
IF tj ( ai ti <xi , xj> +b) < 0
THEN aj = aj+1
• Observation:
Data only inside inner product!
Limitation of Perceptron
• Only linear separations
• Only converges for linearly
separable data
• Only defined on vectorial data
The idea of a Kernel
• Embed data to a different space
• Possibly higher dimension
• Linearly separable in the new space.
Original Problem
Transformed Problem
Kernel Mapping
•
•
•
•
•
•
Need only to compute inner-products.
Mapping: M(x)
Kernel: K(x,y) = < M(x) , M(y)>
Dimensionality of M(x): unimportant!
Need only to compute K(x,y)
Using it in the embedded space:
Replace <x,y> by K(x,y)
Example
x=(x1 , x2); z=(z1 , z2); K(x,z) = (<x,z>)2
( x, z ) ( x1 z1 x2 z2 )
2
2
( x1 z 1 x z 2 x1 z1 x2 z2 )
2 2
2 2
2 2
([x1 , x , 2 x1 x2 ],[ z 1 , z , 2 z1 z2 ])
2
2
2
(M(x),M(z))
2
2
2
Polynomial Kernel
Original Problem
Transformed Problem
Kernel Matrix
Example of Basic Kernels
• Polynomial
K(x,z)= (<x,z> )d
• Gaussian
K(x,z)= exp{- ||x-z||2 /2}
Kernel: Closure Properties
•
•
•
•
K(x,z) = K1(x,z) + c
K(x,z) = c*K1(x,z)
K(x,z) = K1(x,z) * K2(x,z)
K(x,z) = K1(x,z) + K2(x,z)
• Create new kernels using basic ones!
Support Vector Machines
• Linear Learning Machines (LLM)
• Use dual representation
• Work in the kernel induced feature space
f(x) = ai ti K(xi , x) +b
• Which hyperplane to select
Generalization of SVM
• PAC theory:
error = O( Vcdim / m)
Problem: Vcdim >> m
No preference between consistent hyperplanes
Margin based bounds
•
•
•
•
H: Basic Hypothesis class
conv(H): finite convex combinations of H
D: Distribution over X and {+1,-1}
S: Sample of size m over D
Margin based bounds
• THEOREM: for every f in conv(H)
PrD [ yf ( x) 0] PrS [ yf ( x) ] L
1
L O
m
log m log | H |
log 1 /
2
θ
Maximal Margin Classifier
• Maximizes the margin
• Minimizes the overfitting due to margin
selection.
• Increases margin
Rather than reduce dimensionality
SVM: Support Vectors
Margins
• Geometric Margin: mini ti f(xi)/ ||w||
Functional margin: mini ti f(xi)
f(x)
Main trick in SVM
• Insist on functional marginal at least 1.
Support vectors have margin 1.
• Geometric margin = 1 / || w||
• Proof.
SVM criteria
• Find a hyperplane (w,b)
• That Maximizes: || w ||2 = <w,w>
• Subject to:
for all i
ti (<w,xi>+b) 1
Quadratic Programming
•
•
•
•
Quadratic goal function.
Linear constraint.
Unique Maximum.
Polynomial time algorithms.
Dual Problem
• Maximize
W(a) = ai - 1/2 i,j ai ti aj tj K(xi , xj) +b
• Subject to
i ai ti =0
ai 0
Applications: Text
• Classify a text to given categories
Sports, news, business, science, …
• Feature space
Bag of words
Huge sparse vector!
Applications: Text
• Practicalities:
Mw(x) = tfw log (idfw) / K
ftw = text frequency of w
idfw = inverse document frequency
idfw = # documents / # documents with w
• Inner product <M(x),M(z)>
sparse vectors
• SVM: finds a hyperplan in “document space”