Support Vector Machines - McGill Schulich Faculty of Music

Download Report

Transcript Support Vector Machines - McGill Schulich Faculty of Music

Support Vector
Machines
Jordan Smith
MUMT 611
14 February 2008
Topics to cover

What do Support Vector Machines (SVMS) do?

How do SVMs work?



Linear data
Non-linear data
Unseparable data

Search optimization

Why?
(Kernel functions)
(added Cost function)
What SVMs do
What SVMs do
What SVMs do
= margin
What SVMs do
= margin
What SVMs do
= support vector
= margin
What SVMs do
(optimum separating hyperplane)
= support vector
= margin
What SVMs do
(optimum separating hyperplane)
= support vector
= margin
What SVMs do
Sherrod 230
Topics to cover

What do Support Vector Machines (SVMS) do?

How do SVMs work?



Linear data
Non-linear data
Unseparable data

Search optimization

Why?
(Kernel functions)
(added Cost function)
The linear, separable case


Training data {xi, yi}
Separating hyperplane defined by normal
vector w
hyperplane equation: w·x + b = 0
 distance from plane to origin: |b|/|w|


Distances from hyperplane to nearest
point in each collection are d+ and dGoal: maximize d+ + d(margins)
The linear, separable case
1) xi·w + b ≥ +1
2) xi·w + b ≤ -1


(for yi = +1)
(for yi = -1)
yi(xi·w + b) - 1 ≥ 0
for our support vectors, distance from origin
to plane = |1-b|/|w|
 Algebra
New goal:
d+ + d- = 2 / |w|
maximize: 2 /|w|
i.e., minimize: |w|
Nonlinear SVMs
Sherrod 235
Nonlinear SVMs
Kernel trick:




Map data into a higher-dimensional space using
: Rd  H
Training problems involve only the dot product,
so H can even be of infinite dimension
Kernel trick makes nonlinear solutions linear
again!
youtube example
Nonlinear SVMs

Radial basis function:
Sherrod 236
Nonlinear SVMs

Sigmoid
Sherrod 237
Another demonstration

applet
The unseparable case

Classifiers need to have a balanced
capacity:
Bad botanist: “It has 847 leaves. Not a tree!”
 Bad botanist: “It’s green. That’s a tree!”

The unseparable case
Sherrod 237
The unseparable case
The unseparable case

= fuzzy margin
 = error
The unseparable case
Add a cost function:



xi·w + b ≥ +1 - i
xi·w + b ≤ -1 + i
i ≥ 0
old goal:
new goal:
(for yi = +1)
(for yi = -1)
minimize |w|2/2
minimize |w|2/2 + C(∑i i)k
Optimizing your search

To find the separating hyperplane, you
must manipulate many parameters,
depending on which kernel function you
select:
C, the cost constant
 Gamma, i, etc.


There are two basic methods:
Grid search
 Pattern search

Topics to cover

What do Support Vector Machines (SVMS) do?

How do SVMs work?



Linear data
Non-linear data (Kernel functions)
Unseparable data
(added Cost function)

Search optimization

Why?
Why use SVMs?
Uses:
 Optical character recognition
 Spam detection
 MIR
genre, artist classification (Mandel 2004,
2005)
 mood classification (Laurier 2007)
 popularity classification, based on lyrics
(Dhanaraj 2005)

Why use SVMs?

Machine learner of choice for high-dimensional
data, such as text, images, music!

Conceptually simple.

Generalizable and efficient.
Next slides: results of a benchmark study (Meyer
2004) comparing SVMs and other learning
techniques
Questions?
Key References
Burges, C. J. C. "A tutorial on support vector machines for pattern recognition."
Data Mining and Knowledge Discovery, 2:955-974, 1998.
http://citeseer.ist.psu.edu/burges98tutorial.html
Cortes, C. and V. Vapnik. "Support-Vector Networks." Machine Learning,
20:273-297, Sept 1995.
http://citeseer.ist.psu.edu/cortes95supportvector.html
Sherrod, Phillip H. 2008. DTREG: Predictive Modeling Software. (User’s guide)
227-41. <http://www.dtreg.com/DTREG.pdf”
Smola, A. J. and B. Scholkopf. 1998. “A tutorial on support vector regression.”
NEUROCOLT Technical report NC-TR-98-030. Royal Holloway college,
London.