Machine Learning Using Support Vector Machines
Download
Report
Transcript Machine Learning Using Support Vector Machines
King Saud University
The College of Computer & Information Science
Computer Science Department (Master)
Neural Networks and Machine Learning Applications (CSC 563 )
Machine Learning Using Support
Vector Machines
(Paper Review)
Presented to:
Prof. Dr. Mohamed Batouche
Prepared By:
Asma B. Al-Saleh
427220094
Amani A. Al-Ajlan
427220111
Spring 2008
Paper Information
Title: Machine Learning Using Support Vector
Machines.
Authors:
Abdul Rahim Ahmad.
Marzuki Khalid.
Rubiyah Yusof.
Publisher: MSTC 2002, Johor Bahru.
Date: September 2002.
Review Outlines
Introduction
Artificial Neural Network (ANN)
Support Vector Machine (SVM)
Support Vectors
Theory of SVM
Quadratic Programming
Non-linear SVM
SVM Implementations
SVM for Multi-class Classification
Handwriting Recognition Experimental Results
ANN vs. SVM
Conclusion
Introduction
Introduction
The aim of this paper is to
Present SVM as a comparison to ANN.
Show the concept of SVM by providing
them some details about SVM.
Machine
Learning
Machine Learning (ML)
Constructing computer program that
automatically improve its performance
with experience.
Learned
???
Data
Machine Learning (ML) Applications
1.
2.
3.
4.
Data mining programs.
Information filtering systems.
Autonomous vehicles.
Pattern recognition system:
Speech recognition.
Handwriting recognition.
Face recognition.
Text categorization.
Artificial Neural
Network
Artificial Neural Network (ANN)
Massively parallel computing systems
consisting of an extremely large number of
simple processors with many
interconnections.
Artificial Neural Network (ANN)
The main characteristics of ANN are:
1. Ability to learn complex nonlinear inputoutput relationships
2. They use sequential training procedures to
updating (adapt) network architecture and
connection weights so that a network can
work efficiently.
Artificial Neural Network (ANN)
In the area of pattern classification,
the feed-forward network is most
popularly used.
Pattern
Classification
Multilayer perceptron
(MLP)
Radial-BasisFunction
(RBF) networks
Data
Clustering
Kohonen SelfOrganizing Map (SOM)
ANN and Pattern Recognition
ANN is low dependence on domainspecific knowledge compared to rulebased approaches.
Availability of efficient learning
algorithms to use.
Support Vector
Machine
Support Vector Machine – (SVM)
SVM was introduced in 1992 by
Vapnik and his coworkers.
SVM original form:
Binary classifier; separates between two
classes.
Design for linear and separable data set.
SVM used for classification and
regression.
Support Vector Machine – (SVM)
Theory of SVM
Constraints
H1
1. No data points between H1
and H2.
2. Margin between H1 and H2 is
maximized.
w.x
0
b
1
||
b
1
rg i
w.x
-
b
2
w.x
|| w
H2
n
W
Ma
H
Support Vectors
Solution: expressed as
a linear combination
of support vectors:
H1
• Subset of
training patterns
• Close to the
decision boundary
H
W
H2
b
1
||
2
|| w
0
Ma
w.x
b
1
rg i
w.x
b
n
w.x
Theory of SVM
training data:
{( x1 , y1 ) , ( x2 , y2 ) , ……, ( xn , y n )}
Where: y {1,1}
x Rd
i
i
xi
Input
features
SVM
yi
Class or
label
Theory of SVM
Class 1:
H1
w.x b 1
H
W
yi ( w.xi b) 1
w.x
b
1
||
0
2
b
rg i
w.x b 1
w.x
1
|| w
Class 2:
b
n
w.x
Ma
H2
Theory of SVM
Learn a linear separating hyper plane
classifier:
f ( x) sgn( w.x b)
Quadratic Programming
to maximize the margin
to minimize: 1 || w || 2
2
,
|| w ||
we need
2
Quadratic Programming solved by
introducing Lagrange multipliers
N
L( w, b, ) 1 || w || 2 iN1 i yi ( xi .w b) i 1 i
2
Lagrange Multipliers
N
1
N
2
y
(
x
.
w
b
)
i 1 i
L( w, b, ) || w || i 1 i i i
2
Maximize L where w and b are
eliminated:
and
LD
N
i 1
i
1
i , j i j yi y j xi x j
2
Theory of SVM
Discriminate function:
f (x ) sgn(( i 1 i yi xi .x b)
N
Non-linear SVM
1. SVM mapped the data sets of input
space into a higher dimensional
feature space
2. Linear and the large-margin learning
algorithm is then applied.
Non-linear SVM
x (x)
Input (data) space
(Non- linear)
Feature space
(Linear)
Non-linear SVM
If the mapping function is ,we just
solve:
1
N
LD i 1 i i , j i j y i y j ( xi ).(.x j )
2
However, the mapping can be
implicitly done by kernel functions:
LD i 1 i
N
1
i , j i j y i y j k ( xi , x j )
2
Non-linear SVM
Discriminate function:
f (x ) sgn(( i 1 i yi k ( xi , x) b)
N
Kernel
There are many kernels that can be
used that way.
Any kernel that satisfies Mercer’s
condition can be used.
Kernel - Examples
Polynomial kernels
Hyperbolic tangent
Radial basis function (Gaussian
kernel)
Non-separable Input Space
In real world problem, there is always
noise.
Noise Non-separable data.
Slack variables are introduced to each
input:
yi ( w.xi b) 1 i
Penalty Parameter C: control
overfitting.
Non-separable Input Space
H1
H
H2
j
j
i
i
w.x
w.x
w.x
b
b
1
b
0
1
SVM for Multi-class Classification
Basic SVM is binary classifier; separates
between two classes.
In real world, more than two classes is
usually needed.
Ex: handwriting recognition
SVM for Multi-class Classification
Methods
Modifying
binary to
incorporate
multi-class
learning.
Combining
binary
classifiers
One vs. All
One vs. One
K
K (K-1) /2
SVM for Multi-class Classification
One vs. One and DAGSVM (Directed
Acyclic Graph) are the best choices
for practical use.
they are:
Less complex
Easy to construct
Faster to train.
Tapia et al, 2005
SVM Implementation
Quadratic programming (QP) which is
computationally intensive.
However, many decomposition methods
have been proposed that avoids the QP and
makes SVM learning practical for many
current problems.
Ex: Sequential Minimal Optimization (SMO)
Results of
Experimental Studies
Data
Handwritten digit database:
MNIST dataset.
USPS dataset.
more difficult; human recognition error rate
is as high as 2.5%.
Error rate comparison of ANN, SVM and other
algorithms for MNIST and USPS database.
Results of Experimental Studies
1. SVM error rate is significantly lower than most
other algorithms except for LeNet 5 NN.
2. Training time for SVM was significantly slower
the higher recognition rate (low error rate)
justify for the usage.
3. SVM usage should be increasing and replacing
ANN in the area of handwriting recognition
where faster method of implementing SVM
have been introduced recently.
SVM vs. ANN
SVM
ANN
Global minimum.
Local minimum
SVM does not overfit
data (Structural Risk
Minimization).
ANN is known to overfit
data unless cross-validation
is applied.
Multi-class implementation
needs to be performed.
Naturally handles
multi-class classification.
Conclusion
SVM is Powerful and is a useful alternative to
neural networks.
SVM find Global, unique solution.
Two key concepts of SVM: maximize the
margin and choice of kernel.
Performance depends on choice of kernel and
parameters Still a research topic.
Training is memory-intensive due to QP.
Conclusion
Many active research is taking place on areas
related to SVM.
Many SVM implementations are available on
the Web:
SVMLight
LIBSVM
Thank you
…..
Questions?