Machine Learning Using Support Vector Machines

Download Report

Transcript Machine Learning Using Support Vector Machines

King Saud University
The College of Computer & Information Science
Computer Science Department (Master)
Neural Networks and Machine Learning Applications (CSC 563 )
Machine Learning Using Support
Vector Machines
(Paper Review)
Presented to:
Prof. Dr. Mohamed Batouche
Prepared By:
Asma B. Al-Saleh
427220094
Amani A. Al-Ajlan
427220111
Spring 2008
Paper Information
 Title: Machine Learning Using Support Vector
Machines.
 Authors:
 Abdul Rahim Ahmad.
 Marzuki Khalid.
 Rubiyah Yusof.
 Publisher: MSTC 2002, Johor Bahru.
 Date: September 2002.
Review Outlines
 Introduction
 Artificial Neural Network (ANN)
 Support Vector Machine (SVM)
 Support Vectors
 Theory of SVM
 Quadratic Programming
 Non-linear SVM
 SVM Implementations
 SVM for Multi-class Classification
 Handwriting Recognition Experimental Results
 ANN vs. SVM
 Conclusion
Introduction
Introduction
 The aim of this paper is to
 Present SVM as a comparison to ANN.
 Show the concept of SVM by providing
them some details about SVM.
Machine
Learning
Machine Learning (ML)
 Constructing computer program that
automatically improve its performance
with experience.
Learned
???
Data
Machine Learning (ML) Applications
1.
2.
3.
4.
Data mining programs.
Information filtering systems.
Autonomous vehicles.
Pattern recognition system:




Speech recognition.
Handwriting recognition.
Face recognition.
Text categorization.
Artificial Neural
Network
Artificial Neural Network (ANN)
 Massively parallel computing systems
consisting of an extremely large number of
simple processors with many
interconnections.
Artificial Neural Network (ANN)
 The main characteristics of ANN are:
1. Ability to learn complex nonlinear inputoutput relationships
2. They use sequential training procedures to
updating (adapt) network architecture and
connection weights so that a network can
work efficiently.
Artificial Neural Network (ANN)
 In the area of pattern classification,
the feed-forward network is most
popularly used.
Pattern
Classification
Multilayer perceptron
(MLP)
Radial-BasisFunction
(RBF) networks
Data
Clustering
 Kohonen SelfOrganizing Map (SOM)
ANN and Pattern Recognition
 ANN is low dependence on domainspecific knowledge compared to rulebased approaches.
 Availability of efficient learning
algorithms to use.
Support Vector
Machine
Support Vector Machine – (SVM)
 SVM was introduced in 1992 by
Vapnik and his coworkers.
 SVM original form:
 Binary classifier; separates between two
classes.
 Design for linear and separable data set.
 SVM used for classification and
regression.
Support Vector Machine – (SVM)
Theory of SVM
Constraints
H1
1. No data points between H1
and H2.
2. Margin between H1 and H2 is
maximized.
w.x

0
b
1
||
b
1
rg i
w.x

-
b
2
w.x

|| w
H2
n
W
Ma
H
Support Vectors
Solution: expressed as
a linear combination
of support vectors:
H1
• Subset of
training patterns
• Close to the
decision boundary
H
W
H2
b
1
||
2
|| w
0
Ma
w.x

b
1
rg i
w.x

b
n
w.x

Theory of SVM
 training data:
{( x1 , y1 ) , ( x2 , y2 ) , ……, ( xn , y n )}
Where: y  {1,1}
x  Rd
i
i
xi
Input
features
SVM
yi
Class or
label
Theory of SVM
Class 1:
H1
w.x  b  1
H
W
yi ( w.xi  b)  1
w.x

b
1
||
0
2
b
rg i
w.x  b  1
w.x

1
|| w
Class 2:
b
n
w.x

Ma
H2
Theory of SVM
 Learn a linear separating hyper plane
classifier:
f ( x)  sgn( w.x  b)
Quadratic Programming
 to maximize the margin
to minimize: 1 || w || 2
2
,
|| w ||
we need
2
 Quadratic Programming solved by
introducing Lagrange multipliers
N
L( w, b,  )  1 || w || 2  iN1 i yi ( xi .w  b)  i 1 i
2
Lagrange Multipliers
N
1
N
2


y
(
x
.
w

b
)
 i 1 i
L( w, b,  )  || w || i 1 i i i
2
 Maximize L where w and b are
eliminated:
and
LD 

N
i 1
i
1
 i , j  i j yi y j xi x j
2
Theory of SVM
 Discriminate function:
f (x )  sgn(( i 1 i yi xi .x  b)
N
Non-linear SVM
1. SVM mapped the data sets of input
space into a higher dimensional
feature space
2. Linear and the large-margin learning
algorithm is then applied.
Non-linear SVM
x   (x)
Input (data) space
(Non- linear)
Feature space
(Linear)
Non-linear SVM
 If the mapping function is ,we just
solve:
1
N
LD  i 1 i  i , j  i j y i y j ( xi ).(.x j )
2
 However, the mapping can be
implicitly done by kernel functions:
LD  i 1 i
N
1
 i , j  i j y i y j k ( xi , x j )
2
Non-linear SVM
 Discriminate function:
f (x )  sgn(( i 1 i yi k ( xi , x)  b)
N
Kernel
 There are many kernels that can be
used that way.
 Any kernel that satisfies Mercer’s
condition can be used.
Kernel - Examples
 Polynomial kernels
 Hyperbolic tangent
 Radial basis function (Gaussian
kernel)
Non-separable Input Space
 In real world problem, there is always
noise.
 Noise  Non-separable data.
 Slack variables are introduced to each
input:
yi ( w.xi  b)  1  i
 Penalty Parameter C: control
overfitting.
Non-separable Input Space
H1
H
H2
j
j
i
i
w.x

w.x

w.x

b
b
1
b
0
1
SVM for Multi-class Classification
 Basic SVM is binary classifier; separates
between two classes.
 In real world, more than two classes is
usually needed.
 Ex: handwriting recognition
SVM for Multi-class Classification
Methods
Modifying
binary to
incorporate
multi-class
learning.
Combining
binary
classifiers
One vs. All
One vs. One
K
K (K-1) /2
SVM for Multi-class Classification
 One vs. One and DAGSVM (Directed
Acyclic Graph) are the best choices
for practical use.
 they are:
 Less complex
 Easy to construct
 Faster to train.
Tapia et al, 2005
SVM Implementation
 Quadratic programming (QP) which is
computationally intensive.
 However, many decomposition methods
have been proposed that avoids the QP and
makes SVM learning practical for many
current problems.
 Ex: Sequential Minimal Optimization (SMO)
Results of
Experimental Studies
Data
 Handwritten digit database:
 MNIST dataset.
 USPS dataset.
 more difficult; human recognition error rate
is as high as 2.5%.
Error rate comparison of ANN, SVM and other
algorithms for MNIST and USPS database.
Results of Experimental Studies
1. SVM error rate is significantly lower than most
other algorithms except for LeNet 5 NN.
2. Training time for SVM was significantly slower
the higher recognition rate (low error rate)
justify for the usage.
3. SVM usage should be increasing and replacing
ANN in the area of handwriting recognition
where faster method of implementing SVM
have been introduced recently.
SVM vs. ANN
SVM
ANN
 Global minimum.
 Local minimum
 SVM does not overfit
data (Structural Risk
Minimization).
ANN is known to overfit
data unless cross-validation
is applied.
 Multi-class implementation
needs to be performed.
 Naturally handles
multi-class classification.
Conclusion
 SVM is Powerful and is a useful alternative to
neural networks.
 SVM find Global, unique solution.
 Two key concepts of SVM: maximize the
margin and choice of kernel.
 Performance depends on choice of kernel and
parameters  Still a research topic.

Training is memory-intensive due to QP.
Conclusion
 Many active research is taking place on areas
related to SVM.
 Many SVM implementations are available on
the Web:
 SVMLight
 LIBSVM
Thank you
…..
Questions?