Transcript PPT

Neural networks and
support vector machines
Perceptron
Input
Weights
x1
w1
x2
w2
Output: sgn(wx + b)
x3
.
.
.
xD
w3
wD
Can incorporate bias as
component of the weight
vector by always
including a feature with
value set to 1
Loose inspiration: Human
neurons
Linear separability
Perceptron training algorithm
• Initialize weights
• Cycle through training examples in multiple
passes (epochs)
• For each training example:
– If classified correctly, do nothing
– If classified incorrectly, update weights
Perceptron update rule
• For each training instance x with label y:
– Classify with current weights: y’ = sgn(wx)
– Update weights: w  w + α(y-y’)x
– α is a learning rate that should decay as a
function of epoch t, e.g., 1000/(1000+t)
– What happens if y’ is correct?
– Otherwise, consider what happens to individual
weights wi  wi + α(y-y’)xi
• If y = 1 and y’ = -1, wi will be increased if xi is positive
or decreased if xi is negative  wx will get bigger
• If y = -1 and y’ = 1, wi will be decreased if xi is positive
or increased if xi is negative  wx will get smaller
Convergence of perceptron
update rule
• Linearly separable data: converges to a
perfect solution
• Non-separable data: converges to a
minimum-error solution assuming learning
rate decays as O(1/t) and examples are
presented in random sequence
Implementation details
• Bias (add feature dimension with value fixed to 1)
vs. no bias
• Initialization of weights: all zeros vs. random
• Learning rate decay function
• Number of epochs (passes through the training
data)
• Order of cycling through training examples
(random)
Multi-class perceptrons
• One-vs-others framework: Need to keep a
weight vector wc for each class c
• Decision rule: c = argmaxc wc x
• Update rule: suppose example from class
c gets misclassified as c’
– Update for c: wc  wc + αx
– Update for c’: wc’  wc’ – αx
Multi-class perceptrons
• One-vs-others framework: Need to keep a
weight vector wc for each class c
• Decision rule: c = argmaxc wc x
Softmax:
P(c | x) =
Max
Inputs
Perceptrons
w/ weights wc
exp(w c × x)
C
å exp(w
k=1
k
× x)
Visualizing linear classifiers
Source: Andrej Karpathy, http://cs231n.github.io/linear-classify/
Differentiable perceptron
Input
Weights
x1
w1
x2
w2
Output: (wx + b)
x3
.
.
.
xd
w3
Sigmoid function:
wd
1
 (t ) 
1  e t
Update rule for differentiable perceptron
• Define total classification error or loss on the
training set:
E (w )    y j  f w (x j )  ,
N
2
j 1
f w (x j )   (w  x j )
• Update weights by gradient descent:
w1
w2
E
w  w 
w
Update rule for differentiable perceptron
• Define total classification error or loss on the
training set:
E (w )    y j  f w (x j )  ,
N
j 1
2
f w (x j )   (w  x j )
• Update weights by gradient descent:
E
w  w 
w
N
E



   2 y j  f (x j )  ' (w  x j )
(w  x j )
w j 1 
w


   2 y j  f (x j )  (w  x j )(1   (w  x j ))x j
N
j 1
• For a single training point, the update is:
w  w    y  f (x)  (w  x)(1   (w  x))x

Update rule for differentiable perceptron
• For a single training point, the update is:
w  w    y  f (x)  (w  x)(1   (w  x))x
– This is called stochastic gradient descent
• Compare with update rule with non-differentiable
perceptron:
w  w    y  f ( x)  x
Network with a hidden layer
weights w’
weights wj
• Can learn nonlinear functions (provided each perceptron is nonlinear
and differentiable)
f (x)= s éå w¢js
ë j
(
)
ù
w
x
åk jk k û
Network with a hidden layer
• Hidden layer size and network capacity:
Source: http://cs231n.github.io/neural-networks-1/
Training of multi-layer networks
•
Find network weights to minimize the error between true and
estimated labels of training examples:
N
E(w) = å( y j - fw (x j ))
2
j=1
•
E
Update weights by gradient descent: w  w  
w
•
This requires perceptrons with a differentiable nonlinearity
Sigmoid:
g(t) =
1
1+ e-t
Rectified linear unit (ReLU): g(t) = max(0,t)
Training of multi-layer networks
•
Find network weights to minimize the error between true and
estimated labels of training examples:
N
E(w) = å( y j - fw (x j ))
2
j=1
•
•
•
E
Update weights by gradient descent: w  w  
w
Back-propagation: gradients are computed in the direction
from output to input layers and combined using chain rule
Stochastic gradient descent: compute the weight update
w.r.t. one training example (or a small batch of examples) at
a time, cycle through training examples in random order in
multiple epochs
Regularization
• It is common to add a penalty on weight magnitudes to
the objective function:
N
E( f ) = å( yi - f (xi )) +
i=1
2
l
å
2
j
w 2j
– This encourages network to use all of its inputs “a little” rather
than a few inputs “a lot”
Source: http://cs231n.github.io/neural-networks-1/
Multi-Layer Network Demo
http://playground.tensorflow.org/
Neural networks: Pros and cons
• Pros
– Flexible and general function approximation
framework
– Can build extremely powerful models by adding more
layers
• Cons
– Hard to analyze theoretically (e.g., training is prone to
local optima)
– Huge amount of training data, computing power
required to get good performance
– The space of implementation choices is huge
(network architectures, parameters)
Support vector machines
• When the data is linearly separable, there may
be more than one separator (hyperplane)
Which separator
is best?
Support vector machines
• Find hyperplane that maximizes the margin
between the positive and negative examples
Support vectors
Margin
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
SVM parameter learning
1
w
• Separable data: min
w ,b 2
Maximize
margin
•
2
subject to
yi ( w  x i  b )  1
Classify training data correctly
Non-separable data:
min
w,b
n
1
2
w + C å max ( 0,1- yi (w × xi + b))
2
i=1
Maximize
margin
Minimize classification mistakes
SVM parameter learning
n
min
w,b
1
2
w + C å max ( 0,1- yi (w × xi + b))
2
i=1
+1
Margin
0
-1
Demo: http://cs.stanford.edu/people/karpathy/svmjs/demo
Nonlinear SVMs
• General idea: the original input space can
always be mapped to some higher-dimensional
feature space where the training set is separable
Φ: x → φ(x)
Image source
Nonlinear SVMs
• Linearly separable dataset in 1D:
x
0
• Non-separable dataset in 1D:
x
0
• We can map the data to a higher-dimensional space:
x2
0
x
Slide credit: Andrew Moore
The kernel trick
• General idea: the original input space can
always be mapped to some higher-dimensional
feature space where the training set is separable
• The kernel trick: instead of explicitly computing
the lifting transformation φ(x), define a kernel
function K such that
K(x,y) = φ(x) · φ(y)
(to be valid, the kernel function must satisfy
Mercer’s condition)
The kernel trick
• Linear SVM decision function:
w  x  b  i  i yi xi  x  b
learned
weight
Support
vector
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
The kernel trick
• Linear SVM decision function:
w  x  b  i  i yi xi  x  b
• Kernel SVM decision function:
  y  ( x )   ( x)  b    y K ( x , x)  b
i
i
i
i
i
i
i
i
• This gives a nonlinear decision boundary in the
original feature space
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Polynomial kernel: K (x, y )  (c  x  y )
d
Gaussian kernel
• Also known as the radial basis function (RBF)
kernel:
2
 1
K (x, y )  exp  2 x  y 
 

K(x, y)
||x – y||
Gaussian kernel
SV’s
SVMs: Pros and cons
• Pros
• Kernel-based framework is very powerful, flexible
• Training is convex optimization, globally optimal solution can
be found
• Amenable to theoretical analysis
• SVMs work very well in practice, even with very small
training sample sizes
• Cons
• No “direct” multi-class SVM, must combine two-class SVMs
(e.g., with one-vs-others)
• Computation, memory (esp. for nonlinear SVMs)
Neural networks vs. SVMs
(a.k.a. “deep” vs. “shallow” learning)