Transcript Slide 1
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu
Overfitting:
Regularization: If the data
is not separable weights
dance around
Mediocre generalization:
Finds a “barely” separating
solution
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
2
Which is best linear separator?
+
+
+
+
-
+
+
7/18/2015
Data:
Examples:
(x1, y1) … (xn, yn)
-
- -
Example i:
xi = ( x1(1),… , x1(d) )
Real values
yi { -1, +1 }
Inner product:
wx = Σj w(j) x(j)
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
3
Why define margin
as distance to the
closest example,
why not avg. or
something else
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
4
+
+ +
+
-
+
+
+
-
- -
Prediction = sign(wx + b)
“Confidence” = (w x + b) y
For i-th datapoint:
i = (w xi+b) yi
Want to solve:
max min i
w
-
i
Can rewrite as
max
w,
s.t.i, yi ( w xi b)
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
5
Maximize the margin:
Good according to intuition,
theory & practice
+ +
max
w,
s.t.i, yi ( w xi b)
+
+
+
+
+
+
wx+b=0
- -
Maximizing the margin
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
6
Separating hyperplane
is defined by the
support vectors
Points on +/- planes
from the solution.
If you knew these
points, you could
ignore the rest
If no degeneracies,
d+1 support vectors (for d dim. data)
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
7
Problem:
Let (w x+b) y = then
(2w x+ 2b) y = 2
x2
Scaling w increases margin!
x1
Solution:
Normalize
|w|=
𝑤
w:
|𝑤|
𝑑
𝑗=1
𝑤𝑗
2
Also require hyperplanes
touching support vectors
to be: 𝑤 ⋅ 𝑥 + 𝑏 = ±1
7/18/2015
w
|| w ||
Note:
w j … j-th component of w
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
8
What is the relation
between x1 and x2?
𝒙𝟏 = 𝒙𝟐 +
𝒘
𝟐𝜸
||𝒘||
x2 2
We also know:
x1
𝑤 ⋅ 𝑥1 + 𝑏 = +1
𝑤 ⋅ 𝑥2 + 𝑏 = −1
w
|| w ||
Then:
𝒘
𝒘
𝒙𝟐 + 𝟐𝜸
||𝒘||
𝒘 ⋅ 𝒙𝟐 + 𝒃 +
-1
7/18/2015
+ 𝒃 = +𝟏
𝒘⋅𝒘
𝟐𝜸
||𝒘||
w
1
w w w
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
Note:
ww w
9
2
We started with
maxw,
s.t.i, yi ( w xi b)
x2 2
But w can be arbitrarily large
We normalized and..
max max
x1
1
2
min w min 12 w
w
Then:
1
w 2
min
|| w ||
w
|| w ||
2
s.t.i, yi ( w xi b) 1
This is called SVM with “hard” constraints
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
10
If data is not separable introduce penalty:
1
w 2
min
w C (# number of mist akes)
2
s.t.i, yi ( w xi b) 1
Minimize ǁwǁ2 plus the
number of training mistakes
Set C using cross validation
How to penalize mistakes?
All mistakes are not
equally bad!
+
+
+
+ +
+
+
-
-
-
-
-
+
-
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
11
Introduce slack variables
i
n
2
1
min 2 w C i
w,b ,i 0
+
i 1
s.t.i, yi ( w xi b) 1 i
If point xi is on the wrong
side of the margin then
get penalty i
Slack penalty C:
+
+
+
+
+
+
i
i
-
+
- -
For each datapoint:
C=: only want to w,b that
If margin 1, don’t care
separate the data
If margin < 1, pay linear penalty
C=0: can set i to anything,
then w=0 (basically ignores the data)
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
12
SVM in the “natural”nform
arg min
w w C max0,1 yi ( w xi b)
1
2
w,b
i 1
Margin
Regularization Empirical loss (how well we fit training data)
parameter
SVM uses “Hinge Loss”:
min
w,b
penalty
1
2
n
w C i
2
i 1
s.t.i, yi ( w xi b) 1 i
0/1 loss
Hinge loss: max{0, 1-z}
-1
7/18/2015
0
1
2
z yi ( xi w b)
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
13
n
min
w,b
1
2
w w C i
i 1
s.t.i, yi ( xi w b) 1 i
Want to estimate w,b:
Standard way: Use a solver!
Solver: software for finding solutions to
“common” optimization problems
Use a quadratic solver:
Minimize quadratic function
Subject to linear constraints
Problem: Solvers are inefficient for big data!
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
14
n
Want to estimate w,b!
Alternative approach:
Want to minimize f(w,b):
d
f ( w, b) 12 w
j 1
j 2
min
w,b
1
2
w w C i
i 1
s.t.i, yi ( xi w b) 1 i
d
j j
C max0,1 yi ( w xi b)
i 1
j 1
n
How to minimize convex functions?
Gradient descent: minz f(z)
Iterate: zt+1 zt – f’(zt)
f(z)
z
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
15
Want to minimize f(w,b):
d
j
j
f ( w, b) 12 w C max0,1 yi ( w xi b)
j 1
i 1
j 1
Take the gradient w.r.t wj:
d
j 2
n
n
f (w, b)
L( xi , yi )
j
j
w C
j
j
w
w
i 1
L( xi , yi )
0 if yi (w xi b) 1
j
w
yi xij else
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
16
Gradient descent:
Iterate until convergence:
For j = 1 … d
n
f (w, b)
L( xi , yi )
j
w C
Evaluate j
j
j
w
w
i 1
Update: wj wj - j
Problem:
Computing j takes O(n) time!!
n … size of the training dataset
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
17
We just had:
Stochastic gradient descent (SGD)
n
j w C
j
i 1
L( xi , yi )
w j
Instead of evaluating gradient over all
examples evaluate it for each individual example
L( xi , yi )
j ,i w C
j
w
Stochastic gradient descent:
j
Iterate until convergence:
For i = 1… n
For j = 1 … d
Evaluate j,i
Update: w j w j - j, i
7/18/2015
n … # training examples
d … # features
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
18
Example by Leon Bottou:
Reuters RCV1 document corpus
Predict a category of a document
One vs. the rest classification
n = 781,000 training examples
23,000 test examples
d = 50,000 features
One feature per word
Remove stop-words
Remove low frequency words
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
19
Questions:
Is SGD successful at minimizing f(w,b)?
How quickly does SGD find the minimum of f(w,b)?
What is the error on a test set?
Training time
Value of f(w,b)
Test error
Standard SVM
“Fast SVM”
SGD SVM
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
20
| f(w,b) – foptimal(w,b) |
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
21
What if we subsample the
training data and solve exactly?
The point of this
slid is that methods
that converge faster
don’t necessary
work faster –
simple thing that
you can do fast (but
many times) is
better. Check SVM
slides from 2009
and see if there is a
better slide to make
this point.
This is not too good
slide
SGD on full dataset vs. Batch Conjugate Gradient
on a sample of n training examples
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
22
Need to choose learning rate and t0
t
L( xi , yi )
wt 1 wt
wC
t t0
w
Leon suggests:
Choose t0 so that the expected initial updates are
comparable with the expected size of the weights
Choose :
Select a small subsample
Try various rates (e.g., 10, 1, 0.1, 0.01, …)
Pick the one that most reduces the cost
Use for next 100k iterations on the full dataset
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
23
The two steps do
not decompose.
Why does it work?
Since eta is close
to 0? You can do
this. You get one
more eta^2 term
that is low order
Sparse Linear SVM:
Feature vector xi is sparse (contains many zeros)
Do not do: xi=[0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]
But represent xi as a sparse vector xi=[(4,1), (9,5), …]
The update w w w C
w
i
i
L( x , y )
Can be computed in 2 steps:
L( xi , yi ) cheap: xi is sparse and so few
w w C
w
coordinates j of w will be updated
w w(1 )
7/18/2015
expensive: w is not sparse, all
coordinates need to be updated
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
24
Give the equations
for step 1 and 2.
w = v.s
Solution 1:
Represent vector w as the
product of scalar s and vector v
Perform step (1) by updating v
and step (2) by updating s
Two step update procedure:
(1) w w C
(2)
L( xi , yi )
w
w w(1 )
Solution 2:
Perform only step (1) for each training example
Perform step (2) with lower frequency
and higher
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
25
Why do you think
training on A and B
would take the
same time as on
the full dataset?
Stopping criteria:
How many iterations of SGD?
Early stopping with cross validation
Create validation set
Monitor cost function on the validation set
Stop when loss stops decreasing
Early stopping
Extract two disjoint subsamples A and B of training data
Train on A, stop by validating on B
Number of epochs is an estimate of k
Train for k epochs on the full dataset
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
26
One against all:
Learn 3 classifiers
+ vs. {o, -}
- vs. {o, +}
o vs. {+, -}
Obtain:
w+ b+, w- b-, wo bo
7/18/2015
How to classify?
Return class c
arg maxc wc x + bc
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
27
Learn 3 sets of weights simoultaneously
For each class c estimate wc, bc
Want the correct class to have highest margin:
wy xi + by 1 + wc xi + bc c yi , i
i
i
(xi, yi)
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
28
Optimization problem:
min
w,b
1
2
w
c
c
2
n
C i
i 1
wyi xi byi wc xi bc 1 i
c yi , i
i 0, i
To obtain parameters wc, bc (for each class c)
we can use similar techniques as for 2 class SVM
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
29