Transcript Slide 1

CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu

Overfitting:

Regularization: If the data
is not separable weights
dance around

Mediocre generalization:
 Finds a “barely” separating
solution
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
2
Which is best linear separator?

+
+
+
+
-
+
+
7/18/2015
Data:
 Examples:
 (x1, y1) … (xn, yn)
-
- -

Example i:
 xi = ( x1(1),… , x1(d) )
 Real values
 yi  { -1, +1 }

Inner product:
wx = Σj w(j) x(j)
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
3
Why define margin
as distance to the
closest example,
why not avg. or
something else
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
4

+
+ +


+
-
+
+
+
-
- -
Prediction = sign(wx + b)
“Confidence” = (w x + b) y
For i-th datapoint:
i = (w xi+b) yi
Want to solve:
max min  i
w
-

i
Can rewrite as
max
w,
s.t.i, yi ( w  xi  b)  
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
5

Maximize the margin:
 Good according to intuition,
theory & practice
+ +
max
w,
s.t.i, yi ( w  xi  b)  
+
+
+
+
+
+
wx+b=0



- -
Maximizing the margin
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
6

Separating hyperplane
is defined by the
support vectors
 Points on +/- planes
from the solution.
 If you knew these
points, you could
ignore the rest
 If no degeneracies,
d+1 support vectors (for d dim. data)
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
7

Problem:
 Let (w x+b) y =  then
(2w x+ 2b) y = 2 
x2
 Scaling w increases margin!

x1
Solution:
 Normalize
|w|=
𝑤
w:
|𝑤|
𝑑
𝑗=1
𝑤𝑗
2
 Also require hyperplanes
touching support vectors
to be: 𝑤 ⋅ 𝑥 + 𝑏 = ±1
7/18/2015
w
|| w ||
Note:
w j … j-th component of w
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
8

What is the relation
between x1 and x2?
 𝒙𝟏 = 𝒙𝟐 +

𝒘
𝟐𝜸
||𝒘||
x2 2
We also know:
x1
 𝑤 ⋅ 𝑥1 + 𝑏 = +1
 𝑤 ⋅ 𝑥2 + 𝑏 = −1

w
|| w ||
Then:
𝒘
𝒘
𝒙𝟐 + 𝟐𝜸
||𝒘||
 𝒘 ⋅ 𝒙𝟐 + 𝒃 +
-1
7/18/2015
+ 𝒃 = +𝟏
𝒘⋅𝒘
𝟐𝜸
||𝒘||
w
1
 

w w w
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
Note:
ww  w
9
2

We started with
maxw, 
s.t.i, yi ( w  xi  b)  
x2 2
But w can be arbitrarily large

We normalized and..
max  max

x1
1
2
 min w  min 12 w
w
Then:
1
w 2
min
|| w ||
w
|| w ||
2
s.t.i, yi ( w  xi  b)  1
This is called SVM with “hard” constraints
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
10

If data is not separable introduce penalty:
1
w 2
min
w  C (# number of mist akes)
2
s.t.i, yi ( w  xi  b)  1
 Minimize ǁwǁ2 plus the
number of training mistakes
 Set C using cross validation

How to penalize mistakes?
 All mistakes are not
equally bad!
+
+
+
+ +
+
+
-
-
-
-
-
+
-
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
11

Introduce slack variables
i
n
2
1
min 2 w  C i
w,b ,i 0
+
i 1
s.t.i, yi ( w  xi  b)  1  i
 If point xi is on the wrong
side of the margin then
get penalty i
 Slack penalty C:
+
+
+
+
+
+
i
i
-
+
- -
For each datapoint:
 C=: only want to w,b that
If margin  1, don’t care
separate the data
If margin < 1, pay linear penalty
 C=0: can set i to anything,
then w=0 (basically ignores the data)
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
12

SVM in the “natural”nform
arg min
w  w  C  max0,1  yi ( w  xi  b)
1
2
w,b
i 1
Margin
Regularization Empirical loss (how well we fit training data)
parameter
SVM uses “Hinge Loss”:
min
w,b
penalty

1
2
n
w  C  i
2
i 1
s.t.i, yi  ( w  xi  b)  1  i
0/1 loss
Hinge loss: max{0, 1-z}
-1
7/18/2015
0
1
2
z  yi  ( xi  w  b)
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
13
n
min
w,b
1
2
w  w  C  i
i 1
s.t.i, yi  ( xi  w  b)  1  i

Want to estimate w,b:
 Standard way: Use a solver!
 Solver: software for finding solutions to
“common” optimization problems

Use a quadratic solver:
 Minimize quadratic function
 Subject to linear constraints

Problem: Solvers are inefficient for big data!
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
14
n


Want to estimate w,b!
Alternative approach:
 Want to minimize f(w,b):
d
 
f ( w, b)  12  w
j 1
j 2
min
w,b
1
2
w  w  C  i
i 1
s.t.i, yi  ( xi  w  b)  1  i
d


j j
 C  max0,1  yi ( w xi  b)
i 1
j 1


n
 How to minimize convex functions?
 Gradient descent: minz f(z)
 Iterate: zt+1  zt –  f’(zt)
f(z)
z
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
15

Want to minimize f(w,b):
d


j
j
f ( w, b)  12  w  C  max0,1  yi ( w xi  b)
j 1
i 1
j 1


 Take the gradient w.r.t wj:
d
 
j 2
n
n
f (w, b)
L( xi , yi )
j
j 
 w  C
j
j
w

w
i 1
L( xi , yi )
 0 if yi (w  xi  b)  1
j
w
  yi xij else
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
16

Gradient descent:
Iterate until convergence:
For j = 1 … d
n
f (w, b)
L( xi , yi )
j
 w  C
Evaluate  j 
j
j
w
w
i 1
Update: wj  wj -  j

Problem:
 Computing j takes O(n) time!!
 n … size of the training dataset
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
17
We just had:

Stochastic gradient descent (SGD)
n
 j  w  C
j
i 1
L( xi , yi )
w j
 Instead of evaluating gradient over all
examples evaluate it for each individual example
L( xi , yi )
 j ,i  w  C
j
w
Stochastic gradient descent:
j

 Iterate until convergence:
For i = 1… n
For j = 1 … d
Evaluate j,i
Update: w j  w j -  j, i
7/18/2015
n … # training examples
d … # features
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
18

Example by Leon Bottou:
 Reuters RCV1 document corpus
 Predict a category of a document
 One vs. the rest classification
 n = 781,000 training examples
 23,000 test examples
 d = 50,000 features
 One feature per word
 Remove stop-words
 Remove low frequency words
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
19

Questions:
 Is SGD successful at minimizing f(w,b)?
 How quickly does SGD find the minimum of f(w,b)?
 What is the error on a test set?
Training time
Value of f(w,b)
Test error
Standard SVM
“Fast SVM”
SGD SVM
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
20
| f(w,b) – foptimal(w,b) |
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
21

What if we subsample the
training data and solve exactly?
The point of this
slid is that methods
that converge faster
don’t necessary
work faster –
simple thing that
you can do fast (but
many times) is
better. Check SVM
slides from 2009
and see if there is a
better slide to make
this point.
This is not too good
slide
 SGD on full dataset vs. Batch Conjugate Gradient
on a sample of n training examples
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
22


Need to choose learning rate  and t0
t 
L( xi , yi ) 
wt 1  wt 
wC

t  t0 
w 
Leon suggests:
 Choose t0 so that the expected initial updates are
comparable with the expected size of the weights
 Choose :
 Select a small subsample
 Try various rates  (e.g., 10, 1, 0.1, 0.01, …)
 Pick the one that most reduces the cost
 Use  for next 100k iterations on the full dataset
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
23

The two steps do
not decompose.
Why does it work?
Since eta is close
to 0? You can do
this. You get one
more eta^2 term
that is low order
Sparse Linear SVM:
 Feature vector xi is sparse (contains many zeros)
 Do not do: xi=[0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]
 But represent xi as a sparse vector xi=[(4,1), (9,5), …]
 The update w  w   w  C


w

i
i
L( x , y ) 
 Can be computed in 2 steps:
L( xi , yi ) cheap: xi is sparse and so few
w  w  C
w
coordinates j of w will be updated
w  w(1  )
7/18/2015
expensive: w is not sparse, all
coordinates need to be updated
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
24
Give the equations
for step 1 and 2.
w = v.s

Solution 1:
 Represent vector w as the
product of scalar s and vector v
 Perform step (1) by updating v
and step (2) by updating s

Two step update procedure:
(1) w  w  C
(2)
L( xi , yi )
w
w  w(1  )
Solution 2:
 Perform only step (1) for each training example
 Perform step (2) with lower frequency
and higher 
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
25
Why do you think
training on A and B
would take the
same time as on
the full dataset?

Stopping criteria:
How many iterations of SGD?
 Early stopping with cross validation
 Create validation set
 Monitor cost function on the validation set
 Stop when loss stops decreasing
 Early stopping
 Extract two disjoint subsamples A and B of training data
 Train on A, stop by validating on B
 Number of epochs is an estimate of k
 Train for k epochs on the full dataset
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
26

One against all:
Learn 3 classifiers
 + vs. {o, -}
 - vs. {o, +}
 o vs. {+, -}
Obtain:
w+ b+, w- b-, wo bo


7/18/2015
How to classify?
Return class c
arg maxc wc x + bc
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
27

Learn 3 sets of weights simoultaneously
 For each class c estimate wc, bc
 Want the correct class to have highest margin:
wy xi + by  1 + wc xi + bc c  yi , i
i
i
(xi, yi)
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
28

Optimization problem:
min
w,b
1
2
w
c
c
2
n
 C  i
i 1
wyi  xi  byi  wc  xi  bc  1  i
c  yi , i
i  0, i
 To obtain parameters wc, bc (for each class c)
we can use similar techniques as for 2 class SVM
7/18/2015
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
29