PPT - Mining of Massive Datasets
Download
Report
Transcript PPT - Mining of Massive Datasets
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
High dim.
data
Graph
data
Infinite
data
Machine
learning
Apps
Locality
sensitive
hashing
PageRank,
SimRank
Filtering
data
streams
SVM
Recommen
der systems
Clustering
Community
Detection
Web
advertising
Decision
Trees
Association
Rules
Dimensional
ity
reduction
Spam
Detection
Queries on
streams
Perceptron,
kNN
Duplicate
document
detection
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
2
Example: Spam filtering
Instance space x X (|X|= n data points)
Binary or real-valued feature vector x of
word occurrences
d features (words + other things, d~100,000)
Class y Y
y: Spam (+1), Ham (-1)
Goal: Estimate a function f(x) so that y = f(x)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
3
Would like to do prediction:
estimate a function f(x) so that y = f(x)
Where y can be:
Real number: Regression
Categorical: Classification
Complex object:
Ranking of items, Parse tree, etc.
Data is labeled:
Have many pairs {(x, y)}
x … vector of binary, categorical, real valued features
y … class ({+1, -1}, or a real number)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
4
Task: Given data (X,Y) build a model f() to
predict Y’ based on X’
Strategy: Estimate 𝒚 = 𝒇 𝒙
Training
X
on (𝑿, 𝒀).
data
Hope that the same 𝒇(𝒙) also
Test
X’
works to predict unknown 𝒀’ data
Y
Y’
The “hope” is called generalization
Overfitting: If f(x) predicts well Y but is unable to predict Y’
We want to build a model that generalizes
well to unseen data
But Jure, how can we well on data we have
never seen before?!?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
5
Idea: Pretend we do not know the data/labels
we actually do know
Build the model f(x) on
the training data
See how well f(x) does on
the test data
Training
set
Validation
set
Test
set
If it does well, then apply it also to X’
X
Y
X’
Refinement: Cross validation
Splitting into training/validation set is brutal
Let’s split our data (X,Y) into 10-folds (buckets)
Take out 1-fold for validation, train on remaining 9
Repeat this 10 times, report average performance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
6
Binary classification:
f (x) =
{
+1 if w(1) x(1) + w(2) x(2) +. . .w(d) x(d)
-1 otherwise
Input: Vectors xj and labels yj
Vectors xj are real valued where 𝒙
𝟐
=𝟏
Goal: Find vector w = (w(1), w(2) ,... , w(d) )
Each w(i) is a real number
wx=0
-- - - -- - - -- w
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Decision
boundary
is linear
Note:
x x, 1
x
w w,
7
(Very) loose motivation: Neuron
Inputs are feature values
Each feature has a weight wi
(1)
w
(1)
Activation is the sum:
x
w(2)
𝒇 𝒙 =
𝒅 (𝒊) (𝒊)
𝒊 𝒘 𝒙
If the f(x) is:
Positive: Predict +1
Negative: Predict -1
x(2)
x(3)
x(4)
=𝒘⋅𝒙
wx=0
w(3)
w(4)
nigeria
0?
x1
Spam=1
x2
w
Ham=-1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
viagra
8
Note that the Perceptron is
a conservative algorithm: it
ignores samples that it
classifies correctly.
Perceptron: y’ = sign(w x)
How to find parameters w?
Start with w0 = 0
Pick training examples xt one by one
Predict class of xt using current wt
y’ = sign(wt xt)
If y’ is correct (i.e., yt = y’)
No change: wt+1 = wt
ytxt
If y’ is wrong: Adjust wt
wt+1 = wt + yt xt
is the learning rate parameter
xt is the t-th training example
yt is true t-th class label ({+1, -1})
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
wt
wt+1
xt, yt=1
9
Good: Perceptron convergence theorem:
If there exist a set of weights that are consistent
(i.e., the data is linearly separable) the Perceptron
learning algorithm will converge
Bad: Never converges:
If the data is not separable
weights dance around
indefinitely
Bad: Mediocre generalization:
Finds a “barely” separating solution
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
10
Perceptron will oscillate and won’t converge
So, when to stop learning?
(1) Slowly decrease the learning rate
A classic way is to: = c1/(t + c2)
But, we also need to determine constants c1 and c2
(2) Stop when the training error stops chaining
(3) Have a small test dataset and stop when the
test set error stops decreasing
(4) Stop when we reached some maximum
number of passes over the data
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
11
Want to separate “+” from “-” using a line
Data:
+
Training examples:
+
+
(x1, y1) … (xn, yn)
Each example i:
+
+
-
+
-
- -
xi = ( xi(1),… , xi(d) )
xi(j) is real valued
yi { -1, +1 }
Inner product:
𝒘 ⋅ 𝒙 = 𝑑𝑗=1 𝑤 (𝑗) ⋅ 𝑥 (𝑗)
Which is best linear separator (defined by w)?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
13
A
C
+
+
+
+
+
+
B
+
-
-
+
-
+
-
-
-
-
Distance from the
separating
hyperplane
corresponds to
the “confidence”
of prediction
Example:
We are more sure
about the class of
A and B than of C
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
14
Margin 𝜸: Distance of closest example from
the decision line/hyperplane
The reason we define margin this way is due to theoretical convenience and existence of
generalization error bounds that depend on the value of margin.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
15
Remember: Dot product
𝑨 ⋅ 𝑩 = 𝑨 ⋅ 𝑩 ⋅ 𝐜𝐨𝐬 𝜽
𝑨 𝒄𝒐𝒔𝜽
𝒅
𝑨(𝒋)
|𝑨|=
𝟐
𝒋=𝟏
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
16
Dot product
𝑨 ⋅ 𝑩 = 𝑨 𝑩 𝐜𝐨𝐬 𝜽
What is 𝒘 ⋅ 𝒙𝟏 , 𝒘 ⋅ 𝒙𝟐 ?
x2
+ +x
1
𝒘
In this case
𝜸𝟏 ≈ 𝒘 𝟐
𝒘
x2
+ +x
1
x2
+ +x
1
𝒘
In this case
𝜸𝟐 ≈ 𝟐 𝒘 𝟐
So, 𝜸 roughly corresponds to the margin
Bigger 𝜸 bigger the separation
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17
Distance from a point to a line
w
A (xA(1), xA(2))
+
H
(0,0)
M (x1, x2)
L
Note we assume
𝒘 𝟐=𝟏
Let:
Line L: w∙x+b =
w(1)x(1)+w(2)x(2)+b=0
w = (w(1), w(2))
Point A = (xA(1), xA(2))
Point M on a line = (xM(1), xM(2))
d(A, L) = |AH|
= |(A-M) ∙ w|
= |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)|
= xA(1) w(1) + xA(2) w(2) + b
=w∙A+b
Remember xM(1)w(1) + xM(2)w(2) = - b
since M belongs to line L
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
18
𝒘
+
+ +
+
-
+
+
+
- -
-
Prediction = sign(wx + b)
“Confidence” = (w x + b) y
For i-th datapoint:
𝜸𝒊 = 𝒘 𝒙𝒊 + 𝒃 𝒚𝒊
Want to solve:
𝐦𝐚𝐱 𝐦𝐢𝐧 𝜸𝒊
𝒘
𝒊
Can rewrite as
max
w,
s.t.i, yi ( w xi b)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
19
Maximize the margin:
+
Good according to intuition,
theory (VC dimension) &
+ +
+
practice
+
max
+
s.t.i, yi ( w xi b)
+
+
-
w,
𝜸 is margin … distance from
the separating hyperplane
wx+b=0
- -
Maximizing the margin
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
20
Separating hyperplane
is defined by the
support vectors
Points on +/- planes
from the solution
If you knew these
points, you could
ignore the rest
Generally,
d+1 support vectors (for d dim. data)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
22
Problem:
Let 𝒘𝒙 + 𝒃 𝒚 = 𝜸
then 𝟐𝒘𝒙 + 𝟐𝒃 𝒚 = 𝟐𝜸
Scaling w increases margin!
x2
Solution:
x1
Work with normalized w:
𝜸=
𝒘
𝒘
𝒙 + 𝒃 𝒚
w
|| w ||
Let’s also require support vectors 𝒙𝒋
to be on the plane defined by: 𝒘 ⋅
𝒙𝒋 + 𝒃 = ±𝟏
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
𝑑
𝑤 (𝑗)
|w|=
2
𝑗=1
23
Want to maximize margin 𝜸!
What is the relation
between x1 and x2?
𝒙𝟏 = 𝒙𝟐 + 𝟐𝜸
𝒘
||𝒘||
x2
2
We also know:
x1
𝒘 ⋅ 𝒙𝟏 + 𝒃 = +𝟏
𝒘 ⋅ 𝒙𝟐 + 𝒃 = −𝟏
w
|| w ||
So:
𝒘 ⋅ 𝒙𝟏 + 𝒃 = +𝟏
𝒘 𝒙𝟐 + 𝟐𝜸
𝒘
||𝒘||
𝒘 ⋅ 𝒙𝟐 + 𝒃 + 𝟐𝜸
+ 𝒃 = +𝟏
𝒘⋅𝒘
𝒘
= +𝟏
w
1
w w w
-1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Note:
ww w
24
2
We started with
maxw,
s.t.i, yi ( w xi b)
But w can be arbitrarily large!
2
We normalized and...
arg max arg max
x2
1
arg min w arg min 12 w
w
x1
2
Then:
min
1
w 2
|| w ||
w
|| w ||
2
s.t.i, yi ( w xi b) 1
This is called SVM with “hard” constraints
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
25
If data is not separable introduce penalty:
min
1
w 2
w C (# number of mistakes)
2
s.t.i, yi ( w xi b) 1
Minimize ǁwǁ2 plus the
number of training mistakes
Set C using cross validation
How to penalize mistakes?
All mistakes are not
equally bad!
+
+
+
+ +
+
-
-
-
+
-
-
+
-
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
26
Introduce slack variables i
min
w ,b , i 0
1
2
n
w C i
2
i 1
s.t.i, yi ( w xi b) 1 i
+
If point xi is on the wrong
side of the margin then
get penalty i
+
+
+
+
+
+
i
j
-
+
- -
For each data point:
If margin 1, don’t care
If margin < 1, pay linear penalty
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
27
min
1
w 2
w C (# number of mistakes)
2
s.t.i, yi ( w xi b) 1
What is the role of slack penalty C: small C
“good” C
+
C=: Only want to w, b
+
that separate the data
+
C=0: Can set i to anything,
+
+
big C
then w=0 (basically
+
ignores the data)
+
+ - (0,0)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
28
SVM in the “natural”nform
arg min
w ,b
1
2
w w C max0,1 yi ( w xi b)
i 1
Margin
Regularization
parameter
SVM uses “Hinge Loss”:
min
penalty
Empirical loss L (how well we fit training data)
w ,b
1
2
n
w C i
2
i 1
s.t.i, yi ( w xi b) 1 i
0/1 loss
Hinge loss: max{0, 1-z}
-1
0
1
2
z yi ( xi w b)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
29
n
min
w ,b
1
2
w w C i
i 1
s.t.i, yi ( xi w b) 1 i
Want to estimate 𝒘 and 𝒃!
Standard way: Use a solver!
Solver: software for finding solutions to
“common” optimization problems
Use a quadratic solver:
Minimize quadratic function
Subject to linear constraints
Problem: Solvers are inefficient for big data!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
31
n
Want to estimate w, b!
Alternative approach:
Want to minimize f(w,b):
min
w,b
1
2
w w C i
i 1
s.t.i, yi ( xi w b) 1 i
d
( j) ( j)
1
f ( w, b) 2 w w C max 0,1 yi ( w xi b)
i 1
j 1
n
Side note:
How to minimize convex functions 𝒈(𝒛)?
g(z)
Use gradient descent: minz g(z)
Iterate: zt+1 zt – g(zt)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
z
32
Want to minimize f(w,b):
d
f ( w, b) 12 w
j 1
( j) 2
d
( j) ( j)
C max0,1 yi ( w xi b)
i 1
j 1
n
Empirical loss 𝑳(𝒙𝒊 𝒚𝒊 )
Compute the gradient (j) w.r.t. w(j)
n
L( xi , yi )
f
(
w
,
b
)
( j)
( j)
f
w C
( j)
( j)
w
w
i 1
L( xi , yi )
0
if yi (w xi b) 1
( j)
w
yi xi( j ) else
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
33
Gradient descent:
Iterate until convergence:
• For j = 1 … d
n
L( xi , yi )
f
(
w
,
b
)
( j)
( j)
• Evaluate:f
w C
( j)
( j)
w
w
i 1
• Update:
w(j) w(j) - f(j)
…learning rate parameter
C… regularization parameter
Problem:
Computing f(j) takes O(n) time!
n … size of the training dataset
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
34
We just had:
Stochastic Gradient Descent
f
( j)
w
n
( j)
C
i 1
L( xi , yi )
w( j )
Instead of evaluating gradient over all examples
evaluate it for each individual training example
L( xi , yi )
f ( xi ) w C
( j)
w
Stochastic gradient descent:
( j)
( j)
Notice: no summation
over i anymore
Iterate until convergence:
• For i = 1 … n
• For j = 1 … d
• Compute: f(j)(xi)
• Update: w(j) w(j) - f(j)(xi)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
35
Example by Leon Bottou:
Reuters RCV1 document corpus
Predict a category of a document
One vs. the rest classification
n = 781,000 training examples (documents)
23,000 test examples
d = 50,000 features
One feature per word
Remove stop-words
Remove low frequency words
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
37
Questions:
(1) Is SGD successful at minimizing f(w,b)?
(2) How quickly does SGD find the min of f(w,b)?
(3) What is the error on a test set?
Training time
Value of f(w,b)
Test error
Standard SVM
“Fast SVM”
SGD SVM
(1) SGD-SVM is successful at minimizing the value of f(w,b)
(2) SGD-SVM is super fast
(3) SGD-SVM test set error is comparable
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
38
SGD SVM
Conventional
SVM
Optimization quality: | f(w,b) – f (wopt,bopt) |
For optimizing f(w,b) within reasonable quality
SGD-SVM is super fast
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
39
SGD on full dataset vs. Conjugate Gradient on
a sample of n training examples
Theory says: Gradient
descent converges in
linear time 𝒌. Conjugate
gradient converges in 𝒌.
Bottom line: Doing a simple (but fast) SGD update
many times is better than doing a complicated (but
slow) CG update a few times
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
𝒌… condition number
40
Need to choose learning rate and t0
t
L( xi , yi )
wt 1 wt
wt C
t t0
w
Leon suggests:
Choose t0 so that the expected initial updates are
comparable with the expected size of the weights
Choose :
Select a small subsample
Try various rates (e.g., 10, 1, 0.1, 0.01, …)
Pick the one that most reduces the cost
Use for next 100k iterations on the full dataset
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
41
Sparse Linear SVM:
Feature vector xi is sparse (contains many zeros)
Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]
But represent xi as a sparse vector xi=[(4,1), (9,5), …]
Can we do the SGD update
more
efficiently?
w
w w w C
L( xi , yi )
Approximated in 2 steps:
L( xi , yi ) cheap: xi is sparse and so few
w w C
coordinates j of w will be updated
w
w w(1 )
expensive: w is not sparse, all
coordinates need to be updated
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
42
Solution 1: 𝒘 = 𝒔 ⋅ 𝒗
Represent vector w as the
product of scalar s and vector v
Then the update procedure is:
(1) 𝒗 = 𝒗 −
Two step update procedure:
L( xi , yi )
w
(2) w w(1 )
(1) w w C
𝝏𝑳 𝒙𝒊 ,𝒚𝒊
𝜼𝑪
𝝏𝒘
(2) 𝒔 = 𝒔(𝟏 − 𝜼)
Solution 2:
Perform only step (1) for each training example
Perform step (2) with lower frequency
and higher
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
43
Stopping criteria:
How many iterations of SGD?
Early stopping with cross validation
Create a validation set
Monitor cost function on the validation set
Stop when loss stops decreasing
Early stopping
Extract two disjoint subsamples A and B of training data
Train on A, stop by validating on B
Number of epochs is an estimate of k
Train for k epochs on the full dataset
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
44
Idea 1:
One against all
Learn 3 classifiers
+ vs. {o, -}
- vs. {o, +}
o vs. {+, -}
Obtain:
w+ b+, w- b-, wo bo
How to classify?
Return class c
arg maxc wc x + bc
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
45
Idea 2: Learn 3 sets of weights simoultaneously!
For each class c estimate wc, bc
Want the correct class to have highest margin:
wy xi + by 1 + wc xi + bc c yi , i
i
i
(xi, yi)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
46
Optimization problem:
min
w,b
1
2
w
c
c
2
n
C i
i 1
wyi xi byi wc xi bc 1 i
c yi , i
i 0, i
To obtain parameters wc , bc (for each class c)
we can use similar techniques as for 2 class SVM
SVM is widely perceived a very powerful
learning algorithm
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
47