Semi-Stochastic Gradient Descent Methods
Download
Report
Transcript Semi-Stochastic Gradient Descent Methods
Lehigh University
December 3, 2014
Semi-Stochastic
Gradient Descent Methods
Jakub Konečný
University of Edinburgh
Based on
Basic Method: S2GD
Mini-batching (& proximal setting): mS2GD
Konečný and Richtárik. Semi-Stochastic Gradient
Descent Methods, December 2013
Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD:
Minibatch semi-stochastic gradient descent in the
proximal setting, October 2014
Coordinate descent variant: S2CD
Konečný, Qu and Richtárik. S2CD: Semi-Stochastic
Coordinate Descent, October 2014
Introduction
Large scale problem setting
Problems are often structured
Structure – sum of functions
is BIG
Frequently arising in machine learning
Examples
Linear regression (least squares)
Logistic regression (classification)
Assumptions
Lipschitz continuity of derivative of
Strong convexity of
Gradient Descent (GD)
Update rule
Fast convergence rate
Alternatively, for accuracy we need
iterations
Complexity of single iteration –
(measured in gradient evaluations)
Stochastic Gradient Descent (SGD)
Update rule
a step-size parameter
Why it works
Slow convergence
Complexity of single iteration –
(measured in gradient evaluations)
Goal
GD
SGD
Fast convergence
Slow convergence
gradient evaluations in
each iteration
Complexity of iteration
independent of
Combine in a single algorithm
Semi-Stochastic Gradient Descent
S2GD
Intuition
The gradient does not change drastically
We could reuse the information from “old” gradient
Modifying “old” gradient
Imagine someone gives us a “good” point
Gradient at point , near , can be expressed as
Gradient change
We can try to estimate
Approximation of the gradient
and
Already computed gradient
The S2GD Algorithm
Simplification; size of the
inner loop is random,
following a geometric rule
Theorem
Convergence rate
For any fixed , can be made
arbitrarily small by increasing
Can be made arbitrarily
small, by decreasing
How to set the parameters
?
Setting the parameters
Fix target accuracy
The accuracy is achieved by setting
# of epochs
stepsize
# of iterations
Total complexity (in gradient evaluations)
# of epochs
full gradient evaluation
cheap iterations
Complexity
S2GD complexity
GD complexity
iterations
complexity of a single iteration
Total
Related Methods
SAG – Stochastic Average Gradient
(Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)
SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014)
Refresh single stochastic gradient in each iteration
Need to store gradients.
Similar convergence rate
Cumbersome analysis
Refined analysis
MISO - Minimization by Incremental Surrogate
Optimization (Julien Mairal, 2014)
Similar to SAG, slightly worse performance
Elegant analysis
Related Methods
SVRG – Stochastic Variance Reduced Gradient
(Rie Johnson, Tong Zhang, 2013)
Arises as a special case in S2GD
Prox-SVRG
(Tong Zhang, Lin Xiao, 2014)
Extended to proximal setting
EMGD – Epoch Mixed Gradient Descent
(Lijun Zhang, Mehrdad Mahdavi , Rong Jin, 2013)
Handles simple constraints,
Worse convergence rate
Experiment (logistic regression on: ijcnn, rcv, real-sim, url)
Extensions
Extensions summary
S2GD:
Minibatching: mS2GD
Efficient handling of sparse data
Pre-processing with SGD (S2GD+)
Non-strongly convex losses
High-probability result
Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD:
Minibatch semi-stochastic gradient descent in the
proximal setting, October 2014
Coordinate descent variant:S2CD
Konečný, Qu, Richtárik. S2CD: Semi-Stochastic
Coordinate Descent, October 2014
Sparse data
For linear/logistic regression, gradient copies
sparsity pattern of example.
But the update direction is fully dense
SPARSE
DENSE
Can we do something about it?
Sparse data
Yes we can!
To compute
, we only need coordinates of
corresponding to nonzero elements of
For each coordinate , remember when was it
updated last time –
Before computing
in inner iteration number ,
update required coordinates
Step being
Compute direction and make a single update
Number of iterations when the coordinate was not updated
The “old gradient”
Sparse data implementation
S2GD+
Observing that SGD can make reasonable progress,
while S2GD computes first full gradient (in case we
are starting from arbitrary point),
we can formulate the following algorithm (S2GD+)
S2GD+ Experiment
High Probability Result
The result holds only in expectation
Can we say anything about the concentration of the
result in practice?
Paying just logarithm of probability
Independent from other parameters
For any
we have:
Code
Efficient implementation for logistic regression available at MLOSS
http://mloss.org/software/view/556/
mS2GD (mini-batch S2GD)
How does mini-batching influence the algorithm?
Replace
by
Provides two-fold speedup
Provably less gradient evaluations are needed
(up to certain number of mini-batches)
Easy possibility of parallelism
S2CD (Semi-Stochastic Coordinate Descent)
SGD type methods
Coordinate Descent type methods
Sampling rows (training examples) of data matrix
Sampling columns (features) of data matrix
Question: Can we do both?
Sample both columns and rows
S2CD (Semi-Stochastic Coordinate Descent)
Comlpexity
S2GD
S2GD as a Learning Algorithm
Problem with “us” optimizers
Optimizers care about optimization
Statisticians care about statistics
In isolation
Practical need to control both statistical predictive
power and effort spent on optimization, is not well
understood.
Optimizers should be aware of…
The following framework is mostly due to
Bottou and Bousquets, 2007
Machine Learning Setting
Space of input-output pairs
Unknown distribution
A relationship between inputs and outputs
Loss function to measure discrepancy between
predicted and real output
Define Expected Risk
Machine Learning Setting
Ideal goal: Find
such that,
But you cannot even evaluate
Define Expected Risk
Machine Learning Setting
We at least have iid samples
Define Empirical Risk
Machine Learning Setting
First learning principle – fix a family
prediction functions
Find Empirical Minimizer
Define Empirical Risk
of candidate
Machine Learning Setting
Since optimal
we also define
is unlikely to belong to ,
Define Empirical Risk
Machine Learning Setting
Finding by minimizing the Empirical Risk
exactly is often computationally expensive
Run optimization algorithm that returns such that
Define Empirical Risk
Recapitulation
Ideal optimum
“Best” from our family
Empirical Minimizer
From approximate optimization
Machine Learning Goal
Big goal is to minimize the Excess Risk
Approximation error
Estimation Error
Optimization Error
Generic Machine Learning Problem
All this leads to a complicated compromise
Three variables
Family of functions
Number or examples
Optimization accuracy
Two constraints
Maximal number of examples
Maximal computational time available
Generic Machine Learning Problem
Small scale learning problem
If first inequality is tight
Can reduce to insignificant levels and recover
approximation-estimation tradeoff (well studied)
Large scale learning problem
If second inequality is tight
More complicated compromise
Solving Large Scale ML Problem
Several simplifications needed
Not carefully balance the three terms; instead we only
ensure that asymptotically
Consider fixed family of functions , linearly
parameterized by a vector
Effectively setting
to be a constant
Simplifies to Estimation–Optimization tradeoff
Estimation–Optimization tradeoff
Using uniform convergence bounds, one can obtain
Often considered weak
Estimation–Optimization tradeoff
Using Localized Bounds (Bousquet, PhD thesis,
2004) or Isomorphic Coordinate Projections
(Bartlett and Mendelson, 2006), we get
… if we can establish the following variance condition
Often
, for example under strong convexity, or
making assumptions on the data distribution
Estimation–Optimization tradeoff
Using the previous bounds yields
where is an absolute constant
We want to push this term below
Choosing
and using
and
we get the following table