Optimization and methods to avoid overfitting

Download Report

Transcript Optimization and methods to avoid overfitting

Optimization and Methods to Avoid Overfitting
Automated Trading 2008
London
15 October 2008
Martin Sewell
Department of Computer Science
University College London
[email protected]
Outline of Presentation
•
•
•
•
•
Bayesian inference
The futility of bias-free learning
Overfitting and no free lunch for Occam’s razor
Bayesian model selection
Bayesian model averaging
Terminology
• A model is a family of functions
function: f(x) = 3x + 4
model: f(x) = ax + b
• A complex model has a large volume in
parameter space. If we know how complex a
function is, with some assumptions, we can
determine how ‘surprising’ it is.
Statistics vs Machine Learning
Which paradigm better describes our aims?
• Statistics: test a given hypothesis
• Machine learning: formulate the process of generalization
as a search through possible hypotheses in an attempt to
find the best hypothesis
Answer: machine learning
Classical Statistics vs Bayesian Inference
Which paradigm tells us what we want to know?
• Classical Statistics
P(data|null hypothesis)
• Bayesian Inference
P(hypothesis|data, background information)
Answer: Bayesian inference
Problems with Classical Statistics
•
•
•
•
The nature of the null hypothesis test
Prior information is ignored
Assumptions swept under the carpet
p values are irrelevant (which leads to
incoherence) and misleading
Bayesian Inference
• Definition of a Bayesian: a Bayesian is willing to put a
probability on a hypothesis
• Bayes’ theorem is a trivial consequence of the product rule
• Bayesian inference tells us how we should update our
degree of belief in a hypothesis in the light of new
evidence
• Bayesian analysis is more than merely a theory of
statistics, it is a theory of inference.
• Science is applied Bayesian analysis
• Everyone should be a Bayesian!
Bayes' Theorem
B = background information
H = hypothesis
D = data
P(H|B) = prior
P(D|B) = probability of the data
P(D|H&B) = likelihood
P(H|D&B) = posterior
P(H|D&B) = P(H|B)P(D|H&B)/P(D|B)
Bayesian Inference
• There is no such thing as an absolute probability, P(H), but
we often omit B, and write P(H) when we mean P(H|B).
• An implicit rule of probability theory is that any random
variable not conditioned on is marginalized over.
• The denominator in Bayes’ theorem, P(D|B), is
independent of H, so when comparing hypotheses, we can
omit it and use P(H|D)  P(H)P(D|H).
The Futility of Bias-Free Learning
• ‘Even after the observation of the frequent conjunction of
objects, we have no reason to draw any inference
concerning any object beyond those of which we have had
experience.’ Hume (1739–40)
• Bias-free learning is futile (Mitchell 1980; Schaffer 1994;
Wolpert 1996).
• One can never generalize beyond one’s data without
making at least some assumptions.
No Free Lunch Theorem
The no free lunch (NFL) theorem for supervised machine
learning (Wolpert 1996) tells us that, on average, all
algorithms are equivalent.
Note that the NFL theorems apply to off-training set
generalization error, i.e., generalization error for test sets
that contain no overlap with the training set.
• No free lunch for Occam’s razor
• No free lunch for overfitting avoidance
• No free lunch for cross validation
Occam’s Razor
• Occam's razor (also spelled Ockham's razor) is a
law of parsimony: the principle gives precedence
to simplicity; of two competing theories, the
simplest explanation of an entity is to be preferred.
• Attributed to the 14th-century English logician and
Franciscan friar, William of Ockham.
• There is no free lunch for Occam’s razor.
Bayesian Inference and Occam’s Razor
If the data fits the following two hypotheses equally well, which should be
preferred?
H1: f(x) =
bx + c
H2: f(x) = ax2 + bx + c
Recall that P(H|D)  P(H)P(D|H) with H the hypothesis and D the data;
assume equal priors, so just consider the likelihood, P(D|H).
Because there are more parameters, and the probability must sum to 1,
the probability mass of the more complex model, P(D|H2) will be more
‘spread out’ than P(D|H1), so if the data fits equally well, the simpler
model, model H1, should be preferred.
In other words, Bayesian inference automatically and quantitatively
embodies Occam's razor.
Bayesian Inference and Occam’s Razor
D data sets
H1 simple model
H2 complex model
For data sets in region C,
H1 is more probable.
Occam’s Razor from First Principles?
Bayesian model selection appears to formally justify Occam's
razor from first principles. Alas, this is too good to be true,
it contradicts the no free lunch theorem. Our ‘proof’ of
Occam’s razor involved an element of smoke and mirrors.
Ad hoc assumptions:
• The set of models with a non-zero prior is extremely small,
i.e. all but a countable number of models have exactly zero
probability.
• A flat prior over models (corresponds to a non-flat prior
over functions). When choosing priors, should the
‘principle of insufficient reason’ be applied to functions or
models?
Cross Validation
• Cross-validation (Stone 1974, Geisser 1975) is
the practice of partitioning a sample of data into
subsets such that the analysis is initially
performed on a single subset, while the other
subset(s) are retained for subsequent use in
confirming and validating the initial analysis.
• There is no free lunch for cross-validation.
Overfitting
• Overfitting avoidance cannot be justified from first
principles.
• The distinction between structure and noise can
not be made on the basis of training data, so
overfitting avoidance cannot be justified from the
training set alone.
• Overfitting avoidance is not an inherent
improvement, but a bias.
Underfitting and Overfitting
Bias-Variance Trade-Off
Consider a training set, the target (true) function, and an
estimator (your guess).
• Bias The extent to which the average (over all samples
from the training set) of the estimator differs from the
desired function.
• Variance The extent to which the estimator fluctuates
around its expected value as the samples from the training
set varies.
Bias-Variance Trade-Off: Formula
X = input space
f = target
h = hypothesis
d = training set
m = size of d
C = cost
Y = output space
q = test set point
YF = target Y-values
YH = hypothesis Y-values
σf2 = intrinsic error due to f
E(C | f, m, q) = σf2 + (bias)2 + variance,
where σf2 ≡ E(YF2 | f, q) - [E(YF | f, q)]2,
bias ≡ E(YF | f, q) - E(YH | f, q),
variance ≡ E(YH2 | f, q) - [E(YH| f, q)]2.
Bias-Variance Trade-Off: Issues
C = cost, d = training set, m = size of the training
set, f = target
• There need not always be a bias-variance tradeoff, because there exists an algorithm with both
zero bias and zero variance.
• The bias-plus-variance formula ‘examines the
wrong quantity’. In the real world, it is almost
never E(C | f, m) that is directly of interest, but
rather E(C | d), which is what a Bayesian is
interested in.
Model Selection
1) Model selection - Difficult!
Choose from f(x) = ax2 + bx + c or f(x) = bx + c
2) Parameter estimation - Easy!
Given f(x) = bx + c, find a and b
Model selection is the task of choosing a model with
the correct inductive bias.
Bayesian Model Selection
• Overfitting problem was solved in principle by Sir Harold
Jeffreys in 1939
• Chooses the model with the largest posterior probability
• Works with nested or non-nested models
• No need for a validation set
• No ad hoc penalty term (except the prior)
• Informs you of how much structure can be justified by the
given data
• Consistent
Pedagogical Example: Data
•
•
•
•
•
•
•
GBP to USD interbank rate
Daily data
Exclude zero returns (weekends)
Average ask price for the day
1 January 1993 to 3 February 2008
Training set: 3402 data points
Test set: 1701 data points
Tobler's First Law of Geography
• Tobler's first law of geography (Tobler 1970) tells
us that ‘everything is related to everything else,
but near things are more related than distant
things’.
• We use this common sense principle to select and
prioritize our inputs.
Example: Inputs and Target
5 potential inputs, xn, and a target y
pn is exchange rate n days in the future
x1 = log(p0/p-1)
x2 = log(p-1/p-3)
x3 = log(p-3/p-6)
x4 = log(p-6/p-13)
x5 = log(p-13/p-27)
y = log(p1/p0)
Example: 5 models
m1 = a11x1 + a10
m2 = a22x2 + a21x1 + a20
m3 = a33x3 + a32x2 + a31x1 + a30
m4 = a44x4 + a43x3 + a42x2 + a41x1 + a40
m5 = a55x5 + a54x4 + a53x3 + a52x2 + a51x1 + a50
Example: Assigning Priors 1
Assumption: rather than setting a uniform prior across
models, select a uniform prior across functions.
P(m)  volume in parameter space
Assume that a, b, c  [-5, 5]
Model
a
ax + b
ax + by + c
Volume
111
112
113
Example: Assigning Priors 2
How likely is each model? In practice, the efficient
market hypothesis implies that the simplest of
functions are less likely. We shall penalize our
simplest model.
Example: Model Priors
P(m1) = c × 112 × 0.1 = 0.000006
P(m2) = c × 113
= 0.000683
P(m3) = c × 114
= 0.007514
P(m4) = c × 115
= 0.082650
P(m5) = c × 116
= 0.909147
Marginal Likelihood
The marginal likelihood is the marginal probability of
the data, given the model, and can be obtained by
summing (more generally, integrating) the joint
probabilities over all parameters, θ.
P(data|model) = ∫θP(data|model,θ)P(θ|model)dθ
Bayesian Information Criterion (BIC)
BIC is easy to calculate and enables us to
approximate the marginal likelihood
n = number of data points
k = number of free parameters
RSS is the residual sum of squares
BIC = n ln(RSS/n) + k ln(n)
marginal likelihood  e-0.5BIC
Example: Model Likelihoods
m1
m2
m3
m4
m5
n
3402
3402
3402
3402
3402
k
2
3
4
5
6
RSS
0.05643
0.05640
0.05634
0.05633
0.05629
BIC
Marginal likelihood
-37429
0.9505
-37423
0.0443
-37419
0.0051
-37411
9.2218×10-5
-37405
5.2072×10-6
Example: Model Posteriors
P(model|data)  prior × likelihood
P(m1|data) = c × 6.21×10-6 × 0.95052 = 0.068
P(m2|data) = c × 0.00068 × 0.04429 = 0.349
P(m3|data) = c × 0.00751 × 0.00509 = 0.441
P(m4|data) = c × 0.08265 × 9.22×10-5 = 0.088
P(m5|data) = c × 0.90915 × 5.21×10-6 = 0.055
Example: Best Model
We can choose the best model, the model with the
highest posterior probability:
Model 1: 0.068
Model 2: 0.349 high
Model 3: 0.441 highest
Model 4: 0.088
Model 5: 0.055
Example: Out of Sample Results
0.6
0.5
Model
Model
Model
Model
Model
0.4
0.3
0.2
0.1
0
Log returns over 5 years
1
2
3
4
5
Bayesian Model Averaging
• We chose the most probable model.
• But we can do better than that!
• It is optimal to take an average over all models,
with each model’s prediction weighted by its
posterior probability.
Example: Out of Sample Results Including
Model Averaging
0.6
Model 1
0.5
Model 2
0.4
Model 3
0.3
Model 4
0.2
Model 5
0.1
0
Log returns over 5 years
Model
averaging
Conclusions
• Our community typically worries about overfitting
avoidance and statistical significance, but our practical
successes have been due to the appropriate application of
bias.
• Be a Bayesian and use domain knowledge to make
intelligent assumptions and adhere to the rules of
probability.
• How ‘aligned’ your learning algorithm is with the domain
determines how well you will generalize.
Questions?
This PowerPoint presentation is available here:
http://www.cs.ucl.ac.uk/staff/M.Sewell/Sewell2008.ppt
Martin Sewell
[email protected]