Linear Regression.
Download
Report
Transcript Linear Regression.
Linear Regression
Oliver Schulte
Machine Learning 726
The Linear Regression Model
2/13
Parameter Learning Scenarios
The general problem: predict the value of a continuous
variable from one or more continuous features.
Parent Node/ Discrete
Child Node
Continuous
Discrete
Maximum
Likelihood
Decision Trees
Continuous
conditional
Gaussian
(not discussed)
logit
distribution
(logistic
regression)
linear
Gaussian
(linear
regression)
3/13
Example 1: Predict House prices
Price vs. floor space of houses for sale in Berkeley, CA, July
2009.
Size
Price
House price in $1000
1000
900
800
700
600
500
400
300
500
1000 1500 2000 2500 3000 3500
House size in square feet
Figure Russell and Norvig
4/13
Grading Example
• Predict: final percentage mark for student.
• Features: assignment grades, midterm exam, final exam.
• Questions we could ask.
• I forgot the weights of components. Can you recover
them from a spreadsheet of the final grades?
• I lost the final exam grades. How well can I still predict
the final mark?
• How important is each component, actually? Could I
guess well someone’s final mark given their
assignments? Given their exams?
5/13
Line Fitting
Input:
a data table XNxD.
a target vector tNx1.
Output: a weight vector wDx1.
Prediction model: predicted value = weighted linear
combination of input features.
tn xn w
t = Xw
6/13
Least Squares Error
We seek the closest fit of the predicted line to the data
points.
Error = the sum of squared errors.
Sensitive to outliers.
N
ED (w) =1 / 2å{tn - x n ·w}
2
n=1
7/13
Squared Error on House Price Example
House price in $1000
1000
900
800
700
600
500
400
300
500
1000 1500 2000 2500 3000 3500
House size in square feet
Figure 18.13 Russell and Norvig
8/13
Intuition
Suppose that there is an exact solution and that the input
matrix is invertible. Then we can find the solution by simple
matrix inversion:
-1
t = Xw Û w = X t
• Alas, X is hardly ever square let alone invertible.
But XTX is square, and usually invertible. So
multiply both sides of equation by XT, then use
inversion.
-1
t = Xw Û X t = X X w Û (X X) X t = w
T
T
invert this
T
pseudo-inverse
T
9/13
Partial Derivative
Think about single weight parameter wj.
Partial derivative is
error on input x
¶E
=å (
¶w j n=1
N
n
xn · w
- tn )xnj
prediction for input x
n
Gradient changes weight to bring prediction closer to actual
value.
10/13
Gradient
¶E
= å {x n · w - tn }xnj
¶w j n=1
N
1. Find gradient vector for each input x, add them up.
2. =Linear combination of row vectors xn with coefficients.xnw-tn.
N
ÑE = å {x n · w - tn }x n
n=1
= XT (Xw - t)
11/13
Solution: The Pseudo-Inverse
0 = ÑE = XT (Xw - t)
Û XT Xw = XT t
Assume that XTX is invertible. Then the
solution is given by
w = (XT X)-1 XT t
pseudo-inverse
12/13
The w0 offset
Recall the formula for the partial derivative, and that xn0=1
for all n.
Write w*=(w1,w2,...,wD) for the weight vector without w0,
and similarly xn*=(xn1,xn2,...,xnD) for the n-th feature
vector without the “dummy” input.Then
E
N
n1 w* x*n w0 tn
w0
Setting the partial derivative to 0,
we get
E
1 N
1 N * *
0
w0 ( n1 tn ) ( w x n )
w0
N
N n1
average
target
value
average predicted
value
13/13
Geometric Interpretation
Any vector of the form y = Xw is
a linear combination of the
columns (variables) of X.
If y is the least squares
approximation, then y is the
orthogonal projection of t onto
this subspace
ϕi = column vector i
Figure Bishop
14/13
Probabilistic Interpretation
15/13
Noise Model
A linear function predicts a deterministic value
yn(xn,w) for each input vector.
We can turn this into a probabilistic prediction
via a model
true value = predicted value + random noise:
Let’s start with a Gaussian noise model.
t = y(x, w)+ e
where p(e | s 2 ) = N(e | 0, s 2 )
16/13
Curve Fitting With Noise
17/13
The Gaussian Distribution
18/13
Meet the exponential family
A common way to define a probability density p(x) is as an
exponential function of x.
Simple mathematical motivation: multiplying numbers
between 0 and 1 yields a number between 0 and 1.
E.g. (1/2)n, (1/e)x.
Deeper mathematical motivation: exponential pdfs have good
statistical properties for learning.
E.g., conjugate prior, maximum likelihood, sufficient statistics.
19/13
Reading exponential prob formulas
• Suppose there is a relevant feature f(x) and I want to
express that “the greater f(x) is, the less probable x
is”.
• f(x), p(x)
• Use p(x) = α exp(-f(x)).
20/13
Example: exponential form sample size
• Fair Coin: The longer the sample size, the less likely it
is.
• p(n) = 2-n.
ln[p(n)]
Sample size n
21/13
Location Parameter
• The further x is from the center μ, the less likely it is.
ln[p(x)]
(x-μ)2
22/13
Spread/Precision parameter
• The greater the spread σ2, the more likely x is (away
from the mean).
• The greater the precision β, the less likely x is.
ln[p(x)]
1/σ2 = β
23/13
Minimal energy = max probability
• The greater the energy (of the joint state), the less
probable the state is.
ln[p(x)]
E(x)
24/13
Normalization
Let p*(x) be an unnormalized density function.
To make a probability density function, need to find normalization
constant α s.t.
¥
ò a p *(x)dx =1.
-¥
Therefore
a=
¥
ò p *(x)dx.
-¥
For the Gaussian (Laplace 1782)
æ 1
1
2ö
.
ò exp çè- 2s 2 (x - m ) ÷ø dx =
2
2ps
-¥
¥
25/13
Central Limit Theorem
The distribution of the sum of N i.i.d. random variables
becomes increasingly Gaussian as N grows.
Laplace (1810).
Example: N uniform [0,1] random variables.
26/13
Gaussian Likelihood Function
Exercise: Assume a Gaussian noise model, so the likelihood
function becomes (copy 3.10 from Bishop).
æ 1
1
2ö
p(t | X, w, s ) = Õ exp ç - 2 (tn - w • x n ) ÷
2
è
ø
2
s
2
ps
n=1
N
Show that the maximum likelihood solution minimizes the
sum of squares error:
N
EX (w) =1 / 2å{tn - x n · w}2
n=1
27/13
Regression With Basis Functions
28/13
Nonlinear Features
We can increase the power of linear regression by using
functions of the input features instead of the input
features.
These are called basis functions.
Linear regression can then be used to assign weights to the
basis functions.
29/13
Linear Basis Function Models (1)
Generally
where j(x) are known as basis functions.
Typically, 0(x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions :
d(x) = xd.
30/13
Linear Basis Function Models (2)
Polynomial basis
functions:
These are global, the
same for all input vectors.
31/13
Linear Basis Function Models (3)
Gaussian basis functions:
These are local; a small
change in x only affects
nearby basis functions. ¹j
and s control location and
scale (width).
Related to kernel methods.
32/13
Linear Basis Function Models (4)
Sigmoidal basis
functions:
where
Also these are local; a
small change in x only
affect nearby basis
functions. j and s control
location and scale (slope).
33/13
Basis Function Example
Transformation
34/13
Limitations of Fixed Basis Functions
• M basis functions along each dimension of a D-
dimensional input space require MD basis functions:
the curse of dimensionality.
• In later chapters, we shall see how we can get away
with fewer basis functions, by choosing these using
the training data.
35/13
Overfitting and Regularization
36/13
Polynomial Curve Fitting
37/13
0th Order Polynomial
38/13
3rd Order Polynomial
39/13
9th Order Polynomial
40/13
Over-fitting
Root-Mean-Square (RMS) Error:
41/13
Polynomial Coefficients
42/13
Data Set Size:
9th Order Polynomial
43/13
1st Order Polynomial
44/13
Data Set Size:
9th Order Polynomial
45/13
Quadratic Regularization
Penalize large coefficient values
46/13
Regularization:
47/13
Regularization:
48/13
Regularization:
vs.
49/13
Regularized Least Squares (1)
Consider the error function:
Data term + Regularization term
With the sum-of-squares error function and a
quadratic regularizer, we get
which is minimized by
¸ is called the
regularization
coefficient.
50/13
Regularized Least Squares (2)
With a more general regularizer, we have
Lasso
Quadratic
51/13
Regularized Least Squares (3)
Lasso tends to generate sparser solutions than a
quadratic
regularizer.
52/13
Evaluating Classifiers and Parameters
Cross-Validation
53/13
Evaluating Learners on Validation Set
Training Data
Learner
Validation Set
Model
54/13
What if there is no validation set?
What does training error tell me about the
generalization performance on a hypothetical
validation set?
Scenario 1:You run a big pharmaceutical company.Your
new drug looks pretty good on the trials you’ve done so
far. The government tells you to test it on another
10,000 patients.
Scenario 2:Your friendly machine learning instructor
provides you with another validation set.
What if you can’t get more validation data?
55/13
Examples from Andrew Moore
Andrew Moore's slides
56/13
Cross-Validation for Evaluating
Learners
Cross-validation to estimate the performance of a
learner on future unseen data
Learn on 3 folds, test on 1
Learn on 3 folds, test on 1
Learn on 3 folds, test on 1
Learn on 3 folds, test on 1
57/13
Cross-Validation for Meta Methods
Training Data
• If the learner requires setting a
parameter,
we can evaluate different parameter
settings against the data using training
error or cross validation.
• Then cross-validation is part of
learning.
Learner(λ)
Model(λ)
58/13
Cross-Validation for evaluating a
parameter value
Cross-validation for λset (e.g. λ = 1)
Learn withλ on 3 folds, test on 1
Learn withλ on 3 folds, test on 1
Learn withλ on 3 folds, test on 1
Learn withλ on 3 folds, test on 1
Average error over all 4 runs is the cross-validation
estimated error for the λ value
59/13
Stopping Criterion
for any type of validation (hold-out set validation, cross-validation)
60/13
Bayesian Linear Regression
61/13
Bayesian Linear Regression (1)
• Define a conjugate shrinkage prior over weight
vector w:
p(w|σ2) = N(w|0,σ2I)
• Combining this with the likelihood function and
using results for marginal and conditional
Gaussian distributions, gives a posterior
distribution.
• Log of the posterior = sum of squared errors +
quadratic regularization.
62/13
Bayesian Linear Regression (3)
0 data points observed
Prior
Data Space
63/13
Bayesian Linear Regression (4)
1 data point observed
Likelihood
Posterior
Data Space
64/13
Bayesian Linear Regression (5)
2 data points observed
Likelihood
Posterior
Data Space
65/13
Bayesian Linear Regression (6)
20 data points observed
Likelihood
Posterior
Data Space
66/13
Predictive Distribution (1)
• Predict t for new values of x by integrating
over w.
• Can be solved analytically.
67/13
Predictive Distribution (2)
Example: Sinusoidal data, 9 Gaussian basis
functions, 1 data point
68/13
Predictive Distribution (3)
Example: Sinusoidal data, 9 Gaussian basis
functions, 2 data points
69/13
Predictive Distribution (4)
Example: Sinusoidal data, 9 Gaussian basis
functions, 4 data points
70/13
Predictive Distribution (5)
Example: Sinusoidal data, 9 Gaussian basis
functions, 25 data points
71/13