Linear Regression.

Download Report

Transcript Linear Regression.

Linear Regression
Oliver Schulte
Machine Learning 726
The Linear Regression Model
2/13
Parameter Learning Scenarios
 The general problem: predict the value of a continuous
variable from one or more continuous features.
Parent Node/ Discrete
Child Node
Continuous
Discrete
Maximum
Likelihood
Decision Trees
Continuous
conditional
Gaussian
(not discussed)
logit
distribution
(logistic
regression)
linear
Gaussian
(linear
regression)
3/13
Example 1: Predict House prices
 Price vs. floor space of houses for sale in Berkeley, CA, July
2009.
Size
Price
House price in $1000
1000
900
800
700
600
500
400
300
500
1000 1500 2000 2500 3000 3500
House size in square feet
Figure Russell and Norvig
4/13
Grading Example
• Predict: final percentage mark for student.
• Features: assignment grades, midterm exam, final exam.
• Questions we could ask.
• I forgot the weights of components. Can you recover
them from a spreadsheet of the final grades?
• I lost the final exam grades. How well can I still predict
the final mark?
• How important is each component, actually? Could I
guess well someone’s final mark given their
assignments? Given their exams?
5/13
Line Fitting
 Input:
 a data table XNxD.
 a target vector tNx1.
 Output: a weight vector wDx1.
 Prediction model: predicted value = weighted linear
combination of input features.
tn  xn  w
t = Xw
6/13
Least Squares Error
 We seek the closest fit of the predicted line to the data
points.
 Error = the sum of squared errors.
 Sensitive to outliers.
N
ED (w) =1 / 2å{tn - x n ·w}
2
n=1
7/13
Squared Error on House Price Example
House price in $1000
1000
900
800
700
600
500
400
300
500
1000 1500 2000 2500 3000 3500
House size in square feet
Figure 18.13 Russell and Norvig
8/13
Intuition
 Suppose that there is an exact solution and that the input
matrix is invertible. Then we can find the solution by simple
matrix inversion:
-1
t = Xw Û w = X t
• Alas, X is hardly ever square let alone invertible.
But XTX is square, and usually invertible. So
multiply both sides of equation by XT, then use
inversion.
-1
t = Xw Û X t = X X w Û (X X) X t = w
T
T
invert this
T
pseudo-inverse
T
9/13
Partial Derivative
 Think about single weight parameter wj.
 Partial derivative is
error on input x
¶E
=å (
¶w j n=1
N
n
xn · w
- tn )xnj
prediction for input x
n
Gradient changes weight to bring prediction closer to actual
value.
10/13
Gradient
¶E
= å {x n · w - tn }xnj
¶w j n=1
N
1. Find gradient vector for each input x, add them up.
2. =Linear combination of row vectors xn with coefficients.xnw-tn.
N
ÑE = å {x n · w - tn }x n
n=1
= XT (Xw - t)
11/13
Solution: The Pseudo-Inverse
0 = ÑE = XT (Xw - t)
Û XT Xw = XT t
Assume that XTX is invertible. Then the
solution is given by
w = (XT X)-1 XT t
pseudo-inverse
12/13
The w0 offset
 Recall the formula for the partial derivative, and that xn0=1
for all n.
 Write w*=(w1,w2,...,wD) for the weight vector without w0,
and similarly xn*=(xn1,xn2,...,xnD) for the n-th feature
vector without the “dummy” input.Then
E
N
 n1 w*  x*n  w0  tn
w0
Setting the partial derivative to 0,
we get
E
1 N
1 N * *
0
 w0  ( n1 tn )  (  w  x n )
w0
N
N n1
average
target
value
average predicted
value
13/13
Geometric Interpretation
 Any vector of the form y = Xw is
a linear combination of the
columns (variables) of X.
 If y is the least squares
approximation, then y is the
orthogonal projection of t onto
this subspace
ϕi = column vector i
Figure Bishop
14/13
Probabilistic Interpretation
15/13
Noise Model
 A linear function predicts a deterministic value
yn(xn,w) for each input vector.
 We can turn this into a probabilistic prediction
via a model
true value = predicted value + random noise:
 Let’s start with a Gaussian noise model.
t = y(x, w)+ e
where p(e | s 2 ) = N(e | 0, s 2 )
16/13
Curve Fitting With Noise
17/13
The Gaussian Distribution
18/13
Meet the exponential family
 A common way to define a probability density p(x) is as an
exponential function of x.
 Simple mathematical motivation: multiplying numbers
between 0 and 1 yields a number between 0 and 1.
 E.g. (1/2)n, (1/e)x.
 Deeper mathematical motivation: exponential pdfs have good
statistical properties for learning.
 E.g., conjugate prior, maximum likelihood, sufficient statistics.
19/13
Reading exponential prob formulas
• Suppose there is a relevant feature f(x) and I want to
express that “the greater f(x) is, the less probable x
is”.
• f(x), p(x)
• Use p(x) = α exp(-f(x)).
20/13
Example: exponential form sample size
• Fair Coin: The longer the sample size, the less likely it
is.
• p(n) = 2-n.
ln[p(n)]
Sample size n
21/13
Location Parameter
• The further x is from the center μ, the less likely it is.
ln[p(x)]
(x-μ)2
22/13
Spread/Precision parameter
• The greater the spread σ2, the more likely x is (away
from the mean).
• The greater the precision β, the less likely x is.
ln[p(x)]
1/σ2 = β
23/13
Minimal energy = max probability
• The greater the energy (of the joint state), the less
probable the state is.
ln[p(x)]
E(x)
24/13
Normalization
 Let p*(x) be an unnormalized density function.
 To make a probability density function, need to find normalization
constant α s.t.
¥
ò a p *(x)dx =1.
-¥
 Therefore
a=
¥
ò p *(x)dx.
-¥
 For the Gaussian (Laplace 1782)
æ 1
1
2ö
.
ò exp çè- 2s 2 (x - m ) ÷ø dx =
2
2ps
-¥
¥
25/13
Central Limit Theorem
The distribution of the sum of N i.i.d. random variables
becomes increasingly Gaussian as N grows.
 Laplace (1810).
Example: N uniform [0,1] random variables.
26/13
Gaussian Likelihood Function
 Exercise: Assume a Gaussian noise model, so the likelihood
function becomes (copy 3.10 from Bishop).
æ 1
1
2ö
p(t | X, w, s ) = Õ exp ç - 2 (tn - w • x n ) ÷
2
è
ø
2
s
2
ps
n=1
N
 Show that the maximum likelihood solution minimizes the
sum of squares error:
N
EX (w) =1 / 2å{tn - x n · w}2
n=1
27/13
Regression With Basis Functions
28/13
Nonlinear Features
 We can increase the power of linear regression by using
functions of the input features instead of the input
features.
 These are called basis functions.
 Linear regression can then be used to assign weights to the
basis functions.
29/13
Linear Basis Function Models (1)
 Generally
 where j(x) are known as basis functions.
 Typically, 0(x) = 1, so that w0 acts as a bias.
 In the simplest case, we use linear basis functions :
d(x) = xd.
30/13
Linear Basis Function Models (2)
Polynomial basis
functions:
These are global, the
same for all input vectors.
31/13
Linear Basis Function Models (3)
Gaussian basis functions:
These are local; a small
change in x only affects
nearby basis functions. ¹j
and s control location and
scale (width).
Related to kernel methods.
32/13
Linear Basis Function Models (4)
Sigmoidal basis
functions:
where
Also these are local; a
small change in x only
affect nearby basis
functions. j and s control
location and scale (slope).
33/13
Basis Function Example
Transformation
34/13
Limitations of Fixed Basis Functions
• M basis functions along each dimension of a D-
dimensional input space require MD basis functions:
the curse of dimensionality.
• In later chapters, we shall see how we can get away
with fewer basis functions, by choosing these using
the training data.
35/13
Overfitting and Regularization
36/13
Polynomial Curve Fitting
37/13
0th Order Polynomial
38/13
3rd Order Polynomial
39/13
9th Order Polynomial
40/13
Over-fitting
Root-Mean-Square (RMS) Error:
41/13
Polynomial Coefficients
42/13
Data Set Size:
9th Order Polynomial
43/13
1st Order Polynomial
44/13
Data Set Size:
9th Order Polynomial
45/13
Quadratic Regularization
 Penalize large coefficient values
46/13
Regularization:
47/13
Regularization:
48/13
Regularization:
vs.
49/13
Regularized Least Squares (1)
 Consider the error function:
Data term + Regularization term
 With the sum-of-squares error function and a
quadratic regularizer, we get
 which is minimized by
¸ is called the
regularization
coefficient.
50/13
Regularized Least Squares (2)
 With a more general regularizer, we have
Lasso
Quadratic
51/13
Regularized Least Squares (3)
Lasso tends to generate sparser solutions than a
quadratic
regularizer.
52/13
Evaluating Classifiers and Parameters
Cross-Validation
53/13
Evaluating Learners on Validation Set
Training Data
Learner
Validation Set
Model
54/13
What if there is no validation set?
 What does training error tell me about the
generalization performance on a hypothetical
validation set?
 Scenario 1:You run a big pharmaceutical company.Your
new drug looks pretty good on the trials you’ve done so
far. The government tells you to test it on another
10,000 patients.
 Scenario 2:Your friendly machine learning instructor
provides you with another validation set.
 What if you can’t get more validation data?
55/13
Examples from Andrew Moore
 Andrew Moore's slides
56/13
Cross-Validation for Evaluating
Learners
Cross-validation to estimate the performance of a
learner on future unseen data
Learn on 3 folds, test on 1
Learn on 3 folds, test on 1
Learn on 3 folds, test on 1
Learn on 3 folds, test on 1
57/13
Cross-Validation for Meta Methods
Training Data
• If the learner requires setting a
parameter,
we can evaluate different parameter
settings against the data using training
error or cross validation.
• Then cross-validation is part of
learning.
Learner(λ)
Model(λ)
58/13
Cross-Validation for evaluating a
parameter value
Cross-validation for λset (e.g. λ = 1)
Learn withλ on 3 folds, test on 1
Learn withλ on 3 folds, test on 1
Learn withλ on 3 folds, test on 1
Learn withλ on 3 folds, test on 1
Average error over all 4 runs is the cross-validation
estimated error for the λ value
59/13
Stopping Criterion
for any type of validation (hold-out set validation, cross-validation)
60/13
Bayesian Linear Regression
61/13
Bayesian Linear Regression (1)
• Define a conjugate shrinkage prior over weight
vector w:
p(w|σ2) = N(w|0,σ2I)
• Combining this with the likelihood function and
using results for marginal and conditional
Gaussian distributions, gives a posterior
distribution.
• Log of the posterior = sum of squared errors +
quadratic regularization.
62/13
Bayesian Linear Regression (3)
0 data points observed
Prior
Data Space
63/13
Bayesian Linear Regression (4)
1 data point observed
Likelihood
Posterior
Data Space
64/13
Bayesian Linear Regression (5)
2 data points observed
Likelihood
Posterior
Data Space
65/13
Bayesian Linear Regression (6)
20 data points observed
Likelihood
Posterior
Data Space
66/13
Predictive Distribution (1)
• Predict t for new values of x by integrating
over w.
• Can be solved analytically.
67/13
Predictive Distribution (2)
 Example: Sinusoidal data, 9 Gaussian basis
functions, 1 data point
68/13
Predictive Distribution (3)
 Example: Sinusoidal data, 9 Gaussian basis
functions, 2 data points
69/13
Predictive Distribution (4)
 Example: Sinusoidal data, 9 Gaussian basis
functions, 4 data points
70/13
Predictive Distribution (5)
 Example: Sinusoidal data, 9 Gaussian basis
functions, 25 data points
71/13