Transcript Document

CERAM
February-March-April 2008
Class 4
Ordinary Least Squares
Lionel Nesta
Observatoire Français des Conjonctures Economiques
[email protected]
Introduction to Regression
 Ideally, the social scientist is interested not only in knowing the
intensity of a relationship, but also in quantifying the magnitude
of a variation of one variable associated with the variation of
one unit of another variable.
 Regression analysis is a technique that examines the relation of
a dependent variable to independent or explanatory variables.
 Simple regression y = f(X)
 Multiple regression y = f(X,Z)
 Let us start with simple regressions
Scatter Plot of Fertilizer and Production
Scatter Plot of Fertilizer and Production
Scatter Plot of Fertilizer and Production
Error  Yi  Yi
Pr ediction  Yi
Scatter Plot of Fertilizer and Production
Scatter Plot of Fertilizer and Production
Objective of Regression

It is time to ask: “What is a good fit?”

“A good fit is what makes the error small”

“The best fit is what makes the error smallest”

Three candidates
1. To minimize the sum of all errors
2. To minimize the sum of absolute values of errors
3. To minimize the sum of squared errors
To minimize the sum of all errors
n

min  yi  yi
i 1

Problem of sign
Y
Y
+
–
–
–
X
+
+
X
To minimize the sum of absolute values
of errors
Problem of middle
min  y  y
point
n
i 1
i
i
Y
Y
+2
–1
+3
–1
X
X
To minimize the sum of squared errors
n

min  yi  yi
i 1

2
Solve both problems
Y
+
–
–
X
To minimize the sum of squared errors
n

min  yi  yi
i 1

2
n
 min   2
i 1
ε²
 Overcomes the sign problem
 Goes through the middle point
 Squaring emphasizes large errors
 Easily Manageable
 Has a unique minimum
 Has a unique – and best - solution
ε
Scatter Plot of Fertilizer and Production
Scatter Plot of R&D and Patents (log)
Scatter Plot of R&D and Patents (log)
Scatter Plot of R&D and Patents (log)
Scatter Plot of R&D and Patents (log)
The Simple Regression Model
yi    xi  i
E ( yi )    xi
yi Dependent variable (to be explained)
xi Independent variable (explanatory)
α First parameter of interest
 Second parameter of interest
εi Error term
The Simple Regression Model
y i    xi
 and  are estimates of
the true - but unkown -  and .
To minimize the sum of squared errors
n

min  yi  yi
i 1

2
ε²
n

min  yi  yi
i 1

n



2
0
2
0
i 1

n



i 1
ε

2
n

 min  yi    xi
i 1

2
To minimize the sum of squared errors
n

min  yi  yi
i 1

2
ε²
y  y  x  x 



 x  x
i
i
2
i
  y  x
ε
Application to CERAM_BIO Data using
Excel
lnpat_assets
lnrd_assets
-12.77
-12.51
-12.74
-12.52
-12.12
-12.53
-12.09
Mean of y
-12.16
-2.28
-2.24
-2.20
-2.31
-2.25
-2.26
-2.25
Mean of x
-2.29
Alpha_hat
-8.148
Beta_hat
1.749
Deviation to the mean
-0.61
-0.35
-0.58
-0.36
0.04
-0.37
0.07
0.01
0.05
0.09
-0.02
0.04
0.03
0.04
Numerator
Beta_Hat
Denominator
Beta_Hat
-0.01
-0.02
-0.05
0.01
0.00
-0.01
0.00
Sum
448.75
0.00
0.00
0.01
0.00
0.00
0.00
0.00
Sum
256.55
Application to CERAM_BIO Data using
Excel
lnpat_assets
lnrd_assets
-12.77
-12.51
-12.74
-12.52
-12.12
-12.53
-12.09
Mean of y
-12.16
-2.28
-2.24
-2.20
-2.31
-2.25
-2.26
-2.25
Mean of x
-2.29
Alpha_hat
-8.148
Beta_hat
1.749
Deviation to the mean
-0.61
-0.35
-0.58
-0.36
0.04
-0.37
0.07
0.01
0.05
0.09
-0.02
0.04
0.03
0.04
Numerator
Beta_Hat
Denominator
Beta_Hat
-0.01
-0.02
-0.05
0.01
0.00
-0.01
0.00
Sum
448.75
0.00
0.00
0.01
0.00
0.00
0.00
0.00
Sum
256.55
 Patent 
 R&D 
ln 


8.148

1.748

ln


  i
 Assets 
 Assets 
Interpretation
 Patent 
 R&D 
ln 
  8.148  1.748  ln 
  i
 Assets 
 Assets 
 When the log of R&D (per asset) increases by one
unit, the log of patent per asset increases by 1.748
 Remember! A change in log of x is a relative change
of x itself
 A 1% increase in R&D (per asset) entails a 1.748%
increase in the number of patent (per asset).
Application to Data using SPSS
Analyse  Régression  Linéaire
Coefficientsa
Modèle
1
(constante)
lnrd_assets
Coefficients non
standardisés
Erreur
B
standard
-8.151
.244
1.748
.101
a. Variable dépendante : lnpat_assets
Coefficients
standardisés
Bêta
.642
t
-33.392
17.323
Signification
.000
.000
Assessing the Goodness of Fit
 It is important to ask whether a specification provides
a good prediction on the dependent variable, given
values of the independent variable.
 Ideally, we want an indicator of the proportion of
variance of the dependent variable that is accounted
for – or explained – by the statistical model.
 This is the variance of predictions (ŷ) and the
variance of residuals (ε), since by construction, both
sum to overall variance of the dependent variable (y).
Overall Variance
Decomposing the overall variance (1)
Decomposing the overall variance (2)
Coefficient of determination R²
 R2 is a statistic which provides information on the
goodness of fit of the model.
SStot   yi  y



 

SS fit  
SSres


2 
yi  y  SStot  SS fit  SS res

2
yi  yi 

2

R² 
SS fit
SStot
0  R²  1
Fisher’s F Statistics
 Fisher’s statistics is relevant as a form of ANOVA on SSfit
which tells us whether the regression model brings
significant (in a statistical sense, information.
Model
SS
df
MSS
(1)
(2)
(3)
(2)/(3)
p
MSS fit
Fitted
 y  y 
2
i
Residual
 y
Total
 y  y 
i
 yi
i
p: number of parameters
N: number of observations

2
2
F
MSS fit
MSSres
N–p–1
N–1
MSSres
Application to Data using SPSS
Analyse  Régression  Linéaire
Récapitulatif du modèle
Modèle
1
R
a
.642
R-deux
.412
R-deux ajusté
.410
Erreur
standard de
l'estimation
1.61647
a. Valeurs prédites : (constantes), lnrd_assets
ANOVAb
Modèle
1
Régression
Résidu
Total
Somme
des carrés
784.132
1120.970
1905.102
ddl
a. Valeurs prédites : (constantes), lnrd_assets
b. Variable dépendante : lnpat_assets
1
429
430
Carré moyen
784.132
2.613
F
300.090
Signification
.000a
What the R² is not
 Independent variables are a true cause of the
changes in the dependent variable
 The correct regression was used
 The most appropriate set of independent variables
has been chosen
 There is co-linearity present in the data
 The model could be improved by using transformed
versions of the existing set of independent variables
Inference on β
 We have estimated E ( yi )  y i    xi
 Si   0, E ( yi )  
 Si   0, E ( y )    xi
 Therefore we must test whether the estimated
parameter is significantly different than 0, and, by way
of consequence, we must say something on the
distribution – the mean and variance – of the true but
unobserved β*
The mean and variance of β
 It is possible to show that  is a good approximation,
i.e. an unbiased estimator, of the true parameter β*.

E ˆ  *
 The variance of β is defined as the ratio of the mean
square of errors over the sum of squares of the
explanatory variable
VAR ˆ 
2
n
 x  x 
i 1
2

where   yi  y i
2


2
 n  1  1
The confidence interval of β
 We must now define de confidence interval of β, at
95%. To do so, we use the mean and variance of β
and define the t value as follows: t    * sˆ


 Therefore, the 95% confidence interval of β is:

*
    t .025 
n
2
 x  x 
i 1
If the 95% CI does not include 0, then β is
significantly different than 0.
Student t Test for β
 We are also in the position to infer on β

H0: β* = 0

H1: β* ≠ 0
  * 
t

sˆ
sˆ
Rule of decision
Accept H0 is | t | < tα/2
Reject H0 is | t | ≥ tα/2
Application to Data using SPPS
Analyse  Régression  Linéaire
Coefficientsa
Modèle
1
(constante)
lnrd_assets
Coefficients non
standardisés
Erreur
B
standard
-8.151
.244
1.748
.101
a. Variable dépendante : lnpat_assets
Coefficients
standardisés
Bêta
.642
t
-33.392
17.323
Signification
.000
.000
Assignments on CERAM_BIO

Regress the number of patent on R&D expenses
and consider:
1.
2.
3.

Repeat steps 1 to 3 using:



The quality of the fit
The significance and direction of R&D expenses
The interpretation of the result in an economic sense
R&D expenses divided by one million (you need to
generate a new variable for that)
The log of R&D expenses
What do you observe? Why?