Transcript Document

Warsaw Summer School 2015, OSU Study Abroad
Program
Regression
Linear Relationship
The line = a mathematical function that can be expressed
through the formula Y = a + bX, where Y & X are our
variables.
Y, the dependent variable, is expressed as a linear function of
the independent (explanatory) variable X.
Linear Relationship
Linear Relationship
The constant a = value of Y at the point in which the line
Y = a + bX intersects the Y-axis (also called the intercept).
The slope b equals the change in Y for a one-unit increase in
X (one-unit increase in X corresponds to a change of b units
in Y). The slope describes the rate of change in Y-values, as
X increases.
Verbal interpretation of the slope of the line:
“Rise over run”: the rise divided by the run (the change in
the vertical distance is divided by the change in the horizontal
distance).
Linear Relationship
The constant a = value of Y at the point in which the line
Y = a + bX intersects the Y-axis (also called the intercept).
The slope b equals the change in Y for a one-unit increase in
X (one-unit increase in X corresponds to a change of b units
in Y). The slope describes the rate of change in Y-values, as
X increases.
Verbal interpretation of the slope of the line:
“Rise over run”: the rise divided by the run (the change in
the vertical distance is divided by the change in the horizontal
distance).
Cartesian Coordinate System
Variables X, Y and their linear function:
The formula Y = a + bX expresses the dependent (response)
variable Y as a linear function of the independent
(explanatory) variable X. The formula maps out a straitline graph with slope b and Y-intercept a.
Basics
Linear Relationship: Y = a + bX
The constant a is the value of Y when X = 0.
For X = 0 we have: Y = a + b*0 = a
The constant a is the value of Y where the line Y = a + bX
intersects the Y-axis.
The slope b equals the change in Y for a one-unit increase in
X. This means that one-unit increase in X corresponds to a
change of b units in Y. Thus, the slope describes the rate
of change in the Y-values as X increases. Generally,
b = (Y - a) / X
Model vs Reality
The function Y = a + bX is a model
In reality we do not have one line
The Scatter gram and Least Squares Method
The graphical plot of observed values (X,Y) is called a
- scatter-gram
- scatter-diagram
- scatter-plot.
A regression function is a function that describes how the
expected value of the dependent (response) variable Y
changes according to the values of an independent
(explanatory) variable X.
Regression
This expected value is estimated by a linear function:
• Ý = a + bX
Ý = predicted value for the dependent variable, Y
a = the intercept (the value of Y when X = 0)
b = the regression coefficient (the slope), indicating the
amount of change in Y given a unit change in X
X = the independent variable
Regression
Ý = a + bX
b = [Σ(X - X̃)(Y - Ÿ)] / Σ(X - X̃)2
a = Ý - b*X
Method of Least Squares
The prediction errors, called residuals, are defined as the
differences between observed and predicted values of Y
E = Ý - (a + bX) = Y - Ý
Regression line minimizes the sum of error terms:
SSE = Σ(Y - Ý)2
Method of Least Squares
The method of least squares provides the prediction equation
Ý = a + bX having the minimal value of SSE.
The least square estimates a and b are the values determining
the prediction equation for which the sum of squared
errors SSE is a minimum.
Covariance
In the regression analysis we ask: to what extend could we
predict Y knowing our variable X? Prediction means that
values X and Y go together or co-vary.
Covariance is sum of products, or SP,
• SP = Σ (X - X̃) (Y - Ÿ)
Sums of squares for X:
• SSx = Σ (X - X̃)2
Note that in the regression equation of Y on X
• Ý = bX + a
• b = SP / SSx
Interpretation of b
The slope of the line, b, has the verbal interpretation “rise
over run”-- that is, the rise divided by the run. This means
that the change in the vertical distance is divided by the
change in the horizontal distance.
The more steep the hill, the higher the slope. You go “up”
more rapidly than you go over. The line can have a
negative slope.
When there is negative slope, you are going “downhill” rather
than “uphill.”
• b > 0, positive relationship
• b < 0, negative relationship
• b = 0, no relationship
Linear Relationship
The constant a = value of Y at the point in which the line
Y = a + bX intersects the Y-axis (also called the intercept).
The slope b equals the change in Y for a one-unit increase in
X (one-unit increase in X corresponds to a change of b units
in Y). The slope describes the rate of change in Y-values, as
X increases.
Verbal interpretation of the slope of the line:
“Rise over run”: the rise divided by the run (the change in
the vertical distance is divided by the change in the horizontal
distance).
Unststandardized and standardized coefficients
If both variables, IV and DV, are expressed in z-scores, a
(constant) is equal zero.
We obtain Beta coefficients that tell us the following: How
much change in the standard deviation units in DV is
attributable to the change in IV by one standard deviation.
Two and more IVs
Ý = a + bX + bX
1
1
2
Ý = βX + βX
1
1
2
2
Ý = a + bX + bX
1
1
2
Ý=β X +β X
1
1
2
2
2 ………..
2 ………..
b X +bX
k-1
k-1
k
β X +β X
k-1
k-1
k
k
k
Coefficients and variables
The estimated parameters b1, b2, ..., bk are partial regression
coefficients. They are different from regression
coefficients for bi-variate relationships between Y and each
exploratory variable.
Three criteria for a number of independent (exploratory)
variables:
• (1) Theory
• (2) Parsimony
• (3) Sample size
R2
Coefficient of determination (explained variance) for two
variables
•
SS(total) - SS(error)
r2 = ----------------------------SS(total)
• Stata provides a value of the coefficient of determination
for
•
SS(total) - SS(error)
•
R2 = ----------------------------SS(total)
Sum of squares
R2 is a proportion of explained variance by X1, X2, ...., Xk.
Therefore, 1 - R2 is a proportion of unexplained variance.
Adjusted R-square
• Adjusted R-square is a modification of R-square that
adjusts for the number of terms in a model. R-square
always increases when a new term is added to a model, but
adjusted R-square increases only if the new term improves
the model more than would be expected by chance.
Sum of Squares
The Regression SUM of SQUARES is defined:
SS(regression) = SS(total) – SS(error)
Mean square
The Regression MEAN SQUARE
MSS(regression) = SS(regression) / df-v
df-v = k
where k is a number of variables
The MEAN SQUARE ERROR
MSS(error) = SS(error) / df
df-t = n - (k + 1) where n is a number of cases and k is
a number of variables.
F
The null hypothesis
Ho: b1 = b2 = … = bk = 0
MSS(model)
• F = -------------MSS(error)
The sampling distribution of this statistic is the
F-distribution
t
The test of H0: bk = 0 evaluates whether Y and X are
statistically dependent, ignoring other variables.
We use the t statistic
b
• t = -------------σB
where σB is a standard error of B
SS(error)
• σB = -------n-2
ANOVA
ANALYSIS OF VARIANCE
• How much of the variance is explained by values of the
nominal variable?
• Total sum of squared variation from the mean:
• SS(total) = Σ [X – X̃ (total)]2
ANOVA
The between group variation represents the squared
deviations of every group mean from the total mean:
• SS(between) = Σ [X̃ (group) – X̃ (total)]2
The within-group sum of squares is the sum of every raw
score from its group mean:
• SS(within) = Σ [X – X̃ (group)]2
ANOVA
Mean Squares:
• MSS(between) = SS(between) / df(between)
where df(between) = k – 1
• MSS(within) = SS(within) / df(within)
where df(within) = N - k
F
F-statistic
MSS(between)
• F = -------------MSS(within)
• The larger the F-value, the greater the impact of a group
on the dependent variable.
F
Compare:
MSS(between)
• F = -------------MSS(within)
MSS(regression)
• F = -------------MSS(error)
Regression ANOVA
Stata
ANOVA
• Source - Model, Residual, and Total. The Total variance is
partitioned into the variance which can be explained by the
independent variables (Model) and the variance which is
not explained by the independent variables (Residual,
sometimes called Error).
• SS - Sum of Squares associated with the three sources of
variance, Total, Model and Residual.
• df - Degrees of freedom associated with the sources of
variance.
The total variance has N-1 degrees of freedom.
The model degrees of freedom = the number of coefficients
+ intercept minus 1.
The Residual degrees of freedom is the DF total minus the
DF model.
• MS - Mean Squares, the Sum of Squares divided by their
respective DF.
Regression
• Number of observations used in the regression analysis.
• The F-statistic is the Mean Square Model divided by the
Mean Square Residual. The numbers in parentheses are
the Model and Residual degrees of freedom.
• Prob > F - This is the p-value associated with the above Fstatistic. It is used in testing the null hypothesis that all of
the model coefficients are 0.
• R-squared - R-Squared is the proportion of variance in the
dependent variable which can be explained by the
independent variables.
• Adj R-squared - This is an adjustment of the R-squared
that penalizes the addition of extraneous predictors to the
model. Adjusted R-squared is computed using the formula
1 - ((1 - Rsq)((N - 1) /( N - k - 1)) where k is the number of
predictors.
• Root MSE - Root MSE is the standard deviation of the
error term, and is the square root of the Mean Square
Residual (or Squared Error).