Linear Regression 1 - Home | Social Sciences | UCI Social

Download Report

Transcript Linear Regression 1 - Home | Social Sciences | UCI Social

Linear Regression 2
Sociology 5811 Lecture 21
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Proposals Due Today
Review: Regression
• Regression coefficient formulas:
sYX
b 2
sX
a  Y  bX
• Question: What is the interpretation of a
regression slope?
• Answer: It indicates the typical increase in Y for
any 1-point increase along the X-variable
– Note: this information is less useful if the linear
association between X and Y is low
Example: Education & Job Prestige
• The actual SPSS regression results for that data:
Model Summary
Model
1
R
a
.521
R Sq uare
.272
Adjusted
R Sq uare
.271
Estimates of a and b:
“Constant” = a = 9.427
Slope for “Year of
School” = b = 2.487
Std. Error of
the Estimate
12.40
a. Predictors: (Constant), HIGHEST YEAR OF SCHOOL
COMPLETED
Coefficientsa
Model
1
(Constant)
HIGHEST YEAR OF
SCHOOL COMPLETED
Unstandardized
Coefficients
B
Std. Error
9.427
1.418
2.487
Standardi
zed
Coefficien
ts
Beta
.108
.521
t
6.648
Sig .
.000
23.102
.000
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE
• Equation: Prestige = 9.4 + 2.5 Education
• A year of education adds 2.5 points job prestige
Review: Covariance
• Covariance (sYX): Sum of deviation about Y-bar
multiplied by deviation around X-bar:
N
sYX 
 (Y  Y )( X
i 1
i
i
 X)
N 1
• Measures whether deviation (from mean) in X
tends is accompanied by similar deviation in Y
– Or if cases with positive deviation in X have negative
deviation in Y
– This is summed up for all cases in the data
Review: Covariance
• Covariance: based on multiplying deviation in X
This point deviates a
and Y
4
dev = 3
lot from both means
(3)(2.5) = 7.5
2
dev = 2.5
Y-bar = .5
-4
-2
0
-2
2
4
This point deviates very
little from X-bar, Y-bar
(.4)(-.25) =-.01
X-bar = -1
-4
Review: Covariance and Slope
• The slope formula can be written out as follows:
bYX
sYX
 2
sX
N
 (Y  Y )( X
i 1
bYX 
i
i
 X)
N 1
N
(X
i 1
i
 X)
N 1
N

2
 (Y  Y )( X
i 1
i
N
(X
i 1
i
i
 X)
 X)
2
Review: R-Square
• The R-Square statistic indicates how well the
regression line “explains” variation in Y
• It is based on partitioning variance into:
• 1. Explained (“regression”) variance
– The portion of deviation from Y-bar accounted for by
the regression line
• 2. Unexplained (“error”) variance
– The portion of deviation from Y-bar that is “error”
• Formula:
2
YX
R
2
YX
2 2
X Y
SSREGRESSION
s


SSTOTAL
s s
Review: R-Square
• Visually: Deviation is partitioned into two parts
“Error
Variance”
4
Y-bar
-4
“Explained
Variance”
2
-2
0
Y=2+.5X
-2
2
4
Correlation Coefficient (r)
• The R-square is very similar to another important
statistic: the correlation coefficient (r)
– R-square is literally the square of r
sYX
• Formula for correlation coefficient: r 
s X sY
• r is a measure of linear association
• Ranges from –1 to 1
• Zero indicates no linear association
• 1 = perfect positive linear association
• -1 = perfect negative linear association
Correlation Coefficient (r)
• Example: Education and Job Prestige
• SPSS can calculate the correlation coefficient
– Usually listed in a matrix to allow many comparisons
Correlations
HIGHEST YEAR OF
SCHOOL COMPLETED
RS OCCUPATIONAL
PRESTIGE SCORE
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
RS
HIGHEST
OCCUPA
YEAR OF
TIONAL
SCHOOL
PRESTIG
COMPLETED
E SCORE
1.000
.521**
.
.000
1530
1434
.521**
1.000
.000
.
1434
1440
**. Correlation is sig nificant at the 0.01 level (2-tailed).
Correlation of
“Year of
School” and
Job Prestige:
r = .521
Covariance, R-square, r, and b
• Covariance, R-square, r, and b are all similar
– All provide information about the relationship
between X and Y
• Differences:
• Covariance, b, and r can be positive or negative
– r is scaled from –1 to +1, others range widely
• b tells you the actual slope
– It relates change in X to change in Y in real units
• R-square is like r, but is never negative
– And, it tells you “explained” variance of a regression
Correlation Hypothesis Tests
• Hypothesis tests can be done on r, R-square, b
• Example: Correlation (r): linear association
• Is observed positive or negative correlation
significantly different from zero?
– Might the population have no linear association?
– Population correlation denoted by greek “r”, rho (r)
• H0: There is no linear association (r = 0)
• H1: There is linear association (r  0)
• We’ll mainly focus on tests regarding slopes
• But the process is similar for correlation (r)
Correlation Coefficient (r)
• Education and Job Prestige hypothesis test:
Correlations
HIGHEST YEAR OF
SCHOOL COMPLETED
RS OCCUPATIONAL
PRESTIGE SCORE
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
RS
HIGHEST
OCCUPA
YEAR OF
TIONAL
SCHOOL
PRESTIG
COMPLETED
E SCORE
1.000
.521**
.
.000
1530
1434
.521**
1.000
.000
.
1434
1440
**. Correlation is sig nificant at the 0.01 level (2-tailed).
Here, asterisks signify that coefficients are
significantly different from zero, a=.01
“Sig.” is a p-value:
The probability of
observing r if r = 0.
Compare it to a!
Hypothesis Tests: Slopes
• Given: Observed slope relating Education to Job
Prestige = 2.47
• Question: Can we generalize this to the
population of all Americans?
– How likely is it that this observed slope was actually
drawn from a population with slope = 0?
•
•
•
•
Solution: Conduct a hypothesis test
Notation: slope = b, population slope = b
H0: Population slope b = 0
H1: Population slope b  0 (two-tailed test)
Example: Slope Hypothesis Test
• The actual SPSS regression results for that data:
Model Summary
Model
1
R
.521a
R Sq uare
.272
Adjusted
R Sq uare
.271
t-value and “sig” (pvalue) are for hypothesis
tests about the slope
Std. Error of
the Estimate
12.40
a. Predictors: (Constant), HIGHEST YEAR OF SCHOOL
COMPLETED
Coefficientsa
Model
1
(Constant)
HIGHEST YEAR OF
SCHOOL COMPLETED
Unstandardized
Coefficients
B
Std. Error
9.427
1.418
2.487
Standardi
zed
Coefficien
ts
Beta
.108
.521
t
6.648
Sig .
.000
23.102
.000
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE
• Reject H0 if: T-value > critical t (N-2 df)
• Or, “sig.” (p-value) less than a
Hypothesis Tests: Slopes
• What information lets us to do a hypothesis test?
• Answer: Estimates of a slope (b) have a
sampling distribution, like any other statistic
– It is the distribution of every value of the slope, based
on all possible samples (of size N)
• If certain assumptions are met, the sampling
distribution approximates the t-distribution
– Thus, we can assess the probability that a given value
of b would be observed, if b = 0
– If probability is low – below alpha – we reject H0
Hypothesis Tests: Slopes
• Visually: If the population slope (b) is zero, then
the sampling distribution would center at zero
– Since the sampling distribution is a probability
distribution, we can identify the likely values of b if
the population slope is zero
Sampling
distribution of
the slope
b
If b=0, observed slopes should
commonly fall near zero, too
If observed slope falls very far
from 0, it is improbable that b is
really equal to zero. Thus, we
can reject H0.
0
Bivariate Regression Assumptions
• Assumptions for bivariate regression hypothesis
tests:
• 1. Random sample
– Ideally N > 20
– But different rules of thumb exist. (10, 30, etc.)
• 2. Variables are linearly related
– i.e., the mean of Y increases linearly with X
– Check scatter plot for general linear trend
– Watch out for non-linear relationships (e.g., Ushaped)
Bivariate Regression Assumptions
• 3. Y is normally distributed for every outcome of
X in the population
– “Conditional normality”
• Ex: Years of Education = X, Job Prestige (Y)
• Suppose we look only at a sub-sample: X = 12
years of education
– Is a histogram of Job Prestige approximately normal?
– What about for people with X = 4? X = 16
• If all are roughly normal, the assumption is met
Bivariate Regression Assumptions
• Normality:
Examine sub-samples at different values of X.
Make histograms and check for normality.
12
10
10
8
6
4
8
2
Std. Dev = 1.51
Mean = 3.84
N = 60.00
0
.50
1.50
1.00
2.50
2.00
HAPPY
3.50
3.00
4.50
4.00
5.50
5.00
6.50
6.00
7.50
7.00
8.00
Good
6
12
4
10
8
6
4
2
2
Std. Dev = 3.06
Mean = 4.58
N = 60.00
0
.50
1.50
2.50 3.50
1.00 2.00
0
3.00
4.50
5.50 6.50
4.00 5.00
6.00
7.50
8.50 9.50
7.00 8.00
9.00 10.00
HAPPY
0
INCOME
20000
40000
60000
80000
100000
Not very good
Bivariate Regression Assumptions
• 4. The variances of prediction errors are identical
at every value of X
– Recall: Error is the deviation from the regression line
– Is dispersion of error consistent across values of X?
– Definition: “homoskedasticity” = error dispersion is
consistent across values of X
– Opposite: “heteroskedasticity”, errors vary with X
• Test: Compare errors for X=12 years of
education with errors for X=2, X=8, etc.
– Are the errors around line similar? Or different?
Bivariate Regression Assumptions
• Homoskedasticity: Equal Error Variance
Examine error at
different values of X.
Is it roughly equal?
10
8
Here, things look
pretty good.
6
4
2
0
0
INCOME
20000
40000
60000
80000
100000
Bivariate Regression Assumptions
• Heteroskedasticity: Unequal Error Variance
At higher values of
X, error variance
increases a lot.
10
8
6
This looks pretty
bad.
4
2
0
0
20000
10000
INCOME
40000
30000
60000
50000
80000
70000
100000
90000
Bivariate Regression Assumptions
• Notes/Comments:
• 1. Overall, regression is robust to violations of
assumptions
– It often gives fairly reasonable results, even when
assumptions aren’t perfectly met
• 2. Variations of OLS regression can handle
situations where assumptions aren’t met
• 3. But, there are also further diagnostics to help
ensure that results are meaningful…
– We’ll discuss them next week.
Regression Hypothesis Tests
• If assumptions are met, the sampling distribution
of the slope (b) approximates a T-distribution
• Standard deviation of the sampling distribution is
called the standard error of the slope (sb)
• Population formula of standard error:
sb 
s
N
(X
i 1
i
2
e
 X)
2
• Where se2 is the variance of the regression error
Regression Hypothesis Tests
• Estimating se2 lets us estimate the standard error:
N
sˆ e 
e
2
i
SSERROR

 MS ERROR
N 2
N 2
i 1
• Now we can estimate the S.E. of the slope:
sˆ b 
MS ERROR
N
(X
i 1
i
 X)
2
Regression Hypothesis Tests
• Finally: A t-value can be calculated:
– It is the slope divided by the standard error
t N 2
bYX


sb
bYX
MS ERROR
2
s X ( N  1)
• Where sb is the sample point estimate of the
standard error
• The t-value is based on N-2 degrees of freedom
Example: Education & Job Prestige
• T-values can be compared to critical t...
Coefficientsa
Model
1
(Constant)
HIGHEST YEAR OF
SCHOOL COMPLETED
Unstandardized
Coefficients
B
Std. Error
9.427
1.418
2.487
Standardi
zed
Coefficien
ts
Beta
.108
.521
t
6.648
Sig .
.000
23.102
.000
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE
SPSS estimates
the standard error
of the slope. This
is used to calculate
a t-value
The t-value can be compared to the
“critical value” to test hypotheses. Or,
just compare “Sig.” to alpha.
If t > crit or Sig < alpha, reject H0
Regression Confidence Intervals
• You can also use the standard error of the slope to
estimate confidence intervals:
C.I .  b  sb (t N 2 )
• Where tN-2 is the t-value for a two-tailed test
given a desired a-level
• Example: Observed slope = 2.5, S.E. = .10
• 95% t-value for 102 d.f. is approximately 2
• 95% C.I. = 2.5 +/- 2(.10)
• Confidence Interval: 2.3 to 2.7
Regression Hypothesis Tests
• You can also use a T-test to determine if the
constant (a) is significantly different from zero
– But, this is typically less useful to do
t N 2 
aYX
MS ERROR
( N  1)
• Hypotheses (a = population parameter of a):
• H0: a = 0, H1: a  0
• But, most research focuses on slopes
Regression: Outliers
• Note: Even if regression assumptions are met,
slope estimates can have problems
• Example: Outliers -- cases with extreme values
that differ greatly from the rest of your sample
• Outliers can result from:
– Errors in coding or data entry
– Highly unusual cases
– Or, sometimes they reflect important “real” variation
• Even a few outliers can dramatically change
estimates of the slope (b)
Regression: Outliers
• Outlier Example:
Extreme case that
pulls regression
line up
4
2
-4
-2
0
-2
-4
2
4
Regression line
with extreme case
removed from
sample
Regression: Outliers
•
•
•
•
Strategy for dealing with outliers:
1. Identify them
Look at scatterplots for extreme values
Or, ask SPSS to compute outlier diagnostic
statistics
– There are several statistics to identify cases that are
affecting the regression slope a lot
– Examples: “Leverage”, Cook’s D, DFBETA
– SPSS can even identify “problematic” cases for you…
but it is preferable to do it yourself.
Regression: Outliers
• 2. Depending on the circumstances, either:
• A) Drop cases from sample and re-do regression
– Especially for coding errors, very extreme outliers
– Or if there is a theoretical reason to drop cases
– Example: In analysis of economic activity,
communist countries differ a lot…
• B) Or, sometimes it is reasonable to leave outliers
in the analysis
– e.g., if there are several that represent an important
minority group in your data
• When writing papers, identify if outliers were
excluded (and the effect that had on the analysis).