Induction on Regression (Ch 15)
Download
Report
Transcript Induction on Regression (Ch 15)
AP Statistics
Inference – Chapter 14
Hypothesis Tests: Slopes
• Given: Observed slope relating Education to Job
Prestige = 2.47
• Question: Can we generalize this to the
population of all Americans?
– How likely is it that this observed slope was actually
drawn from a population with slope = 0?
•
•
•
•
Solution: Conduct a hypothesis test
Notation: slope = b, population slope = b
H0: Population slope b = 0
H1: Population slope b 0 (two-tailed test)
Review: Slope Hypothesis Tests
• What information lets us to do a hypothesis test?
• Answer: Estimates of a slope (b) have a
sampling distribution, like any other statistic
– It is the distribution of every value of the slope, based
on all possible samples (of size N)
• If certain assumptions are met, the sampling
distribution approximates the t-distribution
– Thus, we can assess the probability that a given value
of b would be observed, if b = 0
– If probability is low – below alpha – we reject H0
Review: Slope Hypothesis Tests
• Visually: If the population slope (b) is zero, then
the sampling distribution would center at zero
– Since the sampling distribution is a probability
distribution, we can identify the likely values of b if
the population slope is zero
Sampling
distribution of
the slope
b
If b=0, observed slopes should
commonly fall near zero, too
If observed slope falls very far
from 0, it is improbable that b is
really equal to zero. Thus, we
can reject H0.
0
Bivariate Regression Assumptions
• Assumptions for bivariate regression hypothesis
tests:
• 1. Random sample
– Ideally N > 20
– But different rules of thumb exist. (10, 30, etc.)
• 2. Variables are linearly related
– i.e., the mean of Y increases linearly with X
– Check scatter plot for general linear trend
– Watch out for non-linear relationships (e.g., Ushaped)
Bivariate Regression Assumptions
• 3. Y is normally distributed for every outcome of
X in the population
– “Conditional normality”
• Ex: Years of Education = X, Job Prestige (Y)
• Suppose we look only at a sub-sample: X = 12
years of education
– Is a histogram of Job Prestige approximately normal?
– What about for people with X = 4? X = 16
• If all are roughly normal, the assumption is met
Bivariate Regression Assumptions
• Normality:
Examine sub-samples at different values of X.
Make histograms and check for normality.
12
10
10
8
6
4
8
2
Std. Dev = 1.51
Mean = 3.84
N = 60.00
0
.50
1.50
1.00
2.50
2.00
HAPPY
3.50
3.00
4.50
4.00
5.50
5.00
6.50
6.00
7.50
7.00
8.00
Good
6
12
4
10
8
6
4
2
2
Std. Dev = 3.06
Mean = 4.58
N = 60.00
0
.50
1.50
2.50 3.50
1.00 2.00
0
3.00
4.50
5.50 6.50
4.00 5.00
6.00
7.50
8.50 9.50
7.00 8.00
9.00 10.00
HAPPY
0
INCOME
20000
40000
60000
80000
100000
Not very good
Bivariate Regression Assumptions
• 4. The variances of prediction errors are identical
at different values of X
– Recall: Error is the deviation from the regression line
– Is dispersion of error consistent across values of X?
– Definition: “homoskedasticity” = error dispersion is
consistent across values of X
– Opposite: “heteroskedasticity”, errors vary with X
• Test: Compare errors for X=12 years of
education with errors for X=2, X=8, etc.
– Are the errors around line similar? Or different?
Bivariate Regression Assumptions
• Homoskedasticity: Equal Error Variance
Examine error at
different values of X.
Is it roughly equal?
10
8
Here, things look
pretty good.
6
4
2
0
0
INCOME
20000
40000
60000
80000
100000
Bivariate Regression Assumptions
• Heteroskedasticity: Unequal Error Variance
At higher values of
X, error variance
increases a lot.
10
8
6
This looks pretty
bad.
4
2
0
0
20000
10000
INCOME
40000
30000
60000
50000
80000
70000
100000
90000
Bivariate Regression Assumptions
• Notes/Comments:
• 1. Overall, regression is robust to violations of
assumptions
– It often gives fairly reasonable results, even when
assumptions aren’t perfectly met
• 2. Variations of regression can handle situations
where assumptions aren’t met
• 3. But, there are also further diagnostics to help
ensure that results are meaningful…
Regression Hypothesis Tests
• If assumptions are met, the sampling distribution
of the slope (b) approximates a T-distribution
• Standard deviation of the sampling distribution is
called the standard error of the slope (sb)
• Population formula of standard error:
sb
s
N
(X
i 1
i
2
e
X)
2
• Where se2 is the variance of the regression error
Regression Hypothesis Tests
• Estimating se2 lets us estimate the standard error:
N
sˆ e
e
2
i
i 1
N 2
SS ERROR
MS ERROR
N 2
• Now we can estimate the S.E. of the slope:
ŝ b
MS ERROR
N
(X
i 1
i
X)
2
Regression Hypothesis Tests
• Finally: A t-value can be calculated:
– It is the slope divided by the standard error
t N 2
bYX
sb
bYX
MS ERROR
2
s X ( N 1)
• Where sb is the sample point estimate of the
standard error
• The t-value is based on N-2 degrees of freedom
Regression Confidence Intervals
• You can also use the standard error of the slope to
estimate confidence intervals:
C.I . b sb (t N 2 )
• Where tN-2 is the t-value for a two-tailed test
given a desired a-level
• Example: Observed slope = 2.5, S.E. = .10
• 95% t-value for 102 d.f. is approximately 2
• 95% C.I. = 2.5 +/- 2(.10)
• Confidence Interval: 2.3 to 2.7
Regression Hypothesis Tests
• You can also use a T-test to determine if the
constant (a) is significantly different from zero
– But, this is typically less useful to do
t N 2
aYX
MS ERROR
( N 1)
• Hypotheses (a = population parameter of a):
• H0: a = 0, H1: a 0
• But, most research focuses on slopes
Regression: Outliers
• Note: Even if regression assumptions are met,
slope estimates can have problems
• Example: Outliers -- cases with extreme values
that differ greatly from the rest of your sample
• Outliers can result from:
– Errors in coding or data entry
– Highly unusual cases
– Or, sometimes they reflect important “real” variation
• Even a few outliers can dramatically change
estimates of the slope (b)
Regression: Outliers
• Outlier Example:
Extreme case that
pulls regression
line up
4
2
-4
-2
0
-2
-4
2
4
Regression line
with extreme case
removed from
sample
Regression: Outliers
•
•
•
•
Strategy for dealing with outliers:
1. Identify them
Look at scatterplots for extreme values
Or, have computer software compute outlier
diagnostic statistics
– There are several statistics to identify cases that are
affecting the regression slope a lot
– Examples: “Leverage”, Cook’s D, DFBETA
– Computer software can even identify “problematic”
cases for you… but it is preferable to do it yourself.
Regression: Outliers
• 2. Depending on the circumstances, either:
• A) Drop cases from sample and re-do regression
– Especially for coding errors, very extreme outliers
– Or if there is a theoretical reason to drop cases
– Example: In analysis of economic activity,
communist countries differ a lot…
• B) Or, sometimes it is reasonable to leave outliers
in the analysis
– e.g., if there are several that represent an important
minority group in your data
• When writing papers, identify if outliers were
excluded (and the effect that had on the analysis).