Transcript Chapter 13

Chapter 13
Simple Linear Regression
13.1: Types of Regression Models
Lots of Terms:
• Dependent or response variable
• Independent or predictor or
explanatory variable
• Simple linear regression (SLR)
• Linear relationship
Formula 13.1, page 515
•
•
•
•
Simple Linear Regression Model
Beta0 Intercept
Beta1 Slope
Epsilon: noise, random error,
stochastic error
• Figure 13.2, page 515 (linearity)
Relationship to Previous Work
• Where’s the mean?
• Where are the hypotheses?
13.2: Determining the Simple
Linear Regression Equation
• We want the best fitting line.
• We use the method of “Least-Squares.”
It guarantees the “best fitting” line.
• Estimate Beta0 with b0.
• Estimate Beta1 with b1.
• The “b” values are called
“coefficients.”
Least Squares Results
• The slope is b1.
• Interpretation: for each 1 unit increase
in X, the average or predicted value of
Y changes by b1 units.
• Underlined terms should be defined in
the context of the problem.
Least Squares Results
• The intercept is b0.
• Interpretation: when X is 0 units, the
average or predicted value of Y is b0.
• Very often, this value of the intercept is
outside the range of the data (ie the
“relevant range.”)
• Interpretations of b0 should be made
cautiously. Beware of extrapolation.
13.3: Measures of Variation
• “Usefulness” is an important concept.
• Two types of usefulness: statistical and
practical.
• Statistical usefulness almost always
requires managing the “sums of squares.”
• Practical usefulness can be assessed by
managing the sums of squares or by
obtaining an informed opinion.
Calculating Sums of Squares
The three most important sums of squares are
shown on page 526:
• SST = SSR + SSE
• SST results from summing up squared deviations
between actual values of Y and Y-bar.
• SSR = variation in Y accounted for by the
regression.
• SSE = variation in Y NOT accounted for by the
regression.
Using Sums of Squares
• Raw sums of squares are not that
helpful.
• The Coefficient of Determination (r2) can
be calculated with formula 13.7.
• Interpretation: r2 percent of the variation in
variable y is explained by the variation in
variable x in this data set.
• 0  r2  +1
Using Sums of Squares
• Standard Error of the Estimate
• Measures the variability of the
predicted values of Y relative to the
actual values of Y.
• Formula 13.13, page 530.
• Interpretation: The general variability
of Y around the fitted line is “standard
error of the estimate” units.
13.4: Assumptions
13.5: Residual Analysis
There are 4 assumptions of Regression
• Normality of Error
• Homoscedasticity
• Independence of Errors
• Linearity.
Errors
• What are the “errors?”
• In Equation 13.1, errors are the epsilons.
• We do not know the epsilons—they live in the
population.
• We approximate the epsilons with sample data:
residual = y – yhat. See formula 13.14.
• If the residuals meet the assumptions, then we
feel better about the usefulness of our analysis.
• If the residuals do not meet the assumptions,
then we need a new analysis technique.
Linearity
• Examine the plot of X versus the
residuals.
• Example: see Figure 13.12.
• There should not be pattern.
• A pattern means that the linear regression
was not effective at explaining the
variation in Y, ie the SST.
Notes on Residual Analysis
• There are a LOT of techniques that can be
used to examine residuals.
• You are trying to assess the validity of
assumptions.
• Each observation produces a residual.
• The process of calculating Studentized
Residuals allows you to look each
observation to see if it produced a
“strange” residual.
Assumption of Normality
• For any value of X, the errors (and
residuals) conform to a normal distribution.
• At this point, we subjectively assess with
graphical means.
• Histogram of residuals.
• Normal probability plot.
Homoscedasticity
• For all X, the errors and residuals should
have constant, or same, variance.
• Assess subjectively by looking at the graph
of residuals versus predicted values (or
studentized residuals versus X).
• Assess numerically by performing a test of
equal variances (divide the set of residuals
in half and test).
• Figure 13.16 shows a problem.
Independence
• Previous residuals are not correlated with
current and future residuals.
• Assess by plotting the residuals in order of
observation.
• A formal procedure called “Dubin-Watson test”
exists.
• Usually only a problem with time-series data—
data observed over time.
13.7: Inferences about the slope
and correlation coefficient.
• There is a “t test” for the slope.
• There is an “F test” for the overall
explanatory value of the regression line.
• There is a confidence interval estimate
for the slope (skip).
• There is a “t test” for the significance of
the correlation coefficient (skip).
“t-test for Slope”
• Test follows the standard hypothesis
test pattern.
• Like all “t-tests,” the test statistic
follows the usual format, shown in
Equation 13.16, page 542.
• Like all analyses on large data sets, we
look to the computer output for
answers.
“F Test”
• The formula for the F Test is shown in
Equation 13.17, page 544.
• This test should look familiar to you: it
was developed in the ANOVA section.
• Even though the text discusses this test in
terms of the slope, the more general form is
the more useful.
13.8 Estimation of Mean Values
and Prediction of Individual
Values
• CI for a mean value of Y (13.20)
• PI for a value of Y (13.21)
• Won’t have to calculate by hand
but might need to interpret.
13.9: Pitfalls in Regression and
Ethical Issues
Page 554 lists some difficulties.
• Lack of awareness of assumptions.
• Unable to evaluate assumptions.
• Unable to proceed if assumptions are
violated (what are the alternatives?).
• No knowledge of subject area.
Moral of the Story
• Examine assumptions.
• ALWAYS Plot the data.
– The 4 sets of data in Table 13.7 all
have the same regression results,
BUT they look vastly different.
Some Objectives
• Find the regression coefficients using software or
software output.
• Interpret the slope, R2, and the standard error of the
estimate.
• Perform a test of hypothesis on the slope.
• Perform a test of hypothesis on “all of the slopes.”
• Write confidence intervals and prediction intervals
for Y given X.
• Evaluate assumptions when given computer output.
• Suggest the type of output necessary to evaluate
assumptions.
• Name some difficulties of using simple regression.