Statistics - ROHAN Academic Computing

Download Report

Transcript Statistics - ROHAN Academic Computing

Chapter 12
Multiple Regression
 Learn….
To use Multiple Regression
Analysis to predict a response
variable using more than one
explanatory variable.
Agresti/Franklin Statistics, 1 of 141
 Section 12.1
How Can We Use Several
Variables to Predict a
Response?
Agresti/Franklin Statistics, 2 of 141
Regression Models

The model that contains only
two variables, x and y, is called a
bivariate model
Agresti/Franklin Statistics, 3 of 141
Regression Models

The regression equation for the
bivariate model is:
    x
y
Agresti/Franklin Statistics, 4 of 141
Regression Models
Suppose there are two
predictors, denoted by x1 and x2
 This is called a multiple
regression model

Agresti/Franklin Statistics, 5 of 141
Regression Models

The regression equation for this
multiple regression model with two
predictors is:
    x   x
y
1
1
Agresti/Franklin Statistics, 6 of 141
2
2
Multiple Regression Model

The multiple regression model
relates the mean µy of a
quantitative response variable y
to a set of explanatory variables
x1, x2,….
Agresti/Franklin Statistics, 7 of 141
Multiple Regression Model

Example: For three explanatory
variables, the multiple regression
equation is:
    x   x   x
y
1
1
2
2
Agresti/Franklin Statistics, 8 of 141
3
3
Multiple Regression Model

Example: The sample prediction
equation with three explanatory
variables is:
yˆ  a  b x  b x  b x
1
1
2
2
Agresti/Franklin Statistics, 9 of 141
3
3
Example: Predicting Selling
Price Using House and Lot Size


The data set “house selling prices”
contains observations on 100 home
sales in Florida in November 2003
A multiple regression analysis was
done with selling price as the
response variable and with house
size and lot size as the explanatory
variables
Agresti/Franklin Statistics, 10 of 141
Example: Predicting Selling
Price Using House and Lot Size

Output from the analysis:
Agresti/Franklin Statistics, 11 of 141
Example: Predicting Selling
Price Using House and Lot Size

Prediction Equation:
yˆ  10,536  53.8 x  2.84 x
1
2
where y = selling price, x1=house
size and x2 = lot size
Agresti/Franklin Statistics, 12 of 141
Example: Predicting Selling
Price Using House and Lot Size


One house listed in the data set had
house size = 1240 square feet, lot size =
18,000 square feet and selling price =
$145,000
Find its predicted selling price:
yˆ  10,536  53.8(1240)  2.84(18,000)
 107,276
Agresti/Franklin Statistics, 13 of 141
Example: Predicting Selling
Price Using House and Lot Size

Find its residual:
y  yˆ  145,000  107,276  37,724

The residual tells us that the actual selling
price was $37,724 higher than predicted
Agresti/Franklin Statistics, 14 of 141
The Number of Explanatory
Variables


You should not use many
explanatory variables in a multiple
regression model unless you have
lots of data
A rough guideline is that the sample
size n should be at least 10 times the
number of explanatory variables
Agresti/Franklin Statistics, 15 of 141
Plotting Relationships

Always look at the data before doing
a multiple regression

Most software has the option of
constructing scatterplots on a single
graph for each pair of variables
• This is called a scatterplot matrix
Agresti/Franklin Statistics, 16 of 141
Plotting Relationships
Agresti/Franklin Statistics, 17 of 141
Interpretation of Multiple
Regression Coefficients

The simplest way to interpret a
multiple regression equation looks
at it in two dimensions as a function
of a single explanatory variable

We can look at it this way by fixing
values for the other explanatory
variable(s)
Agresti/Franklin Statistics, 18 of 141
Interpretation of Multiple
Regression Coefficients
Example using the housing data:
 Suppose we fix x1 = house size at 2000
square feet
 The prediction equation becomes:
yˆ  10,536  53.8(2000)  2.84x
2
 97,022  2.84x
2
Agresti/Franklin Statistics, 19 of 141
Interpretation of Multiple
Regression Coefficients


Since the slope coefficient of x2 is 2.84,
the predicted selling price for 2000 square
foot houses increases by $2.84 for every
square foot increase in lot size
For a 1000 square-foot increase in lot
size, the predicted selling price of 2000
sq. ft. houses increases by 1000(2.84) =
$2840
Agresti/Franklin Statistics, 20 of 141
Interpretation of Multiple
Regression Coefficients
Example using the housing data:
 Suppose we fix x2 = lot size at 30,000
square feet
 The prediction equation becomes:
yˆ  10,536  53.8 x  2.84(30,000)
1
 74,676  53.8x
1
Agresti/Franklin Statistics, 21 of 141
Interpretation of Multiple
Regression Coefficients

Since the slope coefficient of x1 is
53.8, the predicted selling price for
houses with a lot size of 30,000 sq.
ft. increases by $53.80 for every
square foot increase in house size
Agresti/Franklin Statistics, 22 of 141
Interpretation of Multiple
Regression Coefficients



In summary, an increase of a square foot in
house size has a larger impact on the
selling price ($53.80) than an increase of a
square foot in lot size ($2.84)
We can compare slopes for these
explanatory variables because their units of
measurement are the same (square feet)
Slopes cannot be compared when the units
differ
Agresti/Franklin Statistics, 23 of 141
Summarizing the Effect While
Controlling for a Variable

The multiple regression model
assumes that the slope for a
particular explanatory variable is
identical for all fixed values of the
other explanatory variables
Agresti/Franklin Statistics, 24 of 141
Summarizing the Effect While
Controlling for a Variable

For example, the coefficient of x1 in the
prediction equation:
yˆ  10,536  53.8 x  2.84 x
1
2
is 53.8 regardless of whether we plug in x2 =
10,000 or x2 = 30,000 or x2 = 50,000
Agresti/Franklin Statistics, 25 of 141
Summarizing the Effect While
Controlling for a Variable
Agresti/Franklin Statistics, 26 of 141
Slopes in Multiple Regression
and in Bivariate Regression

In multiple regression, a slope
describes the effect of an
explanatory variable while
controlling effects of the other
explanatory variables in the model
Agresti/Franklin Statistics, 27 of 141
Slopes in Multiple Regression
and in Bivariate Regression


Bivariate regression has only a
single explanatory variable
A slope in bivariate regression
describes the effect of that variable
while ignoring all other possible
explanatory variables
Agresti/Franklin Statistics, 28 of 141
Importance of Multiple
Regression

One of the main uses of multiple
regression is to identify potential
lurking variables and control for
them by including them as
explanatory variables in the model
Agresti/Franklin Statistics, 29 of 141
For all students at Walden Univ., the prediction
equation for y = college GPA and x1= H.S. GPA and
x2= study time is:
Find the predicted college GPA of a student
who has a H.S. GPA of 3.5 and who studies 3
hrs. per day.
a. 3.67
b. 3.005
c. 3.175
d. 3.4
Agresti/Franklin Statistics, 30 of 141
For all students at Walden Univ., the prediction
equation for y = college GPA and x1= H.S. GPA and
x2= study time is:
For students with fixed study time, what is
the change in predicted college GPA when
H.S. GPA increases from 3.0 to 4.0?
a. 1.13
b. 0.0078
c. 0.643
d. 1.00
Agresti/Franklin Statistics, 31 of 141
 Section 12.2
Extending the Correlation and
R-Squared for Multiple
Regression
Agresti/Franklin Statistics, 32 of 141
Multiple Correlation


To summarize how well a multiple
regression model predicts y, we
analyze how well the observed y
values correlate with the predicted y
values
The multiple correlation is the
correlation between the observed y
values and the predicted y values
• It is denoted by R
Agresti/Franklin Statistics, 33 of 141
Multiple Correlation


For each subject, the regression equation
provides a predicted value
Each subject has an observed y-value and
a predicted y-value
Agresti/Franklin Statistics, 34 of 141
Multiple Correlation

The correlation computed between all
pairs of observed y-values and
predicted y-values is the multiple
correlation, R

The larger the multiple correlation,
the better are the predictions of y by
the set of explanatory variables
Agresti/Franklin Statistics, 35 of 141
Multiple Correlation


The R-value always falls between 0
and 1
In this way, the multiple correlation
‘R’ differs from the bivariate
correlation ‘r’ between y and a single
variable x, which falls between -1
and +1
Agresti/Franklin Statistics, 36 of 141
R-squared

For predicting y, the square of R
describes the relative improvement
from using the prediction equation
instead of using the sample mean, y
Agresti/Franklin Statistics, 37 of 141
R-squared

The error in using the prediction equation
to predict y is summarized by the residual
sum of squares:
 ( y  yˆ )
2
Agresti/Franklin Statistics, 38 of 141
R-squared

The error in using y to predict y is
summarized by the total sum of squares:
 ( y  y)
2
Agresti/Franklin Statistics, 39 of 141
R-squared

The proportional reduction in error is:
( y  y )  ( y  yˆ )
R 
( y  y)
2
2
2
Agresti/Franklin Statistics, 40 of 141
2
R-squared

The better the predictions are using
the regression equation, the larger R2
is

For multiple regression, R2 is the
square of the multiple correlation, R
Agresti/Franklin Statistics, 41 of 141
Example: How Well Can We
Predict House Selling Prices?


For the 100 observations on y =
selling price, x1 = house size, and x2 =
lot size, a table, called the ANOVA
(analysis of variance) table was
created
The table displays the sums of
squares in the SS column
Agresti/Franklin Statistics, 42 of 141
Example: How Well Can We
Predict House Selling Prices?

The R2 value can be created from the sums
of squares in the table
R 
2
 ( y  y )  ( y  yˆ )
2
2
( y  y)
314,433- 90,756

 0.711
90,756
2
Agresti/Franklin Statistics, 43 of 141
Example: How Well Can We
Predict House Selling Prices?

Using house size and lot size
together to predict selling price
reduces the prediction error by 71%,
relative to using y alone to predict
selling price
Agresti/Franklin Statistics, 44 of 141
Example: How Well Can We
Predict House Selling Prices?

Find and interpret the multiple correlation
R  R  0.711  0.84
2


There is a strong association between the
observed and the predicted selling prices
House size and lot size very much help us
to predict selling prices
Agresti/Franklin Statistics, 45 of 141
Example: How Well Can We
Predict House Selling Prices?

If we used a bivariate regression
model to predict selling price with
house size as the predictor, the r2
value would be 0.58

If we used a bivariate regression
model to predict selling price with lot
size as the predictor, the r2 value
would be 0.51
Agresti/Franklin Statistics, 46 of 141
Example: How Well Can We
Predict House Selling Prices?

The multiple regression model has R2
0.71, so it provides better predictions
than either bivariate model
Agresti/Franklin Statistics, 47 of 141
Properties of R2


The previous example showed that R2
for the multiple regression model was
larger than r2 for a bivariate model
using only one of the explanatory
variables
A key factor of R2 is that it cannot
decrease when predictors are added
to a model
Agresti/Franklin Statistics, 48 of 141
Properties of R2

R2 falls between 0 and 1

The larger the value, the better the
explanatory variables collectively predict y

R2 =1 only when all residuals are 0, that is,
when all regression predictions are prefect

R2 = 0 when the correlation between y and
each explanatory variable equals 0
Agresti/Franklin Statistics, 49 of 141
Properties of R2

R2 gets larger, or at worst stays the
same, whenever an explanatory
variable is added to the multiple
regression model

The value of R2 does not depend on
the units of measurement
Agresti/Franklin Statistics, 50 of 141
R2 Values for Various Multiple
Regression Models
Agresti/Franklin Statistics, 51 of 141
R2 Values for Various Multiple
Regression Models

The single predictor in the data set that is
most strongly associated with y is the
house’s real estate tax assessment
•


(r2 = 0.679)
When we add house size as a second
predictor, R2 goes up from 0.679 to 0.730
As other predictors are added, R2 continues
to go up, but not by much
Agresti/Franklin Statistics, 52 of 141
R2 Values for Various Multiple
Regression Models

R2 does not increase much after a few
predictors are in the model

When there are many explanatory
variables but the correlations among
them are strong, once you have included
a few of them in the model, R2 usually
doesn’t increase much more when you
add additional ones
Agresti/Franklin Statistics, 53 of 141
R2 Values for Various Multiple
Regression Models

This does not mean that the additional
variables are uncorrelated with the
response variable

It merely means that they don’t add much
new power for predicting y, given the
values of the predictors already in the
model
Agresti/Franklin Statistics, 54 of 141
In a data set used to predict body weight (in pounds), three
predictors were used: height, percent body fat and age.
Their correlations with total body weight were:
Height: 0.745
Percent Body fat: 0.390 Age: -0.187
Which explanatory variable gives by itself the
best prediction of weight?
a. Height
b. Percent body fat
c. Age
Agresti/Franklin Statistics, 55 of 141
In a data set used to predict body weight (in pounds), three
predictors were used: height, percent body fat and age.
Their correlations with total body weight were:
Height: 0.745
Percent Body fat: 0.390 Age: -0.187
With height as the sole predictor, what is r2?
a. .745
b. .555
c. .625
d. .825
Agresti/Franklin Statistics, 56 of 141
In a data set used to predict body weight (in pounds), three
predictors were used: height, percent body fat and age.
Their correlations with total body weight were:
Height: 0.745
Percent Body fat: 0.390 Age: -0.187
If Percent Body Fat is added to the model R2 =
0.66. If Age is then added to the model
R2=0.67. Once you know height and % body
fat, does age seem to help in predicting
weight?
a. No
b. Yes
Agresti/Franklin Statistics, 57 of 141
 Section 12.3
How Can We Use Multiple
Regression to Make Inferences?
Agresti/Franklin Statistics, 58 of 141
Inferences about the Population

Assumptions required when using a
multiple regression model to make
inferences about the population:
• The regression equation truly holds for the
•
population means
This implies that there is a straight-line
relationship between the mean of y and
each explanatory variable, with the same
slope at each value of the other predictors
Agresti/Franklin Statistics, 59 of 141
Inferences about the Population

Assumptions required when using a
multiple regression model to make
inferences about the population:
• The data were gathered using
•
randomization
The response variable y has a normal
distribution at each combination of values
of the explanatory variables, with the same
standard deviation
Agresti/Franklin Statistics, 60 of 141
Inferences about Individual
Regression Parameters




Consider a particular parameter, β1
If β1= 0, the mean of y is identical for all
values of x1, at fixed values of the other
explanatory variables
So, H0: β1= 0 states that y and x1 are
statistically independent, controlling for the
other variables
This means that once the other explanatory
variables are in the model, it doesn’t help to
have x1 in the model
Agresti/Franklin Statistics, 61 of 141
Significance Test about a
Multiple Regression Parameter
1. Assumptions:
•
•
•
Each explanatory variable has a straightline relation with µy with the same slope
for all combinations of values of other
predictors in the model
Data gathered with randomization
Normal distribution for y with same
standard deviation at each combination
of values of other predictors in model
Agresti/Franklin Statistics, 62 of 141
Significance Test about a
Multiple Regression Parameter
2. Hypotheses:
•
•
H0: β1= 0
Ha: β1≠ 0
•
When H0 is true, y is independent of x1,
controlling for the other predictors
Agresti/Franklin Statistics, 63 of 141
Significance Test about a
Multiple Regression Parameter
3. Test Statistic:
b 0
t 
se
1
Agresti/Franklin Statistics, 64 of 141
Significance Test about a
Multiple Regression Parameter
4. P-value: Two-tail probability from tdistribution of values larger than
observed t test statistic (in absolute
value)
The t-distribution has:
df = n – number of parameters in the
regression equation
Agresti/Franklin Statistics, 65 of 141
Significance Test about a
Multiple Regression Parameter
5. Conclusion: Interpret P-value;
compare to significance level if
decision needed
Agresti/Franklin Statistics, 66 of 141
Example: What Helps Predict a
Female Athlete’s Weight?


The “College Athletes” data set
comes from a study of 64 University
of Georgia female athletes
The study measured several physical
characteristics, including total body
weight in pounds (TBW), height in
inches (HGT), the percent of body fat
(%BF) and age
Agresti/Franklin Statistics, 67 of 141
Example: What Helps Predict a
Female Athlete’s Weight?

The results of fitting a multiple regression
model for predicting weight using the
other variables:
Agresti/Franklin Statistics, 68 of 141
Example: What Helps Predict a
Female Athlete’s Weight?

Interpret the effect of age on weight in the
multiple regression equation:
Let yˆ  predicted weight, x1  height,
x2  % body fat, and x3  age
Then yˆ  97.7  3.43x1  1.36 x2  0.96 x3
Agresti/Franklin Statistics, 69 of 141
Example: What Helps Predict a
Female Athlete’s Weight?

The slope coefficient of age is -0.96

For athletes having fixed values for x1
and x2, the predicted weight
decreases by 0.96 pounds for a 1-year
increase in age, and the ages vary
only between 17 and 23
Agresti/Franklin Statistics, 70 of 141
Example: What Helps Predict a
Female Athlete’s Weight?

Run a hypothesis test to determine
whether age helps to predict weight, if
you already know height and percent
body fat
Agresti/Franklin Statistics, 71 of 141
Example: What Helps Predict a
Female Athlete’s Weight?
1. Assumptions:
•
•
The 64 female athletes were a
convenience sample, not a random
sample
Caution should be taken when making
inferences about all female college
athletes
Agresti/Franklin Statistics, 72 of 141
Example: What Helps Predict a
Female Athlete’s Weight?
2. Hypotheses:
• H0: β3= 0
• Ha: β3≠ 0
3. Test statistic:
b  0  0.960
t

 1.48
se
0.648
3
Agresti/Franklin Statistics, 73 of 141
Example: What Helps Predict a
Female Athlete’s Weight?
4. P-value: This value is reported in
the output as 0.14
5. Conclusion:
• The P-value of 0.14 does not give
much evidence against the null
hypothesis that β3 = 0
•
Age does not significantly predict weight
if we already know height and % body fat
Agresti/Franklin Statistics, 74 of 141
Confidence Interval for a
Multiple Regression Parameter

A 95% confidence interval for a β slope
parameter in multiple regression equals:
Estimated slope  t (se)
.025

The t-score has:
df = (n - # of parameters in the model)
Agresti/Franklin Statistics, 75 of 141
Example: What’s Plausible for
the Effect of Age on Weight?

Construct and interpret a 95% CI for β3, the
effect of age while controlling for height and
% body fat
b  t ( se)  0.96  2.00(0.648)
3
.025
 0.96  1.30  (2.3,0.3)
Agresti/Franklin Statistics, 76 of 141
Example: What’s Plausible for
the Effect of Age on Weight?


At fixed values of x1 and x2, we infer
that the population mean of weight
changes very little (and maybe not at
all) for a 1-year increase in age
The confidence interval contains 0
• Age may have no effect on weight, once
we control for height and % body fat
Agresti/Franklin Statistics, 77 of 141
Estimating Variability Around
the Regression Equation


A standard deviation parameter, σ, describes
variability of the observations around the
regression equation
Its sample estimate is:
s
Residual SS

df
 ( y  yˆ )
2
n  (# of parameters in reg. eq.)
Agresti/Franklin Statistics, 78 of 141
Example: Estimating Variability
of Female Athletes’ Weight

Anova Table for the “college athletes”
data set:
Agresti/Franklin Statistics, 79 of 141
Example: Estimating Variability
of Female Athletes’ Weight


For female athletes at particular values of height,
% of body fat, and age, estimate the standard
deviation of their weights
Begin by finding the Mean Square Error:
residual SS 6131.0
s 

 102.2
df
60
2

Notice that this value (102.2) appears in the MS
column in the ANOVA table
Agresti/Franklin Statistics, 80 of 141
Example: Estimating Variability
of Female Athletes’ Weight

The standard deviation is:
s  102.2  10.1

This value is also displayed in the ANOVA table

For athletes with certain fixed values of height, %
body fat, and age, the weights vary with a
standard deviation of about 10 pounds
Agresti/Franklin Statistics, 81 of 141
Example: Estimating Variability
of Female Athletes’ Weight

If the conditional distributions of
weight are approximately bell-shaped,
about 95% of the weight values fall
within about 2s = 20 pounds of the
true regression line
Agresti/Franklin Statistics, 82 of 141
Do the Explanatory Variables
Collectively Have an Effect?

Example: With 3 predictors in a model, we
can check this by testing:
H :     0
0
1
2
3
H : At least one  parameter  0
a
Agresti/Franklin Statistics, 83 of 141
Do the Explanatory Variables
Collectively Have an Effect?

The test statistic for H0 is denoted by F
Mean square for regression
F
Mean square error
Agresti/Franklin Statistics, 84 of 141
Do the Explanatory Variables
Collectively Have an Effect?



When H0 is true, the expected value of
the F test statistic is approximately 1
When H0 is false, F tends to be larger
than 1
The larger the F test statistic, the
stronger the evidence against H0
Agresti/Franklin Statistics, 85 of 141
Summary of F Test That All βeta
Parameters = 0
1. Assumptions: Multiple regression
equation holds, data gathered
randomly, normal distribution for y
with same standard deviation at
each combination of predictors
Agresti/Franklin Statistics, 86 of 141
Summary of F Test That All βeta
Parameters = 0
2.
H :     0
0
1
2
3
H : At least one  parameter  0
a
3. Test statistic:
Mean square for regression
F
Mean square error
Agresti/Franklin Statistics, 87 of 141
Summary of F-Test That All βeta
Parameters = 0
4. P-value: Right-tail probability above
observed F-test statistic value from Fdistribution with:
•
•
df1 = number of explanatory variables
df2 = n – (number of parameters in regression
equation)
Agresti/Franklin Statistics, 88 of 141
Summary of F-Test That All βeta
Parameters = 0
5. Conclusion: The smaller the Pvalue, the stronger the evidence that
at least one explanatory variable has
an effect on y
•
If a decision is needed, reject H0 if Pvalue ≤ significance level, such as 0.05
Agresti/Franklin Statistics, 89 of 141
Example: The F-Test for
Predictors of Athletes’ Weight

For the 64 female college athletes, the
regression model for predicting y =
weight using x1 = height, x2 = % body
fat and x3 = age is summarized in the
ANOVA table on the next page
Agresti/Franklin Statistics, 90 of 141
Example: The F-Test for
Predictors of Athletes’ Weight
Agresti/Franklin Statistics, 91 of 141
Example: The F-Test for
Predictors of Athletes’ Weight

Use the output in the ANOVA table to test
the hypothesis:
H :     0
0
1
2
3
H : At least one  parameter  0
a
Agresti/Franklin Statistics, 92 of 141
Example: The F-Test for
Predictors of Athletes’ Weight




The observed F statistic is 40.48
The corresponding P-value is 0.000
We can reject H0 at the 0.05
significance level
We conclude that at least one
predictor has an effect on weight
Agresti/Franklin Statistics, 93 of 141
Example: The F-Test for
Predictors of Athletes’ Weight



The F-test tells us that at least one
explanatory variable has an effect
If the explanatory variables are chosen
sensibly, at least one should have some
predictive power
The F-test result tells us whether there is
sufficient evidence to make it worthwhile to
consider the individual effects, using t-tests
Agresti/Franklin Statistics, 94 of 141
Example: The F-Test for
Predictors of Athletes’ Weight

The individual t-tests identify which of the
variables are significant (controlling for
the other variables)
Agresti/Franklin Statistics, 95 of 141
Example: The F-Test for
Predictors of Athletes’ Weight


If a variable turns out not to be
significant, it can be removed from
the model
In this example, ‘age’ can be removed
from the model
Agresti/Franklin Statistics, 96 of 141
 Section 12.4
Checking a Regression Model Using
Residual Plots
Agresti/Franklin Statistics, 97 of 141
Assumptions for Inference with
a Multiple Regression Model
•
•
•
The regression equation
approximates well the true
relationship between the predictors
and the mean of y
The data were gathered randomly
y has a normal distribution with the
same standard deviation at each
combination of predictors
Agresti/Franklin Statistics, 98 of 141
Checking Shape and Detecting
Unusual Observations

To test Assumption 3 (the conditional
distribution of y is normal at any fixed
values of the explanatory variables):
• Construction a histogram of the standardized
•
•
residuals
The histogram should be approximately bellshaped
Nearly all the standardized residuals should
fall between -3 and +3. Any residual outside
these limits is a potential outlier
Agresti/Franklin Statistics, 99 of 141
Example: Residuals for House
Selling Price

For the house selling price data, a
MINITAB histogram of the
standardized residuals for the
multiple regression model predicting
selling price by the house size and
the lot size was created and is
displayed on the following page
Agresti/Franklin Statistics, 100 of 141
Example: Residuals for House
Selling Price
Agresti/Franklin Statistics, 101 of 141
Example: Residuals for House
Selling Price



The residuals are roughly bell shaped
about 0
They fall between about -3 and +3
No severe nonnormality is indicated
Agresti/Franklin Statistics, 102 of 141
Plotting Residuals against Each
Explanatory Variable

Plots of residuals against each explanatory
variable help us check for potential
problems with the regression model

Ideally, the residuals should fluctuate
randomly about 0

There should be no obvious change in
trend or change in variation as the values
of the explanatory variable increases
Agresti/Franklin Statistics, 103 of 141
Plotting Residuals against Each
Explanatory Variable
Agresti/Franklin Statistics, 104 of 141
Section 12.5
How Can Regression Include
Categorical Predictors?
Agresti/Franklin Statistics, 105 of 141
Indicator Variables


Regression models specify
categories of a categorical
explanatory variable using artificial
variables, called indicator variables
The indicator variable for a particular
category is binary
• It equals 1 if the observation falls into that
category and it equals 0 otherwise
Agresti/Franklin Statistics, 106 of 141
Indicator Variables

In the house selling prices data set,
the city region in which a house is
located is a categorical variable

The indicator variable x for region is
• x = 1 if house is in NW (northwest region)
• x = 0 if house is not in NW
Agresti/Franklin Statistics, 107 of 141
Indicator Variables

The coefficient β of the indicator
variable x is the difference between
the mean selling prices for homes in
the NW and for homes not in the NW
Agresti/Franklin Statistics, 108 of 141
Example: Including Region in
Regression for House Selling Price

Output from the regression model for selling
price of home using house size and region
Agresti/Franklin Statistics, 109 of 141
Example: Including Region in
Regression for House Selling Price

Find and plot the lines showing how
predicted selling price varies as a
function of house size, for homes in
the NW and for homes no in the NW
Agresti/Franklin Statistics, 110 of 141
Example: Including Region in
Regression for House Selling Price

The regression equation from the
MINITAB output is:
yˆ  15,258  78.0 x  30,569 x
1
Agresti/Franklin Statistics, 111 of 141
2
Example: Including Region in
Regression for House Selling Price


For homes not in the NW, x2 = 0
The prediction equation then simplifies to:
yˆ  15,258  78.0 x  30,569(0)
1
 - 15,258  78.0x
1
Agresti/Franklin Statistics, 112 of 141
Example: Including Region in
Regression for House Selling Price


For homes in the NW, x2 = 1
The prediction equation then simplifies to:
yˆ  15,258  78.0 x  30,569(1)
1
 15,311 78.0x
1
Agresti/Franklin Statistics, 113 of 141
Example: Including Region in
Regression for House Selling Price
Agresti/Franklin Statistics, 114 of 141
Example: Including Region in
Regression for House Selling Price



Both lines have the same slope, 78
For homes in the NW and for homes
not in the NW, the predicted selling
price increases by $78 for each
square-foot increase in house size
The figure portrays a separate line for
each category of region (NW, not NW)
Agresti/Franklin Statistics, 115 of 141
Example: Including Region in
Regression for House Selling Price

The coefficient of the indicator
variable is 30569

For any fixed value of house size, we
predict that the selling price is
$30,569 higher for homes in the NW
Agresti/Franklin Statistics, 116 of 141
Example: Including Region in
Regression for House Selling Price

The line for homes in the NW is above the
line for homes not in the NW

The predicted selling price is higher for
homes in the NW

The P-value of 0.000 for the test for the
coefficient of the indicator variable
suggests that this difference is statistically
significant
Agresti/Franklin Statistics, 117 of 141
Is There Interaction?

For two explanatory variables,
interaction exists between them in
their effects on the response variable
when the slope of the relationship
between µy and one of them changes
as the value of the other changes
Agresti/Franklin Statistics, 118 of 141
Is There Interaction?
Agresti/Franklin Statistics, 119 of 141
Section 12.6
How Can We Model a Categorical
Response?
Agresti/Franklin Statistics, 120 of 141
Modeling a Categorical
Response Variable

When y is categorical, a different
regression model applies, called a
logistic regression
Agresti/Franklin Statistics, 121 of 141
Examples of Logistic
Regression

A voter’s choice in an election (Democrat
or Republican), with explanatory
variables: annual income, political
ideology, religious affiliation, and race

Whether a credit card holder pays their
bill on time (yes or no), with explanatory
variables: family income and the number
of months in the past year that the
customer paid the bill on time
Agresti/Franklin Statistics, 122 of 141
The Logistic Regression Model



Denote the possible outcomes for y as 0 and 1
Use the generic terms failure (for outcome = 0)
and success (for outcome =1)
The population mean of the scores equals the
population proportion of ‘1’ outcomes
(successes)
•

That is, µy = p
The proportion, p, also represents the probability
that a randomly selected subject has a success
outcome
Agresti/Franklin Statistics, 123 of 141
The Logistic Regression Model

The straight-line model is usually
inadequate

A more realistic model has a curved
S-shape instead of a straight-line
trend
Agresti/Franklin Statistics, 124 of 141
The Logistic Regression Model

A regression equation for an Sshaped curve for the probability of
success p is:
(   x )
e
p
1 e
(   x )
Agresti/Franklin Statistics, 125 of 141
Example: Annual Income and
Having a Travel Credit Card

An Italian study with 100 randomly
selected Italian adults considered factors
that are associated with whether a person
possesses at least one travel credit card

The table on the next page shows results
for the first 15 people on this response
variable and on the person’s annual
income (in thousands of euros)
Agresti/Franklin Statistics, 126 of 141
Example: Annual Income and
Having a Travel Credit Card
Agresti/Franklin Statistics, 127 of 141
Example: Annual Income and
Having a Travel Credit Card

Let x = annual income and let y = whether
the person possesses a travel credit card
(1 = yes, 0 = no)
Agresti/Franklin Statistics, 128 of 141
Example: Annual Income and
Having a Travel Credit Card

Substituting the α and β estimates into the
logistic regression model formula yields:
( 3.52 0.105 x )
e
pˆ 
1 e
( 3.52 0.105 x )
Agresti/Franklin Statistics, 129 of 141
Example: Annual Income and
Having a Travel Credit Card

Find the estimated probability of
possessing a travel credit card at the
lowest and highest annual income
levels in the sample, which were x =
12 and x = 65
Agresti/Franklin Statistics, 130 of 141
Example: Annual Income and
Having a Travel Credit Card

For x = 12 thousand euros, the estimated
probability of possessing a travel credit card is:
3.52 0.105 ( 12 )
2.26
e
e
pˆ 

1 e
1 e
0.104

 0.09
1.104
3.521.05 ( 12 )
Agresti/Franklin Statistics, 131 of 141
 2.26
Example: Annual Income and
Having a Travel Credit Card

For x = 65 thousand euros, the estimated
probability of possessing a travel credit card is:
3.52 0.105 ( 65 )
3.305
e
e
pˆ 

1 e
1 e
27.2485

 0.97
28.2485
3.521.05 ( 65 )
Agresti/Franklin Statistics, 132 of 141
3.305
Example: Annual Income and
Having a Travel Credit Card


Annual income has a strong positive
effect on having a credit card
The estimated probability of having a
travel credit card changes from 0.09
to 0.97 as annual income changes
over its range
Agresti/Franklin Statistics, 133 of 141
Example: Estimating Proportion of
Students Who’ve Used Marijuana

A three-variable contingency table
from a survey of senior high-school
students in shown on the next page

The students were asked whether
they had ever used: alcohol,
cigarettes or marijuana
Agresti/Franklin Statistics, 134 of 141
Example: Estimating Proportion of
Students Who’ve Used Marijuana
Agresti/Franklin Statistics, 135 of 141
Example: Estimating Proportion of
Students Who’ve Used Marijuana

Let y indicate marijuana use, coded:
(1 = yes, 0 = no)

Let x1 be an indicator variable for
alcohol use (1 = yes, 0 = no)

Let x2 be an indicator variable for
cigarette use (1 = yes, 0 = no)
Agresti/Franklin Statistics, 136 of 141
Example: Estimating Proportion of
Students Who’ve Used Marijuana
Agresti/Franklin Statistics, 137 of 141
Example:Estimating Proportion of
Students Who’ve Used Marijuana

The logistic regression prediction
equation is:
5.31 2.99 x1  2.85 x2
e
pˆ 
1 e
5.31 2.99 x1  2.85 x2
Agresti/Franklin Statistics, 138 of 141
Example: Estimating Proportion of
Students Who’ve Used Marijuana

For those who have not used alcohol or
cigarettes, x1= x2 = 0 and:
5.31 2.99 ( 0 )  2.85 ( 0 )
e
pˆ 
1 e
5.31 2.99 ( 0 )  2.85 ( 0 )
 0.005
Agresti/Franklin Statistics, 139 of 141
Example: Estimating Proportion of
Students Who’ve Used Marijuana

For those who have used alcohol and
cigarettes, x1= x2 = 1 and:
5.31 2.99 ( 1 )  2.85 ( 1 )
e
pˆ 
1 e
5.31 2.99 ( 1 )  2.85 ( 1 )
 0.628
Agresti/Franklin Statistics, 140 of 141
Example: Estimating Proportion of
Students Who’ve Used Marijuana

The probability that students have
tried marijuana seems to depend
greatly on whether they’ve used
alcohol and cigarettes
Agresti/Franklin Statistics, 141 of 141