Frequency Distributions
Download
Report
Transcript Frequency Distributions
SW388R7
Data Analysis &
Computers II
Slide 1
Multinomial Logistic Regression:
Complete Problems
Outliers and Influential Cases
Split-sample Validation
Sample Problems
SW388R7
Data Analysis &
Computers II
Outliers and Influential Cases
Slide 2
Multinomial logistic regression in SPSS does not compute any
diagnostic statistics.
In the absence of diagnostic statistics, SPSS recommends using
the Logistic Regression procedure to calculate and examine
diagnostic measures.
A multinomial logistic regression for three groups compares
group 1 to group 3 and group 2 to group 3. To test for outliers
and influential cases, we will run two binary logistic
regressions, using case selection to compare group 1 to group 3
and group 2 to group 3.
From both of these analyses we will identify a list of cases with
standardized residuals greater than 3 and Cook's distance
greater than 1.0, and test the multinomial solution without
these cases. If the accuracy rate of this model is less than 2%
more accurate, we will interpret the model that includes all
cases.
SW388R7
Data Analysis &
Computers II
80-20 Cross-validation Strategy
Slide 3
In this validation strategy, the cases are randomly divided into two
subsets: a training sample containing 80% of the cases and a holdout
sample containing the remaining 20% of the cases.
The training sample is used to derive the multinomial logistic
regression model. The holdout sample is classified using the
coefficients for the training model. The classification accuracy for the
holdout sample is used to estimate how well the model based on the
training sample will perform for the population represented by the
data set.
If the classification accuracy rate of the holdout sample that is no less
than 10% lower than the accuracy rate for the training sample (greater
than 0.90 * training accuracy rate), it is deemed sufficient evidence of
the utility of the logistic regression model.
In addition to satisfying the classification accuracy, we will require
that the significance of the overall relationship and the relationships
with individual predictors for the training sample match the
significance results for the model using the full data set.
SW388R7
Data Analysis &
Computers II
80-20 Cross-validation Strategy
Slide 4
SPSS does not classify cases that are not included in the training
sample, so we will have to manually compute the classifications
for the holdout sample if we want to use this strategy.
We will run the analysis for the training sample, use the
coefficients from the training sample analysis to compute
classification scores (log of the odds) for each group, compute
the probabilities that correspond to each group defined by the
dependent variable, and classify the case in the group with the
highest probability.
SW388R7
Data Analysis &
Computers II
Problem 1
Slide 5
10. In the dataset GSS2000, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.05 for evaluating the statistical relationship. Test the generalizability of the logistic
regression model with a cross-validation analysis using a 80% random sample of the data set as
a training sample. Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
distinguishing between groups based on responses to "opinion about spending on welfare"
[natfare]. These predictors differentiate survey respondents who thought we spend too little
money on welfare from survey respondents who thought we spend too much money on welfare
and survey respondents who thought we spend about the right amount of money on welfare
from survey respondents who thought we spend too much money on welfare.
Among this set of predictors, self-employment was helpful in distinguishing among the groups
defined by responses to opinion about spending on welfare. Survey respondents who were selfemployed were 84.3% less likely to be in the group of survey respondents who thought we spend
too little money on welfare, rather than the group of survey respondents who thought we spend
too much money on welfare.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 1
Slide 6
10. In the dataset GSS2000, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.05 for evaluating the statistical relationship. Test the generalizability of the logistic
regression model with a cross-validation analysis using a 80% random sample of the data set
as a training sample. Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
distinguishing betweenFor
groups
based on responses to "opinion about spending on welfare"
these problems, we will
[natfare]. These predictors
differentiate survey respondents who thought we spend too little
there is no
problem
money on welfare fromassume
surveythat
respondents
who
thought we spend too much money on welfare
with
missing
data.
and survey respondents who thought we spend about the right amount of money on welfare
from survey respondents who thought we spend too much money on welfare.
In this problem, we are told to
use 0.05
as alpha for the was helpful in distinguishing among the groups
Among this set of predictors,
self-employment
logistic
regression.
defined by responses to
opinion
about spending on welfare. Survey respondents who were self-
employed were 84.3% less likely to be in the group of survey respondents who thought we spend
too little money on welfare,
than
group
of survey respondents who thought we spend
We arerather
also told
to the
do an
80-20
too much money on welfare.
cross-validation, using 892776
1.
2.
3.
4.
as the random number seed.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 2
Slide 7
The variables listed first in the problem
statement are the independent variables
(IVs): "number of hours worked in the past
10. Inweek"
the dataset
is the following
statement true, false, or an incorrect application
[hrs1], GSS2000,
"self-employment"
[wrkslf],
of a statistic?
Assume
that
there
is
no
problem
"highest year of school completed" [educ] with missing data. Use a level of significance of
0.05 for
the statistical relationship. Test the generalizability of the logistic
andevaluating
"income" [rincom98].
regression model with a cross-validation analysis using a 80% random sample of the data set as
a training sample. Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ] and "income" [rincom98] were useful
predictors for distinguishing between groups based on responses to "opinion about spending on
welfare" [natfare]. These predictors differentiate survey respondents who thought we spend
too little money on welfare from survey respondents who thought we spend too much money on
welfare and survey respondents who thought we spend about the right amount of money on
welfare from survey respondents who thought we spend too much money on welfare.
The
variable
to define
Among
this
set ofused
predictors,
self-employment was helpful in distinguishing among the groups
groups
is the dependent
defined
by responses
to opinion about spending on welfare. Survey respondents who were selfemployed
were
84.3%
less likely
variable
(DV):
"opinion
aboutto be in the group of survey respondents who thought we spend
too little
money
welfare, rather than the group of survey respondents who thought we spend
spending
on on
welfare"
too much
money
on
welfare.
[natfare].
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SPSS only supports direct or
simultaneous entry of independent
variables in multinomial logistic
regression, so we have no choice of
method for entering variables.
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 3
Slide 8
SPSS multinomial logistic regression models the relationship by
comparing each of the groups defined by the dependent variable to the
group with the highest code value.
Assume that there is no problem with missing data. Use a level of significance of 0.05 for
The statistical
responses relationship.
to opinion about
on welfare were:
evaluating the
Testspending
the generalizability
of the logistic regression model
1=
Too
little,
2
=
About
right,
and
3
=
Too
much.
with a cross-validation analysis using a 80% random sample of the data set as a training sample.
Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
distinguishing between groups based on responses to "opinion about spending on welfare"
[natfare]. These predictors differentiate survey respondents who thought we spend too
little money on welfare from survey respondents who thought we spend too much money
on welfare and survey respondents who thought we spend about the right amount of money
on welfare from survey respondents who thought we spend too much money on welfare.
Among this set of predictors, self-employment was helpful in distinguishing among the groups
defined by responses to opinion about spending on welfare. Survey respondents who were selfemployed were 84.3% less likely to be in the group of survey respondents who thought we spend
too little money on
Thewelfare,
analysis rather
will result
thaninthe
twogroup
comparisons:
of survey respondents who thought we spend
too much money •on survey
welfare.
respondents who thought we spend too little money
1.
2.
3.
4.
versus survey respondents who thought we spend too much
True
money on welfare
True with caution
• survey respondents who thought we spend about the right
amount of money versus survey respondents who thought we
False
spend too much
Inappropriate application
of a money
statisticon welfare.
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 4
Slide 9
10. In the dataset GSS2000, is the following statement true, false, or an incorrect application
of a statistic?Each
Assume
that there
is no
problem with
data. Usebetween
a level of significance of
problem
includes
a statement
aboutmissing
the relationship
0.05 for evaluating
the statistical
relationship.
Test the generalizability
of the logistic
one independent
variable
and the dependent
variable. The answer
regression model
with
a
cross-validation
analysis
using
a
80%
random
sample
of the data set as
to the problem is based on the stated relationship, ignoring the
a training sample.
Use
892776
as
the
random
number
seed.
relationships between the other independent variables and the
dependent variable.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
problem
identifies
difference
for to
between
theabout
groupspending
who
distinguishingThis
between
groups
baseda on
responses
"opinion
on welfare"
thought
we spend
too little versus
group that who
thought
we we spend too little
[natfare]. These
predictors
differentiate
surveythe
respondents
thought
spendfrom
too much
. respondents who thought we spend too much money on welfare
money on welfare
survey
and survey respondents who thought we spend about the right amount of money on welfare
from survey respondents who thought we spend too much money on welfare.
Among this set of predictors, self-employment was helpful in distinguishing among the
groups defined by responses to opinion about spending on welfare. Survey respondents who
were self-employed were 84.3% less likely to be in the group of survey respondents who
thought we spend too little money on welfare, rather than the group of survey respondents
who thought we spend too much money on welfare.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 5
Slide 10
10. In the dataset GSS2000, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.05 for evaluating the statistical relationship. Test the generalizability of the logistic
regression model with a cross-validation analysis using a 80% random sample of the data set as
a training sample. Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
distinguishing between groups based on responses to "opinion about spending on welfare"
[natfare]. These predictors differentiate survey respondents who thought we spend too little
money on welfare from survey respondents who thought we spend too much money on welfare
and survey respondents who thought we spend about the right amount of money on welfare
from survey respondents who thought we spend too much money on welfare.
Among this set of predictors, self-employment was helpful in distinguishing among the groups
defined by responses to opinion about spending on welfare. Survey respondents who were selfemployed were 84.3% less likely to be in the group of survey respondents who thought we spend
too little money on welfare, rather than the group of survey respondents who thought we spend
In order for the multinomial logistic regression
too much money on welfare.
1.
2.
3.
4.
question to be true, the overall relationship must
be statistically significant, there must be no
True
evidence of numerical problems, the classification
True with caution
accuracy rate must be substantially better than
False
could be obtained by chance alone, and the
Inappropriate application of a statistic
stated individual relationship must be statistically
significant and interpreted correctly.
SW388R7
Data Analysis &
Computers II
Slide 11
Request multinomial logistic regression
for baseline model
Select the Regression |
Multinomial Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Selecting the dependent variable
Slide 12
First, highlight the
dependent variable
natfare in the list
of variables.
Second, click on the right
arrow button to move the
dependent variable to the
Dependent text box.
SW388R7
Data Analysis &
Computers II
Selecting metric independent variables
Slide 13
Metric independent variables are specified as covariates
in multinomial logistic regression. Metric variables can
be either interval or, by convention, ordinal.
Move the metric
independent variables,
hrs1, educ and rincom98
to the Covariate(s) list
box.
SW388R7
Data Analysis &
Computers II
Selecting non-metric independent variables
Slide 14
Non-metric independent variables are specified as
factors in multinomial logistic regression. Non-metric
variables will automatically be dummy-coded.
Move the metric
independent variables,
wrkslf to the Factors(s)
list box.
SW388R7
Data Analysis &
Computers II
Specifying statistics to include in the output
Slide 15
While we will accept most of
the SPSS defaults for the
analysis, we need to specifically
request the classification table.
Click on the Statistics… button
to make a request.
SW388R7
Data Analysis &
Computers II
Requesting the classification table
Slide 16
First, keep the SPSS
defaults for Summary
statistics, Likelihood
ratio test, and
Parameter estimates.
Second, mark the
checkbox for the
Classification table.
Third, click
on the
Continue
button to
complete the
request.
SW388R7
Data Analysis &
Computers II
Slide 17
Completing the multinomial
logistic regression request
Click on the OK
button to request
the output for the
multinomial logistic
regression.
The multinomial logistic procedure supports
additional commands to specify the model
computed for the relationships (we will use the
default main effects model), additional
specifications for computing the regression,
and saving classification results. We will not
make use of these options.
SW388R7
Data Analysis &
Computers II
LEVEL OF MEASUREMENT - 1
Slide 18
10. In the dataset GSS2000, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.05 for evaluating the statistical relationship. Test the generalizability of the logistic
regression model with a cross-validation analysis using a 80% random sample of the data set as
a training sample. Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
distinguishing between groups based on responses to "opinion about spending on welfare"
[natfare]. These predictors differentiate survey respondents who thought we spend too
little money on welfare from survey respondents who thought we spend too much money
on welfare and survey respondents who thought we spend about the right amount of money
on welfare from survey respondents who thought we spend too much money on welfare.
Among this set of predictors, self-employment was helpful in distinguishing among the groups
defined by responses to opinion about spending on welfare. Survey respondents who were selfMultinomial
logistic
requires
that therespondents who thought we spend
employed were 84.3%
less likely
to regression
be in the group
of survey
dependent
variable
be
non-metric
and
the
too little money on welfare, rather than the group of survey respondents who thought we spend
too much moneyindependent
on welfare.variables be metric or dichotomous.
1.
2.
3.
4.
"Opinion about spending on welfare" [natfare] is
True
ordinal, satisfying the non-metric level of
True with caution
measurement requirement for the dependent
variable.
False
Inappropriate
application
a statistic survey respondents who
It contains
threeofcategories:
thought we spend too little money, about the right
amount of money, and too much money on welfare.
SW388R7
Data Analysis &
Computers II
LEVEL OF MEASUREMENT - 2
Slide 19
"Number of hours worked in the past
"Self-employment" [wrkslf]
week" [hrs1] and "highest year of
is dichotomous, satisfying
[educ] are
interval,
10. school
In thecompleted"
dataset GSS2000,
is the
following statement true, false,
or anorincorrect
application
the metric
dichotomous
the metric or dichotomous
of asatisfying
statistic?
Assume
that
there
is
no
problem
with
missing
data.
Use
a
level
of
significance
of
level of measurement
level of measurement requirement for
0.05independent
for evaluating
the statistical relationship. Test the generalizability
offor
theindependent
logistic
requirement
variables.
regression model with a cross-validation analysis using a 80% random
sample of the data set as
variables.
a training sample. Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ] and "income" [rincom98] were useful
predictors for distinguishing between groups based on responses to "opinion about spending on
welfare" [natfare]. These predictors differentiate survey respondents who thought we spend too
little money on welfare from survey respondents who thought we spend too much money on
welfare and survey respondents who thought we spend about the right amount of money on
"Income"
[rincom98]
is ordinal,
the
welfare from survey
respondents
who thought
wesatisfying
spend too
much money on welfare.
metric or dichotomous level of measurement
requirement for independent variables. If we follow
convention
of treating ordinal
variables
as
Among this set of the
predictors,
self-employment
waslevel
helpful
in distinguishing
among the groups
metric
variables,
the
level
of
measurement
defined by responses to opinion about spending on welfare. Survey respondents who were selfrequirement
forto
the
is satisfied.
Since
employed were 84.3%
less likely
beanalysis
in the group
of survey
respondents who thought we spend
some
data
analysts
do
not
agree
with
this
too little money on welfare, rather than the group of survey respondents who thought we spend
convention,
too much money on
welfare. a note of caution should be included in
our interpretation.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Sample size – ratio of cases to variables
Slide 20
Case Processing Summary
N
WELFARE
R SELF-EMP OR WORKS
FOR SOMEBODY
Valid
Missing
Total
Subpopulation
1
2
3
1
2
56
50
32
17
121
138
132
270
123a
Marginal
Percentage
40.6%
36.2%
23.2%
12.3%
87.7%
100.0%
a. The dependent variable has only one value observed in 115
Multinomial
logistic
regression requires that the minimum ratio
(93.5%)
subpopulations.
of valid cases to independent variables be at least 10 to 1. The
ratio of valid cases (138) to number of independent variables(
4) was 34.5 to 1, which was equal to or greater than the
minimum ratio. The requirement for a minimum ratio of cases
to independent variables was satisfied.
The preferred ratio of valid cases to independent variables is 20
to 1. The ratio of 34.5 to 1 was equal to or greater than the
preferred ratio. The preferred ratio of cases to independent
variables was satisfied.
SW388R7
Data Analysis &
Computers II
Classification accuracy for all cases
Slide 21
Classification
Predicted
Observed
1
2
3
Overall Percentage
1
2
40
22
16
56.5%
13
27
11
37.0%
With all cases, including those
that might be identified as
outliers or influential cases, the
accuracy rate was 52.2%.
We note this to compare with
the classification accuracy after
removing outliers and
influential cases.
3
3
1
5
6.5%
Percent
Correct
71.4%
54.0%
15.6%
52.2%
SW388R7
Data Analysis &
Computers II
Slide 22
Outliers and influential cases for the
comparison of groups 1 and 3
Since multinomial logistic regression
does not identify outliers or influential
cases, we will use binary logistic
regressions to identify them.
Choose the Select Cases… command
from the Data menu to include only
groups 1 and 3 in the analysis.
SW388R7
Data Analysis &
Computers II
Selecting groups 1 and 3
Slide 23
First, mark the If
condition is
satisfied option
button.
Second, click on the
IF… button to specify
the condition.
SW388R7
Data Analysis &
Computers II
Formula for selecting groups 1 and 3
Slide 24
To include only groups 1 and 3 in the
analysis, we enter the formula to
include cases that had a value of 1 for
natfare or a value of 3 for natfare.
After completing the formula,
click on the Continue button
to close the dialog box.
SW388R7
Data Analysis &
Computers II
Completing the selection of groups 1 and 3
Slide 25
To activate the
selection, click on
the OK button.
SW388R7
Data Analysis &
Computers II
Slide 26
Binary logistic regression comparing
groups 1 and 3
Select the Regression |
Binary Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Slide 27
Dependent and independent variables for the
comparison of groups 1 and 3
First, move the
dependent variable
natfare to the Dependent
variable text box.
Second, move the
independent variables,
hrs1, wrkslf, educ, and
incom98 to the
Covariates list box.
Third, click on the Save… button
to request the inclusion of
standardized residuals and Cook's
distance scores in the data set.
SW388R7
Data Analysis &
Computers II
Slide 28
Including Cook's distance and standardized
residuals in the comparison of groups 1 and 3
First, mark the checkbox
for Standardized residuals
in the Residuals panel.
Third, click on the
Continue button to
complete the
specifications.
Second, mark the
checkbox for Cook’s in
the Influence panel.
This will compute Cook’s
distances to identify
influential cases.
SW388R7
Data Analysis &
Computers II
Slide 29
Outliers and influential cases for the
comparison of groups 1 and 3
Click on the OK
button to request
the output for the
logistic regression.
SW388R7
Data Analysis &
Computers II
Slide 30
Locating the case ids for outliers and
influential cases for groups 1 and 3
In order to exclude outliers and
influential cases from the multinomial
logistic regression, we must identify
their case ids.
Choose the Select Cases… command
from the Data menu to identify cases
that are outliers or influential cases.
SW388R7
Data Analysis &
Computers II
Replace the selection criteria
Slide 31
To replace the formula that
selected cases in group 1 and
3 for the dependent variable,
click on the IF… button.
SW388R7
Data Analysis &
Computers II
Slide 32
Formula for identifying outliers and
influential cases
Type in the formula for including
outliers and influential cases.
Note that we are including outliers and
influential cases because we want to
identify them. This is different that
previous procedures where we included
cases that were not outliers and not
influential cases in the analysis.
Click on the Continue
button to close the
dialog box.
SW388R7
Data Analysis &
Computers II
Slide 33
Completing the selection of outliers and
influential cases
To activate the
selection, click on
the OK button.
SW388R7
Data Analysis &
Computers II
Slide 34
Locating the outliers and influential cases
in the data editor
We used Select cases to specify a criteria for including cases that
were outliers or influential cases. Select cases will assign a 1
(true) to the filter_$ variable if a cases satisfies the criteria. To
locate the cases that have a filter_$ value of 1, we can sort the
data set in descending order of the values for the filter variable.
Click on the column header
for filter_$ and select Sort
Descending from the drop
down menu.
SW388R7
Data Analysis &
Computers II
Slide 35
The outliers and influential cases
in the data editor
At the top of the sorted
column for filter_$, we see
only 0's indicating that no
cases met the criteria for
being considered an outlier or
influential case.
SW388R7
Data Analysis &
Computers II
Slide 36
Outliers and influential cases for the
comparison of groups 2 and 3
The process for identifying outliers
and influential cases is repeated for
the other comparison done by the
multinomial logistic regression,
group 2 versus group 3.
Since multinomial logistic regression
does not identify outliers or influential
cases, we will use binary logistic
regressions to identify them.
Choose the Select Cases… command
from the Data menu to include only
groups 2 and 3 in the analysis.
SW388R7
Data Analysis &
Computers II
Selecting groups 2 and 3
Slide 37
First, mark the If
condition is
satisfied option
button.
Second, click on the
IF… button to specify
the condition.
SW388R7
Data Analysis &
Computers II
Formula for selecting groups 2 and 3
Slide 38
To include only groups 2 and 3 in the
analysis, we enter the formula to
include cases that had a value of 2 for
natfare or a value of 3 for natfare.
After completing the formula,
click on the Continue button
to close the dialog box.
SW388R7
Data Analysis &
Computers II
Completing the selection of groups 2 and 3
Slide 39
To activate the
selection, click on
the OK button.
SW388R7
Data Analysis &
Computers II
Slide 40
Binary logistic regression comparing
groups 2 and 3
Select the Regression |
Binary Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Slide 41
Outliers and influential cases for the
comparison of groups 2 and 3
The specifications for the
analysis are the same as the
ones we used for detecting
outliers and influential cases for
groups 1 and 3.
Click on the OK
button to request
the output for the
logistic regression.
SW388R7
Data Analysis &
Computers II
Slide 42
Locating the case ids for outliers and
influential cases for groups 2 and 3
In order to exclude outliers and
influential cases from the multinomial
logistic regression, we must identify
their case ids.
Choose the Select Cases… command
from the Data menu to identify cases
that are outliers or influential cases.
SW388R7
Data Analysis &
Computers II
Replace the selection criteria
Slide 43
To replace the formula that
selected cases in group 2 and
3 for the dependent variable,
click on the IF… button.
SW388R7
Data Analysis &
Computers II
Slide 44
Formula for identifying outliers and
influential cases
Type in the formula for including
outliers and influential cases.
Note that we use the second
version of cook's distance, coo_2,
and the second version of the
standardized residual, zre_2.
Click on the Continue
button to close the
dialog box.
SW388R7
Data Analysis &
Computers II
Slide 45
Completing the selection of outliers and
influential cases
To activate the
selection, click on
the OK button.
SW388R7
Data Analysis &
Computers II
Slide 46
Locating the outliers and influential cases
in the data editor
We used Select cases to specify a criteria for including cases that
were outliers or influential cases. Select cases will assign a 1
(true) to the filter_$ variable if a cases satisfies the criteria. To
locate the cases that have a filter_$ value of 1, we can sort the
data set in descending order of the values for the filter variable.
Click on the column header
for filter_$ and select Sort
Descending from the drop
down menu.
SW388R7
Data Analysis &
Computers II
Slide 47
The outliers and influential cases
in the data editor
At the top of the sorted
column for filter_$, we see
that we have one outlier or
influential case.
In the column
zre_2, we see
that this case
was an outlier
on the
standardized
residual.
SW388R7
Data Analysis &
Computers II
The case id of the outlier
Slide 48
The case id for the outlier is
"20000620." This is the case
that we will omit from the
multinomial logistic
regression.
SW388R7
Data Analysis &
Computers II
Excluding the outlier from the analysis
Slide 49
To exclude the outlier
from the analysis, we
will use the Select
Cases… command again.
SW388R7
Data Analysis &
Computers II
Changing the condition for the selection
Slide 50
Click on the IF…
button to change
the condition.
SW388R7
Data Analysis &
Computers II
Excluding case 20000620
Slide 51
To include all of the cases except the
outlier, we set caseid not equal to the
subject's id. Note that the subject's id
is put in quotation marks because it is
string data in this data set.
After completing the formula,
click on the Continue button
to close the dialog box.
SW388R7
Data Analysis &
Computers II
Completing the exclusion of the outlier
Slide 52
To activate the
exclusion, click
on the OK
button.
SW388R7
Data Analysis &
Computers II
Slide 53
Multinomial logistic regression
excluding the outlier
Select the Regression |
Multinomial Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Slide 54
Running the multinomial logistic regression
without the outlier
The specifications for the
analysis are the same as the
ones we used the multinomial
logistic regression with all cases.
Click on the OK
button to request
the output for the
logistic regression.
SW388R7
Data Analysis &
Computers II
Classification accuracy after omitting outliers
Slide 55
Classification
Predicted
Observed
1
2
3
Overall Percentage
1
2
39
22
15
55.5%
14
27
10
37.2%
3
3
1
6
7.3%
With all cases the classification
accuracy rate was 52.2%.After
omitting the outlier, the accuracy
rate improved to 52.6%. However,
since the amount of the increase
was not greater than 2%, the model
with all cases will be interpreted.
Percent
Correct
69.6%
54.0%
19.4%
52.6%
SW388R7
Data Analysis &
Computers II
Restoring the outlier to the data set
Slide 56
To include the outlier
back into the analysis,
we will use the Select
Cases… command again.
SW388R7
Data Analysis &
Computers II
Restoring the outlier to the data set
Slide 57
Mark the All cases
option button to
include the outlier
back into the data set.
To activate the
exclusion, click
on the OK
button.
SW388R7
Data Analysis &
Computers II
Slide 58
Re-running the multinomial logistic regression
with all cases
Select the Regression |
Multinomial Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Slide 59
Requesting the multinomial logistic
regression again
The specifications for the
analysis are the same as the
ones we have been using all
along.
Click on the OK
button to request
the output for the
multinomial logistic
regression.
SW388R7
Data Analysis &
Computers II
Slide 60
OVERALL RELATIONSHIP BETWEEN
INDEPENDENT AND DEPENDENT VARIABLES
Model Fitting Information
Model
Intercept Only
Final
-2 Log
Likelihood
278.391
252.510
Chi-Square
25.882
df
Sig .
8
.001
The presence of a relationship between the dependent
variable and combination of independent variables is
based on the statistical significance of the final model
chi-square in the SPSS table titled "Model Fitting
Information".
In this analysis, the probability of the model chi-square
(25.882) was 0.001, less than or equal to the level of
significance of 0.05. The null hypothesis that there was
no difference between the model without independent
variables and the model with independent variables was
rejected. The existence of a relationship between the
independent variables and the dependent variable was
supported.
SW388R7
Data Analysis &
Computers II
NUMERICAL PROBLEMS
Slide 61
Parameter Estimates
a
WELFARE
1
2
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
B
-.735
.033
.131
-.114
-1.852
0b
-1.800
-.019
.318
-.088
-1.187
0b
Std. Error
1.533
.021
.110
.057
.720
.
1.500
.021
.110
.057
.680
.
a. The reference category is: 3.
b. This parameter is set to zero because it is redundant.
Wald
.230
2.321
1.417
3.922
6.612
.
1.439
.835
8.351
2.338
3.047
.
95% Confidence Inte
Exp(B)
df
Sig . in the Exp(B)
Lower Bound
Uppe
Multicollinearity
multinomial
1
.632
logistic regression solution is
detected
the
standard .991
1 by examining
.128
1.033
errors for
the
b
coefficients.
A
1
.234
1.140
.919
standard error larger than 2.0
1 numerical
.048 problems,
.893 such
.798
indicates
as multicollinearity
the
1
.010 among.157
.038
independent
variables,
zero
cells
for
0
.
.
.
a dummy-coded independent
.230
variable1 because
all of the subjects
1 same .361
.981
.941
have the
value for the
variable,
and
'complete
separation'
1
.004
1.374
1.108
whereby the two groups in the
1
.126variable .916
.819
dependent
event
can be
1 separated
.081 by scores
.305 on
.080
perfectly
one of the
independent
variables.
0
.
.
.
Analyses that indicate numerical
problems should not be interpreted.
None of the independent variables in
this analysis had a standard error
larger than 2.0.
SW388R7
Data Analysis &
Computers II
Slide 62
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 1
Likelihood Ratio Tests
Effect
Intercept
HRS1
EDUC
RINCOM98
WRKSLF
-2 Log
Likelihood of
Reduced
Model
252.510a
260.968
262.640
256.941
260.034
Chi-Square
.000
8.459
10.130
4.432
7.525
df
Sig .
0
2
2
2
2
.
.015
.006
.109
.023
The chi-square statistic is the difference in -2 log-likelihoods
between the final model and a reduced model. The reduced model is
formed by omitting an effect from the final model. The null hypothesis
is that all parameters of that effect are 0.
The statistical
significance
relationship
selfa. This
reduced modelofisthe
equivalent
to the finalbetween
model because
employment omitting
and opinion
about
spending
on
welfare
is
based
the effect does not increase the degrees of freedom.
on the statistical significance of the chi-square statistic in the
SPSS table titled "Likelihood Ratio Tests".
For this relationship, the probability of the chi-square statistic
(7.525) was 0.023, less than or equal to the level of
significance of 0.05. The null hypothesis that all of the b
coefficients associated with self-employment were equal to
zero was rejected. The existence of a relationship between
self-employment and opinion about spending on welfare was
supported.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 2
SW388R7
Data Analysis &
Computers II
Slide 63
Parameter Estimates
a
WELFARE
1
2
B
Std. Error
Wald
df
Sig .
Exp(B)
Intercept
-.735
1.533
.230
1
.632
HRS1
.033
.021
2.321
1
.128
1.033
EDUC
.131
.110
1.417
1
.234
1.140
RINCOM98
-.114
.057
3.922
1
.048
.893
[WRKSLF=1]
-1.852
.720
6.612
1
.010
.157
b
[WRKSLF=2]
0
.
.
0
.
.
Intercept
-1.800
1.500
1.439
1
.230
HRS1
-.019
.021
.835
1
.361
.981
EDUC
.318
.110
8.351
1
.004
1.374
RINCOM98
-.088
.057
2.338
1
.126
.916
[WRKSLF=1]
-1.187
.680
3.047
1
.081
.305
b
[WRKSLF=2]
. who thought
0
.
.
In the comparison0 of survey .respondents
we spend
toocategory
little money
on welfare to survey respondents who thought
a. The reference
is: 3.
we spend too much money on welfare, the probability of the
b. This parameter
is set
to zero because
redundant.
Wald
statistic
(6.612)it is
for
the variable category survey
respondents who were self-employed [wrkslf=1] was 0.010.
Since the probability was less than or equal to the level of
significance of 0.05, the null hypothesis that the b coefficient for
self-employment was equal to zero for this comparison was
rejected.
95% Confidence Int
Exp(B)
Lower Bound
Upp
.991
.919
.798
.038
.
.941
1.108
.819
.080
.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 3
SW388R7
Data Analysis &
Computers II
Slide 64
Parameter Estimates
a
WELFARE
1
2
B
Std. Error
Wald
df
Sig .
Exp(B)
Intercept
-.735
1.533
.230
1
.632
HRS1
.033
.021
2.321
1
.128
1.033
EDUC
.131
.110
1.417
1
.234
1.140
RINCOM98
-.114
.057
3.922
1
.048
.893
[WRKSLF=1]
-1.852
.720
6.612
1
.010
.157
b
[WRKSLF=2]
0
.
.
0
.
.
Intercept
-1.800
1.500
1.439
1
.230
HRS1
-.019
.021
.835
1
.361
.981
EDUC
.318
.110
8.351
1
.004
1.374
The value of Exp(B) was 0.157 which implies that the odds
RINCOM98
-.08884.3% (0.157
.057
2.338
.126
.916
decreased by
- 1.0
= -0.843).1
[WRKSLF=1]
-1.187
.680
3.047
1
.081
.305
The
relationship
stated
in
the
problem
is
supported.
Survey
b
[WRKSLF=2]
0
.
.
0
.
.
respondents who were self-employed were 84.3% less likely to
a. The reference category
is: 3. group of survey respondents who thought we spend
be in the
too
little
b. This parameter is set to zeromoney
because on
it is welfare,
redundant. rather than the group of survey
respondents who thought we spend too much money on
welfare.
95% Confidence Int
Exp(B)
Lower Bound
Upp
.991
.919
.798
.038
.
.941
1.108
.819
.080
.
SW388R7
Data Analysis &
Computers II
Slide 65
CLASSIFICATION USING THE MULTINOMIAL LOGISTIC
REGRESSION MODEL: BY CHANCE ACCURACY RATE
The independent variables could be characterized as useful
predictors distinguishing survey respondents who thought we
spend too little money on welfare, survey respondents who
thought we spend about the right amount of money on
welfare and survey respondents who thought we spend too
much money on welfare if the classification accuracy rate was
substantially higher than the accuracy attainable by chance
alone. Operationally, the classification accuracy rate should
be 25% or more higher than the proportional by chance
accuracy rate.
Case Processing Summary
N
WELFARE
1
2
3
1
2
Marginal
Percentage
40.6%
36.2%
23.2%
12.3%
87.7%
100.0%
56
50
32
R SELF-EMP OR WORKS
17
FOR SOMEBODY
121
Valid
138
Missing
132
The proportional by chance accuracy rate was computed by
Total
270group based on
calculating the proportion of cases for each
a
the number of cases in each group in the123
'Case
Processing
Subpopulation
Summary',
and then squaring and summing the proportion of
a. The dependent
variable has
only one+value
observed
in 115 = 0.350).
cases
in each group
(0.406²
0.362²
+ 0.232²
(93.5%) subpopulations.
SW388R7
Data Analysis &
Computers II
Slide 66
CLASSIFICATION USING THE MULTINOMIAL LOGISTIC
REGRESSION MODEL: CLASSIFICATION ACCURACY
Classification
Predicted
Observed
1
2
3
Overall Percentage
1
2
40
22
16
56.5%
13
27
11
37.0%
3
3
1
5
6.5%
The classification accuracy rate was 52.2%
which was greater than or equal to the
proportional by chance accuracy criteria of
43.7% (1.25 x 35.0% = 43.7%).
The criteria for classification accuracy is
satisfied.
Percent
Correct
71.4%
54.0%
15.6%
52.2%
SW388R7
Data Analysis &
Computers II
Slide 67
Validation analysis:
set the random number seed
To set the random number
seed, select the Random
Number Seed… command
from the Transform menu.
SW388R7
Data Analysis &
Computers II
Set the random number seed
Slide 68
First, click on the
Set seed to option
button to activate
the text box.
Second, type in the
random seed stated in
the problem.
Third, click on the OK
button to complete the
dialog box.
Note that SPSS does not
provide you with any
feedback about the change.
SW388R7
Data Analysis &
Computers II
Slide 69
Validation analysis:
compute the split variable
To enter the formula for the
variable that will split the
sample in two parts, click
on the Compute…
command.
SW388R7
Data Analysis &
Computers II
The formula for the split variable
Slide 70
First, type the name for the
new variable, split, into the
Target Variable text box.
Second, the formula for the
value of split is shown in the
text box.
The uniform(1) function
generates a random decimal
number between 0 and 1.
The random number is
compared to the value 0.80.
Third, click on the
OK button to
complete the dialog
box.
If the random number is less
than or equal to 0.80, the
value of the formula will be 1,
the SPSS numeric equivalent
to true. If the random
number is larger than 0.80,
the formula will return a 0,
the SPSS numeric equivalent
to false.
SW388R7
Data Analysis &
Computers II
Selecting the teaching sample
Slide 71
To select the cases that
we will use to , we will
use the Select Cases…
command again.
SW388R7
Data Analysis &
Computers II
Selecting the teaching sample
Slide 72
First, mark the If
condition is
satisfied option
button.
Second, click on the
IF… button to specify
the condition.
SW388R7
Data Analysis &
Computers II
Selecting the teaching sample
Slide 73
To include the cases for the
teaching sample, we enter the
selection criteria: "split = 1".
After completing the formula,
click on the Continue button
to close the dialog box.
SW388R7
Data Analysis &
Computers II
Selecting the teaching sample
Slide 74
To activate the
selection, click on
the OK button.
SW388R7
Data Analysis &
Computers II
Slide 75
Re-running the multinomial logistic regression
with the teaching sample
Select the Regression |
Multinomial Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Slide 76
Requesting the multinomial logistic
regression again
The specifications for the
analysis are the same as the
ones we have been using all
along.
Click on the OK
button to request
the output for the
multinomial logistic
regression.
SW388R7
Data Analysis &
Computers II
Comparing the teaching model to full model - 1
Slide 77
Model Fitting Information
Model
Intercept Only
Final
-2 Log
Likelihood
231.881
208.369
Chi-Square
df
23.513
Sig .
8
.003
In the cross-validation analysis, the
relationship between the
independent variables and the
dependent variable was statistically
significant.
The probability for the model chisquare (25.513) testing overall
relationship was = 0.003.
The significance of the overall
relationship between the individual
independent variables and the dependent
variable supports the interpretation of
the model using the full data set.
SW388R7
Data Analysis &
Computers II
Comparing the teaching model to full model - 2
Slide 78
Likelihood Ratio Tests
Effect
Intercept
HRS1
EDUC
RINCOM98
WRKSLF
-2 Log
Likelihood of
Reduced
Model
208.369a
215.710
215.959
210.670
218.214
Chi-Square
.000
7.341
7.590
2.301
9.845
df
Sig .
0
2
2
2
2
The chi-square statistic is the difference in -2 log-likelihoods
between the final model and a reduced model. The reduced model is
formed by omitting an effect from the final model. The null hypothesis
is that all parameters of that effect are 0.
a. This reduced model is equivalent to the final model because
omitting the effect does not increase the degrees of freedom.
The pattern of significance of individual
predictors for the teaching model matches
the pattern for the full data set: hrs1, educ,
and wrkslf have statistically significant
relationships to the dependent variable.
.
.025
.022
.316
.007
SW388R7
Data Analysis &
Computers II
Comparing the teaching model to full model - 3
Slide 79
Parameter Estimates
a
WELFARE
1
2
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
B
-1.302
.026
.175
-.087
-2.519
0b
-1.798
-.025
.328
-.074
-1.349
0b
Std. Error
1.695
.023
.126
.061
.906
.
1.670
.023
.127
.061
.767
.
Wald
.590
1.281
1.921
1.994
7.732
.
1.159
1.243
6.610
1.490
3.091
.
df
1
1
1
1
1
0
1
1
1
1
1
0
Sig .
.442
.258
.166
.158
.005
.
.282
.265
.010
.222
.079
.
a. The reference category is: 3.
The statistical significance and direction of the relationship
b. This parameter is set to zero because it is redundant.
between WKRSLF=1 and group 1 versus group 3 of the
dependent variable for the teaching model agrees with the
findings for the model using the full data set.
Exp(B)
95% Confidence Inte
Exp(B)
Lower Bound
Upp
1.027
1.191
.917
.081
.
.981
.930
.813
.014
.
.975
1.388
.928
.259
.
.933
1.081
.824
.058
.
SW388R7
Data Analysis &
Computers II
Classification accuracy of the holdout sample
Slide 80
Parameter Estimates
a
WELFARE
1
2
B
Std. Error
Wald
df
Intercept
-1.302
1.695
.590
1
HRS1
.026
.023
1.281
1
EDUC
.175
.126
1.921
1
RINCOM98
-.087
.061
1.994
1
[WRKSLF=1]
-2.519
.906
7.732
1
b
[WRKSLF=2]
0
.
.
0
Intercept
-1.798
1.670
1.159
1
HRS1
-.025
.023
1.243
1
EDUC To compute
.328
.127 rate of
6.610
the accuracy
the holdout1
sample,
our
first
task
is
to
explicitly
RINCOM98
-.074
.061
1.490 dummy1
code
any
independent
variables
which SPSS
[WRKSLF=1]
-1.349
.767
3.091
1
dummy coded in the multinomial logistic
b
[WRKSLF=2]
.
.
0
regression. 0
a. The reference category is: 3.
In this example, we must explicitly dummy
b. This parameter is code
set to zero
because it is redundant.
WRKSLF=1.
Sig .
.442
.258
.166
.158
.005
.
.282
.265
.010
.222
.079
.
Exp(B)
95% Confide
Exp
Lower Bound
1.027
1.191
.917
.081
.
.981
.930
.813
.014
.
.975
1.388
.928
.259
.
.933
1.081
.824
.058
.
SW388R7
Data Analysis &
Computers II
Dummy-coding WRKSLF
Slide 81
Parameter Estimates
a
WELFARE
1
2
B
Std. Error
Wald
df
Intercept
-1.302
1.695
.590
1
HRS1
.026
.023
1.281
1
EDUC
.175
.126
1.921
1
RINCOM98
-.087
.061
1.994
1
[WRKSLF=1]
-2.519
.906
7.732
1
b
[WRKSLF=2]
0
.
.
0
Intercept
-1.798
1.670
1.159
1
HRS1
-.025
.023
1.243
1
EDUC
.328
.127
6.610
1
RINCOM98
-.074
.061
1.490
1
WRKSFL=2
is
the
excluded
category
for
[WRKSLF=1]
3.091
1
WRKSLF -1.349
in the table .767
of parameter
estimates.
b
[WRKSLF=2]
0
. reference
.
0
Using this category
as our
category,
a. The reference category
is: 3. the syntax for dummy-coding
WRKSLF is:
b. This parameter is set to zero because it is redundant.
RECODE WRKSLF(1=1)(2=0) INTO
WRKSLF1.
Sig .
.442
.258
.166
.158
.005
.
.282
.265
.010
.222
.079
.
Exp(B)
95% Confide
Exp
Lower Bound
1.027
1.191
.917
.081
.
.981
.930
.813
.014
.
.975
1.388
.928
.259
.
.933
1.081
.824
.058
.
SW388R7
Data Analysis &
Computers II
The log of the odds for the first group
Slide 82
Parameter Estimates
a
WELFARE
1
2
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
B
Std. Error
Wald
df
Sig .
-1.302
1.695
.590
1
.442
.026
.023
1.281
1
.258
.175
.126
1.921
1
.166
-.087
.061
1.994
1
.158
-2.519
.906
7.732
1
.005
b
0
.
.
0
.
-1.798
1.670
1.159
1
.282
-.025
.023
1.243
1
.265
To calculate the log of the odds for the first
.328
.127
6.610
1
.010
group (G1),
we multiple
the coefficients
for
-.074the first .061
1.490
.222
group from
the table of1 parameter
estimates
variables: 1
-1.349
.767times the
3.091
.079
b
0
.
.
0
.
COMPUTE G1 = -1.30238345543984
+ 0.0261986923704887 * HRS1
0.174611208588235
* EDUC
b. This parameter is set to zero because+
it is
redundant.
- 0.0867944152322106 * RINCOM98
- 2.51888052878127 * WRKSLF1.
To get all of the decimal
places for a number,
double click on a cell to
highlight it and the full
number will appear.
a. The reference category is: 3.
Exp(B)
95% Confide
Exp
Lower Bound
1.027
1.191
.917
.081
.
.981
.930
.813
.014
.
.975
1.388
.928
.259
.
.933
1.081
.824
.058
.
SW388R7
Data Analysis &
Computers II
The log of the odds for the second group
Slide 83
Parameter Estimates
a
WELFARE
1
2
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
B
-1.302
.026
.175
-.087
-2.519
0b
-1.798
-.025
.328
-.074
-1.349
0b
Std. Error
1.695
.023
.126
.061
.906
.
1.670
.023
.127
.061
.767
.
To calculate
thedflog of the
for
the
Wald
Sigodds
.
Exp(B)
second.590
group (G2),1 we multiple
.442 the
coefficients for the second group from
1.281
1
.258
1.027
the table of parameter estimates times
1.921
1
.166
1.191
the variables:
1.994
1
.158
.917
COMPUTE
7.732 G2 = -1.79765485734901
1
.005
.081
- 0.0252840253968005 * HRS1
.
0
.
.
+ 0.327632806335678
* EDUC
1.159
1
.282
- 0.0744568011819021
*
RINCOM98
- 1.34937062997864
*
WRKSLF1.
1.243
1
.265
.975
a. The reference category is: 3.
b. This parameter is set to zero because it is redundant.
6.610
1.490
3.091
.
1
1
1
0
.010
.222
.079
.
1.388
.928
.259
.
95% Confidence In
Exp(B)
Lower Bound
Up
.981
.930
.813
.014
.
.933
1.081
.824
.058
.
SW388R7
Data Analysis &
Computers II
The log of the odds for the third group
Slide 84
Parameter Estimates
a
WELFARE
1
2
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
Intercept
HRS1
EDUC
RINCOM98
[WRKSLF=1]
[WRKSLF=2]
B
-1.302
.026
.175
-.087
-2.519
0b
-1.798
-.025
.328
-.074
-1.349
0b
Std. Error
1.695
.023
.126
.061
.906
.
1.670
.023
.127
.061
.767
.
Sig .
Exp(B)
TheWald
third groupdf(G3) is the
reference
.590
1
.442
group and does not appear in the table
of parameter
estimates.
1.281
1
.258
1.027
1.921
1
.166
1.191
1.159
1.243
6.610
1.490
3.091
.
1
1
1
1
1
0
.282
.265
.010
.222
.079
.
.975
1.388
.928
.259
.
By definition, the log of the odds for the
1.994 group is 1equal to.158
reference
zero (0). .917
We
7.732
.005 the .081
create
the variable1for G3 with
command:
.
0
.
.
COMPUTE G3 = 0.
a. The reference category is: 3.
b. This parameter is set to zero because it is redundant.
95% Confidence In
Exp(B)
Lower Bound
Up
.981
.930
.813
.014
.
.933
1.081
.824
.058
.
SW388R7
Data Analysis &
Computers II
The probabilities for each group
Slide 85
Having computed the log of the odds for each group,
we convert the log of the odds back to a probability
number with the following formulas:
COMPUTE P1 = EXP(G1) / (EXP(G1) + EXP(G2) + EXP(G3)).
COMPUTE P2 = EXP(G2) / (EXP(G1) + EXP(G2) + EXP(G3)).
COMPUTE P3 = EXP(G3) / (EXP(G1) + EXP(G2) + EXP(G3)).
EXECUTE.
SW388R7
Data Analysis &
Computers II
Group classification
Slide 86
Each case is predicted to be a member of the group
to which it has the highest probability of belonging.
We can accomplish this using "IF" statements in SPSS:
IF (P1 > P2 AND P1 > P3) PREDGRP = 1.
IF (P2 > P1 AND P2 > P3) PREDGRP = 2.
IF (P3 > P1 AND P3 > P2) PREDGRP = 3.
EXECUTE.
SW388R7
Data Analysis &
Computers II
Selecting the holdout sample
Slide 87
To select the cases that
we will use to , we will
use the Select Cases…
command again.
SW388R7
Data Analysis &
Computers II
Selecting the holdout sample
Slide 88
First, mark the If
condition is
satisfied option
button.
Second, click on the
IF… button to specify
the condition.
SW388R7
Data Analysis &
Computers II
Selecting the holdout sample
Slide 89
To include the cases in the
20% holdout sample, we
enter the criterion: "split =
0".
After completing the formula,
click on the Continue button
to close the dialog box.
SW388R7
Data Analysis &
Computers II
Selecting the holdout sample
Slide 90
To activate the
selection, click on
the OK button.
SW388R7
Data Analysis &
Computers II
The classification accuracy table
Slide 91
The classification accuracy table is a
table of predicted group membership
versus actual group membership. SPSS
can create it as a cross-tabulated table.
Select the Crosstabs… | Descriptive
Statistics command from the Analyze
menu.
SW388R7
Data Analysis &
Computers II
The classification accuracy table
Slide 92
To mimic the appearance of
classification tables in SPSS, we
will put the original variable,
natfare, in the rows of the table
and the predicted group variable,
predgrp, in the columns.
After specifying the row
and column variables, we
click on the Cells… button
to request percentages.
SW388R7
Data Analysis &
Computers II
The classification accuracy table
Slide 93
The classification accuracy
rate will be the sum of the
total percentages on the main
diagonal.
Second, click on the
Continue button to
close the dialog box.
First, to obtain these
percentage, mark the
check box for Total on
the Percentages panel.
SW388R7
Data Analysis &
Computers II
The classification accuracy table
Slide 94
To complete the
request for the
cross-tabulated
table, click on
the OK button.
SW388R7
Data Analysis &
Computers II
The classification accuracy table
Slide 95
The classification accuracy rate will
be the sum of the total percentages
on the main diagonal:
13.0% + 34.8% + 4.3% = 52.1%.
W ELFARE * PREDGRP Crosstabulation
1.000
WELFARE
1
2
3
Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
3
13.0%
2
8.7%
3
13.0%
8
34.8%
PREDGRP
2.000
3
13.0%
8
34.8%
2
8.7%
13
56.5%
3.000
1
4.3%
0
.0%
1
4.3%
2
8.7%
The criteria to support the classification accuracy of the model is
an accuracy rate for the holdout sample that is no less than 10%
lower than the accuracy rate for the training sample. The accuracy
rate for the training sample was 51.3%, making the minimum
requirement for the holdout sample equal to 46.2% (0.90 x
51.3%). The accuracy rate for the holdout sample was 52.1%,
which satisfied the minimum requirement. The classification
accuracy for the analysis of the full data set was supported.
Total
7
30.4%
10
43.5%
6
26.1%
23
100.0%
SW388R7
Data Analysis &
Computers II
Answering the question in problem 1 - 1
Slide 96
10. In the dataset GSS2000, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.05 for evaluating the statistical relationship. Test the generalizability of the logistic
regression model with a cross-validation analysis using a 80% random sample of the data set as
a training sample. Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
distinguishing between groups based on responses to "opinion about spending on welfare"
[natfare]. These predictors differentiate survey respondents who thought we spend too little
money on welfare from survey respondents who thought we spend too much money on welfare
and survey respondents who thought we spend about the right amount of money on welfare
We found
a statistically
from survey respondents who thought
we spend
too muchsignificant
money on overall
welfare.
relationship between the combination of
independent was
variables
andinthe
dependent variable.
Among this set of predictors, self-employment
helpful
distinguishing
among the groups
defined by responses to opinion about spending on welfare. Survey respondents who were selfemployed were 84.3% less likely toRemoval
be in theofgroup
of did
survey
who thought we spend
outliers
not respondents
improve the model
too little money on welfare, rathersubstantially,
than the group
of survey
respondents
who
thought we spend
so they
were included
in the
solution.
too much money on welfare.
1.
2.
3.
4.
There was no evidence of numerical problems in
True
the solution.
True with caution
False
Moreover, the classification accuracy surpassed the
Inappropriate application of aproportional
statistic by chance accuracy criteria,
supporting the utility of the model.
SW388R7
Data Analysis &
Computers II
Answering the question in problem 1 - 2
Slide 97
We verified
each statement
abouttrue,
the relationship
10. In the dataset GSS2000,
is that
the following
statement
false, or an incorrect application
between
an independent
variable
and the data.
dependent
of a statistic? Assume
that there
is no problem
with missing
Use a level of significance of
variable
was
correct
in
both
direction
of
the
relationship
0.05 for evaluating the statistical relationship. Test the generalizability of the logistic
andathe
change in likelihood
associated
withrandom
a one-unit
regression model with
cross-validation
analysis
using a 80%
sample of the data set as
a training sample. Use
892776
as
the
random
number
seed.
change of the independent variable, for the comparison
between groups stated in the problem.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
distinguishing between groups based on responses to "opinion about spending on welfare"
[natfare]. These predictors differentiate survey respondents who thought we spend too little
money on welfare from survey respondents who thought we spend too much money on welfare
and survey respondents who thought we spend about the right amount of money on welfare
from survey respondents who thought we spend too much money on welfare.
Among this set of predictors, self-employment was helpful in distinguishing among the groups
defined by responses to opinion about spending on welfare. Survey respondents who were selfemployed were 84.3% less likely to be in the group of survey respondents who thought we spend
too little money on welfare, rather than the group of survey respondents who thought we spend
too much money on welfare.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Answering the question in problem 1 - 3
Slide 98
10. In the dataset GSS2000, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.05 for evaluating the statistical relationship. Test the generalizability of the logistic
regression model with a cross-validation analysis using a 80% random sample of the data set as
a training sample. Use 892776 as the random number seed.
The variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
"highest year of school completed" [educ] and "income" [rincom98] were useful predictors for
distinguishing between groups
basedsplit
on responses
"opinionthe
about
spending on
The 80-20
validation to
supported
interpretation
of welfare"
[natfare]. These predictors
differentiate
survey
respondents
who
thought
we
spend
too little
the model using the full data set. The overall
money on welfare from survey
respondents
who
thought
we
spend
too
much
money
on
welfare
relationship for the teaching sample was statistically
and survey respondents who
thoughtas
wewere
spend
theofright
amount of
significant,
theabout
pattern
relationships
formoney on welfare
from survey respondents who
thought
we
spend
too
much
money
on
welfare.
individual predictors. Finally, the accuracy rate for the
holdout sample was sufficient to support the accuracy of
Among this set of predictors,
self-employment was helpful in distinguishing among the groups
the full about
model.spending on welfare. Survey respondents who were selfdefined by responses to opinion
employed were 84.3% less likely to be in the group of survey respondents who thought we spend
too little money on welfare, rather than the group of survey respondents who thought we spend
too much money on welfare.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
The answer to the question is true
with caution.
A caution is added because of the
inclusion of ordinal level variables.
SW388R7
Data Analysis &
Computers II
Slide 99
Steps in multinomial logistic regression:
level of measurement and initial sample size
The following is a guide to the decision process for answering
problems about the basic relationships in multinomial logistic
regression:
Dependent non-metric?
Independent variables
metric or dichotomous?
No
Inappropriate
application of
a statistic
Yes
Ratio of cases to
independent variables at
least 10 to 1?
Yes
Run multinomial logistic regression
Record classification accuracy for evaluation of the
effect of removing outliers and influential cases.
No
Inappropriate
application of
a statistic
SW388R7
Data Analysis &
Computers II
Slide 100
Steps in multinomial logistic regression:
detecting outliers and influential cases
Run binary logistic regression for pairs of groups
compared in multinomial logistic regression to
identify outliers and influential cases.
Outliers/influential cases
by standardized residuals
or Cook's distance?
Yes
Remove outliers and
influential cases from
data set
No
Ratio of cases to
independent variables at
least 10 to 1?
Yes
No
Restore outliers and
influential cases to
data set, add caution
to findings
SW388R7
Data Analysis &
Computers II
Slide 101
Steps in multinomial logistic regression:
picking model for interpretation
Were outliers and
influential cases omitted
from the analysis?
No
Yes
Yes
Pick multinomial logistic
regression that omits outliers
for interpretation
Classification accuracy
omitting outliers better
than baseline by 2% or
more?
No
Pick baseline multinomial
logistic regression
for interpretation
SW388R7
Data Analysis &
Computers II
Slide 102
Steps in multinomial logistic regression:
overall relationship and numerical problems
Overall relationship
statistically significant?
(model chi-square test)
No
False
Yes
Standard errors of
coefficients indicate no
numerical problems (s.e.
<= 2.0)?
Yes
No
False
SW388R7
Data Analysis &
Computers II
Slide 103
Steps in multinomial logistic regression:
relationships between IV's and DV
Overall relationship
between specific IV and DV
is statistically significant?
(likelihood ratio test)
No
False
Yes
Role of specific IV and DV
groups statistically
significant and interpreted
correctly?
(Wald test and Exp(B))
No
False
Yes
Overall accuracy rate is
25% > than proportional
by chance accuracy rate?
Yes
No
False
SW388R7
Data Analysis &
Computers II
Slide 104
Steps in logistic regression:
split-sample validation
Compute 80-20 split variable.
Re-run logistic regression.
Overall relationship in
teaching sample supports
full model?
Yes
No
False
SW388R7
Data Analysis &
Computers II
Slide 105
Steps in logistic regression:
validation supports generalizability
Significance of predictors
in teaching sample
matches pattern for
model using full data set?
No
False
Yes
Classification accuracy for
holdout sample close
enough to training
sample?
Yes
No
False
SW388R7
Data Analysis &
Computers II
Slide 106
Steps in multinomial logistic regression:
adding cautions
Yes
Satisfies preferred ratio of
cases to IV's of 20 to 1
No
True with caution
Yes
One or more IV's are
ordinal level treated as
metric?
No
True
Yes
True with caution