document

Transcript document

SW388R7
Data Analysis &
Computers II
Assumptions of multiple regression
Slide 1
Assumption of normality
Transformations
Assumption of linearity
Assumption of homoscedasticity
Script for testing assumptions
Practice problems
Assumptions of Normality, Linearity, and
Homoscedasticity
SW388R7
Data Analysis &
Computers II
Slide 2



Multiple regression assumes that the variables in the
analysis satisfy the assumptions of normality,
linearity, and homoscedasticity. (There is also an
assumption of independence of errors but that
cannot be evaluated until the regression is run.)
There are two general strategies for checking
conformity to assumptions: pre-analysis and postanalysis. In pre-analysis, the variables are checked
prior to running the regression. In post-analysis, the
assumptions are evaluated by looking at the pattern
of residuals (errors or variability) that the regression
was unable to predict accurately.
The text recommends pre-analysis, the strategy we
will follow.
SW388R7
Data Analysis &
Computers II
Assumption of Normality
Slide 3




The assumption of normality prescribes that the
distribution of cases fit the pattern of a normal
curve.
It is evaluated for all metric variables included in the
analysis, independent variables well as the
dependent variable.
With multivariate statistics, the assumption is that
the combination of variables follows a multivariate
normal distribution.
Since there is not a direct test for multivariate
normality, we generally test each variable
individually and assume that they are multivariate
normal if they are individually normal, though this is
not necessarily the case.
SW388R7
Data Analysis &
Computers II
Slide 4
Assumption of Normality:
Evaluating Normality
There are both graphical and statistical methods for
evaluating normality.
 Graphical methods include the histogram and
normality plot.
 Statistical methods include diagnostic hypothesis
tests for normality, and a rule of thumb that says
a variable is reasonably close to normal if its
skewness and kurtosis have values between –1.0
and +1.0.
 None of the methods is absolutely definitive.
 We will use the criteria that the skewness and
kurtosis of the distribution both fall between -1.0
and +1.0.
Assumption of Normality:
Histograms and Normality Plots
SW388R7
Data Analysis &
Computers II
Slide 5
On the left side of the slide is the histogram and normality plot
for a occupational prestige that could reasonably be
characterized as normal. Time using email, on the right, is not
normally distributed.
Histogram
Histogram
100
50
80
30
60
20
40
10
Frequency
Frequency
40
Std. Dev = 13.94
Mean = 44.2
N = 255.00
0
15.0
25.0
20.0
35.0
30.0
45.0
40.0
55.0
50.0
65.0
60.0
75.0
70.0
20
Std. Dev = 6.14
Mean = 3.6
N = 119.00
0
85.0
0.0
5.0
10.0 15.0 20.0 25.0 30.0 35.0
40.0
80.0
TIME SPENT USING E-MAIL
RS OCCUPATIONAL PRESTIGE SCORE (1980)
Normal Q-Q Plot of RS OCCUPATIONAL PRESTIGE SCORE (1980)
Normal Q-Q Plot of TIME SPENT USING E-MAIL
3
3
2
2
1
1
0
Expected Normal
Expected Normal
0
-1
-2
-3
-1
-2
-10
0
20
Observed Value
40
60
80
0
100
Obs erved Value
10
20
30
40
50
Assumption of Normality:
Hypothesis test of normality
SW388R7
Data Analysis &
Computers II
Slide 6
The hypothesis test for normality tests the null hypothesis that the
variable is normal, i.e. the actual distribution of the variable fits the
pattern we would expect if it is normal. If we fail to reject the null
hypothesis, we conclude that the distribution is normal.
The distribution for both of the variable depicted on the previous slide are
associated with low significance values that lead to rejecting the null
hypothesis and concluding that neither occupational prestige nor time
using email is normally distributed.
Tests of Normality
a
Kolmogorov-Smirnov
Statis tic
df
Sig.
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
.121
255
.000
Statis tic
.964
Shapiro-Wilk
df
Sig.
255
.000
a. Lilliefors Significance Correction
Tests of Normality
a
Kolmogorov-Smirnov
Statis tic
df
Sig.
TIME SPENT
USING E-MAIL
.296
119
a. Lilliefors Significance Correction
.000
Statis tic
.601
Shapiro-Wilk
df
119
Sig.
.000
SW388R7
Data Analysis &
Computers II
Slide 7
Assumption of Normality:
Skewness, kurtosis, and normality
Using the rule of thumb that a rule of thumb that says a variable is
reasonably close to normal if its skewness and kurtosis have values
between –1.0 and +1.0, we would decide that occupational
prestige is normally distributed and time using email is not.
We will use this rule of thumb for normality in our strategy for
solving problems.
Assumption of Normality:
Transformations
SW388R7
Data Analysis &
Computers II
Slide 8



When a variable is not normally distributed, we can
create a transformed variable and test it for
normality. If the transformed variable is normally
distributed, we can substitute it in our analysis.
Three common transformations are: the logarithmic
transformation, the square root transformation, and
the inverse transformation.
All of these change the measuring scale on the
horizontal axis of a histogram to produce a
transformed variable that is mathematically
equivalent to the original variable.
Assumption of Normality:
When transformations do not work
SW388R7
Data Analysis &
Computers II
Slide 9


When none of the transformations induces normality
in a variable, including that variable in the analysis
will reduce our effectiveness at identifying statistical
relationships, i.e. we lose power.
We do have the option of changing the way the
information in the variable is represented, e.g.
substitute several dichotomous variables for a single
metric variable.
SW388R7
Data Analysis &
Computers II
Slide 10
Assumption of Normality:
Computing “Explore” descriptive statistics
To compute the statistics
needed for evaluating the
normality of a variable, select
the Explore… command from
the Descriptive Statistics
menu.
SW388R7
Data Analysis &
Computers II
Slide 11
Assumption of Normality:
Adding the variable to be evaluated
Second, click on right
arrow button to move
the highlighted variable
to the Dependent List.
First, click on the
variable to be included
in the analysis to
highlight it.
SW388R7
Data Analysis &
Computers II
Slide 12
Assumption of Normality:
Selecting statistics to be computed
To select the statistics for the
output, click on the
Statistics… command button.
SW388R7
Data Analysis &
Computers II
Slide 13
Assumption of Normality:
Including descriptive statistics
First, click on the
Descriptives checkbox
to select it. Clear the
other checkboxes.
Second, click on the
Continue button to
complete the request for
statistics.
SW388R7
Data Analysis &
Computers II
Slide 14
Assumption of Normality:
Selecting charts for the output
To select the diagnostic charts
for the output, click on the
Plots… command button.
SW388R7
Data Analysis &
Computers II
Slide 15
Assumption of Normality:
Including diagnostic plots and statistics
First, click on the
None option button
on the Boxplots panel
since boxplots are not
as helpful as other
charts in assessing
normality.
Finally, click on the
Continue button to
complete the request.
Second, click on the
Normality plots with tests
checkbox to include
normality plots and the
hypothesis tests for
normality.
Third, click on the Histogram
checkbox to include a
histogram in the output. You
may want to examine the
stem-and-leaf plot as well,
though I find it less useful.
SW388R7
Data Analysis &
Computers II
Slide 16
Assumption of Normality:
Completing the specifications for the analysis
Click on the OK button to
complete the specifications
for the analysis and request
SPSS to produce the
output.
Assumption of Normality:
The histogram
SW388R7
Data Analysis &
Computers II
Slide 17
An initial impression of the
normality of the distribution
can be gained by examining
the histogram.
Histogram
50
In this example, the
histogram shows a substantial
violation of normality caused
by a extremely large value in
the distribution.
40
30
Frequency
20
10
Std. Dev = 15.35
Mean = 10.7
N = 93.00
0
10.0
30.0
50.0
100.0
80.0
60.0
40.0
20.0
0.0
70.0
TOTAL TIME SPENT ON THE INTERNET
90.0
Assumption of Normality:
The normality plot
Expected Normal
SW388R7
Data Analysis &
Computers II
Slide 18
Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET
3
2
1
0
The problem with the normality of this
variable’s distribution is reinforced by the
normality plot.
-1
-2
-3
-40
-20
Observed Value
0
20
40
If the variable were normally distributed,
the red dots would fit the green line very
closely. In this case, the red points in the
upper right of the chart indicate the
60
100
120
severe80skewing
caused
by the extremely
large data values.
SW388R7
Data Analysis &
Computers II
Slide 19
Assumption of Normality:
The test of normality
Tests of Normality
a
Kolmogorov-Smirnov
Statis tic
df
Sig.
TOTAL TIME SPENT
ON THE INTERNET
.246
93
.000
Statis tic
.606
Shapiro-Wilk
df
93
a. Lilliefors Significance Correction
Since the sample size is larger than 50, we use the Kolmogorov-Smirnov
test. If the sample size were 50 or less, we would use the Shapiro-Wilk
statistic instead.
The null hypothesis for the test of normality states that the actual
distribution of the variable is equal to the expected distribution, i.e., the
variable is normally distributed. Since the probability associated with the
test of normality is < 0.001 is less than or equal to the level of significance
(0.01), we reject the null hypothesis and conclude that total hours spent on
the Internet is not normally distributed. (Note: we report the probability as
<0.001 instead of .000 to be clear that the probability is not really zero.)
Sig.
.000
SW388R7
Data Analysis &
Computers II
Slide 20
Transformations:
Transforming variables to satisfy assumptions


When a metric variable fails to satisfy the
assumption of normality, homogeneity of variance, or
linearity, we may be able to correct the deficiency
by using a transformation.
We will consider three transformations for normality,
homogeneity of variance, and linearity:




the logarithmic transformation
the square root transformation, and
the inverse transformation
plus a fourth that may be useful for problems of
linearity:

the square transformation
Transformations:
Computing transformations in SPSS
SW388R7
Data Analysis &
Computers II
Slide 21


In SPSS, transformations are obtained by computing a
new variable. SPSS functions are available for the
logarithmic (LG10) and square root (SQRT)
transformations. The inverse transformation uses a
formula which divides one by the original value for
each case.
For each of these calculations, there may be data
values which are not mathematically permissible.
For example, the log of zero is not defined
mathematically, division by zero is not permitted,
and the square root of a negative number results in
an “imaginary” value. We will usually adjust the
values passed to the function to make certain that
these illegal operations do not occur.
Transformations:
Two forms for computing transformations
SW388R7
Data Analysis &
Computers II
Slide 22



There are two forms for each of the transformations
to induce normality, depending on whether the
distribution is skewed negatively to the left or
skewed positively to the right.
Both forms use the same SPSS functions and formula
to calculate the transformations.
The two forms differ in the value or argument passed
to the functions and formula. The argument to the
functions is an adjustment to the original value of
the variable to make certain that all of the
calculations are mathematically correct.
Transformations:
Functions and formulas for transformations
SW388R7
Data Analysis &
Computers II
Slide 23

Symbolically, if we let x stand for the argument
passes to the function or formula, the calculations
for the transformations are:



Logarithmic transformation: compute log =
LG10(x)
Square root transformation: compute sqrt =
SQRT(x)

Inverse transformation: compute inv = 1 / (x)

Square transformation: compute s2 = x * x
For all transformations, the argument must be
greater than zero to guarantee that the calculations
are mathematically legitimate.
SW388R7
Data Analysis &
Computers II
Slide 24
Transformations:
Transformation of positively skewed variables



For positively skewed variables, the argument is an
adjustment to the original value based on the
minimum value for the variable.
If the minimum value for a variable is zero, the
adjustment requires that we add one to each value,
e.g. x + 1.
If the minimum value for a variable is a negative
number (e.g., –6), the adjustment requires that we
add the absolute value of the minimum value (e.g. 6)
plus one (e.g. x + 6 + 1, which equals x +7).
Transformations:
Example of positively skewed variable
SW388R7
Data Analysis &
Computers II
Slide 25



Suppose our dataset contains the number of books
read (books) for 5 subjects: 1, 3, 0, 5, and 2, and the
distribution is positively skewed.
The minimum value for the variable books is 0. The
adjustment for each case is books + 1.
The transformations would be calculated as follows:
 Compute logBooks = LG10(books + 1)
 Compute sqrBooks = SQRT(books + 1)
 Compute invBooks = 1 / (books + 1)
Transformations:
Transformation of negatively skewed variables
SW388R7
Data Analysis &
Computers II
Slide 26



If the distribution of a variable is negatively skewed,
the adjustment of the values reverses, or reflects,
the distribution so that it becomes positively
skewed. The transformations are then computed on
the values in the positively skewed distribution.
Reflection is computed by subtracting all of the
values for a variable from one plus the absolute
value of maximum value for the variable. This results
in a positively skewed distribution with all values
larger than zero.
When an analysis uses a transformation involving
reflection, we must remember that this will reverse
the direction of all of the relationships in which the
variable is involved. Out interpretation of
relationships must be adjusted accordingly.
Transformations:
Example of negatively skewed variable
SW388R7
Data Analysis &
Computers II
Slide 27



Suppose our dataset contains the number of books
read (books) for 5 subjects: 1, 3, 0, 5, and 2, and the
distribution is negatively skewed.
The maximum value for the variable books is 5. The
adjustment for each case is 6 - books.
The transformations would be calculated as follows:
 Compute logBooks = LG10(6 - books)
 Compute sqrBooks = SQRT(6 - books)
 Compute invBooks = 1 / (6 - books)
Transformations:
The Square Transformation for Linearity
SW388R7
Data Analysis &
Computers II
Slide 28



The square transformation is computed by
multiplying the value for the variable by itself.
It does not matter whether the distribution is
positively or negatively skewed.
It does matter if the variable has negative values,
since we would not be able to distinguish their
squares from the square of a comparable positive
value (e.g. the square of -4 is equal to the square of
+4). If the variable has negative values, we add the
absolute value of the minimum value to each score
before squaring it.
Transformations:
Example of the square transformation
SW388R7
Data Analysis &
Computers II
Slide 29



Suppose our dataset contains change scores (chg) for
5 subjects that indicate the difference between test
scores at the end of a semester and test scores at
mid-term: -10, 0, 10, 20, and 30.
The minimum score is -10. The absolute value of the
minimum score is 10.
The transformation would be calculated as follows:
 Compute squarChg = (chg + 10) * (chg + 10)
Transformations:
Transformations for normality
SW388R7
Data Analysis &
Computers II
Slide 30
Both the histogram and the normality plot for Total
Time Spent on the Internet (netime) indicate that the
variable is not normally distributed.
Histogram
Normal Q-Q Plot of TOTAL TIME SPENT ON THE IN
50
3
40
2
1
30
0
Expected Normal
Frequency
20
10
Std. Dev = 15.35
-1
-2
Mean = 10.7
N = 93.00
0
0.0
20.0
10.0
40.0
30.0
60.0
50.0
80.0
70.0
TOTAL TIME SPENT ON THE INTERNET
100.0
90.0
-3
-40
-20
Observed Value
0
20
40
60
80
100
120
SW388R7
Data Analysis &
Computers II
Slide 31
Transformations:
Determine whether reflection is required
Descriptives
TOTAL TIME SPENT
ON THE INTERNET
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Lower Bound
Upper Bound
Statis tic
10.73
7.57
Std. Error
1.59
13.89
8.29
5.50
235.655
15.35
0
102
102
10.20
3.532
15.614
Skewness, in the table of Descriptive Statistics,
indicates whether or not reflection (reversing the
values) is required in the transformation.
If Skewness is positive, as it is in this problem,
reflection is not required. If Skewness is negative,
reflection is required.
.250
.495
SW388R7
Data Analysis &
Computers II
Slide 32
Transformations:
Compute the adjustment to the argument
Descriptives
TOTAL TIME SPENT
ON THE INTERNET
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Statis tic
10.73
7.57
13.89
8.29
5.50
235.655
15.35
0
102
102
10.20
3.532
15.614
In this problem, the minimum value is 0, so 1 will be
added to each value in the formula, i.e. the argument
to the SPSS functions and formula for the inverse will
be:
netime + 1.
Std. Error
1.59
.250
.495
SW388R7
Data Analysis &
Computers II
Slide 33
Transformations:
Computing the logarithmic transformation
To compute the transformation,
select the Compute… command
from the Transform menu.
SW388R7
Data Analysis &
Computers II
Slide 34
Transformations:
Specifying the transform variable name and function
First, in the Target Variable text box, type a
name for the log transformation variable, e.g.
“lgnetime“.
Second, scroll down the list of functions to
find LG10, which calculates logarithmic
values use a base of 10. (The logarithmic
values are the power to which 10 is raised
to produce the original number.)
Third, click
on the up
arrow button
to move the
highlighted
function to
the Numeric
Expression
text box.
SW388R7
Data Analysis &
Computers II
Slide 35
Transformations:
Adding the variable name to the function
Second, click on the right arrow
button. SPSS will replace the
highlighted text in the function
(?) with the name of the variable.
First, scroll down the list of
variables to locate the
variable we want to
transform. Click on its name
so that it is highlighted.
SW388R7
Data Analysis &
Computers II
Slide 36
Transformations:
Adding the constant to the function
Following the rules stated for determining the constant
that needs to be included in the function either to
prevent mathematical errors, or to do reflection, we
include the constant in the function argument. In this
case, we add 1 to the netime variable.
Click on the OK
button to complete
the compute
request.
SW388R7
Data Analysis &
Computers II
Slide 37
Transformations:
The transformed variable
The transformed variable which we
requested SPSS compute is shown in the
data editor in a column to the right of the
other variables in the dataset.
SW388R7
Data Analysis &
Computers II
Slide 38
Transformations:
Computing the square root transformation
To compute the transformation,
select the Compute… command
from the Transform menu.
SW388R7
Data Analysis &
Computers II
Slide 39
Transformations:
Specifying the transform variable name and function
First, in the Target Variable text box, type a
name for the square root transformation
variable, e.g. “sqnetime“.
Second, scroll down the list of functions to
find SQRT, which calculates the square root
of a variable.
Third, click
on the up
arrow button
to move the
highlighted
function to
the Numeric
Expression
text box.
SW388R7
Data Analysis &
Computers II
Slide 40
Transformations:
Adding the variable name to the function
First, scroll down the list of
variables to locate the
variable we want to
transform. Click on its name
so that it is highlighted.
Second, click on the right arrow
button. SPSS will replace the
highlighted text in the function
(?) with the name of the variable.
SW388R7
Data Analysis &
Computers II
Slide 41
Transformations:
Adding the constant to the function
Following the rules stated for determining the constant
that needs to be included in the function either to
prevent mathematical errors, or to do reflection, we
include the constant in the function argument. In this
case, we add 1 to the netime variable.
Click on the OK
button to complete
the compute
request.
SW388R7
Data Analysis &
Computers II
Slide 42
Transformations:
The transformed variable
The transformed variable which we
requested SPSS compute is shown in the
data editor in a column to the right of the
other variables in the dataset.
SW388R7
Data Analysis &
Computers II
Slide 43
Transformations:
Computing the inverse transformation
To compute the transformation,
select the Compute… command
from the Transform menu.
SW388R7
Data Analysis &
Computers II
Slide 44
Transformations:
Specifying the transform variable name and formula
First, in the Target
Variable text box, type a
name for the inverse
transformation variable,
e.g. “innetime“.
Second, there is not a function for
computing the inverse, so we type
the formula directly into the
Numeric Expression text box.
Third, click on the
OK button to
complete the
compute request.
SW388R7
Data Analysis &
Computers II
Slide 45
Transformations:
The transformed variable
The transformed variable which we
requested SPSS compute is shown in the
data editor in a column to the right of the
other variables in the dataset.
SW388R7
Data Analysis &
Computers II
Slide 46
Transformations:
Adjustment to the argument for the square
transformation
It is mathematically correct to square a value of zero, so the
adjustment to the argument for the square transformation is
different. What we need to avoid are negative numbers,
since the square of a negative number produces the same
value as the square of a positive number.
Descriptives
TOTAL TIME SPENT
ON THE INTERNET
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
Statis tic
10.73
7.57
13.89
5% Trimmed Mean
8.29
Median
5.50
Variance
235.655
Std. Deviation
15.35
Minimum
0
Maximum
102
Range
102
Interquartile Range
10.20
In this problem, the minimum value is 0, no adjustment
Skewness
3.532
is needed for computing the square. If the minimum
Kurtos is
15.614
was a number less than zero, we would add the
absolute value of the minimum (dropping the sign) as
an adjustment to the variable.
Std. Error
1.59
.250
.495
SW388R7
Data Analysis &
Computers II
Slide 47
Transformations:
Computing the square transformation
To compute the transformation,
select the Compute… command
from the Transform menu.
SW388R7
Data Analysis &
Computers II
Slide 48
Transformations:
Specifying the transform variable name and formula
First, in the Target
Variable text box, type a
name for the inverse
transformation variable,
e.g. “s2netime“.
Second, there is not a function for
computing the square, so we type
the formula directly into the
Numeric Expression text box.
Third, click on the
OK button to
complete the
compute request.
SW388R7
Data Analysis &
Computers II
Slide 49
Transformations:
The transformed variable
The transformed variable which we
requested SPSS compute is shown in the
data editor in a column to the right of the
other variables in the dataset.
SW388R7
Data Analysis &
Computers II
Slide 50
Assumption of Normality:
The test of normality
Descriptives
TOTAL TIME SPENT
ON THE INTERNET
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Lower Bound
Upper Bound
Statis tic
10.7312
7.5697
Std. Error
1.59183
13.8927
8.2949
5.5000
235.655
15.35106
.20
102.00
101.80
10.2000
3.532
15.614
Using the rule of thumb for evaluating normality with the skewness
and kurtosis statistics, we look at the table of descriptive statistics.
The skewness and kurtosis for the variable both exceed the rule of
thumb criteria of 1.0. The variable is not normally distributed.
.250
.495
SW388R7
Data Analysis &
Computers II
Assumption of Linearity
Slide 51



Linearity means that the amount of change, or rate
of change, between scores on two variables is
constant for the entire range of scores for the
variables.
Linearity characterizes the relationship between two
metric variables. It is tested for the pairs formed by
dependent variable and each metric independent
variable in the analysis.
There are relationships are not linear.


The relationship between learning and time may not be
linear. Learning a new subject shows rapid gains at first,
then the pace slows down over time. This is often
referred to a a learning curve.
Population growth may not be linear. The pattern often
shows growth at increasing rates over time.
Assumption of Linearity:
Evaluating linearity
SW388R7
Data Analysis &
Computers II
Slide 52



There are both graphical and statistical methods for
evaluating linearity.
Graphical methods include the examination of
scatterplots, often overlaid with a trendline. While
commonly recommended, this strategy is difficult to
implement.
Statistical methods include diagnostic hypothesis
tests for linearity, a rule of thumb that says a
relationship is linear if the difference between the
linear correlation coefficient (r) and the nonlinear
correlation coefficient (eta) is small, and examining
patterns of correlation coefficients.
SW388R7
Data Analysis &
Computers II
Slide 53
Assumption of Linearity:
Interpreting scatterplots
The advice for interpreting
linearity
is often phrased as
90
looking for a cigar-shaped
band, which is very evident in
80 plot.
this
70
60
50
40
30
20
10
0
20
40
60
RESPONDENT'S SOCIOECONOMIC INDEX
80
100
Assumption of Linearity:
Interpreting scatterplots
SW388R7
Data Analysis &
Computers II
Slide 54
200
Sometimes, a scatterplot
shows a clearly nonlinear
pattern that requires
transformation, like the one
shown in the scatterplot.
100
0
-100
-10000
0
10000
Gross domestic product / capita
20000
30000
Slide 55
Assumption of Linearity:
Scatterplots that are difficult to interpret
The correlations for both of these
relationships are low.
The linearity of the relationship on the right
can be improved with a transformation; the
plot on the left cannot. However, this is not
necessarily obvious from the scatterplots.
120
120
TOTAL TIME SPENT ON THE INTERNET
SW388R7
Data Analysis &
Computers II
100
80
60
40
20
0
-20
10
20
30
AGE OF RESPONDENT
40
100
80
60
40
20
0
-20
50
60
70
80
-2
0
2
4
6
HOURS PER DAY WATCHING TV
8
10
12
14
16
Assumption of Linearity:
Using correlation matrices
SW388R7
Data Analysis &
Computers II
Slide 56
Correlations
TOTAL TIME SPENT ON
THE INTERNET
AGE OF RESPONDENT
Logarithm of AGE
[LG10(AGE)]
Square of AGE [(AGE)**2]
Square Root of AGE
[SQRT(AGE)]
Invers e of AGE [-1/(AGE)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
TOTAL TIME
SPENT ON
THE
INTERNET
1
.
93
.017
.874
93
.048
.648
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
**. Correlation is s ignificant at the 0.01 level (2-tailed).
93
-.009
.931
93
.032
.761
93
.079
.453
93
AGE OF
RESPON
DENT
.017
.874
93
1
.
270
.979**
.000
270
Creating a correlation matrix
for the dependent variable
and
the original
and Square Root
Square of
Logarithm of
transformed
variations
ofAGE
the
of
AGE
AGE
[(AGE)**2]
[LG10(AGE)]
independent
variable [SQRT(AGE)]
provides
.032
-.009
us with.048
a pattern
that is
.761
.931
.648
easier to interpret.
93
.979**
.000
270
1
.
93
.983**
.000
270
.926**
.000
Invers e of
AGE
[-1/(AGE)]
.079
.453
93
93
.916**
.995**
.000
.000
270
270
.978**
.994**
.000
.000
270
270
270
.960**
1
.926**
.983**
.000
.
.000
.000
that we need
270
270
270
270 The information
first column
1
.960** of the
.994**
.995** is in the
.
.000
.000 matrix.000
which shows the
270
270
270 and significance
270 correlation
.951**
.832**
.916** for the.978**
dependent variable
.000
.000 and all.000
forms of .000
the
270
270
270
270
independent variable.
270
.832**
.000
270
.951**
.000
270
1
.
270
SW388R7
Data Analysis &
Computers II
Slide 57
Assumption of Linearity:
The pattern of correlations for no relationship
Correlations
TOTAL TIME SPENT ON
THE INTERNET
AGE OF RESPONDENT
Logarithm of AGE
[LG10(AGE)]
Square of AGE [(AGE)**2]
Square Root of AGE
[SQRT(AGE)]
Invers e of AGE [-1/(AGE)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
TOTAL TIME
SPENT ON
THE
INTERNET
1
.
93
.017
.874
93
.048
.648
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
**. Correlation is s ignificant at the 0.01 level (2-tailed).
93
-.009
.931
93
.032
.761
93
.079
.453
93
AGE OF
RESPON
DENT
.017
.874
93
1
.
270
.979**
.000
270
The correlation between the
two variables is very weak
and statistically nonsignificant.
If we viewed
this
Square Root
Square of
Logarithm of
as aAGE
hypothesis
AGE
AGEtest forofthe
[SQRT(AGE)]
[(AGE)**2]
[LG10(AGE)]
significance
of r, we would
.032
-.009 is no
.048 that there
conclude
.931
.648
relationship between these .761
93
93
93
variables.
.979**
.000
270
1
.
.983**
.000
270
.926**
.000
Invers e of
AGE
[-1/(AGE)]
.079
.453
93
.916**
.995**
.000
.000
270
270
.978**
.994**
.000
.000
270
270
270
.960**
1
.926**
.983**
.000
.000 none of .significance
.000 Moreover,
270
270
270
270
tests for the correlations with
1
.960**
.994**
.995**
the transformed dependent
.
.000
.000
.000
variable
are
statistically
270
270
270
270
is no
.951**
.832**
.978** There
.916** significant.
between
these .000
.000
.000
.000 relationship
variables;
it
is
not
a
problem
270
270
270
270
with non-linearity.
270
.832**
.000
270
.951**
.000
270
1
.
270
SW388R7
Data Analysis &
Computers II
Slide 58
Assumption of Linearity:
Correlation pattern suggesting transformation
Correlations
TOTAL TIME SPENT ON
THE INTERNET
HOURS PER DAY
WATCHING TV
Logarithm of TVHOURS
[LG10( 1+TVHOURS)]
Square of TVHOURS
[(TVHOURS)**2]
Square Root of
TVHOURS [SQRT(
1+TVHOURS)]
Invers e of TVHOURS
[-1/( 1+TVHOURS)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
TOTAL TIME
SPENT ON
THE
INTERNET
1
.
93
.215
.079
68
.104
.397
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
**. Correlation is s ignificant at the 0.01 level (2-tailed).
68
.328**
.006
68
.156
.203
68
.045
.713
68
HOURS PER
DAY
WATCHING
TV
.215
.079
68
1
.
160
.874**
.000
160
.903**
.000
160
.967**
.000
160
.626**
.000
160
The correlation between the
two variables is very weak
and
statistically
non-Square Root
Logarithm
of
significant.
If weof viewed
this
TVHOURS
Square
of TVHOURS
[LG10(
TVHOURS
[SQRT(
as a hypothesis test for the
1+TVHOUR
[(TVHOUR
1+TVHOUR
significance
of
r, we would
S)]
S)**2]
S)]
conclude
.104 that there
.328** is no .156
relationship
between
these
.397
.006
.203
68
68
68
variables.
.874**
.000
160
1
.
.903**
.000
160
.611**
.000
Invers e of
TVHOUR
S [-1/(
1+TVHOU
RS)]
.045
.713
68
.967**
.626**
.000
.000
160
160
.967**
.910**
.000
.000
160
160
160
160
.774**
.335**
.000
160
.784**
.000
160
1
.
160
.611**
1
However,
.000 the probability
.
.000
associated
with 160
the larger 160
160
.967** for the
.774**square
1
correlation
.000
.000
transformation
is statistically.
160
160
significant,
suggesting
that160
.335**
this is.910**
a transformation
we.784**
.000
.000
might want to use in our .000
160
160
160
analysis.
Assumption of Linearity:
Transformations
SW388R7
Data Analysis &
Computers II
Slide 59



When a relationship is not linear, we can transform
one or both variables to achieve a relationship that is
linear.
Three common transformations to induce linearity
are: the logarithmic transformation, the square root
transformation, and the inverse transformation.
All of these transformations produce a new variable
that is mathematically equivalent to the original
variable, but expressed in different measurement
units, e.g. logarithmic units instead of decimal units.
Assumption of Linearity:
When transformations do not work
SW388R7
Data Analysis &
Computers II
Slide 60


When none of the transformations induces linearity
in a relationship, our statistical analysis will
underestimate the presence and strength of the
relationship, i.e. we lose power.
We do have the option of changing the way the
information in the variables are represented, e.g.
substitute several dichotomous variables for a single
metric variable. This bypasses the assumption of
linearity while still attempting to incorporate the
information about the relationship in the analysis.
SW388R7
Data Analysis &
Computers II
Slide 61
Assumption of Linearity:
Creating the scatterplot
Suppose we are interested in
the linearity of the
relationship between "hours
per day watching TV" and
"total hours spent on the
Internet".
The most commonly
recommended strategy for
evaluating linearity is visual
examination of a scatter plot.
To obtain a scatter plot
in SPSS, select the
Scatter… command from
the Graphs menu.
SW388R7
Data Analysis &
Computers II
Slide 62
Assumption of Linearity:
Selecting the type of scatterplot
First, click on
thumbnail sketch of a
simple scatterplot to
highlight it.
Second, click on
the Define button to
specify the variables
to be included in the
scatterplot.
SW388R7
Data Analysis &
Computers II
Slide 63
Assumption of Linearity:
Selecting the variables
First, move the
dependent variable
netime to the Y
Axis text box.
Third, click on
the OK button to
complete the
specifications for
the scatterplot.
Second, move the
independent
variable tvhours to
the X axis text
box.
If a problem statement mentions a
relationship between two variables
without clearly indicating which is
the independent variable and which
is the dependent variable, the first
mentioned variable is taken to the
be independent variable.
SW388R7
Data Analysis &
Computers II
Slide 64
Assumption of Linearity:
The scatterplot
The scatterplot is produced in
the SPSS output viewer.
The points in a scatterplot are
considered linear if they form
a cigar-shaped elliptical band.
The pattern in this scatterplot
is not really clear.
SW388R7
Data Analysis &
Computers II
Slide 65
Assumption of Linearity:
Adding a trendline
To try to determine if the relationship is linear,
we can add a trendline to the chart.
To add a trendline
to the chart, we
need to open the
chart for editing.
To open the chart
for editing, double
click on it.
SW388R7
Data Analysis &
Computers II
Slide 66
Assumption of Linearity:
The scatterplot in the SPSS Chart Editor
The chart that we
double clicked on is
opened for editing in the
SPSS Chart Editor.
To add the trend
line, select the
Options… command
from the Chart
menu.
SW388R7
Data Analysis &
Computers II
Slide 67
Assumption of Linearity:
Requesting the fit line
In the Scatterplot Options
dialog box, we click on the
Total checkbox in the Fit Line
panel in order to request the
trend line.
Click on the Fit Options…
button to request the r²
coefficient of determination
as a measure of the
strength of the
relationship.
SW388R7
Data Analysis &
Computers II
Slide 68
Assumption of Linearity:
Requesting r²
First, the Linear
regression thumbnail
sketch should be
highlighted as the type
of fit line to be added to
the chart.
Third, click on the
Continue button to
complete the
options request.
Second, click on the Fit
Options… Click on the Display
R-square in Legend checkbox
to add this item to our
output.
SW388R7
Data Analysis &
Computers II
Slide 69
Assumption of Linearity:
Completing the request for the fit line
Click on the OK button
to complete the
request for the fit line.
SW388R7
Data Analysis &
Computers II
Slide 70
Assumption of Linearity:
The fit line and r²
The red fit line is
added to the chart.
The value of r²
(0.0460)
suggests that
the relationship
is weak.
SW388R7
Data Analysis &
Computers II
Slide 71
Assumption of Linearity:
Computing the transformations
There are four
transformations that we
can use to achieve or
improve linearity.
The compute dialogs for
these four
transformations for
linearity are shown.
SW388R7
Data Analysis &
Computers II
Slide 72
Assumption of Linearity:
Creating the scatterplot matrix
To create the scatterplot
matrix, select the
Scatter… command in
the Graphs menu.
SW388R7
Data Analysis &
Computers II
Slide 73
Assumption of Linearity:
Selecting type of scatterplot
First, click on the
Matrix thumbnail
sketch to indicate
which type of
scatterplot we want.
Second, click on the
Define button to select
the variables for the
scatterplot.
SW388R7
Data Analysis &
Computers II
Slide 74
Assumption of Linearity:
Specifications for scatterplot matrix
First, move the dependent
variable, the independent variable
and all of the transformations to
the Matrix Variables list box.
Second, click
on the OK
button to
produce the
scatterplot.
SW388R7
Data Analysis &
Computers II
Slide 75
Assumption of Linearity:
The scatterplot matrix
The scatterplot matrix shows a
thumbnail sketch of scatterplots
for each independent variable or
transformation with the
dependent variable. The
scatterplot matrix may suggest
which transformations might be
useful.
TOTAL TIME SPENT ON
HOURS PER DAY WATCHI
LGTVHOUR
SQTVHOUR
INTVHOUR
S2TVHOUR
SW388R7
Data Analysis &
Computers II
Slide 76
Assumption of Linearity:
Creating the correlation matrix
To create the correlation
matrix, select the
Correlate | Bivariate…
command in the Analyze
menu.
SW388R7
Data Analysis &
Computers II
Slide 77
Assumption of Linearity:
Specifications for correlation matrix
First, move the dependent
variable, the independent variable
and all of the transformations to
the Variables list box.
Second, click on
the OK button to
produce the
correlation matrix.
Assumption of Linearity:
The correlation matrix
SW388R7
Data Analysis &
Computers II
Slide 78
Correlations
TOTAL TIME SPENT
ON THE INTERNET
HOURS PER DAY
WATCHING TV
LGTVHOUR
SQTVHOUR
INTVHOUR
S2TVHOUR
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
TOTAL TIME HOURS PER
SPENT ON
DAY
THE
WATCHING
INTERNET
TV
LGTVHOUR SQTVHOUR INTVHOUR S2TVHOUR
1
.215
.104
.156
.045
.328**
.
.079
.397
.203
.713
.006
93
68
68
68
68
68
.215
1
.874**
.967**
.626**
.903**
The answers to the problems
.079
.
.000
.000
.000
.000
are based on the correlation
68
160
160
160
160
160
matrix.
.104
.874**
1
.967**
.910**
.611**
.397
.000
.
.000
.000
.000
Before we answer the
68
160
160
160
160
160
question in this problem, we
.156
.967**
.967**
1
.784**
.774**
will use a script to produce
.203
.000
.000
.
.000
.000
the output.
68
160
160
160
160
160
.045
.626**
.910**
.784**
1
.335**
.713
.000
.000
.000
.
.000
68
160
160
160
160
160
.328**
.903**
.611**
.774**
.335**
1
.006
.000
.000
.000
.000
.
68
160
160
160
160
160
**. Correlation is s ignificant at the 0.01 level (2-tailed).
SW388R7
Data Analysis &
Computers II
Assumption of Homoscedasticity
Slide 79


Homoscedasticity refers to the assumption that that
the dependent variable exhibits similar amounts of
variance across the range of values for an
independent variable.
While it applies to independent variables at all three
measurement levels, the methods that we will use to
evaluation homoscedasticity requires that the
independent variable be non-metric (nominal or
ordinal) and the dependent variable be metric
(ordinal or interval). When both variables are
metric, the assumption is evaluated as part of the
residual analysis in multiple regression.
Assumption of Homoscedasticity :
Evaluating homoscedasticity
SW388R7
Data Analysis &
Computers II
Slide 80





Homoscedasticity is evaluated for pairs of variables.
There are both graphical and statistical methods for
evaluating homoscedasticity .
The graphical method is called a boxplot.
The statistical method is the Levene statistic which
SPSS computes for the test of homogeneity of
variances.
Neither of the methods is absolutely definitive.
Assumption of Homoscedasticity :
The boxplot
SW388R7
Data Analysis &
Computers II
Slide 81
Each red box shows the middle
50% of the cases for the group,
indicating how spread out the
group of scores is.
If the variance across
the groups is equal, the
height of the red boxes
will be similar across the
groups.
5
141
262
4
78
3
2
63
68
197
236
90
100
163
171
181
40
66
69
81
112
217
234
134
203
1
0
243
214
89
87
58
18
9
256
142
132
105
29
-1
N=
138
20
MARRIED
42
DIVORCED
WIDOWED
MARITAL STATUS
11
56
NEVER MARRIED
SEPARAT ED
If the heights of the red
boxes are different, the
plot suggests that the
variance across groups
is not homogeneous.
The married group is
more spread out than
the other groups,
suggesting unequal
variance.
SW388R7
Data Analysis &
Computers II
Slide 82
Assumption of Homoscedasticity :
Levene test of the homogeneity of variance
Test of Homogeneity of Variances
RS HIGHEST DEGREE
Levene
Statis tic
5.239
df1
4
df2
262
Sig.
.000
The null hypothesis for the test of homogeneity of
variance states that the variance of the dependent
variable is equal across groups defined by the
independent variable, i.e., the variance is homogeneous.
Since the probability associated with the Levene Statistic
(<0.001) is less than or equal to the level of
significance, we reject the null hypothesis and conclude
that the variance is not homogeneous.
Assumption of Homoscedasticity :
Transformations
SW388R7
Data Analysis &
Computers II
Slide 83



When the assumption of homoscedasticity is not
supported, we can transform the dependent variable
variable and test it for homoscedasticity . If the
transformed variable demonstrates homoscedasticity,
we can substitute it in our analysis.
We use the sample three common transformations
that we used for normality: the logarithmic
transformation, the square root transformation, and
the inverse transformation.
All of these change the measuring scale on the
horizontal axis of a histogram to produce a
transformed variable that is mathematically
equivalent to the original variable.
Assumption of Homoscedasticity :
When transformations do not work
SW388R7
Data Analysis &
Computers II
Slide 84

When none of the transformations results in
homoscedasticity for the variables in the
relationship, including that variable in the analysis
will reduce our effectiveness at identifying statistical
relationships, i.e. we lose power.
SW388R7
Data Analysis &
Computers II
Slide 85
Assumption of Homoscedasticity :
Request a boxplot
Suppose we want to
test for homogeneity of
variance: whether the
variance in "highest
academic degree" is
homogeneous for the
categories of "marital
status."
The boxplot provides a visual
image of the distribution of the
dependent variable for the
groups defined by the
independent variable.
To request a boxplot, choose
the BoxPlot… command from
the Graphs menu.
SW388R7
Data Analysis &
Computers II
Slide 86
Assumption of Homoscedasticity :
Specify the type of boxplot
First, click on the Simple
style of boxplot to highlight
it with a rectangle around
the thumbnail drawing.
Second, click on the Define
button to specify the
variables to be plotted.
SW388R7
Data Analysis &
Computers II
Slide 87
Assumption of Homoscedasticity :
Specify the dependent variable
First, click on the
dependent variable
to highlight it.
Second, click on the right
arrow button to move the
dependent variable to the
Variable text box.
SW388R7
Data Analysis &
Computers II
Slide 88
Assumption of Homoscedasticity :
Specify the independent variable
First, click on the
independent
variable to highlight
it.
Second, click on the right
arrow button to move the
independent variable to the
Category Axis text box.
SW388R7
Data Analysis &
Computers II
Slide 89
Assumption of Homoscedasticity :
Complete the request for the boxplot
To complete the
request for the
boxplot, click on
the OK button.
Assumption of Homoscedasticity :
The boxplot
SW388R7
Data Analysis &
Computers II
Slide 90
Each red box shows the middle
50% of the cases for the group,
indicating how spread out the
group of scores is.
If the variance across
the groups is equal, the
height of the red boxes
will be similar across the
groups.
5
141
262
4
78
3
2
63
68
197
236
90
100
163
171
181
40
66
69
81
112
217
234
134
203
1
0
243
214
89
87
58
18
9
256
142
132
105
29
-1
N=
138
20
MARRIED
42
DIVORCED
WIDOWED
MARITAL STATUS
11
56
NEVER MARRIED
SEPARAT ED
If the heights of the red
boxes are different, the
plot suggests that the
variance across groups
is not homogeneous.
The married group is
more spread out than
the other groups,
suggesting unequal
variance.
SW388R7
Data Analysis &
Computers II
Slide 91
Assumption of Homoscedasticity :
Request the test for homogeneity of variance
To compute the Levene test for
homogeneity of variance,
select the Compare Means |
One-Way ANOVA… command
from the Analyze menu.
SW388R7
Data Analysis &
Computers II
Slide 92
Assumption of Homoscedasticity :
Specify the independent variable
First, click on the
independent
variable to highlight
it.
Second, click on the right
arrow button to move the
independent variable to the
Factor text box.
SW388R7
Data Analysis &
Computers II
Slide 93
Assumption of Homoscedasticity :
Specify the dependent variable
First, click on the
dependent variable
to highlight it.
Second, click on the right
arrow button to move the
dependent variable to the
Dependent List text box.
SW388R7
Data Analysis &
Computers II
Slide 94
Assumption of Homoscedasticity :
The homogeneity of variance test is an option
Click on the Options…
button to open the options
dialog box.
SW388R7
Data Analysis &
Computers II
Slide 95
Assumption of Homoscedasticity :
Specify the homogeneity of variance test
First, mark the
checkbox for the
Homogeneity of
variance test. All of
the other checkboxes
can be cleared.
Second, click on
the Continue button
to close the options
dialog box.
SW388R7
Data Analysis &
Computers II
Slide 96
Assumption of Homoscedasticity :
Complete the request for output
Click on the OK button to
complete the request for
the homogeneity of
variance test through the
one-way anova procedure.
SW388R7
Data Analysis &
Computers II
Slide 97
Assumption of Homoscedasticity :
Interpreting the homogeneity of variance test
Test of Homogeneity of Variances
RS HIGHEST DEGREE
Levene
Statis tic
5.239
df1
4
df2
262
Sig.
.000
The null hypothesis for the test of homogeneity of
variance states that the variance of the dependent
variable is equal across groups defined by the
independent variable, i.e., the variance is homogeneous.
Since the probability associated with the Levene Statistic
(<0.001) is less than or equal to the level of
significance, we reject the null hypothesis and conclude
that the variance is not homogeneous.
SW388R7
Data Analysis &
Computers II
Using scripts
Slide 98



The process of evaluating assumptions requires
numerous SPSS procedures and outputs that are time
consuming to produce.
These procedures can be automated by creating an
SPSS script. A script is a program that executes a
sequence of SPSS commands.
Thought writing scripts is not part of this course, we
can take advantage of scripts that I use to reduce
the burdensome tasks of evaluating assumptions .
SW388R7
Data Analysis &
Computers II
Using a script for evaluating assumptions
Slide 99



The script “EvaluatingAssumptionsAndMissingData.exe”
will produce all of the output we have used for
evaluating assumptions.
Navigate to the link “SPSS Scripts and Syntax” on the
course web page.
Download the script file “EvaluatingAssumptionsAnd
MissingData.exe” to your computer and install it,
following the directions on the web page.
SW388R7
Data Analysis &
Computers II
Open the data set in SPSS
Slide 100
Before using a script, a data
set should be open in the
SPSS data editor.
SW388R7
Data Analysis &
Computers II
Invoke the script in SPSS
Slide 101
To invoke the script, select
the Run Script… command
in the Utilities menu.
SW388R7
Data Analysis &
Computers II
Select the script
Slide 102
First, navigate to the folder where you put the script.
If you followed the directions, you will have a file with
an ".SBS" extension in the C:\SW388R7 folder.
If you only see a file with an “.EXE” extension in the
folder, you should double click on that file to extract
the script file to the C:\SW388R7 folder.
Second, click on the
script name to highlight
it.
Third, click on
Run button to
start the script.
SW388R7
Data Analysis &
Computers II
The script dialog
Slide 103
The script dialog box acts
similarly to SPSS dialog
boxes. You select the
variables to include in the
analysis and choose options
for the output.
SW388R7
Data Analysis &
Computers II
Complete the specifications - 1
Slide 104
Move the the dependent and
independent variables from the list of
variables to the list boxes. Metric
and nonmetric variables are moved
to separate lists so the computer
knows how you want them treated.
You must also indicate the level
of measurement for the
dependent variable. By default
the metric option button is
marked.
SW388R7
Data Analysis &
Computers II
Complete the specifications - 2
Slide 105
Mark the option
button for the type
of output you want
the script to
compute.
Click on the OK
button to produce
the output.
Select the
transformation
s to be tested.
SW388R7
Data Analysis &
Computers II
The script finishes
Slide 106
If your SPSS output viewer is
open, you will see the output
produced in that window.
Since it may take a while to
produce the output, and
since there are times when
it appears that nothing is
happening, there is an alert
to tell you when the script is
finished.
Unless you are absolutely
sure something has gone
wrong, let the script run
until you see this alert.
When you see this alert,
click on the OK button.
SW388R7
Data Analysis &
Computers II
Output from the script - 1
Slide 107
The script will produce lots
of output. Additional
descriptive material in the
titles should help link
specific outputs to specific
tasks.
Scroll through the script to
locate the outputs needed
to answer the question.
SW388R7
Data Analysis &
Computers II
Closing the script dialog box
Slide 108
The script dialog box does
not close automatically
because we often want to
run another test right away.
There are two methods for
closing the dialog box.
Click on the Cancel
button to close the
script.
Click on the X
close box to close
the script.
SW388R7
Data Analysis &
Computers II
Problem 1
Slide 109
9. In the dataset GSS2000R, is the following statement true, false, or
an incorrect application of a statistic? Use a level of significance of
0.01 for evaluating missing data and assumptions.
In pre-screening the data for use in a multiple regression of the
dependent variable "total hours spent on the Internet" [netime] with
the independent variables "age" [age], "sex" [sex], and "income"
[rincom98], the evaluation of the assumptions of normality, linearity,
and homogeneity of variance did not indicate any need for a caution to
be added to the interpretation of the analysis.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Level of measurement
Slide 110
Since we are pre-screening
for a multiple regression
hours spent
on the Internet"
9. In the dataset GSS2000R, is the"Total
following
statement
true, false,
problem, we should make
[netime] is interval, satisfying the metric
sure
satisfy theapplication
level of
anwe
incorrect
of a statistic?
Use
a level of
significance
of
level of
measurement
requirement
for
measurement before
the assumptions.
dependent variable.
0.01 for evaluating missing data and
proceeding.
or
In pre-screening the data for use in a multiple regression of the
dependent variable "total hours spent on the Internet" [netime] with
the independent variables "age" [age], "sex" [sex], and "income"
[rincom98], the evaluation of the assumptions of normality, linearity,
and homogeneity of variance did not indicate any need for a caution to
be added to the interpretation of the analysis.
"Age" [age] and "highest year of school completed" [educ] are interval,
satisfying the metric or dichotomous level of measurement requirement for
independent variables.
"Sex" [sex] is dichotomous, satisfying the metric or dichotomous level of
measurement requirement for independent variables.
"Income" [rincom98] is ordinal, satisfying the metric or dichotomous level of
measurement requirement for independent variables. Since some data
analysts do not agree with this convention of treating an ordinal variable as
metric, a note of caution should be included in our interpretation.
SW388R7
Data Analysis &
Computers II
Run the script to test normality - 1
Slide 111
To run the script to test
assumptions, choose the
Run Script… command from
the Utilities menu.
SW388R7
Data Analysis &
Computers II
Run the script to test normality - 2
Slide 112
First, navigate to the
SW388R7 folder on your
computer.
Second, click on the script name to select it:
EvaluatingAssumptionsAndMissingData.SBS
Third, click on
the Run button to
open the script.
SW388R7
Data Analysis &
Computers II
Run the script to test normality - 3
Slide 113
First, move the variables to the
list boxes based on the role that
the variable plays in the analysis
and its level of measurement.
Second, click on the Normality option
button to request that SPSS produce
the output needed to evaluate the
assumption of normality.
Fourth, click on
the OK button to
produce the output.
Third, mark the checkboxes
for the transformations that
we want to test in evaluating
the assumption.
SW388R7
Data Analysis &
Computers II
Normality of the dependent variable
Slide 114
Descriptives
TOTAL TIME SPENT
ON THE INTERNET
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
The dependent variable "total hours spent on
the Internet" [netime] did not satisfy the
criteria for a normal distribution. Both the
skewness (3.532) and kurtosis (15.614) fell
outside the range from -1.0 to +1.0.
Statis tic
10.7312
7.5697
Std. Error
1.59183
13.8927
8.2949
5.5000
235.655
15.35106
.20
102.00
101.80
10.2000
3.532
15.614
.250
.495
SW388R7
Data Analysis &
Computers II
Normality of transformed dependent variable
Slide 115
Since "total hours spent on the Internet"
[netime] did not satisfy the criteria for
normality, we examine the skewness and
kurtosis of each of the transformations to
see if any of them satisfy the criteria.
The "log of total hours spent on the Internet
[LGNETIME=LG10(NETIME)]" satisfied the criteria for a
normal distribution. The skewness of the distribution
(-0.150) was between -1.0 and +1.0 and the kurtosis
of the distribution (0.127) was between -1.0 and +1.0.
The "log of total hours spent on the Internet
[LGNETIME=LG10(NETIME)]" was substituted for "total
hours spent on the Internet" [netime] in the analysis.
SW388R7
Data Analysis &
Computers II
Normality of the independent variables - 1
Slide 116
Descriptives
AGE OF RESPONDENT Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
The independent variable "age" [age]
satisfied the criteria for a normal distribution.
The skewness of the distribution (0.595) was
between -1.0 and +1.0 and the kurtosis of
the distribution (-0.351) was between -1.0
and +1.0.
Statis tic
45.99
43.98
Std. Error
1.023
48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595
-.351
.148
.295
SW388R7
Data Analysis &
Computers II
Normality of the independent variables - 2
Slide 117
Descriptives
RESPONDENTS INCOME
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
The independent variable "income"
[rincom98] satisfied the criteria for a normal
distribution. The skewness of the distribution
(-0.686) was between -1.0 and +1.0 and the
kurtosis of the distribution (-0.253) was
between -1.0 and +1.0.
Statis tic
13.35
12.52
Std. Error
.419
14.18
13.54
15.00
29.535
5.435
1
23
22
8.00
-.686
-.253
.187
.373
SW388R7
Data Analysis &
Computers II
Run the script to test linearity - 1
Slide 118
If the script was not closed after
it was used for normality, we can
take advantage of the
specifications already entered. If
the script was closed, re-open it
as you would for normality.
First, click on the Linearity option
button to request that SPSS produce
the output needed to evaluate the
assumption of linearity.
When the linearity option
is selected, a default set of
transformations to test is
marked.
SW388R7
Data Analysis &
Computers II
Run the script to test linearity - 2
Slide 119
Since we have already decided to use the
log of the dependent variable to satisfy
normality, that is the form of the
dependent variable we want to evaluate
with the independent variables. Mark this
checkbox for the dependent variable and
clear the others.
Click on the OK
button to produce
the output.
SW388R7
Data Analysis &
Computers II
Linearity test with age of respondent
Slide 120
Correlations
Logarithm of NETIME
[LG10(NETIME)]
AGE OF RESPONDENT
Logarithm of AGE
[LG10(AGE)]
Square Root of AGE
[SQRT(AGE)]
Invers e of AGE [-1/(AGE)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Logarithm
of NETIME
[LG10(NE
TIME)]
1
.
93
.074
.483
93
.119
.257
93
.096
.362
93
.164
.116
93
**. Correlation is s ignificant at the 0.01 level (2-tailed).
The assessment of the linear
relationship
between
of total
AGE OF
Logarithm
of
Square
Root"logInvers
e of
RESPON
AGE spent on
of AGE
hours
the InternetAGE
DENT
[LG10(AGE)]
[SQRT(AGE)]
[-1/(AGE)]
[LGNETIME=LG10(NETIME)]"
and
.074
.119
.096
.164
"age" [age] indicated that the
.483
.257
.362 rather.116
relationship
was weak,
than
nonlinear.
probabilities
93
93 The statistical
93
93
associated
with
the
correlation
1
.979**
.995**
.916**
coefficients
measuring
the
.
.000
.000
.000
relationship
with
the
untransformed
270
270
270
270
independent variable (r=0.074,
.979**
1
.994**
.978**
p=0.483), the logarithmic
.000
.
.000
.000
transformation
(r=0.119,
p=0.257),
the square
root transformation
270
270
270
270
(r=0.096, p=0.362), and the inverse
.995**
.994**
1
.951**
transformation
(r=0.164,
p=0.116),
.000
.
were.000
all greater than the
level.000
of
significance
for
testing
assumptions
270
270
270
270
(0.01).
.916**
.978**
.951**
1
.000
270
.000
.000
.
There270
was no evidence
270 that the
270
assumption of linearity was violated.
SW388R7
Data Analysis &
Computers II
Linearity test with respondent’s income
Slide 121
Correlations
Logarithm of NETIME
[LG10(NETIME)]
RESPONDENTS INCOME
Logarithm of Reflected
Values of RINCOM98
[LG10( 24-RINCOM98)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Logarithm
of NETIME
[LG10(NE
TIME)]
1
.
93
-.053
.658
72
.063
.600
72
Square Root of Reflected
Values of RINCOM98
[SQRT( 24-RINCOM98)]
Invers e of Reflected
Values of RINCOM98 [-1/(
24-RINCOM98)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
**. Correlation is s ignificant at the 0.01 level (2-tailed).
.060
.617
72
.073
.540
72
Logarithm
Square Root Invers e of
of Reflected
of Reflected
Reflected
Values
of
Values
of
The assessment of the linear Values of
RINCOM98
RINCOM98
RINCOM9
relationship between "log of total hours
[LG10(
[SQRT(
8 [-1/(
spent on the Internet
RESPONDEN 24-RINCOM 24-RINCOM
24-RINC
[LGNETIME=LG10(NETIME)]"
and
TS INCOME
98)]
98)]
OM98)]
"income"
[rincom98]
indicated
that
the
-.053
.063
.060
.073
relationship was weak, rather than
.658
.600
.617
.540
nonlinear. The statistical probabilities
72
72
72
72
associated with the correlation
1
-.922**
-.602**
coefficients
measuring -.985**
the relationship
.
.000
.000
.000
with
the untransformed
independent
168
168
168
variable (r=-0.053,
p=0.658),
the168
-.922**
1
.974**
.848**
logarithmic transformation
(r=0.063,
p=0.600),
the
square
root
.000
.
.000
.000
transformation (r=0.060, p=0.617),
and the inverse
168
168 transformation
168
168
(r=0.073, p=0.540), were all greater
than the level
of significance
-.985**
.974**
1 for testing
.714**
assumptions
(0.01).
.000
.000
.
.000
168
168
168
168
There was no evidence that the
-.602**
.848**
.714**
1
assumption of linearity was violated.
.000
.000
.000
.
168
168
168
168
SW388R7
Data Analysis &
Computers II
Slide 122
Run the script to test
homogeneity of variance - 1
If the script was not closed after
it was used for normality, we can
take advantage of the
specifications already entered. If
the script was closed, re-open it
as you would for normality.
First, click on the Homogeneity of
variance option button to request that
SPSS produce the output needed to
evaluate the assumption of
homogeneity.
When the homogeneity of
variance option is selected, a
default set of transformations
to test is marked.
SW388R7
Data Analysis &
Computers II
Slide 123
Run the script to test
homogeneity of variance - 2
In this problem, we have
already decided to use the log
transformation for the
dependent variable, so we
only need test it. Next, clear
all of the transformation
checkboxes except for
Logarithmic.
Finally, click on
the OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Levene test of homogeneity of variance
Slide 124
Test of Homogeneity of Variances
Logarithm of NETIME [LG10(NETIME)]
Levene
Statis tic
.166
df1
df2
1
91
Sig.
.685
Based on the Levene Test, the variance in "log of total
hours spent on the Internet
[LGNETIME=LG10(NETIME)]" was homogeneous for the
categories of "sex" [sex]. The probability associated
with the Levene statistic (0.166) was p=0.685, greater
than the level of significance for testing assumptions
(0.01). The null hypthesis that the group variances
were equal was not rejected.
The homogeneity of variance assumption was satisfied.
SW388R7
Data Analysis &
Computers II
Answer 1
Slide 125
In pre-screening the data for use in a multiple regression of the
dependent variable "total hours spent on the Internet" [netime]
with the independent variables "age" [age], "sex" [sex], and
"income" [rincom98], the evaluation of the assumptions of
normality, linearity, and homogeneity of variance did not
indicate any need for a caution to be added to the
interpretation of the analysis.
1.
2.
3.
4.
True
The logarithmic transformation of the dependent variable
[LGNETIME=LG10(NETIME)] solved the only problem with
True with caution
normality that we encountered. In that form, the relationship
with the metric dependent variables was weak, but there was no
False
evidence of nonlinearity. The variance of log transform of the
dependent
variable was
homogeneous
for the categories of the
Inappropriate
application
of
a
statistic
nonmetric variable sex.
No cautions were needed because of a violation of assumptions.
A caution was needed because respondent’s income was ordinal
level.
The answer to the problem is true with caution.
SW388R7
Data Analysis &
Computers II
Problem 2
Slide 126
14. In the dataset 2001WorldFactbook, is the following statement
true, false, or an incorrect application of a statistic? Use a level of
significance of 0.01 for evaluating missing data and assumptions.
In pre-screening the data for use in a multiple regression of the
dependent variable "life expectancy at birth" [lifeexp] with the
independent variables "population growth rate" [pgrowth], "percent of
the total population who was literate" [literacy], and "per capita GDP"
[gdp], the evaluation of the assumptions of normality, linearity, and
homogeneity of variance did not indicate any need for a caution to be
added to the interpretation of the analysis.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Level of measurement
Slide 127
Since we are pre-screening
for a multiple regression
problem, we should make
sure we satisfy the level of
measurement before
proceeding.
"Life expectancy at birth" [lifeexp] is
interval, satisfying the metric level of
measurement requirement for the
dependent variable.
14. In the dataset 2001WorldFactbook, is the following statement true,
false, or an incorrect application of a statistic? Use a level of
significance of 0.01 for evaluating missing data and assumptions.
In pre-screening the data for use in a multiple regression of the
dependent variable "life expectancy at birth" [lifeexp] with the
independent variables "population growth rate" [pgrowth], "percent of
the total population who was literate" [literacy], and "per capita GDP"
[gdp], the evaluation of the assumptions of normality, linearity, and
homogeneity of variance did not indicate any need for a caution to be
added to the interpretation of the analysis.
"Population growth rate" [pgrowth] "percent of the total
population who was literate" [literacy] and "per capita GDP"
[gdp] are interval, satisfying the metric or dichotomous level
of measurement requirement for independent variables.
SW388R7
Data Analysis &
Computers II
Run the script to test normality - 1
Slide 128
To run the script to test
assumptions, choose the
Run Script… command from
the Utilities menu.
SW388R7
Data Analysis &
Computers II
Run the script to test normality - 2
Slide 129
First, navigate to the
SW388R7 folder on your
computer.
Second, click on the script name to select it:
EvaluatingAssumptionsAndMissingData.SBS
Third, click on
the Run button to
open the script.
SW388R7
Data Analysis &
Computers II
Run the script to test normality - 3
Slide 130
First, move the variables to the
list boxes based on the role that
the variable plays in the analysis
and its level of measurement.
Second, click on the Normality option
button to request that SPSS produce
the output needed to evaluate the
assumption of normality.
Fourth, click on
the OK button to
produce the output.
Third, mark the checkboxes
for the transformations that
we want to test in evaluating
the assumption.
SW388R7
Data Analysis &
Computers II
Normality of the dependent variable
Slide 131
Descriptives
Life expectancy at birth Mean
- total population
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
The dependent variable "life expectancy at
birth" [lifeexp] satisfied the criteria for a
normal distribution. The skewness of the
distribution (-0.997) was between -1.0 and
+1.0 and the kurtosis of the distribution
(0.005) was between -1.0 and +1.0.
Statis tic
66.9009
65.3508
Std. Error
.78648
68.4510
67.7063
70.6900
135.462
11.63879
36.45
83.47
47.02
14.9400
-.997
.005
.164
.327
SW388R7
Data Analysis &
Computers II
Normality of the first independent variables
Slide 132
Descriptives
Population growth rate
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
The independent variable "population growth
rate" [pgrowth] did not satisfy the criteria for
a normal distribution. Both the skewness
(2.885) and kurtosis (22.665) fell outside the
range from -1.0 to +1.0.
Statis tic
1.4944
1.3081
Std. Error
.09456
1.6808
1.4365
1.4000
1.958
1.39929
-1.14
13.39
14.53
1.8000
2.885
22.665
.164
.327
SW388R7
Data Analysis &
Computers II
Slide 133
Normality of transformed independent
variable
Neither the logarithmic
(skew=-0.218,
kurtosis=1.277), the
square root (skew=0.873,
kurtosis=5.273), nor the
inverse transformation
(skew=-1.836,
kurtosis=5.763) induced
normality in the variable
"population growth rate"
[pgrowth].
A caution was added to
the findings.
SW388R7
Data Analysis &
Computers II
Slide 134
Normality of the second independent
variables
Descriptives
Percent literate total population
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
The independent variable "percent of the
total population who was literate" [literacy]
did not satisfy the criteria for a normal
distribution. The kurtosis of the distribution
(0.081) was between -1.0 and +1.0, but the
skewness of the distribution (-1.112) fell
outside the range from -1.0 to +1.0.
Statis tic
80.032
76.987
Std. Error
1.5443
83.077
81.856
90.000
484.100
22.0023
13.6
100.0
86.4
31.300
-1.112
.081
.171
.340
SW388R7
Data Analysis &
Computers II
Slide 135
Normality of transformed independent
variable
Since the distribution was skewed to the
left, it was necessary to reflect, or reverse
code, the values for the variable before
computing the transformation.
The "square root of percent of the total population who was literate
(using reflected values) [SQLITERA=SQRT(101-LITERACY)]" satisfied
the criteria for a normal distribution. The skewness of the distribution
(0.567) was between -1.0 and +1.0 and the kurtosis of the distribution
(-0.964) was between -1.0 and +1.0. The "square root of percent of the
total population who was literate (using reflected values)
[SQLITERA=SQRT(101-LITERACY)]" was substituted for "percent of the
total population who was literate" [literacy] in the analysis.
SW388R7
Data Analysis &
Computers II
Normality of the third independent variables
Slide 136
Descriptives
Per capita GDP
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
The independent variable "per capita GDP"
[gdp] did not satisfy the criteria for a normal
distribution. The kurtosis of the distribution
(0.475) was between -1.0 and +1.0, but the
skewness of the distribution (1.207) fell
outside the range from -1.0 to +1.0.
Statis tic
8554.43
7410.27
Std. Error
580.523
9698.59
7818.67
5000.00
7.4E+07
8590.954
510
36400
35890
11200.00
1.207
.475
.164
.327
SW388R7
Data Analysis &
Computers II
Slide 137
Normality of transformed independent
variable
The "square root of per
capita GDP
[SQGDP=SQRT(GDP)]"
satisfied the criteria for a
normal distribution. The
skewness of the
distribution (0.614) was
between -1.0 and +1.0
and the kurtosis of the
distribution (-0.773) was
between -1.0 and +1.0.
The "square root of per
capita GDP
[SQGDP=SQRT(GDP)]"
was substituted for "per
capita GDP" [gdp] in the
analysis.
SW388R7
Data Analysis &
Computers II
Run the script to test linearity - 1
Slide 138
If the script was not closed after
it was used for normality, we can
take advantage of the
specifications already entered. If
the script was closed, re-open it
as you would for normality.
First, click on the Linearity option
button to request that SPSS produce
the output needed to evaluate the
assumption of linearity.
When the linearity option
is selected, a default set of
transformations to test is
marked.
Click on the OK
button to produce
the output.
SW388R7
Data Analysis &
Computers II
Linearity test with population growth rate
Slide 139
Correlations
Life expectancy at birth total population
Population growth rate
Logarithm of PGROWTH
[LG10( 2.14+PGROWTH)]
Square Root of
PGROWTH [SQRT(
2.14+PGROWTH)]
Invers e of PGROWTH [-1/(
2.14+PGROWTH)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Logarithm of
Square Root
Invers e of
Life
PGROWTH
of PGROWTH PGROWT
expectancy
[LG10(
[SQRT(
H [-1/(
assessment
of the linearity
at birth - total
Population The
2.14+PGRO
2.14+PGROW
2.14+PG
relationship TH)]
between "life
population
growth rate of the
WTH)]
ROWTH)]
expectancy
at
birth"
[lifeexp]
1
-.262**
-.314**
-.301**
-.282**
and
"population
growth
rate"
.
.000
.000
.000
.000
[pgrowth] indicated that the
219
219
219
219
219
relationship could be considered
-.262**
1
.930**
.979**
.801**
linear because the probability
.000
. associated
.000with the correlation
.000
.000
219
219 coefficient
219for the relationship
219
219
-.314**
.930**(r=-0.262)1was statistically
.985**
.956**
.000
.000 signficant (p<0.001)
.
.000 none .000
and
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
**. Correlation is s ignificant at the 0.01 level (2-tailed).
219
-.301**
.000
219
-.282**
.000
219
of the statistically significant
219
219
transformations for population
.979**growth rate
.985**had a relationship
1
that
was
substantially
stronger.
.000
.000
.
The
relationship
between
219
219
219 the
untransformed variables was
.801**
.956**
.897**
assumed to satisfy the
.000
.000
.000
assumption of linearity.
219
219
219
.897**
.000
219
1
.
219
219
219
SW388R7
Data Analysis &
Computers II
Linearity test with population literacy
Slide 140
Correlations
Life expectancy at birth total population
Percent literate - total
population
Logarithm of Reflected
Values of LITERACY
[LG10( 101-LITERACY)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Logarithm
Square Root Invers e of
of Reflected
of Reflected
Reflected
Values of
Values of
Values of
Life
LITERACY
LITERACY
LITERAC
expectancy
Percent
[LG10(
[SQRT(
Y [-1/(
at birth - total literate - total 101-LITERA 101-LITERA 101-LITE
population
population
CY)]
CY)]
RACY)]
1
.724**
-.670**
-.720**
-.467**
.
.000
.000
.000
.000
219
203
203
203
203
The transformation "square
.724**
1 of percent
-.895**of the total
-.978**
-.594**
root
.000
.
.000
.000
.000
population
who was literate
203
203
203 values) 203
203
(using reflected
-.670**
-.895**
1
.966**
.857**
[SQLITERA=SQRT(101LITERACY)]" was
incorporated
.000
.000
.
.000
.000
203
Square Root of Reflected
Values of LITERACY
[SQRT( 101-LITERACY)]
Invers e of Reflected
Values of LITERACY [-1/(
101-LITERACY)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
**. Correlation is s ignificant at the 0.01 level (2-tailed).
-.720**
.000
203
-.467**
.000
203
in the analysis in the
evaluation of
normality. 203
203
203
Additional transformations for
linearity were
not considered.
-.978**
.966**
1
.000
203
-.594**
.000
203
.000
203
.857**
.000
203
.
203
.717**
.000
203
203
.717**
.000
203
1
.
203
SW388R7
Data Analysis &
Computers II
Linearity test with per capita GDP
Slide 141
Correlations
Life expectancy at birth total population
Per capita GDP
Logarithm of GDP
[LG10(GDP)]
Square Root of GDP
[SQRT(GDP)]
Invers e of GDP [-1/(GDP)]
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Life
expectancy
Logarithm of
Square Root
Invers e of
The
transformation
at birth - total Per capita
GDP
of GDP"square GDP
root of per capita
GDP
population
GDP
[LG10(GDP)]
[SQRT(GDP)]
[-1/(GDP)]
[SQGDP=SQRT(GDP)]"
1
.643**
.762**
.713**was
.727**
incorporated
in
the
analysis
.
.000
.000
.000
.000
in the evaluation of
219
219
219
219
219
normality. Additional
.643**
1
.898**
.978**
.637**
transformations for linearity
.000
.
.000 considered.
.000
.000
were not
219
219
219
219
219
.762**
.898**
1
.969**
.890**
.000
.000
.
.000
.000
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
**. Correlation is s ignificant at the 0.01 level (2-tailed).
219
219
219
219
219
.713**
.000
219
.727**
.000
219
.978**
.000
219
.637**
.000
219
.969**
.000
219
.890**
.000
219
1
.
219
.762**
.000
219
.762**
.000
219
1
.
219
SW388R7
Data Analysis &
Computers II
Slide 142
Run the script to test
homogeneity of variance - 1
There were no nonmetric
variables in this analysis, so the
test of homogeneity of variance
was not conducted.
SW388R7
Data Analysis &
Computers II
Answer 2
Slide 143
In pre-screening the data for use in a multiple regression of the
dependent variable "life expectancy at birth" [lifeexp] with the
independent variables "population growth rate" [pgrowth], "percent of
the total population who was literate" [literacy], and "per capita GDP"
[gdp], the evaluation of the assumptions of normality, linearity, and
homogeneity of variance did not indicate any need for a caution to be
added to the interpretation of the analysis.
1.
2.
3.
4.
Two transformations were substituted to satisfy the assumption of
True
the "square root of percent of the total population who
True with normality:
caution
was literate (using reflected values) [SQLITERA=SQRT(101LITERACY)]" and the "square root of per capita GDP
False
[SQGDP=SQRT(GDP)]" was substituted for "per capita GDP" [gdp] in
Inappropriate
application of a statistic
the analysis.
However, none of the transformations induced normality in the
variable "population growth rate" [pgrowth]. A caution was added to
the findings.
The answer to the problem is false. A caution was added because
"Population growth rate" [pgrowth] did not satisfy the assumption of
normality and none of the transformations were successful in
inducing normality.
SW388R7
Data Analysis &
Computers II
Slide 144
Steps in evaluating assumptions:
level of measurement
The following is a guide to the decision process for answering
problems about assumptions for multiple regression:
Is the dependent
variable metric and the
independent variables
metric or dichotomous?
Yes
No
Incorrect application
of a statistic
SW388R7
Data Analysis &
Computers II
Slide 145
Steps in evaluating assumptions:
assumption of normality for metric variable
Does the dependent
variable satisfy the
criteria for a normal
distribution?
Yes
Assumption
satisfied, use
untransformed
variable in analysis
No
Does one or more of the
transformations satisfy
the criteria for a normal
distribution?
No
Assumption not
satisfied, use
untransformed
variable in analysis
Add caution to
interpretation
Yes
Assumption
satisfied, use
transformed
variable with
smallest skew
SW388R7
Data Analysis &
Computers II
Slide 146
Steps in evaluating assumptions:
assumption of linearity for metric variables
Independent
variable
transformed for
normality?
No
If dependent variable was transformed
for normality, substitute transformed
dependent variable in the test for the
assumption of linearity
Yes
Skip test
Yes
Probability of correlation (r) for
relationship between IV and
DV <= level of significance?
Probability of correlation
(r) for relationship
between any transformed
IV and
DV <= level of
significance?
Probability of correlation
(r) for relationship
between any transformed
IV significant
AND r greater than r of
untransformed by 0.20?
No
Assumption satisfied,
use untransformed
independent variable
No
Yes
Yes
Assumption satisfied,
Use transformed
variable with highest r
No
Interpret relationship
as weak, not nonlinear.
No caution needed
SW388R7
Data Analysis &
Computers II
Slide 147
Steps in evaluating assumptions:
homogeneity of variance for nonmetric variables
If dependent variable was transformed
for normality, substitute transformed
dependent variable in the test for the
assumption of homogeneity of variance
Yes
Assumption
not satisfied
Add caution to
interpretation
Probability of Levene
statistic <= level of
significance?
No
Assumption
satisfied

document

Transcript document

Directory