Lecture 12 - University of Pennsylvania

Download Report

Transcript Lecture 12 - University of Pennsylvania

Stat 112: Lecture 22 Notes
• Chapter 9.1: One-way Analysis of
Variance.
• Chapter 9.3: Two-way Analysis of Variance
• Homework 6 is due on Friday.
Errors in Hypothesis Testing
State of World
Decision
Based on
Data
Null
Hypothesis
True
Accept Null Correct
Hypothesis Decision
Alternative
Hypothesis
True
Type II
error
Reject Null Type I
Hypothesis errror
Correct
Decision
When we do one hypothesis test and reject null hypothesis if p-value <0.05, then
the probability of making a Type I error when the null hypothesis is true is 0.05. We
protect against falsely rejecting a null hypothesis by making probability of Type I
error small.
Multiple Comparisons Problem
• Compound uncertainty: When doing more
than one test, there is an increased
chance of a Type I error
• If we do multiple hypothesis tests and use
the rule of rejecting the null hypothesis in
each test if the p-value is <0.05, then if all
the null hypotheses are true, the
probability of falsely rejecting at least one
null hypothesis is >0.05.
Individual vs. Familywise Error
Rate
• When several tests are considered
simultaneously, they constitute a family of tests.
• Individual Type I error rate: Probability for a
single test that the null hypothesis will be
rejected assuming that the null hypothesis is
true.
• Familywise Type I error rate: Probability for a
family of test that at least one null hypothesis will
be rejected assuming that all of the null
hypotheses are true.
• When we consider a family of tests, we want to
make the familywise error rate small, say 0.05,
to protect against falsely rejecting a null
hypothesis.
Bonferroni Method
• General method for doing multiple comparisons
for any family of k tests.
• Denote familywise type I error rate we want by
p*, say p*=0.05.
• Compute p-values for each individual test -p1,..., pk
p*
• Reject null hypothesis for ith test if pi 
k
• Guarantees that familywise type I error rate is at
most p*.
• Why Bonferroni works: If we do k tests and all
null hypotheses are true , then using Bonferroni
with p*=0.05, we have probability 0.05/k to make
a Type I error for each test and expect to make
k*(0.05/k)=0.05 errors in total.
Bonferroni on Milgram’s Data
Oneway Analysis of Voltage Level By Condition
Level
Remote
Voice-Feedback
Remote
Voice-Feedback
Proximity
Remote
- Level
Difference Lower CL
Touch-Proximity 136.8750 86.2157
Touch-Proximity 99.7500 49.0907
Proximity
93.0000 42.3407
Proximity
55.8750
5.2157
Touch-Proximity 43.8750 -6.7843
Voice-Feedback 37.1250 -13.5343
Upper CL
187.5343
150.4093
143.6593
106.5343
94.5343
87.7843
Output obtained from
Fit Y by X, Compare
Means, Each Pair
Student’s t
p-Value
3.2771e-7
0.0001484
0.0003890
0.0308583
0.0891141
0.1497462
1. Suppose we are interested in comparing all pairs of groups.
Then there are six tests, and so using Bonferroni, we should
only reject each test if the p-value is less than 0.05/6=0.0083.
We conclude that there is strong evidence that remote has a
higher mean than touch proximity, voice feedback has a higher
mean than touch proximity and remote has a higher mean than
proximity, but that there is not strong evidence for any other
pairs of groups having different means.
Bonferroni on Milgram’s Data
Continued
Oneway Analysis of Voltage Level By Condition
Level
Remote
Voice-Feedback
Remote
Voice-Feedback
Proximity
Remote
- Level
Difference Lower CL
Touch-Proximity 136.8750 86.2157
Touch-Proximity 99.7500 49.0907
Proximity
93.0000 42.3407
Proximity
55.8750
5.2157
Touch-Proximity 43.8750 -6.7843
Voice-Feedback 37.1250 -13.5343
Upper CL
187.5343
150.4093
143.6593
106.5343
94.5343
87.7843
p-Value
3.2771e-7
0.0001484
0.0003890
0.0308583
0.0891141
0.1497462
2. Suppose we are only interested in comparing remote to the
three other groups. Then there are three tests, and so using
Bonferroni, we should only reject each test if the p-value is less
than 0.05/3=0.0167. We conclude that there is strong evidence
that remote has a higher mean than touch-proximity
and proximity.
Important Note: We need to decide what family of tests we are
interested in before looking at the data.
Tukey’s HSD
• Tukey’s HSD is a method that is
specifically designed to control the
familywise type I error rate (at 0.05) for
analysis of variance when we are
interested in comparing all pairs of groups.
• JMP Instructions: After Fit Y by X, click the
red triangle next to the X variable and click
LSMeans Tukey HSD.
Tukey’s HSD for Milgram’s Data
Oneway Analysis of Voltage Level By Condition
Means Comparisons
Comparisons for all pairs using Tukey-Kramer HSD
Level
Remote
Voice-Feedback
Proximity
Touch-Proximity
A
A
B
B
C
C
Mean
405.00000
367.87500
312.00000
268.12500
Levels not connected by same letter are significantly different
Level
- Level
Difference
Remote
Touch-Proximity
136.8750
Voice-Feedback
Touch-Proximity
99.7500
Remote
Proximity
93.0000
Voice-Feedback
Proximity
55.8750
Proximity
Touch-Proximity
43.8750
Remote
Voice-Feedback
37.1250
Lower CL
70.2722
33.1472
26.3972
-10.7278
-22.7278
-29.4778
Upper CL
203.4778
166.3528
159.6028
122.4778
110.4778
103.7278
Pairs of groups which are significantly different according to Tukey’s HSD
Procedure: Remote and Proximity, Remote and Touch Proximity, Voice
Feedback and Touch Proximity.
The 95% confidence intervals are adjusted so that the familywise coverage
rate is 95%, i.e., 95% of the time all of the confidence intervals will
contain the true parameters.
Assumptions in one-way ANOVA
• Assumptions needed for validity of oneway analysis of variance p-values and CIs:
– Linearity: automatically satisfied.
– Constant variance: Spread within each group
is the same.
– Normality: Distribution within each group is
normally distributed.
– Independence: Sample consists of
independent observations.
Rule of thumb for checking
constant variance
• Constant variance: Look at standard deviation of
different groups by using Fit Y by X and clicking Means
and Std Dev.
Means and Std Deviations
Level
Proximity
Remote
Touch-Proximity
Voice-Feedback
Number
40
40
40
40
Mean
312.000
405.000
268.125
367.875
Std Dev
129.979
63.640
131.874
119.518
Std Err Mean
20.552
10.062
20.851
18.897
• Rule of Thumb: Check whether (highest group standard
deviation/lowest group standard deviation) is greater
than 2. If greater than 2, then constant variance is not
reasonable and transformation should be considered.. If
less than 2, then constant variance is reasonable.
• (Highest group standard deviation/lowest group standard
deviation) =(131.874/63.640)=2.07. Thus, constant
variance is not reasonable for Milgram’s data.
Transformations to correct for
nonconstant variance
• If standard deviation is highest for high groups with high
means, try transforming Y to log Y or Y . If standard
deviation is highest for groups with low means, try
transforming Y to Y2.
Means and Std Deviations
Level
Proximity
Remote
Touch-Proximity
Voice-Feedback
Number
40
40
40
40
Mean
312.000
405.000
268.125
367.875
Std Dev
129.979
63.640
131.874
119.518
Std Err Mean
20.552
10.062
20.851
18.897
• SD is particularly low for group with highest mean. Try
transforming to Y2. To make the transformation, right
click in new column, click New Column and then right
click again in the created column and click Formula and
enter the appropriate formula for the transformation.
Transformation of Milgram’s data to
Squared Voltage Level
Means and Std Deviations
Level
Proximity
Remote
Touch-Proximity
Voice-Feedback
Number
40
40
40
40
Mean
113816
167974
88847
149259
Std Dev
78920.2
48541.4
79291.3
74053.6
Std Err Mean
12478
7675
12537
11709
• Check of constant variance for transformed data:
(Highest group standard deviation/lowest group
standard deviation) = 1.63. Constant variance
assumption is reasonable for voltage squared.
• Analysis of variance tests are approximately
valid for voltage squared data; reanalyzed data
using voltage squared.
Analysis using Voltage Squared
Strong evidence that the group mean voltage squared levels are not all the same.
Response Voltage Squared
Effect Tests
Source
Condition
Nparm
3
DF
3
Sum of Squares
1.50737e11
F Ratio
9.8735
Prob > F
<.0001
Effect Test Gives Strong Evidence That Not All Conditions Have the Same Mean Voltage.
Oneway Analysis of Voltage Squared By Condition
Comparisons for all pairs using Tukey-Kramer HSD
Level
Remote
Voice-Feedback
Proximity
Touch-Proximity
A
A
B
B
C
C
Mean
167973.75
149259.38
113816.25
88846.88
Levels not connected by same letter are significantly different
Level
- Level
Difference
Remote
Touch-Proximity
79126.88
Voice-Feedback
Touch-Proximity
60412.50
Remote
Proximity
54157.50
Voice-Feedback
Proximity
35443.13
Proximity
Touch-Proximity
24969.38
Remote
Voice-Feedback
18714.38
Lower CL
37701.9
18987.6
12732.6
-5981.8
-16455.6
-22710.6
Upper CL Difference
120551.8
101837.4
95582.4
76868.1
66394.3
60139.3
Strong evidence that remote has higher mean voltage squared level than proximity
and touch-proximity and that voice-feedback has higher mean voltage squared level
than touch-proximity, taking into account the multiple comparisons.
Rule of Thumb for Checking
Normality in ANOVA
• The normality assumption for ANOVA is that the
distribution in each group is normal. Can be checked by
looking at the boxplot, histogram and normal quantile
plot for each group.
• If there are more than 30 observations in each group,
then the normality assumption is not important; ANOVA
p-values and CIs will still be approximately valid even for
nonnormal data if there are more than 30 observations in
each group.
• If there are less than 30 observations per group, then we
can check normality by clicking Analyze, Distribution and
then putting the Y variable in the Y, Columns box and the
categorical variable denoting the group in the By box.
We can then create normal quantile plots for each group
and check that for each group, the points in the normal
quantile plot are in the confidence bands. If there is
nonnormality, we can try to use a transformation such as
log Y and see if the transformed data is approximately
normally distributed in each group.
One way Analysis of Variance:
Steps in Analysis
1. Check assumptions (constant variance,
normality, independence). If constant variance
is violated, try transformations.
2. Use the effect test (commonly called the Ftest) to test whether all group means are the
same.
3. If it is found that at least two group means
differ from the effect test, use Tukey’s HSD
procedure to investigate which groups are
different, taking into account the fact multiple
comparisons are being done.
Analysis of Variance Terminology
• The criterion (criteria) by which we classify the
groups in analysis of variance is called a factor.
In one-way analysis of variance, we have one
factor.
• The possible values of the factor are levels.
• Milgram’s study: Factor is experimental condition
with levels remote, voice-feedback, proximity
and touch-proximity.
• Two-way analysis of variance: Groups are
classified by two factors.
Two-way Analysis of Variance
Examples
• Milgram’s study: In thinking about the Obedience to
Authority study, many people have thought that women
would react differently than men. Two-way analysis of
variance setup in which the two factors are experimental
condition (levels remote, voice-feedback, proximity,
touch-proximity) and sex (levels male, female).
• Package Design Experiment: Several new types of
cereal packages were designed. Two colors and two
styles of lettering were considering. Each combination of
lettering/color was used to produce a package, and each
of these combinations was test marketed in 12
comparable stores and sales in the stores were
recorded.. Two-way analysis of variance in which two
factors are color (levels red, green) and lettering (levels
block, script).
• Goal of two-way analysis of variance: Find out how the
mean response in a group depends on the levels of both
factors and find the best combination.
Two-way Analysis of Variance
• The mean of the group with the ith level of factor
1 and the jth level of factor 2 is denoted ij , e.g.,
in package-design experiment, the four group
means are
red ,block , red ,script , green,block , green,script
• As with one-way analysis of variance, two-way
analysis of variance can be seen as a a special
case of multiple regression. For two-way
analysis of variance, we have two categorical
explanatory variables for the two factors and
also include an interaction between the factors.
Response Sales
Effect Tests
Source
Color
TypeStyle
TypeStyle*Color
Nparm
1
1
1
DF
1
1
1
Sum of Squares
4641.3333
5985.3333
972.0000
F Ratio
3.1762
4.0959
0.6652
Prob > F
0.0816
0.0491
0.4191
Expanded Estimates
Nominal factors expanded to all levels
Term
Intercept
Color[Green]
Color[Red]
TypeStyle[Block]
TypeStyle[Script]
TypeStyle[Block]*Color[Green]
TypeStyle[Block]*Color[Red]
TypeStyle[Script]*Color[Green]
TypeStyle[Script]*Color[Red]
Estimate
144.91667
-9.833333
9.8333333
-11.16667
11.166667
-4.5
4.5
4.5
-4.5
Std Error
5.517577
5.517577
5.517577
5.517577
5.517577
5.517577
5.517577
5.517577
5.517577
t Ratio
26.26
-1.78
1.78
-2.02
2.02
-0.82
0.82
0.82
-0.82
Estimated Mean for Red Block group = 144.92+9.83-11.17+4.5 = 148.08
Estimated Mean for Red Script group = 144.92+9.83+11.17-4.5= 161.42
Prob>|t|
<.0001
0.0816
0.0816
0.0491
0.0491
0.4191
0.4191
0.4191
0.4191
LS Means Plot
SalesLS Means
250
200
150
100
50
Green
Red
Color
LS Means Plot
SalesLS Means
250
200
150
100
50
Block
Script
T ypeStyle
LS Means Plot
SalesLS Means
250
200
Script
Block
150
100
50
Green
Red
Color
The LS Means Plots show how the means of the
groups vary as the levels of the factors vary.
For the top plot for color, green refers to the mean
of the two green groups (green block and green
script) and red refers to the mean of the two red
groups (red block and red script). Similarly for the
second plot for TypeStyle, block refers to the mean
of the two block groups (red block and green
block). The third plot for TypeStyle*Color shows
the mean of all four groups.
Two-way ANOVA in JMP
• Use Analyze, Fit Model with a categorical
variable for the first factor, a categorical variable
for the second factor and an interaction variable
that crosses the first factor and the second
factor.
• The LS Means Plots are produced by going to
the output in JMP for each variable that is to the
right of the main output, clicking the red triangle
next to each variable (for package design, the
vairables are Color, TypeStyle, Typestyle*Color)
and clicking LS Means Plot.
Interaction in Two-Way ANOVA
• Interaction between two factors: The impact of one factor
on the response depends on the level of the other factor.
• For package design experiment, there would be an
interaction between color and typestyle if the impact of
color on sales depended on the level of typestyle.
• Formally, there is an interaction if
red ,block  red ,script  green,block  green,script
• LS Means Plot suggests there is not much interaction.
Impact of changing color from red to green on mean
sales is about the same when the typestyle is block as
when the typestyle is script.
LS Means Plot
SalesLS Means
250
200
Script
Block
150
100
50
Green
Red
Color
Effect Test for Interaction
• A formal test of the null hypothesis that
there is no interaction, H0 : ij  i ', j  ij'  i ' j '
for all levels i,j,i’,j’ of factors 1 and 2,
versus the alternative hypothesis that
there is an interaction is given by the
Effect Test for the interaction variable
(here Typestyle*Color).
Effect Tests
Source
Color
TypeStyle
TypeStyle*Color
Nparm
1
1
1
DF
1
1
1
Sum of Squares
4641.3333
5985.3333
972.0000
F Ratio
3.1762
4.0959
0.6652
• p-value for Effect Test = 0.4191. No
evidence of an interaction.
Prob > F
0.0816
0.0491
0.4191
Implications of No Interaction
• When there is no interaction, the two factors can be
looked in isolation, one at a time.
• When there is no interaction, best group is determined
by finding best level of factor 1 and best level of factor 2
separately.
• For package design experiment, suppose there are two
separate groups: one with an expertise in lettering and
the other with expertise in coloring. If there is no
interaction, groups can work independently to decide
best letter and color. If there is an interaction, groups
need to get together to decide on best combination of
letter and color.
Model when There is No Interaction
• When there is no evidence of an
interaction, we can drop the interaction
term from the model for parsimony and
more accurate estimates:
Response Sales
Effect Tests
Source
Color
TypeStyle
Nparm
1
1
DF
1
1
Sum of Squares
4641.3333
5985.3333
F Ratio
3.2000
4.1266
Prob > F
0.0804
0.0481
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
144.91667
Color[Green]
-9.833333
Color[Red]
9.8333333
TypeStyle[Block]
-11.16667
TypeStyle[Script]
11.166667
Std Error
5.497011
5.497011
5.497011
5.497011
5.497011
t Ratio
26.36
-1.79
1.79
-2.03
2.03
Mean for red block group = 144.92+9.83-11.17=143.58
Mean for red script group = 144.92+9.83+11.17=165.92
Prob>|t|
<.0001
0.0804
0.0804
0.0481
0.0481
Tests for Main Effects When There
is No Interaction
Response Sales
Effect Tests
Source
Color
TypeStyle
Nparm
1
1
DF
1
1
Sum of Squares
4641.3333
5985.3333
F Ratio
3.2000
4.1266
Prob > F
0.0804
0.0481
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
144.91667
Color[Green]
-9.833333
Color[Red]
9.8333333
TypeStyle[Block]
-11.16667
TypeStyle[Script]
11.166667
Std Error
5.497011
5.497011
5.497011
5.497011
5.497011
t Ratio
26.36
-1.79
1.79
-2.03
2.03
Prob>|t|
<.0001
0.0804
0.0804
0.0481
0.0481
• Effect test for color: Tests null hypothesis that group mean does not
depend on color versus alternative that group mean is different for at
least two levels of color. p-value =0.0804, moderate but not strong
evidence that group mean depends on color.
• Effect test for TypeStyle: Tests null hypothesis that group mean
does not depend on TypeStyle versus alternative that group mean is
different for at least two levels of TypeStyle. p-value = 0.0481,
evidence that group mean depends on TypeStyle.
• These are called tests for “main effects.” These tests only make
sense when there is no interaction.
Example with an Interaction
• Should the clerical employees of a large
insurance company be switched to a four-day
week, allowed to use flextime schedules or kept
to the usual 9-to-5 workday?
• The data set flextime.JMP contains percentage
efficiency gains over a four week trial period for
employees grouped by two factors: Department
(Claims, Data Processing, Investment) and
Condition (Flextime, Four-day week, Regular
Hours).
Response Improve
Effect Tests
Source
Nparm DF Sum of Squares F Ratio Prob > F
Department
2 2
154.3087
8.0662 0.0006
Condition
2 2
0.5487
0.0287 0.9717
Condition*Department
4 4
5588.2004 146.0566 <.0001
There is strong evidence of an interaction.
Department
25
15
FourDay
Regular
5
-5
-15
Condition
Regular
FourDay
Flex
DP
Claims
Flex
Invest
5
-5
-15
Invest
Claims
DP
Condition
Improve
25
15
Department
Improve
Interaction Profiles
Which schedule is best
appears to differ by department.
Four day is best for
investment employees, but
worst for data
processing employees.
Which Combinations Works Best?
• For which pairs of groups is there strong
evidence that the groups have different
means – is there strong evidence that one
combination works best?
• We combine the two factors into one factor
(Combination) and use Tukey’s HSD, to
compare groups pairwise, adjusting for
multiple comparisons.
Oneway Analysis of Improve By Combination
Means Comparisons
Comparisons for all pairs using Tukey-Kramer HSD
Level
DPFlex
InvestFourDay
InvestRegular
ClaimsFlex
ClaimsRegular
ClaimsFourDay
DPRegular
DPFourDay
InvestFlex
A
A
B
C
C
C
C
D
D
Mean
16.89091
16.87273
9.38182
4.32727
4.20000
3.12727
2.21818
-4.74545
-5.65455
Levels not connected by same letter are significantly different
For Data Processing employees, there is strong evidence
that flextime is best. For Investment employees, there is strong
evidence that Four Day is best. For claims employees, there is
not strong evidence that any of the schedules have different means.
Checking Assumptions
• As with one-way ANOVA, two-way ANOVA is a
special case of multiple regression and relies on
the assumptions:
– Linearity: Automatically satisfied
– Constant variance: Spread within groups is the same
for all groups.
– Normality: Distribution within each group is normal.
• To check assumptions, combine two factors into
one factor (Combination) and check
assumptions as in one-way ANOVA.
Checking Assumptions
Means and Std Deviations
Level
GreenBlo
GreenScr
RedBlock
RedScrip
Number
12
12
12
12
Mean
119.417
150.750
148.083
161.417
Std Dev
37.4929
33.5129
44.8461
36.1272
Std Err Mean
10.823
9.674
12.946
10.429
Lower 95%
95.59
129.46
119.59
138.46
Upper 95%
143.24
172.04
176.58
184.37
• Check for constant variance: (Largest
standard deviation of group/Smallest
standard deviation of group)
=(44.85/33.51) <2. Constant variance OK.
• Check for normality: Look at normal
quantile plots for each combination (not
shown). For all normal quantile plots, the
points fall within the 95% confidence
bands. Normality assumption OK.
Two way Analysis of Variance:
Steps in Analysis
1.
2.
3.
4.
Check assumptions (constant variance, normality,
independence). If constant variance is violated, try
transformations.
Use the effect test (commonly called the F-test) to test
whether there is an interaction.
If there is no interaction, use the main effect tests to
whether each factor has an effect. Compare individual
levels of a factor by using t-tests with Bonferroni
correction for the number of comparisons being made.
If there is an interaction, use the interaction plot to
visualize the interaction. Create combination of the
factors and use Tukey’s HSD procedure to investigate
which groups are different, taking into account the fact
multiple comparisons are being done.