Transcript Document

Impact of a simulation/
randomization-based
curriculum on student
understanding of p-values and
confidence intervals
Beth Chance
Karen McGaughey
Jimmy Wong
Cal Poly – San Luis Obispo
ICOTS9
Outline
• About the curriculum (Karen)
• Evaluating the curriculum (Beth)
• Benefits/Cautions/Suggestions (Karen)
• Next Steps (Beth)
Background
• Randomization-based introductory statistics
courses (Saturday workshop)
• Introducing all inferential techniques through
simulation and randomization-based methods
• e.g., permutation tests, bootstrapping
• Tintle et al. (2015) text (Roy, Session 4A)
• Focus on overall statistical process
via genuine research studies
• Normal-based methods presented
as alternative approximation to
simulation results
Background
• Spiraled just-in-time curriculum:
• Brief introduction to probability through simulation
• e.g., Monty Hall problem, coin tossing
• Develop understanding of probability as a long-run
proportion
• Statistical Inference (Ch. 1)
• Process probability/one proportion
• One mean, two proportions, two means, matched pairs,
multiple proportions, multiple means, regression
• Deeper dive in each iteration
• Interspersed as needed: discussions of random
sampling, random assignment, graphical displays,
scope of conclusions, etc.
Background
• Ch 1: Test of significance
• Two competing explanations
• One proportion
for the study outcome:
• Facial Prototyping – “Bob
• “Random chance alone”
& Tim” (Lea, Thomas,
• Research conjecture
Lamkin, & Bell, 2007)
• Could the observed statistic
plausibly have happened by
random chance alone?
• Design the simulation:
• What does “by random
• Binary response
chance alone” look like?
• Overwhelmingly name
• Coin tossing model
left picture “Tim” (e.g. ~
80%)
• Tactile & via computer
Background
• Ch. 3: Confidence Intervals = Interval of plausible values
• Example: Reese’s Pieces (n = 40, 𝑝 = 16/40 = 0.40 )
• Test (via simulation) for plausible values of population
proportion given observed sample proportion
Two-sided
p-value
Decision at 0.05
significance level
Plausible?
Ho: π = 0.26
0.0430
Reject Ho
No
Ho: π = 0.27
0.0800
Fail to reject Ho
Yes
:
:
Fail to reject Ho
Yes
:
:
Fail to reject Ho
Yes
Ho: π = 0.55
0.0770
Fail to reject Ho
Yes
Ho: π = 0.56
0.0450
Reject Ho
No
Test
Background
• Ch 5: Two proportions
• Dolphin Therapy
(Antonioli & Reveley,
2005)
Therapy group
Dolphin
Control
Improved
10
3
Did not improve
5
12
• Binary response
• Designed experiment
• Two competing
explanations:
• Ho: “random chance alone”
• Ha: research conjecture
• Could the observed statistic
plausibly have happened by
random chance alone?
• Design the simulation:
• Card shuffling
• Tactile & via computer
2013-2014 Evaluation
• New and experienced teachers
• 15 institutions (HS, community college, university)
• 15 instructors (fall) and 23 instructors (spring, 12 new)
• Over 1500 students
• Assessment
• (Modified) CAOS pre and post tests (Tintle, Session 8A)
• SATS attitudes pre and post tests (Swanson, Session 1F)
• Set of common multiple choice exam questions
• 25 instructors, 774-826 students
• Final exam transfer question
One Proportion (Exam 1)
• Research question: Are city residents more likely to
watch a movie at home rather than in the theater?
Q1: Picking the correct null hypothesis (overall percentages)
Adult residents of the city are equally likely to choose to
watch the movie at home as to watch the movie at the
theater.
92.9%
Adult residents of the city are more likely to choose to
watch the movie at home than to watch the movie at the
theater.
5.8%
Adult residents of the city are less likely to choose to
watch the movie at home than to watch at the theater.
.6%
Other
.6%
One Proportion (Exam 1)
• Research question: Are city residents more likely to
watch a movie at home rather than in the theater?
Q2: Picking the correct alternative hypothesis
Adult residents of the city are equally likely to choose to
watch the movie at home as to watch the movie at the
theater.
1.7%
Adult residents of the city are more likely to choose to
watch the movie at home than to watch the movie at the
theater.
90.1%
Adult residents of the city are less likely to choose to
watch the movie at home than to watch at the theater.
5.6%
Other
2.7%
One Proportion (Exam 1)
• Research question: Are city residents more likely to
watch a movie at home rather than in the theater?
Q3: Result is statistically significant (p = 0.012), which explanation
is more plausible?
More than half of the adult residents in her city prefer to
watch the movie at home.
65.6%
There is no overall preference for movie-watching-at-home
in her city, but by pure chance her sample just happened
to have an unusually high number of people choose to
watch the movie at home.
6.0%
(a) and (b) are equally plausible explanations.
29.7%
Substantial section-to-section variability!
One Proportion (Exam 1)
• Research question: Are city residents more likely to
watch a movie at home rather than in the theater?
Q4: Most valid interpretation of p-value?
A sample proportion as large as or larger than hers would rarely
occur.
14.0%
A sample proportion as large as or larger than hers would rarely
occur if the study had been conducted properly.
6.9%
A sample proportion as large as or larger than hers would
rarely occur if 50% of adults in the population prefer to watch
Higher for experienced instructors
the movie at home.
59.9%
A sample proportion as large as or larger than hers would rarely
occur if more than 50% of adults in the population prefer to
watch the movie at home
20.3%
One Proportion (Exam 1)
• Research question: Are city residents more likely to
watch a movie at home rather than in the theater?
Q5: Would 95% confidence interval contain 0.5?
Yes
25.3%
No
43.8%
Not enough information
31.0%
Two Proportions (Exam 2)
• Research question: Are women more likely to
dream in color than men?
Q1: Best conclusion from not significant (not small p-value) result ?
You have found strong evidence that there is no difference
between the proportions of men and women in your community
that dream in color.
14.5%
You have not found enough evidence to conclude that there is
a difference between the proportions of men and women in
your community that dream in color. Higher for new instructors
72.8%
You have found strong evidence against the claim that there is a
difference between the proportions of men and women that
dream in color.
10.7%
Because the result is not significant, we can’t conclude anything
from this study.
4.1%
Two Proportions (Exam 2)
• Research question: Are women more likely to
dream in color than men?
Q2: Best interpretation from small p-value?
It would not be very surprising to obtain the observed sample results if there is
really no difference between the proportions of men and women in your
community that dream in color.
It would be very surprising to obtain the observed sample results if there is
really no difference between the proportions of men and women in your
community that dream in color.
5.0%
56.5%
It would be very surprising to obtain the observed sample results if there is really
a difference between the proportion of men and women in your community that
dream in color.
7.9%
The probability is very small that there is no difference between the proportions
of men and women in your community that dream in color.
22.6%
The probability is very small that there is a difference between the proportions
of men and women in your community that dream in color.
8.4%
Two Proportions (Exam 2)
• Research question: Are women more likely to
dream in color than men?
Q3: If really is a difference, why might get large p-value?
Something went wrong with the analysis, and the results
of this study cannot be trusted.
6.1%
There must not be a difference after all and the other
research studies were flawed.
3.8%
The sample size might have been too small to detect a
difference even if there is one.
90.1%
Two Proportions (Exam 2)
• Research question: Are women more likely to
dream in color than men?
Q4: Which has stronger evidence of a difference: Study A vs. Study B?
Study A: 40/100 vs. 20/100
80.3%
Study B: 35/100 vs. 25/100
4.4%
The strength of evidence would be similar for these two
studies
15.3%
Two Proportions (Exam 2)
• Research question: Are women more likely to
dream in color than men?
Q5: Which has stronger evidence of a difference: Study C vs. Study
D (30% vs. 20%)?
Study C: sample sizes of 100 and 100
83.0%
Study D: sample sizes of 40 and 40
6.0%
The strength of evidence would be similar for these two
studies
10.8%
Two Proportions (Exam 2)
• Research question: Are women more likely to
dream in color than men?
Q6: Small p-value, which explanation is more plausible?
Men and women in your community do not differ on this
issue but by chance alone the random sampling led to the
difference we observed between the two groups.
13.6%
Men and women in your community differ on this issue.
58.1%
(a) and (b) are equally plausible explanations.
28.2%
36% correct with draft curriculum four years ago
Two Proportions (Exam 2)
• n = 404 students (8 instructors)
Q7: Main purpose of the randomness in the simulation?
To allow me to draw a cause-and-effect conclusion
from the study.
19.1%
To allow me to generalize my results to a larger
population.
11.4%
To simulate values of the statistic under the null
hypothesis.
58.8%
To replicate the study and increase the accuracy of
the results
8.2
Two Means (Exam 2/Final)
• 717 students, 14 instructors
• Want to compare mean score on video game
with and without monetary incentive
• Simulation process is described and given null
distribution
Two Means (Exam 2/Final)
Q1: Main motivation for this process?
This process allows her to compare her actual result to what
could have happened by chance if gamers’ performances were
not affected by whether they were asked to do their best or
offered an incentive.
83.0%
This process allows her to determine the percentage of time the
$5 incentive strategy would outperform the “do your best"
strategy for all possible scenarios.
12.0%
This process allows her to determine how many times she needs
to replicate the experiment for valid results.
2.2%
This process allows her to determine whether the normal
distribution fits the data.
2.8%
Two Means (Exam 2/Final)
Q2: What’s assumed in carrying out the simulation?
The $5 incentive is more effective than the “do your best”
incentive for improving performance.
25.8%
The $5 incentive and the “do your best” incentive are
equally effective at improving performance.
60.9%
The “do your best” incentive is more effective than a $5
incentive for improving performance.
6.0%
Both (a) and (b) but not (c).
7.3%
Two Means (Exam 2/Final)
Q3: Approximate p-value from graph
0.501 (using null value)
14.0%
0.047 (two-sided)
16.9%
0.022
52.5%
.001 (small)
16.2%
Two Means (Exam 2/Final)
Q4: What does histogram tell us about research question?
The $5 incentive is not effective because the distribution
of differences generated is centered at zero.
16.3%
The $5 incentive is effective because distribution of
differences generated is centered at zero.
14.8%
The $5 incentive is not effective because the p-value is
greater than 0.05.
5.1%
The $5 incentive is effective because the p-value is less
than 0.05.
63.4%
Two Means (Exam 2/Final)
Q5: Appropriate interpretation of p-value?
The p-value is the probability that the $5 incentive is not
really helpful.
3.7%
The p-value is the probability that the $5 incentive is really
helpful.
12.9%
The p-value is the probability that she would get a result
as least as extreme as the one she actually found, if the
$5 incentive is really not helpful.
82.3%
The p-value is the probability that a student wins on the
video game.
0.9%
CAOS Significance questions
(n  2,000 pre, 1,500 post)
• Valid/invalid interpretations
Pre
Post
Exp New
Non
CAOS
Large or small p-value, no impact 50% 89% 85% 62%en
Probability of results at least as
50% 65% 66% 52%
extreme under null: valid
Probability of alternative: invalid 40% 53% 58% 48%
68%
Probability of null: invalid
60%
53%
72%
67%
58%
57%
54%
CAOS Conf interval questions
• Valid/invalid interpretations
Pre
95% of all observations in
population in interval: invalid
Post
Exp New
57% 63%
64%
Non
56%
95% confident an observational
27% 41% 37% 21%en
unit is in interval: invalid
95% of sample means from
51% 60% 60% 64%
population are in interval:
invalid
95% confident population mean
71% 80% 80% 82%
is in interval: valid
CAOS
65%
49%
48%
76%
CAOS Sampling variability
questions
Pre
Post
CAOS
Exp
New
Non
71%
58%
57%
49%
10%
19%
22%
11%
33%
39%
38%
34%
33%
Values of 10 sample
proportions
42%
44%
52%e
43%
52%
Simulation design
24%
40%
35%
24%e
22%
Small sample (n = 60) may fail
to detect difference
Necessary sample size for all
310 million U.S. residents
“Hospital problem”
67%
Topic areas – Summary
• Auth= author team member
• Mid = non-author but have used materials more than once
Post
Pre
Auth
Mid
New
Non
Auth
Mid
New
Non
Significance
52%
43%
47%
46%
72%
67%
69%
55%*
Confidence
55%
51%
51%
49%
63%
60%
60%
56%
Sampling
variability
35%
36%
36%
35%
41%
40%
41%
32%
Transfer Question (Final exam)
• A constant theme of course: Could the statistic have
happened by chance alone?
• Applicable in any situation vs. statistical test
applicable in only one specific situation
• Can students apply the same logic to a novel problem?
• Spring 2014: Two Cal Poly instructors (169 students)
• Final exam: mean/median as a measure of skewness to
make inference about population shape (adapted from
2009 AP Statistics exam)
• Earlier midterm: Ratio of standard deviations or
relative risk
Transfer Question (Final exam)
• Do the sample data provide convincing evidence the
population is right skewed?
• Calculate statistic: mean/median = 1.05
• What values would you expect for the statistic with a
normally distributed population? With a skewed right
population?
• 39% answered both questions correctly
• Common errors:
• Mean/median > 1.05 if right skewed
• Wrong direction: mean/median < 1 if right skewed
Transfer Question (Final exam)
• Do the sample data provide evidence the
population is right skewed?
• Calculate statistic: mean/median = 1.05
• Given a simulated null distribution from a symmetric
population (centered at 1)
• Evidence against the null hypothesis?
Transfer Question (Final exam)
• Multiple choice version based on common
responses from open-ended version:
• Answer choices focus on 3 characteristics of the
null distribution:
• There is strong evidence (or not) to suggest the
actual population distribution is right skewed…….
• Due to symmetric shape
• Because the center is at 1
• Because most values vary between 0.96 to 1.04
Transfer Question (Final exam)
Two instructors (5 sections/169 students) from Cal Poly
does not provide strong evidence … because this null distribution is
symmetric.
11%
provides strong evidence … because this null distribution is symmetric.
12%
does not provide strong evidence … because this null distribution is
centered around one.
20%
provides strong evidence … because this null distribution is centered
around one.
26%
does not provide strong evidence … because most of the values in this null
distribution vary between 0.96 to 1.04.
10%
provides strong evidence … because most of the values in this null
distribution vary between 0.96 to 1.04.
18%
Other: provided correct reasoning
7%
* 25% answered correctly and an additional 8% showed work
indicating correct reasoning
Benefits
• Little to no confusion that small p-values  statistical
significance
• Students very comfortable (even initially) with idea of “could
this have happened by chance alone”
• Idea of large z-score or t-score (beyond 2SE) also clicks
• Address difficult inferential reasoning earlier in course
• Repeated exposures allow a synthesis of the ideas
• Understanding “Inference process” as statistical method, rather
than stand-alone methods for testing means, proportions, etc.
• Efficiency gains:
• Still possible to do both simulation and normal-based methods
• Exploration of other statistics (e.g. MAD for multiple means)
• Instructors enjoy approach, research study focus, richer
student questions
Cautions
• Inferential reasoning is difficult and initially, little
carry-over of learning:
• Non 50/50 cases
• Comparing groups
• Need several repeated exposures
• May introduce a misconception of “repeating the
study”
• Possible increase in misconception that we are
“providing evidence for the null hypothesis”
• Continue to struggle with identifying & defining
parameters
• Balance inferential with descriptive statistics (less as
Common Core comes on line?)
Main Suggestions
• Emphasize the ideas of model and simulation
• Repeatedly test their ability to design a simulation
• Ask students to predict simulation results (where
will it be centered, why)
• Focus on variability in null distribution as the key
• Clearly delineate observed data from simulation
• Explicitly discuss roles of randomness in the study
design vs. randomness in simulation
• Use early experiential examples that give students
ownership of the data (“observed” statistic)
Future Steps
• Three year NSF grant (DUE/TUES – 1323210) to
continue data collection across institutions
• More “non-users” and other randomizationbased curriculums (e.g., Lock5, Catalst)
• More studies of student retention of concepts
• Next theme of common exam questions:
Confidence intervals
• Email Nathan Tintle ([email protected])
or Beth Chance ([email protected]) if you
would like to participate
Questions?