Transcript File

Tools for Two Common Student
Mistakes on the AP Exam
Dr. Robert Taylor
Dr. Ellen Breazel
Dr. Julia Sharp
Clemson University
1
Breazel Activities in AP Statistics
AP Statistics Readings
Reader
2007, 2009, 2010, 2011
Table Leader
2012
Hosted Professional Development Events
Activities for the Classroom, Clemson – 2011
Practice Exam, Clemson - 2012
AP Workshop – Greenville, SC - 2011
Five Day College Board Workshops
SC AP Statistics Institutes , Clemson - 2009
4/25
Curriculum Module – Editor and Author – 2012
Topic – Random Assignment & Random Sampling
2
Taylor Activities in AP Statistics
AP Statistics Readings
Reader
2000, 2001, 2002
Table Leader
2004, 2005, 2006
Question Leader 2007, 2010, 2011, 2012
Asst Chief Reader 2008, 2009
Test Development Committee 2007-2013
Chair/Co-Chair of Test Devel. Comm. 2010-2013
AP Workshops (5 days) at UGA 2000, 2001, 2002, 2004, 2006
4/25
3
Activities in AP Statistics -cont
AP Workshop (1 day) at Oglethrope College
AP Workshop (1 day) at Greenville, SC
AP Workshop (1 day) at Columbia, SC
AP Workshop (1 day) at Spartanburg
AP Workshop (1 day) at Columbia, SC
AP Workshop (1 day) at Charlotte, NC
AP Workshop (1 day) at Jacksonville, FL
AP Workshop (1 day) at Nashville, TN
AP Workshop (1 day) at Atlanta
Jan. 2002
Oct. 2003
Feb. 2005
Oct. 2006
Feb. 2008
Feb. 2009
Oct. 2010
Oct. 2011
Jan. 2012
Five Day College Board Workshops
SC AP Statistics Institutes , Clemson – (7 cons. Yrs) 2005 -2011
AP Workshop at Carleton College, MN June 22-25, 2009
AP Workshop at Maryville College, TN July 5-10, 2009, July 2010
5/25
4
HELPING STUDENTS UNDERSTAND
POPULATION VS. SAMPLE
5
Inferential Statistics
Definition:
Inferential Statistics refers to methods of making
decisions or predictions about a population,
based on data obtained from a sample of that
population.
(Statistics The Art and Science of Learning from Data – Agresti/Franklin)
Common Steps for Inferential Statistics
Confidence Intervals
(3 parts)
Hypothesis Tests
(4 parts)
- Assumptions
- Calculations
- Summary
-
Hypotheses
Assumptions
Testing
Summary
Where do Students Go Wrong?
Hypotheses: 2 different ways
1. Use symbols for sample statistics rather than
population parameters
2. Incorrectly define population parameter
symbols
AP Statistics Exam 2004 – Question 5
A rural county hospital offers several health services. The
hospital administrators conducted a poll to determine
whether the residents’ satisfaction with the available services
depends on their gender…
Male Female Total
Satisfied
384
416
800
Not Satisfied
80
120
200
Total
464
536
1000
a. Using a sig. level of 0.05, conduct an appropriate test to
determine if, for adult residents of the county, there is an
association between gender and whether or not they were
satisfied with services offered by the hospital
a. Using a sig. level of 0.05, conduct an appropriate test to
determine if, for adult residents of the county, there is an
association between gender and whether or not they were
satisfied with services offered by the hospital
Where do Students Go Wrong?
Hypotheses:
1. Use symbols for sample statistics rather than
population parameters
2. Incorrectly define population parameter
symbols
a. Using a sig. level of 0.05, conduct an appropriate test to
determine if, for adult residents of the county, there is an
association between gender and whether or not they were
satisfied with services offered by the hospital
Even without the sample size – this
statement is not clear and will often
not receive credit
Where Do Students Go Wrong?
Conclusions:
1. Mixing Sample and Population
2. Using p-hat for p-value
AP Statistics Exam 2011 – Question 4 - Cholesterol
High cholesterol levels in people can be reduced by exercise, diet,
and medication. Twenty middle-aged males with cholesterol
readings between 220 and 240 (mg/dL) of blood were randomly
selected from the population of such male patients at a large local
hospital. Ten of the 20 males were randomly assigned to group A,
advised on appropriate exercise, diet, and also received a placebo.
The other 10 males were assigned to group B, received the same
advice on appropriate exercise and diet, but received the drug
intended to reduce cholesterol instead of a placebo…….
Do the data provide convincing evidence, at the α = 0.01 level,
that the cholesterol drug is effective in producing a reduction in
mean cholesterol level beyond that produced by exercise and
diet?
Do the data provide convincing evidence, at the α = 0.01 level,
that the cholesterol drug is effective in producing a reduction
in mean cholesterol level beyond that produced by exercise
and diet?
Where is the
parameter of
interest?
Where Do Students Go Wrong?
Conclusions:
1. Mixing Sample and Population
2. Using p-hat for p-value
AP Statistics Exam 2005 – Question 4
Some boxes of a certain brand of breakfast cereal include a
voucher for a free video rental inside the box. The company
that makes the cereal claims that a voucher can be found in
20 percent of the boxes….. This group of students purchased
65 boxes of cereal to investigate the company’s claim. The
students found a total of 11 vouchers for free video rentals in
the 65 boxes.
….Based on this sample, is there support for the students’
belief that the proportion of boxes with vouchers is less
than 0.2? Provide statistical evidence to support your
answer.
….Based on this sample, is there support for the students’
belief that the proportion of boxes with vouchers is less than
0.2? Provide statistical evidence to support your answer.
Actually the
p-hat value
population
Activity
Sampling from a Population
Situation: A Sociology class is interested in the
average number of Facebook friends for MSSU
(a fictitious college). The university actually has
obtained this information from all its students
however they are not willing to give out their
data in electronic form. All the Sociology class
was able to obtain was a paper copy of the data.
The class believes that the average number of
friends is around 855. Conduct an appropriate
test for the classes guess.
Questions for the class
State the null and alternative
hypothesis of your test. Be sure to
define any parameters used.
H 0 :   855
H A :   855
Where µ represents the population mean
number of Facebook friends at MSSU
Take a Random Sample of Size 35….
Calculate the mean and standard deviation of the
number of FB friends from your sample and label
properly.
x  884.5714
s  947.3255
What is the test statistic for your hypothesis test?
x   884.5714  855
t

 0.18
s
947.3255
n
35
p  value  0.8546
At the 5% significance level, what conclusions can
you make about the Sociology classes guess on the
number of FB from this college?
At the 5% significance level the p-value (0.8546)
is very large (greater than alpha = 0.05)
therefore we do not reject the null hypothesis.
There is insufficient evidence to suggest the
average number of Facebook friends of all MSSU
students is different from 855.
Optional Class Discussions/Activities
• On your
post it note write your sample mean. Have a member
of your group place the sample mean on the number line at the front
of the class.
• On your
post it note write whether you Rejected your null
hypothesis or not. Place your post-it in the proper place in the table
at the front of the class.
• What shape does the
class?
post-it note histogram form for the
• What shape will the histogram form if we were to have 3000 groups
in the class? Why? [link]
• If we had 3000 groups what percentage of those groups would reject
the null hypothesis for their hypothesis test (if the null hypothesis is
true)?
Why is this activity helpful?
– Students have a difficult time understanding an
intangible Population
– Large enough population – with a realistic
application
– Can also look at Confidence Intervals
– Good exercise to review sampling distribution
– Opportunity to talk about 10% of population rule
– Gives overview of entire inference process.
HELPING STUDENTS TO READ
COMPUTER OUTPUT
27
Since AP Statistics emphasizes conceptual understanding
and communication, long computational mechanics are
often provided in AP Statistics Exam questions.
This frequently takes the form of computer output which
the student must be able to read and interpret in the
context of the problem.
Recent examples of questions where computer output
have related to regression problems and these include:
a) Question 3, 2005, The Great Plains Railroad
b) Question 6, 2001, Science Performance in a Magnet School
c) Question 5, 2011, Windmill.
28
It is also common for the calculated test statistic or
the p-value to be provided for the student:
a) Question 5, 2012, “p-value of 0.97”
b) Question 3, 2010, “… the resulting confidence
interval was 0.417+ 0.119.”
c) Question 6, 2010, “Frequency Plot of Simulated
Values of the Test Statistic Q.”
d) Question 5, 2009, “… resulted in a p-value of 0.97”
e) Question 6, 2009, “A dotplot of 100 simulated
values of the test statistic sample mean / sample
median.”
29
The common goal in all of these questions
• focus the student on relating the statistical concepts
and results to a problem in context
• provide some of the time-consuming mechanics of the
statistical calculations.
The computer output is intended to be as generic as
possible but invariably will contain some of traces of the
particular software package which was used.
Thus, it is important that students can accurately and
comfortably interpret statistics from various computer
output.
30
The focus of the second half of this presentation
will be to review different of computer outputs
with the goal of trying to identify the desired
statistical calculations.
The software packages which will be used are:
- Minitab
- SAS
- SPSS
-R
31
Generic Problem Twelve randomly selected homes sold in a
nearby county were selected to examine the relationship
between the x= size (in square feet) and y=price (in dollars) of
the houses. Below is a scatterplot of the data:
Qn 1. Do the data
indicate that there
may be a linear
relation between
size of house (in
square feet) and
the price of a
house?
32
Answer to Qn 1. From the scatterplot there appears to be a
fairly strong, positive linear relation between size of house (in
square feet) and the price of a house.
Residuals Versus Size of House
(Residual = Price of House - Predicated Price of House)
40000
30000
Residual
20000
10000
0
-10000
-20000
1500
2000
2500
3000
Size of House
3500
4000
Moreover, the residual plot shows no discernible pattern. Hence,
it is reasonable to proceed with a linear regression analysis to
predict price of houses by the size of houses.
33
Qn 2. Using the computer output, determine the equation of
the least squares regression line. Identify all variables used in
the equation.
Regression Analysis: price versus size
The regression equation is
price = 31547 + 111 size
Predictor
Constant
size
Coef
31547
111.337
S = 14683.5
SE Coef
17850
6.265
R-Sq = 96.9%
T
1.77
17.77
P
0.108
0.000
Answer to Qn 2. The regression
equation is
Predicted Price of House (in
dollars) = $31,547 + 111*Size of
House (in sq ft).
R-Sq(adj) = 96.6%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
10
11
SS
68081311679
2156038716
70237350395
MS
68081311679
215603872
F
315.77
Residual
34376
St Resid
2.48R
P
0.000
Unusual Observations
Obs
5
size
3118
price
413023
Fit
378647
SE Fit
4773
R denotes an observation with a large standardized residual.
34
Regression Analysis: price versus size
Answer to Qn 2. The regression equation is
Predicted Price of House (in dollars)
Predictor
Coef SE Coef
T
P
=1.77
$31,547
Constant
31547
17850
0.108 + 111*Size of House (in sq ft).
The regression equation is
price = 31547 + 111 size
size
111.337
6.265
17.77
0.000
R-Sq =is96.9%
R-Sq(adj) =price
96.6% of a home that is 2500
Qn 3. What
the estimated
Analysis
of Variance
square
feet?
S = 14683.5
Source
Regression
Residual Error
Total
DF
1
10
11
SS
68081311679
2156038716
70237350395
MS
68081311679
215603872
F
315.77
P
0.000
Answer to Qn 3. The Predicted Price of a House that is
2500 square feet is= $31,547 + 111*(2500) = $309,047.
Unusual Observations
Obs
5
size
3118
price
413023
Fit
378647
SE Fit
4773
Residual
34376
St Resid
2.48R
R denotes an observation with a large standardized residual.
Qn 4. What proportion of the variation in house price is
explained by its linear relationship with house size?
Answer to Qn 4. 96.9% of the variation in house prices
can be explained by a linear relationship with the size of
the house in square feet.
35
Qn 5 What is the difference between R2 and
adjusted R2 ?
Adjusted R2 is the percentage of response variable variation
that is explained by its relationship with one or more
predictor variables, adjusted for the number of predictors in
the model.
R2 will always increase when a new term is added, and hence,
a model with more terms may appear to have a better fit
simply because it has more terms.
The adjusted R2 will increase only if the new term improves
the model more than expected by chance. It decreases when
a predictor improves the model less than expected by chance.
36
Coefficients
Term
Coef
Constant
27841.9
size
128.5
bedrooms -12501.0
SE Coef
19160.1
26.2
18518.8
T
1.45312
4.90253
-0.67505
P
0.180
0.001
0.517
Summary of Model
S = 15100.2
PRESS = 2938181585
R-Sq = 97.08%
R-Sq(pred) = 95.82%
R-Sq(adj) = 96.43%
Fits and Diagnostics for Unusual Observations
Obs
5
price
413023
Fit
378399
SE Fit
4921.67
Residual
34623.8
St Resid
2.42539
R
Predicted Price of House =
$29,842 + 128*Size of House – 12,501*Bedrooms
Note that adjusted R2 decreased from 96.6 to 96.43 when the
variable ‘number of bedrooms’ was added.
37
A strong positive linear
relationship between
price and number of
bedrooms.
Scatterplot of Price of House vs Number of Bedrooms
450000
Price of House
400000
350000
300000
250000
200000
2.0
2.5
3.0
3.5
4.0
Number of Bedrooms
4.5
5.0
A strong positive linear
relationship between
number of bedrooms
and size of house.
Scatterplot of Number of Bedrooms vs Size of House
5.0
4.5
Number of Bedrooms
Thus, adding the
number of bedrooms
add very little to
predicting house price
when size of house is
in the model.
4.0
3.5
3.0
2.5
2.0
1500
2000
2500
3000
Size of House
3500
38 4000
Qn 6. Interpret the value of the estimated slope in the
context of this problem and construct a 95% confidence
interval for the model of the population model.
Answer to Qn 6. The estimated slope is $111 per square
foot. Thus, we would expect a $111 increase in the price
of the house (on the average) for each square foot
increase in house size.
The regression equation is
price = 31547 + 111 size
Predictor
Constant
size
Coef
31547
111.337
S = 14683.5
SE Coef
17850
6.265
R-Sq = 96.9%
T
1.77
17.77
P
0.108
0.000
R-Sq(adj) = 96.6%
39
The population linear regression model is Y = α + βX + ε
Y denotes the price of the houses
X denotes the size of the houses
α is the y-intercept
β is the slope of the regression model
ε denotes random error
Inferences for linear regression depends on the condition
that ε is normally distributed with mean 0 and variance
σ2.
The residuals are the best estimates for the error terms.
40
Dotplot of Residual
Only Slightly
Skewed Right
-14000
-7000
0
7000
14000
Residual
21000
28000
35000
Boxplot of Residual
No Outliers
-20000
-10000
0
10000
Residual
20000
30000
40000
Reasonable to assume error terms are approximately normal41
The regression equation is
price = 31547 + 111 size
Predictor
Constant
size
Coef
31547
111.337
S = 14683.5
SE Coef
17850
6.265
R-Sq = 96.9%
T
1.77
17.77
P
0.108
0.000
R-Sq(adj) = 96.6%
C.I. for slope
point estimate for slope + (t-value*standard error of slope)
111 + (2.228*6.256) = 111 + (13.94)
Thus, we are 95% confidence that the population
slope for the increase in house price per unit
increase in square footage is between $97.06 and
$124.94.
42
Qn 7 Is there statistically convincing evidence that house
price is related to house size? Explain.
The regression equation is
price = 31547 + 111 size
Predictor
Constant
size
Coef
31547
111.337
S = 14683.5
SE Coef
17850
6.265
R-Sq = 96.9%
T
1.77
17.77
P
0.108
0.000
R-Sq(adj) = 96.6%
Answer to Qn 7. The question is whether the population
slope β is 0. Since a 95% confidence interval is significantly
distanced from 0, there is very strong evidence that the
population slope β ≠ 0.
The t-statistic = Estimated slope/ standard error of the slope
= (111 – 0)/6.265 = 17.77, p-value ≈ .000000003
provides very strong evidence to reject Ho: β = 0 in favor of Ha:
β ≠ 0. This test of hypothesis is equivalent to the test that the
43
population correlation is 0.
Qn 8. For each of the following outputs, identify
a) The estimated slope
b) The standard error of the estimated slope
c) The estimated intercept
d) The standard error of the estimated intercept
e) R2
f) adjusted R2
g) the estimated standard deviation of linear
regression
h) the evidence that house price is related to house
size
44
The SAS System
The REG Procedure
Model: MODEL1
Dependent Variable: price
Number of Observations Read
Number of Observations Used
Root MSE
Dependent Mean
Coeff Var
Variable
Intercept
size
a)
b)
c)
d)
DF
1
1
14683
339675
4.32280
12
12
R-Square
Adj R-Sq
Parameter Estimates
Parameter
Standard
Estimate
Error
31547
17850
111.33654
6.26545
The estimated slope
The standard error of the estimated
slope
The estimated intercept
The standard error of the estimated
intercept
e)
f)
g)
h)
0.9693
0.9662
t Value
1.77
17.77
Pr > |t|
0.1076
<.0001
R2
adjusted R2
the estimated standard deviation of
linear regression
the evidence that house price is
related to house size
45
SPSS
a
Coefficients
Model
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
(Constant)
Std. Error
Beta
31547.458
17850.374
111.337
6.265
1.767
.108
17.770
.000
1
size
.985
a. Dependent Variable: price
a)
b)
c)
d)
The estimated slope
The standard error of the estimated
slope
The estimated intercept
The standard error of the estimated
intercept
e)
f)
g)
h)
R2
adjusted R2
the estimated standard deviation of
linear regression
the evidence that house price is
related to house size
46
From R:
Call:
lm(formula = price ~ size)
Residuals:
Min
1Q
-15386 -10151
Median
-2312
3Q
Max
6806 34376
Coefficients:
Estimate
31547.458
111.337
Std. Error
17850.374
6.265
t value
1.767
17.770
Pr(>|t|)
0.108
6.79e-09 ***
(Intercept)
size
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14680 on 10 degrees of freedom
Multiple R-squared: 0.9693, Adjusted R-squared: 0.9662
F-statistic: 315.8 on 1 and 10 DF, p-value: 6.795e-09
a)
b)
c)
d)
The estimated slope
The standard error of the estimated
slope
The estimated intercept
The standard error of the estimated
intercept
e)
f)
g)
h)
R2
adjusted R2
the estimated standard deviation of
linear regression
the evidence that house price is
related to house size
47
a)
b)
c)
d)
The estimated slope
The standard error of the estimated
slope
The estimated intercept
The standard error of the estimated
intercept
e)
f)
g)
h)
R2
adjusted R2
the estimated standard deviation of
linear regression
the evidence that house price is
related to house size
48
Contacts
• Bob Taylor
[email protected]
• Ellen Breazel
[email protected]
• Slides and other resources
www.clemson.edu/~ehepfer
(Click on AP Stats)
49