9.3 Tests about a Population Mean

Download Report

Transcript 9.3 Tests about a Population Mean

9.3 Tests about a Population Mean
Objectives
SWBAT:
• STATE and CHECK the Random, 10%, and Normal/Large Sample conditions
for performing a significance test about a population mean.
• PERFORM a significance test about a population mean.
• USE a confidence interval to draw a conclusion for a two-sided test about a
population parameter.
• PERFORM a significance test about a mean difference using paired data.
What are the three conditions for conducting a significance test for a population
mean? How are these different than the conditions for calculating a confidence
interval for a population mean?
Conditions For Performing A Significance Test About A Mean
• Random: The data come from a well-designed random sample or
randomized experiment.
o 10%: When sampling without replacement, check that
n ≤ (1/10)N.
• Normal/Large Sample: The population has a Normal distribution
or the sample size is large (n ≥ 30). If the population distribution
has unknown shape and n < 30, use a graph of the sample data to
assess the Normality of the population. Do not use t procedures if
the graph shows strong skewness or outliers.
The conditions are the same as the conditions for constructing a confidence interval.
What test statistic do we use when testing a population mean? Is the formula
on the formula sheet?
As with proportions, all we get on the formula sheet is the generic
test statistic =
statistic−parameter
standard deviation of statistic
Reminder: we always operate under the assumption that the null is true.
Also, with means, we use the t-distribution unless there is a rare instance when we know
the population standard deviation.
Because the population standard deviation σ is usually unknown, we use the sample
standard deviation sx in its place. The resulting test statistic has the standard error of
the sample mean in the denominator
x - m0
t=
sx
n
When the Normal condition is met, this statistic has a t distribution with
of freedom.
n - 1 degrees
How do you calculate p-values using the t distributions?
You can use either the table or the calculator to get a p-value. Let’s look at an example.
A battery company wants to test H0: µ = 30 versus Ha: µ > 30 based on an SRS of 15 new AAA
batteries with mean lifetime and standard deviation x = 33.9 hours and sx = 9.8 hours.
test statistic =
t=
statistic - parameter
standard deviation of statistic
x - m 0 33.9 - 30
=
= 1.54
sx
9.8
15
n
The P-value is the probability of getting a result this
large or larger in the direction indicated by Ha, that is,
P(t ≥ 1.54).
 Go to the df = 14 row.
Upper-tail probability p
df
.10
.05
.025
13
1.350
1.771
2.160
14
1.345
1.761
2.145
15
1.341
1.753
3.131
80%
90%
95%
Confidence level C
 Since the t statistic falls between the values 1.345
and 1.761, the “Upper-tail probability p” is between
0.10 and 0.05.
 The P-value for this test is between 0.05 and 0.10.
Reminder: for two-sided tests, double the p-value
In the calculator:
2nd>DISTR>6: tcdf
To find 𝑃(𝑡 ≥ 1.54):
tcdf(lower: 1.54, upper: 10000,
df: 14)
= 0.0729
Reminder that the t distribution
is symmetric! Therefore, the
probability that 𝑡 ≥ 1.54 is the
same as the probability 𝑡 ≤
− 1.54.
Try it in the calculator:
tcdf(lower: -10000, upper: -1.54,
df: 14)
Alternate Example: Short Subs
Abby and Raquel like to eat sub sandwiches. However, they noticed that the lengths of the
“6-inch sub” sandwiches they get at their favorite restaurant seemed shorter than the
advertised length. To investigate, they randomly selected 24 different times during the next
month and ordered a “6-inch” sub. Here are the actual lengths of each of the 24 sandwiches
(in inches):
4.50 4.75 4.75 5.00 5.00 5.00 5.50 5.50
5.50 5.50 5.50 5.50 5.75 5.75 5.75 6.00
6.00 6.00 6.00 6.00 6.50 6.75 6.75 7.00
a) Do these data provide convincing evidence at the level that sandwiches at this restaurant
are shorter than advertised, on average?
Alternate Example: Short Subs
Abby and Raquel like to eat sub sandwiches. However, they noticed that the lengths of the
“6-inch sub” sandwiches they get at their favorite restaurant seemed shorter than the
advertised length. To investigate, they randomly selected 24 different times during the next
month and ordered a “6-inch” sub. Here are the actual lengths of each of the 24 sandwiches
(in inches):
4.50 4.75 4.75 5.00 5.00 5.00 5.50 5.50
5.50 5.50 5.50 5.50 5.75 5.75 5.75 6.00
6.00 6.00 6.00 6.00 6.50 6.75 6.75 7.00
Alternate Example: Short Subs
Abby and Raquel like to eat sub sandwiches. However, they noticed that the lengths of the
“6-inch sub” sandwiches they get at their favorite restaurant seemed shorter than the
advertised length. To investigate, they randomly selected 24 different times during the next
month and ordered a “6-inch” sub. Here are the actual lengths of each of the 24 sandwiches
(in inches):
4.50 4.75 4.75 5.00 5.00 5.00 5.50 5.50
5.50 5.50 5.50 5.50 5.75 5.75 5.75 6.00
6.00 6.00 6.00 6.00 6.50 6.75 6.75 7.00
Calculator: tcdf(lower: -10000; upper: -2.38, df: 23) = 0.0130
Alternate Example: Short Subs
Abby and Raquel like to eat sub sandwiches. However, they noticed that the lengths of the
“6-inch sub” sandwiches they get at their favorite restaurant seemed shorter than the
advertised length. To investigate, they randomly selected 24 different times during the next
month and ordered a “6-inch” sub. Here are the actual lengths of each of the 24 sandwiches
(in inches):
4.50 4.75 4.75 5.00 5.00 5.00 5.50 5.50
5.50 5.50 5.50 5.50 5.75 5.75 5.75 6.00
6.00 6.00 6.00 6.00 6.50 6.75 6.75 7.00
b) Given your conclusion in part (a), which kind of mistake – a Type I or a Type II error – could
you have made? Explain what this mistake would mean in context.
Because we rejected the null hypothesis, it is possible that we made a Type I error. In other
words, it is possible that we found convincing evidence that the mean length was less than 6
inches when in reality the mean length is 6 inches.
• To do this in the calculator:
4.50 4.75 4.75 5.00 5.00 5.00 5.50 5.50
5.50 5.50 5.50 5.50 5.75 5.75 5.75 6.00
6.00 6.00 6.00 6.00 6.50 6.75 6.75 7.00
Start by entering the data into a list.
STAT > TESTS > 2: T-Test
List: wherever you put the data
Calculate
t = -2.407
p = 0.0123
Can you use your calculator for the Do step? Are there any drawbacks?
As we’ve seen, yes, you can use your calculator. However, if you just show
calculator results with no work, and one or more values are wrong, you won’t
get any credit for the “do” step. If you opt for the calculator-only method,
name the procedure (t test) and report the test statistic, degrees of freedom,
and p-value.
Alternate Example: Don’t break the ice
In the children’s game Don’t Break the Ice, small plastic ice cubes are
squeezed into a square frame. Each child takes turns tapping out a cube
of “ice” with a plastic hammer, hoping that the remaining cubes don’t
collapse. For the game to work correctly, the cubes must be big enough
so that they hold each other in place in the plastic frame but not so big
that they are too difficult to tap out. The machine that produces the
plastic cubes is designed to make cubes that are 29.5 millimeters (mm)
wide, but the actual width varies a little. To ensure that the machine is
working well, a supervisor inspects a random sample of 50 cubes every
hour and measures their width. The Fathom output summarizes the data
from a sample taken during one hour.
a) Interpret the standard deviation and the standard error provided by the computer
output.
Standard deviation: The widths of the cubes are typically about 0.0877 mm from the mean
width (29.4943).
Standard error: In random samples of size 50, the sample mean will typically be about 0.0124
mm from the true mean.
Alternate Example: Don’t break the ice
In the children’s game Don’t Break the Ice, small plastic ice cubes are
squeezed into a square frame. Each child takes turns tapping out a cube
of “ice” with a plastic hammer, hoping that the remaining cubes don’t
collapse. For the game to work correctly, the cubes must be big enough
so that they hold each other in place in the plastic frame but not so big
that they are too difficult to tap out. The machine that produces the
plastic cubes is designed to make cubes that are 29.5 millimeters (mm)
wide, but the actual width varies a little. To ensure that the machine is
working well, a supervisor inspects a random sample of 50 cubes every
hour and measures their width. The Fathom output summarizes the data
from a sample taken during one hour.
b)
1) The true mean really is not 29.5 mm
2) The true mean really is 29.5 mm but we produced 29.4943 mm by random chance
Alternate Example: Don’t break the ice
In the children’s game Don’t Break the Ice, small plastic ice cubes are
squeezed into a square frame. Each child takes turns tapping out a cube
of “ice” with a plastic hammer, hoping that the remaining cubes don’t
collapse. For the game to work correctly, the cubes must be big enough
so that they hold each other in place in the plastic frame but not so big
that they are too difficult to tap out. The machine that produces the
plastic cubes is designed to make cubes that are 29.5 millimeters (mm)
wide, but the actual width varies a little. To ensure that the machine is
working well, a supervisor inspects a random sample of 50 cubes every
hour and measures their width. The Fathom output summarizes the data
from a sample taken during one hour.
c) Do these data give convincing evidence that the mean width of cubes produced this hour is
not 29.5 mm? Use a significance test with 𝛼 = 0.05 to find out.
Alternate Example: Don’t break the ice
In the children’s game Don’t Break the Ice, small plastic ice cubes are
squeezed into a square frame. Each child takes turns tapping out a cube
of “ice” with a plastic hammer, hoping that the remaining cubes don’t
collapse. For the game to work correctly, the cubes must be big enough
so that they hold each other in place in the plastic frame but not so big
that they are too difficult to tap out. The machine that produces the
plastic cubes is designed to make cubes that are 29.5 millimeters (mm)
wide, but the actual width varies a little. To ensure that the machine is
working well, a supervisor inspects a random sample of 50 cubes every
hour and measures their width. The Fathom output summarizes the data
from a sample taken during one hour.
Alternate Example: Don’t break the ice
In the children’s game Don’t Break the Ice, small plastic ice cubes are
squeezed into a square frame. Each child takes turns tapping out a cube
of “ice” with a plastic hammer, hoping that the remaining cubes don’t
collapse. For the game to work correctly, the cubes must be big enough
so that they hold each other in place in the plastic frame but not so big
that they are too difficult to tap out. The machine that produces the
plastic cubes is designed to make cubes that are 29.5 millimeters (mm)
wide, but the actual width varies a little. To ensure that the machine is
working well, a supervisor inspects a random sample of 50 cubes every
hour and measures their width. The Fathom output summarizes the data
from a sample taken during one hour.
p-value: using the t distribution with 40 degrees of freedom, the p-value is greater than
2(0.25) = 0.50.
Using technology: With df = 49, the calculator’s t-test gives a p-value = 2(0.32395) = 0.6479
Alternate Example: Don’t break the ice
In the children’s game Don’t Break the Ice, small plastic ice cubes are
squeezed into a square frame. Each child takes turns tapping out a cube
of “ice” with a plastic hammer, hoping that the remaining cubes don’t
collapse. For the game to work correctly, the cubes must be big enough
so that they hold each other in place in the plastic frame but not so big
that they are too difficult to tap out. The machine that produces the
plastic cubes is designed to make cubes that are 29.5 millimeters (mm)
wide, but the actual width varies a little. To ensure that the machine is
working well, a supervisor inspects a random sample of 50 cubes every
hour and measures their width. The Fathom output summarizes the data
from a sample taken during one hour.
Conclude: Because the p-value of 0.6479 is greater than 𝛼 = 0.05, we fail to reject the null
hypothesis. There is not convincing evidence that the true width of the plastic ice cubes
produced this hour is different than 29.5 mm.
On the calculator:
T-Test
Stats
df = 50-1 = 49
t = -0.4595
p = 0.6479
Alternate Example: Don’t break the ice
In the children’s game Don’t Break the Ice, small plastic ice cubes are
squeezed into a square frame. Each child takes turns tapping out a cube
of “ice” with a plastic hammer, hoping that the remaining cubes don’t
collapse. For the game to work correctly, the cubes must be big enough
so that they hold each other in place in the plastic frame but not so big
that they are too difficult to tap out. The machine that produces the
plastic cubes is designed to make cubes that are 29.5 millimeters (mm)
wide, but the actual width varies a little. To ensure that the machine is
working well, a supervisor inspects a random sample of 50 cubes every
hour and measures their width. The Fathom output summarizes the data
from a sample taken during one hour.
d) Calculate a 95% confidence interval for 𝜇. Does your interval support your decision from
(c)?
Using t-interval on the calculator, we get an interval of (29.469, 29.519). This supports our
decision in (c) because the mean of 29.5 is a plausible value in this interval. Therefore, we
would fail to reject our null.
Paired Data
• Comparative studies are more convincing than single-sample investigations. For that reason,
one-sample inference is less common than comparative inference. Study designs that involve
making two observations on the same individual, or one observation on each of two similar
individuals, result in paired data.
• When paired data result from measuring the same quantitative variable twice, we can make
comparisons by analyzing the differences in each pair.
• If the conditions for inference are met, we can use one-sample t procedures to perform
inference about the mean difference µd.
• These methods are sometimes called paired t procedures.
• Our null hypothesis will that there is no mean difference between the two items being
compared (H0: µd = 0).
• Our alternative will be that there is a mean difference (Ha: µd > 0). Be sure to define the order
in which you are subtracting.
Alternate Example: Is the express lane faster?
For their second semester project in AP Statistics, Libby and Kathryn
decided to investigate which line was faster in the supermarket: the
express lane or the regular lane. To collect their data, they randomly
selected 15 times during a week, went to the same store, and bought the
same item. However, one of them used the express lane and the other
used a regular lane. To decide which lane each of them would use, they
flipped a coin. If it was heads, Libby used the express lane and Kathryn
used the regular lane. If it was tails, Libby used the regular lane and
Kathryn used the express lane. They entered their randomly assigned
lanes at the same time, and each recorded the time in seconds it took
them to complete the transaction. Carry out a test to see if there is
convincing evidence that the express lane is faster.
Since these data are paired, we will consider the differences in time (regular − express). Here are the
15 differences. In this case, a positive difference means that the express lane was faster.
5
246
−46
121
30
55
79
−94
−17
95
20
14
129
−39
42
Now find the mean difference by adding up these differences and dividing by 15.
640/15 = 42.7
Alternate Example: Is the express lane faster?
For their second semester project in AP Statistics, Libby and Kathryn
decided to investigate which line was faster in the supermarket: the
express lane or the regular lane. To collect their data, they randomly
selected 15 times during a week, went to the same store, and bought the
same item. However, one of them used the express lane and the other
used a regular lane. To decide which lane each of them would use, they
flipped a coin. If it was heads, Libby used the express lane and Kathryn
used the regular lane. If it was tails, Libby used the regular lane and
Kathryn used the express lane. They entered their randomly assigned
lanes at the same time, and each recorded the time in seconds it took
them to complete the transaction. Carry out a test to see if there is
convincing evidence that the express lane is faster.
Alternate Example: Is the express lane faster?
For their second semester project in AP Statistics, Libby and Kathryn
decided to investigate which line was faster in the supermarket: the
express lane or the regular lane. To collect their data, they randomly
selected 15 times during a week, went to the same store, and bought the
same item. However, one of them used the express lane and the other
used a regular lane. To decide which lane each of them would use, they
flipped a coin. If it was heads, Libby used the express lane and Kathryn
used the regular lane. If it was tails, Libby used the regular lane and
Kathryn used the express lane. They entered their randomly assigned
lanes at the same time, and each recorded the time in seconds it took
them to complete the transaction. Carry out a test to see if there is
convincing evidence that the express lane is faster.
Alternate Example: Is the express lane faster?
For their second semester project in AP Statistics, Libby and Kathryn
decided to investigate which line was faster in the supermarket: the
express lane or the regular lane. To collect their data, they randomly
selected 15 times during a week, went to the same store, and bought the
same item. However, one of them used the express lane and the other
used a regular lane. To decide which lane each of them would use, they
flipped a coin. If it was heads, Libby used the express lane and Kathryn
used the regular lane. If it was tails, Libby used the regular lane and
Kathryn used the express lane. They entered their randomly assigned
lanes at the same time, and each recorded the time in seconds it took
them to complete the transaction. Carry out a test to see if there is
convincing evidence that the express lane is faster.
Alternate Example: Is the express lane faster?
For their second semester project in AP Statistics, Libby and Kathryn
decided to investigate which line was faster in the supermarket: the
express lane or the regular lane. To collect their data, they randomly
selected 15 times during a week, went to the same store, and bought the
same item. However, one of them used the express lane and the other
used a regular lane. To decide which lane each of them would use, they
flipped a coin. If it was heads, Libby used the express lane and Kathryn
used the regular lane. If it was tails, Libby used the regular lane and
Kathryn used the express lane. They entered their randomly assigned
lanes at the same time, and each recorded the time in seconds it took
them to complete the transaction. Carry out a test to see if there is
convincing evidence that the express lane is faster.
On the calculator:
T-Test
Data
Mu 0: 0
List
Mu: >m0
Df: 14
t = 1.9668
p = 0.0347
What is the difference between statistical and practical significance?
Statistical Significance and Practical Importance
When a null hypothesis (“no effect” or “no difference”) can be rejected at
the usual levels (α = 0.05 or α = 0.01), there is good evidence of a difference.
But that difference may be very small. When large samples are available,
even tiny deviations from the null hypothesis will be significant.
• Remember the wise saying: Statistical significance is not the same thing as practical
importance. The remedy for attaching too much importance to statistical
significance is to pay attention to the actual data as well as to the p-value.
• Plot the data and examine it carefully. Are there outliers or other departures from a
consistent pattern? A few outlying observations can produce highly significant tests.
What is the problem of multiple tests?
Beware of Multiple Analyses
Statistical significance ought to mean that you have found a difference that you
were looking for. The reasoning behind statistical significance works well if you
decide what difference you are seeking, design a study to search for it, and use a
significance test to weigh the evidence you get. In other settings, significance
may have little meaning.
Suppose that 20 significance tests were conducted and in each case the null
hypothesis was true. What is the probability that we avoid a Type I error in all
20 tests?