H - Cengage Learning

Download Report

Transcript H - Cengage Learning

IV. Estimation and Testing
A. Overview
1. Introduction to Estimation
A parameter is an important characteristic of a population.
Examples:
•
the true mean outside diameter of the pen barrel;
•
the variability of the elastic strengths of polymer yarn; and
•
the coefficients which relate the effect of catalyst, temperature, and pressure
to the filament's strength.
Problem: How often do we know the true values of parameters?
Almost Never!
WHY?
We never can observe populations in their entirety.
How do we get around this problem?
WE TAKE A SAMPLE AND ESTIMATE THE PARAMETERS
An estimator is a statistic used to estimate an unknown parameter of a population.
The sample mean, y and the sample variance, s2, are examples of estimators.
Two criteria for choosing estimators are:
• accuracy (unbiased)
• precision.
An unbiased estimator of an unknown parameter is one whose expected value is
equal to the parameter of interest.
Thus, we call ˆ an unbiased estimator of  if
E[ˆ]  
Thus the estimator yields, on the average, an estimate close to the true value.
In this case, ˆ is an unbiased estimator of  .
1
ˆ is a biased estimator.
2
The concept of precision looks at the variances of the estimators.
An estimator is more precise if its sampling distribution has a smaller standard
error.
If our data come from a normal distribution, then among the class of unbiased
estimators,
• y is the most precise estimator of μ
• s2 is the most precise estimator of σ2.
We defined s2 using n-1 in the denominator because it produces an unbiased
estimator of σ2.
2. Introduction to Confidence Intervals
We call y and s2 point estimators.
If we sample from a continuous distribution, then y and s2 are continuous random
variables.
Does anyone sense a problem?
Note:
• P( y   )  0
• P(s2 = σ2) = 0
Consequently, statisticians prefer interval estimators.
These intervals give a range of plausible values for the parameter of interest.
For example, consider the population mean, μ.
If we don't know σ2, and if the parent distribution is well-behaved, then
y
s/ n
follows a t-distribution.
As a result,
y


P  t

 t   1
s/ n


s
s 

P  t
 y t
 1

n
n

s
s 

P  y  t
    y  t
 1

n
n

s
s 

P y  t
   y t
 1

n
n

n 1 , / 2
n 1 , / 2
n 1 , / 2
n 1 , / 2
n 1 , / 2
n 1 , / 2
n 1 , / 2
n 1 , / 2
We call
y t
n 1 , / 2
s
n
a (1- )•100% confidence interval for μ.
Interpretation: If we take an infinite number of samples from a well behaved
parent distribution, then (1- )•100% of the time, the interval
y t
will contain μ.
n 1 , / 2
s
n
3. Introduction to Testing
The process by which we use data to answer questions about parameters
is very similar to how juries evaluate evidence about a defendant.
We start with a nominal claim, which we call a null hypothesis, H0.
H0: the defendant is innocent
The prosecutor seeks to establish an alternative claim, which we call the
alternative hypothesis, Ha.
Ha: the defendant is guilty
Note: the jury makes a decision under the risk of making a mistake.
convict
acquit
Defendant’s innocent Type I error Correct decision
True State guilty
Correct decision Type II error
What is the typical standard for a jury's decision?
must be convinced beyond a reasonable doubt
What does that imply about the probability of a Type I error?
Should be small
What does that imply about the probability of a Type II error?
Could be large
Traditionally, we let
  P(Type I error)
 P(reject H | H is true)
0
0
 P(reject H when H is true)
0
0
 P(convict an innocent person)
 is called the significance level of our test.
The power of the test is
Power = P(reject H0 | Ha is true]
= P(convict a guilty person)
We want small  and large power.
Note:
• Rejecting H0 is a strong claim since we needed to be convinced beyond a
reasonable doubt.
We must have substantial evidence before we reject the nominal claim.
• Failing to reject H0 is a weak claim.
The evidence may seem to support the alternative, but the jury is not convinced
beyond a reasonable doubt.
We do the same thing with engineering decisions.
Consider a packaging process for the 10 oz boxes of a popular breakfast cereal.
The company has received a number of complaints about underfilled boxes.
Suppose the equipment should be set to deliver, on the average, 10.2oz.
If it really is set to that value, the company should have virtually no complaints
about underfills.
What would be an appropriate procedure to determine if the machine is set
properly or if it will tend to underfill the boxes?
The appropriate hypotheses for testing underfills are:
H0: μ = 10.2
Ha: μ < 10.2
What is a Type I error and its consequence?
What is a Type II error and its consequence?
The most commonly used values for  are:
• .10
• .05
• .01
If we perform this test once, what seems to be a reasonable  ?
Let’s shift gears.
If you have a problem with underfills, how can you correct it?
From a stockholder's perspective, is this a wise idea?
• We don't want to underfill.
• Neither do we want to overfill.
What would be an appropriate procedure!
H0: μ = 10.2
Ha: μ ≠ 10.2
A two sided hypothesis since we care μ < 10.2 and μ > 10.2.
This is a real problem in industry and will lead to the concept of control charts
which we introduce in the next chapter.
In general, we follow a 5-step procedure for conducting hypothesis tests.
1. State the appropriate hypotheses.
H0: nominal claim
Ha: alternative claim
(what we seek to prove)
2. State the appropriate test statistic. State how we plan to analyze the data.
3. Determine the critical region. Determine the values for the test statistic which
support rejecting H0.
4. Conduct the experiment, calculate the test statistic.
5. Reach conclusions and state them in English.
We will learn a statistical jargon:
• reject H0
• fail to reject H0.
THIS IS NOT ENGLISH!
A better way to express our conclusions:
• We should adjust the equipment.
• We shouldn't adjust the equipment.
If we reject the null hypothesis, we should always follow up our test with an
appropriate confidence interval.
The idea of the interval: to give a range of plausible values as an alternative to the
nominal claim.
4. Relationship of Testing to Confidence Intervals
A two-sided hypothesis test with a significance level of  is equivalent to
constructing a (1 - )• 100% confidence interval and using the following decision
rule:
• If the interval does contain this value, then we would fail to reject H0.
• If the interval does not contain this value, then we would reject H0.
The  we use for the hypothesis test is exactly the same  we use for the
confidence interval.
By the way we constructed the confidence interval, each value in the interval is a
plausible candidate for the true value.
Thus, if the nominal value of the parameter of interest falls within the confidence
interval, then we have no evidence to conclude that it is not a plausible value for
the parameter.
Hence, we cannot reject the null hypothesis.
On the other hand, if our interval does not contain the nominal value, then the
nominal value is not plausible, and we do have sufficient evidence to reject the
nominal claim.
Many engineers and statisticians prefer to concentrate solely on confidence intervals
since
• they clearly estimate the parameter of interest, and
• they can address the interesting questions for which hypothesis tests are designed.
Confidence intervals provide a simple, powerful, and direct basis for addressing both
practical and statistical significance.
B. Tests for a Single Mean
1. One Sided Tests
Consider the injection molding process for pen barrels.
Suppose the nominal outside diameter is .380 in.
Lately, the supervisor in packaging keeps complaining that the caps fall off,
jamming his equipment.
We need to determine if the outside diameters of these barrels, on the average,
has become too small.
What should we do?
Collect a sample.
A recent random sample of 15 pen barrels yielded
.379 .380 .378 .379 .381
.379 .380 .378 .379 .379
.381 .379 .380 .380 .380
Is it clear that, on the average, the outside diameter is less then .380 in?
Consider a hypothesis test.
Step 1: State the Hypotheses
H0: μ = .380
Ha: μ < .380
Step 2: State the Test Statistic
y
t
s/ n
0
Step 3: State the Critical or Rejection Region
The critical region depends upon Ha
For Ha: μ < μ0 :
We thus reject H0 if
where
t
n 1 ,
t  t
n 1 ,
is the appropriate value from the t table in the Appendix.
For Ha: μ > μ0 :
We thus reject H0 if
t t
n 1 ,
Usually, textbook problems give  .
Typical values for  are:
• .10
• .05 (most popular)
• .01
In our particular case, consider
= .05.
Thus, we shall reject H0 if
t < -t n-1,α
t < -t 14,.05
t < -1.761
Step 4: Conduct Experiment and Calculate Test Statistic
y  .3795
s  .0009
y
t
s/ n
.3795  .380

.0009 / 15
 2.152
0
Step 5: Reach Conclusions and State in English
Since t < -1.761, we have sufficient evidence to reject H0.
We therefore have enough evidence to suggest that the true mean outside
diameter is less than .380.
A reasonable question: What are the “plausible” values for the true mean
outside diameter?
We can construct a 95% confidence interval for μ by
y t
So
n 1 , / 2
s
with t
n
n 1 , / 2
t
14 ,. 025
 2.145
.0009
.3795  (2.145)
15
.3795  .0005
(.3790,3800)
Does this interval contain .380?
Note: in some sense, we could have addressed the question of interest directly
by the confidence interval.
What did we assume to do this analysis?
That our outside diameter's follow a well behaved distribution.
Are we comfortable with that assumption?
.
Number Depth
.378 00
2
2
.379 000000
6
.380 00000
5
7
.381 00
2
2
2. Two-Tailed Tests
An important characteristic of the grapes used to make fine wine is the sugar
content.
Basically, the wine maker can predict the final alcohol content of the wine by
dividing the sugar content of the grapes by 2.
A Napa Valley winery pays a premium to its wine growers if they can deliver
shipment with true mean alcohol contents of 26%.
The winery tests grapes from five different, randomly selected locations in the
shipment and determines the sugar content at each location.
What is an appropriate method for determining if the wine growers deserves a
premium?
Step 1: State the Hypotheses
H0: μ = 26
Ha: μ ≠ 26
Step 2: State the Test Statistic
y
t
s/ n
0
Step 3: State the critical region
For Ha: μ ≠ μ0:
We thus reject H0 if
where
t
n 1 , / 2
| t | t
n 1 , / 2
is appropriate value from the t table in the Appendix
In our case, use  = .05.
Thus, we shall reject the null hypothesis if
| t | t
n 1 , / 2
| t | t
4 ,. 025
| t | 2.777
Step 4: Conduct Experiment and Calculate Test Statistic
Suppose the next wine grower has
y  24.5
y
t
s/ n
24.5  26

1 .3 / 5
 2.580
0
s  1.3
Step 5: Reach Conclusion, State in English
Since |t| < 2.777, we fail to reject H0.
Therefore we have insufficient evidence to show that the true sugar content is
not 26%.
Therefore, we should pay the grower the premium.
Typically, we would want to check our assumptions.
In this case, with n=5, we cannot do very much.
We must trust that the data come from a very well behaved (nearly normal)
distribution.
C. Tests for Proportions
Example: 50 lb. Bags of Graphite
Historically 1% of the 50 lb. bags of graphite bagged on a certain process have
weights outside the specifications of 48-52 lbs.
Suppose we wish to monitor this process.
What would be appropriate hypotheses?
H0: p = p0 (p = .01)
Ha: p ≠ p0 (p ≠ .01)
What would be the appropriate test statistic?
Let Y be the number of bags which fail to meet the specifications in our sample.
We can estimate p by
Y
pˆ 
n
From the normal approximation to the binomial, we obtain
Z
pˆ  p
p (1  p )
n
0
0
Note: under H0, we actually know the standard error of p̂
What should be the critical or rejection region?
Once again, the rejection region depends on the alternative hypothesis.
Consider Ha: p < p0.
We thus reject H0 if Z < -zα
Now, consider Ha: p > p0.
We thus reject H0 if Z > -zα
Consider Ha: p ≠ p0, which is our specific case.
We thus reject H0 if |Z| > zα/2
We need to determine an appropriate significance level,  .
Typical choices are
.10
.05
.01
Which should we use?
What is our rejection rule?
Next, we need to determine a sample size.
From the normal approximation to the binomial, we need n to satisfy:
• np0 ≥ 5 (preferably np0 ≥ 10), and
• n(1 - p0) ≥ 5 (preferably n(1 - p0) ≥ 10).
In our case, what does that mean?
Suppose we use n = 1000 and that our sample has 15 bags which fail to meet
the specifications.
The value for our test statistic is
.015  .01
Z
 1.59
(.01)(.99)
1000
What can we conclude?
D. p-values
In testing,  represents our standard of evidence.
Once we state our  , we determine the appropriate critical region for our test.
Any value of our test statistic which is more extreme than our “critical value” is
considered sufficient evidence to reject the null hypothesis or nominal claim.
An alternative method looks at the observed significance level, sometimes
called the attained significance level, which is the smallest Type I error rate
that would allow us to reject the null hypothesis.
The observed significance level is the probability of seeing the particular value
of our test statistic, or something more extreme, if H0 is true.
This probability is usually called a p-value.
Most statistical software packages report p-values since these packages
do not know what the researcher wishes to use for  .
One rejects H0 whenever the p-value is less than  .
We make extensive use of p-values when performing regression analysis with
statistical software.
The p-value depends upon the specific alternative used for our test.
Let z0 be the observed value for our test statistic.
• For Ha: μ < μ0, the p-value is P(Z < z0).
• For Ha: μ > μ0, the p-value is P(Z > z0).
• For Ha: μ ≠ μ0, we must consider both tails of the standard normal distribution
and the p-value is 2 • P(Z > |z0|).
Example: Breaking Strengths of Carbon Fibers
Consider the hypotheses
H0: p = .10
Ha: p ≠ .10
Suppose the data produced a test statistic value of z0 = -1.33.
Thus, the p-value for this test is
p-value = 2 • P(Z > |z0|)
= 2 • P(Z > |-1.33|)
= 2 • P(Z > 1.33)
= .1836
Suppose our significance level is  = .01.
Since our p-value is not less than .01, we would fail to reject the null
hypothesis.
We have insufficient evidence to show that the true proportion of defectives has
changed.
E. Hypothesis Tests for Two Means, Independent Groups
There are many occasions when we need to compare two processes or
populations.
For example, consider two machines which produce erasers with the same
nominal outside diameter.
For a long time, the supervisor has complained that Machine 1 produces
erasers with a larger outside diameter.
How can we approach this problem?
Let μ1 and 
Let μ2 and 
2
1
2
2
be the population mean and population variance for machine 1.
be the population mean and population variance for machine 2.
Assume:
•   
2
2
1
2
(common variance)
2
•  is unknown
2
• the observations from Machine 1 are independent of those from Machine 2.
Suppose that a random sample of size n1 is taken from machine 1's production.
Let y and s
1
2
be the resulting sample mean and sample variance.
1
Suppose that a random sample of size n2 is taken from machine 2's production.
Let y and s
2
2
2
be the resulting sample mean and sample variance.
Step 1: The Possible Hypotheses
H0: μ1- μ2 = 0
Ha: μ1- μ2 < 0
μ1- μ2 = 0
μ1- μ2 > 0
μ1- μ2 = 0
μ1- μ2 ≠ 0
This procedure can be generalized to test
H0: μ1- μ2 = δ0
Ha: μ1- μ2 < δ0
μ1- μ2 = δ0
μ1- μ2 > δ0
μ1- μ2 = δ0
μ1- μ2 ≠ δ0
when δ0 is a specified difference between the two means.
In our specific case, our hypotheses are:
H0: μ1- μ2 = 0
Ha: μ1- μ2 > 0
Step 2: The Test Statistic
y y
t
1 1
s

n n
1
2
p
1
2
where
(n  1) s  (n  1) s
s 
n n 2
2
2
1
1
2
2
2
p
1
2
In this case, t follows a t distribution with n1 + n2 – 2 degrees of freedom.
Step 3: Critical or Rejection Regions
Once again, the rejection regions depend on the alternative hypothesis.
• For Ha: μ1 - μ2 < 0, we reject H0 when
t  t
n1  n2  2 ,
• For Ha: μ1 - μ2 > 0, we reject H0 when
t t
n1  n2  2 ,
• For Ha: μ1 - μ2 ≠ 0, we reject H0 when
| t | t
n1  n2  2 , / 2
In our specific case, we reject H0 when
t t
n1  n2  2 ,
Step 4: Collect Data and Calculate the Test Statistic
A single batch of raw materials has been split to provide two production
runs: One for machine 1, and one for machine 2.
MACHINE 1
240 243 250 253
238 242 245 251
239 242 246 248
MACHINE 2
241 243 245 248
239 240 242 243
239 240 250 252
241 243 249 255
For Machine 1:
y  244.75
s  24.205
n  12
y  244.375
s  24.516
n  16
1
2
1
1
For Machine 2:
2
2
1
2
(n  1) s  (n  1) s
s 
n n 2
2
2
1
1
2
2
2
p
1
2
11(24.205)  15(24.516)

12  16  2
 24.284
Thus,
s  4.938
p
The value of the test statistic is
y y
244.75  244.375
t

 0.199
1 1
1 1
s

4.938

n n
12 16
1
2
p
1
2
Step 5: Reach Conclusions
Suppose we use a significance level of 0.10.
With n1 = 12 and n2 = 16, the critical value for the t statistic is
t
n1  n2  2 ,
t
26 ,. 10
 1.315
Because our observed value of the test statistic (0.199) is less than 1.315, we
do not have sufficient evidence to reject the null hypothesis.
Thus, we cannot show that Machine 1 produces larger outside diameters than
Machine 2.
A (1   ) 100% confidence interval for μ1 - μ2 is
( y1  y2 )  tn n 2 , / 2 s p
1
2
1 1

n1 n2
Thus, a 95\% confidence interval for the two machines is
( y1  y2 )  t n  n 2 , / 2 s p
1
2
1 1

n1 n2
(244.75  244.375)  2.056(4.938)
1 1

12 16
0.375  1.886
(1.511,2.261)
Note: 0 is a plausible value for the true mean difference.
We need to check our assumptions.
Stem
23•
24*
24•
24*
Machine 1
89
0222
568
013
No.
2
4
3
3
Depth
2
6
6
3
Normal Probability Plot for Machine 1
Machine 2
99
00112333
589
005
No.
2
8
3
3
Depth
2
6
3
Normal Probability Plot for Machine 2
F. Paired t-test
1. The Hypothesis Test
Note: The two sample t test assumed that the two samples were
independent of each other.
There are many occasions where the two samples are not independent
because they involve the same sampling unit.
Example: Marketing Pre-Test of a New Ball-Point Pen
The Marketing Department of a pen company determined that the basic
ball-point pen needed revision.
Marketing commissioned a production lot of a new prototype pen.
A group of ten people who work at the production facility were asked to write
with the new prototype and with the leading competitor's pen.
Each person ranked the pen's writing performance on a scale from 1 - 10, with 1
being extremely poor and 10 being excellent.
Note: We should expect significant differences in preference from individual to
individual.
The two rankings are not independent of one another!
How can we determine if people prefer the prototype?
Let
• y1i be the observed score for the competitor's pen given by the ith person
• y2i be the observed score for the prototype pen given by the ith person
Define
di = y1i - y2i .
Let δ be the true mean difference in the scores.
• If δ = 0, then there is no difference in the two pens.
• If δ > 0, then the first pen tends to get higher ratings than the second.
• If δ < 0, then the first pen tends to get lower ratings than the second.
We can set up an appropriate hypothesis testing procedure.
Step 1: State the Hypotheses
H0: δ = δ0
δ = δ0
δ = δ0
Ha: δ < δ0
δ > δ0
δ ≠ δ0
Note: Often, δ0 will be 0.
In our case, we wish to show that the prototype is better; thus,
H0: δ = δ0
Ha: δ < δ0
Step 2: State the Test Statistic
An appropriate estimate of δ is
1 n
d   di
n i 1
d is the sample mean difference.
Note:
2
Let s be the sample variance for the differences,
d
n
sd2 
 
n
n d i2   d i
i 1
i 1
n(n  1)
The appropriate test statistic is
d  0
t
sd / n
2
Step 3: State the Critical or Rejection Region
Our critical regions are:
•For Ha: δ < δ0, we reject H0 when
• For Ha: δ > δ0, we reject H0 when
• For Ha: δ ≠ δ0, we reject H0 when
t  t n1,
t  t n1,
| t | t n1, / 2
For the marketing pre-test, we should use a .05 significance level.
Thus, we reject the null hypothesis if
t  t n 1,
t  t9 ,.025
t  1.833
Step 4: Collect Data and Calculate the Test Statistic
The actual data:
Individual Competitor Prototype Difference
1
7
8
-1
2
6
7
-1
3
8
9
-1
4
10
8
2
5
2
9
-7
6
5
5
0
7
6
6
0
8
6
8
-2
9
4
10
-6
10
6
9
-3
For these data,
d  1.9
sd  2.77
Thus, our test statistic is
d
t
sd / n
 1.9

2.77 / 10
 2.17
Step 5: Reach Conclusions
Since t < -1.833, we have sufficient evidence to reject the null hypothesis.
As a result, we have evidence to suggest that people who work at this facility really
do prefer the prototype.
2. The Confidence Interval
We can construct a 95% confidence interval for the true difference by
sd
d  t n 1, / 2 
n
2.77
 1.9  t9 ,.025
10
 1.9  2.262  (0.88)
 1.9  1.98
(3.88,0.08)
The plausible values for this difference range from -3.88 to 0.08, which seems
to contradict the results of our hypothesis test.
We must keep in mind that we conducted a one-sided hypothesis test;
however, our confidence interval is two-sided.
We do need to check our assumptions.
Stem
-s:
-f:
-t:
-*:
*:
t:
Leaves No.
67
2
Depth
2
23
111
00
2
2
4
2
1
3
1
The Normal Probability Plot
3. When to Pair
A reasonable question: When should an experimenter pursue a paired
structure?
Pairing works well when the sampling units available for the study differ widely
among themselves.
In this case, pairing allows us to remove the sampling unit to sampling unit
variability, which makes our estimate of the standard deviation much smaller.
As a result, we are more likely to reject our null hypothesis (we increase the
power of our test).
On the other hand, pairing the data also reduces the number of degrees of
freedom available for our analysis.
Decreasing the number of degrees of freedom makes the critical value
for our test statistic slightly larger in absolute value.
As a result, it is slightly more difficult to reject the null hypothesis (we slightly
decrease the power of our test).
In general, we should obtain paired data whenever we know that the sampling
units differ significantly from one another.
The reduction in variability typically more than compensates for the slight increase
in the critical value for the test.
G. Transformations
There are times when engineering data does not follow a normal distribution.
This violates our distributional assumption for the t-test. One approach for
dealing with nonnormal data is to transform the data to a different scale where
normality holds. Common transformation in the Engineering Sciences are
natural log, square root and inverse.
Consider an example of the sealing strength of plastic bags with a target
strength of 11 Newtons. A Normal probability plot shows the data depart from
normality.
Original Data
Quantile of Standard Normal
2
1
0
-1
Transformed Data
Using Inverse
-2
0.05
0.06
0.07
0.08
0.09
Inverse of Data
0.10
0.11
0.12
The output from the t-test on the transformed data shows that there is
no evidence that the mean has changed from 11 using   0.05.
Test of mu = 0.0909 vs not = 0.0909
N
Mean StDev
SE Mean
95% CI
T
P
20 0.083293 0.018119 0.004051 (0.074813, 0.091773) -1.88 0.076
It is important to remember to transform the nominal value being tested
(11 becomes 0.0909).
An alternative to transforming the data is to apply a methodology that does
not rely on the normality assumption. This methodology is known as
nonparametric statistics. The analogous nonparametric procedure for the
one-sample t-test tests the population median and is called the sign test.
Essentially, the sign test counts the number of observations above and
below the median. If the null hypothesis is true, we would expect half the
observations to be above the median and half below. Using the binomial
distribution, one can calculate a p-value when the numbers above andbelow
deviate from half the data.
Sign test of median = 11.00 versus not = 11.00
N Below Equal Above
20
7
0
13
P
Median
0.2632
11.70
Note that nonparametric procedures are less powerful than t-tests since we are
only concerned with the number of values above and below the median and not
their exact values. (If all 13 values above 11 in our example were multiplied by
100, we would get the same p-value in the sign test.)
There are nonparametric procedures for the two-sample independent t-test (rank
sum test) and the paired t-test (signed rank test). These procedures can be done
very quickly in standard statistical software packages.