Transcript Document

Module Nine:
Techniques for Monitoring Variability and Testing
Homogeneity
Most of statistical techniques that are developed for analyzing experimental
design data involving analysis of variance require the assumption of
homogeneity of variances among treatment groups. Techniques for diagnosing
the homogeneity property of variance will be discussed using both numerical
and graphical methods.
In designing experiments, factors of interest may be themselves a random
variable. The main interest is to study the variance components of the factor.
This is especially important for designs that are planned to reduce variation of
the response variable.
Techniques for analyzing variance components to study the day-to-day
variability, the within-lab variability and between-lab variability are important
for inter-laboratory testing.
Techniques for reducing variation of response variable for quality improvement
will also be discussed in this module. This include control charts for monitoring
variation and Gage R&R analysis.
1
Presenting uncertainty for sample variance: 100(1-a)% Confidence
Interval for Population Variance, s2
In an inter-laboratory testing study, we often encounter the problem of estimating within-lab
variability and between-lab variability. These are used to study the ‘repeatability’ and
‘reproducibility’ of a testing procedure.
•Repeatability refers to how well a test procedure can be repeated for testing the same material
under the same condition in the same laboratory. It is often measured by the within-lab variability
when the testing environment and experimental units are as homogeneous as possible. Hence, it
measures the uncertainty due to the combined small uncontrollable random errors.
•Reproducibility refers to how well a measurement of the same material using the same testing
procedure can be reproduced in different laboratories.
•When a lab is under a good statistical control, the sources of special causes are minimum,
therefore, the repeatability is expected to be high.
•Reproducibility is more complicated. The low reproducibility often reflected by the large
between-lab variability. The causes of low reproducibility may be due to material itself, the
difference between labs in the environmental conditions, operator’s training, and systematic
errors.
•How to measure the repeatability and reproducibility is important in a inter-laboratory study or
in a gauge analysis.
2
To discuss the presentation of variation, we start from one sample case.
Consider we observe n testing results of a material using a testing procedure by
repeatedly testing the material for n times. One task is to measure the repeatability of
the testing procedure for the same material.
A simple and useful estimate of the repeatability is the variance, s2, and the standard
deviation, s.
Different from measuring the uncertainty of sample mean, since sample variance, s2
is always non-negative, we do not report the uncertainty of s2 in the form of
s 2  SE(s 2 )
In stead, based on some statistical theory, when the sample is drawn from a normal
population, the distribution of s2 can be determined by a so-called Chi-Square
distribution:
(n  1) s 2
s2
~  2 with degrees of freedom, df = n-1
3
What is Chi-Square Random Variable? What does the distribution look
like? How to compute Ch-square Probability or use Chi-square Table?
Statistically,
 2 (Chi-square)distribution with degrees of freedom is
a continuous probability distribution with probability density:
f(x)
f ( x) 
 1
 3
1
x ( / 2) 1e  x / 2
 2  
2  
2
for x > 0
 5
 8
5
10
15
x
4
Hands-on Activity of using Chi-square Table and using Minitab to
find the Chi-square quantiles.
• df = 20, find
95th
percentile Q(.95),
2

P(
< Q(.95) = .95
•More examples here, if needed.
5
The uncertainty of using s2 to estimate s2 can be expressed in terms of 100(1-a)%
confidence interval.
After some simple algebra, we obtain a 100(1-a)% confidence interval for s2:
Lower Bound =
(n  1) s 2
Upper Bound =
 (2a / 2,df n1)
Where,
(n  1) s 2
 (12 a / 2, df  n 1)
(2a / 2,df n1) is 100(1-a /2) percentile of  2 distribution with d.f. = (n-1)
(12 a / 2,df n1) is 100(a /2) percentile of  2 distribution with d.f. = (n-1)
Case Example: The TAPPI Data : Sample GR36. 76 labs tested the sample GR36.
The sample variance and standard deviation are s2 = .4679, s = .684
A 95% confidence interval for the lab variation is
Lower bound = (76-1)(.4679)/100.84 = .348
Upper bound = (76-1)(.4679)/50.94 = .689
95% sure that the lab variance is between .348 to .689
6
Hands-on activity
1. It was found that three labs may have large within –lab
variability. An experiment was conduct at these three labs to test
the same material. Each material was tested for 20 times. The
sample s.d. is found to be:
sA = 4.5
sB = 8.4
sc=5.6
(a) Obtain a 95% confidence interval to estimate the within-lab
variance, respectively.
(b) Is there overlap between the 95% confidence interval for
Lab A and Lab B?
(c ) Based on the result in (b), is there a strong evidence to
conclude that the within-lab variance of Lab B is significantly
higher than that of Lab A?
7
Comparing the homogeneity of variances between two groups
When two materials are tested for n repetitions in the same lab, we are often
interested in investigating two aspects about the testing results:
1.
To compare the average testing results between two materials x1  x2 (this
was done in Module Eight). The approach is the t-test.
2.
To compare the within-material variability of the testing results.
The second type of comparison is to compare the ratio of variances between two
groups, : s 12 s 22
Our goal is to estimate this ratio based on two independent samples,
one from each population.
Notice that, in order to measure the uncertainty of the ratio,s 1 s 2
2
2
s
s
, we compute the ratio of sample variances: 1 2 . Similar to the
measurement of one-sample variance, we need to understand the
2
2
2
2
s
s
s
s
corresponding distribution of 1 2 or some function of 1 2 8
that will allow us to study the uncertainty of measuring s12 s22
2
2
The idea is to take two independent samples and compute sample
variances to make inference about s 12 s 22
Assuming that each sample data are observed independently from a
normal population, statistical theory tells us the ratio of sample
2
2
variance, s1 s2 follows so-called F-distribution with numerator
degrees of freedom (df) being the df of s12 and denominator df
2
being the df of s2
2
2
s
s
That is, F= 1 2 , and the F-distribution depends on df1 =
numerator df, and df2 = denominator df.
For this two material testing example, df1 = (n-1) = df2
9
What is F-distribution? What does it look like? How to compute the
cumulative probability or quantiles using Table and using Minitab
Statistically, F-distribution with numerator and denominator degrees of freedom
1 and 2 is a continuous probability distribution with probability density:
1   2 1  / 2 ( / 2) 1
)( ) x
2
2
f ( x) 


x
( 1 )( 2 )(1  1 )(  ) / 2
2
2
2
(
f(x)
1  10,2  100
1
1
1
for x  0
2
1  10,2  10
1  10,2  4
1  4,2  4
x
10
Hands-on Activity : Using F-Table
• An important property of F-distribution:
F(a ,df1 ,df2 ) 
1
F(1-a ,df2 ,df1 )
, a is the upper tail probability.
For example: F(.05,4,20)  2.87 
1
F(.95,20,4)
, Therefore, F(.95,20,4)  1/ 2.87  .35
Because of this property, most of F-quantile tables only
present either the upper or lower end of the F-distribution.
This property is used to determine the other end of the Fquantiles.
Exercises of using F-table may be needed here.
11
Using the F-distribution, we are able to test the hypothesis at the level
of significance, a:
H0 : s12  s 22
Ha : s12  s 22
Fobs  s12 s22
1.
Compute
2.
Obtain the critical F-value using F-Table or software such as Minitab. The
critical values for two-side test are
F(a/2, df1,df2), and F(1-a/2,df1,df2)
3.
When Fobs falls outside the two critical values, we conclude that the two
population variances are significantly different.
(Note: Hypothesis test can also be a right-side test or a left-side test.)
4.
If we use software for conduct this comparison, software such as Minitab gives
us what so-called p-value. It is the observed level of significance. Therefore, to
make a decision, we can compare p-value with a using the rule:
If p-value < a, then, reject Ho, , two variances are significantly different.
If p-value
12
 a , then, do not reject Ho, the variances are not significantly different.
100(1-a)% confidence of the ratio s 12 s 22 can also be
established to estimate the uncertainty of the ratio:
F-distribution describes the distribution of
s12 / s 12
F 2 2
s2 / s 2
Where, two independent samples of size n1 and n2 are randomly
chosen from a normal populations with variances
s12 and s 22 , respectively. F-distribution has df1  n1 1, df2  n2 1
2
2
s
/
s
1
1
Rearranging F 
in the following probability inequality
2
2
s2 / s 2
P( F(1a / 2,df1 ,df2 )  F  F(a / 2,df1 ,df2 ) )  1  a
one obtains 100(1-a)% confidence interval for s 12 s 22:
s 
1
 
Lower Bound =  s  F(a / 2,df1 ,df2 )
2
1
2
2
 s12 
 2  F(a / 2, df 2 , df1 )
Upper Bound =  s2 
13
Hands-On Activity
A study was conducted to compare the water-insoluble
nitrogen in two type of fertilizers. Type A was randomly
assigned to ten labs, and Type B was randomly assigned to test
Type B. Assuming labs have little systematic error. The tested
results are summarized:
Type A
Type B
n1 = 10
n2 = 8
S1 = 2.5
S2 = 4.5
Average = 24.5
Average = 26.4
a. Obtain a 90% confidence interval for ratio of variances
of the two types of fertilizers.
b. Test if type A has a significantly lower variance when
compare with Type B using a = 1%.
14
A more robust method for testing the uniformity of variances
Levene’s Test
This method considers the distances of the observations from their sample median
rather than their sample mean. Using the sample median rather than the sample mean
makes the test more robust for smaller samples as well as making the procedure
asymptotically distribution-free. The test statistic is given by:
FL 
( N  k ) ni ( yi.  y.. ) 2
(k  1) ( yij  y i. )2
where yij | xij  mi |, i  1, 2,
, k , j  1, 2,
mi is the median of {xi1, xi 2 ,
, ni
, xini }
ni is the number of observations for the ith sample. N is the total numbers of
observations, N   ni
Levene’s test statistic FL follows an F-distribution with numerator d.f., n1 = k-1
and denominator d.f., n2  N-k
For two population case, k = 2. The decision rule is:
15
When the observed FL > F(a, 1, N2), Two variances are concluded not uniform.
•The measurement uncertainty of variance is different from the measurement
uncertainty of mean.
•In addition, because of the distribution properties, the interval of the
measurement uncertainty is presented in terms of confidence interval, in stead of
only based on  SE
BestEst
•The confidence intervals are developed rather differently between mean and
variance. The sample mean is based on  SEBestEst
While confidence interval for variance is based on ratio uncertainty.
•The general approach of measuring uncertainty for f(x1,x2, …., xk) require little
assumption, therefore, the measurement uncertainty based on the general
approach is usually more conservative. As a consequence, the measurement
uncertainty is usually larger.
•One should take the advantage of the distribution property, if the distribution of
the variable can be approximated properly, such as Normal, t, Ch-square, and Fdistributions, and so on. The measured uncertainty will be more precise.
•However, if the assumption, which we have been making, ‘the population from
which we draw the sample follows a normal curve’ or ‘ the sample size is large’
is not satisfied, the results may be too optimistic or even inappropriate.
Therefore, a quick diagnosis of distribution and outliers are important. Or use
tests that are more robust to the violation of normal assumption can be used. 16
For
the Two population variances problem, Levene’s FL-test can be applied.
Use Minitab to construct confidence interval estimate for variance:
Consider the following example: The uniformity of hardness of specimen of
4% carbon steel is a key quality characteristics. Two types of steels are to
be compared: Heat-Treated and Cold-Rolled. There are typically two
important questions to be answered:
1. Which type of steel has higher hardness? – A problem of comparing the
means using t-test.
2. Which one gives more uniform hardness? - A problem of comparing the
variances using the F-test or Levene’s Test.
In the following, we will use Minitab to conduct the analysis for problem (2)
and leave problem (1) as a hands-on activity.
In Minitab:
1.
2.
3.
4.
Go to Stat, choose Basic Statistics, then select ‘2 variances’.
Depending on how the data are organized. If the data are in two columns,
click on ‘Samples in different columns’, and enter the variables.
Click on ‘Options’, it allows you to enter the level of confidence. By default,
it is 95%.
Storage selection allows you to store some computed results.
17
1
Heat-Tr
31.8
Cold-Tr
21.1
2
43.7
24.9
3
35.6
19.8
4
38.0
16.5
5
24.5
18.3
6
29.5
20.9
7
38.9
16.4
8
32.4
17.3
9
29.5
15.8
10
19.7
17.6
11
24.6
14.6
12
39.7
23.4
13
42.5
19.4
Variable
N
Heat-Tr
20 33.58
Cold-Tr
15 18.96
20.6
15
32.6
17.8
16
36.8
17
37.5
33.1
19
31.8
20
28.7
Min
Max
32.85
6.36
1.42
19.7
43.7
18.30
2.87
0.74
14.6
24.9
5
3
3
2
2
1
1
2
1
3
2
1
0
0
15
25
35
45
Heat-Treated
6
5
5
4
3
2
2
1
1
18
StDev SE Mean
4
Frequency
40.6
Median
5
6
14
Mean
6
Frequency
Row
1
0
15
25
35
Cold-Rolled
45
18
Test for Equal Variances
Level1
Heat-Tr
Level2
Cold-Tr
ConfLvl
95.0000
\Workshop-Taiwan-Summer-2001\Data
Sets\Steel-treat-conf-var.MPJ
Bonferroni confidence intervals for standard deviations
Lower
Sigma
Upper
N
Factor Levels
4.65829
6.35832
9.85093
20
Heat-Tr
2.01077
2.86501
4.85604
15
Cold-Tr
F-Test (normal distribution)
Test Statistic: 4.925
P-Value
: 0.004
Levene's Test (any continuous distribution)
Test Statistic: 7.123
P-Value
: 0.012
Both F-test and Levene’s test show the two types of steel have significantly different
uniformity. In particular, the Heat-Treated is much higher than the Cold-Rolled steel.
95% Bonferroni confidence intervals are also provided. It shown that 95% of chance
that the hardness non-uniformity of Heated-steel measured by s.d. is from 4.65 to
9.85. While the non-uniformity of Cold-Rolled steel is from 2.01 to 4.86.
19
What is the Bonferroni’s Simultaneous Confidence Interval? How
to construct them? How to interpret them?
Bonferroni’s simultaneous confidence interval for s.d. as reported in the Minitab modifies
the confidence interval for each individual sample variance using Chi-square
distribution. The method is described in the following:
1.
For each group, compute sample variances, si2, I = 1,2, …, k.
2.
When constructing an individual using Chi-square, the critical values are modified
depending on the number of confidence intervals to be constructed:
(n  1) s 2
Lower Bound =
 2a
(
2k
2
(1
, df  n 1)
a
2k
, df  n 1)

Upper Bound =
2
(
a
2k
, df  n 1)
is 100(1is 100(
a
2k
a
2k
(n  1) s 2
2
(1
a
2k
, df  n 1)
) percentile of  2 distribution with d.f. = (n-1)
) percentile of  2 distribution with d.f. = (n-1)
20
NOTE: For single confidence interval, k=1, the confidence interval is the same as we
discussed before.
When we have more than one intervals, in order to keep the type I error to be still a,
Bonferroni proposed to reduce the type I error for a member of the confidence intervals to
be a/2k.
Example- Using the Steel data, we obtain:
95% Bonferroni confidence intervals for standard deviations
Lower
Sigma
Upper
N
Factor Levels
4.65829
6.35832
9.85093
20
Heat-Tr
2.01077
2.86501
4.85604
15
Cold-Tr
The Lower Bound 4.6583 for s1 is obtained by
(n  1) s 2
 2a
(
2k
, df  n 1)

(20  1)(6.35832) 2
 2 .05
(
2(2)
, df  20 1)

768.1264

35.3986
21.6996  4.6583
The corresponding Chi-square values are: 35.3986 and 7.9156,
21
and the upper bound is found to be 9.8509
Hands-on Activity
Construct the Bnoferroni confidence interval for the ColdRolled type of steel.
2
2
(.05/


4,14)
(.0125,14)  28.4219,
2
(12 .05/ 4,14)  (.9875,14)
 4.8732
22
The graphical presentation of the comparison are also provided
by Minitab in the following.
Test for Equal Variances
95% Confidence Intervals for Sigmas
Factor Levels
Heat-Tr
Cold-Tr
2
3
4
5
6
7
8
9
10
F-Test
Test Statistic: 4.925
Levene's Test
Test Statistic: 7.123
P-Value
P-Value
: 0.004
: 0.012
Boxplots of Raw Data
Heat-Tr
Cold-Tr
15
25
35
45
23
The uniformity of hardness of steel is a classical example in
quality improvement
•To improve the hardness – the harder the better.
That is to improve the signal of the quality characteristic.
•To reduce the non-uniformity – minimize the variability – the
smaller the s.d., the better.
That is to reduce the noise of the quality characteristic.
This is one of the major contributions due to Taguchi, who
proposed what is now known as Signal-to-Noise Ratio analysis
in quality improvement.
24
Hands-on Activity
Use the Hardness of Steel data to conduct the following analysis:
1. Test if the hardness of two types of steel is significantly
different.
2. Suppose the standard hardness of steel for bridge is at the
minimum of 25. Does either of the steel meet the standard?
3. Suppose the standard variability of the hardness of steel is at the
maximum of 4.5. Does either type of steel meet this
requirement?
25
Testing homogeneity of variability for more than two groups
(Testing for uniformity of measurements for more than two groups)
In real world applications, including the laboratory testing, it often the case that there
are more than two groups for comparison. As a result, we need methods for making
comparison among three or more groups.
Testing for homogeneity of variances is important for at least two purposes:
1.
The homogeneity of variances reflects to the quality of the variable of interest. This
is especially important in quality improvement. For the example of uniformity of
hardness of steel. If the steel has a huge variation of the hardness, the life time of
buildings or bridges will be very unreliable. Some lots of steel may last for 50
years, while others may only last for 10 years due to the hardness variation of the
same type of steel.
2.
In making an appropriate comparison of group means, we often make the
assumption that the within-group variation is constant. That is, the within-group
variations are approximately similar. Therefore, an appropriate comparison of
group means requires a diagnosis of homogeneity of variance.
26
Levene’s Test and Bartlett’s Test for testing homogeneity of
variances of more than two groups
We have introduced Levene’s FL-test for two group comparison.
This test is good for more than two groups as the formula shows.
FL 
( N  k ) ni ( yi.  y.. ) 2
(k  1) ( yij  y i. )2
where yij | xij  mi |, i  1, 2,
, k , j  1, 2,
mi is the median of {xi1, xi 2 ,
, ni
, xini }
ni is the number of observations for the ith sample. N is the total numbers of
observations, N   ni
Levene’s test statistic FL follows an F-distribution with numerator d.f., n1 = k-1 and
denominator d.f., n2  N-k
The decision rule is:
When the observed FL > F(a, k1, Nk), the within-group variances are concluded27not
uniform.
Bartlett’s Test for Homogeneity of Variances
Consider we have three types of steel, and we are interested in comparing the
uniformity of the hardness. To do so, we conduct a lab experiment to test the
hardness of three types of steel. The data is recorded in table form such as:
Steel Type
mean
A
x11
x12
x13
x14
…
…
x1n1
B
x21
x22
x23
x24
…
…
x2n2
C
x31
x32
x33
x34
…
…
x3n3
x1
x2
x3
variance
s1 2
s2 2
s3 2
Bartlett Test is based on the nature log transformation of the geometric mean
of the sample variances. If within-group variances are all equal, we can
estimate the overall variance by combining all of the k groups of data
together. As a consequence, the difference between the combined variance
and the uncombined variance gives the basis for the bartlett test.
B
[ (ni  1)]ln[ (ni  1) si2 /  (ni  1)]   (ni  1) ln si2


1
1


1  

 /{3(k  1)}

 (ni  1)  (ni  1) 

28
Row
Cold-Tr
Mid-Tr
Heat-Tr
1
21.1
30.4
31.8
2
24.9
27.5
43.7
3
19.8
21.8
35.6
4
16.5
24.9
38.0
Variable N
5
18.3
31.4
24.5
Heat-Tr
6
20.9
18.5
29.5
7
16.4
16.7
38.9
8
17.3
19.8
32.4
9
15.8
23.8
29.5
10
17.6
25.7
19.7
11
14.6
29.4
24.6
12
23.4
28.4
39.7
13
19.4
21.7
42.5
14
20.6
26.4
40.6
15
17.8
30.6
32.6
36.8
17
37.5
18
33.1
19
31.8
20
28.7
Median StDev SE Mean
Min
Max
20 33.58
32.85
6.36
1.42
19.7
43.7
Cold-Tr
15 18.96
18.30
2.87
0.74
14.6
24.9
Mid-Tr
15 25.13
25.70
4.64
1.20
16.7
31.4
Mean
Box Plot for the three types of steel
45
35
Hardness
16
\Workshop-Taiwan-Summer-2001\Data
Sets\Steel-treat-conf-var.MP
25
15
Cold-Tr
Heat-Tr
Factor
Mid-Tr
29
Test for Equal Variances: Response
ConfLvl
Hardness
95.0000
Bonferroni confidence intervals for standard deviations
Lower
Sigma
Upper
N
1.96631
Factor Levels
2.86501
5.0557
15
Cold-Tr
4.56689
6.35832
10.1796
20
Heat-Tr
3.18546
4.64138
8.1903
15
Mid-Tr
Bartlett's Test (normal distribution)
Test Statistic: 8.676
P-Value
: 0.013
Levene's Test (any continuous distribution)
Test Statistic: 3.927
P-Value
: 0.026
Both Bartlett test and Levene test conclude that the with-group variances are not
uniform.
Note that Bartlett test requires Normality assumption of the population. Levene
30
Test does not require this assumption.
A graphical presentation of testing homogeneity of variances
Test for Equal Variances for Hardness
95% Confidence Intervals for Sigmas
Factor Levels
Cold-Tr
Bartlett's Test
Test Statistic: 8.676
P-Value
: 0.013
Heat-Tr
Levene's Test
Test Statistic: 3.927
P-Value
: 0.026
Mid-Tr
2
3
4
5
6
7
8
9
10
31
Hands-on Activity
Perform the following test by hand using the steel data:
1. Bonferroni’s simultaneous confidence interval for the Midtemperature treated steel.
2. Perform the Bartlett test.
3. Perform the Levene test.
32