Module Six: Outlier Detection for Two Sample Case, Youden Plot

Transcript Module Six: Outlier Detection for Two Sample Case, Youden Plot

Module Six: Outlier Detection
for Two Sample Case
Two sample plot, also known as Youden’s Plot, is a
scatter plot with a confidence region. Youden used it
for detecting labs with unusual testing results when
two samples are tested in n different lab. Youden plot
is a special case of the bivariate control chart, and the
idea behind is the Principal Component Analysis.
In this module, we will discuss Principal Component
Analysis and how it is applied to construct bivariate
control charts and discuss the interpretations of the
1
plot.
There are three types of laboratory testing, where twosample plots can be applied:
•
A group of labs participate in testing two similar materials using
the same method. It is important to identify labs, if any, which
perform extremely different from the rest in either one or both
materials. A two-sample plot is used to detect the extreme labs.
This is the case studied by Youden (1954), and later extended by
Mandel and Lashof (1974).
2
2. A participated lab tests a variety of material using the same method as
the standard lab, and the testing results are compared to the
standardized lab to study if any particular test is extremely different
from the standard lab. A paired bivariate control chart is used to
determine how good the participated lab is when compared with the
standard lab. Tracy, Young and Mason (1995) used a similar
approach, bivariate control chart for studying paired measurements in
quality control.
3. A lab test similar two or more materials using a standard procedure on
a regular bases for many days. A bivariate or multivariate control
chart is used for process control. In recent decades, multivariate
control charts have been developed for process control of two or
more quality characteristics simultaneously. Similar situation may
occur in laboratory testing process control.
3
The classical Youden’s Plot for Two-sample cases
The following inter-laboratory study about the percent of
insoluble residue in cement reported by 29 Laboratories
Row
%residueA
%residueB
Row
%residueA
%residueB
1
0.31
0.22
16
0.25
0.11
2
0.08
0.12
17
0.26
0.17
3
0.24
0.14
18
0.26
0.18
4
0.14
0.07
19
0.12
0.05
5*
0.52
0.37
20
0.29
0.14
6
0.38
0.19
21
0.22
0.11
7
0.22
0.14
22
0.13
0.10
8*
0.46
0.23
23*
0.56
0.42
9
0.26
0.05
24
0.30
0.30
10
0.28
0.14
25
0.24
0.06
11
0.10
0.18
26*
0.25
0.35
12
0.20
0.09
27
0.24
0.09
13
0.26
0.10
28
0.28
0.23
14
0.28
0.14
29
0.14
0.10
15
0.25
0.13
4
Using One sample Box-Plot, we have
Side-By-Side Box Plot for Material A abd B Insoluable Residue in Cement
0.6
%Insoluable residue
0.5
0.56
0.52
0.46
0.42
0.37
0.4
0.3
0.25
0.2
0.14
0.1
0.08
0.0
%residueA
%residueB
Quick check suggests Lab #5, 8, and 23 are likely outliers. They are excluded in the twosample plot.
5
(from NIST website)
The vertical line passes through the median of the X-variable.
The Horizontal line passed through the median of the Y-variable.
The center of the graph is the intersection of the two median lines. This
intersection is called the Manhattan Median.
6
What is the 45o line for? What is the practical meaning of this line?
Why and How Youden Plot works?
Under the absolute perfect condition, the pair of data should be all equal when testing the
same material twice using the same testing procedure. However, it is never the case. In fact,
there are two major components of uncertainties: The systematic error and random error.
From common experience, systematic error of a given lab should be the same or very close,
in theory, when performing the test under the same condition using the same procedure. And
that each lab is different from others. This suggests that, if there is no random or unexpected
error, under the perfect condition, but allowing lab differences, then, the pair of data should
be equal (that x=y) for a given lab, but, may be different for different labs. Translating this
situation, the pairs of data are located at the 45o line passing through the Manhattan Median.
Therefore, the distance from (x,y) pairs on the 45o line is the component of the systematic
error for the given lab.
How can we identify the random error component from the plot?
In reality, a pair of data points are scattered on the graph. Only very few pairs will be located
on the 45o line, and even so, x and y are really the same.
Consider a pair of data point which is away from center and are not on the 45o line. The
distance from the point (x,y) to the center is the total error:
Total error = Systematic error + Random error
7
(-7, -2)
Random Error Component(RaE): the
distance from (-2,-7) to (-4.5, -4.5)
The total error =
Systematic Error (SyE) component:
the distance from (-4.5, 4.5) to (0,0)
(RaE) 2  ( SyE ) 2
8
The distance is
allowed to be
negative here
for reflecting
the quadrant.
The magnitude
is positive.
(x3 , y3)
How to determine (x3 , y3)?
For simplicity, assume center is (0,0). Then we see that x3 = (x2 + y2)/2. That is the coordinate for the
systematic component is at ((x2 + y2)/2 , (x2 + y2)/2 ). Hence, systematic error component is the distance
to from this point to (0,0):
2[( x  y ) / 2]2 | x  y | / 2
2
2
2
2
If the center is at (x1 , y1), then the Systematic Error component is
(| ( x2  x1 )  ( y2  y1 ) |) / 2
9
10
11
NOTE: p in the formula is the Random Component, in our notation, we use: RaE.
If there is no systematic error, the uncertainty should involve only random error component. And
therefore, the standard deviation of the random error components is an estimate of the random
error variation. 95% coverage region is given by the radius = 2.45(s). The value 2.45 is based on
12
the assumption the (X,Y) follows Bivariate Normal and are independent.
Youden’s Calculation of the standard deviation and the radius
Youden plot is applied to identify labs with unusual high systematic error as well as
labs have unusually high random errors.
•When a lab has both large total error, the data point will be far away from the center.
They are the first group of labs that should be closely investigated.
•It may happen that a a material is less sensitive to different environment than the
other. When this happens, the data points will tend to be parallel to X- axis or Y-axis,
with a large variation due to a material. This can also be quickly identified.
•When a lab have an unusually high systematic error, but small random component, it
will scattered along the 45o line. And that more data points are in upper right quadrant
and lower left quadrant.
•When a lab has a large random component, it will be far from the 45 o line, that is,
will have higher probability to be in the upper-left and lower-right quadrants.
13
In order to identify these unusual labs, labs with large systematic error, Youden
suggested to draw a circle centered at the Manhattan median with the radius
being a multiple of the variation due to random error.
The variation due to the random errors is the standard deviation of the random errors of labs,
which is obtained by:
( RaE )2
s

i
n 1
According to Youden (1959),
a 95% coverage probability of a circle is given by the circle with radius = 2.448(s)
A 99% coverage probability of a circle is given the cirvle with radius = 3.035(s)
The relationship between the coverage probability and the multiple b is given in Youden’s
original paper in the Journal of Industrial Quality Control, 1959, p. 24-28:
Coverage probability = 1-exp(-b2/2)
14
An alternative approach to compute the systematic error components, random
error components and the corresponding variations.
For a given (x,y) data point, its corresponding
coordinate of the systematic component
is ( (x+y)/2, (x+y)/2), and the difference
between (x,y) and ((x+y)/2, (x+y)/2)
along the X-axis is
X – (x+Y)/2 = (x-y)/2
The difference along the Y-axis = (y-x)/2
This suggests that the random error for the data
point (x,y) is (x-y)/2.
For each lab,
1.
compute the systematic component
(-7, 2)
[ (x+y)) – (x0+y0)]/2, where (x0 , y0) is
the median origin.
2.
Compute the random error component:
(x-y)/2
3.
Compute variance and s.d. for each component.
sB measures the variation of between-lab systematic errors.
se measures the variation due to random errors.
4. To construct the circle with 95% coverage probability, the radius = 2.45(s e)
15
16
17
Activity: To construct a classical Youden’s Plot for Two-sample cases
The following inter-laboratory study about the percent of insoluble
residue in cement reported by 29 Laboratories
Row
%residueA
%residueB
Row
%residueA
%residueB
1
0.31
0.22
16
0.25
0.11
2
0.08
0.12
17
0.26
0.17
3
0.24
0.14
18
0.26
0.18
4
0.14
0.07
19
0.12
0.05
5*
0.52
0.37
20
0.29
0.14
6
0.38
0.19
21
0.22
0.11
7
0.22
0.14
22
0.13
0.10
8*
0.46
0.23
23*
0.56
0.42
9
0.26
0.05
24
0.30
0.30
10
0.28
0.14
25
0.24
0.06
11
0.10
0.18
26*
0.25
0.35
12
0.20
0.09
27
0.24
0.09
13
0.26
0.10
28
0.28
0.23
14
0.28
0.14
29
0.14
0.10
15
0.25
0.13
18
Variable
N
Mean
Median
StDev
Material A
25
0.2292
0.2500
0.0731
Material B
25
0.1340
0.1300
0.0597
Variable
N
N*
Mean
(A-B)/2
25
4
0.04760
(A+B)/2
25
4
0.1816
StDev
0.03473
0.0570
The radius of the circle for the 95% coverage region is .03473 x 2.45 = .085
Hence the circle has the form:
(x-.25)2 + (y-.13)2 = (.85)2
Hands-on activities using Mandel and Lashof’s data
19
Principal Component Analysis -The concept Behind
Two-Sample Plots
The idea behind the two-sample plot is the principal components and
bivariate normal distribution. The following scatter plot illustrate the
principal components for bivariate case.
Scatter plot with Principal Compoment Coordinates
X2, Tensile Strength-E
100
Y2, Minor Principal
Component
Y1, Major Principal
Component
90
(79.81, 92.81)
80
70
80
X1, Tensile Strength-H
90
100
20
The scatter plot is from an inter-laboratory study in Mandel &
Lashof (1974). The data are tensile strength of rubbers using two
different materials, and testing in 16 laboratories.
Laboratory
Strength-E (X2)
Strength-H (X1)
1
94
80
2
103
82
3
94
77
4
99
83
5
97
86
6
91
76
7
91
81
8
102
98
9
98
83
10
91
81
11
93
82
12
82
69
13
93
81
14
82
72
15
83
73
16
92
73
21
Y1 and Y2 are new coordinates.
•Y1 represents the direction where the data values have the largest uncertainty.
•Y2 is perpendicular to Y1.
•They intersect at the sample averages
( x1 , x2 ) = (79.81, 92.81) .
To find Y1 and Y2, we need to make transformation from X1 and X2. To simplify the
discussion, we move the origin to ( x1 , x2 ) and redefine the (X1,X2) coordinate as
x1 = X1 -
x1
, x2 = X2 -
x2
, so that the origin is (0,0).
The relationship is illustrated in the following graph. We would like to present the data
of a given lab, p = (x1,x2) in terms of p = (y1,y2). From basic geometry relations, we
see:
x2
p
y1 = (cosq) x1 + (sinq) x2
y1
y2
y2 = (-sinq) x1 + (cosq) x2
The angle q is determined so
that the observations along
the Y1 axis has the largest
variability. But HOW?
q
x1
22
The transformation from (x1,x2) to (y1,y2) results several
nice properties
1. The variability along y1 is largest.
2. Y1 and y2 are uncorrelated, that is, orthogonal.
3. The confidence region based on (y1,y2) is easy to construct, and
provide useful interpretations of the two sample plots.
Questions remain unanswered are
1. How to determine the angle q so that the variability of
observations along the y1 axis is maximized?
2. How to construct the ellipse for confidence region with different
levels of confidences?
3. How to interpret the two-sample plots?
23
How to determine the y1 and y2 axis so that the variability
of observations along the y1 axis is maximized and y2 is
orthogonal to y1?
Rewrite the linear relation between (y1,y2) and (x1,x2) in matrix notation:
y1 = (cosq) x1 + (sinq) x2
y2 = (-sinq) x1 + (cosq) x2
 y1   (cosq ) x1  (sin q ) x2   cosq
Y  
 

 y2  ( sin q ) x1  (cosq ) x2   sin q
sin q   x1   A1' X 
  '   AX



cosq   x2   A2 X 
NOTE: X is bivariate , so is Y, and
V(X) =
Cov( x1 , x2 )  ,

Cov( x , x )

V
(
x
)

1
2
2

V ( x1 )
V(Y) = A’V(X)A =
l1 0 
0 l 

2
l1 and l2 are called the eigen values. Which are the solutions of V ( X )  l I  0
And, V(Y1) = l1 , V(Y2) = l2, Correlation between Y1 and Y2 = 0.
24
l1 and l2 are called the eigen values. Which are the solutions of
And, V(Y1) = l1 , V(Y2) = l2, Correlation between Y1 and Y2 = 0.
 2 rs 1s 2 

s 1  s 2 , when s1  s2 , q = 45o
The angle q = (.5) arctan  2
2  if
 s1  s 2 
 l1  s 12 

 arctan 
 rs 1s 2 
Note the angle depends on the correlation between X1 and X2 , as well as, on the
variances of X1 and X2, respectively.
• When r is close to zero, the angle is also close to zero. If V(X1) and V(X2) are
close, then, the scatter plots are scattered like a circle. That is, there is no clear
major principal component.
•When r is close to zero and V(X1) is much larger than V(X2), then, the angle
will be close to zero, and the data points are likely to be parallel to the X-axis. On
the other hand, if V(X1) is much smaller than V(X2), the angle will be close to
900, and the data points will be more likely parallel to the Y-axis.
25
Consider, now, we actually observe the following two sample data:
The sample means are given by
 x11 x21 
x x 
 12 22 
 x13 x23 




 x1n x2 n 
 x1 
x 
 2
The sample variance-covariance matrix is given by
Vˆ ( X ) 
 s12

 rs1s2
rs1s2 
2 
s2 
r is the Pearson’s correlation coefficient, and S2 is the
sample variance. S is the sample standard deviation.
V(Y) is the solution of
Vˆ ( X )  l I  0
The solutions for l are given by
( s12  s22 )  ( s12  s22 ) 2  4(1  r 2 ) s12 s22
2
NOTE: V(Y1) + V(Y2) = l1l2  s12 + s22 = V(X1) + V(X2)
26
Using the sample data, the angle is estimated by
 2rs1s2 

q  (.5) arctan  2
2 
 s1  s2 
s1  s2
 l1  s12 

 arctan 
 rs1s2 
Case Example
We know use the Tensile strength data to demonstration the computation of
principal components and related sample information.
For the Tensile Strength Example, X1 is the material H and X2 is the material E.
The number of labs, n= 16.
Using Minitab, we can obtain the following information:
27
Variance-Covariances Matrix:
H
E
H
46.4292
E
35.0292
2

s
35.0292
 Vˆ ( X )   1
40.9625
 rs1s2
rs1s2 
2 
s2 
Correlations = .8031
Principal Component Analysis: Tensile Strength-H, Tensile Strength-E
Eigen values are: 78.831 and 8.560, the solutions of
Vˆ ( X )  l I  0
Linear Coefficients between (Y1, Y2) and (X1,X2)
Variable
Y1
Y2
H
0.734
-0.679
E
0.679
0.734
These are the coefficients for
y1 = (cosq) x1 + (sinq) x2
y2 = (-sinq) x1 + (cosq) x2
The angle q =
 l1  s12 

 arctan 
 rs1s2 
= arctan[(78.831-46.4292)/35.0292] = 42.770
28
The sample means from the sample data are
Variable
N
Mean
H
16
79.81
E
16
92.81
In terms of (Y1, Y2), the means are
Variable
N
Mean
Y1
16
121.61
Y2
16
13.937
Two-sample Scatter PLot for the Tensile Strength Data
Two sample scatter plot is
Tensile Strength-E
100
90
80
70
80
90
Tensile Strength-H
100
29
Confidence Region for two-sample Plots
Each of the X1 and X2 can be treated as a univariate variable. In
most cases, we consider each variable follows a normal
distribution. The rules we introduced for one variable case do
assume that each variable follows a normal distribution. We can
apply outlier detecting methods for each variable. When we
consider two variables simultaneously, [X1, X2] are bivariate,
and the distribution for [X1,X2] is taken to be bivariate normal
distribution. Because of this extension, we are able to construct
ellipses that works similar to empirical rule. We can construct
several ellipses so that the probability of having the pair of data
inside the ellipse is .95 or .99 and so on.
The construction of the ellipse can be simplified when we use the
principal components as described above. And the interpretations
based on the principal components are very useful.
30
Bivariate Normal Distribution and it’s application in
two-sample plots
Because the ellipse region relies on bivariate normal
distribution, we briefly give an introduction of the bivariate
normal in the following.
The bivariate normal distribution of X1 and X2 has the form:
f(x1, x2) = (2ps1s2)1(1r2)1/2 exp(-Q/2)
Where
1 ( X 1  1 ) 2 2 r ( X 1  1 )( X 2   2 ) ( X 2   2 ) 2
Q
[


]
2
2
2
1 r
s1
s 1s 2
s2
We usually use the notation :
 1 
 
 2
 X1 
 1   s 12
 X  ~ BN (  , 
 2
 2   rs 1s
 s 12
is the mean vector. 
 rs 1s
rs 1s 2 
2 
s2 
rs 1s 2 
)
2 
s2 
is the variance-covariance
matrix.
31
 1 
A ellipse Q = c, c>0 centered at   can then be created in the (X1, X2) coordinate.
 2 
.
The shape and the orientation of the ellipse is determined by the values of s12, s22
and r, and its size is determined by the choice of the constant c. The choice of the
constant c can be determined based on the level of confidence using the bivariate
normal distribution.
When we collect two samples, the sample data provide sample means and sample
variance-covariance matrix.
The sample means are given by
 x1 
x 
 2
2

s1
ˆ
V (X )  
 rs1s2
The sample variance-covariance matrix is given by
rs1s2 
2 
s2 
r is the Pearson’s correlation coefficient, and S2 is the sample variance. S is the
sample standard deviation.
32
When replacing the population parameters by the corresponding sample
information, we obtain the Hotelling’s T2:
1 ( X 1  x1 )2 2r ( X 1  x1 )( X 2  x2 ) ( X 2  x2 ) 2
T 
[


]
2
2
2
1 r
s1
s1s2
s2
2
T2 is distributed as
2(n 2  1)
F( 2,n2)
n(n  2)
Which is a multiple of an F-distribution. The ellipse region is now given by
T2 = c*
The constant c* is determined using the F-distribution described above. The
corresponding 100(1-a)% percentile is from the F distribution with degrees of
freedom (2, n-2).
For example, when participated labs, n = 16, then a 95% percentile of
F(2, n-2) = F(2,14) = 3.74. Therefore, c* = 2(255)/224 x (3.74) = 8.515
A 95% ellipse region can then be constructed using T2 = 8.515
33
How to construct a 100(1-a)% region in a two-sample plot –
the Youden’s Plot?
Under the (X1,X2) coordinate, a general form of an ellipse is given by :
( X 1  x1 ) 2 2( X 1  x1 )( X 2  x2 ) ( X 2  x2 )


1
a11
a12
a22
Under the principal component coordinate and move the center to (0,0),
the ellipse has the simple form in terms of (y1, y2):
2
1
2
2
y
y

1
b11
b22
The distance from the center, (0,0) to the ellipse curve along the
major principal component is is
b11 , and the distance from the
center (0,0) to the curve along the minor principal component is b22
34
When matching the mathematical form with the statistical form of confidence
ellipse region: T2 = c*, it is seen that, T2 = c* can be expressed in terms of the
general form above.
Confidence ellipse region in terms of the principal components has a simple form:
y
y12
y22
2

1
l1c*
l2 c *
(0, l2 c* )
( l1c* , 0)
( l1c* , 0)
y1
( (l1  l2 )c* , 0)
(0,  l2 c* )
( (l1  l2 )c* , 0)
The sum of the distance from foci to curve is constant, and equal to 2 l1c*
Hence, for any point on the ellipse (y1,y2), when y1 is given, y2 can be
computed by
l2 2
y2   c l2  y1
l1
*
35
The sum of the distance from foci to curve is constant, and equal to 2 l1c*
Hence, for any point on the ellipse (y1,y2), when y1 is given, y2 can be
computed by
l2 2
y2   c l2  y1
l1
*
And, the original scale in terms of (x1,x2) at the center (0,0) is given by
x1 = (cosq)y1 – (sinq)y2
x2 = (sinq)y1 + (conq)y2
Shifting the center from (0,0) back to
( x1 , x2 ) , we have
X 1  x1  (cos q ) y1  (sin q ) y2
X 2  x2  (sin q ) y1  (cos q ) y2
where q = arctan[(l1  s12 ) /(rs1s2 )
36
We are now ready to construct a 100(1-a)% confidence ellipse region for the
Tensile Strength data.
In terms of the principal component coordinate (y1, y2) at the center = (0,0).
A 95% coverage region has the region covered by the ellipse
y12
y22

1
*
*
l1c l2 c
Which is given by
(Y1 )2
(Y2 ) 2

1
78.831 8.515 8.56  8.515
l1c
*
= 25.908
l2 c
*
= 8.537
(l1  l2 )c*
= 24.461
The vertices are (-25.908, 0) and (25.908, 0)
The foci are (-24.461, 0) and (24.461, 0)
The Vertical axis are intersected at (0, 8.537) and (0, -8.537)
The sum of the distances from two foci to the curve is 51.816
37
Score Plot of Tensile-Tensile
Second Component
20
15
10
110
120
130
140
First Component
Score Plot of Tensile-Tensile
Second Component
20
15
10
110
120
First Component
130
140
38
To construct the ellipse curve, the point on the curve is given by:
l2 2
( y1 , y2 )  ( y1 ,  c l2  y1 )  ( y1 ,  72.8884  .1086 y12 )
l1
*
The original scale of the ellipse curve on the (X1, X2) is given by
X 1  x1  (cosq ) y1  (sin q ) y2  79.81  .734 y1  .679 y2
X 2  x2  (sin q ) y1  (cos q ) y2  92.81  .679 y1  .734 y2
Using the original scale of (X1, X2), the 95% coverage region is
( X 1  79.91)2 ( X 1  79.81)( X 2  92.81) ( X 2  92.81) 2


1
140.36
82.08
123.833
39
The two-sample plot with the 95% coverage probability for the
tensile strength data is given by:
Two-Sample PLot for the Tensile Strength Data
- Materal H Vs Material E
Raw Data
110
95% Ellipse
Material E
100
90
80
60
70
80
90
100
Material H
We notice that one lab is outside of the 95% coverage region.
This is Lab 8 with two-sample results (98,102)
40
Marginal Plots as an supplement for detecting outliers in
two-sample case
The two-sample plot (Youden’s plot) takes the correlation between
two-samples into account and are assumed to follow a bivariate
normal distribution. This is a useful plot for detecting outliers. In
addition to the two-sample plot, one have introduced box-plot for each
sample. A two-dimensional marginal box-plot is a good addition for
detecting outliers when we sue two-sample Youden’s plot. Using the
Tensilt strength as an example, we use Minitab to construct the
marginal box-plot in the following.
41
Scatter Plot between Material H and E
with Marginal Box PLots
Tensile Strength-E
105
100
95
90
85
80
65
70
75
80
85
90
95
100
Tensile Strength-H
The horizontal Box plot is the box-plot for Tensile Strength of Material H. The
vertical Box plot id for Material E. It is noticed that there is a clear outlier when
testing Material H based on the marginal box plot. Lab 8 is the outlier lab when
testing Material H. A similar finding was obtained using Two-sample plot. 42
How to construct Two-sample Plot using Minitab?
Minitab does not have a procedure on the pull-down menu.
However, a Minitab macro program, similar to what we call a
subroutine, has been written for two-sample plots. I have added
this macro into the Minitab Macro collection. The name of the
macro is BCC (stands for Bivariate Control Chart). There are
some restrictions when applying this macro.
Preparation of data:
1. Enter sample One in C1, which will be on the X-axis.
2. Enter sample Two in C2, which will be on the Y-axis.
3. Obtain the critical F-value of the F-distribution using any
Statistics book or using Minitab: F(a, 2, n-2), where n is the
sample size with (1-a) being the coverage probability. For
example, a 95% coverage region gives a = .05.
43
For the Tensile Strength inter-laboratory study, the number of
participated labs = n = 16.
For, a 95% coverage region in the Two-sample plot, the critical
value is F(.05, 2,14) = 3.74 from an F-table. The two degrees of
freedom are ‘2’ for numerator d.f., and n-2 = 14 for the
denominator d.f..
4. Use Minitab to determine F(.05,2,14):
• Go to Calc, choose Probability Distributions, then select ‘F’.
• In the F-distribution Dialog box, click on ‘Inverse cumulative probability.
• Enter d.f. 2 and 14, respectively.
• Click on Input Constant and enter .95.
5. Make the Session window an active window. Then, go to Editor
Menu, and click on ‘Enable Commands’.
44
6. In the Session window, you should see ‘MTB>’ appears. We will
use Minitab commands to input the F(.05, 2,14) and run the
BCC Macro.
7. In the Session window, enter the following two commands next
to ‘MTB>’ :
MTB> let k11 = 3.74
MTB> %BCC
The ‘let’ statement defines a minitab constant: k11 = F(.05,2,14)
The ‘%’ before the macro name ‘BCC’ is to identify ‘BCC’ as a
minitab macro.
The results from the %BCC macro are a Two-sample plot with both
scatter plot and the 95% ellipse region, and some principal
component computations stored in the worksheet.
45
How to use Minitab to construct Marginal Box plot?
Marginal box plot is available in the Graph Menu.
1. Go to Graph, choose Marginal Plot.
2. In the Dialog box, enter Y-variable, and X-variable.
3. You can choose different types of marginal plots, including
Histogram, Dot-lot and Box plot. Choose the one you would
like to construct.
4. Click on the ‘Symbol’ selection, you can define the type of
symbols for the scatter plot.
46
How to use Minitab to conduct a Principal Component
Analysis?
In my discussion of constructing principal components, and angle, the
relationship between (X1, X2) and the principal components, I
have used Minitab to obtain these results for the Tensile Strength
example. Here is how Minitab does the analysis.
1. Go to Stat, choose Multivariate, then select Principal Components.
2. In the Dialog box, enter variables X1 and X2. Choose the Type of
Matrix to be Covariance.
3. Click on the ‘Graph’ , then select ‘Score plot for first two
components’.
4. Click on the ‘Storage’, you can choose to store the coefficients for
computing (y1,y2) using (x1,x2)(that is cosq and sinq), and store
the values of (Y1, Y2). For example, store Coefficients in C5 and
47
Store (Y1,Y2) inC6, C7.
How to Interpret Two-sample plots?
In a typical inter-laboratory testing study, the pairs of samples usually
are of different, but similar materials are sent out periodically to a
number of participating laboratories. All labs run a predetermined
number of replicate measurements on each sample for a number of
properties.
•For each property, the pair of two samples can be plotted as a Youden Plot or
more generally, a bivariate control chart.
•Generally speaking, the points on the plot fall within an ellipse region, which
may be defined as a 95% coverage region or 99%.
•The major axis (the major principal components) of the ellipse is approximately
45o.
•The length of the principal component ( 2 c*l1 based on previous discussion)
is related to the ‘between laboratory variability’, and the length of the second
principal component (
) to the ‘material-laboratory interaction’.
*
2 c l2
48
A General Statistical Model for interpreting the possible
source of variability
NOTE:
•
If the between-lab variability is not homogeneous for two samples, the major
principal component of the ellipse may not be 45o.
•
In addition, the pattern and interpretation of the plot of the plot heavily
depends on the model for the experiment and the source of variability in both
samples and their inter-relationships.
•
A general statistical model that includes important sources of variability is
necessary in order to provide adequate interpretation of Youden’s two-sample
plot under different assumptions of the sources of variability.
•
Youden emphasizes that the two samples should be similar and reasonably
close in the measurement of the property evaluated.However, this may not
hold in inter-laboratory studies. A more general situation is to consider the
following possibilities.
49
1.
Material A and B have different population averages, a and b.
2.
Each lab has a systematic effect, denoted by LiA, and LiB, respectively.
3.
The property is measured with an unexplainable random error from each
laboratory, denoted by eiA and eiB, respectively.
The observed measurement from each lab , denoted by XiA and XiB can be expressed
by the statistical model:
XiA = a + LiA + eiA , i = 1,2, 3, ----, n
XiB = b + LiB + eiB , I = 1,2,3, ---, n
NOTE that the systematic lab effects is assumed dependent only on the population
of the material, not on the material itself. In other words, for different material
with the same population average, the systematic lab effect remain the same,
even though the material may be different.
It is important to realize that the interpretation of Youden plot depends critically on
the specific assumptions on LiA, LiB, eiA and eiB.
When data are observed, a and b are estimated by the corresponding sample
averages: x and x , and the model is center to (0,0):
50
A
B
xiA = XiA -
x A = LiA + eiA
xiB = XiB - x B = LiB + eiB
Expressing the variance-covariance of (x1, x2) in terms of the variance
components of the systematic effects and random error, we obtain,
under the assumption that systematic effects are independent of the
random error, and the random errors are independent from lab to lab:
 s
ˆ
V (X )  
2
1
 rs1s2
rs1s2 
2 
s2 
=
 sL2A  sR2 A

 rsLA sLB
rsLA sLB 
2
2 
sLB  sRB 
We can obtain the variance-covariance estimates from the data, and
then, to estimate the variance component of each source of variation.
However, some assumptions are needed in order for us to do so, since
we have three equations with five unknowns.
51
When we impose any assumptions, it is important to keep in mind that these
assumptions should reflect the properties of populations underline the experiments.
Considering an inter-laboratory experiment using two materials, A and B which can
be described in terms of the model
XiA = a + LiA + eiA , i = 1,2, 3, ----, n
XiB = b + LiB + eiB , I = 1,2,3, ---, n
•
It is often that the averages of two materials are not the same, that is a  b
•
The random error from each sample may have different variability (not
homogeneous), that is, variance of random error for material A may not be the
same as that for material B.
•
There are situations where we can demonstrate the equal variance for the random
errors of two material, but the systematic biases between two material may not be
the same for different labs.
52
Based on the above scenarios, the following three types of
model can be considered, and the variance components
can be estimated for each type of model.
•
One may consider the situation by assuming the systematic
effects, the laboratory effect, are the same because of the
standardized procedure is implemented, that is, LiA = LiB = L
(additive model)
•
A less restrictive assumption would be: The laboratory
biases are in proportional relationships as so that LiA=
K(LiB) (concurrent model).
•
Foe each of above models, if we can demonstrate
homogeneity of variances of random errors of two materials,
we can assume: V(eiAa)  V(eiB) for all labs, (homogeneous
variance).
53
Under the additive model: we obtain the following variance
component estimate for each source of variation:
s12  sL2  sR2 A , s22  sL2  sR2B , rs1s2  sL2
Therefore, the variance components are :
sL2  rs1s2 , sR2 A  s12  rs1s2 , sR2 B  s22  rs1s2
Under the concurrent model, the proportion is k = LA/LB. The variance
component estimates are:
s12  k 2 sL2B  sR2 A , s22  sL2B  sR2B , rs1s2  ksL2B
Therefore, the variance components are :
r
r
2
2
2
2
2
sLB  s1s2 , sRA  s1  krs1s2 ) , sRB  s 2  s1s2
k
k
54
Under the concurrent model, the proportion ‘k’ is estimated by
(x

k
iA
/ xiB )
n
The average ratio of material A and material B.
.
The variance component estimates are:
s12  k 2 sL2B  sR2 A , s22  sL2B  sR2B , rs1s2  ksL2B
Therefore, the variance components are :
r
r
2
2
2
sLB  s1s2 , sRA  s1 ( s1  krs2 ) , sRB  s2 ( s2  s1 )
k
k
55
Under the homogeneous variance for additive model,
the random error variance are assumed equal, that is:
sR2 A  sR2B  sR2
And the variance component for each source is estimated by:
s12  sL2  sR2
, s22  sL2  sR2 , rs1s2  sL2
Therefore, we can combine two sample data to estimate
the variance components for more precised estimates.
The variance components are :
sL2  rs1s2 , sR2  ( s12  s22 ) / 2  rs1s2 ,
which is the combined sample variance minus the covariance of two samples.
56
Under the homogeneous variance for concurrent
model, the variance components are:
s12  k 2 sL2B  sR2 , s22  sL2B  sR2 , rs1s2  ksL2B
Therefore, the variance components are :
sL2B
2
2
2
r
s

s
k
1
2
1
2
 s1s2 , s R 

rs1s2
k
2
2k
57
The principal components of the ellipse region for the general model
is given by the following:
l1  ( n  1)( sL2  sL2  sR2  sR2  M ) / 2
A
B
A
B
l1  ( n  1)( sL2  sL2  sR2  sR2  M ) / 2
A
M 
B
A
B
[( sL2A  sL2B )  ( sR2 A  sR2B )]2  4r 2 sL2A sL2B
tan q  [(sL2A  sL2B )  (sR2A  sR2B )  M ]/(2rsLA sLB )
Case One: Two samples are similar, and the property is
measured under the same standard procedure.
This is the most restricted situation. Under this situation, we are assuming:
xA  xB
1.
The averages are approximately the same, that is
2.
The lab systematic effects are equal for two samples, that is,
LiA = LiB = Li
3.
The random error components follow the same normal distribution. and
independent.
58
Under this case, the variance components are simplified to:
sL2A  sL2B  sL2 and sR2A  sR2B  sR2
the principal components are reduced to
l1  (n  1)(2sL2  sR2 )
l2  (n  1)( sR2 )
and tanq  1, that is, the angle q  45o
Under this situation, the Youden plot can be interpreted by:
1.
The minor principal component, l2 , measures the random error.
2.
The major principal component, l1, measures the systematic effect, if the
random error is relatively small.
3.
The ratio l1/ l2 measures the impact of the systematic effect with respect to the
random error.
59
Case Two: Under this situation, we are assuming:
xA  xB
1.
The averages are approximately, that is
2.
The lab systematic effects are equal for two samples, that is,
LiA = LiB = Li
3.
The random error components are independent, but follow two different
normal distributions with different , but, small variances relative to s 2 .
L
Under these conditions, we have :
l1  2(n  1) sL2  (n  1)( sR2  sR2 ) / 2
A
B
l2  (n  1)( sR2  sR2 ) / 2
A
B
tan q  1, that is the angle of th emajor principal component is approximat ely 45o.
The same interpretation of Case One holds, here.
60
Case Three: Under this situation, we are assuming:
xA  xB
1.
The averages are approximatelyequal, that is
2.
The lab systematic effects are equal for two samples, that is,
LiA = LiB = Li
3.
The random error components are independent, but follow two different
normal distributions with different variances: s 2 and s 2
RA
RB
Under these conditions, we have:
l1  (n  1)[ 2 s L2  ( sR2  sR2 )  M ] / 2
A
B
l2  (n  1)[ 2 s L2  ( s R2  sR2 )  M ] / 2
A
B
tan q  [( s R2 A  s R2 B )  M ] / 2 s L2
where, M  (2 s L2 ) 2  ( s R2 A  s R2 B ) 2
61
Case Four: When the model is a concurrent model
The systematic bias effects between material A and B is a retio
relationship: LiA= k(LiB). The observed two-sample data are (XiA,
XiB).
•
An approach to determine if the model is concurrent or not
1. Compute the ratio of the variable with larger measurements to
the smaller variable, so that the ratio is larger than one.
Suppose XiA > XiB . Compute ki = XiA/ XiB for each lab.
2. Compute the average of ki : k   ki
n
3. If there is a consistent pattern of a similar ratios , then, the
model is considered as concurrent model. (Note: there is no
specific rule available.)
62
Interpretation of Youden’s Plot for case Three
For this case, the two variances of random errors of two materials
are different, the angle of the ellipse depending on the
magnitude of the variances.
2
2
2
s
is
very
small
relative
to
s
and
s
1. If RA
L
RB
Then, tanq will be larger than one, and the angle is larger than
2
2
o
s
is
relatively
much
larger
tha
n
s
45 . When RB
, then, the
L
major principal component will approach 90o, and it will be
almost vertical.
2
2
2
2. On the other hand, if sRB is very small relative to s L and sRA
Then, Then, tanq will be smaller than one, and the angle is
2
2
o
s
is
relatively
much
larger
tha
n
s
smaller than 45 . When RA
L
then, the major principal component will approach 0o, and it
will be almost horizontal.
63
,
Transforming concurrent model to additive model
And analyzing the concurrent model using an additive
model based on the transformed data.
The complexity of concurrent model suggests for a transformation
from concurrent model to additive model is preferred.
•
Suppose XiA > XiB , and the average of ki is computed: k
•
Compute a new variable: X’iA = XiA/ k
•
Analyze the two-sample data using the additive model approach
based on (X’iA, XiB).
The analysis and the interpretation of Youden plots can be extended
to this transformed two-sample data.
64
Examples for the interpretation of Youden’s Plot
Consider the Tensile Strength Data. There are three rubber material,
E,G and H. The data are: Row Laboratory
E
G
H
1
1
94
45
80
2
2
103
43
82
3
3
94
42
77
4
4
99
50
83
5
5
97
47
86
6
6
91
66
76
7
7
91
96
81
8
8
102
128
98
9
9
98
73
83
10
10
91
54
81
11
11
93
81
82
12
12
82
45
69
13
13
93
127
81
14
14
82
150
72
15
15
83
70
73
16
16
92
89
73
65
We conducted the (H,E) Youden plot for (H,E) before. In the following
we will analyze the Youden plot using the variance components of an
additive model.
1. A quick check of the data, the ratio between each pair of material is a
little over one. Therefore, we will consider the additive model for this
data.
2. Using Minitab, we obtain:
X 1  79.8, X 2  92.8
l1  78.832, l2  8.56, the angle of the major principal component  42o
46.4292 35.0292
s12 rs1s2

2
40.9625
s2
3. The variance components are:
sL2  35.0292, s 2R A  46.4292  35.0292  11.40
sR2B  40.9625  35.0292  5.93
Standard deviations are : s L  5.92, s R A  3.38, s RB  2.44
66
Based on these information, the following conclusions are obtained:
1.
The systematic effect dominates random errors. This indicates there are large
laboratory differences.
2.
The random errors from both material are small and about the same.
3.
To see if there is any outlier labs, a 95% ellipse region is constructed, and showed
one serious deficient lab. The two-sample plot with the 95% coverage probability for the
tensile strength data is given by:
Two-Sample PLot for the Tensile Strength Data
- Materal H Vs Material E
4. The linear pattern indicates
Raw Data
110
95% Ellipse
100
Material E
the systematic error is large.
Since for a given, when
comparing with others, a linear
pattern suggests the data from
both material are either both
high or both low, which shows
the existence of large
systematic errors between labs.
90
80
60
70
80
90
100
Material H
We notice that one lab is outside of the 95% coverage region.
This is Lab 8 with two-sample results (98,102)
67
Youden plot for Material H and Material G
Results of descriptive summary and Principal Component analysis are
given ine following
Variable
N
Mean
Median
TrMean
StDev
SE Mean
H
16
79.81
81.00
79.29
6.81
1.70
G
16
75.38
68.00
72.43
34.33
8.58
Eigenvalue
1179.8
45.4
Proportion
0.963
0.037
Covariances: H, G
H
Variable
PC1
PC2
H
0.030
1.000
G
1.000
-0.030
H
46.4292
G
33.8750
G
1178.7833
Cosq = .03, hence the angle of the major principal
component = 88o
The ratios of H to G are varied a lot. Therefore, concurrent model is
not appropriate. We apply additive model for this data.
68
The variance components based on an additive model are:
sL2  33.875, s 2R A  46.4292  33.875  12.55
sR2B  1178.78  33.875  1144.90
Standard deviations are : s L  5.8, s R A  3.5, s RB  33.8
Scatter plot of material H and G with proper scale
G
200
Scatter plot indicates
a huge variation
among G material
100
0
0
100
200
H
69
Two Sample Youden Plot for Material H (X) Vs Material G (Y)
200
Data
95%
Ellipse
Material G
Marginal Plot for Material H and G
100
150
130
110
G
0
90
70
60
70
80
90
100
Material H
50
30
65
70
75
80
85
90
95
100
H
Youden plot indicates an
potential outlier labs, Lab#8 with
data (98,128) and Lab#14 with
data (72,150).
There is a clear outlier lab for material
H. Material G is highly skewed and
large uncertainty among labs.
70
Conclusions:
1.
Scatter plot indicates a huge variation among G testing material.
2.
The major principal component is at 88o, almost vertical. This suggests
labs testing material G are very inconsistent.
3.
Further variance decomposition shows the the random error variance
extremely dominates the uncertainty component among labs for material
G.
4.
The systematic bias effect is small. The random error for material H is
also small.
5.
There is an outlier lab: Lab #8 according to the 95% ellipse region.
6.
Marginal box-plot indicates a clear outlier lab in material H. For material
G, the variation is huge and stretched from 43 to 150. At least two labs are
much larger than others. This result also indicates that the material G is
much more sensitive to different environment than material H. Some
extremely high strength deserve further investigation. In particular, one
should try to locate special causes for such a high strength, and try to
reproduce the results, if possible.
71
The issue about computation
In this module, we have used a variety of procedures in Minitab, I
order to accomplish our analysis. The procedures we applied include:
•Graphs:
Plot, Marginal Plot, Box Plot, Time Series Plot, Histogram,
Probability Plot.
•Statistical Procedures:
Descriptive statistics, Covariance, Correlation, Normality test,
Regression, Individual control chart, Principal Component
Analysis, and Bivariate control chart.
The combination of applying the right tool at the right time is the most
important, yet, most difficult. It will take a good understanding of the
statistical concepts as well as proficient usage of Minitab.
72
Hands-On Activity#1:
Repeat the same analysis of the Tensile Data.
Hands-On Activity #2:
Conduct a thorough investigation of of the TAPPI
Inter-laboratory study.
Hands-On activity #3:
Bring your own two-sample data and conduct a
thorough investigation of your data.
73

Module Six: Outlier Detection for Two Sample Case, Youden Plot

Transcript Module Six: Outlier Detection for Two Sample Case, Youden Plot

Directory