Rejection of Suspect Data

Download Report

Transcript Rejection of Suspect Data

Statistical Chemistry
Yudonob’s lecture notes
The Two Types of Errors are
•Determinate Errors
•Random Errors
Errors – systematic errors that cause
a measurement to always be too high or too low
which can be traced to an identifiable source.
Examples include
• Use of an uncalibrated or faulty tool or
instrument
• Use of wrong values, such as molar mass,
conversion factor, etc.
• A good way to detect the existence of
determinate errors is to use different methods of
analyzing the same material
 Determinate
Errors – errors that are random in nature.
They occur when a calibrated instrument is correctly
used to its most sensitive degree of measurement.
For example, using the analytical balance (sensitivity
of 0.1mg), you see variations in the last digit when
re-weighing the same object. In this chapter we will
focus our attention on ways to evaluate random
errors.
 Random
o All measurements have some random errors.
No measurements contain experimental errors.
Statistics allows us to accept conclusions that
have a high probability of being correct and to
reject conclusions that have a low probability.
o Statistics apply only to random errors; the
analyst would eliminate all determinate errors
before making sensitive measurements.
Random errors follow a Gaussian distribution of
values about the central measurement.
A Gaussian distribution is characterized by the mean
value and a standard deviation
Mean value or average – is a measurement of central
tendency
xmean = i (xi) / n
where i represents each individual measurement, 
means the summation, and n is the number of
measurements in that set of data.
Standard deviation – is the measure of the width
of the distribution about the central value.
______________
s=

(xi – xmean) 2 / (n-1)
i
The above defined standard deviation is for a
limited or small set of data; for a large set of data
the standard deviation is indicated by  and is
defined as
______________
=

i
(xi – xmean) 2 / n
As the size of the data set increases there (n – 1)  n, so
s
Ordinarily analytical chemists will use the first
value (s) for the standard deviation since we will
typically deal with a small population or small
data set.
The larger the
value of s, the
broader is the
Gausssian curve.
The relative standard deviation is the standard
deviation divided by the mean value, that is
s / xmean.
The relative standard deviation may be
expressed in % (parts per hundred) or ppt (parts
per thousand)
Relation standard deviation (%) = s x100/ xmean
Relation standard deviation (ppt) = s x 1000 /xmean
Other important terms
Median – middle value in an ordered set (ascending
or descending); when n is an even number, it is the
average of the 2 middle values.
Range – difference between the highest and lowest
values in the set of data. May be stated as
(High – Low) or that value.
For example in a set of data where 25.11 is the
highest and 24.85 the lowest, we could describe
the range as (25.11 – 24.85) or 0.26
Find the mean, median, standard deviation,
relative standard deviation and the range of the
following set of student data acquired in the
analysis of chloride in a sample:
xi
18.56%
18.65%
18.49%
18.54%
18.70%
18.53%
The sum of the individual values = 111.47
there were 6 measurements
The mean (xmean) = 111.47 / 6 = 18.578 = 18.58
The quantity (xmean – xi) is calculated next
xi (xmean – xi)
(xmean – xi) 2
18.56%
-0.02
0.0004
18.65%
0.07
0.0049
18.49%
-0.09
0.0081
18.54%
-0.04
0.0016
18.70%
0.12
0.0144
18.53%
-0.05
0.0025
The sum of deviations squared
( i (xmean – xi)2 )= 0.0319
0.0319/(6-1) = 0.00638
________
s = (0.00638) = 0.0798 = 0.08 or 0.080 to use the
authors method.
Note that s is reported to the number of decimal places as the data.
The relative standard deviation = s/xmean
= 0.0798 x 100 / 18.578 = 0.429
this could be reported as 0.43% or 4.3 ppt
To find the median, first arrange in order (a/d);
I choose d(escending)
x I 18.70 18.65 18.56 18.54 18.53 18.49
1
2
3
4
5
6
Since n = 6, the median is between ordered #3
and #4, so the median = (18.56 + 18.54)/2 =
median =
18.55
For the ideal Gaussian distribution 68.3% of the
measurements lie within  1 (standard
deviation) of the mean value, 95.5% within  2
and 99.7% within  3.
This means that for real data of a small
population we can expect only 4.5% to fall
outside the  2s limits and only 0.3% outside the
 3s limits from the mean value.
Student’s t test
The Student’s t test is a test developed by W. S.
Gossett who used the pseudonym “Student” to
publish this statistical test in 1908. It is used to
express confidence intervals for a set of data and to
statistically compare the results of different
experiments.
Student’s t test
The true mean is denoted as . From a small
number of data points it is not possible to determine
either  or . Instead, we have xmean and s. We would
like to be able to state the probability that the true value
 is within some quantity of xmean . The confidence
interval does this in the form
 = xmean  t s / n
and may stated at a certain probability such as 90%,
95%, or 99%, etc. The values of t for various degrees of
freedom and confidence levels are shown in Table 4-2,
page 78 of your textbook.
Student’s t test
Lets go back to the % chloride data and calculate the 50%,
90%, 95% and 99% confidence intervals for the results.
xi
xmean = 18.58 s = 0.08
18.56%
18.65%
18.49%
18.54%
18.70%
18.53%
At the 50% CI,  = 18.58  (0.727)(0.079 / 6 = 18.58  0.023
= 18.58  0.02. Note that the value for t is at the intersection of
the 50% column and the row for number of degrees of freedom = 5
Student’s t test
Now repeating the calculation with the appropriate
values of t
At the 90% CI,  = 18.58  (2.015)(0.079 / 6 =
18.58  0.065 = 18.58  0.07
At the 95% CI,  = 18.58  (2.571)(0.079 / 6 =
18.58  0.082 = 18.58  0.08.
At the 99% CI,  = 18.58  (4.032)(0.079 / 6 =
18.58  0.130 = 18.58  0.13.
Student’s t test
Note that the tolerance quantity ( t s / n) becomes larger
as we increase the percent probability that we desire to
include. Or, another way of looking at it is that at the 50%
CI there is a 50% probability that the true value () lies
outside the  0.02, whereas at the 99% CI that is a 1%
probability that  lies outside the  0.13
Also note that the tolerance quantity ( t s / n) is reported
to the same number of decimal places as the mean value,
though I carried an extra place through the calculation and
rounded after the final step.
Student’s t test
From the equation  = xmean  t s / n we see that
the size of the ( t s / n) is inversely proportional
to the n; thus, one way to increase the probability
that a x mean value is close to the true value  is to
increase the number of results, assuming that x mean
and s are not affected by the multiple runs.
Student’s t test
Problem – For n = 3 the x mean and s were found to
be 15.78 and 0.30 respectively. Calculate the 95%
confidence interval.
For n = 3, (n - 1) = 2; t 95, 2 = 4.303
= 15.78  (4.303)(0.30 / 3 = 15.78  0.745
 = 15.78  0.75
Relative uncertainty = (0.75/15.78) X 100 = 4.75%
Student’s t test
Repeat the previous calculation for n = 7 with the
same x mean and s values:
For n = 7, (n-1) = 6; t 95, 6 = 2.447
= 15.78  (2.447)(0.30 / 7 = 15.78  0.277 =
 = 15.78  0.28
Relative uncertainty = (0.28/15.78) X 100 = 1.77%
Student’s t test
The t test is also valuable to compare two different
sets of data to determine if they are ‘the same’ or
‘different’, or stated statistically, “are there
significant differences between the two sets of
data?”
Student’s t test
Example – As the director of a research laboratory
you are paid to decide if there is a significant
difference between the mean values of two sets of
data obtained by two different scientists, a senior
scientist and one recently hired.
Data of Senior Scientist:
xmean = 24.66%
with s = 0.06% for n = 5
Data of the New Kid:
xmean = 24.55%
with s = 0.10% for n = 7
Student’s t test
What we need to do here is the compare the two mean
values, x1 mean to x2 mean as their difference (x1mean- x2 mean) to
( t s / n). Because there are two different standard
deviations, we need to calculate the pooled standard
deviation, spool which is defined as
_________________________________
spool =  {(n1 – 1)s12 + (n2 – 1)s22} / (n1 + n2 – 2)
spool = {(5 – 1)(0.06)2 + (7 – 1)(0.10)2 / (5 + 7 – 2)}1/2
Student’s t test
spool = {(5)(0.0036) + (6)(0.010) / (10)}1/2 =
{(0.018 + 0.060) / (10)}1/2 = {0.0078 }1/2
spool = 0.088 = 0.09
Note that the value of spool will always fall between
the two individual values of s; it is like a weighed
average value.
Student’s t test
____________
Test if |(x1 mean- x2 mean)| > t spool /  n1 + n2 / n1 n2 ) ?
We will use the value of t 95 for 7 + 5 – 2 or 10 degrees of
freedom; according to Table 4-2, t 95,10 = 2.228.
Substitution, is | 24.66 – 24.55| > {(2.228)(0.088) / (12/35)}1/2 ?
0.11 > {(0.196) / (0.343)}1/2 ?
0.11 > {(0.196) / (0.343)}1/2 ?
0.11 > {(0.572)}1/2 ?
0.11 > 0.756 ?
No, there is no significant difference between the mean values
of the two scientists.
Student’s t test
The testing for significant differences between the
true value () and the mean value (xmean) of a set of
data is very similar to the previous test.
If |( - x mean)| > t spool /  n1 + n2 / n1 n2 ), there is a
significant difference between the true value and the
mean.
F test for Differences in Precisions
In addition to comparing a mean value to the true value and
two mean values, it is often valuable to compare the
precisions of two different sets of data. Your textbook
does not discuss this test, so I will briefly explain it and
apply it to a typical problem.
The variance v is defined as the standard deviation squared,
that is, v = s2. Variances are calculated for both sets of
data. The larger variance is placed in the numerator of a
term known as Fc and defined as Fc = vlarger / vsmaller. The
value of Fc is then compared to the tabulated values of Ft at
a specified confidence level, generally 95%.
F test for Differences in Precisions
Values for Ft For the Comparison of Variances
at the 95% Confidence Level
Number of
Number of Observations, Numerator
Observations,
--------------------------------------------------------------------------------------------------
Denominator
3
4
5
6
7
10
¥
3
4
5
6
7
10
¥
19.00 19.16 19.25 19.30 19.33 19.38 19.50
9.55
9.28
9.12
9.01
8.94
8.81
8.53
6.94
6.59
6.39
6.26
6.16
6.00
5.63
5.79
5.41
5.19
5.05
4.95
4.78
4.36
5.14
4.76
4.53
4.39
4.28
4.10
3.67
4.26
3.86
3.63
3.48
3.37
3.18
2.71
2.99
2.60
2.37
2.21
2.09
1.88
1.00
F test for Differences in Precisions
Problem – Were there significant differences between the
precisions of the two scientists in the last problem above?
Data of Senior Scientist: xmean = 24.66%, s = 0.06% for n = 5
Data of the New Kid: xmean = 24.55%, s = 0.10% for n = 7
For the new kid, v = (0.10)2 = 0.010;
For the senior scientist, v = 0.0036.
Fc = (0.010 / 0.0036) = 2.78. From the Ft table, Ft = 6.16.
Since Fc < Ft there are no significant differences between the
precisions of the two scientists.
Conclusions of the Differences between Mean
Values and Precisions of the Two Scientists
1) The first test allowed us to test for significant
differences in the mean values obtained by the two
scientists. Since the difference in the 2 mean values
was less than the tolerance quantity, there is no
significant difference between the mean values of the
two scientists at the 95% confidence level.
2) The second test (F-test) allowed us to test for
differences in the precision of the two scientists. Since
the calculated value of F cal < Ftable , there is no
significant difference between the precision of the 2
scientists at the 95% confidence level.
Rejection of Suspect Data
1) The Q-Test
Occasionally in a set of data there is one value that appears
to not belong with the rest of the set. If the experimenter
is aware of some mistake or malfunction, she/he do not
need to employ one of these tests to reject that result. If no
known error has occurred (so that the suspect result
appears to be random, the analyst is then faced with
whether to retain or reject this suspect value. He/she needs
some sound basis for their decision, not just ‘eyeballing’ it.
Your textbook describes one such test, the Q-test. After I
have discussed the Q-test, I will then discuss two
additional less rigorous, but useful tests for rejection of
suspect data.
Rejection of Suspect Data
Problem – Given the following set of data for the
determination of % Acidic Substance in a Cleansing
Agent. May the suspect result be rejected, or must it be
retained by the criteria of the Q-test?
% Acid 10.19%
10.08%
10.52%
10.13%
Calculate the mean values both retaining and rejecting the
suspect value (which is the 10.52 result).
xmean (retaining) = 10.23%
xmean (rejecting) = 10.13%
Rejection of Suspect Data
Clearly the suspect value undutifully influences the
mean value. To employ the Q-test we need the range
and the difference between the suspect value and the
value nearest it.
Range = (10.52 – 10.08) = 0.44
Difference of Suspect and its nearest value
= (10.52 – 10.19) = 0.33
Qcal = (xsuspect – xnearest) / (Range)
= 0.33 / 0.44 = 0.75
Since Qcal < Qtable (0.75 < 0.76) we must retain the suspect
value at the 90% confidence level.
38
Rejection of Suspect Data
Referring to Table 4-4, textbook page 82 Qt = 0.76 for
n = 4 at the 90% Confidence Level. Thus we must
retain the suspect value by this criterion.
(Not in your textbook, but Qtable at the 96% confidence
level has a value of 0.85 for n = 4 a; by this criterion,
the suspect value of 10.52% would also be retained.)
a
Skoog and West, “Fundamentals of Analytical Chemistry, 4e, c1982,
CBS College Publishing, p62.
39
Rejection of Suspect Data
2) The 4d and 2.5d Rules
Although less rigorous, this test may also be used to
decide whether to retain or reject a suspect. In order to
use it, one needs to calculate the average deviation
which is defined as
average deviation = i |(x i – xmean)| / n
40
Rejection of Suspect Data
x mean reject ?
% Acid
10.19
10.08
10.52
10.13
10.13
Note
(?)
di =
(x i -x mean )
0.06
-0.05
0.39
0
sum di
avg d
2.5 x avg d
4 X avg d
|di |
0.06
0.05
0
0.11
0.037
0.093
0.147
Rejection of Suspect Data
Since 4 x avg d < di for the suspect value from the
mean, we could reject the suspect value. The 2.5d
is done identically except the multiplier is 2.5
instead of 4; 2.5d equals 0.093 or 0.09 in this
problem. Clearly the 2.5d rule allows easier
rejection than the 4d rule. The deviation of the
suspect value (0.39) could be rejected by both of
these criteria.
Rejection of Suspect Data
In the analysis of your laboratory results,
you may use any of the above tests in an
attempt to reject one suspect result; if you
meet the criterion for rejection, reject the
suspect value and state that basis in your
laboratory report.
Corrections to Errors in Earlier Slides
The following slides are corrections to the errors in the
earlier slides.
Rejection of Suspect Data
Clearly the suspect value undutifully influences the
mean value. To employ the Q-test we need the range
and the difference between the suspect value and the
value nearest it.
Range = (10.52 – 10.08) = 0.44
Difference of Suspect and its nearest value
= (10.52 – 10.19) = 0.33
Qcal = (xsuspect – xnearest) / (Range)
= 0.33 / 0.44 = 0.75
Since Qcal < Qtable (0.75 < 0.76) we must retain the suspect
value at the 90% confidence level.
38
Rejection of Suspect Data
Referring to Table 4-4, textbook page 82 Qt = 0.76 for
n = 4 at the 90% Confidence Level. Thus we must
retain the suspect value by this criterion.
(Not in your textbook, but Qtable at the 96% confidence
level has a value of 0.85 for n = 4 a; by this criterion,
the suspect value of 10.52% would also be retained.)
a
Skoog and West, “Fundamentals of Analytical Chemistry, 4e, c1982,
CBS College Publishing, p62.
39
Rejection of Suspect Data
Note the correction (underlined) is the last statement in
the proceeding slide. I could not find a less restrictive
Q-Table (Confidence Level less than 90%). If such a
table exists, say at 50% CL, its Qtable would be less than
the 0.76 value at the 90% CL used in this problem.
Rejection of Suspect Data
x mean reject ?
% Acid
10.19
10.08
10.52
10.13
10.13
Note
(?)
di =
(x i -x mean )
0.06
-0.05
0.39
0
sum di
avg d
2.5 x avg d
4 X avg d
|di |
0.06
0.05
0
0.11
0.037
0.093
0.147
Rejection of Suspect Data
Since 4 x avg d < di for the suspect value from the
mean, we could reject the suspect value. The 2.5d
is done identically except the multiplier is 2.5
instead of 4; 2.5d equals 0.093 or 0.09 in this
problem. Clearly the 2.5d rule allows easier
rejection than the 4d rule. The deviation of the
suspect value (0.39) could be rejected by both of
these criteria.