Comparing Datasets 1-17

Download Report

Transcript Comparing Datasets 1-17

Comparing Datasets and
Comparing a Dataset with a
Standard
How different is enough?
Concepts:
•
•
•
•
•
•
•
Independence of each data point
Test statistics
Central Limit Theorem
Standard error of the mean
Confidence interval for a mean
Significance levels
How to apply in Excel
module 7
2
Independent measurements:
• Each measurement must be independent
(shake up the basket of tickets)
• Example of non-independent
measurements:
– Public responses to questions (one result
affects the next person’s answer)
– Samplers placed too close together so air
flows are affected
module 7
3
Test statistics:
• Some number that is calculated
based on the data
• In the student’s t test, for example, t
• If t is >= 1.96, and you have a
normally distributed population, you
know you are to the right on the curve
where 95% of the data is in the inner
portion is symmetrically between the
right and left (t=1.96 on the right and 1.96 on the left)
module 7
4
Test statistics correspond to
significance levels
• “P” stands for percentile
• Pth percentile is where p of the data falls
below, and 1-p fall above:
module 7
5
Two major types of questions:
• Comparing the mean against a standard
– Does the air quality here meet the NAAQS?
• Comparing two datasets
– Is the air quality different in 2006 than 2005?
– Or, is the air quality better?
– Or, is the air quality worse?
module 7
6
Comparing mean to a standard:
• Did the air quality meet the CARB annual
stnd of 12 microg/m3?
Ft
Ft Smith
Ft Smith N_Fort
year
Smith
avg
Max
Smith
Min
‘05
14.78
0.1
37.9
77
module 7
7
Central Limit Theorem (magic!)
• Even if the underlying population is not
normally distributed
• If we repeatedly take datasets
• These different datasets will have means
that cluster around the true mean
• And the distribution of these means is
normally distributed!
module 7
8
magic concept #2: Standard error
of the mean
• Represents uncertainty
around the mean
• as sample size N gets
bigger, your error gets
smaller!
• The bigger the N, the more
tightly you can estimate
mean
• LIKE standard deviation
for a population, but this is
for YOUR sample
module 7


N
9
For a
“large” sample (N > 60), or when very close
to a normal distribution:
A confidence interval for a population mean is:
 s 
x  Z

 n
Choice of z determines 90%, 95%, etc.
module 7
10
For a “small” sample:
Replace the Z value with a t value to get:
 s 
x  t

 n
where “t” comes from Student’s t distribution,
and depends on the sample size.
module 7
11
Student’s t distribution versus
Normal Z distribution
T-distribution and Standard Normal Z distribution
0.4
Z distribution
density
0.3
0.2
T with 5 d.f.
0.1
0.0
-5
0
5
Value
module 7
12
compare t and Z values:
Confidence t value with Z value
level
5 d.f
2.015
1.65
90%
2.571
1.96
95%
4.032
2.58
99%
module 7
13
What happens as
sample gets larger?
T-distribution and Standard Normal Z distribution
0.4
Z distribution
density
0.3
T with 60 d.f.
0.2
0.1
0.0
-5
0
5
Value
module 7
14
What happens to CI as
sample gets larger?

x  Z


x  t

s 

n
s 

n
For large samples:
Z and t values
become almost
identical, so CIs are
almost identical.
module 7
15
First, graph and review data:
• Use box plot add-in
• Evaluate spread
• Evaluate how far apart mean and
median are
• (assume the sampling design and
the QC are good)
module 7
16
Excel summary stats:
module 7
17
1. Use the
box-plot
add-in
40
35
N=77
Min
0.1
25th
7.5
20
Median
13.7
15
75th
18.1
Max
37.9
Mean
14.8
SD
8.7
2. Calculate
summary
stats
30
25
10
5
0
Ft Smith
module 7
18
Our question:
• Can we be 95%, 90% or how confident
that this mean of 14.78 is really greater
than the standard of 12?
• Saw that N = 77, and mean and median
not too different
• Use z (normal) rather than t
module 7
19
The mean is 14.8 +- what?
• We know the equation for CI is
 s 
•
x  Z

 n
• The width of the confidence interval
represents how sure we want to be
that this CI includes the true mean
• Now all we need to decide is how
confident we want to be
module 7
20
CI calculation:
• For 95%, z = 1.96 (often rounded to 2)
• Stnd error (sigma/N) = (8.66/square root of
77) = 0.98
• CI around mean = 2 x 0.98
• We can be 95% sure that the mean is
included in (mean +- 2), or 14.8-2 at the
low end, to 14.8 + 2 at the high end
• This does NOT include 12 !
module 7
21
Excel can also calculate a
confidence interval around the
mean:
The mean plus and minus 1.93 is a 95%
confidence interval that does NOT
include 12!
module 7
22
We know we are more than 95%
confident, but how confident can
we be that Ft Smith mean > 12?
• Calculate where on the curve our mean of
14.8 is, in terms of the z (normal) score,
• Or if N small, use the t score:
module 7
23
To find where we are on the curve,
calc the test statistic:
• Ft Smith mean = 14.8,
sigma =8.66, N =77
• Calculate the test
statistic, which in this
case is the z factor
(x  )
z

(we decided we can use
the z rather than the t
distribution)
• If N was < 60, the test
stat is t, but
calculated the same
module 7
way
N
Data’s
mean
The stnd of 12
24
Calculate z easily:
• our mean 14.8 minus the standard of 12
(treat the real mean  (mu) as the stnd) is
the numerator (= 2.8)
• The stnd error is sigma/square root of N =
0.98 (same as for CI)
• so z = (2.8)/0.98 = z = 2.84
• So where is this z on the curve?
• Remember at z = 3 we are to the right of ~
99%
module 7
25
Where on the curve?
Z=2
Z=3
So between 95 and 99% probable that the true mean
will not include 12
module 7
26
Can calculate exactly where on the
curve, using Excel:
• Use Normsdist function, with z
If z (or t) =
2.84, in
Excel:
Yields 99.8% probability that the
true mean does NOT include 12
module 7
27