Mod7ComDatasets
Download
Report
Transcript Mod7ComDatasets
Module 7: Comparing Datasets
and Comparing a Dataset
with a Standard
How different is enough?
Concepts
Independence of each data point
Test statistics
Central Limit Theorem
Standard error of the mean
Confidence interval for a mean
Significance levels
How to apply in Excel
module 7
2
Independent Measurements
Each measurement must be independent
(shake up basket of tickets)
Example of non-independent measurements
– Public responses to questions (one result affects
next person’s answer)
– Samplers too close together, so air flows
affected
module 7
3
Test Statistics
Some number calculated based on data
In student’s t test, for example, t
If t is >= 1.96 and
– population normally distributed,
– you’re to right of curve,
– where 95% of data is in inner portion,
symmetrically between right and left (t=1.96
on right, -1.96 on left)
module 7
4
Test statistics correspond to
significance levels
“P” stands for percentile
Pth percentile is where p of data falls below,
and 1-p fall above
module 7
5
Two Major Types of Questions
Comparing mean against a standard
– Does air quality here meet NAAQS?
Comparing two datasets
– Is air quality different in 2006 than 2005?
– Better?
– Worse?
module 7
6
Comparing Mean to a Standard
Did air quality meet CARB annual standard of
12 microg/m3?
Ft
Ft Smith
Ft Smith N_Fort
year
Smith
avg
Max
Smith
Min
‘05
14.78
0.1
37.9
77
module 7
7
Central Limit Theorem (magic!)
Even if underlying population is not
normally distributed
If we repeatedly take datasets
These different datasets have means that
cluster around true mean
Distribution of these means is normally
distributed!
module 7
8
Magic Concept #2:
Standard Error of the Mean
Represents uncertainty around
mean
As sample size N gets bigger,
error gets smaller!
The bigger the N, the more
tightly you can estimate mean
LIKE standard deviation for
a population, but this is for
YOUR sample
module 7
N
9
For a “large” sample (N > 60), or when very
close to a normal distribution…
Confidence interval for population mean is:
s
x Z
n
Choice of z determines 90%, 95%, etc.
module 7
10
For a “Small” Sample
Replace Z value with a t value to get…
s
x t
n
…where “t” comes from Student’s t
distribution, and depends on sample size
module 7
11
Student’s t Distribution vs.
Normal Z Distribution
T-distribution and Standard Normal Z distribution
0.4
Z distribution
density
0.3
0.2
T with 5 d.f.
0.1
0.0
-5
0
Value
module 7
5
12
Compare t and Z Values
Confidence t value with Z value
level
5 d.f
2.015
1.65
90%
2.571
1.96
95%
4.032
2.58
99%
module 7
13
What happens as
sample gets larger?
T-distribution and Standard Normal Z distribution
0.4
Z distribution
density
0.3
T with 60 d.f.
0.2
0.1
0.0
-5
0
Value
module
7
5
14
What happens to CI as
sample gets larger?
For large samples
Z and t values
become almost
identical, so CIs are
almost identical
x Z
x t
module 7
s
n
s
n
15
First, graph and review data
Use
box plot add-in
Evaluate spread
Evaluate how far apart mean
and median are
(assume sampling design and
QC are good)
module 7
16
Excel Summary Stats
module 7
17
1. Use the
box-plot
add-in
40
35
2. Calculate
summary
stats
30
25
20
15
10
5
0
Ft Smith
module 7
N=77
Min
25th
Media
n
75th
Max
Mean
SD
0.1
7.5
13.7
18.1
37.9
14.8
8.718
Our Question
Can we be 95%, 90%, or how confident that
this mean of 14.78 is really greater than
standard of 12?
We saw that N = 77, and mean and median
not too different
Use z (normal) rather than t
module 7
19
The mean is 14.8 +- what?
We
know equation for CI is
s
x Z
n
Width
of confidence interval
represents how sure we want to be
that this CI includes true mean
Now, decide how confident we want
to be
module 7
20
CI Calculation
For 95%, z = 1.96 (often rounded to 2)
Stnd error (sigma/N) = (8.66/square root of
77) = 0.98
CI around mean = 2 x 0.98
We can be 95% sure that mean is included
in (mean +- 2), or 14.8-2 at low end, to 14.8
+ 2 at high end
This does NOT include 12 !
module 7
21
Excel can also calculate a
confidence interval around the mean
Mean, plus and minus 1.93, is a 95%
confidence interval that does NOT
include 12!
module 7
22
We know we are more than 95%
confident, but how confident can we
be that Ft Smith mean > 12?
Calculate where on curve our mean of 14.8 is,
in terms of z (normal) score…
…or if N small, use t score
module 7
23
To find where we are on the curve,
calc the test statistic…
Ft Smith mean = 14.8,
sigma =8.66, N =77
Calculate test statistic,
in this case the z factor
z
(we decided we can use the
z rather than the t
distribution)
(x )
N
If N was < 60, test stat
is t, but calculated the
same way
Data’s
mean
module 7
Standard of 12
24
Calculate z Easily
Our mean 14.8 minus standard of 12 (treat real
mean (mu) as standard) is numerator (= 2.8)
Standard error is sigma/square root of N = 0.98
(same as for CI)
so z = (2.8)/0.98 = z = 2.84
So where is this z on the curve?
Remember, at z = 3 we are to the right
of ~ 99%
module 7
25
Where on the curve?
Z=2
Z=3
So between 95 and 99% probable that the true mean
will not include 12
module 7
26
You can calculate exactly where on
the curve, using Excel
Use Normsdist function, with z
If z (or t) =
2.84, in
Excel
Yields 99.8% probability that the
true mean does NOT include 12
module 7
27