Comparing Samples

Download Report

Transcript Comparing Samples

Comparing Samples
Last Time
• I talked about what could go wrong in an
experiment where you compared a sample
mean against a population with a known
population mean and standard deviation.
• You will build a sampling (null) distribution and
set an alpha level using the population values.
The population has a fixed amount of variability
(SD) but the variability in the sample statistics is
affected by the sample size. The smaller the
sample size, the more variability in sample
statistics.
Example with SE of Means
• SAS EG example of simulating a
population
• Draw a single sample, get the mean
• Get many means
• Calculate the mean and SD from these
means
• Compare vs. theoretical distribution
Critical Cut Points
• Given the hypothetical mean, standard
deviation and sample size, you then
determine what is such an usual sample
that you would reject your null hypothesis
(the null distribution).
Alpha and Beta Again
• If your sample data came from a different
population, you will guess that the data for
this (sub) population is centered around
your sample mean and the distribution will
not completely overlap the null distribution.
The part of the alternative distribution
(area under the curve) which does not
overlap the null distribution is the power.
Graphical Example
• Here is an R example of cut points in the
theoretic distribution and how the alternate
distribution overlaps with the null
distribution:
Comparing Means
• In reality, you will almost never have a known
population mean and standard deviation and
compare your sample against that. You will
likely have a hypothetical population mean and
you will want to see if your sample was likely to
have come from the set of sample means
distributed around that hypothetical population
mean. Conceptually it is the same task but the
shape of the sampling distribution is different
when you don’t know the population SD.
• Gossett described the function that describes the
distribution for when you are comparing means and
estimating the population SD from the sample.
• He figured it out while working at a brewery that would
not let him publish under his own name so he published
it under the name Students and called the distribution T.
(Was he thinking tea?)
• The T distribution describes the samples when you don’t
know the population standard deviation. There is extra
uncertainty and that is manifested as a wider (and fattertailed) looking distribution.
0.2
0.1
T with 5 df
0.0
Prob density
0.3
0.4
Student’s T
-4
-2
0
values
2
4
0.4
Asymptotic T
0.0
0.1
0.2
0.3
• As your sample
size gets bigger
the T distribution
looks more and
more like a Z
distribution. N of
30 is essentially
indistinguishable
from a Z.
-4
-2
0
values
2
4
Calculate It
• To do the t-test is trivially
easy. First load the data
into an analysis package.
Graph it and then do the
one sample t-test.
• See the example SAS
Enterprise Guide project.
• The formula for the
statistic sure looks
familiar…
t
x  0
sd / n
Two Samples
• If you have two samples, the formula gets
a bit more complicated. Instead of using a
single sample to get the guess for the
population variability, you have two and if
the samples are not of the same size, you
want to put more trust (weight) in the
larger sample.
Estimated Variance
• Basically you take the
weighted average,
with a tweak to the
denominator to
consider you are
estimating population
parameters in the
formula.
(n  1)  s1  (n2  1) * s2
 1
n1  n2  2
2
s pooled
2
2
The T-Statistic
t
x1  x2
s pooled
2
1 1
  
 n1 n2 
Paired samples?
• What is your variance like if you sample
the same person before and after a
treatment relative to if you sampled two
different people?
• Smaller
ANOVA
• To compare three or more groups you will
want to use a method called ANOVA.
Analysis of variance is baffling when you
first see the algebra because you are
looking for differences in group means by
comparing variances.
How ANOVA Works
• Begin by looking at the overall variability in
your data vs. the overall mean. Then look
at the variability in your data if you
compare relative to the subgroups. If
there is no meaningful effect of the
treatments, the overall variability will look
like the variability relative to the
subgroups.
5
10
15
Dude
20
5
10
15
Dude
20
-0.5
-0.5
0.0
0.0
change
change
0.5
0.5
1.0
1.0
Reduced Variance
• With the T or Z distributions you get
excited if your sample mean is far from the
proposed population mean. Here, you get
excited if the ratio of the two variances is
far from 1. You need a distribution that
can describe the ratio of two variances.
That distribution is the F. It has a
parameter to describe the number of
subjects in the two halves of the fraction.