Transcript File

Statistical Inference
Two Statistical Tasks
1. Description
2. Inference
Thus far, we have completed:
1. Descriptive Statistics
a. Central tendency
i. discrete variables
ii. continuous variables
b. Variation
i. discrete variables
ii. continuous variables
c. Association
i. discrete variables
Now we begin:
2. Inferential Statistics
a. Estimation
b. Hypothesis testing
Inferential statistics are based on
random sampling.
A sample is a subset of some
universe (or population [set]).
If (and only if) the sample is selected
according to the laws of probability,
we can make inferences about the
universe from known (statistical)
characteristics of the sample.
“Random” means selected so
that each element in the
universe has exactly the
same chance of being picked
for the sample (sometimes
called an equi-probability
sample).
Put differently, the only
difference between elements
selected into the sample and
those not selected is pure
chance (i.e., “the luck of the
draw”).
All inferential statistics
evaluate the probability that
unlucky selection in creating a
random sample (the “luck of
the draw,” technically called
“sampling error”) explains the
statistical outcomes obtained
from random samples.
Sample 1: 75% cardinal
(n1 = 4)
Sample 2: 0% cardinal
(n2 = 4)
Sample 3: 25% cardinal
(n3 = 4)
Percent cardinal
0
25
50
75
100
f
lowest
medium
highest
medium
lowest
0% 25% 50% 75% 100%
Percent cardinal in random samples
All statistics calculated on
variables from a random sample
have a (known) sampling
distribution. Sampling
distributions are the theoretically
possible distributions of
statistical outcomes from an
infinite number of random
samples of the same size.
Knowing this, we do not actually need
to draw an infinite number of random
samples. When we draw ONE
(large) random sample, CHANCES
ARE that its characteristics will be
closer to the center of its sampling
distribution than the extremes. That
is, any sample statistic is likely to be
close to (rather than very different
from) the actual (unknown) value
(parameter) in the universe.
For example, when we find that
the value of 2 for the association
between two variables in a large
random sample is 13.748,
chances are that the (unknown)
value of 2 for the universe (the
so-called “true” value) is similar
rather than very different.
The question is: Does this
sample value of 2 permit us to
infer that the two variables are
(probably) related or are
(probably) independent in the
universe? The answer
requires knowing how to use
the Chi-Square sampling
distribution(s).
Sampling distributions allow us
to identify the probability that
a sample statistic has a similar
value in the universe from
which the random sample was
drawn (that is, whether the
value holds in general, not
merely for the sample).
Unfortunately, 2 has not one but
several sampling distributions,
each differently shaped. The one
that is relevant for the specific
inference we wish to make can
be identified by knowing the
number of degrees of freedom
involved in the calculation of this
sample statistic.
In the case of contingency tables
(crosstabulations), degrees of
freedom associated with 2 are a
function of the size of the table
(i.e., the number of rows and
columns). Specifically,
df = (R – 1)(C – 1)
For example, a contingency
table having two rows and two
columns (i.e., a 2 x 2 table)
has only one degree of
freedom:
df = (R – 1)(C – 1)
= (2 – 1)(2 – 1)
= (1)(1)
=1
Column
Row
One
One
Two
Total
Two
Total
?
?
100
?
?
200
200
100
300
Column
Row
One
Two
Total
One
96
?
100
Two
?
?
200
200
100
300
Total
Column
Row
One
Two
Total
One
96
4
100
Two
?
?
200
200
100
300
Total
Column
Row
One
Two
Total
One
96
4
100
Two
104
?
200
Total
200
100
300
Column
Row
One
Two
Total
One
96
4
100
Two
104
96
200
Total
200
100
300
An Example
Year
Party
Preference
1984
1985
Total
Democrat
545
595
1,140
Independent
528
462
990
Republican
370
455
825
1,443
1,512
2,955
Total
For the crosstabulation in this table, 2 = 13.748. Is the association
in this table confined to the sample, or does this mean that there
was a “real” shift in party identification from one year to the next?
There are several steps in answering this question.
Since these data are from a large random sample, we
can use the laws of chance to infer whether this value
represents a “real” shift in the universe (i.e., among
people in the U.S. in general ) or is merely an artifact of
sampling (bad luck in randomly selecting 2,955 people
who are NOT like the rest of the population).
We know that 13.748 is ONE of the values on a
sampling distribution of 2, but which sampling
distribution? Since df = 2 [i.e., (3-1)(2-1)], we can
determine that the sampling distribution is the one
whose values are located in row 2 of the table in
Appendix 4, the “Critical Values of Chi-Square.”
We need a DECISION RULE or CUT POINT to decide
whether this represents a true shift or merely the result
of chance in drawing the random sample.
We must decide what chance of being wrong we want
to entertain in deciding between a “true” relationship
between changes over time and political party
preference (i.e., one that actually exists in the universe)
and an artifact of sampling (i.e., a relationship that
exists nowhere else except in our sample due to the
“luck of the draw”). Actually, with Appendix 4 we are
limited to some conventional probabilities of deciding
incorrectly: 10 percent (.10, column 1), 5 percent (.05,
column 2), 1 percent (.01, column 3), or 1/10 of 1
percent (.001, column 4). Until we have introduced
some additional criteria, let's stick with a 5 percent
chance of incorrectly deciding between a real
association and chance.
This is known as an alpha level (or significance level) and
is expressed as:
 = 0.05
It means that we have only a 5 percent chance of
incorrectly deciding between a true association in the
universe and one due to chance (which exists only in the
sample). In other words, this means that we have a 95
percent chance of being correct in making our inference.
Having decided on an alpha level of .05 (i.e., accepting a
5 percent chance that we will decide incorrectly) and
knowing the appropriate Chi-Square sampling distribution
(one defined by 2 degrees of freedom), we can find the
critical value of 2. From row 2 (df = 2) and column 2 (
= .05) of Appendix 4, we find that the appropriate critical
value is 5.99. Since 2 for the data was calculated to be
13.748 and since 13.748 is GREATER than the critical
value, we conclude that the odds favor there being a true
association between party preference and year of poll. In
other words, there is less that a 5 percent chance that this
association could be due to chance (by randomly
selecting people who are atypical of the rest of the
population).
Recapitulation
1. Statistical inference involves “generalizing”
from a sample to a (statistical) universe.
2. Statistical inference is only possible with
random samples.
3. Statistical inference estimates the probability
that a sample result could be due to chance
(in sample selection).
4. Sampling distributions are the “keys” that
connect (known) sample statistics and
(unknown) universe parameters.
5. Alpha levels are used to identify “critical
values” on sampling distributions.