Producing Data: Sampling

Download Report

Transcript Producing Data: Sampling

Producing Data: Sampling
Inference: Probabilities and
Distributions
Week 8 March 1-2, 2011
(REVISED March 8 22:30)
To start….
• To start today I just want to briefly go over
the material we did last time on Chapter 8
and then spend the bulk of tonight on
Material from Chapter 10 and 11
Producing Data
• MY BEST ADVICE REGARDING THE
GENERATION OF DATA SUITABLE FOR
STATISTICAL ANALYSIS IS DON’T!!!!!!!
• HOWEVER, having noted that, you do need to
be aware of how people go about collecting the
data you are using and the limits that are
imposed on your work by these methods
The first half of today’s class will
look at sampling
• Some Key Terms
– Population (the entire group of individual cases about
which we want information)
– Sample (a part of the population from which we
actually collect information and from which we will
draw conclusions about the entire population)
– Sampling Design (the method we use to choose a
sample from the population)
– Inference (the process of generalizing our results
from our sample to the wider population)
The Ideal Sample
•
•
A sound sampling design is essential if we are
going to produce results that are generalizable
from our sample to the entire population
In the ideal sample design:
1. Each individual case has an equal probability of
being drawn as part of the sample
2. Each characteristic of relevance to the study is
present in equal probability in our sample and the
population-at-large. ****NOTE: by the end of today’s
lesson we will see that this second characteristic is
not essential as long as:
1. The first point is realized
2. and the sample we draw is large enough
However, for now we will assume it is a good idea to
aim for
Garbage In… Garbage Out…
• The truth of the matter is
– It is very hard to absolutely achieve both of the ideal
characteristics of a sample
• For some populations it might be impossible such as
relatively small groups (think about policy-makers and
advisors for example)
– It is getting harder all the time to do so
– The best we can probably ever hope to do is
“approximate” the two characteristics.
• The closer we approximate the two
characteristics
• The better will be the accuracy of our inferences
Sample types
• Convenience Samples
• Voluntary Response Samples
• Purposive Samples
– Snowball Samples
• Simple Random Samples
– Not only need to randomly choose the individual
cases, but also need to figure out the correct size for
the sample so that every possible sample has an
equal chance to be chosen
• Stratified Random Sample
– Sometimes you want to go beyond a Simple
Random Sample so as to ensure certain
characteristics are not left to even the chance
of error. This is common in public opinion
polling where several Simple Random
Samples will be drawn for populations
identified along geographic lines.
• The next slide portrays a multistage
stratified random sample
National Stratified Random Sample
BC
Randomly
Select
A set of
Postal Sorting
Areas for each
Region and
Randomly
Select
People in
Those PSAs
Prairie
Greater
Vancouver
Greater
Victoria
ON
QU
Other
Van.
Island
Atlantic
The South
Interior
The North
• Stratification need not be done on
geographic lines (though it often is)
• Any characteristic can be used as the
basis of stratification.
• What matters is that it is something of
importance to your research
Caution
• As the book notes, a sample’s quality is built on
more than just the sampling technique
• We need to have reliable information about the
population before we can even begin to select a
decent sample
• Non-responses
• Response bias
• And many other factors can also degrade a
sample’s quality and the quality of the data we
can get out of it.
A funny little thing called probability
• As we noted earlier, when we take a sample and
conduct a study we want to generalize (or infer)
the results of our study to the wider population
our sample is drawn from.
• If 40 percent of our sample say they will vote
Conservative we would like to estimate that this
is the situation among the general population
• However, we know there is a chance that if we
had drawn two separate samples and done two
simultaneous studies, we might have gotten
different results for each sample
• If the variation of results between samples is too
great, then we cannot generalize our results to
the wider population.
• Therefore we need to know about probability to
estimate the chance that our results will vary
from the actual situation in the population at
large.
• The statistician might tell us there is a 95%
chance that our survey result is accurate plus or
minus 3% (meaning there is a 95% chance that
the real support for the Conservatives among
the general population is between 37% and
43%)
In the Long-Run
• The law of probability is based on a
regularly documented observation that
while chance can produce erratic results
over the short-term or when small
numbers are looked at, it generates
regular and predictable outcomes over the
long-term and as numbers increase
• Random
– Outcomes are uncertain but there is
nonetheless a regular distribution of outcomes
in a large number of repetitions (this is not a
pattern).
• Probability
– The probability of any outcome of a random
phenomenon is the proportion of times the
outcome would occur in a very long series of
repetitions.
Examples of Commonly Known Probable
Long-term Results
• Coin Tossing
– Fifty/Fifty
• Stockmarket returns
– Reversion to the mean
• The fall of cards in a game (if you are able
to count them properly and quickly
enough)
Randomness
• Perfect Randomness is very rare and the
ability to select numbers totally at random
(so that each and every other number had
just as good a chance of being selected
and no pattern can ever be predicted) is
valuable
Probability Models
• Sample Space “S” of a random phenomenon is
the set of all possible outcomes
• An Event is an outcome or a set of outcomes of
a random phenomenon (the roll of the dice, the
flip of the coin, etc.)
• A Probability Model is a mathematical
description of a random phenomenon consisting
of
– A sample space S
– A way of assigning probability to events
As in figure 10.2 in the book: There are 36 possible combinations if
you roll two standard dice. If we wanted to define a sample space S
for (5) it would be comprised of the four possible ways to roll 5 (i.e. the
four “events” that result in 5
A ={ roll 1 & 4, roll 2 & 3, roll 3 & 2, roll 4 &1}
Some Formal Probability Rules
• A probability is a number between 0 and 1
– An event with a probability of 0 ought never to occur,
An event with a probability of 1 out to always occur
• All possible outcomes together must equal 1
• If two events have no outcomes in common, the
probability that one or the other occurs is the
sum of their individual probabilities. Eg. If one
event occurs in 40% of cases, and the other in
25% and the two cannot occur together then the
probability of one or the other occurring is 65%
• The probability that an event does not occur is 1
minus the probability that it does occur
Discrete vs. Continuous Models
• Discrete Probability Models
– Assume that the sample space is finite
– To assign probabilities list the probabilities of all the
individual outcomes (must be between 1 and 0 and
add up to 1).The probability of an event is the sum of
the outcomes making up the event.
– Think of our dice example: What is the probability of
rolling a five?
•
•
•
•
•
roll 1 & 4 = 1/36
roll 2 & 3 = 1/36
roll 3 & 2 = 1/36
roll 4 &1 = 1/36
Total Probability = 4/36 = 1/9 = 0.111
• Continuous Probability Models
– Assign probabilities as areas under a density curve
(such as the normal curve)
– The area under the curve and above any range of
values is the probability of an outcome in that range
– This is what we did in chapter 3!
Sampling Distributions
• Some Key Words
– Parameter: a number that describes the
population. We often can only speculate on
this as we only have data for a sample.
– Statistic is a number that can be computed
from the sample data without making use of
any unknown parameters. We often use
statistics to estimate parameters.
• The Law of Large Numbers:
– Draw observations at random from any
population with finite mean µ .
– As the number of observations drawn

increases, the mean xof the observed values
gets closer and closer to the mean µ of the
population .
Two types of distribution of
variables (be careful)
• The Population Distribution of a variable is
the distribution of values of the variable in
the population
• The Sampling Distribution of a statistic is
the distribution of values taken by the
statistic in all possible samples of the
same size from the same population (in
other words, how a statistic varies in many
samples drawn from the same population)
For those who like math
Suppose that
X
is the mean of a SRS of size “n” drawn from a large population with
mean
and standard deviation


Then the sampling distribution of
and standard deviation

n
X
has mean

Bottom Line and Caution
• If we can compute the average among a sample
we have a decent guess as to what the average
is in the population
• The average of a sample is generally a better
guess of the average of the population than any
one case in the population. In other words,
based on a survey of incomes, I can tell you with
a reasonable level of certainty what the average
income of Canadians is. I cannot tell you just
from that what the income of any specific
Canadian is.
The Central Limit Theorem
• When the number of samples is large enough,
the sample distribution of the mean will look like
a normal distribution even if the population from
which the sample is drawn does not exhibit a
normal distribution for the variable as long as the
population has a finite standard deviation
• This means we can use the known properties of
the normal curve to explore our data and study
it, even if we do not know for certain what the
parameter values are for variables in the general
population.