The Normal Approximation for Data
Download
Report
Transcript The Normal Approximation for Data
Sample Surveys
Terminologies
• Investigators usually want to generalize about a class of individuals.
This class is called the population.
• For example, in forecasting the results of a presidential election in the
U.S., one relevant population consists of all eligible voters.
• When study the whole population is impractical (this is the usual
case), we can examine only part of it. This part is called the sample.
• Investigators make inferences from the sample to the population.
That is they make generalizations from the part to the whole.
Terminologies
• Usually, there are some numerical facts about the population which
the investigators want to know. Such numerical facts are called
parameters.
• For example, the mean of the distribution for the heights of men in
some large population.
• In general, parameters cannot be determined exactly, but can only be
estimated from a sample. Then a major issue is accuracy.
• Parameters are estimated by statistics (some numbers) which can be
computed from a sample.
• In previous example, the average of the heights in a sample will be a
good statistic to estimate the mean of the whole population.
Comments
• Statistics are what investigators know; parameters are what they want
to know.
• Estimating parameters from the sample is justified when the sample
represents the population.
• But it is impossible to check just by looking at the sample.
• Instead, one has to look at how the sample was chosen.
• Reason: to see whether the sample is like the population in the ways
that matter, investigators would have to know the facts about the
population that they are trying to estimate.
Choosing a sample
We will see the method of choosing the sample matters a lot by looking
at an example.
Later we will introduce the best methods in choosing a sample, which
involve probability.
Example
• Background: in 1936, Franklin Delano Roosevelt was completing his
first term of office as president of the U.S.
• It was an election year, and the Republican candidate was Governor
Alfred Landon of Kansas.
• The country was struggling to recover from the Great Depression.
• There were still nine million unemployed: real income had dropped
by one-third in the period 1929-1933 and was just beginning to turn
upward.
• Landon was campaigning on a program of economy in government,
and Roosevelt was defensive about his deficit financing.
Example
• Most observers thought Roosevelt would be an easy winner.
• But the Literary Digest magazine predicted an overwhelming victory
for Landon, with Roosevelt getting only 43% of the popular vote.
• This prediction was based on the largest number of people ever
replying to a poll----about 2.4 million individuals.
• It was backed by the enormous prestige of the Digest.
• However, Roosevelt won the 1936 election by a landslide----62% to
38%.
• The Digest went bankrupt soon after.
Bias 1
• A sampling procedure should be fair, selecting people for inclusion in
the sample in an impartial way, so as to get a representative cross
section of the public.
• A systematic tendency on the part of the sampling procedure to
exclude one kind of person or another from the sample is called
selection bias.
Example
• To find out where the Digest went wrong, we have to look at how
they picked their sample.
• The Digest mailed questionnaires to 10 million people.
• The names and addresses of these 10 million people came from
sources like telephone books and club membership lists.
• That tended to screen out the poor, who were unlikely to belong to
clubs or have telephones. (At the time, for example, only one
household in four had a telephone.)
• So there was a very strong bias against the poor in the Digest’s
sampling procedure.
Comments
• Prior to 1936, this bias may not have affected the predictions very
much, because the rich and poor voted along similar lines.
• But in 1936, the political split followed economic lines more closely.
The poor voted overwhelmingly for Roosevelt, the rich were for
Landon.
• So the first reason for Digest’s error was selection bias.
• When a selection procedure is biased, taking a large sample does not
help. This just repeats the basic mistake on a larger scale.
Bias 2
• After deciding which people ought to be in the sample, a survey
organization still has to get their opinions:
• If a large number of those selected for the sample do not in fact
respond to the questionnaire or the interview, non-response bias is
likely to happen.
• The non-respondents differ from the respondents in one obvious
way: they did not respond. Experience shows they tend to differ in
other important ways as well.
Example
• The Digest did very badly at the first step in sampling. But there is also
a second step.
• There were only 2.4 million people bothered to reply, out of the 10
million who got the questionnaire.
• These 2.4 million respondents do not even represent the 10 million
people who were polled, let alone the population of all voters.
• For example, the Digest made a special survey, with questionnaires
mailed to every third registered voter in Chicago.
• About 20% responded, and of those who responded over half favored
Landon.
• But in the election Chicago went for Roosevelt, by a two-to-one
margin.
Comments
• The Digest poll was spoiled both by selection bias and non-response
bias.
• Non-respondents can be very different from respondents. When
there is a high non-response rate, look out for non-response bias.
Remarks
• Special surveys have been carried out to measure the difference
between respondents and non-respondents.
• It turns out that lower-income and upper-income people tend not to
respond to questionnaires, so the middle class is over-represented
among respondents.
• For these reasons, modern survey organizations prefer to use
personal interviews rather than mailed questionnaires.
• A typical response rate for personal interviews is 65%, compared to
25% for mailed questionnaires.
Remarks
• But the problem of non-response bias still remains, even with
personal interviews.
• Those who are not at home when the interviewer calls may be quite
different from those who are at home, with respect to working hours,
family ties, social background, and therefore with respect to
attitudes.
• Good survey organizations keep this problem in mind, and have
ingenious methods for dealing with it. (See reading materials for the
Gallup poll.)
Summary for choosing a sample
• Some samples are really bad. To find out whether a sample is any
good, ask how it was chosen.
• Was there selection bias?
• Was there non-response bias?
• You may not be able to answer these questions just by looking at the
data.
Probability methods
The probability methods use objective and impartial chance
mechanisms to select the sample, compared to the quota sampling
which is not a probability method. (See reading materials.)
What is a probability method?
• The probability method for drawing a sample is that:
• For example, suppose we carry out a survey of 100 voters in a small
town with a population of 1,000 eligible voters.
• Then we list all the eligible voters, write the name of each one on a
ticket, put all 1,000 tickets in a box, and draw 100 tickets at random.
• There will be no point interviewing the same person twice, the draws
are made without replacement.
• The people whose tickets have been drawn form the sample.
Simple random sampling
• The process is called simple random sampling: tickets have simply
been drawn at random without replacement.
• At each draw, every ticket in the box has an equal chance to be
chosen.
• The interviewers have no discretion at all in whom they interview, and
the procedure is impartial----everybody has the same chance to get
into the sample.
• The law of averages guarantees that the percentage of the
corresponding subjects (e.g. Democrats) in the sample is likely to be
close to the percentage in the population.
Comments
• It is not practical to take a simple random sample.
• For example, to predict a presidential election, we first need a list of
all the eligible voters----over 200 million names. There is no such list.
• Even if there were, drawing a few thousand names at random from
200 million is not an easy job, since we have to make every name in
the box have an equal chance of being selected.
• Even if we could draw a simple random sample, the people would be
scattered all over the map. It would be prohibitively expensive to
send interviewers around to find them all.
• So in practice, we use another probability method instead.
Multistage cluster sampling
• We describe the idea of multistage cluster sampling by using the
following example:
• During the period from 1952 through 1984, the Gallup pre-election
surveys were all done using just about the same procedure.
• The Gallup Poll makes a separate study in each of the four geographic
regions of the U.S.----Northeast, South, Midwest, and West.
• Within each region, they group together all the population centers of
similar sizes. One such grouping might be all towns in the Northeast
with a population between 50 and 250 thousand.
• Then, a random sample of these towns is selected.
Multistage cluster sampling
• Interviewers are stationed in the selected towns, and no interviews
are conducted in the other towns of that group.
• Other groupings are handled the same way. This completes the 1st
stage of sampling.
• For election purposes, each town is divided up into wards, and the
wards are subdivided into precincts.
• At the 2nd stage of sampling, some wards are selected at random from
each town chosen in the stage before.
Multistage cluster sampling
• At the 3rd stage, some precincts are drawn at random from each of
the previously selected wards.
• At the 4th stage, households are drawn at random from each selected
precinct.
• Finally, some members of the selected households are interviewed.
• Remember, no discretion is allowed. (e.g. Gallup Poll interviewers are
instructed to “speak to the youngest man 18 or older at home, or if
no man is at home, the oldest woman 18 or older”.)
The figure for Multistage cluster sampling
The advantage
• The method is set up so the distribution of the sample by residence is
the same as the distribution for the nation.
• Each stage in the selection procedure uses an objective and impartial
chance mechanism to select the sample units.
• So there is no selection bias on the part of the interviewer.
• Note: there could be selection bias on the other parts. (e.g. Separate
study in each of the four geographical regions, dividing towns into
wards, wards into precincts, and etc.)
The Gallup Poll record
The Gallup Poll record
• There are three points to notice:
• 1st. The sample size has gone down sharply. They used a sample of
size about 50,000 in 1948. But they now use samples less than a
tenth of that size.
• 2nd. There is no longer any consistent trend favoring either
Republicans or Democrats.
• 3rd. The accuracy has gone up appreciably.
• Using probability methods to select the sample, the Gallup Poll has
been able to predict the elections with startling accuracy, sampling
less than 5 persons in 100,000----which proves the value of
probability methods in sampling.
Remarks
• Simple random sampling is the basic probability method.
• Other methods can be quite complicated.
• But all probability methods for sampling have two important features:
• The interviewers have no discretion at all as to whom they interview;
• There is a definite procedure for selecting the sample, and it involves
the planned use of chance.
• As a result, with a probability method it is possible to compute the
chance that any particular individuals in the population will get into
the sample.
• Note: to minimize bias, an impartial and objective probability method
should be used to choose the sample.
Telephone Surveys
Many surveys are now conducted by telephone. The savings in costs are
dramatic.
How to pick a sample?
• Here are two examples:
• In 1988, the Gallup Poll used a multistage cluster sample based on area
codes, “exchanges”, and “banks”.
• For example: Area code-Exchange-Bank-Digits: 415-767-26-76.
• In 1992, they switched to a simpler design.
• There are 4 time zones in the U.S. The Gallup Poll divided each zone into 3
types of areas, according to population density (heavy, medium, light). That
gives 12 strata.
• Within each stratum, they drew a simple random sample of telephone
numbers, using the computer to exclude businesses by checking the yellow
pages.
• Choosing telephone numbers at random is called RDD: random digit
dialing.
Comments
• Non-respondents create problems, as usual.
• The Gallup Poll does most of its interviewing on evenings and
weekends, when the people are more likely to be at home.
• If there is no answer, the interviewer will call back up to 3 times.
(Some designs have up to 15 call-backs. That is better, but more
expensive.)
• For many purposes, results are comparable to those from face-to-face
interviews, and the cost is about 1/3 as much. This is why survey
organizations are using the telephone.
Remarks
• People who do not have phones must be different from the rest of us,
and that does cause a bias in telephone surveys.
• But the effect is small, because these days nearly everybody has a
phone.
• On the other hand, about 1/3 of residential telephones are unlisted.
Rich people and poor people are more likely to have unlisted
numbers, so the telephone book tilts toward the middle class.
• Sampling from directories would create a real bias, but random digit
dialing gets around this difficulty.
Chance error and bias
We have introduced the practical difficulties faced by real survey
organizations.
However, even if all these difficulties are assumed away, the sample is
still likely to be off----due to chance error.
Chance error
• Suppose we have a box with a very large number of tickets, some
marked 1 and the others marked 0. That is the population.
• We want to estimate the percentage of 1’s in the box. That is the
parameter.
• We draw 1,000 tickets at random without replacement. That is the
sample.
• In this case, there is no problem about the response. Also, drawing
tickets at random eliminates selection bias.
• As a result, the percentage of 1’s in the sample is going to be a good
estimate for the percentage of 1’s in the box.
Chance error
• But the estimate is still likely to be a bit off, because the sample is
only part of the population.
• Since the sample is chosen at random, the amount off is governed by
chance:
• Percentage of 1’s in sample = percentage of 1’s in box + chance error.
• In other situations, if we take bias into account:
• Estimate = parameter + bias + chance error.
• Chance error is often called “sampling error”, this error comes from
the fact that the sample is only part of the whole.
• Bias is called “non-sampling error”, this error is from other sources,
like non-response. Bias is often a more serious problem than chance
error.
Some natural questions
• As previous chapters, when we have chance error in an equation, we
usually ask about chance errors:
• How big are they likely to be?
• How do they depend on the size of the sample?
• Or do they depend on the size of the population?
• How big does the sample have to be in order to keep the chance
errors under control?
• We will study these topics next in class.
Summary
• A sample is part of a population.
• A parameter is a numerical fact about a population. Usually a
parameter cannot be determined exactly, but can only be estimated.
• A statistic can be computed from a sample, and used to estimate a
parameter. The statistic is what the investigator knows. A parameter is
what the investigator wants to know. The major issue is accuracy.
• When choosing a sample survey, ask yourself what is the population,
the parameter? How is the sample chosen? What is the response
rate? Try to avoid the selection bias and non-response bias.
• Large samples offer no protection against bias.
Summary
• Probability methods for sampling use an objective and impartial
process to pick the sample, and leave no discretion to the interviewer.
• The investigator can compute the probability that any particular
individuals in the population will be selected for the sample.
Probability methods guard against bias, because blind chance is
impartial.
• Even when using probability methods, bias may come in. Then the
estimate differs from the parameter, due to bias and chance error:
• Estimate = parameter + bias + chance error.