Transcript lecture 2

Introduction to Data
Analysis.
Sampling
Today’s lecture

Sampling (A&F 2)
Why sample?
 Random sampling.
 Other sampling methods.


Stata stuff in Lab
2
Sampling introduction
Last week we were talking about populations
(albeit in some cases small ones, such as my
friends).
 Often when we see numbers used they are not
numbers relating to a population, but a sample
of that population.


Newspapers report the percentage of the electorate
thinking Tony Blair is trustworthy, but this is really the
percentage of their sample (say 1000 people) that they
asked about Blair’s trustworthiness.
3
Samples and populations

For that statistic from the newspaper’s sample to be
useful, the sample has to be ‘representative’.


i.e. the % saying Blair is trustworthy in newspaper’s survey (the
sample of 1000 people) needs to be similar to the % in electorate
(the population of 40 million people).
An intuitively obvious way of doing this is to pick
1000 people at random.


For the survey, metaphorically (or literally with a big hat) put every
elector’s name into a hat and pull out 1000 names.
For a random sample of people in a large classroom, I could sample
every 10th person along each row.
4
Why sample?

Cost.
We could ask all 40 million people that are eligible to
vote in Britain. This would prove somewhat expensive.
 The last British census cost £220 million…


Speed.


Equally the last British census took 5 years to process
the data…
Impossibility.

Consuming every bottle of wine from a vineyard to
assess its quality leaves no wine to sell…
5
Why random?

Random sampling allows us to apply
probability theory to our samples.
This means that we can assess how likely it is (given
how big our sample is) that our sample is representative.
 Deal with this in more detail later on.


Intuitively, non-random sampling doesn’t
seem a very good idea.

Who’s heard of Alf Landon?
6
Alf vs. FDR

In 1936 the Literary Digest magazine
predicted that the Republican Presidential
candidate (Landon) would beat FDR.
The LD sent 10 million questionnaires out, of the 2 ½
million that were sent back, a large majority claimed to
be voting Republican at the election.
 The LD wanted to estimate the % of voters for each
candidate (the parameter), and used the proportion from
their sample (the statistic) to estimate this.


But, FDR won…
7
Why did the LD get it wrong?

LD’s sample was large, but unrepresentative.
They did not send questionnaires to randomly selected
people, but rather lists of people with club
memberships, lists of car /telephone owners.
 These people were wealthier and therefore more likely
to vote Republican; the sample was not representative of
the US electorate as a whole.


The LD’s sampling frame was not the
population (the electorate), but a wealthy
subset of the population.
8
Non-probability sampling

The moral being…


If we don’t sample randomly, and instead use non-probability
sampling, then we are likely to get sample statistics that are not
similar to the population.
e.g. Newspapers and TV regularly invite readers/
viewers to ring up and ‘register their opinion’.


Scottish Daily Mirror ran a poll on who should be the new 1st
Minister in 2001. One of Jack McConnell’s fellow MSPs rang up
169 times to indicate he should take the position…
If the Daily Mail and Independent hold phone in-polls on the same
issue, the results will be different as the samples are different to one
another in a non-random way (social class, ideology, etc.).
9
Experimental designs


Randomness is also useful in experimental sciences,
just as with observational data.
If we are giving one set of subjects a treatment and
one group nothing, then ideally we would randomly
select who is in each group.



e.g. psychiatrists studying a drug for manic depressives, would give
the drug to one group and a placebo to the other.
Their results are no good if the groups are initially different (say by
age, sex, etc.).
Random selection into a group makes these differences unlikely,
and allows us to test how likely it is that the drug has a real effect.
10
Simple random sampling (SRS)

‘Names out of a hat’ sampling. Select the n of
the sample that we want, and then randomly
pick that n of observations from the
population.
Each member of the population is equally likely to be
sampled.
 e.g. if I wanted a sample from the room, then I might
give everyone a number, and then use a table of random
numbers to pick out 10 people. Any method that picks
people randomly is acceptable.

11
Problems with SRS

A random sample may not include enough of a
particular interesting group for analysis.


Interested in experiences of racism, 100 random people will on
average include 85 whites, and an individual sample will
potentially have even fewer (maybe even zero) non-whites.
Can be costly and difficult.


A random sample of 50 school-children might include 49 in
England and Wales, and one in the Orkney’s.
A complete list of every school-child might be possible to obtain,
but what of every person living in Britain. A list of the population
of interest is not always available.
12
Solutions (1)

Stratified random sampling.


Two stages: classify population members into groups,
then select by SRS within those groups.
e.g. ‘over-sample’ non-whites for our racism
study.
Once we had divided the population by race, we would
SRS within those racial groups.
 Might take 50 whites and 50 non-whites for our sample
if we were interested in comparing experiences of
racism.

13
Solutions (2)

Cluster random sampling.


If population members are naturally clustered, then we
SRS those clusters and then SRS the population
members within those clusters.
Pupils in schools are naturally grouped by
school.
We may not have a list of every school-child, but we do
have a list of every school.
 Again two stages. We randomly pick 5 schools, and
then randomly pick 10 children in each school.

14
One further problem
This is not to say that all problems with
random sampling are soluble.
 Non-response.

Not all members of our chosen sample may respond,
particularly when sanctions are nil and incentives are
low (or in fact usually negative…).
 This can matter if non-response is non-random. If
certain types of people tend to respond and others do
not.

15
Non-response

In 1992 opinion polls predicted a Labour
victory, yet the Conservatives were returned
by a large majority of votes (if not seats).
One of the (many) factors that may have caused this
bias in the polls was that Conservative voters were less
likely to respond to surveys than other voters.
 If the members of the sample that choose to not respond
are different to those that do then we have a biased
sample. More on bias later on.
 Ultimately, tricky to deal with. Some more on this later
this semester.

16
Sampling – a summary
Sampling is a easy way of collecting
information about a population.
 SRS means everyone in the population of
interest has the same chance of being selected.
 We often use slightly different methods to SRS
to overcome certain problems.
 Random sampling allows us to estimate the
probability of the sample being similar to the
population.

17