The Scientific Study of Politics (POL 51)

Transcript The Scientific Study of Politics (POL 51)

The Scientific Study of
Politics (POL 51)
Professor B. Jones
University of California, Davis
Today
 Sampling Plans
 Survey Research
More fun with simulations
samplesize<-10000
population<-rnorm(samplesize, 5, 2)
truth<-mean(population)
sdtruth<-sd(population)
truth
Sdtruth
Here’s what I know in
> truth
[1] 5.002265
> sdtruth
[1] 2.003601
the “population”:
What do my samples look like?
ten<-sample(population, 10, replace=F)
m1<-mean(ten); m1
sd1<-sd(ten)
hist(ten)
fifty<-sample(population, 50, replace=F)
m2<-mean(fifty); m2
sd2<-sd(fifty)
hist(fifty)
hundred<-sample(population, 100,
replace=F)
m3<-mean(hundred); m3
sd3<-sd(hundred)
hist(hundred)
.
.
.
Sampling Sizes
 In general, we’ve seen larger sample
sizes yield more accurate conclusions.
 Though the differences between very
large and just “merely” large samples
may in fact be negligible.
 Requires us to turn to the concept of
repeated sampling and sample
variability.
Polls and Repeated Sampling
 As individual researchers, you usually
have one “shot” at it.
 Statistical theory (classical) relies on the
concept of long-run probability
 Repeated trials


…law of large numbers
…central limit theorem
 Maybe concepts you have heard of
before? …or not.
Side-trip to the 2008
Presidential Election
 Pollster.com allows us to think about
“repeated” sampling.
 This site basis its analysis on all
available polls
 Why might this be a good thing?
 There is sampling variability in individual
samples.
 Let’s look at polls that were leading up to
the 2008 Election
What are the “dots”
 The blue dots are Obama percentage
(estimates)
 The red dots are McCain
 (Each blue dot has a corresponding red)
 Note variability in samples: sampling
frames, methodologies differ.
 Combine them, and you get a better
picture.
 Look at solid red and blue states.
Polls
 Note how the polls seem to be
“clustering” as the election gets closer.
 Why?


Undecideds deciding?
More certainty?
 Let’s look at close states.
Understanding variability
 We kind of see “repeated sampling”
 The basic idea:


The “truth” will be revealed if you just
sample enough
But any one sample may be off in one
direction or another.
 Back to sampling
 Let’s simulate repeated sampling in R
More Simulation
 The Population



N=1,000,000
Mean of the Population is 0.4992135
R Code:
#"The Population"
X<-runif(1000000,.01,.99)
meanX <- mean(X); meanX
Let’s Sample n=500, 1000, 5000.
 First Sample: Mean=.4692207
 Second Sample: Mean=.5004778
 Third Sample: Mean=.5027007
#Some Samples: First, sample 1, n=500, evaluate:
set.seed(52151)
nsamp <- 1
res <- numeric(nsamp)
for (i in 1:nsamp) res[i] <- mean(sample(X, 500, replace = FALSE))
mean(res)
#Some Samples: Second, sample 2, n=1000, evaluate:
set.seed(110789008)
nsamp <- 1
res <- numeric(nsamp)
for (i in 1:nsamp) res[i] <- mean(sample(X, 1000, replace = FALSE))
mean(res)
#Some Samples: Third, sample 3, n=5000, evaluate:
set.seed(16978)
nsamp <- 1
res <- numeric(nsamp)
for (i in 1:nsamp) res[i] <- mean(sample(X, 5000, replace = FALSE))
mean(res)
Repeated Sampling
 Suppose we were to take 10 samples of
size 500?
[1,] 0.4922826
[2,] 0.5114829
[3,] 0.5006157
[4,] 0.5180107
[5,] 0.5083638
[6,] 0.5054319
[7,] 0.4992882
[8,] 0.4612303
[9,] 0.4897318
[10,] 0.5016498
Mean: 0.4988088
S.D.: 0.01568156
Lessons?
 Sampling variability is a real issue.
 Range in estimates went from .46 to .52

Way under and way over estimate the
mean in certain trials.
 However, on average, “we’re close.”
 More simulations.
Repeated Sampling
 Experiment 1: 1000 samples, n=500


Mean: 0.4994611
S.D.: 0.01209907
set.seed(7869324)
nsamp <- 1000
res <- numeric(nsamp)
for (i in 1:nsamp) res[i] <- mean(sample(X,
500, replace = FALSE))
mean(res); sd(res)
hist(res, br=10, xlim=range(.5))
abline(v =meanX)
N=500, 1000 Samples
Repeated Sampling
 Experiment 2: 1000 samples, n=1000


Mean: 0.4988333
S.D: 0.008994245
set.seed(7454)
nsamp <- 1000
res <- numeric(nsamp)
for (i in 1:nsamp) res[i] <- mean(sample(X,
1000, replace = FALSE))
mean(res); sd(res)
hist(res, br=10, xlim=range(.5) )
abline(v =meanX)
N=1000, 1000 Samples
Repeated Sampling
 Experiment 3: 1000 samples, n=5000


Mean: 0.499128
S.D.: 0.004016436
set.seed(13433)
nsamp <- 1000
res <- numeric(nsamp)
for (i in 1:nsamp) res[i] <- mean(sample(X,
5000, replace = FALSE))
mean(res); sd(res)
hist(res, br=10, xlim=range(.5))
abline(v =meanX)
N=5000, 1000 Samples
What’s going on?
Sampling Variability
 If we “fix” the number of samples, what
happened?
 As n increases, variability decreases.
 “On average, our sample estimate is
“close” to the true value…
 AND, the variation across samples is
decreasing.
Theory
 Population Parameter
 θ is the unknown parm.
 What does this equality tell us?
 How does it relate to samples?
^
E ( )  
Sample Proportions
 In our examples, we wanted to estimate
a proportion.
 We knew it’s true value (we usually do
not!)
 We therefore must sample.
 The same concept as before applies:
^
E ( P)  P
Probability
 “Over repeated samples, the expected
value of the proportion will equal the true
population proportion.”
 This is a good thing.


Sample estimates can do a good job of
approximating the population value.
This permits generalizability.
 Good sampling technique will produce
“unbiased estimates.”
Repeated Sampling Redux
 Suppose we were to take 10 samples of
size 500?
[1,] 0.4922826
[2,] 0.5114829
[3,] 0.5006157
[4,] 0.5180107
[5,] 0.5083638
[6,] 0.5054319
[7,] 0.4992882
[8,] 0.4612303
[9,] 0.4897318
[10,] 0.5016498
Mean: 0.4988088
S.D.: 0.01568156
Mean of the Population is 0.4992135
E(P)=.4988; Population “P”=.4992
E(P)≈P
Note, any single sample might
be “off”; however, the idea is that
there is no systematic tendency to be
off one direction or the other.
Sampling Distribution
 What we’ve just gone through are
simulations of SAMPLING
DISTRIBUTONS
 Defined: the distribution of a statistic that
you obtain from repeated samples of
size n from some population.
The Concept of Variance
 How far might you be off in a particular
sample?
 Why, by the way, might you like to know
this?
 You usually only have ONE sample!!
 Is there a way we can determine this
degree of variability?
Standard Error of a Proportion
 Variance: “Average “squared” deviations
 Standard Error: square root of the
variance.

2
P
P
P (1  P )

N
P (1  P )

N
Standard Error in Action
 Suppose the true population parameter
is P.



P=.50
In repeated samples, you would expect the
average sample statistic to approach .50
Recall prior simulation
 What is the “sampling error”?


Using formula from previous slide:
[.5(1-.5)/100]1/2 =.05
Interpretation?
 If the true population proportion is .50
and we took repeated (random) samples
of size 100, the expected value of P
would be .50 but the standard deviation
would be .05.
 .05 is our standard error of the sampling
distribution. This is what ought to
happen in repeated sampling.
 More to it…that comes later.
Put it to the test.
> #"The Population"
> X<-runif(1000000,.01,.99)
> meanX <- mean(X); meanX
[1] 0.500889
> sdX<-sd(X); sdX
[1] 0.2832314
>
> #Sample 100, 1000 times
>
> set.seed(7324)
> nsamp <- 1000
> res <- numeric(nsamp)
> for (i in 1:nsamp) res[i] <- mean(sample(X, 100,
replace = FALSE))
> mean(res); sd(res)
[1] 0.5007463
[1] 0.02781522
Result
 What conclusions would I draw from my
simulation?
 “Best guess” of P is .50.
 The average deviation across samples is
about .03.
 My guess + my error allows me to
compute a CONFIDENCE INTERVAL
 Estimate +/- Error=C.I.
Confidence Interval
 What I’ve really done in my simulation is





computed a “68 percent confidence interval.”
.50 plus or minus .03
68 percent of all samples give a value for P
between (about) .47 and .53
Classical interpretation: In repeated samples of
size 100, the expected value of P will lie in the
range .47 to .53, 68 percent of the time.
Why “68 percent”?
68-95-99.7 Rule and the Normal Distribution
One Sample
 You have one
sample.
 What makes the C.I.
big versus small?
 The Standard Error


As n goes up, s.e.
goes down.
Therefore, C.I. must
get smaller.
P (1  P )
 
N
P (1  P )
P 
N
2
P
Illustration
Relationship Between Sample Size and
Sampling Error
0.10
0.08
0.06
0.04
0.02
0.00
25
10
0
20
0
30
0
40
0
50
0
60
0
70
0
80
0
90
0
10
00
20
00
30
00
40
00
Standard Error
0.12
Sample Size
Inference
 The goal of statistical inference is to
make supportable conclusions about the
unknown characteristics, or parameters,
of a population based on the known
characteristics of a sample measured
through sample statistics.
 Any difference between the value of a
population parameter and a sample
statistic is bias and can be attributed to
sampling error.
Inference
 On average, a sample statistic will equal
the value of the population parameter.
 Any single sample statistic, however,
may not equal the value of the
population parameter.
 Consider the sampling distribution:
When the means from an infinite number
of samples drawn from a population are
plotted on a frequency distribution, the
mean of the distribution of means will
equal the population parameter.
Inference
 By calculating the standard error of the
estimator (or sample statistic), which
indicates the amount of numerical
variation in the sample estimate, we can
estimate confidence.
 More variation means less confidence in
the estimate.
 Less variation means more confidence.
Implications?
 If we want to cut our s.e. in half, we must
quadruple the sample size.
 N exponentially related to s.e.




S.E. for N=100 is .05
S.E. for N=400 is .025
 .05/.025=2
S.E. for N=1000 is .0158
S.E. for N=4000 is .0079
 .0158/.0079=2
 There are trade-offs between precision and
design.

The Scientific Study of Politics (POL 51)

Transcript The Scientific Study of Politics (POL 51)

Directory