Transcript week 4
Introduction to Data
Analysis
Confidence Intervals
Standard errors continued…
Last week we managed to work out some info
about the distribution of sample means. If we
have lots of samples then:
Mean of all the sample means = population mean.
Shape of the distribution of sample means = normal.
Standard error tells us how dispersed the distribution of
sample means is.
Put all these together, and we can finally work
out how likely our sample mean is to be near
the population mean.
2
Churchgoing example
Let’s take a less contrived example than my
car managing to break any speed limits.
This week I’m interested in the average number of times
a year people go to church (or synagogue, or whatever).
I take a sample of 100 people and record the number of
times they went to church last year. Some went a lot but
most went infrequently.
So, from that sample I get a sample mean and a sample
standard deviation.
3
Church-going sample
60
Sample mean = 8.5
50
Frequency
40
Sample s.d. = 20
30
20
10
0
0
10
20
30
40
50
60
Number of times per year
4
Standard error
Since we know the s.d. and mean of the sample, we can
calculate the standard error.
Standard error X
s
20
2
n
100
Because we know the standard error and the fact that the
sampling distribution is normal, we are able to calculate how
‘likely’ it is that a specific range around our estimate contains
the population mean.
We can calculate what’s called a confidence interval.
5
Confidence intervals (1)
A confidence interval for an estimate is a range
of numbers within which the parameter is
‘likely’ to fall.
Remember a parameter is something about the
population, like the population mean.
We can use the standard error (and the fact that
the sampling distribution is normal) to produce
such a range.
6
Confidence intervals (2)
estimate (k standard error )
k is chosen to determine what is meant by
‘likely’ to contain the actual value of the
population mean.
We want to pick a k that is meaningful to us.
… and is not so large that the interval is useless.
… but not so small that the interval is very unlikely to
contain the true population value.
7
Confidence intervals (3)
To return to our example, we have:
An estimate of the population mean (the
mean number of days of church attendance)
which is the sample mean (8 ½ days).
A standard error (2 days) which allows us to
put a range around our estimate.
8.5 (k 2)
8
How do we pick k?
Imagine sampling repeatedly, and therefore
getting lots of sample means.
Given what we know about the sampling
distribution (i.e. it’s normal if n is large
enough), we know what proportion of these
sample means will fall within a certain number
of standard errors of the actual population
mean.
9
Confidence coefficient
Call this proportion the confidence coefficient.
If we picked .95 then we know that if we repeatedly sampled the
population, that interval around each of the sample means would
include the true value 95% of the time.
The confidence coefficient is a number that is chosen by the
researcher which is close to 1, like 0.95 or 0.99.
Since the sampling distribution is normal, we know
the values of k that correspond to the probability of
any proportion.
So we know that approximately 95% of confidence intervals that
are 2 standard errors either side of the sample mean will include the
population mean.
10
Back to church
For our particular sample we calculate a 95%
confidence interval (i.e. k=2).
k
8.5 (2 2)
8.5 4
SE
Imagine the population mean for churchgoing in
Britain is actually 6 days a year.
So our particular sample is off by 2 ½ days a year.
The mean of all the possible sample means is equal to the
population mean so the centre of the sampling distribution is 6.
11
Sampling distribution (1)
Population mean = 6
0
1
2
3
4
5
6
7
Church-going (days a year)
8
9
10
11
12
Sample mean=8.5
12
More samples
But that’s just our one sample, let’s imagine we
took many samples, and then calculated 95%
confidence intervals for all the sample means.
13
Sampling distribution (2)
Population mean = 6
0
1
2
3
4
5
6
7
8
9
10
11
12
Church-going (days a year)
14
Sampling distribution (3)
Of my 7 samples, all the confidence intervals around
the sample mean enclosed the actual true population
mean apart from one.
If we repeated this lots of times, we would expect
95% of the confidence intervals to enclose the actual
population mean.
95% because that’s the confidence coefficient that we picked.
If we had picked a confidence coefficient of 0.99, then it would be
99% and the confidence intervals would be larger.
15
Confidence coefficients
To make things easier I’ve been rounding the
numbers off, the exact figures for k at each
confidence coefficient are slightly different.
Confidence coefficient
k
68%
95%
99%
99.9%
1.00
1.96
2.58
3.29
16
Smaller confidence intervals?
We could just be less certain.
If I was willing to pick a 90% confidence interval then
the range of the interval would be narrower as k would
be lower.
We would be wrong more often though…
We could increase the sample size.
The bigger the sample size, the lower the standard error
and therefore the smaller the confidence interval for a
given probability.
This isn’t always practical of course…
17
Why 95%?
Generally speaking in social science we pick 95% as
an appropriate confidence coefficient.
A 1 in 20 chance of being wrong is generally felt to
be okay.
Path dependency—Fisher integrated the normal PDF
for the 95% level, which is REALLY hard to do
without a computer.
This isn’t always the case…
Sometimes we want to be really sure we’ve enclosed the mean.
Other times we want a narrower range and are willing to accept that
we are wrong more often.
18
CIs for proportions
Since calculating the standard error is similar for
proportions, so are producing confidence intervals.
Remember we need binary data coded a 0 and 1 (yes/no,
men/women etc.)
Remember the standard error for proportions depends on the n
and the sample proportion. So the CI is as below:
P(1 P)
P 1.96
n
19
Proportions example
Let’s take a political
opinion poll in the US,
we’re interested in the
population proportion
voting Democrat (call
this π).
Sample
is 1000 people.
Sample proportion is
0.45 (or 45%).
P 1.96
P (1 P )
n
0.45(1 0.45)
1000
0.45 1.96 0.0157
0.45 0.03
0.45 1.96
i.e. Actual proportion is 45% 3%,
so the interval is 42%, 48%.
20
Opinion polls
A lot of opinion polls use a sample size of
about 1000 people, and aim to estimate
proportions that are between 30% and 50%.
This is why you often see ‘margins of error’ that are 3%
either side of an estimate quoted in newspapers; these
are 95% confidence intervals.
Note that all these confidence intervals ignore
non-sampling error.
What we call sampling bias, this could be due to nonresponse, badly worded questions and so forth.
21
Exercise
If you were taking a political opinion poll,
roughly how big a sample would you need to
achieve a ‘margin of error’ (i.e. a 95%
confidence interval) of:
2% either side of the estimate?
1% either side of the estimate?
(Assume that the sample proportion of interest
is around 40%).
22
Exercise answer
Margin of error of 0.02
0.02 1.96
0.02 2
0.02
2
P( 1 P)
n
P( 1 P)
n
0.40( 1 0.40 )
n
0.24
n
0.24
0.0001
n
n 2400
0.01
Margin of error of 0.01
0.01 1.96
0.01 2
0.01
2
P( 1 P)
n
P( 1 P)
n
0.40( 1 0.40 )
n
0.24
n
0.24
0.000025
n
n 9600
0.005
23
What’s a hypothesis?
Hypotheses = testable statements about the world.
Hypotheses = falsifiable.
We test hypotheses by attempting to see if they could be false,
rather than ‘proving’ them to be true.
All swans are white…? Popper said you cannot prove that all swans
are white by counting white swans, but you can prove that not all
swans are white by counting one black swan.
We normally generate hypotheses from a
combination of theory, past empirical work, common
sense and anecdotal observations about the world;
e.g. young people are more leftwing than their elders. Observations
abut my friends, work showing this relationship in other countries,
theory suggesting that social ageing makes people more rightwing.
24
Some hypotheses (or not)
My girlfriend spends too much on make-up.
My girlfriend spends more than me on make-up.
NO. Falsifiable, but not really social science.
Women spend more than men on make-up.
NO. It’s a normative claim about ‘too much’.
HOORAY, a social scientific hypothesis. It’s falsifiable, and tells
us something about the world.
Make-up is used to marginalize the Other as a form of
contemporary patriarchialist nihilism.
NO. This is too vague to be falsifiable.
25
Hypothesis testing
Two different types of hypothesis.
Descriptive inference (e.g. old people are more religious
than young people).
Causal inference (e.g. old people are more religious than
young people because they think about death more).
We test statistical hypotheses using
significance testing.
This is a way of statistically testing a hypothesis by
comparing the data we have to values predicted by the
hypothesis.
26
Null and alternative hypotheses
When we’re testing hypotheses, we want to choose
between two conflicting statements.
Null hypothesis (H0) is directly tested.
This is a statement that the parameter we are interested in has a
value similar to no effect.
e.g. regarding religiosity, old people are the same as young people.
Alternative hypothesis (Ha) contradicts the null
hypothesis.
This is a statement that the parameter falls into a different set of
values than those predicted by H0.
e.g. regarding religiosity, old people are different to young people.
27
A (simple) hypothesis
We’re interested in whether new measures to
curb MEPs fraudulently claiming for flights
are effective.
Previously, the population of MEPs managed to claim
16 flights each on expenses (per month).
After the new measures are introduced, we sample 100
MEPs and find that they are now managing to only
charge 13½ flights a month each to expenses. The
standard deviation for this sample is 10 flights.
Are EU (well British and German anyway) taxpayers
really getting better value for money though?
28
Aviatophobia or less fraud?
So in this case the null hypothesis is that there
has been no change.
H0 is that MEPs are claiming the same amount per
month as they were before, and the difference is just
because our sample happens to include a lot of
aviatophobic MEPs (or something like that).
The alternative hypothesis is that MEPs
spending has been curbed.
Ha is that MEPs are claiming less than they were
previously.
29
More formally
Slightly more formally:
The population we are interested is MEPs after the
changes.
We want to know whether the population mean ‘number
of flights claimed per month’ is different from 16.
H0 is that this population mean is equal to 16.
Ha is that this population mean is less than 16.
The info we have is from one sample mean.
Now we imagine that H0 is true…
30
If H0 is true…(1)
The population mean will be equal to 16.
Since n is large(ish) the sampling distribution will be
normal and centred around 16.
We can calculate the standard error of the sampling
distribution.
Standard error X
s
10
1
n
100
31
If H0 is true…? (2)
H0 mean = 16
Sample mean = 13.5
10
11
12
13
14
15
16
17
18
19
20
21
22
Flights claimed (per month)
2 ½ % of the distribution
32
P-value (1)
So we know that H0 looks pretty unlikely (certainly
less than a 2½ per cent chance), but we can actually
give a more precise probability.
We work out how many standard errors the sample mean is away
from H0 to produce a z-score, and then calculate a p-value from
this with reference to the normal probability distribution.
Sample mean Null hypothesis mean
Standard error
13.5 16
z
2 . 5
1
Pr (sample mean being more than 2.5 SEs lower
z
than the population mean) 0.006
33
P-value (2)
If H0 were correct it looks unlikely that we would get
a sample that had a mean of 13½.
In fact there is just a 0.006 (or 0.6%) chance that our
sample could have come from a population postulated
by the null hypothesis.
We could set an (arbitrary) ‘significance level’ that our test must
meet. Maybe we need to be 99% confident that we can reject the
null hypothesis (i.e. the p-value is less than 1%).
Best practice when we perform significance tests is simply to report
the p-value, and to make the judgement that p-values of, say, 5%
and below are probably good evidence the null hypothesis can be
rejected.
34
Steps for Hypothesis test
Check assumptions (i.e. normality, sample
size, level of measurement)
State hypotheses—Null and Alternative
Calculate appropriate test statistic (e.g z-score)
Calculate associated p-value
Interpret the result
35
Interpreting hypothesis test
We NEVER accept the null hypothesis.
We either reject or fail to reject based on our
p-value.
May fail to reject null hypothesis due to:
small sample size
inappropriate research design
biased sample, etc..
36
Type I and type II errors
A type I error occurs when we reject H0, even though
it is true.
A type II error occurs when we do not reject H0, even
though it is false.
This is going to happen 5% of the time if we choose to reject H0
when the p-value is less than 0.05.
If our ‘significance level’ is 0.05, then sometimes there will be a
real difference and we won’t detect it.
The more stringent the significance level the more
difficult to detect a real effect is, but the more
confident we can be that when we find an effect it is
real.
37
Making errors…
There is a trade-off between the two types of
error. Depending on what we’re doing we may
be more willing to accept one sort or the other.
Think of this as analogous to a legal trial, we don’t want
the guilty to go free (a type II error), but we’d be even
unhappier if we execute an innocent person (a type I
error).
In this case, we might want the significance level to be
very low to minimize executing innocent people (but at
the same time allowing lots of the guilty to go free)
38
Differences between 2 sample means
More normally we’ve sampled two groups and
wish to see if they differ.
Let’s go back to our religion example.
We might be interested in whether women’s churchgoing differs from men.
We have 2 samples, 45 men and 55 women.
The men have a mean attendance of 7 days a year, with
standard deviation of 15.
The women have a mean attendance of 10 days a year
with standard deviation of 15.
39
Are women more or less religious than
men?
Essentially, to answer this we estimate the
difference between the populations (the
parameter) using the difference between the
sample means (the statistic).
We can run a significance test on this statistic, and work
out whether our samples are likely to represent real
differences between the populations of men and women.
The null hypothesis is therefore that there is no
difference between men’s mean churchgoing and
women’s mean churchgoing.
40
Z-score
So we work out the z-score as before.
Estimate of parameter null hypothesis value
Standard error of estimator
( X women X men ) 0 X women X men
z
2
2
SE ( X women X men )
swomen
smen
nwomen nmen
z
z
10 - 7
152 152
55 45
3
1.00
4.09 5
41
P-value
The standard error of the estimator is ~3 and the Zscore is ~1 then.
We have no prior ideas of whether we think women
are more or less religious than men, so we just want
to test the possibility that they are not the same.
i.e. we want to know how likely it would be to get an
individual estimate of the difference between the
sample means that is either 1 SE greater than the null
hypothesis (i.e. zero) or 1 SE less than the null
hypothesis.
42
Two-sided tests (1)
-12
-9
-6
-3
0
H0 mean = 0.
SE = ~3.
3
6
9
12
Churchgoing for women - churchgoing for men (in days per year)
68% of the distribution
Difference between sample means = 3
i.e. SampleWomen is 3 greater than sampleMen
43
Two-sided tests (2)
The probability of a difference between the two
sample means being 3 or greater is 0.16.
The probability of a difference between the two
sample means being 3 or less is 0.16.
So the p-value for a 2-sided test is 2*(0.16) or 0.32.
This value is high (much higher than our 5% cut off
value), so we fail to reject H0 that men and women do
not differ in their church attendance.
44
Two-sided tests (3)
Regardless of our theoretical expectations, the
convention is to use two-tailed tests. Why?
In essence making it even more difficult to find results
just due to chance.
We normally don’t have very strong prior information
about the difference.
One tailed tests are often the hallmark of someone
trying to make something out of nothing.
What does it mean to use a one-tailed test??
Not necessarily bad—in fact arguably MUCH smarter.
45
Significance tests and CIs (1)
Notice that our significance test has ended up
looking rather similar to the CIs.
We could use a CI around the difference
between the two sample means to test the
hypothesis that they are the same.
A 95% CI would just be 1.96*SE. We’ve just
worked out the SE (it’s approximately 3).
46
Significance tests and CIs (2)
( women men ) ( X women X men ) 1.96 SE
( women men ) 3 1.96 3
( women men ) 3 5.88
The 95% CI encloses zero (which was our null
hypothesis, that women are the same as men).
CIs and significance tests are doing the same job, just
presenting the information in a slightly different way.
47
Proportions and significance tests
This means of course that all I said about
proportions and CIs, applies to proportions and
significance tests.
A hypothesis related to a previous example is
that one of the candidates does have a lead.
Therefore the null hypothesis is that both
candidates have a vote share of 50%.
48
Proportions example – the return
Sample is 1000, proportion voting Democrat is 0.45 (or 45%).
Null hypothesis is that the population proportion is 0.5.
Thus, according to my sample, it is very likely that the
Presidential race is not a dead-heat.
z
0 P
0 (1 0 )
n
0.5 0.45
0.5(1 0.5)
1000
0.05
3.16
0.0158
Pr(sample mean being 3.16 SEs or more different to the
population mean) 0.001* 2 0.002
z
Note: The SE is
calculated
with the null hypothesis
proportion, not with the
sample proportion.
Remember, we are
testing the null
hypothesis, so the mean
is the null hypothesis
mean .
49
Summary
Because we know the shape of the sampling
distribution, we can work out:
Ranges around an individual sample mean that will
enclose the population mean X per cent of the time.
The probability that a hypothesis about the population
mean is true, given a particular sample mean.
The probability that population means for different
groups are different, given two sample means.
All of the above for proportions.
Although there are some small complications...
50
Things I haven’t mentioned
The reality of statistical hypothesis testing is slightly
more complicated than this in some ways particularly:
If your sample sizes are small. If the sample < 30 or < 40 then we
need to use a different distribution called the t distribution. To use
this we need to assume that the population distribution we are
interested in is normal.
As sample size increases, the t-distribution looks more and more
similar to the z-distribution. This is why significance tests (even for
big sample sizes) are often called t-tests.
When looking at proportions we often use slightly different tests as
well to accommodate the fact that a proportion cannot be greater
than 1, or less than 0.
51