A Tale of Three Numbers

Download Report

Transcript A Tale of Three Numbers

A Tale of Three Numbers
Statistical Significance,
Effect Size, and
Sample Size
BRIEF REVIEW
Causation vs. Correlation
When two variables A and B are correlated,
there are four possibilities:
1.
2.
3.
4.
A causes B
B causes A
A common cause C causes both A and B
The correlation is accidental
So, discovering that countries with democratic
elections get in fewer wars, we might conclude:
1.
2.
3.
4.
Democracy causes peace.
Peace causes democracy.
Christianity causes both democracy and peace.
Democracy and peace are only accidentally
correlated.
Observational Studies
Importantly, if we just observe the facts and
collect data on how things are, we cannot tell
which hypothesis is true.
Observational studies find correlations, not the
causal structure of the world. (This is what HW4
was about.)
The Best Evidence
So far, we’ve learned that a good experiment or
clinical trial is:
• Randomized
• Double-blind
• Controlled
This is often abbreviated ‘RCT’: Randomized
Controlled Trial.
Controls
An experiment with no controls is useless.
It tells us what happens when we do X, but not
what happens when we don’t do X (control).
Maybe the same results would happen from not
doing X. Maybe X does nothing. Or a lot. Or a
little. With no controls, it is impossible to tell.
Randomization
An experiment or trial is randomized when each
person who is participating in the experiment/
trial has a fair and equal chance of ending up
either in the control group or the experimental
group.
Benefits of Randomization
Proper randomization:
Minimizes experimenter bias– the experimenter
can’t bias who goes into which group.
Minimizes allocation bias– lowers the chance
that the control group and experimental group
differ in important ways.
Selection Bias
Randomization cannot get rid of all selection
bias.
For example, many psychology experiments are
just performed on American undergraduates by
their professors.
This means both groups over-represent young
Westerners. (“Sampling bias”)
Allocation Bias
Randomization also guards against allocation
bias, where the control group and experimental
group are different in important ways.
For example, if you assign the first 20 people to
enroll in the experiment to the control and the
next 20 to the experimental group, there may be
allocation bias: the first to enroll may be more
eager to take part, because they are sicker.
The Importance of Randomization
Previously we saw that improper randomization
procedures on average exaggerated effects by
41%.
This is an average result, so improper
randomization often leads to exaggerations that
are even larger than 41%.
Why RCTs?
The importance of the experimental method (as
opposed to scientific observation) is that it
allows us to discern the causal structure of the
world.
Causal Structure
If we find a correlation between our
experimental treatment T and our desired
outcome O, we can rule out:
• O caused T in the experiment.
• A common cause C caused both O and T in the
experiment.
Causal Structure
But can we determine whether the correlation
between T and O is real in the first place and not
accidental?
Yes!
STATISTICAL SIGNIFICANCE
Statistical Significance
We say that an experimental correlation is
statistically significant if it’s unlikely to be
accidental.
How can we tell when it’s unlikely to be
accidental?
Null Hypothesis
We give a name to the claim that there is no
causal connection between the variables being
studied.
It is called the null hypothesis.
Our goal is to reject the null hypothesis when it
is false, and to accept it when it is true.
Rejecting the Null Hypothesis
All experimental data is consistent with the null
hypothesis. Any correlation can always be due
entirely to chance.
But sometimes the null hypothesis doesn’t fit
the data very well. When the null hypothesis
suggests that our actual observations are very
unlikely, we reject the null hypothesis.
P-Values
One way to characterize the significance of an
observed correlation is with a p-value.
The p-value is the probability that we would
observe our data on the assumption that the
null hypothesis is true.
p = P(observations/ null hypothesis = true)
P-Values
Obviously lower p-values are better, that means
your observed correlation is more likely to be
true.
In science we have an arbitrary cut-off point,
5%. We say that an experimental result with p <
.05 is statistically significant.
Statistical Significance
What does p < .05 mean?
It means that the probability that our
experimental results would happen if the null
hypothesis is true is less than 5%.
According to the null hypothesis, there is less
than a 1 in 20 chance that we would obtain
these results.
Note
Importantly, p-values are not measures of how
likely the null hypothesis is, given the data. They
are measures of how likely the data is, given the
null hypothesis.
p = P(data/ null hypothesis = true)
≠
P(null hypothesis = true/ data)
Example
Suppose I have a coin, and I hypothesize that the
coin is biased toward heads.
The null hypothesis might be “this is a fair coin,
it is equally likely to land heads or tails”.
Suppose I then flip it 5 times and it lands
HHHHH– heads 5 times in a row.
Example
We know that the probability of this happening
if the coin is fair is 1/25 = 1/32 = 0.03125 or
about 3%.
P(HHHHH/ the coin is fair) =
P(HHHHH/ null hypothesis = true) =
p = 3%
Example
So p = .03 < .05, and we can reject the null
hypothesis. The bias toward heads is statistically
significant.
Importance
Just because the results of an experiment (or
observational study) are “statistically significant”
does not mean the revealed correlations are
important.
The effect size also matters, that is the strength
of the correlation.
EFFECT SIZES
Effect Size
One NAEP analysis of 100,000 American
students found that science test scores for men
were higher than the test scores for women,
and this effect was statistically significant
These results are unlikely if the null hypothesis,
that gender plays no role in science scores, were
true.
Effect Size
However, the average difference between men
and women on the test was just 4 points out of
300, or 1.3% of the total score.
Yes, there was a real (statistically significant)
difference. It was just a very, very small
difference.
Effect Size
One way to put the point might be: “p-values
tell you when to reject the null hypothesis. But
they do not tell you when to care about the
results.”
Measures of Effect Size
There are lots of measures of effect size:
Pearson’s r, Cohen’s f, Cohen’s d, Hedges’ g,
Cramér’s V,…
Here we will just talk about two measures that
are commonly reported: odds ratios and relative
risks.
Odds Ratio
First, let’s introduce the idea of a binary
variable. A binary variable is a variable that can
have only two values.
“height” is not a binary variable, because there
are more than two heights people can have.
“got an A” is a binary variable, because either
you got an A or you didn’t.
Odds
Whenever you have a binary variable, you can
ask about the odds of that variable– what are
the odds of getting an A?
If 10 students got A’s out of 50 students, then 10
students passed and 40 failed. The odds of
getting an A are 10:40 or 1:4 or 25%.
Odds vs. Probabilities
Odds are not probabilities. There are 50
students and 10 of them got A’s.
The probability of getting an A: 10/50 = 20%
The odds of getting an A: 10/40 = 25%
Odds Ratios
Suppose I have another binary variable
“studied”– students either studied for the exam
or they didn’t.
I can ask about the odds that a student who
studied got an A, and the odds that a student
who didn’t study got an A.
In Table Format
Got an A =
yes
Study = yes 6
Study = no 4
Totals
10
Got an A =
no
15
25
40
Totals
21
29
50
Odds Ratio
So the odds of getting an A among studiers are
6:15 or 40%.
And the odds of getting an A among nonstudiers are 4/25 or 16%.
Odds Ratio
The odds ratio is the ratio of these odds, or
40%:16% ≈ 2.5
This means that (in our example) studying raises
the odds that someone will get an A by 150%.
Alternatively: a student who studies has two and
a half times better odds of getting an A.
Relative Risk
While odds ratios are appropriate when we have
two correlated binary variables in an
observational study (as when I observe the
effects of studying on getting an A), the effect
sizes in RCTs are usually reported by relative
risks, which are also called risk ratios.
Relative Risk
Relative risks are just like odds ratios except they
compare probabilities and not odds.
The odds that a studying student passes are
6:15 = 40%
The probability is 6/(6 + 15) = 6/21 ≈ 29%
Example
The odds that a non-studying student passes are
4:25 = 16%.
The probability is 4/(4 + 25) = 4/29 ≈ 14%.
Example
Whereas the odds ratio was 40:16 = 250%, we
get a relative risk of:
29%:14% = 29:14 = 2.07 = 207%
These numbers are similar, but obviously not the
same. The risk ratio tells you that a student who
studies is twice as likely to get an A.
Relation
As the probabilities of events get smaller the
odds approach the probabilities, and odds ratios
and relative risks are similar.
However, as the probabilities of the events get
higher, the odds and risk ratios get very
different.
Here’s our table again…
Got an A = Got an A =
yes
no
Study = yes
6
15
Study = no
4
25
Totals
10
40
Totals
21
29
50
Odds Ratio for High Probability Events
The probability of not getting an A is much
higher than the probability of getting an A:
40/50 >> 10/50.
The odds of study = no, A = no: 25/4 = 6.25
The odds of study = yes, A = no: 15/6 = 2.5
Odds ratio: 6.25/2.5 = 250%.
Not studying increases odds of A = no by one
and a half times.
Relative Risk for High Probability
Events
What about probabilities?
P(A = no/ study = no) = 86%
P(A = no/ study = yes) = 71%
Relative risk = 86/71 = 121%
So not studying increases your risk of not getting
an A by 21%.
What This Means
What this means is that if you see an effect size
reported in the news you must know whether it
is an odds ratio or a risk ratio.
Otherwise a seemingly very big difference might
actually be a very small difference.
Real Life Case
Here’s a real headline from the NY Times:
“Doctors are only 60% as likely to order cardiac
catheterization for women and blacks as for men
and whites.”
This sounds like a risk ratio. Doctors refer white
men n% of the time and blacks and women 60%
of n% of the time. Right?
Large Difference in Risk!
The study found that doctors referred white
men to heart specialists 90.6% of the time.
If the “60%” figure is a risk ratio, then they
referred blacks and women 60% x 90.6% =
54.4% of the time.
That’s a big difference!
Actually… No
But people who write newspaper articles don’t
understand odds ratios and risk ratios.
The probability of a doctor referring a black man
or a woman to a heart specialist was 84.7%, not
54.4%.
The article was confusing an odds ratio with a
risk ratio.
What’s Going On?
If 90.6% of white males were referred, then
9.4% were not referred, and so a white male's
odds of being referred were 90.6/9.4 ≈ 9.6.
Since 84.7% of blacks and women were referred,
15.3% were not referred, and so for them, the
odds of referral were 84.7/15.3 ≈ 5.5.
The odds ratio was therefore 5.5/9.6 ≈ 60%. The
odds of a referral if you were black or a woman
were about 60% of the odds of referral if you
were a white man.
But the risk ratio was much higher. If you were
black or a woman, the probability that you
would be referred was 93% of the probability
that a white man would be referred.
This Happens All the Time
This is from “Childhood Asthma Gene Identified
by scientists” from the Science Editor of The
Independent:
“Inheriting the gene raises the risk of developing
asthma by between 60 and 70 per cent– enough
for researchers to believe that the discovery may
eventually open the way to new treatments for
the condition.”
Complete Misrepresentation
The quote talks about “raising the risk”.
Is that what the scientists found? No. The
number reported wasn’t an increased risk, it
was an odds ratio.
The gene only raised the risk of asthma by 19%.
More “Science Editors”
This quote is from the Science Editor of the
London Times in “Genetic breakthrough offers
MS sufferers new hopes for treatment”:
“Research has identified two genetic variants
that each raises a person's risk of developing MS
by about 30 per cent, shedding new light on the
origins of the autoimmune disease that could
ultimately lead to better therapies”
Complete Misrepresentation
The quote talks about genes “raising the risk” of
Multiple Sclerosis.
Is that what the researchers found? No!
The first gene raised the risk of MS by only 3%,
the second by only 4%!
SAMPLE SIZE
Sample
In statistics, the people who we are studying are
called the sample.
Our question is then: what sample size is
needed for a result that generalizes to the
population?
Non-Random Samples
The first thing we should realize is that it’s not
going to do us any good to ask a non-random
group of people.
Suppose everyone who goes to ILoveMitt.com is
voting for Mitt. If I ask them, it will seem like
100% of the population will vote for Mitt, even if
only 3% will really vote for him.
Internet Polls
Internet polls are not
trustworthy. They are
biased toward people who
have the internet, people
who visit the site that the
poll is on, and people who
care enough to vote on a
useless internet poll.
Representative Samples
The opposite of a biased sample is a
representative sample.
A perfectly representative sample is one where
if n% of the population is X, then n% of the
sample is X, for every X.
For example, if 10% of the population smokes,
10% of the sample smokes.
Random Sampling
One way to get a representative sample is to
randomly select people from the population, so
that each has a fair and equal chance of ending
up in the sample.
Confidence Interval
Suppose I poll a sample of some population and
find out that 50% of the sample will vote for
candidate C. I might be:
• 90% certain that 48-52% of the population will
vote for C
• 95% certain that 45-55% of the population will
vote for C
• 99% certain that 40-50% of the population will
vote for C
Margin of Error
The margin error is half of some confidence
interval (usually 95%).
So if I’m 95% certain that between 45 and 55%
of people will vote for C, then the margin of
error is ±5%.
Error Bars
Often, confidence intervals/ margins of error are
presented graphically as error bars.