nice ballet opera nice

Download Report

Transcript nice ballet opera nice

Confidence intervals, effect size,
power
Looking beyond the p-value
What we will cover
• Confidence Intervals
• What they are
• How to calculate them
• How to interpret them
• Effect size
• The role of the sample size in finding an effect
• Cohen’s D
• Introducing the concept of Power
• An R library for basic power calculations.
Point estimates & Intervals
• If you have a sample mean and you wish to make a guess as to what
the population mean is, you can make two kinds of estimates:
• Point estimates are guesses that specify an exact number. When you make a
point estimate using the sample mean, it is likely your guess is near the true
population parameter but it is very unlikely that it will be exactly the same as
the parameter. For example, you might make a point estimate that μ is 4
when really it is 3.55.
• Interval estimates are guesses that specify a range of numbers. With interval
estimates, you guess that the true population parameter falls somewhere
between 2 numbers. For example, you might make an interval estimate that μ
is between 2 and 6.
Confidence interval
• In reality we often want to report a confidence interval as well as a
test statistic and significance test.
• You can see that most statistical packages report confidence intervals
by default, e.g.
Confidence interval
• Say we were to take a random sample from a population and calculate the
mean.
• A confidence interval is a range of values around the mean, with the
following meaning:
• if we drew an infinite number of samples of the same size from the same population,
x% of the time the true population mean would be included in the confidence
interval calculated from the samples.
• If we compute a 95% confidence interval (the most common type), x = 95,
so we can say that 95% of the confidence intervals calculated from an
infinite number of samples of the same size, drawn from the same
population, can be expected to contain the true population mean.
Confidence Interval
• More generally, a confidence interval gives us information about the
precision of a point estimate such as the sample mean.
• A wide confidence interval tells us that if we had drawn a different
sample, we might get a quite different sample mean, whereas a
narrow confidence interval suggests that if we drew a different
sample, the sample mean would probably be fairly close to that from
the sample we did draw.
• Basically we can calculate (using an appropriate formula) the upper
and lower boundary of our confidence interval.
• estimate ± margin of error
Confidence Interval
• Lets think about this in terms of a normal distribution first before
looking at how to calculate it for the independent samples t-test
(which is more practical).
• Any Normal distribution has a probability about 0.95 within ±2
standard deviations of its mean.
Confidence Interval
• To construct a confidence interval we need to know more about the area C
under the curve.
• That is, we must find the number z∗ such that any Normal distribution has
probability C within ±z∗ standard deviations of its mean.
• Because all Normal distributions have the same standardized form, we can
obtain everything we need from the standard Normal curve.
• The sample mean x̄ has the Normal distribution with mean μ and standard
deviation σ/√n.
• The unknown population mean μ lies between:
• x − z∗σ /√n and x + z∗σ/ √n
Example
• The National Student Loan Survey collects data to examine questions
related to the amount of money that borrowers owe. The survey selected a
sample of 1280 borrowers who began repayment of their loans between
four and six months prior to the study. The mean of the debt for
undergraduate study was $18,900 and the standard deviation was about
$49,000.
• This distribution is clearly skewed but because our sample size is quite
large, we can rely on the central limit theorem to assure us that the
confidence interval based on the Normal distribution will be a good
approximation.
• Let’s compute a 95% confidence interval for the true mean debt for all borrowers.
(Although the standard deviation is estimated from the data collected, we will treat
it as a known quantity for our calculations here).
Example
• Calculations:
• We’ll round 2684 to 2700 for the purposes of this example.
Example
• Suppose the researchers who designed the National Student Loan
Survey had used a different sample size.
• How would this affect the confidence interval?
• We can answer this question by changing the sample size in our
calculations and assuming that the mean and standard deviation are
the same.
• Let’s assume that the sample mean of the debt for undergraduate
study is $18,900 and the standard deviation is about $49,000, as in
the previous example. But suppose that the sample size is only 320.
• The margin of error for 95% confidence is?
Example
5400 Vs 2700
• Notice that the margin of error for this example is twice as large as the margin of
error that we just computed.
• The only change that we made was to assume that the sample size is 320 rather
than 1280.
• This sample size is exactly one-fourth of the original 1280.
• Thus, we approximately double the margin of error when we reduce the sample
size to one-fourth of the original value.
Confidence Interval
• One thing to note that by increasing the confidence interval from 95% to
99%. We make the interval bigger not smaller!
• This may seem strange at first but this diagram shows why:
• Suppose that for the student loan data in our example we wanted 99%
confidence. For 99% confidence, z∗ = 2.576. The margin of error for 99%
confidence based on 1280 observations is:
Confidence Interval
• Formula for Independent Samples t-test (equal
variance)
Confidence Interval
• If we had calculated the t-test by hand we would have calculated most of this
anyway.
• There are several points worth noting about this formula:
• It is actually a confidence interval for the difference in the means of the two Populations.
• We use the upper critical t-value for the df, and half the specified alpha level from a standard
t-table.
• Lets use R for the moment:
> ballet <- c(89.2,78.2,89.3,88.3,87.3,90.1,95.2,94.3,78.3,89.3)
> football <- c(79.3,78.3,85.3,79.3,88.9,91.2,87.2,89.2,93.3,79.9)
> spool_numerator <- (var( ballet ) * (length(ballet) - 1) ) + (var( football ) *
(length(football) - 1))
> spool_demoninator <- (length(ballet)-1)+(length(football)-1)
Confidence Intervals
> spool_fraction <- spool_numerator / spool_demoninator
[1] 31.78189
> spool_rhs <- sqrt((spool_fraction * (1/length(ballet) + 1/length(football))))
[1] 2.521186
> alpha <- .05
> t_alpha_half <- qt(1-alpha/2, df=(length(ballet)-1)+(length(football)-1))
[1] 2.10
> difference_in_mean <- mean(ballet) - mean(football)
[1] 2.76
• Given what we have just entered and formula you just saw - can you work out
the confidence interval now?
2.76 + ((2.10)*(2.521186)) = 8.054491
2.76 – ((2.10)* (2.521186)) = -2.534491
Confidence Intervals
• Note that this confidence interval includes 0, which is our null value
(the value we posited for the difference in means in our null
hypothesis); this result is expected because for this data set, we did
not find significant results and thus did not reject the null hypothesis.
• Because the sample mean is not resistant, outliers can have a large
effect on the confidence interval. You should search for outliers and
try to correct them or justify their removal before computing the
interval.
Caution!
• The most important caution concerning confidence intervals is that the margin of error in a
confidence interval covers only random sampling errors.
• The margin of error is obtained from the sampling distribution and indicates how much error can
be expected because of chance variation in randomized data production.
• Practical difficulties such as nonresponse in a sample survey cause additional errors.
• These errors can be larger than the random sampling error. This often happens
• when the sample size is large (so that σ/√n is small).
• Remember this unpleasant fact when reading the results of an opinion poll or other sample
survey. The practical conduct of the survey influences the trustworthiness of its results in ways
that are not included in the announced margin of error
Effect size
• As an indication of the importance of a result in quantitative research,
statistical significance has enjoyed a rather privileged position for decades.
Social scientists have long given the “p < .05” rule a sort of magical quality,
with any result carrying a probability greater than .05 being quickly
discarded into the trash heap of “nonsignificant” results.
• Recently, however, researchers and journal editors have begun to view
statistical significance in a slightly less flattering light, recognizing one of its
major shortcomings: It is perhaps too heavily influenced by sample size.
• As a result, more and more researchers are becoming aware of the
importance of effect size and increasingly are including reports of effect
size in their work.
Statistical inferences
• To determine whether a statistic is statistically significant, we follow
the same general sequence regardless of the statistic (z scores, t
values, F values, correlation coefficients, etc.).
• First, we find the difference between a sample statistic and a
population parameter (either the actual parameter or, if this is not
known, a hypothesized value for the parameter).
• Next, we divide that difference by the standard error.
• Finally, we determine the probability of getting a ratio of that size due
to chance, or random sampling error.
• The problem with this process is that when we divide the numerator (i.e.,
the difference between the sample statistic and the population
parameter) by the denominator (i.e., the standard error), the sample size
plays a large role.
• In all of the formulas that we use for standard error, the larger the
sample size, the smaller the standard error. When we plug the standard
error into the formula for determining t values, F values, and z scores, we
see that the smaller the standard error, the larger these values become,
and the more likely that they will be considered statistically significant.
Effect size & sample
• Because of this effect of sample size, we sometimes find that even very small differences between the sample
statistic and the population parameter can be statistically significant if the sample size is large.
• The left side of the graph shows a fairly large difference between the sample mean and population
mean, but this difference is not statistically significant with a small sample…
Effect size & sample
• Suppose we know that the average IQ score for the population of adults in the United States is
100.
• Now suppose that I randomly select two samples of adults. One of my samples contains 25 adults,
the other 1600.
• Each of these two samples produces an average IQ score of 105 and a standard deviation of 15. Is
the difference between 105 and 100 statistically significant?
• To answer this question, we’ll calculate a z score for each sample.
• Standard Error = 15 / 25 = 3
• Standard Error = 15 / 1600 = 0.375
• Get Z-scores
• 100 – 105 / 3 = 1.666667
• 100 – 105 / 0.3753 = 13.33333
• Our z-crit for alpha of 0.05 (two-way) is 1.96.
Implications
• If we are using an alpha level of .05, then a difference of 5 points on the IQ test would not be
considered statistically significant if we only had a sample size of 25, but would be highly
statistically significant if our sample size were 1,600.
• Because sample size plays such a big role in determining statistical significance, many statistics
textbooks make a distinction between statistical significance and practical significance.
• With a sample size of 1,600, a difference of even 1 point on the IQ test would produce a
statistically significant result (z = 1 ÷ .375 ⇒ z = 2.67, p < .01). However, if we had a very small
sample size of 4, even a 10-point difference in average IQ scores would not be statistically
significant (z= 10 ÷ 7.50 ⇒ z = 1.33, p > .10)
• But is a difference of 1 point on a test with a range of over 150 points really important in the real
world? And is a difference of 10 points not meaningful?
• In other words, is it a significant difference in the practical sense of the word significant?
Practical significance
• Very small effects can be highly significant (small P), especially when a test is based on a large
sample. A statistically significant effect need not be practically important.
• Plot the data to display the effect you are seeking, and use confidence intervals to estimate the
actual values of parameters.
• P-values are more informative than the reject-or-not result of a fixed level α test.
• Beware of placing too much weight on traditional values of α, such as α = 0.05.
• Significance tests are not always valid. Faulty data collection, outliers in the data, and testing a
hypothesis on the same data that suggested the hypothesis can invalidate a test.
• Many tests run at once will probably produce some significant results by chance alone, even if all
the null hypotheses are true.
Calculating effect size
• There are different formulas for calculating the effect sizes of different
statistics, but these formulas share common features. The formulas
for calculating most inferential statistics involve a ratio of a numerator
divided by a standard error. Similarly, most effect size formulas use
the same numerator, but divide this numerator by a standard
deviation rather than a standard error.
• We’ll look at an example for the t-test
• The general name for the effect size formula we will look at is Cohen’s D
• (We can also use this in another formula to give use an idea of what sample
size we can use!)
COHEN’S D
• Standard measure for independent samples t test
X1  X 2
d
sp
• Cohen initially suggested could use either sample standard deviation,
since they should both be equal according to our assumptions
(homogeneity of variance)
• In practice however researchers use the pooled variance
GLASS’S Δ
• For studies with control groups, we’ll use the
control group standard deviation in our formula
X1  X 2
d
scontrol
• This does not assume equal variances
Comparison
• Note the range from 0 to 1. There are general guidelines on what is
small, medium and large.
Cohen's Rules Of Thumb For Effect Size
Effect size
“Small effect”
“Medium effect”
“Large effect”
Correlation
coefficient
r = 0.1
r = 0.3
r = 0.5
Difference between
means
d = 0.2 standard
deviations
d = 0.5 standard
deviations
d = 0.8 standard
deviations
29
Statistical Power
• Statistical power refers to the probability of finding a particular sized effect
• Specifically, it is 1- type II error rate
• Probability of rejecting the null hypothesis if it is false
• It is a function of type I error rate, sample size, and effect size
• Its utility lies in helping us determine the sample size needed to find an effect size
of a certain magnitude
Two kinds of power analysis
• A priori
• Used when planning your study
• What sample size is needed to obtain a certain level of power?
• Post hoc
• Used when evaluating study
• What chance did you have of significant results?
• Not really useful
• If you do the power analysis and conduct your analysis accordingly then you did what
you could. To say after, “I would have found a difference but didn’t have enough power”
isn’t going to impress anyone.
A priori Effect Size?
• Figure out an effect size before I run my experiment?
• Several ways to do this:
• Base it on substantive knowledge
• What you know about the situation and scale of measurement
• Base it on previous research
• Use conventions
An acceptable level of power?
• Why not set power at .99?
• Practicalities
• Howell shows how for a 1 sample t test, and an effect size d of 0.33:
• Power = .80, then n = 72
• Power = .95, then n = 119
• Power = .99, then n = 162
• Cost of increasing power (usually done through increasing n) can be high
Howell’s general rule
• Look for big effects
or
• Use big samples
• You may now start to understand how little power many of the studies in have
considering they are often looking for small effects.
• Many seem to think that if they use the central limit theorem rule of thumb
(n=30) that power is solved too.
• This is clearly not the case.
• Effects are there but if they are small it will be very unlikely that an experiment with
a small sample will ‘discover them’.
Post hoc power: the power of the actual study
• If you fail to reject the null hypothesis might want to know what
chance you had of finding a significant result – defending the failure
• As many point out this is a little dubious
• One thing we can understand regarding the power of a particular
study at hand is that it can be affected by a number of issues such as
• Reliability of measurement
• An increase in reliability can actually result in power
increasing or decreasing though here we stress the decrease
due to unreliable measures
•
•
•
•
Outliers
Skewness
Unequal N for group comparisons
The analysis chosen
Something to consider
• Doing a sample size calculation is nice in that it gives a sense of what to shoot for, but rarely if
ever do the data or circumstances bare out such that it provides a perfect estimate for our needs
• Rule of thumb sample size calculation for all studies:
• The sample size needed is the largest N you can obtain based on practical considerations
(e.g. time, money)
• Also, even the useful form of power analysis (for sample size calculation) involves statistical
significance as its focus
• While it gives you something to shoot for, our real interest regards the effect size itself and how
comfortable we are with its estimation
• Emphasizing effect size over statistical significance in a sense de-emphasizes the power problem
Errors in Null Hypothesis Tests
True State of Affairs
Your
Reject Null
Decision
Fail to Reject Null
No Effect (H0)
Some Effect
Type I Error - reject
null when it is true
(a)
Power= 1- 
Type II Error - fail to
reject null when you
should ()
a and 
• The simplest way to consider the relationship
between a and  is to think of a in terms of the
null hypothesis and  in terms of the alternative
hypothesis (a different distribution). One has an
effect on the other but it isn’t a straight linear
relationship.
The Definition Of Statistical Power
• Statistical power is the probability of not missing an effect, due to
sampling error, when there really is an effect there to be found.
• Power is the probability (prob = 1 - β) of correctly rejecting Ho
when it really is false.
• Depends on:
• Effect Size
• How large is the effect in the population?
• Sample Size (N)
• You are using a sample to make inferences about the
population. How large is the sample?
• Decision Criteria - a
• How do you define “significant” and why?
Power: a and 
Conventions And Decisions About Statistical
Power
• Acceptable risk of a Type II error is often set at 1 in 5, i.e., a
probability of 0.2.
• The conventionally uncontroversial value for “adequate”
statistical power is therefore set at 1 - 0.2 = 0.8.
• People often regard the minimum acceptable statistical power for
a proposed study as being an 80% chance of an effect that really
exists showing up as a significant finding.
• http://homepage.stat.uiowa.edu/~rlenth/Power/
Power
• What you should know:
• Power is the probability of correctly rejecting the null hypothesis that sample
estimates (e.g. Mean, proportion, odds, correlation co-efficient etc.) does not
statistically differ between study groups in the underlying population.
• Large values of power are desirable, at least 80%, is desirable given the
available resources and ethical considerations.
• Power proportionately increases as the sample size for study increases.
• Accordingly, an investigator can control the study power by adjusting the
sample size and vice versa
Sample size
• Let’s assume we want to calculate how many subjects per group we need to conduct a two-tailed
independent samples t-test with acceptable power.
• The denominator here is the effect size.
• We need Z-values for both α and β to use this formula. We will stick with the 95% confidence
interval for a two-tailed test, so the Z-value for 1 − α /2 will be 1.96
• We will compute the sample size required for 80% power, so the Z-value for 1 − β will be 0.84.
(Note that if we were doing a one-tailed test, Zα would be 1.645, and if we were calculating 90%
power, Z1 − β would be 1.28.)
• The effect size is the difference between the two populations divided by the appropriate
• measure of variance
Calculating Sample Size
• If μ1 = 25, μ2 = 20, and σ = 10, the effect size is 0.5. We can plug
these numbers into the sample size formula.
• We round fractional results up to the next whole number, so we need
at least 63 subjects per group to have an 80% probability of finding a
significant difference between two groups when the effect size is 0.5.
Power analysis in R
• See: ‘pwr’ package
> pwr.t.test(n=64,d=0.5,sig.level=.05,type="two.sample",
alternative="two.sided")
• Our previous calculations are close but it is generally a good idea to
let the stats package calculate these figures.