effect size

Transcript effect size

SUMMARY
Homoscedasticity
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/dont-be-a-victim-of-statistical-hippopotomonstrosesquipedaliophobia
Tests for homoscedasticity
• 𝐻0 : 𝜎1 = 𝜎2
• F-test of equality of variances (Hartley's test), 𝐹 𝑛𝐿 −
Power of the test
• A probability that it correctly rejects the null hypothesis
(H0) when it is false.
• Equivalently, it is the probability of correctly accepting the
alternative hypothesis (Ha) when it is true - that is, the ability of a
test to detect an effect, if the effect actually exists.
Probability of FP is α
Decision
Reject H0
State of
the world
H0 true
Retain H0
Type I error
H0 false
power = 1 - β
Type II error
Probability of FN is β
What factors affect the power?
To increase the power of your test, you may do any of the
following:
1. Increase the effect size (the difference between the null
and alternative values) to be detected
The reasoning is that any test will have trouble rejecting the null
hypothesis if the null hypothesis is only 'slightly' wrong. If the effect size is
large, then it is easier to detect and the null hypothesis will be soundly
rejected.
2. Increase the sample size(s) – power analysis
3. Decrease the variability in the sample(s)
4. Increase the significance level (α) of the test
The shortcoming of setting a higher α is that Type I errors will be more
likely. This may not be desirable.
NEW STUFF
Effect size
• When a difference is statistically significant, it does not
necessarily mean that it is big, important or helpful in
decision-making. It simply means you can be confident
that there is a difference.
• For example, you evaluate the effect of sun erruptions on
student knowledge (𝑛 = 2000).
• The mean score on the pretest was 84 out of 100. The mean score
on the posttest was 83.
• Although you find that the difference in scores is statistically
significant (because of a large sample size), the difference is very
small suggesting that erruptions do not lead to a meaningful
decrease in student knowledge.
Effect size
• To know if an observed difference is not only statistically
significant, but also factually important, you have to
calculate its effect size.
• The effect size in our case is 84 – 83 = 1.
• The effect size is transformed on a common scale by
standardizing (i.e., the difference is divided by a s.d.).
Power analysis
• To ensure that your sample size is big enough, you will
need to conduct a power analysis.
• For any power calculation, you will need to know:
• What type of test you plan to use (e.g., independent t-test)
• The alpha value (usually 0.05)
• The expected effect size
• The sample size you are planning to use
• Because the effect size can only be calculated after you
collect data, you will have to use an estimate for the
power analysis.
• Cohen suggests that for t-test values of 0.2, 0.5, and 0.8 represent
small, medium and large effect sizes respectively.
Power analysis in R (paired t-test)
install.packages("pwr")
library(pwr)
pwr.t.test(d=0.8,power=0.8,sig.level=0.05,type="paired",alternative="two.sided")
Paired t test power calculation
n = 14.30278
d = 0.8
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number of *pairs*
Check for normality – histogram
Check for normality – QQ-plot
qqnorm(rivers)
qqline(rivers)
Check for normality – tests
• The graphical methods for checking data normality still
leave much to your own interpretation. If you show any of
these plots to ten different statisticians, you can get ten
different answers.
• H0: Data follow a normal distribution.
• Shapiro-Wilk test
> shapiro.test(rivers)
Shapiro-Wilk normality test
data: rivers
W = 0.6666, p-value < 2.2e-16
p-value < 2.2e-16
log
p-value = 3.945e-05
Nonparametric statistics
• Small samples from considerably non-normal
distributions.
• non-parametric tests
• No assumption about the shape of the distribution.
• No assumption about the parameters of the distribution (thus they
are called non-parametric).
• Simple to do, however their theory is extremely
complicated. Of course, we won't cover it at all.
• However, they are less accurate than their parametric
counterparts.
• So if your data fullfill the assumptions about normality, use
paramatric tests (t-test, F-test).
Nonparametric tests
• If the normality assumption of the t-test is violated, then its
nonparametric alternative should be used.
• The nonparametric alternative of t-test is Wilcoxon test.
• wilcox.test()
• http://stat.ethz.ch/R-manual/R-patched/library/stats/html/wilcox.test.html
ANOVA
(ANALÝZA ROZPTYLU)
A problem
• You're comparing three brands of beer.
A problem
• You buy four bottles of each brand for the following prices.
Primátor
Kocour
Matuška
15
39
65
12
45
45
14
48
32
11
60
38
• What do you think, which of these brands have significantly
different prices?
• No significant difference between any of these.
• Primátor and Kocour
• Primátor and Matuška
• Kocour and Matuška
t-test
• We can do three t-tests to show if there is a significant
difference between these brands.
• How many t-tests would you need to compare four
samples?
• 6
• To compare 10 samples, you need 45 t-tests! This is a lot.
We don’t want to do a million t-tests.
• But in this lesson you'll learn a simpler method.
• Its called Analysis of variance (Analýza rozptylu) –
ANOVA.
Multiple comparisons problem
• If you make two comparisons and assuming that both null
•
•
•
•
•
hypothesis are true, what is the chance that both
comparisons will not be statistically significant (𝛼 = 0.5)?
0.95 × 0.95 = 0.9025
And what is the chance that one or both comparisons will
result in a statistically significant conclusion just by
chance?
1.0 − 0.9025 = 0.0975 ~ 10%
For N comparisons, this probability is generally 1.00 −
0.95𝑁.
So, for example, for 13 independent tests there is about
50:50 chance of obtaining at least one FP.
Multiple comparisons problem
Bennet et al., Journal of Serendipitous and Unexpected Results, 1, 1-5, 2010
http://www.graphpad.com/guides/prism/6/statistics/index.htm?beware_of_multiple_comparisons.htm
Correcting for multiple comparisons
• Bonferroni correction – the simplest approach is to
divide the α value by the number of comparisons N. Then
define the particular comparison as statistically significant
when its p-value is less than 𝛼/𝑁.
• For example, for 100 comparisons reject the null in each if
its p-value is less than 0.05 100 = 0.0005.
• However, this is a bit too conservative, other approaches
exist.
> p.adjust()
• “There seems no reason to use the unmodified Bonferroni
correction because it is dominated by Holm's method”
Main idea of ANOVA
• To compate three or more samples, we can use the same
ideas that underlie t-tests.
• In t-test, the general form of t-statistic is
𝑥1 − 𝑥2
𝑡=
𝑆𝐸
Variability between sample
means
Error, variability within
samples
• Similarly, for three or more samples we assess the
variability between sample means in numerator and the
error (variability within samples) in denominator.
Variability between sample means
Variability within samples
ANOVA hypothesis
• 𝐻0 : 𝜇1 = 𝜇2 = 𝜇3
𝐻1 ∶ at least one pair of samples is significantly different
• Follow-up multiple comparison steps – see which
means are different from each other.
F ratio
between − group variability
𝐹=
within − group variability
• As between-group variability (variabilita mezi
skupinami) increases, F-statistic increases and this leans
more in favor of the alternative hypothesis that at least
one pair of means is significantly different.
• As within-group variability (variabilita v rámci skupin)
increases, F-statistic decreases and this leans more in
favor of the null hypothesis that the means are not
siginificantly different.
Beer brands – a boxplot
13
𝑥𝑃
Primátor
Kocour
Matuška
15
39
65
12
45
45
14
48
32
11
60
38
35
𝑥𝐺
45
48
𝑥𝐾 𝑥𝑀
Between-group variability
SS – sum of squares, součet čtverců
MS – mean square, průměrný čtverec
SSB – součet čtverců mezi skupinami
MSB – průměrný čtverec mezi skupinami
𝑥𝑃 − 𝑥𝐺
2
𝑥𝑀 − 𝑥𝐺
𝑥𝐾 − 𝑥𝐺
13
𝑥𝑃
35
𝑥𝐺
2
𝑆𝑆𝐵 =
2
45
48
𝑥𝐾 𝑥𝑀
𝑛𝑘 𝑥𝑘 − 𝑥𝐺
2
𝑘
𝑑𝑓𝐵 = 𝑘 − 1
𝑘 𝑛𝐾 𝑥𝑘 − 𝑥𝐺
𝑀𝑆𝐵 =
𝑘−1
2
Within-group variability
SSW – součet čtverců uvnitř skupin
MSW – průměrný čtverec uvnitř skupin
𝑆𝑆𝑊
𝑀𝑆𝑊 =
=
𝑑𝑓𝑊
𝑘
𝑥𝑖 − 𝑥𝑘
𝑁−𝑘
2
The summary of variabilities
𝑆𝑆𝑊
𝑀𝑆𝑊 =
=
𝑑𝑓𝑊
𝑘
𝑥𝑖 − 𝑥𝑘
𝑁−𝑘
2
𝑆𝑆𝐵
𝑀𝑆𝐵 =
=
𝑑𝑓𝐵
𝑘
𝑥𝑘 − 𝑥𝐺
𝑘−1
2
Primátor
Kocour
Matuška
15
39
65
• 𝑥𝑘 ... sample mean
12
45
45
• 𝑁 ... total number of data points
14
48
32
11
60
38
• 𝑥𝑖 ... value of each data point
• 𝑘 ... number of samples
• 𝑛𝐾 ... number of data points in each sample
• 𝑥𝐺 ... grand mean
F-ratio
𝐹𝑑𝑓𝐵 ,𝑑𝑓𝑊
𝑀𝑆𝐵
=
𝑀𝑆𝑊
𝑑𝑓𝐵 = 𝑘 − 1
𝑑𝑓𝑊 = 𝑁 − 𝑘
F-distribution
F distribution
𝑆𝑆𝐵
𝑀𝑆𝐵 =
= 1505.3
𝑑𝑓𝐵
Beer prices
Primátor
Kocour
Matuška
15
39
65
12
45
45
14
48
32
11
60
38
13
48
45
2
= 3011
𝑥𝑘
𝑆𝑆𝐵 = 𝑛
𝑥𝑘 − 𝑥𝐺
𝑆𝑆𝑊
𝑀𝑆𝑊 =
= 95.78
𝑑𝑓𝑊
𝑥𝐺 = 35.33
𝑑𝑓𝐵 = 𝑘 − 1 = 2
𝑘
𝑆𝑆𝑊 =
𝑥𝑖 − 𝑥𝑘
2
= 862
𝑑𝑓𝑊 = 𝑁 − 𝑘 = 9
𝑘
∗
𝐹2,9
= 4.25
𝐹2,9
𝑀𝑆𝐵
=
= 15.72
𝑀𝑆𝑊
F9,2
F2,9
Beer brands – ANOVA

effect size

Transcript effect size

Directory