p-value - tim bates
Download
Report
Transcript p-value - tim bates
+
Refresher in inferential
statistics
[email protected]
http://www.psy.ed.ac.uk/events/research_seminars/psych
stats
+
Resources
http://www.statmethods.net
+
Our basic question…
Did something occur?
Importantly, did what we predicted would occur, transpire?,
i.e., is the world as we predicted?
Why does this require statistics?
+
Is Breastfeeding good for Baby’s
brains?
The association between breastfeeding and IQ is moderated by a genetic polymorphism
(rs174575) in the FADS2 gene
Caspi A et al. PNAS 2007;104:18860-18865
©2007 by National Academy of Sciences
+
Overview
Hypothesis
testing
p-values
Type
I vs. Type II errors
Power
Correlation
Fisher’s
exact test
T-test
Linear
regression
Non-parametric
statistics (mostly for you to go
over in your own time)
+
Hypothesis testing
1. Propose a null and an experimental hypothesis.
Mistakes here may make the experiment un-analysable
2. Consider the assumptions of the test: Are they met?
Statistical independence of observations
Distributions of the observations.
Student's t distribution, normal distribution etc.
3. Compute the relevant test statistic.
1. Student’s t-test-> t ; ANOVA F; Chi2
4. Compute likelihood of the test-statistic:
1. Does it exceed your chosen threshold?
2. Either reject (or fail to reject) the null hypothesis
+
What mistakes can we make?
“The World”
Yes
Yes correct detection
Your
Decision
No
false negative
No
false positive
correct rejection
+
Starting to make inferences…the
Binomial
Toss a coin
+
Dropping lots of coins...
Pachinko
+
Normal compared to Binomial
n=6
p = .5
+
Distributions
normal (µ, ∂)
binomial (p, n)
+
Distributions
Poisson (lambda)
Accidents in a period of time;
Power
Publication rates
+
Testing what distribution you have
+
Why are things normal?
+
Central limit theorem
The mean of a large number of independent random
variables is distributed approximately normally.
+
Hypothesis testing
Making statistical decisions using experimental data.
Need to form a null hypothesis
(we can reject, but not confirm hypotheses)
A result is “significant” if it is unlikely to have occurred by
chance.
Ronald Fisher “We may discover whether a second sample is
or is not significantly different from the first”.
+
What mistakes can we make?
“The World”
Yes
Yes correct detection
Your
Decision
No
false negative
No
false positive
correct rejection
+
Error
Type-I
error: False Alarm, a bogus effect
reject
the null hypothesis when it is really true
Much of published science is Type-I error
(Ioannides, 2008)
Type-II
error: Miss a real effect
Fail
to reject our null hypothesis when it is false
Many small projects have this problem
Type-III
error: :-)
lazy, incompetent, or
willful ignorance of the truth
+
p-values
Almost
any difference (a count, a difference in
means, a difference in variances) can be
found with some probability, irrespective of
the true situation.
All
we can do is to set a threshold likelihood
for deciding that an event occurred by chance.
p=.05
= 1 time in 20, the result would be as
large by chance when the null hypothesis is
true.
+
Type I vs. Type II errors
Type I:
False positive
Likelihood of type 1 = α
p=.05 = setting α to .05
World
Type II:
False negative
Likelihood of type 2 = β
Power = 1-β
Yes
You
No
Yes Correct detection Type I (α)
(power)
No Type II (β)
Correct rejection
+
P-values
p-value is the likelihood of mean differences as large or
larger than those observed in the data occurring by chance
p-value criteria (alpha ) allow us a binary answer to our
questions
Questions – is a smaller p-value:
“More” significant?
Indicate a “Bigger” effect? (if so when?)
and how could we measure” effect”?
+
Compare these two statements
It’s ‘significant’, but how big is the effect?
I can see it’s big: but what is the p-value?
+
Confidence Intervals
Range of values within a given likelihood threshold
(for instance 95%)
Closely related to p-values.
p = 1-CI
i.e., if p<.05, 95% CI will not include 0 (no difference)
Would you rather have a CI or a p-value?
Why?
What is an effect size?
+
P and CI
You can’t go from p to CI!
You can go from CI to p
At a p=.05, 95%CIs will overlap less than 25%
At p= .01, the 95% CI bars just touch
+
Units of a Confidence Interval
Unlike p, CIs are given in the units of the DV
Cumming and Finch (2005)
BMI in people on a low carb diet might be19-23 kg/m2
Cumming, G. and Finch S.(2005). Inference by eye: confidence
intervals and how to read pictures of data. American
Psychologist. 60:170-80. PMID: 15740449
+
Standard Errors and Standard
Deviations
SE is (typically) the standard error of the mean
The precision with which we have estimated the population mean
based on our sample
Computationally, it is ∂/sqrt(n)
A 95% confidence interval is ± 1.96 SE
+
Example: coin toss
Random sample of 100 coin tosses, of a coin believed to be
fair
We observed number of 45 heads, and 55 tails: Is the coin
fair?
+
Binomial test
binom.test(x=45, n=100, p=.5, alternative="two.sided”)
number of successes = 45, number of trials = 100
p-value = 0.3682
alternative hypothesis: true probability of success != 0.5
95 percent confidence interval: 0.3503
0.5527
sample estimates: probability of success: 0.45
+
Categorical Data
Fisher’s
Exact Test
Categorical
data resulting from
classifying objects in one of two ways
Tests
significance of the observed
"contingency" of the two outcomes.
Fisher, R. A. (1922). On the interpretation of χ2 from contingency
tables, and the calculation of P. Journal of the Royal Statistical Society,
85(1), 87-94.
+
The Lady Drinking Tea
Question: Does Tea taste better if the milk is added to the
tea, or vice versa?
Null Hypothesis: The drinker cannot tell
Subjects: Ms Bristol
Experiment: 8 "trials" (cups): 4 in each way, in random
order
DV: Milk versus Milk second discrimination
Enter data into 2 x 2 contingency table
+
Fisher Contingency Table
Guess
Milk
Tea
Truth Milk
Tea
3
1
1
3
A = c(1, 1, 1, 0, 1, 0, 0, 0) # vector of guesses
B = c(1, 1, 1, 1, 0, 0, 0, 0) # vector of Teas
guessTable <- table(A,B) # contingency table
labels = list(Guess = c("Milk", "Tea"), Truth = c("Milk", "Tea")) # make labels
dimnames(guessTable)= labels # add label
fisher.test(guessTable, alternative = "greater") # test
+
Can she tell?
Fisher's Exact Test for Count Data
p-value = 0.24 # association could not be established
Alternative hypothesis:
true odds ratio is greater than 1
95% confidence interval: 0.313 – Inf
Sample odds ratio: 6.40
+
What if we have two continuous
variables?
Are they related
Q: If you have continuous depression scores and cut-off scores,
which is more powerful?
+
Correlation of two continuous
variables: Pearson’s r
All variables continuous
Pearson
+
Correlation: what are the maximum
and minimum correlations?
+
Power (1-β)
Probability that a test will correctly reject the null
hypothesis.
Complement of the false negative rate, β
False negative = missing a real effect
1-β = p (correctly reject a false null hypothesis)
+
Power and how to get it
Probability of rejecting the null hypothesis when it is false
Whence comes power?
+
Power applied to a correlation
Samples of n=30 from a population in which two normal traits
correlate 0.3
r=0.3
xy = mvrnorm (n=30, mu=rep(0,2), Sigma= matrix(c(1,r,r,1) ,nrow=2, ncol=2));
xy = data.frame(xy);
names(xy) <- c("x", "y");
qplot(x, y, data = xy, geom = c("point" , "smooth"), method=lm)
+
Power of a correlation test
library(pwr)
pwr.r.test(n = 30, r = .3, sig.level = 0.05)
n
= 30
r
= 0.3
sig.level
= 0.05
power
= 0.359
alternative = two.sided
+
Power: r = .3
+
t-test
When we wish to compare means in a sample, we must
estimate the standard deviation from the sample
Student's t-distribution is the distribution of small samples
from normally varying populations
+
t-distribution function
t is defined as the ratio:
Z/sqrt(V/v)
Z is normally distributed with expected value 0 and
variance 1;
V has a chi-square distribution with ν degrees of freedom;
+
Normal and t-distributions
Normal is in blue
Green = t with df = 1
Red = t with df = 3 (far right = df increasing to 30)
+
Power of t-test
power.t.test(n=15, delta=.5)
Two-sample t test power calculation
n = 15 ; delta = 0.5 ; sd = 1; sig.level = 0.05
power = 0.26
alternative = two.sided
NOTE: n is number in *each* group
+
Linear regression
+
Linear regression
fit
= lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit)
anova(fit)
# show results
# anova table
coefficients(fit)
# model coefficients
confint(fit, level=0.95)
fitted(fit)
# CIs for model parameters
# predicted values
residuals(fit)
# residuals
influence(fit)
# regression diagnostics
+
Nonparametric Statistics
Timothy C. Bates
[email protected]
+
Bootstrapping: Kurtosis differences
kurtosisDiff <- function(x, y, B = 1000){
kx <- replicate(B, kurtosi(sample(x, replace = TRUE)))
ky <- replicate(B, kurtosi(sample(y, replace = TRUE)))
return(kx - ky)
}
kurtDiff <- kurtosisDiff(x, y, B = 10000); mean(kurtDiff > 0) # p=
0.205 NS
+
Parametric Statistics 1
Assume data are drawn from samples with a certain
distribution (usually normal)
Compute the likelihood that groups are related/unrelated or
same/different given that underlying model
t-test, Pearson’s correlation, ANOVA…
+
Parametric Statistics 2
Assumptions of Parametric statistics
1.
Observations are independent
2.
Your data are normally distributed
3.
Variances are equal across groups
Can be modified to cope with unequal ∂2
+
Non-parametric Statistics?
Non-parametric statistics do not assume any underlying
distribution
They compute the likelihood that your groups are the same
or different by comparing the ranks of subjects across the
range of scores.
+
Non-parametric Statistics
Assumptions of non-parametric statistics
1.
Observations are independent
+
Non-parametric Statistics?
Non-parametric statistics do not assume any underlying
distribution
Estimating or modeling this distribution reduces their power
to detect effects…
So don’t use them unless you have to
+
Why use a Non-parametric Statistic?
Very small samples
Leads to Type-1 (false alarm) errors
Outliers more often lead to spurious Type-1 (false
alarm) errors in parametric statistics.
Nonparametric statistics reduce data to an ordinal
rank, which reduces the impact or leverage of
outliers.
+
Non-parametric Choices
Data type?
continuous
discret
e
Question?
association
Spearman’s
Rank
χ2
Different
central value
BrownForsythe
Number of
groups?
two-groups
Mann-Whitney U
Wilcoxon’s Rank Sums
Difference in ∂2
more than 2
Kruskal-Wallis
test
+
Non-parametric Choices
Data type?
continuous
discret
e
Question?
Like a
Pearson’s
R
association
Spearman’s
Rank
Like
Student’s t
No alternative
χ2
Different
central value
Difference in ∂2
BrownForsythe
Number of
groups?
two-groups
Mann-Whitney U
Wilcoxon’s Rank Sums
more than 2
Kruskal-Wallis
test
Like F-test
Like ANOVA
+
Binomial test
binom.test(45, 100, .5, alternative="two.sided”)
number of successes = 45, number of trials = 100,
p-value = 0.3682
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval: 0.350 0.5527
Sample estimates: probability of success 0.45
binom.test(51,235,(1/6),alternative="greater")
+
Spearman Rank test (ρ (rho))
Named after Charles Spearman,
Non-parametric measure of correlation
Assesses how well an arbitrary monotonic function describes the
relationship between two variables,
Does not require the relationship be linear
Does not require interval measurement
+
Spearman Rank (ρ rho)
d
= difference in rank of a given pair
n = number of pairs
Alternative
test = Kendall's Tau (Kendall's τ)
+
Mann-Whitney U
AKA: “Wilcoxon rank-sum test
Mann & Whitney, 1947; Wilcoxon, 1945
Non-parametric test for difference in the medians of two
independent samples
Assumptions:
Samples are independent
Observations can be ranked (ordinal or better)
+
Mann-Whitney U
U tests the difference in the medians of two independent
samples
n1 = number of obs in sample 1
n2 = number of obs in sample 2
R = sum of ranks of the lower-ranked sample
+
Mann-Whitney U or t?
Should you use it over the t-test?
Yes if you have a very small sample (<20)
(central limit assumptions not met)
If your data are really ordinal
Otherwise, probably not.
It is less prone to type-I error
(spurious significance) due to outliers.
But does not in fact handle comparisons of samples whose
variances differ very well
(Use unequal variance t-test with rank data)
+
Wilcoxon signed-rank test (related
samples)
Same idea as Mann-U, generalized to matched samples
Equivalent to non-independent sample t-test
+
Kruskall-Wallis
Non-parametric
one-way analysis of variance
by ranks (named after William Kruskal and W.
Allen Wallis)
tests
equality of medians across groups.
It
is an extension of the Mann-Whitney U test to
3 or more groups.
Does
not assume a normal population,
Assumes
population variances among groups
are equal.
+
Aesop: Mann-Whitney U Example
Suppose that Aesop is dissatisfied with his classic
experiment in which one tortoise was found to beat one hare
in a race.
He decides to carry out a significance test to discover
whether the results could be extended to tortoises and hares
in general…
+
Aesop 2: Mann-Whitney U
He
collects a sample of 6 tortoises and 6
hares, and makes them all run his race. The
order in which they reach the finishing post
(their rank order) is as follows:
tort
= c(1, 7, 8, 9, 10,11)
hare
= c(2, 3, 4, 5, 6, 12)
Original
tortoise still goes at warp speed,
original hare is still lazy, but the others run truer
to stereotype.
+
Aesop 3: Mann-Whitney U
wilcox.test(tort, hare)
Wilcoxon = W = 25, p-value = 0.31
Tortoises and hares do not differ
tort = c(1, 7, 8, 9, 10,11) (n2 = 6)
hare = c(2, 3, 4, 5, 6, 12) (n1 = 6, R1 =32)