p-value - tim bates

Download Report

Transcript p-value - tim bates

+
Refresher in inferential
statistics
[email protected]
http://www.psy.ed.ac.uk/events/research_seminars/psych
stats
+
Resources

http://www.statmethods.net
+
Our basic question…

Did something occur?

Importantly, did what we predicted would occur, transpire?,
i.e., is the world as we predicted?

Why does this require statistics?
+
Is Breastfeeding good for Baby’s
brains?
The association between breastfeeding and IQ is moderated by a genetic polymorphism
(rs174575) in the FADS2 gene
Caspi A et al. PNAS 2007;104:18860-18865
©2007 by National Academy of Sciences
+
Overview
 Hypothesis
testing
 p-values
 Type
I vs. Type II errors
 Power
 Correlation
 Fisher’s
exact test
 T-test
 Linear
regression
 Non-parametric
statistics (mostly for you to go
over in your own time)
+
Hypothesis testing
1. Propose a null and an experimental hypothesis.
Mistakes here may make the experiment un-analysable
2. Consider the assumptions of the test: Are they met?
Statistical independence of observations
Distributions of the observations.
Student's t distribution, normal distribution etc.
3. Compute the relevant test statistic.
1. Student’s t-test-> t ; ANOVA  F; Chi2
4. Compute likelihood of the test-statistic:
1. Does it exceed your chosen threshold?
2. Either reject (or fail to reject) the null hypothesis
+
What mistakes can we make?
“The World”
Yes
Yes correct detection
Your
Decision
No
false negative
No
false positive
correct rejection
+
Starting to make inferences…the
Binomial

Toss a coin
+
Dropping lots of coins...

Pachinko
+
Normal compared to Binomial

n=6

p = .5
+
Distributions
normal (µ, ∂)
binomial (p, n)
+
Distributions
Poisson (lambda)
Accidents in a period of time;
Power
Publication rates
+
Testing what distribution you have
+
Why are things normal?
+
Central limit theorem

The mean of a large number of independent random
variables is distributed approximately normally.
+
Hypothesis testing

Making statistical decisions using experimental data.

Need to form a null hypothesis


(we can reject, but not confirm hypotheses)
A result is “significant” if it is unlikely to have occurred by
chance.

Ronald Fisher “We may discover whether a second sample is
or is not significantly different from the first”.
+
What mistakes can we make?
“The World”
Yes
Yes correct detection
Your
Decision
No
false negative
No
false positive
correct rejection
+
Error
 Type-I
error: False Alarm, a bogus effect
 reject
the null hypothesis when it is really true
 Much of published science is Type-I error
 (Ioannides, 2008)
 Type-II
error: Miss a real effect
 Fail
to reject our null hypothesis when it is false
 Many small projects have this problem
 Type-III
error: :-)
 lazy, incompetent, or
willful ignorance of the truth
+
p-values
 Almost
any difference (a count, a difference in
means, a difference in variances) can be
found with some probability, irrespective of
the true situation.
 All
we can do is to set a threshold likelihood
for deciding that an event occurred by chance.
 p=.05
= 1 time in 20, the result would be as
large by chance when the null hypothesis is
true.
+
Type I vs. Type II errors

Type I:

False positive

Likelihood of type 1 = α

p=.05 = setting α to .05
World

Type II:

False negative

Likelihood of type 2 = β

Power = 1-β
Yes
You
No
Yes Correct detection Type I (α)
(power)
No Type II (β)
Correct rejection
+
P-values

p-value is the likelihood of mean differences as large or
larger than those observed in the data occurring by chance

p-value criteria (alpha ) allow us a binary answer to our
questions

Questions – is a smaller p-value:

“More” significant?

Indicate a “Bigger” effect? (if so when?)

and how could we measure” effect”?
+
Compare these two statements

It’s ‘significant’, but how big is the effect?

I can see it’s big: but what is the p-value?
+
Confidence Intervals

Range of values within a given likelihood threshold



(for instance 95%)
Closely related to p-values.

p = 1-CI

i.e., if p<.05, 95% CI will not include 0 (no difference)

Would you rather have a CI or a p-value?

Why?
What is an effect size?
+
P and CI

You can’t go from p to CI!

You can go from CI to p

At a p=.05, 95%CIs will overlap less than 25%

At p= .01, the 95% CI bars just touch
+
Units of a Confidence Interval


Unlike p, CIs are given in the units of the DV

Cumming and Finch (2005)

BMI in people on a low carb diet might be19-23 kg/m2
Cumming, G. and Finch S.(2005). Inference by eye: confidence
intervals and how to read pictures of data. American
Psychologist. 60:170-80. PMID: 15740449
+
Standard Errors and Standard
Deviations


SE is (typically) the standard error of the mean

The precision with which we have estimated the population mean
based on our sample

Computationally, it is ∂/sqrt(n)
A 95% confidence interval is ± 1.96 SE
+
Example: coin toss

Random sample of 100 coin tosses, of a coin believed to be
fair

We observed number of 45 heads, and 55 tails: Is the coin
fair?
+
Binomial test

binom.test(x=45, n=100, p=.5, alternative="two.sided”)
number of successes = 45, number of trials = 100
p-value = 0.3682
alternative hypothesis: true probability of success != 0.5
95 percent confidence interval: 0.3503
0.5527
sample estimates: probability of success: 0.45
+
Categorical Data
 Fisher’s
Exact Test
 Categorical
data resulting from
classifying objects in one of two ways
 Tests
significance of the observed
"contingency" of the two outcomes.

Fisher, R. A. (1922). On the interpretation of χ2 from contingency
tables, and the calculation of P. Journal of the Royal Statistical Society,
85(1), 87-94.
+
The Lady Drinking Tea

Question: Does Tea taste better if the milk is added to the
tea, or vice versa?

Null Hypothesis: The drinker cannot tell

Subjects: Ms Bristol

Experiment: 8 "trials" (cups): 4 in each way, in random
order

DV: Milk versus Milk second discrimination

Enter data into 2 x 2 contingency table
+
Fisher Contingency Table
Guess
Milk
Tea
Truth Milk
Tea
3
1
1
3
A = c(1, 1, 1, 0, 1, 0, 0, 0) # vector of guesses
B = c(1, 1, 1, 1, 0, 0, 0, 0) # vector of Teas
guessTable <- table(A,B) # contingency table
labels = list(Guess = c("Milk", "Tea"), Truth = c("Milk", "Tea")) # make labels
dimnames(guessTable)= labels # add label
fisher.test(guessTable, alternative = "greater") # test
+
Can she tell?
Fisher's Exact Test for Count Data
p-value = 0.24 # association could not be established
Alternative hypothesis:
true odds ratio is greater than 1
95% confidence interval: 0.313 – Inf
Sample odds ratio: 6.40
+
What if we have two continuous
variables?
Are they related
Q: If you have continuous depression scores and cut-off scores,
which is more powerful?
+
Correlation of two continuous
variables: Pearson’s r

All variables continuous

Pearson
+
Correlation: what are the maximum
and minimum correlations?
+
Power (1-β)

Probability that a test will correctly reject the null
hypothesis.

Complement of the false negative rate, β

False negative = missing a real effect

1-β = p (correctly reject a false null hypothesis)
+
Power and how to get it

Probability of rejecting the null hypothesis when it is false

Whence comes power?
+
Power applied to a correlation
Samples of n=30 from a population in which two normal traits
correlate 0.3

r=0.3

xy = mvrnorm (n=30, mu=rep(0,2), Sigma= matrix(c(1,r,r,1) ,nrow=2, ncol=2));

xy = data.frame(xy);

names(xy) <- c("x", "y");

qplot(x, y, data = xy, geom = c("point" , "smooth"), method=lm)
+
Power of a correlation test
library(pwr)
pwr.r.test(n = 30, r = .3, sig.level = 0.05)
n
= 30
r
= 0.3
sig.level
= 0.05
power
= 0.359
alternative = two.sided
+
Power: r = .3
+
t-test

When we wish to compare means in a sample, we must
estimate the standard deviation from the sample

Student's t-distribution is the distribution of small samples
from normally varying populations
+
t-distribution function

t is defined as the ratio:

Z/sqrt(V/v)

Z is normally distributed with expected value 0 and
variance 1;

V has a chi-square distribution with ν degrees of freedom;
+
Normal and t-distributions

Normal is in blue

Green = t with df = 1

Red = t with df = 3 (far right = df increasing to 30)
+
Power of t-test
power.t.test(n=15, delta=.5)
Two-sample t test power calculation
n = 15 ; delta = 0.5 ; sd = 1; sig.level = 0.05
power = 0.26
alternative = two.sided
NOTE: n is number in *each* group
+
Linear regression
+
Linear regression
 fit
= lm(y ~ x1 + x2 + x3, data=mydata)
 summary(fit)
 anova(fit)
# show results
# anova table
 coefficients(fit)
# model coefficients
 confint(fit, level=0.95)
 fitted(fit)
# CIs for model parameters
# predicted values
 residuals(fit)
# residuals
 influence(fit)
# regression diagnostics
+
Nonparametric Statistics
Timothy C. Bates
[email protected]
+
Bootstrapping: Kurtosis differences
kurtosisDiff <- function(x, y, B = 1000){
kx <- replicate(B, kurtosi(sample(x, replace = TRUE)))
ky <- replicate(B, kurtosi(sample(y, replace = TRUE)))
return(kx - ky)
}
kurtDiff <- kurtosisDiff(x, y, B = 10000); mean(kurtDiff > 0) # p=
0.205 NS
+
Parametric Statistics 1

Assume data are drawn from samples with a certain
distribution (usually normal)

Compute the likelihood that groups are related/unrelated or
same/different given that underlying model

t-test, Pearson’s correlation, ANOVA…
+
Parametric Statistics 2
Assumptions of Parametric statistics

1.
Observations are independent
2.
Your data are normally distributed
3.
Variances are equal across groups

Can be modified to cope with unequal ∂2
+
Non-parametric Statistics?

Non-parametric statistics do not assume any underlying
distribution

They compute the likelihood that your groups are the same
or different by comparing the ranks of subjects across the
range of scores.
+
Non-parametric Statistics
Assumptions of non-parametric statistics

1.
Observations are independent
+
Non-parametric Statistics?

Non-parametric statistics do not assume any underlying
distribution

Estimating or modeling this distribution reduces their power
to detect effects…

So don’t use them unless you have to
+
Why use a Non-parametric Statistic?

Very small samples
 Leads to Type-1 (false alarm) errors

Outliers more often lead to spurious Type-1 (false
alarm) errors in parametric statistics.
 Nonparametric statistics reduce data to an ordinal
rank, which reduces the impact or leverage of
outliers.
+
Non-parametric Choices
Data type?
continuous
discret
e
Question?
association
Spearman’s
Rank
χ2
Different
central value
BrownForsythe
Number of
groups?
two-groups
Mann-Whitney U
Wilcoxon’s Rank Sums
Difference in ∂2
more than 2
Kruskal-Wallis
test
+
Non-parametric Choices
Data type?
continuous
discret
e
Question?
Like a
Pearson’s
R
association
Spearman’s
Rank
Like
Student’s t
No alternative
χ2
Different
central value
Difference in ∂2
BrownForsythe
Number of
groups?
two-groups
Mann-Whitney U
Wilcoxon’s Rank Sums
more than 2
Kruskal-Wallis
test
Like F-test
Like ANOVA
+
Binomial test
binom.test(45, 100, .5, alternative="two.sided”)
number of successes = 45, number of trials = 100,
p-value = 0.3682
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval: 0.350 0.5527
Sample estimates: probability of success 0.45
binom.test(51,235,(1/6),alternative="greater")
+
Spearman Rank test (ρ (rho))


Named after Charles Spearman,
Non-parametric measure of correlation

Assesses how well an arbitrary monotonic function describes the
relationship between two variables,

Does not require the relationship be linear

Does not require interval measurement
+
Spearman Rank (ρ rho)
d
= difference in rank of a given pair
 n = number of pairs
 Alternative
test = Kendall's Tau (Kendall's τ)
+
Mann-Whitney U

AKA: “Wilcoxon rank-sum test


Mann & Whitney, 1947; Wilcoxon, 1945
Non-parametric test for difference in the medians of two
independent samples

Assumptions:

Samples are independent

Observations can be ranked (ordinal or better)
+
Mann-Whitney U

U tests the difference in the medians of two independent
samples

n1 = number of obs in sample 1

n2 = number of obs in sample 2

R = sum of ranks of the lower-ranked sample
+
Mann-Whitney U or t?

Should you use it over the t-test?
 Yes if you have a very small sample (<20)
 (central limit assumptions not met)
 If your data are really ordinal
 Otherwise, probably not.

It is less prone to type-I error
 (spurious significance) due to outliers.

But does not in fact handle comparisons of samples whose
variances differ very well
 (Use unequal variance t-test with rank data)
+
Wilcoxon signed-rank test (related
samples)

Same idea as Mann-U, generalized to matched samples

Equivalent to non-independent sample t-test
+
Kruskall-Wallis
 Non-parametric
one-way analysis of variance
by ranks (named after William Kruskal and W.
Allen Wallis)
 tests
equality of medians across groups.
 It
is an extension of the Mann-Whitney U test to
3 or more groups.
 Does
not assume a normal population,
 Assumes
population variances among groups
are equal.
+
Aesop: Mann-Whitney U Example

Suppose that Aesop is dissatisfied with his classic
experiment in which one tortoise was found to beat one hare
in a race.

He decides to carry out a significance test to discover
whether the results could be extended to tortoises and hares
in general…
+
Aesop 2: Mann-Whitney U
 He
collects a sample of 6 tortoises and 6
hares, and makes them all run his race. The
order in which they reach the finishing post
(their rank order) is as follows:
 tort
= c(1, 7, 8, 9, 10,11)
 hare
= c(2, 3, 4, 5, 6, 12)
 Original
tortoise still goes at warp speed,
original hare is still lazy, but the others run truer
to stereotype.
+
Aesop 3: Mann-Whitney U

wilcox.test(tort, hare)

Wilcoxon = W = 25, p-value = 0.31

Tortoises and hares do not differ

tort = c(1, 7, 8, 9, 10,11) (n2 = 6)

hare = c(2, 3, 4, 5, 6, 12) (n1 = 6, R1 =32)