Transcript Slide 1

Dealing With Statistical
Uncertainty
Richard Mott
Wellcome Trust Centre for Human
Genetics
Synopsis
•
•
•
•
•
•
•
•
Hypothesis testing
P-values
Confidence intervals
T-test
Non-parametric tests
Sample size calculations
Permutations
Bootstrap
Hypothesis testing
Null Hypothesis H0
Examples of H0:
Alternative Hypothesis H1
• Examples of H1:
– Mean of a population is 3.0
– Mean of population > 3.0
– In a genetic association study,
there is no association
between disease state and
the genotypes of a particular
SNP
– There is an association
between disease and
genotype
P-values
• The P-value of a statistic Z is the
probability of sampling a value
more extreme than that
observed value z if the null
hypothesis is true.
• For a one-sided test, the p-value
would be
a = Prob( Z > z | H0)
• For a two-sided test it is
a = Prob( |Z| > z | H0)
• Note this is not the probability
density at x, but the tails of the
cumulative distribution function
Upper and lower 2.5% tails of the
N(0,1) distribution,
corresponding to |z|=1.96
Power
• The power of a statistic Z is the
probability of sampling a value
more extreme than that observed
value z if the alternative
hypothesis is true.
• For a one-sided test, the power
would be
b = Prob( Z > z | H1)
• For a two-sided test it is
b = Prob( |Z| > z | H1)
• Note this is not the probability
density at x, but the tails of the
cumulative distribution function
H0: N(1,0) H1: N(1.5,1) H2: N(4,1)
0.05 = Prob( Z> 1.644|H0) (5% upper tail)
0.44 = Prob( Z> 1.644|H1)
0.99 = Prob( Z> 1.644|H2)
The Likelihood
•
•
Likelihood = “probability of the data given the model”
Basis of parametric statistical inference
L(q)  Pr(y i | q)
i
•
•
Different hypotheses can often be expressed in terms of different values of the
parameters q
Hypothesis testing equivalent to comparing the likelihoods for different q

The Likelihood Ratio Test
•
General Framework for constructing hypothesis tests:
S = Likelihood( data | H1) / Likelihood( data | H0)
Reject H0 if S > s(H0, H1)
Threshold s is chosen such that there is a probability a of making a false rejection under H0.
a
is the size of the test (or false positive rate or Type I error) e.g. a=0.05
b
is the power of the test, the probability of correctly rejecting H0 when H1 is true e.g. b=0.8
1- b is the type II error, the false negative rate
Generally, for fixed sample size n, if we fix a then we can’t fix b
If we fix a and bthen we must let n vary.
The Neyman Pearson Lemma states that the likelihood ratio test is the most powerful of all tests
of a given size
Type III error: "correctly rejecting the null hypothesis for the wrong reason".
Example: The mean of a Normal
Distribution
H0: mean = m0
vs
H1: mean = m1
data: y1, … yn iid N(m,s2) Assume variance is the same and is known
Therefore we base all inferences on the sample mean compared to the difference in the
hypothesised means
The Normal and T distributions
•
For a sample from a Normal distribution with known mean and variance, the sample mean can be
standardised to follow the standard Normal distribution
•
And the 95% confidence interval for the mean is given by
•
But we often wish to make inferences about samples from a Normal Distribution when the variance
is unknown: The variance must be estimated from the data as the sample variance
•
Because of this uncertainty the distribution of the sample mean is broader, and follows the Tn-1
distribution
The T distribution
• The Tn and Normal
distributions are almost
identical for n>20,
• As a rule of thumb, an
approximate 95%
confidence interval for
n>20 is
T-tests
• The T-test compares the means of samples
• The T-test is optimal (in LR sense) if the data are
sampled from a Normal distribution
• Even if the data are not Normal the test is useful
in large samples
• There are several versions, depending on the
details
T tests for the comparison of sample means
One Sample Test
• One sample y1 ….. yn
– H0: the sample mean = m0
– H1: the sample mean != m0
– Reject H0 if
– tn-1(a/2) is the quantile of the Tn-1 distribution such that the
probability of exceeding tn-1(a/2) is a/2
– Two sided test
– in R, tn-1(0.025) = pt(0.025,n-1)
T tests for the comparison of sample means
Paired T-test
• Two samples of paired data (x1,y1), (x2-y2) …. (xn,yn)
• Example:
– two experimental replicates taken from n individuals, we
wish to test if the replicates are statistically similar
•
•
•
•
H0: mean(x) = mean(y)
Take differences d1= x1 - y1 … dn= xn - yn
H0: mean(d) = 0
One sample T-test with m0=0
T tests for the comparison of sample means
Two-Sample T-test
•
Two samples of unpaired data
x1, x2 ….. xn
y1, y2 ….. ym
•
•
H0: mean(x) = mean(y)
If we assume the variance is the same in each group then we can estimate the variance
as the pooled estimator
•
The test statistic is
(The case with unequal variances is more complicated)
T tests in R
t.test(x, y = NULL, alternative = c("two.sided", "less",
"greater"), mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
•
•
•
•
•
•
•
•
•
•
x a numeric vector of data values.
y an optional numeric vector data values.
alternative a character string specifying the alternative hypothesis, must be one of
"two.sided" (default), "greater" or "less".
mu a number indicating the true value of the mean
paired a logical indicating whether you want a paired t-test.
var.equal a logical variable indicating whether to treat the two variances as being
equal.
conf.level confidence level of the interval.
formula a formula of the form lhs ~ rhs where lhs is a numeric variable giving the
data values and rhs a factor with two levels giving the corresponding groups.
data an optional data frame containing the variables in the model formula.
subset an optional vector specifying a subset of observations to be used.
T tests in R
samp1 <- rnorm( 20, 0, 1) # sample 20 numbers from N(0,1) distribution
samp2 <- rnorm( 20, 1.5, 1) # sample from N(1.5,1)
t.test( samp1, samp2,var.equal=TRUE )
Two Sample t-test
data: samp1 and samp2
t = -3.5892, df = 38, p-value = 0.000935
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.9272037 -0.5372251
sample estimates:
mean of x mean of y
0.4040969 1.6363113
Outliers
A small number of contaminating observations with different distribution
• Outliers can destroy statistical
genuine significance
Reason: the estimate of the
variance is inflated, reducing
the T statistic
• And sometimes create false
significance :
Type III error: "correctly
rejecting the null hypothesis
for the wrong reason".
• There are Robust Alternatives
to the T-test: Wilcoxon tests
> samp3 <- samp2
> samp3[1] = 40 # add one outlier
> t.test( samp1, samp3,var.equal=TRUE )
Two Sample t-test
data: samp1 and samp3
t = -1.5911, df = 38, p-value = 0.1199
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-7.0519594 0.8450837
sample estimates:
mean of x mean of y
0.4040969 3.5075347
> var(samp2)
[1] 1.160251
> var(samp3)
[1] 74.88977
Non-parametric tests:
Wilcoxon signed rank test
(alternative to Paired T-test)
•
•
•
•
•
•
•
•
Two samples of paired data (x1,y1), (x2-y2) …. (xn,yn)
Take differences
d1= x1 - y1 … dn= xn – yn
H0: distribution of d’s is symmetric about 0
Compute the ranks of the absolute differences, rank(|d1|) etc
Compute the signs of the differences
Compute W, the sum of the ranks with positive signs
If H0 true then W should be close to the mean rank n/2
The distribution of W is known and for n>20 a Normal approximation is used.
•
in R
wilcox.test(x, y, alternative = c("two.sided", "less", "greater"), mu = 0,
paired = TRUE, exact = NULL, correct = TRUE, conf.int = FALSE, conf.level
= 0.95, ...)
Wilcoxon rank sum test
(alternative to unpaired T-test)
•
Two samples of unpaired data
x1, x2 ….. xn
y1, y2 ….. ym
•
•
•
•
Arrange all the observations into a single ranked series of length N = n+m.
R1 = sum of the ranks for the observations which came from sample 1. The sum of
ranks in sample 2 follows by calculation, since the sum of all the ranks equals
N(N + 1)/2
Compute U = min(R1 ,R2) - n(n + 1)/2.
z = (U –m)/s Normal approximation
m = nm/2
s2 = nm(n=m+1)/12
•
The Wilcoxon Rank sum test is only about 5% less efficient than the unpaired T-test (in
terms of the sample size required to achieve the same power) in large samples.
Wilcoxon rank sum test
(alternative to unpaired T-test)
>wilcox.test(samp1,samp2)
Wilcoxon rank sum test
data: samp1 and samp2
W = 82, p-value = 0.001041
alternative hypothesis: true mu is not equal to 0
> wilcox.test(samp1,samp3)
# with outliers
Wilcoxon rank sum test
data: samp1 and samp3
W = 82, p-value = 0.001041
alternative hypothesis: true mu is not equal to 0
Binomial Sign Test
(T-test on one sample)
• One sample y1 ….. yn
H0: the sample mean = m0
• Count M, the number of y’s that are > m0
• If H0 is true, this should be about n/2, and
should be distributed as a Binomial random
variable B(n/2,n)
Contingency Tables
• Data are a table of counts, with the row and column
margins fixed.
n11
n12
r1c1/N
r1c2/N
n21
n22
r2c1/N
r2c2/N
c1
= n11+ n21
c2
= n12+ n22
r1 = n11+ n12
r2 = n21+ n22
N
• Ho: the counts in each cell are consistent with the rows
and columns acting independently
The Chi-Squared statistics
2
(O

E
)
 ij ij / Eij
X2 
ij
logLikRatio
2Oij log(Oij / Eij )
ij


2
(R1)(C 1)
Fisher’s Exact Test
for 2 x 2 Contingency tables
• Similar idea to a permutation test (see below)
• Given the marginal totals in a 2x2 table, we can
compute the probability P of getting the
observed cell counts assuming the null
hypothesis of no associations between classifying
factors.
– hypergeometric distribution.
• Find all tables with same margins and with
probabilities no bigger than P
• P-value of the table is the sum of these
probabilities
Fisher’s Exact Test
Male
Female
Smoker
a
c
a+c
Non-smoker
b
d
b+d
a+b
c+d
n
Male
Female
Smoker
10
5
15
Non-smoker
10
12
22
20
17
37
a  bc  d



 a  c 
P
 n 


a  c 

R implementation:
m <- matrix(c(10,10,5,12),nrow=2)
fisher.test(m)
Fisher's Exact Test for Count Data
data: m
p-value = 0.3152
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.513082 11.942154
sample estimates:
odds ratio
2.342796
>
Permutation Tests
Another non-parametric Alternative
• Permutation is a general principle with many applications to hypothesis
testing.
• But it is not useful for parameter estimation
• The method is best understood by an example:
– Suppose a data set is classified into N groups. We are interested in whether
the differences between group means (eg if N=2 then the two sample T test
or Wilcoxon Rank Sum test is appropriate)
– Compute a statistic S (eg the difference between the maximum and minimum
group mean)
– If there are no differences between groups then any permutation of the group
labellings between individuals should produce much the same result for S
– Repeat i = 1 to M times:
• permute the data
• compute the statistic on the permuted data, Si
– The Permutation p-value is the fraction of permutations that exceed S
• It’s that simple!
Bootstrapping
•
Given a set of n things S = {x1 …. x2}, a bootstrap sample is formed by sampling with
replacement from S, to create a new set S* of size n
•
•
Original
Resample
•
Some elements of S will be missing in S* and some will be present multiply often (about
1/e = 63% of the elements will be present in S* on average)
•
S* can be used as a replacement for S in any algorithm that works on S, and we can
repeatedly draw further bootstrap samples.
•
Any statistic Z that can be calculated on S can also be calculated on S*, and so we can
construct the bootstrap sampling distribution of Z from repeated samples.
•
From this we can make inferences about the variation of Z, eg construct a 95%
confidence interval (if Z is numeric)
•
Can be used for parameter estimation
Bootstrapping
• Bootstrapping is a general way to evaluate the uncertainty in estimates of
statistics whilst making few assumptions about the data
• It is particularly useful for complex statistics whose distribution cannot be
evaluated analytically
• Example:
– In a gene expression study, we wish to find a minimal set of SNPs that in
combination will predict the variation in a phenotype.
– Suppose we have a deterministic algorithm A that will find us a plausible set of
SNPs (ie a model).
• This is likely to be a complex programme and contain some arbitrary steps.
• It is also unlikely to give any measure of uncertainty to the model.
– There may be many possible sets of SNPs with similar claims, so we would like
a way to average across different models, and evaluate the evidence for a SNP
as the fraction of models in which the SNP occurs
• To do this, we need a way to perturb the data, to generate new data sets on which we
can apply the procedure A
• Bootstrapping is one way to do this