Transcript Slide 1

Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 3
Hypothesis testing
Exploratory Data Analysis and Essential Statistics using R
†
Boris Steipe
Toronto, September 8–9 2011
DEPARTMENT OF
BIOCHEMISTRY
DEPARTMENT OF
MOLECULAR GENETICS
† Includes material originally developed by
Oedipus ponders the riddle of the Sphinx. Classical (~400 BCE)
Raphael Gottardo
introduction
• One sample and two sample t-tests are used
to test a hypothesis about the mean(s) of a
distribution.
• Gene expression: Is the mean expression
level under condition 1 different from the
mean expression level under condition 2?
• Assume that the data are from a normal
distribution.
Module 3: Hypothesis testing
bioinformatics.ca
one sample t-test
text
Module 3: Hypothesis testing
bioinformatics.ca
example – HIV data
Is gene 1 differentially expressed?
Is the mean log ratio equal to zero?
# One sample t-test
data <- log(read.table(file="hiv.raw.data.24h.txt",
sep="\t", header=TRUE))
# Compute the log ratios
M <- (data[, 1:4] - data[, 5:8])
gene1 <- t.test(M[1, ], mu=0)
x <- seq(-4, 4, 0.1)
f <- dt(x, df=3)
plot(x, f, xlab="x", ylab="density", type="l", lwd=5)
segments(gene1$stat, 0, gene1$stat, dt(gene1$stat, df=3),
col=3, lwd=5)
segments(-gene1$stat, 0, -gene1$stat, dt(gene1$stat, df=3),
col=2, lwd=5)
segments(x[abs(x)>abs(gene1$stat)], 0,
x[abs(x)>abs(gene1$stat)], f[abs(x)>abs(gene1$stat)],
col=4, lwd=1)
> gene1
One Sample t-test
data: M[1, ]
t = 0.7433, df = 3, p-value = 0.5112
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-1.771051 2.850409
sample estimates:
mean of x
0.539679
Module 3: Hypothesis testing
p–value
bioinformatics.ca
example – HIV data
Is gene 4 differentially expressed?
Is the mean log ratio equal to zero?
# One sample t-test
data <- log(read.table(file="hiv.raw.data.24h.txt",
sep="\t", header=TRUE))
# Compute the log ratios
M <- (data[, 1:4] - data[, 5:8])
gene4 <- t.test(M[4, ], mu=0)
x <- seq(-4, 4, 0.1)
f <- dt(x, df=3)
plot(x, f, xlab="x", ylab="density", type="l", lwd=5)
segments(gene4$stat, 0, gene4$stat, dt(gene4$stat, df=3),
col=3, lwd=5)
segments(-gene4$stat, 0, -gene4$stat, dt(gene4$stat, df=3),
col=2, lwd=5)
segments(x[abs(x)>abs(gene4$stat)], 0,
x[abs(x)>abs(gene4$stat)], f[abs(x)>abs(gene4$stat)],
col=4, lwd=1)
> gene4
One Sample t-test
data: M[4, ]
t = 3.2441, df = 3, p-value = 0.0477
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.01854012 1.93107647
sample estimates:
mean of x
0.9748083
Module 3: Hypothesis testing
p–value
bioinformatics.ca
what is a p–value?
a) A measure of how much evidence we have against the
alternative hypothesis.
b) The probability of making an error.
c) Something that biologists want to be below 0.05 .
d) The probability of observing a value as extreme or more
extreme by chance alone.
e) All of the above.
Module 3: Hypothesis testing
bioinformatics.ca
two–sample t–test
Test if the means of two distributions are the same.
The data yi1, ..., yin are independent and normally distributed
with mean μi and variance σ2, N (μi,σ2), where i=1,2.
In addition, we assume that the data in the two groups are
independent and that the variance is the same.
H0: μ1 = μ2
Module 3: Hypothesis testing
H1: μ1 = μ2
bioinformatics.ca
two–sample t–test
Module 3: Hypothesis testing
bioinformatics.ca
example revisited – HIV data
Is gene 1 differentially expressed?
Are the means equal?
# Two sample t-test
data <- log(read.table(file="hiv.raw.data.24h.txt",
sep="\t", header=TRUE))
gene1 <- t.test(data[1,1:4], data[1,5:8], var.equal=TRUE)
x <- seq(-4, 4, 0.1)
f <- dt(x, df=6)
plot(x, f, xlab="x", ylab="density", type="l", lwd=5)
segments(gene1$stat, 0, gene1$stat, dt(gene1$stat, df=6),
col=3, lwd=5)
segments(-gene1$stat, 0, -gene1$stat, dt(gene1$stat, df=6),
col=2, lwd=5)
segments(x[abs(x)>abs(gene1$stat)], 0,
x[abs(x)>abs(gene1$stat)], f[abs(x)>abs(gene1$stat)],
col=4, lwd=1)
> gene1
Two Sample t-test
data: data[1, 1:4] and data[1, 5:8]
t = 0.6439, df = 6, p-value = 0.5434
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.511041 2.590398
sample estimates:
mean of x mean of y
2.134678 1.594999
Module 3: Hypothesis testing
p–value
bioinformatics.ca
example revisited – HIV data
Is gene 4 differentially expressed?
Are the means equal?
# Two sample t-test
data <- log(read.table(file="hiv.raw.data.24h.txt",
sep="\t", header=TRUE))
gene4 <- t.test(data[4,1:4], data[4,5:8], var.equal=TRUE)
x <- seq(-4, 4, 0.1)
f <- dt(x, df=6)
plot(x, f, xlab="x", ylab="density", type="l", lwd=5)
segments(gene4$stat, 0, gene4$stat, dt(gene4$stat, df=6),
col=3, lwd=5)
segments(-gene4$stat, 0, -gene4$stat, dt(gene4$stat, df=6),
col=2, lwd=5)
segments(x[abs(x)>abs(gene4$stat)], 0,
x[abs(x)>abs(gene4$stat)], f[abs(x)>abs(gene4$stat)],
col=4, lwd=1)
> gene1
Two Sample t-test
data: data[4, 1:4] and data[4, 5:8]
t = 2.4569, df = 6, p-value = 0.04933
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.003964227 1.945652365
sample estimates:
mean of x mean of y
7.56130 6.588321
Module 3: Hypothesis testing
p–value
bioinformatics.ca
t–test assumptions
Normality: The data need to be Normal. If not, one can use a
transformation or a non parametric test. If the sample size is
large enough (n>30), the t-test will work just fine (CLT).
Independence: Usually satisfied. If not independent, more
complex modeling is required.
Independence between groups: In the two sample t- test, the
groups need to be independent. If not, one can use a paired ttest.
Equal variances: If the variances are not equal in the two
groups, use Welch's t-test (default in R).
Module 3: Hypothesis testing
bioinformatics.ca
Exercise
Try Welch's t-test and the paired t-test.
# Two sample t-test (Welch's)
gene1 <- t.test(data[1, 1:4], data[1, 5:8])
x <- seq(-4, 4, 0.1)
f <- dt(x, df=6)
plot(x, f, xlab="x", ylab="density", type="l", lwd=5)
segments(gene1$stat, 0, gene1$stat, dt(gene1$stat, df=6), col=2, lwd=5)
segments(-gene1$stat, 0, -gene1$stat, dt(gene1$stat, df=6), col=2, lwd=5)
segments(x[abs(x)>abs(gene1$stat)], 0,
x[abs(x)>abs(gene1$stat)], f[abs(x)>abs(gene1$stat)], col=2, lwd=1)
gene1
# Paired t-test
gene1 <- t.test(as.double(data[4, 1:4]), as.double(data[4, 5:8]), paired=TRUE)
x <- seq(-4, 4, 0.1)
f <- dt(x, df=6)
plot(x, f, xlab="x", ylab="density", type="l", lwd=5)
segments(gene1$stat, 0, gene1$stat, dt(gene1$stat, df=6), col=2, lwd=5)
segments(-gene1$stat, 0, -gene1$stat, dt(gene1$stat, df=6), col=2, lwd=5)
segments(x[abs(x)>abs(gene1$stat)], 0,
x[abs(x)>abs(gene1$stat)], f[abs(x)>abs(gene1$stat)], col=2, lwd=1)
gene1
Compare with the previous results.
Module 3: Hypothesis testing
bioinformatics.ca
non–parametric tests
Constitute a flexible alternative to the t-tests where
there is no distributional assumption.
There exist several non parametric alternatives
including the Wilcoxon and Mann-Whitney tests.
In cases where a parametric test would be appropriate,
non-parametric tests have less power.
Module 3: Hypothesis testing
bioinformatics.ca
Exercise
Use R to perform a non–parametric test (Wilcoxon)
on the gene 1 data and on the gene 4 data.
Hint:Type help.search("wilcoxon")
Module 3: Hypothesis testing
bioinformatics.ca
permutation test
When computing the p-value, all we need to know is the
distribution of our statistics under the null hypothesis.
How can we estimate the null distribution?
In the two sample case, one could simply randomly permute the
group labels and recompute the statistics to simulate the null
distribution.
Repeat for a number of permutations and compute the number
of times you observed a value as extreme or more extreme.
Module 3: Hypothesis testing
bioinformatics.ca
permutation test
• Select a statistic (e.g. mean difference, t statistic)
• Compute the statistic for the original data t.
• For a number of permutations, Np
• Randomly permute the labels and compute the associated
statistic, t 0i
• p-value = { # |t 0i| > |t | } / Np
Module 3: Hypothesis testing
bioinformatics.ca
Exercise
Try the permutation test. Interpret the result.
# Permutation test gene 1
set.seed(100)
Np <- 100
t0 <- rep(0, Np)
# Here we use the difference in means, but any statistics could be used!
t <- mean(as.double(data[1, 1:4]))-mean(as.double(data[1, 5:8]))
for(i in 1:Np)
{
perm <- sample(1:8)
newdata <- data[1, perm]
t0[i] <- mean(as.double(newdata[1:4]))-mean(as.double(newdata[5:8]))
}
pvalue <- sum(abs(t0)>abs(t))/Np
pvalue
Note: this kind of test is also called "bootstrapping".
Module 3: Hypothesis testing
bioinformatics.ca
the Bootstrap
The basic idea is to resample the
data we have observed and compute
a new value of the statistic/estimator
for each resampled data set.
Then one can assess the estimator
by looking at the empirical
distribution across the resampled
data sets.
Module 3: Hypothesis testing
set.seed(100)
x <- rnorm(15)
muHat <- mean(x)
sigmaHat <- sd(x)
Nrep <- 100
muHatNew <- rep(0, Nrep)
for(i in 1:Nrep)
{
xNew <- sample(x, replace=TRUE)
muHatNew[i] <- median(xNew)
}
se <- sd(muHatNew)
muHat
se
bioinformatics.ca
Error rates
Truth
Decision
Accept H0
H0
H1
1-

"False negative"
Reject H0

1-
"False positive"
Module 3: Hypothesis testing
bioinformatics.ca
Error rates (statistician's version)
Truth
Decision
Accept H0
H0
H1
1-

"Type II error"
Reject H0

1-
"Type I error"
Module 3: Hypothesis testing
bioinformatics.ca
statistical "power"
The power of a statistical test is the probability that the test will reject the null
hypothesis when the null hypothesis is false (i.e. that it will not make a Type II
error, or a false negative decision). As the power increases, the chances of
a Type II error occurring decrease. The probability of a Type II error occurring
is referred to as the false negative rate (β). Therefore power is equal to 1 − β,
which is also known as the sensitivity.
Power analysis can be used to calculate the minimum sample size required
so that one can be reasonably likely to detect an effect of a given size. Power
analysis can also be used to calculate the minimum effect size that is likely to
be detected in a study using a given sample size. In addition, the concept of
power is used to make comparisons between different statistical testing
procedures: for example, between a parametric and a nonparametric test of
the same hypothesis.
From Wikipedia – Statistical_Power
Module 3: Hypothesis testing
bioinformatics.ca
One sample t-test – power calculation
1 sample t-test:
If the mean is μ0, t follows a t-distribution with n-1 degrees of
freedom.
If the mean is not μ0, t follows a non central t-distribution with
n-1 degrees of freedom and noncentrality parameter
(μ1-μ0) x (s/√n).
Module 3: Hypothesis testing
bioinformatics.ca
Power, error rates and decision
Power calculation in R:
> power.t.test(n = 5, delta = 1, sd=2,
alternative="two.sided", type="one.sample")
One-sample t test power calculation
n=5
delta = 1
sd = 2
sig.level = 0.05
power = 0.1384528
alternative = two.sided
Other tests are available – see ??power.
Module 3: Hypothesis testing
bioinformatics.ca
Power, error rates and decision
PR(False Negative)
PR(Type II error)
μ0 μ 1
PR(False Positive)
PR(Type I error)
Module 3: Hypothesis testing
bioinformatics.ca
Power, error rates and decision
Module 3: Hypothesis testing
bioinformatics.ca
multiple testing
Single hypothesis testing
• Fix the False Positive error rate (eg. α = 0.05).
• Minimize the False Negative (maximize sensitivity)
This is what traditional testing does.
What if we perform many tests at once? Does this
afect our False Positive rate?
Module 3: Hypothesis testing
bioinformatics.ca
multiple testing
With high-throughput methods, we usually look at a very large
number of decisions for each experiment. For example, we ask
for every gene on an array whether it is significantly up- or
downregulated.
This creates a multiple testing paradox. The more data we
collect, the harder it is for every observation to appear
significant.
Therefore:
• We need ways to assess error probability in multiple testing
situations correctly;
• We need approaches that address the paradox.
Module 3: Hypothesis testing
bioinformatics.ca
multiple testing example
1000 t-tests, all null hypotheses are true (μ = 0).
For one test, Pr of a False Positive is 0.05.
For 1000 tests, Pr of at least one False Positive is 1–(1–0.05)1000 ≈ 1
> set.seed(100)
> y <- matrix(rnorm(100000), 1000, 5)
> myt.test <- function(y){
+ t.test(y, alternative="two.sided")$p.value
+}
> P <- apply(y, 1, myt.test)
> sum(P<0.05)
[1] 44
Module 3: Hypothesis testing
bioinformatics.ca
FWER
The FamilyWise Error Rate is the probability of having at least
one False Positive (making at least one type I error) in a "family" of
observations.
Example: Bonferroni multiple adjustment.
p̃g = N x pg
If p̃g ≤ α then FWER ≤ α
This is simple and conservative, but there are many other
(more powerful) FWER procedures.
Module 3: Hypothesis testing
bioinformatics.ca
False Discovery Rate (FDR)
The FDR is the proportion of False Positives among the genes
called Differentially Expressed (DE).
Order the p-values for each of N observations:
p(1) ≤ . . . ≤ p(i ) ≤ . . . ≤ p(N)
Let k be the largest i such that p(i ) ≤ i / N x α
... then the FDR for genes 1 ... k is controlled at α.
Hypotheses need to be independent!
FDR: Benjamini and Hochberg (1995)
Module 3: Hypothesis testing
bioinformatics.ca
FDR example
# FDR
set.seed(100)
N <- 10000
alpha <- 0.05
y1 <- matrix(rnorm(9000*4, 0, 1), 9000, 4)
y2 <- matrix(rnorm(1000*4, 5, 1), 1000, 4)
y <- rbind(y1, y2)
myt.test <- function(y){
t.test(y, alternative="two.sided")$p.value
}
P <- apply(y, 1, myt.test)
sum(P<alpha)
Psorted <- sort(P)
plot(Psorted, xlim=c(0, 1000), ylim=c(0, .01))
abline(b=alpha/N, a=0, col=2)
p <- p.adjust(P, method="bonferroni")
sum(p<0.05)
p <- p.adjust(P, method="fdr")
sum(p<0.05)
# Calculate the true FDR
sum(p[1:9000]<0.05)/sum(p<0.05)
Module 3: Hypothesis testing
bioinformatics.ca
FDR – applied to the HIV dataset
# Lab on HIV data
data <- log(read.table(file="hiv.raw.data.24h.txt", sep="\t", header=TRUE))
M <- data[, 1:4]-data[, 5:8]
A <- (data[, 1:4]+data[, 5:8])/2
Bonferroni p-values < 0.05
# Here I compute the mean over the four replicates
M.mean <- apply(M, 1, "mean")
A.mean <- apply(A, 1, "mean")
n <- nrow(M)
# Basic normalization
for(i in 1:4)
M[, i] <- M[, i]-mean(M[, i])
p.val <- rep(0, n)
for(j in 1:n)
p.val[j] <- t.test(M[j, ],
alternative = c("two.sided"))$p.value
p.val.tilde <- p.adjust(p.val, method="bonferroni")
plot(A.mean, M.mean, pch=".")
points(A.mean[p.val<0.05], M.mean[p.val<0.05], col=3)
points(A.mean[p.val.tilde<0.05], M.mean[p.val.tilde<0.05], col=4)
Module 3: Hypothesis testing
bioinformatics.ca
FDR – applied to the HIV dataset
FDR < 0.1
p.val.tilde <- p.adjust(p.val, method="fdr")
plot(A.mean, M.mean, pch=".")
points(A.mean[p.val<0.05], M.mean[p.val<0.05], col=3)
points(A.mean[p.val.tilde<0.1], M.mean[p.val.tilde<0.1], col=2)
Module 3: Hypothesis testing
bioinformatics.ca
FDR – applied to the HIV dataset
M.sd <- apply(M, 1, "sd")
FDR < 0.1
plot(M.mean, M.sd)
points(M.mean[p.val.tilde<0.1], M.sd[p.val.tilde<0.1], col=2)
The samples included
under the FDR criterion are
predominantly
characterized by a small
variance.
Is this biologically meaningful?
Module 3: Hypothesis testing
bioinformatics.ca
t–test revisited
1 sample t-test:
If the mean is zero, tg approximately follows a t-distribution
with R – 1 degrees of freedom.
Problems:
1) For many genes S is small!
2) Is t really t-distributed?
Module 3: Hypothesis testing
bioinformatics.ca
t–test revisited
1) For many genes S is small!
Use a modified t–statistic.
tg =
Mg
(Sg + c)
R
(Positive constant)
2) Is t really t-distributed?
Estimate the null distribution (through permutation).
If there is no differential expression (i.e. Sample and
Control are drawn from the same distribution and H0 is
true) permuting the columns of the data matrix should not
change the statistics.
Module 3: Hypothesis testing
bioinformatics.ca
SAM
Module 3: Hypothesis testing
bioinformatics.ca
SAM
library(samr)
y <- c(1, 1, 1, 1)
data.list <- list(x=M, y=y, geneid=as.character(1:nrow(M)),
genenames=paste("g", as.character(1:nrow(M)), sep=""),
logged2=TRUE)
res <- samr(data.list, resp.type=c("One class"), s0=NULL, s0.perc=NULL,
nperms=100, center.arrays=FALSE, testStatistic=c("standard"))
delta.table <- samr.compute.delta.table(res, nvals=500)
kk <- 1
while(delta.table[kk, 5]>0.1)
kk <- kk+1
delta <- delta.table[kk, 1]
siggenes.table <- samr.compute.siggenes.table(res, delta,
data.list, delta.table)
ind.diff <- sort(as.integer(c(siggenes.table$genes.up[, 3],
siggenes.table$genes.lo[, 3])))
n1 <- dim(data)[1]
ind.log <- rep(FALSE, n1)
ind.log[ind.diff] <- TRUE
plot(A.mean, M.mean, pch=".")
points(A.mean[ind.log], M.mean[ind.log], col=3)
Module 3: Hypothesis testing
bioinformatics.ca
SAM
FDR 10%
Module 3: Hypothesis testing
bioinformatics.ca
LIMMA
Exercise: perform the same analysis with LIMMA.
### Limma ###
source("http://bioconductor.org/biocLite.R")
biocLite("limma")
## Compute necessary parameters (mean and standard deviations)
library(limma)
fit <- lmFit(M)
## Regularize the t-test
fit <- eBayes(fit)
## Default adjustment is BH (FDR)
topTable(fit, p.value=0.05, number=100)
Module 3: Hypothesis testing
bioinformatics.ca
Multiple conditions
Module 3: Hypothesis testing
bioinformatics.ca
other alternatives
• Non parametric versions of the t/F–test
• Modified versions of the t/F–test (SAM)
• Empirical and fully Bayesian approaches
Module 3: Hypothesis testing
bioinformatics.ca
summary
Sample
size
Number
of tests
p=1
p>1
n < 30
non-parametric
t-test/F- test
regularized
t-test/F-test
(e.g. SAM, limma)
+ multiple testing
n ≥ 30
t-test, F- test
t-test, F- test
+ multiple testing
Multiple testing:
If hypotheses are independent or weakly dependent use an FDR correction,
otherwise use Bonferroni's FWER.
For more complex hypotheses, try an anova (p=1) or limma (p>1).
Module 3: Hypothesis testing
bioinformatics.ca
We are on a
Coffee Break &
Networking Session
Module 3: Hypothesis testing
bioinformatics.ca