www.ibp.ucla.edu

Download Report

Transcript www.ibp.ucla.edu

Introduction to resampling in
MATLAB
Feb 5 2009
Austin Hilliard
So you've done an experiment...



Two independent datasets:

control

experimental
Have n numbers in each dataset, representing
independent measurements
Question:
Did the experimental manipulation have an effect,
i.e. did it make a difference?
The question restated
Is there a significant (i.e. “real”) numerical
difference between the control and experimental
datasets?
What is the probability that the two datasets are
subsets of two different/independent
distributions?
What is the probability that the two datasets are
NOT subsets of the same underlying
distribution?
Answering the question

What do we need?

A number(s) to represent/summarize each dataset
→ data descriptors

A test for comparing these numbers
→ test statistic

A way to assess the significance of the test statistic
→ distribution of test statistics generated from
datasets that ARE from the same underlying
distribution
Descriptors and tests

Know your data!


Visualize it
Does it have a central tendency? (i.e. is histogram
vaguely mound shaped?)



- If yes, use the mean as descriptor
- If no, median may be better
Typical test: compute difference (subtraction)
between descriptors
We'll discuss how to assess the significance of
your test statistic soon, that's the resampling
part...
But first let's visualize some data
Simulate and inspect data
ctrl = 10.*(5.75+randn(25,1));
exp = 10.*(5+randn(25,1));
boxplot([ctrl exp])
groups = {'control' 'experimental'}
boxplot([ctrl exp],'labels',groups)
Central tendency?
hist(ctrl)
title('Control group')
hist(exp)
title('Experimental group')
Compute descriptors and test stat

Our data looks pretty mound shaped, so let's
use the mean as data descriptor
ctrl_d = mean(ctrl);
exp_d = mean(exp);

Take the absolute value of the difference as our
test stat
test_stat = abs(ctrl_d - exp_d);
Assessing the test statistic

What we would really like to test:


The probability that the two datasets are subsets of
two different/independent distributions
Problem:

We don't know what those hypothetical independent
distributions are, if we did we wouldn't have to go
through all of this!
Assessing the test statistic

Solution:


Compute the probability that the two datasets are
NOT subsets of the same underlying distribution
How to do this?


Start by assuming datasets are subsets of same
distribution → null hypothesis
See what test statistic looks like when null
hypothesis is true
Assessing the test statistic

Need to generate a distribution of test statistic
generated when datasets really ARE from the same
underlying distribution
→ distribution of test stat under null hypothesis

Then we can quantify how (a)typical the value of our
test stat is under null hypothesis


if typical, our datasets are likely from same
distribution → no effect of experiment
if atypical, there is a good chance datasets are from
different distributions → experiment had an effect
Generate the null distribution using
bootstrap

Basic procedure:




Combine datasets
Randomly sample (w/ replacement) from
combined dataset to create two pseudo
datasets w/ same n as real datasets
Compute descriptors and test statistic for
pseudo datasets
Repeat 10,000 times, keeping track of
pseudo test stat
Combine datasets and resample
box = [ctrl; exp];
% combine datasets into 'box'
% could use concat()
pseudo_stat_dist = zeros(1,10000);
% create vector 'pseudo_stat_dist'
% to store results of resampling
% --- start resampling --for trials = 1:10000
end
% resample 10000 times
pseudo_ctrl = ...
sample(length(ctrl), box);
pseudo_exp = ...
sample(length(exp), box);
% create pseudo groups to be
% same size as actual groups
pseudo_ctrl_d = ...
mean(pseudo_ctrl);
pseudo_exp_d = ...
mean(pseudo_exp);
% compute pseudo group
% descriptors
pseudo_stat = ...
abs(pseudo_ctrl_d - pseudo_exp_d);
pseudo_stat_dist(trials) = ...
pseudo_stat;
% compute pseudo test stat
% store pseudo test stat to
% build null distribution
Now what?


Compute how many values of pseudo test stat
are greater than our actual test stat, then divide
by the total number of data points in the null
distribution
This approximates the likelihood of getting our
actual test stat (i.e. the likelihood of seeing a
difference as big as we see) if our two datasets
were truly from the same underlying distribution
→ p-value
Compute p-val and visualize null
distribution
bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats
% bigger than actual stat
pval = bigger / length(pseudo_stat_dist)
%
%
%
%
divide by total number
of pseudo test stats to
compute p-value
could use proportion()
hist(pseudo_stat_dist)
title('Pseudo test stat distribution')
xlabel('Pseudo test stat')
ylabel('Frequency')
% show histogram of pseudo
% test stats
How to interpret the p-value?

Assuming that the null hypothesis is true, the pvalue is the likelihood that we would see a value
of the test statistic that is greater than the value
of our actual test statistic
Restated →

Assuming both groups really are from the same
underlying distribution, the p-value is the
likelihood that we would see a difference
between them as big as we do
“Paired” data
•
What if your two datasets are not actually
independent?
•
Maybe you have a single group of
subjects/birds/cells/etc. with measurements
taken under different conditions
•
Important: the measurements within each
condition must still be independent
Bootstrap to test paired data
•
Basic strategy is the same, but...
•
Instead of one descriptor/group, have as many
descriptors as data pairs
–
compute difference between measurement
under condition1 and measurement under
condition2
•
Test statistic is mean/median/etc. of the
differences
•
Slightly different resampling procedure
Visualize data
•
Investigate distributions for raw data, also
should plot distribution of the differences
between paired data-points
•
Why?
→ Test statistic is computed from these differences,
not the raw datasets
Resampling procedure
•
Randomly sample 1 or -1 and store in vector
•
Multiply by vector of differences
–
randomizes direction of change
•
Compute test stat (mean, median, etc.) for
randomized differences
•
Repeat
•
Now we have distribution of test statistic if
direction of change is random, i.e. NOT
dependent on experimental conditions
Compute descriptors and test stat
diffs = ctrl - exp;
% compute change from ctrl condition
% to exp condition
test_stat = abs(mean(diffs));
% compute test statistic
pseudo_stat_dist = zeros(1,10000);
% create vector to store pseudo test
% stats generated from resampling
Resampling
for trials = 1:10000
% resample 10,000 times
signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's
% and -1's same size as diffs
pseudo_diffs = signs .* diffs;
% multiply by actual diffs to
% randomize direction of diff
pseudo_stat = abs(mean(pseudo_diffs));
% compute pseudo test stat from
% pseudo diffs
pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to
% grow null distribution
end
Compute p-val and visualize null
distribution
bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats
% bigger than actual stat
pval = bigger / length(pseudo_stat_dist)
%
%
%
%
divide by total number
of pseudo test stats to
compute p-value
could use proportion()
hist(pseudo_stat_dist)
title('Pseudo test stat distribution')
xlabel('Pseudo test stat')
ylabel('Frequency')
% show histogram of pseudo
% test stats
function [] = bootstrap2groupsMEAN_PAIRED (ctrl, exp)
% --- prepare for resampling --diffs = ctrl - exp;
% compute change from ctrl condition
% to exp condition
test_stat = abs(mean(diffs))
% compute test statistic
pseudo_stat_dist = zeros(1,10000);
% create vector to store pseudo test
% stats generated from resampling
% --- resampling --for trials = 1:10000
% resample 10,000 times
signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's
% and -1's same size as diffs
pseudo_diffs = signs .* diffs;
% multiply by actual diffs to
% randomize direction of diff
pseudo_stat = abs(mean(pseudo_diffs));
% compute pseudo test stat from
% pseudo diffs
pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to
% grow null distribution
end
% --- end resampling --bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats
% bigger than actual stat
pval = bigger / length(pseudo_stat_dist)
%
%
%
%
divide by total number
of pseudo test stats to
compute p-value
could use proportion()
hist(pseudo_stat_dist)
title('Pseudo test stat distribution')
xlabel('Pseudo test stat')
ylabel('Frequency')
% show histogram of pseudo
% test stats
function [] = bootstrap2groupsMEANS (ctrl, exp)
%
% function [] = bootstrap2groupsMEANS (ctrl, exp)
%
% 'ctrl' and 'exp' are assumed to be column vectors, this is
% necessary for creation of 'box'
%
% --- prepare for resampling --ctrl_d = mean(ctrl);
exp_d = mean(exp);
% compute descriptors
% 'ctrl_d' and 'exp_d'
test_stat = abs(ctrl_d - exp_d)
% compute test statistic 'test_stat'
box = [ctrl; exp];
% combine datasets into 'box'
% could use concat()
pseudo_stat_dist = zeros(1,10000);
% create vector 'pseudo_stat_dist'
% to store results of resampling
% --- start resampling --for trials = 1:10000
% resample 10,000 times
pseudo_ctrl = ...
sample(length(ctrl), box);
pseudo_exp = ...
sample(length(exp), box);
% create pseudo groups to be
% same size as actual groups
pseudo_ctrl_d = ...
mean(pseudo_ctrl);
pseudo_exp_d = ...
mean(pseudo_exp);
% compute pseudo group
% descriptors
pseudo_stat = ...
abs(pseudo_ctrl_d - pseudo_exp_d);
pseudo_stat_dist(trials) = ...
pseudo_stat;
% compute pseudo test stat
% store pseudo test stat to
% build null distribution
end
% --- end resampling --bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats
% bigger than actual stat
pval = bigger / length(pseudo_stat_dist)
%
%
%
%
divide by total number
of pseudo test stats to
compute p-value
could use proportion()
hist(pseudo_stat_dist)
title('Pseudo test stat distribution')
xlabel('Pseudo test stat')
ylabel('Frequency')
% show histogram of pseudo
% test stats
Don't forget to look online to get a more
generalized version of
'bootstrap2groupsMEANS.m' called
'bootstrap2groups.m'
It's more flexible (can define method for descriptor
and iterations as input), stores output in a struct,
plots raw data, has one-tailed tests if needed,
and can be used for demo (try running
'out = bootstrap2groups(@mean)'
Possible topics for next week
•
confidence intervals
•
normality testing
•
ANOVA (1 or 2-way)
•
power