A comparison of multiple treatments
Download
Report
Transcript A comparison of multiple treatments
Data Analysis
Nick Holmes
Pathology Flow Cytometry Facility
http://www.bio.cam.ac.uk/~nh106/flowsite/flowindex.html
Or via Quick links on http://www.path.cam.ac.uk/
Flow Cytometry data is technically a discontinuous variable function as the
values recorded in your FCS data files are the results from analogue to digital
converters (ADCs)
The number of bins (or channels as flow jocks call them) varies greatly for
today’s cytometers..
E,g,
Facscan
Cyan
Canto
CytekDxP
Accuri
1024
65536
262144
262144
16777216
Different analysis programmes will display these data on a variety of
scales. Often the data are rebinned, ie channels are recombined
into a smaller number of values. This helps make distributions look
smooth.
However, more complex data processing algorithms can also be
used e.g. in FlowJo; these will make your data look different
(sometimes nicer).
Sources of true background noise
Thermionic emission
Stray light
Electrical circuits
Sources of variation in signal
Mains (transformer) fluctuation
Laser power fluctuation
Cell (particle) position within the beam (fluidic fluctuation)
Sample preparation and staining
Biological Variance
It is becoming increasingly common for operators to QC cytometers by
regularly running beads – some machines have automated software
routines for doing this. If I do it manually what kind of variability am I
prepared to tolerate before I decide the machine needs attention?
Typically I would have a batch of control beads and standard settings
which would have placed my beads in linear channel 128 on a 256
channel display (note really these are in channels 32512-32767 if I am
using our Cyan).
When we repeat this e.g. every week, as well as reasonable precision
(CV<3-8% depending on fluorescence parameter), I am also expecting
fluctuations in mean and median from week to week.
I guess I am not going to worry if the median lies in the range 115-140 or
so.
Which measure of central tendancy?
Often we will want to use a single statistic value to compare populations
but which one gives the best description of non-Gaussian populations?
median
553
mode
200
geometric mean
621
mean
1755
I think you can see that
both mode and mean do
a pretty poor job at
describing the complex
population. Geometric
mean and median are
quite similar.
I prefer median as a
general rule
When are cells positive?
Here our control (red) population has a lower
peak (mode) than our test sample.
But does this mean all or most of the cells are
positive for our test antibody (antigen by
implication)?
I would favour a conservative interpretation,
namely that the main peak in the test sample is
negative – its slightly higher fluorescence is
probably due to differences between the control
and test antibody preparations (coupling profile,
aggregation profile etc.)
Even then, bear in mind that wherever you set
your lower boundary for positivity, you count
some negatives as positives or some positives
as negatives – usually both. So this boundary is
a sensitive and subjective measure. Be careful
when making conclusions that the exact position
of the boundary doesn’t alter your result!
Basic KS comparisons are too sensitive to
distinguish real from instrument variation
Dmax=1
Dmax=O.1857
p<0.001*
p<0.001*
The right-hand above panel shows 2 samples from the same tube so
it cant be ‘truly’ different except by sampling
* Kolmogorov-Smirnov comparison in FlowJo
If we use KS to compare histograms, any d>0.042 would be significant at
p<10-6 for n=4096* and you might think that this level of cut off could
avoid mere spurious statistical noise making things appear different when
they aren’t.
However, even this level isn’t high enough – or anywhere near actually
Unfortunately FlowJo doesn’t return D values for KS
Nor is it easy to calculate (with precision) Dc for p<10-7 or less
In practice, you have to use common sense judgement.
If things don’t look different enough to be believably BIOLOGICALLY
different than don’t let stats trick you.
Conversely anything which looks biologically meaningful WILL be
statistically different if you just compare the raw data of histograms
BUT is this what we need to establish?
* In fact a general approximation can be made that p≈10exp-6 for any dn≈2.69
The Mann-Whitney U test: a simple test for reproducibility
•
•
Mann-Whitney U test will be generally
applicable wherever you want to
compare univariate distributions for a
test sample and a control sample
If the ranges do not overlap, you only
need 3 samples of each to get p<0.05
Overlapping ranges require more samples
but for example
median FI
Controls
test
5.61
5.99
5.83
6.58
5.87
7.02
6.01
7.31
7.15
7.39
Gives U = 4; p < 0.05
Where inter- experiment variability is too high, the Wilcoxon’s signed
rank test will still deliver significance at 5% for 5 pairs of samples
Provided that which the control value is always lower than the test sample within each
experimental pair
Example: a monoclonal antibody against a novel antigen is used to stain the cell line BRAVO
The fluorescence obtained by indirect staining with the Mab + anti-mouse Ig is compared to that obtained with
same secondary + an isotype control. The experiment is repeated on 5 separate occasions. The following
median FI values are obtained
Control
3
6
14
11
8
test
5
7
16
13
10
By Mann-Whitney
U=10, P>0.1
By Wilcoxon
W=0, P<0.05
1% significance requires seven such samples
NB: These two sets are very close. I would still exercise caution in interpreting the
data. Clearly they give a small reproducible difference. This could mean that the
BRAVO expresses low levels of the novel antigen. Alternatively, there may be
unknown differences in the unspecific binding of the control and novel antibody, e.g.
a higher level of dimers, trimers etc – almost all antibody preps contain some higher
order species.
A comparison of multiple treatments
If we want to compare multiple pairs of samples within an
experiment then we need a different method.
Friedmann’s test provides a method to assess the possibility that a
dataset is different
by either ‘block’ (by which we would mean experiment/run)
or by ‘treatment’ by which we could mean different
antibodies, different concentrations of antibodies or drug
etc.
However, in order to compare individual pairs of treatments we need
to apply Dunn’s post test to the data –
Friedmann only tests whether the null hypothesis that all samples
might plausibly be drawn from the same population is demonstrably
false at some defined level of certainty.
An example of Friedmann’s test
We have T lymphocyte cell line which expresses GFP under the control
of a minimal promoter with 3 NFAT binding sites upstream of the
transcription start site. Thus we can measure the degree of NFAT
activity after anti-CD3 activation.
We have 4 different drugs which we believe may inhibit the
dephosphorylation of NFAT (required for nuclear entry, hence
transcriptional activity).
We treat cells with anti-CD3 and, independently, each of the 4 drugs
with vehicle as a control and we measure the fluorescence of cells
stimulated by 488nm light and measured between 515-545nm.
We did this experiment 3 times using, so far as possible, the same
cells, drug doses, cytometer settings etc.
Median GFP fluorescence
Expt
Vehicle
1
A
B
C
D
2657
2612
2271
1907
1439
2
2347
2333
2201
1899
1333
3
2784
2636
2311
2089
1566
2527
2261
1965
1466
97
87
76
56
mean
as % C
2596
Convert these to Ranks within each
experiment
Expt
Vehicle
1
5
4
3
2
1
2
5
4
3
2
1
3
5
4
3
2
1
ΣRi
15
12
9
6
3
45
ΣRi2
225
144
81
36
9
495
A
B
C
D
Friedmann’ s test statistic S is given by
2
R
S R12 R22 R32 R42 R52
n
2
R
Ri2
n
i 1
n
Where n=the number of treatments, Ri is the sum of
ranks of treatment i and R is the sum of all Ri
For our example S=495- 405 thus S= 90
We can use tables of significance level to find that for 3 replicates of 5
treatments, S=86 has a probability p=0.009 that all values are drawn from the
same underlying population.
This only tells us that the drugs made a difference!
Dunn’s Post test for pairwise comparison
If we want to ask whether particular drugs were effective , and whether some were
better than others we need to do pairwise comparisons of treatments. We could
chose two levels of query.
1. For each drug, does it inhibit the activation of GFP expression?
2. For all drugs, is A>B>C etc ?
Whenever you perform multiple comparisons within a dataset, you need to correct for the fact
that the more comparisons you do, the more probable it is that you will see an effect by chance.
Query 1 makes a total of 4 comparisons and Q2 10 comparisons so we divide the level of
significance we are prepared to accept by these values*
i.e for Q1 we need p≤0.0125 and for Q2, p≤0.005, if we want to use the conventional low
level significance threshold (p=0.05).
We need to find the value of z from the normal distribution that corresponds to that twotailed probability – this can be done using online calculators e.g.
http://graphpad.com/quickcalcs/Statratio1.cfm
* Technically the correction is to 1-(0.95)1/N where N is the number of comparisons but 0.05/N is a close approximation
for small N
To compare groups i and j, we find the absolute value of the difference between the
mean ranks in group i and the mean ranks in group j then divide this difference in
mean ranks by its standard deviation (square root of [(N*(N+1)/12)*(1/Ni + 1/Nj)]).
Here N is the total number of data points in all groups, and Ni and Nj are the number of data points in the two
groups being compared. Furthermore the ranks are calculated using all samples rather than the within ‘block’
ranks used for the Friedmann test
If the ratio calculated in the preceding paragraph is larger that the critical value of z
then we conclude that the difference is statistically significant.
For Q1 we require z ≥ 2.498 for p≤0.0125
For Q2 we need z≥ 2.807 for p≤0.005
The upshot is that if we asked Q1,
then only drug D gave significantly different activation
For Q2 only the same Vehicle- drug D comparison can be said to be
significantly different
This does not mean that drug C doesn’t inhibit NFAT activation or
that there is no difference between drug D and drug A!
It means we need to do more replicate experiments to show
further significance.
Set X
Expt
Vehicle
A
1
2657
2612
2271
1907
1439
2
2347
2333
2201
1899
1333
3
2784
2636
2311
2089
1566
means
2596
2527
2261
1965
1466
97
87
76
56
% V control
Dunn’s test Vehicle control Vs
Expt
Set Y
Vehicle
B
NS
NS
A
D
NS
B
p<0.05
C
D
1
797.92
109.45
28.76
16.73
10.09
2
758.02
103.98
27.32
15.89
9.59
3
853.77
117.11
30.77
17.9
10.8
803.24
110.18
28.95
16.84
10.16
14
4
2
1
means
% V control
ANOVAR (set Y) Ctrl vs
Set Y
C
P<<0.001
P<<0.001
P<<0.001
P<<0.001
These two datasets have exactly the same ranks
For Set Y however, the Dunn results clearly miss
something important
Parametric ANOVAR may be permissible (after all we
don’t know for certain that the underlying distribution of
median values ISNT Gaussian)
• Use common sense
• Biological significance not statistical significance
• Compare like with like
• Reduce heterogeneity as far as possible
• Use non-parametric tests
– Mann-Whitney
– Wilcoxon’s
– Friedmann + Dunn’s
• Clear, reproducible cytometry data does not need
stats; but if pushed you can risk using parametric
stats with care