Introduction to Statistics - Homepages | The University of Aberdeen

Download Report

Transcript Introduction to Statistics - Homepages | The University of Aberdeen

Introduction to Statistics
Biomedical Sciences Degrees
Honours Students
Derek Scott
[email protected]
Why use statistics?
• Statistics are used to analyse populations and
predict changes in terms of probability.
• Normally, a representative sample is taken, large
enough to make likely conclusions about the
population as a whole.
• Descriptive statistics: summarise the data and
describe the population. These values allow you
to see how large and how variable the data are.
• Inferential statistics: propose null hypothesis and
endeavour to disprove it. By looking at these,
you can check for error.
• When analysing data, you want to make the
strongest possible conclusion from limited
amounts of data. To do this, you need to
overcome 2 problems:
• Important differences can be obscured by
biological variability and experimental error. This
makes it difficult to distinguish real differences
from random variability.
• The human brain excels at finding patterns,
even from random data. Our natural inclination
(especially with our own data) is to conclude that
any differences are real, and to minimise the
contribution of random variability. Statistical rigor
prevents you from making this mistake.
Errors
• Bias or systematic error: Data go in a predictable
direction perhaps due to experimental design or human
errors. Can remove the errors if you identify them.
• Random error: Unpredictable errors. Can’t get rid of
these.
• Usually you will quote a measure of error with your data
(e.g. standard deviation, standard error of the mean)
• EXAMPLE: The mean height of a student in BM4005 is:
1.71 ± 0.20 (43) metres.
MEAN VALUE
SD or SEM
n, the number of
samples
Units!!!
Independent Sampling 1
• Measure BP in rats, 5 rats per group.
• Measure BP 3 times in each animal.
• You do not have 15 independent
measurements, since triplicate
measurements in each animals will be
closer to one another than to those in
other animals.
• You should average values from each rat.
• Now have 5 independent mean values.
Independent Sampling - 2
• Perform a biochemical test 3 times, each
time in triplicate.
• Do not have 9 independent values, as an
error in preparing the reagents for 1
experiment could affect all 3 triplicates.
• Average the triplicates, and you have 3
independent mean values.
Independent Sampling - 3
• Doing a human exercise study.
• Recruit 10 people from the inner-city, and 10
people from the countryside.
• Have not independently sampled 20 subjects
from one population.
• Data from inner-city subjects may be closer to
each other than to the data from rural subjects.
You have sampled from 2 populations, and need
to account for this in your analysis.
Gaussian (Normal) Distribution
• Data usually follow a bell-shaped distribution
called Gaussian distribution. t-tests and ANOVA
tests assume that the population follows an
approximately Gaussian distribution.
• For example, of we measure the height of
everyone in 4th year and plot this, most people
would fall in the middle of the curve, with a few
at the bottom end, and a few at the top end of
the curve.
• For Gaussian distribution, we use parametric
tests
“Bell-shaped”
curve
10
8
6
4
2
0
180
178
Height (cm)
176
174
172
170
168
166
164
162
Number of students
Gaussian Distribution
Outliers
• When analysing data, some values can be
very different the rest.
• Tempting to delete it from analysis.
• Was the value typed in correctly?
• Was there an experimental problem with
that value?
• Is it due to biological diversity?
• What if answers to these questions are
no?
Outliers
• If outlier is due to chance, keep it in the
data set.
• If it is due to a mistake (e.g. bad pipetting,
voltage spike, apparatus problem) then
you must remove it from the analysis.
• If you want to be absolutely sure whether
the outlier is due to chance or not, there
are specific statistical tests you can do, but
usually these basic checks are enough to
decide.
Mean
• Sample mean will probably not be exactly the
population mean. Mean is more accurate if you
have a bigger sample size with a low variability.
• You may calculate Confidence Intervals (CI’s)
telling you the area in which 95% of the
population will fall.
• EXAMPLE: Mean height of a student in BM4005
is 1.71 metres. The 95% confidence limits for
this value are 1.5 and 1.8 metres. These are the
upper and lower heights between which 95% of
the class will fall.
Confidence Intervals
• Nothing magical about 95%. You could do
it for any value you liked – 99%, 90% etc.
• If you set a value of 99%, then the
intervals would be wider because 99% of
the class’s heights must fall within that
range.
• 95% confidence limits mean you have a
reasonable level of confidence that the
true population mean lies within that
range.
Standard Deviation (SD)
• Quantifies variability
• If data follow Gaussian distribution, then
68% of values lie within one SD of mean
(on either side) and 95% of values lie
within 2 SD’s of the mean.
• So, as a rule of thumb, if 2 points on a
graph are more than 2 SD’s away from
each other, they are significantly different.
• Expressed in same units as data
Standard Error of the Mean (SEM)
• Measure of how far sample mean is likely to be
from the true population mean.
SEM = SD/n
• Smaller than SD, so used more to give smaller
error bars!
• SD quantifies scatter – how much values vary
from each other. Doesn’t really change much
even if you have a bigger sample size.
• SEM quantifies how accurately you know the
true mean of the population. SEM gets smaller
as sample gets larger
P Values
P Value
Wording
Symbol
> 0.05
Not significant
ns
0.01 to 0.05
Significant
*
0.001 to 0.01
Very significant
**
< 0.001
Extremely
significant
***
Student’s t-test
• Used to compare the means of two groups
of data.
• Paired t-test: control expt. and treatment
done on same person, animal or cell etc.
• Unpaired t-test: control done on 1 group of
subjects, with the treatment being done on
another separate group.
• Can be 1- or 2-tailed.
Iron and zinc evoke electrogenic responses that are pHdependent
-2
 Isc (A.cm )
0.75
-2
 Isc (A.cm )
1.0
0.5
***
0.0
0.50
0.25
***
-0.5
0.00
Krebs pH 6.0
Krebs pH 7.4
Condition
IRON (100M)
Krebs pH 6.0
Krebs pH 7.4
Condition
ZINC (100M)
Iron- and zinc-evoked transport is temperaturedependent
1.5
1.5
Isc (A.cm )
-2
***
1.0
***
0.5
***
-2
 Isc (A.cm )
***
***
**
**
1.0
***
***
0.5
0.0
***
**
*
0.0
0
250
500
750
1000
0
[Iron] (M)
250
500
750
[Zinc] (M)
IRON
ZINC
 4 oC  37 oC
1000
Paired or Unpaired?
• Choose paired if the 2 columns of data are matched, e.g.
• You measure weight before and after an intervention in
the same subjects.
• You recruit subjects as pairs, matched for variables such
as age, ethnic group, disease severity. One of the pair
gets one treatment, the other gets an alternative
treatment.
• You perform the control experiment in one cell or piece
of tissue, and then apply a drug. You measure the effect
of the drug in the same cell or tissue.
• Shouldn’t be based on the variable you are comparing.
For example, if measuring BP, you can match subjects
based on their age or postcode, but not on their BP’s.
Student’s t-test
• You will probably always use a 2-tailed t-test.
• 2-tailed test just asks whether there is a
difference between the 2 means.
• 1-tailed test predicts whether:
– Mean 1 is bigger than Mean 2 or
– Mean 2 is bigger than Mean 1.
• For 1 tailed you must know which mean will be
bigger before you start – not usually possible
• Stick to a 2-tailed t-test to be safe!!!
Analysis of Variance (ANOVA)
• Used to compare means of 3 or more
groups.
• Again, can have matched (paired) or
unmatched (unpaired) values.
• You will probably only use 1-way ANOVA
• EXAMPLE: Your null hypothesis is that the
average BP for 4 men is equal. ANOVA
can compare each subject’s BP and say if
they are different or not.
Features of ANOVA
• ANOVA produces an F value which tells you how
much variation there is in your sample. Higher F
value means more variation.
• Dunnett’s post test allows you to compare
against 1 group e.g. A v B, A v C, A v D. Handy if
A is the control group.
• Tukey’s post test allows you to compare all
columns against one another just to check for
any differences between any groups. Good way
of finding significant differences that you may not
have expected.
The effect of non-selective protein kinase inhibition
with staurosporine
*
-2
*
*
*
*
1
 Isc (A.cm )
2
-2
 Isc (A.cm )
*
*
*
*
*
*
1.5
*
*
*
0.5
*
*
*
*
*
*
*
*
250
500
0
0
250
500
750
1000
-0.5
0
[Fe2+ ] (M)
[Zn2+ ] (M)
IRON
 8-Br cGMP + Staurosporine
750
ZINC
 Staurosporine (0.5 M)
 8-Br cGMP (100 M)  Control
1000
Non-Gaussian Distribution
• Use non-parametric tests for these
unusual situations which rank data from
low to high and analyse distribution of
ranks.
• Less powerful than parametric but used
when values are too low or high to
measure by assigning arbitrary values.
Also used if outcome is a rank or score
with only a few categories.
• P values are usually higher.
Skewness
Correlation
+ve correlation
-ve correlation
Correlation doesn’t tell you about the cause of the effect, it
just tells you that there is a link between value X and value Y.
The nearer the R value is to 1, the better the correlation.
Fluorescence Ratio
(F490/440)
Regression
10
8
6
4
6.0
6.5
7.0
7.5
8.0
pHi
Regression calculates a line of best fit. Often used to
calculate a standard curve which you could use to estimate
value x if you know value y. Unknowns must fall within your
standard curve’s range.
Correlation and regression
• A word of caution about doing regression and finding
correlations.
• Just because you can draw a line of best fit through
some points and make quite a good straight line, it does
not necessarily mean there is a relationship.
• Correlation does not necessarily imply causation!
• For example, the consumption of tropical fruit in the UK
since WW2 has increased, and so has the birth rate in
the UK. If I plot this on a graph, and did a regression, I
would probably get a nice straight line as both increase
together. I would probably also show there is a good
correlation.
• This does not mean that I can say that eating tropical
fruit improves your fertility!!!
• Use some common sense when interpreting your data!
Summary
• This is just a basic introduction.
• For extra information, try the Help files on
Graphpad Prism (on the University PC’s)
• If you end up doing an Honours project with
certain types of data (e.g. collecting
psychological data, epidemiological studies
etc.), your supervisor should inform you about
any special tests/calculations they use for that
type of data.
• Finally, if you are still unsure, make it clear to
your supervisor that you do not understand why
or what you are doing.