Transcript HERE

Practice & Communication of Science
From Distributions to Confidence
@UWE_KAR
The Normal Distribution

Represents the idealised distribution of a large
number of things we measure in biology


many parameters approximate to the ND
Is defined by just two things…

population mean



µ (mu)
the centre of the distribution (mean=median=mode)
population standard deviation (SD)




σ (sigma)
the distribution ‘width’ (mean  point of inflexion)
encompasses 68% of the area under the curve
95% of area found within 1.96 σ either side of mean
The Normal Distribution

Is symmetrical

mean=median=mode
The Normal Distribution

One SD either side of mean includes 68% of
represented population

SD boundary is inflexion point



2* SD covers 95%


curvature changes direction
the ‘s’ bit
(*actually 1.96)
3 SD covers 99.7%
The Normal Distribution

All Normal Distributions are similar

differ in terms of…



mean
SD (governs how ‘spikey’ curve is)
Fig below…

4 different SDs, 2 different means
Standardising Normal Distributions

Regardless of what they measure, all Normal
Distributions can be made identical by…

subtracting the mean from every reading


dividing each reading by the SD


the mean then becomes zero
a reading one SD bigger  +1
Called Standard Scores or z-scores

amazing! Different measurements  same ‘view’
Standard (z) Scores

A ‘pure’ way to represent data distribution







the actual measurements (mg, m, sec) disappear!
replaced by number of SDs from the mean (zero)
For any reading, z = (x - µ) / σ
A survey of daily travel time had these results (in minutes):
26,33,65,28,34,55,25,44,50,36,26,37,43,62,35,38,45,32,28,34
The Mean is 38.8 min, and the SD is 11.4 min
To convert the values to z-scores…




eg to convert 26
first subtract the mean: 26 - 38.8 = -12.8,
then divide by the Standard Deviation: -12.8/11.4 = -1.12
So 26 is -1.12 Standard Deviations from the Mean
Familiarity with the Normal Distribution

95% of the class are between 1.1 and 1.7m tall


what is the mean and SD?
Assuming normal distribution…


the distribution is symmetrical, so mean height is
(1.7 - 1.1) / 2 = 1.4m
the range 1.1  1.7m covers 95% of the class,
which equals ± 2 SDs



one SD = (1.7 – 1.1) / 4
= 0.6 / 4
= 0.15m
Familiarity with the Normal Distribution

One of that class is 1.85m tall


what is the z-score of that measurement?
Assuming normal distribution…




z-score = (x - µ) / σ
z = (1.85m - 1.4m) / 0.15m
= 0.45m / 0.15m
=3


note there are no units
3 SDs cover 99.7% of the population

only 1.5 in 1000 of the class will be as tall/taller

a big class, with fractional students! 
Familiarity with the Normal Distribution

36 students took a test; you were 0.5 SD above
the average; how many students did better?




from the curve, 50% sit above zero
from the curve, 19.1% sit between 0 and 0.5 SD
so 30.9% sit above you
30.9% of 36 is about 11
Familiarity with the Normal Distribution

Need to have a ‘feel’ for this…
Populations and Samples – a Diversion?

A couple of seemingly pedantic but important
points about distributions…

population

the potentially infinite group on which measurements
might be made


don’t often measure the whole population
sample

a sub-set of the population on which measurements
are actually made


most studies will sample the population
n is the number studied


n-1 called the ‘degrees of freedom’
often extrapolate sample results to the population
Populations and Samples – so what?

They are described/calculated differently…



μ is the population mean, x is the sample mean
σ or σn is population SD, s or σn-1 is sample SD
Calculating the SD is different for each


most calculators do it for you…
as long as you choose the right type (pop vs samp)
Populations and Samples – choosing

Analysing the results of a class test…


Analysing the results of a drug trial…


sample, since you expect the conclusions to apply
to the larger population
A national census collects information about
age


population, since you don’t intend extrapolating the
results to all students everywhere
population, since by definition the census is about
the population taking part in the survey
If in doubt, use the sample SD

and as n increases, the difference decreases
Populations and Samples – implications

If the sample observed is the population, then
the mean and SD of that sample are the
population mean and the population SD


ie you calculate σ (or σn)
If the sample is part of the bigger population,
then the sample mean and SD are estimates
of the population mean and the population SD


ie you calculate σn-1 (or s)
(the presumption here is you chose your sample to
reflect the population!)
Implications of Estimating Pop Mean

For a sample, the ‘quality’ of its estimate of the
population mean and SD depends on the
number of observations made




the mean from a sample of 1 member of the
population is unlikely to be close to the pop mean
if you sampled the whole population, your sample
estimate is the population mean
in between, adding extra readings to a sample will
improve your estimate of the population mean
also, repeated sampling  a collection of means



eg you sampled 200 FVCs and did this 100 times
the 100 means are ND and will have its own SD (!)
called the Standard Error of the Mean (SEM)
The Standard Error of the Mean

Immediate recap…



repeated samples of a distribution will produce
different estimates of the population mean
the SD of those estimates of pop mean called the
SEM
Surprisingly easy to calculate




we don’t need to keep repeating our sampling
can derive it from SD of data in one sample
SEM = sample standard dev / square root of
number of samples
SEM = s / √ N

eg if N=16, then SEM is 4x smaller than SD
Remember, Samples  Estimates

When we sample a population, we end up
with a sample mean, x

our ‘best guess’ estimate of the real pop mean, µ


Our sample also has a measure of the
variability of the data that comprises it

the sample Standard Deviation, s


the ‘real’ mean of the population is ‘hidden’
which is also an estimate of the population SD, σ
s can be also be used to indicate the
variability of the population mean itself

SEM = s / √ N

can then use SEM to determine confidence limits
Confidence Limits and the SEM

The SEM reflects the ‘fit’ of a sample mean, x,
to the underlying population mean, µ

if we calculate two sample means and they are
similar, but for one the SEM is high,


we are less ‘confident’ about how well that sample
mean estimates the population mean
Just like the ‘raw’ data used to calculate a
sample mean follows a distribution, so do
repeat estimates of the population mean itself


this is the t Distribution
similar to the Normal Distribution
The t Distribution

Yet another distribution! 

but distributions are important because they define
how we expect our data to behave


if we know that, then we gain insight into our expts!
Generally ‘flatter’ than the Normal Distribution

any particular area is more ‘spread out’ (less clear)

the more ‘pointed’ a curve, the clearer the peak
t Distribution Pointedness Varies!

Logic…


the number of samples influences the ‘accuracy’ of
our estimate of the population mean from the
sample mean
as N increases, the ‘peak’ becomes sharper


a given area of the curve is less ‘spread out’
At large N, t Distribution = Normal Distribution

and for both, 95% of curve contained in ± 1.96 SD
Using the t Distribution

When we calculate a sample mean and call it
our estimate of the population mean…


it’s nice to know how ‘confident’ we are in that
estimate
One measure of confidence is the 95%
Confidence Interval (95%CI)

the range over which we are 95% confident the
‘true’ population mean lies



derived from our sample mean (we calculate)
and our SEM (we calculate)
and the N (though it’s the ‘degrees of freedom’, N-1)

we use this to look up a ‘critical t value’
From t Distribution to 95%CI


The t Distribution is centred around our mean
and its shape is influenced by N-1
95%CI involves chopping off the two 2.5% tails



Need a t table to look up how many SEMs (SDs)
along the x-axis this point will be
Value varies with N-1
And level of
confidence sought
Step 1: The t Table

t value varies with…

Row…


DoF is N-1
Column…


level of ‘confidence’
95%CI involves
chopping off the
two 2.5% tails


α = 0.05 (5%)
For N = 10, α = 0.05


t(N-1),0.05 = 2.262
when N large, t=1.96
Step 2: Using the t value

The t value is the number of SEMs along the xaxis (in each direction) that encompasses that
% of the t distribution centred on our mean


2.262 in the case of t(N-1),0.05
Eg we measure the FVC (litres) of 10 people…

mean = 3.83, SD = 1.05, N = 10



SEM = 1.05/√10 = 0.332 litres
t(N-1),0.05 = 2.262 standard errors to cover 95% curve
So, litres either side of the mean = 2.262 * 0.332


= 0.751 litres either side of mean covers 95% of dist
So, 95%CI is 3.83 ± 0.751 = 3.079  4.581 litres

(3.079, 4.581)
Effect of Bigger N

A larger sample size gives us greater
confidence in any population mean we estimate


In previous example…



so 95%CI should be smaller
mean = 3.83, SD = 1.05, N =10, SEM = 0.332
95%CI is (3.079, 4.581)
But say we measured 90 more people…

mean = 3.55, SD = 0.915, N = 100




mean and SD similar to before, but
SEM now a lot smaller, at 0.915/√100 = 0.0915
so too is t(N-1),0.05 = 1.96 (rather than 2.262)
95%CI = 3.55 ± (1.96 * 0.0915) = 3.371  3.729
Effect of Bigger N

A bigger N ‘sharpens’ the t distribution so that
the 95% boundaries are less far apart


ie our confidence interval will become smaller
95%CI also shrinks because SEM = SD/√N
Summary

Normal Dist fully defined just by mean and SD



SD calculation differs for sample vs population
Estimation of population mean from a sample is
always prone to uncertainty


Standard Error of Mean (s/√N) reflects uncertainty
Estimates of means follow the t Distribution


Transformation to z-scores makes all NDs identical
t Distribution becomes ‘sharper’ with higher N
‘Width’ of t dist covering 95% is called 95%CI



range in which 95/100 mean estimates would fall
95%CI = mean ± (t(N-1),0.05 * SEM)
t is the number of SEMs along dist covering that %