with big errors if we use small numbers of people

Download Report

Transcript with big errors if we use small numbers of people

Sample-Based Epidemiology Concepts
Deaths
Alive
Total
Unmarried
16,712
1,197,142
1,213,854
Infant Mortality in the USA (1991)
Married
Total
18,784
35,496
2,878,421
4,075,563
2,897,205
4,111,059
We rarely have the luxury of having the entire population at our disposal so we
usually take a small (or large, if you have the money and time and even larger
if you also have lots of post-docs to collate data) random sample from our
selected population and estimate the population incidence (probabilities) based
on the sample. This means that we will have errors in estimation; with big
errors if we use small numbers of people in our samples and smaller errors if we
use bigger numbers of people in our samples.
Because of the error in estimating the population parameter, we have to calculate
confidence limits for our estimate; our sample predicts a parameter but the
parameter could be smaller or larger than the predicted value – so we need to
know the range of possible values for the predicted parameter. To see how this
works we have to delve into the incredibly cool
Universe of Statistical Analysis.
The terms confidence limits and estimate of population parameters are
highly relevant to research in the health sciences because they are
statistical concepts.
Statistics and statistical analysis is nothing more than calculating
measures of probability, association, central tendency and variance of
sample data (statistics) and the probabilities that the calculated statistics
relate to the target population (statistical analysis).
Of course statistical probabilities are not exactly the same as the
actual population probabilities of infant mortality (0.0086) and infant
non-mortality (0.9914) for the USA in 1991; two separate population
parameters.
A parameter is any measure from a population while a statistic is any
measure from a sample.
If we test entire populations then we do not need statistical analysis. For
example:
If another population (lets say, another country) was measured in its’
entirety and the other country’s infant mortality and infant non-mortality
were calculated as 0.0085 and 0.9915, respectively [compared to infant
mortality (0.0086) and infant non-mortality (0.9914) for the USA in 1991]
we could conclude with absolute certainty (100% confidence) that the two
populations were completely different with regard to these two parameters
because we would be absolutely certain that the calculated numbers are
exactly descriptive of the respective populations (even though there is just a
tiny difference between the two populations). Different numbers means
different!
However, because samples are not necessarily exactly representative of the
population from which they came, differing numbers from two (or more)
different samples do not necessarily guarantee that the samples came from
two (or more) different populations.
As previously mentioned, we simply NEVER (well, not very often anyway)
have the luxury of being able to measure the entire population so we have to
suffer with a (usually) small sample that was selected from the population.
We then measure whatever it is we are interested in; lets say: “Infant
Mortality” or “Height”, and then assume that our sample represents our
population and that whatever the sample statistic is, that same number is an
estimate of the parameter of the population from which the sample was selected.
Because such an assumption may not be absolutely true; ie. the sample doesn’t
perfectly represent the population, we need to have some idea of where the
actual population parameter might be …
To do this, we simply perform a particular type of statistical analysis to estimate a
range of possible values that would include the population parameter ... we use
the sample data to do so: the sample statistic is used to estimate the exact middle
of the range and the variability of the numbers in the sample is used to estimate
the highest and lowest value of the range …
To understand how these statistical calculations are made we need to start
with a frequency distribution of the data:
To understand how these statistical calculations are made we need to start with a
frequency distribution of the data:
Once we have a frequency distribution of the data then the mathematical properties of
the frequency distribution can be used to estimate the range of values that the
population parameter might exist – within certain confidence limits or confidence
intervals . . .
The predicted range of values within which the population parameter might exist is
calculated on the basis of Confidence Intervals and these are defined by percentages:
95% confidence interval, 90% confidence interval, 99% confidence interval . . .
These percentages relate to statistical probabilities . . .
95% CI: There is a probability of 0.95 that the population parameter exists within the
calculated range of values – or a probability of 0.05 that it does not . . .
90% CI: There is a probability of 0.90 that the population parameter exists within the
calculated range of values – or a probability of 0.10 that it does not . . .
99% CI: There is a probability of 0.99 that the population parameter exists within the
calculated range of values – or a probability of 0.01 that it does not . . .
An extremely accurate, but rather cumbersome way to describe data; especially if
there were hundreds or thousands of people in the population . . . . .
A little less
accurate of a
description but a
whole lot easier to
describe because
only the shape of
the line is being
described; not
each of the
individual data
points. Note that
the shape of the
line still
accurately
describes how the
data is distributed
on the number
line, we just need
a more accurate
way to describe
the line …
And there even is a way to calculate those two parts of the curve.
(If you look at
the right and left halves of the curve separately, you may recognize them as sigmoid curves.)
The measure of central tendency most often used to describe the peak of the data
curve is called mu (µ - population parameter) or mean ( x – sample statistic) and the
measure of variability most often used to describe the dispersion of the data along
the number line is called the standard deviation (σ – parameter; sd - statistic);
which is equal to the square root of the variance (σ2) or (V).
µ = ∑x/n
σ2 =
(commonly called the average – add up all the scores and divide by
the total number of scores)
∑ (x - µ)2
—————
n
(subtract the mean from each score, square each result, add
up all the squares, and then divide by n; then take the
square root to get σ)
The µ corresponds to the exact point on the number line where the central peak of the
frequency distribution curve sits and the σ corresponds to the exact point on the
number line where the data starts to spread out faster away from the mid-point.
An advantage of describing your
population in terms of how the data is
distributed on a number line using µ
and σ is that any population can be
represented by this exact same kind
of a curved line; a line often called a
normal curve.
An important property of these curves is that they are very easy to describe in
terms of mathematical probabilities. For example, we know that 50% of all
the body weights (data points) in the population are greater than the center
point (µ = 5’ 6.25”) which means there is a 0.50 probability that a randomly
selected individual is taller than 5’ 6.75”. We also know that 68.26% of all
the data points are between the 2 σ limits (4’ 1.75” to 6’ 10.75”) which
means there is a 0.6826 probability that a randomly selected individual will
be between 4’ 1.75” tall and 6’ 10.75” tall.
This graph simply illustrates more “percentages of the data distributed along the
number line” in different sections of the curve; based on how far along the number line
you go in σ units. Again, using percent as probabilities, there is a 0.3413 probability that
a randomly selected individual would be between the mean and one standard deviation
above the mean, or to put it a different way, we would be 34.13% confident that a
randomly selected individual would be somewhere between the mean and +1 sd, or 2.28%
confident that a randomly selected individual would be +2sd above the mean . . .
Note that the z-score number corresponds to the sd unit.
Now . . . from this curve you notice that
standard deviation units and z-score units are
the same thing.
In between the +1 and -1 units are found
68.26% of all the scores in the frequency
distribution.
In between the +2 and -2 units are found
95.54% of all the scores in the frequency
distribution
To make things easier, tables of z-scores and the % of scores in between the z-score limits are
available in most statistics textbooks . . . A few of those values are reproduced here:
Z-Score
%
1.00
1.5
1.65
68.26
86.60
90.00
1.96
95.00*
2.00
95.54
* Traditional level for “statistical significance”
Z-Score
%
2.5
2.57
3.0
3.27
3.3+
98.76
99.00
99.74
99.90
~100
Now . . .to figure out where the confidence
limits actually come from in all those
epidemiology papers . . .
The “baby” data illustrates this fairly well . . .
Sample1
Sample2
Sample3
Sample4
Births
Births
Births
Births
Unmarried
35
29
33
41
Married
65
71
67
59
Total
100
100
100
100
If we randomly sampled 100 live births from all of the 4,111,059 live births in
the USA in 1991 we might find that 35 births were associated with unmarried
mothers. This would give a sample probability (statistic) of 35 unwed mothers /
100 live births = 0.35 - an estimate of the population probability (parameter)
that a birth is associated with an unmarried mother.
The sample probability (statistic) is not the correct probability for the entire
population, just the correct probability for the sample.
If we took 3 more (different) random samples from the same population, each of
100 live births, we would probably find a different probability that the birth is
associated with unwed mothers for each sample that was randomly selected; we
might get 29 / 100 = 0.29; 33 / 100 = 0.33; 41 / 100 = 0.41; and so on . . . and we
would never be 100% certain (confident) that any one sample probability would
exactly represent the population parameter.
We need some way to deal with this uncertainty so we construct confidence limits
or a confidence interval.
Sample1
Sample2
Sample3
Sample4
Marital status of samples of new mothers in the USA (1991)
Unmarried
Married
Births
35
65
Births
41
59
Births
33
67
Births
29
71
Total
100
100
100
100 …
If we could keep sampling samples (of n = 100) and calculating probabilities
forever we would end up with an infinite number of sample probabilities.
Sample probabilities close to the true population probability would appear
numerous times while those far away would appear less frequently; the most
frequently occurring sample probability (from the infinite number of samples)
would correspond to the population probability while the least frequent
probabilities would correspond to the extreme values (again, from the infinite
number of samples).
This infinite number of theoretical sample probabilities would obviously fit into
some kind of frequency distribution curve that is normally distributed. From this
theoretical Normal Distribution we can construct a confidence interval using
standard percentile scores (actually the same sd units called z-scores illustrated in
previous slides) which will then be related to just how confident we want to be;
95% confident? 90% confident? 99% confident? – just plug in the sample values
you are interested in, and appropriate z-score value that corresponds to your
chosen %-confidence level into the formula and voila: Confidence Intervals
This is another figure of that same normal curve with z-scores and percentages; the
actual z-scores that correspond to 95% and 90% of the data have been added …
Just imagine that this curve illustrates the distribution of an infinite number of
probabilities calculated from the infinite number of samples (n = 100) that were
randomly selected from the same population)
We already have some idea where the middle of this “population curve” fits on a
number line because we have the (ONE) sample estimate of that point; we are just
not 100% confident that the sample statistic is exactly the same as the population
parameter. What we need to know is the range of possible values that the actual
population center-point might be within – so we calculate that range using the
above theoretical curve …
Marital status of a sample of new mothers in the USA (1991)
Births
Unmarried
35
Married
65
Total
100
Probability
0.35
Confidence Interval - 95% (use z-score of 1.96)
0.35 x 0.65
0.35 ±
( 1.96 √
——————
100
)
=
=
0.35 ± (1.96 √0.002275)
0.35
(0.257, 0.443)
Confidence Interval - 90% (use z-score of 1.644)
=
0.35 ± (1.644 √0.002275)
=
0.35
(0.272, 0.428)
*True population probability
=
0.295
(1,213,854 / 4,111,059)
The confidence interval is simply the range of values in a frequency distribution
of values from all possible samples of the same size between which you might
expect to find the true population value (parameter), ie. The sample statistic
predicts that the parameter is 0.35 but it is 90% probable the true parameter is
somewhere between 0.272 and 0.428; and 95% probable the parameter is between
0.257 & 0.443.
These two graphs illustrate the
previous calculations as well as the
effect of sample size on the
“accuracy” of using the sample
statistics to predict the population
variance.
From the previous formula, the zscore values (1.96 or 1.644)
describe the confidence limits
between which we will look for
our predicted population “value”
The term √ (0.35 x 0.65) / 100
is a calculation of the sample
variance – note that the sample n
is part of the equation.
The larger the n, the narrower the
variance (n=1000 = .285 - .305 vs.
n=100 = .3 - .4) in predicting the
population variance.
With smaller sample sizes, or with highly variable data, or with p ~ 0 or 1, it is
problematic to accurately predict population variance using the sample
variance, so this next formula is actually used a lot more:
(2 x 100 x 0.35) + 1.962 ± 1.96 √1.962 + (4 x 100 x 0.35 x 0.65)
——————————————————————————————
2 ( 100 + 1.962)
[ previous calculation
=
35
±
=
35
± (0.257, 0.43) ]
True population probability =
(0.264, 0.447)
0.295
*You will notice that all epidemiology publications will give the confidence
intervals associated with each variable measured.
**and since computers do all the work nowadays and they can calculate exact
intervals based on the sampling distribution of P, based on the binomial
distribution, we don’t have to bother with knowing any of these formulas, just
have an idea about what the formulas are actually calculating …