RR-OR-AR - frozencrocus.com

Download Report

Transcript RR-OR-AR - frozencrocus.com

Sample-Based Epidemiology Concepts:
Conditional Probabilities, Relative Risk, Attributable Risk,
Excess Risk, & Linear Regression
Deaths
Alive
Total
Unmarried
16,712
1,197,142
1,213,854
Married
18,784
2,878,421
2,897,205
Total
35,496
4,075,563
4,111,059
Getting back to the original population data, we can now extend the
probability concept. The previous examples used total population data without
looking at any of the sub-categories of data.
Remember:
P “probability” (D) “infant death” = 35,496 total deaths / 4,111,059 total live births
= 0.0086 (8.6 infant deaths / 1000 live births)
If we are interested in the probability of infant death for a birth associated with
the unmarried status of the mother, then we have to look at the conditional
probability of the outcome; producing a new formula:
P (A | B)
= P “probability” (A “infant death” | “conditional on” B
“unmarried mother”)
P (A | B)
= f A : B / (f A + f A)
= 16,712 / (16,712 + 1,197,142) = 16,712 / 1,213,854
= 0.014 or 14 deaths / 1000 births
Deaths
Alive
Total
Unmarried
16,712
1,197,142
1,213,854
Married
18,784
2,878,421
2,897,205
Total
35,496
4,075,563
4,111,059
An equivalent formula would be:
P (A | B)
=
P (A & B)
—————— =
P (B)
f A& B / (total) f
—————————
f B / (total) f
P (A & B) = 16,712 / 4,111,059 = 0.0041
P (B)
= 1,213,854 / 4,111,059 = 0.295
P (A | B)
= 0.0041 / 0.295 = 0.014 or 14 deaths / 1000 births
Notice that the P (A | B) is simply the joint probability A & B (which =
incidence of unmarried-mother births that do not survive), divided by the
marginal probability B (which = incidence of unmarried mother births)
Now, all we have to do is take this probability approach to the concepts
of relative risk, odds ratio, attributable risk, excess risk, and linear
regression of P relative to levels of exposure …
To do this we need to go back to those probability expressions:
P (D)
P (D)
P (E)
P (E)
and a brief review of the concepts of Joint Probabilities, Marginal
Probabilities, and Conditional Probabilities.
Lets now use a sample (n = 200) to illustrate these associations between low
birthweight infants and marital status of the mother:
Unmarried
Married
Total
Low
7
7
14
Joint Probability:
(P within population)
Marginal Probability:
(P within population)
Conditional Probability:
(P within conditioned variable)
Birthweight
Normal
52
134
186
Total
59
141
200
P (unmarried mother AND low birthweight)
=
7 / 200 = 0.035
P (not unmarried AND not low birthweight)
= 134 / 200 = 0.67
P (low birthweight infant)
= 14 / 200 = 0.07
P (not low birthweight infant)
= 186 / 200 = 0.93
P (low birthweight | unmarried)
=
7 / 59
= 0.119
P (low birthweight | not unmarried)
=
7 / 141 = 0.050
Relative Risk is the ratio of 2 conditional probabilities; you simply take the
probability of the disease in question conditional on the presence of the risk
factor and divide that probability by the probability of disease conditioned on
the absence of the risk factor.
P (D | E)
RR
=
——————
P (D | E)
Birthweight
Unmarried
Married
Total
Low
7
7
14
Normal
52
134
186
Total
59
141
200
Using the infant data with low birthweight as the “disease” and unmarried
mother as the “risk” :
RR
=
P (D | E)
——————
P (D | E)
=
7 / 59
——— =
7 / 141
0.118644
———— =
0.049645
2.39
Indicating there is a 2.39 fold increase in risk for low birthweight if the mother is
unmarried; relative to the risk of low birthweight if the mother is married.
Odds Ratio is a slightly different concept; it compares the odds of D in the
exposed and unexposed subgroups.
OR
=
P (D | E)
——————
P (D | E)
÷
P (D | E)
——————
P (D | E)
Using the same data:
Birthweight
Unmarried
Married
Total
Low
7
7
14
P (D | E)
P (D | E)
OR = ———— ÷ ————
P (D | E)
P (D | E)
Normal
52
134
186
Total
59
141
200
7 / 59
7 / 141
0.13461
= ——— ÷ ———— = ———— = 2.58
52 / 59
134 / 141
0.05223
Indicating that the odds of a low birthweight infant for unmarried mother are
2.58 fold greater than the odds of a low birthweight infant if the mother is married
Excess Risk is a different concept altogether; it relates to absolute differences in
risk rather than relative (or ratios of) risks.
ER
=
P (D | E) -
Using the current example:
Unmarried
Married
Total
ER
=
P (D | E) -
P (D | E)
Birthweight
Low
7
7
14
7
P (D | E) = ——
59
Normal
52
134
186
-
7
——
141
Total
59
141
200
=
0.069
Indicating that there would be an increase of ~ 7% in low birthweights if
exposure status changed from E to E.
One use of ER is to evaluate possible impact of public health programs. For
example, the RR of lung cancer associated with smoking is ~ 9.0 while that of
CHD associated with smoking is ~ 2.3. The ER of CHD with smoking exposure
is, however, larger (0.0024 vs. 0.0016) than that of lung cancer; indicating that
cigarette intervention programs may be more important in terms of CHD.
Attributable Risk is a slightly different concept. It relates to absolute differences
in risk but also incorporates the association between E and D as well as the
prevalence of E. It attempts to address the question of what is the fraction of
the total disease in the population that can be explained by the presence of E.
AR
=
P (D) - P (D | E)
—————————
P (D)
Using the current example:
Unmarried
Married
Total
Birthweight
Low
7
7
14
P (D) - P (D | E)
AR = —————————
P (D)
Normal
52
134
186
=
14 / 200 - 7 / 141
——————————
14 / 200
Total
59
141
200
=
0.29
Indicating that 29% of low birthweights in the population (ok, really in the sample) are
attributed to the marital status of the mother. Of course, being unmarried does not cause
low-birthweights but rather there are many factors associated with being an unmarried
mother that may be causal.
As before, the use of samples precludes certainty as to whether your calculated
indices of risk truly represent the population from which the sample was taken.
In order to deal with this uncertainty, confidence limits can be calculated for
these risk indices in much the same way as they were calculated for the
marginal probabilities as explained earlier.
- The calculated RR (or OR) from the sample is assumed to be a reasonable
estimate of the true population risk.
- Based on the sample, a frequency distribution of RR’s (or OR’s) from an
infinite number of samples with the same N as used in your sample is predicted.
- The calculated RR is assumed to be in the middle of the frequency
distribution and 90% or 95% CI’s are calculated.
- The lower limit and higher limit of the CI reveals the possible range of values
that the population RR (or OR) might be within (with a 90% or 95%
probability).
Calculating confidence limits for the “ratios” is a slightly different affair than for
marginal probabilities because you are looking at associations between two (or
more) different probabilities, which somehow have to be combined. Because OR is
very commonly used in epidemiology, we’ll start with that:
To make the arithmetic easier, the OR formula is rearranged:
Using some new data on coffee consumption and pancreatic cancer:
D = Cancer
(a)
347
(c)
20
(a + c) 367
E = Coffee
E = Not Coffee
Total
D = Not Cancer
(b)
555
(d)
88
(b + d) 643
Total
(a + b)
902
(c + d)
108
(a + b + c + d) 1010
From before:
P (D | E)
P (D | E)
OR = ———— ÷ ————
P (D | E)
P (D | E)
347/902
20 /108
0.625224
= ——— ÷ ———— = ———— = 2.75
555 /902
88 / 108
0.227272
Rearranging the formula (902’s & 108’s cancel, then cross multiply) gives:
OR
=
ad
————
bc
=
347 x 88
————
555 x 20
=
30536
————
11100
=
2.75
As before, in estimating confidence limits you use a theoretical distribution of
an infinite number of sample statistics. In this case - OR, however, the
frequency distribution of an infinite number of OR values would be skewed to
the left and therefore not very useful. In order to deal with this, you make some
sort of a transformation of the data.
In the case of OR, a frequency distribution of an infinite number of OR values
that have been transformed to their natural logarithms would yield a
symmetrical curve that is normally distributed.
You then can use the appropriate z-score x √ variance to obtain the desired
confidence limits (90%?, 95%?).
For 95% CI:
nl OR ± 1.96 √V
You then obtain the exponent of the calculated values to convert back into our
normal language.
EXAMPLE (coffee consumption and pancreatic cancer):
E = Coffee
E = Not Coffee
Total
D = Cancer
(a)
347
(c)
20
(a + c) 367
ad
OR = ————
bc
=
D = Not Cancer
(b)
555
(d)
Total
88
(b + d) 643
347 x 88
————
555 x 20
=
(a + b)
902
(c + d)
108
(a + b + c + d) 1010
30536
————
11100
=
2.75
To calculate confidence limits:
Take the natural log of the OR (2.75):
nl OR = 1.0116
Calculate the variance of the nl OR:
1
1
V nl OR = — + —
a
b
1
1
V nl OR = —— + ——
347
555
1
1
+ —— + —— =
20
88
0.066
1
+ —
c
1
+ —
d
Now that you have the nl OR and V nl OR you just choose the z-score that corresponds to
your desired CI and calculate away …
nl OR = 1.0116
V nl OR = 0.066
If you decide to pick a 95% CI then you use the z-score of 1.96.
(If you wanted to use 90% CI then you would use a z-score of 1.64)
95% CI of nl OR = 1.0116 ± 1.96 √ 0.066
= 1.0116 ± 0.5035
= 1.0116 (0.508, 1.516)
Remember that these are natural logs so we have to use the exponent of the nl’s to convert
back to “normal arithmetic” and we get:
exp 1.0116 = 2.75
exp 0.508 = 1.66
exp 1.516 = 4.55
95% CI OR = 2.75 (1.66, 4.55)
Indicating that the odds of cancer with coffee exposure are anywhere from one-and-a-half
to four-and-a-half times greater than the odds of cancer without coffee exposure
For RR, a different example will be used:
EXAMPLE (Personality Type and CHD):
E = Type A
E = Type B
Total
D = CHD
(a)
178
(c)
79
(a + c) 257
D = Not CHD
(b)
1411
(d)
1486
(b + d) 2897
Total
(a + b)
1589
(c + d)
1565
(a + b + c + d) 3154
Using the rearranged formula:
a / (a + b)
RR = ————
c / (c + d)
=
178 / 1589
———— =
79 / 1565
0.1122
————
0.0505
=
2.22
To calculate confidence limits:
Take the natural log of the RR (2.22):
nl RR = 0.797
Calculate the V of the nl RR:
b
d
V nl RR = ——— + ———
a (a + b)
c (c + d)
1411
V nl RR = ————
178 x 1589
1486
+ ————
79 x 1565
=
0.017
Now that you have the nl RR and V nl RR you just choose the z-score that corresponds
to your desired CI and calculate away (sound familiar?) …
nl RR = 0.797
V nl RR = 0.017
If you decide to pick a 95% CI then you use the z-score of 1.96.
(If you wanted to use 90% CI then you would use a z-score of 1.64)
95% CI of nl RR = 0.797 ± 1.96 √ 0.017
= 0.797 ± 0.2555
= 0.797 (0.542, 1.053)
Remember that these are natural logs so we have to get the exponent of the nl’s to
“convert back to normal arithmetic” and we get:
exp 0.797 = 2.22
exp 0.542 = 1.72
exp 1.053 = 2.87
95% CI RR = 2.22 (1.72, 2.87)
Indicating anywhere from about 2 to almost 3 times the risk for CHD in Type A
individuals
But what about that statistical analysis stuff which “figures out” if your data
is “significant” . . . . . ?
And all that probability stuff so you can say your results are Significant at
the 0.05 level? - or - Significant at the 0.1 level? - or - Significant at the 0.01 level?
This basic statistical concept is related to the question:
Does your calculated OR and/or RR, which reveal observed associations
between D and E in your sample, reflect a population in which these same
associations are actual?
(Or, are the calculated associations simply the result of some random
happenings resulting from your sampling process?)
One way to surmise the answer is to simply take into consideration the 95% CI
of your ratios and as long as the range of possible values is not very “large”
you can be reasonably sure that your calculated RR (or OR) is a reasonably
accurate measure of the population RR or OR.
And, if the lower limit is not close to 1, there is a reasonable likelihood that the
presence of E entails more risk for “getting” D than the absence of E will. If the
lower limit includes 1, then you can’t really preclude the possibility of NO effect
with 95% confidence .
There is a trend in modern epidemiology to rely less and less on tests of
significance and to rely more on using the calculated CI’s as a direct
indication of the true significance of your ratios’ value.
On the other hand, there are still a lot of statisticians (and journals) who
insist on using tests of significance to give “statistical relevance” to any
conclusions made on the basis of the calculated risk indices.
Because all of the analyses discussed so far used frequency data arranged in a
2 x 2 table of E, E, D, and D, analyzing for statistical significance will based
on these frequencies.
To illustrate we will use some case-control data on smoking and fatal / non-fatal
automobile accidents from which we obtain the following 2 x 2 contingency table:
D
68
32
100
E
E
Total
D
104
96
200
Total
172
128
300
From this data we might calculate the OR
OR = (68 x 96) / (32 x 104) =
6528 / 3328 = 1.96
nl OR = 0.6729
V nl OR = 1/68 + 1/104 + 1/32 + 1/96 = 0.06599
95% CI nl OR =
0.6729 ± 1.96 √ 0.06599
= 0.6729 (0.16941, 1.17639) . . . Convert to real numbers and you get:
OR (95% CI) = 1.96 (1.185, 3.243)
The fundamental analysis to be used is the chi-square test for independence and it
is based on the premise that if the exposure variable was completely independent
of the disease variable then we would expect that the P (D |E) = P (D |E) and
that would mean that the null hypothesis for our test would be:
H0 = RR
= 1
= OR = 1 (and ER
= 0)
In tests of statistical significance it is customary to formulate the test statement
in the form of a null hypothesis.
The null hypothesis assumes that there is no treatment effect or no association
between E and D variables and based on the results of the statistical analysis
used (an analysis which is appropriate for the type of data collected and the
research design used) we will either take one of two possible actions:
We will reject the null hypothesis if our data indicates a reasonable probability
of differences or associations,
- or –
We will fail to reject the null hypothesis if our data do not indicate a reasonable
probability of differences or associations.
By using the chi-square test for independence we will get a statistic that represents just
how different the observed frequencies are from those that would have occurred if there
was no association between the variables.
Using the case-control data on smoking and accidents we might OBSERVE the following
contingency table:
E
E
Total
D
68
32
100
D
104
96
200
Total
172
128
300
If there was absolutely no association between E and D we would expect identical
incidences of D in both the E and E groups, giving the following *EXPECTED results:
E
E
Total
D
57.3
42.6
100
D
114.6
85.3
200
Total
172
128
300
*100 fatality cases from 300 total subjects = 33.3 %; which means that 33.3 % of the E
group (57.3) would have D and 33.3 % of the E group would have D (42.6) . . . if no
association between E and D existed.
The basic formula for the chi-square test for independence is as follows:
χ2 =
-
∑
(O-E)2
————
E
or with Yates correction for 2 x 2 tables
χ2
=
( │O – E│- ½ )2
——————
E
. . . and all we have to do is plug in the observed frequencies and the expected
frequencies into the formula to obtain our chi-square statistic.
χ2
=
(68 - 57.3)2
(104 – 114.6)2
———— + ————
+
57.3
114.6
χ2
=
1.985 + 0.992 + 2.6 + 1.3
χ2
=
(32 – 42.6)2
————
42.6
6.98
So, what does this mean?
+
(96 – 85.3)2
————
85.3
Before we can figure that out, one of the first things to do is imagine a frequency
distribution of an infinite number of chi-square values that were calculated from
an infinite number of samples that were randomly drawn from a population in
which there was NO association between D and E. You can imagine that most of
the samples would have values close to 0, some would be close to 1, or 2, while a
few would, through random chance, be much, much larger than 2. The following
table illustrates the probability of randomly selecting a sample with a given χ2
value from a population where P (D│E) = P(D│E):
df
1
2
3
4
5
0.995
0.010
0.072
0.207
0.412
0.99
0.020
0.115
0.297
0.554
0.975
0.001
0.051
0.216
0.484
0.831
Probability
0.95
0.90
0.10
0.004 0.016 2.706
0.103 0.211 4.605
0.352 0.584 6.251
0.711 1.064 7.779
1.145 1.610 9.236
0.05
3.841
5.991
7.815
9.488
11.070
0.025
0.01
0.005
5.024 6.635 7.879
7.378 9 .210 10.597
9.348 11.345 12.838
11.143 13.277 14.860
12.833 15.086 16.750
- and a whole lot more numbers at higher values of df. . . . .
df refers to: (number of rows – 1) x (number of columns – 1)
In the case of a 2 x 2 table, df = 1
From the example of smoking and fatal automobile accidents we got a
calculated χ2 value of 6.98.
From the table with df = 1, we can see that there is less than a 1% chance of
randomly selecting such a sample from a population where P (D│E) = P (D│E).
I reproduced the relevant part of the table here:
df 0.995
1
-
0.99
-
0.975
0.001
0.95
0.004
0.90
0.016
0.10
2.706
0.05
3.841
0.025
5.024
0.01
6.635
0.005
7.879
Based on the statistical analysis, the observed frequencies of fatal accidents
(and not-fatal accidents) in the exposed and unexposed groups have less than a
1% chance of being derived from a population where P (D│E) = P (D│E) and
therefore we reject the null hypothesis of No Association between D and E.
- KEEP THIS CONCEPT IN MIND -
There appears to be an association between smoking and fatal automobile accidents
such that smoking confers an increased risk for dying in an accident.
Of course, without the OR, you can’t tell the degree of the association so just doing the
chi-square test for “significance” isn’t enough. Without the CI, the “accuracy” of the
prediction also isn’t apparent so you need that too.
From the original analysis, we obtained OR (95% CI) = 1.96 (1.185, 3.243)
From the χ2 analysis, we rejected the null hypothesis with a probability of
< 0.01.
We would illustrate the results of our analysis of smoking and fatal
accidents in a manner something like this:
OR (95% CI) = 1.96 (1.185, 3.243), P < 0.01
- or . . . to put things in proper perspective
There is less than a 1% chance of being wrong by concluding that, with 95%
confidence, the odds of a smoker being involved in a fatal automobile
accident appears to be anywhere from ~ 20% to ~ 320% greater than the
odds for non-smokers to be involved in a fatal automobile accident.
Regression Analysis (several different variations, but only two will be
illustrated) is used when one or more of the variables are stratified or
continuous in nature. These analyses can illustrate how risk (or
probability) for disease may change when the degree of exposure changes.
Linear model
Px
=
P (D | X = x) = a + bx
Plotting the probabilities of D conditional on exposure level X (at each
level x determined) on a graph produces a straight line with intercept = a
and slope = b (changes in P for each unit of x).
The intercept (a) illustrates the risk
of D as a probability when exposure
= 0. The slope of the line illustrates
excess risk for each increase in E in
x units (whatever unit E was
measured in) . . .
Logistic Regression Analysis (and multiple logistic regression analysis) are
used extensively in epidemiology research because associations between the
calculated probabilities and exposure variables as measured are rarely
perfectly linear.
Px
log ( ——— ) = log (odds for D | X = x) = a + bx
1 - Px
Plotting the log of the odds of D conditional on exposure level X = (at each
level x determined) on a graph, often produces a curved line with intercept =
a and slope = b (changes in P for each unit of x). Again, the intercept
illustrates risk with exposure = 0 and the slope is the change in the log of the
OR (for each change in the level of E).
Multiple logistic regression is used when there are more than one exposure
variables measured and the log OR is a function which takes into account all
of the measured variables associated with risk.
(notice there were no formulas presented with which the a and bx are calculated)
RR, OR, AR, ER, and Logistic Regression Analysis are all used in epidemiology
research to locate and characterize possible D:E associations. Confidence
limits are always calculated for each risk relationship to illustrate the potential
error in predicting population parameters from sample statistics.
OR, RR, ER, and AR are commonly used with binary data while logistic
regression analysis is used when one or more of the E variables are either
stratified or continuous in nature . . .
The next couple slides are here in an attempt to go back to the original course
discussion on the ideal vs. the political reality . . .
Many standard assumptions about disease risks were established on the basis
of linear regressions or relative risks in conjunction with an inaccurate
assumption of cause.
- it is entirely possible to calculate a linear regression between any
variable with any other to see what associations exist . . .
For example it is possible to calculate a linear regression between age-adjusted
disease-specific mortality data (D) and fat consumption (E):
Linear Regression of incidence of death due to breast
cancer with animal fat intake
- Ken Carroll, 1975
Linear Regression of incidence of death due to breast
cancer with vegetable fat intake
- Ken Carroll, 1975
Based on the strong associations between animal fat and breast cancer,
various hypotheses of animal fat being causal for breast cancer were
constructed . . .
Similar associations between serum cholesterol and/or dietary fat and CHD
also were observed in other studies; leading to many hypotheses regarding
cholesterol and dietary fat being causal for CHD . . .