Transcript Document

7 Regression & Correlation: Rates
Basic Medical Statistics Course
October 2010
W. Heemsbergen
Event rate
Event rate: rate at which the event occurs per subject per period of time.
Rate =
Number of events occurring
Cumulative units of time*
* Clinical research: person-years (total number of years of follow-up for all individuals)
Time should only be counted in which information is available about possible
events, and in which the subject is at risk.
One count (e.g. onset cancer), or several counts (e.g. bloody nose), are possible.
X
1
X
2
3
4
5
6
X
1
Event rate
Incidence rate: no. of new cases per time period
Mortality rate: no. of death per time period
In case of a small rate: re-expressed by for instance the
rate per 1000 person-years.
1
2
3
4
10 years
6 years
5 years
1 year
+
+
2 events / 22 years :
event rate = 0.09 per person-year (or 90 per 1000 p-y).
If we are only interested in first events (e.g. diagnosis of breast cancer)
the f-up must cease at the time point of the (first) event.
2
Relative rate
Relative rate =
(rate ratio,
incidence rate ratio)
Rate exposed
Rate unexposed
A relative rate equal to 1 indicates a similar risk for the two groups
A relative rate > 1 indicates that the rate is higher in the exposed group.
A relative rate < 1 indicates that the rate is lower in the exposed group.
A relative rate is interpreted similar as the relative risk and the Odds ratio, in most
situations in cancer research.
3
Standardization
A comparison between 2 rates can be misleading/inadequate.
The (crude) mortality rate (number of deaths per 1000 person years) between
2 countries is misleading when country A has a relatively young population and
country B a relatively old population (e.g. European vs. African country).
Solution 1: age specific death rates (a calculated rate for each age category)
Solution 2: standardization of the mortality rate, using a standard population.
Solution 3: recalculate (adjust) rate of population A, using the age structure of
population B.
Standardized mortality/death rate: a standard population is introduced with a
fixed age structure. Then the mortality of any population is adjusted for
discrepancies in age structure between standard and the specific population.
Factors often used in standardization: calender-year, age, gender, ethnicity.
4
Rate vs. Risk
Rate: Total no. events / person-years of follow-up.
Risk: Total no. events / no. of individuals exposed
(probability between 0-1 for first events).
Risk: Is calculated for a certain interval of time, may
differ for longer or shorter intervals.
In case the follow-up differs from person to person,
rates are preferred.
5
Example
Immediate risk of suicide and cardiovascular death
after a prostate cancer diagnosis.
BACKGROUND: Receiving a cancer diagnosis is a stressful event that may
increase risks of suicide and cardiovascular death, especially soon after diagnosis.
METHODS: We conducted a cohort study of 342,497 patients diagnosed with
prostate cancer from January 1, 1979, through December 31, 2004, in the
Surveillance, Epidemiology, and End Results Program. Follow-up started from the
date of prostate cancer diagnosis to the end of first 12 calendar months after
diagnosis. The relative risks of suicide and cardiovascular death were calculated
as standardized mortality ratios (SMRs) comparing corresponding incidences
among prostate cancer patients with those of the general US male population, with
adjustment for age, calendar period, and state of residence. We compared risks
in the first year and months after a prostate cancer diagnosis. The analyses were
further stratified by calendar period at diagnosis, tumor characteristics, and other
variables.
J Natl Cancer Inst. 2010;102:307-14
6
Example
RESULTS: During follow-up, 148 men died of suicide (mortality rate = 0.5 per
1000 person-years) and 6845 died of cardiovascular diseases (mortality rate =
21.8 per 1000 person-years).
Patients with prostate cancer were at increased risk of suicide during the first
year (SMR = 1.4, 95% confidence interval [CI] = 1.2 to 1.6), especially during the
first 3 months (SMR = 1.9, 95% CI = 1.4 to 2.6), after diagnosis. The elevated risk
was apparent in pre-prostate-specific antigen (PSA) (1979-1986) and peri-PSA
(1987-1992) eras but not since PSA testing has been widespread (1993-2004).
The risk of cardiovascular death was slightly elevated during the first year (SMR
= 1.09, 95% CI = 1.06 to 1.12), with the highest risk in the first month (SMR =
2.05, 95% CI = 1.89 to 2.22), after diagnosis. The first-month risk was statistically
significantly elevated during the entire study period.
CONCLUSION: A diagnosis of prostate cancer may increase the immediate risks
of suicide and cardiovascular death.
7
Question
SMR (standardized mortality ratio) = 1.09 (cardiovasc death)
A group of 15.000 men is diagnosed with prostate cancer.
Based on statistics of the general male population, the baseline risk for cardiovascular death (without prostate cancer diagnosis), is 0.8 % for the coming year.
How many men in this group are expected to die from cardio-vasc disease, the
coming year ?
8
7 Regression & Correlation:
Logistic regression
Basic Medical Statistics Course
October 2010
W. Heemsbergen
(Binary) Logistic Regression
We have collected data on N individuals.
We are interested in disease A ,which is present in part of the subjects:
• which (risk) factors are predictive / associated with the disease ?
• what is the probability that a subject with a certain risk profile, has the disease
or will develop the disease ?
Example: The development of mucositis of the lower alimentary tract after
chemotherapy in cancer patients.
- What are the risk factors predictive for mucositis after chemotherapy ?
- What is the probability to develop mucositis after chemotherapy, given an
individual risk profile ?
- Potential risk factors: age, weight, renal functioning, type and duration of
chemotherapy, ….
9
(Binary) Logistic Regression
Logistic Regression is similar to Linear Regression. It is used when the outcome of
interest (the dependent variable) is not continuous (e.g. cancer yes/no).
A patient with a certain risk profile (the independent factors), has a probability to
develop an outcome: risk factor 1, risk factor 2 (covariates), … result in a
probability (between 0-1). The outcome itself will however always be present (1) or
not present (0).
probability(D=1|z) = ez / (1+ ez)
ez = Exp(z)
set of covariate values: x1..xk, regression coefficients b1 .. bk
z = a + b1x1+b2x2…+b1x1
10
Example
patnr mean lung Radiation
dose (Gy) Pneumonitis
20.1
23.6
7.7
10.5
6.0
26.0
14.8
18.2
22.7
17.1
24.0
1
1
1
0
0
1
0
1
1
0
1
1.1
1.0
Outcome/ PROB/ PRED
1
2
3
4
5
6
7
8
9
10
11
0.9
0.8
0.7
0.6
0.5
0.4
0.3
data points
logistic regression
linear regression
0.2
0.1
0.0
5
10
15
20
25
30
Mean Lung Dose (Gy)
11
Linear Regression
Logistic Regression
Linear Regression: PRED = -0.135 + 0.045 * MLD
Log. Reg: PROB(D=1) = (exp(-3.4 + 0.24 * MLD)) / ( 1 + exp(-3.4 + 0.24 * MLD) )
Exp(B) is the Odds Ratio for a unit increase. (Odds: P/(1-P) )
12
What is an Odds (Ratio) ?
Are obese patients more at risk to develop diabetes ?
What is the Odds Ratio (OR) ?
obese
y
Obese
yes
yes
yes
yes
no
no
no
no
..
Diabetes
yes
yes
yes
no
no
no
no
yes
..
Odds = p/(1-p) = proportion with
disease/(1-proportion with disease)
y
diabetes
n
n
30 10
10 30
Oddsobese
= 0.75/0.25=3
Oddsnot obese = 0.25/0.75=0.33
OR = 0.33 / 3 = 0.11
or OR = 3 / 0.33 = 9
preferred
(ratio exposed/unexposed)
13
Variable types
The potential predictive factors of interest, can be continuous, categorical, ordinal,
or binary. How to deal with these different types in Logistic Regression ?
In the Logistic Regression procedure, categorical data have to be indicated as
categorical data, and a reference category has to be chosen. Then for each other
category, the regression coefficient is calculated using this category as a
reference. Therefore it is advised to use the largest category as the reference.
If not, it will be assumed that the variable is “continuous”: for each increase of a
unit, the same regression coefficient is estimated. Normal distribution is no
prerequisite.
An ordinal variable can be put in the model as a continuous variable. One should
however always be aware of the underlying assumptions in the model.
In case of a binary predictive variable, it is not necessary to choose. However, the
“reference” will be the lowest value in case of a continuous variable, and possibly
the highest value in case it is indicated as a category (depending on the chosen
reference category).
14
Example: categorical / continuous
Obese
Diabetes
(1 yes, 2 no or 0 no) (1 present, 0 not)
1
1
1
1
2 (0)
2 (0)
2 (0)
2 (0)
1
1
1
0
0
0
0
1
obese
y
n
y
diabetes
n
3
1
1
3
Oddsobese = 0.75/0.25=3
Oddsnot obese = 0.25/0.75=0.33
OR = 0.33 / 3 = 0.11
or OR = 3 / 0.33 = 9 (preferred)
15
Example: categorical / continuous
Obese, 1=yes, 2=no, continuous
Obese, 1=yes, 0=no, continuous
Obese, 1=yes, 2=no, category (reference=last)
risk factors present/not present: code 1 and 0, continuous var.
16
Example:
rectal bleeding
Dosimetric factors predictive for
moderate/severe rectal bleeding,
after RT for prostate cancer.
Int J Radiat Oncol Biol Phys 2004; 59: 1343.
17
Example: esophagus toxicity
To correlate acute esophageal toxicity with dosimetric and clinical parameters for
patients treated with radiotherapy (RT) alone or with chemo-radiotherapy (CRT).
probability(D=1) = exp(z) / (1+ exp(z) ),
can be rewritten as:
volume of esophagus
probability(D=1) = 1/(1+exp-(z) )
Radiat Oncol 2005; 75: 157.
18
Question
What is the Odds Ratio for V35, and for Concurrent Chemo-RT ?
19