Introduction to Biostatistics (ZJU, 2008)
Download
Report
Transcript Introduction to Biostatistics (ZJU, 2008)
Introduction to Biostatistics
(ZJU 2008)
Wenjiang Fu, Ph.D
Associate Professor
Division of Biostatistics, Department of
Epidemiology
Michigan State University
East Lansing, Michigan 48824, USA
Email: [email protected]
www: http://www.msu.edu/~fuw
Logistic regression model
Why use logistic regression?
Estimation by maximum likelihood
Coefficients Interpretation
Hypothesis testing
Evaluating the performance of the model
Why logistic regression?
Many important research topics in which the dependent
variable is Binary.
eg. disease vs no disease,
damage vs no damage, death vs live, etc.
Binary logistic regression is a type of regression analysis
where the dependent variable is a dummy variable: coded 0
(absence of disease) or 1 (presence of a disease), etc.
To explain the variability of the binary variable by
other variables, either continuous or categorical, such
as age, sex, BMI, marriage status, socio-economic
status, etc. use a statistical model to relate the
probability of the response event to the explanatory
variables.
Logistic Regression Model
Event (Y = 1), no event (Y = 0) want to model the mean:
E(Y) = P(Y=1) * 1 + P(Y=0) *0 = P(Y=1)
but π = P(Y=1) is between 0 and 1 and is bounded
the linear predictor β0 + x1β1 + x2β2+ x3β3 +… is a
linear combination and may take any value.
The "logit" model solves this problem.
Single independent variable:
ln [p/(1-p)] = + X
where the probability p = p(Y=1 | X)
p/(1-p) is the "odds" - ratio of the probabilities an event to
occur versus not to occur under condition A.
ln[p/(1-p)] is the “log odds” or "logit probability"
More:
The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
The estimated probability is:
p=1/[1+exp(- - X)]
= exp( + X) / [1+exp( + X)]
if you let + X =0, then p = .50
as + X gets really big, p approaches 1
as + X gets really small, p approaches 0
Comparing LinReg and Logit
Models
LinReg Model
Y=1
Logit Model
Y=0
Maximum Likelihood Estimation
(MLE)
MLE is a statistical method for estimating the
coefficients of a model.
The likelihood function (L) measures the
probability of observing the particular set of
dependent variable values (Y1=y1, … , Yn=yn)
L = Prob (y1,y2,…,yn)
The higher the L, the higher the probability of
observing the sample data at hand.
Maximum Likelihood Estimator
MLE involves finding the coefficients (, ) that
makes the log of the likelihood function (logLik < 0)
as large as possible
The maximum likelihood estimator maximizes
following likelihood function
1 yi
Lik i p (1 p)
yi
Where p = exp( + X) / [ 1+ exp( + X) ]
Maximum Likelihood Estimator
Or equivalently, the MLE maximizes the loglikelihood function
Link Function
p
l log lik yi log
+n log(1- p)
i
1 p
Where p = exp( + X) / [ 1+ exp( + X) ]
MLE is biased estimator, but consistent (by large
sample theory, the estimator converges to the true
model parameters fast enough as sample size
increases).
Interpretation of
Coefficients
Since
ln [ p/(1-p) ] = + X
The slope is interpreted as the rate of change
in the "log odds" as X changes … not very useful.
More useful:
p = Pr (Y=1|X).
p / (1-p) is the odds with condition X.
Example. X: smoker; Y=1 Lung Ca.
p/(1-p) is odds of Lung Ca w.r.t smoke.
Interpretation of
Coefficients
If X is continuous, exp() measures the change
of odds with one unit increase of X.
ln (OR) = ln [odds(X+1)] - ln [odds(X)]
= ln [ p(Y=1|X+1 ) / (1-p(Y=1|X+1)) ]
– ln [ p(Y=1|X) / (1-p(Y=1|X)) ]
= + (X+1) – ( + X) =
ln(OR)=
OR=exp()
Interpretation of
Coefficients
If X is binary, exp() measures the change of
odds for one group compared to the second
ln [p(Y=1|X=1 ) /(1-p(Y=1|X=1)) ]
– ln [p(Y=1|X=0) / (1-p(Y=1|X=0)) ]
= + .1 –( + .0) =
ln(OR) =
OR = exp()
Interpretation of parameter β
Model : logit (p) = β0 + x1β1
Y=1
Y=0
X=1
e 0 1
Pr(1| x 1)
1 e 0 1
1 Pr(1| x 1)
X=0
e 0
Pr(1| x 0)
1 e 0
1 Pr(1| x 0)
P(1| x 1)*[1 P(1| x 0)]
1
OR
e
[1 P(1| x 1)]* P(1| x 0)
1
1 e 0 1
1
1 e 0
Model Assessment
There are several statistics which can be
used for comparing alternative models or
evaluating the performance of a single
model:
LRT, Wald tests
Percent Correct Predictions
Model Chi-Squares Statistic
The model likelihood ratio test (LRT) statistic is
LR = [-2 lik (Reduced model)] - [-2 lik (Full model)]
Example:
test of , LR = -2 [lik () -lik (, ) ]
lik () is likelihood of model with only the intercept
lik (, ) is a model with the intercept and X
The LR statistic follows a chi-squares distribution with r
degrees of freedom, where r=difference in numbers of
parameters between the two models
Use the LRT statistic to determine if the overall model is
statistically significant.
Percent Correct Predictions
Predicted outcome (majority vote method)
if predicted prob Pr(x) ≥ 0.5, assign y = 1^
otherwise assign y =^0
Compare the predicted outcome y and the
^
actual outcome y and compute
the
percentage of correct outcomes.
An Example:
Observed
0
1
Predicted
0
1
328
24
139
44
Overall
% Correct
93.18%
24.04%
69.53%
Testing significance of
variables
Omitted variable(s) can result in bias in the coefficient
estimates. To test for omitted variables you can
conduct a likelihood ratio test:
LR[q] = {[-2 lik (constrained model, i=k-q)]
- [-2 lik (unconstrained model, i=k)]}
where LR follows chi-squares distribution with q
degrees of freedom, with q = 1 or more omitted
variables
An Example:
Variable
PETS
MOBLHOME
TENURE
EDUC
CHILD
WHITE
FEMALE
Constant
Beginning -2 LL
Ending -2 LL
B/se(B)
B
Wald
Sig
-0.699
1.570
-0.020
0.049
0.009
0.186
0.018
-1.049
10.968
29.412
5.993
1.079
0.011
0.422
0.008
2.073
0.001
0.000
0.014
0.299
0.917
0.516
0.928
0.150
687.36
641.41
Constructing the LR Test
Ending -2 loglik Partial Model
Ending -2 loglik Full Model
Block Chi-Square
DF
Critical Value
641.84
641.41
0.43
3
11.345
“Since the chi-squared value is less than the critical value the set
of coefficients is not statistically significant. The full model is not
an improvement over the partial model.”
Multiple Logistic regression
Prob
of event labeled as binary outcome
Logistic
regression model: logit function.
log [π / (1- π)] = β0 + x1β1 +… + xpβp = η
equivalent to p = exp(η) / [ 1+ exp(η) ]
Why need multiple logistic regression?
Simple RxC table cannot solve all problems, and
can be misleading.
Multiple Logistic RegressionFormulation
0 1 X1 p X p
e
E(Y | x) P(Y 1| x) ( x)
0 1 X1
1 e
( x)
ln
0 1 x
1 ( x)
p X p
p X p
The relationship between π and x is S-shaped
The logit (log-odds) transformation (link function)
Multiple Logistic Regression
Assess risk factors
Individually Ho: βk = 0
Globally Ho: βm =… βm+t= 0
while controlling for confounders and other
important determinants of the event
Interpretation of the parameters
If π is the probability of an event and O is the odds
for that event then
( x)
probability of event
Odds
1 ( x) probability of no event
The link function in logistic regression gives the logodds
( x)
g ( x) ln
0 1x p X p
1 ( x)
Example: Snoring & Heart Disease
An epidemiologic study surveyed 2484
subjects to examine whether snoring was a
possible risk factor for heart disease.
Snoring
Nearly
Heart Disease
Never Occasional Every night
Every
Night
Yes
24
35
21
30
No
1355
603
192
224
Prop(yes)
.017
.055
.099
.118
Constructing Indicator variables
Let
Z1=1 if occasional, 0 otherwise
Let
Z2=1 if nearly every night, 0
otherwise
Let
Z3=1 if every night, 0 otherwise
SAS Codes
data hd;
input hd $ snoring $
count;
Z1=(snoring="occa");
Z2=(snoring="nearly");
Z3=(snoring="every");
cards;
yes never 24
yes occa 35
yes nearly 21
yes every 30
no never 1355
no occa 603
no nearly 192
no every 224
;
run;
proc logistic data=hd
descending;
model hd (event=‘yes’)
=Z1 Z2 Z3;
freq count;
run;
SAS OUTPUT
Ordered
Value hd
1 yes
2 no
Total
Frequency
110
2374
Probability modeled is hd='yes'.
Model Convergence Status
Convergence criterion
(GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
Intercept
Intercept
and
Only Covariates
902.827
908.645
900.827
842.923
866.193
834.923
Testing Global Null Hypothesis:
BETA=0
Test
Chi-Square
DF Pr
LikRatio
65.9045
3
<.0001
Score
72.7821
3
<.0001
Wald
58.9513
3
<.0001
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard
Wald
Param
DF
Estimate
Error
Chi-Square
Inter
Z1
Z2
Z3
1
1
1
1
-4.0335
1.1869
1.8205
2.0231
0.2059
0.2695
0.3086
0.2832
383.6641
19.3959
34.8027
51.0313
Odds Ratio Estimates
Effect
Z1
Z2
Z3
Point
Estimate
3.277
6.175
7.561
95% Wald
Confidence Limits
1.932
5.558
3.373
11.306
4.341
13.172
P
<.0001
<.0001
<.0001
<.0001
Calculating Probabilities
The fitted logistic regression function is
Logit(π)= -4.0335 + 1.1869 Z1+ 1.8205 Z2+ 2.0231 Z3
So, the probability of heart disease if never snore
is exp(-4.0335) / (1+exp(-4.0335))=.0174
If snore occasionally,
exp(-4.0335+1.1869) / (1+exp(-4.0335 +1.1869))
=.0549
Calculating Odds Ratios
If Z1=Z2=Z3=0, then odds are exp(-4.0335)
If Z2=Z3=0, but Z1=1, then odds are exp(-4.0335+1.1869)
The ratio of odds is then exp(1.1869) = 3.2769
Interpretation: Compared with people who never snore,
people who snore occasionally are 3.28 times as likely to
develop heart disease.
What is the odds ratio for comparing those who snore
nearly every night with occasional snorers?
What is the odds ratio for comparing those who snore
every night with those who snore nearly every night?
Example – Genetic Association study
Idiopathic Pulmonary Fibrosis (IPF) is known to be
associated with age and gender (older and male are
more likely)
One study had 174 cases and 225 controls found
association of IPF with one gene genotype COX2.8473
(C T).
Genotype
CC
CT
TT
total
Case
88
72
14
174
Control
84
113
28
225
Total
172
185
42
399
P-value by Pearson Chi-squares test: p = 0.0241.
Q: Is this association true?
Example on genetic study
Logistic
regression model
logit [Pr(IPF)] = intercept + snp + sex + age
Results:
Effect
SNP
sex
age
Wald
DF Chi-square
2
2.7811
1
9.1172
1 100.454
P-value
0.2489
0.0025
<.0001
Example on genetic study
Investigate
Disease
why it happens by age and sex
(N-normal; D-disease) by age
0-29
30-49
50-64
65-74
75+
N 104
77
35
7
2
D
0
10
42
68
54
T 104
87
77
75
56
Example on genetic study
Investigate
Disease
why it happens by age and sex
(N-normal; D-disease) by sex
male
female
N
D
72
108
153
66
T
180
219
Example on genetic study
Investigate
why it happens by age and sex
SNP genotype by sex
CC
CT
TT
Total
male
female
79
75
26
93
110
16
180
219
Pearson Chi-squares test
X2 = 6.3911, p = 0.0409
Example on genetic study
Investigate
why it happens by age and sex
Age class by genotype
29
30-49
50-64
65-74
75+
T
CC
36
34
35
37
30
172
Pearson Chi-squares test
X2 = 10.01, p = 0.2643
CT
58
42
32
30
23
185
TT
10
11
10
8
3
42
Loglinear regression model
Why use loglinear regression?
Often, observations are counts (>0, and may be =
0), with potential number of people exposed to
risk (total population at risk).
Estimation by maximum likelihood estimator
Lik i
y
i
yi !
e
or loglik function
l log Lik [ yi log log( yi !)]
yi log n Const.
Link
function
R program for Logistic and
Loglinear models
Both logistic and loglinear models are special generalized linear
models (GLM)
R uses a function ‘glm’ for fitting GLM models with different link
function:
Logistic regression: binomial family
Loglinear regression: Poisson family
Logistic:
fit <- glm(as.factor(y)~x, family = binomial (link = logit))
Loglinear:
fit <- glm(y~x, family = poisson (link = log))
R output for Logistic and
Loglinear models
> fit
Call: glm(formula = y ~ x, family = binomial(link = logit))
Coefficients:
(Intercept)
x1
-0.725709 0.014827
x4
0.041097
x2
0.015248
x3
0.007396
Degrees of Freedom: 19 Total (i.e. Null); 15 Residual
Null Deviance:
24.43
Residual Deviance: 21.22
AIC: 31.22
R output for Logistic
> summary(fit)
Call:
glm(formula = y ~ x, family = binomial(link = logit))
Deviance Residuals:
Min
1Q Median
3Q
Max
-1.3708 -0.9643 -0.3896 1.2518 1.6197
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.725709 1.988678 -0.365 0.715
x1
0.014827 0.021880 0.678 0.498
x2
0.015248 0.022781 0.669 0.503
x3
0.007396 0.018840 0.393 0.695
x4
-0.041097 0.030172 -1.362 0.173
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 24.435 on 19 degrees of freedom
Residual deviance: 21.223 on 15 degrees of freedom
Mortality from Cervical Cancer in Ontario 1960-94
Rate (per 10 5 person-year) and Frequency
Age Year
60-64
65-69
70-74
75-79
80-84
85-89
90-94
20-24
0.15
2
1.22
14
3.15
35
5.38
62
9.80
116
0.11
2
0.52
8
2.94
37
4.47
52
7.15
84
0.15
3
1.24
23
2.01
32
3.59
46
4.32
51
0.14
3
0.80
16
1.45
27
3.86
61
5.12
66
0.14
3
0.88
20
1.79
38
3.12
60
3.71
60
0.20
4
0.47
11
1.31
32
2.47
55
2.47
63
0.13
1
0.93
8
1.08
11
2.16
21
2.16
33
15.66
160
17.01
151
18.56
141
22.44
144
23.53
128
10.97
130
13.32
138
15.23
133
16.08
121
18.87
119
7.75
91
8.19
97
11.53
118
13.66
117
15.31
112
4.69
55
6.82
80
9.12
107
10.71
108
13.79
115
5.17
67
6.12
72
5.94
70
7.93
92
10.36
102
5.02
83
4.65
61
5.81
69
7.35
86
7.60
86
3.41
27
5.79
35
5.77
29
4.02
19
6.83
31
25.89
116
29.12
94
31.76
62
33.16
42
19.36
97
20.08
75
24.72
59
28.95
50
15.36
89
23.84
102
21..51
60
22.90
50
15.18
103
16.29
82
23.82
79
24.94
68
13.95
108
14.90
88
12.69
50
15.23
51
10.42
96
11.50
78
17.40
81
13.88
56
10.44
44
12.73
38
12.77
27
10.42
19
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85 +
Disease Rate (Per 105) and Frequency
Year
Age
20 24
25 29
30 34
60 64
65 69
70 74
75 79
0.15
2
1.22
14
3.15
45
0.11
2
0.52
8
2.94
37
0.15
3
1.24
23
2.01
32
0.14
3
0.80
16
1.45
27
Statistical Models
Loglinear model
log(ERij) = log (nij) + + i + j + ij
intercept ; ERij expected freq. in cell (i, j),
nij population-yrs of expos – offset term.
1, …, a row (age) effects, ai=1 i = 0
1, …, p column (period) effects,
pj=1 j = 0
ij interaction effects, ij = 0
Statistical Models
APC loglinear model
log(ERij) = log (nij) + + i + j + k
intercept ; ERij expected freq. in cell (i, j),
nij popul-yrs of expos – offset term.
1, …, a
i = 0
column (period) effects, j=1p j = 0
row (age) effects, i=1a
1, …, p
1, …, a+p-1 diagonal (cohort) effects,
k=1a+p-1 k = 0
Problem
Linear
dependency among covariates:
Period – Age = Cohort
Rows,
columns and diagonals on a Lexis
diagram
Multiple
estimators !
Disease Rate (Per 105) and Frequency
Year
Age
20 24
25 29
30 34
60 64
65 69
70 74
75 79
0.15
2
1.22
14
3.15
45
0.11
2
0.52
8
2.94
37
0.15
3
1.24
23
2.01
32
0.14
3
0.80
16
1.45
27
Matrix form
Models in matrix form
log(ER) = log (n) + X b
b = (, 1,…a-1, 1,…, p-1, 1, …,a+p-2)T
X singular design matrix : 1–less than full rank
Matrix form
Regression design Matrix (X) for APC model with 3x3 table
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
1
0
1
0
0
0
1
0
1
0
0
1
0
0
0
1
1
0
-1
-1
-1
-1
-1
-1
0
1
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
1
-1
-1
0
0
0
1
-1
-1
1
0
1
0
0
0
-1
-1
0
1
0
1
0
0
-1
-1
-1
-1
0
0
1
0
Eigen values of (XTX):
12.16 9.23 4.98 3.35 3.00
2.15 1.12 0.00
R-program to generate APC Matrix
apcmat1 <- function (a = 3, p = 3) {
## construct APC matrix for APC analysis
x <- matrix(0, a*p, 2*(a+p-2))
for (j in 1:p)
x[(a-1)*p+j, -1:-(a+p-2)] <- gammac[j, ]
# last block (row in apc table)
for gamma
gammac <- rbind( diag(1, a+p-2), -1)
gammac <- NULL
for (i in 1:(a-1)) {
x[(i-1)*p+(1:p), i] <- 1
x[(a-1)*p+(1:p), 1:(a-1)] <- -1
# last block (row in apc table) for alpha
# for alpha
x[(i-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1,p-1)
x[(a-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1, p-1)
# for beta
x[i*p, a:(a+p-2)] <- -1
# last block (row in apc table)
x[a*p, a:(a+p-2)] <- -1
for (j in 1:p)
x[(i-1)*p+j, -1:-(a+p-2) ] <- gammac [a-i+j, ]
}
x
}
for beta
p
2
1
a
1 2
1
2
a+p-1
Challenge - Identification
Constraint
on parameters (Kupper et al,
1985)
One more constraint trend determination
But diff constraint diff trend
Challenge - Identification
Conclusion:
Except
all estimators are biased
one
With constraint satisfied by true
parameters of model
Not verifiable !!
Identifiability
Mystery?
problem !
Previous Methods
Estimable
functions (Fienberg + Mason 1979,
Clayton + Schifflers 1987, Holford 1985, 1991)
Indep. of selection of constraint
Invariant characteristics of trends
Nonlinear components estimable – curvature
Linear component (slope) not estimable !
Previous Conclusion
Identifiability
problem - difficult !
Linear trends are not estimable (Numerically)
!
Kupper et al (1985), Holfold (1985, 1991),
Clayton and Schifflers (1987),
Fienberg and Mason (1979, 1985).
Kupper et al. (1985) provided a condition for
estimable function.
Mystery ?
Approach ?
Multiple estimators !
How to pick up a “correct one” ?
Hint: think about math, not statistics !
Structure of Estimators
Each Estimator
^
b = B + tB0
B0 eigen-vector of XTX : eigen-value 0.
||B0|| = 1, indep. of disease rate
t arbitrary real number
B orthogonal to B0 , uniquely
determined
by disease /event rate
B0 Independent of Rate
Kupper
et al. (1985):
B* = (0 A P C)
A = [ 1 - (a+1)/2, … , (a-1) – (a+1)/2 ]
P = [-1+(p+1)/2, …, -(p-1) + (p+1)/2 ]
C = [ 1- (a+p)/2, …, (a+p -1) – (a+p)/2]
B0 = B* / ||B*||
Estimable Function
Theorem 1
E(B) is estimable, determines both linear and
nonlinear components of trend;
^
T
B = (I – B0B0 ) b
E(B) is the only estimable function that
determines both linear and nonlinear
components;
L b^ = L(B+tB0) = LB + tLB0 = LB
Geometry for Estimable E(B)
^
b2
t2B0
^
b1
B
O
B0
t1B0
Constraint
Quantitative
Specify relationship between parameters
Require a priori knowledge of event /disease
Qualitative
constraints
constraints
Require no a priori knowledge
Properties of B
Intrinsic
to dis./event: arbitrary tB0
removed
Only estimable function
linear+nonlinear
Robust estimation by sensitivity analysis
Consistent estimation for intercept and
age effects , 1, …, a-1 as p ?
New Method - Intrinsic Estimator
Structure
Intrinsic
of estimators : B + tB0
estimator : B
Determined by removing tB0 – arbitrary term
Effective
trend : trend determined by B
Cervical Cancer Mortality
0
-1
-3
-2
period effect
-1
-2
-3
age effect
0
1
Period trend, 95% CI
1
Age trend, 95% CI
20
30
40
50
60
70
80
1960
1965
1970
age
1975
1980
period
-1
-2
-3
cohort effect
0
1
Cohort trend, 95% CI
1880
1900
1920
cohort
1940
1960
1985
1990
Homicide Arrest Rate
5
(per 10 ) (R. O'Brien, 2000)
Calendar year
age
1960
1965
1970
1975
15
8.89
9.07
17.22 17.54 18.02 16.32 36.52 35.24
20
14.00 15.18 23.76 25.62 23.95 21.11 29.10 32.34
25
13.45 14.69 20.09 21.05 18.91 16.79 17.99 16.75
30
10.73 11.70 16.00 15.81 15.22 12.59 12.44 10.05
35
9.37
9.76
13.13 12.83 12.31 9.60
9.38
7.27
40
6.48
7.41
10.10 10.52 8.79
7.50
6.81
5.48
45
5.71
5.56
7.51
5.31
5.17
3.67
7.32
1980
6.76
1985
1990
1995
Homicide Arrest Rate
Age trend
1.0
0.5
-0.5
0.0
period effect
0.5
0.0
-0.5
age effect
1.0
Period trend
15
20
25
30
35
40
45
1960
1970
age
1980
1990
period
0.5
0.0
-0.5
cohort effect
1.0
Cohort trend
1920
1930
1940
1950
cohort
1960
1970
1980
On-line Software APCsoft
Based
on IE method:
www.apcsoft.epi.msu.edu
Web-based software
Need to upload excel files.
Output analysis results and dynamic
graphics.
Dynamic
graph.