Epi 999 - Stanford University

Download Report

Transcript Epi 999 - Stanford University

Higher Order Contingency
Tables and Logistic
Regression
Copyright © 1999-2007 Leland Stanford Junior University. All rights reserved.
Warning: This presentation is protected by copyright law and international treaties.
Unauthorized reproduction of this presentation, or any portion of it, may result in
severe civil and criminal penalties and will be prosecuted to maximum extent possible
under the law.
Predicting an Outcome
• A major goal for Epidemiology is
quantifying the relationship between sets of
disease predictors and binary outcomes like
diseased/disease free.
• The first step in describing the relationship
between your predictor(s) and your
outcome is to do univariate analyses. That
is, test for an association between each of
your predictors and the outcome.
Predicting an Outcome
• After you assess the univariate relationships
between your predictors and your outcome, you
will want to look for effect modification (what
everyone else calls interactions) and confounding.
• You may want to look for higher order
interactions. You should look for all interactions
which could be there based on subject matter
knowledge.
• Do not test for interactions that you can not
explain (in your native language)! I can think
about three-way interactions but I can not get my
brain around four-way.
Predicting an Outcome
• Prior to doing any analyses, write out on a
spreadsheet all of the effects that interest
you. You then will test for those, and only
those, effects.
Outcome = hasCancer
Predictor name
Type
Odds ratio Lower CI Upper CI p-value
Gender
Binary
Country of Birth
Nominal
Number of Children Ordinal
OC Type
Nominal
OC Years
Continuous
kids * Years OC
OC Type * OC Yrs
Univariate Analysis:
Is There an Association?
• The way you assess the relationship between
your predictors and the outcome depends on
your data. If you have a 2x2 table, you just
look at the confidence interval for the odds ratio
(OR). Otherwise:
Row
Variable
Nominal
Column
Variable
Nominal
Proc Freq
Switch
chisq
Statistic
Pearson c2
Nominal
Ordinal
cmh2
Mean Score
Ordinal
Ordinal
cmh1
Mantel Haenszel c2
Univariate Analysis (2)
Strength of an Association
• If you are looking at a 2x2 table, you can assess
the strength of the association with the odds ratio.
Otherwise:
Row
Variable
Nominal
Ordinal
Column
Statistic
Variable
Nominal or Uncertainty Coefficient c|r
Ordinal
Ordinal
Spearman Correlation
• You can get them from the /measures switch on
the tables statement of proc freq.
Sets of Univariate Statistics
• You can request all the univariate measures like
this:
proc freq data=blah;
tables (sex pih_total)*pre_term_l
/chisq cmh measures;
run;
List all your predictors here… and your outcome here.
Sets of 2x2 Tables (1)
Cochran Mantel-Haenszel
• You will need to do analyses where the
relationship between the predictor and disease
is (at least partially) influenced by a third factor.
This third factor can be a stratification factor
from your study design or a confounder that
you did not block for (i.e., group on).
Regardless of the source, you can use the
Cochran Mantel-Haenszel method to neutralize
the third variable.
Sets of 2x2 Tables (2)
Confounding & Interaction
• Invoking the CMH technique is simple. You add
the extra variable (potential confounder) to the
left side of the tables statement and add /cmh
to the end of the line:
tables school * exposure * spots / cmh;
• This will cause SAS to print out a contingency
table for each of the levels of the confounder
and the common OR/RR.
• It does not print out the OR and RR for the
subtables. To get them, add the measures
switch with the CMH.
(2x)2x2 Ignoring Strata
Data spots2;
input school exposure $ health $ count @@;
datalines;
1 Exposed diseased 38 1 Exposed healthy 4
1 NotExp diseased 10 1 NotExp healthy 21
2 Exposed diseased 20 2 Exposed healthy 57
2 NotExp diseased 10 2 NotExp healthy 17
;run;
proc freq data = spots2;
weight count;
tables exposure*health/norow nocol chisq cmh;
run;
This is the crude
odds ratio.
2x2x2 Using Strata Tables
Results
proc freq data = spots2;
weight count;
tables school*exposure*health
/norow nocol chisq cmh measures;
run;
This is the adjusted
odds ratio.
All is not well!
Do not use the
summary table.
(3)
Simpson’s Paradox
• It is possible to have significant (but
opposed) effects in the levels of the
covariate, and the overall CMH statistic
will indicate NO effects.
• The moral is to always look at your partial
tables.
Exact Tests
• By default, SAS gives you approximate tests
and p values for almost all statistics in proc
freq. You can request exact measures.
proc freq data = spots2;
weight count;
exact or;
tables exposure*health/norow nocol chisq cmh;
run;
Exact Tests(2)
• Exact tests take time and computer power
but run them if you can.
Which CMH Summary?
• If you have a 2x2 table, then all of the CMH
values will be the same.
tables treat*response / chisq cmh;
• If you have a 2xN table, then use nonzero
correlation or row mean scores differ.
tables treat*response / cmh cmh2;
• If you have a Nx2 table, then use the nonzero
correlation.
tables treat*response / cmh cmh1;
Test for Trend
• Looking for a dose response in your predictor is
important.
• If you would like to test for an increasing or
decreasing trend in the binomial proportions
across the levels of your ordinal variable, you
can tell SAS to do a Cochran-Armitage test for
trend.
• To do this, just include the keyword trend on
the tables line:
tables expLevel*hasCancer/cmh chisq
measures trend;
Beyond Contingency Tables
• SAS provides you powerful ways of
analyzing contingency table data.
 Proc freq provides you with all the tools you
need to analyze 2x2 tables.
 Proc freq becomes more and more awkward
as your table sizes increase.
 Instead, you will use multiple/redundant
modeling techniques.
Predicting Outcomes
• In other disciplines where outcomes are
not dichotomous (e.g., alive or dead) or
ordinal (e.g., high, medium or low risk),
predictions are regularly done using linear
regression techniques.
 Outcome = base level + some relationship of
the predictors to the outcome.
Problems with Regression
• Ordinary (least squares) linear regression
is not well suited to predict a binary
outcome, frequency counts or percentiles.
 values outside of the possible range
 non-integer values
 issues with variance
• Instead, epidemiologists typically use two
other types of regression techniques.
 Logistic or Poisson
When to Use
Logistic Regression
• You use LR when you want to predict a binary
outcome, say diseased vs. not diseased, and you
know that you have numeric covariates
(confounding variables) that you want to account
for.
• It is analogous to ANCOVA for continuous
outcomes.
• You choose one outcome and call it the ‘event.’
 Most people have a variable for each ‘bad thing’ in
their data sets and code the event as a 1.
Age and Wisdom (1)
Continuous Outcome
• Let’s say you have a complex measure of
‘wisdom’ and you want to predict it with age.
Plot of Age and Wisdom
100
90
Observed
80
70
Wisdom
60
50
40
30
20
10
0
0
10
20
30
40
50
Age
60
70
80
90
100
Age and Wisdom (2)
Continuous Prediction
• Conceptually, you can see that a line
predicts this data nicely.
Percent wise = 1.63+age*.96
Linear Plot of age and Wisdom
100
Observed
90
Best Linear Guess
80
70
Wisdom
60
50
40
30
20
10
0
0
10
20
30
40
50
Age
60
70
80
90
100
Age and Wisdom (3)
Categorical Outcome
• If it is scored as a binary measure, no matter
how well you place a line, your predictions
are going to be way off.
Plot of Age and Wisdom
1
Wisdom
Observed
0
0
10
20
30
40
50
Age
60
70
80
90
100
Age and Wisdom (4)
Categorical Prediction
• Ideally, you want some function that is
close to a step function.
Plot of Age and Wisdom
1
Observed
Wisdom
Idealized Best Guess
0
0
10
20
30
40
50
Age
60
70
80
90
100
Logistic Fit
Age
100%
Percent Wise
Idealized Percent Wise
80%
Percent wise
• With logistic
regression you get
the probability of
going into the event
group (which is the
wise group in this
case) expressed in
terms of odds.
60%
40%
20%
0%
0
10
20
30
40
50
60
70
80
Age
 Complete separation of groups is actually a
problem…. More on that later.
90
100
Odds and Probabilities
• I have a hard time thinking
in terms of odds.
Fortunately, it is easy to
convert back and forth
between probabilities and
odds.
prob = odds/(odds+1);
odds = prob/(1-prob);
Probability
of an event
0.10
0.20
0.25
0.30
0.40
0.50
0.60
0.70
0.75
0.80
0.90
Odds
of an event
0.11
0.25
0.33
0.43
0.67
1.00
1.50
2.33
3.00
4.00
9.00
Why Odds Anyway?
• Odds are used to counteract the fact that
linear regression produces probability
values outside the range of 0 and 1. Going
with an odds forces the upper bound on the
probability. The lower bound is achieved by
taking the natural log of the regression
value.
Why Odds Anyway?
(2)
• So whereas from ordinary linear regression
you get:
Probability =
baseline+(predictor*weight value)
wise=1.63+age*.96
• In logistic regression you calculate:
LN(probability/1-probability)=
baseline+(predictor*weight value)
What Values Do You Want?
• With LS regression you get beta weights
(parameter estimates) that tell you how much
the outcome changes with each unit of the
Every unit of age increases
predictor.
your wisdom by about 1.
 wise=1.63+age*.96
• With LR your parameter estimates are in log
odds terms which no one can understand, but if
you raise the values to the log base e, then the
Every unit of age
values make some sense.
 odds wise=ebaseline+age*evalue
increases your odds
of being wise by
this amount.
Enough! How Do I Do It?
• SAS provides you with five procedures
that all do logistic regression.





logistic – quick and friendly
genmod – much more powerful
probit – this is the only time I’ll mention it…
catmod – more than binary outcomes
phreg – conditional logistic for matched casecontrol data
Fitting a Model
• Fitting a logistic model is easy with the logistic
procedure. But there is one trick. For some
(stupid) reason SAS wants to predict group
membership into the lowest category (i.e., it
wants events to be 0 and non-events to be 1).
Typically people use the descending (abbreviated
desc) option to make SAS call the events “1” and
non-events “0.”
proc logistic data = blah descending;
model outcome = predict1 predict2;
run;
A Real Example
• The goal here is to predict who would get severe
eclampsia using two of the mothers’ blood
chemistries.
• The primary hypothesis for the study says that
these two factors are related to eclampsia. Later
I will show you how to choose a good set of
predictors from a large set.
proc logistic data = ana_temp desc;
model severe_pre=dsl_igf dsl_insuli;
Notice the abbreviation of
run;
descending.
A Real Example
(2)
• Logistic regression uses a mathematical
technique called maximum likelihood estimation,
which is not guaranteed to produce a result.
Rather, it tries to converge on a valid solution
through successive approximations. If it fails to
converge on an answer, you have a problem
that statisticians like to call infinite parameters.
A Real Example
For now, only pay
attention to these
two sections.
Verify that your
cases are listed first
by looking at the
frequency.
Check the
convergence.
(3)
A Real Example
This tests whether the model
is any good at all. You want
to reject the hypothesis of a
worthless model.
This tells you about the
value of the predictors.
The “point estimate” is
eestimate. It tells you the
impact on the predicted
odds based on a one unit
increase in the predictor.
Notice that neither is a
statistically significant
predictor.
(4)
Beta = 0 Statistics
• These statistics test to see if all your
predictors are not good. They are all
asymptotically equivalent. If they are
wildly different, like this example, you
probably have power problems.
 The Likelihood Ratio statistic (AKA: –2 Log L)
is preferable for smaller samples.
• They usually do not differ.
Proc Logistic Improved
• Students don’t like specifying descending
because it is confusing. In modern versions of
proc logistic you can specify the event explicitly.
• model cancer = pack; strata center;
proc logistic data=ana_temp;
model severe_pre (event = "Sick") = dsl_igf
dsl_insuli/plcl plrl;
units dsl_igf = 10;
run;
Enterprise
Guide
Categorical Predictors
• You interpret the exponentiated parameter
estimates as the change in odds of an event
associated with a one unit increase in the
predictor. What happens when you have a
categorical predictor?
• You want to have a model that tells you the
change in your odds of an event when you are
in a group relative to a referent group.
Categorical Predictors
• You can get SAS to give you the odds of an
event given in a category relative to a
referent group. Say you have packs of
cigarettes smoked per day as a variable
called “packs” with the values: none, half,
full, many.
proc logistic data = lung;
class packs (ref="none")/ param = ref;
model cancer (event = "Sick") = pack;
Run;