Associations Between Categorical Variables

Transcript Associations Between Categorical Variables

Associations Between Categorical
Variables
• Case where both explanatory (independent)
variable and response (dependent) variable
are qualitative (Chapter 7 includes case
where both are binary (2 levels)
• Association: The distributions of responses
differ among the levels of the explanatory
variable (e.g. Party affiliation by gender)
Contingency Tables
• Cross-tabulations of frequency counts where the
rows (typically) represent the levels of the
explanatory variable and the columns represent
the levels of the response variable.
• Numbers within the table represent the numbers
of individuals falling in the corresponding
combination of levels of the two variables
• Row and column totals are called the marginal
distributions for the two variables
Example - Cyclones Near Antarctica
• Period of Study: September,1973-May,1975
• Explanatory Variable: Region (40-49,50-59,60-79)
(Degrees South Latitude)
• Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8))
(Number of months in parentheses)
• Units: Cyclones in the study area
• Treating the observed cyclones as a “random
sample” of all cyclones that could have occurred
Source: Howarth(1983), “An Analysis of the Variability of Cyclones around Antarctica and Their
Relation to Sea-Ice Extent”, Annals of the Association of American Geographers, Vol.73,pp519-537
Example - Cyclones Near Antarctica
Region\Season
40-49S
50-59S
60-79S
Total
Autumn
370
526
980
1876
Winter
452
624
1200
2276
Spring
273
513
995
1781
Summer
422
1059
1751
3232
Total
1517
2722
4926
9165
For each region (row) we can compute the percentage of storms
occuring during each season, the conditional distribution. Of the
1517 cyclones in the 40-49 band, 370 occurred in Autumn, a
proportion of 370/1517=.244, or 24.4% as a percentage.
Region\Season
40-49S
50-59S
60-79S
Autumn
24.4
19.3
19.9
Winter
29.8
22.9
24.4
Spring
18.0
18.9
20.2
Summer
27.8
38.9
35.5
Total% (n)
100.0 (1517)
100.0 (2722)
100.0 (4926)
Example - Cyclones Near Antarctica
40.00
region
40-49S
50-59S
60-79S
30.00
regp ct
Bars show Means
20.00
10.00
Autumn
Winter
Spring
Summer
season
Graphical Conditional Distributions for Regions
Guidelines for Contingency Tables
• Compute percentages for the response (column)
variable within the categories of the explanatory
(row) variable. Note that in journal articles, rows
and columns may be interchanged.
• Divide the cell totals by the row (explanatory
category) total and multiply by 100 to obtain a
percent, the row percents will add to 100
• Give title and clearly define variables and
categories.
• Include row (explanatory) total sample sizes
Independence & Dependence
• Statistically Independent: Population conditional
distributions of one variable are the same across
all levels of the other variable
• Statistically Dependent: Conditional Distributions
are not all equal
• When testing, researchers typically wish to
demonstrate dependence (alternative hypothesis),
and wish to refute independence (null hypothesis)
Pearson’s Chi-Square Test
• Can be used for nominal or ordinal explanatory
and response variables
• Variables can have any number of distinct levels
• Tests whether the distribution of the response
variable is the same for each level of the
explanatory variable (H0: No association between
the variables
• r = # of levels of explanatory variable
• c = # of levels of response variable
Pearson’s Chi-Square Test
• Intuition behind test statistic
– Obtain marginal distribution of outcomes for
the response variable
– Apply this common distribution to all levels of
the explanatory variable, by multiplying each
proportion by the corresponding sample size
– Measure the difference between actual cell
counts and the expected cell counts in the
previous step
Pearson’s Chi-Square Test
• Notation to obtain test statistic
– Rows represent explanatory variable (r levels)
– Cols represent response variable (c levels)
1
2
…
c
Total
1
n11
n12
…
n1c
n1.
2
n21
n22
…
n2c
n2.
…
…
…
…
…
…
r
nr1
nr2
…
nrc
nr.
Total
n.1
n.2
…
n.c
n..
Pearson’s Chi-Square Test
• Observed frequency (fo): The number of
individuals falling in a particular cell
• Expected frequency (fe): The number we would
expect in that cell, given the sample sizes
observed in study and the assumtpion of
independence.
– Computed by multiplying the row total and the
column total, and dividing by the overall sample
size.
– Applies the overall marginal probability of the
response category to the sample size of explanatory
category
Pearson’s Chi-Square Test
• Large-sample test (all fe > 5)
• H0: Variables are statistically independent
(No association between variables)
• Ha: Variables are statistically dependent
(Association exists between variables)
• Test Statistic:  2  ( f o  f e )2
 f
obs
e
2
• P-value: Area above  obs
in the chi-squared
distribution with (r-1)(c-1) degrees of
freedom. (Critical values in Table 8.5)
Example - Cyclones Near Antarctica
Observed Cell Counts (fo):
Region\Season
40-49S
50-59S
60-79S
Total
Autumn
370
526
980
1876
Winter
452
624
1200
2276
Spring
273
513
995
1781
Summer
422
1059
1751
3232
Total
1517
2722
4926
9165
Note that overall: (1876/9165)100%=20.5% of all cyclones
occurred in Autumn. If we apply that percentage to the 1517 that
occurred in the 40-49S band, we would expect (0.205)(1517)=310.5
to have occurred in the first cell of the table. The full table of fe:
Region\Season
40-49S
50-59S
60-79S
Total
Autumn
310.5
557.2
1008.3
1876
Winter
376.7
676.0
1223.3
2276
Spring
294.8
529.0
957.3
1781
Summer
535.0
959.9
1737.1
3232
Total
1517
2722
4926
9165
Example - Cyclones Near Antarctica
Computation of
Region
40-49S
40-49S
40-49S
40-49S
50-59S
50-59S
50-59S
50-59S
60-79S
60-79S
60-79S
60-79S
2
 obs
Season
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
fo
fe
370
452
273
422
526
624
513
1059
980
1200
995
1751
310.5
376.7
294.8
535.0
557.2
676.0
529.0
959.9
1008.3
1223.3
957.3
1737.1
(fo-fe)^2
3540.25
5670.09
475.24
12769
973.44
2704
256
9820.81
800.89
542.89
1421.29
193.21
((fo-fe)^2)/fe
11.4017713
15.0520042
1.61207598
23.8672897
1.74702082
4
0.48393195
10.2310762
0.79429733
0.44379138
1.4846861
0.11122561
71.2291706
Example - Cyclones Near Antarctica
• H0: Seasonal distribution of cyclone occurences
is independent of latitude band
• Ha: Seasonal occurences of cyclone occurences
differ among latitude bands
2
• Test Statistic:
 obs
 71.2
• P-value: Area in chi-squared distribution with (31)(4-1)=6 degrees of freedom above 71.2
Frrom Table 8.5, P(222.46)=.001  P< .001
SPSS Output - Cyclone Example
O
N
A
S
p
t
i
m
o
u
n
r
i
t
m
t
m
R
4
C
0
2
3
2
7
E
5
7
8
0
0
%
%
%
%
%
%
5
C
6
4
3
9
2
E
2
0
0
9
0
%
%
%
%
%
%
6
C
0
0
5
1
6
E
3
3
3
1
0
%
%
%
%
%
%
T
C
6
6
1
2
5
E
0
0
0
0
0
%
%
%
%
%
%
a
p
a
d
i
l
d
u
f
a
P
9
6
0
P-value
L
7
6
0
L
8
1
0
A
5N
a
0
m
Misuses of chi-squared Test
• Expected frequencies too small (all
expected counts should be above 5, not
necessary for the observed counts)
• Dependent samples (the same individuals
are in each row, see McNemar’s test)
• Can be used for nominal or ordinal
variables, but more powerful methods exist
for when both variables are ordinal and a
directional association is hypothesized
Residual Analysis
• Once dependence has been determined from a chisquared test, often interested in determining which
cells contributed
• Residual: fo-fe measures the difference between
the observed and expected counts
– Positive implies observed more than expected
– Residual’s practical importance depends on level of fe
• Adjusted Residual (computed for each cell):
fo  fe
f e (1  row proportion )(1  column proportion )
Adjusted residuals above 3 in absolute value give strong evidence against independence in
that cell
Example - Cyclones Near Antarctica
Adjusted residuals are computed in the following table.
Row proportion for Region 40-49S: 1517/9165=0.1655
Column Proportion for Season Autumn is: 1876/9165=0.2047
Region
40-49S
40-49S
40-49S
40-49S
50-59S
50-59S
50-59S
50-59S
60-79S
60-79S
60-79S
60-79S
Season
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
fo
fe
370
452
273
422
526
624
513
1059
980
1200
995
1751
310.5
376.7
294.8
535
557.2
676
529
959.9
1008.3
1223.3
957.3
1737.1
row prop col prop adj res
0.1655
0.2047 4.144837
0.1655
0.2483 4.898484
0.1655
0.1943 -1.54843
0.1655
0.3526 -6.64664
0.297
0.2047 -1.76769
0.297
0.2483 -2.75125
0.297
0.1943 -0.92433
0.297
0.3526 4.741291
0.5375
0.2047
-1.4695
0.5375
0.2483 -1.12983
0.5375
0.1943 1.996065
0.5375
0.3526 0.609481
2x2 Tables
• Each variable has 2 levels
– Explanatory Variable – Groups (Typically
based on demographics, exposure, or Trt)
– Response Variable – Outcome (Typically
presence or absence of a characteristic)
• Measures of association
– Relative Risk (Prospective Studies)
– Odds Ratio (Prospective or Retrospective)
– Absolute Risk (Prospective Studies)
2x2 Tables - Notation
Group 1
Outcome
Present
n11
Outcome
Absent
n12
Group
Total
n1.
Group 2
n21
n22
n2.
Outcome
Total
n.1
n.2
n..
Relative Risk
• Ratio of the probability that the outcome
characteristic is present for one group, relative
to the other
• Sample proportions with characteristic from
groups 1 and 2:
n11
1 
n1.
^
n21
2 
n2.
^
Relative Risk
• Estimated Relative Risk:
^
RR   1
^
2
95% Confidence Interval for Population
Relative Risk:
( RR (e 1.96
v
) , RR (e1.96
^
e  2.71828
v
))
^
(1   1 )
(1  
v 

n11
n21
2
)
Relative Risk
• Interpretation
– Conclude that the probability that the outcome
is present is higher (in the population) for group
1 if the entire interval is above 1
– Conclude that the probability that the outcome
is present is lower (in the population) for group
1 if the entire interval is below 1
– Do not conclude that the probability of the
outcome differs for the two groups if the
interval contains 1
Example - Coccidioidomycosis and
TNFa-antagonists
• Research Question: Risk of developing
Coccidioidmycosis associated with arthritis
therapy?
• Groups: Patients receiving tumor necrosis
factor a (TNFa) versus Patients not receiving
TNFa (all patients arthritic)
Source: Bergstrom, et al
(2004)
TNFa
Other
Total
COC
7
4
11
No COC
240
734
974
Total
247
738
985
Example - Coccidioidomycosis and
TNFa-antagonists
• Group 1: Patients on TNFa
• Group 2: Patients not on TNFa
^
7
4
1 
 .0283  2 
 .0054
247
738
^
^
1
.0283
RR  ^ 
 5.24
 2 .0054
95%CI : (5.24e 1.96
.3874
1  .0283 1  .0054
v

 .3874
7
4
, 5.24e1.96
.3874
)  (1.55 , 17.76)
Entire CI above 1  Conclude higher risk if on TNFa
Odds Ratio
• Odds of an event is the probability it occurs
divided by the probability it does not occur
• Odds ratio is the odds of the event for group 1
divided by the odds of the event for group 2
• Sample odds of the outcome for each group:
n11 / n1.
n11
odds1 

n12 / n1.
n12
odds2 
n21
n22
Odds Ratio
• Estimated Odds Ratio:
odds1 n11 / n12 n11n22
OR 


odds2 n21 / n22 n12n21
95% Confidence Interval for
Population Odds Ratio
( OR (e 1.96
v
) , OR (e1.96 v ) )
1
1
1
1
e  2.71828
v 



n11
n12
n21
n22
Odds Ratio
• Interpretation
– Conclude that the probability that the outcome
is present is higher (in the population) for group
1 if the entire interval is above 1
– Conclude that the probability that the outcome
is present is lower (in the population) for group
1 if the entire interval is below 1
– Do not conclude that the probability of the
outcome differs for the two groups if the
interval contains 1
Example - NSAIDs and GBM
• Case-Control Study (Retrospective)
– Cases: 137 Self-Reporting Patients with Glioblastoma
Multiforme (GBM)
– Controls: 401 Population-Based Individuals matched to
cases wrt demographic factors
GBM Present GBM Absent
NSAID User
32
138
NSAID Non-User
105
263
Total
137
401
Source: Sivak-Sears, et al
Total
170
368
538
Example - NSAIDs and GBM
32(263)
8416

 0.58
138(105) 14490
1
1
1
1
v



 0.0518
32 138 105 263
OR 
95% CI : ( 0.58e 1.96
0.0518
, 0.58e1.96
0.0518
)  (0.37 , 0.91)
Interval is entirely below 1, NSAID
use appears to be lower among
cases than controls
Absolute Risk
• Difference Between Proportions of outcomes with
an outcome characteristic for 2 groups
• Sample proportions with characteristic from
groups 1 and 2:
n11
1 
n1.
^
n21
2 
n2.
^
Absolute Risk
Estimated Absolute Risk:
^
^
AR   1   2
95% Confidence Interval for Population
Absolute Risk ^
 ^  ^  ^ 
 1 1   1   2 1   2 
AR  1.96

n1.



n2.
Absolute Risk
• Interpretation
– Conclude that the probability that the outcome
is present is higher (in the population) for group
1 if the entire interval is positive
– Conclude that the probability that the outcome
is present is lower (in the population) for group
1 if the entire interval is negative
– Do not conclude that the probability of the
outcome differs for the two groups if the
interval contains 0
Example - Coccidioidomycosis and
TNFa-antagonists
• Group 1: Patients on TNFa
• Group 2: Patients not on TNFa
^
7
4
1 
 .0283  2 
 .0054
247
738
^
^
^
AR   1   2  .0283  .0054  .0229
.0283(.9717) .0054(.9946)

247
738
 .0229  .0213  (0.0016 , 0.0242)
95%CI : .0229  1.96
Interval is entirely positive, TNFa is associated
with higher risk
Ordinal Explanatory and Response
Variables
• Pearson’s Chi-square test can be used to test
associations among ordinal variables, but more
powerful methods exist
• When theories exist that the association is
directional (positive or negative), measures exist
to describe and test for these specific alternatives
from independence:
– Gamma
– Kendall’s tb
Concordant and Discordant Pairs
• Concordant Pairs - Pairs of individuals where one
individual scores “higher” on both ordered variables
than the other individual
• Discordant Pairs - Pairs of individuals where one
individual scores “higher” on one ordered variable
and the other individual scores “higher” on the other
• C = # Concordant Pairs D = # Discordant Pairs
– Under Positive association, expect C > D
– Under Negative association, expect C < D
– Under No association, expect C  D
Example - Alcohol Use and Sick Days
• Alcohol Risk (Without Risk, Hardly any Risk,
Some to Considerable Risk)
• Sick Days (0, 1-6, 7)
• Concordant Pairs - Pairs of respondents where one
scores higher on both alcohol risk and sick days
than the other
• Discordant Pairs - Pairs of respondents where one
scores higher on alcohol risk and the other scores
higher on sick days
Source: Hermansson, et al
(2003)
Example - Alcohol Use and Sick Days
A
C
D
d
o
d
d
a
t
A
W
7
3
5
5
H
4
3
6
3
S
2
5
4
1
T
3
1
5
9
• Concordant Pairs: Each individual in a
given cell is concordant with each individual
in cells “Southeast” of theirs
•Discordant Pairs: Each individual in a given
cell is discordant with each individual in
cells “Southwest” of theirs
Example - Alcohol Use and Sick Days
A
C
D
d
o
d
d
a
t
A
W
7
3
5
5
H
4
3
6
3
S
2
5
4
1
T
3
1
5
9
C  347(63  56  25  34)  113(56  34)  154(25  34)  63(34)  83164
D  145(154  63  52  25)  113(154  52)  56(52  25)  63(52)  73496
Measures of Association
• Goodman and Kruskal’s Gamma:
CD
 
CD
^
^
1    1
• Kendall’s tb:
CD
^
tb 
0.5 (n   ni. )( n   n. j )
2
2
2
2
When there’s no association between the ordinal variables,
the population based values of these measures are 0.
Statistical software packages provide these tests.
Example - Alcohol Use and Sick Days
C  D 83164  73496
 

 0.0617
C  D 83164  73496
^
c
y
m
a
b
o
r
l
E
x
o
u
O
K
5
0
7
5
O
G
2
2
7
5
N
9
a
N
b
U

Associations Between Categorical Variables

Transcript Associations Between Categorical Variables

Directory