Categorical data analysis

Transcript Categorical data analysis

Introduction to Categorical Data
Analysis
July 22, 2004
Categorical data

The t-test, ANOVA, and linear regression all
assumed outcome variables that were
continuous (normally distributed).
 Even their non-parametric equivalents
assumed at least many levels of the outcome
(discrete quantitative or ordinal).
 We haven’t discussed the case where the
outcome variable is categorical.
Types of Variables: a taxonomy
Categorical
binary
2 categories +
Quantitative
nominal
ordinal
discrete
continuous
discrete random variables
more categories +
order matters +
numerical +
uninterrupted
Overview of statistical tests
Independent variable=predictor
Dependent variable=outcome
e.g., BMD= pounds age amenorrheic (1/0)
Continuous outcome
Continuous predictors
Binary predictor
Types of variables to be analyzed
Predictor
(independent)
variable/s
Categorical
Outcome (dependent)
variable
Continuous
Statistical procedure
or measure of association
ANOVA
Dichotomous
Continuous
T-test
Continuous
Continuous
Simple linear regression
Multivariate
Continuous
Multiple linear regression
Categorical
Categorical
Dichotomous
Dichotomous
Multivariate
Dichotomous
Categorical
Time-to-event
Multivariate
Time-to-event
Chi-square test
Odds ratio, Mantel-Haenszel
OR, Relative risk,
difference in proportions
Logistic regression
Kaplan-Meier curve/ logrank test
Cox-proportional hazards
model
Types of variables to be analyzed
Predictor
(independent)
variable/s
Categorical
Outcome (dependent)
variable
Continuous
Statistical procedure
or measure of association
ANOVA
Dichotomous
Continuous
T-test
Continuous
Continuous
Simple linear regression
Multivariate
Continuous
Multiple linear regression
Categorical
Categorical
Chi-square test
Dichotomous
Dichotomous
Odds ratio, Mantel-Haenszel
OR, Relative risk,
difference in proportions
Multivariate
Dichotomous
Logistic regression
Categorical
Time-to-event
Multivariate
Time-to-event
Kaplan-Meier curve/ logrank test
Cox-proportional hazards
model
done
Today
and
next
week
Last
part of
course
Difference in proportions
Example: You poll 50 people from random
districts in Florida as they exit the polls on
election day 2004. You also poll 50 people from
random districts in Massachusetts. 49% of pollees
in Florida say that they voted for Kerry, and 53%
of pollees in Massachusetts say they voted for
Kerry. Is there enough evidence to reject the null
hypothesis that the states voted for Kerry in equal
proportions?
Null distribution of a difference
in proportions
Standard error of a proportion=
Standard error can be estimated by=
(still normally distributed)
p(1  p)
n
pˆ (1  pˆ )
n
Standard error of the difference of two proportions=
pˆ1 (1  pˆ1 ) pˆ 2 (1  pˆ 2 )

or
n1
n2
(n ) p  (n2 ) p2
p(1  p) p(1  p)

, where p  1 1
n1
n2
n1  n2
The variance of a difference is the
sum of variances (as with difference
in means).
Analagous to pooled variance
in the ttest
Null distribution of a difference
in proportions
Difference of proportions
p(1  p) p(1  p)
~ N ( p1  p2 ,

)
n1
n2
For our example, null distribution=
pˆ1  pˆ 2 ~ N (0, 2 x
.51(1  .51)
 .10)
50
Answer to Example

We saw a difference of 4% between Florida
and Massachusetts
 Null distribution predicts chance variation
between the two states of 10%.
 P(our data/null
distribution)=P(Z>.04/.10=.4)>.05
 Not enough evidence to reject the null.
Chi-square test
for comparing proportions
(of a categorical variable)
between groups
I. Chi-Square Test of Independence
When both your predictor and outcome variables are categorical, they may be crossclassified in a contingency table and compared using a chi-square test of
independence.
A contingency table with R rows and C columns is an R x C contingency table.
Example

Asch, S.E. (1955). Opinions and social
pressure. Scientific American, 193, 31-35.
The Experiment

A Subject volunteers to participate in a
“visual perception study.”
 Everyone else in the room is actually a
conspirator in the study (unbeknownst to
the Subject).
 The “experimenter” reveals a pair of
cards…
The Task Cards
Standard line
Comparison lines
A, B, and C
The Experiment

Everyone goes around the room and says which
comparison line (A, B, or C) is correct; the true
Subject always answers last – after hearing all the
others’ answers.
 The first few times, the 7 “conspirators” give the
correct answer.
 Then, they start purposely giving the (obviously)
wrong answer.
 75% of Subjects tested went along with the
group’s consensus at least once.
Further Results

In a further experiment, group size (number
of conspirators) was altered from 2-10.

Does the group size alter the proportion of
subjects who conform?
The Chi-Square test
Number of group members?
Conformed?
2
4
6
8
10
Yes
20
50
75
60
30
No
80
50
25
40
70
Apparently, conformity less likely when less or more group
members…

20 + 50 + 75 + 60 + 30 = 235 conformed
 out of 500 experiments.

Overall likelihood of conforming = 235/500
= .47
Expected frequencies if no
association between group
size and conformity…
Number of group members?
Conformed?
2
4
6
8
10
Yes
47
47
47
47
47
No
53
53
53
53
53

Do observed and expected differ more than
expected due to chance?
Chi-Square test
(observed - expected) 2
 
expected
2
(20  47) 2 (50  47) 2 (75  47) 2 (60  47) 2 (30  47) 2
4 





47
47
47
47
47
(80  53) 2 (50  53) 2 (25  53) 2 (40  53) 2 (70  53) 2




 85
53
53
53
53
53
2
Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(5-1)=4
Rule of thumb: if the chi-square statistic is much greater than it’s degrees of freedom,
indicates statistical significance. Here 85>>4.
The Chi-Square distribution:
is sum of squared normal deviates
df
 2 d f   Z 2 ; where Z ~ Normal(0,1 )
i 1
The expected
value and
variance of a chisquare:
E(x)=df
Var(x)=2(df)
Chi-Square test
(observed - expected) 2
 
expected
2
(20  47) 2 (50  47) 2 (75  47) 2 (60  47) 2 (30  47) 2
4 





47
47
47
47
47
(80  53) 2 (50  53) 2 (25  53) 2 (40  53) 2 (70  53) 2




 85
53
53
53
53
53
2
Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(5-1)=4
Rule of thumb: if the chi-square statistic is much greater than it’s degrees of freedom,
indicates statistical significance. Here 85>>4.
Caveat
**When the sample size is very small in any
cell (<5), Fischer’s exact test is used as an
alternative to the chi-square test.
Example of Fisher’s Exact
Test
Fisher’s “Tea-tasting
experiment”
Claim: Fisher’s colleague (call her “Cathy”) claimed that, when drinking
tea, she could distinguish whether milk or tea was added to the cup first.
To test her claim, Fisher designed an experiment in which she tasted 8
cups of tea (4 cups had milk poured first, 4 had tea poured first).
Null hypothesis: Cathy’s guessing abilities are no better than chance.
Alternatives hypotheses:
Right-tail: She guesses right more than expected by chance.
Left-tail: She guesses wrong more than expected by chance
Fisher’s “Tea-tasting
experiment”
Experimental Results:
Guess poured first
Milk
Tea
Milk
3
1
4
Tea
1
3
4
Poured First
Fisher’s Exact Test
Step 1: Identify tables that are as extreme or more extreme than what
actually happened:
Here she identified 3 out of 4 of the milk-poured-first teas correctly. Is
that good luck or real talent?
The only way she could have done better is if she identified 4 of 4
correct.
Guess poured first
Milk
Tea
Poured First
Milk
3
1
4
Tea
1
3
4
Guess poured first
Milk
Tea
Milk
4
0
Tea
0
4
Poured First
4
4
Fisher’s Exact Test
Step 2: Calculate the probability of the tables (assuming fixed marginals)
Guess poured first
Milk
Tea
Milk
3
1
Tea
1
3
Poured First
4
4



P (3) 
 .229

4
3
4
1
8
4
Guess poured first
Milk
Tea
Milk
4
0
Tea
0
4
Poured First
4
4



P ( 4) 
 .014

4
4
4
0
8
4
Step 3: to get the left tail and right-tail p-values, consider the probability
mass function:
Probability mass function of X, where X= the number of correct
identifications of the cups with milk-poured-first:



P ( 4) 
 .014




P (3) 
 .229




P ( 2) 
 .514




P (1) 
 .229

    .014
P ( 0) 

4
4
4
0
8
4
4
3
4
1
8
4
4
2
4
2
8
4
4
1
4
3
8
4
4
0
4
4
8
4
“right-hand
tail
probability”:
p=.243
“left-hand tail
probability”
(testing the null
hypothesis that
she’s
systematically
wrong): p=.986
SAS code and output
for generating Fisher’s Exact
statistics for 2x2 table
Milk
Tea
Milk
3
1
4
Tea
1
3
4
data tea;
input MilkFirst GuessedMilk Freq;
datalines;
1 1 3
1 0 1
0 1 1
0 0 3
run;
data tea; *Fix quirky reversal of SAS 2x2 tables;
set tea;
MilkFirst=1-MilkFirst;
GuessedMilk=1-GuessedMilk;run;
proc freq data=tea;
tables MilkFirst*GuessedMilk /exact;
weight freq;run;
SAS output
Statistics for Table of MilkFirst by GuessedMilk
Statistic
DF
Value
Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square
1
2.0000
0.1573
Likelihood Ratio Chi-Square
1
2.0930
0.1480
Continuity Adj. Chi-Square
1
0.5000
0.4795
Mantel-Haenszel Chi-Square
1
1.7500
0.1859
Phi Coefficient
0.5000
Contingency Coefficient
0.4472
Cramer's V
0.5000
WARNING: 100% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
Fisher's Exact Test
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cell (1,1) Frequency (F)
3
Left-sided Pr <= F
0.9857
Right-sided Pr >= F
0.2429
Table Probability (P)
Two-sided Pr <= P
0.2286
0.4857
Sample Size = 8
Introduction to the 2x2 Table
Introduction to the 2x2 Table
Exposure (E)
Disease (D)
a
No Exposure
(~E)
b
No Disease (~D)
c
d
a+c = P(E)
b+d = P(~E)
Marginal probability
of exposure
Marginal probability of
disease
a+b = P(D)
c+d = P(~D)
Cohort Studies
Disease
Exposed
Target
population
Disease-free
cohort
Disease-free
Disease
Not
Exposed
Disease-free
TIME
The Risk Ratio, or Relative Risk (RR)
Exposure (E)
Disease (D)
a
No Exposure
(~E)
b
No Disease (~D)
c
d
a+c
b+d
risk to the exposed
RR 
P(D / E)
P(D /~E )
a /( ac)

b /(bd )
risk to the unexposed
Hypothetical Data
Congestive
Heart Failure
No CHF
High Systolic BP
Normal BP
400
400
1100
2600
1500
3000
400
/
1500
RR 
 2.0
400 / 3000
Case-Control Studies
Sample on disease status and ask
retrospectively about exposures (for rare
diseases)
 Marginal probabilities of exposure for cases and
controls are valid.
• Doesn’t require knowledge of the absolute risks of disease
• For rare diseases, can approximate relative risk
Case-Control Studies
Exposed in
past
Disease
(Cases)
Target
population
Not exposed
Exposed
No Disease
(Controls)
Not Exposed
The Odds Ratio (OR)
Exposure (E)
Disease (D)
a = P (D& E)
No Exposure
(~E)
b = P(D& ~E)
No Disease (~D)
c = P (~D&E)
d = P (~D&~E)
OR 
a
c
b
d
ad

bc
The Odds Ratio
OR 
P( E / D)
P (~ E / D )
P( E /~ D)
P (~ E / ~ D )
P( D / E )
P (~ D / E ) 1
P( D /~ E )
P (~ D / ~ E ) 1
P ( D& E )
P ( D &~ E )
P (~ D & E )
P (~ D & ~ E )

Via Bayes’ Rule

P( D / E )
P( D /~ E )
 RR
When disease is rare: P(~D)  1
“The Rare Disease Assumption”

Properties of the OR (simulation)
6
5
P 4
e
r
c 3
e
n
t
2
1
0
0
0.35
0.7
1.05
1.4
1.75
2.1
Simulated Odds Ratio
2.45
2.8
3.15
3.5
Properties of the lnOR
10
Standard deviation =
Standard deviation =
1 1 1 1
  
a b c d
8
P
e
r
c
e
n
t
6
4
2
0
-1.05
-0.75
-0.45
-0.15
0.15
0.45
lnOR
0.75
1.05
1.35
1.65
1.95
Hypothetical Data
Smoker
Non-smoker
Lung Cancer
20
10
30
No lung cancer
6
24
30
(20)( 24)
OR 
 8.0
(6)(10)
95% CI  (8.0)e
1.96
1 1 1
1
  
20 6 10 24
1.96
, (8.0)e
1 1 1
1
  
20 6 10 24
Note that the
size of the
smallest 2x2 cell
determines the
magnitude of
the variance
 (2.47 - 25.8)
Example: Cell phones and brain
tumors (cross-sectional data)
Own a cell
phone
Don’t own a
cell phone
Brain tumor
No brain tumor
5
347
352
3
88
91
8
435
453
5
3
 .014; ptumor/ nophone 
 .033
352
91
ˆ1  p
ˆ2)  0
(p
8
;p
 .018
453
( p )(1  p ) ( p )(1  p )

n1
n2
ptumor/ cellphone 
Z
Z
(.014  .033)
(.018 )(.982 ) (.018 )(.982 )

352
91

 .019
 1.22
.0156
Same data, but use Chi-square test
orBrain
Fischer’s
exact
tumor
No brain tumor
Own
5
347
352
Don’t own
3
88
91
8 8
352
 .018; pcellphone  435  .777
453
453
ptumor xpcellphone  .018 * .777  .014
ptumor 
453
Expected in cell a  .014 * 453  6.3; 1.7 in cell c;
345.7 in cell b; 89.3 in cell d
(R-1 )*(C-1 )  1*1  1 df

2
1
(8 - 6.3) 2 (3 - 1.7) 2 (89.3 - 88) 2 (347 - 345.7) 2




 1.48
6. 3
1. 7
89.3
345 .7
NS
note :Z 2  1.22 2  1.48
Same data, but use Odds Ratio
Own a cell
phone
Don’t own a
cell phone
Brain tumor
No brain tumor
5
347
352
3
88
91
8
435
453
5 * 88
OR 
 .423
3 * 347
lnOR - 0
Z

1 1 1 1
  
a b c d
 .86

 1.16; p  .05
.74
1
1
1 1

 
5 347 3 88
ln(. 423)

Categorical data analysis

Transcript Categorical data analysis

Directory