Lecture 9 Categorical Data

Download Report

Transcript Lecture 9 Categorical Data

Contingency Tables: Tests for
independence and
homogeneity (§10.5)
How to test hypotheses of independence (association) and
homogeneity (similarity) for general two-way cross
classifications of count data.
Terms:
Contingency Table
Cross-Classification Table
Measure of association
Independence in two-way tables
Chi-Square Test for Independence
or Homogeneity
1
Test of Independence or Association
A university conducted a study concerning faculty teaching
evaluation classification by students. A sample of 467
faculty is randomly selected, and each person is classified
according to rank (Instructor, Assistant Professor, etc. ) and
teaching evaluation (Above, Average, Below).
Person
1
2
3
4
5
Rank
Professor
Instructor
Professor
Assistant Professor
Associate Professor
Evaluation
Above
Average
Below
Average
Average
.
.
.
.
.
.
.
.
.
Each person has two
categorical responses.
Data can be formatted into a crosstabulation or contingency table.
Rank
Teaching
Evaluation
Above
Average
Instructor
Assistant
Professor
Associate
Professor Professor
36
62
45
50
Average
48
50
35
43
Below
Average
30
13
20
35
2
What are we interested in from this two-way
classification table?
Rank
Teaching
Evaluation
Above
Average
Average
Below
Average
Sum
Relative
Frequency
Instructor
Assistant
Professor
Associate
Professor Professor Sum
Relative
Frequency
36
62
45
50
193
0.413
48
50
35
43
176
0.377
30
13
20
35
98
0.210
114
125
100
128
467
1.000
0.244
0.268
0.214
0.274
1.000
Is the level
of teaching
evaluation
related to
rank?
Are Professors more likely to be judged above average than other ranks?
Ho: Teaching Evaluation and Rank are independent variables.
Two variables that have been categorized in a two-way table are independent
if the probability that a measurement is classified into a given cell of the table is
equal to the probability of being classified into that row times the probability of
being classified into that column. This must be true for all cells of the table.
3
Rank
n j
Teaching
Evaluation
Above
Average
Average
Below
Average
Sum
Relative
Frequency
ni
Instructor
Assistant Associate
Professor Professor Professor Sum
p11
p12
p13
p14
193
p1.
p21
p22
p23
p24
176
p2.
p31
p32
p33
p34
98
p3.
114
125
100
128
467
1.000
p.1
p.2
p.3
p.4
1.000
The independence assumption:
nij  n  pij
Eij 
n
n 
pij  pip j for all ij
Observed
Test Statistic:
r
Eij  n  pi p j
ni  n j
Relative
Frequency
c
  
2
Expected
i 1 j 1
n
ij
 Eij 
2
Eij
df = (r-1)(c-1)
r=#rows=3, c=#cols=4, 3 4 table.4
Observed Counts
Rank
Teaching
Evaluation
Above
Average
Average
Below
Average
Sum
Relative
Frequency
Instructor
Assistant Associate
Relative
Professor Professor Professor Sum Frequency
36
62
45
50
193
0.413
48
50
35
43
176
0.377
30
13
20
35
98
0.210
114
125
100
128
467
1.000
0.244
0.268
0.214
0.274
1.000
5
Expected Counts
Rank
Teaching
Evaluation
Above
Average
Average
Below
Average
Sum
Eij 
ni  n j
n
Instructor
Assistant Associate
Professor Professor Professor Sum
47.113
51.660
41.328
52.899
193
42.964
47.109
37.687
48.240
176
23.923
26.231
20.985
26.861
98
114
125
100
128
467
Assumptions: no Eij < 1, and
no more than 20% of Eij < 5.
6
Individual Cell Chi Square Values
Teaching
Evaluation
Above
Average
Instructor
Assistant
Professor
2.6215
2.0698
0.3263
0.1589
Average
0.5904
0.1774
0.1916
0.5692
Below
Average
1.5438
6.6740
0.0462
2.4663
 2  2.62   2.47  17.44,
Associate
Professor Professor
62,0.95  12.59,
 Reject Ho
There is evidence of an association between rank and
evaluation. Note that we observed less Assistant Professors
getting below average evaluations (13) than we would expect
under independence (26.2). Chi Square value is 6.67.
7
Minitab
rank
1
1
1
2
2
2
3
3
3
4
4
4
eval
1
2
3
1
2
3
1
2
3
1
2
3
count
30
48
36
13
50
62
20
35
45
35
43
50
STAT >
TABLES >
Cross Tabs
Classification Variables:
rank eval
Check Chi-square Analysis, and
Above and Std. residual
Frequencies are in: count
Input data in this
way
8
Tabulated Statistics: eval, rank
Rows: eval Columns: rank
1
2
3
30
23.92
1.24
13
26.23
-2.58
20
20.99
-0.22
2
48
42.96
0.77
50
47.11
0.42
35
43
176
37.69 48.24 176.00
-0.44 -0.75
--
3
36
62
45
50
193
47.11 51.66 41.33 52.90 193.00
-1.62
1.44
0.57 -0.40
--
1
All
4
All
35
98
26.86 98.00
1.57
--
Cell Contents -Count
Exp Freq
Std. Resid
Square roots of
Individual Chisquare values:
nij  Eij
Eij
114
125
100
128
467
114.00 125.00 100.00 128.00 467.00
------
Chi-Square = 17.435, DF = 6, P-Value = 0.008
9
options ls=79 ps=40 nocenter;
data eval;
input job $ rating $ number;
datalines;
Instructor Above
36
Instructor Average 48
Instructor Below
30
Assistant Above
62
Assistant Average 50
Assistant Below
13
Associate Above
45
Associate Average 35
Associate Below
20
Professor Above
50
Professor Average 43
Professor Below
35
;
run;
proc freq data=eval;
weight number;
table job*rating / chisq ;
run;
Table of job by rating
job
rating
SAS
Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚Above
‚Average ‚Below
‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Assistan ‚
62 ‚
50 ‚
13 ‚
125
‚ 13.28 ‚ 10.71 ‚
2.78 ‚ 26.77
‚ 49.60 ‚ 40.00 ‚ 10.40 ‚
‚ 32.12 ‚ 28.41 ‚ 13.27 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Associat ‚
45 ‚
35 ‚
20 ‚
100
‚
9.64 ‚
7.49 ‚
4.28 ‚ 21.41
‚ 45.00 ‚ 35.00 ‚ 20.00 ‚
‚ 23.32 ‚ 19.89 ‚ 20.41 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Instruct ‚
36 ‚
48 ‚
30 ‚
114
‚
7.71 ‚ 10.28 ‚
6.42 ‚ 24.41
‚ 31.58 ‚ 42.11 ‚ 26.32 ‚
‚ 18.65 ‚ 27.27 ‚ 30.61 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Professo ‚
50 ‚
43 ‚
35 ‚
128
‚ 10.71 ‚
9.21 ‚
7.49 ‚ 27.41
‚ 39.06 ‚ 33.59 ‚ 27.34 ‚
‚ 25.91 ‚ 24.43 ‚ 35.71 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
193
176
98
467
41.33
37.69
20.99
100.00
10
The FREQ Procedure
Statistics for Table of job by rating
Statistic
DF
Value
Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square
6
17.4354
0.0078
Likelihood Ratio Chi-Square
6
18.7430
0.0046
Mantel-Haenszel Chi-Square
1
10.8814
0.0010
Phi Coefficient
0.1932
Contingency Coefficient
0.1897
Cramer's V
0.1366
Sample Size = 467
11
First you need to tell SPSS that each observation
must be weighted by the cell count.
SPSS
DATA > WEIGHT CASES
Then you choose the analysis.
ANALYZE >
DESCRIPTIVE STATISTICS >
CROSS TABS
12
13
R
> score <- c(36,48,30,62,50,13,45,35,20,50,43,35)
> mscore <- matrix(score,3,4)
> mscore
[,1] [,2] [,3] [,4]
[1,]
36
62
45
50
[2,]
48
50
35
43
[3,]
30
13
20
35
> chisq.test(mscore)
Pearson's Chi-squared test
data: mscore
X-squared = 17.4354, df = 6, p-value = 0.00781
> out <- chisq.test(mscore)
> out[1:length(out)]
$statistic
X-squared
17.43537
$parameter
df
6
$p.value
[1] 0.00780959
14
$method
[1] "Pearson's Chi-squared test"
$data.name
[1] "mscore"
$observed
[,1] [,2] [,3] [,4]
[1,]
36
62
45
50
[2,]
48
50
35
43
[3,]
30
13
20
35
$expected
[,1]
[,2]
[,3]
[,4]
[1,] 47.11349 51.65953 41.32762 52.89936
[2,] 42.96360 47.10921 37.68737 48.23983
[3,] 23.92291 26.23126 20.98501 26.86081
Square roots of
Individual Chisquare values:
nij  Eij
Eij
$residuals
[,1]
[,2]
[,3]
[,4]
[1,] -1.6191155 1.4386830 0.5712511 -0.3986361
[2,] 0.7683695 0.4211764 -0.4377528 -0.7544218
[3,] 1.2424774 -2.5834003 -0.2150237 1.5704402
15
Test of Homogeneity
Suppose we wish to determine if there is an association between a
rare disease and another more common categorical variable (e.g.
smoking). We can’t just take a random sample of subjects and hope
to get enough cases (subjects with the disease).
One solution is to choose a fixed number of cases, and a fixed
number of controls, and classify each according to whether they are
smokers or not. The same chi square test of independence applies
here, but since we are sampling within subpopulations (have fixed
margin totals), this is now called a chi square test of homogeneity
(of distributions).
16
Homogeneity Null Hypothesis
In general, if the column categories represent c distinct subpopulations,
random samples of size n1, n2, …, nc are selected from each and classified
into the r values of a categorical variable represented by the rows of the
contingency table. The hypothesis of interest here is if there a difference
in the distribution of subpopulation units among the r levels of the
categorical variable, i.e. are the subpopulations homogenous or not.
Subpop 1 = Subpop 2
=…=
Subpop c
p11
p12
...
p1c
p21
p22
...
p2c
:
:
:
:
pr1
pr2
...
prc
pij = proportion of subpop j subjects
(j=1,…,c) that fall in category i
(i=1,…,r).
r
p
i 1
ij
 1, for each j  1,, c
17
Null hypothesis
of homogeneity
 p 1c 
 p 11   p 12 


  

 p 2c 
 p 21   p 22 




  
     


  

 p  p 
p 
 r1   r 2 
 rc 
18
Example: Myocardial Infarction (MI)
Data was collected to determine if there is an association between
myocardial infarction and smoking in women. 262 women suffering
from MI were classified according to whether they had ever smoked
or not. Two controls (patients with other acute disorders) were
matched to every case.
Smoked
Yes
No
Totals
Myocardial
Yes
172
90
262
Infarction
No
173
346
519
Totals
355
436
791
Is the incidence of smoking the same for MI and non-MI sufferers?
Ho: the incidence of MI is homogenous with respect to smoking
Ho: p11=p12 and p21=p22
19
Example: MI results in MTB
Stat -> Tables -> Chi-Square Test
-------------------------------------------------------------------------------------------Chi-Square Test: MI Yes, MI No
Expected counts are printed below observed counts
MI Yes
172
115.74
MI No
173
229.26
Total
345
2
90
146.26
346
289.74
436
Total
262
519
781
1
Chi-Sq = 27.352 + 13.808 + 21.643 + 10.926 = 73.729
DF = 1, P-Value = 0.000
Conclude: there is evidence of lack of homogeneity of incidence
of MI with respect to smoking.
20
Odds and Odds Ratios
Sometimes probabilities are expressed as odds, e.g.
• Gambling circles. (Why?)
• Biomedical studies. (Easy interpretation in logistic regression, etc.)
Odds of Event A = P(A)  (1-P(A))
P(A) = Odds of A / (1 + Odds of A)
Ex: A horse has odds of 3 to 2 of winning. This means that in every
3+2=5 races the horse wins 3 and loses 2. So P(Wins) = 3/5.
To use the above formula express the odds as d to 1, so 1.5 to 1 in
this case. Thus
P(Wins) = 1.5 / (1+1.5) = 1.5 / 2.5 = 3/5.
21
Example: MI and Odds Ratios
For women sufferers of MI, the proportion who ever smoked is
172/262 = 0.656. In other words, the odds that a woman MI
sufferer is a smoker are 0.656/(1-0.656) = 1.9.
pˆ11  0.656
For women non-sufferers of MI, the proportion who ever smoked is
173/519 = 0.333. In other words, the odds that a woman non-MI
sufferer is a smoker are 0.333/(1-0.333) = 0.5.
We can now calculate the odds ratio of being a smoker among MI
sufferers:
OR = 1.9/0.5 = 3.82
pˆ12  0.333
Among MI suffers, the odds of being a smoker are about 4 times
the odds of not being a smoker. Put another way: a randomly
selected MI sufferer is about twice as likely (.656/.333) of being a
smoker than of not being one.
22