Transcript Document

AS 737
Categorical Data Analysis
For Multivariate
Week 2
The Data (var00002)
Binomial Test
H 0 : p  .70
H1 : p  .70
The Result
Binomial Test
var00002
Group 1
Group 2
Total
Category
.00
1.00
N
15
5
20
Obs erved
Prop.
.75
.25
1.00
Tes t Prop.
.70
Exact Sig.
(1-tailed)
.4163708
Using SPSS. But how is it
calculated?
How To Calculate the P-value
x
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
P(x)
0.000000
0.000000
0.000000
0.000001
0.000005
0.000037
0.000218
0.001018
0.003859
0.012007
0.030817
0.065370
0.114397
0.164262
0.191639
0.178863
0.130421
0.071604
0.027846
0.006839
0.000798
Binomial Table Made in
Excel with n=20 and
p=0.70
H 0 : p  .70
H1 : p  .70
Why? P-value is the
probability of observing
what was observed or
more extreme under the
null hypothesis. In our
example X=15 so p-value
equals
P(x>=15)=.4163708
P-value equals the sum
of the probabilities 15
through 20 = .4163708
Binomial Test
x=15
Binomial Test
var00002
Group 1
Group 2
Total
Category
.00
1.00
N
15
5
20
Obs erved
Prop.
.75
.25
1.00
Tes t Prop.
.70
Exact Sig.
(1-tailed)
.4163708
The way the data is analyzed it
treats the “0” as a success.
There are 15 zeros.
Thus again, P-value equals the sum of the
probabilities 15 through 20 =P(X>=15)= .4163708
Sampling
1. Last week we covered the Binomial distribution and Poisson
distribution.
1. Count data often comes from the Binomial/Multinomial or Poisson
distribution.
2. Luckily whether the data comes from Binomial/Multinomial or
Poisson distribution for most analysis of the categorical data is
performed in the same manor.
1. For this reason we will often not discuss which distribution the data came
from.
Two-Way Contingency Tables
Gender
Belief in Afterlife
Yes
No/Not sure
Total
Females
435
147
582
Male
375
134
509
Total
810
281
1091
nrc (n, 1st row, 2nd column)
n11
n12
n1+
n21
n22
n2+
n+1
n+2
n
Joint, Marginal and Conditional Probabilities
 ij  P( X  i, Y  j ) denote the probabilit y (X, Y)
falls in the cell row i and column j
The probablili ties  ij form the joint distributi on
of X and Y and   ij  1
i, j
The marginal distributi ons are the row and column tot als :  i     ij   j    ij
j
n   nij
i, j
pij 
nij
n
i
represents the sample joint distributi on
Often it is informativ e to construct a separate probabilit y distributi on
for one variable given each level of the other vari able, the conditiona l distributi on.
If Y is a response varible and X an explanator y varia ble than the
conditiona l distributi on of Y | X can be very useful.
For example : Using the last table , the sample conditiona l distributi on belief in
afterlife given that gender is female  (453/582, 147/582) or .747 proportion of women
believe there is an afterlife, and .253 don' t believe or are not sure.
Independence
Two variables are said to be statistica lly independen t if the conditiona l distributi ons of
Y are identical at each level of X. For example, Belief in afterlife would be considered independen t
of gender if the probabilit y of a males believing in afterlife was equal to the probabilit y of females
believing in an afterlife. When the re is statistica l independen ce the following holds true :
 ij   i     j for all i  1,..., I and j  1,..., J .
Difference of Proportions
When the counts in the two rows are independent binomial samples,
the estimated standard error of p1-p2 is
ˆ ( p1  p 2 ) 
p1 (1  p1 ) p 2 (1  p 2 )

n1
n2
The confidence interval for π1   2 is
( p1  p 2 )  z ˆ ( p1  p 2 )
2
95% CI,   .05, z  1.96
2
Class take 10 minutes to do the following:
Calculate the 95% confidence interval for the difference in
proportions between women and men (women-men) that believe in
an afterlife.
Difference of Proportions
95% CI for the difference in proportion
(can range from -1 to 1)
.010684+/-1.96*.02656
.010684+/-.052057
(-0.04137,0.062741)
Do you believe the difference is different
from zero?
Now that we have calculated a 95% CI, Explain what a 95% CI is.
Were we to take an infinite number of samples and create an infinite
number of 95% confidence intervals 95% of those intervals created
would contain the true difference of
1
2
π 
Difference of Proportions
Group
Myocardial Infarction (MI)
Yes
No
Total
Placebo
189
10,845
11,034
Aspirin
104
10,933
11,037
Total
293
21,778
22,071
Class take 5 minutes to do the following:
Calculate a 95% for difference in proportions.
Difference in Proportions vs. Relative Risk
The 95% CI is (.0171-.0094)+/-1.96(0.0015)
Approx (.005,.011), appears to diminish risk of MI
Another way to compare the placebo vs. Aspirin is to look at the
relative risk, 
1
2
The sample relative risk is p1/p2=.0171/.0094=1.82
Thus in the sample there were 82% more cases of MI from the
placebo than Aspirin. To calculate the CI for relative risk you
would first calculate the CI of the log of relative risk and then
take the CI limits and the taken the antilog. (Note, log will
represent natural log, in Excel you must use ln, not log).
 p1 
1  p1 1  p2
log    z

2
n1 p1 n2 p 2
 p2 
Relative Risk
The confidence interval for the relative risk is
(1.43, 2.31). From this we would the relative risk is at least 43%
higher for patients taking aspirin. It can be misleading to only
look at the difference in proportions, looking at this situation in
terms of relative risk, clearly you would want to take Aspirin.
 p1 
1  p1 1  p2
log    z

2
n1 p1 n2 p 2
 p2 
0.597628+/-1.96*0.121347=(.359787,.835469)
Exp(0.359787) and Exp(0.835469)=(1.43,2.31)
The Odds Ratio
1
 odds of row 1
(1   1 )
2
odds 2 
 odds of row 2
(1   2 )
odds1 

odds
odds  1
1

(1   1 )
odds1

This is called the odds ratio.

odds 2
2
(1   2 )
The odds are nonnegative, when the odds are greater than one a
success is more likely than a failure.
The odds ratio can equal all nonnegative numbers. When X and
Y are independent then the odds ratio equals 1. An odds ratio of
4 means that the odds of success in row 1 are 4 times the odds of
success in row 2. When the odds of success are higher for row 2
than row 1 the odds ratio is less than 1.
The Odds Ratio
The maximum likelihood estimator of the odds ratio is:
p1
ˆ 
p2
(1  p1 )
(1  p 2 )
n 11

n21
n 12
n 22
n11n22

n12 n21
The asymptotic standard error for the log of the MLE is:
ASE log( ˆ) 
1
1
1
1



n11 n12 n21 n22
The confidence interval is:

log ˆ  z ASE log ˆ
2

Inference for Log Odds Ratios
Class take10 minutes to do the following:
Calculate the Odds ratio for MI, and then a 95% CI for
the odds ratio.
Inference for Log Odds Ratios
Odds ratio=(189*10933)/(104*10845)=1.832
Log(1.832)=.605
ASE of the log = (1/189+1/10933+1/10845+1/104)1/2=.123
95% CI of the log odds ratio is (.365,.846)
Thus the 95% CI of the Odds ratio is (1.44,2.33)
Dealing with small cell counts and the
For when zero cell counts occur or some cell counts are very
small, the following slightly amended formula is used:
(n11  0.5)( n22  0.5)
 
(n12  0.5)( n21  0.5)
~
The Relationship Between Odds Ratio and Relative Risk
p1
Odds ratio 
p2
(1  p1 )
(1  p 2 )
(1  p 2 )
 Relative Risk 
(1  p1 )
Chi-Squared Tests
For calculating chi-square statistics for testing a null hypothesis
with fixed values  ij we use expected frequencies:
 ij  n ij
Pearson chi - squared statistic
X2 
n
  ij 
2
ij
 ij
maximum likelihood under the null hypothesis
maximum likelihood when parameters are unrestrict ed
The test statistic equals  2 log  

the likelihood - ratio chi - squared statistic for
two - way contigency tables simplifies :
 nij
G  2 nij log 

 ij
df  ( I  1)  ( J  1)
2




Chi-Squared Tests of Independence
For calculating chi-square statistics for testing a null hypothesis
with assuming independence:
H O :  ij   i     j
Most likely the true probabilities are unknown and the sample
probabilities must be used
ˆ ij  npi  p  j 
ni  n  j
n
Pearson chi - squared statistic and likelihood ratio statistic respective ly :
X2 
n
 ˆ ij 
2
ij
ˆ ij
 nij
G  2 nij log 
 ˆ
 ij
2




Chi-Squared Test of Independence
Take 15 minutes and calculate the Pearson statistic and the
likelihood ratio chi-squared statistic for the null hypothesis that
the probability of heads is the same for all people, assuming the
true probability is unknown.
Person
Coin Toss
Heads
Tails
Total
Michael
270
230
500
Mark
260
240
500
Mary
280
220
500
Total
810
690
1500
Adjusted Residuals
nij  ˆ ij
ˆ ij 1  pi  1  p  j 
When the null hypothesis is true, each adjusted residual has a largesample standard normal distribution. An adjusted residual about 2-3
or larger in value indicates lack of fit of the null hypothesis within
that cell. Take 10 minutes to calculate the adjusted residuals:
Gender
Political Party Identification
Democrat
Independent
Republican
Females
279
73
225
Males
165
47
191
Adjusted Residuals
From this example we can see how the adjusted residuals can add
further insight beyond the chi-squared tests of independence. Such
as direction.
Gender
Political Party Identification
Democrat
Independent
Republican
Females
279
(2.29)
73
(0.46)
225
(-2.62)
Males
165
(-2.29)
47
(-0.46)
191
(2.62)
Chi-Squared Tests of Independence with Ordinal Data
Linear trend alternative to independence.
r

 n
i , j i j ij
 n   n 



i
i i
j
j


n
 i i i    2 n 
  j j  j
n
 
i  scores for the rows 1  2  3  ...

2


ni  

i
 i



2
j
n
   n  
2
j
j
n
j


i  scores for the columns 1  2  3  ...
M 2   n  1 r 2
df  1
M 2 is chi-squared with one degree of freedom. M, its square root
follows a standard normal distribution. M gives insight into
direction. Note, when categories do not have scores such as
education level logical scores must be assigned. E.G. High
School degree =1, College degree =2, Masters degree=3
Example with Ordinal Data
Alcohol and Infant Malformation
Alcohol
Consumption
Infant Malformation
Absent
Present
Percentage
Present
0
17,066
48
0.28
<1
14,464
38
0.26
1-2
788
5
0.63
3-5
126
1
0.79
>=6
37
1
2.63
Take 2-3 minutes and think of logical value assignments for
scores. Note: nominal binary data can be treated as ordinal.
Example with Ordinal Data
Alcohol and Infant Malformation
Alcohol
Consumption
Infant Malformation
Absent
Present
Percentage
Present
0
17,066
48
0.28
<1
14,464
38
0.26
1-2
788
5
0.63
3-5
126
1
0.79
>=6
37
1
2.63
1  0 2  0.5 3  1.5 4  4 5  7
1  0 2  1
Take 20 minutes using the scores given calculate r.