Contingency Table and Correspondence Analysis

Download Report

Transcript Contingency Table and Correspondence Analysis

Contingency Table and
Correspondence Analysis
Nishith Kumar
Department of Statistics
BSMRSTU
and
Mohammed Nasser
Department of Statistics
RU
Overview
Contingency table.
Some real world problem for contingency table
Pearson chi-squared test
Probabilistic interpretation of matrices
Contingency tables: Homogeneity and Heterogeneity
Historical background of correspondence analysis
Correspondence analysis (CA)
Correspondence analysis and eigenvalues.
Singular value decomposition.
Calculation procedure of CA
Interpretation of correspondence analysis
R code and examples
Conclusion
2
Contingency Table
In statistics, a contingency table (also referred to as cross
tabulation or cross tab) is a type of table in a matrix format that
displays the (multivariate) frequency distribution of the variables.
The term contingency table was first used by Karl Pearson in 1904.
Sometimes contingency table is called incidence matrix.
Contingency tables are often used in social sciences (such as
sociology, education, psychology). These tables can be considered
as frequency tables. Rows and columns are some categorical
variables. If variables are continuous then we can use bins for these
continuous variables and convert them into categorical ones.
3
Real Problem
Cross-tabulation of age groups by perceived health status
Very Good
Good
Regular
Bad
Very Bad
16-24
243
789
167
18
6
25-34
220
809
164
35
6
35-44
147
658
181
41
8
45-54
90
469
236
50
16
55-64
53
414
306
106
30
65-74
44
267
284
98
20
75+
20
136
157
66
17
1. Is there any relation between different age group and perceived health status?
2. How can you visualize this type of relationship?
3. How can you find the similarity of row category?
4. How can you Interprete – distances between categories of row and column variables
4
Real Problem
Suppose we have the following contingency table
Smoking behavior
light
medium heavy
2
3
2
Total
Senior Managers
none
4
Junior managers
4
3
7
4
18
Senior employees
25
10
12
4
51
Junior employees
18
24
33
13
Secretaries
TOTAL
10
61
6
45
7
62
2
25
11
88
25
193
1. How can we analyze contingency table type data?
2. How can we converts frequency table data into graphical displays.
3. How can we find the similarity of column category?
4. How can we find the similarity of row category?
5. How can we find the relationship of row and column category simultaneously?
5
Real Problem
Survey of effects of four different drug types. Patients gave score for
each drug type (excellent, very good, good, fair, poor). Number
of all elements is 121.
Drug A
Drug B
Drug C
Drug D
1.
2.
3.
4.
excellent very good
6
8
12
8
0
3
1
1
good fair poor
10
1
5
3
3
5
12
6
10
8 12
7
Is there is association between columns and rows?
If there is some association then how can we find some structure
in this data table?
Can we order columns and rows by their closeness?
Can we find associations between columns and rows?
6
Pearson chi-squared test
Suppose that we have a data matrix X that has I rows and J columns. Elements
of the matrix are xij. Let us use the following notations:
I
J
n   xij , P  X / n, r  P1, c  P 1
T
i 1 j 1
Dr  diag (r ), Dc  diag (c)
1
r
1 T
c
R  D P, C  D P
Q  P  rc
T
r and c are row and column sums, R and C are row and column profiles,
respectively. Q is difference between P and product of row and column
sums.
7
Pearson chi-squared test (Cont.)
More notations and relations:
in( I )  tr ( Dr (R  1cT )Dc1 (R  1cT )T )  the total inertia of rows
in( J )  tr ( Dc (C  1r T )Dr1 (C  1r T )T ) = the total inertia of columns
relation in( I )  in( J ) is true.
1
c
in( J )  tr (Dc (C  1r T )Dr1 (C  1r T )T )
in( I )  tr (Dr (R  1c )D (R  1c ) )
T
1
r
T T
1
c
1
r
 tr (Dr (D P  1c )D (D P  1c ) )
T
T T
 tr (Dc (Dc1PT  1r T )Dr1 (Dc1PT  1r T )T )
 tr (QD Q D )
 tr (QT Dr 1QDc1 )
 /n
 2 / n
1
c
2
T
1
r
Row and column inertias are multiple of chi-squared with degrees of freedom
(I-1)(J-1). Multiplicity is 1/n. If P would be probability then if there
would be no association between rows and columns then Q would be 0.
It is equivalent to saying that rows and columns are independent
8
Pearson chi-squared test (Cont.)
Principal inertias:
1
Value 0.074759
Percentage 87.76%
For Smoke Data:
2
3
0.010017 0.000414
11.76%
0.49%
R Code:
library(ca)
library(MASS)
ca(smoke)
chisq.test(smoke)
Rows:
SM
JM
SE
JE
SC
Inertia 0.002673 0.011881 0.038314 0.026269 0.006053
Columns:
none
light
Inertia 0.049186 0.007059
medium
0.012610
heavy
0.016335
We have seen,
Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1)
Chi squared = 16.4416, df = 12, p-value = 0.1718
9
Pearson chi-squared test (Cont.)
Drag Data
Principal inertias :
1
inertias
0.304667
Percentage 78.32%
2
3
0.077342 0.007015
19.88%
1.8%
R Code:
library(ca)
library(MASS)
ca(drug)
chisq.test(drug))
Rows:
Drug A
Inertia 0.055280
Drug B Drug C
Drug D
0.143372 0.071340 0.119030
Columns:
excellent verygood good fair poor
Inertia 0.152430 0.060843 0.044719 0.111385 0.019646
Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1)
Chi squared = 47.0718, df = 12, p-value = 4.53e-06
I.e. there is strong evidence that there is row-column association.
10
Pearson chi-squared test (Cont.)
Health Data:
Principal inertias:
1
2
3
4
Value 0.136603 0.00209 0.001292 0.000474
Percentage 97.25% 1.49% 0.92% 0.34%
R Code:
library(ca)
library(MASS)
ca(health)
chisq.test(health))
Rows:
16-24 25-34
35-44
45-54
55-64
65-74
75+
Inertia 0.027020 0.021316 0.006900 0.001667 0.022711 0.033288 0.027557
Columns:
VG
GOOD
REG
BAD
VB
Inertia 0.024279 0.022368 0.045823 0.037955 0.010034
Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1)
Chi squared = 894.8607,
df = 24,
p-value < 2.2e-16
I.e. there is strong evidence that there is row-column association.
11
Probabilistic Interpretation of Matrices
P  X / n , If the matrix P would be a probability matrix i.e. each element pij
are probability of happening rows and columns simultaneously then we can
have the following interpretation of the involved matrices:
1)
Elements of r are the marginal probabilities of rows. Elements of c are the marginal
probabilities of columns.
2)
Elements of Q are differences between joint probability and product of individual
probabilities. In some sense this matrix represents the degree of dependencies of
rows and columns
3)
Elements of R are the conditional probabilities of columns when row is known
4)
Elements of C are the conditional probabilities of rows when column is known
5)
Total inertia is the total indicator of dependencies of rows and columns.
12
Marginal probability of Drag Data
X
1)
X
P
n
excellent
very good
good
fair
poor
Total
Drug A
6
8
10
1
5
30
Drug B
12
8
3
3
5
31
Drug C
0
3
12
6
10
31
Drug D
1
1
8
12
7
29
Total
19
20
33
22
27
121
Elements of r are the marginal probabilities of columns. Elements of c are
the marginal probabilities of rows
Excellent Very Good
0.0495868 0.066116
0.0991736 0.066116
0
0.024793
0.0082645 0.008264
Drug A
Drug B
Drug C
Drug D
Marginal
Probability of 0.1570248
Patient Score
0.165289
Good
0.08264
0.02479
0.09917
0.06612
Fair
0.00826
0.02479
0.04959
0.09917
Poor
0.0413
0.0413
0.0826
0.0579
Marginal
Probability of
Drug type
0.248
0.256
0.256
0.24
0.27273
0.18182
0.2231
1
13
Degree of dependencies of rows and columns
2.
Elements of Q are differences between joint probability and product of
individual probabilities. In some sense this matrix represents the degree
of dependencies of rows and columns
Q
Drug A
Drug B
Drug C
Drug D
excellent
0.01065501
0.05894406
-0.04022949
-0.02936958
very good
0.02513490
0.02376887
-0.01755345
-0.03135032
good
0.0150262960
-0.0450788881
0.0293012772
0.0007513148
fair
-0.036814425
-0.021788129
0.003005259
0.055597295
poor
-0.014001776
-0.015845912
0.025476402
0.004371286
See slide no. 19 for R code
14
Conditional Probabilities and Inertias
3) Elements of R are the conditional probabilities of columns when row is known
Drug A
Drug B
Drug C
Drug D
excellent
0.20000000
0.38709677
0.00000000
0.03448276
R
very good
0.26666667
0.25806452
0.09677419
0.03448276
good
0.33333333
0.09677419
0.38709677
0.27586207
fair
0.03333333
0.09677419
0.19354839
0.41379310
poor
0.1666667
0.1612903
0.3225806
0.2413793
4) Elements of C are the conditional probabilities of rows when column is known
C
Drug A
Drug B
Drug C
Excellent 0.31578947 0.63157895 0.0000000
Very good 0.40000000 0.40000000 0.1500000
good
0.30303030 0.09090909 0.3636364
Fair
0.04545455 0.13636364 0.2727273
poor
0.18518519 0.18518519 0.3703704
5)
Drug D
0.05263158
0.05000000
0.24242424
0.54545455
0.25925926
Total inertia is the total indicator of dependencies of rows and columns.
Small inertia indicate there is no row column association.
15
Similarly we can find the following measurement for
Smoke data and Health Status data.
i)
ii)
iii)
iv)
Marginal probabilities ,
Degree of dependencies of row and column
Conditional probabilities
Inertias
16
Contingency Tables: Homogeneity and Heterogeneity
t=in(I)=in(J)=X2/n is the coefficient of association called as Pearson’s meansquare contingency.
It is the total inertia. The total inertia is a measure of homogeneity/heterogeneity
of the table.
If t is large it is a measure of heterogeneity and if t is small it is a measure of
homogeneity of the table.
Homogeneity means that there is no row-column association.
t can also be calculated using:
t
I
J
i 1
j 1
2
r
[(
p
/
r

c
)
 i  ij i j / c j ]
17
Contingency Tables: Homogeneity and Heterogeneity( Cont.)
We can interpret the following formula by the following way
I
J
i 1
j 1
t   ri [( pij / ri  c j )2 / c j ]

Second summation is sum of a weighted squared distance between
the vector of relative frequency of the ith row (i.e. jth row profile –
pij/ri) and the average row profile – c. Inverse of the elements of c are
the weights.
•
It is known as chi-squared distance between ith row profile and the
average row profile.
The total inertia is further weighted sums of I chi-squared distances.
The weights are the elements of r.
If all elements of row profiles are close to the average row profile
then table is homogenous. Otherwise table is heterogeneous.
•
•
•
We can do similar calculations for the column profiles. It is done
easily by changing roles of r and c.
18
Calculations of Inertia to Find Out the
Homogeneity or Heterogeneity
We can calculate t by R from the following code,
library(ca)
library(MASS)
######Read Data########
###### Probability Matrix#######
pdrag<-drug/121
c<-colSums(pdrag)
r<-rowSums(pdrag)
Dr<-diag(r)
Dc<-diag(c)
q<-pdrag-r%*%t(c)
R<-ginv(Dr)%*%as.matrix(pdrag)
C<-ginv(Dc)%*%t(as.matrix(pdrag))
I
J
i 1
j 1
t   ri [( pij / ri  c j )2 / c j ]
sp<-0
tsp<-0
t<-0
for(i in 1:4){
for (j in 1:5){
sp[j]<-((((pdrag[i,j]/r[i])c[j])*((pdrag[i,j]/r[i])-c[j]))/c[j])
}
tsp[i]<-colSums(as.matrix(sp))
t[i]<-r[i]*tsp[i]
}
ti<-colSums(as.matrix(t))
Total inertia for Drug data is t = 0.3890234
19
Historical Background of Correspondence Analysis
Correspondence analysis (CA) was first proposed by Hirschfeld 1935
Hirschfeld 1935
Later CA was developed by Jean-Paul Benzécri 1973
The CA solution was shown by (Greenacre 1984)
It is incorporated in R in 2009
20
Correspondence Analysis
Correspondence analysis is a statistical technique used to analyze categorical
data (Benzecri, 1992) and provides a graphical representation of cross
tabulations or contingency tables.
Correspondence analysis (CA) can be viewed as a generalized principal
component analysis tailored for the analysis of qualitative data.
Although CA was originally created to analyze cross tabulation but CA
is so multipurpose that it is used with a lot of other numerical data table
types. It is formally applicable to any data matrix with nonnegative
entries.
21
Objectives of CA
The main objectives of CA are to transform a dataset into two
factor scores (rows and columns) that give the best
representation of the similarity structure of the rows and
columns of the table.
Correspondence analysis is used to reduce the dimension of a
data matrix as in principal component analysis. So using CA
we can visualize the data two or three dimensionally.
22
Correspondence analysis and eigenvalues
For a given contingency table we calculate row and column profiles. Now we
want to find a vector (g) when multiplied by row profiles from the left will
have highest possible variance. It means that we want to maximize
(Rg  1cTg)T Dr (Rg  1cTg)  max
To make this problem solvable we add an additional constraint (similar to
PCA). We want weighted norm of the vector to be unit and weighted
mean to be 0. Weights are column sums.
gTDcg  1, cTg  0
So we have to maximize
(Rg)T Dr Rg  gTPT Dr -1Dr Dr1Pg  gTPTDr1Pg  max
23
Correspondence analysis and eigenvalues (cont.)
(Rg)T Dr Rg  gTPTDr1Pg  maximize subject to condition
gTDcg  1
To maximize the function we can use the Lagrange multipliers technique.
Thus the Lagrange function
L  gTPTDr1Pg   (1  gTDcg)
Now differentiating L by g and put that equal to zero
 P T Dr1Pg   Dcg
L
0
g
 P T Dr1C T Dc g   Dcg
 P T Dr1C T ( Dc g )   ( Dcg )
Thus the problem reduces to the eigenvalue problem. As a result we will have
principal coordinates for columns. Similarly we can find principal
coordinates for row.
This problem easily and compactly solved if we use singular value
decomposition.
24
Singular Value Decomposition
X
=
Λ
U
n×n
m×n
VT
n×n
Row orthonormal
containing the
eigenvectors of XTX.
m×n
Real,
where (n≤ m) column orthonormal
containing the
eigenvectors of XXT.
Diagonal matrix,
containing the
singular values
of matrix X.
XV=U Λ, The columns U Λ indicate the PCs
Left singular vector shows the structure of observations.
Right singular vector shows the structure of variables.
25
Correspondence Analysis Calculation Procedure
To obtain coordinates using SVD, the computational algorithm of the row and
column profiles with respect to principle axes are given below
X
P
n
X
P
r
Row total
Column Total
c
Grand total

Dr  

Diagonal

Matrix 
 D 
r
c




c




1/ 2
T
1/ 2
T
Calculate the matrix of standardized residuals Dr ( P  rc ) Dc  U V [Using SVD]
U is a (m×n) column orthonormal matrix (UTU=I), containing the eigenvectors of the
symmetric matrix PPT and VT is a (nxn) row orthonormal matrix (VTV=I), containing
the eigenvectors of the symmetric matrix PTP.
The principal coordinates of rows: F  Dr 1/ 2U 
The principal coordinates of columns: G  Dc1/ 2V 
1/ 2
1/ 2
Standard row and column coordinates are Dr U and Dc V respectively
First few (one or two) elements of F and G are usually taken and plotted simultaneously.26
Interpretation of Correspondence analysis
Elements of Λ are called the principal inertias. They are also related to the
canonical correlations given by the package R.
Larger value of Λ means that the corresponding element has higher importance. It
is usual to use one or two elements of F and G. Then these elements are used for
various plots.
For pictorial representation either columns or rows are plotted in and ordered
form or biplots is used to find possible association between rows and columns
as well as their order.
Correspondence Analysis can be considered as a dimension reduction technique
and can be used together with others (for example PCA).
Comparative application of different dimension reduction technique may give
insight to the problem and structure in the data.
27
Algorithm of Correspondence Analysis
1.
Take a contingency table (X) and find sum of all elements (total sum= n)
2.
Divide all elements by the total sum (call it P)
3.
Find row and column sums (r and c)
4.
Calculate the matrix of standardized residuals,
5.
Find generalized SVD of the S.
6.
Find principal row and column coordinates. Take few elements and plot them
7.
Analyze the results (order and closeness of columns and rows, possible
associations between columns and rows).
S  Dr1/ 2 ( P  rcT ) Dc 1/ 2
28
Correspondence Analysis in Drug data
R code:
drug<- read.table(text = "
qlt excellent verygood good fair poor
DrugA 6
8
10 1
5
DrugB 12
8
3 3
5
Drugc 0
3
12 6 10
DrugD 1
1
8 12
7
", row.names = 1, header = TRUE)
plot(ca(drug), mass = c(TRUE, TRUE))
plot(ca(drug), mass = c(TRUE, TRUE), arrows
= c(FALSE, TRUE))
Summary(ca(drug))
29
Biplot of Drug data using Correspondence Analysis
Principal inertias (eigenvalues):
dim
1
2
3
value
%
cum%
0.304667 78.3 78.3
0.077342 19.9 98.2
0.007015
1.8 100.0
-------- ----Total: 0.389023 100.0
30
Correspondence analysis in Smoke Data
Principal inertias (eigen values):
dim
1
2
3
value
0.074759
0.010017
0.000414
% cum% scree plot
87.8 87.8 *************************
11.8 99.5 ***
0.5 100.0
library(ca)
data("smoke")
plot(ca(smoke), mass = c(TRUE, TRUE))
Summary(ca(smoke))
31
Biplot using Correspondence analysis
library(ca)
data("smoke")
plot(ca(smoke), mass = c(TRUE, TRUE), arrows = c(FALSE, TRUE))
32
Three Dimensional plot using Correspondence analysis
library(ca)
data("smoke")
plot3d.ca(ca(smoke, nd=3))
33
Correspondence analysis in Health Data
library(ca)
health<- read.table(text = "
age VG GOOD REG BAD VB
16-24 243 789 167 18 6
25-34 220 809 164 35 6
35-44 147 658 181 41 8
45-54 90 469 236 50 16
55-64 53 414 306 106 30
65-74 44 267 284 98 20
75+ 20 136 157 66 17
", row.names = 1, header = TRUE)
plot(ca(health), mass = c(TRUE, TRUE))
34
Biplot of Health Data Correspondence analysis
library(ca)
health<- read.table(text = "
age VG GOOD REG BAD VB
16-24 243 789 167 18 6
25-34 220 809 164 35 6
35-44 147 658 181 41 8
45-54 90 469 236 50 16
55-64 53 414 306 106 30
65-74 44 267 284 98 20
75+ 20 136 157 66 17
", row.names = 1, header = TRUE)
plot(ca(health), mass = c(TRUE, TRUE),
arrows = c(FALSE, TRUE))
35
Conclusion
In conclusion we can say that correspondence analysis can
1. Converts frequency table data into graphical displays
2. Show the similarity of row category
3. Show the similarity of column category
4. Show the relationship of row and column category simultaneously
Although CA was originally created to analyze cross tabulation but CA is
so multipurpose that it is used with a lot of other numerical data table
types. It is formally applicable to any data matrix with nonnegative
entries.
36
Future Studies
1. Study Multiple correspondence analysis.
2. High dimensional data analysis using Correspondence Analysis.
3. Assess the effect of outliers.
4. The 1st CA axis is reliable, but 2nd and later axes are quadratic distortions
of the first – produces the “arch effect”. So my future study is how to
solve this problem.
5. Application of CA in Microarray data to find out the gene pattern and
similarity of gene structure.
6. Missing value and outlier is a general problem in microarray data. So
solving missing value and outlier problem my target is to propose a robust
correspondence analysis method that can handle both outlier and missing
value problem
37
References
1.
Benzécri, J.-P. (1973). L'Analyse des Données. Volume II. L'Analyse des
Correspondances. Paris, France: Dunod.
2.
Greenacre, Michael (1983). Theory and Applications of Correspondence
Analysis. London: Academic Press. ISBN 0-12-299050-1
3.
Greenacre, Michael (2007). Correspondence Analysis in Practice, Second
Edition. London: Chapman & Hall/CRC.
4.
Greenacre, M. and Nenadic,O. (2007), “Correspondence Analysis in R, with
Two- and Three-dimensional Graphics: The ca Package”, Journal of Statistical
Software,Vol-20 ,Issue-30.
5.
Hirschfeld, H.O. (1935) "A connection between correlation and
contingency", Proc. Cambridge Philosophical Society, 31, 520–524.
38
Thank You so Much
for Your Patience
39