slides - CARME 2011

Download Report

Transcript slides - CARME 2011

Representing interaction
in multiway contingency
tables: MIDOVA, CA and
log-linear model.
Martine Cadot1,2 , Alain Lelu2,3,4
1.
2.
3.
4.
Université Henri Poincaré, Nancy1
Laboratoire Lorrain de Recherche en Informatique et ses
Applications (LORIA, Nancy)
Université de Franche-Comté/LASELDI, Besançon
Institut des Sciences de la Communication du CNRS, Paris
The usual definition of statistical interaction
 Two
Way Analysis of Variance is a way of studying
the effects of two factors separately (their main
effects) and (sometimes) together (their interaction
effect).
 An interaction is the variation among the differences
between means for different levels of one factor
over different levels of the other factor
Carme 2011, Rennes
[email protected], [email protected]
2
The interaction A*B*C in statistics
The linear model ANOVA (C /(A*B))
I=(0.7-0.4)-(0.5-0.3)=0.1, p(H0)=0.44
Carme 2011, Rennes
The loglinear model (A*B*C)
I=ln((2.3/0.3)/(0.4/0.1))=0.88
p(H0) = 0.14
[email protected], [email protected]
3
The principles of statistical interaction

A level-k interaction refers to a complex association
between k variables:



May be positive, null or negative
May be statistically significant or not
It is difficult to interpret an association between k
variables (k>2)
pairwise (level-2) associations may be easily
interpreted as far as no higher level interactions exist
(whether null or non-significant)
 One has to start from level k, then k-1, ...

Carme 2011, Rennes
[email protected], [email protected]
4
Solutions for contingency tables:
1. Starting from a Correspondence
Analysis : Escofier, Pagès, Abdessemed,
Grossetête, Mourad,… (1979 to 2000).
 2. Our solution, MIDOVA:
Multidimensional Interaction Differential Of
Variation (Cadot 2006)

Carme 2011, Rennes
[email protected], [email protected]
5
Solution 1 : Escofier 1983
 Interactions
of level k>2 are removed and the
relations between the k variables are
interpreted by comparing the 2 CAs
(with/without interaction)
A
real example has been shown (~500 000
individuals, 3 variables): I=education grade,
T=gender, S=occupation , CA with IxT rows
and S columns, interaction between I, T and S
Carme 2011, Rennes
[email protected], [email protected]
6
Definition of interaction for CA by
Abdessemed & Escofier (1983, p152)
Carme 2011, Rennes
[email protected], [email protected]
7
Computation of Factors without
interaction: A&E (1983, p154)
Carme 2011, Rennes
[email protected], [email protected]
8
Reconstitution of contingency table
without interaction: A&E (1983, p154)
Carme 2011, Rennes
[email protected], [email protected]
9
Contingency table and CA with
interaction (raw data)
Carme 2011, Rennes
[email protected], [email protected]
10
Contingency table and CA without
interaction (Abessemed & Escofier)
Pb: negative counts Mourad & Grossetête
Carme 2011, Rennes
[email protected], [email protected]
11
Problem of negative counts
Grossetête & Mourad (1982)
Carme 2011, Rennes
[email protected], [email protected]
12
The reconstitution formula is replaced by an
algorithm
Carme 2011, Rennes
[email protected], [email protected]
13
Contingency table and CA without
interaction (Grossetête & Mourad)
Positive counts but no parallelism !!
Carme 2011, Rennes
[email protected], [email protected]
14
Solution 2 : MIDOVA (Cadot 2006)
A
data analysis and mining method
descriptive, exploratory, ≠ experiment design
 Lets a model emerge from the data, ≠ statistical
inference

 Well-fitted
to the case of multiple binary variables
 Generates a symbolic representation of the data
as a configuration of itemsets (conjunction of
logic variables, eg: A nonB C) or association
rules (A nonBnonC), Han 2001.
Carme 2011, Rennes
[email protected], [email protected]
15
MIDOVA algorithm

Principle:




For each k-itemset:



1
The support1 s of a k-itemset has a variability potential, given
its sub-itemsets.
The MIDOVA residue Mr is a measure of the variability
potential left to its super-itemsets
k-itemsets are extracted as long as there are values of Mr for
(k-1)-itemsets « not too close » to 0
compute the variation interval (sinf, ssup) of the support s.
compute the residue Mr = 2k-1inf(|ssup-s|,| sinf -s|).
if Mr=0, stop generating super-itemsets
number of individuals which verify each variable of the itemset
Carme 2011, Rennes
[email protected], [email protected]
16
Interpreting MIDOVA results:

While k-itemsets are extracted levelwise (i.e.
k=1, 2, …) 2 methods are available for
interpreting:
 Method
1: witholding the sole « sterile » k-itemsets,
those not producing (k+1)-itemsets (i.e. when exact
association rules between the k variables appear)
 Method 2: witholding the sole significant k-itemsets
Carme 2011, Rennes
[email protected], [email protected]
17
Method 1 MIDOVA

1-itemsets (Boolean variables) : 18 are set up

Gender : 1 variable H or F (if the variable is H, F=nonH)
 Education grade : sd , cap, bep, bte, bac, dut, deg, sup
 Occupation : AGR, ING, TEC, OQ, ONQ, CS, CM, EQ, ENQ

2-itemsets: 153 (= 18x17/2) extracted


64 itemsets for exact uninteresting rules (AGRnon ING, sdnon bep, etc.)
7 itemsets for exact interesting rules
 btenon ING, degnon AGR, degnon ONQ, degnon OQ,
 dutnon ONQ, sdnon ING, supnon AGR
 82 itemsets remain for building 3-itemsets

3-itemsets: 65 extracted
20 for exact rules « F, Bac non AGR »; « F, Btenon AGR », « F, Dutnon
AGR », etc.
 45 itemsets remain for building 4-itemsets


4-itemsets: 0 extracted
End of the itemset extraction
Method 1 MIDOVA

Produces a coherent set of rules based on a principle of
maximal interaction


These rules are easy to interpret
These rules are not redundant and their set is minimal for binary
variables

For example, the set of rules with AGR contains the 5 rules :



but does not contain


degnon AGR, supnon AGR
‘’F, Bac non AGR’’; ’’F, Btenon AGR’’; ’’F, Dutnon AGR’’
‘’F, deg non AGR’’ and ‘’H, deg non AGR’’ as a complement of, or instead
of the rule ‘’degnon AGR’’
But the set of 5 rules with AGR could be reducted to 2 rules if
the links between the variables bte, bac, dut, etc. had been
kept

(graduate>=deg) non AGR ; ’’F,(graduate >=bte)  non AGR’’
Description of the data (Haj Ali, D.
& Cadot, M. 2010).

National survey on the health of Tunisian women, in 2001
by « Office National de la Famille et de la Population de la
Tunisie » (ONFP)

4087 couples with women from15 to 54 years old, 157 binary
variables describing the constitution of the household, the
lifestyle (housing, how many rooms, toilets, accommodation
for drinkable water, etc.) ; and the socio-economic features of
each member of the household, their occupation, the stability
of their situation, the age of the husband and spouse, their
education grade, their matrimonial status, their habitation and
informations about their parents and children, etc.

Data are collected by Dhouha Haj Ali for a study of the
marriage of women and the poorness of their households
Carme 2011, Rennes
[email protected], [email protected]
20
MIDOVA results

5050 relations between 2 variables


78 294 relations between 3 variables


100 with poorness, 58 ** of which (p<0,01), 9 * (p<0,05)
1772 with poorness, 74 ** of which( p<0,01), 89 * (p<0,05)
MIDOVA found out interactions between the education
grades of the husband, of the wife, and poorness.



±2 : very significant (positive red :+2, negative green :-2),
±1 : signif.,
ns : non-signif.
Carme 2011, Rennes
[email protected], [email protected]
21
Conclusion




Association between variables in a dataset does not
amount to a set of pairwise relations, except when higherorder interactions are missing.
Brigitte Escofier was one of the few aware of the problem
in the data analysis community, and she tackled it head
on.
But interaction is a difficult topic: we have listed 5
approaches to it, each with their pros and cons (Anova,
Log-linear model, Escofier et al., Grossetête, MIDOVA).
Our MIDOVA approach, now limited to binary variables,
takes into account large sets of variables and high-order
interactions (>3). For details on our decomposition
/reconstruction algorithm, see [Cadot Lelu 2010].
Thank you !
Carme 2011, Rennes
[email protected], [email protected]
23
Bibliography








Abdessemed, L. & Escofier B. (2000). Analyse de l’interaction et de la variabilité inter
et intra dans un tableau de fréquence ternaire. In Moreau, J., Doudin, P.-A. & Cazes,
P. (eds). l’analyse des correspondances et les techniques connexes, pp. 146-164.
Springer-Verlag, Berlin.
Cadot, M. (2006). Extraire et valider les relations complexes en sciences humaines :
statistiques, motifs et règles d’association. Ph.D. thesis, Université de FrancheComté, France.
Cadot M., Lelu A (2010). A Novel Decomposition Algorithm for Binary Datatables:
Encouraging Results on Discrimination Tasks. Fourth IEEE International Conference
on Research Challenges in Information Science (RCIS 2010), pp. 57-68.
Escofier, B. (1983). Généralisation de l’analyse des correspondances à la
comparaison de tableaux de fréquences. Rapport de Recherche Inria, Rennes,
N°207.
Escofier, B. & Pagès, J. (1988). Analyses factorielles simples et multiples. Dunod
Paris.
Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques, Morgan
Kaufmann Publishers, San Francisco.
Haj Ali, D. & Cadot, M. (2010). Estimation de l’impact de la décision du mariage sur la
pauvreté des ménages tunisiens, MASHS 2010 (Lille, France), 10–11 juin, pp. 45–56
Mourad, G. (1983). Flux de pétrole et flux de marchandises entre l’OPEP et l’OCDE
de 1970 à 1979. In Benzécri J.-P. & collaborateurs (eds). Pratique de l’analyse des
données, tome 5, économie, pp. 233–280.
Carme 2011, Rennes
[email protected], [email protected]
24