Correspondence Analysis for Data Mining with Application in Medicine

Download Report

Transcript Correspondence Analysis for Data Mining with Application in Medicine

Correspondence analysis for data
mining with applications in medicine
Annie Morin
IRISA France
[email protected]
WORKSHOP ZAGREB JUNE 2004
Correspondence analysis
• Statistical vizualization method for displaying
the associations between the levels of a twocontingency table and the distances between the
categories of each variable => exploratory
method
• Usually, Chi-square test for independence in a
contingency table
WORKSHOP ZAGREB JUNE 2004
CA
• Duality between the row and the columns
• Use of the row profiles and of the column
profiles
• Use of chi-square distance (distributional
equivalence)
• Factorial analysis method (eigen values of a adhoc matrix) and reduction of dimensionality
WORKSHOP ZAGREB JUNE 2004
Example : Frequency table
D1
D2
D3
D4
total
heart
11
3
1
3
18
forest
20
9
5
14
48
surgery
4
2
2
3
12
WORKSHOP ZAGREB JUNE 2004
animal
1
2
3
16
21
total
37
16
11
36
100
D1
D2
D3
D4
an prof
Row-profiles
heart
31
16
8
9
18
forest
54
58
45
39
48
surgery
12
15
22
8
12
WORKSHOP ZAGREB JUNE 2004
animal
5
11
25
44
22
100
100
100
100
100
Column profile
D1
D2
D3
D4
heart
63
14
5
19
100
forest
42
19
10
29
100
surgery
37
20
20
24
100
WORKSHOP ZAGREB JUNE 2004
animal
6
8
13
74
100
col prof
37
16
11
36
100
WORKSHOP ZAGREB JUNE 2004
D4
D1
animal
heart
forest
surgery
D2
D3
WORKSHOP ZAGREB JUNE 2004
Distances
Between two columns
Between two rows
WORKSHOP ZAGREB JUNE 2004
• Diagonalization of a « covariance matrix » to
find the eigenvalues and corresponding
eigenvectors
• λ1≥λ2≥…….. ≥ λp
• Inertia of the cloud is ∑λi =2 / n
• Distance to the independence model
WORKSHOP ZAGREB JUNE 2004
Simultaneous representation
• Of the rows and of the columns profiles on the same
factorial plane
• Validity of representation :
– Inertia : contributions that describe the proportion of
variance explained provided by each element (row or
column profile) in building an axis
– Quality of representation of each element by the axes
WORKSHOP ZAGREB JUNE 2004
Applications in medicine
• Pharmacology
• Therapeutic trials (to avoid double blind
procedures) : CA allows the physician to follow
the evolution of the illness or/and of the therapy
• Textual analysis : reports, business intelligence,
bibliometry
WORKSHOP ZAGREB JUNE 2004
Application on mucoviscidosis
• Mucoviscidosis : rare disease
– No specific keywords
– No specific magazines
• Goal : To define a minimum common
vocabulary for the researchers working on
mucoviscidosis (clinicians, geneticists, etc..)
WORKSHOP ZAGREB JUNE 2004
HYPOTHESIS :
THE TYPICAL WORDS FOR A GIVEN TOPIC ARE
INDEPENDENT OF THE TECHNIQUES
SURGEON WORDS
GENETICS WORDS
TOPIC WORDS
WORKSHOP ZAGREB JUNE 2004
Processing
• First step of the study : to create a “kernel” base
which contains the references of scientific
documents used by people working on the
disease => 612 publications
WORKSHOP ZAGREB JUNE 2004
• 30 axes with a positive side and a negative one
• Each side of each axis is characterized by the
words with a high relative contribution to the
inertia (greatest than a threshold).
WORKSHOP ZAGREB JUNE 2004
DATA
• Two-table crossing the 612 documents
(summaries) and 850 words
• CA on this two-way table
WORKSHOP ZAGREB JUNE 2004
Dimension of a word
• The words of a topic are one-dimensional
• The words of a filed are multidimensional
• The dimension of a word is the number of axis
on which this word has a high relative
contribution to inertia
• If we want to find the minimum common
vocabulary, the dimension of a word must be
high
WORKSHOP ZAGREB JUNE 2004
MUCOVISCIDOSIS BASE
EXON
ALLELES
CBAVD
MUTATIONS NOVEL
DEFERENS FAMILIES
IDENTIFICATION
CONGENITAL ALLELE CODING
SCREENING
POPULATION ELECTROPHORESIS
MUTATION
PCR
DETECTION DELTAF
DIAGNOSIS
DNA
GENE
VENTRICULAR
LEFT
HYPERTENSION TRANSPLANTATIONS
FAILURE DOUBLE
HEART LIVER FOLLOW CASES
COMPLICATIONS CHILDREN
LUNG PULMONARY
REJECTION MEAN TREATMENT
+
ANALYSIS DELTA REGULATOR
CF
CFTR
CONDUCTANCE
EXPRESSION
PROTEIN
HUMAN
CELLS
ACTIVITY CELL MEMBRANE
ALPHA
TRANSPORT APICAL
ELASTASE INDUCED ATP CHANNEL
MU SECRETION
CHANNELS
INHIBITOR
CA
BILE
WORKSHOP ZAGREB JUNE 2004
81 words have a dimension greatest
than 10
ACID
AERUGINOSA
ANTIGENS
AUREUS
CELL
CHANNELS
CONCENTRATIONS
DRUG
EPITHELIUM
FLOW
LEFT
MUCIN
NASAL
PEPTIDE
PRENATAL
PROTEINASE
RECEPTOR
SCREENING
STRAINS
TRANSPORT
WATER
ADENOSINE
ALPHA
ANTITRYPSIN
BRONCHIAL
CELLS
CHILDREN
CYSTIC
ELASTASE
EXPRESSION
FLUID
LIVER
MUCINS
NEONATAL
PERFORMANCE
PROPERTIES
PSEUDOMONAS
RECEPTORS
SECRETION
THERAPY
TRYPSIN
ADENOVIRUS
ALVEOLAR
ASPERGILLOSIS
CAMP
CFTR
CHROMOSOME
DIAGNOSIS
ELASTIN
FETAL
HLA
LUNG
MUCUS
NEUTROPHILS
PLASMA
PROTEASE
RAT
REJECTION
SECRETIONS
TRANSFER
VENTRICULAR
ADHESION
AMILORIDE
ATP
CASES
CHANNEL
CIRRHOSIS
DOUBLE
EMPHYSEMA
FIBROSIS
INHIBITOR
MARKERS
MUTATIONS
PATCHES
PNEUMONIA
PROTEIN
RATS
RIGHT
SPUTUM
TRANSPLANTATIO
VIVO
WORKSHOP ZAGREB JUNE 2004
Is a high dimension a sufficient
condition to characterize the disease?
To check it, we use other thematic
databases and in each of them, we
count the number of documents with
at least two words among the
previous 81 words.
WORKSHOP ZAGREB JUNE 2004
5 thematic databases
•
•
•
•
•
BREAST CANCER …………………………..9871 doc
POLYAMINES……………………………...12726 doc
LEUCOCYTE INFILTRATED TUMOR ……586 doc
ACUTE LYPMPHOBLAST LEUKEMIA …2063 doc
MUCOVISCIDOSCIS………………………...612 doc
WORKSHOP ZAGREB JUNE 2004
RETRIEVAL STATISTICS WITH
THE 81 WORDS
SUJET DES BASES
MUCOVISCIDOSE
LEUCEMIE AIGUË LYMPHOBLASTIQUE
POLYAMINES
CANCER DU SEIN
TUMOR INFILTRATING LEUCOCYTE
TOTAL
TAUX DE RECUPERATION EFFECTIF BASES
612
612
(100%)
1990
2063
(96%)
11912
12726
(94%)
8728
9871
(88%)
546
586
(93%)
23788
(92%)
WORKSHOP ZAGREB JUNE 2004
25858
CA of the 5 databases and 81 words
HLA
antigens cases
screening diagnosis
therapy chromosome
BASE LAL
flow transplantation
BASE CANCER SEIN adhesion
BASE TIL
receptor expression
children
right
left lung
mutations
aspergillosis
cell
neutrophils
mucins
vivo
epithelium
drug protein
rejection
pneumonia
plasma secretion
alpha peptide
alveolar
BASE POLYAMINE
acid transfer
protease
inhibitor ATP ventricular
adenovirus
adenosine CAMP
prenatal
proteinase
stains transport channel
cirrhosis
antitrypsin
neonatal
aureus
BASE MUCO
bronchial
pseudomonas
secretions
amiloride patches
aeruginosa
elastase
sputum fibrosis
nasal cystic
elastin
mucus
emphysema
CFTR
WORKSHOP ZAGREB JUNE 2004
20 left words
adenovirus
Aeruginosa
Amiloride
Antitrypsin
Aureus
Bronchial
Cftr
Cirrhosis
Cystic
Elastase
Elatin
Emphysema
Fibosis
Mucus
Nasal
Patches
proteinase
Pseudomon.
secretions
sputum
WORKSHOP ZAGREB JUNE 2004
Retrieval statistics with thess 20
words
SUBJECT
Retrieval rate
Db size
Mucoviscidosis
550 (89.9%)
612
Leukemia
38 (1.8%)
2063
Polyamines
341 (2.7%)
12726
Breast cancer
202 (2.1%)
9878
Tumor Infilt. Leu 9 (1.5%)
WORKSHOP ZAGREB JUNE 2004
586
Conclusion
• CA is a very powerful methof to display teh
association among variables
• It can be used with large datasets (one of the
dimension must be « tractable »)
WORKSHOP ZAGREB JUNE 2004
• Thanks to Michel Kerbaol for allowing me to
use its data on mucoviscidosis
• [email protected]
• Software : Qnomis
WORKSHOP ZAGREB JUNE 2004