Practice of Data Mining - University of California, Los

Download Report

Transcript Practice of Data Mining - University of California, Los

Challenges and Techniques for
Mining Clinical data
Wesley W. Chu
Laura Yu Chen
Outline



Introduction of SmartRule association
rule mining
Case I: mining pregnancy data to
discover drug exposure side effects
Case II: mining urology clinical data for
operation decision making
SmartRule Features

Generate MFIs directly from tabular data


User select MFIs for rule generation


User can select a subset of MFIs to including
certain attributes as targets in rule generation
Derive rules from targeted MFIs


Reduce the search space and the support counting
time by taking advantage of column structures
Efficient support-counting by building inverted
indices for the collection of itemsets
Hierarchically organize rules into trees and
use spreadsheet to present the rule trees
System overview of SmartRule
Domain
experts
4
InvertCount:
- MFIsFIs
- Count sup
3
6
Excel Book
 MFI
Data
 Rules
 Config
FI Supports
1
2
TMaxMiner:
Compute
MFI from
tabular data
.
5
RuleTree:
- Generate
- Organize
Computation Complexity

Efficient MFI mining:



Does not require superset checking
gather past tail information to determine
the next node to explore during the mining
process
Efficient rule generation:

Reduce the computation for supportcounting by building inverted indices
Scalability


Limitation: Microsoft Excel spreadsheet
size is 65,536 rows in one spreadsheet
When the dataset exceeds the
spreadsheet size limit:


Partition the dataset into multiple groups of
the maximum spreadsheet size to derive
MFIs for each spreadsheet
Then join these MFIs for generating
association rules
Case I:
Mining Pregnancy Data



Data set: Danish National Birth Cohort (DNBC)
Dimension: 4455 patients x 20 attributes
Each patient record contain:



Exposure status : drug type, timing, and sequence
of different drugs
Possible confounders: vitamin intake, smoking,
alcohol consumption, socio-economic status and
psycho-social stress
Endpoint: preterm birth, malformations and
prenatal complications
Sample Pregnancy Data
Challenges

Problem: discover side effects of drug
exposure during pregnancy


E.g.: study how the antidepressants and
confounders influence the preterm birth of the
new-born
Difficulties in finding side effects:



Small number of patients suffer side effect
Sensitive to the drug exposure time
Exposure to sequence of multiple drugs
Derive Drug Side Effects via SmartRule(1):
low-support low-confidence rules
Low support or low confidence rules could
still be significant because of their contrast
to normal pregnant woman


For example:


If patients exposed to cita in the 3rd trimester, then
have preterm birth with support=0.0011,
confidence=0.1786
If patients not exposured to cita, then have preterm
birth with support=0.0433, confidence=0.0444
Derive Drug Side Effects via SmartRule(2):
temporal sensitive rules
Divide the pregnancy period into time slots (e.g.
trimester) and combine drug exposure by time:





If patients exposed to cita in the 1st trimester and drink
alcohol, then have preterm birth with support=0.0011
and confidence=0.132
If patients exposed to cita in the 2nd trimester and drink
alcohol, then have preterm birth with support=0.0011 and
confidence=0.417
If patients exposed to cita in the 3rd trimester and drink
alcohol, then have preterm birth with support=0.0009 and
confidence=0.364
Flexible in time slot division, domain user can
control granularity
Rule Presentation

Hierarchically organize
rules into trees


View general rules and
then extend to specific
rules
Use spreadsheet to
present the rule trees

Easy to sort, filter or
extend the rule trees to
search for the interesting
rules
1) In general, patients have preterm birth (sup=0.0454,
conf=0.0454)
2) If exposed to cita in the 1st trimester, then
preterm birth (sup=0.0016, conf=0.0761)
6) If exposed to cita in the 1st trimester and
drink alcohol, then preterm birth (sup=0.0011,
conf=0.132)
3) If exposed to cita in the 2nd trimester, then
preterm birth (sup=0.0013, conf=0.1714)
7) If exposed to cita in the 2nd trimester and
drink alcohol, then preterm birth (sup=0.0011,
conf=0.417)
4) If exposed to cita in the 3rd trimester, then
preterm birth (sup=0.0011, conf=0.1786)
8) If exposed to cita in the 3rd trimester and
drink alcohol, then preterm birth (sup=0.0009,
conf=0.364)
5) If no exposure to cita, then preterm birth
(sup=0.0433, conf=0.0444)
A part of the rule hierarchy for the exposure to the
antidepressant citalopram and alcohol at different time
period of pregnancy with preterm birth
Knowledge Discovery from
Data Mining Results

Challenges:


Examining the vast number of rules
manually is too labor-intensive
Exploring knowledge (rules) without
specific goal
Existing approach:
Top-down in Rule Hierarchy


Association rules are represented in general rules,
summaries and exception rules (GSE patterns). The
GSE pattern presents the discovered rules in a
hierarchical fashion. Users can browse the hierarchy
from top-down to find interesting exception rules.
Due to the low occurance of drug side effects,
interesting rules are exception rules and reside at
the lower level of the hierarchy. Without user
guidance, it requires exploration of the entire GSE
hierarchy to locate the interesting exception rules.
Reference:

B. Liu, M. Hu, and W. Hsu, "Multi-level organization and summarization of the discovered rules,"

Aug, 2000, Boston, USA.
B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using general rules and
exceptions.“ Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000),
July 30 - Aug 3, 2000, Austin, Texas, USA.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
New effective bottom up technique
to find exception rules

Derive a set of seed attributes from
high-confidence rules

For example, given high-conf rule:
If exposed to Anxio in the pre, in and post time
and use tobacco and have symptoms of
depression, then have preterm birth with
confidence = 0.6

List of seed attributes: Anxio_pre, Anxio_in,
Anxio_post, tobacco and symptoms of
depression
Using seed attributes to explore
exception rules via rule hierarchy

Explore more rules based on these seed
attributes in the rule hierarchies


First look for rules that represent effect of each
single seed attribute on preterm birth
Then further explore the combination of multiple
seed attributes
Seed attributes
High-confidence rule
Rule hierarchy
Rule hierarchy
New Findings from Data
Mining
1) In general, patients have preterm birth (sup=0.0454,
conf=0.0454)


Finding: combined exposure to
citalopram and alcohol in
pregnancy is associated with
an increased risk of preterm
birth
Not initially discovered by
epidemiology study due to the
large number of combinations
among all the attributes and
their values
2) If exposed to cita in the 1st trimester, then
preterm birth (sup=0.0016, conf=0.0761)
6) If exposed to cita in the 1st trimester and
drink alcohol, then preterm birth (sup=0.0011,
conf=0.132)
3) If exposed to cita in the 2nd trimester, then
preterm birth (sup=0.0013, conf=0.1714)
7) If exposed to cita in the 2nd trimester and
drink alcohol, then preterm birth (sup=0.0011,
conf=0.417)
4) If exposed to cita in the 3rd trimester, then
preterm birth (sup=0.0011, conf=0.1786)
8) If exposed to cita in the 3rd trimester and
drink alcohol, then preterm birth (sup=0.0009,
conf=0.364)
5) If no exposure to cita, then preterm birth
(sup=0.0433, conf=0.0444)
Statistical Analysis VS. Data Mining
Statistical analysis
Infeasible to test all
potential hypotheses
for large number of
attributes
 Testing hypotheses
with small sample
size has limited
statistical power

Data mining
 No hypothesis, mine
association in large
dataset with multiple
temporal attributes
 Can generate association
rules independent of the
sample size
 Derive rules with temporal
information of drug
exposure
Case II:
Mining Urology Clinical Data


Data set: urology surgeries operated
during 1995 to 2002 at the UCLA
Pediatric Urology Clinic
Dimension: 130 patients x 28 attributes
Bladder Body & Bladder Neck
Training Data Attributes

Each patient record contain:

Pre-operative conditions:







Type of surgery performed:






Demography data: age, gender, etc.
patient ambulatory status (A)
catheterizing skills (CS)
amount of creatinine in the blood (SerumCrPre)
leak point pressure (LPP)
urodynamics, such as the minimum volume of saline infused into a
bladder when its pressure reached 20 cm of water (20%min)
Op-1
Op-2
Op-3
Op-4
Bladder
Bladder
Bladder
Bladder
Neck
Neck
Neck
Neck
Reconstruction with Augmentation
Reconstruction without Augmentation
Closure without Augmentation
Closure with Augmentation
Post-op complications: infection, complication, etc.
Final outcome of the surgery: urine continence  wet or dry
Sample of Urology Clinical Data
Goals and Challenges

Goal:



Derive a set of rules from the clinical data set
(training set) that summarize the outcome based
on patients’ pre-op data
Predict operation outcome based on a given
patient’s pre-op data (test set), and recommend
the best operation to perform
Challenge:


Small sample size, large number of attributes
Continuous-value attributes such as uro-dynamics
measurements
Data Mining Steps




1. Separate the patients into four groups based on
their type of surgery performed
2. In each group, partition the continuous value
attributes into discrete intervals or cells. Since the
sample size is very small, we use a hybrid technique
to determine the optimal number of cells and cell
sizes.
3. Generate association rules for each patient group
based on the partitioned continues value attributes
4. For a given patient with a specific set of pre-op
conditions, the generated rules from the training set
can be used to predict success or failure rate for a
specific operation
Partitioning Continuous Value
Attributes

Current approach to partition continuous attribute:



Using domain expert guidance can be biased and
inconsistent
Statistical clustering technique fails when the training set
size is small and the number of attributes is large
New hybrid approach:


Using data mining technique to select a small set of key
attributes
Using statistical classification technique to perform the
optimal partition (determine the cell sizes and the number of
cells) from the small set of key attributes
Hybrid Clustering Technique

Select a small key attribute set (via data mining):



Use domain expert partition to perform mining on the
training set
Select a set of key attributes that contribute to high
confidence and support rules
Optimal partition (via statistical classification)

Use statistical classification techniques (e.g. CART) to
determine the optimal number of cells and their
corresponding cell sizes for the attributes
Mining optimally partitioned attribute data yields better quality
rules
Partition of continuous
variables for operations

Partition of continuous variables into optimal number of
discrete intervals (cells) and cell sizes for four types of
operations.
Cell#
LPP
SerumCrPre
1
[0, 19]
[0, 0.75]
2
(19, 33.5]
3
4
Cell#
20%min
20%mean
30%min
30%mean
LPP
SerumCrPr
e
[0.75, 2.2]
1
[80, 118]
[50, 77]
[100, 170]
[51, 51]
[12, 20]
[0, 0.5]
(33.5,40]
n/a
2
[145, 178]
[88, 104]
[206, 241]
[94, 113]
[24, 36]
[0.7, 1.4]
normal
n/a
3
[221, 264]
n/a
[135, 135]
normal
n/a
Operation Type 1
Cell#
LPP
20% mean
1
[0, 19]
[0, 33.37]
2
(19, 69]
(33.37, 37.5]
3
normal
(37.5, 52]
4
n/a
(52, 110]
Operation Type 4
[135, 135]
Operation Type 2
Cell#
20%min
20%mean
30%min
30%mean
LPP
SerumCrPre
1
[103,130]
[57, 75]
[129, 157]
[86, 93]
[6, 29]
[0.3, 0.7]
2
[156,225]
[92, 105]
[188, 223]
[100,121]
[30,40]
[1.0, 1.5]
Operation Type 3
Recommending operation based on
rules derived from training set



Transform the patient’s pre-op data of the continues
value attributes using the optimal partitions for each
operation
Find a set of rules (from the training set) that matches
the patients’ pre-op data
Compare the matched rules from each operation,
recommend the type of sugary that provides the best
match
Example: Prediction for Matt
Ambulatory
Status (A)
Cath
Skills (CS)
4
1
Serum
CrPre
0.5
20%
min
20%
mean(M)
31
20
30%
min
50
30%
mean
33
LPP
27
UPP
unkown
Patient Matt’s pre-operative conditions
Ambulatory
Status (A)
Cath
Skills (CS)
Serum
CrPre
20%
min
20%
mean(M)
30%
min
30%
mean
LPP
Op-1
4
1
1
n/a
n/a
n/a
n/a
2
Op-2
4
1
1
<1
<1
<1
<1
2
Op-3
4
1
1
<1
<1
<1
<1
1
Op-4
4
1
n/a
n/a
1
n/a
n/a
2
Discretized pre-operative conditions of patient Matt’s pre-op conditions.
The attributes not used in rule generation are denoted as n/a
Rule trees selected from the knowledge base
that match patient Matt’s pre-op profile
Surgery
Op-1
Op-2
Op-3
Op-4
Conditions
Outcome
Support
Support(%)
Confidence
CS=1
Success
10
41.67
0.77
CS=1 and LPP=2
Success
3
12.5
0.75
CS=1 and LPP=2
Fail
2
16.67
0.67
20%min=1 and LPP=2
Fail
2
16.67
0.67
CS=1 and SerumCrPre=1
Success
5
50
0.83
CS=1, SerumCrPre=1 and LPP=1
Success
2
20
1
A=4
Success
14
32.55
0.78
A=4 and CS=1
Success
11
25.58
0.79
A=4, CS=1 and LPP=2
Success
8
18.6
0.8
A=4, CS=1 and M=1
Success
6
13.95
1
A=4, CS=1, M=1 and LPP=2
Success
6
13.95
1
Based on the rule tree, we note that Operations 3 and 4 both match patient
Matt’s pre-op conditions. However, Operation 4 matches more attributes in
Matt’s pre-op conditions than Operation 3. Thus, Operation 4 is more
desirable for patient Matt.
Representing rules in a
hierarchical structure

Favorable user feedback in using the spreadsheet
interface because of its ease in rule searching and
sorting
sup=32.55%,conf=0.78
A4Success
sup=25.58%,conf=0.79
A4CS1Success
sup=18.6%,conf=0.8
A4CS1Lpp2Success
sup=13.95%,conf=1
A4CS1M1Success
sup=13.95%,conf=1
A4CS1M1Lpp2Success
Represent rule trees for Op-4 by spreadsheet
Rule tree for Op-4
Lesson learn from mining data
with small sample size


For small sample size, hybrid clustering yield
better than conventional unsupervised
clustering techniques
Hybrid clustering enables us to generate
useful rules for small sample sizes, which
could not be done using data mining or
statistical classifying methods alone
Conclusion

Mining pregnancy data:


Discover drug exposure side effects (association)
Advantage over traditional statistical approaches:





Independent of hypotheses
Independent of the sample size
Derive rules with temporal information
Using seed attribute approach to effectively discover exception
rules via rule hierarchy
Mining urology clinical data:


Deriving association rules based on patient’s pre-op conditions and
their operation outcomes according to different type of operations
Hybrid clustering technique to derive optimal partition for
continuous value attributes . This technique is critical for deriving
high quality rules for small sample size with large number of
attributes
Reference


Qinghua Zou, Yu Chen, Wesley W. Chu and Xinchun Lu. Mining association rules
from tabular data guided by maximal frequent itemset. Book Chapter in
“Foundations and Advances in Data Mining”, edited by Wesley W. Chu and T.Y.
Lin, Springer, 2005.
Yu Chen, Lars Henning Pedersen, Wesley W. Chu and Jorn Olsen. "Drug
Exposure Side Effects from Mining Pregnancy Data" In SIGKDD Explorations
(Volume 9, Issue 1), June 2007, Special Issue on Data Mining for Health
Informatics, Guest Editors: Raymond Ng and Jian Pei .




Q. Zou, W.W. Chu, and B. Lu. SmartMiner: A depth-first search algorithm guided
by tail information for mining maximal frequent itemsets. In Proc. of the IEEE
Intl. Conf. on Data Mining, 2002.
R. Agrawal and R. Srikant: Fast algorithms for mining association rules. In
Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994.
D. Burdick, M. Calimlim, and J. Gehrke: MAFIA: a maximal frequent itemset
algorithm for transactional databases. In Intl. Conf. on Data Engineering, Apr.
2001.
K. Gouda and M.J. Zaki: Efficiently Mining Maximal Frequent Itemsets. Proc. of
the IEEE Int. Conference on Data Mining, San Jose, 2001.
Reference




B. Liu, M. Hu, and W. Hsu, "Multi-level organization and summarization of the
discovered rules," Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, Aug, 2000, Boston, USA.
B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using
general rules and exceptions.“ Proceedings of Seventeeth National Conference
on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, Austin, Texas, USA.
Frequent Itemset Mining Implementations Repository, http://fimi.cs.helsinki.fi/
http://www.ics.uci.edu/~mlearn/MLRepository.html