COMPARING ASSOCIATION RULES AND DECISION TREES FOR

Download Report

Transcript COMPARING ASSOCIATION RULES AND DECISION TREES FOR

COMPARING ASSOCIATION
RULES AND DECISION TREES
FOR DISEASE PREDICTION
Carlos Ordonez
MOTIVATION

Three main issues about mining association rules
in medical datasets:
1.
2.
3.

A significant fraction of association rules is
irrelevant
Most relevant rules with high quality metrics
appear only at low support
# of discovered rules becomes extremely large at
low support
Search constraints:


Find only medically significant association rules
Make search more efficient
MOTIVATION


Decision tree  a well-known machine learning
algorithm
Association rules vs. Decision tree
Accuracy
 Interpretability
 Applicability

ASSOCIATION RULES
Support
 Confidence
 Lift

confidence( x  y ) 
lift ( x  y ) 
sup port ( x  y )
sup port ( x)
confidence( x  y )
sup port ( y )
Lift quantifies the predictive power of x  y
 Rules such that lift(xy) > 1 are interesting!

CONSTRAINED ASSOCIATION RULES

Transforming Medical Data Set

Data must be transformed to binary dimensions
Numeric attributes  intervals, each interval is mapped to
an item.
 Categorical attributes each categorical value is an item
 If an attribute has negation add that as an item

Each item is corresponds to the presence or absence of one
categorical value or one numeric interval
CONSTRAINED ASSOCIATION RULES

1.
Search Constraints
Max itemset size (k)

2.
Group



3.
Reduces the combinatorial explosion of large
itemsets and helps finding simple rules
gi >0  Aj belongs to a group
gi =0  Aj is not group-constrained at all
This avoids finding trivial or redundant rules
Antecedent/Consequent
ci = 1  Ai is an antecedent
ci = 2  Ai is a consequent
Binned at 40(adult) and
60(old)
Percentage of vessel narrowing
LAD, LCX and RCA are binned
at 70% and 50%
LM is binned at 30% and 50%
9 heart regions ( 2 ranges
with 0.2 as cutoff)
Patients
655
attributes
25
Binned at 200 and 250
PARAMETERS
k=4
 Min support = 1% ≈ 7
 Min confidence = 70%
 Min lift = 1.2



To get rules where there is stronger implication
dependence between X and Y
Rules with conf ≥ 90 and lift ≥ 2, with 2 or more
items in the consequent were considered
medically significant.
HEALTHY ARTERIES
9,595 associations
 771 rules

DISEASED ARTERIES
Several unneeded items
were filtered out ( with
values in lower (healthy)
ranges)
 10,218 associations
 552 rules

PREDICTIVE RULES FROM DECISION
TREES
CN4.5  using gain ratio
 CART  similar results
 Threshold for the height of the tree to produce
simple rules
 Percentage of patients (ls)


Fraction of patients where the antecedent appears
Confidence factor (cf)
 Focus on predicting LDA disease

PREDICTIVE RULES FROM DECISION
TREES
1.
All measurements without binning as
independent variables, numerical variables are
automatically split

Without any threshold on height:
181 node
 90% accuracy
 height = 14
 most rules more than 5 attributes
 except 5 rules, other involve less than 2% of the patients
 More than 80% of rules refer to less than 1% of patients
 Many rules involve attributes with missing information
 Many rules had the same variable being split several times
 Few rules with cf = 1 but splits included borderline cases
and involves few patients

PREDICTIVE RULES FROM DECISION
TREES

With threshold = 10 on height
83 nodes
 77% accuracy
 Most rules have repeated attributes
 More than 5 attributes
 Perfusion cutoffs higher than 0.5
 Low cf and involved less than 1% of the population


With threshold = 3 on height
65% accuracy
 Simpler rules

RELATED WORK
PREDICTIVE RULES FROM DECISION
TREES
2.
Items (binary variables) as independent
variables like association rules are used
With threshold = 3 on height



Most of the rules were much closer to the prediction
requirements
10 nodes
DISCUSSION

Decision trees






are not as powerful as association rules in this case
Do not work well with combinations of several target
variables
Fail to identify many medically relevant
combinations of independent numeric variable ranges
and categorical values
Tend to find complex and long rules, if the height is
unlimited
Find few predictive rules with reasonably sized (>1%)
sets of patients in such cases
Rules some times repeat the same attribute
DISCUSSION - ALTERNATIVES

build many decision trees with different
independent attributes


Create a family of small trees, each tree has a
weight


It’s error-prone, difficult to interpret, slow for higher
# of attributes
Each tree becomes similar to a small set of
association rules
Constraints for association rules can be adopted
to decision trees (future work)
DISCUSSION – DECISION TREE
ADVANTAGES



DT partitions the data set, ARs on the same
target attributes may refer to overlap
DT represents a predictive model of data set, ARs
are disconnected among themselves
DT is guaranteed to have at least 50% prediction
accuracy and generally above 80% for binary
target variables, ARs require trial and error to
find the best threshold