7. Decision Trees and Decision Rules

Download Report

Transcript 7. Decision Trees and Decision Rules

國立雲林科技大學
National Yunlin University of Science and Technology
Knowledge discovery with classification
rules in a cardiovascular dataset
Advisor : Dr. Hsu
Presenter : Zih-Hui Lin
Author
:Viii Podgorelec a,*, Peter Kokol a, Milojka
Molan Sti81ic b, Marjan Heri :ko a, Ivan Rozrnan a
Computer Methods and Programs in Biomedicine, Volume 80, Supplement 1, December 2005, Pages S39-S49
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline






Motivation
Objective
Introduction
The AREX algorithm
Experiment
Conclusions
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation


Modern medicine generates huge amounts of
data and there is an acute and widening gap
between data collection and data
comprehension.
it is very difficult for a human to make use of
such amount of information (i.e. hundreds of
attributes, thousand of images, several
channels of 24 hours of ECG or EEG signals)
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective

enable searching for new facts, which should
reveal some new interesting patterns and
possibly improve the existing medical
knowledge.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction

Decision tree
─
Advantage

─
transparency of the classification process that one
can easily interpret, understand and criticize.
Disadvantages



poor processing of incomplete, noisy data,
inability to build several trees for the same dataset
inability to use the preferred attributes, etc.
5
Intelligent Database Systems Lab
2
1
The AREX algorithm
N.Y.U.S.T.
I. M.
1 multi-population self-adapting genetic algorithm
for the induction of decision trees.
3
1.1 Build N decision trees upon objects from S
Oi 1.2 Classify object with nt randomly
chosen trees from all N trees
s
N
1.4 From all N decision
create M initial
2 evolution of programs in an arbitrary
classification rules
programming language, which is used
to evolve classification
2.2
S*
1.3 if frequency of the most frequent decision
class classified by nt trees > nt - ct
(ct=nt/2)
6
2.1 create M/2+1 rules
(randomly)
2.3 If s is not empty
•Add |s| randomly chosen
objects from s* to s
•ct=ct+1
•repeat 1.1
2.4 an optimal set of classification rules is
determined with a simple genetic algorithm
Intelligent Database Systems Lab
Genetic algorithm for the construction of
decision trees
1.Number of attribute nodes
M that will be in the tree
2. Select an attribute Xi
M attributes
population
Xi
null
3. 選一空節點,(tree深度
愈高,選中機率愈低)
root
null
Xi
4. Randomly select an attribute Xi
(還沒被選過的機率較高)
null
null
null
null
N.Y.U.S.T.
I. M.
(1)Continuous attributes
→split constant
(2)Discrete attributes
→randomly defined two
disjunctive sets
Xi
null
null
•For each empty leaf the following algorithm
determines the appropriate decision class
7
未使用的 leaf nodes
Intelligent Database Systems Lab
proGenesys system & Finding the
optimal set of rules
N.Y.U.S.T.
I. M.
(it covers many objects - otherwise it tends to be
too specific).
(most of the objects covered by the rule should
fall into the same decision class)
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Dataset


contains data of 100 patients from Maribor Hospital.
The attributes include
─
─
─
─

general data (age, sex, etc.)
health status (data from family history and child's previous illnesses),
general cardiovascular data (blood pressure, pulse, chest pain, etc.)
more specialized cardiovascular data - data from child's cardiac history
and clinical examinations (with findings of ultrasound, ECG, etc.).
dataset five different diagnoses are possible:
─
─
─
─
─
innocent heart murmur良性雜音
congenital heart disease with left-to-right shunt先天性心臟病(左向右
分流)
aortic valve disease with aorta coarctation,主動脈辨疾病(主動脈縮窄)
arrhythmias心律不整
chest pain.心悸
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Classification result –training set
Overfitting
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Classification result –testing set
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Classification result
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions

One of the most evident advantages of AREX is the
simultaneous very good
─
─

Generalization → high and similar overall accuracy on both
training set and test set
Specialization → high and very similar accuracy of all decision
classes, also the least frequent ones.
equip physicians with a powerful technique to
─
─
(1) confirm their existing knowledge about some medical
problem
(2) enable searching for new facts, which should reveal some
new interesting patterns and possibly improve the existing
medical knowledge.
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
My opinion



Advantage: 依類別給予權重
Disadvantage:
Apply: 實際應用於臨床上
14
Intelligent Database Systems Lab