Learning Decision Trees

Download Report

Transcript Learning Decision Trees

Learning Decision Trees
Brief tutorial by M Werner
Medical Diagnosis Example
• Goal – Diagnose a disease from a blood
test
• Clinical Use
– Blood sample is obtained from a patient
– Blood is tested to measure current expression
of various proteins, say by using a DNA
microarray
– Data is analyzed to produce a Yes or No
answer
Data Analysis
• Use a decision tree such as:
P1 > K1
Y
N
P2 > K2
N
P3 > K3
P2 > K2
Y
Y
P4 > K4
N
Y
Yes
No
N
Yes
No
P4 > K4
Y
No
N
No
Y
Yes
How to Build the Decision Tree
• Start with samples of blood from patients known
to either have the disease or not (training set).
• Suppose there are 20 patients and 10 are
known to have the disease and 10 not
• From the training set get expression levels for all
proteins of interest
• i.e. if there are 20 patients and 50 proteins we
get a 50 X 20 array of real numbers
• Rows are proteins
• Columns are patients
Choosing the decision nodes
10 have disease
10 don’t
• We would like the
tree to be as short
as possible
• Start with all 20
patients in one
group
• Choose a protein
and a level that
gains the most
information
10/10
Mostly
diseased
Possible
splitting
condition
Px > Kx
9/3
1/7
Mostly
not
diseased
10/10
Alternative
splitting
condition
Py > Ky
7/7
3/3
How to determine information gain
• Purity – A measure to which the patients in a
group share the same outcome.
• A group that splits 1/7 is fairly pure – Most
patients don’t have the disease
• 0/8 is even purer
• 4/4 is the opposite of pure. This group is said to
have high entropy. Knowing that a patient is in
this group does not make her more or less likely
to have the disease.
• The decision tree should reduce entropy as test
conditions are evaluated
Measuring Purity (Entropy)
• Let f(i,j)=Prob(Outcome=j in node i)
• i.e. If node 2 has a 9/3 split
– f(2,0) = 9/12 = .75
– f(2,1) = 3/12 = .25
• Gini impurity:
• Entropy:
Computing Entropy
yes
no
total
frequency prob
log2
prob_log2
11 0.916667 -0.12553 -0.11507
1 0.083333 -3.58496 -0.29875
12
entropy
0.413817
yes
no
total
frequency prob
log2
prob_log2
9.9
0.99
-0.0145 -0.01435
0.1
0.01 -6.64386 -0.06644
10
entropy
0.080793
Goal is to use a test which best
reduces total entropy in the
subgroups
Building
the Tree
Links
• http://www.ece.msstate.edu/research/isip/
publications/courses/ece_8463/lectures/cu
rrent/lecture_27/lecture_27.pdf
• Decision Trees & Data Mining
• Andrew Moore Tutorial