Transcript yes no
DSCI 4520/5240: Data Mining
Fall 2013 – Dr. Nick Evangelopoulos
Lecture 5:
Decision Tree Algorithms
Material based on:
Witten & Frank 2000, Olson & Shi 2007, de Ville 2006, SAS Education 2005
slide 1
DSCI 4520/5240
DATA MINING
n
Decision Tree Algorithms
Decision Tree algorithms were developed by the
Machine Learning research community (part of
Computer Science/Artificial Intelligence)
Lecture 6 - 2
DSCI 4520/5240
DATA MINING
Agenda
n
Rule evaluation criteria in Machine Learning
n
Decision Tree algorithms: 1R, ID3
– Entropy
n
Naïve Bayes classification (optional)
slide 3
DSCI 4520/5240
DATA MINING
Decision Trees: Credit risk example
This example is related to determining
credit risks. We have a total of 10
people. 6 are good risks and 4 are bad.
We apply splits to the tree based on
employment status. When we break
this down, we find that there are 7
employed and 3 not employed. Of the
3 that are not employed, all of them
are bad credit risks and thus we have
learned something about our data.
Note that here we cannot split
this node down any further
since all of our data is grouped
into one set. This is called a
pure node. The other node,
however, can be split again
based on a different criterion.
So we can continue to grow the
tree on the left hand side.
CORRESPONDING RULES:
•IF employed = yes AND married = yes THEN risk = good
•IF employed = yes AND married = no THEN risk = good
•IF employed = no THEN risk = bad
Lecture 6 - 4
DSCI 4520/5240
DATA MINING
Decision Tree performance
Confidence is the degree of accuracy of a rule.
Support is the degree to which the rule conditions occur
in the data.
EXAMPLE: if 10 customers purchased Zane Grey’s The
Young Pitcher and 8 of them also purchased The
Short Stop, the rule: {IF basket has The Young
Pitcher THEN basket has The Short Stop} has
confidence of 0.80. If these purchases were the only
10 to cover these books out of 10,000,000 purchases,
the support is only 0.000001.
Lecture 6 - 5
DSCI 4520/5240
DATA MINING
Rule Interestingness
Interestingness is the idea that Data Mining discovers
something unexpected.
Consider the rule: {IF basket has eggs THEN basket has
bacon}. Suppose the confidence level is 0.90 and the
support level is 0.20. This may be a useful rule, but
it may not be interesting if the grocer was already
aware of this association.
Recall the definition of DM as the discovery of
previously unknown knowledge!
Lecture 6 - 6
Rule Induction algorithms
DSCI 4520/5240
DATA MINING
They are recursive algorithms that identify data
partitions of progressive separation with respect to
the outcome.
The partitions are then organized into a decision tree.
Common Algorithms:
n
1R
n
ID3
n
C4.5/C5.0
n
CART
CHAID
CN2
BruteDL
SDL
Lecture 6 - 7
Illustration of two Tree algorithms
DSCI 4520/5240
DATA MINING
n
1R and Discretization in 1R
n
ID3: Min Entropy and Max Info Gain
Lecture 6 - 8
DSCI 4520/5240
DATA MINING
1R
slide 9
1R: Inferring Rudimentary Rules
DSCI 4520/5240
DATA MINING
1R: learns a 1-level decision tree
n
In other words, generates a set of rules that all test on one
particular attribute
Basic version (assuming nominal attributes)
n
One branch for each of the attribute’s values
n
Each branch assigns most frequent class
n
n
Error rate: proportion of instances that don’t belong to the
majority class of their corresponding branch
Choose attribute with lowest error rate
Lecture 6 - 10
Pseudo-code for 1R
DSCI 4520/5240
DATA MINING
For each attribute,
For each value of the attribute, make a rule as
follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this
attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
Let’s apply 1R on the weather data:
n
Consider the first (outlook) of the 4 attributes (outlook, temp,
humidity, windy). Consider all values (sunny, overcast,
rainy) and make 3 corresponding rules. Continue until you
get all 4 sets of rules.
Lecture 6 - 11
DSCI 4520/5240
DATA MINING
A simple example: Weather Data
Outlook
Temp
Humidity
Windy
Play?
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
slide 12
DSCI 4520/5240
DATA MINING
Evaluating the Weather Attributes in 1R
(*) indicates a
random choice
between two
equally likely
outcomes
slide 13
DSCI 4520/5240
DATA MINING
Decision tree for the weather data
Outlook
sunny
rainy
overcast
Windy
Humidity
yes
high
no
normal
yes
false
yes
true
no
slide 14
DSCI 4520/5240
DATA MINING
Discretization in 1R
Consider continuous Temperature data, after sorting them in ascending order:
65 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
One way to discretize temperature is to place breakpoints wherever the class changes:
Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No
To avoid overfitting, 1R adopts the rule that observations of the majority class in each
partition be up to 3 (if available), unless there is a “run”:
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
If adjacent partitions have the same majority class, the partitions are merged:
Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No
The final discretization leads to the rule set:
IF temperature <= 77.5 THEN Yes
IF temperature > 77.5 THEN No
slide 15
DSCI 4520/5240
DATA MINING
Comments on 1R
n
1R was described in a paper by Holte (1993)
n
Contains an experimental evaluation on 16 datasets
(using cross-validation so that results were
representative of performance on future data)
n
Minimum number of instances was set to 6 after
some experimentation
n
n
1R’s simple rules performed not much worse than
much more complex decision trees
Simplicity-first pays off!
Lecture 6 - 16
DSCI 4520/5240
DATA MINING
Entropy and Information
Gain
slide 17
DSCI 4520/5240
DATA MINING
Constructing Decision Trees in ID3, C4.5,
C5.0
Normal procedure: top down in recursive divide-and-conquer
fashion
First: attribute is selected for root node and branch is created
for each possible attribute value
Then: the instances are split into subsets (one for each
branch extending from the node)
Finally: procedure is repeated recursively for each branch,
using only instances that reach the branch
Process stops if all instances have the same class
Lecture 6 - 18
Which attribute to select?
DSCI 4520/5240
DATA MINING
Outlook
Temperature
rainy
sunny
overcast
yes
yes
no
no
no
yes
yes
yes
yes
hot
yes
yes
yes
no
no
yes
yes
no
no
Windy
false
Humidity
high
yes
yes
yes
no
no
no
no
normal
yes
yes
yes
yes
yes
yes
no
yes
yes
yes
yes
yes
yes
no
no
mild
yes
yes
yes
yes
no
no
cool
yes
yes
yes
no
true
yes
yes
yes
no
no
no
slide 19
DSCI 4520/5240
DATA MINING
•
•
•
•
•
A criterion for attribute selection
Which is the best attribute?
The one which will result in the smallest tree.
Heuristic: choose the attribute that produces the “purest” nodes!
Popular impurity criterion: Information. This is the extra
information needed to classify an instance. It takes a low value
for pure nodes and a high value for impure nodes.
We can then compare a tree before the split and after the split
using Information Gain = Info (before) – Info (after).
Information Gain increases with the average purity of the
subsets that an attribute produces
Strategy: choose attribute that results in greatest information
gain. However, it equivalent (and faster) to select the smallest
information (=total weighted entropy)
slide 20
Computing Information
DSCI 4520/5240
DATA MINING
nInformation
is measured in bits
nGiven
a probability distribution, the info required to predict an
event is the distribution’s entropy
nEntropy
gives the additional required information (i.e., the
information deficit) in bits
n
This can involve fractions of bits!
nThe
negative sign in the entropy formula is needed to convert
all negative logs back to positive values
Formula for computing the entropy:
Entropy (p1, p2, …, pn) = –p1 logp1 –p2 logp2 … –pn logpn
slide 21
DSCI 4520/5240
DATA MINING
Weather example: attribute “outlook”
• Outlook = “Sunny”
Info([2,3]) = entropy(2/5, 3/5) =
–2/5log(2/5) –3/5log(3/5) = 0.971 bits
Info([2,3])
Outlook
rainy
sunny
yes
yes
no
no
no
overcast
yes
yes
yes
yes
yes
yes
yes
no
no
• Outlook = “Overcast”
Info([4,0]) = entropy(1, 0) = –1log(1) –0log(0) = 0 bits (by definition)
• Outlook = “Rainy”
Info([3,2]) = entropy(3/5, 2/5) = –3/5log(3/5) –2/5log(2/5) = 0.971 bits
Information (=total weighted entropy) for attribute Outlook:
Info([3,2], [4,0], [3,2]) = (5/14)×0.971+ (4/14)×0 + (5/14)×0.971 =
0.693 bits.
slide 22
DSCI 4520/5240
DATA MINING
Comparing Information Gains (OPTIONAL) or
total weighted entropies
• Information Gain = Information Before – Information After
Gain (Outlook) = info([9,5]) – info([2,3], [4,0], [3,2]) = 0.940 – 0.693
= 0.247 bits
• Information Gain for attributes from the Weather Data:
Gain (Outlook) = 0.247 bits
Gain (Temperature) = 0.029 bits
Gain (Humidity) = 0.152 bits
Gain (Windy) = 0.048
Outlook also has
minimum
weighted entropy
at 0.693
slide 23
Continuing to split
DSCI 4520/5240
DATA MINING
Outlook
Outlook
sunny
sunny
no
no
mild cool
yes
no
yes
sunny
Windy
Temperature
hot
Outlook
false
yes
yes
no
no
true
yes
no
Humidity
high
no
no
no
normal
yes
yes
Gain (Temperature) = 0.571 bits
Gain (Humidity) = 0.971 bits
Gain (Windy) = 0.020 bits
slide 24
Final Decision Tree
DSCI 4520/5240
DATA MINING
Outlook
sunny
rainy
overcast
Humidity
high
no
•
•
normal
yes
Windy
yes
false
yes
true
no
Not all leaves need to be pure. Sometimes identical
instances belong to different classes
Splitting stops when data cannot split any further
slide 25
DSCI 4520/5240
DATA MINING
Another example: Loan Application Data
Age
young
young
young
young
young
young
young
young
young
young
young
young
middle
middle
middle
middle
middle
old
old
old
Income
low
low
low
low
low
average
average
average
average
average
average
high
average
average
average
high
high
high
high
high
Risk
high
high
high
high
low
high
high
high
high
low
low
high
average
low
low
low
low
low
low
low
OnTime?
yes
yes
no
no
yes
no
yes
yes
yes
yes
yes
no
no
yes
yes
yes
yes
yes
yes
yes
Twenty loan
application cases
are presented. The
target variable
OnTime? Indicates
whether the loan
was paid off on
time.
slide 26
Loan Example: probability calculations
DSCI 4520/5240
DATA MINING
All possible values for the three attributes (Age, Income, Risk)
are shown below. For each value, the probability for the
loan to be On Time (OnTime = yes) is calculated:
Variable
Age
Income
Risk
Value
young
middle
old
low
average
high
high
average
low
Cases
12
5
3
5
9
6
9
1
10
OnTime=yes
8
4
3
3
7
5
5
0
10
Prob(yes)
0.67
0.80
1.00
0.60
0.78
0.83
0.56
0.00
1.00
slide 27
DSCI 4520/5240
DATA MINING
Loan Example: Entropy calculations
Information calculations for attribute Age are shown below.
• First we calculate the probability for each value to result in Yes
• Also the probability for this value to result in No.
• Then we compute the entropy for this value as:
E = –p(yes) logp(yes) –p(no) logp(no)
• Finally we calculate Information (=weighted entropy) for the entire attribute:
Inform = E1p1 + E2p2 + E3p3
Log
Log
Entropy=
Prob
prob(yes) (base 2) prob(no) (base 2) Sum of Products (attribute-value) Product
young
8/12
-0.585
4/12
-1.585
0.918
12/20
0.551
middle
4/5
-0.322
1/5
-2.322
0.722
5/20
0.18
old
3/3
0
0/3
0
0
3/20
0
Information
0.731
slide 28
Loan Example: The first split
DSCI 4520/5240
DATA MINING
The calculations continue until we have, for each competing
attribute, the Information required to predict the outcome.
The attribute with lowest required information is also the attribute
with largest information gain, when we compare the required
information before and after the split.
Attribute
Age
Income
Risk
Information
(in bits)
0.731
0.782
0.446
Risk
low
high
average
slide 29
DSCI 4520/5240
DATA MINING
Naïve Bayes
Classification (optional
topic)
slide 30
DSCI 4520/5240
DATA MINING
Statistical Decision Tree Modeling
1R uses one attribute at a time and chooses the one that works best.
Consider the “opposite” of 1R: Use all the attributes.
Let’s first make two assumptions: Attributes are
• Equally important
• Statistically independent
Although based on assumptions that are almost
never correct, this scheme works well in practice!
slide 31
DSCI 4520/5240
DATA MINING
Probabilities for the Weather Data
Table showing counts and conditional probabilities (contigencies):
A new day:
Suppose the answer is Play=Yes.
How likely is to get the attribute
values of this new day?
slide 32
Baye’s Rule
DSCI 4520/5240
DATA MINING
Probability of event H given evidence E:
P(H|E) =
P(E|H) P(H)
P(E)
WHERE: H = target value, E = input variable values
“A priori” probability of H: P(H)
(Probability of event before evidence has been seen)
“A posteriori” probability of H: P(H|E)
(Probability of event after evidence has been seen)
slide 33
Naïve Bayes Classification
DSCI 4520/5240
DATA MINING
Classification learning: what’s the probability of the class
given an instance?
Evidence E = instance
Event H = class value for instance
Naïve Bayes assumption: evidence can be split into
independent parts (i.e. attributes of instance!)
P(H|E) =
P(E1|H) P(E2|H) … P(En|H) P(H)
P(E)
slide 34
DSCI 4520/5240
DATA MINING
Naïve Bayes on the Weather Data
Evidence E
P( Yes | E) = (P( Outlook = Sunny | Yes) ×
P( Temperature = Cool | Yes) ×
P( Humidity = High | Yes) ×
P( Windy = True | Yes) × P(Yes)) / P(E)
P( Yes | E) = (2/9 × 3/9 × 3/9 × 3/9 × 9/14) / P(E)
= 0.0053 / P(E)
P( No | E) = (3/5 × 1/5 × 4/5 × 3/5 × 5/14) / P(E)
= 0.0206 / P(E)
Note that P(E) will disappear when we
normalize!
slide 35
DSCI 4520/5240
DATA MINING
n
n
n
Comments on Naïve Bayes Classification
Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
Why? Because classification doesn’t require
accurate probability estimates as long as maximum
probability is assigned to the correct class
However: adding too many redundant attributes will
cause problems (e.g. identical attributes)
slide 36
DSCI 4520/5240
DATA MINING
•
•
•
•
Suggested readings
Verify the entropy, information, and information gain
calculations we did in these slides
Hint: All logs are base 2!!!
Read the SAS GSEM 5.3 text, chapter 4 (pp. 61-102)
Read the Sarma text, chapter 4 (pp. 113-168). Pay
particular attention to:
• Entropy calculations (p. 126)
• Profit Matrix (p. 136)
• Expected profit calculations (p. 137)
• How to use SAS EM and grow a decision tree
(pp. 143-158)
slide 37