Transcript Slides

CS 590M Fall 2001: Security
Issues in Data Mining
Lecture 4: ID3
ID3 (precursor to C4.5)
• “Canonical” decision tree learning algorithm
by Ross Quinlan
• Has evolved into C5.0 from Rulequest –
established product.
• Basic idea: recursively build decision tree
“top-down”
– At each node, choose “best” attribute for decision
from those remaining
– What is “best”?
Algorithm ID3 (Training set S,
Attributes R, Class C)
If all records in S have same value for C, return
node with that value
If R is empty, return node with most common
value for C in S
Create non-leaf node:
Let D be the attribute with largest Gain(D,S)
among attributes in R;
For each d  D,
Let Sd be items in S where value of D is d
Create edge to ID3(R-{D},C, Sd)
What is Gain(D,S)?
• Information theory:
expected reduction in entropy due to
sorting S on attribute D
• entropy(S) = -(p1*log(p1)+...+pn*log(pn))
– pi is the probability of class i in S.
• Gain(D,S) =
| Si |
entropy( S )  
entropy( Si)
iD | S |
Example: Tennis
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp.
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cold
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
Play Tennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
What should be first?
S=[9+,5-]
E=0.940
S=[9+,5-]
E=0.940
Humidity
Wind
High
Normal
[3+, 4-]
[6+, 1-]
E=0.985
E=0.592
Gain(S,Humidity)
=0.940-(7/14)*0.985
– (7/14)*0.592
=0.151
Weak
[6+, 2-]
Strong
[3+, 3-]
E=0.811
E=1.0
Gain(S,Wind)
=0.940-(8/14)*0.811
– (6/14)*1.0
=0.048
(continued)
S=[9+,5-]
E=0.940
Outlook
Sunny
Over
cast
Rain
[2+, 3-]
[4+, 0]
[3+, 2-]
E=0.971
E=0.0
E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
And recurse...
[D1,D2,…,D14]
[9+,5-]
Outlook
Sunny
Overcast
Rain
Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]
[2+,3-]
[4+,0-]
[3+,2-]
?
Yes
?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
ID3 result
Outlook
Sunny
Humidity
High
No
[D1,D2]
Overcast
Rain
Yes
[D3,D7,D12,D13]
Normal
Yes
[D8,D9,D11]
Wind
Strong
Weak
No
Yes
[D6,D14]
[D4,D5,D10]
Inductive Bias in ID3
• H is the power set of instances X
– Unbiased ?
• Preference for short trees, and for those with high
information gain attributes near the root
• Bias is a preference for some hypotheses, rather than a
restriction of the hypothesis space H
• Occam’s razor: prefer the shortest (simplest)
hypothesis that fits the data
Occam’s Razor
• Why prefer short hypotheses?
• Argument in favor:
– Fewer short hypotheses than long hypotheses
– A short hypothesis that fits the data is unlikely to be a
coincidence
– A long hypothesis that fits the data might be a coincidence
• Argument opposed:
– There are many ways to define small sets of hypotheses
– E.g. All trees with a prime number of nodes that use attributes
beginning with ”Z”
– What is so special about small sets based on size of hypothesis
Overfitting
Consider error of hypothesis h over
• Training data: errortrain(h)
• Entire distribution D of data: errorD(h)
Hypothesis hH overfits training data if there is
an alternative hypothesis h’H such that
errortrain(h) < errortrain(h’)
and
errorD(h) > errorD(h’)
Avoid Overfitting
How can we avoid overfitting?
• Stop growing when data split not
statistically significant
• Grow full tree then post-prune
• Minimum description length (MDL):
Minimize:
size(tree) + size(misclassifications(tree))
Reduced-Error Pruning
Split data into training and validation set
Do until further pruning is harmful:
1. Evaluate impact on validation set of pruning
each possible node (plus those below it)
2. Greedily remove the one that most improves
the validation set accuracy
Produces smallest version of most accurate
subtree
Rule-Post Pruning
1. Convert tree to equivalent set of rules
2. Prune each rule independently of each other
3. Sort final rules into a desired sequence to use
Method used in C4.5
Converting a Tree to Rules
Outlook
Sunny
Humidity
High
No
R1:
R2:
R3:
R4:
R5:
If
If
If
If
If
Overcast
Rain
Yes
Normal
Yes
Wind
Strong
No
Weak
Yes
(Outlook=Sunny)  (Humidity=High) Then PlayTennis=No
(Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes
(Outlook=Overcast) Then PlayTennis=Yes
(Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
(Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes
Issues
• Attributes with many values
– Maximize gain – always selects these
– But not very interesting (classify on
timestamp?)
• What to do with continuous attributes?
– Gain ration
• Attributes with costs (e.g., medical tests)
• Unknown values