Presentation

Download Report

Transcript Presentation

PUBLIC: A Decision Tree Classifier that
Integrates Building and Pruning
Rajeev Rastogi
Kyuseok Shim
Presented by: Alon Keinan
Presentation layout
• Introduction: Classification and Decision
Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
Introduction: Classification
• Classification in data mining:
– Training sample set
– Classifying future records
• Techniques: Bayesian, NN, Genetic,
decision trees …
Introduction: Decision Trees
training
Presentation layout
• Introduction: Classification and Decision
Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
Decision Tree Building
Algorithms
• 2 phases:
– The building phase
– The pruning phase
• The building constructs a “perfect” tree
• The pruning prevents “overfitting”
Building Phase Algorithms
• Differ in the selection of the test criterion
for partitioning
– CLS
– ID3 & C4.5
– CART, SLIQ & SPRINT
• Differ in their ability to handle large
training sets
• All consider “guillotine-cut” only
Pruning Phase Algorithms
• MDL – Minimum Description Length
• Cost-Complexity Pruning
Presentation layout
• Introduction: Classification and Decision
Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
SPRINT
• Initialize root node
• Initialize queue Q to contain root node
• While Q is not empty do
– dequeue the first node N in Q
– if N is not pure
• for each attribute evaluate splits
• use least entropy split to split node N into N1 and
N2
• append N1 and N2 to Q
Entropy
MDL
• The best tree is the one that can be encoded using
the fewest number of bits
• Cost of encoding data records:
• Cost of encoding tree:
– The structure of the tree
– The splits
– The classes in the leaves
Pruning algorithm
• computeCost&Prune(Node N)
–
–
–
–
If N is a leaf return (C(S)+1)
minCostLeft:=computeCost&Prune(Nleft)
minCostRight:=computeCost&Prune(Nright)
minCost:=min{C(S)+1,
Csplit(N)+1+minCostLeft+minCostRight}
– If minCost=C(S)+1
• Prune child nodes Nleft and Nright
– return minCost
Presentation layout
• Introduction: Classification and Decision
Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
PUBLIC
• PUBLIC = PrUning and BuiLding Integrated in
Classification
• Uses SPRINT for building
• Prune periodically !!!
• Basically uses MDL for pruning
• Distinguished three types of leaves:
– “not expandable”
– “pruned”
– “yet to be expanded”
• Exact same tree
Lower Bound Computation
• PUBLIC(1)
– Bound=1
• PUBLIC(S)
– Incorporating split costs
• PUBLIC(V)
– Incorporating split values
PUBLIC(S)
• Calculates a lower bound for s=0,..,k-1
– For s=0: C(S)+1
– For s>0:
• Takes the minimum of the bounds
• Computes by iterative addition
• O(klogk)
PUBLIC(V)
• PUBLIC(S) estimates each split as log(a)
• PUBLIC(V) estimates each split as log(a),
plus the encoding of the splitting value\s
• Complexity: O(k*(logk+a))
Lower Bound Computation
Summary
PUBLIC(1)
Fixed - 1
O(1)
PUBLIC(S)
Incorporating
split costs
O(klogk)
PUBLIC(V)
Incorporating
split value costs
O(k*(logk+a))
Presentation layout
• Introduction: Classification and Decision
Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
Performance Comparisons
• Algorithms:
–
–
–
–
SPRINT
PUBLIC(1)
PUBLIC(S)
PUBLIC(V)
• Data sets:
– Real-life
– Synthetic
Real-life Data Sets
SPRINT
PUBLIC( )
PUBLIC(S)
PUBLIC(V)
Synthetic Data Sets
SPRINT
PUBLIC( )
PUBLIC(S)
PUBLIC(V)
Noise
Other Parameters
• No. of Attributes
• No. of Classes
• Size of training set
Presentation layout
• Introduction: Classification and Decision
Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
Conclusion
• The pruning is integrated into the building phase
• Computing lower bounds of the cost of “yet to be
expanded” leaves
• Improved performance
• Open:
– How often to invoke the pruning procedure?
– Expanding other algorithms …
– Developing a tighter lower bound…