ppt - CIS @ Temple University

Download Report

Transcript ppt - CIS @ Temple University

SLIQ: A Fast Scalable
Classifier for Data Mining
Manish Mehta, Rakesh Agrawal, Jorma Rissanen
1996.
Presentation by: Vladan Radosavljevic
Outline



Introduction
Motivation
SLIQ Algorithm





Building tree
Pruning
Example
Results
Conclusion
Introduction



Most of the classification algorithms are
designed for memory resident data – limited
suitability for mining large datasets
Solution – build a scalable classifier - SLIQ
SLIQ – Supervised Learning in Quest, Quest
was the data mining project at the IBM
Motivation

Recall (ID3, C4.5, CART):
Motivation





NON SCALABLE DECISION TREES:
The complexity lies in determining the best split for
each attribute
The cost of evaluating splits for numerical attributes
is dominated by the cost of sorting values at each
node
The cost of evaluating splits for categorical attributes
is dominated by the cost of searching for the best
subset
Pruning


crossvalidation inapplicable for large datasets
divide data in two parts - training and test set - sizes,
distribution???
Motivation


Improve scalability of tree classifiers
Previous proposals:





Sampling data at each node
Discretization of numerical attributes
Partitioning input data and build tree for each
partition
All methods achieve low accuracy!
SLIQ – improve learning time without loss in
accuracy!
SLIQ

Key features:





Tree classifier, handling both numerical and
categorical attributes
Presort numerical attributes before tree
has been built
Breadth first growing strategy
Goodness test – Gini index
Inexpensive tree pruning algorithm based
on Minimum Description Length (MDL)
SLIQ - Algorithm



Eliminate the need to sort the data at
each node
Create sorted list for each numerical
attribute
Create class list
SLIQ - Algorithm

Example:
SLIQ - Algorithm

Split evaluation:
SLIQ - Algorithm

Example:
SLIQ - Algorithm

Update class list:
SLIQ - Algorithm

Example:
SLIQ - Algorithm



For large-cardinality categorical attributes
(determined based on threshold) the best
split is computed in greedy way, otherwise all
possible splits are evaluated
When node becomes pure stop splitting it,
then condense attribute lists by discarding
examples that correspond to the pure node
SLIQ is able to scale for large datasets with
no loss in accuracy – the splits evaluated with
or without pre-sorting are identical
SLIQ - Pruning


Post pruning algorithm based on
Minimum Description Length principle
Find a model that minimizes:
Cost(M,D) = Cost(D|M) + Cost(M)
Cost(M) - cost of the model
Cost(D|M) - cost of encoding the data D if
model M is given
SLIQ - Pruning


Cost of the data: classification error
Cost of the model:


Encoding the tree: number of bits
Encoding the splits:



numerical attribute - constant (empirically 1)
categorical attribute - depends on cardinality
The MDL pruning evaluate the code length at
each node to determine whether to prune
one or both child or leave the node intact
SLIQ - pruning

Three pruning strategies:



Full – pruning both children and convert
node to the leaf
Partial – prune into the leaf or prune the
left child or prune the right child or leave
node intact
Hybrid – apply Full method and then partial
(prune left, prune right or leave intact)
Results

SLIQ was tested on the datasets:
Results

Pruning strategy comparison:
Results

Accuracy:
Results

Scalability:
Conclusion



SLIQ demonstrates to be a fast, low-cost and
scalable classifier that builds accurate trees
Based on empirical test which compared SLIQ
to other tree based classifiers, SLIQ achieves
a comparable accuracy while producing
smaller decision trees
Scalability??? Memory problem when
increasing number of attributes or number of
classes
References
[1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier
for Data Mining", in Proceedings of the 5th International Conference on
Extending Database Technology, Avignon, France, Mar. 1996.
THANK YOU!