Transcript PPT

Incremental Classification Using TreeBased Sampling for Large Data
H. Yoon, K. Alsabti, and S. Ranka
Instance Selection and Construction for Data Mining,
Chapter 11, pp. 189-206
Cho, Dong-Yeon
(C) 2001 SNU CSE Biointelligence Lab
Introduction

Large Scale Applications
 Building classifiers on large datasets requires
development of new techniques that minimize the
overall execution time with comparable quality.
 Constructing a classifier for large growing dataset from
scratch would be extremely wasteful and
computationally prohibitive.

ICE (Incremental Classification for Ever-growing
large datasets)
 Incrementally building decision tree classifiers for lager
growing datasets using tree-based sampling techniques
(C) 2001 SNU CSE Biointelligence Lab

Decision Tree Classification
 Easily interpreted and comprehended by human beings,
and constructed relatively fast
 No prior knowledge of domains or distributions on the
data, but good classification accuracy
 Construction phase and pruning phase
(C) 2001 SNU CSE Biointelligence Lab
Related Work

Test Criterion for Partitioning a Set of Records
 ID3, C4.5
 CART, SPRINT
 CLOUD

Discretization and Sampling for Large Datasets
 Windowing
 Stratification

Rainforest Framework
 BOAT algorithm
(C) 2001 SNU CSE Biointelligence Lab
Incremental Classification

Problem Definition
 Terms
 A decision
tree T
 A dataset D of size N
 A new incremental dataset d of size n
 A new decision tree T´
 The goal is to build T´ with minimal cost such that T´ is
comparably as accurate as the decision tree built for
D+d from scratch.
(C) 2001 SNU CSE Biointelligence Lab

Impurity-based Split Selection Methods
 The gini index
m
gini ( S )  1   (
 n:
i 1
Ci 2
)
n
total number of records in a set S
 Ci: the number of records that belong to class i
 m: the number of classes
 S is split into two subsets, S1 and S2, with n1 and n2 records.
n1
n2
ginisplit ( S )  gini ( S1 )  gini ( S 2 )
n
n
 It does not lend itself for reuse in accordance with the
changes in the number of records.
(C) 2001 SNU CSE Biointelligence Lab

Weighted Gini Index
 A weight can be assigned to a record when the record
represents a number of other records.
 The motivation for the weighted gini index
 Replace
similar records by a weighted representative and use them
in an incremental batch learning
 A set S´ of n sorted numerical points in nondecreasing order
point (call w-point) associated with a weight k (k  1).
 The total weight sum W of S´
 Each
n
W   wj  n
j 1
(C) 2001 SNU CSE Biointelligence Lab
 Weighted gini index
m
Qi 2
)
i 1 W
W
W
w
ginisplit
( S )  1 gini w ( S1)  2 gini w ( S 2 )
W
W
gini ( S )  1   (
w
 Q i:
the weight sum of the w-points that belong to class i in S´.
 W1 and W2: the weight sums of subset S´1 and S´2, respectively
 Lemma 1
giniwsplit index value at a clear-cut splitting point x on a
dimension is consistent with the original gini index value
ginisplit at the same splitting point.
 The
 Lemma 2
giniwsplit index value is consistent with the original gini
index value ginisplit with respect to the splitting condition
 The
(C) 2001 SNU CSE Biointelligence Lab

ICE: A Scalable Method for Incremental Classification
 Characteristics of ICE
 Size
scalability, easy parallelization, inherent migration into
distributed environment, temporal scalability, algorithm independent,
minimal data access, and flexibility in dealing with addition and
deletion of records as partition
(C) 2001 SNU CSE Biointelligence Lab
Sampling for Incremental
Classification

Sampling from Decision Tree
 One of the problem with random
sampling is that it is sensitive to
the distribution of data.
 The decision tree built on a
dataset can be used to generate a
good sample of the dataset.
 Such a sampling method is
expected to result in a new
decision tree with better or
comparable quality than random
sampling does.
(C) 2001 SNU CSE Biointelligence Lab

Sampling Techniques for Incremental Classification
 Random sampling
 ICE-random: Weighted
samples are chosen randomly from each
node of the decision tree built for the current incremental partition.
 Stratified sampling: Samples are randomly chosen based on the
class labels in proportion to the size of the class in each node.
 Local clustering
 ICE-local: A clustering
algorithm can be applied to the records in
each node of an intermediate decision tree to obtain good
weighted samples.
 ICE-global: A clustering algorithm can be applied to each
incremental portion to extract samples for incremental
classification.
(C) 2001 SNU CSE Biointelligence Lab

Cost of Incremental Classification Using ICE
 The Cost for ICE
TICE ( A, D)  T ( A, Di )  r  Di  T ( A,U i )
 T(A,Di): The
cost function for building a decision tree classifier
for a data set Di
 r: sampling rate (0 < r  1)
 The ratio R of the cost of ICE to the cost of building a
decision tree from scratch
R
 It
T ( A, Di )  r  Di  T ( A,U i )
T ( A, D)
converges to r for a large i under some assumptions.
(C) 2001 SNU CSE Biointelligence Lab
Empirical Results

Experimental Results with Real Datasets
 Records represent characteristics of hand-written
alphabet letters.
 15,000
and 5,000 records of 16 attributes for training and test
datasets, respectively.
 3 classes and 5 epochs
 ICE-random is omitted on purpose
 Its
results is not comparably better than random sampling from
the entire dataset.
 K-means clustering algorithm
(C) 2001 SNU CSE Biointelligence Lab
 Experimental results for letter dataset with five epochs
(C) 2001 SNU CSE Biointelligence Lab
 The quality measure for letter dataset at 15% sampling rate
(C) 2001 SNU CSE Biointelligence Lab

Experimental Results with Large Synthetic Datasets
 72MB with 1 million records
 A smaller
dataset of 100,000 records was created and used to
approximate the accuracy of the dataset for comparison purpose.
(C) 2001 SNU CSE Biointelligence Lab

Important Implications of ICE
 ICE can be applied to large datasets.
 ICE can run on any of various computing configurations.
 ICE can be utilized for dynamically growing large
datasets.
 Larger sample size may improve the overall accuracy
results of ICE, however, sampling rate need not be
significantly high.
(C) 2001 SNU CSE Biointelligence Lab