Transcript KDD04stream

Systematic Data Selection
to Mine Concept Drifting Data Streams
Wei Fan
IBM T.J.Watson
About … …
 Data streams: continuous stream of
new data, generated either in real
time or periodically.




Credit card transactions
Stock trades.
Insurance claim data.
Phone call records
 Our notations.
Data Streams
t1
t2
t3
Old data
t4
t5
New data
Data Stream Mining
 Characteristics: may change over
time.
 Main goal of stream mining:
 make sure that the constructed model is
the most accurate and up-to-date.
Data Sufficiency
 Definition:
 A dataset is considered “sufficient” if adding more
data items will not increase the final accuracy of a
trained model significantly.
 We normally do not know if a dataset is
sufficient or not.
 Sufficiency detection:
 Expensive “progressive sampling” experiment.
 Keep on adding data and stop when accuracy
doesn’t increase significantly.
 Dependent on both dataset and algorithm
 Difficult to make a general claim
Possible changes of data streams
 Possible “concept drift”.
 For the same feature vector, different class
labels are generated at some later time
 Or stochastically, with different probabilities.
 Possible “data sufficiency”.
 Other possible changes not addressed in
our paper.
 Most important of all:
 These are “possibilities”.
 No “Oracle” out there to tell us the truth!
 Dangerous to make assumptions.
How many combinations?
 Four combinations:




Sufficient and no drift.
Insufficient and no drift.
Sufficient and drift.
Insufficient and drift
 Question: Does the “most accurate
model” remain the same under all
four situations?
Case 1: Sufficient and no drift
 Solution one:
 Throw away old models and data.
 Re-train a new model from new data.
 By definitions of data sufficiency.
 Solution two:
 If old model is trained from “sufficient
data”, just use the old model
Case 2: Sufficient and drift
 Solution one:
 Train a new model from new data
 Same “sufficiency definition”.
Case 3: Insufficient and no drift
 Possibility I: if old model is trained
from sufficient data, keep the old
model.
 Possibility II: otherwise, combine new
data and old data, and train a new
model.
Case 4: Insufficient and drift
 Obviously, new data is not enough by
definition.
 What are our options.
 Use old data?
 But how?
A moving hyper plane
A moving hyper plane
See any problems?
 Which old data items can we use?
We need to be picky
Inconsistent Examples
Consistent examples
See more problems?
 We normally never know which of the
four cases a real data stream belongs
to.
 It may change over time from case to
case.
 Normally, no truth is known apriori or
even later.
Solution
 Requirements:
 The right solution should not be “one size fits
all.”
 Should not make any assumptions. Any
assumptions can be wrong.
 It should be adaptive.
 Let the data speak for itself.
 We prefer model A over model B if the
accuracy of A on the evolving data stream
is likely to be more accurate than B.
 No assumptions!
An “Un-biased” Selection framework
 Train FN from new data.
 Train FN+ from new data and selected
consistent old data.
 Assume FO is the previous most accurate
model. Update FO using the new data. Call
it FO+.
 Use cross-validation to choose among the
four candidate models {FN, FN+, FO, and
FO+}.
Consistent old data
 Theoretically, if we know the true
models, we can use the true models
to choose consistent data. But we
don’t
 Practically, we have to rely on
“optimal models.”
 Go back to the hyper plane example
A moving hyper plane
Their optimal models
True model and optimal models
True model.
Perfect model: never makes mistakes.
Not always possible due to:
Stochastic nature of the problem
Noise in training data
Data is insufficient
Optimal model: defined over a given loss
function.
Optimal Model
 Loss function L(t,y) to evaluate
performance.
 t is true label and y is prediction
 Optimal decision decision y* is the label that
minimizes the expected loss when x is
sampled many times:
 0-1 loss: y* is the label that appears the most
often, i.e., if P(fraud|x) > 0.5, predict fraud
 cost-sensitive loss: the label that minimizes the
“empirical risk”.
If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09,
predict fraud
Random decision trees
 Train multiple trees. Details to follow.
 Each tree outputs posterior probability
when classifying an example x.
 The probability outputs of many trees are
averaged as the final probability estimation.
 Loss function and probability are used to
make the best prediction.
Training
At each node, an un-used feature is
chosen randomly
 A discrete feature is un-used if it has
never been chosen previously on a given
decision path starting from the root to the
current node.
 A continuous feature can be chosen
multiple times on the same decision path,
but each time a different threshold value
is chosen
Example
Gender?
M
F
Age>30
P: 1
N: 9
y
……
Age> 25
n
P: 100
N: 150
Training: Continued
 We stop when one of the following
happens:
 A node becomes empty.
 Or the total height of the tree exceeds a
threshold, currently set as the total number of
features.
 Each node of the tree keeps the number of
examples belonging to each class.
Classification
 Each tree outputs membership probability
 p(fraud|x) = n_fraud/(n_fraud + n_normal)
 If a leaf node is empty (very likely for when discrete
feature is tested at the end):
Use the parent nodes’ probability estimate but do not
output 0 or NaN
 The membership probability from multiple random
trees are averaged to approximate as the final
output
 Loss function is required to make a decision
 0-1 loss: p(fraud|x) > 0.5, predict fraud
 cost-sensitive loss: p(fraud|x) $1000 > $90
N-fold Cross-validation with Random
Decision Trees
 Tree structure is independent from
the data.
 Compensation when computing
probability
Key advantage
 n-fold cross validation comes easy.
 Same cost as testing the model once on
the training data.
 Training is efficient since we do not
compute information gain.
 It is actually also very accurate.
Experiments
 I have a demo available to show.
Please contact me.
 In the paper. I have the following
experiments.
 Synthetic datasets.
 Credit card fraud datasets.
 Donation datasets.
Compare
 This new selective framework proposed in this
paper.
 Our last year’s hard coded ensemble framework.
 Use k number of weighted ensembles.
 K=1. Only train on new data.
 K=8.
 Use new data and previous 7 periods of model.
 Classifier is weighted against new data.
 Sufficient and insufficient. Always drift.
Data insufficient: new method
Last year’s method
Avg Result
Data sufficient: new method
Data sufficient: last year’s method
Avg Result
Independent study and implementation
of random decision tree
 Kai Ming Ting and Tony Liu from U of
Monash, Australia on UCI datasets
 Edward Greengrass from DOD on their data
sets.





100 to 300 features.
Both categorical and continuous features.
Some features have a lot of values.
2000 to 3000 examples.
Both binary and multiple class problem (16 and
25)
Related publications on random trees
 “Is random model better? On its accuracy
and efficiency” ICDM 2003
 “On the optimality of probability estimation
by random decision trees” AAAI 04.
 “Mining concept-drifting data streams with
ensemble classifiers” SIGKDD2003