Transcript KDD04stream
Systematic Data Selection
to Mine Concept Drifting Data Streams
Wei Fan
IBM T.J.Watson
About … …
Data streams: continuous stream of
new data, generated either in real
time or periodically.
Credit card transactions
Stock trades.
Insurance claim data.
Phone call records
Our notations.
Data Streams
t1
t2
t3
Old data
t4
t5
New data
Data Stream Mining
Characteristics: may change over
time.
Main goal of stream mining:
make sure that the constructed model is
the most accurate and up-to-date.
Data Sufficiency
Definition:
A dataset is considered “sufficient” if adding more
data items will not increase the final accuracy of a
trained model significantly.
We normally do not know if a dataset is
sufficient or not.
Sufficiency detection:
Expensive “progressive sampling” experiment.
Keep on adding data and stop when accuracy
doesn’t increase significantly.
Dependent on both dataset and algorithm
Difficult to make a general claim
Possible changes of data streams
Possible “concept drift”.
For the same feature vector, different class
labels are generated at some later time
Or stochastically, with different probabilities.
Possible “data sufficiency”.
Other possible changes not addressed in
our paper.
Most important of all:
These are “possibilities”.
No “Oracle” out there to tell us the truth!
Dangerous to make assumptions.
How many combinations?
Four combinations:
Sufficient and no drift.
Insufficient and no drift.
Sufficient and drift.
Insufficient and drift
Question: Does the “most accurate
model” remain the same under all
four situations?
Case 1: Sufficient and no drift
Solution one:
Throw away old models and data.
Re-train a new model from new data.
By definitions of data sufficiency.
Solution two:
If old model is trained from “sufficient
data”, just use the old model
Case 2: Sufficient and drift
Solution one:
Train a new model from new data
Same “sufficiency definition”.
Case 3: Insufficient and no drift
Possibility I: if old model is trained
from sufficient data, keep the old
model.
Possibility II: otherwise, combine new
data and old data, and train a new
model.
Case 4: Insufficient and drift
Obviously, new data is not enough by
definition.
What are our options.
Use old data?
But how?
A moving hyper plane
A moving hyper plane
See any problems?
Which old data items can we use?
We need to be picky
Inconsistent Examples
Consistent examples
See more problems?
We normally never know which of the
four cases a real data stream belongs
to.
It may change over time from case to
case.
Normally, no truth is known apriori or
even later.
Solution
Requirements:
The right solution should not be “one size fits
all.”
Should not make any assumptions. Any
assumptions can be wrong.
It should be adaptive.
Let the data speak for itself.
We prefer model A over model B if the
accuracy of A on the evolving data stream
is likely to be more accurate than B.
No assumptions!
An “Un-biased” Selection framework
Train FN from new data.
Train FN+ from new data and selected
consistent old data.
Assume FO is the previous most accurate
model. Update FO using the new data. Call
it FO+.
Use cross-validation to choose among the
four candidate models {FN, FN+, FO, and
FO+}.
Consistent old data
Theoretically, if we know the true
models, we can use the true models
to choose consistent data. But we
don’t
Practically, we have to rely on
“optimal models.”
Go back to the hyper plane example
A moving hyper plane
Their optimal models
True model and optimal models
True model.
Perfect model: never makes mistakes.
Not always possible due to:
Stochastic nature of the problem
Noise in training data
Data is insufficient
Optimal model: defined over a given loss
function.
Optimal Model
Loss function L(t,y) to evaluate
performance.
t is true label and y is prediction
Optimal decision decision y* is the label that
minimizes the expected loss when x is
sampled many times:
0-1 loss: y* is the label that appears the most
often, i.e., if P(fraud|x) > 0.5, predict fraud
cost-sensitive loss: the label that minimizes the
“empirical risk”.
If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09,
predict fraud
Random decision trees
Train multiple trees. Details to follow.
Each tree outputs posterior probability
when classifying an example x.
The probability outputs of many trees are
averaged as the final probability estimation.
Loss function and probability are used to
make the best prediction.
Training
At each node, an un-used feature is
chosen randomly
A discrete feature is un-used if it has
never been chosen previously on a given
decision path starting from the root to the
current node.
A continuous feature can be chosen
multiple times on the same decision path,
but each time a different threshold value
is chosen
Example
Gender?
M
F
Age>30
P: 1
N: 9
y
……
Age> 25
n
P: 100
N: 150
Training: Continued
We stop when one of the following
happens:
A node becomes empty.
Or the total height of the tree exceeds a
threshold, currently set as the total number of
features.
Each node of the tree keeps the number of
examples belonging to each class.
Classification
Each tree outputs membership probability
p(fraud|x) = n_fraud/(n_fraud + n_normal)
If a leaf node is empty (very likely for when discrete
feature is tested at the end):
Use the parent nodes’ probability estimate but do not
output 0 or NaN
The membership probability from multiple random
trees are averaged to approximate as the final
output
Loss function is required to make a decision
0-1 loss: p(fraud|x) > 0.5, predict fraud
cost-sensitive loss: p(fraud|x) $1000 > $90
N-fold Cross-validation with Random
Decision Trees
Tree structure is independent from
the data.
Compensation when computing
probability
Key advantage
n-fold cross validation comes easy.
Same cost as testing the model once on
the training data.
Training is efficient since we do not
compute information gain.
It is actually also very accurate.
Experiments
I have a demo available to show.
Please contact me.
In the paper. I have the following
experiments.
Synthetic datasets.
Credit card fraud datasets.
Donation datasets.
Compare
This new selective framework proposed in this
paper.
Our last year’s hard coded ensemble framework.
Use k number of weighted ensembles.
K=1. Only train on new data.
K=8.
Use new data and previous 7 periods of model.
Classifier is weighted against new data.
Sufficient and insufficient. Always drift.
Data insufficient: new method
Last year’s method
Avg Result
Data sufficient: new method
Data sufficient: last year’s method
Avg Result
Independent study and implementation
of random decision tree
Kai Ming Ting and Tony Liu from U of
Monash, Australia on UCI datasets
Edward Greengrass from DOD on their data
sets.
100 to 300 features.
Both categorical and continuous features.
Some features have a lot of values.
2000 to 3000 examples.
Both binary and multiple class problem (16 and
25)
Related publications on random trees
“Is random model better? On its accuracy
and efficiency” ICDM 2003
“On the optimality of probability estimation
by random decision trees” AAAI 04.
“Mining concept-drifting data streams with
ensemble classifiers” SIGKDD2003