Systematic Data Selection to Mine Concept
Download
Report
Transcript Systematic Data Selection to Mine Concept
Systematic Data Selection to Mine
Concept-Drifting Data Streams
Wei Fan
Proceedings of the 2004 ACM SIGKDD international
conference on Knowledge discovery
2005/01/21
董原賓
1
Outline
Issues with data stream
Will old data really help?
Optimal models
Computing optimal models
Cross Validation Decision Tree Ensemble
Experiment
Conclusion
2
Issues with data stream
Concept drift
FOi(x) : the optimal model at time stamp i
There is concept drift from time stamp i-1 to time stamp i, if
there are inconsistencies between FOi-1(x) and FOi(x)
Data sufficient
A data set is considered sufficient if adding more data will
not increase the generalization accuracy
Determining the sufficient amount can be formidably
expensive
Even if the dataset is insufficient, still need to train a model
that can best fit the changing data.
3
Will old data really help?
Definition:
Si: most recent data chunk
SP: previous data chunks, SP= S1 U…U Si-1
y = f(x): the underlying true model that
we aim to model
4
Will old data really help?
(Cont.)
Three types of SP:
1. FOi (x) ≠ FOi-1(x), obviously it will only
cancel out the changing concept
2. FOi (x) = FOi-1(x) ≠ y, both models agree
that their predictions are wrong, can’t
determine if they will help or not
3. FOi (x) = FOi-1(x) = y, both models make
the correct prediction, it is the only portion
that may help
5
Will old data really help?
(Cont.)
6
Optimal Models
The two main themes of our comparison is on
possible data insufficiency and concept drift
Four situations:
1. New data is sufficient by itself and
there is no concept drift, the optimal
model should be the one trained from
the new data itself.
7
Optimal Models
(cont.)
2. New data is sufficient by itself and there is
concept drift, the optimal model should be
the one trained from the new data itself.
3. New data is insufficient by itself and there
is no concept drift. If the previous data is
sufficient, the optimal model should be the
existing model. Otherwise, train a new model
from new data plus existing data
8
Optimal Models
(cont.)
4. New data is insufficient by itself and there
is concept drift. Choose previous data chunks
that have consistent concept with the new
data chunk and combine them with the new
data
We will usually never know if the data is indeed
sufficient. Ideally, we should compare a few sensible
choices if the training cost is affordable
9
Computing optimal models
Definition:
Di-1:the dataset that trained the most
recent optimal model FOi-1(x) and is
collectively iteratively throughout the
streaming mining process
si-1:examples selected from Di-1
10
Computing optimal models
(cont.)
Steps:
1. Train a model FNi(x) from the new data
chunk Si only.
2. Select examples from Di-1 that both the
trained new model FNi(x) and the recent
optimal model FOi-1(x) make the correct
prediction.
si-1 = { ∀(x,y) ∈ Di-1, such that, (FNi(x) = y) ∧ (FOi-1(x) = y) }
11
Computing optimal models
(cont.)
3. Train a model FNi+(x) from the data
data plus the selected data in the last
step (Si U si-1)
4. Update the most recent model FOi-1(x)
with Si and call this model FOi-1+(x).
When update the model, keep the
structure of the model and update its
internal statistics
12
Computing optimal models
(cont.)
5. Using “cross-validation“ to compare
the accuracy of FNi(x), FOi-1(x), FNi+(x)
and FOi-1+(x). Choose the one that is
the most accurate and name it FOi(x)
6. Di is the training set that computes
FOi(x) and is one of Si, Di-1, Si U si-1, and
Si U Di-1
13
Computing optimal models
(cont.)
Address how the above framework finds
the optimal model under all four previously
discussed situations:
1. New data is sufficient by itself and
there is no concept drift. Conceptually
FNi(x) should be the optimal model.
FOi-1(x), FNi+(x) and FOi-1+(x) could be
its close match
14
Computing optimal models
(cont.)
2. New data is sufficient by itself and there is
concept drift. FNi(x) should be the optimal
model. FNi+(x) could be very similar in
performance to FNi(x)
3. New data is insufficient by itself and there
is no concept drift. The optimal model should
be either FOi-1(x) or FOi-1+(x)
4. New data is insufficient by itself and there
is concept drift. The optimal model should be
either FNi(x) or FNi+(x)
15
Cross Validation Decision Tree Ensemble
Decision tree ensemble
Step1: sequentially scans the complete
dataset once and finds out all features with
information gain
Step2: When building trees, at each step
choose a remaining feature randomly
Step3: The tree stop growing a branch if
there are no more examples passing through
that branch
16
Cross Validation Decision Tree Ensemble (cont.)
Rule1: Features without information
gain will never be used
Rule2: Each discrete feature can be
used at most once in a particular
decision path
Rule3: Each continuous feature can be
chosen multiple times on the same
decision path
17
Cross Validation Decision Tree Ensemble (cont.)
Rule4: The splitting threshold is a random
value within the max and min of that feature
Rule5: In the training data, each example x is
assign an initial weight 1.0 When missing
feature value is encountered, the current
weight of x is distributed across its children
nodes. If the prior distribution of known
values are given, the weight is distributed in
proportion to this distribution. Otherwise, it is
equally divided among the children nodes
18
Cross Validation Decision Tree Ensemble (cont.)
Cross validation:
n: the size of the training set
n-fold cross validation: leaves one
example x out and use the remaining
n – 1 examples to train a model and
classify on the left-out example x
19
Cross Validation Decision Tree Ensemble (cont.)
Compute the probability of x: Assume
that we have two class labels, either
fraud or non fraud, the probability if x
being fraudulent is
20
Experiment (streaming data)
Synthetic data:Data with drifting concepts based on
a moving hyperplane Roughly half of the examples
are positive, and the other half negative
Credit card fraud data:Real life credit card
transaction for cost sensitive mining. One year period
and 5million transactions. Investigating a fraud
transaction cost = $90
Donation dataset:The famous donation dataset that
first appeared in KDDCUP’98 competition. 95412
known records. 96367 unknown records for test
21
Experiment( Synthetic dataset)
22
Experiment (credit card dataset)
23
Experiment (Donation dataset)
24
Experiment (Accuracy of Cross validation)
25
Experiment (Optimal model counts)
26
Experiment (Training time)
27
Experiment (Data sufficiency test)
28
Conclusion
Using old data unselectively is like
gambling
Proposed a cross-validation-based
framework to choose data and compare
sensible choices
Proposed an implementation of this
framework using cross-validation
decision tree ensemble
29