An Evaluation of Progressive Sampling for Imbalanced Data Sets
Download
Report
Transcript An Evaluation of Progressive Sampling for Imbalanced Data Sets
國立雲林科技大學
National Yunlin University of Science and Technology
An Evaluation of Progressive
Sampling for Imbalanced Data Sets
Advisor : Dr. Hsu
Presenter : Ai-Chen Liao
Authors : Willie Ng and Manoranjan Dash
2006 . IEEE International Conference on Data Mining - Workshops
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Motivation
Objective
Method
Progressive Sampling (PS)
Progressive Sampling with Over-sampling (PSOS)
Experimental Result
Conclusion
Comments
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
One of the emerging challenges for the data mining
research community is to allow learning algorithms to
mine huge databases.
Even if a large data set is able to fit into memory,
running a learning algorithm on the entire data set can
be computationally expensive.
One way of abating the cost is to train the learning
algorithm by employing sampling techniques.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
We study the learning-curve sampling method, an
approach for applying machine learning algorithms to
massive amount of data sets.
We present a refinement for progressive sampling
which works well in practice and is able to converge
to the desired sample size very quickly and accurately.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Method ─ Progressive sampling ( PS )
Provost et al. [1] suggested using progressive sampling (PS) on a
large data set.
PS starts with a reasonably small sample and uses progressively
larger ones until the accuracy of the learning algorithm no
longer improves.
Model
performance
Sample size
Progressive sampling requires the
defnition of at least three main
components:
(i) the sampling schedule,
(ii) the initial data sample,
(iii) the termination criterion.
For instance: 100,200,300,400…
For instance: 100,200,400,800…
5
Intelligent Database Systems Lab
Method ─ Progressive Sampling with Over-sampling
(PSOS)
The notion of modifying PS is motivated by the experiments
documented in [13].
In order to achieve good classification, the optimal class distribution of the
training set should generally cover between 50% and 90% of the minority
class example.
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results
N.Y.U.S.T.
I. M.
7
Intelligent Database Systems Lab
Experimental Results
N.Y.U.S.T.
I. M.
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusion
In this paper, we study the efficiency of PS when
applied to imbalanced data sets.
In PSOS, we place emphasis on managing a balanced
training set so as to speed up convergence as well as
improve overall accuracy.
[PSOS converges on average about 2 iterations earlier than CPS]
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Comments
Advantage
Drawback
…
…
Application
Handling imbalanced data
10
Intelligent Database Systems Lab