Mining with Rare Cases
Download
Report
Transcript Mining with Rare Cases
Mining with Rare Cases
Paper by Gary M. Weiss
Presenter: Indar Bhatia
INFS 795
April 28, 2005
Presentation Overview
1.
2.
3.
4.
Motivation and Introduction to problem
Why Rare Cases are Problematic
Techniques for Handling Rare Cases
Summary and Conclusion
Motivation and Introduction
What are rare cases?
– A case corresponds to a region in the instance space that
is meaningful to the domain under study.
– A rare case is a case that covers a small region of the
instance space
Why are they important
–
–
–
–
–
Detecting suspicious cargo
Finding sources of rare diseases
Detecting Fraud
Finding terrorists
Identifying rare diseases
Classification Problem
– Covers relatively few training examples
Example: Finding association between infrequently
purchased supermarket items
Modeling Problem
P2
P1
Clustering: showing 1 common
And 3 rare classes
P3
Two Class Classification:
Positive Class contains 1
Common and 2 Rare classes
For a classification problem, the rare cases may manifest
themselves as small disjuncts, i.e., those disjuncts in the
classifier that cover few training examples
In unsupervised learning, the three rare cases will be more
difficult to generalize from because they contain fewer data
points
In association rule mining, the problem will be to detect items
that co-occur most infrequently.
Modeling Problem
Current research indicates rare cases and small disjuncts pose
difficulties for data mining, i.e., rare cases have much higher
misclassification rate than common cases.
Small disjuncts collectively cover a substantial fraction of all
examples and cannot simply be eliminated – doing so will
substantially degrade the performance of a classifier.
In a most thorough study of small disjuncts, ( Weiss & Hirsh,
2000), it was shown that in the classifiers induced from 30 realworld data sets, most classifier errors are contributed by the
smaller disjuncts.
Why Rare Cases are Problematic
Problems arise due to absolute rarity
– Most fundamental problem is associated lack of data – only few
examples related to rare cases are in the data set (Absolute rarity)
– Lack of data makes it difficult to detect rare cases, and if detected,
makes generalization difficult
Problems arise due to relative rarity
– Looking for a needle in a Haystack – rare cases are obscured by
common cases (Relative rarity)
– Data mining algorithms rely on greedy Search heuristics that examine
one variable at a time. Since the detection of rare cases may depend
on the conjunction of many conditions, any single condition in isolation
may not provide much guidance.
– For example , consider Association rule mining problem. Association
Analysis has to have a very low support, support =0. This causes a
combinatorial explosion in large datasets.
Why Rare Cases are Problematic
The Metrics
– The metrics used to evaluate classifier accuracy
are more focused on common cases. As a
consequence, rare cases may be totally ignored.
Example:
– consider decision tree. Most decision trees are grown
in a top-down manner, where test conditions are
repeatedly evaluated and the best one selected.
– The metrics (i.e., the information gain) used to select
the best test generally prefers tests that result in a
balanced tree where purity is increased for most of
the examples.
– Rare cases which correspond to high purity branches
covering few examples will often not be included in
the decision tree.
Why Rare Cases are Problematic
The Bias
– The bias of a data mining system is critical to its
performance. The extra-evidentiary bias makes it possible to
generalize from specific examples.
– Bias used by many data mining systems, especially those
used to induce classifiers, employ a maximum-generality
bias.
– This means that when a disjunct that covers some set of
training examples is formed, only the most general set of
conditions that satisfy those examples are selected.
– The maximum-generality bias works well for common cases,
but not for rare cases/small disjuncts.
– Attempts to address the problems of small disjuncts by
selecting an appropriate bias must be considered.
Why Rare Cases are Problematic
Noisy data
– Sufficient high level of background noise may prevent the
learner to distinguish between noise and rare cases.
– Unfortunately, there is not much that can be done to
minimize the impact on noise on rare cases.
– For example: Pruning and overfitting avoidance
techniques, as well as inductive biases that foster
generalization, can minimize the overall impact of noise
but, because these methods tend to remove both the rare
cases and noise-generated ones, they do so at the
expense of rare cases.
Techniques For Handling rare Cases
1.
2.
3.
4.
5.
6.
7.
Obtain Additional Training Data
Use a More Appropriate Inductive Bias
Use More Appropriate Metrics
Employ Non-Greedy Search Techniques
Employ Knowledge/Human Interaction
Employ Boosting
Place Rare Cases Into Separate Classes
1. Obtain Additional Training Data
Simply obtaining additional training data will not help much
because most of the new data will be also associated with the
common cases and may be some associated with rare cases.
This may help problems of “absolute rarity” but not with
“relative rarity”
Only by selectively obtaining additional data for the rare cases
can one address the issues with relative rarity. Such a sampling
scheme will also help with absolute rarity.
The selective sampling approach does not seem practical for
real-world data sets.
2. Use a More Appropriate
Inductive Bias
Rare cases tend to cause small disjuncts to be formed in a
classifier induced from labeled data. This is partly due to bias
used by most learners.
Simple strategies that eliminate all small disjuncts or use
statistical significance testing to prevent small disjuncts from
being formed, have proven to perform poorly.
More sophisticated approaches for adjusting the bias of a
learner in order to minimize the problem with small disjuncts
have been investigated.
Holte et al. (1989) use a maximum generality bias for large
disjuncts and use a maximum specificity bias for small disjuncts.
This was shown to have improved performance of small
disjuncts but degrade the performance of large disjuncts,
yielding poorer overall performance.
2. Use a More Appropriate
Inductive Bias
The approach was refined to ensure that the more specific
bias used to induce the small disjuncts does not affect – and
therefore cannot degrade – the performance of the large
disjuncts.
This was accomplished by using different learners for
examples that fall into large disjuncts and examples that fall
into small disjuncts. (Ting, 1994)
This hybrid approach was shown to improve the accuracy of
small disjuncts, the results were not conclusive.
Carvalho and Frietas(2002a, 2002b) essentially use the same
approach, except that the set of training examples falling into
each individual small disjunct are used to generate a separate
classifier.
Several attempts have been made to perform better on rare
cases by using a highly specific bias for the induced small
disjuncts. These methods have shown mixed success.
3. Use More Appropriate Metrics
Altering Relative importance of Precision vs.
Recall:
Use Evaluation Metrics that, unlike accuracy metrics, do not
discount the importance of rare cases.
Given a classification rule R that predicts target class C, the
recall of R is the % of examples belonging to C that are correctly
identified while the precision of R is the % of times that the rule
is correct.
Rare cases can be given more prominence by increasing the
importance of precision over recall.
Timeweaver (Weiss, 1999), a genetic-algorithm based
classification system, searches for rare cases by carefully altering
the relative importance of precision vs. recall
3. Use More Appropriate
Metrics
Two-Phase Rule Induction:
PNrule (Joshi, Aggarwal & Kumar, 2001) – uses two-phase rule
induction to focus on each measure separately.
The first Phase focuses on recall. In the second phase, precision is
optimized. This is accomplished by learning to identify false
positives within the rule from phase-1.
In the Needle-in-the-haystack analogy, the first phase identifies
regions likely to contain the needle, then in the second phase
learns to discard the hay strands within these regions.
PN-rule Learning
P-phase:
– Positive examples with good support
– Seek good recall
N-phase:
– Remove FP from examples of P-phase
– High accuracy and significant support
4. Employ Non-Greedy Search
Techniques
Most Greedy algorithms are designed to be locally optimal, so as to
avoid local minima. This is done to make sure that the solution
remains tractable. Mining algorithms based on Greedy method are
not globally optimal.
Greedy algorithms are not suitable for dealing with rare cases
because rare cases may depend on the conjunction of many
conditions and any single condition in isolation my not provide the
needed solution.
Mining solution algorithms for handling rare cases must use more
powerful global search methods.
Recommended solution:
– Genetic algorithms, which operate on a population of candidate
solutions rather than a single solution:
– For this reason GA are more appropriate for rare cases.
(Goldberg, 1989), (Freitas, 2002), (Weiss, 1999), (Cavallo and
Freitas, 2002)
5. Employ Knowledge/Human
Interaction
Interaction and knowledge of domain experts can be used
more effectively for rare case mining.
Example:
– SAR detection
– Rare disease detection
– Etc.
6. Employ Boosting
Boosting algorithms, such as AdaBoost, are iterative
algorithms that place different weights on the training
distribution at each iteration.
Following each iteration, boosting increases the weights
associated with the incorrectly classified examples and
decreases the weight associated with the correctly classified
examples.
This forces the learner to focus more on the incorrectly
classified examples in the next iteration,
An algorithm, RareBoost (Joshi, Kumar and Agarwal, 2001)
which applies modified weight-update mechanism to improve
the performance of rare classes and rare cases.
7. Place Rare Cases Into Separate
Classes
Rare cases complicate classification because different rare
cases may have little in common between them, making it
difficult to assign same class label to all of them.
Solution: Reformulate the problem so that rare cases are
viewed as separate classes.
Approach:
1.
2.
3.
4.
Separate each class into subclasses using clustering
Learn after re-labeling the training examples with the new
class labels
Because multiple clustering experiments were used in steps 1,
step 2 involves learning multiple models.
These models are combined using voting.
Boosting based algorithms
RareBoost
– Updates the weights differently
SMOTEBoost
– Combination of SMOTE (Synthetic
Minority Oversampling Technique) and
boosting
CREDOS
First use ripple down rules to overfit
the data
– Ripple down rules are often used
Then prune to improve generalization
– Different mechanism from decision trees
Cost Sensitive Modeling
Detection rate / False Alarm rate may be
misleading
Cost factors: damage cost, response cost,
operational cost
Costs for TP, FP, TN, FN
Define cumulative cost
Outlier Detection Schemes
Detect intrusions (data
points) that are very
different from the “normal”
activities (rest of the data
points)
General Steps
– Identify “normal” behavior
– Construct useful set of
features
– Define similarity function
– Use outlier detection
algorithm
Statistics based
Distance based
Model based
Distance Based Outlier Detection
Represent data as a vector of features
Major approaches
– Nearest neighbor based
– Density based
– Clustering based
Problem
– High dimensionality of data
Distance Based – Nearest Neighbor
Not enough neighbors Outliers
– Compute distance d to the k-th nearest neighbor
– Outlier points
Located in more sparse neighborhoods
Have d larger than a certain threashold
Mahalanobis-distance based approach
– More appropriate for computing distance with skewed
y’
distributions
*
*
*
*
*
*
p2
*
*
* *
* *
* *
*
*
* *
* *
*
*
p1
x’
Distance Based – Density
Local Outlier Factor (LOF)
– Average of the ratios of the density of example p and the
density of its nearest neighbors
Compute density of local neighborhood for each
point
Compute LOF
Larger LOF Outliers
p2
p1
Distance Based – Clustering
Radius w of proximity is specified
Two points x1 and x2 are “near” if d(x1, x2)<w
Define N(x) as number of points that are within w
of x
Points in small cluster Outliers
Fixed-width clustering for speedup
Distance Based - Clustering (cont.)
K-Nearst Neighbor + Canopy Clustering
– Compute sum of distances to k nearest
neighbors
Small K-NN point in dense region
– Canopy clustering for speedup
WaveCluster
– Transform data into multidimensional signals
using wavelet transformation
– Remove Hign/Low frequency parts
– Remaining parts Outliers
Model Based Outlier Detection
Similar to Probabilistic Based schemes
– Build prediction model for normal
behavior
– Deviation from model potential
intrusion
Major approaches
– Neural networks
– Unsupervised Support Vector Machines
(SVMs)
Model Based - Neural Networks
Use a replicator 4-layer feed-forward neural network
Input variables are the target output during training
RNN forms a compressed model for traning data
Outlyingness reconstruction error
Model Based - SVMs
Attempt to separate the entire set of
training data from the origin
Regions where most data lies are labeled as
one class
•Parameters
• Expected outlier rates
•Good for high quality
controlled training data
•Variance of Radial Basis Function
(RBF)
- Larger higher detection rate
and more false alarm
origin
Summary And Conclusion
Rare classes, which result from highly skewed class
distribution, share many of the problems associated with rare
cases. Rare classes and rare cases are connected.
Rare cases may occur can occur within both rare classes and
common classes, it is expected that rare cases to be more of
an issue for rare classers.
(Japkowicz, 2001) views rare classes as a consequence of
between-class imbalance and rare cases as a consequence of
within-class imbalances.
Thus, both forms of rarity are a type of data imbalance
Modeling improvements presented in this paper are applicable
to both types of rarity.