Bagging and Boosting. Prof. Ruiz`s slides

Download Report

Transcript Bagging and Boosting. Prof. Ruiz`s slides

Bagging and Boosting
in Data Mining
Carolina Ruiz
[email protected]
http://www.cs.wpi.edu/~ruiz
2
Motivation and Background

Problem Definition:



Given: a dataset of instances and a target concept
Find: a model (e.g. set of association rules,
decision tree, neural network) that helps in
predicting the classification of unseen instances.
Difficulties:


The model should be stable (i.e. shouldn’t depend
too much on input data used to construct it)
The model should be a good predictor (difficult to
achieve when input dataset is small)
3
Two Approaches

Bagging (Bootstrap Aggregating)


Leo Breiman, UC Berkeley
Boosting

Rob Schapire, ATT Research

Jerry Friedman, Stanford U.
4
Bagging

Model Creation:


Prediction:


Create bootstrap replicates of the dataset
and fit a model to each one
Average/vote predictions of each model
Advantages


Stabilizes “unstable” methods
Easy to implement, parallelizable.
5
Bagging Algorithm



1. Create k bootstrap replicates of the
dataset
2. Fit a model to each of the replicates
3. Average/vote the predictions of the k
models
6
Boosting

Creating the model:


Prediction:


Construct a sequence of datasets and models
in such a way that a dataset in the sequence
weights an instance heavily when the previous
model has misclassified it.
“Merge” the models in the sequence
Advantages:

Improves classification accuracy
7
Generic Boosting Algorithm


1. Equally weight all instance in dataset
2. For I = 1 to T




2.1. Fit a model to current dataset
2.2. Upweight poorly predicted instances
2.3 Downweight well-predicted instances
3. Merge the models in the sequence to
obtain the final model
8
Conclusions and References


Boosted naïve Bayes tied for first place
in KDD-cup 1997
Reference:

“Combining Estimators to Improve
Performance” KDD-99 tutorial notes


John F. Elder
Greg Ridgeway