Ensemble Methods
Download
Report
Transcript Ensemble Methods
Ensemble Methods
An ensemble method constructs a set of base
classifiers from the training data
– Ensemble or Classifier Combination
Predict class label of previously unseen records
by aggregating predictions made by multiple
classifiers
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
General Idea
D
Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers
D1
D2
C1
C2
Step 3:
Combine
Classifiers
© Tan,Steinbach, Kumar
....
Original
Training data
Dt-1
Dt
Ct -1
Ct
C*
Introduction to Data Mining
4/18/2004
‹#›
Why does it work?
Suppose there are 25 base classifiers
– Each classifier has error rate, = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes
a wrong prediction:
25 i
25i
(
1
)
0.06
i
i 13
25
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Methods
By manipulating the training dataset: a classifier
is built for a sampled subset of the training
dataset.
– Two ensemble methods: bagging (bootstrap
averaging) and boosting.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Characteristics
Ensemble methods work better with unstable
classifiers.
– Base classifiers that are sensitive to minor
perturbations in the training set.
For example, decision trees and ANNs.
– The variability among training examples is
one of the primary sources of errors in a
classifier.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Bias-Variance Decomposition
Consider the trajectories of a projectile launched at a
particular angle. The observed distance can be divided
into 3 components.
d f , ( y , t ) Bais Variance f Noiset
– Force (f) and angle (θ)
– Suppose the target is t, but the projectile hits at x at
distance d away from t.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Two Decision Trees (1)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Two Decision Trees (2)
Bias: The stronger the assumptions made by a classifier
about the nature of its decision boundary, the larger the
classifier’s bias will be.
– A smaller tree has a stronger assumption.
– An algorithm cannot learn the target.
Variance: Variability in the training data affects the
expected error, because different compositions of the
training set may lead to different decision boundaries.
Intrinsic noise in the target class.
– Target class for some domain can be nondeterministic.
– Same attributes values with different class labels.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Bagging
Sampling with replacement
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)
1
7
1
1
2
8
4
8
3
10
9
5
4
8
1
10
5
2
2
5
6
5
3
5
7
10
2
9
8
10
7
6
9
5
3
3
10
9
2
7
Build classifier on each bootstrap sample
Each sample has probability 1 - (1 – 1/n)n of
being selected. When n is large, a bootstrap
sample Di contains about 63.2% of the training
data.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Bagging Algorithm
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
A Bagging Example (1)
Consider a one-level binary decision tree x <= k
where k is a split point to minimize the entropy.
Without bagging, the best decision stump is
– x <= 0.35 or x >= 0.75
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
A Bagging Example (2)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
A Bagging Example (3)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
A Bagging Example (4)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Summary on Bagging
Bagging improves generalization error by
reducing the variance of the base classifier.
Bagging depends on the stability of the base
classifier.
If a base classifier is unstable, bagging helps to
reduce the errors associated with random
fluctuations in the training data.
If a base classifier is stable, then the error of the
ensemble is primarily caused by bias in the base
classifier. Bagging may make error larger,
because the sample size is 37% smaller.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Boosting
An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of boosting round
– The weights can be used by a base classifier
to learn a model that is biased toward higherweight examples.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Boosting
Records that are wrongly classified will have their
weights increased
Records that are classified correctly will have
their weights decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)
1
7
5
4
2
3
4
4
3
2
9
8
4
8
4
10
5
7
2
4
6
9
5
5
7
4
1
4
8
10
7
6
9
6
4
3
10
3
2
4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Example: AdaBoost
Base classifiers: C1, C2, …, CT
Error rate:
i w j Ci ( x j ) y j
N
j 1
Importance of a classifier:
1 1 i
i ln
2 i
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Example: AdaBoost
Weight update:
j
if C j ( xi ) yi
w
exp
( j 1)
wi
Z j exp j if C j ( xi ) yi
where Z j is the normalization factor to ensure w i (j1) 1.
( j)
i
i
If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
Classification:
T
C * ( x ) arg max j C j ( x ) y
y
© Tan,Steinbach, Kumar
Introduction to Data Mining
j 1
4/18/2004
‹#›
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
A Boosting Example (1)
Consider a one-level binary decision tree x <= k
where k is a split point to minimize the entropy.
Without bagging, the best decision stump is
– x <= 0.35 or x >= 0.75
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
A Boosting Example (2)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
A Boosting Example (3)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›