Data Mining: Concepts and Techniques

Download Report

Transcript Data Mining: Concepts and Techniques

Data Mining:
Concepts and Techniques
— Chapter 7 —
Classification Ensemble
Learning
April 6, 2016
1
Data Mining: Concepts and Techniques
1
Drawback of Single Classifier
• If we do some classifiers to choose the best,
Which one is the best?
– Maybe more than one classifiers meet the criteria (e.g.
same training accuracy), especially in the following
situations:
• Without sufficient training data
• The learning algorithm leads to different local optima
easily
2
Drawback of Single Classifier- Cont.
• Potentially valuable information may be lost by
discarding the results of less-successful
classifiers
– E.g., the discarded classifiers may correctly classify some
samples
• The final decision must be wrong if the output
of selected classifier is wrong
• The trained classifier may not be complex
enough to handle the problem
3
Ensemble Methods
• Combining a number of trained classifiers lead
to a better performance than any single one
– Errors can be complemented by other correct
classifications
– Different classifiers have different knowledge
regarding the problem
• To decompose a complex problem into subproblems for which the solutions obtained are
simpler to understand, implement, manage
and update
4
General Idea
D
Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers
Step 3:
Combine
Classifiers
D1
D2
C1
C2
....
Original
Training data
Dt-1
Dt
Ct -1
Ct
C*
5
Affecting Factors in ensemble
methods
• Three factors affecting the accuracy:
– Accuracy of individual classifiers
• How good are the individual classifiers?
– Fusion Methods
• How to combine classifiers?
– Diversity among classifiers
• What are the differences between decisions of various
classifiers?
6
Accuracy of individual classifier
• The performance of an individual classifier is
affected by
– Training Dataset (sample and feature)
– Learning Model (types of classifier)
– Model’s Parameters (e.g. the number of neurons in
NN)
7
Examples of Ensemble Methods
• How to generate an ensemble of classifiers?
– Bootstrap aggregating(Bagging) ‫خودراه انداز متراکم‬
– Boosting (AdaBoost) ‫افزایشی‬
8
Bagging: Boostrap Aggregation
• Example in real world: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples
is sampled with replacement from D (i.e., boostrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
– Often significant better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
9
Bagging
• Sampling with replacement
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)
1
7
1
1
2
8
4
8
3
10
9
5
4
8
1
10
5
2
2
5
6
5
3
5
7
10
2
9
8
10
7
6
9
5
3
3
10
9
2
7
• Build classifier on each bootstrap sample
10
Boosting
•
Example: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
•
How boosting works?
–
Weights are assigned to each training tuple
–
A series of k classifiers is iteratively learned
–
After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training tuples
that were misclassified by Mi
–
The final M* combines the votes of each individual classifier, where the
weight of each classifier's vote is a function of its accuracy
•
The boosting algorithm can be extended for the prediction of continuous
values
•
Comparing with bagging: boosting tends to achieve greater accuracy, but it
also risks overfitting the model to misclassified data
11
Boosting
• Records that are wrongly classified will have
their weights increased
• Records that are classified correctly will have
their weights decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)
1
7
5
4
2
3
4
4
3
2
9
8
4
8
4
10
5
7
2
4
6
9
5
5
7
4
1
4
8
10
7
6
9
6
4
3
10
3
2
4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
12
Adaboost (Freund and Schapire, 1997)
•
•
•
•
Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
Initially, all the weights of tuples are set the same (1/d)
Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set
Di of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclssified, its weight is increased, o.w. it is decreased
Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights of the misclassified tuples:
d
•
error ( M i )   w j  err ( X j )
j
The weight of classifier Mi’s
vote is
log
1  error ( M i )
error ( M i )