COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE
Download
Report
Transcript COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE
COP5992 – DATA MINING
TERM PROJECT
RANDOM SUBSPACE METHOD
+
CO-TRAINING
by
SELIM KALAYCI
RANDOM SUBSPACE METHOD (RSM)
Proposed by Ho
“The Random Subspace for Constructing
Decision Forests”, 1998
Another combining technique for weak
classifiers like Bagging, Boosting.
RSM ALGORITHM
1. Repeat for b = 1, 2, . . ., B:
(a) Select an r-dimensional random subspace X from
the original p-dimensional feature space X.
2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple
majority voting to a final decision rule
MOTIVATION FOR RSM
Redundancy in Data Feature Space
Completely redundant feature set
Redundancy is spread over many features
Weak classifiers that have critical training sample
sizes
RSM PERFORMANCE ISSUES
RSM Performance depends on:
Training sample size
The choice of a base classifier
The choice of combining rule (simple majority vs.
weighted)
The degree of redundancy of the dataset
The number of features chosen
DECISION FORESTS (by Ho)
A combination of trees instead of a single tree
Assumption: Dataset has some redundant
features
Works efficiently with any decision tree algorithm
and data splitting method
Ideally, look for best individual trees with lowest tree
similarity
UNLABELED DATA
Small number of labeled documents
Large pool of unlabeled documents
How to classify unlabeled documents accurately?
EXPECTATION-MAXIMIZATION (E-M)
CO-TRAINING
Blum and Mitchel, “Combining Labeled and
Unlabeled Data with Co-Training”, 1998.
Requirements:
Two sufficiently strong feature sets
Conditionally independent
CO-TRAINING
APPLICATION OF CO-TRAINING TO A
SINGLE FEATURE SET
Algorithm:
Obtain a small set L of labeled examples
Obtain a large set U of unlabeled examples
Obtain two sets F1 and F2 of features that are sufficiently redundant
While U is not empty do:
Learn classifier C1 from L based on F1
Learn classifier C2 from L based on F2
For each classifier Ci do:
Ci labels examples from U based on Fi
Ci chooses the most confidently predicted examples E from U
E is removed from U and added (with their given labels) to L
End loop
THINGS TO DO
How can we measure redundancy and use it
efficiently?
Can we improve Co-training?
How can we apply RSM efficiently to:
Supervised learning
Semi-supervised learning
Unsupervised learning
QUESTIONS
????????????????????????????????????????????????????
????????????????????????????????????????????????????
????????????????????????????????????????????????????
????????????????????????????????????????????????????