COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE

Download Report

Transcript COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE

COP5992 – DATA MINING
TERM PROJECT
RANDOM SUBSPACE METHOD
+
CO-TRAINING
by
SELIM KALAYCI
RANDOM SUBSPACE METHOD (RSM)

Proposed by Ho
“The Random Subspace for Constructing
Decision Forests”, 1998

Another combining technique for weak
classifiers like Bagging, Boosting.
RSM ALGORITHM
1. Repeat for b = 1, 2, . . ., B:
(a) Select an r-dimensional random subspace X from
the original p-dimensional feature space X.
2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple
majority voting to a final decision rule
MOTIVATION FOR RSM

Redundancy in Data Feature Space
Completely redundant feature set
 Redundancy is spread over many features


Weak classifiers that have critical training sample
sizes
RSM PERFORMANCE ISSUES

RSM Performance depends on:
Training sample size
 The choice of a base classifier
 The choice of combining rule (simple majority vs.
weighted)
 The degree of redundancy of the dataset
 The number of features chosen

DECISION FORESTS (by Ho)


A combination of trees instead of a single tree
Assumption: Dataset has some redundant
features
Works efficiently with any decision tree algorithm
and data splitting method
 Ideally, look for best individual trees with lowest tree
similarity

UNLABELED DATA

Small number of labeled documents

Large pool of unlabeled documents

How to classify unlabeled documents accurately?
EXPECTATION-MAXIMIZATION (E-M)
CO-TRAINING

Blum and Mitchel, “Combining Labeled and
Unlabeled Data with Co-Training”, 1998.

Requirements:
Two sufficiently strong feature sets
 Conditionally independent

CO-TRAINING
APPLICATION OF CO-TRAINING TO A
SINGLE FEATURE SET
Algorithm:
Obtain a small set L of labeled examples
Obtain a large set U of unlabeled examples
Obtain two sets F1 and F2 of features that are sufficiently redundant
While U is not empty do:
Learn classifier C1 from L based on F1
Learn classifier C2 from L based on F2
For each classifier Ci do:
Ci labels examples from U based on Fi
Ci chooses the most confidently predicted examples E from U
E is removed from U and added (with their given labels) to L
End loop
THINGS TO DO



How can we measure redundancy and use it
efficiently?
Can we improve Co-training?
How can we apply RSM efficiently to:
Supervised learning
 Semi-supervised learning
 Unsupervised learning

QUESTIONS
????????????????????????????????????????????????????
????????????????????????????????????????????????????
????????????????????????????????????????????????????
????????????????????????????????????????????????????