Transcript Document

Feature Selection for
High-Dimensional
Data: A Fast
Correlation-Based
Filter Solution
Presented by Jingting Zeng
11/26/2007
Outline
Introduction to Feature Selection
 Feature Selection Models

 Fast
Correlation-Based Filter (FCBF)
Algorithm
 Experiment
 Discussion
 Reference
Introduction of Feature
Selection

Definition
 A process
that chooses an optimal subset of features
according to an objective function

Objectives
 To
reduce dimensionality and remove noise
 To improve mining performance



Speed of learning
Predictive accuracy
Simplicity and comprehensibility of mined results
An Example for Optimal Subset

Data set (whole set)
 Five
Boolean features
 C = F1∨F2
 F3= ┐F2 ,F5= ┐F4
 Optimal subset:

{F1, F2}or{F1, F3}
Models of Feature Selection

Filter model
 Separating
feature selection from classifier learning
 Relying on general characteristics of data (information,
distance, dependence, consistency)
 No bias toward any learning algorithm, fast

Wrapper model
 Relying
on a predetermined classification algorithm
 Using predictive accuracy as goodness measure
 High accuracy, computationally expensive
Filter Model
Wrapper Model
Two Aspects for Feature
Selection
How to decide whether a feature is
relevant to the class or not
 How to decide whether such a relevant
feature is redundant or not compared to
other features

Linear Correlation Coefficient

For a pair of variables (x,y):

However, it may not be able to capture the
non-linear correlations
Information Measures

Entropy of variable X

Entropy of X after observing Y

Information Gain

Symmetrical Uncertainty
Fast Correlation-Based Filter
(FCBF) Algorithm

How to decide whether a feature f i is
relevant to the class C or not
 Find
a subset S ' , such that
 fi  S ', 1  i  N , SUi,c  
 How to decide whether such a relevant
feature is redundant
 Use
the correlation of features and class as a
reference
Definitions

Predominant Correlation
fi
correlation between a feature f i and the
class C is predominant
iff SU i ,c   and f j  S ', SU j ,i  SU i ,c
 The

Redundant peer (RP)
there is SU j ,i  SUi,c, f j is a RP of f i
 Use Sp to denote the set of RP for S i
i
 If

Spi
i
Spi
C


Three Heuristics



If Spi    , treat f i as a predominant feature,
remove all features in Spi  and skip identifying
redundant peers for them


If Spi   , process all the features in Spi at
first. If non of them becomes predominant,
follow the first heuristic
The feature with the largest SU i ,c value is
always a predominant feature and can be a
starting point to remove other features.

Spi
i
Spi
C


FCBF Algorithm
Time
Complexity:
O(N)
FCBF Algorithm (cont.)
Time
complexity:
O(NlogN)
Experiments
FCBF are compared to ReliefF, CorrSF
and ConsSF
 Summary of the 10 data sets

Results
Results (cont.)
Pros and Cons

Advantage
 Very
fast
 Select fewer features with higher accuracy

Disadvantage
 Cannot

detect some features
4 features generated by 4 Gaussian functions and
adding 4 additional redundant features, FCBF
selected only 3 features
Discussion
FCBF compares only individual features
with each other
 Try to use PCA to capture a group of
features. Based on the result, then the
FCBF is used.

Reference




L. Yu and H. Liu. Feature selection for high-dimensional
data: A fast correlation-based filter solution. In Proc 12th
Int Conf on Machine Learning (ICML-03), pages 856–
863, 2003
Biesiada J, Duch W (2005), Feature Selection for HighDimensional Data: A Kolmogorov-Smirnov CorrelationBased Filter Solution. (CORES'05) Advances in Soft
Computing, Springer Verlag, pp. 95-104, 2005.
www.cse.msu.edu/~ptan/SDM07/Yu-Ye-Liu.pdf
www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt
Thank you!
Q and A