Transcript PPT
Feature Subset Selection
using Minimum Cost
Spanning Trees
Mike Farah - 18548059
Supervisor: Dr. Sid Ray
Outline
• Introduction
Pattern Recognition
Feature subset selection
•
•
•
•
•
Current methods
Proposed method
IFS
Results
Conclusion
Introduction: Pattern Recognition
• The classification of objects into groups by
learning from a small sample of objects
Apples and strawberries:
Classes: apples and strawberries
Features: colour, size, weight, texture
• Applications:
Character recognition
Voice recognition
Oil mining
Weather prediction
…
Introduction: Pattern Recognition
• Pattern representation
Measuring and recording features
Size, colour, weight, texture….
• Feature set reduction
Reducing the number of features used
Selecting a subset
Transformations
• Classification
The resulting features are used for classification of
unknown objects
Introduction: Feature subset selection
• Can be split into two processes:
Feature subset searching
Not usually feasible to exhaustively try all feature
subset combinations
Criterion function
Main issue of feature subset selection (Jain et al.
2000)
Focus of our research
Current methods
• Euclidean distance
J(x) min{ (i j )'(i j )}
Statistical properties of the classes are not
considered
• Mahalanobis distance
1
J(x) min{ (i j )' ij (i j )}
Variances and co-variances of the classes are
taken into account
Limitations of Current Methods
Limitations of Current Methods
Friedman and Rafsky’s two
sample test
• Minimum spanning tree approach for
determining whether two sets of data originate
from the same source
• A MST is built across the data from two sources,
edges which connect samples of different data
sets are removed
• If many edges are removed, then the two sets of
data are likely to originate from the same source
Friedman and Rafsky’s two
sample test
• Method can be used as a criterion function
• MST built across the sample points
• Edges which connect samples of different
classes are removed
• A good subset is one that provides
discriminatory information about the classes,
therefore the fewer edges removed the better
Limitations of Friedman and
Rafsky’s technique
Our Proposed Method
• Use the number of edges and edge lengths in
determining the suitability of a subset
• A good subset will have a large number of short
edges connecting samples of the same class
• And a small number of long edges connecting
samples of different classes
Our Proposed Method
betweenNum
Avg(betweenEdges)
J(x) 0.51
betweenNum
withinNum
Avg(betweenEdges)
Avg(withinEdges)
• We experimented with using average edge
length and weighted average - weighted
average was expected to perform better
IFS - Interactive Feature Selector
• Developed to allow users to experiment with
various feature selection methods
• Automates the execution of experiments
• Allows visualisation of data sets, and results
• Extensible, developers can add criterion
functions, feature selectors and classifiers easily
into the system
IFS - Screenshot
IFS - Screenshot
Experimental Framework
Data set
No. Samples
No. Feats
No. Classes
Iris
150
4
3
Crab
200
7
2
Forensic Glass
214
9
7
Diabetes
332
8
2
Character
757
20
8
Synthetic
750
7
5
Experimental Framework
• Spearman’s rank correlation
A good criterion function will have good
correlation with the classifier, subsets which
are ranked high should achieve high accuracy
levels
• Subset chosen
Final subsets selected by criterion functions
are compared to the optimal subset chosen
by the classifier
• Time
Forensic glass data set results
Forensic glass data set results
Synthetic data set
Synthetic data set
Algorithm completion times
Algorithm completion times
Algorithm complexities
• K-NN
2
O(N
(F log( N) K))
• MST criterion functions
O(N 2 * F)
• Mahalanobis distance
O(C 2 * F 2 (N F))
• Euclidean distance
O(C 2 * F)
Conclusion
• MST based approaches generally
achieved higher accuracy values and rank
correlation - in particular with the K-NN
classifier
• Criterion function based on Friedman and
Rafsky’s two sample test performed the
best
Conclusion
• MST approaches are closely related with the
KNN classifier
• Mahalanobis criterion function suited to data
sets with Gaussian distributions and strong
feature interdependence
• Future work:
Construct a classifier based on KNN, which gives
closer neighbours higher priority
Improve IFS