Transcript PPT

Feature Subset Selection
using Minimum Cost
Spanning Trees
Mike Farah - 18548059
Supervisor: Dr. Sid Ray
Outline
• Introduction
 Pattern Recognition
 Feature subset selection
•
•
•
•
•
Current methods
Proposed method
IFS
Results
Conclusion
Introduction: Pattern Recognition
• The classification of objects into groups by
learning from a small sample of objects
 Apples and strawberries:
 Classes: apples and strawberries
 Features: colour, size, weight, texture
• Applications:





Character recognition
Voice recognition
Oil mining
Weather prediction
…
Introduction: Pattern Recognition
• Pattern representation
 Measuring and recording features
 Size, colour, weight, texture….
• Feature set reduction
 Reducing the number of features used
 Selecting a subset
 Transformations
• Classification
 The resulting features are used for classification of
unknown objects
Introduction: Feature subset selection
• Can be split into two processes:
 Feature subset searching
 Not usually feasible to exhaustively try all feature
subset combinations
 Criterion function
 Main issue of feature subset selection (Jain et al.
2000)
 Focus of our research
Current methods
• Euclidean distance
 J(x)  min{ (i   j )'(i   j )}
 Statistical properties of the classes are not
considered
 • Mahalanobis distance
1
 J(x)  min{ (i   j )' ij (i   j )}
 Variances and co-variances of the classes are
taken into account

Limitations of Current Methods
Limitations of Current Methods
Friedman and Rafsky’s two
sample test
• Minimum spanning tree approach for
determining whether two sets of data originate
from the same source
• A MST is built across the data from two sources,
edges which connect samples of different data
sets are removed
• If many edges are removed, then the two sets of
data are likely to originate from the same source
Friedman and Rafsky’s two
sample test
• Method can be used as a criterion function
• MST built across the sample points
• Edges which connect samples of different
classes are removed
• A good subset is one that provides
discriminatory information about the classes,
therefore the fewer edges removed the better
Limitations of Friedman and
Rafsky’s technique
Our Proposed Method
• Use the number of edges and edge lengths in
determining the suitability of a subset
• A good subset will have a large number of short
edges connecting samples of the same class
• And a small number of long edges connecting
samples of different classes
Our Proposed Method


betweenNum
Avg(betweenEdges)
J(x)  0.51


betweenNum

withinNum
Avg(betweenEdges)

Avg(withinEdges)


• We experimented with using average edge
length and weighted average - weighted
average was expected to perform better
IFS - Interactive Feature Selector
• Developed to allow users to experiment with
various feature selection methods
• Automates the execution of experiments
• Allows visualisation of data sets, and results
• Extensible, developers can add criterion
functions, feature selectors and classifiers easily
into the system
IFS - Screenshot
IFS - Screenshot
Experimental Framework
Data set
No. Samples
No. Feats
No. Classes
Iris
150
4
3
Crab
200
7
2
Forensic Glass
214
9
7
Diabetes
332
8
2
Character
757
20
8
Synthetic
750
7
5
Experimental Framework
• Spearman’s rank correlation
 A good criterion function will have good
correlation with the classifier, subsets which
are ranked high should achieve high accuracy
levels
• Subset chosen
 Final subsets selected by criterion functions
are compared to the optimal subset chosen
by the classifier
• Time
Forensic glass data set results
Forensic glass data set results
Synthetic data set
Synthetic data set
Algorithm completion times
Algorithm completion times
Algorithm complexities
• K-NN
2
O(N
(F  log( N)  K))

• MST criterion functions



 O(N 2 * F)
• Mahalanobis distance
 O(C 2 * F 2 (N  F))
• Euclidean distance
 O(C 2 * F)
Conclusion
• MST based approaches generally
achieved higher accuracy values and rank
correlation - in particular with the K-NN
classifier
• Criterion function based on Friedman and
Rafsky’s two sample test performed the
best
Conclusion
• MST approaches are closely related with the
KNN classifier
• Mahalanobis criterion function suited to data
sets with Gaussian distributions and strong
feature interdependence
• Future work:
 Construct a classifier based on KNN, which gives
closer neighbours higher priority
 Improve IFS