Transcript PPT
Genetic-Algorithm-Based
Instance and Feature Selection
Instance Selection and Construction for
Data Mining Ch. 6
H. Ishibuchi, T. Nakashima, and M. Nii
Abstract
GA based approach for selecting a small number
of instances from a given data set in a pattern
classification problem.
To improve the classification ability of our nearest
neighbor classifier by searching for an appropriate
reference set.
Genetic Algorithm
Coding
Binary string of the length (n+m)
S a1a2 an s1s2 sm
ai : inclusion or exclusion of the i-th feature
sp : the inclusion or exclusion of the p-th instance
Fitness function
Minimize |F|, minimize |P|, and maximize g(S)
|F| : number of selected feature
|P| : number of selected instance
g(S) : classification performance
Genetic Algorithm
Performance measure (first one) : gA(S)
The
number of correctly classified instances
Minimize |P| subject to gA(S) = m
Performance measure (second one) : gB(S)
When
an instance xq was included in the reference set, xq was
not selected as its own nearest neighbor.
min{ d F ( x p , xq ) | x p P}, if xq P
d F ( x p * , xq )
min{ d F ( x p , xq ) | x p P {xq }}, if xq P
fitness
fitness( S ) Wg g ( S ) WF | F | WP | P |
Genetic Algorithm
Initialization
2. Genetic Operation: Iterate the following
procedure Npop/2 times to generate Npop string
1.
1. Randomly select a pair of strings
2. Apply a uniform crossover
3. Apply a mutation operator
Generation Update: Select the Npop best string
from 2Npop
4. Termination test
3.
Numerical Example
Biased Mutation
For effectively decreasing the number of selected
instances is to bias the mutation probability
In the biased mutation, a much larger probability
is assigned to the mutation from sp = 1 to sp = 0.
Data sets
2 artificial + 4 real
Normal distribution with small overlap
Normal distribution with large overlap
Iris data
Appendicitis Data
Cancer Data
Wine Data
Parameter Specifications
Pop Size : 50
Crossover Prob. : 1.0
Mutation Prob.
Pm = 0.01 for feature selection
Pm(1 0) = 0.1 for instance selection
Pm(0 1) = 0.01 for instance selection
Stopping condition : 500 gen.
Weight values : Wg = 5; WF = 1; WP = 1
Performance measure : gA(S) or gB(S)
30 trials for each data
Performance on Training Data
Performance on Test Data
Leaving-one-out procedure (iris & appendicitis)
10-fold cross-validation (cancer & wine)
Effect of Feature Selection
Effect on NN
Some Variants