Overall hit rate

Transcript Overall hit rate

GA-Based Feature Selection and Parameter
Optimization for Support Vector Machine
Cheng-Lung Huang, Chieh-Jen Wang
Expert Systems with Applications, Volume 31, Issue 2, pp.231-240, 2006.
Outline
 INTRODUCTION
 BACKGROUND KNOWLEDGE
 GA-BASED FEATURE SELECTION AND
PARAMETER OPTIMIZATION
 NUMERICAL ILLUSTRATIONS
 CONCLUSION
Introduction
 Support vector machines (SVM) were first
suggested by Vapnik (1995) for classification.
 SVM classifies data with different class label by
determining a set of support vectors that outline a
hyperplane in the feature space.
 The kernel function transforms train data vector to
feature space.
 SVM used in a range of problems including
pattern recognition (Pontil and Verri 1998),
bioinformatics (Brown et al. 1999),
text categorization (Joachims 1998).
Problems
 While using SVM, we confront two problems:
 How to set the best parameters for SVM !
 How to choose the input attributes for SVM !
Feature Selection
 Feature selection is used to identify a powerfully
predictive subset of fields within the database and
to reduce the number of fields presented to the
mining process.
 Affects several aspects of pattern classification:
1.The accuracy of classification algorithm learned
2.The time needed for learning a classification
function
3.The number of examples needed for learning
4.The cost associated with feature
SVM Parameters Setting
 Proper parameters setting can improve the
classification accuracy of SVM.
 The parameters that should be optimized include
penalty parameter C and the parameters with
different kernel function.
 Grid Algorithm is an alternative to find the best C
and the gamma parameter, however it is time
consuming and does not perform well.
Research Purposes
 This research objective is to optimize the
parameters and the feature subset simultaneously,
without degrading the classification accuracy of
SVM.
 Genetic Algorithms (GA) have the potential to be
used to generate both the feature subset and the
SVM parameters at the same time.
Chromosomes Design
nc
1
n
1
g
~
g

Cg ~C g：represents the value of parameter C
c
C
C
n
1

g
~
g


 ：represents the value of parameter γ
n
1
f
g
~
g

f
f ：represents selected features
g
1
C
g
i
C
nc
C
g g
1

g
j
n

g g
1
f
g
k
f
g
nf
f
Genotype to Phenotype
 The bit strings for parameter C and γ are genotype
that should be transformed into phenotype

P max  P min
Value  P min 
*D
L
2 1
Value: the phenotype value
 Pmin: the minimum value of parameter (user define)
 Pmax: the maximum value of parameter (user define)
 D: the decimal value of bit string
 L: the length of bit string
Fitness Function Design
nf
fitness  WA * SVM _ accuracy  WF * ( Ci * Fi ) 1
i 1
 WA: SVM classification accuracy weight
 SVM_accuracy: SVM classification accuracy
 WF: weight of the features
 Ci: cost of feature i
 Fi: “1” represents that feature i is selected; “0”
represents that feature i is not selected
System Flows for GA-based SVM
 (1) Data preprocess: scaling
 (2) Converting genotype to phenotype
 (3) Feature subset
 (4) Fitness evaluation
 (5) Termination criteria
 (6) Genetic operation
Figure of System Flows
Dataset
Population
Scaling
Training set
Phenotype of
Feature feature genes
genes
Parameter
genes
Selected feature subset
Training set
with selected
feature subset
Converting
genotype to
phenotype
Phenotype of
parameter genes
Testing set
Testing set
with selected
feature subset
Training SVM
classifier
Trained SVM
classifier
Classfication accuracy for
testing set
Fitness evaluation
Genetic
operatation
No
Termination
are satisfied?
Yes
Optimized (C,  ) and
feature subset
Experimental Dataset
No.
Names
#Classes
#Instances
Nominal
features
Numeric
features
Total
features
1
German (credit card)
2
1000
0
24
24
2
Australian (credit card)
2
690
6
8
14
3
Pima-Indian diabetes
2
760
0
8
8
4
Heart disease (Statlog Project)
2
270
7
6
13
5
Breast cancer(Wisconsin)
2
699
0
10
10
6
Contraceptive Method Choice
3
1473
7
2
9
7
Ionosphere
2
351
0
34
34
8
Iris
3
150
0
4
4
9
Sonar
2
208
0
60
60
10
Statlog project : vehicle
4
940
0
18
18
11
Vowel
11
990
3
10
13
Experiments Description
 To guarantee that the present results are valid and
can be generalized for making predictions
regarding new data
 Using k-fold-cross-validation
 This study used k = 10, meaning that all of the
data will be divided into ten parts, each of which
will take turns at being the testing data set.
Accuracy Calculation
 Accuracy using the binary target datasets can be
demonstrated by the positive hit rate (sensitivity),
the negative hit rate (specificity), and the overall
hit rate.
 For the multiple class datasets, the accuracy is
demonstrated only by the average hit rate.
Accuracy Calculation
 Sensitivity is the proportion of cases with positive class that
are classified as positive: P(T+|D+) = TP / (TP+FN).
 Specificity is the proportion of cases with the negative class:
P(T-|D-) = TN / (TN + FP).
 Overall hit rate is the overall accuracy which is calculated
by (TP+TN) / (TN+FP+FN+FP).
Target (or Disease)
Predicted
(or Test)
+
-
+
True Positive(TP)
False Positive(FP)
-
False Negative(FN)
True Negative(TN)
Accuracy Calculation
 The SVM_accuracy of the fitness in function is
measured by Sensitivity*Specificity for the datasets
with two classes (positive or negative).
 Overall hit rate for the datasets with multiple
classes.
GA Parameter Setting
 Chromosome Represented by using Binary Code
 Population Size 500
 Crossover Rate 0.7 ,One Point Crossover
 Mutation Rate 0.02
 Roulette Wheel Selection
 Elitism Replacement
WA and WF
 Weight WA and WF can influence the experiment
result according to the fitness function
 The higher WA is; the higher classification
accuracy is.
 The higher WF is; the smaller the number of
features is.
Folder #4 of German Dataset
Curve Diagram
Accuracy
Feature Number
Fitness
1
20
0.9
0.8
15
0.7
0.6
0.5
10
0.4
0.3
5
0.2
0.1
0
0
1.0
0.9
0.8
0.7
0.6
0.5
0.4
Accurcay Weight
0.3
0.2
0.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
Accuracy Weight
0.3
0.2
0.1
Experimental Results for German
Dataset
Results summary
(GA-based approach vs. Grid search )
GA-based approach
p-value for
Wilcoxon
Testing
Grid algorithm
Names
Number of
Original
features
Number of
Selected
features
Average
Positive
Hit Rate
Average
Negative
Hit Rate
Average
Overall Hit
Rate%
Average
Positive
Hit Rate
Average
Negative
Hit Rate
Average
Overall Hit
Rate%
German
24
13±1.83
0.89608
0.76621
85.6±1.96
0.888271
0.462476
76±4.06
0.005*
Australian
14
3±2.45
0.8472
0.92182
88.1±2.25
0.885714
0.823529
84.7±4.74
0.028*
diabetes
8
3.7±0.95
0.78346
0.87035
81.5±7.13
0.592593
0.88
77.3±3.03
0.139
Heart disease
13
5.4±1.85
0.94467
0.95108
94.8±3.32
0.75
0.909091
83.7±6.34
0.005*
breast cancer
10
1±0
0.9878
0.8996
96.19±1.24
0.98
0.944444
95.3±2.28
0.435
Contraceptive
9
5.4±0.53
N/A
N/A
71.22±4.15
N/A
N/A
53.53±2.43
0.005*
ionosphere
34
6±0
0.9963
0.9876
98.56±2.03
0.94
0.9
89.44±3.58
0.005*
iris
4
1±0
N/A
N/A
100±0
N/A
N/A
97.37±3.46
0.046*
sonar
60
15±1.1
0.9863
0.9842
98±3.5
0.65555
0.9
87±4.22
0.004*
vehicle
18
9.2±1.4
N/A
N/A
84.06±3.54
N/A
N/A
83.33±2.74
0.944
Vowel
13
7.8±1
N/A
N/A
99.3±0.82
N/A
N/A
95.95±2.91
0.02*
ROC curve for fold #4 of German
Credit Dataset
1.00
.75
Sensitivity
.50
Guid e
.25
Grid search
0.00
0.00
GA-based approach
.25
1 - Specificity
.50
.75
1.00
Average AUC for Datasets
GA-based approach
Grid algorithm
German
0.8424
0.7886
Australian
0.9019
0.8729
diabetes
0.8298
0.7647
Heart disease
0.9458
0.8331
breast cancer
0.9423
0.9078
Contraceptive
0.7701
0.6078
ionosphere
0.9661
0.8709
iris
0.9756
0.9572
sonar
0.9522
0.8898
vehicle
0.8587
0.8311
Vowel
0.9513
0.9205
Conclusion
 We proposed a GA-based strategy to select
features subset and to set the parameters for SVM
classification.
 We have conducted two experiments to evaluate
the classification accuracy of the proposed GAbased approach with RBF kernel and the grid
search method on 11 real-world datasets from UCI
database.
 Generally, compared with the grid search
approach, the proposed GA-based approach has
good accuracy performance with fewer features.
Thank You
Q&A

Overall hit rate

Transcript Overall hit rate

Directory