Feature Selection in Nonlinear Kernel Classification

Download Report

Transcript Feature Selection in Nonlinear Kernel Classification

Feature Selection in Nonlinear
Kernel Classification
Workshop on Optimization-Based Data Mining Techniques with Applications
IEEE International Conference on Data Mining
Omaha, Nebraska, October 28, 2007
Olvi Mangasarian & Edward Wild
University of Wisconsin
Madison
However, data is
nonlinearly separable
using only the feature x2
Best linear classifier that
uses only 1 feature selects
the feature x1
Example
x2
+ + +
+ + +
_
_
Data is nonlinearly
separable: In general
nonlinear kernels use
both x1 and x2
+
+
_ _ _
_ _ _
+ ++ +
+ + + +
x1
Feature selection in nonlinear classification is important
Outline
 Minimize the number of input space features selected by a
nonlinear kernel classifier
 Start with a standard 1-norm nonlinear support vector
machine (SVM)
 Add 0-1 diagonal matrix to suppress or keep features
 Leads to a nonlinear mixed-integer program
 Introduce algorithm to obtain a good local solution to the
resulting mixed-integer program
 Evaluate algorithm on two public datasets from the UCI
repository and synthetic NDCC data
Linear kernel: (K(A, B))ij = (AB)ij = AiB j = K(Ai, B j)
¢
¢
kernel, parameter
:(K(A, B)) = exp(-||A -B || )
SupportGaussian
Vector
SVMs
Machines
¢
x 2 Rn
SVM defined by
parameters u and threshold
 of the nonlinear surface
A contains all data points
{+…+} ½ A+
{…} ½ A
e is a vector of ones
K(A, A0)u· e e
ij
_
__
0
j
2
K(A+, A0)u ¸ e +e
+
++
_
_
i
+
+
+
+ +
+ ++
Minimize e y (hinge loss or plus
_
+ + +
Minimize e s (||u|| at+
function or max{•, 0}) to fit _
+
solution)
__to reduce +
data
K(x , A )u = 
overfitting
+
_
_
K(x , A )u = 
_
__
_
Slack variable y ¸ 0
_
_
_ K(x , A )u = 1
_
allows points
to be on the
_
wrong side of the
_
_
bounding surface _
0
0
1
0
0
0
0
0
0
 To suppress features, add the number of
features present (e0Ee) to the objective with
weight  ¸ 0
 As  is increased, more features will be
removed from the classifier
Reduced
Start with
Feature
Full SVM
SVM
Replace A with AE, where
E is a diagonal n £ n
matrix
Eii 2 {1,
0},present in
All with
features
are
i =the
1, …,
n
kernel
matrix K(A, A0)
If Eii is 0 the ith feature is removed
Reduced Feature SVM (RFSVM)
1) Initialize diagonal matrix E randomly
2) For fixed 0-1 values E, solve the SVM linear program
to obtain (u, , y, s)
3)Fix (u, , s) and sweep through E repeatedly as follows:
 For each component of E replace 1 by 0 and conversely
provided the change decreases the overall objective function
by more than tol
4)Go to (3) if a change was made in the last sweep,
otherwise continue to (5)
5)Solve the SVM linear program with the new matrix E.
If the objective decrease is less than tol, stop, otherwise
go to (3)
RFSVM Convergence
(for tol = 0)
Objective function value converges
 Each step decreases the objective
 Objective is bounded below by 0
Limit of the objective function value is attained at
any accumulation point of the sequence of iterates
Accumulation point is a “local minimum solution”
 Continuous variables are optimal for the fixed integer
variables
 Changing any single integer variable will not decrease
the objective
Experimental Results
 Classification accuracy versus number of features used
 Compare our RFSVM to Relief and RFE
(Recursive Feature Elimination)
 Results given on two public datasets from the UCI
repository
 Ability of RFSVM to handle problems with up to 1000
features tested on synthetic NDCC datasets
 Set feature selection parameter = 1
Relief and RFE
 Relief
 Kira and Rendell, 1992
 Filter method: feature selection is a preprocessing procedure
 Features are selected as relevant if they tend to have different
feature values for points in different classes
 RFE (Recursive Feature Elimination)
 Guyon, Weston, Barnhill, and Vapnik, 2002
 Wrapper method: feature selection is based on classification
 Features are selected as relevant if removing them causes a large
change in the margin of an SVM
Ionosphere Dataset
34
SVM with
351 Points in RNonlinear
Cross-validation accuracy
If the appropriate value
of  is selected,
RFSVM can obtain
higher accuracy using
fewer features than
SVM1

no feature selection
Even for feature selection
 = 0, some
Note that parameter
accuracy decreases
features
be removed
slightly until
aboutmay
10 features
when
remain, and
thenremoving
decreasesthem
more
the hinge loss
sharply
asdecreases
they are removed
Linear 1-norm
SVM
Number of features used
Points are generated
from normal
distributions
centered at vertices
of 1-norm cubes
Dataset is not
linearly separable
Normally Distributed Clusters on
Cubes Dataset (Thompson, 2006)
Each
point is vs.
the SVM without Feature Selection (NKSVM1)
RFSVM
average
ontest
NDCC
onsetNDCC
DataData
withwith
20 True
100 True
Features
Features
and Varying
and
correctness over Numbers
1000 Irrelevant
of Irrelevant
Features
Features
10 datasets with
200 training, 200
tuning, and 1000
When 480
testing points
irrelevant
Average Accuracy on 1000 Test Points features are
added, the
0.70
RFSVM
accuracy of
RFSVM is
0.53
NKSVM1
45% higher
than that of
NKSVM1
Conclusion
New rigorous formulation with precise objective
for feature selection in nonlinear SVM classifiers
 Obtain a local solution to the resulting mixed-integer
program
 Alternate between a linear program to compute
continuous variables and successive sweeps to update
the integer variables
Efficiently learns accurate nonlinear classifiers
with reduced numbers of features
Handles problems with 1000 features, 900 of
which are irrelevant
Questions?
Websites with links to papers and talks
http://www.cs.wisc.edu/~olvi
http://www.cs.wisc.edu/~wildt
NDCC generator
http://www.cs.wisc.edu/dmi/svm/ndcc/
Running Time on the Ionosphere
Dataset
Averages 5.7 sweeps through the integer variables
Averages 3.4 linear programs
75% of the time consumed in objective function
evaluations
15% of time consumed in solving linear programs
Complete experiment (1960 runs) took 1 hour
 3 GHz Pentium 4
 Written in MATLAB
 CPLEX 9.0 used to solve the linear programs
 Gaussian kernel written in C
Sonar Dataset
208 Points in R60
Cross-validation accuracy

Number of features used
Related Work
Approaches that use specialized kernels
 Weston, Mukherjee, Chapelle, Pontil, Poggio, and
Vapnik, 2000: structural risk minimization
 Gold, Holub, and Sollich, 2005: Bayesian interpretation
 Zhang, 2006: smoothing spline ANOVA kernels
Margin-based approach
 Frölich and Zell, 2004: remove features if there is little
change to the margin if they are removed
Other approaches which combine feature selection
with basis reduction
 Bi, Bennett, Embrechts, Breneman, and Song, 2003
 Avidan, 2004
Future Work
Datasets with more features
Reduce the number of objective function
evaluations
Limit the number of integer cycles
Other ways to update the integer variables
Application to regression problems
Automatic choice of 
Algorithm
 Global solution to nonlinear mixed-integer program cannot
be found efficiently
 Requires solving 2n linear programs
 For fixed values of the integer diagonal matrix, the
problem is reduced to an ordinary SVM linear program
 Solution strategy: alternate optimization of continuous and
integer variables:
 For fixed values of E, solve a linear program for
(u, , y, s)
 For fixed values of (u, , s), sweep through the components of E
and make updates which decrease the objective function
Notation
Data points represented as rows of an m £ n matrix A
Data labels of +1 or -1 are given as elements of an
m £ m diagonal matrix D
Example
 XOR: 4 points in R2
 Points (0, 1) , (1, 0) have label +1
 Points (0, 0) , (1, 1) have label 1
Kernel K(A, B) : Rm£n £ Rn£k ! Rm£k
 Linear kernel: (K(A, B))ij = (AB)ij = AiB¢j = K(Ai, B¢j)
 Gaussian kernel, parameter :(K(A, B))ij = exp(-||Ai0 - B¢j||2)
Methodology
UCI Datasets
 To reduce running time, 1/11 of each dataset was used as a
tuning set to select  and the kernel parameter
 Remaining 10/11 used for 10-fold cross validation
 Procedure repeated 5 times for each dataset with different
random choice of tuning set each time
 NDCC
 Generate multiple datasets with 200 training, 200 tuning, and
1000 testing points