Presentation
Download
Report
Transcript Presentation
A New Paradigm for Feature Selection
With some surprising results
Amnon Shashua
School of Computer Science & Eng.
The Hebrew University
Joint work with Lior Wolf
Wolf & Shashua, ICCV’03
Hebrew University
1
Problem Definition
M [M1 ,..., M q ] R nq
Given a sample of feature measurements
M1
Mi
Mq
m1T
m2T
Find a subset of features
m ,..., m
T
i1
T
is
feature vector
mi
2
....
1
which are most “relevant” with respect to an
inference (learning) task.
Comments:
• Data points can represent images (pixels), feature attributes, wavelet coefs,..
• The task is to select a subset of coordinates of the data points such that the
accuracy, confidence, and training sample size of a learning algorithm would be
optimal - ideally.
• Need to define a “relevance” score.
• Need to overcome the exponential nature of subset selection.
• If a “soft” selection approach is used, need to make sure the solution is “sparse”.
Hebrew University
mnT
data point
2
Examples:
• Text Classification: typically 10 4 10 7 features representing word frequency
counts – only a small fraction is expected to be relevant. Typical examples
include automatic sorting of URLs into web directory and detection of Spam
Email.
• Visual Recognition:
similarity(
• Genomics:
Gene expressions
tissue samples
,
) = e -|
- |1
Goal: recognizing the relevant genes
which separate between normal and
tumor cells, between different sub
classes of cancer, and so on.
Hebrew University
3
Why Select Features?
• Most learning algorithms do not scale well with the growth of irrelevant features.
ex1: number of training examples for some supervised learning methods grow exponentially.
ex2: for classifiers which can optimize their “capacity” (e.g., large margin hyper-planes)
1
1 d
1
m log log
the effective VC dimension d grows fast with irrelevant coordinates - faster than the capacity increase.
• Computational efficiency considerations when number of coordinates
is very high.
• Structure of data gets obscured with large amounts of irrelevant coordinates.
• Run-time of the (already trained) inference engine on new test examples.
Hebrew University
4
Existing Approaches
• Filter methods: pre-process of the data independent of the inference engine.
examples: use of mutual information measure, correlation coefficients, cluster..
• Embedded, Wrapper: select features useful to build a good predictor/inference.
example: run SVM training on every candidate subset of features.
Computationally expensive approach in general.
Hebrew University
5
Feature Subset Relevance - Key Idea
Mi Rn
ˆ Rl l n
M
i
M1
ˆ
M
1
ˆ
M
i
Mi
Mq
m1T
m2T
ˆ
M
q
Rl
mnT
Working Assumption: the relevant subset of rows induce columns that are coherently clustered.
Note: we are assuming an unsupervised setting (data points are not labeled). The framework can easily
apply to supervised settings as well.
Hebrew University
6
• How to measure cluster coherency? We wish to avoid explicitly clustering
for each subset of rows. We wish a measure which is amenable to
continuous functional analysis.
ˆ TM
ˆ
key idea: use spectral information from the affinity matrix M
ˆ
M
i
ˆ
M
1
ˆ
M
q
ˆ
M
ˆ TM
ˆ ?
• How to represent M
s i1 ,..., il
subset of features
is
1
0 otherwise
i
n
Tˆ
ˆ
A M M i mi mTi
i1
Hebrew University
7
Definition of Relevancy
The Standard Spectrum
General Idea:
Select a subset of rows from the sample matrix M such that the resulting
affinity matrix will have high values associated with the first k eigenvalues.
s i1 ,..., il
subset of features
A i 1 i mi miT
n
is
1
i
0 otherwise
m1T
resulting affinity matrix
miT1
T
i2
m
....
rel ( xi1 ,..., xil ) trace(QT AT A Q )
j 1 2j
k
Q
miTl
consists of the first k eigenvectors
of A
Hebrew University
8
mnT
(unsupervised) Optimization Problem
Let
A i 1 i mi miT
n
max traceQT AT A Q
Q ,1 ,..., n
subject to
QT Q I
0,1
n
Optimization is too difficult to be considered in practice (integer and continuous
variables
programming).
Hebrew University
9
(unsupervised) Optimization Problem
Soft Version I
Let
A i 1 i mi miT
n
max traceQT AT AQ h( )
Q,1 ,..., n
subject to
QT Q I
i 0
i 1
i
The non-linear function
h( ) penalizes for uniform
The result is a non-linear programming problem - could be quite difficult to solve.
Hebrew University
10
(unsupervised) Optimization Problem
Let
A i 1 i mi miT
n
for some unknown real scalars
max traceQT AT A Q
Q ,1 ,..., n
1 ,..., n T
Motivation: from spectral clustering it is known
that the eigenvectors tend to be discontinuous
and that may lead to an effortless sparsity
property.
subject to
QT Q I
T 1
Note: the optimization definition ignores the requirements:
1.
i 0
2.
The weight vector
should be sparse.
Hebrew University
11
The Q Algorithm
max traceQT AT A Q
Q ,1 ,..., n
If
were known, then A
T 1
QT Q I
is known and Q is simply the first k eigenvectors of
If Q were known, then the problem becomes:
max T G
subject to
1 ,..., n
where
T 1
Gij (miT m j )miT QT Qm j
is the largest eigenvector of G
Hebrew University
12
A
The Q Algorithm
Power-embedded
(r )
(r )
ij
G
(m m j )m Q
1.
Let G
2.
Let (r ) be the largest eigenvector of
3.
Let
A( r ) i 1 i( r ) mi miT
4.
Let
Z ( r ) A( r )Q ( r 1)
5.
be defined
T
i
T
i
( r 1)T
Q ( r 1) m j
G (r )
n
QR
Z ( r )
Q(r ) R(r )
“QR” factorization step
orthogonal iteration
6. Increment r
(r )
Convergence proof: take k=1 for example. Steps 4,5 become: q
Need to show:
q
( r )T
2
Aq
(r )
q A q
T
2
q T A4 q
T 2
q
Aq
T 2
q Aq
Hebrew University
For all symmetric matrices A and unit vectors q
Aq
qT q
follows from convexity
13
Positivity and Sparsity of
Hand-Waving Argument
arg max traceQ A A Q
T
Q ,1 ,..., n
T
arg min
Q ,1 ,..., n
A QQ A
T
2
F
A
2
F
minimized if rank(A)=k
add redundant terms
A i 1 i mi miT
n
= sum of rank-1 matrices
If we would like rank(A) to be small, we shouldn’t
add too many rank-1 terms,
Therefore should be sparse.
Note: this argument does not say anything with regard to why should be positive.
Hebrew University
14
Positivity and Sparsity of
The key for the emergence of a sparse and positive has to do with the way
The entries of G are defined:
Gij (m m j )m Q Qm j l 1 (miT ql )(mTj ql )(miT m j )
T
i
T
i
k
T
Consider k=1 for example, then each entry is of the form:
a b c 1
f (aT b)( aT c)(bT c)
Clearly,
f 1
a
1 f 1
only if
(aT b) 1, (aT c) 1, (bT c) 1
b
or
(a b) 1, (aT c) 1, (bT c) 1
T
which cannot happen
Expected values of the entries of G are biased towards a positive number
Hebrew University
15
Positivity and Sparsity of
1. What is the minimal value of
f (aT b)( aT c)(bT c)
when
a , b, c
vary over the n-dimensional unit hyper sphere?
1
f 1
8
2. Given a uniform sampling of a, b, c over the n-dim unit hyper sphere,
2
what is the mean and variance of f
1 2 1
,
6
18
3. Given that
of
G
Gij ~ N ( , 2 )
what is the probability that the first eigenvector
is strictly non-negative (same sign)?
Hebrew University
16
Proposition 3:
N ( 0, 2 )
i j
Gij ~ 1 2
i j
N ( , )
2
Let
with an infinitesimal
2
Gx x be the largest eigenvector. Then,
n
1
p( x 0)
1
n
[ 0, ]
n n
where
[ , 2 ] x
1
[ 0, ]
n
n
Is the cumulative distribution function of
N ( , 2 )
n
empirical
Hebrew University
17
Proposition 4: (with O. Zeitouni and M. Ben-Or)
p( x 0) n
1
when
0 for any value of
Hebrew University
18
Sparsity Gap
definition
Let 0 p 1 be the fraction of relevant features and q 1 p
Let
Let
A
G T
B
B
C
x ( x1 , x2 )T
where
Anpnp ~ N (a , 2 )
B, Cnqnq ~ N (b , )
1
6
be the largest eigenvector of G, where x1 holds the first np
entries and x2 holds the remaining nq entries.
The sparsity gap corresponding to G is the ratio
where x1
b
2
is the mean of
x1
and x2
x1
x2
is the mean of
Hebrew University
x2
19
Sparsity Gap
Proposition 4:
Let
Let
np a
G
npb
x ( x1 , x2 )T
nqb
nqb 22
be the largest eigenvector of G
The sparsity gap corresponding to G is:
x1 N (0,
x2 N (0,
Example:
2
np
2
nq
)
)
1
a 0.85, b , n 100
6
61 p 10 3321 p 2 820 p 100
20 p
Hebrew University
20
The feature selection for SVM benchmark
• Two synthetic data sets were proposed in a paper by Weston,
Mukherjee, Chapelle, Pontil, Poggio and Vapnik from NIPS 2001.
• The data sets were designed to have few features which are separable
by SVM, combined with many non relevant features.
• There were 6 relevant features and 196 irrelevant ones.
The linear dataset
• The linear data set is almost separable linearly once
the correct features are recovered.
• The data sets were designed for the labeled case.
• At probability 0.7 the data is almost separable by the first 3 relevant features
and un-separable by the rest 3 relevant features. At probability 0.3 the second
group of relevant features is the separable one. Remaining 196 features were
Hebrew University
21
drawn from N(0,20).
Results – linear data set
Results – non-linear data set
The unsupervised algorithm started to be effective only
from 80 data points and up and is not shown here
There are two species of frogs in this figure:
Hebrew University
24
Hebrew University
Green frog (Rana clamitans)
American toad
25
Automatic separation
•
We use small patches as basic features:
•
In order to compare patches we use the L1 norm on the color
histograms:
similarity(
, )=e
-|
Hebrew University
-
|1
26
The matrix A: many features over ~40 images
The similarity between an image A and a patch B
is the maximum over all similarities between the patches p
in the image A and of the patch B
similarity(
, )=
max similarity( , )
Hebrew University
27
Selected features
Green frog (Rana clamitans)
American toad
Using these features the clustering was correct on 80% of the samples –
Hebrew University
28
compared to 55% correct clustering using conventional spectral clustering
Another example
elephant
sea-elephant
Using these features the clustering was correct on 90% of the samples –
Hebrew University
29
compared to 65% correct clustering using conventional spectral clustering
Genomics
Goal: recognizing the relevant
genes that separate between
cells with different biological
characteristics (normal vs. tumor,
different subclasses of tumor cells)
• Classification of Tissue Samples (type of
Cancer, Normal vs. Tumor)
• Find Novel Subclasses (unsupervised)
• Find Genes responsible for classification
(new insights for drug design).
tissue samples
Gene expressions
The microarray technology provides
many measurements of gene
expressions for different sample
tissues.
Few samples (~50) and large dimension
(~10,000)
Hebrew University
30
The synthetic dataset of Ben-Dor, Friedman, and Yakhini
Parameter
description
Leukemia
a
# class A Samples
25
•
b
The model consists of 6
parameters: # class b
Samples
47
N ( A , s A )
m
# features
600
e,(1-e)
% irrelevant/relevant
72%,28%
(3d)
Size of interval of means
d=555
s
Std coefficient
.75
•
A relevant feature is sampled N ( A , s A ) or N (B , sB ) where the means of
the classes A,B are sampled uniformly from [-1.5d,1.5d]
•
An irrelevant feature is sampled
N(0,s)
Hebrew
University
31
The synthetic dataset of Ben-Dor, Friedman, and Yakhini
•
Results of simulations done by varying one parameter out of
m, d , e, s
MSA Q-
remarks
432
<250
<5
MSA uses
redundancy
% irrelevant features
168
>95%
>99.5%
Easy data set
d
Size of interval
555
[1,1000]
At least
[1,1000]
data is
normalized
s
Spread
.75
<2
<1000
MSA needs
good
separation
Param.
description
Leuk.
a
# class A Samples
25
b
# class b Samples
47
m
# features
e
•
Hebrew University
MSA – max surprise algorithm of Ben-Dor, Friedman, and Yakhini.
32
Shashua & Wolf, ECCV’04
Follow Up Work
Feature selection with “side” information:
Given
the “main” and “side” data. Find weights
M ,W
such that
T
m
m
i1 i i i
and
T
w
w
i1 i i i
n
n
1 ,..., n T
has coherent k clusters
has low cluster coherence (single cluster)
Hebrew University
33
Shashua & Wolf, ECCV’04
Follow Up Work
“Kernalizing”
Q
mi (mi )
:
high dimensional mapping
(mi )T (m j ) k (mi , m j )
A i 1 i (mi ) (mi )T
n
Rather than having inner-products we have outer-products.
Hebrew University
34
END
Hebrew University
35