Slides - School of Engineering and Computer Science

Download Report

Transcript Slides - School of Engineering and Computer Science

Sufficient Dimensionality Reduction
with Irrelevance Statistics
Amir Globerson1
Gal Chechik2
Naftali Tishby1
1 Center
for Neural Computation and School of Computer Science and
Engineering. The Hebrew University
2 Robotics Lab, CS department, Stanford University
• Goal: Find a simple representation of X
which captures its relation to Y
X
?
Y
• Q1: Clustering is not always the most
appropriate description (e.g. PCA vs
Clustering)
• Q2: Data may contain many structures. Not
all of them relevant for a given task.
Talk layout
• Continuous feature extraction using SDR
• Using Irrelevance data in unsupervised
learning
• SDR with irrelevance data
• Applications
Continuous features
Terms
Applications
Papers
Theory
•What would be a simple representation of the papers ?
•Clustering is not a good solution. Continuous scale.
•Look at the mean number of words in the following groups:
• figure, performance, improvement, empirical
• equation, inequality, integral
• Better look at weighted means (e.g. figure only loosely related to results)
• The means give a continuous index reflecting the content of the document
Information in Expectations
• Represent p(X|y) via the expected value of some
function  ( X ) : X  d
• Look at  ( X ) p ( X | y )   ( X ) p( X | y)
x
• A set of |Y| values, representing p(X|y)
p(x1|y) p(x2|y)
<1(X)>p(X|y)
p(xn|y)
<2(X)>p(X|y)
p(X|y)
Examples
• Weighted sum of word counts in a document can
be informative about content
• Weighted grey levels in specific image areas
may reveal its identity
• Mean expression level can reveal tissue identity
• But what are the best features to use ?
• Need a measure of information in expected
values
Quantifying information in
expectations
• Possible measures ?
I ( ( X ); Y )
I(1
n
 ( x ); Y )
n
i 1
I(X;Y) for any 1-1 (x)
Goes to H(Y) as n grows
i
• Want to extract the information related only to
expected values
• Consider all distributions which have the given
expected values, and choose the least informative
one.
Quantifying information in
expectations
• Define the set of distributions which agree with p
on the expected values of  and marginals:

 ( x) ~p ( x| y )   ( x)

~
P ( , p )   ~
p ( x, y ) :
p ( x)  p( x)
~

p ( y)  p( y)

p ( x| y )





• We define the information in measuring (x) on
p(x,y) as
I M ( , p)  min pP ( , p ) I ( p )
Sufficient Dimensionality Reduction
(SDR)
• Find (x) which maximizes I M ( , p)
• Equivalent to finding the maximum likelihood
parameters for
1
 d

p( x, y )  exp   i ( x) i ( y)  A( x)  B( y ) 
Z
 i 1

• Can be done using an iterative projection
algorithm (GT, JMLR 03)
• Produces useful features for document
analysis
• But what if p(x,y) contains many structures ?
Talk layout
• Feature extraction using SDR
• Irrelevance data in unsupervised learning
• Enhancement of SDR with irrelevance
data
• Applications
Relevant and Irrelevant Structures
• Data may contain structures we don’t want to learn
For example:
•Face recognition: face geometry is important, illumination is not.
•Speech recognition: spectral envelope is important, not pitch
(English)
•Document classification: content is important, style is not.
•Gene classification: A given gene may be involved in pathological as
well as normal pathways
• Relevance is not absolute, it is task dependent
Irrelevance Data
• Data set which contains only irrelevant
structures are often available (Chechik and Tishby,
NIPS 2002)
– Images of one person under different illumination
conditions
– Recordings of one word uttered in different
intonations
– Document of similar content but different styles
– Gene expression patterns from healthy tissues
• Find features which avoid the irrelevant ones
Learning with Irrelevance Data
Main data (D+) Irrelevance data (D-)
• Given a model of the data f, Q(f,D) is some quantifier of the
goodness of feature f on the dataset D (e.g. likelihood, information)
max Q (f,D+)-Q(f,D-)
• We want to find
f
• Has been demonstrated successfully (CT,2002) for the case where
– F=p(T|X), soft clustering
– Q(F,Y)=I(T;Y)
• The principle is general and can be applied to any modeling scheme
Talk layout
• Information in expectations (SDR)
• Irrelevance data in unsupervised learning
• Enhancement of SDR with irrelevance
data
• Applications
Adding Irrelevance Statistics to SDR
• Using I M ( , p) as our goodness of feature
quantifier, we can use two distributions, a


relevant p ( X , Y ), and irrelevant p  ( X , Y  )
• The optimal feature is then
 * ( x)  arg max L( )
L( )  I M ( , p  )   I M ( , p  )
• For =0 we have SDR
Calculating *(x)
• When =0, an iterative algorithm can be devised
(Globerson and Tishby 02)
• Otherwise, the gradient of L() can be calculated
and ascended

L
 p  ( x)   ( y )

p  ( y| x )
   ( y)
p  ( y| x )


  p  ( x)   ( y)
p  ( y| x )
   ( y)
p  ( y| x )

Synthetic Example
D+
 0
D
 1
Phase Transitions
(x)
Talk layout
•
•
•
•
Feature extraction using SDR
Irrelevance data in unsupervised learning
SDR with irrelevance data
Applications
Converting Images into
Distributions
X
Y
Extracting a single feature
• The AR dataset consists of images of 50
men and women at different illuminations
and postures
• We took the following distributions:
– Relevant : 50 men at two illumination
conditions (right and left)
– Irrelevant: 50 women at the same illumination
conditions
– Expected features: Discriminate between
men, but not between illuminations
Results for a single feature
Results for a single feature
Face clustering task
• Took 5 men with 26 different postures
• Task is to cluster the images according to
their identity
• Took 26 images of another man as
irrelevance data
• Performed dimensionality reduction using
several methods (PCA,OPCA,CPCA and
SDR-IS) and measured precision for the
reduced data
Precision results
Conclusions
• Presented a method for feature extraction
based on expected values of X
• Showed how it can be augmented to avoid
irrelevant structures
• Future Work
– Eliminate dependence on the dimension of Y
via compression constraints
– Extend to the multivariate case (graphical
models)