ICCV & CVPR paper reading

Download Report

Transcript ICCV & CVPR paper reading

ICCV & CVPR
paper reading
池
晨@jdl.ac.cn
2009.11.27
CVPR09, # 2128 ,
Recognizing Indoor Scenes
Recognizing Indoor Scenes
Ariadna Quattoni & Antonio Torralba
•A. Quattoni, X. Carreras, M. Collins, T. Darrell, An
Efficient Projection for L1,Infinity Regularization, ICML
2009.
•A. Quattoni, A.Torralba, Recognizing Indoor Scenes,
CVPR 2009.
•A. Quattoni, M. Collins, T. Darrell, Transfer Learning
for Image Classification with Sparse Prototype
Representations , CVPR 2008.
•A. Quattoni, M. Collins, T. Darrell, Learning Visual
Representations using Images with Captions, CVPR
2007.
•A. Quattoni, S. Wang, L.P. Morency, M. Collins, and T.
Darrell, Hidden-state Conditional Random Fields, IEEE
PAMI, 2007
Ariadna Quattoni
Ph.D student
MIT Computer Science and
Artificial Intelligence
Laboratory(CSAIL)
Recognizing Indoor Scenes
Ariadna Quattoni & Antonio Torralba
•L.P. Morency, A. Quattoni, T. Darrell, Latent-Dynamic
Discriminative Models for Continuous Gesture
Recognition, CVPR 2007.
•S. Wang, A. Quattoni, L.P. Morency, D. Demirdjian, T.
Darrell, Hidden Conditional Random Fields for Gesture
Recognition, CVPR 2006.
•A. Quattoni, M. Collins, T. Darrell, Incorporating
Semantic Constraints into a Discriminative
Categorization and Labeling Model, Workshop on
Semantic Knowledge in Vision, ICCV, 2005.
•A. Quattoni, M. Collins and T. Darrell, Conditional
Random Fields for Object Recognition, In Proceedings
of NIPS, 2004.
Ariadna Quattoni
Ph.D student
MIT Computer Science and
Artificial Intelligence
Laboratory(CSAIL)
Recognizing Indoor Scenes
Ariadna Quattoni & Antonio Torralba
Research Interests
•Computer vision ,
•Machine learning,
•Human visual perception,
•Scene and object recognition.
Antonio Torralba
Associate Professor
MIT Computer Science and
Artificial Intelligence
Laboratory(CSAIL)
Recognizing Indoor Scenes
Ariadna Quattoni & Antonio Torralba
LabelMe: online image annotation and applications
A. Torralba, B. C. Russell, and J. Yuen,MIT CSAIL Technical
Report, 2009.
How many pixels make an image?
A. Torralba ,Visual Neuroscience, volume 26, issue 01, pp.
123-131, 2009.
Small codes and large databases for recognition
A. Torralba, R. Fergus, Y. Weiss,CVPR,2008.
80 million tiny images: a large dataset for non-parametric
object and scene recognition
Antonio Torralba
A. Torralba, R. Fergus, W. T. Freeman
IEEE Transactions on PAMI, vol.30(11), pp. 1958-1970, 2008. Associate Professor
Sharing visual features for multiclass and multiview object MIT Computer Science and
Artificial Intelligence
detection
Laboratory(CSAIL)
A. Torralba, K. P. Murphy and W. T. Freeman,PAMI,2007.
?
Most scene recognition models that work well for outdoor
scenes perform poorly in the indoor domain.
Fig1. Comparison of Spatial Sift and
Gist features for a scene
recognition task. Both set of features
have a strong correlation in
the performance across the 15 scene
categories. Average performance for
the different features are: Gist: 73.0%,
Pyramid matching: 73.4%, bag of
words: 64.1%, and color pixels (SSD):
30.6%.
In all cases we use an SVM.
Abstract
•Indoor scene recognition is a challenging open problem.
• By global spatial properties or by objects they contain?
•A prototype based model that can successfully combine
both sources of information.
•A dataset of 67 indoor scenes categories.
•Good results.
What is ‘a prototype based model’?
Prototype Image
Prototype image?
ROI(Regions of Interests)
A Prototype Based Model
ROI 1
For each scene category:
S
T ,T , ,T 
1
2
k
 t 1, t 2,
ROI mk
p
For each prototypeTp:
T
Global spatial
properties
Contained
objects
Prototype
Image T
ROI 2
, t mk
……
ROI 5
ROI 3
How does it work?
Image Descriptor
ROI 1
Global spatial
properties
How to represent global
spatial properties?
——Using Gist descriptor
ROI mk
Prototype
Image T
ROI 2
How to represent each ROI?
——Using a spatial
pyramid of visual
words
Contained
objects
……
ROI 5
ROI 3
Gist (1/2)
Magnitude of multiscale
oriented filter outputs
Orginal image
orientation
Scale
Be
decomposed
PCA
Gist feature
Sampled filter outputs
Be taked the
magnitude and be
computed the local
average response over
4*4 windows.
Gist (2/2)
The gist feature encodes edges and textures
information in the original image coarsely
Top row: original images.
Bottom row: noise images coerced to have the same global features (N=64) as
the target image.
Image Descriptor
ROI 1
How to represent global
spatial properties?
——Using Gist descriptor
How to represent each
ROI?
——Using a spatial
pyramid of visual
words
Global spatial
properties
ROI mk
Prototype
Image T
ROI 2
Contained
objects
……
ROI 5
ROI 3
ROI Descriptor
The visual words are obtained by
A spatial pyramid of visual words
creating vector quantized Sift
descriptors by applying K-means to a
random subset of images.
The color of each pixel represents the
visual word to which it was assigned.
Image Descriptor
ROI 1
How to represent global
spatial properties?
——Using Gist descriptor
How to represent each
ROI?
——Using a spatial
pyramid of visual
words
Global spatial
properties
ROI mk
Prototype
Image T
ROI 2
Contained
objects
……
ROI 5
ROI 3
Model Formulation
Given:
A training set of n pairs of labeled images
D
 x , y  ,  x , y  , ,  x , y 
1
1
2
2
n
n
A set of p segmented images which we called
prototypes.
S
T ,T , ,T 
1
2
p
Goal:
To use D and S to learn a mapping h : X→R
Model Formulation
Contained object information
The mapping should capture the fact that images containing
similar objects must have similar scene labels and that some
objects are more important than others in defining a scenes’
identity.
f  x   min d t , x 
kj
s
kj
s
Distances between two
regions are computed
using histogram
intersections.
where tkj represents the jth ROI of kth prototype image, xs represents
the most similar segment with tkj in image x.
Searching Strategy
When meet a new image, how to find its ROIs
that similar with the ROIs in the given prototype
image T?
Histogram intersection
function:
D



 H xs , H kj = min H xs  i  , H kj  i 
i 1

Searching around a small window relative to the
original location in prototype image T.
Searching Strategy
Figure 5. Example of detection of similar
image patches.
The top three images correspond to the
query patterns. For each image, the
algorithm tries to detect the selected
region on the query image.
The next three rows show the top three
matches for each region.
The last row shows the three worst
matching regions.
Model Formulation
Global spatial information
For some scene categorieds global image information can be
very important.
gk  x   Gist ( x)  Gist (Tk ) 2
Global information is computed as L2 norm between the Gist representation
of image x and the Gist representation of prototype k.
Model Formulation
Parameters
The importance of global
features when
considering
Global
spatial the
information
kth prototype.
p

k
h  x  =  exp  m
j 1  kj
k 1
k
f  x    g  x 
kj
kG
k
Contained object information
How relevant the similarity to a
prototype k is for predicting the
scene label.
Captures the importance of a particular ROI inside
a given prototype.
Model Formulation
Learning
How to estimate the model parameters from a
training set D?
D
 x , y  ,  x , y  , ,  x , y 
1
2
1
2
n
n
n
L   ,     l  h  xi  , yi   Cb  Cl 
2
2
i 1
Loss function measuring the
error that the classifier
incurs on training example D.
Regularization terms and the
constants Cb and Cl dictate the
amount of regularization in the
model
Model Formulation
Learning
Using training set D and a gradient-based method to estimate the model
L
mk
parameters:
 k
L
 k



  yi exp  j 1 kj f kj  xi   12 Cb  k
i
L
kj

L
mk
1
  yi k f kj  xi  exp  j 1 kj f kj  xi   Cl kj
kj
2
i
Δ is the set of
indices of
examples in D
that attain nonzero loss.
Model Formulation
The number in parenthesis is the classification
confidence.
p

k
h  x  =  exp  m
j 1  kj
k 1
k
f  x    g  x 
kj
kG
k
How is the performance?
Indoor Database
Figure 2. Summary of the 67 indoor scene
categories used in our study. To facilitate
seeing the variety of different scene
categories considered here we have
organized them into 5 big scene groups. The
database contains 15620 images. All images
• The largest one available: 67 categories,
15620
images.
have a minimum
resolution
of 200 pixels in
• More difficulte: In-class variability
the smallest axis.
Compared with state of the art :
Results (1/3)
Four different variation of the model.
Manually
Annotated
ROIs
Local
features
Segmente
d ROIs
Both Local
and Global
features
Results (1/3)
Four different variation of the model.
•Both local and global
information are useful for the
indoor scene recognition task.
•Using automatic
segmentations instead of
manual segmentations cause
only a small drop in
performance
Results (2/3)
Figure 7. The 67 indoor categories sorted by multiclass average
precision (training with 80 images per class and test is done on 20
images per class).
Results (3/3)
How is the preformance of the proposed model affected
by the number of prototypes used?
We observed a logarithmic growth of the average precision
as a function of the number of prototypes.
Exploit more prototypes might be able to
further improve the performance.
Conclusion (1/3)
ROI 1
Global spatial
properties
Combination of
global information
and contained
object information
ROI mk
Prototype
Image T
ROI 2
Contained
objects
……
ROI 5
ROI 3
Conclusion (2/3)
Global spatial information
p

k
h  x  =  exp  m
j 1  kj
k 1
k
Contained object information
f  x    g  x 
kj
kG
k
Conclusion (3/3)
p

k
h  x  =  exp  m
j 1  kj
k 1
k
f  x    g  x 
kj
kG
k
ICCV09,
Learning to Predict Where Humans Look
Learning to Predict
Where Humans Look
Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba
Education Background
Massachusetts Institute of Technology, Cambridge,
MA
•Ph.D. candidate in Computer Science (Graphics)
Expected graduation June 2010,
•Masters of Science, Computer Science, Jan 2007
•Bachelors of Science in Mathematics, June 2003.
École Polytechnique, Palaiseau, France
•International Program, Computer Science Major,
Sept 2003 to April 2004
Cambridge University, Cambridge, England
• Junior Year Abroad, Read Part IB Mathematics
Tripos, Sept 2001 to June 2002
Tilke Judd
Ph.D student
MIT Computer Science and
Artificial Intelligence
Laboratory(CSAIL)
Learning to Predict
Where Humans Look
Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba
Research Interests
•Computer Graphics
•Computational Photography
•Image Processing
•Perception
•Non-Photorealistic Rendering
Tilke Judd
Ph.D student
MIT Computer Science and
Artificial Intelligence
Laboratory(CSAIL)
Learning to Predice
Where Humans Look
Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba
•Judd, T, Ehinger, K, Durand, F, Torralba, A. Learning to Predict Where
People Look, ICCV 2009.
•Judd, T., Durand, F., Adelson, T. Apparent Ridges for Line Drawing.
Proceedings of ACM Siggraph 2007
•Judd, Tilke. Apparent Ridges for Line Drawing. Masters Thesis,
Computer Science, MIT, Jan 2007
•Ju, W., R. Hurwitz, T. Judd, B. Lee. CounterActive: An Interactive
Cookbook for the Kitchen Counter. Proceedings of SIGCHI 2001,
Short Papers and Abstracts, Seattle WA, April 2001. p 269
•Ju, W., L. Bonanni, R Fletcher, R. Hurwitz, T. Judd, J. Yoon E.R. Post,
M. Reynolds. Origami Desk. Exhibited SIGGRAPH 2001, Los Angeles
Tilke Judd
CA. SIGRAPH Conference Abstracts and Applications, August 2001,
Ph.D student
p.280
MIT Computer Science and
•Judd, Tilke. The JPEG Compression Algorithm. The MIT
Undergraduate Mathematics Journal. Vol 5, p.119
Artificial Intelligence
Laboratory(CSAIL)
Learning to Predict
Where Humans Look
Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba
Education Background
•University of Edinburgh, Edinburgh, UK
2007 B.Sc. Psychology
•California Institute of Technology, Pasadena, CA,
USA
2003 B.S. Engineering & Applied Science
?
Erista Ehinger
Graduate Student
Department of Brain &
Cognitive Sciences at MIT
Learning to Predict
Where Humans Look
Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba
Education Background
•He received his PhD from Grenoble University,
France, in 1999.
•From 1999 till 2002, he was a post-doc in the
MIT Computer Graphics Group
Frédo Durand
Associate Professor
Computer Graph
Group,CSAIL,MIT.
Learning to Predict
Where Humans Look
Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba
Research Interests
•Synthetic image generation
•Computational photography
Frédo Durand
Associate Professor
Computer Graph
Group,CSAIL,MIT.
Learning to Predict
Where Humans Look
Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba
•Co-organized the first Symposium on Computational
Photography and Video in 2005,
•Co-organized the first International Conference on
Computational Photography in 2009,
•Was on the advisory board of the Image and Meaning 2
conference.
•Received an inaugural Eurographics Young Researcher
Award in 2004,
•Received an NSF CAREER award in 2005,
•Received an inaugural Microsoft Research New Faculty
Fellowship in 2005,
•Received a Sloan fellowship in 2006,
•Received a Spira award for distinguished teaching in
2007.
Frédo Durand
Associate Professor
Computer Graph
Group,CSAIL,MIT.
?
How to understand where humans look in a
scenes without an eye tracking?
Figure 2. Current saliency models
do not accurately predict human
fixations. In row one, the low-level
model selects brigh spots of light
as salient while viewers look at the
human. In row two, the low level
model selects the building’s strong
edges and windows as salient
while viewers fixate on the text.
Abstract
•For many applications in graphics,design,and human
computer interaction,it is essential to understand where
humans look in a scene.
•Models of saliency can be used to predict fixation locations.
•A sailency model based on both the top-down information
and bottom up information
•A large eye tracking database.
Database of Eye Tracking Data
15 objects
1003 random images
Free view
3 seconds per image
Recording the gaze path
Database of Eye Tracking Data
Convolve a gaussian filter across the object’s fixation.Then
average all the objects’ data to obtain a continuous saliency
map.
Original Image
Binary saliency
map
Fixation Image
Continuous
saliency map
Collect the object’s
fixations.
Select the top n percent
salient locations to
generate a binary map
Analysis of Dataset
• For some images,all viewers fixate on the same locations,while
in other images viewers’s fixations are dispersed all over the
image.
• The fixation in the database have a strong bias towards the
center.
• Fixations from the database are often on animals,cars,and
human body parts like eyes and hands.
• There is a certain size for a region of interest(ROI)that a person
fixates on.
How to use the analysis above?
Features Used for Machine Learning
Low-level features:
•
•
Local energy of the steerable pyramid filters[3],
Features used in a simple salency model described by Torralba[1] and
Rosenholtz[2],
•
Orientation and color contrast,
•
Values of the red,green and blue channels,as well as the probabilities of
each of these channels as features[4].
•
The probability of each color as computed from 3D color histograms of
the image filtered with a median filter at 6 different scales.
Mid-level features
•
The location of horizon.
Features Used for Machine Learning
High-level features:
•
Runing the Viola Jones face detector[5] and the Felzenszwalb person
detector[6].
Center prior:
•
The distance between each pixel to the center.
Features Used for Machine Learning
Fig 8. Features.A sample
image(bottom right) and 33
of the features that we use
to train the model.
How to use the eye data?
Features Used for Machine Learning
Using binary map to
generate positive label
and negitive label
Original Image
Binary saliency
map
Fixation Image
Continuous
saliency map
Features Used for Machine Learning
Binary saliency
map
Positive labeled
pixels
negtive labeled
pixels
How is the performance?
Training
Binary saliency
map
Positively
labeled pixels
10 positive pixels
per image
10 pixels per
image
903 training images(9030 positive and 9030 negitive training
samples )and 100 testing images with a liblinear SVM.
negtively
labeled pixels
Testing
Figure 9. Comparison of saliency maps. Each row of images compares the
predictors of our SVM saliency model, the Itti saliency map, the center prior, and
the human ground truth, all thresholded to show the top 10 percent salient
locations.
Performance On Testing Images
Figure 10. The ROC curve of
performances for SVMs trained on
each set of features individually and
combined together. We also plot
human performance and chance for
comparison.
Application
Rendering more details at the location users fixated
on and less detial in the rest of the image.
Conclusion (1/4)
Original Image
Binary saliency
map
Fixation Image
Continuous
saliency map
Created database containing true eye data
Conclusion (2/4)
•Low-level features
•Mid-level features
•High-level features
•Center prior
Conclusion (3/4)
Compared the effect of each
subset of whole features on
saliency map.
Conclusion (4/4)
Given an example of the model’s
application.