photo_ret - Princeton University

Download Report

Transcript photo_ret - Princeton University

Using Large-Scale Web Data to
Facilitate Textual Query
Based Retrieval of Consumer Photos
Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo
Nanyang Technological University & Kodak Research Lab
Motivation
• Digital cameras and mobile phone cameras popularize
rapidly:
– More and more personal photos;
– Retrieving images from enormous collections of personal
photos becomes an important topic.
?
How to retrieve?
Previous Work
• Content-Based
The paramountImage
challenge
Retrieval
-- semantic
(CBIR)gap:
– The
Users
gap
provide
between
images
the low-level
as queries
visual
to retrieve
features and
personal
the
high-level
photos.
semantic concepts.
query
result
Image with highlevel concept
…
Semantic
Gap
compare
Low-level
Feature vector
…
…
Feature
vectors
in DB
A More Natural Way
For Consumer Applications
• Image
An intermediate
stage
is used
forthe
to
textual
classify
query
images
based
w.r.t.
image
high-level
Let
theannotation
user to retrieve
desirable
personal
photos
semantic
retrieval.
concepts.
using textual
queries.
– Semantic concepts are analogous to the textual terms describing
document contents.
annotate
query
Sunset
compare
database
result
Annotation Result:
high-level concepts
…
…
rank
Our Goal
information
from web
image
to retrieve
• Leverage
A real-time
Web
images
textual
are accompanied
query
based
byconsumer
tags,
categories
photo and
consumer
photos in
personal
photo
collection. annotation
titles.
retrieval
system
without
any
intermediate
stage.
building
people, family
information
Web Contextual
people, wedding
Images Information
No intermediate
image annotation
sunset
process.
……
……
Web Images
Consumer Photos
System Framework
Large Collection of
Web images
(with descriptive words)
Textual
Query
Automatic Web
Image Retrieval
WordNet
Raw Consumer Photos
Relevant/
Irrelevant
Images
Classifier
Consumer
Photo Retrieval
Relevance
Feedback
Top-Ranked
Consumer
Photos
Refined
Top-Ranked
Photos
• When
Then,
auser
classifier
isgives
trained
based
on
thesebased
web
provides
a textual
query,
It would
And
The
user
then
be
can
consumer
used
alsoto
find
photos
relevance
relevant/irrelevant
can
befeedback
ranked
to
images.
images
on
refine
the the
classifier’s
in web
retrieval
image
decision
results.
collections.
value.
Automatic Web Image Retrieval
……
……
Relevant
Web Images
boat
……
Inverted
File
“boat”
ark
barge
……
Irrelevant
Web Images
……
dredger houseboat
Semantic Word Trees
Based on WordNet
• The
which
do notfirst
contain
the query
word
For web
The
user’s
webimages
images
textual
containing
query,
the
search
query
it word
in the
areand
its
two-levelword
descendants
considered
as “irrelevant
semantic
considered
as “relevant
trees. areweb
images”.
web images”.
Decision Stump Ensemble
• Train a decision stump on each dimension.
• Combine them with their training error
rates.
Why Decision Stump Ensemble?
• Main reason: low time cost
– Our goal: a (quasi) real-time retrieval system.
– For basic classifiers: SVMs are much slower;
– For combination: boosting is also much
slower.
• The advantage of decision stump ensemble:
– Low training cost;
– Low testing cost;
– Very easy to parallelize;
Asymmetric Bagging
• Imbalance: count(irrelevant) >> count(relevant)
– Side effects, e.g. overfitting.
• Solution: asymmetric bagging
– Repeat 100 times by using different randomly sampled
irrelevant web images.
100 training sets
irrelevant
images
relevant
images
…
…
Relevance Feedback
• The user labels nl relevant or irrelevant
consumer photos.
– Use this information to further refine the
retrieval results;
• Challenge 1: Usually nl is small;
• Challenge 2: Cross-domain learning
– Source classifier is trained on the web image
domain.
– The user labels some personal photos.
Method 1: Cross-Domain
Combination of Classifiers
• Re-train classifiers with data from both
domain?
– Neither effective nor efficient;
• A simple but effective method:
– Train an SVM on the consumer photo domain
with user-labeled photos;
– Convert the responds of source classifier and
SVM classifier to probability, and add them up;
– Rank consumer photos based on this sum value.
• Referred as DS_S+SVM_T.
Method 2: Cross-Domain
Regularized Regression (CDRR)
• Construct a linear regression function fT(x):
– For labeled photos: fT(xi) ≈ yi;
– For unlabeled photos: fT(xi) ≈ fs(xi);
Source
Classifier
User-labeled
images x1,…,xl
f T(x) should be the user’s label y(x)
Other images
f T(x) should be f s(x)
• Design a target linear classifier f T(x) = wTx.
E
x
p
e
r
i
m
e
n
t
a
l
R
e
s
u
l
t
s
Dataset and Experimental Setup
• Web Image Database:
– 1.3 million photos from photoSIG.
– Relatively professional photos.
• Text descriptions for web images:
– Title, portfolio, and categories accompanied
with web images;
– Remove the common high-frequency words;
– Remove the rarely-used words.
– Finally, 21377 words in our vocabulary.
Dataset and Experimental Setup
• Testing Dataset #1: Kodak dataset
– Collected by Eastman Kodak Company:
• From about 100 real users.
• Over a period of one year.
– 1358 images:
• The first keyframe from each video.
– 21 concepts:
• We merge “group_of_two” and
“group_of_three_or_more” to one concept.
Dataset and Experimental Setup
• Testing Dataset #2: Corel dataset
– 4999 images
• 192x128 or 128x192.
– 43 concepts:
• We remove all concepts in which there are fewer
than 100 images.
Visual Features
• Grid-Based color moment (225D)
– Three moments of three color channels from
each block of 5x5 grid.
• Edge direction histogram (73D)
– 72 edge direction bins plus one non-edge bin.
• Wavelet texture (128D)
• Concatenate all three kinds of features:
– Normalize each dimension to avg = 0, stddev = 1
– Use first 103 principal components.
Retrieval without
Relevance Feedback
• For all concepts:
– Average number of relevant images: 3703.5.
Retrieval without
Relevance Feedback
• kNN: rank consumer photos with average distance
to 300-nn in the relevant web images.
• DS_S: decision stump ensemble.
Retrieval without
Relevance Feedback
• Time cost:
– We use OpenMP to parallelize our method;
– With 8 threads, both methods can achieve
interactive level.
– But kNN is expected to cost much time on largescale datasets.
Retrieval with Relevance Feedback
• In each round, the user labels at most 1
positive and 1 negative images in top-40;
• Methods for comparison:
– kNN_RF: add user-labeled photos into relevant
image set, and re-apply kNN;
– SVM_T: train SVM based on the user-labeled
images in the target domain;
– A-SVM: Adaptive SVM;
– MR: Manifold Ranking based relevance
feedback method;
Retrieval with Relevance Feedback
• Setting of y(x) for CDRR:
– Positive: +1.0;
– Negative: -0.1;
• Reason:
– The top-ranked negative images
are not extremely negative;
– Positive: “what is”; Negative: “what
is not”.
positive
images
negative
images
Retrieval with Relevance Feedback
• On Corel dataset:
Retrieval with Relevance Feedback
• On Kodak dataset:
Retrieval with Relevance Feedback
• Time cost:
– All methods except A-SVM can achieve real-time
speed.
System Demonstration
Query: Sunset
Query: Plane
The User is Providing The Relevance Feedback …
After 2 pos 2 neg feedback…
Summary
• Our goal: (quasi) real-time textual query
based consumer photo retrieval.
• Our method:
– Use web images and their surrounding text
descriptions as an auxiliary database;
– Asymmetric bagging with decision stumps;
– Several simple but effective cross-domain
learning methods to help relevance feedback.
Future Work
• How to efficiently use more powerful source
classifiers?
• How to further improve the speed:
– Control training time within 1 seconds;
– Control testing time when the consumer photo
set is very large.
Thank you!
• Any questions?