A Metric Cache for Similarity Search

Download Report

Transcript A Metric Cache for Similarity Search

A Metric Cache for
Similarity Search
fabrizio falchi
claudio lucchese
salvatore orlando
fausto rabitti
raffaele perego
Similarity Search
in Databases
o Objects are “unknown”,
only distances are “well known”
o Metric Space assumption:
• Identity
• Symmetry
• Triangular inequality
o Distance functions include:
• Minkowski distances, edit and Jaccard distance, ...
o Applications include:
• Images, 3D shapes, medical data,
text, dna sequences, graphs, etc.
o Metric space indexing works better
than multidimensional indexing.
Distributed
Similarity Search System
Top-K queries
Front-end of the CBIR System
Parallel & Distributed CBIR System
Index of
MM objects
Unit 1
Index of
MM objects
Unit 2
Index of
MM objects
Unit n
Search cost
is close to
O( |DB| ) !!!!
Distributed & Cached
Similarity Search System
Top-K queries
Metric Cache
Front-end of the CBIR System
Parallel & Distributed CBIR System
Index of
MM objects
Unit 1
Index of
MM objects
Unit 2
Index of
MM objects
Unit n
& Cached
Distributed
What’s
different
in
Similarity Search System ?
Top-K queries
Metric Cache
Front-end of the CBIR System
Parallel & Distributed CBIR System
Index of
MM objects
Unit 1
Index of
MM objects
Unit 2
Index of
MM objects
Unit n
What’s different in
Metric Cache
o The cache stores result-objects,
not only result-pointers
• e.g.: documents vs. documents ids
o The cache is a peculiar sample of the
whole dataset
• the set of objects most recently seen
by the users (= most interesting !?!)
o Claim:
An interesting object may be used to
answer approximately if it is
sufficiently similar to the query.
?
What’s different in
Metric Cache
…and…
o Queries may be approximate !
o [Zobel et al. CIVR 07]
At least 8% of the images in the web
are near-duplicates. Most of them
are due to cropping, contrast
adjustment, etc.
o Requirement:
the system must be robust w.r.t.
near-duplicate queries.
?
What’s different in
q3
Exact answer
q2
Exact answer
q1
Approx. answer
Metric Cache
Metric Cache
Front-end
of the CBIR System
Parallel & Distributed
CBIR System
?
Two algorithms: RCache vs. QCache
RCache(q,k)
• If qCache return R
Cache Hit
• R = Cache.knn(q,k)
• If quality(R) >
return R
• else
R = DB.knn(q,k)
Cache.add(q,R)
return R
•
•
Approximate Hit
Cache Miss
In case of approximate hit, the cached query q’, being the closest to q, is marked as used.
The Least Recently Used query and its results are evicted.
Costs of RCache vs. QCache
RCache(q,k)
• Hash table access : O(1)
• Search among all the
result objects:
O( |Cache| )
• Search among all the
objects in the database:
O( |DB| )
Cache Hit
Approximate Hit
Cache Miss
• |Cache| is the number of cached objects, and |DB| is the size of the database.
Two algorithms: RCache vs. QCache
RCache(q,k)
• If qCache return R
• R = Cache.knn(q,k)
• If quality(R) >
return R
• else
R = DB.knn(q,k)
Cache.add(q,R)
return R
•
•
QCache(q,k)
•
•
•
•
•
If qCache return R
Q* = Cache.knn(q,  )
R* = {results in Q*}
R = R*.knn(q,k)
If quality(R) >
return R
• else
R = DB.knn(q,k)
Cache.add(q,R)
return R
Cache Hit
Approximate Hit
Cache Miss
In case of approximate hit, the cached query q’, being the closest to q, is marked as used.
The Least Recently Used query and its results are evicted.
Costs of RCache vs. QCache
RCache(q,k)
QCache(q,k)
• Hash table access : O(1)
• Search among all the
result objects:
O( |Cache| )
• Search among all the
objects in the database:
O( |DB| )
Cache Hit
• Search among the
query objects:
O( |Cache|/k )
Approximate Hit
Cache Miss
• Supposing k results are stored for each query.
• |Cache| is the number of cached objects, and |DB| is the size of the database.
Approximation & Guarantees
r*
q*
s
q
o Let the safe range be:
s = r* - d( q,q* )
o The cached k* objects within
distance s, are the true top-k*
of the new query.
o Every cached query may
provide some additional
guarantee.
Experimental setup
o A collection of 1,000,000 images
downloaded from Flickr:
• we extracted 5 MPEG-7 descriptors, which
were used to measure similarity.
o A query log of 100,000 images:
• a random sample with replacement,
using image views
• 20% training – 80% testing
o k = 20,  = 10
o Quality function is:
• Safe range  0
Hit ratio
Throughput
Approximation quality I
RES(q, R) 


xR
d ( q, x )
yKNN ( q , k )
d ( q, y )
1
REM (q, R) 
maxxR d (q, x)
1
maxyKNN ( q,k ) d (q, y)
Approximation quality II
What we also did ...
o Take queries from a
different collection.
o Inject duplicates in the
query log.
o Use an expectation of RES
as quality measure.
Approximation quality III
o Bug: RES=0.07
o Portrait: RES=0.09
o Sunset: RES=0.12
Approximation quality III
o Bug: RES=0.07
o Portrait: RES=0.09
o Sunset: RES=0.12
Acknowledgements
o The European Project SAPIR
• http://www.sapir.eu
o P. Zezula and his colleagues for the M-Tree
implementation
• http://mufin.fi.muni.cz/tiki-index.php
o The dataset used is available at
• http:// cophir.isti.cnr.it
Thank you.
Backup slides