On Index-free Similarity Search in Metric Spaces

Download Report

Transcript On Index-free Similarity Search in Metric Spaces

On Index-free Similarity Search in
Metric Spaces
Tomáš Skopal1, Benjamin Bustos2
1Charles
University in Prague, Czech Republic
2University of Chile, Santiago, Chile
DEXA 2009, Linz, Austria
Outline
 Metric approach to similarity search
 Motivation for index-free similarity search
 D-file (+ D-cache)
 Experiments
 Conclusion
DEXA 2009, Linz, Austria
Similarity search
 Multimedia databases, time series, bioinformatics, ...
 Content-based similarity search (query by example)
range query
(give me the very similar ones – over 80%)
k nearest neighbors query
(give me the 3 most similar)
0.1
DEXA 2009, Linz, Austria
0.15
0.3
0.6
0.8
Metric approach to similarity search
 the similarity d (actually distance) is computationally expensive –
often O(m2), sometimes even O(2m) w.r.t. the size (m) of a
compared object
 querying by a sequential scan over the database of n objects is thus
expensive
 the goal:
minimizing the number of distance
computations d(*,*) for a query
 the way:
using metric distances (metric postulates)
 allows to partition the data space
 the search is then performed just in several partitions → efficient
search
DEXA 2009, Linz, Austria
Using lower-bound distances for
filtering database objects
 a cheap determination of tight lower-bound distance of
d(*,*) provides a mechanism how to quickly filter irrelevant
objects from search
X
query
ball
r
Q
P
The task: check if X is inside query ball
•we know d(Q,P)
•we know d(P,X)
•we do not know d(Q,X)
•we do not have to compute d(Q,X), because
its lower bound d(Q,P)-d (X,P) is larger than
r, so X surely cannot be in the query ball, so X
is ignored
 this filtering is used in various forms by
metric access methods, where X stands for
a database object and P for a pivot object
DEXA 2009, Linz, Austria
Index-based metric access methods
 All metric access methods (MAM) are index-based, i.e. preprocessing of a
database is always needed.
 Index construction usually takes between O(kn) to O(n2).
M-tree
DEXA 2009, Linz, Austria
PM-tree
GNAT
Motivation for index-free search
 indexing is not desirable (or even possible) if
 we have a highly changeable database
 more inserts/deletes/updates than searches, i.e., streaming databases,
archives, logs, sensory databases, etc.
 we perform isolated searches
 a database is created for a few queries and then discarded, i.e., in data
mining tasks
 we switch between distances (changing similarity)
 the distance function is tuned at query time, e.g., weighing of object
features is applied dynamically
DEXA 2009, Linz, Austria
D-file
 just the original database using sequential scan, BUT
 it uses D-cache
 a memory-resident structure that maintains the distances
computed during previous queries
 provides lower-bounds of requested distances that can be
used to filter some of the database objects when querying
 O(1) complexity for a lower bound retrieval
 no preprocessing (indexing) of database
DEXA 2009, Linz, Austria
D-file – range query
???
Q
simple
sequential
sequential
search
search
enhanced by D-cache filtering
DEXA 2009, Linz, Austria
Oi
D-cache
 every time a D-file computes a distance d(*,*), it is stored
into D-cache
 the D-cache could be viewed as a sparse matrix, where
queries denote rows, database object denote columns, and a
cell contains a value
of d(Q,O)
DEXA 2009, Linz, Austria
D-cache
 D-cache has two functionalities
 it allows to retrieve the exact distance d(Q,O), if it is there
 the main functionality: it provides tight lower bound to d(Q,O)
 How to obtain a lower bound?
 prior to a new query Q, determine some old queries DPiQ (acting as
dynamic pivots) and compute
the distances d(Q, DPiQ)
 when a lower bound to d(Q,O)
is required, search for available
distances d(Q, DPiQ) in
the D-cache and obtain
the max(d(DPiQ, O) – d(Q, DPiQ));
that is our tight lower bound distance
DEXA 2009, Linz, Austria
D-cache
 how to choose the old queries (dynamic pivots)?
 “Recent” policy
 simple – we just choose k previous queries
 motivation: the recently added distances are likely to still sit in the D-cache
 “Internal” policy
 advanced – we select k of the previous queries which are probably close
 we avoid computation of any distance
between new and old queries, we just
estimate the distance using distances
from D-cache
 motivation: a close query (pivot)
produces tighter lower bounds
DEXA 2009, Linz, Austria
D-cache implementation
 Cell cache
 a simple hash table
 used to determine individual cell values,
based on id1, id2
 used for Recent pivot selection
 Row cache
 in inverted list (list of objects belonging to old queries)
 used to determine the mediators when using Internal pivot selection
 Replacement policies
 because the size of D-cache is limited, both Cell cache and Row cache
apply the LRU distance replacement policy
DEXA 2009, Linz, Austria
Experiments
 datasets
 A subset of Corel features – 65,615 32-dimensional vectors
of color moments,
and the d = L1 distance
 A synthetic Polygons set; 500,000 randomly generated 2D
polygons varying in the number of vertices from 10 to 15,
and the d = Hausdorff distance (maximum distance of a point set to the
nearest point in the other set).
 A subset of GenBank file rel147, namely 50,000 protein
sequences of lengths from 50 to 100,
and the d = edit distance
DEXA 2009, Linz, Austria
Experiments
 D-file was compared with 3 metric access methods and the
trivial sequential scan
 M-tree
 PM-tree
 GNAT
 seq. scan
 we have observed the number of distance computations spent
on
 indexing
 querying
DEXA 2009, Linz, Austria
Experiments
 construction costs
DEXA 2009, Linz, Austria
Experiments
 unknown queries (query objects outside the dataset)
DEXA 2009, Linz, Austria
Experiments
 unknown queries (query objects outside the dataset)
DEXA 2009, Linz, Austria
Experiments
 database queries (used when browsing, etc.)
DEXA 2009, Linz, Austria
Experiments
 database queries (used when browsing, etc.)
DEXA 2009, Linz, Austria
Conclusion
 D-file – an index-free metric access method
 requires no indexing
 suitable for highly changeable databases, isolated searches or
when changing the similarity
 D-cache
 a structure used by D-file to cheaply determine lower-bound
distances
 uses distances computed and cached during previous queries
processing
DEXA 2009, Linz, Austria
Future work
 we plan to include the D-cache also into index-based metric
access methods to improve the efficiency of
 index construction
 simple queries (range and kNN)
 advanced operations (similarity joins)
 and so on...
Thank you for your attention!
DEXA 2009, Linz, Austria