On Index-free Similarity Search in Metric Spaces
Download
Report
Transcript On Index-free Similarity Search in Metric Spaces
On Index-free Similarity Search in
Metric Spaces
Tomáš Skopal1, Benjamin Bustos2
1Charles
University in Prague, Czech Republic
2University of Chile, Santiago, Chile
DEXA 2009, Linz, Austria
Outline
Metric approach to similarity search
Motivation for index-free similarity search
D-file (+ D-cache)
Experiments
Conclusion
DEXA 2009, Linz, Austria
Similarity search
Multimedia databases, time series, bioinformatics, ...
Content-based similarity search (query by example)
range query
(give me the very similar ones – over 80%)
k nearest neighbors query
(give me the 3 most similar)
0.1
DEXA 2009, Linz, Austria
0.15
0.3
0.6
0.8
Metric approach to similarity search
the similarity d (actually distance) is computationally expensive –
often O(m2), sometimes even O(2m) w.r.t. the size (m) of a
compared object
querying by a sequential scan over the database of n objects is thus
expensive
the goal:
minimizing the number of distance
computations d(*,*) for a query
the way:
using metric distances (metric postulates)
allows to partition the data space
the search is then performed just in several partitions → efficient
search
DEXA 2009, Linz, Austria
Using lower-bound distances for
filtering database objects
a cheap determination of tight lower-bound distance of
d(*,*) provides a mechanism how to quickly filter irrelevant
objects from search
X
query
ball
r
Q
P
The task: check if X is inside query ball
•we know d(Q,P)
•we know d(P,X)
•we do not know d(Q,X)
•we do not have to compute d(Q,X), because
its lower bound d(Q,P)-d (X,P) is larger than
r, so X surely cannot be in the query ball, so X
is ignored
this filtering is used in various forms by
metric access methods, where X stands for
a database object and P for a pivot object
DEXA 2009, Linz, Austria
Index-based metric access methods
All metric access methods (MAM) are index-based, i.e. preprocessing of a
database is always needed.
Index construction usually takes between O(kn) to O(n2).
M-tree
DEXA 2009, Linz, Austria
PM-tree
GNAT
Motivation for index-free search
indexing is not desirable (or even possible) if
we have a highly changeable database
more inserts/deletes/updates than searches, i.e., streaming databases,
archives, logs, sensory databases, etc.
we perform isolated searches
a database is created for a few queries and then discarded, i.e., in data
mining tasks
we switch between distances (changing similarity)
the distance function is tuned at query time, e.g., weighing of object
features is applied dynamically
DEXA 2009, Linz, Austria
D-file
just the original database using sequential scan, BUT
it uses D-cache
a memory-resident structure that maintains the distances
computed during previous queries
provides lower-bounds of requested distances that can be
used to filter some of the database objects when querying
O(1) complexity for a lower bound retrieval
no preprocessing (indexing) of database
DEXA 2009, Linz, Austria
D-file – range query
???
Q
simple
sequential
sequential
search
search
enhanced by D-cache filtering
DEXA 2009, Linz, Austria
Oi
D-cache
every time a D-file computes a distance d(*,*), it is stored
into D-cache
the D-cache could be viewed as a sparse matrix, where
queries denote rows, database object denote columns, and a
cell contains a value
of d(Q,O)
DEXA 2009, Linz, Austria
D-cache
D-cache has two functionalities
it allows to retrieve the exact distance d(Q,O), if it is there
the main functionality: it provides tight lower bound to d(Q,O)
How to obtain a lower bound?
prior to a new query Q, determine some old queries DPiQ (acting as
dynamic pivots) and compute
the distances d(Q, DPiQ)
when a lower bound to d(Q,O)
is required, search for available
distances d(Q, DPiQ) in
the D-cache and obtain
the max(d(DPiQ, O) – d(Q, DPiQ));
that is our tight lower bound distance
DEXA 2009, Linz, Austria
D-cache
how to choose the old queries (dynamic pivots)?
“Recent” policy
simple – we just choose k previous queries
motivation: the recently added distances are likely to still sit in the D-cache
“Internal” policy
advanced – we select k of the previous queries which are probably close
we avoid computation of any distance
between new and old queries, we just
estimate the distance using distances
from D-cache
motivation: a close query (pivot)
produces tighter lower bounds
DEXA 2009, Linz, Austria
D-cache implementation
Cell cache
a simple hash table
used to determine individual cell values,
based on id1, id2
used for Recent pivot selection
Row cache
in inverted list (list of objects belonging to old queries)
used to determine the mediators when using Internal pivot selection
Replacement policies
because the size of D-cache is limited, both Cell cache and Row cache
apply the LRU distance replacement policy
DEXA 2009, Linz, Austria
Experiments
datasets
A subset of Corel features – 65,615 32-dimensional vectors
of color moments,
and the d = L1 distance
A synthetic Polygons set; 500,000 randomly generated 2D
polygons varying in the number of vertices from 10 to 15,
and the d = Hausdorff distance (maximum distance of a point set to the
nearest point in the other set).
A subset of GenBank file rel147, namely 50,000 protein
sequences of lengths from 50 to 100,
and the d = edit distance
DEXA 2009, Linz, Austria
Experiments
D-file was compared with 3 metric access methods and the
trivial sequential scan
M-tree
PM-tree
GNAT
seq. scan
we have observed the number of distance computations spent
on
indexing
querying
DEXA 2009, Linz, Austria
Experiments
construction costs
DEXA 2009, Linz, Austria
Experiments
unknown queries (query objects outside the dataset)
DEXA 2009, Linz, Austria
Experiments
unknown queries (query objects outside the dataset)
DEXA 2009, Linz, Austria
Experiments
database queries (used when browsing, etc.)
DEXA 2009, Linz, Austria
Experiments
database queries (used when browsing, etc.)
DEXA 2009, Linz, Austria
Conclusion
D-file – an index-free metric access method
requires no indexing
suitable for highly changeable databases, isolated searches or
when changing the similarity
D-cache
a structure used by D-file to cheaply determine lower-bound
distances
uses distances computed and cached during previous queries
processing
DEXA 2009, Linz, Austria
Future work
we plan to include the D-cache also into index-based metric
access methods to improve the efficiency of
index construction
simple queries (range and kNN)
advanced operations (similarity joins)
and so on...
Thank you for your attention!
DEXA 2009, Linz, Austria