PPT presentation

Download Report

Transcript PPT presentation

NM-Tree: Flexible Approximate
Similarity Search in Metric and
Non-metric Spaces
Tomáš Skopal
Jakub Lokoč
Charles University in Prague
Department of Software Engineering
Czech Republic
DEXA 2008, Turin, Italy, Sep 1-5
Presentation outline






Metric similarity search
Semimetric tuning
M-Tree
NM-Tree
Experimental results
Discussion
Metric similarity search

query object
How to search in multimedia databases (MDB)?



textual annotation is expensive and ambiguous
MDB objects are not structured (we cannot use a structured query language, like SQL)
solution: content based similarity searching; similarity between a query and DB
object is interpreted as a relevance

similarity is often modelled by a distance function d satisfying metric properties
-> metric searching -> metric access methods (MAMs)

the distance function d is supposed to be computationally expensive
-> sequential search is unfeasible -> external indexing structures

metric properties are advantageous for indexing, but may be unsuitable for
domain experts


mainly the triangle inequality -> let's d be relaxed to a semimetric (i.e., reflexive,
non-negative, symmetric distance)
we use MAMs also with semimetric distances,
but we have to take more or less incorrect behavior into account

false dismissals & false positives for semimetrics
Semimetric tuning


semimetric brings less limitations for domain experts, but…
semimetric doesn’t guarantee triangle inequality for every triplet of objects in
the database -> a lot of non-triangle triplets causes a lot of false dismissals in
MAMs and vice versa
a
Triangle triplet:
a + b >= c




c
a
b
c
Non-triangle triplet:
a+b<c
b
exact metric searching = all distance triplets must be triangle triplets
non-triangle triplets cause only approximate but faster search
semimetric tuning = changing the proportion of non-triangle triplets
(generated by ds) by applying modifier f (real function) on original semimetric,
e.g. ds* = f (ds)
ds* should
- satisfy triangle inequality to some extent (controlled precision of searching)
- generate lower-dimensional distance distributions (faster searching)
Properties of the modifier f
1. f is increasing – preserves original query orderings
2. triangle-generating f = concave f = turns more distance triplets into triangle ones
= slow but precise searching
3. triangle-violating f = convex f = turns more distance triplets into non-triangle ones
= fast but approximate searching
4. parametric T-bases
ff
Modified distances
may form the
triangle triplet
ds*
guarantees the fish
is always more
similar to sea-maid
than to girl
d
5. TriGEN – an algorithm
for finding an f that satisfies the user-requested retrieval error e
and maximizes the search efficiency (lowest intrinsic dimensionality), see [2]
s
M-tree




dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree)
the leaves are clusters of indexed objects Oj (ground objects)
routing entries in the inner nodes represent hyper-spherical
metric regions (Oi , rOi), recursively bounding the object clusters in leaves
the triangle inequality allows to discard irrelevant M-tree branches (metric
regions resp.) during query evaluation
Q
(euclidean 2D space)
range
query
M-tree filtering
a) basic filtering (expensive)
d(R, Q) > rR + rQ
b) parent filtering (cheap)
|d(P, Q) – d(P, R)| > rR + rQ
NM-tree motivation

separated usage of TriGen and M-tree - limitations




M-tree hierarchy depends on the topology of ds (specific to f used)
For another measure ds* the database must be re-indexed
To provide user with choice of precision/efficiency tradeoff we need to
maintain more M-trees - each for particular dissimilarity measure !!!
T-bases (returned by TriGEN) natively support inverse
modification (as proposed in [2])
ds = fe( fe(ds , w) , -w)
notation : fe-1 ~ fe(ds, -w)
ds*
We can mimic multiple M-trees with just one M-tree and an
appropriate set of modifiers fe -> NM-tree
NM-tree setbacks
Native combination of M-tree and TriGEN brings some problems…

determining modifiers for an empty index (no data received)


guarantee of exact searching


solution : gather first k objects into a sequential file then find
modifiers
solution : use metric dm = fm(ds) for inserting -> it is necessary to
find fm modifier in the initial phase
aggregated distances stored in M-tree as covering radii


cannot be correctly remodified by f
solution : approximate search only at the preleaf & leaf level
NM-tree

structure



the same structure as for the M-tree
maintains modifiers fe for distance modification
construction

Initial phase




Second phase


inserting into the sequential file until sufficient number of objects is gathered
then finding modifiers for requested retrieval errors including fm for error = 0
all objects from the sequential file are inserted into the NM-tree
inserting into the NM-tree under dm = fm(ds) -> this makes possible exact searching
querying



additional query parameter – an error threshold e (such that fe is available in NM-tree)
Exact search – NM-tree querying under dm (+ result distances remodified by fm-1)
Approximate search – for stored values (in the index) it is necessary to make
conversion from dm to ds using fm-1 and subsequently
conversion from ds to desired ds* using fe
NM-tree approximate querying

For upper levels the metric search is performed
query modification
q_radius* = fe(q_radius)
e2qd* = fe(e2qd)
dm
entry modification
d2p* = fe(fm-1(d2p))
e_radius* = fe(fm-1(e_radius))
ds*

For the preleaf and leaf level all the distances used for pruning
are modified to the desired semimetric depending on the userdefined error threshold e
NM-tree querying example
Experiments


We have compared multiple M-trees (each for semimetric determined
by user defined error) with NM-tree
We have performed our tests on two databases



We have tested one semimetric and one metric on each database



68,040 32-dimensional Corel features
250,000 synthetic 2D polygons, each consisting of 5 to 15 vertices
COREL – L0.75 and L2
Polygons – DTW and Hausdorff
For TriGEN we have selected 10 modifiers for each dissimilarity
measure and T-errors (values correlated with retrieval error)
within [0 – 0,32]
Experimental results
Experimental results
Experimental results
References



[1] Ciaccia P., Patella M., Zezula P.
M-tree: An Efficient Access Method for Similarity Search
in Metric Spaces. In: VLDB 1997, pp. 426–435 (1997)
[2] Skopal T.
Unified framework for fast exact and approximate
search in dissimilarity spaces. ACM Transactions on
Database Systems 32(4), 1–46 (2007)
[3] Chávez E., Navarro G.
A Probabilistic Spell for the Curse of Dimensionality. In:
Buchsbaum, A.L., Snoeyink, J. (eds.) ALENEX 2001. LNCS, vol.
2153, pp. 147–160. Springer, Heidelberg (2001)
...
Thank you for
attention
Questions ??