PPT presentation
Download
Report
Transcript PPT presentation
NM-Tree: Flexible Approximate
Similarity Search in Metric and
Non-metric Spaces
Tomáš Skopal
Jakub Lokoč
Charles University in Prague
Department of Software Engineering
Czech Republic
DEXA 2008, Turin, Italy, Sep 1-5
Presentation outline
Metric similarity search
Semimetric tuning
M-Tree
NM-Tree
Experimental results
Discussion
Metric similarity search
query object
How to search in multimedia databases (MDB)?
textual annotation is expensive and ambiguous
MDB objects are not structured (we cannot use a structured query language, like SQL)
solution: content based similarity searching; similarity between a query and DB
object is interpreted as a relevance
similarity is often modelled by a distance function d satisfying metric properties
-> metric searching -> metric access methods (MAMs)
the distance function d is supposed to be computationally expensive
-> sequential search is unfeasible -> external indexing structures
metric properties are advantageous for indexing, but may be unsuitable for
domain experts
mainly the triangle inequality -> let's d be relaxed to a semimetric (i.e., reflexive,
non-negative, symmetric distance)
we use MAMs also with semimetric distances,
but we have to take more or less incorrect behavior into account
false dismissals & false positives for semimetrics
Semimetric tuning
semimetric brings less limitations for domain experts, but…
semimetric doesn’t guarantee triangle inequality for every triplet of objects in
the database -> a lot of non-triangle triplets causes a lot of false dismissals in
MAMs and vice versa
a
Triangle triplet:
a + b >= c
c
a
b
c
Non-triangle triplet:
a+b<c
b
exact metric searching = all distance triplets must be triangle triplets
non-triangle triplets cause only approximate but faster search
semimetric tuning = changing the proportion of non-triangle triplets
(generated by ds) by applying modifier f (real function) on original semimetric,
e.g. ds* = f (ds)
ds* should
- satisfy triangle inequality to some extent (controlled precision of searching)
- generate lower-dimensional distance distributions (faster searching)
Properties of the modifier f
1. f is increasing – preserves original query orderings
2. triangle-generating f = concave f = turns more distance triplets into triangle ones
= slow but precise searching
3. triangle-violating f = convex f = turns more distance triplets into non-triangle ones
= fast but approximate searching
4. parametric T-bases
ff
Modified distances
may form the
triangle triplet
ds*
guarantees the fish
is always more
similar to sea-maid
than to girl
d
5. TriGEN – an algorithm
for finding an f that satisfies the user-requested retrieval error e
and maximizes the search efficiency (lowest intrinsic dimensionality), see [2]
s
M-tree
dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree)
the leaves are clusters of indexed objects Oj (ground objects)
routing entries in the inner nodes represent hyper-spherical
metric regions (Oi , rOi), recursively bounding the object clusters in leaves
the triangle inequality allows to discard irrelevant M-tree branches (metric
regions resp.) during query evaluation
Q
(euclidean 2D space)
range
query
M-tree filtering
a) basic filtering (expensive)
d(R, Q) > rR + rQ
b) parent filtering (cheap)
|d(P, Q) – d(P, R)| > rR + rQ
NM-tree motivation
separated usage of TriGen and M-tree - limitations
M-tree hierarchy depends on the topology of ds (specific to f used)
For another measure ds* the database must be re-indexed
To provide user with choice of precision/efficiency tradeoff we need to
maintain more M-trees - each for particular dissimilarity measure !!!
T-bases (returned by TriGEN) natively support inverse
modification (as proposed in [2])
ds = fe( fe(ds , w) , -w)
notation : fe-1 ~ fe(ds, -w)
ds*
We can mimic multiple M-trees with just one M-tree and an
appropriate set of modifiers fe -> NM-tree
NM-tree setbacks
Native combination of M-tree and TriGEN brings some problems…
determining modifiers for an empty index (no data received)
guarantee of exact searching
solution : gather first k objects into a sequential file then find
modifiers
solution : use metric dm = fm(ds) for inserting -> it is necessary to
find fm modifier in the initial phase
aggregated distances stored in M-tree as covering radii
cannot be correctly remodified by f
solution : approximate search only at the preleaf & leaf level
NM-tree
structure
the same structure as for the M-tree
maintains modifiers fe for distance modification
construction
Initial phase
Second phase
inserting into the sequential file until sufficient number of objects is gathered
then finding modifiers for requested retrieval errors including fm for error = 0
all objects from the sequential file are inserted into the NM-tree
inserting into the NM-tree under dm = fm(ds) -> this makes possible exact searching
querying
additional query parameter – an error threshold e (such that fe is available in NM-tree)
Exact search – NM-tree querying under dm (+ result distances remodified by fm-1)
Approximate search – for stored values (in the index) it is necessary to make
conversion from dm to ds using fm-1 and subsequently
conversion from ds to desired ds* using fe
NM-tree approximate querying
For upper levels the metric search is performed
query modification
q_radius* = fe(q_radius)
e2qd* = fe(e2qd)
dm
entry modification
d2p* = fe(fm-1(d2p))
e_radius* = fe(fm-1(e_radius))
ds*
For the preleaf and leaf level all the distances used for pruning
are modified to the desired semimetric depending on the userdefined error threshold e
NM-tree querying example
Experiments
We have compared multiple M-trees (each for semimetric determined
by user defined error) with NM-tree
We have performed our tests on two databases
We have tested one semimetric and one metric on each database
68,040 32-dimensional Corel features
250,000 synthetic 2D polygons, each consisting of 5 to 15 vertices
COREL – L0.75 and L2
Polygons – DTW and Hausdorff
For TriGEN we have selected 10 modifiers for each dissimilarity
measure and T-errors (values correlated with retrieval error)
within [0 – 0,32]
Experimental results
Experimental results
Experimental results
References
[1] Ciaccia P., Patella M., Zezula P.
M-tree: An Efficient Access Method for Similarity Search
in Metric Spaces. In: VLDB 1997, pp. 426–435 (1997)
[2] Skopal T.
Unified framework for fast exact and approximate
search in dissimilarity spaces. ACM Transactions on
Database Systems 32(4), 1–46 (2007)
[3] Chávez E., Navarro G.
A Probabilistic Spell for the Curse of Dimensionality. In:
Buchsbaum, A.L., Snoeyink, J. (eds.) ALENEX 2001. LNCS, vol.
2153, pp. 147–160. Springer, Heidelberg (2001)
...
Thank you for
attention
Questions ??