Transcript Document

Answering Metric Skyline
Queries by PM-tree
Tomáš Skopal, Jakub Lokoč
Department of Software Engineering, FMP,
Charles University in Prague
Similarity search
• content-based similarity
search
• single-example queries
– range query
– kNN query
• multi-example queries
– combination of single-example queries is not sufficient
– should support
• partial matching
• compromise
– metric skyline
DATESO 2010, Štědronín - Plazy
Metric skyline query (MSQ)
• traditional skyline operator
– linearly ordered attribute domains
– dominance relation
• + MDDRs (minimum
dominating-dominated
rectangles)
– static schema
• metric skyline
– multi-example query (not just operator)
– attributes specified at query time – ith attribute = distance of database
object to ith query example Qi
– result set interpretation: objects similar to all query examples yet
distinct (dissimilar to each other)
– dynamic schema, cannot be reduced to the classic skyline operator for
efficient skyline processing
• i.e., the coordinate system is established at query time
DATESO 2010, Štědronín - Plazy
Generic algorithm
for a hierarchic metric index
•
•
•
branch-and-bound algorithm (originally developed for R-tree
and classic/spatial skyline operator)
dynamic mapping of the metric space into L1 vector space (examples)
heuristics: data/regions processed in L1 order guarantee
no false dismissals
–
0)
1)
2)
3)
4)
a priority heap is used, storing index entries equipped
by MDDRs to be inspected (higher priority = lower L1 order of MDDR)
L1
L1
The algorithm:
The entry of the entire index is pushed on the heap (e.g., M-tree root node).
An entry with the lowest L1 distance of its MDDR is popped from the heap.
If the entry contains just one data object (e.g., entry in an M-tree leaf), it is added to the
skyline set, while removing all entries from the heap dominated by the entry. Jump to 1.
If the entry is a region (e.g., entry in an M-tree inner node), its child node is fetched. The
MDDRs of the child node’s entries are checked for dominance by the already determined
skyline set, while the dominated ones are filtered from further processing.
The MDDRs of the non-filtered child entries are derived, while those not dominated by the
current skyline set are pushed into the heap. Jump to 1.
DATESO 2010, Štědronín - Plazy
M-tree
• metric index based on B+-tree
• inner node contains routing
entries
– ball regions (object and radius) +
distance to parent region + pointer
to subtree
• leaf node contains ground entries
– object + distance to parent region
• 2 types of filtering by querying
– parent filtering (cheap)
• stored distance to parent is used
– basic filtering (expensive)
• distance computation needed
DATESO 2010, Štědronín - Plazy
MSQ implementation using M-tree
• uses the generic algorithm enhanced by specific M-tree
MDDRs, mapping the M-tree regions from metric space into
L1 vector space (dimensions are distances of data/regions to
the query examples Qi)
• 2 types of M-tree MDDR
– Par-MDDR
• the mapped oversized region ball
(using the distance to parent)
– B-MDDR
• the mapped
region ball
DATESO 2010, Štědronín - Plazy
PM-tree
•
•
•
•
combination of M-tree and pivot tables (LAESA)
M-tree balls reduced by rings centered in global pivots Pi
routing and ground entries store also the ring radii
enhanced filtering
– cheaply in pivot space (mapping of data/balls into L∞ vector space)
• mapping of the query object into the pivot space is the only extra computation costs
– if not filtered out in pivot space, regular M-tree filtering
DATESO 2010, Štědronín - Plazy
Paper contribution:
PM-tree MSQ implementation
• B-MDDR, Par-MDDR (inherited from M-tree)
• Piv-MDDR
– using PM-tree rings the MDDR can be tightened
– for each dimension (example Qi) the maximal lower bound and minimal
upper bound distance to the region is found (to the rings intersection)
• pivot skyline
– skyline initialized by pivots
mapped to the L1 space
– heavy optimization
(reduction of heap size)
• deferred heap processing
– reinsertions into heap to save distance computations
DATESO 2010, Štědronín - Plazy
Experiments
• subset of the CoPhIR database, one million 76-dimensional tuples
representing 2 MPEG7 features on flickr images, Euclidean distance used
• Polygons database, 250k 30-dimensional tuples representing 5-15 vertex
2D polygons, Hausdorff distance used
• average over 200 metric skyline queries
• each metric skyline query defined by 2-5 query examples
DATESO 2010, Štědronín - Plazy
Experiments
DATESO 2010, Štědronín - Plazy
Experiments
DATESO 2010, Štědronín - Plazy
Conclusions
• PM-tree based metric skyline query implementation
– up to 2x faster in terms of distance computations and I/O
cost (wrt original M-tree implementation)
– up to 20x faster in terms of heap operations
– needs up to 20x less space for the heap
Thank you for your attention!
Questions?
DATESO 2010, Štědronín - Plazy