ViST: a dynamic index method for querying XML data by tree

Download Report

Transcript ViST: a dynamic index method for querying XML data by tree

ViST: a dynamic index method
for querying XML data by tree structures
Authors: Haixun Wang, Sanghyun Park,
Wei Fan, Philip Yu
Presenter: Elena Zheleva, November 2003
Overview
Modeling XML Queries
 Structure-encoded sequences
 Indexing
 ViST
 Experimental Results

Modeling XML Queries
DTD of purchase records:
(!ELEMENT purchases (purchase*))
(!ELEMENT purchase (seller, buyer))
(!ATTRIST seller ID ID location CDATA name CDATA)
(!ELEMENT seller (item*))
(!ATTRIST buyer ID ID location CDATA name CDATA)
(!ELEMENT item (item*))
(!ATTRIST item name CDATA manufacturer CDATA)
Modeling XML Queries

Focus in XML query language design:
ability to express complex structural or
graphical queries
Modeling XML Queries
Querying XML data = finding sub
structures of the data graph that match
the sequence
 Structure-encoded sequences: a
sequential representation of both XML
data and XML queries

Structure-Encoded
Sequences
Structure-Encoded Sequences
Maps the data and the queries
 Matches the subsequence
 Purpose: to avoid as many join
operations as possible
 Def. Sequence of (symbol, prefix) pairs

Mapping Data

Represent XML document/tree in
preorder

Represent in structure-encoded seq
Mapping Queries
Benefit of sequence matching: query
gets processed as whole
 Path Expression

Structure-Encoded Sequences

Query

Data
Querying XML

through Structure-Encoded Sequence
Matching
Indexing
Role of Indexing
To provide an algorithm to perform this
sequence matching
 Desired features for algorithm:

– Efficient support for subsequence matching
– Use well-supported DB indexing
techniques such as B+ trees
– Allow dynamic index insertion
What is indexing useful for

Auxiliary access structures
– Used to speed up the retrieval of records
– In response to certain search conditions

Provide efficient support for arbitrary
structured queries
– Using wild-cards // and *
Indexing

State-of the-art approaches
– Indexes on paths
– Indexes on nodes
– Indexes on both (structures) – ViST
ViST
Algorithms
Naïve Algorithm based on Suffix Trees
 RIST: Relationships Indexed Suffix Tree
 ViST: Virtual Suffix Tree

Algorithm Using Suffix Trees
Suffix Tree: a compact index to all
distinct, contiguous substrings of a string
 D-Ancestorship – in XML doc tree
 Through structure-encoded sequence
 S-Ancestorship – in suffix tree

Example Using Suffix Trees
Algorithm Using Suffix Trees

Searches
– first by S-Ancestorship: searching under
suffix tree
– then by D-Ancestorship: matching nodes
and prefixes

Disadvantages:
– Costly – traverse large portion of subtree
– Most commercial DBMSs do not support
RIST: Indexing by AncestorDescendant Relationships
Jumps directly to the nodes Y to which
X is both a D-Ancestor and S-Ancestor
 Index Construction: uses B+ trees

RIST: Indexing by AncestorDescendant Relationships
Subsequence Matching
 Determine D-Ancestorship by prefixes
 Determine S-Ancestorship by label
<nx,sizex>
 x – suffix tree node (root of S-tree)
 nx – prefix traversal order
 sizex – number of descendants

ViST: the Virtual Suffix Tree






Same sequence algorithm as RIST
BUT supports dynamic insertions
Uses dynamic method to assign labels
Once assigned, the labels are fixed and are
not affected by subsequent data insertion or
deletion
Labeling the suffix tree w/o building it
Relies on statistical information about the
XML data
ViST: the Virtual Suffix Tree
Index structure contains the sequence:
Sequence to be inserted:
Dynamic scope of x = <nx, sizex,kx>
ViST: the Virtual Suffix Tree
Experimental Results

Datasets used
– DBLP: CS bibliography DB
• 289,627 records/publications
• Each publication – tree of max depth 6
• Avg length of structure-encoded seq = 31
– XMARK
• 1 record
• Complicated tree structure
– Synthetic
Experimental Results

Comparison Methods
– Index Fabric Algorithm – XML paths
– XISS – uses nodes as basic query unit
– ViST – appx. 1/10 of time to perform
queries due to (multiple) join operations
Experimental Results - remove

Index Structure and Size (1/3 less from
suffix tree)
– DocId B+ Tree – N elements
– Combined D-ancestor and S-ancestor B+
tree - N x L elements

Index Construction
Conclusion
XML Queries = Subsequence Matching
 Advantages of ViST – algorithm for
subsequence matching

– Avoids expensive join operations
– Index on both content and structure of XML
documents
– B+ trees – supported by disk-based data
– Dynamic data insertion and deletion