ViST: a dynamic index method for querying XML data by tree
Download
Report
Transcript ViST: a dynamic index method for querying XML data by tree
ViST: a dynamic index method
for querying XML data by tree structures
Authors: Haixun Wang, Sanghyun Park,
Wei Fan, Philip Yu
Presenter: Elena Zheleva, November 2003
Overview
Modeling XML Queries
Structure-encoded sequences
Indexing
ViST
Experimental Results
Modeling XML Queries
DTD of purchase records:
(!ELEMENT purchases (purchase*))
(!ELEMENT purchase (seller, buyer))
(!ATTRIST seller ID ID location CDATA name CDATA)
(!ELEMENT seller (item*))
(!ATTRIST buyer ID ID location CDATA name CDATA)
(!ELEMENT item (item*))
(!ATTRIST item name CDATA manufacturer CDATA)
Modeling XML Queries
Focus in XML query language design:
ability to express complex structural or
graphical queries
Modeling XML Queries
Querying XML data = finding sub
structures of the data graph that match
the sequence
Structure-encoded sequences: a
sequential representation of both XML
data and XML queries
Structure-Encoded
Sequences
Structure-Encoded Sequences
Maps the data and the queries
Matches the subsequence
Purpose: to avoid as many join
operations as possible
Def. Sequence of (symbol, prefix) pairs
Mapping Data
Represent XML document/tree in
preorder
Represent in structure-encoded seq
Mapping Queries
Benefit of sequence matching: query
gets processed as whole
Path Expression
Structure-Encoded Sequences
Query
Data
Querying XML
through Structure-Encoded Sequence
Matching
Indexing
Role of Indexing
To provide an algorithm to perform this
sequence matching
Desired features for algorithm:
– Efficient support for subsequence matching
– Use well-supported DB indexing
techniques such as B+ trees
– Allow dynamic index insertion
What is indexing useful for
Auxiliary access structures
– Used to speed up the retrieval of records
– In response to certain search conditions
Provide efficient support for arbitrary
structured queries
– Using wild-cards // and *
Indexing
State-of the-art approaches
– Indexes on paths
– Indexes on nodes
– Indexes on both (structures) – ViST
ViST
Algorithms
Naïve Algorithm based on Suffix Trees
RIST: Relationships Indexed Suffix Tree
ViST: Virtual Suffix Tree
Algorithm Using Suffix Trees
Suffix Tree: a compact index to all
distinct, contiguous substrings of a string
D-Ancestorship – in XML doc tree
Through structure-encoded sequence
S-Ancestorship – in suffix tree
Example Using Suffix Trees
Algorithm Using Suffix Trees
Searches
– first by S-Ancestorship: searching under
suffix tree
– then by D-Ancestorship: matching nodes
and prefixes
Disadvantages:
– Costly – traverse large portion of subtree
– Most commercial DBMSs do not support
RIST: Indexing by AncestorDescendant Relationships
Jumps directly to the nodes Y to which
X is both a D-Ancestor and S-Ancestor
Index Construction: uses B+ trees
RIST: Indexing by AncestorDescendant Relationships
Subsequence Matching
Determine D-Ancestorship by prefixes
Determine S-Ancestorship by label
<nx,sizex>
x – suffix tree node (root of S-tree)
nx – prefix traversal order
sizex – number of descendants
ViST: the Virtual Suffix Tree
Same sequence algorithm as RIST
BUT supports dynamic insertions
Uses dynamic method to assign labels
Once assigned, the labels are fixed and are
not affected by subsequent data insertion or
deletion
Labeling the suffix tree w/o building it
Relies on statistical information about the
XML data
ViST: the Virtual Suffix Tree
Index structure contains the sequence:
Sequence to be inserted:
Dynamic scope of x = <nx, sizex,kx>
ViST: the Virtual Suffix Tree
Experimental Results
Datasets used
– DBLP: CS bibliography DB
• 289,627 records/publications
• Each publication – tree of max depth 6
• Avg length of structure-encoded seq = 31
– XMARK
• 1 record
• Complicated tree structure
– Synthetic
Experimental Results
Comparison Methods
– Index Fabric Algorithm – XML paths
– XISS – uses nodes as basic query unit
– ViST – appx. 1/10 of time to perform
queries due to (multiple) join operations
Experimental Results - remove
Index Structure and Size (1/3 less from
suffix tree)
– DocId B+ Tree – N elements
– Combined D-ancestor and S-ancestor B+
tree - N x L elements
Index Construction
Conclusion
XML Queries = Subsequence Matching
Advantages of ViST – algorithm for
subsequence matching
– Avoids expensive join operations
– Index on both content and structure of XML
documents
– B+ trees – supported by disk-based data
– Dynamic data insertion and deletion