Path Queries on Compressed XML

Transcript Path Queries on Compressed XML

Path Queries on Compressed
XML
Peter Buneman, Martin Grohe, Christoph Koch
Laboratory for Foundations of Computer Science
University of Edinburgh, UK
Murat Şensoy
Can Gürel
WHAT IS SHOWN IN THIS STUDY
 XML Tree structure can be effectively compressed
and manipulated using techniques derived from
symbolic model checking.
 Specically, concise representations of document tree
structures based on sharing subtrees are highly
effective.
 Compressed structures can be queried directly and
efficiently through a process of manipulating selections
of nodes and partial decompression
MOTIVATION
 Ability to store and manipulate large portions of the structure of
very large XML documents in main memory is crucial to the
development of efficient, scalable native XML databases and query
engines.
 Extraction of text and storing compressed text separately and using
remaining tree skeleton for query processing in memory is not
sufficient for very large documents whose skeletens do not fit to
memory.
 Solution: Compress skeleton so that path queries are possible on
this compressed skeleton.
METHOD
Our compression technique of XML
skeletons by sub-tree sharing can be seen
as a direct generalization of the compression
of Boolean functions into Ordered Binary
Decision Diagrams. This enables us to
transfer the efficient algorithms for OBDDs to
our setting and thus provides the basis for
new algorithms for evaluating path queries
directly on compressed skeletons.
Original Skeleton
can be recovered by
the appropriate
depth-first traversal of the
compresses skeleton
Tree Skeleton
(a)
Compressed Trees
(b)
(c)
This compressed representation is both natural and
exhibits an interesting property of most practical
XML documents:
XML documents possess a regular structure in that sub-tree
structures in a large document are likely to be repeated many
times. If one considers an example of extreme regularity, that of an
XML-encoded relational table with R rows and C columns, the
skeleton has size O(CxR), and the compressed skeleton as in Figure
(b) has size O(C + R). This is further reduced by merging multiple
edges as in Figure (c) to O(C +logR).
Compressed skeletons are easy to compute and allow us to query
the data in a natural way. Each node in the compressed skeleton
represents a set of nodes in the uncompressed tree. The purpose
of a path query is to select a set of nodes in the uncompressed
tree. However this set can be represented by a subset of the nodes
in a partially decompressed skeleton
Coming up...
XML as a universal data source
Handling XML
XML Storage
XML databases
Basic definitions
Decompression/Compression
Queries on XML
Query specification and path languages
XPath
Query evaluation
Implementation
Experiments
Conclusion
XML Storage
Infeasible approaches
Subtree-based indexing/caching mechanisms
Relational DBMS use to keep node information in
tuples
Better: Separate string data and document
structure
String data stored and indexed using conventional
approaches
Document structure stored as a tree (“skeleton”)
with nodes keeping element and attribute names
Used in XMILL as a compression scheme
XML Storage (cont’d)
String data needed for localized processing
Easily compressed by conventional methods
Skeleton needed for navigational aspect of
queries (heavy performance tag)
Usually small, can fit wholly in main memory
Can be large in complex databases, will not fit in
main memory. Can we compress it?
How to avoid compression/decompression
overhead (time and space) during query
evaluation?
Proposed Compression Scheme
Based on sharing of common nodes
Fits well to most practical XML documents due to structure
regularity
Independent of the actual DTD
Generic structure, capable of expressing other
information than just elements and attributes (eg.
string match, query result)
Original document structure is preserved
Bisimulation: Each node in compressed tree corresponds to
a number of nodes in uncompressed tree
Partial decompression is possible
Naturally extends to query processing on compressed
skeletons
Example
Common subtrees are shared
Edges are ordered
In (c) consecutive multiple edges
are joined and marked their
multiplicity
Definitions and Notation
Vertex set V = {v, w, ...}
Set of vertices in tree (nodes of the (possibly
compressed) XML document tree)
Function : VV*
Function that assigns every vertex its sequence of
children
Can be visualized as a directed graph
Normally the graph is rooted and acyclic (a DAG)
(v,w) is an edge iff w(v)
Edges are ordered, so if w occurs as ith position of (v),
we write i
vw
Definitions and Notation (cont’d)
Some sets of vertices may be
arbitrarily distinguished and given
names
In this work, sets are used to
represent XML tags or derived
properties (string matching,
subquery result etc) for
compression operations and
queries
In the example
V={v1,v2,v3,v4,v5}
(v1)= v2 v4 v4
(v2)= v3 v5 v5 v5
(v3)= 
(v4)= v3 v5
(v5)= 
Sets are Sbib, Sbook, Stitle, Spaper,
Sauthor
1
2
3
1
2
v1  v2 v1  v4 v1  v4 v4  v3 v4  v5
1
2
3
4
v2  v3 v2  v5 v2  v5 v2  v5
Definitions and Notation (cont’d)
Instance I=(V, , root, S1,...Sn)
Represents an XML document (in compressed or
uncompressed form)
Equivalent instances are representations of the same XML
document
Modeled as a tuple comprised of:
The DAG components
V: Set of vertices
: Function defining the tree
root: Root element
The “Schema”, a finite set of unary relation names
 = {S1, ...Sn} where each Si is a subset of V
The schema names the instance type (a -Instance)
Definitions and Notation (cont’d)
Edge path
For vertices v0 and vn and intermediate vertices v1
... vn-1 such that
i1
in
v0  v1  ...  vn1  vn
The integer sequence i1 i2 ... in is called an “edge
path” from v0 to vn
For each vertex vV
(v)={P | P is an edge path from the root to v}
And for a set SV
(S)= vS (v)
Definitions and Notation (cont’d)
Two -instances I and J
are said to be
“equivalent” if
(VI)= (VJ)
(SI)= (SJ) for all S
The uncompressed
instance is a tree, so in
a tree there is exactly
one edge path to each
vertex!
Decompression
For each instance I there is a unique
tree instance T(I) equivalent to I
This tree instance is the uncompressed
representation of the same XML
document
How to obtain T(I) from I?
Each node in T(I) corresponds to an edge
path in I
Definitions and Notation (cont’d)
~: Bisimilarity relation
An equivalence relation on a -instance I with
the vertex set V
For all v, wV where v~w,
For all S, vSwS
i
For all i, if v  v' then
i
w  w' and v' ~ w' for some w'V
Compression
For an instance I and bisimilarity relation ~ on I,
construct I|~ as follows
Replace the vertices in VI by their equivalence classes of the
vertices with respect to ~ (put all vertices v, wV and v~w
into a single vertex labeled “v~w”)
For all instances I and bisimilarity relations ~ on I, I
is equivalent to I|~
For all instances I (and T(I)), there is a bisimilarity
relation ~ on T(I) such that I is isomorphic to T(I)|~
It is always possible to obtain an instance isomorphic to I
from T(I) via a bisimilarity relation ~
With an appropriate bisimilarity relation, T(I)|~=I is
a compressed representation of T(I)
Compression (cont’d)
Bisimilarity relations ~ on an instance I form a lattice
Topmost relation is the maximally compressing relation and
yields the equivalent minimal instance
Bottommost relation is equality relation (always fails for
distinct vertices, so provides no compression) and yields the
same instance (identity element)
Instance I is “minimal” if the only bisimilarity relation
on I is equality on V. Then, the bisimilarity relation ~
with I=T(I)|~ is the maximum element of the lattice
of bisimilarity relations on T(I).
Compression (cont’d)
For each instance I, there is exactly 1
minimal instance M(I) equivalent to I.
If T is an XML document, M(T) is its
compression
Given I, M(I) can be calculated in linear
time
Definitions and Notation (cont’d)
Schemas contain tag labels. Instead of
keeping all tags, we can keep only
those relevant (to a subquery)
Need a way to join “compatible” instances
Given schemas ’,
I’=I|’ is the “’-reduct” of a -instance I
SI’=SI for all S’
I and I’ have the same DAG
Definitions and Notation (cont’d)
Given schemas , , -instance I and instance J
I and J are “compatible” if I| and J| are
equivalent
A -instance K is the “common extension” of I
and J if K| is equivalent to I and K| is equivalent
to J
Common extension is defined only for compatible
instances
Given I and J, common extension of I and J can be
calculated in quadratic time
Used to merge results of subqueries
QUERY EVALUATION
How to evaluate a Core XPath query on
a compressed instance?
Firstly, What is a Core Xpath Query?
Location Path is composed of Location Steps:
A location step has three parts:
 An axis, which specifies the tree relationship between the
nodes selected by the location step and the context node.
 A node test, which specifies the node type.
 Zero or more predicates, which use arbitrary expressions
to further refine the set of nodes selected by the location
step.
Child :: chapter [child::title='Introduction']
Axis
Node Test
Predicate
Selects the chapter children of the context node that have one or more title children with string-value equal to Introduction.
EXAMPLE CORE XPATH QUERY:
/ descendant::a / child::b[child::c / child::d or not(following::*)]
The intuition in Core XPath, which
reduces the query evaluation
problem to manipulating sets rather
than binary relations, is to reverse
paths in conditions to direct the
computation of node sets in the query
towards the root of the query
tree.
Original Skeleton
(a)
Tree Skeleton
can be recovered by
the appropriate
OPERATIONS
ON
COMPRESSED INSTANCES
depth-first traversal
of the
compresses skeleton
In certain cases, decompression may be
necessary
to be able to
Compressed
Trees
(b)
represent the resulting selection. Our goal is to avoid
full de- (c)
compression when it is not necessary.
The idea of the algorithm(s) is to traverse the DAG of the input
instance starting from the root, visiting each node v only once.
We choose a new selection of v on the basis of the selection of
its ancestors, and split v if different predecessors of v require
different selections. We remember which node we have copied
to avoid doing it repeatedly.
EXAMPLE :
COMPLEXITY AND DECOMPOSITION
In theory, this method of compression by sharing sub-trees
can lead to an exponential reductionBecause
in instance
size, and
:
even to doubly exponential compression
usinginedge
At each selection
query,
multiplicities. Unfortunately, compressed
may
compressedtrees
instance
decompress exponentially in the
case
even on very
canworst
double
in size
simple queries.
at worst case.
However, surprisingly, the decompression is only
exponential in the size of the queries (but not the
data), which tend to be small.
Implementation
Key goals
Space efficiency (compression of XML) primary goal
Time efficiency (fast queries on compressed XML) secondary
goal
Implementation
DAGs as relaxed trees with multiple incoming pointers
Fast SAX parser in C++ is used
Compression takes linear time (a single scan of document)
Stack used to keep partially formed vertices and do string matches
Hash table used to index finished vertices (Hash function based on
set memberships and children)
Instead of common extensions recompression is done to bias
space efficiency over time efficiency
No XML database, simply reparsed document for each
experiment
No indices are implemented for string constraints
Experimental Results
Reference system
Dell Inspiron 8100 Laptop/256MB RAM/PIII 1GHz/Linux
Data
SwissProt protein database
DBLP
Penn Treebank (manually structured linguistic database from
the Wall Street Journal)
OMIM human gene and genetic disorder database
XMark generated auction data
Shakespeare’s collected works
1998 Major League Baseball statistic
Experimental Results (cont’d)
Queries
For each dataset
Q1: Tree pattern query returning the root if pattern is
matched. Only parent axis used, no decompression is
required.
Q2: Q1 reversed, selecting the nodes (not the root)
Q3: Descendant axis, conditions as well as string
constraints
Q4: Branching query trees
Q5: Q4 extended with the remaining axes
Experimental Results (cont’d)
- rows do not
keep tag
information
+ rows keep
tag
information
Last is the
compression
ratio in ratio
of the
number of
edges
Conclusions
Successful XML compression and efficient querying
on compressed XML
Bisimulation basis for compression requires little
overhead, execution time is competitive even with
uncompressed processing
Sharing of common subtrees prevent all recalculation
and compress large regular XML data more efficiently
Separation of skeleton from the string and
subsequent compression
Global data fits in main memory
Requires less data to be processed for query evaluation
A general cached chunk scheme is needed as data increases
for true scalability

Path Queries on Compressed XML

Transcript Path Queries on Compressed XML

Directory