Updating XML Efficient Storage of XML data

Download Report

Transcript Updating XML Efficient Storage of XML data

Efficient XML Storage,
Query, and Update
Shi Xu
Heng Yuan
Spring 2004 CS240B
Prof. Zaniolo
XML Storage Methods



Flat Streams
Metamodeling
Mixed
Redundant
 Hybrid

Method Covered


“Efficient storage of XML data” covers hybrid
method using a custom made storage system
called Natix.
“Efficient relational storage and retrieval of
XML documents” covers Metamodeling using
their Monet database.
Natix Overview


Natix is an efficient, native repository for
storing, retrieving and managing XML
documents.
It supports tree-structured objects like XML
documents at low architecture level.
Natix architectural overview
Logic Model



Tree is often used in logic model of
semistructured data.
Each non-leaf node is labeled with a symbol
taken from an alphabet DTD.
Leaf nodes can be labeled as the data itself.
A sample XML with its associated
logical tree
Example XML:
<SPEECH>
<SPEAKER>OTHELLO</SPEAKER>
<LINE>Let me see your eyes;</LINE>
<LINE>Look in my face.</LINE>
</SPEECH>
Physical Model

Object Content:
Node and objects are used interchangeably.
 A record contains a set of nodes/objects.
 Aggregate nodes are inner nodes of the tree. They
contain their respective child nodes.
 Literal nodes are leaf nodes containing an
uninterpreted stream of bytes, like text strings,
graphics, etc.
 Proxy nodes are nodes which point to different
records.

Node Representation




Whole documents (or subtrees of documents)
can be stored in one record.
Each record contains exactly one subtree.
The root nodes of each record’s subtree are
called standalone objects, other nodes are
called embedded objects.
The record size has an upper limit, the page
size.
Large Trees



For a large tree, physical model must provide a
mechanism for distributing data trees over
several pages.
Method 1: “flat” representation. It wastes the
available structural information about the data.
Method 2: split large objects based on the
underlying tree structure.

Use proxy objects to connect subtrees of the large
object residing in other records.
A Sample Distribution of logical nodes on records




Proxies (p1, p2)
Helper aggregate objects (h1, h2)
Scaffolding objects include proxies and helper aggregates.
Facade objects (f i)
Dynamic maintenance of an efficient storage



The principle problem is that a record
containing a subtree can grow larger than a page
if a node is added or grows.
Subtree contains in the record has to be
partitioned into several subtrees.
Scaffolding nodes link the new records together
in the physical tree.
Multiway tree representation of
records
Tree Growth Procedure


Step 1: Determine the record r into which the node has
to be inserted.
Step 2: If there is not enough on the page, try to move
r. If the record still does not fit, split the record:




(a) Determine the separator by recursively descending into
the r’s subtree
(b) Distribute the resulting partitions onto records
(c) Insert the separator into the parent record, recursively
calling this procedure
Step 3: Insert the new node
Determining the Insertion Location


There are several possibilities to insert a new node f n into the
physical tree.
This choice can be determined by a configuration parameters.
Determining the separator



Separator – a tree structure with proxies
pointing to the new records to indicate where
which part of the old record was moved.
Consists of all the nodes on the path from d to
the subtree’s root.
Partition the tree into left partition L, right
partition R and Separator S.
A record’s subtree before a split
occurs
Splitting a Record

Distributing the nodes on records
After determining the partitioning, the contents of
the record has to be distributed onto new records.
 Each resulting subtree is then stored in its own
record, called partition records.


Inserting the separator

The separator is moved to the parent record.
Split Algorithm



Find a node d, such that the resulting L and R.
The ratio between the sizes of L and R is
determined by a configuration parameter (split
target).
Another configuration parameter Split tolerance
specifies the minimum size for the subtree of d.
It is used to prevent fragmentation.
Record assembly for the subtree
from previous figure
Physical storage of the tree
represented inside one record
Performance Test



XML markup version of Shakspeare’s play with
8MB with 320,000 nodes.
Pentium-II 333Mhz with 128MB under
Windows NT4.0 with IBM DCAS 34330 disk.
The implementation of the record and tree
storage managers was done in C++.
Test Conditions




Record:Node 1:1 indicating smart record
splitting being inhibited.
Record:Node 1:n indicating that the algorithm
has full control over distribution of nodes on
records.
Incremental updates distributed over the whole
document.
Updates in pre-order (append).
Insertion
Full tree traversal
Queries



Retrieve all speakers in the third act and second scene
of every play, which means it accesses all leaf nodes of
a certain type in one selected subtree of the document.
Recreate the textual representation of the complete first
speech in every scene, hence reading a lot of small
contiguous fragments of each document.
A simple path query was evaluated by reading only the
opening speech of each play.
Selection on leaf nodes of document
subtree
Small contiguous fragments
Single path for each document
Space requirements
Monet Model



XML document is decomposed into binary
relations.
Efficient for storage and retrieval of XML
documents in a relational database.
The database used is their Monet database
server which supports the Monet model.
Some Definitions




An XML document is a rooted tree
d = (V, E, r, labelE, labelA, rank) with nodes V and
edges EVV and a distinguished node rV.
The function labelE : Vstring assigns labels to nodes
labelA : Vstringstring assigns pairs of strings,
attributes and their values, to nodes.
rank : Vint establishes a ranking to allow for an order
among nodes with the same parent node.
A sample XML document
<bibliography>
<article key=“BB88”>
<author>Ben Bit</author>
<title>How to Hack</title>
</article>
<article key=“BK99”>
<editor>Ed Itor</editor>
<author>Bob Byte</author>
<author>Ken Key</author>
<title>Hacking & RSI</title>
</article>
</bibliography>
Syntax Tree of the Previous XML
Document
Monet Transform

Given an XML document d, the Monet
transform is a quadruple Mt(d)=(r,R,A,T) where
R is the set of binary relations that contain all
associations between nodes;
 A is the set of binary relations that contain all
associations between nodes and their attribute
values, including character data;
 T is set of binary relations that contain all pairs of
nodes and their rank;
 r is the root of the document;

Monet Transform of the Example
Document
OQL-like query
Query Handling
Assessment



Implemented within the Monet database server
Tested on 550 MHz Silicon Graphics 1400
Server with 1 GB main memory.
Also used Sun UltraSparc-IIi with 360 MHz and
256 MB main memory to contrast with a related
work.
Size of document collections in XML
and Monet XML format
Scaling of Document
•Scaled the ACM
Anthology from 30 to
3x106 which
corresponds to XML
source size between
10KB and 1GB.
•Run 4 queries consisting
of path expressions of
length 1 through 4 for
various sizes of the
anthology.
Response Time vs. Result Size
Comparison of response time for
query set of SYU, another method
for storage/retrieval of XML
document.
Compare/Contrast Natix and Monet






Natix uses custom database while Monet is built
on top of relational database
Neither uses DTD.
Natix focuses on XML query as well as update.
Monet focuses on XML storage and query.
Though lacking equivalent test, Monet is faster
than Natix on query.
Monet seems to be more space efficient than
Natix as well.
References


“Efficient storage of XML data” By Carl-Christian
Kanne, et al. ICDE 2000
http://citeseer.nj.nec.com/kanne99efficient.html
“Efficient Relational Storage and Retrieval of XML
Documents” By Albrecht Schmidt, et al. WebDB 2000
http://www.research.att.com/conf/webdb2000/progra
m.html