On Efficient Part-match Querying of XML Data

Download Report

Transcript On Efficient Part-match Querying of XML Data

www.fei.vsb.cz
On Efficient Part-match
Querying of XML Data
Michal Krátký, [email protected]
Marek Andrt, [email protected]
Department of Computer Science
VŠB–Technical University of Ostrava
Czech Republic
DATESO 2004
Contents





Introduction – XML, query languages,
indexing XML data, part-match querying.
Multi-dimensional approach to indexing XML
data.
Extension of the multi-dimensional approach
for keyword-based querying.
Index data structures.
Preliminary experimental results.
2/21
Introduction




Native XML database. Set of documents is a
database, DTD (XML Schema) is its database
schema.
XML query languages (XPath, XQL,
XQuery,…).
A common feature is a possibility to formulate
paths in the XML graph (regular path
expressions, XPath axes and so on).
Approaches based on: relational decomposition,
trie, multi-dimensional, signatures and so on.
3/21
Part-match querying XML data



Some approaches for keyword or
phrase based searching were
published: XQuery-IR (WebDb’02),
XKeyword (ICDE’03) and so on.
Knowledges from IR are applied.
Query languages contain operators for
matching term occurrence. For
example contains(), ~=.
4/21
Multi-dimensional approach to
indexing XML data
A graph is a set of the
paths. XML document is
decomposed to paths and
labelled paths.
 labelled path: lp ∈ XLP:
s0,s1,...,slPN
 path: p ∈ XP:
idU(u0),idU(u1),...,idU(ulLP),s
idU(ui) – unique number
of a node ui

5/21
Indexes



Term index – a storage of strings si of
an XML document and their idT(si).
Labelled path index – a storage of
points representing labelled paths.
Path index – a storage of points
representing paths.
6/21
Example
labelled path index, path index


books,book,id; books,book,title and
books,book,author. Points (0,1,2); (0,1,4) and
(0,1,6) are created using idT of element and attribute
names, idLP = 0, 1 and 2.
For example, the path to value The Two Towers. The
labelled path books,book,title with idLP 1 belongs.
Vector (1,0,1,3,5) is created using idLP, unique numbers
idU of elements, and idT of the term.
7/21
Query for values of elements
and attributes


XPath query: books/book[author=“Joseph Heller”]
3 phases of a query processing, finding:
● idT of terms from the term index,
● idLP 2 of labelled path books,book,author from the
labelled path index: point query (0,1,6),
● points from the path index: range query
(2,0,0,0,12)×(2,max,max,max,12). 8/21
Enhanced querying


XPath axes are processed by a range query
or sequence of range queries. For example
axis descendent: (0,idU(u0),…,idU(ul-1),
idU(u),0,…, 0):(maxD,idU(u0),…,idU(ul-1), idU(u),
maxD,…,maxD).
Regular path expression. For example
//title[name=‘Chaudhri’] is processed
by a complex range query. The query is
possible to process in one run in the multidimensional data structure.
9/21
Comparison of approaches



Mainline approaches (XISS, XPath
Accelerator) index single element
(attribute). For example query
/e1[e2=‘dog’] is processed by joining
single results.
Result formatting. For example a result
of the query //name is all matched
subtree.
Operation Update and Insert are
simple possible.
10/21
Keyword-based searching

Motivation:
/PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE



Path-Labelled Path-Term (PLT) index is
added.
The index indexes an 3-dimensional space:
(idP, idLP, idT).
idP is added into the point representing path:
(idP,idLP,idU0,idU1,…,idUl,s).
11/21
Path-Labelled Path-Term index
Example
12/21
Query processing plan
Example
13/21
Index data structures
Paged and balanced multi-dimensional data
structures – (B)UB-trees, variants of Rtrees.
 Problems:
● indexing points with different dimensions.
● narrow range query – the signature is
applied for efficient processing – Signature
R-tree.
 Efficient processing of the complex range
query.

14/21
Efficient processing the
complex range query


Complex range query = sequence of
range queries: qb1,qb2,…,qbn.
The query is possible to process in
one run in the multi-dimensional data
structure.
15/21
Experimental results

Protein Sequence Database XML document:
● the document size is 683MB,
● number of elements: 21,305,818,
● number of attributes:1,290,647.
● maximal length of path: 7.

BUB-forest, R*-forest, Signature BUB-tree and
R*-tree. Index structures: trees indexing spaces
of dimension n=7 and n=9.
16/21
Experimental results
Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/
authors/author='Smith, E.L.']
17/21
Experimental results
Regular path expression




Query: //uid='89071748' , 5 labelled
paths were matched.
Naive processing the complex range
query: DAC: 368
Efficient processing the complex range
query: DAC: 139
Time: 0.03s, Improvement: 2.5x
18/21
Preliminary experimental results
Keyword-based searching

othello.xml:
● document size is 250kB,
● maximal length of the path: 6
● number of paths: 4,967
● number of labelled paths: 13
● number of terms: 8,744
● PLT index: 27,127
19/21
Preliminary experimental results
Keyword-based searching

Query:
/PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE




Labelled path index: result size: 1,
DAC: 3
PLT index: result size: 1, DAC: 3
Path index: result size: 1, DAC: 13
Path index: result size: 1, DAC: 4
20/21
Conclusion





Θ(m × log n), Θ(c × m × log n) vs. Θ(m1 × m2),
m1 ,m2 ≥ m.
Efficient processing a query with AND condition.
Signature is applied.
Multi-dimensional approach for term searching
may be applied (e.g. *comp*).
The update operation of XML documents.
Comparison with another approaches for test
collections (INEX, XMark, …).
http://www.cs.vsb.cz/arg
21/21
References



M. Krátký, J. Pokorný, V. Snášel: Implementation of XPath
Axes in the Multi-dimensional Approach to Indexing XML
Data. Accepted at International Workshop on Database
Technologies for Handling XML information on the
Web, DataX, Int'l Conference on EDBT, Heraklion - Crete,
Greece, 2004.
M. Krátký, J. Pokorný, T. Skopal, V. Snášel: The
Geometric Framework for Exact and Similarity Querying
XML data. In Proceedings of EurAsia-ICT 2002. Shiraz,
Iran, Springer Verlag, LNCS 2510.
M. Krátký, T. Skopal, and V. Snášel: Multidimensional
Term Indexing for Efficient Processing of Complex
Queries. Kybernetika, Journal of the Academy of Sciences
of the Czech Republic, 2004, accepted.
Paths, labelled paths

Paths 0,1,2,’003-04212’; 0,5,6,’001-00863’
and 0,9,10,’045-00012’ belong to the labelled
path books,book,id,
...

Paths 0,1,4,’J.R.R. Tolkien’; 0,5,8,’J.R.R.
Tolkien’ and 0,9,12,’Joseph Heller’ belong to
the labelled path books,book,author.
Complex queries

Query for values and XPath axis processing,
e.g. books/book[author='Joseph Heller']/title
● Combination of above described techniques:
query for value, XPath axis processing.

Regular path expression queries
for example: books//author
● A sequence of range queries processes this
query in the path and labelled path index: books,
author - books,*,author - books,*,…,*,author.
(B)UB-tree, R-tree
UB-tree
B-tree
Z-address
Narrow range query –
signature multi-dimensional ds



Regions intersecting a query hyper box are
searched, O(NI × logc n).
Ratio cR of relevant NR and intersect NI regions
≪ 1 with an increasing dimension.
Signatures are applied to better filtration of
irrelevant regions – signature md structures.
Signature R-tree
Experimental results
Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/
authors/author='Smith, E.L.']
Experimental results