On the Memory Requirements of XPath Evaluation over XML

Download Report

Transcript On the Memory Requirements of XPath Evaluation over XML

Buffering in Query
Evaluation over XML
Streams
Ziv Bar-Yossef
Technion
Marcus Fontoura
Vanja Josifovski
IBM Almaden Research Center
XML Document
1: <paper>
2: <section id = 1>
3:
<title>
4:
Intro
5:
</title>
6:
<content>
7:
bla bla bla
8:
</content>
9: </section>
10: <section id = 2>
11:
<title>
12:
Results
13:
</title>
14:
<content>
15:
yada yada yada
16:
</content>
17: </section>
18: <section id = 3>
19:
<title>
20:
Conclusions
21:
</title>
22:
<content>
23:
etc etc etc
24: </section>
25: <title>
26:
On the Complexity of
Database Queries
27: </title>
28: <author>
29:
Papadimitriou
30: </author>
31: <author>
32:
Yannakakis
33: </author>
34: </paper>
2
XML Document Tree
root
paper
section
author
id
title
content
1
section
id
3
Intro
Yannakakis
title
bla bla bla
section
id
2
author
content
title
Results
etc etc etc
Conclusions
Papadimitriou
title
content
yada yada yada
On the Complexity of
Database Queries
3
XPath Queries
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
root
paper
section
author
id
1
title content
section
id
3
Intro
Yannakakis
bla bla bla
section
id
2
title
author
content
title
Results
etc etc etc
Conclusions
Papadimitriou
title
content
yada yada yada
On the Complexity of
Database Queries
4
XPath Queries
/paper[title != section/title]/author
root
paper
section
author
id
1
title content
section
id
3
Intro
Yannakakis
bla bla bla
section
id
2
title
author
content
title
Results
etc etc etc
Conclusions
Papadimitriou
title
content
yada yada yada
On the Complexity of
Database Queries
5
XPath

Query = path pattern + predicates


XPath 2.0
Forward axis only

Eval(Q,D): nodes in D that match Q

Two modes of XPath evaluation:


Full fledged evaluation: given Q,D, output Eval(Q,D)
Filtering: given Q,D, determine whether Eval(Q,D) is
nonempty.
6
XML Streams

XML stream: sequence of SAX events


Why XML streams?



startDocument(), endDocument(),
startElement(name), endElement(name), text(str)
For transferring XML between systems
For efficient access to large XML documents
Critical resources


Memory
Processing time
7
Streaming XML Algorithms











XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02]
X-scan [Ives, Levy, and Weld 00]
XMLTK [Avila-Campillo et al 02]
XTrie [Chan et al 02]
SPEX [Olteanu, Kiesling, and Bry 03]
Lazy DFAs [Green et al 03]
The XPush Machine [Gupta and Suciu 03]
XSQ [Peng and Chawathe 03]
FluX [Koch el al 04]
TurboXPath [Josifovski, Fontoura, and Barta 05]
…
All of them use lots of memory
on certain queries & documents
8
Memory Bottleneck I:
Storage of Large Transition Tables

Framework of most algorithms:




Q  NFA
Simulate NFA by DFA
Caveat: exponential blowup
However: exponential blowup is not necessary
[Bar-Yossef, Fontoura, Josifovski 04]

Algorithm for filtering XML streams whose space is
linear in the query size
9
Memory Bottleneck II:
Buffering of Document Fragments

Scenario 1: buffering nodes, which may or may not be part
of the output.
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
root
paper
section
id
1
author
title content
Intro
id
3
title
content
author
Yannakakis
bla bla bla
section
id
2
section
title
Results
etc etc etc
Conclusions
title
Papadimitriou
content
yada yada
yada
On the Complexity of
Database Queries
10
Memory Bottleneck II:
Buffering of Document Fragments

Scenario 2: buffering nodes needed for evaluating pending
predicates.
/paper[title != section/title]/author
root
paper
section
id
1
author
title content
Intro
id
3
title
content
author
Yannakakis
bla bla bla
section
id
2
section
title
Results
etc etc etc
Conclusions
title
Papadimitriou
content
yada yada
yada
On the Complexity of
Database Queries
11
Memory Bottleneck II:
Buffering of Document Fragments

Scenario 3: buffering multiple candidate matches that
are nested within each other.
root
a
//a[b and c]
c
a
c
b
a
b


Relevant only when document is “recursive”
Space required: (doc-recursion-depth) [Bar-Yossef,
Fontoura, Josifovski 04]
12
Our Results

Quantitative space lower bounds for:



Matching upper bound


Full-fledged evaluation of queries with predicates
(Scenario 1)
Filtering/full-fledged evaluation of queries with
“multi-variate” predicates (Scenario 2)
Eager evaluation of predicates
In all other scenarios: no buffering required

Filtering of queries with “univariate” predicates over
non-recursive documents is possible without
buffering [Bar-Yossef, Fontoura, Josifovski 04]
13
Related Work

Space complexity of XPath evaluation over nonstreaming XML documents [Gottlob, Koch, Pichler 03],
[Segoufin 03]

Space complexity of XPath evaluation over streams of
indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over
relational data streams [Arasu et al 02]
14
Document Concurrency


Q: query
D = 1,…,n: document





Each i is an SAX event
t = (1,…,t)
Definition: x  D is alive at step t if x  t and  ,
 s.t.
 x  Eval(Q, t)
 x  Eval(Q, t)
t-concurrency(D,Q): number of nodes that are
alive at step t
concurrency(D,Q): maxt t-concurrency(D,Q)
15
Concurrency: Example
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
1: <paper>
2: <section id = 1>
3:
<title>
4:
Intro
5:
</title>
6:
<content>
7:
bla bla bla alive
8:
</content>
9: </section>
10: <section id = 2>
11:
<title>
12:
Results
13:
</title>
14:
<content>
15:
yada yada yada
16:
</content>
17: </section>
alive
18: <section id = 3>
19:
<title>
20:
Conclusions
21:
</title>
22:
<content>
23:
etc etc etc
24:
</content>
dead
25: </section>
26: <title>
27:
On the Complexity of
Database Queries
28: </title>
29: <author>
30:
Papadimitriou
31: </author>
32: <author>
33:
Yannakakis
34: </author>
35: </paper>
16
Lower Bound Notions

A “normal” lower bound:
For every algorithm A, there exist Q and D s.t. A uses on
Q and D (concurrency(D,Q)) bits of space.



Q and D may be “pathological”
Doesn’t say much about real-world queries/documents
An “ideal” lower bound:
For every A, every Q, and every D, A uses on Q and D
(concurrency(D,Q)) bits of space.

Too good to be true


A can have D and Q “hard-coded”, and then know the result a priori
Space of A on D and Q = minimum description length of Q and D
17
Our Lower Bound

Theorem: For every A, every Q, and every D,
there exists an almost isomorphic document D’,
s.t. A uses on Q and D’, (concurrency(D,Q)) bits
of space.
 D’ is the same as D, except for a few extra
empty nodes with auxiliary names.
 Theorem holds only if:


Q is “star-free”
D is non-recursive
18
Why isn’t this Obvious?


Reason 1: we want the theorem to work for
every Q and D, not only ones with high MDL.
Reason 2:

Obvious: If x is alive at step t  A has to
remember x


Because: A may or may not need to output x
Not obvious: If x and y are alive at step t  A has
to remember both

If x and y are not “independent”, maybe it’s enough to
remember just x (or just y)
19
Proof of Lower Bound



C = t-concurrency(D,Q)
x1,…,xC = nodes that are alive at step t
Recall: for every xi there exist i and i s.t.



xi  Eval(Q, ti)
xi  Eval(Q, ti)
Lemma: there exist a single  and a single 
s.t. for all i,


xi  Eval(Q, t)
xi  Eval(Q, t)
20
Proof of Lower Bound (cont.)


For every S  { 1,…,C } define document DS:
DS is the same as D, except





For every i  S, we “mark” xi
Marking: an extra empty child with an auxiliary
name
Note: DS is almost-isomorphic to D
A = any algorithm
Note: From output of A on DS, one can
“reconstruct” the set S.
21
Proof of Lower Bound (cont.)

Consider state of A at step t when running on
DS
 If suffix = , none of the xi’s should be output


If suffix = , no information in suffix about S
but S can be reconstructed from output


 A could not have output any xi by step t
 state of A at step t must have all information
about S
Conclusion: space ≥ (C)

Actual proof: by one-way communication complexity
22
Conclusions

Our contributions:

Quantitative space lower bounds




Full-fledged evaluation of queries with predicates
Filtering/full-fledged evaluation of queries with “multivariate” predicates
Matching upper bound
Open problems:


Quantitative lower bounds for XQuery evaluation
over streams
Address larger fragments of XPath
23