On the Memory Requirements of XPath Evaluation over XML

Transcript On the Memory Requirements of XPath Evaluation over XML

Buffering in Query
Evaluation over XML
Streams
Ziv Bar-Yossef
Technion
Marcus Fontoura
Vanja Josifovski
IBM Almaden Research Center
XML Document
1: <paper>
2: <section id = 1>
3:
<title>
4:
Intro
5:
</title>
6:
<content>
7:
bla bla bla
8:
</content>
9: </section>
10: <section id = 2>
11:
<title>
12:
Results
13:
</title>
14:
<content>
15:
yada yada yada
16:
</content>
17: </section>
18: <section id = 3>
19:
<title>
20:
Conclusions
21:
</title>
22:
<content>
23:
etc etc etc
24: </section>
25: <title>
26:
On the Complexity of
Database Queries
27: </title>
28: <author>
29:
Papadimitriou
30: </author>
31: <author>
32:
Yannakakis
33: </author>
34: </paper>
2
XML Document Tree
root
paper
section
author
id
title
content
1
section
id
3
Intro
Yannakakis
title
bla bla bla
section
id
2
author
content
title
Results
etc etc etc
Conclusions
Papadimitriou
title
content
yada yada yada
On the Complexity of
Database Queries
3
XPath Queries
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
root
paper
section
author
id
1
title content
section
id
3
Intro
Yannakakis
bla bla bla
section
id
2
title
author
content
title
Results
etc etc etc
Conclusions
Papadimitriou
title
content
yada yada yada
On the Complexity of
Database Queries
4
XPath Queries
/paper[title != section/title]/author
root
paper
section
author
id
1
title content
section
id
3
Intro
Yannakakis
bla bla bla
section
id
2
title
author
content
title
Results
etc etc etc
Conclusions
Papadimitriou
title
content
yada yada yada
On the Complexity of
Database Queries
5
XPath

Query = path pattern + predicates


XPath 2.0
Forward axis only

Eval(Q,D): nodes in D that match Q

Two modes of XPath evaluation:


Full fledged evaluation: given Q,D, output Eval(Q,D)
Filtering: given Q,D, determine whether Eval(Q,D) is
nonempty.
6
XML Streams

XML stream: sequence of SAX events


Why XML streams?



startDocument(), endDocument(),
startElement(name), endElement(name), text(str)
For transferring XML between systems
For efficient access to large XML documents
Critical resources


Memory
Processing time
7
Streaming XML Algorithms











XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02]
X-scan [Ives, Levy, and Weld 00]
XMLTK [Avila-Campillo et al 02]
XTrie [Chan et al 02]
SPEX [Olteanu, Kiesling, and Bry 03]
Lazy DFAs [Green et al 03]
The XPush Machine [Gupta and Suciu 03]
XSQ [Peng and Chawathe 03]
FluX [Koch el al 04]
TurboXPath [Josifovski, Fontoura, and Barta 05]
…
All of them use lots of memory
on certain queries & documents
8
Memory Bottleneck I:
Storage of Large Transition Tables

Framework of most algorithms:




Q  NFA
Simulate NFA by DFA
Caveat: exponential blowup
However: exponential blowup is not necessary
[Bar-Yossef, Fontoura, Josifovski 04]

Algorithm for filtering XML streams whose space is
linear in the query size
9
Memory Bottleneck II:
Buffering of Document Fragments

Scenario 1: buffering nodes, which may or may not be part
of the output.
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
root
paper
section
id
1
author
title content
Intro
id
3
title
content
author
Yannakakis
bla bla bla
section
id
2
section
title
Results
etc etc etc
Conclusions
title
Papadimitriou
content
yada yada
yada
On the Complexity of
Database Queries
10
Memory Bottleneck II:
Buffering of Document Fragments

Scenario 2: buffering nodes needed for evaluating pending
predicates.
/paper[title != section/title]/author
root
paper
section
id
1
author
title content
Intro
id
3
title
content
author
Yannakakis
bla bla bla
section
id
2
section
title
Results
etc etc etc
Conclusions
title
Papadimitriou
content
yada yada
yada
On the Complexity of
Database Queries
11
Memory Bottleneck II:
Buffering of Document Fragments

Scenario 3: buffering multiple candidate matches that
are nested within each other.
root
a
//a[b and c]
c
a
c
b
a
b


Relevant only when document is “recursive”
Space required: (doc-recursion-depth) [Bar-Yossef,
Fontoura, Josifovski 04]
12
Our Results

Quantitative space lower bounds for:



Matching upper bound


Full-fledged evaluation of queries with predicates
(Scenario 1)
Filtering/full-fledged evaluation of queries with
“multi-variate” predicates (Scenario 2)
Eager evaluation of predicates
In all other scenarios: no buffering required

Filtering of queries with “univariate” predicates over
non-recursive documents is possible without
buffering [Bar-Yossef, Fontoura, Josifovski 04]
13
Related Work

Space complexity of XPath evaluation over nonstreaming XML documents [Gottlob, Koch, Pichler 03],
[Segoufin 03]

Space complexity of XPath evaluation over streams of
indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over
relational data streams [Arasu et al 02]
14
Document Concurrency


Q: query
D = 1,…,n: document





Each i is an SAX event
t = (1,…,t)
Definition: x  D is alive at step t if x  t and  ,
 s.t.
 x  Eval(Q, t)
 x  Eval(Q, t)
t-concurrency(D,Q): number of nodes that are
alive at step t
concurrency(D,Q): maxt t-concurrency(D,Q)
15
Concurrency: Example
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
1: <paper>
2: <section id = 1>
3:
<title>
4:
Intro
5:
</title>
6:
<content>
7:
bla bla bla alive
8:
</content>
9: </section>
10: <section id = 2>
11:
<title>
12:
Results
13:
</title>
14:
<content>
15:
yada yada yada
16:
</content>
17: </section>
alive
18: <section id = 3>
19:
<title>
20:
Conclusions
21:
</title>
22:
<content>
23:
etc etc etc
24:
</content>
dead
25: </section>
26: <title>
27:
On the Complexity of
Database Queries
28: </title>
29: <author>
30:
Papadimitriou
31: </author>
32: <author>
33:
Yannakakis
34: </author>
35: </paper>
16
Lower Bound Notions

A “normal” lower bound:
For every algorithm A, there exist Q and D s.t. A uses on
Q and D (concurrency(D,Q)) bits of space.



Q and D may be “pathological”
Doesn’t say much about real-world queries/documents
An “ideal” lower bound:
For every A, every Q, and every D, A uses on Q and D
(concurrency(D,Q)) bits of space.

Too good to be true


A can have D and Q “hard-coded”, and then know the result a priori
Space of A on D and Q = minimum description length of Q and D
17
Our Lower Bound

Theorem: For every A, every Q, and every D,
there exists an almost isomorphic document D’,
s.t. A uses on Q and D’, (concurrency(D,Q)) bits
of space.
 D’ is the same as D, except for a few extra
empty nodes with auxiliary names.
 Theorem holds only if:


Q is “star-free”
D is non-recursive
18
Why isn’t this Obvious?


Reason 1: we want the theorem to work for
every Q and D, not only ones with high MDL.
Reason 2:

Obvious: If x is alive at step t  A has to
remember x


Because: A may or may not need to output x
Not obvious: If x and y are alive at step t  A has
to remember both

If x and y are not “independent”, maybe it’s enough to
remember just x (or just y)
19
Proof of Lower Bound



C = t-concurrency(D,Q)
x1,…,xC = nodes that are alive at step t
Recall: for every xi there exist i and i s.t.



xi  Eval(Q, ti)
xi  Eval(Q, ti)
Lemma: there exist a single  and a single 
s.t. for all i,


xi  Eval(Q, t)
xi  Eval(Q, t)
20
Proof of Lower Bound (cont.)


For every S  { 1,…,C } define document DS:
DS is the same as D, except





For every i  S, we “mark” xi
Marking: an extra empty child with an auxiliary
name
Note: DS is almost-isomorphic to D
A = any algorithm
Note: From output of A on DS, one can
“reconstruct” the set S.
21
Proof of Lower Bound (cont.)

Consider state of A at step t when running on
DS
 If suffix = , none of the xi’s should be output


If suffix = , no information in suffix about S
but S can be reconstructed from output


 A could not have output any xi by step t
 state of A at step t must have all information
about S
Conclusion: space ≥ (C)

Actual proof: by one-way communication complexity
22
Conclusions

Our contributions:

Quantitative space lower bounds




Full-fledged evaluation of queries with predicates
Filtering/full-fledged evaluation of queries with “multivariate” predicates
Matching upper bound
Open problems:


Quantitative lower bounds for XQuery evaluation
over streams
Address larger fragments of XPath
23

On the Memory Requirements of XPath Evaluation over XML

Transcript On the Memory Requirements of XPath Evaluation over XML

Directory