Transcript SF-Tree
SF-Tree:
An Efficient and Flexible Structure for
Estimating Selectivity of Simple Path
Expressions with Accuracy Guarantee
Ho Wai Shing
Contents
Motivation
Problem Definition
Perfect Binary SF-Tree
Variations of SF-Tree
Experiments
Conclusion
Motivation
Motivation
XML is becoming more and more
popular
e.g., product catalogs, invoices,
documentations, etc
We need to extract useful information
out of them
Queries, usually in form of XQuery or
XPath, are important
Motivation
Motivation
e.g.,
FOR $i IN document(“*”)//buyer/name
$j IN document(“*”)//seller/name
WHERE $i/text()=$j/text()
RETURN $i
Motivation
To evaluate this query, different
evaluation plans are available
FOR $i IN document(“*”)//buyer/name
$j IN document(“*”)//seller/name
WHERE $i/text()=$j/text()
RETURN $i
Motivation
To select the best query plan, we need
the selectivity of simple path
expressions (SPEs)
To retrieve selectivity counts, we need a
structure to store suitable statistics of
the original XML document collection
Motivation
We may use path indices to store this
SPE-to-selectivity mapping
e.g., DataGuide, path tree, 1-index etc.
disadvantage: designed for absolute path
expressions (APEs) and inefficient for
simple path expressions (SPEs)
Motivation
Motivation
To increase efficiency, we may use
suffix trie to store the mapping
advantage: really quick, no need to scan
the whole index tree
disadvantage: very large, especially when
the data tree is complex
Motivation
Motivation
These structures are not flexible
won’t change with frequently asked queries
can’t make trade-offs between space and
speed
We can tolerate some errors in
selectivity retrieval and it is good to
know a bound on the error
Motivation
Our goal: to build a new structure for
estimating the selectivity of SPEs which,
is faster than the path indices
requires less space than suffix tries
provides flexible trade-off between space
and time
provides accuracy guarantee on the
estimated selectivity
Problem Definition
Problem Definition
XML document
is a text document that logically represents
a directed edge-labeled graph called data
graph
consists of Elements, Attributes and other
information items
Problem Definition
Problem Definition
Data Paths
a set of information items (e1, e2, …, en)
each ei is directly nested within ei-1 for
2in
i.e., a path in a data graph (not necessarily
starts from root)
Problem Definition
Simple Path Expression (SPE)
asks about the structure of XML documents
has a form: //t1/t2/…/tn where ti are tags
returns all information items (en) which is
the end of a data path (e1, e2, …, en) that
satisfies the SPE
Problem Definition
Satisfy: a data path (e1, e2, …, en)
satisfies an SPE (//t1/t2/…/tn) if:
every ei has a tag ti
Selectivity (or counts, selectivity counts)
selectivity of an SPE (//t1/t2/…/tn) is the
number of information items en’s that are
returned by the SPE
Problem Definition
e.g.,
//buyer/name, //name,
//invoice/buyer/name are all SPEs
selectivity counts of //name is 2,
//products is 1, //invoice is 1
Problem Definition
Problem Definition
The Problem:
given a simple path expression p and a
collection D of XML documents,
find the estimated selectivity s1 of p in D
while s1 should be as close to the actual
selectivity s of p in D as possible
SF-Tree
SF-Tree
Basic Idea:
divide all possible SPEs in an XML
document collection into distinct groups
each SPE in a group has the same
selectivity (or similar selectivity)
the task of finding the selectivity of an SPE
now becomes the task of finding which
selectivity group this SPE belongs to
SF-Tree
e.g.,
SF-Tree
e.g.,
SF-Tree
In each group, we use a signature file
to store all the paths that belong to this
group
This reduces the storage requirement,
improves the accessing speed but
induces errors to the structure
Signature Files
A traditional technique to improve the
efficiency of keyword-based information
retrieval
Each keyword is hashed into m bit
positions of a signature file F
For each keyword in the document, the
corresponding m bits in F are set
Signature Files
e.g.,
F has 10 bits, m = 3
//name is hashed to bits 2, 3, 8
//buyer is hashed to bits 2, 4, 9
0 1 1 1 0 0 0 1 1 0
1 2 3 4 5 6 7 8 9 10
Signature Files
To check the existence of a query
keyword k in F, k is hashed and we
check whether all the m bits are set or
not
If any of the m bits are unset,
k is not in F
If all m bits are set, k is very likely to
be in F
Signature Files
e.g., (cont)
if a query keyword //seller is hashed to bits
2, 8, 9,
F will say //seller is in it, but it is not true
bits 2, 8, 9 are in fact set by other
keywords
a false drop occurred
Signature Files
Previous analysis showed that in
optimal case,
false drop rate is (1/2)m
|F | = m |D | / ln(2) where D is the set of
all keywords in the document
In our case, we treat an SPE in a group
as a keyword and a group as a
document
SF-Tree
There are many groups
for 16-bit selectivity counts, there are 216 groups if
each group has a distinct selectivity count
To get the selectivity, we need to find which
group this SPE belongs to
Linear scan of all groups is not feasible
slow
error prone
SF-Tree
organize the groups into a tree form
called an SF-Tree (stands for Signature
File Tree)
Rationale: create representative groups
over several groups, and if an SPE is
not in a representative group, it will
never be in any of descendant groups
of this representative group.
SF-Tree
SF-Tree
SF-Tree
Basic selectivity retrieval algorithm:
1. start from the root,
2. check which child contains the SPE p
3. descend to that child, repeat step 2, 3
until a leaf is reached
4. return the selectivity associated with
that leaf
Perfect Binary SF-Tree
A very straightforward way to build
representative nodes is to build the
nodes according to the most significant
bits of the selectivity count
i.e., all the descendants has the same
m.s.b. in their selectivity counts
A perfect binary SF-Tree is thus formed
Perfect Binary SF-Tree
advantage:
easy to build
easy to analyze
disadvantage:
less flexible
cannot tackle FAQ
not that space efficient
Perfect Binary SF-Tree
properties: (proofs are skipped)
low error rate:
let d be depth of the SF-Tree, a be the depth
difference between the false drop leaf, correct leaf
and their least common ancestor
for negative queries, false drop rate is:
Rneg(d) < 1/2d(m-1)
for positive queries, max error size = 2a, Rpos(d, a)
< 1/2ma-a+1
Perfect Binary SF-Tree
high accessing speed:
assuming no false drop,
Tneg(d) = 2 signatures
Tpos(d) = 2d signatures
it doesn’t depend on the complexity of the
XML documents
Perfect Binary SF-Tree
each false drop induces at most 2 more
signature checks
expected number of signatures to be
checked for positive query:
C < 2d + (2d+2)(d+2)/2m
Perfect Binary SF-Tree
storage:
each SPE must appear at one leaf and all
the ancestors of that leaf
each SPE contributes m/ln2 bits to each
signature that contains it
space requirement = (m/ln2) v (|v|dv)
where v is a leaf node
Perfect Binary SF-Tree
properties -- summary:
negligible error rate (if m = 10, d = 16,
error rate for a negative query is 1/2144!)
error rate decreases as error size increases
for positive queries
error rate and accessing time are
independent of the complexity of the XML
documents
Variations of SF-Tree
Variations of SF-Tree
drawbacks of a perfect binary SF-Tree:
every SPE has the same depth in SF-Tree,
cannot optimize for FAQ
too many layers, we can tolerate a higher
error rate for smaller space requirement
Variations of SF-Tree
Huffman SF-Tree
Shannon-Fano SF-Tree
Cropped SF-Tree
Huffman SF-Tree
build a Huffman tree on the groups of
selectivity
the weight of each group can be:
A. the number of SPEs in that group
B. the probability of a random query will
fall into that group
Huffman SF-Tree
Huffman SF-Tree
advantage:
optimal in space required over all binary
SF-Trees if the weight is (A).
optimal in expected number of signatures
to be checked if the weight is (B).
disadvantage:
error rate is independent of error size
Shannon-Fano SF-Tree
build a Shannon-Fano tree over the
selectivity groups
the order of the leaves are not changed
similar depth as in Huffman tree
error rate now decreases as error size
increases
Shannon-Fano SF-Tree
We observed that the distribution of
SPE over all selectivity is very skewed:
most of the SPE has a very small selectivity
the distribution is similar to a GP with ratio
smaller than 1.
The use of Huffman or Shannon-Fano
tree much reduces the depth of SF-Tree
Shannon-Fano SF-Tree
900
800
number of SPEs
700
600
500
400
300
200
100
0
64
128
192
256
320
384
selectivity
448
512
576
Cropped SF-Tree
Cropped SF-Tree is an easy trade-off
between access time, space and
accuracy
two types of cropped SF-Tree:
head-cropped SF-Tree (the top h levels of
an SF-Tree is removed)
tail-cropped SF-Tree (the bottom t levels
of an SF-Tree is removed)
Cropped SF-Tree
original SF-Tree
Head-cropped SF-Tree
height of the SF-Tree decreases
space required is smaller
accessing time increases because we
have to check all the children of the
root in the first step
accuracy decreases a little bit because
the maximum possible value of a
decreases.
Tail-cropped SF-Tree
height of the SF-Tree decreases
space required is smaller
accessing time is shorter (we can
traverse fewer levels)
accuracy is lowered (a leaf now
represents a range of selectivity, i.e.,
the quantization step size of selectivity
is larger)
Experiments
Experiments
Data Sets
Experiments
Query Workloads
positive SPEs: randomly pick 2000 SPEs
from all the SPEs in the XML document
collection. Each SPE has equal probability
of being picked
negative SPEs: randomly concatenate tags
in the XML document collection, and then
remove those with non-zero selectivity until
2000 SPEs are generated
Experiments
Comparison to Competitors
Performance of SF-Tree
Performance of SF-Tree
Performance of SF-Tree
Conclusions
Conclusions
A new structure, SF-Tree, is introduced
SF-Tree is much quicker and path
indices like path trees and uses much
less space than suffix trie
SF-Tree provides means to make a
trade-off between space, time and
accuracy
Conclusions
Although SF-Tree returns only estimated
selectivity, but accuracy is guaranteed
Neg queries: false drop rate = 1/2d(m-1),
Pos queries: false drop rate = 1/2ma-a+1
max error size = 1/2a
Discussions
SF-Tree is a new approach to
approximate some "string-to-number"
structures
It may be applicable to other aspects
such as approximating a multidimensional histogram
Questions?
References
A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. Estimating the
selectivity of XML path expressions for internet scale applications. In
VLDB, 591--600, 2001.
C. Faloutsos and S. Christodoulakis. Signature files: An access method
for documents and its analytical performance evaluation. ACM TOIS,
2(4):267--288, 1984.
R. Goldman and J. Widom. Data Guides: Enabling query formulation
and optimization in semistructured databases. In VLDB, pp. 436--445,
1997.
L. Lim, M. wang, S. Padmanabhan, J. S. Vitter, and R.Parr.
XPathLearner: an on-line self-tuning markov histogram for XML path
selectivity estimation. In VLDB 2002.
T. Milo, D. Suciu. Index structures for path expressions. In ICDT, 1999.
N. Polyzotis and M. Garofalakis. Statistical synopses for graphstructured XML databases. In SIGMOD, 2002.
References
N. Polyzotis and M. Garofalakis. Structure and value synopses for XML
data graphs. In VLDB, 2002.
W3C. Extensible markup language (XML) 1.0.
http://www.w3.org/TR/1998/REC-xml-19980210