Transcript author

Topic 8: Semi-structured Data
<A>
<B>xxx</B>
<C>
<D>yyy</D>
<D>zzz</D>
</C>
</A>7
In various application domains, the data
are semi-structured; the database
schema is loose-defined.
Semi-structured data need specialized
management methods.
The most characteristic example is XML
data where the elements (tags) define
the semantics of the information.
Dr. N. Mamoulis
Advanced Database Technologies
1
XML and HTML
XML (like HTML) is a subset of SGML
In HTML the tags serve the primary purpose
of describing how to display a data item.
On the other hand, XML tags describe the
data itself. Tags are called elements in XML.
This means that a program receiving an XML
document can interpret it in multiple ways,
can filter the document upon its content, can
restructure it for a different application, etc.
Dr. N. Mamoulis
Advanced Database Technologies
2
Example of HTML document
<html>
<tag></tag>
<head>
attribute=“value”
<title>Course Details</title>
text
</head>
<body>
<b><font size="+1">CSIS7101</font></b>
<p> <b><font size="+1">Detailed (tentative)
schedule</font></b> </p>
<hr width="100%"> <br>
<b>1 Introduction</b>
<ul>
<li> Introduction by the instructor [<a
href="http://www.csis7101.com/intro_slides.pdf">
slides in pdf</a>]*
</li>
</ul>
</body>
</html>
Dr. N. Mamoulis
Advanced Database Technologies
3
Example of XML document
<book>
<booktitle> The Selfish Gene </booktitle>
<author id=“dawkings">
<name>
<firstname> Richard </firstname>
<lastname> Dawkins </lastname>
</name>
<address>
<city> Timbuktu </city>
<zip> 99999 </zip>
<tag></tag>
</address>
attribute=“value”
</author>
text
</book>
Dr. N. Mamoulis
Advanced Database Technologies
4
XML and Databases
XML is becoming a standard for
information exchange over the internet.
Since actual data are stored in XML
documents, it should be possible to
query the data.
Here comes the role of databases: How
should we organize and query XML
data?
Dr. N. Mamoulis
Advanced Database Technologies
5
XML and Databases (cont’d)
Solution 1:

Use specialized storage methods, query
languages and query evaluation techniques
for semi-structured data.
Solution 2:

Represent XML data in relational tables,
transform queries to SQL, and use the
mature relational DB technology.
Dr. N. Mamoulis
Advanced Database Technologies
6
More on XML – Document Type
Descriptors
DTDs (Document Type Descriptors)
define (and control) the schema of XML
documents for a specific application.
Thus, now the structure of these
documents is not free, but should
conform to the DTD.
The DTD can help define a relational
schema for the class of documents that
conform to it.
Dr. N. Mamoulis
Advanced Database Technologies
7
Example of a DTD
0 or more times
<!ELEMENT book (booktitle, author)>
<!ELEMENT article (title, author*, contactauthor)>
<!ELEMENT contactauthor EMPTY>
<!ATTLIST contactauthor authorID IDREF IMPLIED>
<!ELEMENT monograph (title, author, editor)>
<!ELEMENT editor (monograph*)>
<!ATTLIST editor name CDATA #REQUIRED>
reference to
<!ELEMENT author (name, address)>
existing ID
<!ATTLIST author id ID #REQUIRED>
<!ELEMENT name (firstname?, lastname)>
<!ELEMENT firstname (#PCDATA)>
0 or 1 time
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT address ANY>
Dr. N. Mamoulis
anything can be nested
under address
Advanced Database Technologies
8
More on XML – Queries
Queries on XML data are described by
the structural relationships of elements,
attributes and values.
Several XML Query Languages have
been proposed.
XML-QL, Lorel, UnQL, XQL, XPath,
XQuery, ...
Dr. N. Mamoulis
Advanced Database Technologies
9
More on XML – Query Example
Find the last name of the author of
book “the selfish gene”

WHERE <book>
<booktitle> The Selfish Gene </booktitle>
<author> <lastname> $l </lastname> </>
</> IN db.xml

CONSTRUCT <lastname> $l </lastname>
Dr. N. Mamoulis
Advanced Database Technologies
10
XML data represented as graphs
An XML document can be represented
as a node-labeled graph.
The labels of the graph are element
tags, attribute names and values.
Most documents can be represented by
trees. The edges that transform a tree
to a graph come from ID references.
Dr. N. Mamoulis
Advanced Database Technologies
11
Example of a tree representation
book
booktitle
author
The Selfish
Gene
id
dawkins
name
firstname
Richard
Dr. N. Mamoulis
lastname
address
city
zip
Dawkins Timbuktu 99999
Advanced Database Technologies
12
Example of a graph representation
<atricle>
<title> The Importance of Evergreen </title>
<author id=“smith”>
<name>
<firstname> John </firstname>
<lastname> Smith </lastname>
</name>
<address> Smithsville </address>
</author>
<author id=“jones”>
<name>
<lastname> Jones </lastname>
</name>
<address> Jonesville </address>
</author>
<contactauthor idref=“smith”>
</article>
Dr. N. Mamoulis
Advanced Database Technologies
13
Example of a graph representation
(cont’d)
article
author
title
The Importance
of Evergreen
smith
id
name address
firstname lastname Smithsville
John
Dr. N. Mamoulis
author contactauthor
id name address authorid
jones
Smith
Advanced Database Technologies
lastname Jonessville
Jones
14
XML Query types
Queries with absolute path expressions.
These queries retrieve paths where the
first element is the ROOT of the
document.
Example: find all books written by
author with lastname=“Smith”

book/author/name/lastname/Smith
Dr. N. Mamoulis
Advanced Database Technologies
15
XML Query types
Queries with simple path expressions.
These queries retrieve paths where the
first element can be any element of the
document.
Example: find all items written by
author with lastname=“Smith”

//author/name/lastname/Smith
anything can be before author tag
Dr. N. Mamoulis
Advanced Database Technologies
16
XML Query types
Queries with regular path expressions.
These queries retrieve paths where the
not all elements on the path are
specified.
Example: find all documents with an
“author” element with a descendant
“Smith” in the graph

//author//Smith
Dr. N. Mamoulis
any path can be between
author and Smith
Advanced Database Technologies
17
XML Query types
In general, other symbols may be used
to denote the distance between the
path elements.
Example: find all documents with an
“author” followed by one element, then
one or none elements, and then by
one or no element
“Smith”.
could be here

//author/_ /?/Smith
exactly one element
should be here
Dr. N. Mamoulis
Advanced Database Technologies
18
XML Query types
Queries that match multiple regular
path expressions. The paths are joined
in a root element and the whole query
is represented by a twig (small tree).
Example: find the book with title “XML”
written by an author with a descendant
book
“Smith” in the graph

book[/title/XML][//author//Smith]
title author
XML
Dr. N. Mamoulis
Advanced Database Technologies
Smith
19
Indexing and XML Query Processing
Several storage schemes and indexes
have been proposed for the queries
discussed above.
Some of them index the paths or
subgraphs of the XML structures.
Some decompose the information and
flatten it into relational DB tables.
Dr. N. Mamoulis
Advanced Database Technologies
20
Path indexes for XML data
If many documents exist, they are connected
into a large graph by adding a common root.
Then a structural summary of the XML graph
is created.
All the paths in the data graph are preserved
into the summary graph.
If we keep pointers to the original graph into
the summary graph, then this becomes an
index.
Dr. N. Mamoulis
Advanced Database Technologies
21
Path index example
A. a graph of documents
1 alldocuments (root)
book
2
8
book
title author author title author
3
4
name
5
Dr. N. Mamoulis
6
9
10
name name address
7
11
12
article
13
title author author
14
15
18
name address name
16
Advanced Database Technologies
17
19
22
Path index example
B. the 1-index
1 alldocuments (root)
book
article
2,8
title author
3,9
4,6,10
name address
5,7,11
12
13
title author
14
15,18
name address
16,19
17
The 1-index maintains information about all
paths in the original graph
Dr. N. Mamoulis
Advanced Database Technologies
23
Path indexes for XML data
The 1-index maintains information
about all paths in the original graph.
This makes the index very large (with
size comparable to the data size).
Therefore it is quite expensive to
evaluate queries using this index.
To address this problem an A(k)-index
is proposed which indexes exactly only
paths up to length k.
Dr. N. Mamoulis
Advanced Database Technologies
24
Bisimilarity
Two nodes u, v are called bisimilar if:


They have the same label.
If u’ is the parent of u, then there is a
parent v’ of v, such that u’,v’ are also
bisimilar, and vice versa.
Dr. N. Mamoulis
Advanced Database Technologies
25
Bisimilarity Example
Nodes 4 and 10 are bisimilar
Nodes 10 and 15 are not bisimilar
1 alldocuments (root)
book
2
8
book
title author author title author
3
4
name
5
Dr. N. Mamoulis
6
9
10
name name address
7
11
12
article
13
title author author
14
15
18
name address name
16
Advanced Database Technologies
17
19
26
Bisimilarity defines the 1-index
Bisimilar nodes are stored in the same
node in the summary index.
1 alldocuments (root)
book
article
2,8
title author
3,9
4,6,10
name address
5,7,11
Dr. N. Mamoulis
12
13
title author
14
15,18
name address
16,19
Advanced Database Technologies
17
27
Bisimilarity and the A(k) index
In the A(k) index, the notion of kbisimilarity is used:


Two nodes u,v are 0-bisimilar, if they have
the same label.
Two nodes u,v are k-bisimilar, if they have
the same label and for every parent u’ of u,
there is a parent v’ of v, such that u’ and v’
are (k-1)-bisimilar, and vice versa.
Dr. N. Mamoulis
Advanced Database Technologies
28
k-bisimilarity Example
Nodes 5 and 16 are 1-bisimilar
Nodes 6 and 15 are not 1-bisimilar
1 alldocuments (root)
book
2
8
book
title author author title author
3
4
name
5
Dr. N. Mamoulis
6
9
10
name name address
7
11
12
article
13
title author author
14
15
18
name address name
16
Advanced Database Technologies
17
19
29
Bisimilarity and the A(k) index
The A(k)-index stores exactly all paths
of length k,
or else: all k-bisimilar nodes in the data
graph are stored in the same node in
the index graph.
This means that all incoming paths up
to length k are encoded in the index.
Dr. N. Mamoulis
Advanced Database Technologies
30
A(k)-index example
A(3) and A(2)-index
1 alldocuments (root)
book
article
2,8
title author
3,9
4,6,10
name address
5,7,11
Dr. N. Mamoulis
12
13
title author
14
15,18
name address
16,19
Advanced Database Technologies
17
31
A(k)-index example
A(1)-index
A(0)-index
1 alldocuments (root)
book
article
2,8
title author
3,9
4,6,10
name
5,7,11,16,19
Dr. N. Mamoulis
13
title author
14
15,18
address
12,17
Advanced Database Technologies
1 alldocuments (root)
book
article
title
author
2,8
13
3,9,14 4,6,10,15,18
name
5,7,11,16,19
address
12,17
32
Using the A(k) index to search
A(1)-index
A Label Map is constructed
together with the index,
where each label points to
its positions in the index.
book
book
article
title
author
name
1 alldocuments (root)
article
2,8
title author
3,9
4,6,10
name
5,7,11,16,19
13
title author
14
15,18
address
12,17
address
Dr. N. Mamoulis
Advanced Database Technologies
33
Evaluation of path queries
Assume that a path query q of length
≤k is applied.
The A(k) index can answer the query as
follows.
First the last label of q is found and the
label map is used to find its positions in
the index.
Then the index is traversed backwards
to complete the answer.
Dr. N. Mamoulis
Advanced Database Technologies
34
Evaluation of path queries (example)
A(1)-index
Query book/title
1 alldocuments (root)
OK
book
book
article
title
author
name
article
2,8
title author
3,9
4,6,10
name
5,7,11,16,19
13
title author
14
15,18
address
12,17
address
Dr. N. Mamoulis
Advanced Database Technologies
35
Evaluation of path queries
If the path of the query is longer than k, we
may need to access the actual data.
Thus, A(k)-index alone cannot be used to
answer the query in this case.
This is because if we traverse the index
backwards we may find false positive paths
that actually do not exist in the graph.
Paths that share information are grouped to
decrease the potential cost.
Dr. N. Mamoulis
Advanced Database Technologies
36
Evaluation of path queries (2nd example)
A(1)-index
Query book/author/name
path has length 2>1
1 alldocuments (root)
book
book
article
title
author
name
article
2,8
title author
3,9
4,6,10
name
5,7,11,16,19
13
title author
14
15,18
address
12,17
address
Dr. N. Mamoulis
Advanced Database Technologies
37
Problems of path indexes
They are appropriate only for simple
path queries up to a certain length.
Therefore if a query has branches or
regular path expressions the index
cannot provide exact answers, but the
actual data have to be accessed.
Also these indexes have high storage
and update cost.
Dr. N. Mamoulis
Advanced Database Technologies
38
Storing and indexing XML data in
relational databases
We can decompose the structural
information into tables and use them to
answer queries.
This reduces the volume of data that
need to be accessed for a single query
and we can use off-the-shelf query
processing and optimization techniques.
On the other hand, we may need
expensive joins during query processing
Dr. N. Mamoulis
Advanced Database Technologies
39
A decomposition model for XML data
The storage model indexes the elements and
text of the documents by their position in the
graph.
If the structures are trees, this representation
can help to answer queries fast.
On the other hand, for graphs the positions of
the elements many times cannot help fast
query evaluation because of recursion and
other problems the incur.
Dr. N. Mamoulis
Advanced Database Technologies
40
Encoding elements, attributes and
values based on their positions.
The position of each element/attribute occurrence is
represented as a 3-tuple (Document-id,
StartPos:EndPos, LevelNum)
Values (text) is encoded using (Document-id, StartPos,
LevelNum):
 Document-id is the id of the document that contains
the element
 StartPos is the number of words from the beginning
of the document until the start of the element
 EndPos is the number of words from the beginning
of the document until the end of the element
 LevelNum is the nesting depth of the element
Dr. N. Mamoulis
Advanced Database Technologies
41
Encoding example
<book>
<booktitle> The Selfish Gene </booktitle>
<author id=“dawkings">
<name>
<firstname> Richard </firstname>
<lastname> Dawkins </lastname>
</name>
<address>
<city> Timbuktu </city>
<zip> 99999 </zip>
</address>
</author>
</book>
Dr. N. Mamoulis
Advanced Database Technologies
42
Encoding Example
book
(1,2:6,2)
(1,1:27,1)
(1,7:26,2)
booktitle
author
(1,3,3)
The Selfish
Gene
id (1,8:9,3)
(1,9,4)
dawkins
name
(1,10:17,3)
(1,11:13,4) (1,14:16,4)
firstname
lastname
(1,18:25,3)
address
(1,19:21,4) (1,22:24,4)
city
zip
(1,15,5)
(1,20,5) (1,23,5)
(1,12,5)
Richard
Dawkins Timbuktu 99999
Dr. N. Mamoulis
Advanced Database Technologies
43
Using the encoding to determine a
structural relationship
We can use the encoding to find fast the
relationship between two elements (or
between an element and a value).
Element e1 is an ancestor of element e2 in the
same document iff:


e1.DocumentId = e2.DocumentId
e1.StartPos> e2.StartPos && e1.EndPos<
e2.EndPos (interval coverage)
If the above hold and, in addition,
e1.LevelNum+1 = e2.LevelNum, then e1 is the
parent of e2.
Dr. N. Mamoulis
Advanced Database Technologies
44
Answering queries using the
encoding
Assume that all documents have been
flattened to tables and the encoding is used
to index the positions of each element and
value in the documents.
We store all information in a table:
(ElementId, Document-id, StartPos:EndPos, LevelNum)
The table is clustered by ElementId and sorted
by (Document-id, StartPos).
Dr. N. Mamoulis
Advanced Database Technologies
45
Answering queries using the
encoding (cont’d)
Example
ElementId
Document-id
StartPos
EndPos
LevelNum
book
1
1
27
1
booktitle
1
2
6
2
author
1
7
26
2
...
...
...
...
...
Dr. N. Mamoulis
Advanced Database Technologies
46
Answering queries using the
encoding (cont’d)
The query is broken into binary parent-child
or ancestor descendant relationships.
book
Example:

book[/title/XML][//author//Smith]
title author
Broken to:




book/title
title/XML
book//author
author//Smith
Dr. N. Mamoulis
Advanced Database Technologies
XML
Smith
47
Answering queries using the
encoding (cont’d)
Each binary query is executed as a join, and
their results are “stitched” together to
formulate the results of the whole query.
Example: book/author/address
book/author: (2,4),(2,6),(8,10)
 author/address: (10,12),(15,17)
 book/author/title: (8,10,12)
book
book
article
2

8
13
title author author title author
3
4
name
5
Dr. N. Mamoulis
6
9
10
name name address
7
11
12
title author author
14
15
18
name address name
16
Advanced Database Technologies
17
19
48
How to process the binary joins
Thus the “heart” of XML query processing is
the algorithm that joins the elements table to
retrieve the results for each individual query
component.
One method to process the binary join is to
apply a merge-join algorithm, since the table
is already sorted by Element,DocId,StartPos.
Assume that the query is an A//D, where A is
the ancestor element and D is the descendant
element
Dr. N. Mamoulis
Advanced Database Technologies
49
How to process the binary joins.
The tree-merge join algorithm
Example:
AList
DList
1
23
34
2
1
30
32
4
1
25
33
2
1
68
72
5
1
45
56
2
2
17
18
4
2
15
28
3
2
20
22
4
2
45
57
3
2
43
60
2
Dr. N. Mamoulis
Advanced Database Technologies
50
Worst case for the tree-merge join
algorithm
Example:
d1
a1
d1
d2
a2
...
d2n
d2n-1
a1
d2
a2
...
...
dn
an
dn+1
...
an
d2n-1
dn
Dr. N. Mamoulis
dn+1
Advanced Database Technologies
d2n
51
How to process the binary joins.
The tree-merge join algorithm
The tree merge join may perform many
passes to the “inner” DList table, one for
each element in AList that mathes the
elements there.
In order to avoid this a stack-tree join
algorithm is proposed.
OBSERVATION: We can get all the join
results by a depth-first traversal of the XML
tree.
Dr. N. Mamoulis
Advanced Database Technologies
52
The stack-tree join algorithm
The lists are merged together as before, but
a stack is maintained to keep nested AList
elements which are in the same path as the
current element from DList.
When a qualifying element in DList is found,
all elements of AList in the stack are output.
Dr. N. Mamoulis
Advanced Database Technologies
53
Stack-tree join example
a1
a2
d1
a3
d4
d2 d3
a4
AList
DList
a1
d1
a2
d2
a3
d3
a4
d4
Stack
Output
d5
d6
a1
d5 d6
Dr. N. Mamoulis
Advanced Database Technologies
54
Stack-tree join example
a1
a2
d1
a3
d4
d2 d3
a4
AList
DList
a1
d1
a2
d2
a3
d3
a4
d4
d5
d6
Stack
Output
a1,d1
a2,d1
a2
a1
d5 d6
Dr. N. Mamoulis
Advanced Database Technologies
55
Stack-tree join example
a1
a2
d1
a3
d4
d2 d3
a4
AList
DList
a1
d1
a2
d2
a3
d3
a4
d4
d5
d6
Stack
Output
a1,d1
a2,d1
a3
a1
d5 d6
Dr. N. Mamoulis
Advanced Database Technologies
56
Stack-tree join example
a1
a2
d1
a3
d4
d2 d3
a4
AList
DList
a1
d1
a2
d2
a3
d3
a4
d4
d5
d6
Stack
Output
a1,d1
a2,d1
a1,d2
a3,d2
a3
a1
d5 d6
Dr. N. Mamoulis
Advanced Database Technologies
57
Stack-tree join example
a1
a2
d1
a3
d4
d2 d3
a4
AList
DList
a1
d1
a2
d2
a3
d3
a4
d4
d5
d6
Stack
Output
a1,d1
a2,d1
a1,d2
a3,d2
a3
a1
a1,d3
a3,d3
d5 d6
Dr. N. Mamoulis
Advanced Database Technologies
58
Stack-tree join example
a1
a2
d1
a3
d4
d2 d3
a4
AList
DList
a1
d1
a2
d2
a3
d3
a4
d4
Stack
a1,d1
a2,d1
a1,d2
a3,d2
a1,d3
a3,d3
d5
d6
Output
a1
a1,d4
d5 d6
Dr. N. Mamoulis
Advanced Database Technologies
59
Stack-tree join example
a1
a2
d1
a3
d4
d2 d3
a4
AList
DList
a1
d1
a2
d2
a3
d3
a4
d4
d5
d6
d5 d6
Dr. N. Mamoulis
Advanced Database Technologies
Stack
Output
a1,d1
a2,d1
a1,d2
a3,d2
a4
a1
a1,d3
a3,d3
a1,d4
a1,d5
a4,d5
a1,d6
a4,d6
60
Comments on the stack-tree join
algorithm
The algorithm has better worst-case
complexity than the tree-merge join
algorithm.
Both of them have two versions; one that
outputs results sorted on AList elements and
one that outputs results sorted on DList
elements.
Dr. N. Mamoulis
Advanced Database Technologies
61
Limitation of the binary join
algorithms
They are used only for binary joins. If a
query is complex and contains many binary
relationships, many intermediate results have
to be merged.
Example:

book[/title/XML][//author//Smith]
Broken to:




book/title
title/XML
book//author
author//Smith
Dr. N. Mamoulis
Advanced Database Technologies
book
title author
XML
Smith
62
An extension of the stack-tree
join algorithm
book
title author
The path-join and twig-join algorithms XML Smith
extend the basic stack-join algorithm for
complex queries.
The idea is the same, but multiple stacks are
used to avoid merging the intermediate
results.
Path-join is appropriate for path queries only
(e.g., book/author/name)
Twig-join is appropriate for branching (tree)
expressions.
Dr. N. Mamoulis
Advanced Database Technologies
63
Example of Path-Join
a1
Query a//b//c
..at point c3
StackA
a3
a1
Dr. N. Mamoulis
StackB
b4
b3
Output
c1,b2,a1
c2,b2,a1
c3,b4,a3
c3,b4,a1
c3,b3,a1
a2
b2
b1
c1 c2
c3,b3,a3 is not an
answer because
b3 points to a1 in
the next stack!
Advanced Database Technologies
b3
a3
b4 c5
c3 c4
64
Path-join has optimal asymptotic cost
for single-path queries, but ...
... if a query is a twig of multiple paths may
produce many partial results which have
then to be joined.
The twig-join, joins these results at
b
production time.
Example:
a
a




query a[//b/c][//d/e]
two paths a/b/c
two paths a/d/e
only one twig a[/b/c][/d/e]
Dr. N. Mamoulis
a
b
c
d
d
b
d
c
e
c
e
c
e
Advanced Database Technologies
65
Twig Join
b
The twig-join applies
path-join at multiple paths
at the same time.
When at some node there
are potential solutions for
each path of the query,
the algorithm waits for
these results and waits for
them to be computed.
Then the results from
each path are joined.
Dr. N. Mamoulis
a
a
a
b
c
d
d
b
d
c
e
c
e
c
e
Advanced Database Technologies
partial result: a/b/c
partial result: a/d/e
merge result: a[/b/c][/d/e]
66
Limitations of the twig-join and
stack-based methods
It is useful for simple twigs only, but it is not
trivial to extend it for arbitrary trees
book
title author
XML
Smith Jones
The encoding can be used for tree-structured
XML data only. However, in many cases XML
data are graphs. In this case the encoding
(and also the stack-based algorithms) are
not applicable.
Dr. N. Mamoulis
Advanced Database Technologies
67
Summary
XML data are everywhere today, and efficient
management and querying systems are
needed.
This is why today XML data management is
one of the hottest research topics in DB.
There are two streams for XML data
management:


Store XML data into native systems and use
special indexing and querying methods.
Trasform XML data into relational tables and
use/adapt relational query algorithms.
Dr. N. Mamoulis
Advanced Database Technologies
68
References
R. Kaushik et al., “Exploiting Local Similarity
for Indexing Paths in Graph-Structured
Data”, ICDE 2002.
S. Al-Khalifa et al., “Structural Joins: A
Primitive for Efficient XML Query Pattern
Matching”, ICDE 2002.
N. Bruno et al., “Holistic Twig Joins: Optimal
XML Pattern Matching”, ACM SIGMOD 2002.
J. Shanmugasundaram et al. “Relational
Databases for Querying XML Documents:
Limitations and Opportunities”, VLDB 1999.
Dr. N. Mamoulis
Advanced Database Technologies
69