Transcript Lecture24b

15-829A/18-849B/95-811A/19-729A
Internet-Scale Sensor Systems:
Design and Policy
Lecture 24 – Part 2
XML Query Processing
Phil Gibbons
April 15, 2003
Lecture 24
XML Query Processing: Outline
• XML vs. Relational
• XML on Relational DB:
• Shanmugasundaram et al, “Relational Databases for Querying
XML Documents: Limitations and Opportunities”, VLDB’99,
plus follow on papers
• LegoDB, STORED, Edge (2 slides)
• To be continued in my next lecture…
04-15-03
Lecture 24
2
XML vs. SQL for Sensor Databases
• IrisNet represents data in XML (semi-structured
model)
• Hierarchical documents, Queries in XPATH
• TinyDB represents data in the relational model
• Tables, Queries in SQL
What are the pros and cons for each approach?
How does it depend on the sensing context?
04-15-03
Lecture 24
3
Why IrisNet Uses XML
• Rich, heterogeneous data
• Hard to capture in a rigid data model
• Self-describing tags useful
• Schema evolution
• XML supports on-the-fly schema changes
• Wide area sensing => Hierarchical organization
• Good match for XML, bad for relational
• Standard data exchange format
04-15-03
Lecture 24
4
Disadvantages of XML
• Query languages are lacking
• Some minimal features: e.g., aggregates, updates
• Query processors not available for XQuery
• Query processing is SLOW
Key research question: Can we store XML in a relational DB,
and use a relational database system to process queries?
04-15-03
Lecture 24
5
Why use Relational DB Systems?
• Highly reliable, scalable, optimized for performance,
advanced functionality
• Result of 30+ years of Research & Development
• XML database systems are not “industrial strength” … and
not expected to be in the foreseeable future
• Existing data and applications
• XML applications have to inter-operate with existing
relational data and applications
• Not enough incentive to move all existing business
applications to XML database systems
• Lessons from object-oriented database systems?
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram
6
XML Query Processing: Outline
• XML vs. Relational
• XML on Relational DB:
• Shanmugasundaram et al, “Relational Databases for Querying
XML Documents: Limitations and Opportunities”, VLDB’99,
plus follow on papers
• LegoDB, STORED, Edge (2 slides)
• To be continued…
04-15-03
Lecture 24
7
Storing and Querying XML Documents
XML
Schema
XML
Documents
XML
Query
XML
Result
XML Translation Layer
Relational
Schema
SQL
Query
Tuples
Relational
Result
Translation
Information
Relational Database System
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram
8
Relational Data
PurchaseOrder
Id
Customer
200I Cars R Us
Day Month Year
10
300I Bikes R Us null
June 1999
July
Item
Payment
Quantity
Cost
50
2000.00
200I
1
40%
300I Schwinn Tire 100
300I Trek Tire
20
200I Goodyear Tire 200
2500.00
400.00
8000.00
200I
300I
2
1
60%
100%
Pid
Name
1999
200I Firestone Tire
04-15-03
Lecture 24
Pid Installment Percentage
Adapted from slides ©Jayavel Shanmugasundaram
9
SQL Query
PurchaseOrder
Id
Customer
Day
200I
300I
Cars R Us
Bikes R Us
10
null
Payment
Item
Pid
200I
300I
300I
200I
Name
Firestone Tire
Schwinn Tire
Trek Tire
Goodyear Tire
Month Year
1999
June
1999
July
Cost
Quantity
50
2000.00
100
2500.00
400.00
20
200
8000.00
Pid Installment
1
200I
200I
2
300I
1
Percentage
40%
60%
100%
Find all the items bought by “Cars R Us” in the year 1999
Select it.name
From PurchaseOrder po, Item it
Where po.customer = “Cars R Us” and
po.year = 1999 and
po.id = it.pid
04-15-03
Lecture 24
Predicates
Join
Adapted from slides ©Jayavel Shanmugasundaram 10
XML Document
Self-describing tags
Nested structure
Nested sets
Order
04-15-03
<PurchaseOrder id=“200I” customer=“Cars R Us”>
<Date>
<Day> 10 </Day>
<Month> June </Month>
<Year> 1999 </Year>
</Date>
<Item name=“Firestone Tire” cost=“2000.00”>
<Quantity> 50 </Quantity>
</Item>
<Item name=“Goodyear Tire” cost=“8000.00”>
<Quantity> 200 </Quantity>
</Item>
<Payment> 40% </Payment>
<Payment> 60% </Payment>
</PurchaseOrder>
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 11
XML Schema
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
Date (Item)* (Payment)*
</PurchaseOrder>
Date
<Date>
Day? Month Year
</Date>
Day
<Day> {integer} </Day>
Month
<Month> {string} </Month>
Year
<Year> {integer} </Year>
Item
<Item name={string} cost={float}>
Quantity
</Item>
… and so on
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 12
Schemas to Relations: Issues
• Complex schema specifications
• Two level nature of relational schema (tuples and
attributes) vs. arbitrary nesting of XML Schema
• Recursion
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 13
Naïve Approach
PurchaseOrder
Id
(200I)
Customer
(Cars R Us)
Day
(10)
Date
Month
(June)
Element Node
04-15-03
Lecture 24
Item
Item
Payment
(40%)
…
Year
(1999)
Attribute Node
Adapted from slides ©Jayavel Shanmugasundaram 14
Naïve Approach (Contd.)
Edges
Id
Type
Name
Value
0 Element PurchaseOrder
null
1 Attribute
Id
200I
2 Attribute Customer
Cars R Us
3 Element
Date
null
4 Element
Day
10
5 Element
Month
June
6 Element
Year
1999
…
…
…
…
Ordinal ParentId
null
0
null
0
1
2
0
1
0
0
3
3
2
3
…
…
Problem: Many joins for queries (one per hop)
eg. PurchaseOrder/Date/Year
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 15
Desired Properties of Generated
Relational Schema R
• All XML documents conforming to XML schema
should be “mappable” to tuples in R
• All queries over XML documents should be
“mappable” to SQL queries over R
• Not Required: Ability to re-generate XML schema
from R
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 16
XML Schema: Further Examples
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
Date? (Item | Payment)*
</PurchaseOrder>
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
(Date | Payment*) (Item (Item Item)* Payment)*
</PurchaseOrder>
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
Date Item (PurchaseOrder)* Payment
</PurchaseOrder>
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 17
Simplifying XML Schemas
• XML schemas can be “simplified” for translation
purposes
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
(Date | (Payment)*) (Item (Item Item)* Payment)*
</PurchaseOrder>
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
Date? (Item)* (Payment)*
</PurchaseOrder>
Without undermining storage and query functionality
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 18
Simplification Desiderata
• Simplify structure, but preserve differences that
matter in relational model
• Single occurrence (attribute)
• Zero or one occurrences (nullable attribute)
• Zero or more occurrences (relation)
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
(Date | (Payment)*) (Item (Item Item)* Payment)*
</PurchaseOrder>
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
Date? (Item)* (Payment)*
</PurchaseOrder>
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 19
Simplification Rules
• Flattening transformations
• (e1 e2)* -> e1* e2*
• (e1 e2)? -> e1? e2?
• (e1 | e2) -> e1? e2?
• Simplification transformations
• e** -> e*
• e*? -> e*
• e?* -> e*
• e?? -> e?
• Grouping transformations
• e1* e2* e1* -> e1* e2*
…etc
What is lost?
• e+ -> e*
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 20
Result: Translation Normal Form
• An XML schema production is either of the form:
P
<P attr1={type1} … attrm={typem}>
a1 … ap ap+1? … aq? aq+1*… ar*
</P>
where ai  aj
• … or of the form:
P
04-15-03
<P attr1={type1} … attrm={typem}>
{type}
</P>
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 21
Simplified XML Schema
PurchaseOrder
<PurchaseOrder id={integer} customer={string}>
Date (Item)* (Payment)*
</PurchaseOrder>
Date
<Date>
Day? Month Year
</Date>
Day
<Day> {integer} </Day>
Month
<Month> {string} </Month>
Year
<Year> {integer} </Year>
Item
<Item name={string} cost={float}>
Quantity
</Item>
… and so on
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 22
Relational Schema Generation
PurchaseOrder (id, customer)
1
Date
?
Day
1
Month
*
*
Item (name, cost)
1
Year
Payment
1
Quantity
Satisfy: Tables are normalized
Minimize: Number of joins for simple path expressions
(of form /a/b/c)
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 23
Generated Relational Schema
and Shredded XML Document
PurchaseOrder
Id
Customer
Day Month Year
200I Cars R Us
10
June 1999
Payment
Item
Pid Order
Name
Cost
Quantity
Pid Order Value
200I
1
Firestone Tire 2000.00
50
200I
2
40%
200I
3
Goodyear Tire 8000.00
200
200I
4
60%
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 24
Example Schema Graph
book
monogra ph
article
?
booktitle
*
title
conta cta uthor
editor
authorID
*
author
name
name
address
authorid
?
firstnam e
lastname
Not just a tree
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 25
Shared Inlining Technique
• Thus far, works well for trees only
• Intuition:
• Inline as many sub-elements as possible
• Do not inline only if it is a shared, recursive or set subelement.
• Technique:
• Necessary and Sufficient Condition for shared/ recursive
element:
In-degree >= 2 in (simplified) schema graph
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 26
Relational Schema Generation and
XML Document Shredding
• Any XML Schema X can be mapped to a relational
schema R, and …
• Any XML document XD conforming to X can be
converted to tuples in R
• Further, XD can be recovered from the tuples in R
What do you think of the approach, for IrisNet?
Exercise: What would the Parking Space Finder relational schema
look like? Would there be many or few joins in queries?
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 27
04-15-03
Hybrid
2
2
1.5
1.5
0.5
0.5
0
0
Shared
Lecture 24
re
sid pif
en
tia
l
sa
ej
sm
il
vr
m
l
am
bi l
ps
14
m
at
h
ni
tf
of - x
x1
51
6
1
Queries
Shared
re
sid pif
en
tia
l
sa
ej
sm
il
vr
m
l
bi l
ps
14
m
at
h
ni
tf
of - x
x1
51
6
am
Total Joins
re
sid pif
en
tia
l
sa
ej
sm
il
vr
m
l
am
bi l
ps
14
m
at
h
ni
tf
of - x
x1
51
6
Joins/Query
Path Expression with Length 3
Shared
Hybrid
1
Hybrid
2.5
2
1.5
1
0.5
0
Adapted from slides ©Jayavel Shanmugasundaram 28
Varying Path Expression Length
Hybrid
Shared
11
10
9
8
7
6
5
4
3
2
1
0
Total Joins
Total Joins
Shared
1
2
3
4
5
6
7
8
9
10 11
Path Length
11
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10 11
Path Length
Group 1 DTD
04-15-03
Hybrid
Group 3 DTD
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 29
Storing and Querying XML Documents
XML
Schema
XML
Documents
XML
Query
XML
Result
XML Translation Layer
Relational
Schema
SQL
Query
Tuples
Relational
Result
Translation
Information
Relational Database System
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 30
XPERANTO
Create XML
Document Repository
Relational Schema
Generator
Store XML
Documents
Query over Stored
XML Documents
XML view over
tables to reconstruct
shredded XML
documents
Relational Schema
Information
Query Processor
for XML views
of Relational Data
XML Document
Shredder
Create tables
Store rows
in tables
Table 1
Query over
tables
Table n
Relational Database System
04-15-03
Lecture 24
31
XML Query Processing: Outline
• XML vs. Relational
• XML on Relational DB:
• Shanmugasundaram et al, “Relational Databases for Querying
XML Documents: Limitations and Opportunities”, VLDB’99,
plus follow on papers
• LegoDB, STORED, Edge (2 slides)
• To be continued…
04-15-03
Lecture 24
32
LegoDB
[Bohannon et al, ICDE’02]
• An optimization approach:
• automatically explores a space of possible mappings
• selects the mapping which has the lowest cost for a given
application
• Important features:
• Application-driven: takes into account schema, data
statistics, and query workload
• Logical/physical independence: interface is XML-based (XML
Schema, XQuery, XML data statistics)
• Leverage existing technology: XML standards; XML-specific
operations for generating space of mappings; relational
optimizer for evaluating configurations
04-15-03
Lecture 24
Adapted from slides ©Juliana Freire
33
But What If There’s No Schema?
• Revert to one row per edge
Id
0
1
2
3
4
5
6
…
Type
Name
Value Ordinal ParentId
Element PurchaseOrder
null
null
null
Attribute
Id
200I
0
0
Attribute Customer
Cars R Us
1
0
Element
Date
null
2
0
Element
Day
10
0
3
Element
Month
June
1
3
Element
Year
1999
2
3
…
…
…
…
…
• STORED [Deutsch, Fernandez, Suciu, Sigmod’99]
• Looks at data, finds highly supported patterns for tables
• [Florescu, Kossman, Data Engineering Bulletin, 1999]
04-15-03
Lecture 24
Adapted from slides ©Jayavel Shanmugasundaram 34
XML Query Processing: Outline
• XML vs. Relational
• XML on Relational DB:
• Shanmugasundaram et al, “Relational Databases for Querying
XML Documents: Limitations and Opportunities”, VLDB’99,
plus follow on papers
• LegoDB, STORED, Edge (2 slides)
• To be continued… (Thurs)
• Updates, Native XML DBMS
• Also in next lecture: Historical queries
04-15-03
Lecture 24
35