From XML Schema to Relations: A Cost

Download Report

Transcript From XML Schema to Relations: A Cost

From XML Schema to Relations: A CostBased Approach to XML Storage
Presented by
Xinwan Bian and Danyu Wu
02-21-02
1
Introductions

Where to save XML document?
. XML database
. Object-Oriented database
. Object-Relational database
. Relational database
2
Difficulties of Saving XML document into
Relational Database



XML has more complex tree structure
than flat relational tables
XML contains richer data types
The integration with legacy tables
3
Different Approaches to schema mappings

Fixed XML-to-relational mappings

Commercial RDBMS utility tools

Bell Laboratories cost-based approach
4
LegoDB, an XML storage mapping system

Three design principles
. Cost-based search
. Logical/physical independence
. Reuse of existing technology
5
The Basic Approach of LegoDB



Create a p-schema for input XML schema
Obtain cost estimates with input of data
statistics and XQuery workload
Exploit alternative storage configurations and
achieve an optimal mapping
6
Architecture of the Mapping Engine
cost(SQi)
XML Schema
XML data statistics
Generate
Physical Schema
Physical Schema
Transformation
PS0
Rsi: Relational Schema/Queries/Stats
Psi: Physical Schema
Query/Schema
Translation
PSi
Optimal Configuration
Query Optimizer
RSi
XQuery workload
7
Questions

Its Advantages?

Its Disadvantages?
8
Example of P-Schema Creation
type Show=
type Show=
show [@type [String],
show [@type[String],
title [String],
title [String]
year [Integer],
year [Integer],
reviews [String]*,
Reviews*,
…]
type Reviews =
reviews[String]
(a) Initial XML schema
(b) P-Schema
TABLE Show
( show_id INT,
type STRING,
year INT )
TABLE Review
( Review_id,
review String,
parent_show INT)
© Relational table
9
What’s P-Schema?

Physical schemas (p-schemas) is an
extension of XML schemas in two
significant ways:
. They contain data statistics
. They can be easily mapped into
relational tables
10
Example of P-Schema with statistics

type Show = show [ @type[ String <#8,#2>],
year[ Integer<#4,#1800,#2100,#300>],
title[ String<#50,#34798>],
Review*<#10> ]
type Review = review[ String<#800> ]
Scalar<#size, #min, #max, #distinct>
String<#size,#distinct>
*<#count>
11
Stratified Physical Types







scalar type s ::= Integer | String | Boolean
Physical type ps ::= ps<#size,#min,#max,#distincts>
Named type nt::= X (type name) | nt | nt (choice) |  (empty)
| nt{n,m,#<}#count> repetition
Optional type ot ::= nt (named type) | s (optional scalar) | L[ot]
(optional element) | ot, ot (optional sequence) | () (empty)
Physical type pt ::= nt (named type) | ot{0,1} (optional type) |
s (scalar) | L[pt] (element) | pt, pt (sequence) | () empty
Schema item si ::= type X = pt
(type declaration)
Schema ::= schema Sn = si, si, … end (schema)
12
Mapping of p-schema to relations





Create one relation RT for each type name T
For each RT, create a key that stores node id
For each RT, create a foreign key to all relations RPT
such that PT is a parent type of T
A column is created in RT for each sub-element of T
that is a physical type
If the data type is contained within an optional type
then the corresponding column can be null
13
More details of P_Schema to relational
mappings
14
Schema Transformations

Advantages of transformations at XML Schema level
. Much of the XML schema semantics not present in
a given relational schema.
. More natural rewriting at the XML level
. The framework is more easily extensible to other
non-relational stores
15
Inlining/Outlining Transformation

One can either associate a type name to a given
nested element (outlining) or next its definition
directly within its parent element (inlining).
type TV=
seasons [Integer]
Description,
Episode*
=>
type TV =
seasons[Integer],
description[String],
Episode*
type Description =
description [String]
16
Union Factorization/Distribution
Transformation

The first law ((a,(b|c)) == (a,b|a,c)
type Show =
show [@type[String], title[String]
title[String], year [Integer],
Aka{1,10}, Review*, {Movie|TV}]
type Movie =
box_office[Integer]
video_sales[Integer]
Type TV = seasons[Integer],
description[String], Episode*
=>
type Show =
show [(@type[String],
title[String], year[Integer],
Aka{1,10}, Review*,
box_office[Integer],
video_sales[Integer])
| (@type[String], title[String],
year[Integer], Aka{1,10}
Review*, seasons[Integer],
description[String],Episode*)]
17
Corresponding relational configuration
TABLE TV
( TV_id INT,
seasons String,
parent_show )
=>
TABLE Description
( Description_id INT
description String,
parent_TV )
TABLE TV (
TV_id INT,
seasons String,
description String,
parent_Show )
18
Union Factorization/Distribution continues

The Second law (a[t1|t2] == a[t1]|a[t2])
Type Show = show[(@type[String],
title[String],year[Integer],
Aka{1,10}, Review*,
box_office[Integer],
video_sales[Integer])
| (@type [String],
=>
title [String], year [Integer],
Aka{1,10}, Review*,
seasons [Integer],
description [String],
Episode*) ]
type Show = (Show Part1 | Show Part2 )
type Show Part 1 = show [@type [String],
title [String], year[Integer], Aka{1,10},
Review*, box_office[Integer],
video_sales[Integer] ]
type Show Part2 =
show [@type [String], title[String],
year [Integer], Aka{1,10},
Review*, seasons [Integer],
description [String], Episode* ]
19
Corresponding relational configurations
TABLE Show (
Show_id INT,
type String, title String,
year INT)
TABLE Show_Part1 (
Show_Part1_id INT,
type String, title String,
year INT, box_office INT,
TABLE Movie (
video_sales INT)
Movie_id INT, Box_Office INT,
=>
video_sales INT, parent_show INT) TABLE Show_Part2 (
Show_Part2_id INT,
TABLE TV (
type String, title String,
TV_id INT, seasons INT,
year INT, seasons INT,
description string, parent_show INT)
description String )
20
Wildcard rewritings


‘~’: any element names can be used
‘~!a’: any name but “a” can be used.
Type Review =
review [~[ String ]*]
=>
type Reviews =
review[ (NYTReview | OtherReview)*]
type NYTReview = nyt[ String]
type OtherReview = (~!nyt) [String]
21
XQuery Queries Examples




Q1: FOR $v in imdb/show WHERE $v/year = 1999 RETURN
($v/title, $v/year, $v/nyt_reviews)
Q2: FOR $v in imdb/show RETURN $v
Q3: FOR $v in imdb/show WHERE $v/title = c3 RETURN
$v/description
Q4: FOR $v in imdb/show RETURN <result> { $v/title,
$v/year, (FOR $e IN $v/episode WHERE $e/guest_director = c4
RETURN $e) }
</result>
22
XQuery Workload Examples
=

Publish
{ Q1 : 0.4, Q2: 0.4, Q3: 0.1, Q4: 0.1}

Lookup = {Q1: 0.1, Q2: 0.1, Q3:0.4, Q4: 0.4}
23
Search Algorithm
Procedure GreedySearch
Input: xSchema: schema, xWkld: query workload, xStats:data statistics
Output: pSchema: an efficient physical schema
1 begin minCost = infinite large ;
pSchema = GetInitialPhysicalSchema(xSchema)
cost = GetPSchemaCost(pSchema, xWkld, xStats)
while (cost < minCost) do
5
minCost = cost
pSchemaList = ApplyTransformations(pSchema)
for each pSchema’ € pSchemaList do
cost’=GetPSchemaCost(pSchema’,xWkld,xStats)
if cost’s < cost then cost = cost’; pSchema = pSchema’ endif
10
endfor
endwhile
return pSchema
end.

24
Experimental Settings

Two variations of the greedy search: greedy-so and greedy-si.

Greedy-so:



Initial physical schema: all element outlined (except base type).
During search: Inlining transformations applied.
Greedy-si:


Initial physical schema: all elements inlined (except elements with
multiple occurences)
During search: Outlining transformations applied.
25
Efficiency of Greedy Search

5 lookup queries and 3 publish queries
26
Results


For lookup: Greedy-so converges to the final
configuration a lot faster.
For publish: opposite.
27
Reasons:


The traversals made by lookup queries are localized.
The final configuration has only a few inlined elements.
Greedy-so can reach this configuration earlier than greedysi.
The publish queries traverse larger number of elements.
The final configuration has several inlined elements.
Greedy-si can reach this configuration earlier than greedyso.
28
Sensitivity of configurations to varied
workloads


Create a spectrum of workloads that combined the
lookup queries and publish queries in the ratio k : (1k), where k€[0,1] is the fraction of lookup queries in
the particular workload.
Three workloads corresponding to k = 0.25, 0.50,
and 0.75, resulting three configurations.
29
Figure 11: Sensitivity to variations in the
workload
30
Inlining as a bad idea to some queries



(a)The query does limited, localized traversals and/or
does not access all the attributes involved.
(b)The query has highly selective selection
predicates.
(c)The query involves join of attributes not
structurally adjacent in the XML Schema (e.g. actor
and director).
31
Effectiveness of XML transformations:Union Distribution
32
Results of the union-transformed configuration



Overlap between the curves for C[0.25] and C[0.75]
with OPT.
C[0.25] and C[0.75] cross at a small angle.
C[All-inlined] performed 2~5 times worse than
optimal.
33
Wildcards
Find the NYTimes reviews for shows produced in
1999:
34
Questions

The optimal mapping in this paper is
cost-based. What else needs to be
considered?
35
References






P.Bohannon, J.Freire, P. Roy, and J. Sim’eon. From XML schema to relations: A
cost –based approach to XML storage. Technical report, Bell Laboratories, 2001.
Full version.
A. Deutsch, M. Fernandez, and D. Suciu. Storing semi-structured data with
STORED. In Proc. Of SIGMOND, pp 431-442, 1999.
D. Florescu and D. Kossman. A performance evaluation of alternative mapping
schemas for storing XML in a relational database. Technical Report 3680, INRIA,
1999
M. Klettke and H. Meyer. XML and object-relational database system –
enhancing structural mappings based on statistics. In Proc. Of WebDB, pp6368, 2000.
A. Schmidt, M. Kersten, M. Windhouwer, and F.Waas. Efficient relational storage
and retrieval of XML documents. In Proc. Of WebDB, pp47-52, 2000.
J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton.
Relational databases for querying XML documents: Limitations and
Opportunities. In Proc. Of VLDB, pp302-314, 1999.
36