Main challenges in XML/Relational mapping - TKK

Download Report

Transcript Main challenges in XML/Relational mapping - TKK

Main challenges in
XML/Relational mapping
Juha Sallinen
Hannes Tolvanen
Introduction: XML and databases
Objectives of the study
Introduction: XML and databases
Basic definitions
• XML/relational mapping means data
transformation between XML and relational
data models
• Mapping method is the way the mapping is
Native vs. Relational
• Why to store XML documents in relational
database and not in native XML database?
– Immaturity of current native XML database
– Emerging technology - no ”de facto” standard
– Well-working relational databases currently in
• Efficient and usable
• May have been in use for years
Mapping dilemma
• XML data model supports much more flexible
data structures than relational model
• Two fundamental differences:
– XML tags
– Nested structure of XML elements vs. flat
structure of relational tables
• If an XML document is not originated from
another relational data source, it is possible that
the data does not fit to relational schema very
Dichotomy of mapping methods
• There are two fundamentally different
techniques of storing XML documents in a
relational database
– LOB presentation
– Composed presentation
LOB presentation
• LOB stands for Large Object
• One XML document is put into a single column of
a relational table
• At least one column for indexing is also needed
• Does not take full advantage of classical
relational database (no XML extensions)
– Not possible to use SQL to query XML
• Not a very interesting choice!
Composed presentation
• Data structure of an XML document is
”shredded” over one or more tables
• Example: Different elements to different
• Multiple ways to do this
– Table-based and object-relational mapping will
be introduced later
Objectives of the study
Objectives of the study
• Find and explain the main issues to be
considered when converting XML schema to
relational schema
– In other words: The main challenges that have to
be taken into account by
• Designers of XML/relational mapping methods
• Users who need to map the data explicitly
• Find and describe briefly two general mapping
methods based on composed presentation
Issues to consider in mapping
• Some of the most essential data characteristics
– Existence of schema definition document
– Stability of the schema
– Degree of structure
• Usage model for data
– Queries against the database
– Requirement of preserving ”hidden” information
• DBMS implementation
– not covered by the study, because scope was limited
to the classical relational model
Data characteristics: Existence of
XML schema definition
• Schema definition says how the structure of XML
documents conforming the schema is restricted
– XSD (XML Schema Definition) and DTD (Document Type
Definition) are currently the dominating standards for defining
XML schema.
• If we have the definition for the schema, conversion to
relational schema will be based on it.
• If we don’t have the schema definition, we have to make
guesses how the structure of the given XML vocabulary
is restricted.
– Guesses are based on the data of instances of the vocabulary
(XML documents). In other words we extract the schema from
available data.
– This is not unproblematic as we see from next example
Data characteristics: Existence of
XML schema definition 2 - Example
• Illustration of the problem of extracting the schema from
<personname>eddy example</personname>
<adddress>mannerheimintie 10, 00000 helsinki</address>
• We might deduce from the document, that we wish to
restrict the schema to
<!ELEMENT addressbook (name, address)>
Data characteristics: Existence of
XML schema definition 2 –
Example continued
But if following document is received from the data source, we either have
to extend our relational schema or dismiss the data that relational schema
doesn’t support (summer cottage’s address) or combine the two fields:
<address>jämeräntaival 10, 02150 espoo</address>
<summerCottageAddress>hiekkatie 7, 99999 oulu</summerCottageAddress>
We can alter the database schema by adding an extra column to table
mapped from addressbook element to support the the new information
– This solution can’t be however applied if we don’t know the relation between
person and summercottage is 1:1. We might get documents containing persons
that have many addresses for summer cottages, and again, we would run to the
situation that we would have to alter the database schema. We would have to
create a property table for the addresses.
Evolving schema
• If the schema of XML vocabulary is
defined, but it experiences changes,
respective changes must be made to
relational schema
• Changes are not always such easy to
make to relational schema as in previous
example (if composed approach is used)
• It should be evaluated what are the
chances for schema to change.
Degree of structure of the XML
Categorization used in the study:
1. Structured data
Data is totally independent from the presentation used to
describe it.
Document can be navigated without examining it first
2. Semi-structured data
Some blocks of the document may contain optionalities
3. Marked-up text
Documents require the preservation of ”hidden” information
E.g. HTML documents
These terms have different meaning in the
literature. Information on the following slide is
based on the definitions of this slide.
Degree of structure of the XML
• Structured documents can be easily mapped to
database using composed presentation. Also
semi-structured documents can be decomposed if
schema definition is provided. If mixed content is
included, it depends on the usage of data whether
LOB presentation is better for the mixed content
block than further fragmentation.
• Marked-up text's requirement for “hidden
information's” preservation is discussed later.
Storing mixed content to relations
• Mixed content: Document elements embedded to
character data . E.g.
<h1>example</h1><p>here you have a <b>short</b> example</p>
• Designing a relational schema to store mixed content
– If there are blocks in the content that make sense only as a
whole, decomposition of those blocks makes no sense.
– If we have strong arguments for decomposing a block containing
mixed content, one possible decomposition method is to create
one table for the root element and one property table for
character data, and a property table for every element that
appears in the content.
Mixed content mapping example
(#PCDATA | B | C)*>
• Example instance:
Here we have a <b>nice
</b> example
• Relational schema
B(a_fk,b, bOrder)
C(a_fk, c, cOrder)
PCDATA(a_fk, pcdata,
Usage models for data: Type of
queries executed against the
• The spectrum of queries
– Queries that retrieve XML documents
– Queries that retrieve fragments of XML
– Queries that make transformations on XML
– And even more complex queries...
Query examples 1
• Sample documents
<streetaddress>jämeräntaival 10</streetaddress>
<summerCottageAddress>hiekkatie 7, 99999 oulu</summerCottageAddress>
<streetaddress>smt 10</streetaddress>
<summerCottageAddress>hiekkatie 7, 99999 oulu</summerCottageAddress>
• Query emitting XML fragment: Select the
names of persons who live in Espoo
Query examples 2
• Query making transformation: “select the
number of persons living in Espoo”
Preservation of “hidden”
• The XML document contains “hidden” information that
is related to the presentation of the data, not the data
– Order of elements
– Comments
– Whitespaces
• It might be required that original XML documents can
be retrieved
– Trivial when LOB presentation is used
– If composition presentation is used, all “hidden”
information need to be stored to relations
Table-based mapping
Listing 1. Required structure of XML document in table-based mapping (Bourret, 2001).
Object-Relational mapping
• Mapping method for mapping any XML document
that has a schema definition.
• The idea is to convert the schema of document to
an object schema, and then convert the object
schema to relational schema
• Step of object/relational conversion is predefined,
but XML/object conversion leaves some freedom
to define the object view that is mapped from XML
• The selection between the choice of possible relational
representations for XML data include many issues that
must be considered.
• Some of the issues limit the choice to LOB presentation
(no schema, rapidly evolving schema, queries include only
retrieval of original documents)
• LOB presentation can be also used for storing blocks of
the document where are no references from elsewhere.
• Usual reason why decomposition method is generally
preferred if possible, is the performance gain. Also the
data comes more accessible to applications that use the
database, but don’t publish any views of data in XML.