569 Presentation

Download Report

Transcript 569 Presentation

0360-569 Semantic Web
(Winter 2007)
A Report on
DTD vs XML Schema: A
Practical Study
By – Bex, G. J., Neven, F., Bussche, J. V.
Presented By: Quazi Rahman
Titas Mutsuddi
Outline
1. Introduction
2. Structural View of DTDs and XSDs
3. Dataset
4. Expressiveness of XSDs
5. Additional Features
6. Regular Expression Characterization
7. Schema and Ambiguity
8. Errors
9. Conclusion
10. Reference
60-569
2
1. Introduction
 DTD and XSD are two widely used schemas to
describe the contents in an XML documents.
 Although DTDs and XSDs differs syntactically, they
are quite related on an abstract level.
 In this paper the authors present a comparative
study of both DTDs and XSDs. They have tried to
answer two questions:
 Which of the extra features or expressiveness of XML
schema are effectively used in practice that are not
allowed in DTDs, and
 How sophisticated are the structural properties (nature
of regular expression) of the two formalisms.
60-569
3
1. Introduction (cont’d)
Definition of DTD and XSD
 Both Document Type Definitions (DTDs) and
XML Schema Definitions (XSDs) states what
tags and attributes are used to describe the
elements in an XML document, where each tag
is allowed, and which tags can appear within
other tags, etc.
 Applications use a document's DTDs or XSDs to
properly read and display a document's
contents.
 Changes in the format of the document can be
easily made by modifying the DTDs or the
XSDs of the document.
60-569
4
1. Introduction (cont’d)
Merits and Demerits of DTD and XSD
 Shortcomings of DTDs
 No support for namespaces
 Limited support for data types
 Limited support for cardinality
 Shortcomings of XSDs
 It is more complex than DTDs
 There are complains about the performance issue.
 Merits of XSDs
 XSDs are extensible to future additions
 Reuse Schema in other Schemas
 Create new data types derived from the standard types
 Reference multiple schemas in the same document
 XSDs are richer and more powerful than DTDs
60-569
5
1. Introduction (cont’d)
Merits of XSDs
 XSDs are written in XML




Don't have to learn a new language
Can use XML editor to edit Schema files
Can use XML parser to parse Schema files
Can transform Schema with XSLT
 XSDs support data types. It is easier to:






Describe allowable document content
Validate the correctness of data
Work with data from a database
Define data facets (restrictions on data)
Define data patterns (data formats)
Convert data between different data types
 XSDs support namespaces
60-569
6
2. Structural View of DTD and XSD
 An XML document may be viewed as a
finite ordered tree structure.

An Example:
<store>
<dvd>
<title>Amelie</title>
<price>17</price>
</dvd>
<dvd>
<title>Good bye, Lenin</title>
<price>20</price>
<discount>20%</discount>
</dvd>
</store>
60-569
7
2. Structural View of DTD and XSD
(cont’d)

Corresponding Tree structure:
store
dvd
dvd
title
price
“Amelie”
“17”
title
“Good bye, Lenin”
60-569
price
“20”
discount
“20%”
8
2. Structural View of DTD and XSD
(cont’d)

DTD to describe the previous document
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
store
dvd
title
price
discount
(dvd+)>
(title, price, discount?)>
(#PCDATA)>
(#PCDATA)>
(#PCDATA)>

For the tree above let us consider every node label is a
member of some finite alphabet .

Definition 1. A DTD is a pair (d, s) where d is a function that
maps -symbols to regular expression over , and s   is the
start symbol. A tree satisfies the DTD if its root is labeled by s
and for every node u with label a, the sequence a1…an of labels
of its children matches the regular expression d(a).
60-569
9
2. Structural View of DTD and XSD
(cont’d)

We can abstract the DTD by the set of rules of the form a
r,
where a is an element and r is a regular expression over the
alphabets of elements. Such as
store
dvd

dvd+
title price discount?
Definition 2. A specialized DTD (SDTD) is a 4-tuple (, ’, , ),
where ’ is an alphabet of types,  is a DTD over ’ and  is a
mapping from ’ to . Note that  can be applied to a ’-tree as
a re-labeling of the nodes, thus yielding a -tree. A -tree t
then satisfies the SDTD if t can be written as (t’), where t’
satisfies the DTD .
60-569
10
2. Structural View of DTD and XSD
(cont’d)

A simple example of a SDTD:
store
dvd1
dvd2



(dvd1 + dvd2)*dvd2(dvd1 + dvd2)*
title price
title price discount
Here, dvd1 defines ordinary DVDs while dvd2 defines DVDs on
sale. The rule for store specifies that there should be at least
one of the latter
Definition 3. A single-type SDTD is an SDTD (, ’, (d,s), )
with the property that no regular expression d(a) has
occurrences of types of the form bi and bj with the same b but
different i and j.
The example above is not a single-type SDTD, as both dvd1
and dvd2 occur in the rule for store.
60-569
11
2. Structural View of DTD and XSD
(cont’d)
 An example of single-type grammar is given below:
store
regulars
discounts
dvd1
dvd2
regulars discounts
(dvd1)*
dvd2(dvd2)*
title price
title price discount
 Although there are still two element definitions dvd1 and
dvd2, they can only occur in a different context, regulars
and discounts respectively.
60-569
12
2. Structural View of DTD and XSD
(cont’d)

Fragment of XSD of the above DTD may be written as:
<xs:element name = “store”>
<xs:complexType>
<xs:sequence>
<xs:choice minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element name = “dvd” type = “dvd1”/>
<xs:element name = “dvd” type = “dvd2”/>
</xs:choice>
<xs:element name = “dvd” type = “dvd2”/>
<xs:choice minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element name = “dvd” type = “dvd1”/>
<xs:element name = “dvd” type = “dvd2”/>
</xs:choice>
</xs:sequence>
</xs:complexType>
</xs:element>
60-569
13
3. Dataset
 The authors have gathered a representative
samples of DTDs and XSDs for this
comparative study, mostly from the online
source xml.coverpages.org
 They have obtained 109 DTDs and 93 XSDs for
this study.
60-569
14
4. Expressiveness of XSDs
Single-Type
 The authors tried to find out whether the expressive
power of single-type SDTDs actually used in real world
XSDs.
 Most XSDs define local tree language, that is, can be
defined by DTDs
 Only 5 out of 30 XSDs that are used in this analysis, or
only 15%, are true single-type SDTDs
 All five XSDs were of the form:
p
q
a1
a2
…a1…
…a2…
expr1
expr2
Which means, when a parent of an a is p (or q) use the
rule for a1 (or a2)
60-569
15
4. Expressiveness of XSDs (cont’d)
Derived Types
 XML Schema provides two kinds of types,
simple and complex types
 Simple type describes the character data an
element can contain (like #PCDATA in DTDs)
 Complex type specifies which elements may
occur as children in a given element.
 In XSDs, new types may derived from existing
types using two mechanisms:
 Extension
 Restriction
60-569
16
4. Expressiveness of XSDs (cont’d)
Derived Types
 A simple type can be extended to complex type to add
attributes to elements
 A complex type can be extended to add a sequence of
additional elements to its content model or to add
attributes
 A simple type can be restricted to limit the acceptable
range of values for that type
 A complex type can be restricted to limit the set
acceptable sub-trees
Extension
Restriction
Simple type (%)
Complex type (%)
27
73
37
7
Table1: Relative use of derivation features in XSDs
60-569
17
4. Expressiveness of XSDs (cont’d)
Derived Types
Out of 93 XSDs considered:
 Approx. one fifth (20%) do not construct new type
through derivation at all
 Extension is used to define additional attributes in 58%,
and to add new elements to a content model in 42%
 Restriction of complex type is used only in 7%
 Note that only 37% used extension of complex type
which is parallel to inheritance in OOP.
 Extension of simple type occurs in 27% of XSDs
 Restriction of simple type is most heavily used (73%),
which shows the shortcomings of DTDs which uses
unrestrictive #PCDATA
60-569
18
4. Expressiveness of XSDs (cont’d)
Derived Types
 6 XSDs have used the feature of finalizing a type
definition, that is using an attribute that specify that the
type can not be restricted nor extended
 11 XSDs have used the abstract type definition that
must be derived to new types from it.
 Derived type can occur anywhere in the content model
where the original type is allowed, but this can be
prevented by applying block attribute to the original
type. 2 XSDs have used this blocking feature.
 Fixed attribute is usually used to indicate that an
element or attribute is restricted to specific value. Only a
single XSD used this feature.
 Using substitutionGroup feature the name of an element
can be substitute with other name. This feature is used
by 10 XSDs.
60-569
19
5. Additional Features




The &-operator specifies that all elements must occur but their
order is not significant, was available in SGML DTD, but is lost
in XML DTD. (a1& a2 & a3  a1a2a3 | a1a3a2 | … | a3a2a1). In XSDs
this feature is restored by defining the xsd:all element. Only 4
XSDs used this operator
Elements of an XML document can be identified using ID
attribute and referred by IDREF or IDREFS (also supported by
DTDs). The IDs are unique throughout the document. Only 6
XSDs used this feature
Referring to elements can be accomplished by key/keyref
pairs. Using a reference to a key implies that the element with
the corresponding key should exist in the document. It is used
by 4 XSDs.
One important feature of XSDs is the use of namespace. This
allows to use elements and types in the current XSD that are
defined elsewhere. Apart from the obvious inclusion of XML
Schema namespace, 20 XSDs used this feature.
60-569
20
6. Regular Expression
Characterization
 The second question the authors tried to answer is how
sophisticated regular expression tend to be in the real
world DTDs and XSDs.
 For this analysis, the authors had to perform some
preprocessing on the documents:


DTD element definition were converted to a canonical form
such as, <!ELEMENT lib ((book | journal)*)> was
converted to the form (c1 | c2)*, just to keep the structural
DTD information
XSDs were preprocessed using XSLT to the canonical form
 For DTDs, total 11802 element definition was reduced to
750 canonical forms, and for XSDs, total 1016 element
definition was reduced to 138 canonical forms, totaling
to 838 for both types of schema.
60-569
21
6. Regular Expression
Characterization
(cont’d)
 Definition 4. A base symbol is a regular expression a, a?,
or a* where a  ; a factor is of the form e, e?, or e*,
where e is a disjunction of base symbols. A simple
regular expression is , Ø, or a sequence of factors, such
as, (a*+b*)(a+b)?b*(a+b)*.
 The authors introduced a uniform syntax to denote
subclass of simple regular expressions by specifying the
allowed factors. They distinguish base symbols extended
by ? Or *. Further, they distinguish between factors with
one disjunct or with arbitrarily many disjuncts; the latter
is denoted by (+…). Finally, factors can again be
extended by * or ?. For example, they write
RE((+a)*,a?) for the set of regular expression e1… en
where every ei is (a1+…+ an)* for some a1,…, an   and
n  1, or a? for some a  .
60-569
22
6. Regular Expression
Characterization
(cont’d)
 Following is a table of possible factors in simple
regular expressions and how they are denoted
(a, a1, . . . , an   ).
Table 2
Factor
Abbr.
Factor
Abbr.
a
a*
a?
(a1 + … + an)
a
a*
a?
(+a)
(a1 + … + an)*
(a1 + … + an)?
(a1* + … + an*)
(a1* + … + an*)*
(+a)*
(+a)?
(+a*)
(+a*)*
60-569
23
6. Regular Expression
Characterization
(cont’d)
 The authors have analyzed the DTDs and XSDs to
characterize their content models according to the
subclasses defined above.
 The result is represented in the Table 3 that list the nonoverlapping categories of expression having a significant
population (more than 0.5%)
 Two major differences between DTDs and XSDs.


XSDs have more simpleType elements (#PCDATA). This
may be due to the fact that XSD introduces more distinct
simpleType elements. It is now possible to fine tune the
specification of an element’s content.
XSDs have less expression in the category RE(a,(+a)*).
This is most probably due to the nature of the XSDs in the
sample since those describing data are over represented
with respect to those describing meta documents
60-569
24
6. Regular Expression
Characterization
DTDs (%)
34
16
1
5
2
8
1
3
0
20
0
0
92
8
#PCDATA
EMPTY
ANY
RE(a)
RE(a, a?)
RE(a, a*)
RE(a, a?, a*)
RE(a, (+a))
RE(a, (+a)?)
RE(a, (+a)*)
RE(a, (+a)?, (+a)*)
RE(a, (+a*)*)
Total simple expression
Non-simple expression
(cont’d)
XSDs (%)
48
10
0
5
10
10
4
3
1
2
1
2
97
3
Table 3: Relative occurrence of various types of regular
expressions given in % of element definitions
60-569
25
6. Regular Expression
Characterization
(cont’d)
 The authors have compared DTDs and XSDs
using different measures but did not observe
any significant differences between them. More
importantly, it is clear from different
comparison that vast majority of expressions
are simple both in DTDs (92%) and in XSDs
(97%)
 Some of the comparisons they have carried out
are:




Density
Width and depth of canonical form
Simple content model
Star height
60-569
26
6. Regular Expression
Characterization
(cont’d)
 The density of a schema is defined as the number of
elements occurring in the right hand side of its rule
divided by the number of elements.
60-569
27
6. Regular Expression
Characterization
(cont’d)
 The table bellow show the fraction of DTDs and XSDs
versus the fraction of their simple content models: the
majority of documents have 90% or more simple content
models
60-569
28
6. Regular Expression
Characterization


(cont’d)
The star height of a regular expression is the maximum nesting
depth of Kleene stars occurring in the expression. Content
models with star height larger than 1 are very rare.
In DTDs presence of more 1 star height expression is due to
the abundance of RE(a, (+a)*) type of expressions in DTDs
with respect of XSDs.
star height
DTDs
XSDs
0
61
78
1
38
17
2
1
4
3
0
0
Table 4: Star height observed in DTDs and XSDs
60-569
29
7. Schema and Ambiguity
 The XML 1.0 specification by W3C, requires
that schema definition to be deterministic or
one-unambiguous.
 The authors checked whether the DTDs and
XSDs in the study respect this requirement
using the tool IBM’s XML Schema Quality
Checker (SQC).
 The authors found almost all of them follow the
rule.
 Only 3 out of 93 XSDs having one or more
ambiguous content model of two canonical
forms: c1?(c1|c2)* and (c1c2)|(c1c3).
60-569
30
7. Schema and Ambiguity
(cont’d)
 For DTDs, the first exception is a regular expression of
the type: (… | ci | … | ci | …)*. But the authors claimed it
to be only a typo, not a design feature.
 The second type of ambiguous regular expression is of
type: c1c2?c2?. The designer’s intention was clearly to
state that c2 may occur zero, one or two times.
 This illustrates a shortcoming of DTDs that has been
addressed in XSDs, as in the following example
<xsd:sequence>
<xsd:element name=“c1” type=“t1”/>
<xsd:element name=“c2” type=“t2”
minOccurs=“0” maxOccurs=“2”/>
</xsd:sequence>
60-569
31
8. Errors
 The authors found some of the errors with XSDs they
have retrieved





Only 30 out of 93 XSDs were found to pass a conformance
test by SQC, that is to be complying the W3C specifications
19 XSDs were designed according to a schema older than
2001 specs.
Some simple type have been omitted or added from one
version of the specs to another causing the SQC to report
errors.
Some errors concern violation of the Datatypes part of the
specs., like a regular expression wrongfully restricting
xsd:string
Some XSDs violating the specs. by specifying a type
attribute for complexType element, or leaving out the
name attribute for a top-level complexType element.
60-569
32
9. Conclusion
 Many features defined in the XML Schema
specification are not widely used yet, especially
those that are related to OO data modeling
such as derivation of complex type extension.
 The expressive power of XSDs under
investigation is almost equivalent of that of
DTDs, which means that disregarding some
exceptions, these XSDs could as well have
been written as DTDs. This might show that
the level of sophistication offered by XSDs is
not necessary for most of the applications, at
least until now.
60-569
33
9. Conclusion
(cont’d)
 The data type part of the XML Schema specs is heavily
used, since it alleviates a major shortcoming of DTDs,
namely the ability to specify the format and type of the
text of an element, which, in XSDs, accomplish through
restricting a simple type. Example:
<xs:element name="letter"> <xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[a-z]"/>
</xs:restriction>
</xs:simpleType> </xs:element>
 The content models specified in both DTDs and XSDs
tend to be very simple. For XSDs, 97% of all content
model can be classified as simple expression.
60-569
34
10. References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Bex, G. T., Neven, F. and Bussche, J. V., DTDs versus XML Schema: A
Practical Study, In Proceedings of the Seventh International Workshop
on the Web and Databases, WebDB 2004, pages 79--84, Maison de la
Chimie, Paris, France, June 17-18 2004.
http://www.webopedia.com/TERM/D/DTD.html
http://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci831
325,00.html
http://en.wikipedia.org/wiki/XML_Schema
http://www.w3schools.com/schema/default.asp
http://www.w3schools.com/dtd/dtd_intro.asp
IBM Corp. XML Schema Quality Checker, 2003,
http://www.alphaworks.ibm.com/tech/xmlsqc
R. Cover. The cover pages, 2003, http://xml.coverpages.org/
P. Biron and A. Mathotra, XML Schema part 2: datatypes. W3C, May
2001, http://www.w3.org/TR/xmlschema-2/
http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-0102/03-01-02.pdf
60-569
35
60-569
36