Transcript XML Schema

CS331: Advanced Database
Systems: Semistructured Data
Management
Norman Paton
University of Manchester
[email protected]
Semistructured Data

Two views of data:



Databases: structured,
modelled, queried,
programmed.
Documents: partially
structured, authored,
read, navigated.
Semistructured data
management is at the
confluence of these two
views.


XML is the principal
data representation
notation for
semistructured data.
XML can be seen as:



An extensible markup
language for documents.
A data model for
hierarchical data.
A notation for
communicating data with
its structure.
See also: COMP30352 – IR, Hypermedia and the Web
XML Language Space


XML (Extensible Markup
Language) is just that:
a markup language with
an extensible collection
of tags.
XML is associated with
many related standards
within the W3C (World
Wide Web Consortium):
http://www.w3.org/.

XML Related Standards:







XPath: navigation.
XQuery: queries.
XSLT: transformations.
XML Schema: document
description.
DOM: modelling
documents as objects.
...
... and underpins:


Web Services.
The Semantic Web.
Markup


Markup is the inclusion
of symbols with special
meaning in a text
document.
Languages with
markup:




LaTeX.
HTML.
RTF.
XML.
<!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>The Silver Pigs</title>
</head>
<body bgcolor="#ffffff" text="#000000“>
<h1>Rome</h1>
\documentclass{llncs}
\begin{document}
\title{An Experimental Performance Evaluation
of Join Algorithms for Parallel Object Databases}
\author{Sandra Sampaio\inst{1} \and
Jim Smith\inst{2}\and
Norman W. Paton\inst{1}\and
Paul Watson\inst{2}}
...
XML Markup



In XML, content is
structured using tags.
Tags are distinguished
by the characters ‘<‘
and ‘>’.
Tags often come in
pairs, round some
content, as start and
end tags.
Start tag
<html>
<head>
<title>The Silver Pigs</title>
</head>
<body bgcolor="#ffffff" text="#000000“>
<h1>Rome</h1>
...
</body>
</html>
End tag
Elements




An element is a
meaningful unit of
content enclosed by
tags.
An application may be
able to interpret an
element.
Elements may be
ordered or nested.
Context matters,
especially given nesting.
<station>
<name>Oxford Road</name>
<city>Manchester</city>
</station>
Element Hierarchies


<country>
An XML document
<name>United Kingdom</name>
essentially
<people>
<pop>60094648 </pop>
represents
<maleLE>75.74</maleLE>
hierarchical data.
<femaleLE>80.7</femaleLE>
</people>
The elements in a
</country>
well formed XML
country
document will match
name
people
and respect the
hierarchy.
pop
maleLE
femaleLE
Attributes


Attributes provide
auxiliary information
about elements.
Attributes are
embedded within
start tags, and have
the form “name =
value”.
<station
updatedBy = “Fred Bloggs”
validUntil = “22/06/2005”>
<name>Oxford Road</name>
<city>Manchester</city>
</station>
<country
source=“http://www.cia.gov”>
<name>United Kingdom</name>
<people>
<pop>60094648 </pop>
<maleLE>75.74</maleLE>
<femaleLE>80.7</femaleLE>
</people>
</station>
Models


An XML file may be able
to contain any old tags,
in any order or
combination (while
remaining well-formed).
Restrictions on the legal
tags and values a
document can contain
may be specified using
a DTD or an XML
Schema.


A DTD (Document Type
Definition) provides a
concise syntax for
modelling documents
(but is on the way out).
An XML Schema
definition is itself an
XML document, which
provides a wide range
of modelling constructs
for constraining other
XML documents.
Trains in XML



Hierarchical models can
capture most cycle-free
data fairly naturally.
Hierarchical models,
however, promote some
concepts and demote
others.
The relational model, by
contrast, treats all
concepts as (broadly)
equal.
<train>
<tno>3107101</tno>
<source>Edinburgh</source>
<destination>London</destination>
<visit>
<name>Edinburgh</name>
<time>06:00</time>
</visit>
<visit>
<name>York</name>
<time>08:00</time>
</visit>
<visit>
<name>London</name>
<time>10:00</time>
</visit>
</train>
Tools


XML is widely used,
and many software
systems can
read/write XML
formats.
Generic tools have
also been developed
for designing/editing
XML.
XML Spy showing a valid
XML file as text.
XML Spy

XML Spy supports:





Editing.
Data modelling.
Validation.
Transformation.
…
XML Spy showing an XML
schema document as a
tree.
Oxygen

Oxygen supports:






Editing.
Data modelling.
Validation.
Transformation.
Querying.
…
Oxygen showing an XML
schema document as a
Tree and text.
XML Databases

Native XML Databases:




Store XML in the database
directly (“native”).
Make XML Schema the
optional schema definition
language.
Query the database using
XML query languages
(XPath/XQuery).
Program database data as
XML data structures
(XML:DB, DOM, ...).
An XPath query and
result in eXist
XML Databases

Native XML databases:

Tamino - Software AG:


eXist - Open Source:


http://www.softwareag.com/tamino/.
http://exist.sourceforge.net/.
Standard APIs:


XML:DB Initiative: http://www.xmldb.org/; Both
Tamino and eXist provide XML:DB APIs.
XQJ: XQuery API for Java; Java community
standard.
XML and Relational Databases

Storage options:



Decomposed: store an
XML document in
relational tables, and
reconstruct on retrieval.
Composed: store an XML
document as an attribute
of a relational table.
Retrieval options:

Represent relational
tables as XML (e.g. Java
WebRowSet).

Relational vendors:


Tend to support both
composed and
decomposed storage
models.
Provide APIs that
accommodate XML for
data transport or display
(e.g., in Web Services or
for Web interface
generation).
Summary

XML is becoming increasingly ubiquitous
for data representation for:



files, transport, storage, metadata.
Data management systems must
support storage, querying and
communication using XML.
Soon everything will be stored using
XML? Don’t believe it!
Further Reading


S. Abiteboul, P. Buneman, D. Suchi,
Data on the Web, Morgan-Kaufmann,
1999.
N. Bradley, The XML Companion (3rd
Edition), Addison-Wesley, 2002.
Data Modelling in XML
XML Schema



XML Schema is a W3C
standard for modelling
using XML.
An XML Schema
definition is itself an
XML document – there
is an XML Schema for
XML Schema!
XML Schema files have
a .xsd suffix; XML data
files have a .xml suffix.

An XML Schema can
specify:





Which elements are
mandatory/optional.
Which attributes are
mandatory/optional.
Element/attribute
types.
Cardinalities.
Relative ordering.
Role of XML Schema

Unlike in relational/object databases:


An XML database need not have a schema.
An XML schema may not be very
prescriptive in terms of what can or cannot
be stored.
Train Model
sequence
recurring sequence
Train ComplexType
<xs:complexType name="TrainType">
<xs:sequence>
<xs:element name="tno" type="xs:string"/>
<xs:element name="source" type="xs:string"/>
<xs:element name="destination" type="xs:string"/>
<xs:sequence maxOccurs="unbounded">
<xs:element name="visit">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="time" type="xs:time"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:sequence>
</xs:complexType
Elements

Elements are defined thus:


<element name = “the-name”>.
Attributes associated with elements:


type: specifies the kind of content that an element
with no attributes or sub-elements can have.
Default imposes no constraints.
minOccurs, maxOccurs: the number of times an
element can occur. A value of unbounded allows
open-ended cardinality. Default is once and only
once.
Built-in Types

There are many built-in types:






string.
integer, positiveInteger, negativeInteger.
short, long.
date, dateTime, time.
id, idref.
anyURI
Complex Elements

Any element with
sub-elements or
attributes is declared
to have a complex
type.
<xs:element name="visit">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type=“..."/>
<xs:element name="time" type=“..."/>
</xs:sequence>
</xs:complexType>
</xs:element>



Sequence: the sub-
elements must appear
in the given order.
Choice: a selection is
made from the subelements.
Both sequence and
choice can have
minOccurs/maxOccurs.
Attributes


Attributes can be
defined within complex
types.
Attributes are optional
unless use=“required”.
<xs:complexType name="TrainType">
<xs:sequence>
<xs:element name="tno" type="xs:string"/>
...
</xs:sequence>
<xs:attribute name="engine" type="xs:string"/>
</xs:complexType>

The resulting data file
can populate the
attribute.
<train ...
xsi:type="TrainType“
engine="125">
<tno>3107101</tno>
<source>Edinburgh</source>
<destination>London</destination>
<visit>
...
</visit>
</train>
Building on Existing Types

New types can be constructed from existing
types by:


Extension: for complex types, this means that new
types can be defined that add attributes or
elements to the type on which they are based.
Restriction: for complex types, this means that
new types can be defined with fewer attributes or
elements, reduced cardinalities, etc.
Type Extensions Example
<xs:complexType name="StationType">
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="type" type="xs:string"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="DistrictStationType">
<xs:complexContent>
<xs:extension base="StationType">
<xs:sequence>
<xs:element name="main" type="xs:string"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Type Extensions in Use
<xs:element name="stations">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element name="station" type="StationType"/>
<xs:element name="districtStation" type="DistrictStationType"/>
</xs:choice>
</xs:complexType>
models
<stations>
<station>
<name>London</name>
<type>main</type>
</station>
<districtStation>
<name>York</name>
<type>district</type>
<main>London</main>
</districtStation>
</stations>
Cross References



Hierarchal models do
not naturally support
shared components.
In documents, crossreferences and
hyperlinks are very
common.
XML has several crossreferencing schemes
(e.g., ID/IDREF,
XPointer).

Within an XML
document:


A value of type ID
must be unique.
A value of type
IDREF must match
some ID within the
document.
ID and IDREF for Trains
ID used to identify station
IDREF used to reference station
XML Schema for ID/IDREF
<xs:element name="station">
<xs:complexType>
<xs:sequence>
<xs:element name="city" type="xs:string"/>
<xs:element name="name" type="xs:ID"/>
</xs:sequence>
</xs:complexType>
</xs:element>
...
<xs:element name="visit">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:IDREF"/>
<xs:element name="time" type="xs:time"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Example: What is the schema?
<shiporder orderid="889923“>
<orderperson>John Smith</orderperson>
<shipto>
<name>Ola Nordmann</name>
<address>Langgt 23</address>
<city>4000 Stavanger</city>
<country>Norway</country>
</shipto>
<item>
<title>Empire Burlesque</title>
<quantity>1</quantity>
<price>10.90</price>
</item>
<item> …</item>
</shiporder>
Summary



XML data can be parsed, transmitted and
queried in the absence of any formal
description of its structure.
Many applications need to be able to make
assumptions about the structure of
documents they process.
XML Schema provides a wide range of
modelling facilities for defining XML
documents.
Further Reading

The W3C Consortium Tutorial is short
but informative:


D. Fallside, XML Schema Part 0: Primer,
2001:


http://www.w3schools.com/schema/
http://www.w3.org/TR/xmlschema-0/
N. Bradley, The XML Companion (3rd
Edition), Addison-Wesley, 2002.
Querying XML Documents using
XPath and XQuery
XPath and XQuery



XPath is a W3C
standard for addressing
parts of an XML
document.
XPath is widely used in
XML languages and
tools, including XQuery
and XSLT.
XPath is not especially
expressive.



XQuery is a W3C
standard for accessing
and restructuring XML
documents.
XQuery is supported by
several XML databases,
but is less widely
deployed than XPath.
XPath is used within
XQuery.
Trains Model

The following diagram shows the XML
schema of the data queried in the
following example queries.
XPath

XPath uses path expressions to describe
routes through documents.



These expressions address locations in a hierarchy
in a way that is familiar from file systems
(/books/chapters/chapter1).
XPath also includes a function library (shared
with XQuery) for manipulating numerical,
string, date, node and sequence values.
Standardisation: XPath 1.0 has been a W3C
standard since 1999; XPath 2.0 became a
W3C standard in January 2007.
XPath Terminology


Nodes include elements, documents
(root elements) and attributes; nodes
can be addressed using XPath.
XPath includes constructs for exploring
relationships between nodes, such as
parent, child, ancestor and descendent.
XPath Syntax
Expression
Description
nodename
Selects child nodes of the current
node.
/
Selects the root node.
//
Selects any node in the document.
.
Selects the current node.
..
Selects the parent of the current
node.
Selects attributes.
@
Simple Paths - 1
What are the numbers of the trains?
/trains/train/tno
//tno
<tno>22403101</tno>
<tno>22407101</tno>
<tno>22403102</tno>
<tno>22446301</tno>
…
Simple Paths - 2
Where do trains start?
/trains/train/source/
<source>aberdeen</source>
<source>aberdeen</source>
<source>aberdeen</source>
<source>aberdeen</source>
<source>aberdeen</source>
…
Duplicates
Where do trains start (no duplicates)? (or tags)
distinct-values(/trains/train/source)
aberdeen
banchory
edinburgh
aberdour
…
Predicates
What are the destinations of trains that start in
Aberdeen?
/trains/train[source="aberdeen"]/destination
<destination>edinburgh</destination>
<destination>edinburgh</destination>
<destination>edinburgh</destination>
<destination>inverness</destination>
<destination>inverness</destination>…
Extracting Node Values
What are the destinations of trains that start in
Aberdeen (no tags)?
/trains/train[source="aberdeen"]
/destination/text()
edinburgh
edinburgh
edinburgh
inverness …
Aggregates
How many trains start from aberdeen?
count(/trains/train[source="aberdeen"])
9
Parents
What are the sources of trains that visit York?
/trains/train[visit/name="york"]/source
/trains/train/visit[name="york"]/../source
<source>edinburgh</source>
<source>edinburgh</source>
<source>edinburgh</source>
<source>london</source>…
Positions in Sequences
Where does the last train (in the document)
from aberdeen visit?
/trains/train[source="aberdeen"][last()]/visit
<visit>
<name>banchory</name>
<time>18:20:00</time>
</visit>
…
XQuery




XQuery is intended to play the same sort of
role for XML data as SQL plays for relational
data.
XQuery is a functional language, and thus is
declarative and compositional.
XQuery can be used to restructure XML data,
as well as to ask questions about it.
XQuery became a W3C standard in January
2007.
XQuery Syntax

XQuery is based on FLOWR (pronounced
“flower”) expressions:





for clauses bind variables to values in streams of
tuples;
let clauses bind variables to the complete result of
an expression;
where clauses filter values in tuple streams;
order by clauses sort tuple streams; and
return clauses construct results.
Iteration and Filtering
What are the destinations of trains that start in
Aberdeen?
for $i in /trains/train
where $i/source = "aberdeen"
return $i/destination
<destination>edinburgh</destination>
<destination>edinburgh</destination>
<destination>edinburgh</destination>…
Identifying Documents
What are the destinations of trains that start in
Aberdeen?
for $i in doc("/db/trains/Trains.xml")/trains/train
where $i/source = "aberdeen“
return $i/destination
<visit>
<name>banchory</name>
<time>18:20:00</time>
</visit>
…
Constructing Results
How many visits are made by each train?
for $t in /trains/train
let $v := $t/visit
return <numvisits> {$t/tno} <count> {count($v)}
</count> </numvisits>
<numvisits>
<tno>22403101</tno>
<count>2</count>
</numvisits>
…
Ordering Results
How many visits are made by each train, ordered by the
number of visits?
for $t in /trains/train
let $c := count($t/visit)
order by $c descending
return <numvisits> {$t/tno} <count> {$c} </count>
</numvisits>
<numvisits>
<tno>46303101</tno>
<count>7</count>
</numvisits> …
Quantifiers
Which trains visit York?
for $t in /trains/train
where some $v in $t//visit satisfies $v/name = "york"
return $t/tn
<tno>3107101</tno>
<tno>46307101</tno>
<tno>22407101</tno>
…
Implicit Quantification
Which trains visit York?
for $t in /trains/train
where $t//visit/name = "york"
return $t/tn
<tno>3107101</tno>
<tno>46307101</tno>
<tno>22407101</tno>…
Reorganisation
Reorganise the document to nest the trains inside the stations.
for $s in distinct-values(//visit/name)
return <station>
<name> {$s} </name>
{for $t in /trains/train, $v in $t//visit where $v/name = $s
return <visit> {($t/tno, $v/time)} </visit>}
</station
<station>
<name>edinburgh</name>
<visit>
<tno>22403101</tno>
<time>09:50:00</time>
</visit> …
</station> …
Examples

Write XQueries that:


Return the number of visits paid to the
most frequently visited station.
Return the names of the stations that
received the most visits.
Further Reading

The W3C Consortium Tutorials are short
but informative; you should read these:



http://www.w3schools.com/xpath/
http://www.w3schools.com/xquery/
There are many books on querying XML
documents, e.g.:

Howard Katz (ed), XQuery from the
Experts, Addison Wesley, 2004.
XML Databases
Native XML Databases



Native XML databases support collections of
XML documents.
Native XML databases provide light weight
support for management of large document
collections.
Such databases typically support:



XPath and XQuery for querying.
XUpdate for manipulating documents.
The XML:DB API for programmatic access.
Accessing eXist



The top level of an eXist
database is a collection;
collections in eXist are
not typed.
Like a file system, a
collection can contain
other collections or
documents.
The eXist client
supports management
of collections, uploading
of documents, etc.
XML:DB

XML:DB provides a Java API similar in
concept to JDBC.
Class cl = Class.forName(“org.exist.xmldb.DatabaseImpl”);
Database database = (Database)cl.newInstance();
DatabaseManager.registerDatabase(database);
Collection col = DatabaseManager.getCollection(
"xmldb:exist://localhost:8080/exist/xmlrpc/db" );
XPathQueryService service = (XPathQueryService)
col.getService("XPathQueryService", "1.0");
ResourceSet result = service.query(args[0]);
ResourceIterator i = result.getIterator();
while(i.hasMoreResources()) {
Resource r = i.nextResource();
System.out.println((String)r.getContent());
}
XML In Oracle 10g

Oracle was not designed to store XML
data. Extensions to support XML occur:


In the server, so that Oracle can store XML
data using tables or built-in types.
Around the server, so that XML data can be
transferred to and from Oracle
applications.
XML Type

Server datatype:



Stored in the
database.
Configurable storage
options (CLOBS or by
mapping to object
relational structures).
Accessible through
SQL, PL/SQL, Java.

Operations to:




Create XML Type
values.
Map to/from XML
Type values.
Test properties of
XML Type values.
Query the content of
XML Type values
using XPath.
Storing XMLType Values


XMLType values can be stored as column
attributes.
Stored either through a mapping onto tables
or as Character Large Objects (CLOBs).
create table franchise (
company varchar(40) not null,
franchise varchar(40) not null,
trains
SYS.XMLTYPE not null,
primary key (company, franchise))
Operations on XMLTypes
SYS_XMLGEN
createXML
String
Relational
Construct
XMLType
getStringVal()
extract
Number
getNumberVal()
Populating XMLTypes

Standard table and
column modification
operations apply to
complete values of
XMLType:




Insert.
Delete.
Update.
The updateXML
operation is provided
for making changes
within XMLType values.
insert into franchise values (
'GNER',
'East Coast Main Line',
sys.XMLType.createXML(
'<trains>
<train>
<tno>3107101</tno>
<source>Edinburgh</source>
<destination>
London
</destination>
<visit> …</visit>
</train>
</trains>')
);
XMLType Methods

existsNode() returns 0 if the XPath expression
does not match the document, else 1:


existsNode(xpath IN varchar2) RETURN number.
extract() returns the fragment that matches
an XPath expression as an XMLType:

extract(xpath IN varchar2) RETURN sys.XMLType.
Selecting XML From SQL


The function sys.XMLType.getStringVal
converts its XMLType parameter into a string.
Retrieve details of GNER trains:
select f. franchise, sys.XMLType.getStringVal(f.trains)
from franchise f
where f.company='GNER'
East Coast Main Line <trains> ... </train>
Extracting Elements


The extract() method on XMLType can be
used to access individual elements.
Retrieve the src and dst of GNER trains:
select f.franchise,
f.trains.extract('//source').getStringVal() as src,
f.trains.extract('//destination').getStringVal() as dst
from franchise f
where f.company='GNER'
East Coast Main Line
<source>Edinburgh</source><source>York</source>
<destination>London</destination><destination>London</destination>
Extracting Element Values

The text() function can extract element
contents.
select f.franchise,
f.trains.extract('//source/text()').getStringVal() as src,
f.trains.extract('//destination/text()').getStringVal() as dst
from franchise f
where f.company='GNER'
East Coast Main Line
EdinburghYork
LondonLondon
Extracting for Comparison

Comparisons can be carried out either
in XPath or in SQL.
select f.company
from franchise f
where f.trains.existsnode('//visit[name="York"]') = 1
select f.company
from franchise f
where f.trains.extract('//visit/name/text()').getStringVal() like '%York%'
GNER
Testing Before Retrieval

To avoid including empty values in results,
the existsNode() method can test for
documents that match an Xpath expression.
select f.franchise, f.trains.extract('//train/tno').getStringVal() as tno
from franchise f
where f.trains.existsnode('//train/tno') = 1
East Coast Main Line
<tno>3107101</tno>
<tno>4630710</tno>
XML In Relational Databases

XML is increasingly
ubiquitous:



Data for Web display.
Data in domainspecific standards.
Data for
communication
between web
services.

Relational vendors
want:




Web pages to be
generated from
databases.
Their storage managers
to support all forms of
data management.
Database applications to
be exposed as Web
Services.
This means full support
for XML input, output
and storage.
Further Reading


Oracle 10g, XML DB Developers Guide
[Chapter 3: Using Oracle XML DB].
Full documentation for eXist is available
along with the software from:

http://exist.sourceforge.net/
Extra Slides