Transcript Author
Managing XML and
Semistructured Data
Part 1: Preliminaries, Motivation and
Overview
COMP630L Topics in DB Systems: Managing Web Data
Fall, 2007
Dr Wilfred Ng
Acknowledgement: Part of the materials in this set of XML slides are
extracted from Prof. Dan Suciu’s course materials. Thanks for his permission
of using them
1
HTML XML SGML
a W3C standard to complement HTML
origins: structured text SGML
motivation:
• HTML describes presentation
• XML describes content
HTML4.0 XML SGML
http://www.w3.org/TR/2000/REC-xml-20001006 (version 2,
10/2000)
2
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases
</i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann,
1999
HTML describes the presentation
3
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
XML describes the content
</bibliography>
4
SGML and XML
eXtensible Markup Language - XML
XML 1.0 – a recommendation from W3C, 1998
XML – related standards: W3C group – XSL,
XQuery,…
Roots: SGML [http://www.w3.org/MarkUp/SGML/]
• Markup = encoding: possibilities for expressing information
• Information modelling freedom, Reusability, provability,
validity
• But a very nasty language
• SGML is an international standard for device-independent,
system-independent methods of representing texts in
electronic form
After the roots: a format for sharing data (on the Web)
5
Basic XML Terminology
tags: book, title, author, …
start tag: <book>, end tag: </book>
elements: <book>…<book>,<author>…</author>
elements are nested
empty element: <red></red> abbrv. <red/>
an XML document: single root element
6
The Role of XML Data
XML is designed for data exchange, not to replace
relational or E/R data
Sources of XML data:
• Created manually with text editors: not really data
• Generated automatically from relational data
• Text files, replacing older data formats: Web server logs,
scientific data (biological, astronomical)
• Stored/processed in native XML engines: very few
applications need that today
7
XML Advantages for Web Data
Over SGML
• Supported by
mainstream
browsers such as IE
and Netscape
• Standard Stylesheet
• Standard linking
Over HTML
•
•
•
•
Interchangable
Searchable
Reusable
Enables Automation
8
Why XML is of Interest to Us
HTML fails to meet structure specification – tags for
appearance only
XML is a language syntax for data
• Note: we have no langauge syntax for relational data
• But XML is not relational: semistructured
This is exciting because:
•
•
•
•
Can translate any data to XML
Can ship XML over the Web (HTTP)
Can input XML into any application
Thus: make data sharing and exchange on the Web possible!
40% annual growth and accelerating: publishers,
government, education,…
9
Relational Data in XML
person
nam e
XML:
row
phone
name
John
3634
Sue
6343
D ic k
6363
person
“John”
row
row
phone name phone
3634 “Sue”
name
6343 “Dick”
phone
6363
<person>
<row> <name>John</name>
<phone> 3634</phone></row>
<row> <name>Sue</name>
<phone> 6343</phone>
<row> <name>Dick</name>
<phone> 6363</phone></row>
</person>
10
XML Data: more expressive
Missing attributes:
<person> <name> John</name>
<phone>1234</phone>
</person>
<person> <name>Joe</name>
</person>
Could represent in
a table with nulls
no phone !
name
phone
John
1234
Joe
11
XML Data: more expressive
Repeated attributes
<person> <name> Mary</name>
<phone>2345</phone>
<phone>3456</phone>
</person>
Impossible in tables:
name
phone
Mary
2345
two phones !
3456
???
12
XML Data: more expressive
Attributes with different types in different objects
<person> <name> <first> John </first>
<last> Smith </last>
</name>
<phone>1234</phone>
</person>
structured name !
Nested collections (no 1NF)
Heterogeneous collections:
• <db> contains both <book>s and <publisher>s
13
XML Data Sharing and Exchange
application
application
object-relational
Integrate
XML Data
Transform
WEB (HTTP)
Warehouse
application
relational data
Specific data management tasks
legacy data
14
From Relational Data to XML Data
XML – Data tree
and its doc
persons
name
row
row
row
phone
name
John
Sue
Dick
persons
3634
6343
6363
“John”
phone name phone
3634 “Sue”
name
6343 “Dick”
phone
6363
<persons>
<row> <name>John</name>
<phone> 3634</phone></row>
<row> <name>Sue</name>
<phone> 6343</phone>
<row> <name>Dick</name>
<phone> 6363</phone></row>
</persons>
15
XML Data
XML is self-describing
Schema elements become part of the data
• Relational schema: persons(name,phone)
• In XML <persons>, <name>, <phone> are part
of the data, and are repeated many times !
Consequence: XML is much more flexible
XML data is semistructured data
16
XML from/to Relational Data
XML publishing:
• relational data XML
XML storage:
Relational data is regular in XML tree
representation. But XML data can be nonrelational!
• XML relational data
Supplier
?
Supplier
?
Common
XML
Illusion
Broker
17
XML Publishing: ideas sketch
XML
Tuple
streams
XML
publishing
Relational
Database
SQL
Web
Application
XPath/
XQuery
18
XML Publishing
Exporting the data is relatively
easier: we do this already for
HTML
Translating XQuery SQL is hard
XML publishing systems:
Research: Experanto (IBM/DB2),
SilkRoute (AT&T Labs and UW,
software unavailable) now
SilkRoute downloadable
• XQuery SQL
Commercial: SQL Server, Oracle
• Only XPath SQL and with
restrictions
A middle-ware approach
19
XML Publishing
Will follow the idea in SilkRoute [ACM TODS 27(4), 2002]
student
enroll
course
Backend relational engine
Relational schema:
• Student(sid, name, address)
• Course(cid, title, room)
• Enroll(sid, cid, grade)
Schemas are often proprietary but XML schemas
are public
20
XML Publishing
<xmlview>
<course> <title> Operating Systems </title>
<room> MGH084 </room>
<student> <name> John </name>
<address> Seattle </address >
<grade> 3.8 </grade>
</student>
<student> …</student>
…
</course>
<course> <title> Database </title>
<room> EE045 </room>
<student> <name> Mary </name>
<address> Shoreline </address >
<grade> 3.9 </grade>
</student>
<student> …</student>
…
</course>
…
</xmlview>
• Group by courses:
Redundant
representation of
students
• Other representations
possible too: tags may
not be the same as
attributes in general
21
XML Publishing
First thing to do: design the DTD (public for users)
<!ELEMENT xmlview (course*)>
<!ELEMENT course (title,room,student*)>
<!ELEMENT student (name,address,grade)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT grade (#PCDATA)>
Second thing to do: develop an XML canonical view under db root for
each relation {t1,…tk} with schema R(A1,…,An)
<R><row><A1>a11</A1>…<An>a11</An></row>…
<row><A1>ak1</A1>…<An>ak1</An></row> </R>
22
Now we write an XQuery to export relational data XML
Note: result should conform to DTD (slide 20 for the relational
Schema; the generated result is shown in slide 21.)
<xmlview>
{ FOR $x IN /db/Course/row
RETURN
<course>
<title> { $x/title/text() } </title>
<room> { $x/room/text() } </room>
{ FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row
$z IN /db/Student/row[sid/text() = $y/sid/text()]/row
RETURN <student> <name> { $z/name/text() } </name>
<address> { $z/address/text() }
</address>
<grade> { $y/grade/text() } </grade>
</student>
}
</course>
}
</xmlview>
23
XML Publishing
Query: find Mary’s grade in Operating Systems
XQuery over public view
FOR $x IN /xmlview/course[title/text()=“Operating Systems”],
$y IN $x/student/[name/text()=“Mary”]
RETURN <answer> $y/grade/text() </answer>
SilkRoute
does this
automatically
SQL over database schema
SELECT Enroll.grade
FROM Student, Enroll, Course
WHERE Student.name=“Mary” and Course.title=“OS”
and Student.sid = Enroll.sid and Enroll.cid = Course.cid
24
XML Publishing
How do we choose the output structure ?
Determined by agreement, with our partners, or dictated by
committees in XML dialects (called applications) and
generate DTDs
The DTD is selective wrt the underlying databases
XML Data is often nested, irregular, etc – that’s why we
need the xmlview query in slide 23 to do the
transformation
No agreed normal forms for XML but a few work on it…
[M. Arenas and L. Libkin, A Normal Form for XML
Documents]
25
XML Storage
Often the XML data is small and is parsed directly
into the application (DOM API) – file based
management (pros and cons?)
Sometimes it is big, and we need to store it in a
database
Much harder than XML publishing (why ?)
A fundamental XML storage problem:
• How do we choose the schema of the database ?
Possible solutions:
• Schema derived from DTD – structure mapping
• Storing XML as a graph such as “Edge relation” – model
mapping
• Native Approach – native XML/SSD engine
26
XML Storage in a Relational DB
Use generic schema
• [Florescu, Kossman 1999]
Use DTD to derive schema
• [Shanmugasundaram, et al. 1999]
Use data mining to derive schema
• [Deutsch, Fernandez, Suciu 1999]
Use the Path table
• [T.Amagasa, T.Shimura, S.Uemura 2001]
27
XML Storage: Edge Relation
[Florescu, Kossman 1999]
Monet [Schmidt et al. WebDB 00]
Structured-based approach
Mapping tree to relations
Use generic relational schemas
(independent on the XML schema):
DOM Tree
Ref(source,label,dest)
Val(node,value)
28
XML Storage: Edge Relation
Ref
&o1
S o u rc e
paper
&
&
&
&
&
&o2
title
author
&o3
author
&o4
“The Calculus” “…”
year
&o5
“…”
[Florescu, Kossman 1999]
&o6
“1986”
o1
o2
o2
o2
o2
Val
N ode
&
&
&
&
o3
o4
o5
o6
L abel
D est
paper
title
a u th o r
a u th o r
year
&
&
&
&
&
o2
o3
o4
o5
o6
V a lu e
T h e C a lc u lu s
…
…
1986
29
XML Storage: Edge Relation
In practice may need more tables for
reference links and nodes:
RefTag1
Source Dest
RefTag1(source,dest)
RefTag2(source,dest)
…
IntVal(node,intVal)
RealVal(node,realVal)
…
&o2
&o5
&o6
&o1
&o7
&o8
&o8
paper
title
&o3
&o7
&o2
author author
&o4
year
&o5
&o8
&o6
30
XML Storage: DTD to Schema
[Christophides, Abiteboul, Cluet, Scholl 1994]
[Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]
Basic idea: use the XML schema to derive the
relational schema
DTD:
<!ELEMENT paper (title, author*, year?)>
<!ELEMENT author (firstName, lastName)>
Relational schemas:
Paper(pid, title, year)
Author(aid, pid, firstName, lastName)
31
XML Storage: DTD to Schema
Each Element corresponds to a relation
Each Attribute of Element corresponds to a column of
relation
Connect elements using foreign keys
But the problems: fragmentation!
Example: How many relations should be used for finding
an address of an author?
Paper(pid,
<!ELEMENT
title, year)
paper (title, author*, year?)>
Author(aid,
<!ELEMENT
pid, firstName,
author (firstName,lastName,address)>
lastName)
Address
<!ELEMENT
(addid, aid,
address
country)
(street, city, country)>
City (cid,
<!ELEMENT
addid, postcode,
city (postcode?,
cityname)
cityname)>
(streetno?,
streetname)>
Street<!ELEMENT
(sid, addid, street
postcode,
streetname)
32
XML Storage: Path Relations
[T.Amagasa, T.Shimura, S.Uemura 2001]
ACM TOIT 2001 1(1)
Store paths as strings (model-based approach)
XPath expressions become the SQL like operator
Additional information for parent/child,
ancestor/descendant relationship
XRel: table schemas for paths, elements and val
XParent:table schemas for elements, labelpath,
parent (alternative but similar)
33
XML Storage: Path Relations
Path
pathID
Pathexpr
1
#/bib
2
#/bib#/paper
3
#/bib#/paper#/author
4
#/bib#/paper#/title
5
#/bib#/paper#/year
6
#/bib#/book#/author
7
#/bib#/book#/title
8
#/bib#/book#/publisher
One entry for every path in the database
Relatively small
34
XML Storage: Path Relations
Positions in doc
Element
Val
Node
ID
Path
ID
Start
1
1
0
1000
-
2
2
5
200
1
3
3
8
20
2
4
3
21
30
2
5
3
31
100
2
6
3
101
150
2
7
4
151
180
2
8
2
300
500
1
End
Parent
ID
...
One entry for every element in the
database relatively large
NodeID
Val
3
Smith
4
Vance
5
Tim
6
Wallace
7
The Best…
6
3
7
4
8
2
...
One entry for every leaf in the database
35
Relatively large
XRel Example: Path Table
root
1
Contains all path string
PLAY
2
• Path(pathID, pathexp)
PathID
PathExp
SCENE
4
0
#/PLAY
1
#/PLAY#/ACT
2
#/PLAY#/ACT#/SCENE
3
#/PLAY#/ACT#/SCENE#/@id
4
#/PLAY#/ACT#/SCENE#/TITLE
5
#/PLAY#/ACT#/SCENE#/SPEECH
…
…
ACT
3
id
5
“000”
TITLE
6
“Intro”
ACT
…
…
SCENE
…
SPEECH
7
SPEAKER
8
“CURIO”
TEXT
9
“This is …”
…
Path Table
36
XRel Example: Other Tables
Element Table
0
6
11
27
49
57
91
Example
docID PathID start end index
21
<PLAY>
<ACT>
<SCENE id=“000”>
<TITLE> Intro </TITLE>
<SPEECH>
<SPEAKER>CURIO</SPEAKER>
<TEXT>This is …</TEXT>
</SPEECH>
</SCENE>
</ACT>
</PLAY>
1
0
0
…
1
1
1
1
6
…
1
1
48
1
2
11
…
1
1
90
1
4
27
48
1
1
…
…
…
…
…
…
Attribute Table
docID PathID start end
1
3
0
…
PathExp
#/PLAY
…
21
21
value
“000”
Text Table
Path Table
PathID
reindex
docID PathID start end
value
1
4
34
40
“Intro”
…
…
…
…
…
37
Storing XML as a Graph
Every XML instance is a tree
Hence we can store it as any graph, using an Edge
table
In addition we need a Value table to store the data
values (#PCDATA)
Edge relation summary:
Same relational schema for every XML document:
• Edge(Source, Tag, Dest)
• Value(Source, Val)
Generic: works for every XML instance
But inefficient:
• Repeat tags multiple times
• Need many joins to reconstruct data
38
Storing XML as a Graph
0
db
2
book
4
3
title
Edge
5
author
6
title
1
9
book
7
publisher
8
author author
11
10
title
state
“Complete
“Morgan
“Transaction
“Chamberlin”
“Bernstein”
“Newcomer”
Guide
Kaufman”
Processing”
to DB2”
Source
Tag
Dest
0
db
1
1
book
2
2
title
3
2
author
4
1
book
5
5
title
5
...
Value
Source
Val
3
Complete guide . . .
6
4
Chamberlin
author
7
6
...
...
...
...
...
39
“CA”
Storing XML as a Graph
What happens to queries:
FOR $x IN /db/book[author/text()=“Chamberlin”]
RETURN $x/title
SELECT vtitle.value
FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle,
Value vauthor, Value vtitle
WHERE xdb.source=0 and xdb.tag = ‘db’ and
xdb.dest = xbook.source and xbook.tag = ‘book’ and
xbook.dest = xauthor.source and xauthor.tag = ‘author’ and
xbook.dest = xtitle.source and xtitle.tag = ‘title’ and
xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and
xtitle.dest = vtitle.source
40
XML Storage: Data Mining to Schema
[Deutsch, Fernandez, Suciu 1999]
Given:
• One large XML data instance
• No schema/DTD
• Query workload
Problem: find a “good” relational schema for it
Notice: even when a DTD is present, it may be
imprecise:
• E.g. when a person may have 1-3 phones: phone*
41
XML Storage: Data Mining to Schema
A
paper
D
B
C paper
paper paper
Paper1
fn 1
year
author
title
author
authortitle authortitleauthor title A X
B X
D X
fn ln fn ln fn
fn
ln
ln
ln 1
fn 2
ln 2
title
year
X
X
X
X
-
X
-
X
X
X
X
-
Paper2
a u th o r
X
C
[Deutsch, Fernandez, Suciu 1999]
title
X
42
Useful References
Data on the Web: from Relations, to Semistructured Data and XML,
Abiteboul, Buneman, Suciu
•
W3C homepage, www.w3.org
•
For foundations
For current standards
F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage
Strategies, SIGMOD Record, 2002
XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002
Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In
Proc. of SIGMOD, 2002
Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML
Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No.
1, pp 110-141, 2001
D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE
Data Engineering Bulletin 22(3), 1999
J. Shanmugasundaram et al. Relational Databases for Querying XML Documents:
Limitations and Opportunities, In Proc. of VLDB, 1999
C. Zhang et al., On Supporting Containment Queries in Relational Databases
Management Systems, In Proc. of SIGMOD, 2001
43