Transcript Author

Managing XML and
Semistructured Data
Part 1: Preliminaries, Motivation and
Overview
COMP630L Topics in DB Systems: Managing Web Data
Fall, 2007
Dr Wilfred Ng
Acknowledgement: Part of the materials in this set of XML slides are
extracted from Prof. Dan Suciu’s course materials. Thanks for his permission
of using them
1
HTML XML SGML
 a W3C standard to complement HTML
 origins: structured text SGML
 motivation:
• HTML describes presentation
• XML describes content
 HTML4.0  XML  SGML
 http://www.w3.org/TR/2000/REC-xml-20001006 (version 2,
10/2000)
2
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases
</i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann,
1999
HTML describes the presentation
3
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
XML describes the content
</bibliography>
4
SGML and XML
 eXtensible Markup Language - XML
 XML 1.0 – a recommendation from W3C, 1998
 XML – related standards: W3C group – XSL,
XQuery,…
 Roots: SGML [http://www.w3.org/MarkUp/SGML/]
• Markup = encoding: possibilities for expressing information
• Information modelling freedom, Reusability, provability,
validity
• But a very nasty language 
• SGML is an international standard for device-independent,
system-independent methods of representing texts in
electronic form
 After the roots: a format for sharing data (on the Web)
5
Basic XML Terminology






tags: book, title, author, …
start tag: <book>, end tag: </book>
elements: <book>…<book>,<author>…</author>
elements are nested
empty element: <red></red> abbrv. <red/>
an XML document: single root element
6
The Role of XML Data
 XML is designed for data exchange, not to replace
relational or E/R data
 Sources of XML data:
• Created manually with text editors: not really data
• Generated automatically from relational data
• Text files, replacing older data formats: Web server logs,
scientific data (biological, astronomical)
• Stored/processed in native XML engines: very few
applications need that today
7
XML Advantages for Web Data
 Over SGML
• Supported by
mainstream
browsers such as IE
and Netscape
• Standard Stylesheet
• Standard linking
 Over HTML
•
•
•
•
Interchangable
Searchable
Reusable
Enables Automation
8
Why XML is of Interest to Us
 HTML fails to meet structure specification – tags for
appearance only
 XML is a language syntax for data
• Note: we have no langauge syntax for relational data
• But XML is not relational: semistructured
 This is exciting because:
•
•
•
•
Can translate any data to XML
Can ship XML over the Web (HTTP)
Can input XML into any application
Thus: make data sharing and exchange on the Web possible!
 40% annual growth and accelerating: publishers,
government, education,…
9
Relational Data in XML
person
nam e
XML:
row
phone
name
John
3634
Sue
6343
D ic k
6363
person
“John”
row
row
phone name phone
3634 “Sue”
name
6343 “Dick”
phone
6363
<person>
<row> <name>John</name>
<phone> 3634</phone></row>
<row> <name>Sue</name>
<phone> 6343</phone>
<row> <name>Dick</name>
<phone> 6363</phone></row>
</person>
10
XML Data: more expressive
 Missing attributes:
<person> <name> John</name>
<phone>1234</phone>
</person>
<person> <name>Joe</name>
</person>
 Could represent in
a table with nulls
 no phone !
name
phone
John
1234
Joe
11
XML Data: more expressive
 Repeated attributes
<person> <name> Mary</name>
<phone>2345</phone>
<phone>3456</phone>
</person>
 Impossible in tables:
name
phone
Mary
2345
 two phones !
3456
???
12
XML Data: more expressive
 Attributes with different types in different objects
<person> <name> <first> John </first>
<last> Smith </last>
</name>
<phone>1234</phone>
</person>
 structured name !
 Nested collections (no 1NF)
 Heterogeneous collections:
• <db> contains both <book>s and <publisher>s
13
XML Data Sharing and Exchange
application
application
object-relational
Integrate
XML Data
Transform
WEB (HTTP)
Warehouse
application
relational data
Specific data management tasks
legacy data
14
From Relational Data to XML Data
XML – Data tree
and its doc
persons
name
row
row
row
phone
name
John
Sue
Dick
persons
3634
6343
6363
“John”
phone name phone
3634 “Sue”
name
6343 “Dick”
phone
6363
<persons>
<row> <name>John</name>
<phone> 3634</phone></row>
<row> <name>Sue</name>
<phone> 6343</phone>
<row> <name>Dick</name>
<phone> 6363</phone></row>
</persons>
15
XML Data
 XML is self-describing
 Schema elements become part of the data
• Relational schema: persons(name,phone)
• In XML <persons>, <name>, <phone> are part
of the data, and are repeated many times !
 Consequence: XML is much more flexible
 XML data is semistructured data
16
XML from/to Relational Data
 XML publishing:
• relational data  XML
 XML storage:
Relational data is regular in XML tree
representation. But XML data can be nonrelational!
• XML  relational data
Supplier
?
Supplier
?
Common
XML
Illusion
Broker
17
XML Publishing: ideas sketch
XML
Tuple
streams
XML
publishing
Relational
Database
SQL
Web
Application
XPath/
XQuery
18
XML Publishing
 Exporting the data is relatively
easier: we do this already for
HTML
 Translating XQuery  SQL is hard
 XML publishing systems:
 Research: Experanto (IBM/DB2),
SilkRoute (AT&T Labs and UW,
software unavailable) now
SilkRoute downloadable
• XQuery  SQL
 Commercial: SQL Server, Oracle
• Only XPath  SQL and with
restrictions
A middle-ware approach
19
XML Publishing
Will follow the idea in SilkRoute [ACM TODS 27(4), 2002]
student
enroll
course
 Backend relational engine
 Relational schema:
• Student(sid, name, address)
• Course(cid, title, room)
• Enroll(sid, cid, grade)
 Schemas are often proprietary but XML schemas
are public
20
XML Publishing
<xmlview>
<course> <title> Operating Systems </title>
<room> MGH084 </room>
<student> <name> John </name>
<address> Seattle </address >
<grade> 3.8 </grade>
</student>
<student> …</student>
…
</course>
<course> <title> Database </title>
<room> EE045 </room>
<student> <name> Mary </name>
<address> Shoreline </address >
<grade> 3.9 </grade>
</student>
<student> …</student>
…
</course>
…
</xmlview>
• Group by courses:
Redundant
representation of
students
• Other representations
possible too: tags may
not be the same as
attributes in general
21
XML Publishing
First thing to do: design the DTD (public for users)
<!ELEMENT xmlview (course*)>
<!ELEMENT course (title,room,student*)>
<!ELEMENT student (name,address,grade)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT grade (#PCDATA)>
Second thing to do: develop an XML canonical view under db root for
each relation {t1,…tk} with schema R(A1,…,An)
<R><row><A1>a11</A1>…<An>a11</An></row>…
<row><A1>ak1</A1>…<An>ak1</An></row> </R>
22
Now we write an XQuery to export relational data  XML
Note: result should conform to DTD (slide 20 for the relational
Schema; the generated result is shown in slide 21.)
<xmlview>
{ FOR $x IN /db/Course/row
RETURN
<course>
<title> { $x/title/text() } </title>
<room> { $x/room/text() } </room>
{ FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row
$z IN /db/Student/row[sid/text() = $y/sid/text()]/row
RETURN <student> <name> { $z/name/text() } </name>
<address> { $z/address/text() }
</address>
<grade> { $y/grade/text() } </grade>
</student>
}
</course>
}
</xmlview>
23
XML Publishing
Query: find Mary’s grade in Operating Systems
XQuery over public view
FOR $x IN /xmlview/course[title/text()=“Operating Systems”],
$y IN $x/student/[name/text()=“Mary”]
RETURN <answer> $y/grade/text() </answer>
SilkRoute
does this
automatically
SQL over database schema
SELECT Enroll.grade
FROM Student, Enroll, Course
WHERE Student.name=“Mary” and Course.title=“OS”
and Student.sid = Enroll.sid and Enroll.cid = Course.cid
24
XML Publishing
How do we choose the output structure ?
 Determined by agreement, with our partners, or dictated by
committees in XML dialects (called applications) and
generate DTDs
 The DTD is selective wrt the underlying databases
 XML Data is often nested, irregular, etc – that’s why we
need the xmlview query in slide 23 to do the
transformation
 No agreed normal forms for XML but a few work on it…
 [M. Arenas and L. Libkin, A Normal Form for XML
Documents]
25
XML Storage
 Often the XML data is small and is parsed directly
into the application (DOM API) – file based
management (pros and cons?)
 Sometimes it is big, and we need to store it in a
database
 Much harder than XML publishing (why ?)
 A fundamental XML storage problem:
• How do we choose the schema of the database ?
 Possible solutions:
• Schema derived from DTD – structure mapping
• Storing XML as a graph such as “Edge relation” – model
mapping
• Native Approach – native XML/SSD engine
26
XML Storage in a Relational DB
 Use generic schema
• [Florescu, Kossman 1999]
 Use DTD to derive schema
• [Shanmugasundaram, et al. 1999]
 Use data mining to derive schema
• [Deutsch, Fernandez, Suciu 1999]
 Use the Path table
• [T.Amagasa, T.Shimura, S.Uemura 2001]
27
XML Storage: Edge Relation





[Florescu, Kossman 1999]
Monet [Schmidt et al. WebDB 00]
Structured-based approach
Mapping tree to relations
Use generic relational schemas
(independent on the XML schema):
DOM Tree
Ref(source,label,dest)
Val(node,value)
28
XML Storage: Edge Relation
Ref
&o1
S o u rc e
paper
&
&
&
&
&
&o2
title
author
&o3
author
&o4
“The Calculus” “…”
year
&o5
“…”
[Florescu, Kossman 1999]
&o6
“1986”
o1
o2
o2
o2
o2
Val
N ode
&
&
&
&
o3
o4
o5
o6
L abel
D est
paper
title
a u th o r
a u th o r
year
&
&
&
&
&
o2
o3
o4
o5
o6
V a lu e
T h e C a lc u lu s
…
…
1986
29
XML Storage: Edge Relation
 In practice may need more tables for
reference links and nodes:
RefTag1
Source Dest
RefTag1(source,dest)
RefTag2(source,dest)
…
IntVal(node,intVal)
RealVal(node,realVal)
…
&o2
&o5
&o6
&o1
&o7
&o8
&o8
paper
title
&o3
&o7
&o2
author author
&o4
year
&o5
&o8
&o6
30
XML Storage: DTD to Schema
[Christophides, Abiteboul, Cluet, Scholl 1994]
[Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]
 Basic idea: use the XML schema to derive the
relational schema
 DTD:
<!ELEMENT paper (title, author*, year?)>
<!ELEMENT author (firstName, lastName)>
 Relational schemas:
Paper(pid, title, year)
Author(aid, pid, firstName, lastName)
31
XML Storage: DTD to Schema
 Each Element corresponds to a relation
 Each Attribute of Element corresponds to a column of
relation
 Connect elements using foreign keys
 But the problems: fragmentation!
 Example: How many relations should be used for finding
an address of an author?
Paper(pid,
<!ELEMENT
title, year)
paper (title, author*, year?)>
Author(aid,
<!ELEMENT
pid, firstName,
author (firstName,lastName,address)>
lastName)
Address
<!ELEMENT
(addid, aid,
address
country)
(street, city, country)>
City (cid,
<!ELEMENT
addid, postcode,
city (postcode?,
cityname)
cityname)>
(streetno?,
streetname)>
Street<!ELEMENT
(sid, addid, street
postcode,
streetname)
32
XML Storage: Path Relations
[T.Amagasa, T.Shimura, S.Uemura 2001]
ACM TOIT 2001 1(1)
Store paths as strings (model-based approach)
 XPath expressions become the SQL like operator
 Additional information for parent/child,
ancestor/descendant relationship
 XRel: table schemas for paths, elements and val
 XParent:table schemas for elements, labelpath,
parent (alternative but similar)
33
XML Storage: Path Relations
Path
pathID
Pathexpr
1
#/bib
2
#/bib#/paper
3
#/bib#/paper#/author
4
#/bib#/paper#/title
5
#/bib#/paper#/year
6
#/bib#/book#/author
7
#/bib#/book#/title
8
#/bib#/book#/publisher
One entry for every path in the database
Relatively small
34
XML Storage: Path Relations
Positions in doc
Element
Val
Node
ID
Path
ID
Start
1
1
0
1000
-
2
2
5
200
1
3
3
8
20
2
4
3
21
30
2
5
3
31
100
2
6
3
101
150
2
7
4
151
180
2
8
2
300
500
1
End
Parent
ID
...
One entry for every element in the
database relatively large
NodeID
Val
3
Smith
4
Vance
5
Tim
6
Wallace
7
The Best…
6
3
7
4
8
2
...
One entry for every leaf in the database
35
Relatively large
XRel Example: Path Table
root
1
 Contains all path string
PLAY
2
• Path(pathID, pathexp)
PathID
PathExp
SCENE
4
0
#/PLAY
1
#/PLAY#/ACT
2
#/PLAY#/ACT#/SCENE
3
#/PLAY#/ACT#/SCENE#/@id
4
#/PLAY#/ACT#/SCENE#/TITLE
5
#/PLAY#/ACT#/SCENE#/SPEECH
…
…
ACT
3
id
5
“000”
TITLE
6
“Intro”
ACT
…
…
SCENE
…
SPEECH
7
SPEAKER
8
“CURIO”
TEXT
9
“This is …”
…
Path Table
36
XRel Example: Other Tables
Element Table
0
6
11
27
49
57
91
 Example
docID PathID start end index
21
<PLAY>
<ACT>
<SCENE id=“000”>
<TITLE> Intro </TITLE>
<SPEECH>
<SPEAKER>CURIO</SPEAKER>
<TEXT>This is …</TEXT>
</SPEECH>
</SCENE>
</ACT>
</PLAY>
1
0
0
…
1
1
1
1
6
…
1
1
48
1
2
11
…
1
1
90
1
4
27
48
1
1
…
…
…
…
…
…
Attribute Table
docID PathID start end
1
3
0
…
PathExp
#/PLAY
…
21
21
value
“000”
Text Table
Path Table
PathID
reindex
docID PathID start end
value
1
4
34
40
“Intro”
…
…
…
…
…
37
Storing XML as a Graph
 Every XML instance is a tree
 Hence we can store it as any graph, using an Edge
table
 In addition we need a Value table to store the data
values (#PCDATA)
Edge relation summary:
 Same relational schema for every XML document:
• Edge(Source, Tag, Dest)
• Value(Source, Val)
 Generic: works for every XML instance
 But inefficient:
• Repeat tags multiple times
• Need many joins to reconstruct data
38
Storing XML as a Graph
0
db
2
book
4
3
title
Edge
5
author
6
title
1
9
book
7
publisher
8
author author
11
10
title
state
“Complete
“Morgan
“Transaction
“Chamberlin”
“Bernstein”
“Newcomer”
Guide
Kaufman”
Processing”
to DB2”
Source
Tag
Dest
0
db
1
1
book
2
2
title
3
2
author
4
1
book
5
5
title
5
...
Value
Source
Val
3
Complete guide . . .
6
4
Chamberlin
author
7
6
...
...
...
...
...
39
“CA”
Storing XML as a Graph
What happens to queries:
FOR $x IN /db/book[author/text()=“Chamberlin”]
RETURN $x/title
SELECT vtitle.value
FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle,
Value vauthor, Value vtitle
WHERE xdb.source=0 and xdb.tag = ‘db’ and
xdb.dest = xbook.source and xbook.tag = ‘book’ and
xbook.dest = xauthor.source and xauthor.tag = ‘author’ and
xbook.dest = xtitle.source and xtitle.tag = ‘title’ and
xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and
xtitle.dest = vtitle.source
40
XML Storage: Data Mining to Schema
[Deutsch, Fernandez, Suciu 1999]
Given:
• One large XML data instance
• No schema/DTD
• Query workload
 Problem: find a “good” relational schema for it
 Notice: even when a DTD is present, it may be
imprecise:
• E.g. when a person may have 1-3 phones: phone*
41
XML Storage: Data Mining to Schema
A
paper
D
B
C paper
paper paper
Paper1
fn 1
year
author
title
author
authortitle authortitleauthor title A X
B X
D X
fn ln fn ln fn
fn
ln
ln
ln 1
fn 2
ln 2
title
year
X
X
X
X
-
X
-
X
X
X
X
-
Paper2
a u th o r
X
C
[Deutsch, Fernandez, Suciu 1999]
title
X
42
Useful References

Data on the Web: from Relations, to Semistructured Data and XML,
Abiteboul, Buneman, Suciu
•

W3C homepage, www.w3.org
•







For foundations
For current standards
F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage
Strategies, SIGMOD Record, 2002
XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002
Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In
Proc. of SIGMOD, 2002
Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML
Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No.
1, pp 110-141, 2001
D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE
Data Engineering Bulletin 22(3), 1999
J. Shanmugasundaram et al. Relational Databases for Querying XML Documents:
Limitations and Opportunities, In Proc. of VLDB, 1999
C. Zhang et al., On Supporting Containment Queries in Relational Databases
Management Systems, In Proc. of SIGMOD, 2001
43