Author

Transcript Author

Managing XML and
Semistructured Data
Part 1: Preliminaries, Motivation and
Overview
COMP630L Topics in DB Systems: Managing Web Data
Fall, 2007
Dr Wilfred Ng
Acknowledgement: Part of the materials in this set of XML slides are
extracted from Prof. Dan Suciu’s course materials. Thanks for his permission
of using them
1
HTML XML SGML
 a W3C standard to complement HTML
 origins: structured text SGML
 motivation:
• HTML describes presentation
• XML describes content
 HTML4.0  XML  SGML
 http://www.w3.org/TR/2000/REC-xml-20001006 (version 2,
10/2000)
2
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases
</i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann,
1999
HTML describes the presentation
3
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
XML describes the content
</bibliography>
4
SGML and XML
 eXtensible Markup Language - XML
 XML 1.0 – a recommendation from W3C, 1998
 XML – related standards: W3C group – XSL,
XQuery,…
 Roots: SGML [http://www.w3.org/MarkUp/SGML/]
• Markup = encoding: possibilities for expressing information
• Information modelling freedom, Reusability, provability,
validity
• But a very nasty language 
• SGML is an international standard for device-independent,
system-independent methods of representing texts in
electronic form
 After the roots: a format for sharing data (on the Web)
5
Basic XML Terminology






tags: book, title, author, …
start tag: <book>, end tag: </book>
elements: <book>…<book>,<author>…</author>
elements are nested
empty element: <red></red> abbrv. <red/>
an XML document: single root element
6
The Role of XML Data
 XML is designed for data exchange, not to replace
relational or E/R data
 Sources of XML data:
• Created manually with text editors: not really data
• Generated automatically from relational data
• Text files, replacing older data formats: Web server logs,
scientific data (biological, astronomical)
• Stored/processed in native XML engines: very few
applications need that today
7
XML Advantages for Web Data
 Over SGML
• Supported by
mainstream
browsers such as IE
and Netscape
• Standard Stylesheet
• Standard linking
 Over HTML
•
•
•
•
Interchangable
Searchable
Reusable
Enables Automation
8
Why XML is of Interest to Us
 HTML fails to meet structure specification – tags for
appearance only
 XML is a language syntax for data
• Note: we have no langauge syntax for relational data
• But XML is not relational: semistructured
 This is exciting because:
•
•
•
•
Can translate any data to XML
Can ship XML over the Web (HTTP)
Can input XML into any application
Thus: make data sharing and exchange on the Web possible!
 40% annual growth and accelerating: publishers,
government, education,…
9
Relational Data in XML
person
nam e
XML:
row
phone
name
John
3634
Sue
6343
D ic k
6363
person
“John”
row
row
phone name phone
3634 “Sue”
name
6343 “Dick”
phone
6363
<person>
<row> <name>John</name>
<phone> 3634</phone></row>
<row> <name>Sue</name>
<phone> 6343</phone>
<row> <name>Dick</name>
<phone> 6363</phone></row>
</person>
10
XML Data: more expressive
 Missing attributes:
<person> <name> John</name>
<phone>1234</phone>
</person>
<person> <name>Joe</name>
</person>
 Could represent in
a table with nulls
 no phone !
name
phone
John
1234
Joe
11
XML Data: more expressive
 Repeated attributes
<person> <name> Mary</name>
<phone>2345</phone>
<phone>3456</phone>
</person>
 Impossible in tables:
name
phone
Mary
2345
 two phones !
3456
???
12
XML Data: more expressive
 Attributes with different types in different objects
<person> <name> <first> John </first>
<last> Smith </last>
</name>
<phone>1234</phone>
</person>
 structured name !
 Nested collections (no 1NF)
 Heterogeneous collections:
• <db> contains both <book>s and <publisher>s
13
XML Data Sharing and Exchange
application
application
object-relational
Integrate
XML Data
Transform
WEB (HTTP)
Warehouse
application
relational data
Specific data management tasks
legacy data
14
From Relational Data to XML Data
XML – Data tree
and its doc
persons
name
row
row
row
phone
name
John
Sue
Dick
persons
3634
6343
6363
“John”
phone name phone
3634 “Sue”
name
6343 “Dick”
phone
6363
<persons>
<row> <name>John</name>
<phone> 3634</phone></row>
<row> <name>Sue</name>
<phone> 6343</phone>
<row> <name>Dick</name>
<phone> 6363</phone></row>
</persons>
15
XML Data
 XML is self-describing
 Schema elements become part of the data
• Relational schema: persons(name,phone)
• In XML <persons>, <name>, <phone> are part
of the data, and are repeated many times !
 Consequence: XML is much more flexible
 XML data is semistructured data
16
XML from/to Relational Data
 XML publishing:
• relational data  XML
 XML storage:
Relational data is regular in XML tree
representation. But XML data can be nonrelational!
• XML  relational data
Supplier
?
Supplier
?
Common
XML
Illusion
Broker
17
XML Publishing: ideas sketch
XML
Tuple
streams
XML
publishing
Relational
Database
SQL
Web
Application
XPath/
XQuery
18
XML Publishing
 Exporting the data is relatively
easier: we do this already for
HTML
 Translating XQuery  SQL is hard
 XML publishing systems:
 Research: Experanto (IBM/DB2),
SilkRoute (AT&T Labs and UW,
software unavailable) now
SilkRoute downloadable
• XQuery  SQL
 Commercial: SQL Server, Oracle
• Only XPath  SQL and with
restrictions
A middle-ware approach
19
XML Publishing
Will follow the idea in SilkRoute [ACM TODS 27(4), 2002]
student
enroll
course
 Backend relational engine
 Relational schema:
• Student(sid, name, address)
• Course(cid, title, room)
• Enroll(sid, cid, grade)
 Schemas are often proprietary but XML schemas
are public
20
XML Publishing
<xmlview>
<course> <title> Operating Systems </title>
<room> MGH084 </room>
<student> <name> John </name>
<address> Seattle </address >
<grade> 3.8 </grade>
</student>
<student> …</student>
…
</course>
<course> <title> Database </title>
<room> EE045 </room>
<student> <name> Mary </name>
<address> Shoreline </address >
<grade> 3.9 </grade>
</student>
<student> …</student>
…
</course>
…
</xmlview>
• Group by courses:
Redundant
representation of
students
• Other representations
possible too: tags may
not be the same as
attributes in general
21
XML Publishing
First thing to do: design the DTD (public for users)
<!ELEMENT xmlview (course*)>
<!ELEMENT course (title,room,student*)>
<!ELEMENT student (name,address,grade)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT grade (#PCDATA)>
Second thing to do: develop an XML canonical view under db root for
each relation {t1,…tk} with schema R(A1,…,An)
<R><row><A1>a11</A1>…<An>a11</An></row>…
<row><A1>ak1</A1>…<An>ak1</An></row> </R>
22
Now we write an XQuery to export relational data  XML
Note: result should conform to DTD (slide 20 for the relational
Schema; the generated result is shown in slide 21.)
<xmlview>
{ FOR $x IN /db/Course/row
RETURN
<course>
<title> { $x/title/text() } </title>
<room> { $x/room/text() } </room>
{ FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row
$z IN /db/Student/row[sid/text() = $y/sid/text()]/row
RETURN <student> <name> { $z/name/text() } </name>
<address> { $z/address/text() }
</address>
<grade> { $y/grade/text() } </grade>
</student>
}
</course>
}
</xmlview>
23
XML Publishing
Query: find Mary’s grade in Operating Systems
XQuery over public view
FOR $x IN /xmlview/course[title/text()=“Operating Systems”],
$y IN $x/student/[name/text()=“Mary”]
RETURN <answer> $y/grade/text() </answer>
SilkRoute
does this
automatically
SQL over database schema
SELECT Enroll.grade
FROM Student, Enroll, Course
WHERE Student.name=“Mary” and Course.title=“OS”
and Student.sid = Enroll.sid and Enroll.cid = Course.cid
24
XML Publishing
How do we choose the output structure ?
 Determined by agreement, with our partners, or dictated by
committees in XML dialects (called applications) and
generate DTDs
 The DTD is selective wrt the underlying databases
 XML Data is often nested, irregular, etc – that’s why we
need the xmlview query in slide 23 to do the
transformation
 No agreed normal forms for XML but a few work on it…
 [M. Arenas and L. Libkin, A Normal Form for XML
Documents]
25
XML Storage
 Often the XML data is small and is parsed directly
into the application (DOM API) – file based
management (pros and cons?)
 Sometimes it is big, and we need to store it in a
database
 Much harder than XML publishing (why ?)
 A fundamental XML storage problem:
• How do we choose the schema of the database ?
 Possible solutions:
• Schema derived from DTD – structure mapping
• Storing XML as a graph such as “Edge relation” – model
mapping
• Native Approach – native XML/SSD engine
26
XML Storage in a Relational DB
 Use generic schema
• [Florescu, Kossman 1999]
 Use DTD to derive schema
• [Shanmugasundaram, et al. 1999]
 Use data mining to derive schema
• [Deutsch, Fernandez, Suciu 1999]
 Use the Path table
• [T.Amagasa, T.Shimura, S.Uemura 2001]
27
XML Storage: Edge Relation





[Florescu, Kossman 1999]
Monet [Schmidt et al. WebDB 00]
Structured-based approach
Mapping tree to relations
Use generic relational schemas
(independent on the XML schema):
DOM Tree
Ref(source,label,dest)
Val(node,value)
28
XML Storage: Edge Relation
Ref
&o1
S o u rc e
paper
&
&
&
&
&
&o2
title
author
&o3
author
&o4
“The Calculus” “…”
year
&o5
“…”
[Florescu, Kossman 1999]
&o6
“1986”
o1
o2
o2
o2
o2
Val
N ode
&
&
&
&
o3
o4
o5
o6
L abel
D est
paper
title
a u th o r
a u th o r
year
&
&
&
&
&
o2
o3
o4
o5
o6
V a lu e
T h e C a lc u lu s
…
…
1986
29
XML Storage: Edge Relation
 In practice may need more tables for
reference links and nodes:
RefTag1
Source Dest
RefTag1(source,dest)
RefTag2(source,dest)
…
IntVal(node,intVal)
RealVal(node,realVal)
…
&o2
&o5
&o6
&o1
&o7
&o8
&o8
paper
title
&o3
&o7
&o2
author author
&o4
year
&o5
&o8
&o6
30
XML Storage: DTD to Schema
[Christophides, Abiteboul, Cluet, Scholl 1994]
[Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]
 Basic idea: use the XML schema to derive the
relational schema
 DTD:
<!ELEMENT paper (title, author*, year?)>
<!ELEMENT author (firstName, lastName)>
 Relational schemas:
Paper(pid, title, year)
Author(aid, pid, firstName, lastName)
31
XML Storage: DTD to Schema
 Each Element corresponds to a relation
 Each Attribute of Element corresponds to a column of
relation
 Connect elements using foreign keys
 But the problems: fragmentation!
 Example: How many relations should be used for finding
an address of an author?
Paper(pid,
<!ELEMENT
title, year)
paper (title, author*, year?)>
Author(aid,
<!ELEMENT
pid, firstName,
author (firstName,lastName,address)>
lastName)
Address
<!ELEMENT
(addid, aid,
address
country)
(street, city, country)>
City (cid,
<!ELEMENT
addid, postcode,
city (postcode?,
cityname)
cityname)>
(streetno?,
streetname)>
Street<!ELEMENT
(sid, addid, street
postcode,
streetname)
32
XML Storage: Path Relations
[T.Amagasa, T.Shimura, S.Uemura 2001]
ACM TOIT 2001 1(1)
Store paths as strings (model-based approach)
 XPath expressions become the SQL like operator
 Additional information for parent/child,
ancestor/descendant relationship
 XRel: table schemas for paths, elements and val
 XParent:table schemas for elements, labelpath,
parent (alternative but similar)
33
XML Storage: Path Relations
Path
pathID
Pathexpr
1
#/bib
2
#/bib#/paper
3
#/bib#/paper#/author
4
#/bib#/paper#/title
5
#/bib#/paper#/year
6
#/bib#/book#/author
7
#/bib#/book#/title
8
#/bib#/book#/publisher
One entry for every path in the database
Relatively small
34
XML Storage: Path Relations
Positions in doc
Element
Val
Node
ID
Path
ID
Start
1
1
0
1000
-
2
2
5
200
1
3
3
8
20
2
4
3
21
30
2
5
3
31
100
2
6
3
101
150
2
7
4
151
180
2
8
2
300
500
1
End
Parent
ID
...
One entry for every element in the
database relatively large
NodeID
Val
3
Smith
4
Vance
5
Tim
6
Wallace
7
The Best…
6
3
7
4
8
2
...
One entry for every leaf in the database
35
Relatively large
XRel Example: Path Table
root
1
 Contains all path string
PLAY
2
• Path(pathID, pathexp)
PathID
PathExp
SCENE
4
0
#/PLAY
1
#/PLAY#/ACT
2
#/PLAY#/ACT#/SCENE
3
#/PLAY#/ACT#/SCENE#/@id
4
#/PLAY#/ACT#/SCENE#/TITLE
5
#/PLAY#/ACT#/SCENE#/SPEECH
…
…
ACT
3
id
5
“000”
TITLE
6
“Intro”
ACT
…
…
SCENE
…
SPEECH
7
SPEAKER
8
“CURIO”
TEXT
9
“This is …”
…
Path Table
36
XRel Example: Other Tables
Element Table
0
6
11
27
49
57
91
 Example
docID PathID start end index
21
<PLAY>
<ACT>
<SCENE id=“000”>
<TITLE> Intro </TITLE>
<SPEECH>
<SPEAKER>CURIO</SPEAKER>
<TEXT>This is …</TEXT>
</SPEECH>
</SCENE>
</ACT>
</PLAY>
1
0
0
…
1
1
1
1
6
…
1
1
48
1
2
11
…
1
1
90
1
4
27
48
1
1
…
…
…
…
…
…
Attribute Table
docID PathID start end
1
3
0
…
PathExp
#/PLAY
…
21
21
value
“000”
Text Table
Path Table
PathID
reindex
docID PathID start end
value
1
4
34
40
“Intro”
…
…
…
…
…
37
Storing XML as a Graph
 Every XML instance is a tree
 Hence we can store it as any graph, using an Edge
table
 In addition we need a Value table to store the data
values (#PCDATA)
Edge relation summary:
 Same relational schema for every XML document:
• Edge(Source, Tag, Dest)
• Value(Source, Val)
 Generic: works for every XML instance
 But inefficient:
• Repeat tags multiple times
• Need many joins to reconstruct data
38
Storing XML as a Graph
0
db
2
book
4
3
title
Edge
5
author
6
title
1
9
book
7
publisher
8
author author
11
10
title
state
“Complete
“Morgan
“Transaction
“Chamberlin”
“Bernstein”
“Newcomer”
Guide
Kaufman”
Processing”
to DB2”
Source
Tag
Dest
0
db
1
1
book
2
2
title
3
2
author
4
1
book
5
5
title
5
...
Value
Source
Val
3
Complete guide . . .
6
4
Chamberlin
author
7
6
...
...
...
...
...
39
“CA”
Storing XML as a Graph
What happens to queries:
FOR $x IN /db/book[author/text()=“Chamberlin”]
RETURN $x/title
SELECT vtitle.value
FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle,
Value vauthor, Value vtitle
WHERE xdb.source=0 and xdb.tag = ‘db’ and
xdb.dest = xbook.source and xbook.tag = ‘book’ and
xbook.dest = xauthor.source and xauthor.tag = ‘author’ and
xbook.dest = xtitle.source and xtitle.tag = ‘title’ and
xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and
xtitle.dest = vtitle.source
40
XML Storage: Data Mining to Schema
[Deutsch, Fernandez, Suciu 1999]
Given:
• One large XML data instance
• No schema/DTD
• Query workload
 Problem: find a “good” relational schema for it
 Notice: even when a DTD is present, it may be
imprecise:
• E.g. when a person may have 1-3 phones: phone*
41
XML Storage: Data Mining to Schema
A
paper
D
B
C paper
paper paper
Paper1
fn 1
year
author
title
author
authortitle authortitleauthor title A X
B X
D X
fn ln fn ln fn
fn
ln
ln
ln 1
fn 2
ln 2
title
year
X
X
X
X
-
X
-
X
X
X
X
-
Paper2
a u th o r
X
C
[Deutsch, Fernandez, Suciu 1999]
title
X
42
Useful References

Data on the Web: from Relations, to Semistructured Data and XML,
Abiteboul, Buneman, Suciu
•

W3C homepage, www.w3.org
•







For foundations
For current standards
F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage
Strategies, SIGMOD Record, 2002
XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002
Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In
Proc. of SIGMOD, 2002
Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML
Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No.
1, pp 110-141, 2001
D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE
Data Engineering Bulletin 22(3), 1999
J. Shanmugasundaram et al. Relational Databases for Querying XML Documents:
Limitations and Opportunities, In Proc. of VLDB, 1999
C. Zhang et al., On Supporting Containment Queries in Relational Databases
Management Systems, In Proc. of SIGMOD, 2001
43

Author

Transcript Author

Directory