10/10 - SEAS - University of Pennsylvania

Download Report

Transcript 10/10 - SEAS - University of Pennsylvania

Normal Forms and XML
Zachary G. Ives
University of Pennsylvania
CIS 550 – Database & Information Systems
October 10, 2007
Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
Announcements
 Homework 3 will be due Monday10/22
 Midterm: Wednesday 10/24
 Please start forming into project groups of 4, with a
list of names due 10/24
2
Lossless Join Decomposition
R1, … Rk is a lossless join decomposition of R w.r.t. an FD set F if
for every instance r of R that satisfies F,
R1(r) ⋈ ... ⋈ Rk(r) = r
Consider:
sid
name
serno
subj
cid
exp-grade
1
23
Sam
Nitin
570103
550103
AI
DB
570
550
B
A
What if we decompose on
(sid, name) and (serno, subj, cid, exp-grade)?
3
Testing for Lossless Join
R1, R2 is a lossless join decomposition of R with respect to F
iff at least one of the following dependencies is in F+
(R1  R2)  R1 – R2
(R1  R2)  R2 – R1
So for the FD set:
sid  name
serno  cid, exp-grade
cid  subj
Is (sid, name) and (serno, subj, cid, exp-grade) a lossless
decomposition?
4
Dependency Preservation
Ensures we can “easily” check whether a FD X  Y
is violated during an update to a database:
 The projection of an FD set F onto a set of attributes Z,
FZ is
{X  Y | X  Y  F +, X  Y  Z}
i.e., it is those FDs local to Z’s attributes
 A decomposition R1, …, Rk is dependency preserving if
F + = (FR1 ... FRk)+
The decomposition hasn’t “lost” any essential FD’s, so
we can check without doing a join
5
Example of Lossless and
Dependency-Preserving Decompositions
Given relation scheme
R(name, street, city, st, zip, item, price)
And FD set name  street, city
street, city  st
street, city  zip
name, item  price
Consider the decomposition
R1(name, street, city, st, zip) and R2(name, item, price)
 Is it lossless?
 Is it dependency preserving?
What if we replaced the first FD by name, street  city?
6
Another Example
Given scheme: R(sid, fid, subj)
and FD set: fid  subj
sid, subj  fid
Consider the decomposition
R1(sid, fid) and R2(fid, subj)
 Is it lossless?
 Is it dependency preserving?
7
FD’s and Keys
 Ideally, we want a design s.t. for each nontrivial
dependency X  Y, X is a superkey for some
relation schema in R
 We just saw that this isn’t always possible
 Hence we have two kinds of normal forms
8
Two Important Normal Forms
Boyce-Codd Normal Form (BCNF). For every relation
scheme R and for every X  A that holds over R,
either A  X (it is trivial) ,or
or X is a superkey for R
Third Normal Form (3NF). For every relation scheme
R and for every X  A that holds over R,
either A  X (it is trivial), or
X is a superkey for R, or
A is a member of some key for R
9
Normal Forms Compared
BCNF is preferable, but sometimes in conflict with the
goal of dependency preservation
 It’s strictly stronger than 3NF
Let’s see algorithms to obtain:
 A BCNF lossless join decomposition (nondeterministic)
 A 3NF lossless join, dependency preserving
decomposition
10
BCNF Decomposition Algorithm
(from Korth et al.; our book gives a recursive version)
result := {R}
compute F+
while there is a relation schema Ri in result that isn’t in BCNF
{
i.e., A doesn’t
form a key
let A  B be a nontrivial FD on Ri
s.t. A  Ri is not in F+
and A and B are disjoint
}
result:= (result – Ri)  {(Ri - B), (A,B)}
11
An Example
Given the schema:
Stuff(sid, name, serno, classroom, cid, fid, prof)
And FDs:
sid  name
fid  prof
serno  classroom, cid, fid
 Find the Boyce-Codd Normal Form for this schema
 What if instead:
sid  name
fid  prof
classroom, cid  serno
serno  cid
12
3NF Decomposition Algorithm
Let F be a minimal cover
i:=0
for each FD A  B in F {
if none of the schemas Rj, 1 j  i, contains AB
{
increment i
Ri := (A, B)
}
}
if no schema Rj, 1  j  i contains a candidate key for R {
increment i
Ri := any candidate key for R
}
return (R1, …, Ri)
Build dep.preserving
decomp.
Ensure
lossless
decomp.
13
An Example
Given the schema:
Stuff(sid, name, serno, classroom, cid, fid, prof)
And FDs:
sid  name
fid  prof
serno  classroom, cid, fid
 Find the Third Normal Form for this schema
 What if instead:
sid  name
fid  prof
classroom, cid  serno
serno  cid
14
Summary of Normalization
 We can always decompose into 3NF and get:
 Lossless join
 Dependency preservation
 But with BCNF:
 We are only guaranteed lossless joins
 The algorithm is nondeterministic, so there is not a
unique decomposition for a given schema R
 BCNF is stronger than 3NF: every BCNF schema is
also in 3NF
15
Normalization Is Good… Or Is It?
 In some cases, we might not mind redundancy, if the
data isn’t directly updated:
 Reports (people like to see breakdowns by semester,
department, course, etc.)
 Warehouses (archived copies of data for doing complex
analysis)
 Data sharing (sometimes we may export data into objectoriented or hierarchical formats)
16
XML: A Semi-Structured Data Model
17
Why XML?
XML is the confluence of several factors:





The Web needed a more declarative format for data
Documents needed a mechanism for extended tags
Database people needed a more flexible interchange format
“Lingua franca” of data
It’s parsable even if we don’t know what it means!
Original expectation:
 The whole web would go to XML instead of HTML
Today’s reality:
 Not so… But XML is used all over “under the covers”
18
Why DB People Like XML
Can get data from all sorts of sources
 Allows us to touch data we don’t own!
 This was actually a huge change in the DB community
Interesting relationships with DB techniques
 Useful to do relational-style operations
 Leverages ideas from object-oriented, semistructured data
Blends schema and data into one format
 Unlike relational model, where we need schema first
 … But too little schema can be a drawback, too!
19
XML Anatomy
Processing Instr.
<?xml version="1.0" encoding="ISO-8859-1" ?>
<dblp>
Open-tag
<mastersthesis mdate="2002-01-03" key="ms/Brown92">
<author>Kurt P. Brown</author>
<title>PRPL: A Database Workload Specification Language</title>
<year>1992</year>
Element
<school>Univ. of Wisconsin-Madison</school>
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
Attribute
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
Close-tag
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
20
</article>
Well-Formed XML
A legal XML document – fully parsable by an XML
parser
 All open-tags have matching close-tags (unlike so many
HTML documents!), or a special:
<tag/> shortcut for empty tags (equivalent to <tag></tag>
 Attributes (which are unordered, in contrast to elements)
only appear once in an element
 There’s a single root element
 XML is case-sensitive
21
XML as a Data Model
XML “information set” includes 7 types of nodes:







Document (root)
Element
Attribute
Processing instruction
Text (content)
Namespace
Comment
XML data model includes this, plus typing info, plus
order info and a few other things
22
XML Data Model Visualized
(and simplified!)
Root
?xml
2002…
p-i
element
dblp
article
mdate
key
author title year school
1992
ms/Brown92
key
editor title journal volume year ee ee
2002…
tr/dec/…
PRPL…
Kurt P….
attribute
text
mastersthesis
mdate
root
Digital…
Univ….
1997
The…
Paul R.
db/labs/dec
SRC…
http://www.
23
What Does XML Do?
 Serves as a document format (super-HTML)
 Allows custom tags (e.g., used by MS Word, openoffice)
 Supplement it with stylesheets (XSL) to define formatting
 Data exchange format (must agree on terminology)
 Marshalling and unmarshalling data in SOAP and
Web Services
24
XML as a Super-HTML
(MS Word)
<h1 class="Section1">
<a name="_top“ />CIS 550: Database and Information
Systems</h1>
<h2 class="Section1">Fall 2004</h2>
<p class="MsoNormal">
<place>311 Towne</place>,
Tuesday/Thursday
<time Hour="13" Minute="30">1:30PM –
3:00PM</time>
</p>
25
XML Easily Encodes Relations
Student-course-grade
sid serno exp-grade
1 570103
B
23 550103
A
<student-course-grade>
<tuple><sid>1</sid><serno>570103</serno><expgrade>B</exp-grade></tuple>
<tuple><sid>23</sid><serno>550103</serno><expgrade>A</exp-grade></tuple>
</student-course-grade>
26
But XML is More Flexible…
“Non-First-Normal-Form” (NF2)
<parents>
<parent name=“Jean” >
<son>John</son>
<daughter>Joan</daughter>
<daughter>Jill</daughter>
</parent>
<parent name=“Feng”>
<daughter>Felicity</daughter>
</parent>
Coincides with “semi-structured data”,
…
invented by DB people at Penn and Stanford
27
XML and Code
 Web Services (.NET, recent Java web service toolkits) are
using XML to pass parameters and make function calls
 Why?
 Easy to be forwards-compatible
 Easy to read over and validate (?)
 Generally firewall-compatible
 Drawbacks? XML is a verbose and inefficient encoding!
 XML is used to represent:




SOAP: the “envelope” that data is marshalled into
XML Schema: gives some typing info about structures being passed
WSDL: the IDL (interface def language)
UDDI: provides an interface for querying about web services
28
Integrating XML: What If We Have
Multiple Sources with the Same Tags?
 Namespaces allow us to specify a context for different tags
 Two parts:
 Binding of namespace to URI
 Qualified names
<root xmlns=“http://www.first.com/aspace” xmlns:otherns=“…”>
<tag xmlns:myns=“http://www.fictitious.com/mypath”>
<thistag>is in the default namespace (aspace)</thistag>
<myns:thistag>is in myns</myns:thistag>
<otherns:thistag>is a different tag in otherns</otherns:thistag>
</tag>
</root>
29