Transcript DB2003
Database System Concepts
Instructor:
Jin Yuanping [email protected]
Professor of College of Software Engineering,
Southeast University
The reference book:
Database System Concepts (4th Edition),
A. Silberschatz , et al, McGraw-Hill, 2002
JYP
1
The Content of the Course
1 Introduction
2 Entity-Relationship Model
3 Relational Model
4 SQL
5 Integrity Constraints
6 Transactions
7 Concurrency Control
8 Database System Architecture
9 New Applications
10 XML
JYP
2
Topic 1: Introduction
Database System
View of Data
Data Models
Data Definition Language
Data Manipulation Language
Transaction Management
Storage Management
Database Administrator
Database Users
Overall System Structure
JYP
3
A Database System
A collection of interrelated data (DB)
A set of programs to access the data (DBMS)
A set of application programs
A group of people to administrate it (DBA)
DB contains information about a particular enterprise
DBMS provides an environment that is both convenient and
efficient to use.
JYP
4
View of Data
An architecture for a database system
JYP
5
Levels of Abstraction
Physical level describes how a record (e.g., customer) is stored.
Logical level: describes data stored in database, and the
relationships among the data.
type customer = record
name : string;
street : string;
city : integer;
end;
View level: application programs hide details of data types.
Views can also hide information (e.g., salary) for security
purposes.
JYP
6
Instances and Schemas
Similar to types, variables and values in programming languages
Schema – the logical structure of the database (e.g., set of
customers and accounts and the relationship between them)
Instance – the actual content of the database at a particular point
in time
JYP
7
Data Independence
Ability to modify a schema definition in one level without affecting
a schema definition in the next higher level.
The interfaces between the various levels and components
should be well defined so that changes in some parts do not
seriously influence others.
Two levels of data independence:
Physical data independence
Logical data independence
JYP
8
Data Models
A collection of concepts and tools for describing
data
data relationships
data semantics
data constraints
Entity-relationship model
Object-oriented model
Relational model (Oracle, DB2)
Semi-structured model (XML)
JYP
9
Entity-Relationship Model
Example of entity-relationship model
JYP
10
Relational Model
Example of tabular data in the relational model
customer-name customer-street
Jones
Smith
Curry
Lindsay
Main
North
North
Park
customer-city
Harrison
Rye
Rye
Pittsfield
customer
JYP
11
XML Data
<bank>
<account>
<account-number> A-101 </account-number>
<branch-name>
Downtown </branch-name>
<balance>
500
</balance>
</account>
<depositor>
<account-number> A-101 </account-number>
<customer-name> Johnson </customer-name>
</depositor>
</bank>
JYP
12
Data Definition Language (DDL)
Specification notation for defining the database schema
DDL compiler generates a set of tables
Data dictionary contains metadata (i.e., data about data)
Data storage and definition language – special type of DDL in
which the storage structure and access methods used by the
database system are specified
JYP
13
Data Manipulation Language (DML)
Language for accessing and manipulating the data organized by
the appropriate data model
Two classes of languages
Procedural – user specifies what data is required and how to get
those data
Nonprocedural – user specifies what data is required without
specifying how to get those data
JYP
14
Transaction Management
A transaction is a collection of operations that performs a single
logical function in a database application
Transaction-management component ensures that the database
remains in a consistent (correct) state despite system failures
(e.g., power failures and operating system crashes) and
transaction failures.
Concurrency-control manager controls the interaction among the
concurrent transactions, to ensure the consistency of the
database.
JYP
15
Storage Management
Storage manager is a program module that provides the
interface between the low-level data stored in the database and
the application programs and queries submitted to the system.
The storage manager is responsible for the following tasks:
interaction with the file manager
efficient storing, retrieving and updating of data
JYP
16
Database Administrator
Coordinates all the activities of the database system; the
database administrator has a good understanding of the
enterprise’s information resources and needs.
Database administrator's duties include:
Schema definition
Storage structure and access method definition
Schema and physical organization modification
Granting user authority to access the database
Specifying integrity constraints
Acting as liaison with users
Monitoring performance and responding to changes in
requirements
JYP
17
Database Users
Users are differentiated by the way they expect to interact with
the system
Application programmers – interact with system through DML
calls
Sophisticated users – form requests in a database query
language
Specialized users – write specialized database applications that
do not fit into the traditional data processing framework
Naïve users – invoke one of the permanent application programs
that have been written previously
JYP
18
Overall System Structure
JYP
19
Overall System Structure (Cont.)
The storage manager components also include:
Authorization and integrity manager
JYP
20
Topic 2: Entity-Relationship Model
Entity Sets
Relationship Sets
Mapping Constraints
Keys
E-R Diagram
Design of an E-R Database Schema
Reduction of an E-R Schema to Tables
JYP
21
Entity Sets
A database can be modeled as:
a collection of entities,
relationship among entities.
An entity is an object that exists and is distinguishable from other
objects.
Example: specific person, company, event, plant
An entity set is a set of entities of the same type that share the
same properties.
Example: set of all persons, companies, trees, holidays
JYP
22
Entity Sets Customer and Loan
JYP
23
Attributes
An entity is represented by a set of attributes, that is
descriptive properties possessed by all members of an
entity set.
Example:
customer = (customer-name, social-security,
customer-street, customer-city)
account = (account-number, balance)
Domain – the set of permitted values for each attribute
Attribute types:
Simple and composite attributes.
Single-valued and multi-valued attributes.
Null attributes.
Derived attributes.
JYP
24
Composite Attributes customer-name
and customer-address
JYP
25
Relationship Sets
A relationship is an association among several entities
Example:
Hayes
customer
entity set
(Hayes, A-102)
depositor
relationship set
A-102
account
entity set
A relationship set is a mathematical relation among n 2 entity
sets.
{(e1, e2, … en) | e1 E1, e2 E2, …, en En}
where (e1, e2, …, en) is a relationship
Example:
(Hayes, A-102) depositor
JYP
26
Relationship Sets (Cont.)
an attribute can also be property of a relationship set. For instance, the
depositor relationship set between entity sets customer and account may
have the attribute access-date
E-R Diagram with an Attribute Attached to a Relationship
JYP
27
Relationship set borrower
JYP
28
Degree of a Relationship Set
Refers to number of entity sets that participate in a relationship
set.
Relationship sets that involve two entity sets are binary (or
degree two). Generally, most relationship sets in a database
system are binary.
Relationship sets may involve more than two entity sets. The
entity sets customer, loan, and branch may be linked by the
ternary (degree three) relationship set CLB.
JYP
29
Roles
Entity sets of a relationship need not be distinct
The labels “manager” and “worker” are called roles; they
specify how employee entities interact via the works-for
relationship set.
Roles are indicated in E-R diagrams by labeling the lines
that connect diamonds to rectangles.
Role labels are optional, and are used to clarify
semantics of the relationship
JYP
30
Mapping Cardinalities
Express the number of entities to which another entity
can be associated via a relationship set.
Most useful in describing binary relationship sets.
For a binary relationship set the mapping cardinality must
be one of the following types:
One to one
One to many
Many to one
Many to many
We distinguish among these types by drawing either a
directed line (), signifying “one,” or an undirected line
(—), signifying “many,” between the relationship set and
the entity set.
JYP
31
Mapping Cardinalities
One to one
One to many
JYP
32
Mapping Cardinalities
Many to one
Many to many
JYP
33
One-To-One Relationship
A customer is associated with at most one loan via the
relationship borrower
A loan is associated with at most one customer via
borrower
JYP
34
One-To-Many and Many-To-One
Relationships
JYP
35
One-To-Many and Many-To-One (Cont.)
In the one-to-many relationship (a), a loan is associated with at
most one customer via borrower, a customer is associated with
several (including 0) loans via borrower
In the many-to-one relationship (b), a loan is associated with
several (including 0) customers via borrower, a customer is
associated with at most one loan via borrower
JYP
36
Many-To-Many Relationship
A customer is associated with several (possibly 0) loans
via borrower
A loan is associated with several (possibly 0) customers
via borrower
JYP
37
Existence Dependencies
If the existence of entity x depends on the existence of
entity y, then x is said to be existence dependent on y.
y is a dominant entity (in example below, loan)
x is a subordinate entity (in example below, payment)
loan
loan-payment
payment
If a loan entity is deleted, then all its associated payment
entities must be deleted also.
JYP
38
Keys
A super key of an entity set is a set of one or more
attributes whose values uniquely determine each entity.
A candidate key of an entity set is a minimal super key
social-security is candidate key of customer
account-number is candidate key of account
Although several candidate keys may exist, one of the
candidate keys is selected to be the primary key.
The combination of primary keys of the participating entity
sets forms a candidate key of a relationship set.
must consider the mapping cardinality and the semantics of
the relationship set when selecting the primary key.
(social-security, account-number) is the primary key of
depositor)
JYP
39
E-R Diagram Components
Rectangles represent entity sets.
Ellipses represent attributes
Diamonds represent relationship sets.
Lines link attributes to entity sets and entity sets to relationship
sets.
Double ellipses represent multivalued attributes.
Dashed ellipses denote derived attributes.
Primary key attributes are underlined.
JYP
40
Weak Entity Sets
An entity set that does not have a primary key is referred to as a
weak entity set.
The existence of a weak entity set depends on the existence of a
strong entity set, it must relate to the strong set via a one-tomany relationship set.
The discriminator (or partial key) of a weak entity set is the set of
attributes that distinguishes among all the entities of a weak
entity set that depend on one particular strong entity.
The primary key of a weak entity set is formed by the primary key
of the strong entity set on which the weak entity set is existence
dependent, plus the weak entity set’s discriminator.
JYP
41
Weak Entity Sets (Cont.)
We depict a weak entity set by double rectangles.
We underline the discriminator of a weak entity set with a
dashed line.
payment-number – discriminator of the payment entity set
Primary key for payment – (loan-number, payment-
number)
JYP
42
Specialization
Top-down design process, we designate subgroupings within an
entity set that are distinctive from other entities in the set.
These subgroupings become lower-level entity sets that have
attributes or participate in relationships that do not apply to the
higher-level entity set.
Depicted by a triangle component labeled ISA (i.e.,
saving-account “is an” account).
JYP
43
Specialization Example
JYP
44
Generalization
A bottom-up design process – combine a number of entity sets
that share the same features into a higher-level entity set.
Specialization and generalization are simple inversions of each
other; they are represented in an E-R diagram in the same way.
Attribute inheritance – a lower-level entity set inherits all the
attributes and relationship participation of the higher-level entity
set to which it is linked.
JYP
45
Design Constraints on a Generalization
Constraint on which entities can be members of a given
lower-level entity set.
condition-defined
user-defined
Constraint on whether or not entities may belong to more
than one lower-level entity set within a single
generalization.
disjoint
overlapping
Completeness constraint – specifies whether or not an
entity in the higher-level entity set must belong to at least
one of the lower-level entity sets within a generalization.
total
partial
JYP
46
E-R Diagram for Banking Enterprise
JYP
47
Reduction of an E-R Schema to Tables
Primary keys allow entity sets and relationship sets to be
expressed uniformly as tables which represent the
contents of the database.
A database which conforms to an E-R diagram can be
represented by a collection of tables.
For each entity set and relationship set there is a unique
table which is assigned the name of the corresponding
entity set or relationship set.
Each table has a number of columns (generally
corresponding to attributes), which have unique names.
Converting an E-R diagram to a table format is the basis
for deriving a relational database design from an E-R
diagram.
JYP
48
Representing Entity Sets as Tables
A strong entity set reduces to a table with the same
attributes.
customer-name
social-security
c-street
c-city
Jones
Smith
Hayes
321-12-3123
019-28-3746
677-89-9011
Main
North
Main
Harrison
Rye
Harrison
A weak entity set becomes a table that includes a column
for the primary key of the identifying strong entity set.
loan-number
payment-number
payment-date
payment-amount
L-17
L-23
L-15
5
11
22
10 May 1996
17 May 1996
12 May 1996
50
75
300
JYP
49
Representing Relationship Sets as Tables
A many-to-many relationship set is represented as a table with
columns for the primary keys of the two participating entity sets,
and any descriptive attributes of the relationship set.
social-security
...
account-number
...
access-date
...
The depositor table
The table corresponding to a relationship set linking a weak
entity set to its identifying strong entity set is redundant.
The payment table already contains the information that would
appear in the loan-payment table (i.e., the columns loan-number
and payment-number).
JYP
50
Representing Generalization as Tables
Method 1: Form a table for the generalized entity account.
Form a table for each entity set that is specialized (include
primary key of generalized entity set)
table
table attributes
account
account-number, balance, account-type
savings-account account-number, interest-rate
checking-account account-number, overdraft-amount
Method 2: If the generalization is disjoint and total, form a table
for each entity set that is specialized
table
table attributes
savings-account account-number, balance, interest-rate
checking-account account-number, balance, overdraft-amount
Method 2 has no table for generalized entity account
JYP
51
Topic 3: Relational Model
Structure of Relational Databases
Relational Algebra
Extended Relational-Algebra-Operations
Modification of the Database
Views
JYP
52
Basic Structure
Given sets A1, A2, …. An, a relation r is a subset of
A1 x A2 x … x An
Thus a relation is a set of n-tuples (a1, a2, …, an)
where ai Ai
Example: if
customer-name = {Jones, Smith, Curry, Lindsay}
customer-street = {Main, North, Park}
customer-city = {Harrison, Rye, Pittsfield}
Then
r = {(Jones, Main, Harrison), (Smith, North, Rye), (Curry, North, Rye),
(Lindsay, Park, Pittsfield)}
is a relation over
customer-name x customer-street x customer-city
JYP
53
Relation Schema
A1, A2, …, An are attributes and also as their domains
R = (A1, A2, …, An ) is a relation schema
Customer-schema = (customer-name, customer-street,
customer-city)
r(R) is a relation on the relation schema R
customer (Customer-schema)
JYP
54
Relation Instance
The current values (relation instance) of a relation are
specified by a table
An element t of r is a tuple, represented by a row in a
table
customer-name customer-street
Jones
Smith
Curry
Lindsay
Main
North
North
Park
customer-city
Harrison
Rye
Rye
Pittsfield
customer
JYP
55
Keys
Let K R
K is a superkey of R if values for K are sufficient to identify a
unique tuple of each possible relation r(R) by “possible r” we
mean a relation r that could exist in the enterprise we are
modeling.
Example: {customer-name, customer-street} and
{customer-name} are both superkeys of Customer, if no two
customers can possibly have the same name.
K is a candidate key if K is minimal
Example: {customer-name} is a candidate key for Customer,
since it is a superkey (assuming no two customers can possibly
have the same name), and no subset of it is a superkey.
JYP
56
Determining Keys from E-R Sets
Strong entity set. The primary key of the entity set becomes
the primary key of the relation.
Weak entity set. The primary key of the relation consists of the
union of the primary key of the strong entity set and the
discriminator of the weak entity set.
Relationship set. The union of the primary keys of the related
entity sets becomes a super key of the relation.
For binary many-to-one relationship sets, the primary key of the
“many” entity set becomes the relation’s primary key.
For one-to-one relationship sets, the relation’s primary key can
be that of either entity set.
JYP
57
Query Languages
Language in which user requests information from the database.
Typically a level higher than a standard programming language.
Categories of languages
procedural
non-procedural
“Pure” languages---concise:
Relational Algebra
Tuple Relational Calculus
Domain Relational Calculus
Pure languages form underlying basis of query languages that
people use.
JYP
58
Relational Algebra
Procedural language
Six basic operators
select
project
union
set difference
Cartesian product
rename
The operators take one or two relations as inputs and give a new
relation as a result.
JYP
59
Select Operation
Notation: p(r)
Defined as:
p(r) = {t | t r and p(t)}
Where P is a formula in propositional calculus, dealing
with terms of the form:
<attribute> = <attribute> or <constant>
>
<
“connected by” : (and), (or), (not)
JYP
60
Select Operation – Example
•
•
Relation r
A
B
C
D
1
7
5
7
12
3
23 10
A
B
C
D
1
7
23 10
A=B ^ D > 5 (r)
JYP
61
Project Operation
Notation:
A1, A2, …, Ak (r)
where A1, A2 … Ak are attribute names and r is a relation name.
The result is defined as the relation of k columns obtained by
erasing the columns that are not listed
Duplicate rows removed from result, since relations are sets
JYP
62
Project Operation – Example
Relation r:
A,C
(r )
A
B
C
10
1
20
1
30
1
40
2
A
C
A
C
1
1
1
1
1
2
2
=
JYP
63
Union Operation
Notation: r s
Defined as:
r s = {t | t r or t s}
For r s to be valid.
1. r, s must have the same number of attributes
2. The attribute domains must be compatible (e.g., 2nd column
of r deals with the same type of values as does the 2nd
column of s)
JYP
64
Union Operation – Example
Relations r, s:
A
B
A
B
1
2
2
3
1
s
r
r s:
A
B
1
2
1
3
JYP
65
Set Difference Operation
Notation r – s
Defined as:
r – s = {t | t r and t s}
Set differences must be taken between compatible relations.
r and s must have the same number of attributes
attribute domains of r and s must be compatible
JYP
66
Set Difference Operation – Example
Relations r, s:
A
B
A
B
1
2
2
3
1
s
r
r – s:
A
B
1
1
JYP
67
Cartesian-Product Operation
Notation r x s
Defined as:
r x s = {t q| t r and q s}
Assume that attributes of r(R) and s(S) are disjoint. (That is,
R S = ).
If attributes of r(R) and s(S) are not disjoint, then renaming must
be used.
JYP
68
Cartesian-Product Operation –
Example
Relations r, s:
A
B
C
D
E
1
2
10
10
20
10
+
+
–
–
r
s
r x s:
A
B
C
D
E
1
1
1
1
2
2
2
2
10
19
20
10
10
10
20
10
+
+
–
–
+
+
–
–
JYP
69
Composition of Operations
Can build expressions using multiple operations
Example: A=C(r x s)
Notation: r
s (natural join)
Let r and s be relations on schemas R and S respectively. The
result is a relation on schema R S which is obtained by
considering each pair of tuples tr from r and ts from s.
If tr and ts have the same value on each of the attributes in R S, a
tuple t is added to the result, where
t has the same value as tr on r
t has the same value as ts on s
JYP
70
Composition of Operations (Cont.)
Example:
R = (A, B, C, D)
S = (E, B, D)
Result schema = (A, B, C, D, E)
r
s is defined as:
r.A, r.B, r.C ,r.D, s.E (c.B=s.B ^ r.D=s.D(r x s))
JYP
71
Natural Join Operation – Example
Relations r, s:
A
B
C
D
B
D
E
1
2
4
1
2
a
a
b
a
b
1
3
1
2
3
a
a
a
b
b
r
r
s
s
A
B
C
D
E
1
1
1
1
2
a
a
a
a
b
JYP
72
Banking Example
branch (branch-name, branch-city, assets)
customer (customer-name, customer-street, customer-city)
account (branch-name, account-number, balance)
loan (branch-name, loan-number, amount)
depositor (customer-name, account-number)
borrower (customer-name, loan-number)
JYP
73
Extended Relational-Algebra-Operations
Generalized Projection
Outer Join
Aggregate Functions
JYP
74
Generalized Projection
Extends the projection operation by allowing arithmetic functions
to be used in the projection list.
P
(E)
F1, F2, …, Fn
E is any relational-algebra expression
Each of F1, F2, …, Fn are arithmetic expressions involving
constants and attributes in the schema of E.
Given relation credit-info(customer-name, limit, credit-balance),
find how much more each person can spend:
P customer-name, limit – credit-balance
JYP
(credit-info)
75
Outer Join
An extension of the join operation that avoids loss of information.
Computes the join and then adds tuples form one relation that
does not match tuples in the other relation to the result of the
join.
Uses null values:
null signifies that the value is unknown or does not exist
All comparisons involving null are false by definition.
JYP
76
Outer Join – Example
Relation loan
branch-name
Downtown
Redwood
Perryridge
loan-number
L-170
L-230
L-260
amount
3000
4000
1700
Relation borrower
customer-name loan-number
Jones
Smith
Hayes
L-170
L-230
L-155
JYP
77
Outer Join – Example
loan
borrower
branch-name
Downtown
Redwood
loan
loan-number
L-170
L-230
amount
3000
4000
customer-name
Jones
Smith
borrower
branch-name
Downtown
Redwood
Perryridge
loan-number
L-170
L-230
L-260
amount
3000
4000
1700
JYP
customer-name
Jones
Smith
null
78
Outer Join – Example
loan
borrower
branch-name
Downtown
Redwood
null
loan-number
amount
L-170
L-230
L-155
3000
4000
null
customer-name
Jones
Smith
Hayes
loan borrower
branch-name
Downtown
Redwood
Perryridge
null
loan-number
amount
L-170
L-230
L-260
L-155
3000
4000
1700
null
JYP
customer-name
Jones
Smith
null
Hayes
79
Aggregate Functions
Aggregation operator
takes a collection of values and returns
a single value as a result.
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
G1, G2, …, Gn
F , A , F , A , …, F
1
1
2
2
m,
Am
(E)
E is any relational-algebra expression
G1, G2 …, Gn is a list of attributes on which to group
Fi is an aggregate function
Ai is an attribute name
JYP
80
Aggregate Function – Example
Relation r:
sumC(r)
A
B
C
7
7
3
10
sum-C
27
JYP
81
Aggregate Function – Example
Relation account grouped by branch-name:
branch-name
account-number
balance
A-102
A-201
A-217
A-215
A-222
400
900
750
750
700
Perryridge
Perryridge
Brighton
Brighton
Redwood
branch-name
sum balance (account)
branch-name sum of balance
Perryridge
Brighton
Redwood
JYP
1300
1500
700
82
Modification of the Database
The content of the database may be modified using the following
operations:
Deletion
Insertion
Updating
All these operations are expressed using the assignment
operator().
JYP
83
Deletion
A delete request is expressed similarly to a query, except instead
of displaying tuples to the user, the selected tuples are removed
from the database.
Can delete only whole tuples; cannot delete values on only
particular attributes
A deletion is expressed in relational algebra by:
rr–E
where r is a relation and E is a relational algebra query.
JYP
84
Deletion Examples
Delete all account records in the Perryridge branch.
account account – branch-name = “Perryridge” (account)
Delete all loan records with amount in the range of 0 to 50
loan loan – amount 0 and amount 50 (loan)
Delete all accounts at branches located in Needham.
r1 branch-city = “Needham” (account
branch)
r2 branch-name, account-number, balance (r1)
r3 customer-name, account-number (r2
depositor)
account account – r2
depositor depositor – r3
JYP
85
Insertion
To insert data into a relation, we either:
specify a tuple to be inserted
write a query whose result is a set of tuples to be inserted
in relational algebra, an insertion is expressed by:
r r E
where r is a relation and E is a relational algebra expression.
The insertion of a single tuple is expressed by letting E be a
constant relation containing one tuple.
JYP
86
Insertion Examples
Insert information in the database specifying that Smith has $1200
in account A-973 at the Perryridge branch.
account account {(“Perryridge”, A-973, 1200)}
depositor depositor {(“Smith”, A-973)}
Provide as a gift for all loan customers in the Perryridge branch, a
$200 savings account. Let the loan number serve as the account
number for the new savings account.
r1 (branch-name = “Perryridge” (borrower
loan))
account account ( branch-name, loan-number (r1) x{(200)})
depositor depositor customer-name, loan-number (r1)
JYP
87
Updating
A mechanism to change a value in a tuple without changing all
values in the tuple
Use the generalized projection operator to do this task
r F1, F2, …, Fn (r)
Each Fi, is either the ith attribute of r, if the ith attribute is not
updated, or, if the attribute is to be updated Fi is an expression,
involving only constants and the attributes of r, which gives the
new value for the attribute
JYP
88
Update Examples
Make interest payments by increasing all balances by 5 percent.
account branch-name, account-number, balance * 1.05 (account)
Pay all accounts with balances over $10,000
6 percent interest and pay all others 5 percent
account branch-name, account-number, balance * 1.06 ( balance 10000 (account))
branch-name, account-number, balance * 1.05 ( balance 10000 (account))
JYP
89
Views
In some cases, it is not desirable for all users to see the entire
logical model (i.e., all the actual relations stored in the database.)
Consider a person who needs to know a customer’s loan number
and branch name but has no need to see the loan amount. This
person should see a relation described, in the relational algebra,
by
customer-name, loan-number, branch-name (borrower
loan)
Any relation that is not of the conceptual model but is made
visible to a user as a “virtual relation” is called a view.
JYP
90
View Definition
A view is defined using the create view statement which has the
form
create view v as <query expression>
where <query expression> is any legal relational algebra query
expression. The view name is represented by v.
Once a view is defined, the view name can be used to refer to
the virtual relation that the view generates.
View definition is not the same as creating a new relation by
evaluating the query expression. Rather, a view definition
causes the saving of an expression to be substituted into queries
using the view.
JYP
91
View Examples
Consider the view (named all-customer) consisting of branches
and their customers.
create view all-customer as
branch-name, customer-name (depositor
account)
branch-name, customer-name (borrower
loan)
We can find all customers of the Perryridge branch by writing:
customer-name( branch-name = “Perryridge” (all-customer))
JYP
92
Updates Through View
Database modifications expressed as views must be translated
to modifications of the actual relations in the database.
Consider the person who needs to see all loan data in the loan
relation except amount. The view given to the person, branchloan, is defined as:
create view branch-loan as
branch-name, loan-number (loan)
Since we allow a view name to appear wherever a relation name
is allowed, the person may write:
branch-loan branch-loan {(“Perryridge”, L-37)}
JYP
93
Updates Through Views (Cont.)
The previous insertion must be represented by an insertion into
the actual relation loan from which the view branch-loan is
constructed.
An insertion into loan requires a value for amount. The insertion
can be dealt with by either.
rejecting the insertion and returning an error message to the user.
inserting a tuple (“Perryridge”, L-37, null) into the loan relation
JYP
94
Views Defined Using Other Views
One view may be used in the expression defining another view
A view relation v1 is said to depend directly on a view relation v2
if v2 is used in the expression defining v1 , this is depicted by a
directed edge < v2 v1 >
A view relation v1 is said to depend on view relation v2 , if there is
a path from v2 to v1 .
A view relation v is said to be recursive if it depends on itself.
JYP
95
View Expansion
A way to define the meaning of views defined in terms of other
views.
Let view v1 be defined by an expression e1 that may itself contain
uses of view relations.
View expansion of an expression repeats the following
replacement step:
repeat
Find any view relation vi in e1
Replace the view relation vi by the expression defining vi
until no more view relations are present in e1
As long as the view definitions are not recursive, this loop will
terminate
JYP
96
Topic 4: SQL
SQL: Structured Query Language for commercial DBMS---more
user-friendly
Basic Structure
Set Operations
Aggregate Functions
Null Values
Nested Subqueries
Derived Relations
Views
Modification of the Database
Joined Relations
Data Definition Language
Embedded SQL
JYP
97
Basic Structure
SQL is based on set and relational operations with certain
modifications and enhancements
A typical SQL query has the form:
select A1, A2, ..., An
from r1, r2, ..., rm
where P
Ais represent attributes
ris represent relations
P is a predicate.
This query is equivalent to the relational algebra
expression.
A1, A2, ..., An (P (r1 x r2 x ... x rm))
The result of an SQL query is a relation.
JYP
98
The select Clause
The select clause corresponds to the projection operation of the
relational algebra. It is used to list the attributes desired in the
result of a query.
Q1:Find the names of all branches in the loan relation
select branch-name
from loan
the query is equivalent to:
branch-name(loan)
An asterisk in the select clause denotes “all attributes”
select *
from loan
JYP
99
The select Clause (Cont.)
SQL allows duplicates in relations as well as in query results.
To force the elimination of duplicates, insert the keyword distinct
after select.
Q2: Find the names of all branches in the loan relations, and
remove duplicates
select distinct branch-name
from loan
The keyword all specifies that duplicates not be removed.
select all branch-name
from loan
JYP
100
The select Clause (Cont.)
The select clause can contain arithmetic expressions involving
the operation +, –, , and /, and operating on constants or
attributes of tuples---generalized projection.
Q3:
select branch-name, loan-number, amount 100
from loan
would return a relation which is the same as the loan relations,
except that the attribute amount is multiplied by 100.
JYP
101
The where Clause
The where clause corresponds to the selection predicate of the
relational algebra. It consists of a predicate involving attributes
of the relations that appear in the from clause.
Q4: Find all loan number for loans made at the Perryridge branch
with loan amounts greater than $1300.
select loan-number
from loan
where branch-name = “ Perryridge” and amount > 1300
SQL uses the logical connectives and, or, and not. It allows the
use of arithmetic expressions as operands to the comparison
operators.
JYP
102
The where Clause (Cont.)
SQL Includes a between comparison operator in order to simplify
where clauses that specify that a value be less than or equal to
some value and greater than or equal to some other value.
Q5:Find the loan number of those loans with loan amounts between
$900 and $1000 (that is, $900 and $1000)
select loan-number
from loan
where amount between 900 and 1000
JYP
103
The from Clause
The from clause corresponds to the Cartesian product operation of the
relational algebra. It lists the relations to be scanned in the evaluation of
the expression.
Q6:Find the Cartesian product borrower x loan
select
from borrower, loan
Q7:Find the name and loan number of all customers having a loan at the
Perryridge branch.
select distinct customer-name, borrower.loan-number
from borrower, loan
where borrower.loan-number = loan.loan-number and
branch-name = “Perryridge”
JYP
104
The Rename Operation
The SQL mechanism for renaming attributes is accomplished
through the as clause:
old-name as new-name
Q8: Find the name and loan number of all customers having a loan
at the Perryridge branch; replace the column name loan-number
with the name loan-id.
select distinct customer-name, borrower.loan-number as loan-id
from borrower, loan
where borrower.loan-number = loan.loan-number and
branch-name = “Perryridge”
JYP
105
Tuple Variables
Tuple variables are defined in the from clause via the use of the
as clause.
Q9:Find the customer names, their loan numbers and amount for
all customers having a loan at some branch.
select distinct customer-name, T.loan-number, S.amount
from borrower as T, loan as S
where T.loan-number = S.loan-number
Q10:Find the names of all branches that have greater assets
than some branch located in Brooklyn.
select distinct T.branch-name
from branch as T, branch as S
Where T.assets > S.assets and S.branch-city = “Brooklyn”
JYP
106
String Operations
SQL includes a string-matching operator for comparisons on
character strings. Patterns are described using two special
characters:
percent (%). The % character matches any substring.
underscore (_). The _ character matches any character.
Find the names of all customers whose street includes the
substring “Main”.
select customer-name
from customer
where customer-street like “%Main%”
Match the name “Main%”
like “Main\%” escape “\”
Note in ACCESS, “%” is replaced by “*”, see query X1.
JYP
107
Ordering the Display of Tuples
Q11:List in alphabetic order the names of all customers having a
loan in Perryridge branch
select distinct customer-name
from borrower, loan
where borrower.loan-number = loan.loan-number and
branch-name = “Perryridge”
order by customer-name
We may specify desc for descending order or asc for ascending
order, for each attribute; ascending order is the default.
SQL must perform a sort to fulfil an order by request. Since
sorting a large number of tuples may be costly. It is desirable to
sort only when necessary.
JYP
108
Duplicates
In relations with duplicates, SQL can define how many copies of
tuples appear in the result.
Multiset versions of some of the relational algebra operators –
given multiset relations r1 and r2:
1. If there are c1 copies of tuple t1 in r1, and t1 satisfies selection ,
then there are c1 copies of t1 in (r1).
2. For each copy of tuple t1 in r1, there is a copy of tuple A(t1) in A(r1)
where A(t1) denotes the projection of the single tuple t1.
3. If there are c1 copies of tuple t1 in r1 and c2 copies of tuple t2 in r2,
there are c1 x c2 copies of the tuple t1. t2 in r1 x r2
JYP
109
Duplicates (Cont.)
Suppose relations r1 with schema (A, B) and r2 with
schema (C) are the following multisets:
r1 = {(1, a) (2,a)}
r2 = {(2), (3), (3)}
Then B(r1) would be {(a), (a)}, while B(r1) x r2 would
be
{(a,2), (a,2), (a,3), (a,3), (a,3), (a,3)}
SQL duplicate semantics:
select A1,, A2, ..., An
from r1, r1, ..., rm
where P
is equivalent to the multiset version of the expression:
A1,, A2, ..., An(P(r1 x r2 x ... x rm))
JYP
110
Set Operations
The set operations union, intersect, and except operate on
relations and correspond to the relational algebra operations
Each of the above operations automatically eliminates
duplicates; to retain all duplicates use the corresponding multiset
versions union all, intersect all and except all. Suppose a
tuple occurs m times in r and n times in s, then, it occurs:
m + n times in r union all s
min (m, n) times in r intersect all s
JYP
111
Set Operations
Q12: Find all customers who have a loan, an account, or both:
(select customer-name from depositor)
union
(select customer-name from borrower)
Find all customers who have both a loan and an account.
(select customer-name from depositor)
intersect
(select customer-name from borrower)
Find all customers who have an account but no loan.
select customer-name from depositor)
except
(select customer-name from borrower)
JYP
112
Aggregate Functions
These functions operate on the multiset of values of a column of
a relation, and return a value
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
JYP
113
Aggregate Functions (Cont.)
Q13:Find the average account balance at the Brighton branch.
select avg (balance)
from account
where branch-name = “Brighton”
Q14:Find the number of tuples in the customer relation.
select count (*)
from customer
Find the number of depositors in the bank.
select count (distinct customer-name)
from depositor
JYP
114
Aggregate Functions – Group By
Find the number of depositors for each branch.
select branch-name, count (distinct customer-name)
from depositor, account
where depositor.account-number = account.accountnumber
group by branch-name
Note: Attributes in select clause outside of aggregate
functions must appear in group by list
JYP
115
Aggregate Functions – Having Clause
Q15: Find the names of all branches where the average account
balance is more than $400.
select branch-name, avg (balance)
from account
group by branch-name
having avg (balance) > 400
Note: predicates in the having clause are applied after the
formation of groups.
JYP
116
Null Values
It is possible for tuples to have a null value, denoted by null,
for some of their attributes, null signifies an unknown value or
that a value does not exist.
The result of any arithmetic expression involving null is null.
Roughly speaking, all comparisons involving null return false.
More precisely.
Any comparison with null returns unknown
(unknown or unknown) = unknown,
(true and unknown) = unknown, (false and unknown) = false,
(unknown and unknown) = unknown
Result of where clause predicate is treated as false if it
evaluates to unknown
“P is unknown” evaluates to true if predicate P evaluates to
unknown
JYP
117
Null Values (Cont.)
Find all loan number which appear in the loan relation with nulll
values for amount.
select loan-number
from loan
where amount is null
Total all loan amounts
select sum (amount)
from loan
Above statement ignores null amounts; result is null if there is no
non-null amount.
All aggregate operations except count(*) ignore tuples with null
values on the aggregated attributes.
JYP
118
Nested Subqueries
SQL provides a mechanism for the nesting of subqueries.
A subquery is a select-from-where expression that is nested
within another query.
A common use of subqueries is to perform tests for set
membership, set comparisons, and set cardinality.
JYP
119
Set Membership
F in r t r (t = F)
(5 in
0
4
5
) = true
(5 in
0
4
6
) = false
(5 not in
0
4
6
) = true
JYP
120
Example Query
Q16:Find all customers who have both an account and a loan at
the bank.
select distinct customer-name
from borrower
where customer-name in (select customer-name
from depositor)
Q17:Find all customers who have a loan at the bank but do not
have an account at the bank
select distinct customer-name
from borrower
where customer-name not in (select customer-name
from depositor)
JYP
121
Example Query
Find all customers who have both an account and a loan at the
Perryridge branch.
select distinct customer-name
from borrower, loan
where borrower.loan-number = loan.loan-number and
branch-name = “Perryridge” and
(branch-name, customer-name) in
(select branch-name, customer-name
from depositor, account
where depositor.account-number =
account.account-number)
JYP
122
The Some Clause
F <comp> some r t(t r [F <comp> t])
Where <comp> can be:
(5< some
0
5
6
) = true
(read: 5 < some tuple in the relation)
(5< some
0
5
) = false
(5 = some
0
5
) = true
0
(5 some 5 ) = true (since 0 5)
(= some) in
However, ( some) not in
JYP
123
Example Query
Q18:Find all branches that have greater assets than some
branch located in Brooklyn.
select branch-name
from branch
where assets >some
(select assets
from branch
where branch-city = “Brooklyn”)
JYP
124
The All Clause
F <comp> all r t(t r [F <comp> t])
(5< all
0
5
6
) = false
(5< all
6
10
) = true
(5 = all
4
5
) = false
4
(5 all 6 ) = true (since 5 4 and 5 6)
( all) not in
However, (= all) in
JYP
125
Example Query
Q19:Find the names of all branches that have greater assets
than all branches located in Horsenneck.
select branch-name
from branch
where assets > all
(select assets
from branch
where branch-city = “Horsenneck”)
JYP
126
Test for Empty Relations
The exists construct returns the value true if the argument
subquery is nonempty.
exists r r Ø
not exists r r = Ø
JYP
127
Example Query
Find all customers who have an account at all branches located
in Brooklyn.
select distinct S.customer-name
from depositor as S
where not exists (
(select branch-name
from branch
where branch-city = “Brooklyn”)
except
(select R.branch-name
from depositor as T, account as R
where T.account-number = R.account-number and
S.customer-name=T.customer-name))
Note that X – Y = Ø X Y
JYP
128
Test for Absence of Duplicate Tuples
The unique construct tests whether a subquery has any
duplicate tuples in its result.
Q20:Find all customers who have only one account at the
Perryridge branch.
select T.customer-name
from depositor as T
where unique (
select R.customer-name
from account, depositor as R
where T.customer-name = R.customer-name and
R.account-number = account.account-number
and
account.branch-name = “Perryridge”)
Note: in ACCESS, it does not work when the subquery return
more than one tuple.
JYP
129
Example Query
Find all customers who have at least two accounts at the
Perryridge branch.
select distinct T.customer-name
from depositor T
where not unique (
select R.customer-name
from account, depositor as R
where T.customer-name = R.customer-name and
R.account-number - account.account-number and
account.branch-name = “Perryridge”)
JYP
130
Derived Relations
Find the average account balance of those branches where the
average account balance is greater than $1200.
select branch-name, avg-balance
from (select branch-name, avg (balance)
from account
group by branch-name)
as result (branch-name, avg-balance)
where avg-balance > 1200
Note that we do not need to use the having clause, since we
compute in the from clause the temporary relation result, and the
attributes of result can be used directly in the where clause.
JYP
131
Views
Provide a mechanism to hide certain data from the view of
certain users. To create a view we use the command:
create view v as <query expression>
where:
<query expression> is any legal expression
The view name is represented by v
JYP
132
Example Queries
A view consisting of branches and their customers
create view all-customer as
(select branch-name, customer-name
from depositor, account
where depositor.account-number = account.account-number)
union
(select branch-name, customer-name
from borrower, loan
where borrower.loan-number = loan.loan-number)
Find all customers of the Perryridge branch
select customer-name
from all-customer
where branch-name = “Perryridge”
JYP
133
Modification of the Database – Deletion
Delete all account records at the Perryridge branch
delete from depositor
where account-number in (select account-number
from account
where branch-name = “Perryridge”)
delete from account
where branch-name = “Perryridge”
JYP
134
Example Deletion
Delete all accounts at every branch located in Needham.
delete from depositor
where account-number in (select account-number
from branch, account
where branch-city = “Needham”
and branch.branch-name = account.branch-name)
delete from account
where branch-name in (select branch-name
from branch
where branch-city = “Needham”)
JYP
135
Example Deletion (Cont.)
Delete the record of all accounts with balances below the
average at the bank.
delete from account
where balance < (select avg (balance)
from account)
Problem: as we delete tuples from account, the average balance
changes
Solutions used in SQL:
1. First, compute avg balance and find all tuples to delete
2. Next, delete all tuples found above (without recomputing avg or
retesting the tuples)
JYP
136
Modification of the Database –
Insertion
Add a new tuple to account
insert into account
values (“Perryridge”, “A-9732”, 1200)
or equivalently
insert into account (branch-name, balance, account-number)
values (“Perryridge”, 1200, “A-9732”)
Add a new tuple to account with balance set to null
insert into account
values (“Perryridge”, “A-777”, null)
JYP
137
Modification of the Database –
Insertion
Provide as a gift for all loan customers of the Perryridge branch,
a $200 savings account. Let the loan number serve as the
account number for the new savings account
insert into account
select branch-name, loan-number, 200
from loan
where branch-name = “Perryridge”
insert into depositor
select customer-name, loan-number
from loan, borrower
where branch-name = “Perryridge”
and loan.account-number = borrower.account-number
JYP
138
Modification of the Database – Updates
Increase all accounts with balances over $10,000 by 6%, all
other accounts receive 5%.
Write two update statements:
update account
set balance = balance 1.06
where balance > 10000
update account
set balance = balance 1.05
where balance 10000
The order is important
JYP
139
Update of a View
Create a view of all loan data in loan relation, hiding the
amount attribute
create view branch-loan as
select branch-name, loan-number
from loan
Add a new tuple to branch-loan
insert into branch-loan
values (“Perryridge”,” L-307”)
This insertion must be represented by the insertion of
the tuple
(“Perryridge”, “L-307”, null)
into the loan relation
Updates on more complex views are difficult or
impossible to translate, and hence are disallowed.
JYP
140
Joined Relations
Join operations take two relations and return as a result another
relation.
These additional operations are typically used as subquery
expressions in the from clause.
Join condition – defines which tuples in the two relations match,
and what attributes are present in the result of the join.
Join type – defines how tuples in each relation that do not match
any tuple in the other relation (based on the join condition) are
treated.
Join Types
Join Conditions
inner join
left outer join
right outer join
full outer join
natural
on <predicate>
using (A1, A2, ..., An)
JYP
141
Joined Relations – Datasets for
Examples
Relation loan
branch-name
loan-number
amount
Downtown
L-170
3000
Redwood
L-230
4000
Perryridge
L-260
1700
Relation borrower
customer-name
loan-number
Jones
L-170
Smith
L-230
Hayes
L-155
JYP
142
Joined Relations – Examples
Q21:loan inner join borrower on
loan.loan-number = borrower.loan-number
branch-name
loan-number
amount
customer-name
loan-number
Downtown
L-170
3000
Jones
L-170
Redwood
L-230
4000
Smith
L-230
Q22: loan left outer join borrower on
loan.loan-number = borrower.loan-number
branch-name
loan-number
amount
customer-name
loan-number
Downtown
L-170
3000
Jones
L-170
Redwood
L-230
4000
Smith
L-230
Perryridge
L-260
1700
null
null
JYP
143
Joined Relations – Examples
loan natural inner join borrower
branch-name
loan-number
amount
customer-name
Downtown
L-170
3000
Jones
Redwood
L-230
4000
Smith
loan natural right outer join borrower
branch-name
loan-number
amount
customer-name
Downtown
L-170
3000
Jones
Redwood
L-230
4000
Smith
null
L-155
null
Hayes
JYP
144
Joined Relations – Examples
loan natural full outer join borrower using (loan-number)
branch-name
loan-number
amount
customer-name
Downtown
L-170
3000
Jones
Redwood
L-230
4000
Smith
Perryridge
L-260
1700
null
null
L-155
null
Hayes
Find all customers who have either an account or a loan (but
not both) at the bank.
select customer-name
from (depositor natural full outer join borrower)
where account-number is null or loan-number is null
JYP
145
Data Definition Language (DDL)
Allows the specification of not only a set of relations but also
information about each relation, including:
The schema for each relation.
The domain of values associated with each attribute.
Integrity constraints
The set of indices to be maintained for each relations.
Security and authorization information for each relation.
The physical storage structure of each relation on disk.
JYP
146
Domain Types in SQL
char(n). Fixed length character string, with user-specified length n.
varchar(n). Variable length character strings, with user-specified
maximum length n.
integer. Integer (a finite subset of the integers that is machine-
dependent).
smallint. Small integer (a machine-dependent subset of the integer
domain type).
numeric(p,d). Fixed point number, with user-specified precision of
p digits, with d digits to the right of decimal point.
JYP
147
Domain Types in SQL (Cont.)
real, double precision. Floating point and double-precision
floating point numbers, with machine-dependent precision.
float(n). Floating point number, with user-specified precision of
at least n digits.
date. Dates, containing a (4 digit) year, month and date.
time. Time of day, in hours, minutes and seconds.
Null values are allowed in all the domain types. Declaring an
attribute to be not null prohibits null values for that attribute.
create domain construct in SQL-92 creates user-defined domain
types
create domain person-name char(20) not null
JYP
148
Create Table Construct
An SQL relation is defined using the create table
command:
create table r (A1 D1, A2 D2, ..., An Dn,
(integrity-constraint1),
...,
(integrity-constraintk))
r is the name of the relation
each Ai is an attribute name in the schema of relation r
Di is the data type of values in the domain of attribute Ai
Example:
create table branch
(branch-name char(15) not null,
branch-city
char(30),
assets
integer)
JYP
149
Integrity Constraints in Create Table
not null
primary key (A1, ..., An)
check (P) , where P is a predicate
Example: Declare branch-name as the primary key for
branch and ensure that the values of assets are nonnegative.
create table branch
(branch-name char(15) not null,
branch-city
char(30),
assets
integer,
primary key (branch-name),
check (assets >=0))
primary key declaration on an attribute automatically
ensures not null in SQL-92
JYP
150
Drop and Alter Table Constructs
The drop table command deletes all information about
the dropped relation from the database.
The alter table command is used to add attributes to
an existing relation. All tuples in the relation are
assigned null as the value for the new attribute. The
form of the alter table command is
alter table r add A D
where A is the name of the attribute to be added to
relation r and D is the domain of A.
The alter table command can also be used to drop
attributes of a relation
alter table r drop A
when A is the name of an attribute of relation r.
JYP
151
Embedded SQL
The SQL standard defines embeddings of SQL in a variety of
programming languages such as Pascal, PL/I, Fortran, C, and
Cobol.
A language to which SQL queries are embedded is referred to as
a host language, and the SQL structures permitted in the host
language comprise embedded SQL.
The basic form of these languages follows that of the system R
embedding of SQL into PL/I.
EXEC SQL statement is used to identify embedded SQL request
to the preprocessor
EXEC SQL <embedded SQL statement > END-EXEC
JYP
152
Example Query
From within a host language, find the names and account
numbers of customers with more than the variable amount
dollars in some account.
Specify the query in SQL and declare a cursor for it
EXEC SQL
declare c cursor for
select customer-name, account-number
from depositor, account
where depositor account-number = account.account-number
and account.balance > :amount
END-EXEC
JYP
153
Embedded SQL (Cont.)
The open statement causes the query to be evaluated
EXEC SQL open c END-EXEC
The fetch statement causes the values of one tuple in the query
result to be placed on host language variables.
EXEC SQL fetch c into :cn , :an END-EXEC
Repeated calls to fetch get successive tuples in the query result;
a variable in the SQL communication area indicates when endof-file is reached.
The close statement causes the database system to delete the
temporary relation that holds the result of the query.
EXEC SQL close c END-EXEC
JYP
154
Dynamic SQL
Allows programs to construct and submit SQL queries at
run time.
Example of the use of dynamic SQL from within a C
program.
char * sqlprog = “ update account set balance = balance
* 1.05 where account-number = ? ”
EXEC SQL prepare dynprog from :sqlprog;
char account [10] = “A-101”;
EXEC SQL execute dynprog using :account;
The dynamic SQL program contains a ?, which is a place
holder for a value that is provided when the SQL program
is executed.
JYP
155
Other SQL Features
Fourth-generation languages – special language to
assist application programmers in creating templates
on the screen for a user interface, and in formatting
data for report generation, available in most commercial
database products
SQL sessions – provide the abstraction of a client and a
server (possibly remote)
client connects to an SQL server, establishing a session
executes a series of statements
disconnects the session
can commit or rollback the work carried out in the session
JYP
156
An SQL environment contains several components,
including a user identifier, and a schema, which
identifies which of several schemas a session is using.
the instructions appear in each individual transaction.
serializable schedules to be generated, while others
allow view-serializable schedules that are not conflictserializable.
JYP
157
Topic 5: Integrity Constraints
Domain Constraints
Referential Integrity
Assertions
Triggers
Functional Dependencies
JYP
158
Domain Constraints
Integrity constraints guard against accidental damage to the
database, by ensuring that authorized changes to the database
do not result in a loss of data consistency.
Domain constraints are the most elementary form of integrity
constraint.
They test values inserted in the database, and test queries to
ensure that the comparisons make sense.
JYP
159
Domain Constraints (Cont.)
The check clause in SQL-92 permits domains to be restricted:
Use check clause to ensure that an hourly-wage domain allows only
values greater than a specified value.
create domain hourly-wage numeric(5,2)
constraint value-test check(value > = 4.00)
The domain hourly-wage is declared to be a decimal number with 5
digits, 2 of which are after the decimal point
The domain has a constraint that ensures that the hourly-wage is
greater than or equal to 4.00
The clause constraint value-test is optional; useful to indicate which
constraint an update violated.
JYP
160
Referential Integrity
Ensures that a value that appears in one relation for a given set
of attributes also appears for a certain set of attributes in another
relation.
Example: If “Perryridge” is a branch name appearing in one of the
tuples in the account relation, then there exists a tuple in the branch
relation for branch “Perryridge”.
Formal Definition
Let r1(R1) and r2(R2) be relations with primary keys K1 and K2
respectively.
The subset of R2 is a foreign key referencing K1 in relation r1, if for
every t2 in r2 there must be a tuple t1 in r1 such that t1[K1] = t2[].
Referential integrity constraint: (r2) K1 (r1)
JYP
161
referential Integrity in the E-R Model
Consider relationship set R between entity sets E1 and E2. The
relational schema for R includes the primary keys K1 of E1 and K2
of E2.
Then K1 and K2 form foreign keys on the relational schemas for
E1 and E2 respectively.
Weak entity sets are also a source of referential integrity
constraints. For the relation schema for a weak entity set must
include the primary key of the entity set on which it depends.
JYP
162
Database Modification
The following tests must be made in order to preserve the
following referential integrity constraint:
(r2) K1 (r1)
Insert. If a tuple t2 is inserted into r2, the system must ensure
that there is a tuple t1 in r1 such that t1[K1] = t2[]. That is
t2 [] K1 (r1)
Delete. If a tuple, t1 is deleted from r1, the system must
compute the set of tuples in r2 that reference t1:
= t1[K1] (r2)
If this set is not empty, either the delete command is rejected
as an error, or the tuples that reference t1 must themselves be
deleted (cascading deletions are possible).
JYP
163
Database Modification (Cont.)
Update. There are two cases:
If a tuple t2 is updated in relation r2 and the update
modifies values for foreign key , then a test similar to the
insert case is made. Let t2’ denote the new value of tuple
t2. The system must ensure that
t2’[] K1(r1)
If a tuple t1 is updated in r1, and the update modifies
values for the primary key (K1), then a test similar to the
delete case is made. The system must compute
= t1[K1] (r2)
using the old value of t1 (the value before the update is
applied). If this set is not empty, the update may be
rejected as an error, or the update may be cascaded to
the tuples in the set, or the tuples in the set may be
deleted.
JYP
164
Referential Integrity in SQL
Primary and candidate keys and foreign keys can be specified as
part of the SQL create table statement:
The primary key clause of the create table statement includes a
list of the attributes that comprise the primary key.
The unique key clause of the create table statement includes a list
of the attributes that comprise a candidate key.
The foreign key clause of the create table statement includes both
a list of the attributes that comprise the foreign key and the name of
the relation referenced by the foreign key.
JYP
165
Referential Integrity in SQL – Example
create table customer
(customer-name char(20) not null,
customer-street char(30),
customer-city
char(30),
primary key (customer-name))
create table branch
(branch-name
char(15) not null,
branch-city
char(30),
assets
integer,
primary key (branch-name))
JYP
166
Referential Integrity in SQL – Example (Cont.)
create table account
(branch-name char(15),
account-number
char(10) not null,
balance
integer,
primary key (account-number),
foreign key (branch-name) references branch)
create table depositor
(customer-name
char(20) not null,
account-number
char(10) not null,
primary key (customer-name, account-number),
foreign key (account-number) references account,
foreign key (customer-name) references customer)
JYP
167
Cascading Actions in SQL
create table account
...
foreign key(branch-name) references branch
on delete cascade
on update cascade.
...)
Due to the on delete cascade clauses, if a delete of a tuple in
branch results in referential-integrity constraint violation, the
delete “cascades” to the account relation, deleting the tuple that
refers to the branch that was deleted.
Cascading updates are similar.
JYP
168
Cascading Actions in SQL (Cont.)
If there is a chain of foreign-key dependencies across multiple
relations, with on delete cascade specified for each
dependency, a deletion or update at one end of the chain can
propagate across the entire chain.
If a cascading update to delete causes a constraint violation that
cannot be handled by a further cascading operation, the system
aborts the transaction. As a result, all the changes caused by
the transaction and its cascading actions are undone.
JYP
169
Assertions
An assertion is a predicate expressing a condition that we wish
the database always to satisfy.
An assertion in SQL-92 takes the form
create assertion <assertion-name> check <predicate>
When an assertion is made, the system tests it for validity. This
testing may introduce a significant amount of overhead; hence
assertions should be used with great care.
JYP
170
Assertion Example
The sum of all loan amounts for each branch must be less than
the sum of all account balances at the branch.
create assertion sum-constraint check
(not exists (select * from branch
where (select sum(amount) from loan
where loan.branch-name = branch.branch-name)
>=(select sum(balance) from account
where account.branch-name = branch.branch-name)))
JYP
171
Assertion Example
Every loan has at least one customer who maintains an account
with a minimum balance of $1000.
create assertion balance-constraint check
(not exists (select * from loan
where not exists ( select *
from borrower, depositor, account
where loan.loan-number = borrower.loan-number
and borrower.customer-name = depositor.customer-name
and depositor.account-number = account.account-number
and account.balance >=1000)))
Note if a loan has no customer who maintains an account with a
minimum balance of $1000, the sub-query will evaluate to .
JYP
172
Assertion Example
All borrower of every loan should maintain an account with a
minimum balance of $1000.
create assertion balance-constraint check
(not exists (select * from loan
where exists ( select *
from borrower, depositor, account
where loan.loan-number = borrower.loan-number
and borrower.customer-name = depositor.customer-name
and depositor.account-number = account.account-number
and account.balance <1000))
and not exists ( select * from loan
where not exists ( select *
from borrower, depositor
where borrower.customer-name = depositor.customer-name
and borrower.loan-number = loan.loan-number)))
JYP
173
Triggers
A trigger is a statement that is executed automatically by the
system as a side effect of a modification to the database.
To design a trigger mechanism, we must:
Specify the conditions under which the trigger is to be executed.
Specify the actions to be taken when the trigger executes.
The SQL-92 standard does not include trigger, but many
implementations support triggers.
JYP
174
Trigger Example
Suppose that instead of allowing negative account balances, the
bank deals with overdrafts by
setting the account balance to zero
creating a loan in the amount of the overdraft
giving this loan a loan number identical to the account number of the
overdrawn account
The condition for executing the trigger is an update to the
account relation that results in a negative balance value.
JYP
175
Trigger Example (Cont.)
define trigger overdraft on update of account T
(if new T.balance < 0
then (insert into loan values
(T.branch-name, T.account-number, –new T.balance)
insert into borrower
(select customer-name, account-number
from depositor
where T.account-number = depositor account-number)
update account S
set S.balance = 0
where S.account-number = T.Account-number))
The keyword new used before T.balance indicates that the
value of T.balance after the update should be used; if it is
omitted, the value before the update is used.
JYP
176
Functional Dependencies
Constraints on the set of legal relations.
Require that the value for a certain set of attributes determines
uniquely the value for another set of attributes.
A functional dependency is a generalization of the notion of a
key.
JYP
177
Functional Dependencies (Cont.)
Let R be a relation schema
R R
The functional dependency
holds on R if and only if for any legal relations r(R), whenever
any two tuples t1 and t2 of r agree on the attributes , they
also agree on the attributes . That is,
t1[] = t2 [] t1[] = t2 []
K is a superkey for relation schema R if and only if K R
K is a candidate key for R if and only if
K R, and
There is no K, R
JYP
178
Functional Dependencies (Cont.)
Functional dependencies allow us to express constraints that
cannot be expressed using superkeys. Consider the schema:
Loan-info-schema = (branch-name, loan-number,
customer-name, amount).
We expect this set of functional dependencies to hold:
loan-number amount
loan-number branch-name
but would not expect the following to hold:
loan-number customer-name
JYP
179
Use of Functional Dependencies
We use functional dependencies to:
test relations to see if they are legal under a given set of functional
dependencies. If a relation r is legal under a set F of functional
dependencies, we say that r satisfies F.
specify constraints on the set of legal relations; we say that F holds
on R if all legal relations on R satisfy the set of functional
dependencies F.
Note: A specific instance of a relation schema may satisfy a
functional dependency even if the functional dependency does
not hold on all legal instances. For example, a specific instance
of Loan-schema may, by chance, satisfy loan-number
customer-name.
JYP
180
Topic 6: Transactions
Transaction Concept
Transaction State
Implementation of Atomicity and Durability
Concurrent Executions
Serializability
Recoverability
Implementation of Isolation
Transaction Definition in SQL
Testing for Serializability.
JYP
181
Transaction Concept
A transaction is a unit of program execution that
accesses and possibly updates various data items.
A transaction must see a consistent database.
During transaction execution the database may be
inconsistent.
When the transaction is committed, the database must
be consistent.
Two main issues to deal with:
Failures of various kinds, such as hardware failures and
system crashes
Concurrent execution of multiple transactions
JYP
182
ACID Properties
To preserve integrity of data, the database system must ensure:
Atomicity. Either all operations of the transaction are properly
reflected in the database or none are.
Consistency. Execution of a transaction in isolation preserves
the consistency of the database.
Isolation. Although multiple transactions may execute
concurrently, each transaction must be unaware of other
concurrently executing transactions. Intermediate transaction
results must be hidden from other concurrently executed
transactions. That is, for every pair of transactions Ti and Tj, it
appears to Ti that either Tj finished execution before Ti started,
or Tj started execution after Ti finished.
Durability. After a transaction completes successfully, the
changes it has made to the database persist, even if there are
system failures.
JYP
183
Example of Fund Transfer
Transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
Consistency requirement — the sum of A and B is unchanged by
the execution of the transaction.
Atomicity requirement — if the transaction fails after step 3 and
before step 6, the system should ensure that its updates are not
reflected in the database, otherwise an inconsistency will result.
JYP
184
Example of Fund Transfer (Cont.)
Durability requirement — once the user has been notified
that the transaction has completed (i.e., the transfer of the
$50 has taken place), the updates to the database by the
transaction must persist despite failures.
Isolation requirement — if between steps 3 and 6, another
transaction is allowed to access the partially updated
database, it will see an inconsistent database
(the sum A + B will be less than it should be).
Can be ensured trivially by running transactions serially,
that is one after the other. However, executing multiple
transactions concurrently has significant benefits, as we
will see.
JYP
185
Transaction State
Active, the initial state; the transaction stays in this
state while it is executing
Partially committed, after the final statement has
been executed.
Failed, after the discovery that normal execution can
no longer proceed.
Aborted, after the transaction has been rolled back
and the database restored to its state prior to the start
of the transaction.
Committed, after successful completion.
JYP
186
Transaction State (Cont.)
JYP
187
Implementation of Atomicity and
Durability
The recovery-management component of a database
system implements the support for atomicity and durability.
JYP
188
Concurrent Executions
Multiple transactions are allowed to run concurrently in the
system. Advantages are:
increased processor and disk utilization, leading to better
transaction throughput: one transaction can be using the CPU
while another is reading from or writing to the disk
reduced average response time for transactions: short
transactions need not wait behind long ones.
Concurrency control schemes – mechanisms to control the
interaction among the concurrent transactions in order to
prevent them from destroying the consistency of the
database
JYP
189
Schedules
Schedules – sequences that indicate the chronological order in
which instructions of concurrent transactions are executed
a schedule for a set of transactions must consist of all instructions of
those transactions
must preserve the order in which the instructions appear in each
individual transaction.
JYP
190
Example Schedules
Let T1 transfer $50 from A to B, and T2 transfer 10% of
the balance from A to B. The following is a serial
schedule (Schedule 1), in which T1 is followed by T2.
T1
read(A)
A := A – 50
write(A)
read(B)
B := B + 50
write(B)
T2
read(A)
temp := A*0.1;
A := A – temp
write(A)
read(B)
B := B + temp
write(B)
JYP
191
Let T1 and T2 be the transactions defined previously.
The following schedule (Schedule 3) is not a serial
schedule, but it is equivalent to Schedule 1.
T1
read(A)
A := A – 50
write(A)
T2
read (A)
temp := A*0.1
A = A – temp
write(A)
read(B)
B := B + 50
write(B)
read(B)
B := B + temp
write(B)
In both Schedule 1 and 3, the sum A + B is preserved.
JYP
192
Example Schedules (Cont.)
The following concurrent schedule (Schedule 4) does
not preserve the value of the sum A + B.
T1
T2
read(A)
A := A – 50
A=90
A=40
read (A)
temp := A*0.1
A = A – temp
write(A)
read(B)
write(A)
read(B)
B := B + 50
write(B)
temp=4
A=36
B=80
A=40
B=130
B := B + temp
write(B)
JYP
B=84
193
Serializability
Basic Assumption – Each transaction preserves database
consistency.
Thus serial execution of a set of transactions preserves
database consistency.
A (possibly concurrent) schedule is serializable if it is
equivalent to a serial schedule. Different forms of schedule
equivalence give rise to the notions of:
1. conflict serializability
2. view serializability
We ignore operations other than read and write instructions,
and we assume that transactions may perform arbitrary
computations on data in local buffers in between reads and
writes. Our simplified schedules consist of only read and
write instructions.
JYP
194
Conflict Serializability
Instructions li and lj of transactions Ti and Tj respectively, conflict
if and only if there exists some item Q accessed by both li and lj,
and at least one of these instructions wrote Q.
1. li = read(Q), lj = read(Q). li and lj don’t conflict.
2. li = read(Q), lj = write(Q). They conflict.
3. li = write(Q), lj = read(Q). They conflict
4. li = write(Q), lj = write(Q). They conflict
Intuitively, a conflict between li and lj forces a (logical) temporal
order between them. If li and lj are consecutive in a schedule
and they do not conflict, their results would remain the same
even if they had been interchanged in the schedule.
JYP
195
Conflict Serializability (Cont.)
If a schedule S can be transformed into a schedule S´
by a series of swaps of non-conflicting instructions, we
say that S and S´ are conflict equivalent.
We say that a schedule S is conflict serializable if it is
conflict equivalent to a serial schedule
Example of a schedule that is not conflict serializable:
T3
read(Q)
T4
write(Q)
write(Q)
We are unable to swap instructions in the above
schedule to obtain either the serial schedule < T3, T4 >,
or the serial schedule < T4, T3 >.
JYP
196
Conflict Serializability (Cont.)
Schedule 3 below can be transformed into Schedule 1,
a serial schedule where T2 follows T1, by series of
swaps of non-conflicting instructions. Therefore
Schedule 3 is conflict serializable.
T1
read(A)
write(A)
T2
read (A)
write(A)
read (B)
write(B)
read (B)
write(B)
JYP
197
View Serializability*
Let S and S´ be two schedules with the same set of
transactions. S and S´ are view equivalent if the following
three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q
in schedule S, then transaction Ti must, in schedule S´, also read
the initial value of Q.
2. For each data item Q if transaction Ti executes read(Q) in
schedule S, and that value was produced by transaction Tj (if
any), then transaction Ti must in schedule S´ also read the value
of Q that was produced by transaction Tj.
3. For each data item Q, the transaction (if any) that performs the
final write(Q) operation in schedule S must perform the final
write(Q) operation in schedule S´.
As can be seen, view equivalence is also based purely on reads
and writes alone.
JYP
198
View Serializability *(Cont.)
A schedule S is view serializable if it is view equivalent to a
serial schedule.
Every conflict serializable schedule is also view serializable.
Schedule 9 — a schedule which is view-serializable but not
conflict serializable.
T3
read(Q)
T4
T6
write(Q)
write(Q)
write (Q)
Every view serializable schedule which is not conflict serializable
has bind writes ( no read before write).
JYP
199
Recoverability
Need to address the effect of transaction failures on concurrently
running transactions.
Recoverable schedule — if a transaction Tj reads a data items
previously written by a transaction Ti , the commit operation of Ti
appears before the commit operation of Tj.
The following schedule (Schedule 11) is not recoverable if T9
commits immediately after the read
T8
T9
read(A)
write(A)
read(A)
read(B)
If T8 should abort, T9 would have read (and possibly shown to the
user) an inconsistent database state. Hence database must
ensure that schedules are recoverable.
JYP
200
Recoverability (Cont.)
Cascading rollback – a single transaction failure leads
to a series of transaction rollbacks. Consider the
following schedule where none of the transactions has
yet committed (so the schedule is recoverable)
T10
T11
T12
read(A)
read(B)
write(A)
read(A)
write(A)
read(A)
If T10 fails, T11 and T12 must also be rolled back.
Can lead to the undoing of a significant amount of work
JYP
201
Recoverability (Cont.)
Cascadeless schedules — cascading rollbacks cannot occur; for
each pair of transactions Ti and Tj such that Tj reads a data item
previously written by Ti, the commit operation of Ti appears
before the read operation of Tj.
Every cascadeless schedule is also recoverable
It is desirable to restrict the schedules to those that are
cascadeless
JYP
202
Implementation of Isolation
Schedules must be conflict or view serializable, and
recoverable, for the sake of database consistency, and
preferably cascadeless.
A policy in which only one transaction can execute at a time
generates serial schedules, but provides a poor degree of
concurrency.
Concurrency-control schemes tradeoff between the amount
of concurrency they allow and the amount of overhead that
they incur.
Some schemes allow only conflict-serializable schedules to
be generated, while others allow view-serializable
schedules that are not conflict-serializable.
JYP
203
Transaction Definition in SQL
Data manipulation language must include a construct for
specifying the set of actions that comprise a transaction.
In SQL, a transaction begins implicitly.
A transaction in SQL ends by:
Commit work commits current transaction and begins a new
one.
Rollback work causes current transaction to abort.
Levels of consistency specified by SQL-92:
Serializable — default
Repeatable read
Read committed
Read uncommitted
JYP
204
Levels of Consistency in SQL-92
Serializable — default
Repeatable read —only committed records to be read,
repeated reads of same record must return same value.
However, a transaction may not be serializable – it may find
some records inserted by a transaction but not find others that
have not yet been commited.
Read committed — only committed records can be read, but
successive reads of record may return different (but
committed) values.
Read uncommitted — even uncommitted records may be
read.
Lower degrees of consistency useful for gathering approximate
information about the database, e.g., statistics for query optimizer.
JYP
205
Testing for Serializability
Consider some schedule of a set of transactions T1, T2, ..., Tn
Precedence graph — a direct graph where the vertices are
the transactions (names).
We draw an arc from Ti to Tj if the two transaction conflict,
and Ti accessed the data item on which the conflict arose
earlier.
We may label the arc by the item that was accessed.
Example 1
x
y
JYP
206
Example Schedule (Schedule A)
T1
T2
read(X)
T3
T4
T5
read(Y)
read(Z)
read(V)
read(W)
read(W)
read(Y)
write(Y)
write(Z)
read(U)
read(Y)
write(Y)
read(Z)
write(Z)
read(U)
write(U)
JYP
207
Precedence Graph for Schedule A
T2
T1
T4
T5
T3
JYP
208
Test for Conflict Serializability
A schedule is conflict serializable if and only if its precedence
graph is acyclic.
Cycle-detection algorithms exist which take order n2 time, where
n is the number of vertices in the graph. (Better algorithms take
order n + e where e is the number of edges.)
If precedence graph is acyclic, the serializability order can be
obtained by a topological sorting of the graph. This is a linear
order consistent with the partial order of the graph.
For example, a serializability order for Schedule A would be
T5 T1 T3 T2 T4 .
JYP
209
Test for View Serializability
The precedence graph test for conflict serializability must be
modified to apply to a test for view serializability.
Construct a labeled precedence graph. Look for an acyclic
graph which is derived from the labeled precedence graph.
Schedule is view serializable if and only if such an acyclic
graph can be found.
The problem of looking for such an acyclic graph falls in the
class of NP-complete problems. Thus existence of an efficient
algorithm is unlikely.
However practical algorithms that just check some sufficient
conditions for view serializability can still be used.
JYP
210
Concurrency Control vs. Serializability Tests
Testing a schedule for serializability after it has executed is a
little too late!
Goal – to develop concurrency control protocols that will assure
serializability. They will generally not examine the precedence
graph as it is being created; instead a protocol will impose a
discipline that avoids nonseralizable schedules.
Will study such protocols later.
Tests for serializability help understand why a concurrency
control protocol is correct.
JYP
211
Topic 7: Concurrency Control
Lock-Based Protocols
Multiple Granularity
Deadlock Handling
Insert and Delete Operations
JYP
212
Lock-Based Protocols
A lock is a mechanism to control concurrent access to a data item
Data items can be locked in two modes :
1. exclusive (X) mode. Data item can be both read as well as
written. X-lock is requested using lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock is
requested using lock-S instruction.
Lock requests are made to concurrency-control manager.
Transaction can proceed only after request is granted.
JYP
213
Lock-Based Protocols (Cont.)
Lock-compatibility matrix
S
X
S
true
false
X
true
false
A transaction may be granted a lock on an item if the requested lock
is compatible with locks already held on the item by other
transactions
The matrix allows any number of transactions to hold shared locks
on an item, but if any transaction holds an exclusive on the item no
other transaction may hold any lock on the item.
JYP
214
Lock-Based Protocols (Cont.)
If a lock cannot be granted, the requesting transaction is made to
wait till all incompatible locks held by other transactions have been
released. The lock is then granted.
JYP
215
Lock-Based Protocols (Cont.)
Example of a transaction performing locking:
T2: lock-S(A);
read (A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B)
Locking as above is not sufficient to guarantee serializability — if A
and B get updated in-between the read of A and B, the displayed
sum would be wrong.
A locking protocol is a set of rules followed by all transactions
while requesting and releasing locks. Locking protocols restrict the
set of possible schedules.
JYP
216
Pitfalls of Lock-Based Protocols
Consider the partial schedule
T3
lock-X(B)
read(B)
B:= B-50
write(B)
T4
lock-S(A)
read(A)
lock-S(B)
lock-X(A)
Neither T3 nor T4 can make progress — executing lock-S(B)
causes T4 to wait for T3 to release its lock on B, while executing
lock-X(A) causes T3 to wait for T4 to release its lock on A.
Such a situation is called a deadlock. To handle a deadlock one
of T3 or T4 must be rolled back and its locks released.
JYP
217
Pitfalls of Lock-Based Protocols (Cont.)
The potential for deadlock exists in most locking protocols.
Deadlocks are a necessary evil.
Starvation is also possible if concurrency control manager is
badly designed. For example:
A transaction may be waiting for an X-lock on an item, while a
sequence of other transactions request and are granted an S-lock
on the same item.
The same transaction is repeatedly rolled back due to deadlocks.
Concurrency control manager can be designed to prevent
starvation.
JYP
218
The Two-Phase Locking Protocol
This is a protocol which ensures conflict-serializable schedules.
Phase 1: Growing Phase
transaction may obtain locks
transaction may not release locks
Phase 2: Shrinking Phase
transaction may release locks
transaction may not obtain locks
The protocol assures serializability. It can be proved that the
transactions can be serialized in the order of their lock points
(i.e. the point where a transaction acquired its final lock).
JYP
219
The Two-Phase Locking Protocol (Cont.)
Two-phase locking does not ensure freedom from deadlocks
Cascading roll-back is possible under two-phase locking. To
avoid this, follow a modified protocol called strict two-phase
locking. Here a transaction must hold all its exclusive locks till it
commits/aborts.
Rigorous two-phase locking is even stricter: here all locks are
held till commit/abort. In this protocol transactions can be
serialized in the order in which they commit.
JYP
220
The Two-Phase Locking Protocol (Cont.)
There can be conflict serializable schedules that cannot be
obtained if two-phase locking is used.
e.g.,
T1
T2
lock-S(X)
read (X)
unlock-S(X)
lock-X(X)
write (X)
unlock-X(X)
lock-X(Y)
write (Y)
unlock-X(Y)
JYP
221
The Two-Phase Locking Protocol (Cont.)
However, in the absence of extra information (e.g., ordering of
access to data), two-phase locking is needed for conflict
serializability in the following sense:
Given a transaction Ti that does not follow two-phase locking, we
can find a transaction Tj that uses two-phase locking, and a
schedule for Ti and Tj that is not conflict serializable.
JYP
222
Lock Conversions
Two-phase locking with lock conversions:
– First Phase:
can acquire a lock-S on item
can acquire a lock-X on item
can convert a lock-S to a lock-X (upgrade)
– Second Phase:
can release a lock-S
can release a lock-X
can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the
programmer to insert the various locking instructions.
JYP
223
Automatic Acquisition of Locks
A transaction Ti issues the standard read/write instruction,
without explicit locking calls.
The operation read(D) is processed as:
if Ti has a lock on D
then
read(D)
else
begin
if necessary wait until no other
transaction has a lock-X on D
grant Ti a lock-S on D;
read(D)
end
JYP
224
Automatic Acquisition of Locks (Cont.)
write(D) is processed as:
if Ti has a lock-X on D
then
write(D)
else
begin
if necessary wait until no other trans. has any lock on D,
if Ti has a lock-S on D
then
upgrade lock on D to lock-X
else
grant Ti a lock-X on D
write(D)
end;
All locks are released after commit or abort
JYP
225
Multiple Granularity
Allow data items to be of various sizes and define a hierarchy of
data granularities, where the small granularities are nested within
larger ones
Can be represented graphically as a tree
when a transaction locks a node in the tree explicitly, it implicitly
locks all the node's descendents in the same mode.
Granularity of locking (level in tree where locking is done):
fine granularity (lower in tree): high concurrency, high locking
overhead
coarse granularity (higher in tree): low locking overhead, low
concurrency
JYP
226
Example of Granularity Hierarchy
The highest level in the example hierarchy is the entire database.
The levels below are of type area, file and record in that order.
JYP
227
Intention Lock Modes
In addition to S and X lock modes, there are three additional lock
modes with multiple granularity:
intention-shared (IS): indicates explicit locking at a lower level of the
tree but only with shared locks.
intention-exclusive (IX): indicates explicit locking at a lower level
with exclusive or shared locks
shared and intention- exclusive (SIX): the subtree rooted by that
node is locked explicitly in shared mode and indicates explicit
locking at a lower level with exclusive-mode locks.
intention locks allow a higher level node to be locked in S or X
mode without having to check all descendent nodes.
JYP
228
Compatibility Matrix with Intention Lock Modes
The compatibility matrix for all lock modes is:
IS
IX
S
S IX
X
IS
IX
S
S IX
X
JYP
229
Multiple Granularity Locking Scheme
Transaction Ti can lock a node Q, using the following rules:
1. The lock compatibility matrix must be observed.
2. The root of the tree must be locked first, and may be locked in
any mode.
3. A node Q can be locked by Ti in IS or S mode only if the parent
of Q is currently locked by Ti in either IX or IS mode.
4. A node Q can be locked by Ti in X, SIX, or IX mode only if the
parent of Q is currently locked by Ti in either IX or SIX mode.
5. Ti can lock a node only if it has not previously unlocked any node
(that is, Ti is two-phase).
6. Ti can unlock a node Q only if none of the children of Q are
currently locked by Ti.
Observe that locks are acquired in root-to-leaf order, whereas they are
released in leaf-to-root order.
JYP
230
Deadlock Handling
Consider the following two transactions:
T1:
write (X)
T2:
write(Y)
write(Y)
write(X)
Schedule with deadlock
T1
lock-X on X
write (X)
T2
lock-X on Y
write (Y)
wait for lock-X on X
wait for lock-X on Y
JYP
231
Deadlock Handling
System is deadlocked if there is a set of transactions such that
every transaction in the set is waiting for another transaction in
the set.
Deadlock prevention protocols ensure that the system will never
enter into a deadlock state. Some prevention strategies :
Require that each transaction locks all its data items before it begins
execution (predeclaration).
Impose partial ordering of all data items and require that a
transaction can lock data items only in the order specified by the
partial order (graph-based protocol).
JYP
232
More Deadlock Prevention Strategies
Following schemes use transaction timestamps for the sake of
deadlock prevention alone.
wait-die scheme — non-preemptive
older transaction may wait for younger one to release data item.
Younger transactions never wait for older ones; they are rolled back
instead.
a transaction may die several times before acquiring needed data
item
wound-wait scheme — preemptive
older transaction wounds (forces rollback) younger transaction
instead of waiting for it. Younger transactions may wait for older
ones.
may be fewer rollbacks than wait-die scheme.
JYP
233
Deadlock prevention (Cont.)
Both in wait-die and in wound-wait schemes, a rolled back
transactions is restarted with its original timestamp. Older
transactions thus have precedence over newer ones, and
starvation is hence avoided.
Timeout-Based Schemes :
a transaction waits for a lock only for a specified amount of time.
After that, the wait times out and the transaction is rolled back.
thus deadlocks are not possible
simple to implement; but starvation is possible. Also difficult to
determine good value of the timeout interval.
JYP
234
Deadlock Detection*
Deadlocks can be described as a wait-for graph, which consists
of a pair G = (V,E),
V is a set of vertices (all the transactions in the system)
E is a set of edges; each element is an ordered pair Ti Tj.
If Ti Tj is in E, then there is a directed edge from Ti to Tj,
implying that Ti is waiting for Tj to release a data item.
When Ti requests a data item currently being held by Tj, then the
edge Ti Tj is inserted in the wait-for graph. This edge is
removed only when Tj is no longer holding a data item needed by
Ti.
The system is in a deadlock state if and only if the wait-for graph
has a cycle. Must invoke a deadlock-detection algorithm
periodically to look for cycles.
JYP
235
Deadlock Detection* (Cont.)
Wait-for graph without a cycle
Wait-for graph with a cycle
JYP
236
Deadlock Recovery*
When deadlock is detected :
Some transaction will have to rolled back (made a victim) to break
deadlock. Select that transaction as victim that will incur minimum
cost.
Rollback -- determine how far to roll back transaction
Total rollback: Abort the transaction and then restart it.
More effective to roll back transaction only as far as necessary to
break deadlock.
Starvation happens if same transaction is always chosen as victim.
Include the number of rollbacks in the cost factor to avoid starvation
JYP
237
Topic 8: Database System Architectures
Centralized Systems
Client--Server Systems
Distributed Systems
JYP
238
Centralized Systems
Run on a single computer system and do not interact with other
computer systems.
General-purpose computer system: one to a few CPUs and a
number of device controllers that are connected through a
common bus that provides access to shared memory.
Single-user system (e.g., personal computer or workstation):
desk-top unit, single user, usually has only one CPU and one or
two hard disks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs,
and a multi-user OS. Serve a large number of users who are
connected to the system via terminals. Often called server
systems.
JYP
239
Client-Server Systems
Server systems satisfy requests generated at m client systems, whose
general structure is shown below:
Client
Client
Client
….
Client
network
Server
JYP
240
Client-Server Systems (Cont.)
Database functionality can be divided into:
Back-end: manages access structures, query evaluation and
optimization, concurrency control and recovery.
Front-end: consists of tools such as forms, report-writers, and
graphical user interface facilities.
The interface between the front-end and the back-end is through
SQL or through an application program interface.
JYP
241
Client-Server Systems (Cont.)
Advantages of replacing mainframes with networks of
workstations or personal computers connected to back-end
server machines:
better functionality for the cost
flexibility in locating resources and expanding facilities
better user interfaces
easier maintenance
One kind of server systems is transaction servers which are
widely used in relational database systems.
JYP
242
Transaction Servers
Also called query server systems or SQL server systems; clients
send requests to the server system where the transactions are
executed, and results are shipped back to the client.
Requests specified in SQL, and communicated to the server
through a remote procedure call (RPC) mechanism.
Transactional RPC allows many RPC calls to collectively form a
transaction.
Open Database Connectivity (ODBC) is an application program
interface standard from Microsoft for connecting to a server,
sending SQL requests, and receiving results.
JYP
243
Distributed Systems
Data spread over multiple machines (also referred to as sites or
nodes.
Network interconnects the machines
Data shared by users on multiple machines
Differentiate between local and global transactions
A local transaction accesses data in the single site at which the
transaction was initiated.
A global transaction either accesses data in a site different from the
one at which the transaction was initiated or accesses data in
several different sites.
JYP
244
Trade-offs in Distributed Systems
Sharing data – users at one site able to access the data residing
at some other sites.
Autonomy – each site is able to retain a degree of control over
data stored locally.
Higher system availability through redundancy — data can be
replicated at remote sites, and system can function even if a site
fails.
Disadvantage: added complexity required to ensure proper
coordination among sites.
Software development cost.
Greater potential for bugs.
Increased processing overhead.
JYP
245
Topic 9: New Applications
Decision-Support Systems
Data Analysis
Data Mining
Data Warehousing
Spatial and Geographic Databases
Multimedia Databases
Mobility and Personal Databases
Information-Retrieval Systems
JYP
246
Decision Support System
Decision-Support systems are used to make business
decisions often based on data collected by on-line
transaction-processing systems.
Examples of business decisions:
what items to stock?
What insurance premium to change?
Who to send advertisements to?
Examples of data used for making decisions
Retail sales transaction details
Customer profiles (income, age, sex, etc.)
JYP
247
Decision-Support Systems: Overview
Data analysis tasks are simplified by SQL
extensions.
Statistical analysis packages (e.g., S++) can be
interfaced with databases
Statistical analysis is a large field, will not study it here
Data mining seeks to discover knowledge
automatically in the form of statistical rules and
patterns from Large databases.
A data warehouse archives information gathered
from multiple sources, and stores it under a unified
schema, at a single site.
JYP
248
Data Analysis
Aggregate functions summarize large volumes of data
A histogram partitions the values taken by an attribute into
ranges, and computes an aggregate over the values in
each range; cumbersome to use standard SQL to
construct a histogram. Extension proposed by Red Brick:
select percentile, avg (balance)
from account
group by N_tile (balance, 10) as percentile
Here, N_tile(balance,10) divides the values for balance
into 10 consecutive ranges, with an equal number of
values in each range, duplicates are not eliminated.
JYP
249
Data Analysis (Cont.)
Small
Medium
Large
Total
Light
Dark
8
20
35
10
10
5
53
35
Total
28
45
15
88
Cross-tabulation of number by size and color of sample
relation sales with the schema Sales(color, size, number).
JYP
250
Data Analysis (Cont.)
Color
Size
Number
Light
Light
Light
Light
Dark
Dark
Dark
Dark
all
all
all
all
Small
Medium
Large
all
Small
Medium
Large
all
Small
Medium
Large
all
8
35
10
53
20
10
5
35
28
45
15
88
Can represent subtotals in relational form by using the value all
E.g. : obtain (Light, all, 53) and (Dark, all, 35) by aggregating
individual tuples with different values for size for each color.
JYP
251
Data Analysis (Cont.)
Rollup: Moving from finer-granularity data to a coarser
granularity by means of aggregation.
Drill down: Moving from coarser-granularity data to
finer-granularity data.
Proposed extensions to SQL, such as the cube
operation help to support generation of summary data
The following query generates the previous table.
select color, size, sum (number)
from sales
groupby color, size with cube
JYP
252
Data Analysis (Cont.)
Figure shows the combinations of dimensions size, color, price
In general computing cube operation with n groupby columns gives
2n different groupby combinations.
JYP
253
Data Mining
Like knowledge discovery in artificial intelligence data mining
discovers statistical rules and patterns it differs from machine
learning in that it deals with large volumes of data stored
primarily on disk.
Knowledge discovered from a database can be represented by
a set of rules.
e.g.,: “Young women with annual incomes greater than $50,000
are most likely to buy sports cars”
Discover rules using one of two models:
1. The user is involved directly in the process of knowledge
discovery.
2. The system is responsible for automatically discovering
knowledge from the database by detecting patterns and
correlation's in the data.
JYP
254
Knowledge Representation Using Rules
General form of rules X antecedent consequent
X is a list of one or more variables with associated ranges.
The rule transactions T, buys (T, bread) buys(T, milk)
states: if there is a tuple (t1, bread) in the relation buys, there
must also be a tuple (t1, milk) in the relation buys.
Population: Cross-product of the ranges of the variables in the
rule.
In the above example, the set of all transactions.
Support: Measure of what fraction of the population satisfies
both the antecedent and the consequent of the rule.
e.g., 10% of transactions buy bread and milk.
Confidence : Measure of how often the consequent is true when
the antecedent is true.
e.g., 80% of transactions that buy bread also buy milk
JYP
255
Some Classes of Data-Mining Problems
Classification : Finding rules that partition the given data
into disjoint groups (classes) that are relevant for making a
decision
(e.g.,: which of several factors help classify a person’s credit
worthiness).
Associations: Useful to determine associations between
different items
(e.g.,: someone who buys bread is quite likely also to buy
milk).
Sequence correlations: determine correlations between
related sequence data.
(e.g., : when bond rates go up stock prices go down within
two days.)
JYP
256
User-Guided Data Mining
In user-guided data mining, primary responsibility for
discovering rules is with the user.
User may runs tests on the database to verify or refute a
hypothesis. Confidence and support for rules expressing
a hypothesis are derived from the database.
An iterative process of forming and refining rules is used.
Example: Test the hypothesis “People who hold master’s
degrees are the most likely to have an excellent credit
rating.” If confidence of rule is low, may refine it into the
rule:
people P, P.degree = Masters and P.income 75, 000
P.credit = excellent
Data-visualization though graphical representations like
maps, charts, and color-coding, helps detect patterns in
data
JYP
257
Classification Rules
Classification rules help assign new objects to a set of classes.
E.g., given a new automobile insurance applicant, should he or
she be classified as low risk, medium risk or high risk?
Classification rules for above example could use a variety of
knowledge, such as educational level of applicant, salary of
applicant, age of applicant, etc.
Classification rules can be compactly shown as a Classification
tree.
JYP
258
Part of Credit Risk Classification Tree
JYP
259
Discovery of Classification Rules*
Training set: a data sample in which the grouping for each
tuple is already known.
Top down generation of classification tree.
Each internal node of the tree partitions the data into groups
based on the attribute.
The data at a node is not partitioned further if either all (or most)
of the items at the node belong to the same class, or all
attributes have been considered. Such a node is a leaf node.
Otherwise the data at the node is partitioned further by picking
an attribute for partitioning data at the node.
JYP
260
Discovery of Classification Rules *(Cont.)
Consider credit risk example: Suppose degree is chosen
to partition the data at the root. Since degree has a small
number of possible values, one child is created for each
value.
At each child node of the root, further classification is
done if required. Here, partitions are defined by income.
Since income is a continuous attribute, some number of
intervals are chosen, and one child created for each
interval.
Different classification algorithms use different ways of
choosing which attribute to partition on at each node, and
what the intervals, if any, are.
In general, different branches of the tree could grow to
different levels. Different nodes at the same level may use
different partitioning attributes.
JYP
261
Discovery of Association Rules
Example: transactions T, buys (T, bread) buys (T, milk)
In general: notion of transaction , and its itemset, the set of items
contained in the transaction
General form of rule:
transactions T, c(T, i1) and . . . and c(T, in) c(T, i0)
where c(T, ik) denotes that transaction T contains item ik.
Above can be represented as A b where A = {i1, i2, . . . , in}
and b = i0.
Support of rule = number of transactions whose itemsets contain
A {b}
Usually desire rules with strong support, which will involve only
items purchased in a significant percentage of the transactions.
JYP
262
Discovery of Association Rules (Cont.)
Consider all possible sets of relevant items.
For each set find its support (i.e. , how many transactions
purchase all items in the set).
Use sets with sufficiently high support to generate
association rules.
From set A generate the rules A - {b} b for each b A.
Support of each of the rules is support of A.
Confidence of a rule is support of A divided by support of
A - {b}.
JYP
263
Finding Support
Few sets: Determine level of support via a single pass.
A count is maintained for each set, initially set to 0.
When a transaction is fetched, the count is incremented for each
set of items which contained in the itemset of the transaction.
Sets with a high count at the end of the pass correspond to items
with a high degree of association.
Many sets: If memory not enough to hold all counts for all sets
Use multiple passes, considering only some sets in each
pass.
Optimization: Once a set is eliminated because it occurs in too
small a fraction of the transactions, none of its supersets
needs to be considered.
JYP
264
Data Warehousing
A data warehouse is a repository of information gathered
from multiple sources.
JYP
265
Data Warehousing (Cont.)
Provides a single consolidated interface to data
Data stored for an extended period, providing access to
historical data
Data/updates are periodically downloaded from online
transaction processing (OLTP) systems.
Typically, download happens each night.
Data may not be completely up-to-date, but is recent enough for
analysis.
Running large queries at the warehouse ensures that OLTP
systems are not affected by the decision-support workload.
JYP
266
Issues in Building a Warehouse
When and how to gather data.
Source driven: data source initiates data transfer
Destination driven: warehouse initiates data transfer
What schema to use.
Schema integration
Cleaning and conversion of incoming data
What data to summarize.
Raw data may be too large to store on-line
Aggregate values (totals/subtotals) often suffice
Queries on raw data can often be transformed by query optimizer to
use aggregate values
How to propagate updates.
Date at warehouse is a view on source data
Efficient view maintenance techniques required
JYP
267
Spatial and Geographic Databases
Spatial databases store information related to spatial
locations, and support efficient storage, indexing and querying
of spatial data.
Special purpose index structures are important for accessing
spatial data, and for processing spatial join queries.
Design databases (or CAD databases) store design
information about how objects are constructed e.g.: designs of
buildings, aircraft, layouts of integrated-circuits
Geographic databases store geographic information (e.g.,
maps): often called geographic information systems or GIS.
JYP
268
Representation of Geometric Information
Various geometric constructs can be represented in a
database in a normalized fashion.
Represent a line segment by the coordinates of its endpoints.
Approximate a curve by partitioning it into a sequence of
segments; represents each segment as a separate tuple that
also carries with it the identifier of the curve.
Closed polygons: list its vertices in order, starting vertex is the
same as the ending vertex.
Alternative: triangulation — divide polygon into triangles;
give the polygon an identifier with each of its triangles.
JYP
269
Representation of Geometric Constructs (Cont.)
JYP
270
Representation of Geometric Information (Cont.)
Representation of points and line segment in 3-D similar to 2-D,
except that points have an extra z component
Represent arbitrary polyhedra by dividing them into tetrahedrons,
like triangulating polygons.
Alternative: List their faces, each of which is a polygon, along with
an indication of which side of the face is inside the polyhedron.
JYP
271
Design Databases
Represent design components as objects (generally
geometric objects); the connections between the objects
indicate how the design is structured.
Simple two-dimensional objects: points, lines, triangles,
rectangles, polygons.
Complex two-dimensional objects: formed from simple
objects via union, intersection, and difference operations.
Complex three-dimensional objects: formed from simpler
objects such as spheres, cylinders, and cuboids, by union,
intersection, and difference operations.
Wireframe models represent three-dimensional surfaces
as a set of simpler objects.
JYP
272
Representation of Geometric Constructs
(a) Difference of cylinders
(b) Union of cylinders
Design databases also store non-spatial information about
objects (e.g., construction material, color, etc.)
Spatial integrity constraints are important.
E.g., pipes should not intersect, wires should not be too close to
each other, etc.
JYP
273
Geographic Data
Raster data consist of bit maps or pixel maps, in two or more
dimensions.
Example 2-D raster image: satellite image of cloud cover, where
each pixel stores the cloud visibility in a particular area.
Additional dimensions might include the temperature at different
altitudes at different regions, or measurements taken at different
points in time.
Design databases generally do not store raster data.
JYP
274
Geographic Data (Cont.)
Vector data are constructed from basic geometric objects:
points, line segments, triangles, and other polygons in two
dimensions, and cylinders, speheres, cuboids, and other
polyhedrons in three dimensions.
Vector format often used to represent map data.
Roads can be considered as two-dimensional and represented
by lines and curves.
Some features, such as rivers, may be represented either as
complex curves or as complex polygons, depending on whether
their width is relevant.
Features such as regions and lakes can be depicted as
polygons.
JYP
275
Applications of Geographic Data
Examples of geographic data
map data for vehicle navigation
distribution network information for power, telephones, water
supply, and sewage
Vehicle navigation systems store information about roads
and services for the use of drivers:
Spatial data: e.g, road/restaurant/gas-station coordinates
Non-spatial data: e.g., one-way streets, speed limits, traffic
congestion
Global Positioning System or GPS unit - utilizes
information broadcast from GPS satellites to find the
current location of user with an accuracy of tens of
meters.
increasingly used in vehicle navigation systems as well as
utility maintenance applications.
JYP
276
Spatial Queries
Nearness queries request objects that lie near a specified
location.
Nearest neighbor queries, given a point or an object, find
the nearest object that satisfies given conditions.
Region queries deal with spatial regions. e.g., ask for
objects that lie partially or fully inside a specified region.
Queries that compute intersections or unions of regions.
Spatial join of two spatial relations with the location playing
the role of join attribute.
JYP
277
Spatial Queries (Cont.)
Spatial data is typically queried using a graphical query
language; results are also displayed in a graphical
manner.
Graphical interface constitutes the front-end
Extensions of SQL with abstract data types, such as
lines, polygons and bit maps, have been proposed to
interface with back-end.
allows relational databases to store and retrieve spatial
information
queries can mix spatial and nonspatial conditions
extensions also include and allowing spatial conditions
(contains or overlaps).
JYP
278
Multimedia Databases
To provide such database functions as indexing and
consistency, it is desirable to store multimedia data in a
database (rather than storing them outside the database,
in a file system).
The database must handle large object representation.
Similarity-based retrieval must be provided by special
index structures.
Must provide guaranteed steady retrieval rates for
continuos-media data.
JYP
279
Similarity-Based Retrieval
Pictorial data - Two pictures or images that are slightly different as
represented in the database may be considered the same by a
user.
E.g., identify similar designs for registering a new trademark.
Audio data- Speech-based user interfaces allow the user to give a
command or identify a data item by speaking.
E.g., test user input against stored commands.
Handwritten data -- Identify a handwritten data item or command
stored in the database (requires similarity testing).
JYP
280
Continuos-Media Data
Most important types are video and audio data.
Characterized by high data volumes and real-time
information-delivery requirements.
Data must be delivered sufficiently fast that there are no
gaps in the audio or video.
Data must be delivered at a rate that does not cause
overflow of system buffers.
Synchronization among distinct data streams must be
maintained (video of a person speaking must show lips
moving synchronously with the audio).
JYP
281
Multimedia Data Formats
Store and transmit multimedia data in compressed form;JPEG
is the most widely used format for image data.
Encoding each frame of a video using JPEG is wasteful since,
successive frames of a video are often nearly the same.
MPEG standards use commonalties among a sequence of
frames to achieve a greater degree of compression.
MPEG-1 stores a minute of 30-frame-per-second video and
audio in approximately 12.5 MB (compares with 75 MB for
video using only JPEG); quality comparable to VHS video tape.
MPEG-2 designed for digital broadcast systems and DVD;
negligible loss of video quality. Compresses 1 minute of audiovideo to approximately 17 MB.
MPEG-4
JYP
282
Video Servers
Current video-on-demand servers are based on file systems;
existing database systems do not meet real-time response
requirements.
Multimedia data are stored on several disks (RAID
configuration), or on tertiary storage for less frequently accessed
data.
Head-end terminals - used to view multimedia data; PCs or TVs
attached to a small, inexpensive computer called a set-top box.
Network - Transmission of multimedia data from a server to
multiple head-end terminals requires a high-capacity network,
such as asynchronous-transfer-mode (ATM) network.
JYP
283
Mobility and Personal Databases
A mobile computing environment consists of mobile
computers, referred to as mobile hosts, and a wired
network of computers.
Mobile host may be able to communicate with wired
network through a wireless digital communication network.
A model for mobile communication
Mobile hosts communicate to the wired network via
computers referred to as mobile support stations.
Each mobile support station manages those mobile hosts
within its cell.
When mobile hosts move between cells, there is a handoff of
control from one mobile support station to another.
JYP
284
Database Issues in Mobile Computing
New issues for query optimization.
energy (battery power) is a scarce resource and its usage
must be minimized
mobile user’s locations may be a parameter of the query.
Users may need to be able to perform database updates
even while the mobile computer is disconnected.
e.g., mobile salesman records sale of products on (local
copy of) database.
Can result in conflicts detected on reconnection, which may
need to be resolved manually.
JYP
285
Disconnectivity and Consistency
A mobile host may remain in operation during periods of
disconnection.
Problems created if the user of the mobile host issues
queries and updates on data that resides or is cached
locally:
Recoverability: Updates entered on a disconnected machine
may be lost if the mobile host fails. Since the mobile host
represents a single point of failure, stable storage cannot be
simulated well.
Consistency : Cached data may become out of date, but the
mobile host cannot discover this until it is reconnected.
JYP
286
Information Retrieval Systems
Information retrieval (IR) systems use a simpler data
model than database systems, but provide more
powerful querying capabilities within the restricted
model.
Queries attempt to locate documents that are of interest
by specifying, for example, sets of keywords.
e.g., find documents containing the words “database
systems”
Information retrieval systems order answers based on
their estimated relevance.
e.g., user may really only want documents about
database systems, but the system may retrieve all
documents that mention the phrase “database systems”.
Documents may be ordered by, for example, how many
times the phrase appears in the document.
JYP
287
Differences From Database Systems
Information retrieval systems, unlike traditional database
systems, handle:
Unstructured documents
Searching using keywords and relevance ranking
Most information retrieval systems do not handle:
High update rates
Concurrency control
Data structured using more complex data models (e.g., relational
or object oriented data models)
Complex queries written in, e.g., SQL
JYP
288
Topic 10: XML
XML: Extensible Markup Language
Defined by the WWW Consortium (W3C)
Originally intended as a document markup language not a
database language
Documents have tags giving extra information about sections of the
document
E.g. <title> XML </title> <slide> Introduction …</slide>
Derived from SGML (Standard Generalized Markup Language), but
simpler to use than SGML
Extensible, unlike HTML
Users can add new tags, and separately specify how the tag should
be handled for display
Goal was (is?) to replace HTML as the language for publishing
documents on the Web
JYP
289
XML Introduction (Cont.)
The ability to specify new tags, and to create nested tag structures
made XML a great way to exchange data, not just documents.
Much of the use of XML has been in data exchange applications, not as a
replacement for HTML
Tags make data (relatively) self-documenting
E.g.
<bank>
<account>
<account-number> A-101 </account-number>
<branch-name>
Downtown </branch-name>
<balance>
500
</balance>
</account>
<depositor>
<account-number> A-101 </account-number>
<customer-name> Johnson </customer-name>
</depositor>
</bank>
JYP
290
XML: Motivation
Data interchange is critical in today’s networked world
Examples:
Banking: funds transfer
Order processing (especially inter-company orders)
Scientific data
– Chemistry: ChemML, …
– Genetics:
BSML (Bio-Sequence Markup Language), …
Paper flow of information between organizations is being replaced
by electronic flow of information
Each application area has its own set of standards for
representing information
XML has become the basis for all new generation data
interchange formats
JYP
291
XML Motivation (Cont.)
Earlier generation formats were based on plain text with line
headers indicating the meaning of fields
Similar in concept to email headers
Does not allow for nested structures, no standard “type” language
Tied too closely to low level document structure (lines, spaces, etc)
Each XML based standard defines what are valid elements, using
XML type specification languages to specify the syntax
DTD (Document Type Descriptors)
XML Schema
Plus textual descriptions of the semantics
XML allows new tags to be defined as required
However, this may be constrained by DTDs
A wide variety of tools is available for parsing, browsing and
querying XML documents/data
JYP
292
Structure of XML Data
Tag: label for a section of data
Element: section of data beginning with <tagname> and ending
with matching </tagname>
Elements must be properly nested
Proper nesting
<account> … <balance> …. </balance> </account>
Improper nesting
<account> … <balance> …. </account> </balance>
Formally: every start tag must have a unique matching end tag, that
is in the context of the same parent element.
Every document must have a single top-level element
JYP
293
Example of Nested Elements
<bank-1>
<customer>
<customer-name> Hayes </customer-name>
<customer-street> Main </customer-street>
<customer-city> Harrison </customer-city>
<account>
<account-number> A-102 </account-number>
<branch-name>
Perryridge </branch-name>
<balance>
400 </balance>
</account>
<account>
…
</account>
</customer>
.
.
</bank-1>
JYP
294
Motivation for Nesting
Nesting of data is useful in data transfer
Example: elements representing customer-id, customer name, and
address nested within an order element
Nesting is not supported, or discouraged, in relational databases
With multiple orders, customer name and address are stored
redundantly
normalization replaces nested structures in each order by foreign key
into table storing customer name and address information
Nesting is supported in object-relational databases
But nesting is appropriate when transferring data
External application does not have direct access to data referenced
by a foreign key
JYP
295
Structure of XML Data (Cont.)
Mixture of text with sub-elements is legal in XML.
Example:
<account>
This account is seldom used any more.
<account-number> A-102</account-number>
<branch-name> Perryridge</branch-name>
<balance>400 </balance>
</account>
Useful for document markup, but discouraged for data
representation
JYP
296
Attributes
Elements can have attributes
<account acct-type = “checking” >
<account-number> A-102 </account-number>
<branch-name> Perryridge </branch-name>
<balance> 400 </balance>
</account>
Attributes are specified by name=value pairs inside the starting
tag of an element
An element may have several attributes, but each attribute name
can only occur once
<account acct-type = “checking” monthly-fee=“5”>
JYP
297
Attributes Vs. Subelements
Distinction between subelement and attribute
In the context of documents, attributes are part of markup, while
subelement contents are part of the basic document contents
In the context of data representation, the difference is unclear and
may be confusing
Same information can be represented in two ways
– <account account-number = “A-101”> …. </account>
– <account>
<account-number>A-101</account-number> …
</account>
Suggestion: use attributes for identifiers of elements, and use
subelements for contents
JYP
298
More on XML Syntax
Elements without subelements or text content can be abbreviated
by ending the start tag with a /> and deleting the end tag
<account number=“A-101” branch=“Perryridge” balance=“200 />
To store string data that may contain tags, without the tags being
interpreted as subelements, use CDATA as below
<![CDATA[<account> … </account>]]>
Here, <account> and </account> are treated as just strings
JYP
299
Namespaces
XML data has to be exchanged between organizations
Same tag name may have different meaning in different
organizations, causing confusion on exchanged documents
Specifying a unique string as an element name avoids confusion
Better solution: use unique-name:element-name
Avoid using long unique names all over document by using XML
Namespaces
<bank Xmlns:FB=‘http://www.FirstBank.com’>
…
<FB:branch>
<FB:branchname>Downtown</FB:branchname>
<FB:branchcity> Brooklyn </FB:branchcity>
</FB:branch>
…
</bank>
JYP
300
XML Document Schema
Database schemas constrain what information can be stored,
and the data types of stored values
XML documents are not required to have an associated schema
However, schemas are very important for XML data exchange
Otherwise, a site cannot automatically interpret data received from
another site
Two mechanisms for specifying XML schema
Document Type Definition (DTD)
Widely used
XML Schema
Newer, increasing use
JYP
301
Document Type Definition (DTD)
The type of an XML document can be specified using a DTD
DTD constraints structure of XML data
What elements can occur
What attributes can/must an element have
What subelements can/must occur inside each element, and how
many times.
DTD does not constrain data types
All values represented as strings in XML
DTD syntax
<!ELEMENT element (subelements-specification) >
<!ATTLIST element (attributes) >
JYP
302
Element Specification in DTD
Subelements can be specified as
names of elements, or
#PCDATA (parsed character data), i.e., character strings
EMPTY (no subelements) or ANY (anything can be a subelement)
Example
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT customer-name (#PCDATA)>
<! ELEMENT account-number (#PCDATA)>
Subelement specification may have regular expressions
<!ELEMENT bank ( ( account | customer | depositor)+)>
Notation:
– “|” - alternatives
– “+” - 1 or more occurrences
– “*” - 0 or more occurrences
JYP
303
Bank DTD
<!DOCTYPE bank [
<!ELEMENT bank ( ( account | customer | depositor)+)>
<!ELEMENT account (account-number branch-name balance)>
<! ELEMENT customer(customer-name customer-street
customer-city)>
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT account-number (#PCDATA)>
<! ELEMENT branch-name (#PCDATA)>
<! ELEMENT balance(#PCDATA)>
<! ELEMENT customer-name(#PCDATA)>
<! ELEMENT customer-street(#PCDATA)>
<! ELEMENT customer-city(#PCDATA)>
]>
JYP
304
Attribute Specification in DTD
Attribute specification : for each attribute
Name
Type of attribute
CDATA
ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)
– more on this later
Whether
mandatory (#REQUIRED)
has a default value (value),
or neither (#IMPLIED)
Examples
<!ATTLIST account acct-type CDATA “checking”>
<!ATTLIST customer
customer-id ID
# REQUIRED
accounts
IDREFS # REQUIRED >
JYP
305
IDs and IDREFs
An element can have at most one attribute of type ID
The ID attribute value of each element in an XML document must
be distinct
Thus the ID attribute value is an object identifier
An attribute of type IDREF must contain the ID value of an
element in the same document
An attribute of type IDREFS contains a set of (0 or more) ID
values. Each ID value must contain the ID value of an element
in the same document
JYP
306
Bank DTD with Attributes
Bank DTD with ID and IDREF attribute types.
<!DOCTYPE bank-2[
<!ELEMENT account (branch, balance)>
<!ATTLIST account
account-number ID
# REQUIRED
owners
IDREFS # REQUIRED>
<!ELEMENT customer(customer-name, customer-street,
customer-city)>
<!ATTLIST customer
customer-id
ID
# REQUIRED
accounts
IDREFS # REQUIRED>
… declarations for branch, balance, customer-name,
customer-street and customer-city
]>
JYP
307
XML data with ID and IDREF attributes
<bank-2>
<account account-number=“A-401” owners=“C100 C102”>
<branch-name> Downtown </branch-name>
<balance>
500 </balance>
</account>
<customer customer-id=“C100” accounts=“A-401”>
<customer-name>Joe
</customer-name>
<customer-street> Monroe </customer-street>
<customer-city> Madison</customer-city>
</customer>
<customer customer-id=“C102” accounts=“A-401 A-402”>
<customer-name> Mary </customer-name>
<customer-street> Erin
</customer-street>
<customer-city> Newark </customer-city>
</customer>
</bank-2>
JYP
308
Limitations of DTDs
No typing of text elements and attributes
All values are strings, no integers, reals, etc.
Difficult to specify unordered sets of subelements
Order is usually irrelevant in databases
(A | B)* allows specification of an unordered set, but
Cannot ensure that each of A and B occurs only once
IDs and IDREFs are untyped
The owners attribute of an account may contain a reference to
another account, which is meaningless
owners attribute should ideally be constrained to refer to
customer elements
JYP
309
XML Schema
XML Schema is a more sophisticated schema language which
addresses the drawbacks of DTDs. Supports
Typing of values
E.g. integer, string, etc
Also, constraints on min/max values
User defined types
Is itself specified in XML syntax, unlike DTDs
More standard representation, but verbose
Is integrated with namespaces
Many more features
List types, uniqueness and foreign key constraints, inheritance ..
BUT: significantly more complicated than DTDs, not yet widely
used.
JYP
310
XML Schema Version of Bank DTD
<xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema>
<xsd:element name=“bank” type=“BankType”/>
<xsd:element name=“account”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“account-number” type=“xsd:string”/>
<xsd:element name=“branch-name”
type=“xsd:string”/>
<xsd:element name=“balance”
type=“xsd:decimal”/>
</xsd:squence>
</xsd:complexType>
</xsd:element>
….. definitions of customer and depositor ….
<xsd:complexType name=“BankType”>
<xsd:squence>
<xsd:element ref=“account” minOccurs=“0” maxOccurs=“unbounded”/>
<xsd:element ref=“customer” minOccurs=“0” maxOccurs=“unbounded”/>
<xsd:element ref=“depositor” minOccurs=“0” maxOccurs=“unbounded”/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
JYP
311
Querying and Transforming XML Data
Translation of information from one XML schema to another
Querying on XML data
Above two are closely related, and handled by the same tools
Standard XML querying/translation languages
XPath
Simple language consisting of path expressions
XQuery
An XML query language with a rich set of features
Wide variety of other languages have been proposed, and some
served as basis for the Xquery standard
XML-QL, Quilt, XQL, …
JYP
312
Tree Model of XML Data
Query and transformation languages are based on a tree model of
XML data
An XML document is modeled as a tree, with nodes corresponding
to elements and attributes
Element nodes have children nodes, which can be attributes or
subelements
Text in an element is modeled as a text node child of the element
Children of a node are ordered according to their order in the XML
document
Element and attribute nodes (except for the root node) have a single
parent, which is an element node
The root node has a single child, which is the root element of the
document
We use the terminology of nodes, children, parent, siblings,
ancestor, descendant, etc., which should be interpreted in the above
tree model of XML data.
JYP
313
XPath
XPath is used to address (select) parts of documents using
path expressions
A path expression is a sequence of steps separated by “/”
Think of file names in a directory hierarchy
Result of path expression: set of values that along with their
containing elements/attributes match the specified path
E.g.
/bank-2/customer/customer-name evaluated on the
bank-2 data we saw earlier returns
<customer-name>Joe</customer-name>
<customer-name>Mary</customer-name>
E.g.
/bank-2/customer/customer-name/text( )
returns the same names, but without the enclosing tags
JYP
314
XPath (Cont.)
The initial “/” denotes root of the document (above the top-level tag)
Path expressions are evaluated left to right
Each step operates on the set of instances produced by the previous step
Selection predicates may follow any step in a path, in [ ]
E.g.
/bank-2/account[balance > 400]
returns account elements with a balance value greater than 400
/bank-2/account[balance] returns account elements containing a
balance subelement
Attributes are accessed using “@”
E.g. /bank-2/account[balance > 400]/@account-number
returns the account numbers of those accounts with balance > 400
IDREF attributes are not dereferenced automatically (more on this later)
JYP
315
Functions in XPath
XPath provides several functions
The function count() at the end of a path counts the number of
elements in the set generated by the path
E.g. /bank-2/account[customer/count() > 2]
– Returns accounts with > 2 customers
Also function for testing position (1, 2, ..) of node w.r.t. siblings
Boolean connectives and and or and function not() can be used
in predicates
IDREFs can be referenced using function id()
id() can also be applied to sets of references such as IDREFS and
even to strings containing multiple references separated by blanks
E.g. /bank-2/account/id(@owner)
returns all customers referred to from the owners attribute
of account elements.
JYP
316
More XPath Features
Operator “|” used to implement union
E.g. /bank-2/account/id(@owner) | /bank-2/loan/id(@borrower)
gives customers with either accounts or loans
However, “|” cannot be nested inside other operators.
“//” can be used to skip multiple levels of nodes
E.g. /bank-2//customer-name
finds any customer-name element anywhere under the /bank-2 element,
regardless of the element in which it is contained.
A step in the path can go to:
parents, siblings, ancestors and descendants
of the nodes generated by the previous step, not just to the children
“//”, described above, is a short from for specifying “all descendants”
“..” specifies the parent.
We omit further details,
JYP
317
XQuery
XQuery is a general purpose query language for XML data
Currently being standardized by the World Wide Web Consortium
(W3C)
The textbook description is based on a March 2001 draft of the standard.
The final version may differ, but major features likely to stay unchanged.
Alpha version of XQuery engine available free from Microsoft
XQuery is derived from the Quilt query language, which itself borrows
from SQL, XQL and XML-QL
XQuery uses a
for … let … where .. result …
syntax
for
SQL from
where SQL where
result SQL select
let allows temporary variables, and has no equivalent in SQL
JYP
318
FLWR Syntax in XQuery
For clause uses XPath expressions, and variable in for clause
ranges over values in the set returned by XPath
Simple FLWR expression in XQuery
find all accounts with balance > 400, with each result enclosed in
an <account-number> .. </account-number> tag
for
$x in /bank-2/account
let
$acctno := $x/@account-number
where $x/balance > 400
return <account-number> $acctno </account-number>
Let clause not really needed in this query, and selection can be done
In XPath. Query can be written as:
for $x in /bank-2/account[balance>400]
return <account-number> $x/@account-number
</account-number>
JYP
319
Path Expressions and Functions
Path expressions are used to bind variables in the for clause, but
can also be used in other places
E.g. path expressions can be used in let clause, to bind variables to
results of path expressions
The function distinct( ) can be used to removed duplicates in
path expression results
The function document(name) returns root of named document
E.g. document(“bank-2.xml”)/bank-2/account
Aggregate functions such as sum( ) and count( ) can be applied
to path expression results
XQuery does not support group by, but the same effect can be
got by nested queries, with nested FLWR expressions within a
result clause
More on nested queries later
JYP
320
Joins
Joins are specified in a manner very similar to SQL
for $a in /bank/account,
$c in /bank/customer,
$d in /bank/depositor
where $a/account-number = $d/account-number
and $c/customer-name = $d/customer-name
return <cust-acct> $c $a </cust-acct>
The same query can be expressed with the selections specified
as XPath selections:
for $a in /bank/account
$c in /bank/customer
$d in /bank/depositor[
account-number = $a/account-number and
customer-name = $c/customer-name]
return <cust-acct> $c $a</cust-acct>
JYP
321
Changing Nesting Structure
The following query converts data from the flat structure for bank
information into the nested structure used in bank-1
<bank-1>
for $c in /bank/customer
return
<customer>
$c/*
for $d in /bank/depositor[customer-name = $c/customer-name],
$a in /bank/account[account-number=$d/account-number]
return $a
</customer>
</bank-1>
$c/* denotes all the children of the node to which $c is bound, without the
enclosing top-level tag
Exercise for reader: write a nested query to find sum of account
balances, grouped by branch.
JYP
322
XQuery Path Expressions
$c/text() gives text content of an element without any
subelements/tags
XQuery path expressions support the “–>” operator for
dereferencing IDREFs
Equivalent to the id( ) function of XPath, but simpler to use
Can be applied to a set of IDREFs to get a set of results
June 2001 version of standard has changed “–>” to “=>”
JYP
323
Sorting in XQuery
Sortby clause can be used at the end of any expression. E.g. to
return customers sorted by name
for $c in /bank/customer
return <customer> $c/* </customer> sortby(name)
Can sort at multiple levels of nesting (sort by customer-name, and by
account-number within each customer)
<bank-1>
for $c in /bank/customer
return
<customer>
$c/*
for $d in /bank/depositor[customer-name=$c/customer-name],
$a in /bank/account[account-number=$d/account-number]
return <account> $a/* </account> sortby(account-number)
</customer> sortby(customer-name)
</bank-1>
JYP
324
Functions and Other XQuery Features
User defined functions with the type system of XMLSchema
function balances(xsd:string $c) returns list(xsd:numeric) {
for $d in /bank/depositor[customer-name = $c],
$a in /bank/account[account-number=$d/account-number]
return $a/balance
}
Types are optional for function parameters and return values
Universal and existential quantification in where clause predicates
some $e in path satisfies P
every $e in path satisfies P
XQuery also supports If-then-else clauses
JYP
325
Application Program Interface
There are two standard application program interfaces to XML
data:
SAX (Simple API for XML)
Based on parser model, user provides event handlers for
parsing events
– E.g. start of element, end of element
– Not suitable for database applications
DOM (Document Object Model)
XML data is parsed into a tree representation
Variety of functions provided for traversing the DOM tree
E.g.: Java DOM API provides Node class with methods
getParentNode( ), getFirstChild( ), getNextSibling( )
getAttribute( ), getData( ) (for text node)
getElementsByTagName( ), …
Also provides functions for updating DOM tree
JYP
326
The end of the course
Thank you for your cooperation!
JYP
327