IM_FA16-08-DataIndependenceRMx

Download Report

Transcript IM_FA16-08-DataIndependenceRMx

INF 385T: Information Modeling
Data Independence
and
The Relational Model
Slides for October 12
Karen Wickett
[email protected]
School of Information
University of Texas at Austin
Fall 2016
1
Agenda
n Data Independence
n The Relational Model
2
Data Independence
3
Data Independence Problems
• Imagine a library
• Where books are shelved alphabetically by title and assigned
labels based on their position on the shelf.
• So the third book from the left on the fifth shelf on the first
floor is labeled “3:5:1”.
• See any problems?
• What do libraries actually do?
4
Data dependence problems
•
Suppose all the programs accessing a database use:
•
the server name, directory path, file name, record location (LF/CR count) and
byte offset to find a particular field: say, ISBN.
•
You can’t change any of those things without having change all the programs that
access the database locating methods based on those features.
•
You can’t increase the length of a field to accommodate longer ISBNs, or add a
field, or reorganize the file for optimized processing, etc.
•
There are many variants of this sort of dependency.
See Codd’s enumeration at the beginning of the 1970 article.
•
In addition, consider how much unnecessary detail an application programmer needs
to know, how this makes development slow, and maintenance expensive.
•
Improving data independence improves other aspects of software engineering.
5
What’s the solution?
• Build systems so that the people trying to access data do not need to be aware
of the details of how data is physically stored.
• Then data can be updated (added to, or shuffled around) without disrupting
how information is retrieved.
• This is called data independence.
• Information retrieval is independent from the physical storage of data.
• It is achieved by defining levels of abstraction.
• Information is accessed through the mappings between the levels.
• Making a change at one level only requires updating the mappings.
6
The overarching importance of data independence
• Data independence is the most important concept in information
modeling!
• Failures to implement data independence result in inefficiencies, low
functionality, inflexibility, non-interoperability.
• And those failures happen every day, at great cost in money, time, loss of
effectiveness
• and social costs too: lack of accessibility, marginalization of cultures and
classes.
• If you become a modeler, you, or your employees, subcontractors or
colleagues, will fail at this over and over.
• And not because it is so hard, but because it takes a little more time at the
beginning, and a little more thought.
• How can we ensure data independence?
• With abstraction. [sometimes: indirection].
7
ANSI/SPARC Three level architecture [SKS] c. 1975
In the 1970s a working group within ANSI/SPARC set about to develop an architecture
for databases that would help support data independence. Part of the result was the
identification of three “levels”, now known as the ANSI/SPARC “three level model”
or “three level architecture”. These are “levels of abstraction”.
8
Levels of Abstraction [SKS]
• Physical level: describes how a record (e.g., customer) is stored.
• i.e. how it is “written to disk”
• i.e. how it is inscribed in physical media
(patterned matter or energy)
• Logical level: describes data stored in database, and the relationships among
the data.
e.g.
type customer = record
customer_id : string;
customer_name : string;
customer_street : string;
customer_city : integer;
end
• View level: describes how information is presented to users.
• A way to hide:
(a) details of data types and
(b) information (such as an employee’s salary) for security purposes.
9
ANSI/SPARC Three level architecture [EN]
View level
Logical level
Physical level
10
Data Independence [EN]
• Data independence is the ability to modify a schema definition in one level
without affecting a schema definition in the next higher level.
• The interfaces between the various levels and components should be
defined so that changes in some parts do not seriously influence others.
• Logical data independence is the capacity to change the conceptual
[logical] schema without having to change the external schemas or
application programs.
• Physical Data Independence: is the capacity to change the internal
[physical] schema without having to change the conceptual schema.
— “application programs do not depend on the physical schema”
When a schema at a lower level is changed, only the mappings between this
schema and higher-level schemas need to be changed in a DBMS that fully
supports data independence. The higher-level schemas themselves are
unchanged.
Hence, the application programs need not be changed since they refer to the
external schemas; but not to the logical or internal schemas.
11
The Relational Model
• The relational model, a model for schemas at the logical level of the
ANSI/SPARC architecture, was aimed directly at the problems of data
independence.
• Informally: the relational model represents information as tables of values.
• More formally: information is represented as a set of tuples, a relation.
• “The model uses the mathematical concept of a relation…as its basic building
block, and has its theoretical basis in set theory and first order predicate logic”
[EN]
• Due to its simplicity, its mathematical foundation, and its effectiveness this
theory received immediate attention from both practitioners and researchers.
12
The Mathematical definition of relation
Cartesian Product
•
Let A1, A2, … An be sets.
Their Cartesian product, denoted by A1 x A2 x … x An,
is the set of all tuples <a1, a2, … an> such that a1 A1, a2  A2, … an An.
[ Or: A1x A2 x … x An = {(a1, a2, …, an) | ai  Ai for i = 1, 2, …, n} ]
n-ary Relation
•
Let A1, A2, … An be sets.
An n-ary relation on these sets is a subset of A1 x A2, … x An
The sets A1, A2, …An are the domains of the relation; n is called its degree.
–
“… the definition on which the theory of databases rests.” (Rosen).
13
A relation is a subset of a Cartesian product
Consider a tabular format for representing information about textbooks, with columns for
book name, publisher name, date of the first edition, and date adopted.
The set of all possible rows in this format would be the Cartesian product of the
possibilities for each column:
strings x strings x dates x dates.
Where strings is the set of all strings and dates the set of all dates.
This is a (very large) set of 4-tuples.
Any particular instance (actual table of values) will be a relation, a set of 4-tuples which
is a subset of the above Cartesian product.
Tablestate42  strings x strings x dates x dates
For instance…
Moby Dick |
Hunger
|
…
Random
Bantam
|
|
1/5/1859 | 1/4/1999
5/6/1939 | 2/3/1999
14
Intuitively…
• The relational approach represents information as a table
• … as rows and columns of values.
• The rows are considered to be without order.
• The columns are considered ordered
•
[In an alternative conceptualization columns are without order, but
uniquely identified by a column name.]
• Each row typically corresponds to an entity, each column to a
property, and the values are specific instances of that property.
• A table is a collection of rows (of equal length).
• A relational database is a collection of tables.
15
A Table [EN]
16
A Relation
{<
<
<
<
<
>,
>,
>,
>,
>, }
17
The attributes and tuples of a relation STUDENT [EN]
Same table, different order of tuples…
18
Attributes and Roles
Tablestate42  strings x strings x dates x dates
Where strings is the set of all strings and dates the set of all dates
For instance…
Moby Dick |
Hunger
|
…
Random
Bantam
|
|
1/5/1859 | 1/4/1999
5/6/1939 | 2/3/1999
Missing from the Cartesian Product definition of the relation is the meaning of the different
columns.
Although the 3rd and 4th elements above have the same domain, they mean different
things.
So we distinguish the roles of each domain occurrence.
We do this with attributes, which are the names of roles.
Here: title, publisher, pubdate, adoptdate.
We define a function dom(X) that takes attributes as values and returns their domains, and
then re-represent our relation like this [as per EN] …
Tablestate42  { dom(title) x dom(publisher) x dom(pubdate) x dom(adoptdate) }
19
Schemas
• A relational schema R, denoted by R(A1, A2, … , An ), is made of up a
relational schema name R and a list of attibutes A1, A2, … An ,
e.g. BOOK(isbn, author, title, publication_date)
• Each attribute Ai is the name of a role played in the relation schema R by
some domain D of atomic values.
• The domain of Ai is denoted by dom(Ai).
• R is called the name of the relational schema
• or sometimes the name of the relation.
20
Relations, Relation States, Relation Schemas
E&N’s general definition of a relation
A relation r(R) is a mathematical relation of degree n on the domains dom(A1), dom(A2,),
…, dom(An) which is a subset of the Cartesian product of the domains that define R
•
r(R)  (dom(A1) x dom(A2) x …. x dom(An) )
A RM specific formulation in terms of schemas, with additional terminology and concepts
•
r(R), a relation (or relation state*) r of the relation schema** R(A1, A2, … An)
•
is a set of n-tuples r = {t1, t2, … tn},
—where each n-tuple is an ordered list of n values t = < v1, v2, … vn>,
such that each value vi is an element of dom(Ai), or is NULL.
Terminology:
The ith value in n-tuple t, which corresponds to the attribute Ai
is referred to as t[Ai] or t[i].
*Or relation instance.
**Or relation (yuck)
21
Alternatives: attribute/value pairs
• There are actually several equivalent ways to define a relation.
• A particularly important one specifies a relation in terms of ordered pairs,
where the first element of each pair is an attribute and the second is the value
of the attribute, and the set includes a pair for every attribute in the schema...
• Specifically the relation is a set of sets t1, t2 … tn, each ti consisting of a
number of ordered pairs, where the the first element of the pair is an attribute
Aj and the second element vj is a value from the domain of that attribute.
• {{<title, “Moby Dick”>, <Author, Melville>, <Language, English>},
{<title, “Tao Te Ching”>, <Author, “LaoTzu”>, <Language, Chinese>},
{<title, “Ramayana”>, <Author, Valmiki>, <Language, Chinese>}, … }
22
Alternatives: triples
An additional approach is to define a relation as a set of triples where the first
element is a primary key, and the second and third are defined as
attribute/value pairs.
{
<book42, title, “Moby Dick”>, < book42, Author, Melville>, < book42, Language, English>,
< book43, title, “Tao Te Ching”>, < book43, Author, LaoTzu>, < book43, Language, Chinese>,
< book44, title, “Ramayana”>, < book44, Author, Valmiki>, < book44, Language, Chinese>
… }
This is essentially the technique used in the W3C semantic web language,
RDF.
23
Connecting back
•
In predicate logic a relation is a predicate of two or more places: father_of(x,y),
larger_than(x,y), between(x,y,z), gives(x,y,z).
•
For every relation in this PL sense there is a set of n-tuples of which that relation is
true, that “satisfy” that relation.
•
(Remember that in PL interpretations there is a function that assigns to every nplace relation the set of n-tuples satisfying that relation)
•
This set of n-tuples we sometimes say is the relation.
•
And sometimes we say that it is the extension of the relation.
•
And sometimes we say that there are two senses of relation,
•
Intensional: the property “father_of”; and
•
Extensional: {<Smith, Smith Jr.>, <Jones, Jones Jr>, … }
24
Relation, Intension, Extension
•
The relation is formed over the cartesian product of the sets; each set has values from a
domain; that domain is used in a specific role which is conveyed by the attribute name.
•
For example, attribute Cust-name is defined over the domain of strings of 25 characters.
The role these strings play in the CUSTOMER relation is that of the name of customers.
•
Formally,
Given R(A1, A2, …, An)
r(R)  dom (A1) x dom (A2) x … x dom(An)
—
—
—
—
•
R: schema of the relation
r of R: a specific “value”, “state”, “instance”, or “population” of R.
R is also called the intension of a relation
r is also called the extension of a relation
Example
•
Let S1 = {0,1}
•
Let S2 = {a,b,c}
•
Let R  S1 X S2
•
Then for example: r(R) = {<0,a> , <0,b> , <1,c> }
is one possible “state” or “population” or “extension” r of the relation R, defined over
domains S1 and S2. It has three tuples.
25
Constraints on the Relational Model
26
Relational Integrity Constraints
Constraints are conditions that must hold on all valid relation instances.
There are three main types of constraints:
1.
Key constraints
2.
Entity integrity constraints
3.
Referential integrity constraints
27
Key Constraints
Definition: Superkey of R
A set of attributes SK of R such that no two tuples in any valid relation instance r(R)
will have the same value for SK.
•
That is, for any distinct tuples t1 and t2 in r(R), t1[SK]  t2[SK].
Definition: Key of R
A "minimal" superkey; that is, a superkey K such that removal of any attribute from
K results in a set of attributes that is not a superkey.
If a relation has several candidate keys, one is chosen arbitrarily* to be the primary key.
The primary key attributes are underlined in the relation schema.
*arbitrary in a mathematical sense
28
Key Constraints
3.4
• CAR(License_number, EngineSerialNo, Make, Model, Year)
• has two minimal superkeys (i.e. keys):
• Key1 = {License_number},
• Key2 = {SerialNo}
• both are also superkeys.
• {Make, Model} is a superkey but not a key.
29
Entity Integrity Constraints
• Definition: Relational Database Schema
A set S of relation schemas that belong to the same database. S is the
name of the database.
S = {R1, R2, ..., Rn}
• Entity Integrity: The primary key attributes PK of each relation schema
Ri in S cannot have null values in any tuple of r(Ri).
This is because primary key values are used to identify the individual
tuples.
t[PK]  null for any tuple t in r(Ri)
• Note: Other attributes of R may be similarly constrained to disallow null
values, even though they are not members of the primary key.
30
Referential Integrity (i.e. a 404 ERROR)
Background and terminology
•
A constraint involving two relations
(the previous constraints involve a single relation).
•
Used to specify a relationship among tuples in two relations:
the referencing relation and the referenced relation.
•
Tuples in the referencing relation R1 have foreign key attributes (FK) that
reference the primary key attributes (PK) of the referenced relation R2.
•
A tuple t1 in R1 is said to reference a tuple t2 in R2 if t1[FK] = t2[PK].
•
A referential integrity constraint can be displayed in a relational database schema as
a directed arc from R1’s FK to R2’s PK.
31
Example
Specify the foreign keys for the database schema
K(STUDENT,COURSE,ENROLL,BOOK_ADOPTION,TEXT)
where:
STUDENT (Ssn, Name, Major, Bdate)
COURSE(Course#, Cname, Dept)
ENROLL(Ssn, Course#, Quarter, Grade)
BOOK_ADOPTION(Course#, Quarter, Book_isbn)
TEXT(Book_isbn, Book_title, Publisher, Author)
32
Referential Integrity
•
The Referential Integrity Constraint:
The value in the foreign key column (or columns) FK of the referencing relation R1
can be either:
1) a value of an existing primary key value of the corresponding primary key
PK in the referenced relation R2,,
or..
2) a null.
(In case (2), the FK in R1 should not be a part of its own primary key.)
33
Other Types of Constraints
Semantic Integrity Constraints:
-
based on application semantics and cannot be expressed by the relational
model per se
E.g., “the max. no. of hours per employee for all projects he or she works
on is 56 hrs per week”
-
A constraint specification language may have to be used to express these
Some versions of SQL include triggers and ASSERTIONS to allow for
some of these
Languages for expressing these constraints can have decidability, tractability, and
completeness problems.
34
Choosing a primary key
35
Figure: Primary keys
3.7
36
Figure: Referential Integrity Constraints
3.7
37
Figure: Possible State. (see p.72)
3.6
38
Relational Languages
There are three main relational languages, and they fall into two categories:
• The Relational Algebra
• The Relational Calculi
—The Tuple Relational Calculus
—The Domain Relational Calculus
39
Relational Algebra
The Relational Algebra (RA)
•
The relational algebra is a set of operations, that can be performed on given tables
to construct a new desired table.
•
These operations are largely derived from the theory of sets.
—including relations and tuples
•
As the RA consists of procedures (operations) for constructing the desired table (rather
then simply a description of the conditions a desired table must satisfy, it is a
procedural language.
•
RA is sometimes considered part of the relational model itself.
•
Parts of RA are included in SQL.
40
Relational Calculi
•
The Relational Calculi (RC)
•
The relational calculi do not provide operations for constructing sequences of
procedures, but rather are languages that describe, in a single statement, the
conditions such a table must satisfy.
•
Because it declares what characteristics the desired table must have, but does not
provide a set of procedures for obtaining that table, the relational calculi are called
declarative languages and are often considered “higher” level languages than RA.
•
Both relational calculi are based directly on predicate logic.
•
There are two relational calculi
—The Tuple Relational Calculus (TRC): variables range over tuples of
relations.
—The Domain Relational Calculus (DRC): variables range over domain
values
41
Where SQL fits in
•
SQL is based primarily on the Tuple Relational Calculus, with some bits from the
Relational Algebra. It has a variety of constructs that extend the expressiveness of the
relational algebra.
•
[Beware: the SQL SELECT and the RA SELECT are two completely different
operations!!!]
42
References
•
[SKS] A. Silberschatz, H. Korth, S. Sudarshan, Database Systems Concepts. 5th edition, 2005.
•
[EN] R. Elmasri, S. Navathe, Fundamentals of Database Systems. 4th edition, 2003.
•
[AD] P. Atzeni, V DeAntonellis, Relational Database Theory. 1993.
•
[Rosen] K. Rosen. Discrete Mathematics and its Applications. 5th edition, 2003.
43