Overview of Databases I

Download Report

Transcript Overview of Databases I

Databases II
(Fall 2009)
Professor: Iluju Kiringa
[email protected]
http://www.site.uottawa.ca/~kiringa
SITE 5072
1
Review of Databases I
Chapters 2-5
2
What is a DBMS?
A centralized database is a large collection of
integrated data.
 Organizations face large quantities of data
that must be efficiently managed.

 Many store GBs, even TBs of data
 Some scientific applications store PBs of data !

A Database Management System (DBMS) is a
software package designed to store and
manage databases.
3
Content of « Databases I »

Foundations of DBMSs (Chap. 2-5, 19): :
 Conceptual design - Chap. 2:
• Input: requirements of an application
• Output: ER diagrams (i.e., ER model)
 Logical design – Chap. 3:
• Input: ER diagrammes
• Output: Relational model
 Normalization – Chap. 19:
• Input: Relational model
• Output: Relational model in normal forms
 Relational algebra and calculus – Chap. 4
 SQL – Chap. 5

Application development (Chap. 6-7)
 Classic applications: Embeded SQL, JDBC, SQLJ, stored procedures
 Internet applications : HTTP, HTML, XML, 3-tier architecture
4
Content of « Databases I » (Cont’d)

Storage and Indexing (Chap. 8-11):
 Disks and files – Chap. 9:
• Memory hierarchy
• Disk and file management
 Tree index – Chap. 10:
• ISAM tree
• B+ tree
 Hasch index – Chap. 11:
• static
• extendible
• linear
5
Content of « Databases II »

Query evaluation (Chap. 12-15)
 External sort - Chap. 13
 Evaluation of relational operators -- Chap. 14
 System R: a sample evaluation algorithm – Chap. 15

Transactions (Chap. 16-18)
 Concurrency control – Chap. 17
 Recovery – Chap. 18

Advanced topics (Chap. 22,25, 26, 23 ou 27)
 Distributed databases – Chap. 22
 Data Warehousing – Chap. 25
 Data Mining Chap. 26
6
Overview of Database Design





Requirements analysis: Find out what users want to do
with the database
Conceptual design: Use the output of RA to develop a
high-level description of the data to be stored, along with
their constraints. Output of CD usually is an ER-diagram.
Logical design: Choose a DBMS and map the conceptual
schema (ER-diagram) into the data model of the DBMS.
Output of this step is the logical schema of the data.
Schema refinement: Analyze the logical schema, identify
potential problems in it, and fix these by refining the
logical schema using known normal forms.
Physical design: Consider expected workloads that the
database will support to further refine the design in
order to meet desired performance criteria. Output here
is the physical schema.
7
ER Model

The model

Basic elements:
•
•
•

Advanced elements:
•
•
•
•

Entity: real-world object
Relationship: association between entities
Attributes: description of an entity or a relationship
Constraints: qualification of a relationship
Weak entity : Identifiable only via another entity
Hierarchy : similar to OO-languages
Agregation: sort of macros
Conceptual design using the ER model




Should a given concept be modelled as an entity or an attribute?
Should a given concept be modelled as an entity or a relationship?
Should a given concept be modelled as a binary or as a ternary
relationship?
etc
8
Relational Database Concepts

Relation: made up of 2 parts:






Instance : a table, with rows and columns.
#Rows = cardinality, #fields = degree / arity.
Schema : specifies name of relation, plus name and domain (type) of
each column (attribute).
Can think of a relation as a set of rows or tuples (i.e., all
rows are distinct), where each tuple has the same arity as
the relation schema.
Relational database: a set of relations, each with distinct
name.
Relational DB schema: set of schemas of relations in the DB.
Relational DB instance: set of relation instances in the DB.
9
Sample Relation
 Schema :
Instance :
Students(sid: string, name: string, login: string,
age: integer, gpa: real).
sid
53666
53688
53650
name
login
Jones jones@cs
Smith smith@eecs
Smith smith@math
age
18
18
19
gpa
3.4
3.2
3.8
Cardinality = 3, arity = 5, all rows distinct.
 Commercial systems allow duplicates.
 Order of attributes may or may not matter!
 Do all columns in a relation instance have to
be distinct? Depends on whether they are ordered or not.

10
Creation and Alteration of Relations
CREATE TABLE Students
(sid: CHAR(20),
name: CHAR(20),
login: CHAR(10),
age: INTEGER,
gpa: REAL)
DROP TABLE Students
CREATE TABLE Enrolled
(sid: CHAR(20),
cid: CHAR(20),
grade: CHAR(2))
ALTER TABLE Students
ADD COLUMN firstYear: integer
INSERT INTO Students (sid, name, login, age, gpa)
VALUES (53688, ‘Smith’, ‘smith@ee’, 18, 3.2)
DELETE FROM Students S
WHERE S.name = ‘Smith’
11
Primary Key and Foreign Key in SQL
CREATE TABLE Enrolled
(sid CHAR(20)
cid CHAR(20),
grade CHAR(2),
PRIMARY KEY (sid),
UNIQUE (cid, grade) )
CREATE TABLE Enrolled
(sid CHAR(20)
cid CHAR(20),
grade CHAR(2),
PRIMARY KEY (sid,cid) )
CREATE TABLE Enrolled
(sid CHAR(20), cid CHAR(20), grade CHAR(2),
PRIMARY KEY (sid,cid),
FOREIGN KEY (sid) REFERENCES Students )
12
Logical DB Design: ER to Relational
The ER model represent the initial, high-level
database design.
 The task is to generate a relational database
schema that is as close as possible to the ER
model.
 The mapping is approximate since it is hard to
translate all the constraints of the ER model into
an efficient logical (relational) model.

13
Formal Relational Query Languages

Two mathematical Query Languages form
the basis for “real” languages (e.g. SQL), and
for implementation:
 Relational Algebra: More operational, very useful
for representing execution plans.
 Relational Calculus: Lets users describe what they
want, rather than how to compute it (Nonoperational, declarative).
14
Relational Algebra

Basic operations:








Additional operations:


Selection ( ) Selects a subset of rows from relation.
Projection ( ) Deletes unwanted columns from relation.
Cross-product ( ) Allows us to combine two relations.
Set-difference ( ) Gives tuples in 1st rel., but not in 2nd rel.
Union (  ) Gives tuples in rel. 1 and in rel. 2.
Intersection, join, division, renaming: Not (theoretically)
essential, but (practically) very useful.
Since each operation returns a relation, operations
can be composed! (Algebra is “closed”.)
15
Cross-Product
Each row of S1 is paired with each row of R1.
 Result schema has one field per field of S1 and R1,
with field names `inherited’ if possible.
 Conflict: Both S1 and R1 have a field called sid.

(sid) sname rating age
(sid) bid day
22
dustin
7
45.0
22
101 10/10/96
22
dustin
7
45.0
58
103 11/12/96
31
lubber
8
55.5
22
101 10/10/96
31
lubber
8
55.5
58
103 11/12/96
58
rusty
10
35.0
22
101 10/10/96
58
rusty
10
35.0
58
103 11/12/96
 Renaming operator:
 (C(1 sid1, 5  sid2), S1 R1)
16
Joins

Condition Join:
R  c S   c ( R  S)
(sid) sname rating age
22
dustin 7
45.0
31
lubber 8
55.5
S1 
(sid) bid
58
103
58
103
S1. sid  R1. sid
day
11/12/96
11/12/96
R1
Result schema same as that of cross-product.
 Fewer tuples than cross-product, might be
able to compute more efficiently
 Sometimes called a theta-join.

17
Joins (Cont’d)

Equi-Join: A special case of condition join where
the condition c contains only equalities.
sid
sname rating age bid day
22
dustin 7
45.0 101 10/10/96
58
rusty
10
35.0 103 11/12/96
S1 


R1
sid
Result schema similar to cross-product, but only
one copy of fields for which equality is specified.
Natural Join: Equijoin on all fields having the same
name in both relations.
18
Relational Calculus

Has two flavors:
 Tuple relational calculus (TRC)
 Domain relational calculus (DRC).

Has variables, constants, comparison ops, logical
connectives, and quantifiers.




TRC: Variables range over (i.e., get bound to) tuples.
DRC: Variables range over domain elements (= field values).
Both TRC and DRC are simple subsets of first-order logic.
Expressions in the calculus are called formulas. An
answer tuple is essentially an assignment of constants
to variables that make the formula evaluate to true.
19
Tuple Relational Calculus

Query has the form:








t| p(t)

Answer includes all tuples t that make the formula
p(t) be true.

Formula is recursively defined, starting with
simple atomic formulas (getting tuples from
relations or making comparisons of values),
and building bigger formulas using the logical
connectives.
20
TRC Formulas

Atomic formula:


RRname, or R.a op S.b,
op is one of , , ,,, 
or R.a op constant
Formula:
 an atomic formula, or
  p, p  q, p  q , where p and q are formulas, or
 R( p(R)), where variable R is free in p(R), or
 R( p(R)) , where variable R is free in p(R)
 The use of quantifiers R and R is said to bind R.


A variable that is not bound is free.
21
Overview: Features of SQL

Data definition language: used to create, destroy, and
modify tables and views.
Data manipulation language: used to pose queries, and
to insert, delete, and modify rows.
 Triggers and advanced integrity constraints: used to
specify actions that the DBMS will execute automatically.
 Embeddded SQL: allows SQL to be called from a host

language, or

Dynamic SQL: allows run-time creation and execution of
queries .
22
Syntax of Basic SQL Query
SELECT
FROM
WHERE
[DISTINCT] target-list
relation-list
qualification
relation-list A list of relation names (possibly with a
range-variable after each name).
 target-list A list of attributes of relations in relation-list
 qualification Comparisons (Attr op const or Attr1 op
Attr2, where op is one of , ,  , , ,  )
combined using AND, OR and NOT.
 DISTINCT is an optional keyword indicating that the
answer should not contain duplicates.

23
Sample Query and Conceptual Evaluation
SELECT S.sname
FROM Sailors S, Reserves R
WHERE S.sid=R.sid AND R.bid=103
(sid) sname rating age
(sid) bid day
22 dustin
7
45.0
22
101 10/10/96
22 dustin
7
45.0
58
103 11/12/96
31 lubber
8
55.5
22
101 10/10/96
31 lubber
8
55.5
58
103 11/12/96
58 rusty
10
35.0
22
101 10/10/96
58 rusty
10
35.0
58
103 11/12/96
24
Aggregate Operators

Significant extension of relational
algebra
SELECT COUNT (*)
FROM Sailors S
SELECT AVG (S.age)
FROM Sailors S
WHERE S.rating=10
COUNT (*)
COUNT ( [DISTINCT] A)
SUM ( [DISTINCT] A)
AVG ( [DISTINCT] A)
MAX (A)
MIN (A)
single column
SELECT S.sname
FROM Sailors S
WHERE S.rating= (SELECT MAX(S2.rating)
FROM Sailors S2)
SELECT COUNT (DISTINCT S.rating)
FROM Sailors S
WHERE S.sname=‘Bob’
SELECT AVG ( DISTINCT S.age)
FROM Sailors S
WHERE S.rating=10
25
Queries With GROUP BY and HAVING
SELECT
FROM
WHERE
GROUP BY
HAVING

[DISTINCT] target-list
relation-list
qualification
grouping-list
group-qualification
The target-list contains (i) attribute names (ii) terms
with aggregate operations (e.g., MIN (S.age)).

The attribute list (i) must be a subset of grouping-list.
Intuitively, each answer tuple corresponds to a group, and
these attributes must have a single value per group. (A
group is a set of tuples that have the same value for all
attributes in grouping-list.)
26
Find the age of the youngest sailor with age  18,
for each rating with at least 2 such sailors
SELECT S.rating, MIN (S.age)
FROM Sailors S
WHERE S.age >= 18
GROUP BY S.rating
HAVING COUNT (*) > 1


Only S.rating and S.age are
mentioned in the SELECT,
GROUP BY or HAVING clauses;
other attributes `unnecessary’.
2nd column of result is
unnamed. (Use AS to name it.)
sid sname rating age
22 dustin
7
45.0
31 lubber
8
55.5
71 zorba
10 16.0
64 horatio
7
35.0
29 brutus
1
33.0
58 rusty
10 35.0
rating age
1
33.0
7
45.0
rating
7
35.0
7
35.0
8
55.5
Answer relation
10 35.0
27
Summary
SQL
 is more declarative than earlier, procedural query languages.
 is relationally complete; in fact, significantly more expressive
power than relational algebra.




In practice, users need to be aware of how queries are optimized and
evaluated for best results.
has any alternative ways to write a query; optimizer should
look for most efficient evaluation plan.
allows specification of rich integrity constraints.
Provides triggers to respond to changes in the database.
28