Query Processing, Resource Management and Approximate in …

Download Report

Transcript Query Processing, Resource Management and Approximate in …

4. Relational Databases
Levels of Abstraction in data
defined by various “schema” levels
• Many views,
View 1
View 2
View 3
• Conceptual (logical) schema
Conceptual Schema
• Physical schema.
Physical Schema
–
Views describe how users see data (possibly
different data models for different views)
–
Conceptual schema defines logical
structure of entire data enterprise
–
Physical schema describes underlying
files and indexes used.
 Schemas are defined using
ANSI schema model
Data Definition Languages or DDLs;
data are modified/queried using
Data Manipulation Languages or DMLs.
Structure of a
DBMS
These layers
must consider
concurrency
control and
recovery
• A typical DBMS has a
layered architecture.
Query Optimization
and Execution
Relational Operators
Files and Access Methods
• This is one of several
possible architectures.
• Another with a little more
detail on next slide.
Buffer Management
Disk Space Management
DB
Structure of a DBMS
QUERIES from users (or Transactions or user-workload requests)

SQL (or some other User Interface Language)
QUERY OPTIMIZATION LAYER

Relational Operators (Select, Project, Join)
DATABASE OPERATOR LAYER

File processing operators (open,close file,read/write record
FILE MANAGER LAYER (provide the file concept)

Buffer managment operators (read/flush page)
BUFFER MANAGER LAYER

Disk transfer operators (malloc, read/write block
DISK SPACE MANAGER LAYER

DB on DISK
DISK SPACE MANAGER deals with space on disk
offers an interface to higher layers (mainly the BUFFER MGR) consisting of:
allocate/deallocate space; read/write block
can be implement on a raw disk system directly, then it would likely access data as
follows: read block b of track t of cylinder c on disk d
or can use OS file system (OS file = sequence of bytes) then it would likely access
data as follows: read bytes b of file f and then the Operating System file manager
would translate that into read block b of track t of cylinder c on disk d
most systems do not use the OS files system
- for portability reasons,
- to avoid OS file size peculiarities (limitations)
BUFFER MANAGER
partitions the main memory allocated to the DBMS into buffer page frames,
brings pages to and from disk as requested by higher layers (mainly the FILE Mgr).
FILE MANAGER
supports the file concept to higher layers (DBMS file = collection of records and pages
of records)
supports access paths to the data in those files (e.g., Indexes).
Not all Higher level DBMS code recognizes/ uses page concept.
Almost all DBMS use the record concept, though.
DATABASE OPERATOR LAYER
implements physical data model operators
(e.g., relational operators; select, project, join...)
QUERY OPTIMIZER
produces efficient execution plans for answering user queries (e.g., execution plans as
trees of relational operators: select, project, join, union, intersect translated from,
e.g., SQL queries).
SQL is not adequate to answer all user-database questions, e.g., Knowledge workers
working on Data Warehouses ask "what if" questions (On-Line Analytic Processing
or OLAP) not retrieval questions (SQL)
CLUSTERING and Record Identification
Disk I/O minimization is still main performance objective in designing a DBMS.
Thus, clustering records on disk correctly is important.
CLUSTERING = storing logically related records (those accessed together) physically
close, to reduce disk access time.
If the workload typically requests sequential access by SUPPLIER#, then cluster:
SUPPLIER-1-record close to SUPPLIER-2-record, ... intra-file clustering
If the workload typically requests individual specific SUPPLIER-# followed by access
to its shipment data, cluster: SUPPLIER-1-record , SUPLLIER-1's_shipment_data,
SUPPLIER-2-record , SUPPLIER-2's_shipment_data, ... inter-file clustering
RID = Record Identifier = permanently assigned record identifier (page#, record#)
RRN = Relative Record Number = permanently assigned order number - usually an
arrival order number
Overview of Database Design
 Conceptual design:

What are the entities and relationships in the enterprise?

What information about these entities and relationships
should be stored in the database?

What integrity constraints or business rules should be
enforced?
A database `schema’ Model diagram answers these
question pictorially (Entity-Relationship or ER
diagrams).
 Then
one maps the ER diagrams into a relational schema
(using the Data Definition Language provided)
Entity Relationship Model
ssn
Entity:
Real-world object type
distinguishable from other object
types.
name
lot
Employee
An entity is described (in DB) using a set of Attributes.
–
Each entity set has a key.(the chosen identifier attribute(s); underlined)
–
Each attribute has a domain.(allowable value universe)
ER Model (Cont.)
Relationship:
Association among two or more entities.
E.g., Employee Jones works in Pharmacy department.
Relationships can have attributes too!
name
since
name
ssn
dname
lot
Employee
did
Works_In
ssn
lot
budget
Department
Degree=2 relationship between entities, Employees and Departments.
Employee
supervisor
subordinate
Reports_To
Degree=2 relationship between an entity and
Itself? E.g., Employee Reports_To Employee.
Must specify the “role” of each entity
to distinguish them.
Relationship Cardinality Constraints
 (many-to-many)
Works_In:
An employeessn
can work in
many
departments.
 a dept can have many
employees working in it.
• (1-many) e.g., Manages:
• It may be required that each
dept has at most 1 manager.
• (1-1) Manages: In addition it
may be required that each
manager manages at most 1
department.
since
name
lot
Employee
m
dname
budget
did
Works_In
n
Department
since
name
dname
ssn
did
lot
1
Employee
Manages
budget
m
Department
since
name
ssn
dname
did
lot
1
Employee
Manages
budget
1
Department
Relationship Cardinality Constraints
1-to-1
1-to Many
Many-to-1
Many-to-Many
Participation Constraints
 Every department may have to have a manager?


This is an example of total participation constraint:
the participation of Department in Manages is said to
be total (vs. partial).
since
name
ssn
dname
did
lot
Employees
Manages
Works_In
since
budget
total Departments
ISA (`is a’) Hierarchies
name
ssn
We can use attribute inheritance to
save repeating shared attributes.
hourly_wages
If we declare an ISA relationship
among entity types, e.g., A ISA B
(every instance of A entity is also
an instance entity of entity B), then
B entities “inherit” A entity
attributes
lot
Employee
Covering
yes
hours_worked
ISA
Overlap
allowed
Hourly_Emp
contractid
Contract_Emp
e.g., every Hourly_Emp ISA Employee every Contract_Emp ISA Employee Hourly_Emps
and Contract_Emps can have their own separate attributes also.
 Overlap constraints: Can Joe be an Hourly_Emp and a
Contract_Emp? (Allowed/disallowed)
 Covering constraints: Does every Employee entity also
have to be an Hourly_Emp or a Contract_Emp entity?
(Yes/no)
Relational Database: Working Definitions
• Relational database: a set of relations
• Relation: made up of 2 parts:
–
Instance or occurrence : a table, with rows and columns.
#Rows = cardinality,
–
#fields = degree
Schema or type: specifies name of relation & name, type of each attribute
• Students(sid: string, name: string, login: string, age: integer, gpa: real).
• Strictly, a relation is a set of tuples but it is common to think of it as a
table (sequence of rows made up of a sequence of attribute values)
Relational Query Languages
• A major strength of the relational model: supports
simple, powerful querying of data.
• Queries can be written intuitively (specifying what,
not how), DBMS is responsible for evaluation
• The DBMS does your programming!
–
Allows a module called the optimizer to extensively reorder operations (even combine similar operations from
different concurrent requests), and still ensure that the
answer does not change.
SQL Query Language
• One of the simplest languages on earth
very English-like!
Specify what, not how.
• E.g., SELECT attributes FROM relations WHERE
condition
What columns you want
What rows you want.
• Find all 18 year old students (a selection)
SELECT *
FROM Students S
WHERE S.gpa=3.4
sid
name
53666 Jones
login
jones@cs
sid
53666 Jones
age gpa
18
name
login
jones@cs
age gpa
18
3.4
53688 Smith smith@ee 18
3.2
3.4
•To find just names and logins (a projection), replace 1st l
SELECT S.name, S.login
FROM Students S
WHERE S.age=18
name
login
Jones
jones@cs
Querying Multiple Relations
(Join, implemented using nested loop – alternative
1)
• What does the following query produce?
Where also used to
combine (join) S & E
SELECT S.name, E.cid
FROM Students S, Enrolled E
WHERE S.sid=E.sid AND E.grade=“A”
sid
name
53666 Jones
login
jones@cs
53650 Smith smith@ee
age gpa
18
3.4
18
3.2
suceeds
Joine
But
sid
cid
grade Selec
53831 Carnatic101
C
fails
53831 Reggae203
B
53650 Topology112
A
53666 History105
B
we get:
S.name
Smith
E.cid
Topology112
Destroying and Altering Relations
(also DDL)
DROP TABLE Students
• Destroys the relation Students. The schema
information and the tuples are deleted.
ALTER TABLE Students
ADD COLUMN Year: integer

The schema of Students is altered by adding a
new field; every tuple in the current instance
is extended, e.g., with a null value in the new
field.
Adding and Deleting Tuples
• Can insert a single tuple using:
INSERT INTO Students (sid, name, login, age, gpa)
VALUES (53688, ‘Smith’, ‘smith@ee’, 18, 3.2)
Can delete all tuples satisfying some condition
(e.g., name = Smith):
DELETE
FROM Students S
WHERE S.name = ‘Smith’
 many powerful variants of these commands are available!
Views
• A view is a relation constructable from stored or base
relations. Store a definition of it, rather than the instance
(actual tuples).
CREATE VIEW YoungActiveStudents (name, grade)
AS SELECT S.name, E.grade
FROM Students S, Enrolled E
WHERE S.sid = E.sid and S.age<21
Views can be dropped using the DROP VIEW command. How to handle
DROP TABLE if there’s a view on the table?

DROP TABLE command has options to let user specify this.
• Views can be used to present necessary information (or a summary),
while hiding details in underlying relation(s).
Integrity Constraints (ICs)
• IC: condition that must be true for any instance in the
database; e.g., domain constraints.
• ICs are specified when (or after) relations are created.
– ICs are checked when relations are modified.
• A legal instance of a relation is one that satisfies all its ICs.
– DBMS should not allow illegal instances.
–
Avoids data entry errors, too!
Primary Key Constraints
• A set of fields is a key (strictly speaking, a candidate key)
for a relation if it satisfies:
1. (Uniqueness condition) No two distinct tuples can have
same values in the key (which may be a composite)
2. (Minimality condition) The Uniqueness condition is not
true for any subset of a composite key.
– If Part 2 is false, it’s called a superkey (for superset of a
key)
– There’s always at least one key for a relation, one of the
keys is chosen (by DBA) to be the primary key, the
primary record identification or lookup column(s)
• E.g., sid is a key for Students. The set {sid, gpa} is a
superkey.
Entity integrity
 No column of the primary key can
contain a null value.
Foreign Keys and Referential
Integrity
• Foreign key : A field (or set of fields) in one relation used
to `refer’ to a tuple in another relation. (by listing the the
primary key value in the second relation.) Like a `logical
pointer’.
• E.g. sid in ENROLL is a foreign key referring to sid in
Students (sid is the primary key of S)
– If all foreign key constraints are enforced, a special
integrity constraint, referential integrity , is achieved,
i.e., no dangling references
– E.g., if Referential Integrity is enforced (and it almost
always is) an Enrolled record cannot have a sid that is
not present in Students (students cannot enroll in
courses until they register in the school)
Foreign Keys
• Only students listed in the Students relation
should be allowed to enroll for courses.
Enrolled
sid
53666
53666
53650
53666
cid
grade
Carnatic101
C
Reggae203
B
Topology112
A
History105
B
Students
sid
53666
53688
53650
name
login
Jones jones@cs
Smith smith@eecs
Smith smith@math
age
18
18
19
gpa
3.4
3.2
3.8
Enforcing Referential Integrity
• Consider Students and Enrolled; sid in Enrolled is a
foreign key that references Students.
• What should be done if an Enrolled tuple with a nonexistent student id is inserted? (Reject it!)
• What should be done if a Students tuple is deleted?
– Also delete all Enrolled tuples that refer to it?
– Disallow that deletion if an Enrolled tuple refers to it?
– Set sid in Enrolled tuples that refer to it to a default sid?
–
(sometimes there is a “default default, e.g., set sid in Enrolled
tuples to a special value null, denoting `not applicable’ if no
other default is specified.)
Referential Integrity in SQL
• SQL supports all 4 options on
deletes and updates.
–
Default action = NO ACTION
(the violating delete/update request is
rejected)
–
CASCADE (also delete all
tuples that refer to deleted tuple)
–
SET NULL / SET DEFAULT
(sets foreign key value of
referencing tuple)
CREATE TABLE Enrolled
(sid CHAR(20),
cid CHAR(20),
grade CHAR(2),
PRIMARY KEY (sid,cid),
FOREIGN KEY (sid)
REFERENCES Students
ON DELETE CASCADE
ON UPDATE SET NULL)
Where do ICs Come From?
• ICs are based on the semantics of the real-world
enterprise that is being described in the
database. I.e., the users decide semantics, not the
DB experts!
Why?
• We can check a database instance to see if an IC is
violated, but we can NEVER infer an IC by only
looking at the data instances.
• An IC is a statement about all possible instances!
• An IC is a statement about all possible instances!
• It is not a statement that can be inferred from the set of
currently existing instances.
• If ICs were inferred from current instances, then when a
relation is newly created and has, say, just 2 tuple, many,
many ICs would be inferred (e.g., in
sid
name
53666 Jones
login
jones@cs
53688 Smith smith@ee
age gpa
18
3.4
18
3.5
the system might infer that students MUST be 18 or that
names have to be 5 characters or worse yet, that
gpa ranking must be the same as alphabetical name
ordering!
Key and foreign key ICs are the most common.
Who decides primary key? (and other design choices?)
• The Database design expert?
– NO! Not in isolation, anyway.
– Someone from the enterprise who understands the data and the
procedures should be consulted.
– The following story illustrates this point. CAST:
– Mr. Goodwrench = MG (parts manager);
– Pointy-headed Dbexpert = Ph D
 Ph D I've looked at your data, and decided Part Number (P#) will be
designated the primary key for the relation, PARTS(P#, COLOR, WT,
TIME-OF-ARRIVAL).
 MG
You're the expert.
 Ph D Well, according to what I’ve learned in school, P# should be the
primary key, because IT IS the lookup attribute!
 ...
later
 MG Why is lookup so slow?
 Ph D You do store parts in the stock room
ordered by P#, right?

MG No. We store by weight! When a shipment
comes in, I take each part into the back room and
throw it as far as I can. The lighter ones go further
than the heavy ones so they get ordered by weight!
• Ph D But, but… weight doesn't have Uniqueness property! Parts with
the same weight end up together in a pile!
• MG No they don't. I tire quickly, so the first one I throw goes furthest.
• Ph D Then we’ll use a composite primary key, (weight, time-ofarrival).
• MG We get our keys primarily from Curt’s Lock and Key.
• The point is: This conversation should have
taken place during the 1st meeting.
An ER Example:
COMPANY is described to us as follows:
1. The company is organized into depts - each with a name, number, manager. - Each manager has a
startdate. - Each department can have several locations.
2. Departments control projects - each with a name, number, location.
3. Each employee has a name, SSN, sex, address, salary, birthdate, dept, supervisor. - An
employee may work on several projects (not necessarily all controlled by his dept) for which
we keep hoursworked by project.
4. Each employee dependent has a name, sex, birthdate and relationship.
In ER diagrams, entities are represented in boxes: |EMPLOYEE| |DEPENDENT| |DEPT| |PROJECT|
An attribute (or property) of an entity describes that entity.
An ENTITY has a TYPE, including name and list of its attributes.
ENTITY TYPE SCHEMA describes the common structure shared by all entities of that type.
Project (Name, Num,Location, Dept)
ENTITY INSTANCE
(Dome, 46,
(IACC, 52,
(Bean Res,
= individual occurrence of an entity of a particular type at a particular time
19 Ave N & Univ, Athletics)
Bolley & Centennial, C.S.)
31, 12 Ave N & Federal, P.S.) . . .
Entity Type does not change often - very static.
Entity instances get added, changed often - very dynamic
An ER Example continueed:
ATTRIBUTES are written next to Entity they describe, usually something like the following:
Name-------------.
Number-----------|
Locations--------|--|DEPARTMENT|
Manager----------|
ManagerStartDate-'
.--Name
|--SSN
|__EMPLOYEE__|----|--Sex
|--Address
|--Salary
|--BirthDate
|--Department
|--Supervisor
`--WorksOn
Name-------------.
Number-----------|
Location---------|-|_PROJECT|
ControlDepartment'
.-Employee
|-DependentName
|_DEPENDENT|--|-Sex
|-BirthDate
`-Relationship
An ER Example:
CATEGORIES OF ATTRIBUTES: =
COMPOSITE ATTRIBUTE = attributes that are subdivided into smaller parts with independent meaning.
e.g., Name attribute of Employee may be subdivided into FName, Minit, LName.
Indicated: Name (FName, Minit, LName)
Also, WorksOn may be a composite attr of Employee of Project and Hours: WorksOn (Project, Hours)
SINGLE-VALUED ATTRIBUTE: one value per entry.
MULTIVALUED ATTRIBUTE (repeating group) have multiple values per entry:
eg, Locations (as an attribute of Department since a Department can have multiple locations)
- Multivalued Attribute, use {Locations}
- WorksOn may be a mutlivalued attr of Employee as well composite: {WorksOn (Project,Hours)}
DERIVED ATTRIBUTE is an attribute whose value can be calculated from other attribute values.
eg, Age calculated from BirthDate and CurrentDate.
KEY ATTRIBUTE: Each value can occur at most once. (has the uniqueness property)
Used to identify entity instances. We will * key attribute(s).
ATTRIBUTE DOMAIN: Set of values that may be assigned (also called Value Set).
Thus the Preliminary Design of Entity Types for COMPANY db is.
*Name-------------.
*Number-----------|
Locations--------|--|DEPARTMENT|
Manager----------|
ManagerStartDate-'
.---Name
|--*SSN
|__EMPLOYEE__|----|---Sex
|---Address
|---Salary
|---BirthDate
|---Department
|---Supervisor
`---WorksOn
*Name-------------.
*Number-----------|
Location---------|-|_PROJECT|
ControlDepartment'
.--Employee
|-*DependentName
|_DEPENDENT|--|--Sex
|--BirthDate
`--Relationship
An ER Example continued:
RELATIONSHIPS among entities express relationships among them:
Relationships have RELATIONSHIP TYPEs (consisting of the names of the entities and the name
of the relationship).
A Relationship type diagram for a relationship between EMPLOYEE and DEPARTMENT called
"WorksFor" is diagrammed: (in a roundish box)
|EMPLOYEE|-( WorksFor )-|DEPARTMENT|
RELATIONSHIP INSTANCEs for the above relationship might be, eg:
( John Q. Smith, Athletics )
( Fred T. Brown, Comp. Sci.)
( Betty R. Hahn, Business ) . . .
RELATIONSHIP DEGREE: Number of participating entities (usually 2)
If an entity participates more than once in the same relationship, then ROLE NAMES are
needed to distinguish multiple participations.
eg, Supervisor, Supervisee in Supervision relationship
- Called Reflexive Relationships.
- Unnecessary if entity types are distinct.
One decision that has to be made is to decide whether attribute or relationship is the
appropriate way to model, e.g., "WorksOn". Above we modeled it as an attribute of EMPLOYEE
{WorksOn(Project,Hours)}
The fact that it is multivalued and composite (involving another entity, project) ssuggest that
it would be better to model it as a relationship (i.e., it makes a very complex attribute!)
WORKS_FOR(EMPLOYEE, DEPARTMENT)
An ER Example continued:
CONSTRAINTS ON A RELATIONSHIP
CARDINALITY CONSTRAINT can be
1-to-1
many-to-1
1-to-many or
many-to-many
1 to 1:
MANAGES(EMPLOYEE, DEPARTMENT)
Each manager MANAGES 1 dept
Each dept is MANAGED-BY 1 manager
Many to 1: WORKS_FOR(EMPLOYEE, DEPARTMENT)
Each employee WORKS_FOR 1 dept
Each dept is WORKED_FOR by many emps
Many to Many: WORKS_ON(EMPLOYEE, PROJECT)
Each employee WORKS_ON many projects
Each project is WORKED_ON by many employees
PARTICIPATION CONSTRAINT (for an entity in a relationship) can be Total, Partial or Min-Max
Total: Every EMPLOYEE WORKS_FOR some DEPARTMENT
Partial: Not every EMPLOYEE MANAGES some DEPT
RELATIONSHIP can have ATTRIBUTES (properties) as well: eg, Hours for WORKS_ON Relationship,
Manager_Start_Date in MANAGES relationship.
An ER Example continued:
6 RELATIONSHIPS;
CARDINALITY
----------1:1
RELATIONSHIP
-----------MANAGES
1:many
WORKS_FOR
many:many
WORKS_ON
1:many
CONTROLS
(role names, if any, above)
ATTRIBUTES
------ (participation below)
(EMPLOYEE, DEPARTMENT)
partial
total
(DEPARTMENT, EMPLOYEE)
total
total
(EMPLOYEE,
PROJECT)
total
total
(DEPARTMENT, PROJECT)
partial
total
Reflexive relationship with role names --------------.---------.
supervisor supervisee
1:many
SUPERVISION
(EMPLOYEE, EMPLOYEE)
partial
partial
1:many
DEPENDENTS_OF (
EMPLOYEE, DEPENDENT)
partial
total
An ER Example continued:
COMPANY Entity-Relationship Diagram (showing the Schema)
(double connecting lines means "total" while single line means partial participation.)
(
MANAGES
)
1||
|1
|| (WORKS_FOR)
|
*Name-----------.
|| 1|| many||
|
*Number---------|
||
||
||
/
{Locations}-----|- DEPARTMENT //
/
number_employees' /1
//
/
.----'
//
/
( CONTROLS )
//
/
many
//
/
||
//
/
|| (SUPERVISE)
//
/
||
|
|
//
/
||
1|
|many
//
/
|| 'er|
|'ee
//
/
||
|____|_______//_____/
.Name(FN,Mi,LN)
||
|_EMPLOYEE__________|---|-*SSN
||
//
|
|-Sex
||
//
|
|-Address
|| Hours-.
//
|
|-Salary
||
|
/many
|
`-BirthDate
\\
(WORKS_ON)
|
\\
1|
\\
many|
|
\\
||
( Dependent_0f )
\\
||
|many
*Nane-. \\_______||___
||
*Numb-|--| PROJECT
|
||
Locatn'
||
||
*DependentName---. .
Sex--------------|--|| DEPENDENT ||
BirthDate--------|
Relationship-----'